DataUnchain is engineered from the ground up for absolute data sovereignty. Every component runs on your hardware, every AI inference stays local, and no byte ever leaves your network.
Every design decision in DataUnchain traces back to four non-negotiable pillars that guarantee the integrity and confidentiality of your documents.
Privacy is not an afterthought or a checkbox. It is the foundational constraint from which the entire system architecture derives. Every pipeline stage, every data structure, every network call is evaluated against the question: "Does this minimize exposure?" The answer must always be yes before a single line of code ships. No external API calls for AI inference, no cloud storage for intermediary results, no analytics pixels tracking operator behavior. Data minimization is enforced at the schema level: we extract only the fields you configure, store only what is needed, and purge on your schedule.
DataUnchain operates under a strict zero-trust posture toward the outside world. The system assumes that any external network is hostile. There are no outbound connections during normal operation: no telemetry pings, no license-check heartbeats, no crash-report uploads, no model-update pulls. All AI models are served locally via Ollama, running our proprietary VLM entirely on your hardware. Even time synchronization and DNS resolution can be disabled in air-gapped deployments. The only network boundary that matters is the one you control.
Every container, every process, every database role in DataUnchain operates with the minimum permissions necessary to perform its function. The FastAPI service cannot write to the model directory. The Ollama container has no access to the PostgreSQL socket. The Streamlit dashboard reads extraction results but cannot modify pipeline configuration. This compartmentalization means that even if an attacker compromises one component, lateral movement is architecturally blocked. Docker namespaces, read-only file systems, and non-root user directives enforce this at the OS level.
Opaque systems cannot be trusted. DataUnchain produces a complete audit trail for every document processed: the raw input hash, the exact prompt sent to the VLM, the full JSON response, the mathematical validation verdicts, the confidence score, and the timestamp of every state transition. All logs are structured JSON, queryable by any SIEM or log aggregator. Prometheus metrics expose queue depth, inference latency, error rates, and resource utilization in real time. You can replay, inspect, and verify every extraction decision the system has ever made.
Seven tightly integrated components, each containerized, each replaceable, each auditable. No black boxes.
The entire DataUnchain appliance is defined in a single docker-compose.yml file. One command brings up the full stack: AI engine, API server, database, queue, dashboard, and monitoring. Each service is version-pinned, health-checked, and restartable independently. Compose profiles let you toggle GPU mode, CPU-only mode, or development mode with a single environment variable. Updates are atomic: pull the new images, run docker compose up -d, and the system rolls forward with zero downtime for stateless services.
Ollama serves as the local inference runtime, hosting our proprietary Vision-Language Model entirely on your hardware. The VLM processes document images natively, understanding tables, handwriting, stamps, rotations, and multi-page layouts without any OCR pre-processing step. Model weights are loaded once at startup and remain in GPU VRAM (or system RAM in CPU mode) for sub-second inference on subsequent documents. No model phones home. No usage metrics are collected. The model binary is cryptographically signed, and its hash is verified at every container restart to guarantee integrity.
The API layer is built on FastAPI, the async Python framework that delivers automatic OpenAPI documentation, Pydantic validation on every request and response, and native async/await support for non-blocking I/O. Endpoints handle document upload, extraction status queries, result retrieval, webhook dispatch, and administrative operations. Rate limiting is enforced per-client. Authentication uses API keys with configurable scopes. Every request is logged with a correlation ID that traces through the entire pipeline, from upload to final database write.
Extraction results are stored as JSONB in PostgreSQL, enabling full SQL querying over semi-structured data without sacrificing schema flexibility. Each extraction record includes the raw JSON response, the validation result, the confidence score, processing duration, and a foreign key to the source document metadata. For lightweight or embedded deployments, DataUnchain can fall back to SQLite with the same ORM layer. PostgreSQL volumes are mounted on the host for straightforward backup with pg_dump or filesystem-level snapshots.
Document processing jobs are managed through a Redis-backed task queue. When a document is uploaded, it is immediately enqueued with its configuration payload and assigned a unique job ID. Workers pull tasks asynchronously, ensuring that the API remains responsive even under heavy batch loads. Failed jobs are automatically retried with exponential backoff. Dead-letter tracking captures permanently failed documents for manual review. Queue depth and worker utilization are exposed via Prometheus metrics so you can scale workers precisely to your throughput requirements.
A real-time operator interface built on Streamlit provides document upload, extraction result inspection, side-by-side comparison of the original scan versus the extracted JSON, batch processing controls, and Progressive Learning feedback loops. The dashboard communicates exclusively with the FastAPI backend over the internal Docker network. No external CDN resources are loaded. All static assets are bundled into the container image, ensuring the dashboard renders fully in air-gapped environments without any degraded functionality.
Every service exposes a /metrics endpoint scraped by Prometheus on a configurable interval. Metrics include: documents processed per minute, average inference latency (P50, P95, P99), validation pass/fail ratio, queue depth, GPU utilization, memory pressure, and database connection pool saturation. Pre-built Grafana dashboards are included for immediate visibility. Alert rules for anomalous latency spikes, queue backlogs, and disk space thresholds are configured out of the box.
From scan to structured JSON, every byte stays on your machine. Here is exactly what happens when a document enters DataUnchain.
$ # Step 1 — Document Ingestion
UPLOAD invoice_scan.pdf → FastAPI /api/v1/upload
├─ File saved to local volume /data/incoming/
├─ SHA-256 hash computed and stored
├─ Metadata record created in PostgreSQL
└─ Job enqueued in Redis with unique job_id
⚠ No data leaves the machine. No external call made.
$ # Step 2 — AI Vision Inference
WORKER picks job from Redis queue
├─ Document image loaded into memory
├─ Prompt assembled from configurable extraction schema
├─ Image + prompt sent to Ollama (localhost:11434)
│ └─ Our proprietary VLM processes the image
│ └─ Inference runs on local GPU/CPU
└─ Raw JSON response captured with full token metadata
⚠ Ollama binds to 127.0.0.1 only. No network exposure.
$ # Step 3 — Validation & Enrichment
VALIDATOR receives raw extraction
├─ Pydantic schema validation (types, required fields)
├─ Mathematical cross-check: Taxable + VAT = Total
├─ Date format normalization (ISO 8601)
├─ Confidence score computed per field
├─ FatturaPA XSD validation (if Italian e-invoice)
└─ Result: PASS | WARNING | FAIL
$ # Step 4 — Storage & Dispatch
STORE validated extraction
├─ JSONB record written to PostgreSQL
├─ Audit log entry with full provenance chain
├─ Webhook fired to configured ERP endpoint (LAN only)
└─ Dashboard updated in real time via WebSocket
⚠ Webhook targets are internal IPs you configure. No cloud.
Production-grade security controls baked into every layer, from API authentication to infrastructure monitoring.
API keys with configurable scopes (read, write, admin) protect every endpoint. Keys are hashed with bcrypt before storage. Failed authentication attempts trigger progressive delays and are logged with source IP for SIEM ingestion. Role-based access control ensures operators see only the documents and configurations assigned to their scope.
Token-bucket rate limiting is applied per API key. Default limits are generous for normal operation but prevent abuse scenarios such as document-flood attacks. Limits are configurable per client, allowing high-volume batch integrations to operate at elevated throughput while keeping interactive endpoints responsive. Rate limit headers are returned on every response for client-side awareness.
Failed extractions are retried with exponential backoff (1s, 2s, 4s, 8s, up to configurable max). After exhausting retries, the job moves to a dead-letter queue for manual inspection. Every retry attempt is logged with the failure reason, ensuring full traceability. Operators can re-enqueue dead-letter jobs with a single API call or dashboard button after resolving the underlying issue.
PostgreSQL data volumes are host-mounted for straightforward backup via pg_dump, filesystem snapshots, or enterprise backup agents. A built-in script automates nightly dumps with configurable retention policies. Model weights are read-only and versioned, so recovery requires only restoring the database volume and restarting containers. Recovery time objective (RTO) for a full appliance restore is under 15 minutes.
Every container declares a Docker health check that probes its critical dependency. FastAPI checks the database connection pool. The worker checks Redis connectivity and Ollama model availability. The dashboard checks the API reachability. Unhealthy containers are automatically restarted by Docker's restart policy. Prometheus alerts fire when a service enters a restart loop, enabling proactive intervention before batch processing is impacted.
All services emit structured JSON logs to stdout, collected by Docker's logging driver. Each log entry includes a timestamp, severity level, correlation ID, service name, and a machine-parseable message body. This format integrates natively with ELK, Splunk, Datadog, or any log aggregation platform your SOC already operates. Log retention and rotation are controlled at the Docker daemon level, keeping container images stateless.
For organizations operating in Italy, DataUnchain includes built-in validation against the official FatturaPA XML Schema Definition. Extracted invoice data is automatically checked for compliance with SDI (Sistema di Interscambio) requirements, including mandatory fields, codice fiscale formats, VAT number validation, and document type codes. Non-compliant extractions are flagged before they reach your ERP, preventing rejection at the SDI gateway and saving hours of manual correction.
DataUnchain's architecture was designed to satisfy the most demanding European data protection requirements by construction, not by policy addendum.
Article 25 requires that data protection is integrated into processing activities from the design stage. DataUnchain satisfies this by running all AI inference locally, storing data exclusively on premises, extracting only the fields explicitly configured by the data controller, and providing configurable retention and purge schedules. There is no "opt out" of privacy — the system literally cannot send data externally because no outbound network path exists in the default configuration.
When DataUnchain is deployed as a managed appliance, the operator acts as data processor under Article 28. Because the system runs entirely on the controller's infrastructure, the processor never has independent access to personal data. There are no sub-processors, no cross-border data transfers, and no shared infrastructure. The Data Processing Agreement (DPA) we provide reflects this architecture: processing occurs solely under the controller's instruction, on the controller's hardware, within the controller's network perimeter.
Article 32 mandates appropriate technical and organizational measures to ensure security. DataUnchain implements: encryption at rest via filesystem-level encryption on the host, encryption in transit via TLS on all internal Docker network communication when configured, access control via API key authentication with role-based scopes, resilience via automatic container restart and job retry, and regular testing via health checks and Prometheus alerting. The containerized architecture enables rapid patching: update a single image without touching the data layer.
The EU NIS2 Directive (2022/2555) imposes cybersecurity obligations on entities in critical sectors. DataUnchain supports NIS2 compliance through: supply chain security (no third-party cloud dependencies), incident reporting readiness (structured logs with full audit trail), vulnerability management (container image scanning and version pinning), and business continuity (backup automation and sub-15-minute RTO). For organizations classified as "essential" or "important" entities, DataUnchain's on-premise architecture eliminates an entire category of third-party risk.
Most document AI solutions require you to upload sensitive documents to external servers. Here is what that means in practice.
| Criterion | DataUnchain (On-Premise) | Typical Cloud Solution |
|---|---|---|
| Data residency | Your server, your jurisdiction | Provider's cloud region (often US) |
| Network exposure | Zero — no internet required | Every document transits the internet |
| Sub-processors | None | Cloud infra, CDN, logging, analytics |
| Telemetry | Zero — no phone-home, no analytics | Usage metrics, error reporting, model training |
| GDPR compliance | By architecture — no DPA chain needed | Requires DPAs, SCCs, impact assessments |
| Vendor lock-in | Open formats, standard Docker, your data | Proprietary APIs, data migration friction |
| Latency | Local network — sub-second for cached models | Network round-trip + queue wait + inference |
| Cost model | Fixed hardware — no per-page fees | Per-page or per-API-call pricing |
| Air-gap capability | Full functionality offline | Non-functional without internet |
DataUnchain runs on commodity server hardware. No specialized appliances, no proprietary chipsets.
# Clone the appliance repository
$ git clone https://github.com/dataunchain/appliance.git
$ cd appliance
# Configure environment
$ cp .env.example .env
$ nano .env # Set API keys, DB passwords, GPU mode
# Launch the full stack (GPU mode)
$ docker compose --profile gpu up -d
[+] Running 7/7
✓ Container dataunchain-db Started
✓ Container dataunchain-redis Started
✓ Container dataunchain-ollama Started
✓ Container dataunchain-api Started
✓ Container dataunchain-worker Started
✓ Container dataunchain-dashboard Started
✓ Container dataunchain-prometheus Started
# Verify health
$ docker compose ps --format "table \t"
NAME STATUS
dataunchain-db Up 12s (healthy)
dataunchain-redis Up 12s (healthy)
dataunchain-ollama Up 11s (healthy)
dataunchain-api Up 10s (healthy)
dataunchain-worker Up 10s (healthy)
dataunchain-dashboard Up 9s (healthy)
dataunchain-prometheus Up 9s (healthy)
DataUnchain is fully functional with no internet connection whatsoever. Here is how it works.
On a machine with internet access, pull all Docker images and model weights. Export them to a portable medium (USB, external SSD, or internal transfer share).
Move the exported images and model files to the target server on the air-gapped network. Load Docker images with docker load.
Run docker compose up -d. All containers start from local images. No registry pull required. Ollama loads the VLM from the local model directory.
Process documents, train Progressive Learning corrections, export results — all without any internet connectivity. The system will run as long as the hardware is powered.
Architecture, privacy, and deployment details.
No. The system makes zero outbound network connections during normal operation. There is no telemetry, no crash reporting, no license-check heartbeat, no model update pull, and no analytics of any kind. The Ollama inference server binds exclusively to 127.0.0.1 and is not accessible from outside the Docker network. You can verify this yourself by running the appliance with all outbound firewall rules set to DROP — the system will function identically.
DataUnchain supports a CPU-only fallback mode. If no GPU is detected, or if the GPU container fails health checks, the system can be configured to fall back to CPU inference automatically. Inference will be slower (approximately 15-30 seconds per page versus 2-4 seconds on GPU), but all other functionality remains identical. The CPU mode uses the same model weights with quantized inference optimized for AVX2 instructions. You can also run in CPU-only mode permanently for environments where GPU hardware is not available or not permitted.
The update process mirrors the initial deployment. On a connected machine, pull the new Docker images and any updated model weights. Export them with docker save, transfer the archive to the air-gapped server via approved physical media, load with docker load, and restart with docker compose up -d. Database migrations are applied automatically on startup. Your extraction data, Progressive Learning corrections, and configuration are preserved across updates.
The production deployment target is Linux (Ubuntu 22.04+, RHEL 8+, or equivalent) with Docker 24+. For development and evaluation purposes, DataUnchain can run on Windows (via WSL2 + Docker Desktop) and macOS (via Docker Desktop with Apple Silicon or Intel). GPU acceleration on Windows requires WSL2 with NVIDIA Container Toolkit. macOS does not support NVIDIA GPUs, so inference runs in CPU mode. We recommend Linux for all production deployments due to superior GPU passthrough performance and container runtime stability.
The base VLM is a general-purpose vision-language model that works out of the box on invoices, delivery notes, contracts, and other business documents. DataUnchain's Progressive Learning feature allows operators to correct extraction errors through the dashboard. These corrections are stored locally and used to refine future extractions through prompt engineering and retrieval-augmented techniques — your documents never leave the system and are never used to train or fine-tune the base model weights. The model remains static; only your correction history evolves.
DataUnchain is architected to satisfy GDPR (Articles 25, 28, 32), NIS2 Directive requirements, and Italian FatturaPA/SDI compliance. Because DataUnchain is deployed entirely on your infrastructure, the security posture inherits your organization's existing certifications (ISO 27001, SOC 2, etc.). We provide a comprehensive security architecture document, a Data Processing Agreement template, and a DPIA (Data Protection Impact Assessment) template to support your compliance documentation requirements. Contact us for specific compliance questionnaire responses.
We believe trust comes from transparency. Review the codebase, audit the Docker configuration, verify the network isolation, and run the system on your own hardware before making any commitment.
No credit card. No cloud account. Just Docker and your documents.