Architecture & Privacy

Architectural Principles

Every design decision in DataUnchain traces back to four non-negotiable pillars that guarantee the integrity and confidentiality of your documents.

🛡️

Privacy by Design

Privacy is not an afterthought or a checkbox. It is the foundational constraint from which the entire system architecture derives. Every pipeline stage, every data structure, every network call is evaluated against the question: "Does this minimize exposure?" The answer must always be yes before a single line of code ships. No external API calls for AI inference, no cloud storage for intermediary results, no analytics pixels tracking operator behavior. Data minimization is enforced at the schema level: we extract only the fields you configure, store only what is needed, and purge on your schedule.

🚫

Zero Trust External

DataUnchain operates under a strict zero-trust posture toward the outside world. The system assumes that any external network is hostile. There are no outbound connections during normal operation: no telemetry pings, no license-check heartbeats, no crash-report uploads, no model-update pulls. All AI models are served locally via Ollama, running our proprietary VLM entirely on your hardware. Even time synchronization and DNS resolution can be disabled in air-gapped deployments. The only network boundary that matters is the one you control.

🔑

Least Privilege

Every container, every process, every database role in DataUnchain operates with the minimum permissions necessary to perform its function. The FastAPI service cannot write to the model directory. The Ollama container has no access to the PostgreSQL socket. The Streamlit dashboard reads extraction results but cannot modify pipeline configuration. This compartmentalization means that even if an attacker compromises one component, lateral movement is architecturally blocked. Docker namespaces, read-only file systems, and non-root user directives enforce this at the OS level.

🔍

Transparency

Opaque systems cannot be trusted. DataUnchain produces a complete audit trail for every document processed: the raw input hash, the exact prompt sent to the VLM, the full JSON response, the mathematical validation verdicts, the confidence score, and the timestamp of every state transition. All logs are structured JSON, queryable by any SIEM or log aggregator. Prometheus metrics expose queue depth, inference latency, error rates, and resource utilization in real time. You can replay, inspect, and verify every extraction decision the system has ever made.

Technology Stack

Seven tightly integrated components, each containerized, each replaceable, each auditable. No black boxes.

🐳

Docker Compose

The entire DataUnchain appliance is defined in a single docker-compose.yml file. One command brings up the full stack: AI engine, API server, database, queue, dashboard, and monitoring. Each service is version-pinned, health-checked, and restartable independently. Compose profiles let you toggle GPU mode, CPU-only mode, or development mode with a single environment variable. Updates are atomic: pull the new images, run docker compose up -d, and the system rolls forward with zero downtime for stateless services.

Orchestration Single-Command Deploy

🧠

Ollama + Proprietary VLM

Ollama serves as the local inference runtime, hosting our proprietary Vision-Language Model entirely on your hardware. The VLM processes document images natively, understanding tables, handwriting, stamps, rotations, and multi-page layouts without any OCR pre-processing step. Model weights are loaded once at startup and remain in GPU VRAM (or system RAM in CPU mode) for sub-second inference on subsequent documents. No model phones home. No usage metrics are collected. The model binary is cryptographically signed, and its hash is verified at every container restart to guarantee integrity.

Local Inference Vision AI No OCR

⚡

FastAPI

The API layer is built on FastAPI, the async Python framework that delivers automatic OpenAPI documentation, Pydantic validation on every request and response, and native async/await support for non-blocking I/O. Endpoints handle document upload, extraction status queries, result retrieval, webhook dispatch, and administrative operations. Rate limiting is enforced per-client. Authentication uses API keys with configurable scopes. Every request is logged with a correlation ID that traces through the entire pipeline, from upload to final database write.

Async OpenAPI Pydantic

🗄️

PostgreSQL / SQLite

Extraction results are stored as JSONB in PostgreSQL, enabling full SQL querying over semi-structured data without sacrificing schema flexibility. Each extraction record includes the raw JSON response, the validation result, the confidence score, processing duration, and a foreign key to the source document metadata. For lightweight or embedded deployments, DataUnchain can fall back to SQLite with the same ORM layer. PostgreSQL volumes are mounted on the host for straightforward backup with pg_dump or filesystem-level snapshots.

JSONB SQL Backup

📮

Redis Queue

Document processing jobs are managed through a Redis-backed task queue. When a document is uploaded, it is immediately enqueued with its configuration payload and assigned a unique job ID. Workers pull tasks asynchronously, ensuring that the API remains responsive even under heavy batch loads. Failed jobs are automatically retried with exponential backoff. Dead-letter tracking captures permanently failed documents for manual review. Queue depth and worker utilization are exposed via Prometheus metrics so you can scale workers precisely to your throughput requirements.

Async Queue Retry Dead-Letter

📊

Streamlit Dashboard

A real-time operator interface built on Streamlit provides document upload, extraction result inspection, side-by-side comparison of the original scan versus the extracted JSON, batch processing controls, and Progressive Learning feedback loops. The dashboard communicates exclusively with the FastAPI backend over the internal Docker network. No external CDN resources are loaded. All static assets are bundled into the container image, ensuring the dashboard renders fully in air-gapped environments without any degraded functionality.

Real-Time Operator UI Self-Contained

📈

Prometheus Monitoring

Every service exposes a /metrics endpoint scraped by Prometheus on a configurable interval. Metrics include: documents processed per minute, average inference latency (P50, P95, P99), validation pass/fail ratio, queue depth, GPU utilization, memory pressure, and database connection pool saturation. Pre-built Grafana dashboards are included for immediate visibility. Alert rules for anomalous latency spikes, queue backlogs, and disk space thresholds are configured out of the box.

Metrics Grafana Alerts

Data Flow

From scan to structured JSON, every byte stays on your machine. Here is exactly what happens when a document enters DataUnchain.

dataunchain-pipeline

$ # Step 1 — Document Ingestion
UPLOAD invoice_scan.pdf → FastAPI /api/v1/upload
  ├─ File saved to local volume /data/incoming/
  ├─ SHA-256 hash computed and stored
  ├─ Metadata record created in PostgreSQL
  └─ Job enqueued in Redis with unique job_id
     ⚠ No data leaves the machine. No external call made.

$ # Step 2 — AI Vision Inference
WORKER picks job from Redis queue
  ├─ Document image loaded into memory
  ├─ Prompt assembled from configurable extraction schema
  ├─ Image + prompt sent to Ollama (localhost:11434)
  │   └─ Our proprietary VLM processes the image
  │   └─ Inference runs on local GPU/CPU
  └─ Raw JSON response captured with full token metadata
     ⚠ Ollama binds to 127.0.0.1 only. No network exposure.

$ # Step 3 — Validation & Enrichment
VALIDATOR receives raw extraction
  ├─ Pydantic schema validation (types, required fields)
  ├─ Mathematical cross-check: Taxable + VAT = Total
  ├─ Date format normalization (ISO 8601)
  ├─ Confidence score computed per field
  ├─ FatturaPA XSD validation (if Italian e-invoice)
  └─ Result: PASS | WARNING | FAIL

$ # Step 4 — Storage & Dispatch
STORE validated extraction
  ├─ JSONB record written to PostgreSQL
  ├─ Audit log entry with full provenance chain
  ├─ Webhook fired to configured ERP endpoint (LAN only)
  └─ Dashboard updated in real time via WebSocket
     ⚠ Webhook targets are internal IPs you configure. No cloud.

0

External API calls

0

Bytes sent to cloud

100%

Local processing

Enterprise Security

Production-grade security controls baked into every layer, from API authentication to infrastructure monitoring.

🔐

Authentication & Authorization

API keys with configurable scopes (read, write, admin) protect every endpoint. Keys are hashed with bcrypt before storage. Failed authentication attempts trigger progressive delays and are logged with source IP for SIEM ingestion. Role-based access control ensures operators see only the documents and configurations assigned to their scope.

🚦

Rate Limiting

Token-bucket rate limiting is applied per API key. Default limits are generous for normal operation but prevent abuse scenarios such as document-flood attacks. Limits are configurable per client, allowing high-volume batch integrations to operate at elevated throughput while keeping interactive endpoints responsive. Rate limit headers are returned on every response for client-side awareness.

🔄

Retry & Dead-Letter

Failed extractions are retried with exponential backoff (1s, 2s, 4s, 8s, up to configurable max). After exhausting retries, the job moves to a dead-letter queue for manual inspection. Every retry attempt is logged with the failure reason, ensuring full traceability. Operators can re-enqueue dead-letter jobs with a single API call or dashboard button after resolving the underlying issue.

💾

Backup & Recovery

PostgreSQL data volumes are host-mounted for straightforward backup via pg_dump, filesystem snapshots, or enterprise backup agents. A built-in script automates nightly dumps with configurable retention policies. Model weights are read-only and versioned, so recovery requires only restoring the database volume and restarting containers. Recovery time objective (RTO) for a full appliance restore is under 15 minutes.

💓

Health Checks & Self-Healing

Every container declares a Docker health check that probes its critical dependency. FastAPI checks the database connection pool. The worker checks Redis connectivity and Ollama model availability. The dashboard checks the API reachability. Unhealthy containers are automatically restarted by Docker's restart policy. Prometheus alerts fire when a service enters a restart loop, enabling proactive intervention before batch processing is impacted.

📋

Structured JSON Logging

All services emit structured JSON logs to stdout, collected by Docker's logging driver. Each log entry includes a timestamp, severity level, correlation ID, service name, and a machine-parseable message body. This format integrates natively with ELK, Splunk, Datadog, or any log aggregation platform your SOC already operates. Log retention and rotation are controlled at the Docker daemon level, keeping container images stateless.

🇮🇹

FatturaPA XSD Validation

For organizations operating in Italy, DataUnchain includes built-in validation against the official FatturaPA XML Schema Definition. Extracted invoice data is automatically checked for compliance with SDI (Sistema di Interscambio) requirements, including mandatory fields, codice fiscale formats, VAT number validation, and document type codes. Non-compliant extractions are flagged before they reach your ERP, preventing rejection at the SDI gateway and saving hours of manual correction.

GDPR & Regulatory Compliance

DataUnchain's architecture was designed to satisfy the most demanding European data protection requirements by construction, not by policy addendum.

Art. 25

Data Protection by Design & Default

Article 25 requires that data protection is integrated into processing activities from the design stage. DataUnchain satisfies this by running all AI inference locally, storing data exclusively on premises, extracting only the fields explicitly configured by the data controller, and providing configurable retention and purge schedules. There is no "opt out" of privacy — the system literally cannot send data externally because no outbound network path exists in the default configuration.

Art. 28

Processor Obligations

When DataUnchain is deployed as a managed appliance, the operator acts as data processor under Article 28. Because the system runs entirely on the controller's infrastructure, the processor never has independent access to personal data. There are no sub-processors, no cross-border data transfers, and no shared infrastructure. The Data Processing Agreement (DPA) we provide reflects this architecture: processing occurs solely under the controller's instruction, on the controller's hardware, within the controller's network perimeter.

Art. 32

Security of Processing

Article 32 mandates appropriate technical and organizational measures to ensure security. DataUnchain implements: encryption at rest via filesystem-level encryption on the host, encryption in transit via TLS on all internal Docker network communication when configured, access control via API key authentication with role-based scopes, resilience via automatic container restart and job retry, and regular testing via health checks and Prometheus alerting. The containerized architecture enables rapid patching: update a single image without touching the data layer.

NIS2

NIS2 Directive Readiness

The EU NIS2 Directive (2022/2555) imposes cybersecurity obligations on entities in critical sectors. DataUnchain supports NIS2 compliance through: supply chain security (no third-party cloud dependencies), incident reporting readiness (structured logs with full audit trail), vulnerability management (container image scanning and version pinning), and business continuity (backup automation and sub-15-minute RTO). For organizations classified as "essential" or "important" entities, DataUnchain's on-premise architecture eliminates an entire category of third-party risk.

On-Premise vs Cloud: A Clear Comparison

Most document AI solutions require you to upload sensitive documents to external servers. Here is what that means in practice.

Criterion	DataUnchain (On-Premise)	Typical Cloud Solution
Data residency	Your server, your jurisdiction	Provider's cloud region (often US)
Network exposure	Zero — no internet required	Every document transits the internet
Sub-processors	None	Cloud infra, CDN, logging, analytics
Telemetry	Zero — no phone-home, no analytics	Usage metrics, error reporting, model training
GDPR compliance	By architecture — no DPA chain needed	Requires DPAs, SCCs, impact assessments
Vendor lock-in	Open formats, standard Docker, your data	Proprietary APIs, data migration friction
Latency	Local network — sub-second for cached models	Network round-trip + queue wait + inference
Cost model	Fixed hardware — no per-page fees	Per-page or per-API-call pricing
Air-gap capability	Full functionality offline	Non-functional without internet

Deployment & Hardware Requirements

DataUnchain runs on commodity server hardware. No specialized appliances, no proprietary chipsets.

Recommended

GPU Mode

✓ GPU: NVIDIA with 16 GB+ VRAM (RTX 4090, A4000, L4, A100). CUDA 12+ and nvidia-container-toolkit installed.
✓ RAM: 32 GB system memory minimum. 64 GB recommended for large batch processing.
✓ Storage: 50 GB SSD for model weights + Docker images. Additional space proportional to document volume.
✓ OS: Ubuntu 22.04+ / RHEL 8+ / any Linux with Docker 24+ and NVIDIA Container Toolkit.
✓ Performance: ~2-4 seconds per page on RTX 4090. Batch throughput scales linearly with GPU count.

Alternative

CPU-Only Mode

✓ CPU: Modern x86_64 processor with 8+ cores. AVX2 support strongly recommended for optimized inference.
✓ RAM: 64 GB system memory minimum. Model weights load entirely into RAM for inference.
✓ Storage: Same as GPU mode. SSD strongly recommended to reduce model load times.
✓ OS: Any Linux distribution with Docker 24+. No GPU drivers required.
✓ Performance: ~15-30 seconds per page depending on CPU and document complexity. Suitable for low-volume or non-time-critical workloads.

deployment

# Clone the appliance repository
$ git clone https://github.com/dataunchain/appliance.git
$ cd appliance

# Configure environment
$ cp .env.example .env
$ nano .env   # Set API keys, DB passwords, GPU mode

# Launch the full stack (GPU mode)
$ docker compose --profile gpu up -d
[+] Running 7/7
 ✓ Container dataunchain-db         Started
 ✓ Container dataunchain-redis      Started
 ✓ Container dataunchain-ollama     Started
 ✓ Container dataunchain-api        Started
 ✓ Container dataunchain-worker     Started
 ✓ Container dataunchain-dashboard  Started
 ✓ Container dataunchain-prometheus Started

# Verify health
$ docker compose ps --format "table \t"
NAME                    STATUS
dataunchain-db          Up 12s (healthy)
dataunchain-redis       Up 12s (healthy)
dataunchain-ollama      Up 11s (healthy)
dataunchain-api         Up 10s (healthy)
dataunchain-worker      Up 10s (healthy)
dataunchain-dashboard   Up 9s  (healthy)
dataunchain-prometheus  Up 9s  (healthy)

Air-Gap Deployment

DataUnchain is fully functional with no internet connection whatsoever. Here is how it works.

1

Pre-load on a connected machine

On a machine with internet access, pull all Docker images and model weights. Export them to a portable medium (USB, external SSD, or internal transfer share).

2

Transfer to isolated network

Move the exported images and model files to the target server on the air-gapped network. Load Docker images with docker load.

3

Start the stack

Run docker compose up -d. All containers start from local images. No registry pull required. Ollama loads the VLM from the local model directory.

4

Operate indefinitely

Process documents, train Progressive Learning corrections, export results — all without any internet connectivity. The system will run as long as the hardware is powered.

Why Air-Gap Matters

✓ Defense & Government: Classified document processing in SCIF or equivalent environments where no external network access is permitted.
✓ Healthcare: Patient records and medical imaging in hospital environments subject to strict data isolation requirements.
✓ Manufacturing: Factory floor document processing on OT (Operational Technology) networks physically separated from IT networks.
✓ Financial Services: Processing sensitive financial documents in environments where regulatory auditors require proof of data isolation.
✓ Legal: Law firms handling privileged client documents under attorney-client confidentiality obligations.

Frequently Asked Questions

Architecture, privacy, and deployment details.

Does DataUnchain ever send data to external servers?

No. The system makes zero outbound network connections during normal operation. There is no telemetry, no crash reporting, no license-check heartbeat, no model update pull, and no analytics of any kind. The Ollama inference server binds exclusively to 127.0.0.1 and is not accessible from outside the Docker network. You can verify this yourself by running the appliance with all outbound firewall rules set to DROP — the system will function identically.

What happens if the GPU is not available or fails?

DataUnchain supports a CPU-only fallback mode. If no GPU is detected, or if the GPU container fails health checks, the system can be configured to fall back to CPU inference automatically. Inference will be slower (approximately 15-30 seconds per page versus 2-4 seconds on GPU), but all other functionality remains identical. The CPU mode uses the same model weights with quantized inference optimized for AVX2 instructions. You can also run in CPU-only mode permanently for environments where GPU hardware is not available or not permitted.

How do I update DataUnchain in an air-gapped environment?

The update process mirrors the initial deployment. On a connected machine, pull the new Docker images and any updated model weights. Export them with docker save, transfer the archive to the air-gapped server via approved physical media, load with docker load, and restart with docker compose up -d. Database migrations are applied automatically on startup. Your extraction data, Progressive Learning corrections, and configuration are preserved across updates.

Can DataUnchain run on Windows or macOS?

The production deployment target is Linux (Ubuntu 22.04+, RHEL 8+, or equivalent) with Docker 24+. For development and evaluation purposes, DataUnchain can run on Windows (via WSL2 + Docker Desktop) and macOS (via Docker Desktop with Apple Silicon or Intel). GPU acceleration on Windows requires WSL2 with NVIDIA Container Toolkit. macOS does not support NVIDIA GPUs, so inference runs in CPU mode. We recommend Linux for all production deployments due to superior GPU passthrough performance and container runtime stability.

Is the AI model fine-tuned on my documents?

The base VLM is a general-purpose vision-language model that works out of the box on invoices, delivery notes, contracts, and other business documents. DataUnchain's Progressive Learning feature allows operators to correct extraction errors through the dashboard. These corrections are stored locally and used to refine future extractions through prompt engineering and retrieval-augmented techniques — your documents never leave the system and are never used to train or fine-tune the base model weights. The model remains static; only your correction history evolves.

What compliance certifications does DataUnchain hold?

DataUnchain is architected to satisfy GDPR (Articles 25, 28, 32), NIS2 Directive requirements, and Italian FatturaPA/SDI compliance. Because DataUnchain is deployed entirely on your infrastructure, the security posture inherits your organization's existing certifications (ISO 27001, SOC 2, etc.). We provide a comprehensive security architecture document, a Data Processing Agreement template, and a DPIA (Data Protection Impact Assessment) template to support your compliance documentation requirements. Contact us for specific compliance questionnaire responses.

Open Architecture

Want to Inspect the Architecture?

We believe trust comes from transparency. Review the codebase, audit the Docker configuration, verify the network isolation, and run the system on your own hardware before making any commitment.

View on GitHub Request Early Access

No credit card. No cloud account. Just Docker and your documents.

Zero Cloud. Zero Telemetry. Your Data Stays Yours.