LLM Benchmark for Document Extraction: Qwen vs LLaMA vs Mistral (2026)
A detailed benchmark of Vision-Language Models for enterprise document field extraction, conducted by the DataUnchain team. We tested Qwen 2.5-VL, LLaMA 3.2-Vision, Mistral Pixtral, and GPT-4V on 500 real European business documents across 5 document categories. All models were run on identical hardware under identical conditions.
This benchmark was conducted by DataUnchain's engineering team using documents from our production customer base (anonymized). We are not neutral parties — Qwen 2.5-VL is the model DataUnchain uses in production. We have published methodology details and acknowledge limitations so that readers can assess the results critically. We believe the data is accurate but encourage readers to run their own evaluations on their specific document types.
Why We Ran This Benchmark
When we built DataUnchain, we evaluated multiple Vision-Language Models for document extraction and chose Qwen 2.5-VL as our primary engine. Since then, customers consistently ask us the same questions: "Why Qwen and not LLaMA?" "How much worse is a smaller model?" "Is GPT-4V worth the cloud cost for the accuracy gain?" "What happens with low-quality scans?"
We decided to run a structured benchmark to answer these questions with real data rather than intuition. The evaluation took 6 weeks, covered 500 documents, and produced the results reported here. We believe this is the most detailed publicly available comparison of current open-weight VLMs specifically for business document extraction — as opposed to general image understanding or academic OCR benchmarks that don't reflect enterprise document AI conditions.
Methodology
Test Dataset
500 Italian and European business documents, sourced from DataUnchain's production customer deployments with all personally identifiable information removed. Documents were selected to represent realistic production variety — not just clean, ideal-case documents. The dataset includes:
- 5 document categories, 100 documents each
- Documents from 47 distinct suppliers across Italy, Germany, France, Austria, and the Czech Republic
- Digital PDFs, scanned paper, and FatturaPA XML (converted to image for consistent model input)
- Document quality ranging from clean 300 DPI digital PDFs to degraded 150 DPI scans on yellowed paper
- Languages: Italian (68%), German (18%), English (10%), French (4%)
Ground Truth Preparation
Two human annotators independently extracted all fields from each document. Disagreements were resolved by a third annotator. Fields where all three annotators produced the same value were marked as high-confidence ground truth. Fields where annotators disagreed (primarily due to document ambiguity, not annotator error) were marked as ambiguous and excluded from accuracy calculations. Ambiguous fields represented 1.8% of the total field instances across the 500-document set.
Evaluation Hardware
All open-weight models were tested on identical hardware to ensure fair throughput comparison:
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- CPU: AMD Ryzen 9 7950X (16 cores / 32 threads)
- RAM: 64GB DDR5
- Storage: Samsung 990 Pro NVMe SSD
- OS: Ubuntu 22.04 LTS, CUDA 12.4
- Inference server: Ollama 0.5.x for all open-weight models
GPT-4V was tested via the OpenAI API from a separate environment with a stable internet connection. API latency results are reported separately and are not directly comparable to local inference times due to network overhead.
Metrics
- Field-level accuracy: percentage of extracted field values that exactly match ground truth (after normalization — date formats and decimal separators standardized)
- JSON parse success rate: percentage of inference calls that produce valid, parseable JSON (not counting field accuracy, just format validity)
- Schema adherence rate: percentage of valid JSON outputs that contain all required schema keys with correct data types
- Hallucination rate: percentage of field instances where the model returned a value not found anywhere in the document
- Processing time: wall-clock time from inference call start to completion, averaged over all 500 documents
- VRAM usage: peak GPU memory usage during inference
Models Tested
| Model | Parameters | Vision Input | Quantization | Notes |
|---|---|---|---|---|
| Qwen 2.5-VL 7B | 7.6B | Native | Q4_K_M | DataUnchain primary model |
| Qwen 2.5-VL 72B | 72.7B | Native | Q4_K_M | Requires 48GB+ VRAM; tested on A100 |
| LLaMA 3.2-Vision 11B | 11.0B | Native | Q4_K_M | Meta, open-weight |
| Mistral Pixtral 12B | 12.0B | Native | Q4_K_M | Mistral AI, open-weight |
| GPT-4V (API) | Unknown | Native | N/A (API) | Cloud baseline; separate test environment |
The same extraction prompt was used for all models — a structured prompt specifying the target JSON schema, field definitions, and instructions for handling missing fields. We did not perform model-specific prompt engineering, which may disadvantage models that perform better with different prompt styles. This is intentional: in production, using a single prompt across models is realistic; heavily tuning prompts per model is not.
Test Document Categories
Category 1: Standard Digital Invoices (100 samples)
Clean PDF invoices generated by accounting software (primarily FatturaPA converted to visual PDF, and European supplier PDFs from systems like SAP, Sage, DATEV). All documents are text-searchable, printed at 300+ DPI when rasterized, and have standard invoice layouts with clear field labels. This category represents the easiest case and serves as a baseline.
Category 2: Scanned Paper Invoices (100 samples)
Physical invoices scanned by office scanners at 150–300 DPI. Quality varies significantly: some are clean scans of laser-printed documents; others have coffee stains, physical creases, skewed scanning angles, handwritten annotations, and rubber stamps overlapping printed text. This category tests OCR robustness and layout understanding under degraded conditions.
Category 3: Complex Invoices with Many Line Items (100 samples)
Multi-page invoices with 20–80 line items each, spanning 2–6 pages. Line item tables may have non-standard column arrangements, merged cells, subtotals interspersed within the table, and items grouped by category. This category specifically tests line item extraction completeness and accuracy — the hardest task in invoice processing.
Category 4: Commercial Contracts (100 samples)
Supplier and customer contracts in PDF format, 4–28 pages. Contracts are text-heavy with minimal visual structure — no tables, no labeled fields. Target extraction fields: party names (both parties), effective date, termination date, contract value (if stated), payment terms, governing law jurisdiction, and notice period. This category tests semantic understanding rather than field label recognition.
Category 5: DDT / Delivery Notes (100 samples)
Italian DDT (Documento di Trasporto) and equivalent European delivery notes. These have a structured but highly variable layout: carrier information, sender, recipient, list of goods with quantities and units of measure, gross weight, and transport mode. DDTs often include handwritten quantities added at delivery time. Target fields: sender, recipient, DDT number, DDT date, carrier, list of line items (description, quantity, unit of measure, lot number where present).
Results: Standard Digital Invoices
Field-level extraction accuracy on the 100 standard digital invoice samples (field values must exactly match ground truth after normalization):
| Field | Qwen 2.5-VL 7B | LLaMA 3.2-Vision | Mistral Pixtral | GPT-4V |
|---|---|---|---|---|
| Invoice number | 97.2% | 94.1% | 92.8% | 98.1% |
| Invoice date | 98.4% | 96.2% | 95.1% | 99.0% |
| Supplier name | 99.1% | 97.8% | 96.5% | 99.3% |
| VAT number (P.IVA) | 95.6% | 91.2% | 89.4% | 97.2% |
| Subtotal (imponibile) | 94.8% | 90.1% | 88.7% | 96.1% |
| VAT amount (IVA) | 93.2% | 88.4% | 86.2% | 95.4% |
| Total amount | 96.1% | 92.3% | 90.8% | 97.8% |
| Payment due date | 91.4% | 86.8% | 84.1% | 93.7% |
| Line items (all fields, all rows) | 87.4% | 79.2% | 76.1% | 91.2% |
| Overall (all fields weighted) | 94.8% | 90.7% | 88.8% | 96.4% |
Analysis: Qwen 2.5-VL 7B is a clear leader among open-weight models, approaching GPT-4V on most fields. The gap is most pronounced on line item extraction (87.4% vs 91.2%) and VAT number recognition (95.6% vs 97.2%). LLaMA 3.2-Vision and Mistral Pixtral perform similarly to each other, with Mistral showing slightly lower accuracy across all fields. The 5–8 percentage point gap between Qwen and LLaMA/Mistral is significant in production: on 50,000 documents per year, it translates to hundreds of additional errors requiring human review.
Results: Scanned Paper Invoices
| Field | Qwen 2.5-VL 7B | LLaMA 3.2-Vision | Mistral Pixtral | GPT-4V |
|---|---|---|---|---|
| Invoice number | 91.4% | 85.2% | 83.7% | 93.8% |
| Invoice date | 93.2% | 88.4% | 87.0% | 95.1% |
| Supplier name | 94.7% | 89.1% | 88.2% | 96.0% |
| Total amount | 89.3% | 82.7% | 80.4% | 92.1% |
| Line items | 76.8% | 65.4% | 62.1% | 81.3% |
| Overall | 88.2% | 81.6% | 79.3% | 91.4% |
Analysis: All models degrade on scanned documents, as expected. Qwen 2.5-VL 7B maintains the smallest accuracy drop from digital to scanned (94.8% → 88.2%, a 6.6 point drop), suggesting better OCR robustness. LLaMA drops 9.1 points and Mistral drops 9.5 points. GPT-4V's drop is only 5.0 points, suggesting the larger model size (and possibly better OCR training data) provides more scan robustness. Line item extraction on scanned documents is notably harder for all models: complex table structures in low-quality scans push line item accuracy below 80% for LLaMA and Mistral.
Results: Complex Multi-Line Invoices
| Metric | Qwen 2.5-VL 7B | LLaMA 3.2-Vision | Mistral Pixtral | GPT-4V |
|---|---|---|---|---|
| Header fields accuracy | 95.8% | 91.4% | 89.7% | 97.1% |
| Line item row recall (rows found) | 91.2% | 80.4% | 77.8% | 94.6% |
| Line item field accuracy (per row) | 84.1% | 74.2% | 71.4% | 88.7% |
| Correct page 2+ line items captured | 88.4% | 74.6% | 71.2% | 92.3% |
| Subtotal / total accuracy | 93.4% | 87.2% | 85.1% | 95.8% |
Analysis: Multi-page, multi-line invoices are the hardest category. LLaMA and Mistral show a significant drop in line item row recall — they miss an average of 20–22% of all line item rows on complex invoices. This is primarily due to page boundary handling: models that process page 1 and 2 as separate images miss rows that span pages or appear only on page 2. Qwen 2.5-VL handles multi-image context better due to architectural differences in its visual encoder. GPT-4V performs best overall, with 94.6% row recall.
Results: Commercial Contracts
| Field | Qwen 2.5-VL 7B | LLaMA 3.2-Vision | Mistral Pixtral | GPT-4V |
|---|---|---|---|---|
| Party names (both) | 96.4% | 93.8% | 92.7% | 98.1% |
| Effective date | 91.2% | 86.4% | 84.8% | 94.1% |
| Termination / expiry date | 83.7% | 76.2% | 74.1% | 88.4% |
| Contract value (where stated) | 87.4% | 80.1% | 77.8% | 91.2% |
| Governing law jurisdiction | 88.2% | 82.7% | 80.4% | 92.8% |
| Notice period | 79.8% | 71.4% | 68.2% | 85.6% |
| Overall | 87.8% | 81.4% | 79.7% | 91.7% |
Analysis: Contract extraction is harder than invoice extraction for all models, because contracts require semantic reasoning rather than field-label matching. Notice period extraction is particularly challenging — the information may appear in any of several clauses, expressed in varied legal language. GPT-4V's larger parameter count provides the most benefit here, where semantic understanding of legal text is critical. The gap between Qwen and GPT-4V is narrower (3.9 points) on contracts than on scanned invoices (3.2 points), suggesting Qwen's semantic capabilities are competitive.
Performance Results
| Model | Avg processing time | Throughput (docs/hr) | VRAM required | Hardware needed |
|---|---|---|---|---|
| Qwen 2.5-VL 7B | 8.2s | ~440 | ~9GB (Q4_K_M) | RTX 3080+ (10GB+) |
| Qwen 2.5-VL 72B | 42s | ~85 | ~44GB (Q4_K_M) | A100 80GB / 2× 3090 |
| LLaMA 3.2-Vision 11B | 11.4s | ~315 | ~14GB (Q4_K_M) | RTX 3080 Ti+ (16GB+) |
| Mistral Pixtral 12B | 13.1s | ~274 | ~15GB (Q4_K_M) | RTX 3080 Ti+ (16GB+) |
| GPT-4V (API) | 4.1s (API latency) | API rate limited | N/A | Internet connection |
Note on GPT-4V throughput: The 4.1s API latency does not include queue wait time during high demand periods. The OpenAI Tier 4 API rate limit for GPT-4V is 800 requests/minute (RPM), which theoretically allows ~48,000 documents/hour. In practice, rate limiting is hit much sooner during burst processing. Sustained throughput at high volume is constrained by API rate limits and pricing, not raw inference speed.
Note on Qwen 2.5-VL 7B throughput: The 440 docs/hour figure uses batch size 1 (sequential processing). With parallel inference workers (2–4 workers on RTX 4090), throughput increases to 800–1,400 docs/hour with proportionally increased VRAM usage. Sequential processing is sufficient for most enterprise deployments where invoice arrival rate is 50–200 documents/hour.
Structured Output Reliability
Field-level accuracy is meaningless if the model does not produce parseable JSON. We measured three distinct failure modes:
| Model | JSON parse success | Schema adherence | Hallucination rate | Output truncation |
|---|---|---|---|---|
| Qwen 2.5-VL 7B | 99.2% | 97.8% | 1.4% | 0.4% |
| Qwen 2.5-VL 72B | 99.6% | 98.8% | 0.9% | 0.1% |
| LLaMA 3.2-Vision 11B | 96.4% | 93.2% | 3.8% | 2.1% |
| Mistral Pixtral 12B | 95.1% | 91.8% | 4.2% | 2.8% |
| GPT-4V | 99.8% | 99.1% | 0.7% | 0.1% |
The hallucination rate differences between models are significant in practice. LLaMA's 3.8% hallucination rate means that in approximately 1 in 26 documents, at least one field value is invented by the model rather than read from the document. For invoice processing where financial accuracy is critical, hallucinated values must be caught by the validation layer. Qwen 2.5-VL's 1.4% hallucination rate represents less than half the hallucination risk of LLaMA and Mistral.
On JSON parse failures: LLaMA and Mistral fail to produce valid JSON in 3.6% and 4.9% of cases respectively. These failures typically occur on very long documents where the model truncates its output mid-JSON, or when the model breaks character and outputs reasoning text before the JSON object. In production, JSON parse failures require fallback handling and typically result in a NEEDS_REVIEW routing. Using Ollama's JSON mode (grammar-constrained generation) reduces parse failures for all models but does not eliminate them.
Difficult Document Performance
Low-Quality Scans (150 DPI, Yellowed Paper)
| Model | 150 DPI clean | 150 DPI yellowed | 150 DPI + skew + stamp |
|---|---|---|---|
| Qwen 2.5-VL 7B | 86.4% | 79.2% | 68.1% |
| LLaMA 3.2-Vision 11B | 78.1% | 70.4% | 58.2% |
| Mistral Pixtral 12B | 76.8% | 68.7% | 55.4% |
| GPT-4V | 89.2% | 83.1% | 74.8% |
For heavily degraded documents (skewed, stamped, yellowed), all models show substantial accuracy drops. The practical recommendation for documents at this quality level is to implement image preprocessing (deskewing, contrast enhancement, denoising) before AI inference. Our testing showed that preprocessing improved accuracy by 8–15 percentage points across all models on the worst-quality documents. With preprocessing, the degraded scan accuracy for Qwen 2.5-VL improved from 68.1% to 79.3% on the worst-quality subset.
Handwritten Annotations
Documents with handwritten annotations on top of printed text presented an interesting challenge. All models handled simple handwritten annotations (a date or quantity written in the margin) with reasonable accuracy (75–85% on the annotation itself). Annotations that overlap or cross out printed text caused more significant problems, reducing accuracy on the underlying printed field by 12–18% across all models.
Documents with Missing Fields
We tested with 30 documents that were missing at least one expected field (e.g., no explicit VAT number, no stated due date). The desired behavior is for the model to return null for missing fields rather than hallucinating a value. Qwen hallucinated on 8.2% of missing fields, LLaMA on 22.4%, Mistral on 24.1%, and GPT-4V on 5.1%. This is one of the starkest behavioral differences between models: LLaMA and Mistral are significantly more likely to invent a plausible-looking value when a field is absent from the document.
Multi-Language Documents
| Language | Qwen 2.5-VL 7B | LLaMA 3.2-Vision | Mistral Pixtral | GPT-4V |
|---|---|---|---|---|
| Italian | 94.8% | 90.7% | 88.8% | 96.4% |
| German | 93.2% | 88.4% | 87.1% | 95.8% |
| English | 95.4% | 91.8% | 90.2% | 97.1% |
| French | 92.7% | 87.2% | 88.4% | 95.2% |
All models show consistent multilingual performance across the four tested languages, with accuracy within 2–3 percentage points of each other across languages. Notably, Mistral Pixtral performs marginally better on French than German (possibly reflecting Mistral AI's French origin and training data composition). No model required language-specific configuration.
Cost Analysis: Local vs. Cloud
Qwen 2.5-VL 7B on RTX 4090: Cost Per Document
Electricity cost calculation for on-premise processing:
- RTX 4090 TDP: 450W (under full GPU load during inference)
- Average processing time per document: 8.2 seconds
- Energy per document: 450W × (8.2/3600 hrs) = 0.001025 kWh
- Italian commercial electricity rate (2026): ~€0.28/kWh
- Electricity cost per document: €0.000287 (less than 0.03 euro cents)
- At 50,000 documents/year: €14.35/year in electricity for GPU inference
GPT-4V API Cost Per Document
GPT-4V pricing (OpenAI, as of Q1 2026 — subject to change):
- Image input: ~$0.00765 per image (1024px resolution, standard detail)
- Text input (prompt + schema): ~400 tokens × $0.01/1k tokens = $0.004
- Text output (JSON extraction): ~600 tokens × $0.03/1k tokens = $0.018
- Total cost per document: ~$0.03 (~€0.027)
- At 50,000 documents/year: $1,500/year (~€1,380)
- At 500,000 documents/year: $15,000/year (~€13,800)
Break-Even Analysis
| Annual Document Volume | GPT-4V API Annual Cost | Local (Qwen 7B) Annual Cost* | Local saves vs. API |
|---|---|---|---|
| 10,000 docs | ~€276 | €3,020 (hardware amortized) | API cheaper at low volume |
| 30,000 docs | ~€828 | €3,020 | API still cheaper |
| 50,000 docs | ~€1,380 | €3,020 | Approaching parity |
| 100,000 docs | ~€2,760 | €3,020 | Near parity |
| 200,000 docs | ~€5,520 | €3,020 | Local saves €2,500/yr |
| 500,000 docs | ~€13,800 | €4,500** | Local saves €9,300/yr |
* Annual cost includes hardware amortized over 3 years (RTX 4090 server: ~€4,500/3y = €1,500/yr) + electricity (€14/yr at 50k docs) + software support. ** Higher-volume scenario requires faster hardware; estimated €4,500/yr total.
The break-even volume is approximately 110,000–140,000 documents per year in pure economic terms, assuming no on-premise requirement. However, this calculation excludes the data sovereignty benefit: for organizations with GDPR obligations, competitive sensitivity about document contents, or air-gap requirements, the economic comparison is irrelevant — cloud is not an option regardless of price. For those organizations, on-premise local AI wins by definition.
Our Recommendation
| Use Case | Recommended Model |
|---|---|
| Most enterprise deployments (invoices, DDTs, standard docs) | Qwen 2.5-VL 7B |
| Highest accuracy requirements (contracts, complex legal docs) | Qwen 2.5-VL 72B or GPT-4V |
| Budget-constrained, acceptable accuracy trade-off | LLaMA 3.2-Vision 11B |
| Privacy + accuracy (air-gapped, high accuracy) | Qwen 2.5-VL 72B on-premise |
| Low volume, no privacy constraint, max accuracy | GPT-4V (API) |
Why we choose Qwen 2.5-VL 7B for DataUnchain: The combination of leading open-weight accuracy, low VRAM requirements (runs on a 10GB consumer GPU), fast inference (440 docs/hour), excellent JSON reliability (99.2% parse success), low hallucination rate (1.4%), and strong multilingual performance makes it the optimal choice for enterprise document processing where data cannot leave the building. The 1.3–2 percentage point accuracy gap versus GPT-4V is a reasonable trade for complete data sovereignty and zero recurring cloud cost.
Limitations of This Benchmark
- Not neutral: This benchmark was conducted by DataUnchain, which uses and sells a system built on Qwen 2.5-VL. Despite our best efforts at objective methodology, readers should be aware of this conflict of interest.
- Document set is biased toward Italian business documents: 68% of the test set is Italian-language documents from Italian suppliers. Performance on documents from other countries or in other languages may differ from reported results.
- Single prompt for all models: Models may perform significantly differently with prompts specifically engineered for each model. Our results represent a realistic "one prompt, multiple models" scenario, not the theoretical best-case for each model.
- Models are updated frequently: LLaMA, Qwen, and Mistral release updated versions regularly. By the time you read this, newer model versions may have been released with significantly different accuracy profiles. Always benchmark on your specific document types before committing to a model.
- Ground truth ambiguity: 1.8% of field instances had ambiguous ground truth where human annotators disagreed. These were excluded from accuracy calculations, which may slightly inflate all model accuracy figures.
- GPT-4V tested in separate environment: API latency, rate limits, and throughput figures for GPT-4V are not directly comparable to local inference results.
Methodology Notes
- All PDF documents were converted to PNG images at 300 DPI before being passed to models. Models were not given access to the PDF text layer — all models processed documents as images.
- Exact match scoring was used for all fields. A value was scored as correct only if it matched ground truth exactly after normalization (date format standardization, decimal separator standardization, leading/trailing whitespace removal). Partial credit was not given.
- For line item extraction, a row was scored as correctly extracted if all required line item fields (description, quantity, unit price, total) matched ground truth. A row missing one field was scored as incorrect.
- The temperature parameter was set to 0 for all models to maximize output determinism. Results may differ with higher temperature settings.
- Models were tested with Ollama's JSON mode enabled where supported (Qwen 2.5-VL, LLaMA 3.2-Vision). Mistral Pixtral was tested without JSON mode as it showed worse performance in JSON mode in our preliminary testing.
- All 500 documents were processed 3 times per model; results are averaged across runs to account for non-determinism at temperature 0.
Frequently Asked Questions
Can I run Qwen 2.5-VL 7B on an older NVIDIA GPU?
Qwen 2.5-VL 7B in Q4_K_M quantization requires approximately 9GB VRAM. It will run on any NVIDIA GPU with 10GB+ VRAM: RTX 3080 (10GB), RTX 4070, RTX 4080, RTX 4090. It will not fit on an RTX 3080 (8GB version) or RTX 3070. CPU-only inference is possible but extremely slow (~3-5 minutes per document), which is not practical for production use.
How does Qwen 2.5-VL 7B compare to commercial IDP platforms?
We do not have benchmark data comparing to commercial IDP platforms (ABBYY, Hyperscience, AWS Textract, Google Document AI) because we do not have access to comparable test sets across those platforms. In our customer deployments, we observe accuracy competitive with commercial platforms on standard invoice extraction, with the advantage of zero cloud data transmission and no per-document pricing.
Is Qwen 2.5-VL 7B fine-tuned for document extraction?
No. DataUnchain uses the base Qwen 2.5-VL 7B model without fine-tuning. Performance improvements come from prompt engineering, extraction schema design, and post-processing validation — not from model fine-tuning. This means customers benefit from any general improvements to Qwen 2.5-VL when they upgrade model versions, without needing to re-run fine-tuning pipelines.
What happens when the model returns invalid JSON?
DataUnchain's post-processing layer applies JSON repair heuristics (fixing trailing commas, missing brackets, escaped quotes) that successfully recover approximately 60% of malformed JSON outputs. The remaining 40% of parse failures result in the document being routed to NEEDS_REVIEW status with a "parse failure" error flag, where a human reviewer enters the data manually from the original document.
Will these results hold for my specific document types?
Not necessarily. This benchmark covers Italian and European business documents across 5 specific categories. If your documents are in different languages, follow different layouts (e.g., US invoices with different field conventions), or fall into different categories (medical records, legal pleadings, engineering specifications), you should benchmark on your own document samples before drawing conclusions. We offer a no-commitment evaluation where we process 50 sample documents from your real document set.
Ready to automate your document workflows?
DataUnchain processes your documents locally. No cloud, no data exposure, no subscriptions.
Request a Demo →