LLM Benchmark for Document Extraction: Qwen vs LLaMA vs Mistral (2026)

A detailed benchmark of Vision-Language Models for enterprise document field extraction, conducted by the DataUnchain team. We tested Qwen 2.5-VL, LLaMA 3.2-Vision, Mistral Pixtral, and GPT-4V on 500 real European business documents across 5 document categories. All models were run on identical hardware under identical conditions.

DISCLOSURE:

This benchmark was conducted by DataUnchain's engineering team using documents from our production customer base (anonymized). We are not neutral parties — Qwen 2.5-VL is the model DataUnchain uses in production. We have published methodology details and acknowledge limitations so that readers can assess the results critically. We believe the data is accurate but encourage readers to run their own evaluations on their specific document types.

Why We Ran This Benchmark

When we built DataUnchain, we evaluated multiple Vision-Language Models for document extraction and chose Qwen 2.5-VL as our primary engine. Since then, customers consistently ask us the same questions: "Why Qwen and not LLaMA?" "How much worse is a smaller model?" "Is GPT-4V worth the cloud cost for the accuracy gain?" "What happens with low-quality scans?"

We decided to run a structured benchmark to answer these questions with real data rather than intuition. The evaluation took 6 weeks, covered 500 documents, and produced the results reported here. We believe this is the most detailed publicly available comparison of current open-weight VLMs specifically for business document extraction — as opposed to general image understanding or academic OCR benchmarks that don't reflect enterprise document AI conditions.

Methodology

Test Dataset

500 Italian and European business documents, sourced from DataUnchain's production customer deployments with all personally identifiable information removed. Documents were selected to represent realistic production variety — not just clean, ideal-case documents. The dataset includes:

5 document categories, 100 documents each
Documents from 47 distinct suppliers across Italy, Germany, France, Austria, and the Czech Republic
Digital PDFs, scanned paper, and FatturaPA XML (converted to image for consistent model input)
Document quality ranging from clean 300 DPI digital PDFs to degraded 150 DPI scans on yellowed paper
Languages: Italian (68%), German (18%), English (10%), French (4%)

Ground Truth Preparation

Two human annotators independently extracted all fields from each document. Disagreements were resolved by a third annotator. Fields where all three annotators produced the same value were marked as high-confidence ground truth. Fields where annotators disagreed (primarily due to document ambiguity, not annotator error) were marked as ambiguous and excluded from accuracy calculations. Ambiguous fields represented 1.8% of the total field instances across the 500-document set.

Evaluation Hardware

All open-weight models were tested on identical hardware to ensure fair throughput comparison:

GPU: NVIDIA RTX 4090 (24GB VRAM)
CPU: AMD Ryzen 9 7950X (16 cores / 32 threads)
RAM: 64GB DDR5
Storage: Samsung 990 Pro NVMe SSD
OS: Ubuntu 22.04 LTS, CUDA 12.4
Inference server: Ollama 0.5.x for all open-weight models

GPT-4V was tested via the OpenAI API from a separate environment with a stable internet connection. API latency results are reported separately and are not directly comparable to local inference times due to network overhead.

Metrics

Field-level accuracy: percentage of extracted field values that exactly match ground truth (after normalization — date formats and decimal separators standardized)
JSON parse success rate: percentage of inference calls that produce valid, parseable JSON (not counting field accuracy, just format validity)
Schema adherence rate: percentage of valid JSON outputs that contain all required schema keys with correct data types
Hallucination rate: percentage of field instances where the model returned a value not found anywhere in the document
Processing time: wall-clock time from inference call start to completion, averaged over all 500 documents
VRAM usage: peak GPU memory usage during inference

Models Tested

Model	Parameters	Vision Input	Quantization	Notes
Qwen 2.5-VL 7B	7.6B	Native	Q4_K_M	DataUnchain primary model
Qwen 2.5-VL 72B	72.7B	Native	Q4_K_M	Requires 48GB+ VRAM; tested on A100
LLaMA 3.2-Vision 11B	11.0B	Native	Q4_K_M	Meta, open-weight
Mistral Pixtral 12B	12.0B	Native	Q4_K_M	Mistral AI, open-weight
GPT-4V (API)	Unknown	Native	N/A (API)	Cloud baseline; separate test environment

The same extraction prompt was used for all models — a structured prompt specifying the target JSON schema, field definitions, and instructions for handling missing fields. We did not perform model-specific prompt engineering, which may disadvantage models that perform better with different prompt styles. This is intentional: in production, using a single prompt across models is realistic; heavily tuning prompts per model is not.

Test Document Categories

Category 1: Standard Digital Invoices (100 samples)

Clean PDF invoices generated by accounting software (primarily FatturaPA converted to visual PDF, and European supplier PDFs from systems like SAP, Sage, DATEV). All documents are text-searchable, printed at 300+ DPI when rasterized, and have standard invoice layouts with clear field labels. This category represents the easiest case and serves as a baseline.

Category 2: Scanned Paper Invoices (100 samples)

Physical invoices scanned by office scanners at 150–300 DPI. Quality varies significantly: some are clean scans of laser-printed documents; others have coffee stains, physical creases, skewed scanning angles, handwritten annotations, and rubber stamps overlapping printed text. This category tests OCR robustness and layout understanding under degraded conditions.

Category 3: Complex Invoices with Many Line Items (100 samples)

Multi-page invoices with 20–80 line items each, spanning 2–6 pages. Line item tables may have non-standard column arrangements, merged cells, subtotals interspersed within the table, and items grouped by category. This category specifically tests line item extraction completeness and accuracy — the hardest task in invoice processing.

Category 4: Commercial Contracts (100 samples)

Supplier and customer contracts in PDF format, 4–28 pages. Contracts are text-heavy with minimal visual structure — no tables, no labeled fields. Target extraction fields: party names (both parties), effective date, termination date, contract value (if stated), payment terms, governing law jurisdiction, and notice period. This category tests semantic understanding rather than field label recognition.

Category 5: DDT / Delivery Notes (100 samples)

Italian DDT (Documento di Trasporto) and equivalent European delivery notes. These have a structured but highly variable layout: carrier information, sender, recipient, list of goods with quantities and units of measure, gross weight, and transport mode. DDTs often include handwritten quantities added at delivery time. Target fields: sender, recipient, DDT number, DDT date, carrier, list of line items (description, quantity, unit of measure, lot number where present).

Results: Standard Digital Invoices

Field-level extraction accuracy on the 100 standard digital invoice samples (field values must exactly match ground truth after normalization):

Field	Qwen 2.5-VL 7B	LLaMA 3.2-Vision	Mistral Pixtral	GPT-4V
Invoice number	97.2%	94.1%	92.8%	98.1%
Invoice date	98.4%	96.2%	95.1%	99.0%
Supplier name	99.1%	97.8%	96.5%	99.3%
VAT number (P.IVA)	95.6%	91.2%	89.4%	97.2%
Subtotal (imponibile)	94.8%	90.1%	88.7%	96.1%
VAT amount (IVA)	93.2%	88.4%	86.2%	95.4%
Total amount	96.1%	92.3%	90.8%	97.8%
Payment due date	91.4%	86.8%	84.1%	93.7%
Line items (all fields, all rows)	87.4%	79.2%	76.1%	91.2%
Overall (all fields weighted)	94.8%	90.7%	88.8%	96.4%

Analysis: Qwen 2.5-VL 7B is a clear leader among open-weight models, approaching GPT-4V on most fields. The gap is most pronounced on line item extraction (87.4% vs 91.2%) and VAT number recognition (95.6% vs 97.2%). LLaMA 3.2-Vision and Mistral Pixtral perform similarly to each other, with Mistral showing slightly lower accuracy across all fields. The 5–8 percentage point gap between Qwen and LLaMA/Mistral is significant in production: on 50,000 documents per year, it translates to hundreds of additional errors requiring human review.

Results: Scanned Paper Invoices

Field	Qwen 2.5-VL 7B	LLaMA 3.2-Vision	Mistral Pixtral	GPT-4V
Invoice number	91.4%	85.2%	83.7%	93.8%
Invoice date	93.2%	88.4%	87.0%	95.1%
Supplier name	94.7%	89.1%	88.2%	96.0%
Total amount	89.3%	82.7%	80.4%	92.1%
Line items	76.8%	65.4%	62.1%	81.3%
Overall	88.2%	81.6%	79.3%	91.4%

Analysis: All models degrade on scanned documents, as expected. Qwen 2.5-VL 7B maintains the smallest accuracy drop from digital to scanned (94.8% → 88.2%, a 6.6 point drop), suggesting better OCR robustness. LLaMA drops 9.1 points and Mistral drops 9.5 points. GPT-4V's drop is only 5.0 points, suggesting the larger model size (and possibly better OCR training data) provides more scan robustness. Line item extraction on scanned documents is notably harder for all models: complex table structures in low-quality scans push line item accuracy below 80% for LLaMA and Mistral.

Results: Complex Multi-Line Invoices

Metric	Qwen 2.5-VL 7B	LLaMA 3.2-Vision	Mistral Pixtral	GPT-4V
Header fields accuracy	95.8%	91.4%	89.7%	97.1%
Line item row recall (rows found)	91.2%	80.4%	77.8%	94.6%
Line item field accuracy (per row)	84.1%	74.2%	71.4%	88.7%
Correct page 2+ line items captured	88.4%	74.6%	71.2%	92.3%
Subtotal / total accuracy	93.4%	87.2%	85.1%	95.8%

Analysis: Multi-page, multi-line invoices are the hardest category. LLaMA and Mistral show a significant drop in line item row recall — they miss an average of 20–22% of all line item rows on complex invoices. This is primarily due to page boundary handling: models that process page 1 and 2 as separate images miss rows that span pages or appear only on page 2. Qwen 2.5-VL handles multi-image context better due to architectural differences in its visual encoder. GPT-4V performs best overall, with 94.6% row recall.

Results: Commercial Contracts

Field	Qwen 2.5-VL 7B	LLaMA 3.2-Vision	Mistral Pixtral	GPT-4V
Party names (both)	96.4%	93.8%	92.7%	98.1%
Effective date	91.2%	86.4%	84.8%	94.1%
Termination / expiry date	83.7%	76.2%	74.1%	88.4%
Contract value (where stated)	87.4%	80.1%	77.8%	91.2%
Governing law jurisdiction	88.2%	82.7%	80.4%	92.8%
Notice period	79.8%	71.4%	68.2%	85.6%
Overall	87.8%	81.4%	79.7%	91.7%

Analysis: Contract extraction is harder than invoice extraction for all models, because contracts require semantic reasoning rather than field-label matching. Notice period extraction is particularly challenging — the information may appear in any of several clauses, expressed in varied legal language. GPT-4V's larger parameter count provides the most benefit here, where semantic understanding of legal text is critical. The gap between Qwen and GPT-4V is narrower (3.9 points) on contracts than on scanned invoices (3.2 points), suggesting Qwen's semantic capabilities are competitive.

Performance Results

Model	Avg processing time	Throughput (docs/hr)	VRAM required	Hardware needed
Qwen 2.5-VL 7B	8.2s	~440	~9GB (Q4_K_M)	RTX 3080+ (10GB+)
Qwen 2.5-VL 72B	42s	~85	~44GB (Q4_K_M)	A100 80GB / 2× 3090
LLaMA 3.2-Vision 11B	11.4s	~315	~14GB (Q4_K_M)	RTX 3080 Ti+ (16GB+)
Mistral Pixtral 12B	13.1s	~274	~15GB (Q4_K_M)	RTX 3080 Ti+ (16GB+)
GPT-4V (API)	4.1s (API latency)	API rate limited	N/A	Internet connection

Note on GPT-4V throughput: The 4.1s API latency does not include queue wait time during high demand periods. The OpenAI Tier 4 API rate limit for GPT-4V is 800 requests/minute (RPM), which theoretically allows ~48,000 documents/hour. In practice, rate limiting is hit much sooner during burst processing. Sustained throughput at high volume is constrained by API rate limits and pricing, not raw inference speed.

Note on Qwen 2.5-VL 7B throughput: The 440 docs/hour figure uses batch size 1 (sequential processing). With parallel inference workers (2–4 workers on RTX 4090), throughput increases to 800–1,400 docs/hour with proportionally increased VRAM usage. Sequential processing is sufficient for most enterprise deployments where invoice arrival rate is 50–200 documents/hour.

Structured Output Reliability

Field-level accuracy is meaningless if the model does not produce parseable JSON. We measured three distinct failure modes:

Model	JSON parse success	Schema adherence	Hallucination rate	Output truncation
Qwen 2.5-VL 7B	99.2%	97.8%	1.4%	0.4%
Qwen 2.5-VL 72B	99.6%	98.8%	0.9%	0.1%
LLaMA 3.2-Vision 11B	96.4%	93.2%	3.8%	2.1%
Mistral Pixtral 12B	95.1%	91.8%	4.2%	2.8%
GPT-4V	99.8%	99.1%	0.7%	0.1%

KEY FINDING:

The hallucination rate differences between models are significant in practice. LLaMA's 3.8% hallucination rate means that in approximately 1 in 26 documents, at least one field value is invented by the model rather than read from the document. For invoice processing where financial accuracy is critical, hallucinated values must be caught by the validation layer. Qwen 2.5-VL's 1.4% hallucination rate represents less than half the hallucination risk of LLaMA and Mistral.

On JSON parse failures: LLaMA and Mistral fail to produce valid JSON in 3.6% and 4.9% of cases respectively. These failures typically occur on very long documents where the model truncates its output mid-JSON, or when the model breaks character and outputs reasoning text before the JSON object. In production, JSON parse failures require fallback handling and typically result in a NEEDS_REVIEW routing. Using Ollama's JSON mode (grammar-constrained generation) reduces parse failures for all models but does not eliminate them.

Difficult Document Performance

Low-Quality Scans (150 DPI, Yellowed Paper)

Model	150 DPI clean	150 DPI yellowed	150 DPI + skew + stamp
Qwen 2.5-VL 7B	86.4%	79.2%	68.1%
LLaMA 3.2-Vision 11B	78.1%	70.4%	58.2%
Mistral Pixtral 12B	76.8%	68.7%	55.4%
GPT-4V	89.2%	83.1%	74.8%

For heavily degraded documents (skewed, stamped, yellowed), all models show substantial accuracy drops. The practical recommendation for documents at this quality level is to implement image preprocessing (deskewing, contrast enhancement, denoising) before AI inference. Our testing showed that preprocessing improved accuracy by 8–15 percentage points across all models on the worst-quality documents. With preprocessing, the degraded scan accuracy for Qwen 2.5-VL improved from 68.1% to 79.3% on the worst-quality subset.

Handwritten Annotations

Documents with handwritten annotations on top of printed text presented an interesting challenge. All models handled simple handwritten annotations (a date or quantity written in the margin) with reasonable accuracy (75–85% on the annotation itself). Annotations that overlap or cross out printed text caused more significant problems, reducing accuracy on the underlying printed field by 12–18% across all models.

Documents with Missing Fields

We tested with 30 documents that were missing at least one expected field (e.g., no explicit VAT number, no stated due date). The desired behavior is for the model to return null for missing fields rather than hallucinating a value. Qwen hallucinated on 8.2% of missing fields, LLaMA on 22.4%, Mistral on 24.1%, and GPT-4V on 5.1%. This is one of the starkest behavioral differences between models: LLaMA and Mistral are significantly more likely to invent a plausible-looking value when a field is absent from the document.

Multi-Language Documents

Language	Qwen 2.5-VL 7B	LLaMA 3.2-Vision	Mistral Pixtral	GPT-4V
Italian	94.8%	90.7%	88.8%	96.4%
German	93.2%	88.4%	87.1%	95.8%
English	95.4%	91.8%	90.2%	97.1%
French	92.7%	87.2%	88.4%	95.2%

All models show consistent multilingual performance across the four tested languages, with accuracy within 2–3 percentage points of each other across languages. Notably, Mistral Pixtral performs marginally better on French than German (possibly reflecting Mistral AI's French origin and training data composition). No model required language-specific configuration.

Cost Analysis: Local vs. Cloud

Qwen 2.5-VL 7B on RTX 4090: Cost Per Document

Electricity cost calculation for on-premise processing:

RTX 4090 TDP: 450W (under full GPU load during inference)
Average processing time per document: 8.2 seconds
Energy per document: 450W × (8.2/3600 hrs) = 0.001025 kWh
Italian commercial electricity rate (2026): ~€0.28/kWh
Electricity cost per document: €0.000287 (less than 0.03 euro cents)
At 50,000 documents/year: €14.35/year in electricity for GPU inference

GPT-4V API Cost Per Document

GPT-4V pricing (OpenAI, as of Q1 2026 — subject to change):

Image input: ~$0.00765 per image (1024px resolution, standard detail)
Text input (prompt + schema): ~400 tokens × $0.01/1k tokens = $0.004
Text output (JSON extraction): ~600 tokens × $0.03/1k tokens = $0.018
Total cost per document: ~$0.03 (~€0.027)
At 50,000 documents/year: $1,500/year (~€1,380)
At 500,000 documents/year: $15,000/year (~€13,800)

Break-Even Analysis

Annual Document Volume	GPT-4V API Annual Cost	Local (Qwen 7B) Annual Cost*	Local saves vs. API
10,000 docs	~€276	€3,020 (hardware amortized)	API cheaper at low volume
30,000 docs	~€828	€3,020	API still cheaper
50,000 docs	~€1,380	€3,020	Approaching parity
100,000 docs	~€2,760	€3,020	Near parity
200,000 docs	~€5,520	€3,020	Local saves €2,500/yr
500,000 docs	~€13,800	€4,500**	Local saves €9,300/yr

* Annual cost includes hardware amortized over 3 years (RTX 4090 server: ~€4,500/3y = €1,500/yr) + electricity (€14/yr at 50k docs) + software support. ** Higher-volume scenario requires faster hardware; estimated €4,500/yr total.

The break-even volume is approximately 110,000–140,000 documents per year in pure economic terms, assuming no on-premise requirement. However, this calculation excludes the data sovereignty benefit: for organizations with GDPR obligations, competitive sensitivity about document contents, or air-gap requirements, the economic comparison is irrelevant — cloud is not an option regardless of price. For those organizations, on-premise local AI wins by definition.

Our Recommendation

Use Case	Recommended Model
Most enterprise deployments (invoices, DDTs, standard docs)	Qwen 2.5-VL 7B
Highest accuracy requirements (contracts, complex legal docs)	Qwen 2.5-VL 72B or GPT-4V
Budget-constrained, acceptable accuracy trade-off	LLaMA 3.2-Vision 11B
Privacy + accuracy (air-gapped, high accuracy)	Qwen 2.5-VL 72B on-premise
Low volume, no privacy constraint, max accuracy	GPT-4V (API)

Why we choose Qwen 2.5-VL 7B for DataUnchain: The combination of leading open-weight accuracy, low VRAM requirements (runs on a 10GB consumer GPU), fast inference (440 docs/hour), excellent JSON reliability (99.2% parse success), low hallucination rate (1.4%), and strong multilingual performance makes it the optimal choice for enterprise document processing where data cannot leave the building. The 1.3–2 percentage point accuracy gap versus GPT-4V is a reasonable trade for complete data sovereignty and zero recurring cloud cost.

Limitations of This Benchmark

Not neutral: This benchmark was conducted by DataUnchain, which uses and sells a system built on Qwen 2.5-VL. Despite our best efforts at objective methodology, readers should be aware of this conflict of interest.
Document set is biased toward Italian business documents: 68% of the test set is Italian-language documents from Italian suppliers. Performance on documents from other countries or in other languages may differ from reported results.
Single prompt for all models: Models may perform significantly differently with prompts specifically engineered for each model. Our results represent a realistic "one prompt, multiple models" scenario, not the theoretical best-case for each model.
Models are updated frequently: LLaMA, Qwen, and Mistral release updated versions regularly. By the time you read this, newer model versions may have been released with significantly different accuracy profiles. Always benchmark on your specific document types before committing to a model.
Ground truth ambiguity: 1.8% of field instances had ambiguous ground truth where human annotators disagreed. These were excluded from accuracy calculations, which may slightly inflate all model accuracy figures.
GPT-4V tested in separate environment: API latency, rate limits, and throughput figures for GPT-4V are not directly comparable to local inference results.

Methodology Notes

All PDF documents were converted to PNG images at 300 DPI before being passed to models. Models were not given access to the PDF text layer — all models processed documents as images.
Exact match scoring was used for all fields. A value was scored as correct only if it matched ground truth exactly after normalization (date format standardization, decimal separator standardization, leading/trailing whitespace removal). Partial credit was not given.
For line item extraction, a row was scored as correctly extracted if all required line item fields (description, quantity, unit price, total) matched ground truth. A row missing one field was scored as incorrect.
The temperature parameter was set to 0 for all models to maximize output determinism. Results may differ with higher temperature settings.
Models were tested with Ollama's JSON mode enabled where supported (Qwen 2.5-VL, LLaMA 3.2-Vision). Mistral Pixtral was tested without JSON mode as it showed worse performance in JSON mode in our preliminary testing.
All 500 documents were processed 3 times per model; results are averaged across runs to account for non-determinism at temperature 0.

Frequently Asked Questions

Can I run Qwen 2.5-VL 7B on an older NVIDIA GPU?

Qwen 2.5-VL 7B in Q4_K_M quantization requires approximately 9GB VRAM. It will run on any NVIDIA GPU with 10GB+ VRAM: RTX 3080 (10GB), RTX 4070, RTX 4080, RTX 4090. It will not fit on an RTX 3080 (8GB version) or RTX 3070. CPU-only inference is possible but extremely slow (~3-5 minutes per document), which is not practical for production use.

How does Qwen 2.5-VL 7B compare to commercial IDP platforms?

We do not have benchmark data comparing to commercial IDP platforms (ABBYY, Hyperscience, AWS Textract, Google Document AI) because we do not have access to comparable test sets across those platforms. In our customer deployments, we observe accuracy competitive with commercial platforms on standard invoice extraction, with the advantage of zero cloud data transmission and no per-document pricing.

Is Qwen 2.5-VL 7B fine-tuned for document extraction?

No. DataUnchain uses the base Qwen 2.5-VL 7B model without fine-tuning. Performance improvements come from prompt engineering, extraction schema design, and post-processing validation — not from model fine-tuning. This means customers benefit from any general improvements to Qwen 2.5-VL when they upgrade model versions, without needing to re-run fine-tuning pipelines.

What happens when the model returns invalid JSON?

DataUnchain's post-processing layer applies JSON repair heuristics (fixing trailing commas, missing brackets, escaped quotes) that successfully recover approximately 60% of malformed JSON outputs. The remaining 40% of parse failures result in the document being routed to NEEDS_REVIEW status with a "parse failure" error flag, where a human reviewer enters the data manually from the original document.

Will these results hold for my specific document types?

Not necessarily. This benchmark covers Italian and European business documents across 5 specific categories. If your documents are in different languages, follow different layouts (e.g., US invoices with different field conventions), or fall into different categories (medical records, legal pleadings, engineering specifications), you should benchmark on your own document samples before drawing conclusions. We offer a no-commitment evaluation where we process 50 sample documents from your real document set.