The State of AI Document Parsing: Practical Findings for Production Teams

64 controlled runs comparing GPT-5.1, Claude Sonnet 4.5, AWS Textract, Azure Document Intelligence, and Google DocAI across passports, certificates, and tax forms.
ai
document parsing
ocr
benchmarking
production
Published

December 11, 2025

We ran 64 controlled experiments across five parsing providers to learn which ones are ready for production workflows that combine IDs, vital records, and tax forms. The harness, sample documents, and evaluation scripts live in the public repository document-data-extraction-benchmark so teams can reproduce and extend these results.

At a glance

  • Scope: 5 providers, 6 document types, images and PDFs
  • Metrics: accuracy, average latency, and cost
  • Runs: 64 experiments using a reproducible harness
  • Default pick: GPT-5.1 for accuracy, cost, and latency balance
  • When to specialize: AWS or Azure for passport-heavy ID workflows

Quick recommendations

  • General pipelines: GPT-5.1 as the default; consider Claude Sonnet 4.5 only when a 1M context window is required
  • IDs (passports): AWS Textract or Azure ID model; GPT-5.1 is the strongest LLM choice
  • Defer: Google DocAI unless you are investing in custom processors
  • QA: Plan human review on roughly 30 percent of outputs

Executive summary

  • Top performers: GPT-5.1 leads with 73.2 percent accuracy, production-ready average latency (~12.3 seconds), and very low cost (~$0.00054 per document). Claude Sonnet 4.5 matches accuracy but is slower and 53x more expensive.
  • OCR specialty: AWS Textract and Azure excel on passports and other government IDs thanks to specialized models, but fall short on general documents where LLMs dominate.
  • Cost value: GPT-5.1 delivers near state-of-the-art accuracy at almost the lowest cost.
  • Not production ready: Google DocAI underperformed and would require substantial custom development.
  • Testing and reproducibility: All benchmarks were run with the open harness, document samples, and evaluation scripts in document-data-extraction-benchmark.
Context and limitations
  • Results are directional from 64 tests across 6 document types, mostly U.S. documents.
  • Configurations were largely out of the box; accuracy may shift with tuning, different samples, or provider updates.
  • Use these rankings to shortlist providers. Re-run your top three documents per workflow with tuned prompts or model settings before finalizing.

Why we ran this evaluation

Operational workflows depend on accurate extraction of structured data from IDs, vital records, and tax forms. Accuracy, speed, and reliability directly affect automation quality and cost. We tested five providers on eight documents to see which vendor delivers the best real-world performance today.

Setup overview

Dimension Details
Providers GPT-5.1, Claude Sonnet 4.5, AWS Textract, Azure Document Intelligence, Google DocAI
Document types Passports, driver licenses, birth certificates, marriage certificates, W-2, 1040
Formats Images and PDFs
Metrics Accuracy, average latency, and cost
Reproducibility Harness plus scripts in repository document-data-extraction-benchmark

Key findings

  1. LLM vision models outperform classic OCR on general documents. GPT-5.1 and Claude Sonnet 4.5 reach 70 to 75 percent accuracy on certificates and tax forms where OCR lags at 0 to 40 percent.
  2. Specialized OCR wins for narrow ID use cases. AWS Textract and Azure hit around 90 percent accuracy on passports due to ID-trained models; their accuracy drops sharply on certificates and tax forms.
  3. GPT-5.1 is the composite leader. It ties for highest accuracy (73.2 percent) with Claude Sonnet 4.5 while being faster on average (~12.3 seconds vs ~16.3 seconds) and far cheaper (~$0.00054 vs ~$0.0286 per document).
  4. Providers with major gaps. Azure is strong on passports but inconsistent elsewhere without configuration. Google DocAI showed the lowest accuracy and needs custom processor development.

Provider snapshots

GPT-5.1 - default leader and budget choice
  • Accuracy: 73.2 percent
  • Avg latency: ~12.3 seconds
  • Cost: ~$0.00054 per document
  • Best overall balance of accuracy, cost, and speed.
AWS Textract
  • Accuracy: ~50 percent overall, ~90 percent on passports
  • Avg latency: ~7.7 seconds
  • Excels on passports and standardized ID formats.
Claude Sonnet 4.5 - accuracy tie at premium cost
  • Accuracy: 73.2 percent
  • Avg latency: ~16.3 seconds
  • Cost: ~$0.0286 per document
  • Use when the 1M context window is mandatory and budget is secondary.
Google DocAI
  • Accuracy: ~22 percent
  • Avg latency: ~6.8 seconds
  • Requires custom processors for acceptable accuracy; not recommended out of the box.

Performance by document type

  • Passports: OCR tools (AWS Textract, Azure) dominate passports with MRZ-trained models at roughly 90 percent accuracy. GPT-5.1 is the best LLM here (~80 percent), closing most of the gap while staying far cheaper.
  • Driver licenses: All providers struggle; layout variability is the main issue and no model clearly dominates.
  • Birth and marriage certificates: GPT-5.1 and Claude Sonnet 4.5 reach 75 to 90 percent accuracy. OCR models often remain below 40 percent.
  • Tax forms (W-2 and 1040): LLMs show strong semantic understanding, delivering around 80 to 100 percent accuracy on many fields. Claude Sonnet 4.5, GPT-5.1, and Textract cluster around ~91.7 percent on W-2 and ~80 percent on 1040.

When to choose each provider

  • Choose GPT-5.1 if you want the best composite score (73.2 percent accuracy, ~12.3 second average latency, ~$0.00054 per document) across mixed document types with a single default.
  • Choose Claude Sonnet 4.5 if you need a 1M token context window for long or complex documents and can accept slower latency and much higher cost for the same accuracy.
  • Use AWS Textract or Azure only if you have ID-heavy workflows (especially passports) and need maximum ID accuracy at the expense of general-document performance and higher page-based pricing.
  • Avoid Google DocAI for now; low accuracy outweighs its speed advantage unless you plan significant custom processor work.

Strategic recommendations

  1. Adopt GPT-5.1 as the default provider for general-purpose document extraction across mixed document types.
  2. Use AWS Textract or Azure ID models for passport-heavy or government ID workflows that demand near 90 percent ID accuracy.
  3. Reserve Claude Sonnet 4.5 for scenarios where the 1M context window is a hard requirement.
  4. Budget for human review on roughly 30 percent of outputs given the current accuracy ceiling of about 73 percent.