From $0.004 to $0.0004 Per Receipt: How We Slashed Document AI Cost 10× (Without Losing Accuracy)

From $0.004 to $0.0004 Per Receipt: How We Slashed Document AI Cost 10× (Without Losing Accuracy)

Feb 28, 2026·
Admin
· 6 min read
10× cost cut — OCR + rules vs vision LLMs

From $0.004 to $0.0004 Per Receipt: How We Slashed Document AI Cost 10× (Without Losing Accuracy)

Document extraction at scale — receipts and invoices

At VaultSafe we need to turn receipts, invoices, and IDs into structured JSON at scale—reliably and cheaply. The obvious path is image → vision LLM → JSON. That works, but at roughly $0.004 per image** (generalized over 1000+ images) it doesn’t scale. We redesigned the pipeline around **OCR + deterministic extraction** and got to **~$0.0004 per image with accuracy in the high 90s—on par with what we saw from LLM-based extraction. Here’s how we did it and why the OCR we chose is good enough.

The bottom line: We cut document extraction cost from $0.004 to ~$0.0004 per image (10×) while keeping accuracy in the high 90s. No vision model in the hot path—just OCR, regex, and optional small LLM cleanup.

Two Pipelines: Vision LLM vs. OCR + Rules

Pipeline A (expensive): Image → Vision LLM → Structured JSON

  • Pros: Handles layout and noise well; one model does “see + understand.”
  • Cons: Cost (~$0.004/img), latency, and dependency on a single black box.

Pipeline B (ours): Image → OCR → Text → Regex / pattern matching (+ optional small LLM cleanup) → Structured JSON

  • Pros: ~10× lower cost, predictable latency, transparent rules, and the same end-to-end accuracy (high 90s) in our experiments.
  • Cons: OCR text has minor noise; we have to design extraction logic and optional cleanup.

We ran a formal comparison and chose Pipeline B as our default for receipts and invoices. The main insight: OCR noise is small and structured enough that regex and pattern matching get you most of the way, and a small LLM can fix the rest when needed—no need for an expensive vision model on every image.

Why This OCR Is Good Enough

OCR output is never perfect. We see things like:

  • Date z (should be a date)
  • Receipt No 3 (digit or spacing off)
  • Transaction No 5
  • Amount 3 %2841927 (artifact + number)

But in practice, the semantic structure is intact: labels like “Receipt No”, “Amount”, “Date” are present and the values are nearby. That means:

  1. Regex and pattern matching work — We can reliably find “Receipt No” and the following number, “Amount” and the following value, etc.
  2. A small LLM cleanup works — For the remaining messy cases, a cheap text-only LLM can normalize and correct fields without ever seeing the image.
  3. No need for an expensive vision model — We avoid sending every image through a vision LLM while still matching the accuracy we got from the full vision pipeline.

So the bar for OCR is not “perfect text,” it’s “good enough structure for rules + optional LLM.” That’s the bar we used when choosing an engine.

Can You Extract JSON Without an LLM?

Code and automation — regex does the heavy lifting

Yes. In our setup, most receipts and invoices are fully extractable with rules only (no LLM). Example patterns:

receipt_no = re.search(r"Receipt No\s+(\d+)", text)
amount = re.search(r"(\d{1,2},\d{2},\d{3})", text)   # Indian-style grouping
date = re.search(r"\d{2}-[A-Za-z]{3}-\d{4}", text)  # e.g. 04-Jul-2025

Then normalize in code:

  • 28,41,9272841927 (remove commas, treat as integer/float)
  • 04-Jul-2025 → ISO format for storage

The remainder (weird layouts, heavy noise, mixed languages) we send through a small text LLM for cleanup only. That keeps cost low while preserving accuracy in the high 90s.

How We Chose the OCR Engine: Benchmarks

Data and analytics — measuring what works

We support nine languages (en, ch, es, hi, ar, pt, fr, ja, de) and need one OCR path that works for all. We benchmarked PaddleOCR, EasyOCR, and Tesseract on the same set of 100+ images (receipts, invoices, multi-language), measuring time, memory, and accuracy (character/word-level vs. ground truth, in line with common OCR benchmarks).

EngineTime (s)Memory (MB)Accuracy (our eval)Method
PaddleOCR271570Best9 languages × 1 pass each, merge by position
EasyOCR840570Lower5 readers (ch, ja, ar, hi, Latin), merge
Tesseract2.491Lowest1 pass, 9 languages in a single call

Takeaways:

  • PaddleOCR delivered the best accuracy on our receipt/invoice set—consistent with reported results on document OCR. For us, that accuracy was non-negotiable. Tesseract and EasyOCR were faster or lighter but trailed on our 100-image eval; we didn’t adopt them as the primary engine. We went with PaddleOCR and reduced cost via a hybrid, geo-aware approach.

Hybrid approach: geo-specific languages + re-run when you need more

We don’t run all nine languages on every image. We use geo-specific tagging (e.g. from account locale, upload region, or user preference) to choose a small set of languages per document (e.g. en+hi for India, en+ch for China, en+es+pt for Latin America). That cuts model load, memory, and time while keeping accuracy high where it matters.

Users can also re-run parsing after adding more languages. If someone has a mix of Hindi and English receipts, they can enable Hindi in settings and re-parse; we run OCR only for the selected languages. So we get PaddleOCR’s accuracy without the full cost of a nine-language run on every single image.

High-Level Pipeline Design

Our production flow looks like this:

flowchart LR A[Image] --> B[Geo tag] B --> C[PaddleOCR] C --> D[Regex + rules] D --> E{Normalize} E --> F[Optional LLM cleanup] F --> G[JSON] H[User adds languages] --> I[Re-run parsing] I --> C
  1. Ingest — Image (receipt/invoice/ID) is uploaded.
  2. Geo + languages — We use geo-specific tagging (locale/region/preference) to pick which languages to run (e.g. en+hi for India). Users can add more languages and re-run parsing anytime.
  3. OCRPaddleOCR runs for the selected languages only → raw text + per-line boxes/scores.
  4. Extraction — Deterministic regex and pattern matching on the OCR text → candidate fields (receipt_no, amount, date, etc.).
  5. Normalization — Simple Python rules (comma stripping, date parsing, unit handling).
  6. Optional cleanup — For low-confidence or parse-fail cases, a small text LLM fixes or fills fields; no image input.
  7. Output — Structured JSON written to our pipeline; downstream apps and APIs consume this only.

We never send the image to a vision model in the hot path. By using PaddleOCR with geo-specific languages (and letting users re-run with more languages when needed), we get from $0.004 to ~$0.0004 per image (generalized over 1000+ images) while keeping accuracy in the high 90s.

Summary

  • Cost: We reduced document extraction cost from ~$0.004 to ~$0.0004 per image (order-of-magnitude reduction) by replacing “image → vision LLM → JSON” with “image → OCR → text → rules (+ optional small LLM) → JSON.”
  • Accuracy: End-to-end in the high 90s—comparable to the vision-LLM pipeline, because OCR keeps semantic structure intact and LLMs are good at removing the remaining noise when we use them only for cleanup.
  • OCR choice: We use PaddleOCR in production for best accuracy in our evaluation. We run it in a hybrid, geo-aware way: only the languages relevant to the user’s region (or their choice) are loaded and run, and users can re-run parsing after adding more languages.
  • Rule-first design: Most documents are handled with regex and pattern matching; the rest use a small text LLM. No expensive vision model per image.

If you’re building document extraction at scale, we hope this pipeline and these numbers help you choose between vision-heavy and OCR+rules designs. For receipts and invoices, OCR + rules (+ optional LLM cleanup) was the right tradeoff for us.