How internal document parsing works

How internal document parsing works

How internal document parsing works

This page describes how VaultSafe turns uploaded documents (receipts, invoices, IDs) into structured data. We use an OCR-first pipeline with deterministic extraction and optional LLM cleanup—no vision model in the hot path—to keep cost low and accuracy high (in the high 90s in our evaluations).

Pipeline overview

  1. Upload — User uploads an image or PDF (page rendered as image).
  2. Geo + languages — We use geo-specific tagging (locale/region/preference) to select which languages to run. Users can add more languages and re-run parsing later.
  3. OCR — PaddleOCR runs for the selected languages only and returns plain text (and optionally line-level boxes/scores).
  4. Extraction — Regex and pattern matching extract candidate fields (e.g. receipt number, amount, date).
  5. Normalization — Python rules normalize values (number formatting, date formats).
  6. Optional cleanup — Low-confidence or parse-fail cases can be sent to a small text-only LLM for correction.
  7. Output — Structured JSON is stored and exposed to apps and APIs.

Design principle: We avoid sending every image through a vision LLM. OCR gives us text that preserves semantic structure; rules do most of the work, and a small LLM handles the remaining noise.

OCR engine selection

We need one OCR path that supports nine languages: English, Chinese (Simplified), Spanish, Hindi, Arabic, Portuguese, French, Japanese, and German. We evaluated three engines on the same 100+ image set (receipts, invoices, multi-language), measuring time, memory, and accuracy (character/word-level vs. ground truth):

EngineTime (s)Memory (MB)Accuracy (our eval)Method
PaddleOCR271570Best9 languages × 1 pass each, results merged by position
EasyOCR840570Lower5 readers (script restrictions), results merged
Tesseract2.491Lowest1 pass, 9 languages in a single call

Production choice: PaddleOCR. It delivered the best accuracy on our receipt/invoice set (consistent with reported document-OCR benchmarks). Tesseract and EasyOCR were faster or lighter in some setups but trailed on accuracy in our eval. We use PaddleOCR in a hybrid, geo-aware way so we don’t pay the full cost of running all nine languages on every image.

Hybrid approach: geo-specific languages and re-run

  • Geo-specific tagging — We use account locale, upload region, or user preference to choose a small set of languages per document (e.g. en+hi for India, en+ch for China, en+es+pt for Latin America). Only those language models are loaded and run.
  • Re-run with more languages — Users can add more languages in settings and re-run parsing on existing documents. We run OCR only for the selected languages, so PaddleOCR’s accuracy is preserved without running all nine every time.

PaddleOCR output is line- or word-level text with bounding boxes; we use it as the single source of truth for the extraction step.

Why OCR output is good enough

OCR text often contains minor noise (e.g. Date z, Amount 3 %2841927). What matters for our pipeline is that semantic structure is preserved: labels like “Receipt No”, “Amount”, “Date” appear in the text with values nearby. Under that condition:

  • Regex and pattern matching reliably isolate fields.
  • A small text LLM can correct or normalize the remaining noisy cases without ever seeing the image.
  • No vision model is required in the default path, which keeps cost at ~$0.0004 per image (vs. ~$0.004 with a vision LLM per image).

Extraction layer

The extraction layer is rule-based on the OCR text:

  • Patterns — We use regex and simple heuristics to find receipt number, amount, date, merchant, etc., depending on document type.
  • Normalization — Number formats (e.g. 28,41,9272841927), dates (e.g. 04-Jul-2025 → ISO), and units are normalized in Python.
  • Confidence — When a field is missing or low-confidence, we can route that document (or field) to an optional text-only LLM cleanup step.

Most documents are fully extracted with rules only; the remainder use the small LLM for cleanup. End-to-end accuracy in our experiments is in the high 90s, comparable to a vision-LLM–based pipeline.

Cost and scale

  • Baseline (vision LLM per image): ~$0.004 per image (generalized over 1000+ images).
  • Our pipeline (OCR + rules + optional text LLM): ~$0.0004 per image.

The order-of-magnitude cost reduction comes from removing the vision model from the hot path and using OCR + deterministic extraction + optional small LLM cleanup instead.

Summary

AspectApproach
OCR enginePaddleOCR (geo-specific languages; re-run when user adds more)
ExtractionRegex and pattern matching on OCR text
CleanupOptional small text LLM for low-confidence cases
Vision modelNot used in default path
Cost (approx.)~$0.0004 per image
AccuracyBest in our eval (PaddleOCR); aligned with vision-LLM baseline

For more context and benchmarks, see the blog post: How We Cut Document Extraction Cost by 10×: OCR + Rules vs. Vision LLMs.