How internal document parsing works
How internal document parsing works
This page describes how VaultSafe turns uploaded documents (receipts, invoices, IDs) into structured data. We use an OCR-first pipeline with deterministic extraction and optional LLM cleanup—no vision model in the hot path—to keep cost low and accuracy high (in the high 90s in our evaluations).
Pipeline overview
- Upload — User uploads an image or PDF (page rendered as image).
- Geo + languages — We use geo-specific tagging (locale/region/preference) to select which languages to run. Users can add more languages and re-run parsing later.
- OCR — PaddleOCR runs for the selected languages only and returns plain text (and optionally line-level boxes/scores).
- Extraction — Regex and pattern matching extract candidate fields (e.g. receipt number, amount, date).
- Normalization — Python rules normalize values (number formatting, date formats).
- Optional cleanup — Low-confidence or parse-fail cases can be sent to a small text-only LLM for correction.
- Output — Structured JSON is stored and exposed to apps and APIs.
Design principle: We avoid sending every image through a vision LLM. OCR gives us text that preserves semantic structure; rules do most of the work, and a small LLM handles the remaining noise.
OCR engine selection
We need one OCR path that supports nine languages: English, Chinese (Simplified), Spanish, Hindi, Arabic, Portuguese, French, Japanese, and German. We evaluated three engines on the same 100+ image set (receipts, invoices, multi-language), measuring time, memory, and accuracy (character/word-level vs. ground truth):
| Engine | Time (s) | Memory (MB) | Accuracy (our eval) | Method |
|---|---|---|---|---|
| PaddleOCR | 27 | 1570 | Best | 9 languages × 1 pass each, results merged by position |
| EasyOCR | 840 | 570 | Lower | 5 readers (script restrictions), results merged |
| Tesseract | 2.4 | 91 | Lowest | 1 pass, 9 languages in a single call |
Production choice: PaddleOCR. It delivered the best accuracy on our receipt/invoice set (consistent with reported document-OCR benchmarks). Tesseract and EasyOCR were faster or lighter in some setups but trailed on accuracy in our eval. We use PaddleOCR in a hybrid, geo-aware way so we don’t pay the full cost of running all nine languages on every image.
Hybrid approach: geo-specific languages and re-run
- Geo-specific tagging — We use account locale, upload region, or user preference to choose a small set of languages per document (e.g. en+hi for India, en+ch for China, en+es+pt for Latin America). Only those language models are loaded and run.
- Re-run with more languages — Users can add more languages in settings and re-run parsing on existing documents. We run OCR only for the selected languages, so PaddleOCR’s accuracy is preserved without running all nine every time.
PaddleOCR output is line- or word-level text with bounding boxes; we use it as the single source of truth for the extraction step.
Why OCR output is good enough
OCR text often contains minor noise (e.g. Date z, Amount 3 %2841927). What matters for our pipeline is that semantic structure is preserved: labels like “Receipt No”, “Amount”, “Date” appear in the text with values nearby. Under that condition:
- Regex and pattern matching reliably isolate fields.
- A small text LLM can correct or normalize the remaining noisy cases without ever seeing the image.
- No vision model is required in the default path, which keeps cost at ~$0.0004 per image (vs. ~$0.004 with a vision LLM per image).
Extraction layer
The extraction layer is rule-based on the OCR text:
- Patterns — We use regex and simple heuristics to find receipt number, amount, date, merchant, etc., depending on document type.
- Normalization — Number formats (e.g.
28,41,927→2841927), dates (e.g.04-Jul-2025→ ISO), and units are normalized in Python. - Confidence — When a field is missing or low-confidence, we can route that document (or field) to an optional text-only LLM cleanup step.
Most documents are fully extracted with rules only; the remainder use the small LLM for cleanup. End-to-end accuracy in our experiments is in the high 90s, comparable to a vision-LLM–based pipeline.
Cost and scale
- Baseline (vision LLM per image): ~$0.004 per image (generalized over 1000+ images).
- Our pipeline (OCR + rules + optional text LLM): ~$0.0004 per image.
The order-of-magnitude cost reduction comes from removing the vision model from the hot path and using OCR + deterministic extraction + optional small LLM cleanup instead.
Summary
| Aspect | Approach |
|---|---|
| OCR engine | PaddleOCR (geo-specific languages; re-run when user adds more) |
| Extraction | Regex and pattern matching on OCR text |
| Cleanup | Optional small text LLM for low-confidence cases |
| Vision model | Not used in default path |
| Cost (approx.) | ~$0.0004 per image |
| Accuracy | Best in our eval (PaddleOCR); aligned with vision-LLM baseline |
For more context and benchmarks, see the blog post: How We Cut Document Extraction Cost by 10×: OCR + Rules vs. Vision LLMs.