<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Document Parsing | VaultSafe | Chat with Your Files — AI Document Assistant</title><link>https://www.vaultsafe.ai/en/tag/document-parsing/</link><atom:link href="https://www.vaultsafe.ai/en/tag/document-parsing/index.xml" rel="self" type="application/rss+xml"/><description>Document Parsing</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sat, 28 Feb 2026 00:00:00 +0000</lastBuildDate><image><url>https://www.vaultsafe.ai/media/logo.svg</url><title>Document Parsing</title><link>https://www.vaultsafe.ai/en/tag/document-parsing/</link></image><item><title>From $0.004 to $0.0004 Per Receipt: How We Slashed Document AI Cost 10× (Without Losing Accuracy)</title><link>https://www.vaultsafe.ai/en/blog/ocr-to-json-10x-cost-cut/</link><pubDate>Sat, 28 Feb 2026 00:00:00 +0000</pubDate><guid>https://www.vaultsafe.ai/en/blog/ocr-to-json-10x-cost-cut/</guid><description>&lt;h1 id="from-00004-per-receipt-how-we-slashed-document-ai-cost-10-without-losing-accuracy"&gt;From $0.004 to $0.0004 Per Receipt: How We Slashed Document AI Cost 10× (Without Losing Accuracy)&lt;/h1&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-100" &gt;&lt;img src="https://images.unsplash.com/photo-1554224155-8d04cb21cd6c?w=1200&amp;amp;q=80" alt="Document extraction at scale — receipts and invoices" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;At VaultSafe we need to turn receipts, invoices, and IDs into &lt;strong&gt;structured JSON&lt;/strong&gt; at scale—reliably and cheaply. The obvious path is &lt;strong&gt;image → vision LLM → JSON&lt;/strong&gt;. That works, but at roughly &lt;strong&gt;$0.004 per image** (generalized over 1000+ images) it doesn’t scale. We redesigned the pipeline around **OCR + deterministic extraction** and got to **~$0.0004 per image&lt;/strong&gt; with &lt;strong&gt;accuracy in the high 90s&lt;/strong&gt;—on par with what we saw from LLM-based extraction. Here’s how we did it and why the OCR we chose is good enough.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The bottom line:&lt;/strong&gt; We cut document extraction cost from &lt;strong&gt;$0.004 to ~$0.0004 per image&lt;/strong&gt; (10×) while keeping &lt;strong&gt;accuracy in the high 90s&lt;/strong&gt;. No vision model in the hot path—just OCR, regex, and optional small LLM cleanup.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id="two-pipelines-vision-llm-vs-ocr--rules"&gt;Two Pipelines: Vision LLM vs. OCR + Rules&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Pipeline A (expensive):&lt;/strong&gt; Image → Vision LLM → Structured JSON&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pros: Handles layout and noise well; one model does “see + understand.”&lt;/li&gt;
&lt;li&gt;Cons: Cost (~$0.004/img), latency, and dependency on a single black box.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Pipeline B (ours):&lt;/strong&gt; Image → OCR → Text → Regex / pattern matching (+ optional small LLM cleanup) → Structured JSON&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Pros: &lt;strong&gt;~10× lower cost&lt;/strong&gt;, predictable latency, transparent rules, and the same &lt;strong&gt;end-to-end accuracy (high 90s)&lt;/strong&gt; in our experiments.&lt;/li&gt;
&lt;li&gt;Cons: OCR text has minor noise; we have to design extraction logic and optional cleanup.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We ran a formal comparison and chose Pipeline B as our default for receipts and invoices. The main insight: &lt;strong&gt;OCR noise is small and structured enough that regex and pattern matching get you most of the way&lt;/strong&gt;, and a small LLM can fix the rest when needed—no need for an expensive vision model on every image.&lt;/p&gt;
&lt;h2 id="why-this-ocr-is-good-enough"&gt;Why This OCR Is Good Enough&lt;/h2&gt;
&lt;p&gt;OCR output is never perfect. We see things like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Date z&lt;/code&gt; (should be a date)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Receipt No 3&lt;/code&gt; (digit or spacing off)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Transaction No 5&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Amount 3 %2841927&lt;/code&gt; (artifact + number)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But in practice, &lt;strong&gt;the semantic structure is intact&lt;/strong&gt;: labels like “Receipt No”, “Amount”, “Date” are present and the values are nearby. That means:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Regex and pattern matching work&lt;/strong&gt; — We can reliably find “Receipt No” and the following number, “Amount” and the following value, etc.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A small LLM cleanup works&lt;/strong&gt; — For the remaining messy cases, a cheap text-only LLM can normalize and correct fields without ever seeing the image.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No need for an expensive vision model&lt;/strong&gt; — We avoid sending every image through a vision LLM while still matching the accuracy we got from the full vision pipeline.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;So the bar for OCR is not “perfect text,” it’s “good enough structure for rules + optional LLM.” That’s the bar we used when choosing an engine.&lt;/p&gt;
&lt;h2 id="can-you-extract-json-without-an-llm"&gt;Can You Extract JSON Without an LLM?&lt;/h2&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-100" &gt;&lt;img src="https://images.unsplash.com/photo-1555066931-4365d14bab8c?w=1200&amp;amp;q=80" alt="Code and automation — regex does the heavy lifting" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Yes.&lt;/strong&gt; In our setup, &lt;strong&gt;most receipts and invoices are fully extractable with rules only&lt;/strong&gt; (no LLM). Example patterns:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre tabindex="0" class="chroma"&gt;&lt;code class="language-python" data-lang="python"&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;receipt_no&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;Receipt No\s+(\d+)&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;(\d{1,2},\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s2"&gt;,\d&lt;/span&gt;&lt;span class="si"&gt;{3}&lt;/span&gt;&lt;span class="s2"&gt;)&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Indian-style grouping&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class="line"&gt;&lt;span class="cl"&gt;&lt;span class="n"&gt;date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;\d&lt;/span&gt;&lt;span class="si"&gt;{2}&lt;/span&gt;&lt;span class="s2"&gt;-[A-Za-z]&lt;/span&gt;&lt;span class="si"&gt;{3}&lt;/span&gt;&lt;span class="s2"&gt;-\d&lt;/span&gt;&lt;span class="si"&gt;{4}&lt;/span&gt;&lt;span class="s2"&gt;&amp;#34;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# e.g. 04-Jul-2025&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;Then normalize in code:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;28,41,927&lt;/code&gt; → &lt;code&gt;2841927&lt;/code&gt; (remove commas, treat as integer/float)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;04-Jul-2025&lt;/code&gt; → ISO format for storage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The remainder (weird layouts, heavy noise, mixed languages) we send through a &lt;strong&gt;small text LLM&lt;/strong&gt; for cleanup only. That keeps cost low while preserving accuracy in the high 90s.&lt;/p&gt;
&lt;h2 id="how-we-chose-the-ocr-engine-benchmarks"&gt;How We Chose the OCR Engine: Benchmarks&lt;/h2&gt;
&lt;p&gt;
&lt;figure &gt;
&lt;div class="flex justify-center "&gt;
&lt;div class="w-100" &gt;&lt;img src="https://images.unsplash.com/photo-1551288049-bebda4e38f71?w=1200&amp;amp;q=80" alt="Data and analytics — measuring what works" loading="lazy" data-zoomable /&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p&gt;We support &lt;strong&gt;nine languages&lt;/strong&gt; (en, ch, es, hi, ar, pt, fr, ja, de) and need one OCR path that works for all. We benchmarked &lt;strong&gt;PaddleOCR&lt;/strong&gt;, &lt;strong&gt;EasyOCR&lt;/strong&gt;, and &lt;strong&gt;Tesseract&lt;/strong&gt; on the same set of &lt;strong&gt;100+ images&lt;/strong&gt; (receipts, invoices, multi-language), measuring &lt;strong&gt;time&lt;/strong&gt;, &lt;strong&gt;memory&lt;/strong&gt;, and &lt;strong&gt;accuracy&lt;/strong&gt; (character/word-level vs. ground truth, in line with common OCR benchmarks).&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;Time (s)&lt;/th&gt;
&lt;th&gt;Memory (MB)&lt;/th&gt;
&lt;th&gt;Accuracy (our eval)&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PaddleOCR&lt;/td&gt;
&lt;td&gt;27&lt;/td&gt;
&lt;td&gt;1570&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Best&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9 languages × 1 pass each, merge by position&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EasyOCR&lt;/td&gt;
&lt;td&gt;840&lt;/td&gt;
&lt;td&gt;570&lt;/td&gt;
&lt;td&gt;Lower&lt;/td&gt;
&lt;td&gt;5 readers (ch, ja, ar, hi, Latin), merge&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tesseract&lt;/td&gt;
&lt;td&gt;2.4&lt;/td&gt;
&lt;td&gt;91&lt;/td&gt;
&lt;td&gt;Lowest&lt;/td&gt;
&lt;td&gt;1 pass, 9 languages in a single call&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;Takeaways:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;PaddleOCR&lt;/strong&gt; delivered the &lt;strong&gt;best accuracy&lt;/strong&gt; on our receipt/invoice set—consistent with reported results on document OCR. For us, that accuracy was non-negotiable. &lt;strong&gt;Tesseract&lt;/strong&gt; and &lt;strong&gt;EasyOCR&lt;/strong&gt; were faster or lighter but trailed on our 100-image eval; we didn&amp;rsquo;t adopt them as the primary engine. We went with &lt;strong&gt;PaddleOCR&lt;/strong&gt; and reduced cost via a &lt;strong&gt;hybrid, geo-aware&lt;/strong&gt; approach.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id="hybrid-approach-geo-specific-languages--re-run-when-you-need-more"&gt;Hybrid approach: geo-specific languages + re-run when you need more&lt;/h3&gt;
&lt;p&gt;We don&amp;rsquo;t run all nine languages on every image. We use &lt;strong&gt;geo-specific tagging&lt;/strong&gt; (e.g. from account locale, upload region, or user preference) to choose a &lt;strong&gt;small set of languages per document&lt;/strong&gt; (e.g. en+hi for India, en+ch for China, en+es+pt for Latin America). That cuts model load, memory, and time while keeping accuracy high where it matters.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Users can also re-run parsing&lt;/strong&gt; after adding more languages. If someone has a mix of Hindi and English receipts, they can enable Hindi in settings and re-parse; we run OCR only for the selected languages. So we get PaddleOCR&amp;rsquo;s accuracy without the full cost of a nine-language run on every single image.&lt;/p&gt;
&lt;h2 id="high-level-pipeline-design"&gt;High-Level Pipeline Design&lt;/h2&gt;
&lt;p&gt;Our production flow looks like this:&lt;/p&gt;
&lt;div class="mermaid"&gt;flowchart LR
A[Image] --&gt; B[Geo tag]
B --&gt; C[PaddleOCR]
C --&gt; D[Regex + rules]
D --&gt; E{Normalize}
E --&gt; F[Optional LLM cleanup]
F --&gt; G[JSON]
H[User adds languages] --&gt; I[Re-run parsing]
I --&gt; C
&lt;/div&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Ingest&lt;/strong&gt; — Image (receipt/invoice/ID) is uploaded.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Geo + languages&lt;/strong&gt; — We use &lt;strong&gt;geo-specific tagging&lt;/strong&gt; (locale/region/preference) to pick which languages to run (e.g. en+hi for India). Users can add more languages and &lt;strong&gt;re-run parsing&lt;/strong&gt; anytime.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OCR&lt;/strong&gt; — &lt;strong&gt;PaddleOCR&lt;/strong&gt; runs for the selected languages only → raw text + per-line boxes/scores.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Extraction&lt;/strong&gt; — Deterministic &lt;strong&gt;regex and pattern matching&lt;/strong&gt; on the OCR text → candidate fields (receipt_no, amount, date, etc.).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Normalization&lt;/strong&gt; — Simple Python rules (comma stripping, date parsing, unit handling).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optional cleanup&lt;/strong&gt; — For low-confidence or parse-fail cases, a &lt;strong&gt;small text LLM&lt;/strong&gt; fixes or fills fields; no image input.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Output&lt;/strong&gt; — Structured JSON written to our pipeline; downstream apps and APIs consume this only.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We never send the image to a vision model in the hot path. By using &lt;strong&gt;PaddleOCR with geo-specific languages&lt;/strong&gt; (and letting users re-run with more languages when needed), we get from &lt;strong&gt;$0.004 to ~$0.0004 per image&lt;/strong&gt; (generalized over 1000+ images) while keeping &lt;strong&gt;accuracy in the high 90s&lt;/strong&gt;.&lt;/p&gt;
&lt;h2 id="summary"&gt;Summary&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost:&lt;/strong&gt; We reduced document extraction cost from &lt;strong&gt;~$0.004 to ~$0.0004 per image&lt;/strong&gt; (order-of-magnitude reduction) by replacing “image → vision LLM → JSON” with “image → OCR → text → rules (+ optional small LLM) → JSON.”&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Accuracy:&lt;/strong&gt; End-to-end &lt;strong&gt;in the high 90s&lt;/strong&gt;—comparable to the vision-LLM pipeline, because OCR keeps &lt;strong&gt;semantic structure&lt;/strong&gt; intact and LLMs are good at removing the remaining noise when we use them only for cleanup.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OCR choice:&lt;/strong&gt; We use &lt;strong&gt;PaddleOCR&lt;/strong&gt; in production for best accuracy in our evaluation. We run it in a &lt;strong&gt;hybrid, geo-aware&lt;/strong&gt; way: only the languages relevant to the user&amp;rsquo;s region (or their choice) are loaded and run, and users can &lt;strong&gt;re-run parsing&lt;/strong&gt; after adding more languages.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rule-first design:&lt;/strong&gt; &lt;strong&gt;Most&lt;/strong&gt; documents are handled with &lt;strong&gt;regex and pattern matching&lt;/strong&gt;; the rest use a small text LLM. No expensive vision model per image.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you’re building document extraction at scale, we hope this pipeline and these numbers help you choose between vision-heavy and OCR+rules designs. For receipts and invoices, &lt;strong&gt;OCR + rules (+ optional LLM cleanup)&lt;/strong&gt; was the right tradeoff for us.&lt;/p&gt;</description></item></channel></rss>