Documentation | VaultSafe | Chat with Your Files — AI Document Assistant

How internal document parsing works

Sat, 28 Feb 2026 00:00:00 +0000

How internal document parsing works

This page describes how VaultSafe turns uploaded documents (receipts, invoices, IDs) into structured data. We use an OCR-first pipeline with deterministic extraction and optional LLM cleanup—no vision model in the hot path—to keep cost low and accuracy high (in the high 90s in our evaluations).

Pipeline overview

Upload — User uploads an image or PDF (page rendered as image).
Geo + languages — We use geo-specific tagging (locale/region/preference) to select which languages to run. Users can add more languages and re-run parsing later.
OCR — PaddleOCR runs for the selected languages only and returns plain text (and optionally line-level boxes/scores).
Extraction — Regex and pattern matching extract candidate fields (e.g. receipt number, amount, date).
Normalization — Python rules normalize values (number formatting, date formats).
Optional cleanup — Low-confidence or parse-fail cases can be sent to a small text-only LLM for correction.
Output — Structured JSON is stored and exposed to apps and APIs.

Design principle: We avoid sending every image through a vision LLM. OCR gives us text that preserves semantic structure; rules do most of the work, and a small LLM handles the remaining noise.

OCR engine selection

We need one OCR path that supports nine languages: English, Chinese (Simplified), Spanish, Hindi, Arabic, Portuguese, French, Japanese, and German. We evaluated three engines on the same 100+ image set (receipts, invoices, multi-language), measuring time, memory, and accuracy (character/word-level vs. ground truth):

Engine	Time (s)	Memory (MB)	Accuracy (our eval)	Method
PaddleOCR	27	1570	Best	9 languages × 1 pass each, results merged by position
EasyOCR	840	570	Lower	5 readers (script restrictions), results merged
Tesseract	2.4	91	Lowest	1 pass, 9 languages in a single call

Production choice: PaddleOCR. It delivered the best accuracy on our receipt/invoice set (consistent with reported document-OCR benchmarks). Tesseract and EasyOCR were faster or lighter in some setups but trailed on accuracy in our eval. We use PaddleOCR in a hybrid, geo-aware way so we don’t pay the full cost of running all nine languages on every image.

Hybrid approach: geo-specific languages and re-run

Geo-specific tagging — We use account locale, upload region, or user preference to choose a small set of languages per document (e.g. en+hi for India, en+ch for China, en+es+pt for Latin America). Only those language models are loaded and run.
Re-run with more languages — Users can add more languages in settings and re-run parsing on existing documents. We run OCR only for the selected languages, so PaddleOCR’s accuracy is preserved without running all nine every time.

PaddleOCR output is line- or word-level text with bounding boxes; we use it as the single source of truth for the extraction step.

Why OCR output is good enough

OCR text often contains minor noise (e.g. Date z, Amount 3 %2841927). What matters for our pipeline is that semantic structure is preserved: labels like “Receipt No”, “Amount”, “Date” appear in the text with values nearby. Under that condition:

Regex and pattern matching reliably isolate fields.
A small text LLM can correct or normalize the remaining noisy cases without ever seeing the image.
No vision model is required in the default path, which keeps cost at ~$0.0004 per image (vs. ~$0.004 with a vision LLM per image).

Extraction layer

The extraction layer is rule-based on the OCR text:

Patterns — We use regex and simple heuristics to find receipt number, amount, date, merchant, etc., depending on document type.
Normalization — Number formats (e.g. 28,41,927 → 2841927), dates (e.g. 04-Jul-2025 → ISO), and units are normalized in Python.
Confidence — When a field is missing or low-confidence, we can route that document (or field) to an optional text-only LLM cleanup step.

Most documents are fully extracted with rules only; the remainder use the small LLM for cleanup. End-to-end accuracy in our experiments is in the high 90s, comparable to a vision-LLM–based pipeline.

Cost and scale

Baseline (vision LLM per image): ~$0.004 per image (generalized over 1000+ images).
Our pipeline (OCR + rules + optional text LLM): ~$0.0004 per image.

The order-of-magnitude cost reduction comes from removing the vision model from the hot path and using OCR + deterministic extraction + optional small LLM cleanup instead.

Summary

Aspect	Approach
OCR engine	PaddleOCR (geo-specific languages; re-run when user adds more)
Extraction	Regex and pattern matching on OCR text
Cleanup	Optional small text LLM for low-confidence cases
Vision model	Not used in default path
Cost (approx.)	~$0.0004 per image
Accuracy	Best in our eval (PaddleOCR); aligned with vision-LLM baseline

For more context and benchmarks, see the blog post: .

Apps

Wed, 18 Feb 2026 00:00:00 +0000

Apps

VaultSafe’s app ecosystem represents a breakthrough in secure, AI-powered personal data processing. Our platform enables sophisticated applications that transform document metadata into structured, actionable intelligence—all while maintaining the highest standards of privacy and security.

Architecture overview

VaultSafe Apps operate on a sophisticated pipeline that combines advanced AI models with zero-trust security principles:

Document metadata extraction — When files are uploaded, our AI analysis engine (powered by state-of-the-art computer vision and NLP models) extracts rich metadata: document classification, entity recognition (people, dates, locations), structured key-value pairs, and semantic descriptions. This metadata layer powers both search/chat and app processing.
Agent prompt execution — Each App defines an agent prompt: sophisticated instructions that guide our AI models to extract app-specific structured data from the metadata. These prompts leverage advanced few-shot learning and structured output capabilities, enabling precise extraction without exposing raw file content.
Secure data storage — Extracted data is stored in a per-user, per-app namespace within our zero-trust architecture. Each app’s data is isolated, encrypted, and accessible only to the user who owns it.
Intelligent widget rendering — Apps define widgets that present extracted data through configurable UI components (e.g., sorted lists, timelines, dashboards). Widgets are dynamically rendered based on app-defined schemas and display preferences.

Pipeline: Document → AI Metadata Extraction → Agent Prompt → Structured Data → Secure Storage → Widget UI

Security guarantee: Apps never access raw files or user credentials. They operate exclusively on pre-extracted metadata within isolated execution environments.

App types

Default apps — Pre-installed applications (e.g., Birthdays) that run automatically on all processed documents. These apps are developed by VaultSafe’s research team and represent best practices in secure personal data processing.
Optional apps — Additional apps available in our catalog that users can enable. Once enabled, they process new documents and provide dedicated widgets for their extracted data.

Users can add or edit app entries (e.g. birthdays, reminders) from the Apps page using simple forms—no technical setup or JSON required.

Data schema and extensibility

Each App defines a schema that specifies the structure of extracted data (e.g., { person_name: string, date: ISO8601, source: string }). Schemas enable:

Type-safe data storage and retrieval
Validation of extraction outputs
Future extensibility for custom user-defined fields

Our schema system supports complex nested structures and is designed to evolve with our platform’s capabilities.

Marketplace and third-party development

VaultSafe is building an open ecosystem for personal data applications. Third-party developers can create apps that integrate seamlessly with our platform:

App packages — Self-contained bundles (folder or zip) containing app metadata, agent prompts, schemas, and widget definitions
Unified deployment — Single-package format ensures consistent installation and execution across our infrastructure
Secure execution — Apps run in isolated environments with strict access controls, ensuring user data privacy

For technical specifications, development guidelines, and marketplace submission details, see Marketplace apps.

Technical capabilities

Our app platform leverages:

Advanced AI models for document understanding and structured extraction
Zero-trust security with end-to-end encryption and granular permissions
Scalable infrastructure supporting millions of documents and thousands of concurrent app executions
Research-driven design informed by our team’s experience building AI systems for Asia’s largest consumer applications

For technical support or partnership inquiries, contact support@vaultsafe.ai.

Getting started

Wed, 18 Feb 2026 00:00:00 +0000

Getting started

VaultSafe is your AI document vault: upload PDFs and images; OCR and parsing make everything searchable. Chat in plain English, fill PDF forms, merge or compress PDFs, and use smart apps that auto-extract birthdays, reminders, and relationships. Zero-trust security.

Quick start

Sign up at app.vaultsafe.ai (Google or email).
Upload files (PDFs, images, receipts, IDs). OCR runs in 9 languages; documents are indexed for search and chat. Attach to a chat or add to My Files.
My Files — Browse, filter by person and type, see status. Chats & QnA — Ask in plain English, attach files, get answers and download links; use PDF tools from the conversation.
Enable Apps from the catalog (e.g. Birthdays, Reminders). See Apps and Document parsing pipeline.

What you get

OCR & parsing — 9 languages; receipts, invoices, IDs → searchable metadata. How it works.
Chat — Ask e.g. “when does my insurance expire?” or “fill this PDF with Peter’s details.” Download links for filled/merged PDFs.
PDF tools — Fill forms, merge, compress, place on A4—from chat.
Smart apps — Birthdays, relationships, reminders auto-extracted; filter by person and type.
Privacy — Your data private, encrypted, under your control.

For technical support or integration assistance, contact support@vaultsafe.ai.

App Developer Guide

Sat, 08 Mar 2025 00:00:00 +0000

App Developer Guide

Build apps that extract structured data from documents and present it through widgets. All app definitions live in the Apps table (Postgres)—no code in the repo.

How It Works

File upload → VaultSafe analyzes it (document type, description, entities, key-value content).
Your app’s agent prompt runs on that metadata → extracts structured data.
Data is stored in user_app (user_id + app_id + data).
Widget renders the data (list, table, or card view).

Security: Apps never see raw files. They only receive pre-extracted metadata.

Creating an App

1. Define Your Schema

Example (Birthdays):

{
 "extraction": {
 "birthdays": [
 { "person_name": "string", "date": "string", "source": "string", "file_id": "string" }
 ]
 }
}

2. Write the Agent Prompt

Input JSON: suggested_file_name, type_of_file, description, main_person, other_persons, full_content, file_id.

Birthday example:

You extract birthday-related information from document metadata. Only use information explicitly present in the input; do not infer or guess.

Input is a JSON object with: suggested_file_name, type_of_file, description, main_person, other_persons, full_content (key-value from document).

Return a JSON object with:
- "birthdays": list of { "person_name": string, "date": string (YYYY-MM-DD or partial like "15 March"), "source": string, "file_id": string (pass through from input) }

If no birthday/date of birth is explicitly present, return { "birthdays": [] }. Do not fabricate dates.

Field	Description
type	`list_by_date`, `table`, or `card`
list_key	Key in extracted JSON (e.g. `birthdays`)
sort_field	Field to sort by (e.g. `date`)
display_fields	Fields to show
title	Widget title
empty_message	Message when no data

4. Publish via Admin

Use the VaultSafe Admin app (local only). Create or edit the app, set status (enable_for_all or enabled_by_user), and save.

type	Use case
list_by_date	Sorted list, ideal for dates
table	Tabular view
card	Card layout

App Status

enable_for_all: Auto-enabled for all users. Backfill runs on publish.
enabled_by_user: Marketplace only. Users enable explicitly.

Users can disable any app at any time.

User Data

user_app.data: List of { file_id, extracted, updated_at }.
Users can add or edit items in the Apps UI via simple forms (e.g. birthdays, reminders)—no JSON or technical setup required.
Updates sync to chat context for AI-assisted Q&A.

Full Example

See for a complete Birthdays app example.

Documentation | VaultSafe | Chat with Your Files — AI Document Assistant

How internal document parsing works

How internal document parsing works

Pipeline overview

OCR engine selection

Hybrid approach: geo-specific languages and re-run

Why OCR output is good enough

Extraction layer

Cost and scale

Summary

Apps

Apps

Architecture overview

App types

Data schema and extensibility

Marketplace and third-party development

Technical capabilities

Getting started

Getting started

Quick start

What you get

App Developer Guide

App Developer Guide

How It Works

Creating an App

1. Define Your Schema

2. Write the Agent Prompt

3. Configure the Widget

4. Publish via Admin

Widget Types

App Status

User Data

Full Example

Support