PDF Processing for Complex Documents and Why It Matters for RAG
Working with PDFs sounds easy until you try to use them as a reliable knowledge source in RAG systems.
Installation guides, technical manuals, compliance documents, and product specs are nothing like clean text files. They include tables split across pages, diagrams with captions, mixed languages, warnings buried in paragraphs, and layouts that confuse most extraction tools.
When these documents are processed poorly, retrieval may still work, but the answers won’t. Context is lost, tables break, and the model responds with partial or misleading information.
In this blog, I’ll walk through how we approached PDF processing for complex technical documents, what typically goes wrong with traditional methods, and how we built a more reliable extraction pipeline that works well for search, Q&A, and RAG-based systems.
Why Traditional PDF Processors Fall Short
We initially relied on standard PDF processing tools, the kind that extract text using layout heuristics or OCR-based pipelines.
They work reasonably well for:
- Simple documents
- Linear text
- Clean layouts
But with real-world technical PDFs, we repeatedly saw issues like:
- Tables flattened into unreadable text
- Page boundaries ignored
- Headings detached from their content
- Figures and captions skipped entirely
- Silent failures with no indication of missing data
These tools focus on text extraction, not understanding document structure. For RAG systems, that difference matters a lot.
We needed a processor that could reason about pages, sections, and visual layout, not just read characters off a page.
Switching to LLM-Based PDF Processing
To handle this, we moved from traditional PDF processors to LLM-based PDF extraction using OpenRouter’s PDF processing API.
This shift gave us a few important advantages:
- Better understanding of document layout
- Improved table and section detection
- More consistent handling of mixed content (text, tables, warnings, diagrams)
- The ability to control how extraction happens through prompts
Instead of treating PDFs as raw text sources, we could now ask the model to extract structured content explicitly, page by page.
This didn’t solve everything automatically, but it gave us the flexibility we needed to build a reliable pipeline on top.
What We Needed from PDF Extraction
Our goal was not just to read PDFs, but to turn them into dependable inputs for retrieval-augmented generation (RAG).
That meant the output had to:
- Preserve page numbers for traceability
- Keep tables and figures identifiable
- Maintain exact wording (no summarization)
- Handle missing or partial extractions
- Work reliably across large PDFs (20–100+ pages)
Most importantly, we needed structured output, not just raw text.
Step 1: Page-Level Indexing First
Instead of extracting everything in one pass, we start with an index pass.
This first step scans the document and builds:
- Total page count
- Headings and their page numbers
- Figures and captions
- Tables and titles
Even if the model is unsure, we force it to return a best-effort index. This gives us a roadmap of what should exist in the document before extraction begins.
This step alone helps detect:
- Missing pages
- Skipped figures
- Tables that didn’t extract correctly
Step 2: Controlled Page-by-Page Extraction
Rather than extracting the whole PDF at once, we extract small page ranges (for example, 2–3 pages at a time).
For each page, we extract:
- Main text (verbatim)
- Tables
- Specifications
- Diagrams
- Warnings
- Procedures
- Part numbers
- Figures and tables found on that page
Each page becomes its own structured object with metadata like:
- Page number
- Whether it contains tables or warnings
- Which figures appear on it
- Extraction coverage details
This makes the output predictable and debuggable.
Step 3: Continuation Handling (The Hidden Problem)
Large PDFs often exceed token limits, even when extracting a few pages.
Instead of failing or truncating output, we added continuation support:
- If the model can’t finish, it explicitly tells us
- We pass a continuation hint back
- Extraction resumes exactly where it stopped
This avoids silent data loss, which is one of the most dangerous issues in document pipelines.
Step 4: Detecting and Re-Extracting Missing Content
Even with careful extraction, things can still be missed:
- A table title is indexed but never extracted
- A figure is referenced but not found
- A page returns empty content
To solve this, we calculate coverage metrics:
- Which pages are missing
- Which figures weren’t extracted
- Which tables don’t appear in the final output
We then selectively re-ask only the problematic pages, instead of reprocessing the entire document. This keeps the system efficient and reliable.
Step 5: Preparing the Output for Vector Databases
Raw page-level data is useful, but not ideal for retrieval.
Before storing data in Pinecone (or any vector database), we:
- Split pages into smaller semantic chunks
- Separate tables into their own chunks
- Preserve section headings and hierarchy
- Attach rich metadata (page, section, table title, document ID)
This allows:
- More accurate retrieval
- Cleaner answers
- Easier source citations
Final Thoughts
RAG systems don’t fail because models are weak; they fail because documents are messy.
Switching to LLM-based PDF processing gave us the control and structure that traditional tools lacked. Combined with indexing, validation, retries, and semantic chunking, it allowed us to turn complex PDFs into dependable knowledge sources.
Once that foundation is solid, everything built on top of it becomes far more reliable.
