Multimodal RAG

Standard RAG assumes your knowledge base is text. But the most valuable information in most organizations lives in PDFs with tables, slide decks with diagrams, photos of whiteboards, and scanned documents. Multimodal RAG bridges that gap — and the engineering is meaningfully different from text-only retrieval.

Why Text RAG Breaks on Documents

Take a typical enterprise PDF: it has headers, footers, tables, charts, diagrams, multi-column layouts, and images with captions. A text-only RAG pipeline:

Extracts text with a PDF parser (loses layout, mangles tables).
Chunks the text (splits tables mid-row, separates captions from images).
Embeds the chunks (the embedding has no idea there was a chart).
Retrieves and answers (the model hallucinates the missing visual context).

The result: wrong answers with high confidence. Multimodal RAG fixes this by keeping the visual representation in the loop.

Architecture Patterns

Three main approaches, from simplest to most capable:

1. Vision Model as Parser

Use a vision-language model to convert document pages into rich text descriptions, then run standard text RAG on those descriptions.

Render each PDF page as an image.
Send to GPT-4o / Claude / Gemini with instructions to describe all content including tables, charts, and layout.
Store the descriptions as text chunks with metadata linking back to the source page.
Retrieve and answer using a text LLM.

Pros: uses your existing text RAG stack. Cons: the description is lossy — subtle visual details get dropped, and you pay vision model costs at indexing time.

2. Multimodal Embeddings

Embed images and text into the same vector space, then retrieve across modalities.

CLIP — the original image-text embedding model. Good for natural images, weaker on documents and text-heavy content.
SigLIP — improved CLIP variant with better training objectives. Stronger on fine-grained visual details.
Nomic Embed Vision — designed for document understanding. Handles text-heavy images better than CLIP/SigLIP.

Pipeline: embed document page images with a multimodal embedding model, embed the query as text, retrieve the most relevant pages, send those pages (as images) to a vision-language model for answer generation.

Pros: preserves full visual information, retrieval is fast. Cons: multimodal embeddings are still weaker than text embeddings for text-heavy content.

3. ColPali and Late-Interaction Visual Retrieval

ColPali is the breakthrough architecture for document retrieval. Instead of embedding each page as a single vector, it produces a grid of patch embeddings — one per image patch — and uses late interaction (like ColBERT) to match query tokens against patch embeddings.

Why this matters:

Layout-aware retrieval. The model attends to specific regions of the page, not a single averaged representation.
No OCR needed. Retrieval works directly on page images.
Strong on tables and charts. Because patches preserve spatial structure.

Current implementations: Vespa has native ColPali support, Qdrant supports multi-vector retrieval, and several open-source libraries wrap the ColPali model.

This is the direction document retrieval is heading. If you're building a new document RAG system, evaluate ColPali before defaulting to text chunking.

PDF Understanding Pipelines

A production-grade PDF RAG pipeline:

Ingest — render pages as images at 150-300 DPI. Store both the images and the raw PDF.
Classify — what type of content is on each page? (text-heavy, table, chart, diagram, mixed)
Extract — for text-heavy pages, extract text. For visual pages, keep the image. For mixed pages, do both.
Chunk intelligently — respect page boundaries. Don't split tables. Keep chart + caption together. Use layout detection (LayoutParser, Docling) to identify semantic regions.
Embed — text chunks with a text embedder, visual chunks with a multimodal embedder (or ColPali patches for all pages).
Index — store in a vector database with rich metadata (page number, document ID, content type, extraction confidence).
Retrieve — hybrid retrieval: text search + visual search + metadata filtering.
Generate — send retrieved chunks (both text and images) to a multimodal LLM for answer generation.

Tools worth evaluating: Docling (IBM, open-source document parser), Unstructured.io, LlamaParse, Amazon Textract, Azure Document Intelligence.

Sometimes the query is in one modality and the answer is in another:

"Find the chart that shows Q3 revenue trends" — text query, image answer.
"What does this diagram mean?" — image query, text answer.
"Find slides similar to this one" — image query, image answer.

This requires a shared embedding space (CLIP, SigLIP, or ColPali). The key engineering challenge is calibrating relevance scores across modalities — a text-to-text similarity of 0.85 doesn't mean the same thing as a text-to-image similarity of 0.85.

Practical tip: re-rank with a multimodal model. Retrieve a broad candidate set using embeddings, then re-rank by sending each candidate (text or image) alongside the query to a vision-language model and asking "is this relevant?"

What Works Today

Document parsing with vision models is production-ready and significantly better than OCR + rules.
ColPali-style retrieval is the most promising direction for document search and is usable today with some engineering effort.
CLIP/SigLIP embeddings work well for natural image retrieval but underperform on text-heavy documents.
Cross-modal retrieval works but needs careful calibration and usually a re-ranking step.

What's Still Hard

Table extraction from complex, merged-cell tables remains error-prone even with vision models.
Diagram understanding — flowcharts, architecture diagrams, and technical drawings are partially understood but not reliably.
Scale — multimodal indexing is 10-100x more expensive than text indexing. Budget accordingly.
Evaluation — there's no standard benchmark for multimodal RAG quality. You'll need to build your own eval set from real documents.

On this page