Steven's Knowledge

Vision Models

Image understanding at production scale — from GPT-4V to open-source alternatives

Vision-language models are the most mature multimodal capability and the one most likely to be in your production stack right now. The gap between "can see an image" and "reliably extracts the right data from a messy scan" is where the real engineering lives.

The Landscape

The frontier vision-language models as of early 2026:

  • GPT-4o / GPT-4V — strong general vision, excellent at following complex instructions about images. The default starting point for most teams.
  • Claude (Sonnet / Opus) — particularly strong at document understanding, chart reading, and structured extraction. Tends to be more careful about hallucinating visual details.
  • Gemini Pro / Ultra — native multimodal from training, handles very long image sequences well. Best-in-class for multi-image reasoning.
  • Open-source (LLaVA, InternVL, Qwen-VL) — competitive on benchmarks, self-hostable, fine-tunable. InternVL 2.5 and Qwen2-VL are the current open-source leaders.

No single model wins everything. Claude is often best for document parsing, Gemini for multi-image tasks, GPT-4o for general instruction following. Benchmark the models on your actual data before committing.

Image Understanding vs Image Generation

These are different model families with different engineering concerns:

  • Understanding (vision-language models) — takes an image in, produces text out. Mature, reliable, relatively cheap.
  • Generation (diffusion models, autoregressive image models) — takes text in, produces an image out. More expensive, harder to control, quality varies.

Most production use cases are understanding, not generation. Don't let the hype around image generation distract you from the quieter wins in image understanding.

The OCR Replacement Pattern

The single highest-ROI use of vision models today is replacing OCR + regex pipelines. The old way:

  1. Run OCR (Tesseract, cloud OCR).
  2. Parse the raw text with regexes or rules.
  3. Fix errors with heuristics.
  4. Hope the layout didn't change.

The new way:

  1. Send the document image to a vision-language model.
  2. Ask for structured output (JSON) with the fields you need.
  3. Done.

The vision model handles layout, skew, handwriting, mixed languages, and ambiguity — all the things that made OCR pipelines fragile. Accuracy goes up, maintenance goes way down.

Document Parsing Pipelines

For serious document processing at scale, the pattern is:

  1. Page segmentation — split multi-page documents into individual pages.
  2. Classification — what type of document is this page? (invoice, receipt, contract, form)
  3. Extraction — pull structured fields using a vision-language model with a typed schema.
  4. Validation — cross-check extracted fields against each other and business rules.
  5. Human review — route low-confidence extractions to a human.

Key engineering decisions:

  • Resolution matters. Most vision models accept images up to ~2048px on the long side. Downscaling a dense document loses detail. Split into regions if needed.
  • Prompt the schema. Give the model the exact JSON schema you expect back. This dramatically improves extraction reliability.
  • Batch wisely. Multi-page documents can often be processed page-by-page in parallel, but cross-page context (like a table split across pages) needs sequential handling.

Practical Vision Pipelines

A few patterns that work in production:

  • Screenshot-to-data — capture UI screenshots, extract displayed information. Useful for monitoring, testing, accessibility audits.
  • Visual grounding — ask the model to locate specific elements in an image by coordinates. The foundation of computer-use agents.
  • Comparison — send two images (before/after, expected/actual) and ask the model to describe differences. Works for QA, change detection, compliance.
  • Chart and graph reading — vision models are surprisingly good at reading bar charts, line graphs, and tables from images. Often better than chart-specific extraction tools.

What's Still Hard

  • Spatial precision. Models can tell you "the button is in the top-right corner" but struggle with pixel-level precision.
  • Small text in large images. If the text is small relative to the image, it gets lost. Crop and zoom first.
  • Hallucinated text. Models sometimes "read" text that isn't there, especially in low-quality images. Always validate critical extractions.
  • Cost at scale. Image tokens are expensive. Processing millions of documents needs careful batching and caching.

On this page