Video Understanding
Temporal reasoning, cost tradeoffs, and practical architectures for processing video with AI
Video is the most expensive and least mature multimodal capability. The naive approach — "just send the video to the model" — either doesn't work, costs a fortune, or both. Understanding the architectural tradeoffs is essential before you write any code.
Video as Frames vs Native Video Models
Two fundamentally different approaches:
Frame sampling — extract frames from the video at some interval, send them as images to a vision-language model. This is what most teams actually do.
- Sample 1 frame per second for action-dense content, 1 frame every 5-10 seconds for talking-head or slow-paced content.
- Send frames as a multi-image prompt to models like Gemini (which handles long image sequences natively) or batch them through GPT-4o / Claude.
- Cheap to implement, leverages the best vision models, works today.
- Misses motion, audio context, and fast temporal dynamics. If something happens between frames, you don't see it.
Native video models — models trained on video as a first-class input. Still emerging:
- Gemini 1.5 / 2.0 — the most capable model for direct video input. Can process up to an hour of video in context.
- Twelve Labs — specialized video understanding API. Strong at search, classification, and temporal reasoning.
- Google Video AI — cloud APIs for label detection, shot change detection, object tracking.
- Open-source — Video-LLaVA, VideoChat, and others are progressing but lag behind on long-video reasoning.
Rule of thumb: start with frame sampling. Move to native video models only when you need temporal reasoning, motion understanding, or audio-visual correlation that frame sampling can't give you.
Temporal Reasoning
The hard part of video understanding. Questions like "what happened before X?" or "how long did Y last?" require the model to reason across time.
Current capabilities:
- Event sequencing — models can usually get the order of major events right in short clips.
- Duration estimation — unreliable. Models guess rather than measure.
- Causal reasoning — "why did this happen?" works when the cause and effect are visually obvious and temporally close.
- Long-range dependencies — connecting something in minute 1 to something in minute 45 is still very hard. Even Gemini with long context struggles on subtle temporal references.
Practical approach: if you need precise temporal reasoning, don't rely on the model alone. Use traditional video analysis (scene detection, object tracking, timestamp extraction) to create a structured timeline, then ask the model to reason over that timeline.
Video Summarization
The most common production use case. Patterns:
- Hierarchical summarization — summarize each segment, then summarize the summaries. Works for long videos where a single-pass summary would lose detail.
- Keyframe + transcript — extract keyframes and a transcript (via Whisper), send both to an LLM for a combined summary. Cheaper than sending all frames.
- Native model summary — send the video directly to Gemini or Twelve Labs and ask for a summary. Simplest, but expensive for long videos.
For meeting recordings: transcript-first is usually better than video-first. The visual content (slides, screen shares) supplements the spoken content, not the other way around.
Real-Time Video Analysis
Processing video as it streams, not after the fact. Use cases: security monitoring, live sports analysis, manufacturing quality control, telehealth.
Architecture:
- Capture — receive the video stream (RTSP, WebRTC, or frame-by-frame).
- Sample — extract frames at a fixed rate (1-5 fps for most use cases).
- Batch — accumulate a small window of frames (e.g., last 10 seconds).
- Analyze — send the batch to a vision model with a specific question.
- Alert — if the analysis triggers a condition, fire an alert.
Key constraint: latency budget. A vision API call takes 1-5 seconds. If you need sub-second response, you need an on-device model (YOLO, lightweight classifier) for the fast path and a cloud model for the slow, detailed path.
Cost and Latency
Video is expensive. Some real numbers to ground your planning:
- Frame sampling at 1 fps, 10-minute video — 600 frames. At ~750 tokens per image, that's ~450K tokens. With GPT-4o at input pricing, roughly $1-2 per video.
- Gemini native video, 10-minute video — significantly cheaper per minute than frame sampling through other models, but still adds up at scale.
- Twelve Labs — usage-based pricing, generally cheaper for search and classification than sending raw video to a general-purpose model.
Cost reduction strategies:
- Adaptive sampling — sample more frames during high-activity segments, fewer during static segments. Use scene change detection to decide.
- Two-stage pipeline — cheap classifier first (is this frame interesting?), expensive model only on interesting frames.
- Cache aggressively — if the same video will be queried multiple times, cache the extracted features and frame descriptions.
- Downsample resolution — 720p is usually sufficient for understanding; you don't need 4K frames going to the model.
What's Improving Fast
- Native video context windows are getting longer (Gemini already handles hours).
- Open-source video models are closing the gap, especially for classification and search.
- Real-time APIs for video are emerging but not yet general-purpose.
- Video generation models (Sora, Veo, Kling) are separate from video understanding but the line will blur.