Video Understanding

Temporal reasoning, cost tradeoffs, and practical architectures for processing video with AI

Video is the most expensive and least mature multimodal capability. The naive approach — "just send the video to the model" — either doesn't work, costs a fortune, or both. Understanding the architectural tradeoffs is essential before you write any code.

Video as Frames vs Native Video Models

Two fundamentally different approaches:

Frame sampling — extract frames from the video at some interval, send them as images to a vision-language model. This is what most teams actually do.

Sample 1 frame per second for action-dense content, 1 frame every 5-10 seconds for talking-head or slow-paced content.
Send frames as a multi-image prompt to models like Gemini (which handles long image sequences natively) or batch them through GPT-4o / Claude.
Cheap to implement, leverages the best vision models, works today.
Misses motion, audio context, and fast temporal dynamics. If something happens between frames, you don't see it.

Native video models — models trained on video as a first-class input. Still emerging:

Gemini 1.5 / 2.0 — the most capable model for direct video input. Can process up to an hour of video in context.
Twelve Labs — specialized video understanding API. Strong at search, classification, and temporal reasoning.
Google Video AI — cloud APIs for label detection, shot change detection, object tracking.
Open-source — Video-LLaVA, VideoChat, and others are progressing but lag behind on long-video reasoning.

Rule of thumb: start with frame sampling. Move to native video models only when you need temporal reasoning, motion understanding, or audio-visual correlation that frame sampling can't give you.

Temporal Reasoning

The hard part of video understanding. Questions like "what happened before X?" or "how long did Y last?" require the model to reason across time.

Current capabilities:

Event sequencing — models can usually get the order of major events right in short clips.
Duration estimation — unreliable. Models guess rather than measure.
Causal reasoning — "why did this happen?" works when the cause and effect are visually obvious and temporally close.
Long-range dependencies — connecting something in minute 1 to something in minute 45 is still very hard. Even Gemini with long context struggles on subtle temporal references.

Practical approach: if you need precise temporal reasoning, don't rely on the model alone. Use traditional video analysis (scene detection, object tracking, timestamp extraction) to create a structured timeline, then ask the model to reason over that timeline.

Video Summarization

The most common production use case. Patterns:

Hierarchical summarization — summarize each segment, then summarize the summaries. Works for long videos where a single-pass summary would lose detail.
Keyframe + transcript — extract keyframes and a transcript (via Whisper), send both to an LLM for a combined summary. Cheaper than sending all frames.
Native model summary — send the video directly to Gemini or Twelve Labs and ask for a summary. Simplest, but expensive for long videos.

For meeting recordings: transcript-first is usually better than video-first. The visual content (slides, screen shares) supplements the spoken content, not the other way around.

Real-Time Video Analysis

Processing video as it streams, not after the fact. Use cases: security monitoring, live sports analysis, manufacturing quality control, telehealth.

Architecture:

Capture — receive the video stream (RTSP, WebRTC, or frame-by-frame).
Sample — extract frames at a fixed rate (1-5 fps for most use cases).
Batch — accumulate a small window of frames (e.g., last 10 seconds).
Analyze — send the batch to a vision model with a specific question.
Alert — if the analysis triggers a condition, fire an alert.

Key constraint: latency budget. A vision API call takes 1-5 seconds. If you need sub-second response, you need an on-device model (YOLO, lightweight classifier) for the fast path and a cloud model for the slow, detailed path.

Cost and Latency

Video is expensive. Some real numbers to ground your planning:

Frame sampling at 1 fps, 10-minute video — 600 frames. At ~750 tokens per image, that's ~450K tokens. With GPT-4o at input pricing, roughly $1-2 per video.
Gemini native video, 10-minute video — significantly cheaper per minute than frame sampling through other models, but still adds up at scale.
Twelve Labs — usage-based pricing, generally cheaper for search and classification than sending raw video to a general-purpose model.

Cost reduction strategies:

Adaptive sampling — sample more frames during high-activity segments, fewer during static segments. Use scene change detection to decide.
Two-stage pipeline — cheap classifier first (is this frame interesting?), expensive model only on interesting frames.
Cache aggressively — if the same video will be queried multiple times, cache the extracted features and frame descriptions.
Downsample resolution — 720p is usually sufficient for understanding; you don't need 4K frames going to the model.

What's Improving Fast

Native video context windows are getting longer (Gemini already handles hours).
Open-source video models are closing the gap, especially for classification and search.
Real-time APIs for video are emerging but not yet general-purpose.
Video generation models (Sora, Veo, Kling) are separate from video understanding but the line will blur.