Steven's Knowledge

Context Compression

Techniques for fitting more useful information into fewer tokens — summarization, selective context, and sliding windows

Context windows are finite, tokens cost money, and latency grows with input length. Compression is the art of keeping the information the model needs while throwing away what it doesn't. Done well, compression improves both cost and quality. Done badly, you lose the signal.

Why Compress

Three reasons to compress context rather than just stuffing everything in:

  1. Cost — halving the input tokens roughly halves the input cost per call. At scale, this matters.
  2. Latency — fewer input tokens means faster time-to-first-token.
  3. Attention quality — less noise in the context means the model focuses on what matters. A shorter, curated context often outperforms a longer, noisy one.

Summarization-Based Compression

The most intuitive approach: use an LLM to summarize content before feeding it as context.

How it works:

  1. Take a long document or conversation.
  2. Summarize it into a shorter form that preserves the key information.
  3. Use the summary as context instead of the original.

When it works well:

  • Conversation history — summarize older turns, keep recent ones verbatim.
  • Background documents — when the model needs awareness but not verbatim access.
  • Multi-document synthesis — summarize each document, then reason over summaries.

The risk: summarization is lossy. The model that produces the summary decides what's important, and it might be wrong. Critical details get dropped. Numbers get rounded. Nuance disappears.

Mitigation: keep the original accessible. Summarize for the "first pass" and let the model request the full text when it needs precision.

LLMLingua and Token-Level Compression

A different approach: instead of rewriting content, remove tokens that are informationally redundant.

LLMLingua and its successors (LongLLMLingua, LLMLingua-2) use a small model to estimate per-token importance, then drop low-importance tokens. The result is a compressed prompt that reads like broken English to a human but preserves the information a model needs.

How it works:

  1. Run each token through a small model to get perplexity scores.
  2. Tokens with low perplexity (predictable, redundant) are candidates for removal.
  3. Remove tokens up to the target compression ratio.
  4. The compressed prompt is sent to the main model.

Typical compression ratios: 2x-5x without significant quality loss on many benchmarks.

Tradeoffs:

  • Adds a preprocessing step (latency and cost of the small model).
  • Works better on natural language than on structured data or code.
  • Quality degrades at high compression ratios — there's a sweet spot.
  • The compressed text is fragile; minor model updates can change what the target model can recover from compressed input.

Selective Context

Instead of compressing everything uniformly, keep important parts at full fidelity and aggressively compress or drop the rest.

Strategies:

  • Relevance filtering — given the user's query, use embeddings or a classifier to score each section's relevance. Keep high-relevance sections verbatim; summarize or drop low-relevance ones.
  • Recency weighting — in conversations, recent messages are more likely to be relevant. Keep the last N turns verbatim, summarize everything before that.
  • Entity-focused retention — identify key entities (people, products, decisions) and keep all mentions of those entities, dropping surrounding filler.
  • Instruction preservation — system prompts and user instructions are almost always high-value. Never compress these.

Conversation Compaction

For multi-turn conversations that grow beyond the window:

Progressive Summarization

  1. Keep the full conversation until it hits a threshold (e.g., 70% of the window).
  2. Summarize the oldest portion into a paragraph.
  3. The context becomes: summary of old turns + full recent turns.
  4. As the conversation continues, re-summarize periodically.

This is what most production chatbots do. The model always has the recent conversation verbatim plus a summary of earlier context.

Hierarchical Compaction

A more sophisticated version:

  • Level 0 — full verbatim conversation (most recent turns).
  • Level 1 — per-turn summaries (medium-age turns).
  • Level 2 — session-level summary (old turns).

Each level compresses more aggressively. The model sees the most detail for the most recent context and progressively less for older context.

Sliding Window Approaches

When processing very long sequences (long documents, streaming data):

  • Fixed window — only the most recent N tokens are in context. Oldest tokens fall off the end. Simple but lossy — anything before the window is gone.
  • Sliding window with summary — maintain a running summary of everything before the window. The context is always: summary + current window. This preserves some awareness of earlier content.
  • Overlapping windows — process the document in windows that overlap by some percentage. Each window has context from the previous one. Good for extraction tasks where you can merge results.

Practical Guidelines

  1. Measure before compressing — don't compress if you're not hitting a cost, latency, or quality wall. Compression adds complexity.
  2. Preserve structure — headers, section markers, and formatting carry disproportionate information. Keep them even when compressing content.
  3. Never compress the system prompt — your instructions to the model are the highest-value tokens in the context.
  4. Test compressed vs uncompressed — run your eval suite with and without compression. If quality doesn't improve (or actively degrades), the compression isn't worth it.
  5. Combine approaches — selective context (drop irrelevant sections) + summarization (compress the retained sections) is often more effective than either alone.
  6. Log what was compressed — when debugging bad outputs, you need to know what the model actually saw. Keep a record of the pre-compression context.

On this page