AI/ ai · llm · inference · open-source

CompressKV Cuts LLM Memory Use by Targeting Smarter Cache Heads

A new open-source framework trims the KV cache to under 1% of its original size while keeping LLM accuracy above 90% on long-context tasks.

CompressKV Cuts LLM Memory Use by Targeting Smarter Cache Heads

Researchers have found a way to run long-context AI inference with a fraction of the memory it normally demands.

Large language models store intermediate calculations in a key-value (KV) cache during inference — essentially a scratchpad that grows with context length. The longer the input, the more memory it eats. CompressKV, a framework from researchers at TU Darmstadt, attacks this by being selective about what gets kept. Rather than scoring and evicting tokens uniformly across all attention heads, it identifies a subset it calls Semantic Retrieval Heads — the attention heads that actually do the work of locating important information in a long prompt. Only those heads guide the decision of what stays in cache. Layer-by-layer budget allocation, calculated offline based on how much each layer degrades when cache is cut, handles the rest.

The numbers are striking enough to take seriously. On LongBench question-answering tasks, CompressKV retains over 97% of full-cache performance using just 3% of the KV cache. On the Needle-in-a-Haystack benchmark — a standard test for finding a specific fact buried in a long document — it hits 90% accuracy with only 0.7% of the KV storage. That matters because memory cost is a real deployment constraint: running long-context models on consumer or edge hardware often simply is not viable today, and cloud inference costs scale with memory usage.

KV cache compression is a crowded field — methods like H2O, SnapKV, and PyramidKV have all chipped at the same problem — but the head-differentiation angle is a genuine structural departure from heuristic-score-over-all-heads approaches. Code is public on GitHub, which is the right move; extraordinary benchmark claims need community replication before anyone should bank on them at scale.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →