Researchers have found a way to squeeze the memory that AI agents burn through during long conversations, with measurable gains in speed and throughput.
The paper introduces UltraQuant, a 4-bit compression scheme for the key-value (KV) cache — the part of an AI system's memory that stores context across conversation turns. As agents handle longer, multi-round tasks, that cache balloons and starts choking GPU utilization. UltraQuant attacks the problem by storing cache data in FP4 format (a compact numeric representation), using FP8 queries and a technique called Walsh-Hadamard rotation to preserve accuracy. Tested against the FP8 KV baseline on AMD CDNA4 hardware, it cut median time-to-first-token by 3.47x in cache-pressured late conversation rounds and raised output throughput by 1.63x.
KV cache bloat is one of the less glamorous but genuinely hard constraints on deploying context-heavy agents at scale — the kind that power multi-step coding assistants or long-running automation. Cutting cache memory without wrecking quality is the sort of engineering work that makes production deployments cheaper and faster, which matters more than benchmark scores on a fresh context window.
Noteworthy: the work is explicitly anchored to AMD GPUs and vLLM, not the Nvidia stack that dominates most inference research — a deliberate positioning choice, or a signal that AMD's CDNA4 hardware is finally competitive enough to warrant serious optimization work.