AI/ ai · inference · llm · hardware

UltraQuant Cuts AI Agent Memory Costs with 4-bit KV Caching

A new compression technique for AI agent memory slashes response latency by up to 3.47x on AMD GPUs, without gutting output quality.

Researchers have found a way to squeeze the memory that AI agents burn through during long conversations, with measurable gains in speed and throughput.

The paper introduces UltraQuant, a 4-bit compression scheme for the key-value (KV) cache — the part of an AI system's memory that stores context across conversation turns. As agents handle longer, multi-round tasks, that cache balloons and starts choking GPU utilization. UltraQuant attacks the problem by storing cache data in FP4 format (a compact numeric representation), using FP8 queries and a technique called Walsh-Hadamard rotation to preserve accuracy. Tested against the FP8 KV baseline on AMD CDNA4 hardware, it cut median time-to-first-token by 3.47x in cache-pressured late conversation rounds and raised output throughput by 1.63x.

KV cache bloat is one of the less glamorous but genuinely hard constraints on deploying context-heavy agents at scale — the kind that power multi-step coding assistants or long-running automation. Cutting cache memory without wrecking quality is the sort of engineering work that makes production deployments cheaper and faster, which matters more than benchmark scores on a fresh context window.

Noteworthy: the work is explicitly anchored to AMD GPUs and vLLM, not the Nvidia stack that dominates most inference research — a deliberate positioning choice, or a signal that AMD's CDNA4 hardware is finally competitive enough to warrant serious optimization work.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →