[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-ultraquant-cuts-ai-agent-memory-costs-with-4-bit-kv-caching":10,"sections":34},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":24,"tags":25,"sources":29,"feedback":33,"feedback_at":22,"cost_usd":33,"total_tokens":33},1768,"ultraquant-cuts-ai-agent-memory-costs-with-4-bit-kv-caching","UltraQuant Cuts AI Agent Memory Costs with 4-bit KV Caching","A new compression technique for AI agent memory slashes response latency by up to 3.47x on AMD GPUs, without gutting output quality.","Researchers have found a way to squeeze the memory that AI agents burn through during long conversations, with measurable gains in speed and throughput.\n\nThe paper introduces UltraQuant, a 4-bit compression scheme for the key-value (KV) cache — the part of an AI system's memory that stores context across conversation turns. As agents handle longer, multi-round tasks, that cache balloons and starts choking GPU utilization. UltraQuant attacks the problem by storing cache data in FP4 format (a compact numeric representation), using FP8 queries and a technique called Walsh-Hadamard rotation to preserve accuracy. Tested against the FP8 KV baseline on AMD CDNA4 hardware, it cut median time-to-first-token by 3.47x in cache-pressured late conversation rounds and raised output throughput by 1.63x.\n\nKV cache bloat is one of the less glamorous but genuinely hard constraints on deploying context-heavy agents at scale — the kind that power multi-step coding assistants or long-running automation. Cutting cache memory without wrecking quality is the sort of engineering work that makes production deployments cheaper and faster, which matters more than benchmark scores on a fresh context window.\n\nNoteworthy: the work is explicitly anchored to AMD GPUs and vLLM, not the Nvidia stack that dominates most inference research — a deliberate positioning choice, or a signal that AMD's CDNA4 hardware is finally competitive enough to warrant serious optimization work.","[\"ai\",\"inference\",\"llm\",\"hardware\"]","2026-06-19T04:00:00.000Z","2026-06-19T11:31:35.645Z","2026-06-19T14:22:18.843Z","published",null,[],"ai",[24,26,27,28],"inference","llm","hardware",[30],{"name":31,"url":32},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.20474",0,{"sections":35},[36,40,44,49,54,58,63,67,71,76,81,86,91,96],{"name":37,"slug":24,"count":38,"latest_published_at":39},"AI",491,"2026-06-19T14:59:11.000Z",{"name":41,"slug":42,"count":43,"latest_published_at":18},"Security","security",132,{"name":45,"slug":46,"count":47,"latest_published_at":48},"Policy","policy",88,"2026-06-16T09:26:09.000Z",{"name":50,"slug":51,"count":52,"latest_published_at":53},"Consumer Tech","consumer-tech",78,"2026-06-16T17:58:24.000Z",{"name":55,"slug":28,"count":56,"latest_published_at":57},"Hardware",62,"2026-06-18T15:24:16.000Z",{"name":59,"slug":60,"count":61,"latest_published_at":62},"Deals","deals",58,"2026-06-19T14:43:50.000Z",{"name":64,"slug":65,"count":61,"latest_published_at":66},"Software","software","2026-06-16T20:00:00.000Z",{"name":68,"slug":69,"count":70,"latest_published_at":18},"Dev Tools","dev-tools",50,{"name":72,"slug":73,"count":74,"latest_published_at":75},"Science","science",38,"2026-06-18T04:00:00.000Z",{"name":77,"slug":78,"count":79,"latest_published_at":80},"Gaming","gaming",31,"2026-06-16T15:25:13.000Z",{"name":82,"slug":83,"count":84,"latest_published_at":85},"General","general",26,"2026-06-13T18:35:15.000Z",{"name":87,"slug":88,"count":89,"latest_published_at":90},"Startups","startups",23,"2026-06-16T15:00:00.000Z",{"name":92,"slug":93,"count":94,"latest_published_at":95},"Reviews","reviews",19,"2026-06-14T08:00:00.000Z",{"name":97,"slug":98,"count":99,"latest_published_at":100},"How-To","how-to",6,"2026-06-16T09:00:00.000Z"]