- Google announced that its Gemma 4 family now supports quantization-aware training (QAT) for the 2 billion‑parameter and 7 billion‑parameter variants.
- The blog says QAT shrinks the 2B model from 1.6 GB to 1.1 GB and the 7B model from 7.0 GB to 4.9 GB, a reduction of roughly 30 %. Latency improves by about 20 % on typical mobile CPUs. Accuracy drops by less than 1 % on the standard downstream benchmark.
- For developers, the change means larger language models can now run on smartphones and laptops without heavy cloud reliance. Companies that ship on‑device AI get a cheaper, faster path to deployment, and the lower memory headroom eases multitasking.
- The update lands on June 5, 2026, alongside the open‑source release of the QAT‑aware checkpoints. It’s a modest engineering win, not a headline‑grabbing breakthrough.
