ai/ quantization · gemma

Google trims Gemma 4 models with quantization-aware training

Quantization-aware training cuts Gemma 4’s size by up to 30% and speeds inference by 20% with less than 1% accuracy loss.

1 min readJune 5, 2026Original reporting · 1 source

Google trims Gemma 4 models with quantization-aware training

Google announced that its Gemma 4 family now supports quantization-aware training (QAT) for the 2 billion‑parameter and 7 billion‑parameter variants.

The blog says QAT shrinks the 2B model from 1.6 GB to 1.1 GB and the 7B model from 7.0 GB to 4.9 GB, a reduction of roughly 30 %. Latency improves by about 20 % on typical mobile CPUs. Accuracy drops by less than 1 % on the standard downstream benchmark.

For developers, the change means larger language models can now run on smartphones and laptops without heavy cloud reliance. Companies that ship on‑device AI get a cheaper, faster path to deployment, and the lower memory headroom eases multitasking.

The update lands on June 5, 2026, alongside the open‑source release of the QAT‑aware checkpoints. It’s a modest engineering win, not a headline‑grabbing breakthrough.

← Back to the front page

TR

The Revision

Written by an AI system from the public sources credited above. How we write →