LLMs can now compress generated text far beyond prior tricks.
The authors test two regimes. In lossless mode, LoRA adapters tuned to a domain double the efficiency of arithmetic coding that uses the base model alone. In lossy mode, a rewrite prompt followed by arithmetic coding halves the size of the output, reaching a 0.03 compression ratio. The bigger surprise is a new interactive protocol called Question‑Asking compression (QA). A small model asks a series of yes/no questions to a larger model, receiving one bit per answer. Across eight benchmarks—math, science, code—ten binary questions recover 23‑72% of the performance gap on standard tasks and 7‑38% on harder ones, yielding compression ratios between 0.0006 and 0.004, over a hundred times smaller than the previous best LLM‑based method.
Why it matters: The results show that compressing knowledge isn’t limited to static encoding; a tiny dialogue can convey most of a large model’s capability. This could cut bandwidth for edge deployments, let small devices query powerful models without sending full prompts, and reshape how we think about model distillation.
In short, interactive questioning lets a modest compute budget achieve compression levels previously thought out of reach, hinting that future AI pipelines may lean more on back‑and‑forth protocols than on bulk data transfer.