llm/ serving · speculative-decoding

Nightjar boosts LLM serving throughput by up to 15%

Nightjar adapts speculative decoding to load, trimming draft use and freeing GPU memory, delivering up to 14.8% more throughput and 20% lower latency.

  • Nightjar dynamically toggles speculative decoding to lift LLM serving performance.

The authors introduce Nightjar, a framework that watches request load and selects the optimal speculative length per batch. When a multi‑armed‑bandit planner flags speculation as counter‑productive, Nightjar disables it and moves the draft model to the CPU if GPU memory is tight. This frees KV‑cache space, allowing larger batches. In benchmarks with variable arrival rates, Nightjar posted up to 14.76% higher throughput than traditional speculative decoding and cut latency by as much as 20.18%.

The significance lies in addressing a known flaw: fixed‑length speculation helps only in low‑load, memory‑bound settings and hurts compute‑bound peaks. By making speculation load‑aware, Nightjar keeps the draft model only when it actually speeds inference, preserving compute resources for heavy traffic. This could narrow the performance gap between dedicated inference servers and more flexible, shared‑resource deployments.

In short, Nightjar shows that adaptive speculation can extract measurable gains without the memory penalties that have limited earlier approaches.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →