- Nightjar dynamically toggles speculative decoding to lift LLM serving performance.
The authors introduce Nightjar, a framework that watches request load and selects the optimal speculative length per batch. When a multi‑armed‑bandit planner flags speculation as counter‑productive, Nightjar disables it and moves the draft model to the CPU if GPU memory is tight. This frees KV‑cache space, allowing larger batches. In benchmarks with variable arrival rates, Nightjar posted up to 14.76% higher throughput than traditional speculative decoding and cut latency by as much as 20.18%.
The significance lies in addressing a known flaw: fixed‑length speculation helps only in low‑load, memory‑bound settings and hurts compute‑bound peaks. By making speculation load‑aware, Nightjar keeps the draft model only when it actually speeds inference, preserving compute resources for heavy traffic. This could narrow the performance gap between dedicated inference servers and more flexible, shared‑resource deployments.
In short, Nightjar shows that adaptive speculation can extract measurable gains without the memory penalties that have limited earlier approaches.