LLM‑based semantic filters can now run twice as fast.
The authors propose a two‑phase cascade that first applies model‑free clustering and only falls back to an online proxy when needed, sharing oracle calls across stages. Instead of a cosine bi‑encoder, they use off‑the‑shelf token‑aware models, and they train the proxy with the oracle’s per‑document confidence as a soft label. Calibration adds a safety margin only where the labeled sample is sparse, avoiding wasted oracle queries. On three 10 K‑document corpora, the method meets a 90% accuracy target in 1.6–2.0× less time than the previous best approach and succeeds on 95% of queries.
The change matters because semantic filtering underpins many LLM‑driven pipelines, from content moderation to data curation. Reducing oracle calls directly lowers compute cost and latency, making large‑scale deployments more economical. Using the oracle’s confidence as a training signal also extracts more value from each expensive call.
In short, the paper shows that smarter cascade composition can halve filtering time today, and a theoretical lower bound suggests another order of magnitude of savings may be possible as the technique matures.