[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"branding":3,"analytics":7,"article-adaptive-two-phase-cascade-halves-llm-filtering-time":10},{"siteName":4,"siteTagline":5,"publisherName":4,"contactEmail":6},"The Revision","Tech news, decoded.","editor@therevision.news",{"gaMeasurementId":8,"adsenseClientId":9},"G-ZW2MV82GYR","ca-pub-8533917693782264",{"article":11},{"id":12,"slug":13,"title":14,"dek":15,"body_md":16,"tags_json":17,"published_at":18,"created_at":19,"updated_at":20,"status":21,"review_note":22,"review_notes":23,"image_url":22,"persona_id":22,"persona_name":22,"section":22,"tags":30,"sources":34,"feedback":38,"feedback_at":22,"cost_usd":38,"total_tokens":38},1417,"adaptive-two-phase-cascade-halves-llm-filtering-time","Adaptive two-phase cascade halves LLM filtering time","A new adaptive cascade using clustering, token-aware proxies and soft-label training cuts semantic filtering latency by up to 2× while keeping 90% accuracy.","LLM‑based semantic filters can now run twice as fast.\n\nThe authors propose a two‑phase cascade that first applies model‑free clustering and only falls back to an online proxy when needed, sharing oracle calls across stages. Instead of a cosine bi‑encoder, they use off‑the‑shelf token‑aware models, and they train the proxy with the oracle’s per‑document confidence as a soft label. Calibration adds a safety margin only where the labeled sample is sparse, avoiding wasted oracle queries. On three 10 K‑document corpora, the method meets a 90% accuracy target in 1.6–2.0× less time than the previous best approach and succeeds on 95% of queries.\n\nThe change matters because semantic filtering underpins many LLM‑driven pipelines, from content moderation to data curation. Reducing oracle calls directly lowers compute cost and latency, making large‑scale deployments more economical. Using the oracle’s confidence as a training signal also extracts more value from each expensive call.\n\nIn short, the paper shows that smarter cascade composition can halve filtering time today, and a theoretical lower bound suggests another order of magnitude of savings may be possible as the technique matures.","[\"large-language-models\",\"semantic-filtering\",\"information-retrieval\"]","2026-06-16T04:00:00.000Z","2026-06-17T08:44:15.980Z","2026-06-17T08:44:18.807Z","published",null,[24],{"id":25,"reviewer":26,"round":27,"reason":28,"status":29},"editor-r1","editor",1,"Add a concise concluding paragraph summarizing the findings and their significance for readers.","resolved",[31,32,33],"large-language-models","semantic-filtering","information-retrieval",[35],{"name":36,"url":37},"arXiv cs.AI","https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.08090",0]