A new benchmark called IHBench targets one of the messier problems in voice AI: what happens after a user interrupts a bot mid-sentence.
Researchers introduced IHBench to evaluate post-interruption recovery in voice agents running structured, multi-step workflows - the kind used in customer service, healthcare scheduling, and account management. The benchmark tests six types of interruptions, injected at controlled points, and scores each on two axes: whether the agent completed the task and how cleanly it recovered. It covers 27 audio-language model configurations from OpenAI, Google, and open-weight sources across 10 enterprise domains. Existing speech benchmarks, the authors note, measure the timing of interruptions - when a model detects a barge-in or yields a turn - but not whether the agent picks up at the right step afterward, without repeating content the user already heard.
The gap matters because real conversations are messy. A voice agent that can detect an interruption but then restarts from step one, or ignores the interjection entirely, is not actually useful in production. The benchmark also surfaces a consistent divide: closed-weight models outperform open-weight ones on task fulfillment, degrade roughly 3.3 times more slowly as conversations lengthen, and show no penalty for handling audio versus text - while open-weight models lose ground on all three.
That 3.3x degradation gap is the number to watch. It suggests that for enterprise deployments where calls routinely run long, the model-weight choice is not just a cost decision - it is a reliability one.