A new benchmark reveals just how far AI models are from reliably helping doctors in practice.
Researchers introduced PhysAssistBench, a benchmark built from real MIMIC-IV hospital records that simulates the full complexity of a clinical encounter. Instead of testing isolated skills — medical trivia, chart-reading, or bedside manner — it puts models through multi-turn scenarios where they must handle a physician's underspecified requests, a patient's vague symptom descriptions, and precise interaction with electronic health record systems, all at once. The evaluation set covers 1,296 manually reviewed, physician-validated turns and is available in two languages. Experiments with leading LLMs show that current models fail to hold it all together reliably.
That finding matters because it exposes a structural gap in how clinical AI gets evaluated today. Most benchmarks treat medical knowledge, communication, and tool use as separate tests; a model can ace each in isolation and still fall apart when it has to coordinate them in a single live interaction. The bottleneck, the researchers argue, is not any one capability but the coordination layer between them.
The benchmark is a useful corrective to the marketing narrative that surrounds medical AI. Vendors routinely tout high scores on clinical knowledge exams as evidence of readiness, but passing a multiple-choice board question is a different task than managing an ambiguous patient conversation while pulling the right data from a hospital record system. PhysAssistBench makes that distinction measurable rather than rhetorical.
No model cleared the bar here — which is the point. The gap between "impressive on benchmarks" and "safe to put next to a patient chart" turns out to be wide, and now there is at least one tool designed to quantify it.