TxAgent took first place in the NeurIPS 2025 CURE‑Bench therapeutic reasoning challenge.
The team behind TxAgent entered the competition with a Llama‑3.1‑8B model fine‑tuned for iterative retrieval‑augmented generation. The system generates function calls to a unified biomedical toolkit—FDA Drug API, OpenTargets, and Monarch—then incorporates the returned data into its reasoning trace. In the benchmark, which scores correctness, tool usage, and reasoning quality, TxAgent earned the Excellence Award in Open Science. The authors report that tighter retrieval for function calls lifted overall performance.
This matters because therapeutic AI must link patient data to up‑to‑date drug information without hallucinating. By treating tool invocations as supervised signals, TxAgent shows a path to safer, more transparent decision support than conventional RAG models that rely on static knowledge. If the approach scales, hospitals could deploy assistants that reliably suggest dosing or flag interactions while staying auditable.
The result is a reminder that raw language model size matters less than the surrounding ecosystem. Earlier agents that simply queried static texts fell short on safety checks; TxAgent’s tool‑driven loop narrows that gap. Still, the benchmark is a simulated setting, so real‑world validation remains a hurdle.