LLMs Confidently Wrong on Clinical Data, Study Finds

LLMs don't know what they don't know — and on structured clinical data, that blind spot could matter.

Researchers compared Qwen 2.5 7B against XGBoost on a clinical prediction task and found that the LLM's self-reported confidence barely moved — hovering between 0.856 and 0.937 — regardless of whether its actual accuracy was 49% or 75.3%. The number tracked prompt format, not correctness. Stranger still, the LLM performed worst exactly when XGBoost was most certain: accuracy fell to 64.8% on cases where XGBoost hit 99%, yet the two models nearly tied when XGBoost was only moderately confident.

That inverse difficulty effect is the paper's sharpest finding. It suggests LLMs may be systematically unreliable on the structured, feature-heavy cases that clinical decision support handles best — not the ambiguous edge cases, but the clear-cut ones. The researchers frame this as a cold start problem: without retraining, LLMs lack the internal signals to know when to trust themselves on tabular data.

The proposed fix is a cross-model calibrator that uses attribution divergence between the LLM and XGBoost to generate patient-specific reliability estimates. Pairing that with few-shot examples and SHAP-derived feature evidence cut the Attribution Disagreement Score from 1.54 to 0.38 and lifted accuracy from 49% to 75.3% — no retraining required. Expected calibration error dropped from 0.254 to 0.080.

The study is a useful reminder that confidence scores from LLMs are closer to formatting artifacts than epistemic signals — a distinction that becomes consequential when the output informs a clinical decision.

← Back to the front page