Cross-modal fusion lifts time-to-event predictions in CT-EHR data

Multimodal survival models finally get a systematic test.

Researchers built a foundation‑model pipeline that encodes CT images and longitudinal EHR data separately, then aligns them in a shared latent space using four fusion strategies: late fusion, contrastive alignment, cross‑attention and co‑attention. They trained on two large, multi‑institutional cohorts—pulmonary embolism mortality (3,099 training cases) and cardiovascular disease outcomes (2,951 training cases)—and validated internally and on external sites.

The alignment improved concordance index by 1.5‑5.4 % over single‑modality baselines whenever both modalities contributed meaningfully. Contrastive fusion with CLMBR embeddings gave the most reliable gains for PE mortality, while cross‑attention edged out internal scores for major adverse cardiac events and image‑guided co‑attention performed best on external data. The study shows that handling modality imbalance with task‑aware fusion is key to robust, scalable clinical models.

In practice, the work suggests hospitals can reuse existing imaging and EHR pipelines, but must choose a fusion method that matches the prediction task and data distribution.

← Back to the front page