CLARITY lets an autonomous‑driving model swap between vision and thermal data based on scene lighting. The system taps a vision‑language model to infer illumination conditions, then scales each sensor’s contribution while preserving dark‑object details and sharpening thin‑object edges.
Static fusion pipelines treat every frame alike, so noise from a weak camera channel can drag down the whole network. By contrast, CLARITY’s condition‑aware weighting reduces that bleed‑through, and its hierarchical decoder keeps segment borders tidy across scales. On the MFNet benchmark the method reaches 62.3% mean IoU and 77.5% mean accuracy, edging out the prior best by a few points.
If the gain holds on larger, real‑world fleets, it could narrow the performance gap between day and night driving perception without costly sensor upgrades. The approach also hints at a broader trend: using language models as cheap scene interpreters to steer low‑level vision pipelines.
For now the claim rests on a single dataset, but the improvement is enough to question whether static fusion will survive another year.