Adversarial images can hijack cheap multimodal LLM cascades

A new paper shows that adversarial image perturbations can systematically force multimodal LLM cascades to route queries to the heavyweight model.

The authors describe the Forced Deferral Attack (FDA), which adds a universal border trigger to images. The trigger depresses the confidence scores of the cheap front‑end model, causing the cascade’s deferral logic to hand off the request to the strong backend. FDA trains the trigger with a temperature‑flattened loss that pushes the weak model’s token distribution toward a flatter target derived from its own clean outputs. Across several benchmark datasets, model families, and deferral metrics, FDA outperforms standard image‑perturbation and prompt‑injection baselines, raising the proportion of strong‑model routing without changing the answer content.

This matters because cascade architectures are a leading strategy for scaling multimodal services while controlling compute costs. If attackers can arbitrarily trigger expensive inference, providers face higher operating bills and potential service degradation. The vulnerability also highlights a broader security blind spot: confidence‑based routing can be weaponized without tampering with the final answer.

The paper adds to a growing list of attacks that target system‑level decisions rather than model outputs, suggesting that future cascades will need robust confidence estimation or additional checks before delegating work.

← Back to the front page