LLM guardrails vulnerable to denial‑of‑service loops

LLM‑based guardrails can be trapped in endless reasoning, turning them into denial‑of‑service weapons.

The paper introduces two attack pipelines. One uses a beam‑search optimiser that feeds an LLM a bank of strategies to generate payloads that maximise the length of the guardrail’s internal chain‑of‑thought. The second relies on structural mutations that exploit the guardrail’s schema‑following logic with far less compute. In controlled tests the payloads inflate token counts by 13‑63× on eight popular model backbones, including Claude, GPT, Gemini, DeepSeek and Qwen. When deployed in real‑world agents—web bots, desktop helpers, code generators and multi‑agent systems—the same tricks cause latency spikes up to 148×, and a single poisoned document can hog shared guardrail resources, starving other agents.

Why it matters: guardrails are marketed as the last line of defense against jailbreaks, yet their own reasoning engine becomes the Achilles’ heel. The attacks bypass content filters entirely by exhausting compute, not by slipping past semantic checks. This flips the security narrative: protecting prompt integrity may now require throttling or cost‑bounding the guardrail’s reasoning depth, a design shift not yet reflected in most commercial deployments.

In short, the study shows that availability, not just correctness, is at risk for LLM agents. Until vendors harden guardrails against runaway loops, shared AI services could see intermittent outages triggered by a single malicious document.

← Back to the front page