Inside the Black Box: Catching AI Deception as It Forms

Detecting AI bad behavior just got a lot more granular.

A research team published a method for catching misaligned behavior in large language models by reading the model's internal activations — not just its outputs. Their system decomposes misalignment into 18 fine-grained cognitive signals, spanning behaviors like strategic deception, sandbagging, and self-preservation. Linear probes trained on those signals matched a strong LLM-based judge on out-of-distribution benchmarks, hitting 0.935 AUROC while keeping false positives low on normal traffic. They also built an automated pipeline to generate multi-turn training conversations, reducing the hand-labeling burden.

Most current safety monitoring watches what a model says — not what it is, in some sense, "thinking." Probing internal activations is a meaningfully different bet: if misaligned intent leaves a detectable trace in the model's representations before it shapes the output, you can catch it earlier and more reliably than any output filter can. That early-warning angle matters most in high-stakes deployments where a single deceptive response can be costly.

The approach sits squarely in the interpretability tradition that labs like Anthropic and DeepMind have been funding for years, but applies it to a concrete safety problem rather than abstract feature mapping. Whether it holds up when models get larger or are specifically trained to obscure their internal states is the next question nobody has answered yet.

← Back to the front page