AI/ ai · formal-verification · distributed-systems · llm

A 20B Model That Actually Verifies Its Own TLA+ Code

TLA-Prover combines fine-tuning and self-repair to hit 30% on formal spec checks, trouncing the best public baseline by 3.5x.

A new 20-billion-parameter model clears a bar that stumped every public LLM benchmarked on formal distributed-systems verification.

Researchers built TLA-Prover specifically to synthesize TLA+ specifications — a formal language used to verify that distributed systems and safety-critical protocols behave correctly. The problem: existing large language models are bad at this. Across 25 tested models, the best anyone managed was 26.6% syntactic parse and 8.6% semantic model-check, meaning most generated specs either wouldn't even load or would fail the TLC model checker on logic grounds. TLA-Prover combines supervised fine-tuning on verified examples with a self-repair loop called group-relative policy optimization, where the model learns to fix its own rejected output. No learned reward model sits in the middle — TLC itself grades every result.

The grading system uses four tiers: Bronze for specs that parse, Silver for warning-free output, Gold for specs TLC passes, and Diamond for specs where TLC can still catch a deliberately introduced violation — ruling out the cheap trick of writing trivially always-true properties. TLA-Prover hits 30% at both Gold and Diamond on a held-out 30-problem set, a 3.5x improvement over the untuned baseline. A simpler direct preference optimization variant from the same starting checkpoint reaches 20% at Diamond.

For context, this is still a 30-problem benchmark, not a production system — and 30% pass@1 means the model fails seven out of ten tries. But the gap between 8.6% and 30% is meaningful in a domain where most LLMs produce specifications that are syntactically plausible and semantically useless, which is arguably worse than producing nothing at all.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →