A new 20-billion-parameter model clears a bar that stumped every public LLM benchmarked on formal distributed-systems verification.
Researchers built TLA-Prover specifically to synthesize TLA+ specifications — a formal language used to verify that distributed systems and safety-critical protocols behave correctly. The problem: existing large language models are bad at this. Across 25 tested models, the best anyone managed was 26.6% syntactic parse and 8.6% semantic model-check, meaning most generated specs either wouldn't even load or would fail the TLC model checker on logic grounds. TLA-Prover combines supervised fine-tuning on verified examples with a self-repair loop called group-relative policy optimization, where the model learns to fix its own rejected output. No learned reward model sits in the middle — TLC itself grades every result.
The grading system uses four tiers: Bronze for specs that parse, Silver for warning-free output, Gold for specs TLC passes, and Diamond for specs where TLC can still catch a deliberately introduced violation — ruling out the cheap trick of writing trivially always-true properties. TLA-Prover hits 30% at both Gold and Diamond on a held-out 30-problem set, a 3.5x improvement over the untuned baseline. A simpler direct preference optimization variant from the same starting checkpoint reaches 20% at Diamond.
For context, this is still a 30-problem benchmark, not a production system — and 30% pass@1 means the model fails seven out of ten tries. But the gap between 8.6% and 30% is meaningful in a domain where most LLMs produce specifications that are syntactically plausible and semantically useless, which is arguably worse than producing nothing at all.