Automated fact-checkers get worse results when their measuring stick is broken.
A team of researchers has released Credence, a framework for decomposing compound sentences into individual, checkable claims — a step that sits at the front of most automated fact-checking systems. The core problem they address: the standard metric for judging decomposition quality, a token-overlap score called Jaccard-F1, systematically undercounts quality when a claim is paraphrased rather than copied word-for-word. Credence replaces it with Semantic-F1, which uses cosine similarity on embeddings from a model called BGE-large. Across three benchmarks covering social media, encyclopedia, and news content, Semantic-F1 outperformed Jaccard-F1 by 15 to 32 percentage points.
The fix matters because decomposition quality gates everything downstream. A fact-checker that misses a compound claim, or treats a valid paraphrase as a failure, produces unreliable verdicts before a human ever sees them. The researchers also formally proved that rule-based repair in their pipeline is guaranteed to terminate, while LLM-based self-repair is not — a finding that puts a hard engineering requirement on any team using large models in this loop: build an early-exit guard or you risk infinite correction cycles.
The paper tested four open models ranging from 3.8B to 12B parameters plus one closed API model, giving practitioners something to calibrate against before picking a decomposer. Results on the news-domain benchmark were notably harder — atomicity violations dropped 47 to 100 percent relative to baselines, but the error-per-repair rate bottomed out around 0.824, suggesting news language still resists clean decomposition. Anyone selling an AI fact-checking product as solved should read that number carefully.