RoTRAG boosts multi-turn chat harm detection with rule retrieval

# RoTRAG adds rule‑based grounding to chat safety checks.

Researchers introduced RoTRAG, a retrieval‑augmented pipeline that fetches short human‑written moral norms—called Rules of Thumb—for each dialogue turn. A lightweight binary router first decides if a turn needs fresh retrieval or can rely on prior context. The retrieved rules are then fed to a large language model, which produces a turn‑level harm judgment and an overall severity rating.

The approach matters because most safety classifiers depend only on internal model knowledge, which can drift from explicit societal norms. By anchoring decisions to external, interpretable rules, RoTRAG improves F1 scores by roughly 40% on the ProsocialDialog and Safety Reasoning Multi Turn Dialogue benchmarks, while trimming redundant compute. The drop in distributional error (‑8.4%) suggests more consistent judgments across nuanced conversations.

If the trend holds, future chat moderators may blend rule retrieval with parametric models rather than relying on black‑box inference alone.

← Back to the front page