A Smarter RLHF That Knows When It Doesn't Know

A new training technique called UARM aims to fix one of the quieter failure modes in AI alignment: reward models that act confident even when they shouldn't.

Reinforcement learning from human feedback (RLHF) is the standard method labs use to steer large language models toward useful, safe behavior. It works by training a reward model on human preference data, then optimizing the AI to score well on that model. The problem, according to a new paper, is that reward models are typically deterministic — they spit out a single score with no indication of how reliable that score is. Pair that with a training method called GRPO, which treats all reward signals equally when computing advantages, and you get a system that can be gamed: the policy learns to exploit uncertain, unreliable reward estimates rather than genuinely improving.

The proposed fix, Uncertainty-Aware Reward Modeling (UARM), wraps reward predictions in calibrated confidence intervals using a technique called quantile-based conformal prediction, then reweights GRPO's advantage calculations so that shaky reward signals get less pull. Tested across three benchmarks — HelpSteer, UltraFeedback, and PKU-SafeRLHF — UARM outperformed standard GRPO and other baselines on reward calibration, reward hacking resistance, and overall alignment quality.

Reward hacking is not a new problem — researchers have been documenting it since the earliest days of RLHF — but the pressure is growing as labs push policies into increasingly diverse response spaces where reward models are least reliable. UARM doesn't require scrapping the existing RLHF pipeline, just adding a layer of epistemic humility to it. Whether the major labs adopt something like this quietly or keep shipping overconfident reward models is, at this point, an open question.

← Back to the front page