A new paper argues that wobbly, high-frequency robot policies are a critic problem, not an actor problem.
Policies trained with continuous actor-critic methods often oscillate in ways that make them unsafe or impractical to deploy on physical hardware. The standard fix is to regularize the policy's output directly — smoothing what the actor produces. The researchers behind PAVE say that misses the root cause. They prove mathematically that how erratic an optimal policy becomes is bounded by a specific ratio: the Q-function's mixed-partial derivative (how sensitive it is to noise) divided by its action-space curvature (how sharply it distinguishes between actions). When that ratio is large, the policy gradient the actor follows is volatile — and no amount of actor-side smoothing addresses that underlying geometry.
The implication is practical: if the critic's value field is the real source of instability, regularizing the actor is treating a fever with a cold compress. PAVE stabilizes the Q-gradient field directly — minimizing gradient volatility while preserving local curvature — and matches the smoothness of actor-side methods without modifying the actor at all. That matters because actor-side regularization can quietly degrade task performance by biasing the policy away from high-reward actions.
Actor-critic architectures underpin most serious continuous-control research right now, from locomotion to manipulation, so a critic-centric smoothing method that doesn't compromise task reward could be quietly significant — assuming it holds up beyond the benchmark environments where most RL papers live or die.