AI/ reinforcement learning · robotics · ai · research

Fixing Jerky Robot Policies by Fixing the Critic First

A new regularization framework called PAVE targets the critic in actor-critic RL to smooth out erratic policies without touching the actor at all.

A new paper argues that wobbly, high-frequency robot policies are a critic problem, not an actor problem.

Policies trained with continuous actor-critic methods often oscillate in ways that make them unsafe or impractical to deploy on physical hardware. The standard fix is to regularize the policy's output directly — smoothing what the actor produces. The researchers behind PAVE say that misses the root cause. They prove mathematically that how erratic an optimal policy becomes is bounded by a specific ratio: the Q-function's mixed-partial derivative (how sensitive it is to noise) divided by its action-space curvature (how sharply it distinguishes between actions). When that ratio is large, the policy gradient the actor follows is volatile — and no amount of actor-side smoothing addresses that underlying geometry.

The implication is practical: if the critic's value field is the real source of instability, regularizing the actor is treating a fever with a cold compress. PAVE stabilizes the Q-gradient field directly — minimizing gradient volatility while preserving local curvature — and matches the smoothness of actor-side methods without modifying the actor at all. That matters because actor-side regularization can quietly degrade task performance by biasing the policy away from high-reward actions.

Actor-critic architectures underpin most serious continuous-control research right now, from locomotion to manipulation, so a critic-centric smoothing method that doesn't compromise task reward could be quietly significant — assuming it holds up beyond the benchmark environments where most RL papers live or die.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →