An autonomous driving model trained mostly against itself still needs a little human touch to avoid developing alien road habits.
Researchers published a method that layers a small set of human driving demonstrations on top of standard self-play reinforcement learning. Rather than throwing out human data entirely — or requiring thousands of hours of it — the approach uses just 30 minutes of demonstrations as a regularization signal. The resulting policies coordinate naturally with human drivers in held-out tests and finish training in 15 hours on a single consumer-grade GPU.
The gap this closes matters: pure self-play agents tend to invent effective but socially incompatible driving behaviors. Prior attempts to fix that relied on reward engineering and domain randomization — both notoriously brittle. Using human data as a light constraint rather than the primary training signal is a cleaner solution and dramatically cheaper than imitation learning, which typically requires around 75,000 minutes of demonstrations for comparable results.
The team has released videos and full source code, which is worth watching — autonomous driving research has a history of impressive papers that quietly assumed highway-only or sanitized simulation conditions. How this holds up in dense urban traffic with unpredictable cyclists remains the real test.