Teaching AI Agents to Learn on the Job via RL

A research team has published a framework for training AI agents that keep getting better the longer they run — without waiting for a new model release.

The paper, posted to arXiv, describes "Connect the Dots" (CoD): a reinforcement learning setup where an LLM-based agent solves a sequence of tasks, learns from what it encounters, and continuously rewrites its own context about the environment. The infrastructure handles long rollout sequences that alternate between solving tasks and updating context — a non-trivial engineering problem that standard RL pipelines are not built for. The team implemented a GRPO-style RL algorithm with fine-grained credit assignment and built custom tasks and environments designed to train the meta-capability itself, not just domain-specific performance.

What distinguishes CoD from conventional fine-tuning or retrieval-augmented approaches is the scope of what generalizes. The researchers report three distinct generalization directions: within training domains, across different domains, and from CoD settings to "Ralph-loop" settings — a separate agentic configuration the model was not explicitly trained on. That last one is the interesting claim, because cross-paradigm transfer is exactly what makes a meta-capability worth having.

The paper is careful to label these as proof-of-concept implementations, so the empirical results validate the training approach's efficacy rather than declare a production-ready system. Still, for a field that has mostly treated agent memory as a retrieval problem, framing continuous self-updating as a learnable skill — and then training for it end-to-end — is a meaningful shift in how labs might think about long-running deployments. Code is available on GitHub.

← Back to the front page