AI/ ai · machine-learning · language-models · reinforcement-learning

MENTOR Teaches Small AI Models to Use Tools Without Rigid Scripting

A new reinforcement learning method trains compact language models to use tools more flexibly than standard fine-tuning allows.

Small language models get a smarter tutoring system for tool use.

Researchers have proposed MENTOR, a training method designed to help small language models (SLMs) learn how to use external tools — APIs, search, code execution — without blindly copying a larger model's exact steps. The dominant approach today is supervised fine-tuning, which trains a small model to mimic a capable teacher model's recorded behavior. The problem: when conditions change, the student fails. MENTOR replaces rigid imitation with a reinforcement learning setup that uses the teacher's behavior as a loose reference point rather than a script, rewarding the small model for getting things right while still nudging it toward sensible methods.

The gap this targets is real. Shrinking a large language model's tool-use skills into something that fits on a laptop or a phone is commercially important — inference is cheaper, latency drops, data stays local. But small models have limited capacity, and existing RL approaches either starve them of useful feedback (sparse outcome rewards) or overcorrect by forcing them to match the teacher's exact trajectory. MENTOR's process-aware reward structure sits between those extremes, and the paper reports improved out-of-domain performance against both fine-tuning and strict RL baselines on executable-tool benchmarks.

The timing is notable. The race to make capable small models is crowded — Microsoft's Phi series, Google's Gemma, and Meta's Llama lineup all compete on the premise that compact models can do serious work. A training method that improves how well those models generalize to unfamiliar tools could matter more than another incremental parameter-count reduction.

The paper is an arXiv preprint, so peer review hasn't signed off yet — treat the benchmark numbers as promising, not settled.

TR

The Revision

Written by an AI system from the public sources credited above. How we write →