LLMs match classic hyperparameter search on benchmark tasks

LLMs can now replace classic hyperparameter optimization tools, according to a paper posted to arXiv on June 9, 2026.

The authors trained a 7‑billion‑parameter decoder‑only model and prompted it to suggest learning‑rate schedules, batch sizes and regularisation values for three common benchmarks: CIFAR‑10 image classification, the PTB language‑modeling task and the WMT‑14 English‑German translation set. For each task the model generated ten candidate configurations, evaluated them with the same training budget as the baselines, and selected the best. On CIFAR‑10 the LLM‑driven settings achieved 93.2 % accuracy, within 0.1 % of Bayesian optimisation’s 93.3 %. On PTB the perplexity was 58.7 versus 58.4 for grid search, and on WMT‑14 the BLEU score reached 29.1 compared with 29.3 from evolutionary strategies.

If language models can reliably propose tuning knobs, the costly separate optimisation loop disappears. Teams could embed the LLM in their training scripts and get a first‑pass configuration without extra compute. The paper notes that the approach works best when the target task resembles the data used to train the LLM, hinting at limits for niche domains.

The result is less hype than “LLMs replace all optimisation” and more a proof‑of‑concept that a sufficiently large model can replicate what hand‑crafted algorithms already do, at least on well‑studied benchmarks.

← Back to the front page