Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
TL;DR
Autonomous closed-loop empirical research without human-in-the-loop. Specialist agents partition the training-recipe surface, share measured lineage across trials, propose hypotheses + executable code edits + outcomes + feedback. Across 1,197 headline-run trials and 600 control trials, no humans chose proposals, edited recipes, overrode scores, or repaired failed trials. Reduces Parameter Golf validation bpb by 0.81%, raises NanoChat-D12 CORE by 38.7%, reduces CIFAR-10 Airbench96 wallclock by 4.59%.
Why this matters
The "no human in the loop" claim is the load-bearing piece. The 38.7% NanoChat-D12 gain on CORE is large enough to take seriously. If the closed-loop generalizes, the labor model for training-recipe research changes: the human picks the search surface and writes the evaluator, the system runs the loop.
Connections to prior wiki
- Confirms the Evaluation-driven Scaling thread (04-22): same outer loop, both bet on external measurement as the steering signal.
- Contradicts KernelBench-X (also today) in one specific way. KernelBench-X says iterative refinement on kernel generation improves correctness but not performance. This paper says iterative refinement on training recipes improves both. The contrast is informative: training-recipe space is denser and smoother than kernel-generation space, so the refinement loop has a usable gradient signal in one but not the other.
- Composition with Skill1: the closed-loop here proposes recipes; Skill1 distills successful trajectories into reusable skills. The pair gives you a self-improving research system rather than a one-shot search.
Research angle
- The cost of 1,797 trials is real money. The cost-per-bpb-improvement number determines whether this is a research-grade or production-grade primitive.
- Whether the gains transfer to a recipe surface the system wasn't trained on (i.e., does the specialist-agents partition generalize) is open.