Andrej Karpathy open-sourced autoresearch, a minimal agent loop for automated ML research, and reported roughly 20 additive changes that reduced nanochat’s Time to GPT-2 from 2.02 hours to 1.80 hours. Research teams can use it as a concrete recipe for closed-loop experimentation on any metric with cheap proxy evaluations.

Karpathy's release thread positions autoresearch less as a polished product than as a reusable loop: give an agent a measurable objective, let it modify the training code, run full experiments, score the result, and preserve wins. The repo itself is available in the GitHub project, while the early walkthrough from a widely shared summary distilled the operating model to "~630 lines of code," "single GPU," and short training cycles.
That matters because the contribution is procedural. Instead of promising autonomous science in the abstract, autoresearch packages the bread-and-butter ML tuning workflow Karpathy describes doing manually for "2 decades" into an agentic closed loop that can keep iterating while humans refine prompts and constraints manual-to-agent shift.
The strongest evidence is the nanochat run itself. Karpathy says a roughly two-day run on a depth-12 model found about 20 validation-loss improvements, and that every one he tested was additive and transferred to larger depth-24 models. Stacked together, those changes moved Time to GPT-2 from 2.02 hours to 1.80 hours, which he says becomes the new leaderboard entry measured speedup.
The
shows 276 plotted experiments with 29 kept improvements on the running-best path, while the thread says the broader process worked through about 700 autonomous changes. The retained fixes included sharper attention from adding a missing QKnorm scaler, regularization for value embeddings, less conservative banded attention, corrected AdamW betas, a tuned weight decay schedule, and improved initialization kept improvements.
Karpathy also says the agent "looked at the sequence of results of experiments and used that to plan the next ones," which is the more important engineering claim than raw benchmark movement: the loop is doing sequential experimental design, not just grid search planning claim. Meanwhile, the retweeted result spread quickly, with a reposted copy passing 1,000 reposts, signaling that this specific benchmark delta landed as more than a niche repo drop.
Karpathy's framing is blunt: "All LLM frontier labs will do this" and scaling it is "just engineering" frontier-lab claim. His proposed path is a swarm model: agents tune smaller systems cheaply, promising ideas get promoted to larger scales, and humans stay on the edges for supervision and problem selection swarm roadmap.
The practical boundary condition is also clear in the thread. This works best where the target metric is cheap to score directly, or where a smaller model or proxy objective gives a fast signal. That's why nanochat is a plausible first target and why the same pattern could extend to inference, training, or system-level metrics that can be evaluated repeatedly without expensive human review proxy-metric framing.
A useful read from practitioners is that the hard part may shift from execution to research design. In one engineer's reaction, the interesting work becomes setting hypotheses, building verification methods, and using "contracts" so longer-horizon agents improve systems without drifting off-task.
Andrej Karpathy just dropped something absurdely insane. An open-source repo where an AI agent runs its own ML research loop. While you sleep. The setup is almost absurdly simple: -~630 lines of code -single GPU -5-minute training runs But here’s the twist. The human Show more
I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, Show more
"All LLM frontier labs will do this. It's the final boss battle... Doing it is 'just engineering' and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and Show more
Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes,