workflowMarch 9, 2026

Karpathy releases autoresearch after nanochat cuts Time to GPT-2 by 11%

Andrej Karpathy open-sourced autoresearch, a minimal agent loop for automated ML research, and reported roughly 20 additive changes that reduced nanochat’s Time to GPT-2 from 2.02 hours to 1.80 hours. Research teams can use it as a concrete recipe for closed-loop experimentation on any metric with cheap proxy evaluations.

4 min read

Karpathy releases autoresearch after nanochat cuts Time to GPT-2 by 11%

TL;DR

Andrej Karpathy open-sourced autoresearch as a minimal recipe for an agent that edits training code, runs experiments, evaluates loss, and keeps improvements in Git; early descriptions of the repo emphasized a "single GPU" setup and "5-minute training runs" Karpathy's release early repo summary.
In Karpathy's first nanochat run, the agent worked through roughly 700 autonomous changes, kept about 20 additive improvements, and cut the leaderboard's "Time to GPT-2" from 2.02 hours to 1.80 hours, an ~11% gain nanochat results.
The changes were not cosmetic tuning: Karpathy says the loop found concrete fixes in attention scaling, regularization, AdamW betas, weight decay scheduling, and initialization that transferred from a depth-12 run to larger depth-24 models technical findings.
Karpathy argues this pattern generalizes to any metric that is "reasonably efficient to evaluate," and the release is already being framed by practitioners as a blueprint for longer-horizon, contract-driven agent workflows rather than a one-off demo generalization claim agent design reaction.

What shipped

Karpathy's release thread positions autoresearch less as a polished product than as a reusable loop: give an agent a measurable objective, let it modify the training code, run full experiments, score the result, and preserve wins. The repo itself is available in the GitHub project, while the early walkthrough from a widely shared summary distilled the operating model to "~630 lines of code," "single GPU," and short training cycles.

That matters because the contribution is procedural. Instead of promising autonomous science in the abstract, autoresearch packages the bread-and-butter ML tuning workflow Karpathy describes doing manually for "2 decades" into an agentic closed loop that can keep iterating while humans refine prompts and constraints manual-to-agent shift.

What did it actually improve on nanochat?

The strongest evidence is the nanochat run itself. Karpathy says a roughly two-day run on a depth-12 model found about 20 validation-loss improvements, and that every one he tested was additive and transferred to larger depth-24 models. Stacked together, those changes moved Time to GPT-2 from 2.02 hours to 1.80 hours, which he says becomes the new leaderboard entry measured speedup.

The

shows 276 plotted experiments with 29 kept improvements on the running-best path, while the thread says the broader process worked through about 700 autonomous changes. The retained fixes included sharper attention from adding a missing QKnorm scaler, regularization for value embeddings, less conservative banded attention, corrected AdamW betas, a tuned weight decay schedule, and improved initialization kept improvements.

Karpathy also says the agent "looked at the sequence of results of experiments and used that to plan the next ones," which is the more important engineering claim than raw benchmark movement: the loop is doing sequential experimental design, not just grid search planning claim. Meanwhile, the retweeted result spread quickly, with a reposted copy passing 1,000 reposts, signaling that this specific benchmark delta landed as more than a niche repo drop.

Why this matters for engineering teams

Karpathy's framing is blunt: "All LLM frontier labs will do this" and scaling it is "just engineering" frontier-lab claim. His proposed path is a swarm model: agents tune smaller systems cheaply, promising ideas get promoted to larger scales, and humans stay on the edges for supervision and problem selection swarm roadmap.

The practical boundary condition is also clear in the thread. This works best where the target metric is cheap to score directly, or where a smaller model or proxy objective gives a fast signal. That's why nanochat is a plausible first target and why the same pattern could extend to inference, training, or system-level metrics that can be evaluated repeatedly without expensive human review proxy-metric framing.

A useful read from practitioners is that the hard part may shift from execution to research design. In one engineer's reaction, the interesting work becomes setting hypotheses, building verification methods, and using "contracts" so longer-horizon agents improve systems without drifting off-task.

TL;DR

What shipped

What did it actually improve on nanochat?

Why this matters for engineering teams

Discussion across the web