workflowMarch 9, 2026

Karpathy releases autoresearch after nanochat cuts Time to GPT-2 by 11%

Andrej Karpathy open-sourced autoresearch, a minimal agent loop for automated ML research, and reported roughly 20 additive changes that reduced nanochat’s Time to GPT-2 from 2.02 hours to 1.80 hours. Research teams can use it as a concrete recipe for closed-loop experimentation on any metric with cheap proxy evaluations.

Coding Agents Benchmarks Developer Experience

4 min read

Karpathy releases autoresearch after nanochat cuts Time to GPT-2 by 11%

TL;DR

Andrej Karpathy open-sourced autoresearch as a minimal recipe for an agent that edits training code, runs experiments, evaluates loss, and keeps improvements in Git; early descriptions of the repo emphasized a "single GPU" setup and "5-minute training runs" Karpathy's release early repo summary.
In Karpathy's first nanochat run, the agent worked through roughly 700 autonomous changes, kept about 20 additive improvements, and cut the leaderboard's "Time to GPT-2" from 2.02 hours to 1.80 hours, an ~11% gain nanochat results.
The changes were not cosmetic tuning: Karpathy says the loop found concrete fixes in attention scaling, regularization, AdamW betas, weight decay scheduling, and initialization that transferred from a depth-12 run to larger depth-24 models technical findings.
Karpathy argues this pattern generalizes to any metric that is "reasonably efficient to evaluate," and the release is already being framed by practitioners as a blueprint for longer-horizon, contract-driven agent workflows rather than a one-off demo generalization claim agent design reaction.

What shipped

Karpathy's release thread positions autoresearch less as a polished product than as a reusable loop: give an agent a measurable objective, let it modify the training code, run full experiments, score the result, and preserve wins. The repo itself is available in the GitHub project, while the early walkthrough from a widely shared summary distilled the operating model to "~630 lines of code," "single GPU," and short training cycles.

That matters because the contribution is procedural. Instead of promising autonomous science in the abstract, autoresearch packages the bread-and-butter ML tuning workflow Karpathy describes doing manually for "2 decades" into an agentic closed loop that can keep iterating while humans refine prompts and constraints manual-to-agent shift.

What did it actually improve on nanochat?

The strongest evidence is the nanochat run itself. Karpathy says a roughly two-day run on a depth-12 model found about 20 validation-loss improvements, and that every one he tested was additive and transferred to larger depth-24 models. Stacked together, those changes moved Time to GPT-2 from 2.02 hours to 1.80 hours, which he says becomes the new leaderboard entry measured speedup.

The

shows 276 plotted experiments with 29 kept improvements on the running-best path, while the thread says the broader process worked through about 700 autonomous changes. The retained fixes included sharper attention from adding a missing QKnorm scaler, regularization for value embeddings, less conservative banded attention, corrected AdamW betas, a tuned weight decay schedule, and improved initialization kept improvements.

Karpathy also says the agent "looked at the sequence of results of experiments and used that to plan the next ones," which is the more important engineering claim than raw benchmark movement: the loop is doing sequential experimental design, not just grid search planning claim. Meanwhile, the retweeted result spread quickly, with a reposted copy passing 1,000 reposts, signaling that this specific benchmark delta landed as more than a niche repo drop.

Why this matters for engineering teams

Karpathy's framing is blunt: "All LLM frontier labs will do this" and scaling it is "just engineering" frontier-lab claim. His proposed path is a swarm model: agents tune smaller systems cheaply, promising ideas get promoted to larger scales, and humans stay on the edges for supervision and problem selection swarm roadmap.

The practical boundary condition is also clear in the thread. This works best where the target metric is cheap to score directly, or where a smaller model or proxy objective gives a fast signal. That's why nanochat is a plausible first target and why the same pattern could extend to inference, training, or system-level metrics that can be evaluated repeatedly without expensive human review proxy-metric framing.

A useful read from practitioners is that the hard part may shift from execution to research design. In one engineer's reaction, the interesting work becomes setting hypotheses, building verification methods, and using "contracts" so longer-horizon agents improve systems without drifting off-task.

🧾 More sources

TL;DR1 tweets

Top-level summary of the repo release, nanochat benchmark impact, and the engineering pattern implied by the results.

What did it actually improve on nanochat?1 tweets

Evidence for the measured nanochat gains, transfer to larger models, and the specific optimizations the agent discovered.

Why this matters for engineering teams1 tweets

Broader implications for agent-driven experimentation, proxy metrics, and longer-horizon workflow design.

workflowMarch 9, 2026

Karpathy releases autoresearch after nanochat cuts Time to GPT-2 by 11%

Coding Agents Benchmarks Developer Experience

4 min read

TL;DR

Andrej Karpathy open-sourced autoresearch as a minimal recipe for an agent that edits training code, runs experiments, evaluates loss, and keeps improvements in Git; early descriptions of the repo emphasized a "single GPU" setup and "5-minute training runs" Karpathy's release early repo summary.
In Karpathy's first nanochat run, the agent worked through roughly 700 autonomous changes, kept about 20 additive improvements, and cut the leaderboard's "Time to GPT-2" from 2.02 hours to 1.80 hours, an ~11% gain nanochat results.
The changes were not cosmetic tuning: Karpathy says the loop found concrete fixes in attention scaling, regularization, AdamW betas, weight decay scheduling, and initialization that transferred from a depth-12 run to larger depth-24 models technical findings.
Karpathy argues this pattern generalizes to any metric that is "reasonably efficient to evaluate," and the release is already being framed by practitioners as a blueprint for longer-horizon, contract-driven agent workflows rather than a one-off demo generalization claim agent design reaction.

What shipped

Chubby♨️

@kimmonismus

·Follow

Andrej Karpathy just dropped something absurdely insane. An open-source repo where an AI agent runs its own ML research loop. While you sleep. The setup is almost absurdly simple: -~630 lines of code -single GPU -5-minute training runs But here’s the twist. The human Show more

Andrej Karpathy

@karpathy

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the

3:03 PM · Mar 9, 2026

938

Read 35 replies

What did it actually improve on nanochat?

Andrej Karpathy

@karpathy

·Follow

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, Show more

10:28 PM · Mar 9, 2026

19.4K

Read 968 replies

The

Why this matters for engineering teams

prinz

@deredleritt3r

·Follow

"All LLM frontier labs will do this. It's the final boss battle... Doing it is 'just engineering' and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and Show more

Andrej Karpathy

@karpathy

12:12 AM · Mar 10, 2026

247

Read 7 replies

🧾 More sources

TL;DR1 tweets

Top-level summary of the repo release, nanochat benchmark impact, and the engineering pattern implied by the results.

What did it actually improve on nanochat?1 tweets

Evidence for the measured nanochat gains, transfer to larger models, and the specific optimizations the agent discovered.

Why this matters for engineering teams1 tweets

Broader implications for agent-driven experimentation, proxy metrics, and longer-horizon workflow design.