workflowMarch 21, 2026

Autoresearch claims 2718 Elo after 70 experiments on a Rust chess engine

A developer says an autoresearch loop hill-climbed a vibecoded Rust engine to 2718 Elo after running more than 70 experiments under a 500 ms move budget. The real takeaway is the workflow: automated experiment loops can optimize code against a measurable target.

Coding Agents Benchmarks Deep Research

3 min read

Autoresearch claims 2718 Elo after 70 experiments on a Rust chess engine

TL;DR

Developer Deedy Das says Karpathy-style Autoresearch pushed a "vibecoded Rust chess engine" from "expert" strength to a reported 2718 Elo after running "over 70 experiments" and hill-climbing for score results thread.
The underlying engine is conventional search, not a newly trained model: Deedy's technical breakdown says it uses negamax alpha-beta search with pruning, iterative deepening, opening books, and a transposition table, all tested at a 500 ms per-move limit.
The practical engineering story is the loop, not chess. The community reaction around Autoresearch frames it as a reusable pattern for "everything with a measurable metric," where the agent proposes code changes and keeps what improves the target.
That pattern may extend beyond fully verifiable tasks. In a practitioner take, Shreya Shankar says she is "very optimistic" about combining these search loops with qualitative evaluators for more subjective coding-style work.

What actually improved in the chess engine

The claim is specific: the agent ran more than 70 experiments on its own, searching for code changes that improved Elo, and the best run landed at 2718 according to Deedy's results thread. He also describes the end result as "a top 50 grandmaster" and a "#311 chess engine," which gives engineers a concrete target variable rather than a vague quality claim.

The engine itself is not exotic. Deedy's technical breakdown says it uses "negamax alpha-beta tree search with pruning and iterative deepening," plus standard opening books and a transposition table to cache moves. He adds that every test used a 500 ms per-move budget and that there is "no offline computation or training element," so this was an automated experiment loop over ordinary engine code, not a training pipeline or a one-off tuned checkpoint. The most obvious next lever, by his account, would be replacing the static evaluation function with efficiently updatable NNUEs technical breakdown.

Why this workflow matters beyond chess

The broader takeaway is that Autoresearch looks most convincing on problems with a hard metric and a fast evaluation harness. The community reaction summarizes the pattern as the community applying it to "everything with a measurable metric," which is exactly why chess is a useful demo: Elo is legible, regression is cheap to detect, and the search space is mostly code and parameter changes.

Engineers are already testing how far that loop generalizes. Shreya Shankar's subjective tasks argues that "auto research-style search loops" can be paired with qualitative evaluators for "non-verifiable" tasks, suggesting a bridge from benchmark optimization to coding workflows where the score is fuzzier but still rankable. Another practitioner is already trying the approach with RL infrastructure, as an RL experiment puts it: "Maybe it works, maybe it doesn't." That is where this story lands for engineering teams: an agentic optimization loop appears able to improve real code when the objective is measurable and the eval cycle is short enough to run dozens of times.

🧾 More sources

Why this workflow matters beyond chess1 tweets

Supporting evidence on how practitioners are interpreting Autoresearch as a reusable optimization loop beyond a single chess demo.

workflowMarch 21, 2026

Autoresearch claims 2718 Elo after 70 experiments on a Rust chess engine

Coding Agents Benchmarks Deep Research

3 min read

TL;DR

Developer Deedy Das says Karpathy-style Autoresearch pushed a "vibecoded Rust chess engine" from "expert" strength to a reported 2718 Elo after running "over 70 experiments" and hill-climbing for score results thread.
The underlying engine is conventional search, not a newly trained model: Deedy's technical breakdown says it uses negamax alpha-beta search with pruning, iterative deepening, opening books, and a transposition table, all tested at a 500 ms per-move limit.
The practical engineering story is the loop, not chess. The community reaction around Autoresearch frames it as a reusable pattern for "everything with a measurable metric," where the agent proposes code changes and keeps what improves the target.
That pattern may extend beyond fully verifiable tasks. In a practitioner take, Shreya Shankar says she is "very optimistic" about combining these search loops with qualitative evaluators for more subjective coding-style work.

What actually improved in the chess engine

Deedy

@deedydas

·Follow

Karpathy's Autoresearch pushed my vibecoded Rust chess engine AI from "expert" to a top 50 grandmaster, a #311 chess engine. It ran over 70 experiments on its own and tried to hill climb to the top ELO score it could, landing at 2718!

Watch on X

2:56 AM · Mar 22, 2026

·Follow

Replying to @deedydas

This approach fundamentally uses a negamax alpha-beta tree search with pruning and iterative deepening. I tested everything with a 500ms per move limit. The main way to improve it would be to get rid of the static evaluation at the nodes and replace it with efficiently updatable Show more

2:56 AM · Mar 22, 2026

115

Read 4 replies

Why this workflow matters beyond chess

Zhengyao Jiang

@zhengyaojiang

·Follow

Autoresearch has been out for 2 weeks. The community is trying to apply it to everything with a measurable metric, here are some successful attempts: 🧵 (1/6)

3:56 PM · Mar 21, 2026

·Follow

Fun article on plugging together auto research-style search loops with qualitative coding-style evaluators. I am very optimistic about this approach on non-verifiable (ie subjective) tasks

George from 🕹prodmgmt.world

@nurijanian

x.com/i/article/2034…

5:26 PM · Mar 21, 2026

338

Read 8 replies

🧾 More sources

Why this workflow matters beyond chess1 tweets

Supporting evidence on how practitioners are interpreting Autoresearch as a reusable optimization loop beyond a single chess demo.