workflowMay 25, 2026

Microsoft benchmarks SkillOpt at +24.8 Codex points by editing skills, not weights

Microsoft Research released SkillOpt, which optimizes external skill files instead of fine-tuning model weights and reports best-or-tied results across 52 evaluation cells. The method matters because it improved Codex and Claude Code accuracy without extra inference-time calls.

4 min read

Microsoft benchmarks SkillOpt at +24.8 Codex points by editing skills, not weights

TL;DR

Microsoft Research's SkillOpt treats the agent's skill document as editable external state while keeping model weights frozen, according to omarsar0's summary and the linked paper.
Across 52 combinations of model, benchmark, and harness, SkillOpt was best or tied in every cell, per omarsar0's benchmark summary and daniel_mac8's breakdown.
The biggest headline gains came on GPT-5.5 setups: omarsar0 reported +23.5 points in direct chat, +24.8 in Codex, and +19.1 in Claude Code over running with no skill.
SkillOpt's loop edits the skill file only when hidden-set validation improves, so the learned skill adds no extra inference-time calls at deployment, as daniel_mac8 and the project page describe it.
Microsoft released the method openly with a public GitHub repo, a project page, and the arXiv paper, all linked from AlphaSignalAI's source roundup.

You can read the paper, browse the repo, and check the project page. The fun detail is that the "training" target here is plain text, not parameters: daniel_mac8 highlights a textual learning-rate budget, a rejected-edit buffer, and epoch-wise meta updates, while TheTuringPost points to spreadsheet solving jumping from 41.8% to 80.7%.

Skill documents as trainable state

The paper's central move is simple: stop treating the skill doc as static prompt glue. Instead, treat it as external state that can be optimized while the underlying agent stays frozen.

That lands because a lot of agent work today still looks like handwritten instructions plus hope. omarsar0 frames SkillOpt as a way to train the playbook around the model rather than retrain the model itself.

The edit loop is validation-gated

SkillOpt runs the agent, scores the rollout, edits the skill file, and keeps the change only if performance improves. TheTuringPost's workflow summary describes the edit operations as add, remove, and rewrite steps over the instruction file.

The method details surfaced in daniel_mac8's post are unusually concrete for prompt optimization work:

textual learning-rate budget
rejected-edit buffer
epoch-wise meta updates
zero extra inference-time calls once the skill is learned

That last point is the killer feature. The optimized skill is just text shipped with the agent harness, not an extra model call in the serving path.

Benchmarks moved across chat, Codex, and Claude Code

The headline result is breadth. omarsar0 says SkillOpt was best or tied across all 52 evaluated cells spanning models, benchmarks, and harnesses.

The GPT-5.5 averages that got attention were:

Direct chat: +23.5 points over no skill, per omarsar0's post
Codex: +24.8 points, per omarsar0's post
Claude Code: +19.1 points, per omarsar0's post

One benchmark example from TheTuringPost is spreadsheet solving moving from 41.8% to 80.7%. That is a giant jump for a method that never touches weights.

Skills transferred across models and harnesses

The paper is not only claiming better prompt editing. It is also claiming that the learned skills transfer across models and harnesses, according to omarsar0's summary and TheTuringPost's writeup.

That matters because the harness itself appears to be part of the optimization target. daniel_mac8 calls out the harness as at least as important as the model, which is a pretty direct shot at the common assumption that better agent behavior mostly comes from swapping base models.

In practice, SkillOpt is benchmarking three different wrappers around the same underlying capability question: direct chat, Codex, and Claude Code. The spread in gains across those setups is a reminder that the instructions around the model are doing real work.

Microsoft shipped paper, repo, and demo site

This is already more than a paper drop. The source roundup in AlphaSignalAI's post points to all three public artifacts: the arXiv paper, the GitHub repository, and the project website.

That makes SkillOpt easy to inspect from three angles: the research claim in the paper, the implementation in the repo, and the benchmark framing on the site. For agent engineers, that is a much better package than a screenshot-only research teaser.