releaseJune 28, 2026

Microsoft opens SkillOpt with batch eval loops for agent SOP files

Microsoft open-sourced SkillOpt, a system that treats agent skill documents as tunable artifacts and improves them against measured task batches. It matters because practitioners are already standardizing shared /research, QA, and packageable skills across harnesses, turning skill files into a new optimization surface alongside models.

4 min read

Microsoft opens SkillOpt with batch eval loops for agent SOP files

TL;DR

Microsoft open-sourced SkillOpt, and pauliusztin_'s walkthrough frames it as a system for improving an agent's skill document instead of changing the underlying model.
The core loop in the SkillOpt thread is batchy and measurable: run tasks, score outcomes, propose skill edits, then keep only the edits that improve validation performance.
Practitioners were already treating skill files as real harness artifacts before this release, with zeeg's project-local QA setup and Matt Pocock's /research skill idea showing the pattern in the wild.
Skill files are also getting packaged and shared across tools, with Matt Pocock's reply about .agents/skills, jakubkrehel's oklch-skill package, and Wes Roth's Excel skills post all pointing to reusable SOP files rather than one-off prompts.

You can open the SkillOpt link from the original post, grab a packaged [oklch-skill](link:1:0|from npm), and watch Microsoft push the same SKILL.md idea into Excel workflows in Wes Roth's demo post. The weirdly durable idea across all three is the same one pauliusztin_ called out: the skill document is turning into its own optimization surface.

Skill documents

According to the main SkillOpt post, a skill is a small SOP-like document that specifies how to solve a task, which tools to use, which steps to follow, how to format outputs, and what good behavior looks like. A follow-up post from the same author makes the hierarchy explicit by arguing that the skill document can matter more than the model, memory, or harness.

That lines up with the community language around skills. Another pauliusztin_ post describes the trend as training documents instead of parameters, which is a crisp summary of why this feels closer to ML-style iteration than ordinary prompt tweaking.

Batch eval loops

SkillOpt terminal run

The original thread lays out SkillOpt's loop in five steps:

Run the agent on a batch of tasks.
Measure performance.
Use a second model to analyze failures.
Propose edits to the skill document.
Keep only the edits that improve validation performance.

The useful shift is the validation step. Ordinary prompt editing often stops at "this looked better once." SkillOpt, at least in the release framing captured by the announcement thread, keeps the edit only when the batch score moves.

Project-local skills

Two separate tweets show the same pattern from the bottom up. zeeg's QA example describes project-local skills that help agents write better tests and handle UI-specific QA, while Matt Pocock's /research sketch reads like a candidate SOP waiting to be formalized into a reusable command.

The details matter because these are not giant agent frameworks. zeeg is talking about a few focused instructions that refine over time, and Pocock is describing a background research flow with persistence to a markdown file. Those are exactly the kinds of artifacts SkillOpt is designed to tune.

Packaging and distribution

The other reveal is that skills are starting to move between environments as packages and files, not just prompts copied from chat logs.

jakubkrehel shipped an oklch-skill installable with npx skills add, aimed at color conversion, palette generation, contrast checks, and migration.
Matt Pocock's reply says many harnesses converge on .agents/skills, enough that he symlinks different setups into that path.
Wes Roth's Excel post says Microsoft's Copilot skills in Excel use an open-standard SKILL.md file stored in OneDrive for recurring finance workflows.

That is new surface area around the same primitive: a markdown-like operational document that can be shared, mounted into a harness, and increasingly optimized.

TL;DR

Skill documents

Batch eval loops

Project-local skills

Packaging and distribution

Discussion across the web