releaseMarch 9, 2026

Hermes Agent introduces self-evolution with a reported 39.5% quality gain

Nous Research released a self-evolution package for Hermes Agent that uses DSPy and GEPA to optimize skills, prompts, and code, and reported a phase-one score increase from 0.408 to 0.569 on one skill. Agent teams can study the repo for fallback model, memory, and self-improvement loop patterns.

3 min read

Hermes Agent introduces self-evolution with a reported 39.5% quality gain

TL;DR

Nous Research shipped hermes-agent-self-evolution, a Hermes Agent package that uses DSPy and GEPA to optimize the agent’s own skills, prompts, and code through an evolutionary loop rather than GPU retraining, according to the launch thread and Teknium’s repo post.
The clearest reported result is a phase-one gain on one skill: the shared results image shows an arXiv-task score moving from 0.408 to 0.569, which Nous and collaborators describe as a 39.5% quality improvement.
The repo and report position the system as population-based optimization: Nous’s announcement says it keeps multiple candidate solutions, applies LLM-driven mutations tied to failure cases, and selects by fitness; the validation report is the primary artifact behind that claim.
Alongside self-evolution, Hermes also added practical agent runtime features including automatic fallback-model failover, multi-platform messaging, and broad secret redaction, which makes this look less like a one-off research demo and more like a deployable agent stack in motion the weekend ship list Teknium’s local-run note.

What shipped in Hermes self-evolution?

The core release is hermes-agent-self-evolution, which Nous describes as “an evolutionary self-improvement system” for Hermes Agent. In the main announcement, the team says it uses DSPy plus GEPA to optimize “skills, prompts, and code,” maintains populations of solutions, and applies “LLM-driven mutations” aimed at specific failures rather than doing standard model finetuning.

The strongest concrete evidence so far is the phase-one validation result. The results screenshot says the pipeline runs “via API calls without GPU training” and reports a baseline-to-optimized jump from 0.408 to 0.569 on the arXiv skill, labeled as “+39.5%.” Teknium’s linked validation report is the source document for that early measurement, but the public evidence here is still narrow: one skill, one phase, and one reported score delta.

That still makes the release interesting for engineers because the target of optimization is not just prompt text. Nous says the loop can rewrite the agent’s “skills, descriptions, prompts, and code” results screenshot, which pushes it closer to a self-editing agent framework than a prompt tuner.

What matters for implementation beyond the benchmark?

Hermes’ weekend release bundled several runtime features that make the self-improvement story more relevant to production agents. In the launch thread, Nous says Hermes now supports automatic provider failover when a primary model is rate-limited or down, with fallback switching across providers including Codex OAuth and Nous Portal. The same post also says tool outputs now redact “API keys, tokens, and passwords” across 22-plus patterns before they ever reach model context.

Those surrounding changes matter because self-modifying or self-optimizing agents are only useful if the runtime is resilient. Nous also says Hermes now runs across Signal, iMessage, Telegram, Discord, WhatsApp, Slack, and CLI with “full feature parity” the launch thread, while Teknium adds that the agent supports “locally running models” and can run locally local-run note. Together, that gives engineers a concrete set of patterns to inspect: local execution, fallback routing, memory across sessions, and a search loop that tries to improve an agent’s own components over time.

TL;DR

What shipped in Hermes self-evolution?

What matters for implementation beyond the benchmark?

Discussion across the web