Skip to content
AI Primer
release

Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs

Moonshot open-sourced Kimi K2.6, a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, and larger agent swarms. It now sits near frontier closed models for long-horizon coding and tool use, so teams can try it for agent workflows.

7 min read
Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs
Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs

TL;DR

You can read the official blog post, inspect the model card, and browse day-one serving docs from vLLM, SGLang, and OpenRouter. The rollout also shipped fast across products: Baseten went live on day 0, NousResearch added Hermes Agent support, and Ollama exposed it through OpenClaw, Hermes, and Claude-style harnesses.

What shipped

Moonshot shipped K2.6 across chat, API, and downloadable weights on the same day, with links from the official announcement, the API platform, and the Hugging Face repo.

The exposed surfaces split into four modes, according to testingcatalog's UI capture and scaling01's platform screenshot:

  • K2.6 Instant: quick-response mode
  • K2.6 Thinking: deeper reasoning mode
  • K2.6 Agent: research, docs, slides, websites, sheets
  • K2.6 Agent Swarm Beta: large-scale search, long-form writing, batch tasks

The model card and serving posts add the harder specs:

Benchmarks

Moonshot's own chart puts K2.6 in a very specific place: first on several open-weight coding and tool-use tasks, near the frontier closed models on the rest, and still behind on some search, terminal, and vision rows.

The easiest way to scan the claims from the benchmark table and the launch chart is as a split list.

Moonshot-led rows

  • HLE with tools: 54.0, ahead of GPT-5.4 at 52.1, Claude Opus 4.6 at 53.0, and Gemini 3.1 Pro at 51.4
  • DeepSearchQA f1: 92.5, ahead of GPT-5.4 at 78.6 and Claude Opus 4.6 at 91.3
  • SWE-Bench Pro: 58.6, ahead of GPT-5.4 at 57.7, Claude Opus 4.6 at 53.4, and Gemini 3.1 Pro at 54.2

Rows it does not lead

  • BrowseComp: 83.2, behind Gemini 3.1 Pro at 85.9
  • Toolathlon: 50.0, behind GPT-5.4 at 54.6
  • Terminal-Bench 2.0: 66.7, behind Gemini 3.1 Pro at 68.5
  • SWE-Bench Multilingual: 76.7, behind GPT-5.4 at 77.8
  • MathVision with python: 93.2, behind GPT-5.4 at 96.1

Third-party evals tightened the same picture instead of flipping it. Artificial Analysis ranks K2.6 at 54 on its Intelligence Index, tied just below Anthropic, Google, and OpenAI's top entries at 57, and ArtificialAnlys' thread says that is enough for the top open-weight spot. ValsAI separately put it at number 1 among open-weight models and number 7 overall, with ValsAI's follow-up attributing most of the gain to Terminal Bench 2, up 17 points, and SWE-Bench, up 8 points versus K2.5.

Agent loops

The most interesting claims in this launch are about duration, not single-turn prompts. nrehiew_'s post shows K2.6 spending nearly 12 hours and more than 4,000 tool calls writing and optimizing a Zig inference engine for Qwen 3.5 on an M3 Max, moving from roughly 15 tokens per second to 193.1 and finishing about 20 percent above LM Studio's baseline.

Moonshot paired that with a longer internal case. According to Moonshot's quoted worklog, its RL infra team used a K2.6-backed agent for five days of monitoring, incident response, and system operations, with persistent context across the run.

The swarm claims are bigger than the examples:

This is Christmas-come-early stuff for agent harness nerds, because Moonshot is not pitching K2.6 as a better autocomplete model. It is pitching a longer-lived worker.

Frontend agent

Moonshot spent unusual launch-space on website generation, and the details are specific enough to matter. the agent demo thread says K2.6 can generate:

  • video hero sections through external video generation APIs
  • raw GLSL and WGSL shader code
  • GSAP and Framer Motion animation flows
  • Three.js and React Three Fiber scenes
  • React 19, TypeScript, Vite, Tailwind, and shadcn/ui app scaffolds
  • auth, database, and backend wiring in the same pass

The demos in chetaslua's thread, crystalsssup's WebGL clip, and ai_for_success's one-shot asset test show why this part of the launch spread so fast. Moonshot is chasing the same taste-and-motion territory that made closed models sticky for webdev demos, but with open weights and an agent surface attached.

Where it shows up

K2.6 had one of the denser day-one rollouts in recent open-model launches. From the evidence pool and linked pages, it landed across:

Some of those integrations added their own operational details. Baseten says it is serving K2.6 with KV-aware routing, NVFP4 weights on Blackwell, multimodal hierarchical caching, and prefill-decode disaggregation. Ollama turned it into launch commands for OpenClaw, Hermes, and a Claude-style terminal harness. NousResearch made it selectable in Hermes Agent the same day.

Price and rough edges

The price story is strong, but it is not flat from K2.5. OpenRouter's listing shows K2.6 at $0.95 per million input tokens and $4 per million output tokens on OpenRouter. ValsAI's cost thread says Moonshot's own pricing moved from $0.10 and $3 on K2.5 to $0.16 and $4 on K2.6, with an average Vals cost of $0.21 per test, still far below Opus 4.7 at $1.05 per test.

The rough-edge reports arrived almost as quickly as the benchmark praise. Mollick's hands-on post called K2.6 Thinking very good for an open-weights model, but said the Lem Test produced a 74-page thinking trace with only an okay answer. In a follow-up, Mollick's comparison said light real-world use still felt worse than Claude Opus 4.6 despite the benchmark wins.

Other early testers converged on a similar split. teortaxesTex's roguelike run said the raw model came surprisingly close to GPT-5.4 on a staged HTML-to-voxel-game test and had better aesthetic taste, while the same tester's follow-up said the richer tool environment actually introduced errors on step one. A separate Chinese community writeup collected in ZhihuFrontier's Hermes summary reported zero 429s across 23 concurrent agents, but also slower time-to-first-token and weekly quota burn that consumed 24 percent in one day of heavy testing.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 6 threads
What shipped3 posts
Benchmarks2 posts
Agent loops1 post
Frontend agent4 posts
Where it shows up4 posts
Price and rough edges3 posts