Kimi K2.6 launches with 58.6 SWE-Bench Pro and 4,000-tool-call agent runs
Moonshot open-sourced Kimi K2.6, a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, and larger agent swarms. It now sits near frontier closed models for long-horizon coding and tool use, so teams can try it for agent workflows.

TL;DR
- Moonshot open-sourced Kimi K2.6, and Kimi_Moonshot's launch post says the new model leads open-weight peers on HLE with tools at 54.0, SWE-Bench Pro at 58.6, BrowseComp at 83.2, and Toolathlon at 50.0.
- The model card on Hugging Face and vllm_project's day-0 support post describe a 1T-parameter MoE with 32B active parameters, 256K context, multimodal input, MLA attention, and native INT4 quantization.
- The headline product shift is longer autonomous work: the launch thread claims 4,000-plus tool calls over 12-plus hours, while Moonshot's RL infra example says an internal K2.6-backed agent handled monitoring and incident response for 5 days.
- Moonshot also turned K2.6 into a bigger orchestration story, with the web UI rollout exposing Instant, Thinking, Agent, and Agent Swarm modes, and Moonshot's agent demo adding one-prompt frontend generation with video, WebGL shaders, and backend wiring.
- Third-party evals put the release close to frontier closed models, but not cleanly past them: according to Artificial Analysis and ValsAI's methodology thread, K2.6 is now the top open-weight model on both Artificial Analysis and Vals, while hands-on posts from Ethan Mollick and Mollick's follow-up still report rough edges versus closed-model leaders.
You can read the official blog post, inspect the model card, and browse day-one serving docs from vLLM, SGLang, and OpenRouter. The rollout also shipped fast across products: Baseten went live on day 0, NousResearch added Hermes Agent support, and Ollama exposed it through OpenClaw, Hermes, and Claude-style harnesses.
What shipped
Moonshot shipped K2.6 across chat, API, and downloadable weights on the same day, with links from the official announcement, the API platform, and the Hugging Face repo.
The exposed surfaces split into four modes, according to testingcatalog's UI capture and scaling01's platform screenshot:
- K2.6 Instant: quick-response mode
- K2.6 Thinking: deeper reasoning mode
- K2.6 Agent: research, docs, slides, websites, sheets
- K2.6 Agent Swarm Beta: large-scale search, long-form writing, batch tasks
The model card and serving posts add the harder specs:
- Architecture: 1T total parameters, 32B active, 384 experts with 8 routed plus 1 shared, per vllm_project
- Context: 256K on Moonshot surfaces, with OpenRouter's listing showing 262,144 tokens
- Modality: image and video input, text output, per Artificial Analysis and the model card
- Serving hooks: dedicated Kimi tool-call and reasoning parsers in vLLM, per vllm_project
- License: Modified MIT on Hugging Face, as shown in scaling01's model card screenshot
Benchmarks
Moonshot's own chart puts K2.6 in a very specific place: first on several open-weight coding and tool-use tasks, near the frontier closed models on the rest, and still behind on some search, terminal, and vision rows.
The easiest way to scan the claims from the benchmark table and the launch chart is as a split list.
Moonshot-led rows
- HLE with tools: 54.0, ahead of GPT-5.4 at 52.1, Claude Opus 4.6 at 53.0, and Gemini 3.1 Pro at 51.4
- DeepSearchQA f1: 92.5, ahead of GPT-5.4 at 78.6 and Claude Opus 4.6 at 91.3
- SWE-Bench Pro: 58.6, ahead of GPT-5.4 at 57.7, Claude Opus 4.6 at 53.4, and Gemini 3.1 Pro at 54.2
Rows it does not lead
- BrowseComp: 83.2, behind Gemini 3.1 Pro at 85.9
- Toolathlon: 50.0, behind GPT-5.4 at 54.6
- Terminal-Bench 2.0: 66.7, behind Gemini 3.1 Pro at 68.5
- SWE-Bench Multilingual: 76.7, behind GPT-5.4 at 77.8
- MathVision with python: 93.2, behind GPT-5.4 at 96.1
Third-party evals tightened the same picture instead of flipping it. Artificial Analysis ranks K2.6 at 54 on its Intelligence Index, tied just below Anthropic, Google, and OpenAI's top entries at 57, and ArtificialAnlys' thread says that is enough for the top open-weight spot. ValsAI separately put it at number 1 among open-weight models and number 7 overall, with ValsAI's follow-up attributing most of the gain to Terminal Bench 2, up 17 points, and SWE-Bench, up 8 points versus K2.5.
Agent loops
The most interesting claims in this launch are about duration, not single-turn prompts. nrehiew_'s post shows K2.6 spending nearly 12 hours and more than 4,000 tool calls writing and optimizing a Zig inference engine for Qwen 3.5 on an M3 Max, moving from roughly 15 tokens per second to 193.1 and finishing about 20 percent above LM Studio's baseline.
Moonshot paired that with a longer internal case. According to Moonshot's quoted worklog, its RL infra team used a K2.6-backed agent for five days of monitoring, incident response, and system operations, with persistent context across the run.
The swarm claims are bigger than the examples:
- 300 parallel sub-agents, up from K2.5's 100, per the launch post
- 4,000 coordinated steps per run, up from 1,500 on K2.5, per the same thread
- One prompt, 100-plus files, per Moonshot's release copy
- Claw Groups research preview, which maximelabonne's summary describes as routing tasks across different agents and even human-in-the-loop collaborators
This is Christmas-come-early stuff for agent harness nerds, because Moonshot is not pitching K2.6 as a better autocomplete model. It is pitching a longer-lived worker.
Frontend agent
Moonshot spent unusual launch-space on website generation, and the details are specific enough to matter. the agent demo thread says K2.6 can generate:
- video hero sections through external video generation APIs
- raw GLSL and WGSL shader code
- GSAP and Framer Motion animation flows
- Three.js and React Three Fiber scenes
- React 19, TypeScript, Vite, Tailwind, and shadcn/ui app scaffolds
- auth, database, and backend wiring in the same pass
The demos in chetaslua's thread, crystalsssup's WebGL clip, and ai_for_success's one-shot asset test show why this part of the launch spread so fast. Moonshot is chasing the same taste-and-motion territory that made closed models sticky for webdev demos, but with open weights and an agent surface attached.
Where it shows up
K2.6 had one of the denser day-one rollouts in recent open-model launches. From the evidence pool and linked pages, it landed across:
- Inference providers: OpenRouter, Baseten, Fireworks, Venice, AI/ML API
- Serving stacks: vLLM, SGLang cookbook
- Agent products: Hermes Agent, OpenClaw via Ollama, Kilo, Droid, OpenCode, and Arena
- Weights and chat surfaces: Hugging Face, Kimi web, HuggingChat
Some of those integrations added their own operational details. Baseten says it is serving K2.6 with KV-aware routing, NVFP4 weights on Blackwell, multimodal hierarchical caching, and prefill-decode disaggregation. Ollama turned it into launch commands for OpenClaw, Hermes, and a Claude-style terminal harness. NousResearch made it selectable in Hermes Agent the same day.
Price and rough edges
The price story is strong, but it is not flat from K2.5. OpenRouter's listing shows K2.6 at $0.95 per million input tokens and $4 per million output tokens on OpenRouter. ValsAI's cost thread says Moonshot's own pricing moved from $0.10 and $3 on K2.5 to $0.16 and $4 on K2.6, with an average Vals cost of $0.21 per test, still far below Opus 4.7 at $1.05 per test.
The rough-edge reports arrived almost as quickly as the benchmark praise. Mollick's hands-on post called K2.6 Thinking very good for an open-weights model, but said the Lem Test produced a 74-page thinking trace with only an okay answer. In a follow-up, Mollick's comparison said light real-world use still felt worse than Claude Opus 4.6 despite the benchmark wins.
Other early testers converged on a similar split. teortaxesTex's roguelike run said the raw model came surprisingly close to GPT-5.4 on a staged HTML-to-voxel-game test and had better aesthetic taste, while the same tester's follow-up said the richer tool environment actually introduced errors on step one. A separate Chinese community writeup collected in ZhihuFrontier's Hermes summary reported zero 429s across 23 concurrent agents, but also slower time-to-first-token and weekly quota burn that consumed 24 percent in one day of heavy testing.