releaseApril 21, 2026

Kimi K2.6 launches API with $0.95/M input, 256K context, and video input

Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.

6 min read

Kimi K2.6 launches API with $0.95/M input, 256K context, and video input

TL;DR

Moonshot put Kimi_Moonshot's API launch behind a new cache-aware price sheet, $0.16 per million cached input tokens, $0.95 per million uncached input tokens, and $4 per million output tokens, with 256K context, tool calls, JSON and partial JSON modes, web search, and native text, image, and video input.
The same launch wave included Kimi_Moonshot's FlashKDA post, a CUTLASS-based Kimi Delta Attention kernel that Moonshot says delivers 1.72x to 2.22x prefill speedups over the flash-linear-attention baseline on H20.
Day-zero infrastructure landed fast: warpdotdev's Warp announcement, CosineAI's Cosine post, openclaw's release post, and ollama's cloud integration post all shipped K2.6 support within hours.
Bench charts are strong, but hands-on reactions split. bridgemindai's BridgeBench post put K2.6 in its top five and first on debugging, while bridgemindai's workflow test and emollick's usage note both argued the real product feel still trails the best closed models.
Moonshot is also pushing a very specific agent story: Kimi_Moonshot's long-horizon demo claimed 4,000-plus tool calls over 12 hours, while Kimi_Moonshot's frontend demo thread showed video heroes, WebGL shaders, and backend wiring from one prompt.

Moonshot did not just post a model card. You can already try K2.6 on OpenRouter, through Ollama Cloud, with the SGLang cookbook, or in independent eval dashboards like Artificial Analysis. There is also a community-built Kimi 2.6 Code terminal client, which tells you how quickly the surrounding harness ecosystem moved.

API surface

The API post is unusually dense for a single launch tweet. It exposes two input prices, one for cache hits and one for cache misses, plus a flat output price.

Cache-hit input: $0.16 per 1M tokens
Cache-miss input: $0.95 per 1M tokens
Output: $4 per 1M tokens
Context: 256K
Modes: thinking and non-thinking
Native inputs: text, image, video
API features: tool calls, JSON mode, partial mode, web search

That same price sheet is what bridgemindai's OpenRouter screenshot surfaced on OpenRouter, and OpenRouter's Cloudflare routing screenshot showed the same model already running through Cloudflare Workers AI.

FlashKDA

Moonshot paired the API rollout with an infra artifact, not just a benchmark chart. FlashKDA is framed as a drop-in backend for flash-linear-attention, and the claim that matters is prefill speed: 1.72x to 2.22x faster than the baseline on H20, according to Kimi_Moonshot's FlashKDA post.

That lines up with the rest of K2.6's serving story. baseten's day-zero post said its deployment uses NVFP4 weights on Blackwell, KV-aware routing, multimodal hierarchical caching, and prefill-decode disaggregation, while vllm_project's support post listed day-zero vLLM 0.19.1 support with Moonshot's tool-call and reasoning parsers.

Launch partners

The rollout looked more like agent plumbing than a normal model launch.

Warp added K2.6 on day one, citing faster tool calls and lower cost than frontier incumbents, per warpdotdev's Warp announcement.
Cosine said K2.6's 4,000-plus sequential tool-call budget fits its Swarm product for long-horizon tasks, according to CosineAI's Cosine post.
Genspark shipped day-zero access through Fireworks, per genspark_ai's Genspark post.
OpenClaw added K2.6 support in its 2026.4.20 release, according to openclaw's release post.
Ollama exposed it as kimi-k2.6:cloud with launch paths for OpenClaw, Hermes, and Claude-style clients, per ollama's cloud integration post.
Windsurf made it free for Pro, Teams, and Max users for two weeks, according to cognition's Windsurf repost.

This is why the launch feels practical so quickly. The model shipped alongside terminals, agent frameworks, aggregation layers, and hosted inference, not months before them.

Long-horizon coding

Moonshot's central claim is stamina. The launch thread and follow-up demos keep returning to four numbers:

4,000-plus tool calls per run, per Kimi_Moonshot's main launch thread
12-plus hours of continuous execution, per Kimi_Moonshot's long-horizon demo
300 parallel sub-agents, up from 100 in K2.5, per Kimi_Moonshot's main launch thread
100-plus files from one prompt, also per Kimi_Moonshot's main launch thread

The frontend demos make the target workload concrete. Kimi_Moonshot's frontend demo thread and Kimi_Moonshot's backend wiring demo showed K2.6 generating video hero sections, WebGL shader work, React plus Vite plus Tailwind stacks, and auth plus database wiring in one pass.

Independent evals broadly support the jump. ArtificialAnlys' model ranking post placed K2.6 at 54 on its Intelligence Index, fourth overall behind Anthropic, Google, and OpenAI, and ValsAI's open-weight ranking post put it first among open-weight models on the Vals Index.

Benchmarks and rough edges

The benchmark story is strong enough that the skepticism is part of the story. BridgeBench put K2.6 at a quality score of 81.4 and first on debugging, ahead of GPT-5.4 on that board, according to bridgemindai's BridgeBench post. But later the same account's bridgemindai's workflow test said the model broke a real website build in visible ways, and bridgemindai's lava lamp test made a similar case with a simple visual app prompt.

Ethan Mollick, Wharton professor and frequent model evaluator, wrote in emollick's first-impression thread that K2.6 Thinking looked very good for an open-weights model but still had many rough edges versus closed state of the art, then added in emollick's usage note that it did not feel as good as Claude Opus 4.6 in ordinary use despite beating it on some charts.

The sharper read is that K2.6 has crossed into "serious option" territory, but the harness and workload still matter a lot more than the headline scores imply.

Harnesses

One of the more useful signals from this launch is how fast people started wrapping K2.6 in tools built for daily work. There is already a community terminal client, Kimi 2.6 Code, modeled on Claude Code and wired to bring your own API key, per skirano's Kimi 2.6 Code post.

That showed up elsewhere too. NousResearch's Hermes Agent update added K2.6 to Hermes, opencode's OpenCode post put it into OpenCode, and omarsar0's survey-generator demo used Fireworks inference plus a plugin skill to generate full survey papers. The model launch was only half the story, the usable coding-agent harnesses arrived at the same time.