Skip to content
AI Primer
release

Kimi K2.6 launches API with $0.95/M input, 256K context, and video input

Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.

6 min read
Kimi K2.6 launches API with $0.95/M input, 256K context, and video input
Kimi K2.6 launches API with $0.95/M input, 256K context, and video input

TL;DR

Moonshot did not just post a model card. You can already try K2.6 on OpenRouter, through Ollama Cloud, with the SGLang cookbook, or in independent eval dashboards like Artificial Analysis. There is also a community-built Kimi 2.6 Code terminal client, which tells you how quickly the surrounding harness ecosystem moved.

API surface

The API post is unusually dense for a single launch tweet. It exposes two input prices, one for cache hits and one for cache misses, plus a flat output price.

  • Cache-hit input: $0.16 per 1M tokens
  • Cache-miss input: $0.95 per 1M tokens
  • Output: $4 per 1M tokens
  • Context: 256K
  • Modes: thinking and non-thinking
  • Native inputs: text, image, video
  • API features: tool calls, JSON mode, partial mode, web search

That same price sheet is what bridgemindai's OpenRouter screenshot surfaced on OpenRouter, and OpenRouter's Cloudflare routing screenshot showed the same model already running through Cloudflare Workers AI.

FlashKDA

Moonshot paired the API rollout with an infra artifact, not just a benchmark chart. FlashKDA is framed as a drop-in backend for flash-linear-attention, and the claim that matters is prefill speed: 1.72x to 2.22x faster than the baseline on H20, according to Kimi_Moonshot's FlashKDA post.

That lines up with the rest of K2.6's serving story. baseten's day-zero post said its deployment uses NVFP4 weights on Blackwell, KV-aware routing, multimodal hierarchical caching, and prefill-decode disaggregation, while vllm_project's support post listed day-zero vLLM 0.19.1 support with Moonshot's tool-call and reasoning parsers.

Launch partners

The rollout looked more like agent plumbing than a normal model launch.

This is why the launch feels practical so quickly. The model shipped alongside terminals, agent frameworks, aggregation layers, and hosted inference, not months before them.

Long-horizon coding

Moonshot's central claim is stamina. The launch thread and follow-up demos keep returning to four numbers:

The frontend demos make the target workload concrete. Kimi_Moonshot's frontend demo thread and Kimi_Moonshot's backend wiring demo showed K2.6 generating video hero sections, WebGL shader work, React plus Vite plus Tailwind stacks, and auth plus database wiring in one pass.

Independent evals broadly support the jump. ArtificialAnlys' model ranking post placed K2.6 at 54 on its Intelligence Index, fourth overall behind Anthropic, Google, and OpenAI, and ValsAI's open-weight ranking post put it first among open-weight models on the Vals Index.

Benchmarks and rough edges

The benchmark story is strong enough that the skepticism is part of the story. BridgeBench put K2.6 at a quality score of 81.4 and first on debugging, ahead of GPT-5.4 on that board, according to bridgemindai's BridgeBench post. But later the same account's bridgemindai's workflow test said the model broke a real website build in visible ways, and bridgemindai's lava lamp test made a similar case with a simple visual app prompt.

Ethan Mollick, Wharton professor and frequent model evaluator, wrote in emollick's first-impression thread that K2.6 Thinking looked very good for an open-weights model but still had many rough edges versus closed state of the art, then added in emollick's usage note that it did not feel as good as Claude Opus 4.6 in ordinary use despite beating it on some charts.

The sharper read is that K2.6 has crossed into "serious option" territory, but the harness and workload still matter a lot more than the headline scores imply.

Harnesses

One of the more useful signals from this launch is how fast people started wrapping K2.6 in tools built for daily work. There is already a community terminal client, Kimi 2.6 Code, modeled on Claude Code and wired to bring your own API key, per skirano's Kimi 2.6 Code post.

That showed up elsewhere too. NousResearch's Hermes Agent update added K2.6 to Hermes, opencode's OpenCode post put it into OpenCode, and omarsar0's survey-generator demo used Fireworks inference plus a plugin skill to generate full survey papers. The model launch was only half the story, the usable coding-agent harnesses arrived at the same time.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 7 threads
TL;DR6 posts
API surface2 posts
FlashKDA2 posts
Launch partners5 posts
Long-horizon coding5 posts
Benchmarks and rough edges4 posts
Harnesses3 posts