Skip to content
AI Primer
update

Vals ranks Kimi K2.7 Code at 78.2% on SWE-bench and 67% on Terminal-Bench 2.1

Vals posted new external results for Kimi K2.7 Code, ranking it the top open-weight model on SWE-bench and Terminal-Bench 2.1. The results give Moonshot's launch claims an outside benchmark line on repo and terminal-heavy tasks.

3 min read
Vals ranks Kimi K2.7 Code at 78.2% on SWE-bench and 67% on Terminal-Bench 2.1
Vals ranks Kimi K2.7 Code at 78.2% on SWE-bench and 67% on Terminal-Bench 2.1

TL;DR

  • ValsAI's benchmark thread puts Kimi K2.7 Code at 78.2% on SWE-bench and 67% on Terminal-Bench 2.1, which Vals described as the top open-weight result on both boards.
  • The official Hugging Face model card positions K2.7 Code as a coding-specific post-train on K2.6, with 1 trillion total parameters, 32 billion active per token, and a 256K context window.
  • According to rasbt's breakdown, the interesting part is where the gains show up: terminal and repo-heavy tasks, not just one-shot code generation.
  • Moonshot's own materials, including the Fireworks day-one launch post, claim about 30% lower reasoning-token usage than K2.6 while keeping higher coding scores.

You can check the Vals SWE-bench page, browse the official model card, and see that K2.7 Code was already live on both Fireworks and Vercel AI Gateway on day one. The most useful update is not Moonshot's internal delta chart, it is that ValsAI's benchmark thread adds an outside number on two benchmarks that force models to operate inside repositories and terminals.

Vals benchmark line

Vals's post gives K2.7 Code an external reference point after Moonshot launched it with mostly first-party numbers. In Vals's telling, 78.2% on SWE-bench and 67% on Terminal-Bench 2.1 make it the highest-scoring open-weight model on both tasks.

The Vals SWE-bench page describes SWE-bench Verified as 500 human-validated GitHub issue tasks run inside isolated Docker containers, with success decided by whether the generated patch passes unit tests. That makes the 78.2% figure more useful than a plain code-gen score, because the model has to navigate an existing codebase and land a working fix.

Agentic benchmarks

Rasbt's thread is the cleanest explanation of why these numbers landed. Terminal-Bench requires the model to inspect an environment, run commands, and react to outputs, while SWE-bench asks it to resolve real repository issues and pass tests.

That split matters because K2.7 Code is not being sold as a generic coder. The Hugging Face model card calls it a coding-focused agentic model built on K2.6, and the Fireworks writeup repeats the same pitch: long-horizon coding, lower reasoning-token use, same 1T MoE backbone.

Open weights and endpoints

Moonshot shipped the weights under a Modified MIT license on Hugging Face, which keeps this result in the part of the market engineers can actually self-host or route through third parties. Day-one serving was already in place through Fireworks and Vercel AI Gateway.

The official specs are unusually frontier-sized for an open coding model: 1T total parameters, 384 experts, 8 selected experts per token, 61 layers, and a 256K context window, according to the model card. So the headline from Vals is not just that K2.7 Code scored well. It is that an open-weight model with immediate hosted access now has an outside benchmark line on two of the messier agentic coding tests.

Share on X