Skip to content
AI Primer
release

Qwen3.7 Max launches with 1M context, 35-hour autonomy, and 56.6 AA Index

Alibaba launched Qwen3.7 Max as its new flagship agent model with 1M context, stronger coding and reasoning scores, and cross-harness benchmarks. OpenRouter, Together, AI Gateway, and Kilo support it on day one, making it ready for immediate deployment.

5 min read
Qwen3.7 Max launches with 1M context, 35-hour autonomy, and 56.6 AA Index
Qwen3.7 Max launches with 1M context, 35-hour autonomy, and 56.6 AA Index

TL;DR

You can read the official blog post, check the Model Studio docs, and compare the launch framing against the Artificial Analysis model page. The interesting bits are pretty concrete: Alibaba is arguing that agent skill transfers across environments, OpenRouter is already advertising explicit prompt caching, and the ecosystem rollout hit Vercel AI Gateway, OpenRouter, and Venice immediately.

What shipped

Alibaba's launch pitch breaks the model into four concrete jobs: coding agent, office assistant, long-horizon autonomous worker, and scaffold-agnostic base model.

Benchmarks

The cleanest outside read comes from Artificial Analysis. ArtificialAnlys scored Qwen3.7-Max at 56.6 on its Intelligence Index, up from 51.8 for Qwen3.6 Max Preview.

The gains it called out are narrow rather than universal:

The caveat is more interesting than the topline. ArtificialAnlys says part of the AA Index gain comes from abstaining more often on AA-Omniscience: accuracy fell from 37.7% to 30.1%, while hallucination rate dropped from 44.2% to 22.9%.

Agent training

Alibaba is making an explicit training claim here. Alibaba_Qwen's agent-scaling post says Qwen3.7 extends Qwen3.5's environment-scaling approach by increasing the quality and diversity of agent training environments, with the thesis that agentic capabilities generalize the way language capabilities generalize from diverse text.

A second claim sits on top of that: Alibaba_Qwen's cross-harness note says the model stays strong across QwenClawBench and CoWorkBench regardless of which harness is used at eval time. That is a direct attempt to answer the usual suspicion around agent benchmarks, namely that a model learned one scaffold too well.

Kernel run

The launch's showpiece is not a chatbot demo. It is a long autonomous optimization loop on an attention kernel.

According to Alibaba_Qwen's self-evolving post, the run lasted about 35 hours, covered 432 kernel evaluations, used 1,158 tool calls, and produced a reported 10.0x geometric-mean speedup over a Triton reference across multiple workloads. kimmonismus usefully narrows the interpretation: the result is a model grinding through compile, profile, and rewrite cycles on one bounded target, not a general self-improvement jump.

That narrower framing still leaves a striking number. Alibaba_Qwen's launch post had already turned long-horizon autonomy into the headline, and togethercompute's launch post echoed the same run as evidence that the model can stay coherent over hours rather than minutes.

Where it shows up

This rollout was not limited to Alibaba's own surfaces.

Access and pricing

Alibaba made the official access story simple: Qwen Studio for direct use, Model Studio for API access, and the official blog post as the main technical summary.

Pricing was less tidy in the early evidence. ArtificialAnlys' model breakdown said pricing was still unannounced at launch time, while scaling01's pricing post claimed the live rate looked like $2.5 per 1M input tokens and $7.5 per 1M output tokens. The same ArtificialAnlys' model breakdown also noted that its eval run consumed 96.7M output tokens, about 31% more than Qwen3.6 Max Preview.

One more implementation detail surfaced outside the official thread. OpenRouter's listing explicitly called out prompt caching for repeated context, and OpenRouter's prompt-caching guide linked to the provider-specific caching docs. That matters because a 1M-context model is not just about fitting more tokens, it is also about whether the serving stack exposes cost controls for repeated long prompts.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR3 posts
Kernel run1 post
Where it shows up4 posts
Access and pricing1 post
Share on X