Skip to content
AI Primer
update

Qwen 3.7 Max users report 5-minute cache creation, $43 vibe-coding bills, and uneven task quality

A day after Qwen 3.7 Max launched, users posted both standout benchmark wins and rough real-work reports, including 5-minute cache creation and $43 in 15 minutes of vibe coding. That matters because teams evaluating coding agents are seeing a gap between leaderboard strength and per-task reliability.

5 min read
Qwen 3.7 Max users report 5-minute cache creation, $43 vibe-coding bills, and uneven task quality
Qwen 3.7 Max users report 5-minute cache creation, $43 vibe-coding bills, and uneven task quality

TL;DR

You can read Alibaba's launch post, skim the Artificial Analysis model page, and try the model through Qwen Studio, OpenRouter, or Vercel AI Gateway. OpenRouter also published an explicit caching guide for Qwen models, which is suddenly relevant given the day-one cache complaints.

Benchmarks

Alibaba's own framing was broad: strong coding-agent scores, stronger general-purpose agent performance, and a 35-hour autonomous kernel optimization run with 1,158 tool calls, per Alibaba_Qwen's launch thread and Alibaba_Qwen's kernel run post. Artificial Analysis gave the more compact outside summary: 56.6 on the Intelligence Index, up 4.8 points from Qwen3.6 Max Preview, with gains concentrated in scientific reasoning, agentic capability, and coding according to ArtificialAnlys.

A few concrete numbers stood out:

The catch is in the sub-metrics. ArtificialAnlys said part of the Intelligence Index gain came from abstaining more on AA-Omniscience: accuracy fell from 37.7 percent to 30.1 percent while hallucination rate dropped from 44.2 percent to 22.9 percent.

Friction

The fastest shift in the conversation was from "best Chinese model" to "how expensive is a real task." bridgemindai's workflow review said the model looked excellent on paper, then burned $43 in 15 minutes of vibe coding and still produced 15 errors on one task.

That lined up with the softer skepticism around the launch. sbmaruf's reply said they were "totally disappointed" after using it, while teortaxesTex's cache complaint argued that a quoted "cache creation (5 min)" step made the long-horizon pitch hard to swallow.

The split is useful because it is not benchmark denial. bridgemindai's review explicitly said the benchmark scores were legit, then argued the cost per completed task was the problem. That is a much narrower and more engineer-relevant complaint.

Token economics

The most detailed independent review in the evidence pool, ZhihuFrontier's translated review, said Qwen 3.7 Max improved reasoning by more than 30 percent versus Qwen3.6-Max-Preview, but did it with average token use around 44K, roughly 50 percent higher than the prior generation.

That review described the pricing math this way:

  • Token usage up about 50 percent.
  • Price down to about 60 percent of the prior level.
  • Overall task cost roughly flat, with faster TPS and better usability.

Artificial Analysis reached a similar conclusion from a different angle. ArtificialAnlys said Qwen3.7 Max consumed 96.7 million output tokens to run the Intelligence Index, about 31 percent more than Qwen3.6 Max Preview's 73.9 million.

The day-one user reports suggest flat benchmark-era economics do not automatically translate to flat workflow cost. bridgemindai's workflow review described the model as cheaper per token but more expensive per task, and teortaxesTex's follow-up noted that the API then got a 50 percent discount.

Harness claims

Alibaba's most ambitious claim was that Qwen 3.7 Max learned agentic behavior that transfers across scaffolds. In Alibaba_Qwen's cross-harness post, the company said results held across QwenClawBench and CoWorkBench regardless of evaluation harness, naming Claude Code, OpenClaw, Qwen Code, and custom stacks in the broader launch thread.

That is why kimmonismus focused less on the 35-hour kernel demo and more on the line about agentic capabilities generalizing from diverse training environments. If that claim holds up, the story is about training distribution, not a single flashy run.

The rollout also matched the scaffold-agnostic pitch. By the end of the first day, Qwen 3.7 Max had shown up on Vercel AI Gateway, OpenRouter, Venice, and Kilo, with those launches reflected in vercel_dev, OpenRouter, AskVenice, and kilocode.

Prompt-language gotcha

One of the stranger details came from ZhihuFrontier's translated review, which said Qwen's official blog recommends an extra system prompt during inference. Without it, the reviewer said, the model may reason in Chinese; with it, the chain of thought shifts to English.

That reviewer tied the language choice to measurable performance differences in coding, geometry, and spatial tasks. It is an unusually concrete reminder that part of Qwen 3.7 Max's published performance depends on prompt setup, not just the base model.

The same review also said the model becomes noticeably shakier around 100K-context coding sessions by round three, when it starts forgetting early constraints or making simpler errors. That is a different failure mode from the five-minute cache complaint, and a more specific one.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 4 threads
TL;DR1 post
Benchmarks3 posts
Friction1 post
Harness claims4 posts
Share on X