Gemini 3.5 Flash ships with 76.2% Terminal-Bench 2.1 and $1.50/$9 pricing
Google shipped Gemini 3.5 Flash as a GA model with 1M context, 65K max output, and stronger agentic benchmarks than Gemini 3.1 Pro. Watch task-level cost, since third-party evals show it can exceed Gemini 3.1 Pro and GPT-5.5 Medium on some jobs.

TL;DR
- Google's launch thread shipped Gemini 3.5 Flash as a generally available model with 1M context, up to 65K output tokens, and rollout across the Gemini app, Search AI Mode, AI Studio, Antigravity, Android Studio, and Gemini Enterprise.
- In Google's launch numbers, GoogleDeepMind's benchmark card puts Gemini 3.5 Flash ahead of Gemini 3.1 Pro on Terminal-Bench 2.1, GDPval-AA, and MCP Atlas, while Google's speed chart says it reaches 289 output tokens per second.
- The cleanest caveat comes from Artificial Analysis, which scored Gemini 3.5 Flash at 55 on its Intelligence Index, behind Gemini 3.1 Pro Preview at 57, while also estimating a 5.5x higher run cost than Gemini 3 Flash and a 75% higher cost than Gemini 3.1 Pro on its suite.
- Pricing is the part engineers noticed first: scaling01's pricing screenshot and Phil Schmid's launch summary both show $1.50 per million input tokens and $9 per million output tokens, triple Gemini 3 Flash's listed token price.
- Third-party evals split by workload. ValsAI's finance benchmark put Gemini 3.5 Flash at #1 on Finance Agent v2, but kwindla's agent benchmark thread found it expensive on long multi-turn tasks and still too slow on time-to-first-token for voice agents.
You can read Google's launch post, the Gemini API what's new page, and the full eval methodology PDF. The independent counterweight is Artificial Analysis' model page, and the oddest product demo was Google using Antigravity 2.0 plus Gemini 3.5 Flash to build a toy operating system in 12 hours.
Availability
Google positioned this as a same-day GA release, not another preview.
According to Google's availability post, Gemini 3.5 Flash is live for consumers in the Gemini app and Search AI Mode, for developers in Antigravity, AI Studio, the Gemini API, and Android Studio, and for enterprises in Gemini Enterprise Agent Platform. Google's Search update also made it the new default model for AI Mode globally, which matters more than the model card copy.
The product surfaces were already leaking before I/O. AiBattle_'s cloud console screenshot spotted the model in Google Cloud Console hours before launch, and testingcatalog's AI Studio screenshot showed the release date, Jan 2025 knowledge cutoff, and the same $1.50/$9 pricing now listed publicly.
Benchmark deltas
Google's headline is simple: Flash now beats the older Pro on the agent stack.
The first-party deltas versus Gemini 3.1 Pro that matter most:
- Terminal-Bench 2.1: 70.3% to 76.2%
- GDPval-AA: 1314 Elo to 1656 Elo
- MCP Atlas: 78.2% to 83.6%
- Output speed, per Google's speed slide: 289 tokens per second
Third-party numbers fill in the shape of the release. Arena's frontend ranking placed Gemini 3.5 Flash at #9 on Code Arena: Frontend with a 1507 score, up 70 points over Gemini 3 Flash. ValsAI's finance result ranked it #1 on Finance Agent v2 at 57.86%, ahead of GPT-5.5 at 51.76% and Opus 4.7 at 51.51%.
The same eval sheet shows where the model is not leading. bridgemindai's benchmark screenshot surfaced Google's own SWE-Bench Pro public number at 55.1%, below GPT-5.5 at 58.6% and Opus 4.7 at 64.3%. scaling01's benchmark roundup also highlighted GPT-5.5's 94.8% on MRCR v2 at 128K, far ahead of Gemini 3.5 Flash at 77.3%.
Pricing
The price card landed with much less Flash energy than the name implies.
Google's listed API price is $1.50 per million input tokens and $9 per million output tokens, as Phil Schmid's summary and OpenRouter's listing both restated. That is a straight 3x jump from Gemini 3 Flash's $0.50 and $3.00 pricing, which AiBattle_'s prelaunch pricing post caught before the keynote.
Artificial Analysis makes the more important point: list price was only part of the jump. Artificial Analysis estimated Gemini 3.5 Flash cost $1,552 to run on its Intelligence Index, versus $395 for Gemini 3 Flash and $892 for Gemini 3.1 Pro, because the model used more turns and more tokens in addition to costing more per token.
That token appetite shows up in the agentic data. Artificial Analysis' breakdown thread said Gemini 3.5 Flash averaged 49 turns per GDPval-AA task, one of the highest turn counts they had recorded, even while posting a strong 1656 Elo. Fast models that take more turns can still run up the bill.
Reasoning controls
The API side added more knobs than the keynote slides lingered on.
The practical spec list pulled from the Gemini API what's new page and echoed by ValsAI's config note looks like this:
- 1M token context window
- 65K max output tokens
- Four thinking levels: minimal, low, medium, high
- Medium is the new default thinking level
- Automatic thought preservation across multi-turn conversations
- Multimodal input, text output
Long-context behavior still comes with edge cases. Dillon Uzar's Context Arena thread found Gemini 3.5 Flash overtook Gemini 3.1 Pro from 256K onward on MRCR v2, but also reported that about 15% of samples above 200K hit a roughly 63K reasoning-token ceiling and collapsed at 1M. The same thread said low mode regressed hard versus the previous Flash on long-context reasoning.
ARC Prize's verified run shows how much the thinking setting changes outcomes. ARC Prize's verified ARC-AGI post recorded 72.1% on ARC-AGI-2 in high mode for $0.85, versus 8.9% in minimal mode for $0.11.
Agent workloads
Independent evals agree on one thing: this model is tuned for tool use and long multi-step work, not for every latency-sensitive job.
On a 30-plus-turn task benchmark, kwindla's results put Gemini 3.5 Flash high at the top with a score of 97 and a 1.86 second median turn time. The same thread said minimal mode could cost more than high mode on that workload because it made more mistakes and needed more turns to finish.
Voice is a different story. According to kwindla's voice benchmark thread, median time to first token was still about 960 ms for Gemini 3.5 Flash minimal, above the sub-700 ms threshold he wants for voice agents, while older fast models like Claude Haiku 4.5 remained better fits for that constraint.
The strongest outside consensus sits around agentic and multimodal work. Artificial Analysis credited the biggest gains to GDPval-AA and hallucination reduction, ValsAI's index post put the model at #3 on both the Vals Index and Vals Multimodal Index, and ValsAI's Vibe Code Bench note said it ranked #3 overall on SWE-bench Verified, Live Code Bench, and Terminal Bench 2 in their testing.
Antigravity
Google's best evidence for the release is not a chart, it is the harness demo.
Using Antigravity 2.0 with Gemini 3.5 Flash, Google said an autonomous team built a working operating system from scratch in 12 hours with 93 parallel sub-agents, more than 15,000 model requests, 2.6 billion processed tokens, and less than $1,000 in API credits. koraykv's launch thread added a second internal demo where agents recreated the AlphaZero paper, trained a model via self-play on TPU pods, and shipped a playable web app from two prompts.
Those demos also explain why Google keeps pairing this model with Antigravity and managed agents. Demis Hassabis claimed the model runs 12x faster inside Antigravity, and Phil Schmid's developer guide note pointed developers to a migration path built around the new Interactions API skill.