Skip to content
AI Primer
update

Gemini 3.5 Flash users report 3x price hikes and broken tool chains one day after launch

Users reported failed harness runs, benchmark misses, broken Calendar and video-editing flows, and later a tripled Antigravity rate limit after Gemini 3.5 Flash launched. Watch real agent workflows closely, because the speed gains are arriving with higher spend and unstable behavior.

6 min read
Gemini 3.5 Flash users report 3x price hikes and broken tool chains one day after launch
Gemini 3.5 Flash users report 3x price hikes and broken tool chains one day after launch

TL;DR

Google's own materials say Gemini 3.5 Flash is headed almost everywhere: the Gemini app blog post ties Gemini Spark to the Antigravity harness, the model page exposes the 1M context window, and Simon Willison's notes pulled out the buried details, including a January 2025 knowledge cutoff and Google's new Interactions API. Meanwhile you can browse kwindla's eval code, inspect the Gradient Bang task benchmark, and check the BullshitBench viewer that fueled part of the backlash.

Pricing

The launch landed with Flash branding and near-Pro pricing. According to haider1's pricing comparison, Gemini 3.5 Flash sat much closer to Gemini 3.1 Pro Preview than to earlier Flash models.

The price complaints were not just about list rates. Several benchmark runners said the model also spent more tokens to finish work:

That made "Flash" feel like a label with legacy expectations attached to it, not a cheap tier.

Benchmarks

The benchmark picture was not uniformly bad. It was weird.

On the positive side, kwindla's thread said Gemini 3.5 Flash with a high thinking budget became the top model on a 32-turn task-agent benchmark, with a per-turn P50 under two seconds. The same thread said the model improved tool calling over earlier Gemini 3 variants, even though median time to first token stayed around 1 second.

On the negative side, other public evals were much colder:

The cleanest read from the evidence pool is that Gemini 3.5 Flash looked best when the test rewarded throughput plus long multi-turn completion, and much worse when people cared about cost-normalized coding quality or reliability.

Harnesses

The loudest complaints came from people running the model inside agent harnesses instead of single-turn chats.

The reports clustered around three failure modes:

  • Harness mismatch: rishdotblog's report said Gemini 3.5 Flash was "totally broken in non-Google harnesses" and slower and worse than GPT-5.5 or Opus at tool chaining.
  • Over-scripted agent narration: arohan's screenshot showed the model repeatedly speaking in an "I will" pattern during a coding task, then arohan's follow-up suggested the summarization model should sound more like an inner voice.
  • Volatile real-task quality: bridgemindai's Flappy Bird demo called its coding output "pure slop," while petergostev's video summary said it sometimes generated best-in-class results and sometimes crashed out or did something strange.

Google's own product framing helps explain why this gap mattered. simonw's linked quote highlighted the line that Gemini Spark "runs on Gemini 3.5 and uses the Antigravity harness," so harness behavior was part of the product story on day one, not an edge case.

Gemini app

The web and app rollout produced concrete regression reports within hours.

Two reports were especially specific:

  • koltregaskes's post described a greyed out "Create video" action, unclear credit exhaustion, and a multi-video project that suddenly forked into a fresh chat on the fifth step.
  • kchonyc's post said Gemini stopped turning schedules from PDFs or HTML into Google Calendar entries after the update, then kchonyc's later follow-up said the Calendar integration was still broken the next morning.

Those issues landed alongside a broader UI refresh. testingcatalog's UI preview had already spotted a new Gemini interface, a thinking effort selector, and a usage-limits tab before I/O. The official rollout posts, GeminiApp's free rollout clip and OfficialLoganK's feedback request, showed Google treating the model as a live product surface that would be tuned in public.

Antigravity rollout

Google widened access even as complaints piled up.

Three rollout details stood out:

That last update matters because the launch story did not end at ship. It kept changing under load, with new limits, new packaging, and more surfaces exposing the same fast, expensive, still-settling model.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 6 threads
TL;DR6 posts
Pricing5 posts
Benchmarks4 posts
Harnesses4 posts
Gemini app4 posts
Antigravity rollout3 posts
Share on X