releaseJune 24, 2026

Gemini 3.5 Flash adds Computer Use with 78.4 OSWorld score

Google released built-in Computer Use for Gemini 3.5 Flash across browser, mobile, and desktop. Try it for agent workflows, but watch for timeout issues on long design-from-scratch runs.

5 min read

Gemini 3.5 Flash adds Computer Use with 78.4 OSWorld score

TL;DR

Google added native Computer Use to Gemini 3.5 Flash, and _philschmid's launch thread says it now works across browser, mobile, and desktop from the main model.
testingcatalog's benchmark post pegged Gemini 3.5 Flash at 78.4 on OSWorld-Verified, while WesRoth's summary says the feature is in public preview through the Gemini API and Gemini Enterprise Agent Platform.
In Cua's KiCad EDA eval, trycua's early-access result reported the top mean reward at 0.267, while trycua's leaderboard note says GPT-5.5 still had more full solves, 6 versus Gemini's 5.
trycua's limitation note says the same long-run cliff remains on design-from-scratch tasks, where runs time out inside a 200-step budget even when the model shows the right domain knowledge.
Google paired the release with approval gates and prompt-injection stops, and osanseviero's reply about enterprise safety says extra safety systems and tooling integrations shipped alongside it.

You can jump from Google's launch post to the build docs, compare it with Cua's benchmark, and watch the Jira automation demo alongside a KiCad run. One of the more useful buried details is that the tool is exposed as a native capability inside Gemini 3.5 Flash, per _philschmid's launch thread, not a separate wrapper model. Another is that Google pushed OfficialLoganK's GA post about the Interactions API as the new default API surface at almost the same time.

Computer Use

Google's launch pitch is straightforward: give Gemini 3.5 Flash a screen and a goal, and the model decides the actions. According to _philschmid's launch thread, the built-in tool supports three environments:

Browser
Mobile phones
Desktop

WesRoth's summary adds the action vocabulary Google is exposing at a high level: screenshot analysis plus UI actions such as clicking, scrolling, and typing. The same post says each action can also carry an explanation of what the model is trying to accomplish.

Safeguards

The safety stack in the launch materials centers on intervention points, not just policy text. _philschmid's launch thread lists three concrete controls:

User confirmation
Auto-stop on suspected prompt injection
Additional prompt-injection training

osanseviero's reply about enterprise safety says Google also shipped enterprise safety systems and tooling integrations to make production rollouts easier. That matters because this is one of the few computer-use launches framing approval flow as part of the product surface on day one.

OSWorld

testingcatalog's OSWorld post says Gemini 3.5 Flash scored 78.4 on OSWorld-Verified. The attached chart places it:

Equal with Sonnet 4.6 at 78.4
Slightly behind GPT-5.5 at 78.7
Behind Opus 4.8 at 83.4
Ahead of Gemini 3.1 Pro at 76.2
Well ahead of Gemini 3 Flash at 65.1

The benchmark image also includes a code sample that calls client.interactions.create(...) with tools=[{"type": "computer_use", "environment": "mobile"}], which ties the new eval numbers directly to the Interactions API surface.

KiCad EDA

Cua's early-access test is a better stress test than ordinary browser automation demos. trycua's methodology note says the suite used 25 expert-authored KiCad tasks, a matched 200-step budget, and a netlist match as the success criterion.

The interesting split is full solves versus mean reward. According to trycua's leaderboard note, GPT-5.5 solved 6 of 25 tasks outright, but trycua's early-access result says Gemini 3.5 Flash led on mean reward at 0.267 because it combined 5 full solves with 3 partials.

Grounding and reasoning

Cua's screenshots split the result into two capabilities that are easy to scan:

Pixel grounding: trycua's execution-quality note says Gemini 3.5 Flash could hit zoomed-in CAD targets accurately inside dense KiCad UI state.
Analog reasoning: trycua's analog-design note says it derived a resistor value from a feedback equation, mapped it to the E96 series, then checked a datasheet.

That is a nicer signal than a generic "clicked the right button" demo. KiCad is cramped, stateful, and unforgiving.

Timeout cliff

The release did not erase the long-horizon problem. trycua's limitation note says design-from-scratch tasks still time out within the 200-step budget, even when the model appears to have the necessary engineering knowledge.

That lines up with how practitioners already use this category. In omarsar0's reaction, Omar says computer-use models are most valuable for agentic loops and long-running tasks, while omarsar0's follow-up narrows the use case to jobs that browser-only tooling cannot reach.

Demos

The evidence pool shows three distinct demo shapes:

A Jira workflow demo for browser form-filling and ticket submission.
A KiCad demo for dense desktop CAD interaction through the agy CLI and Cua Driver.
_philschmid's Android example for phone control, where his note says Gemini 3.5 Flash appeared to prefer English text on websites.

That last detail is small, but it is the kind of friction engineers usually discover only after wiring up a real run.

Interactions API

Computer Use landed next to a bigger platform change. OfficialLoganK's GA post says Google shipped the Interactions API into general availability and made it the new default API going forward.

WesRoth's summary of Interactions API GA lists the pieces Google is bundling into that surface:

Managed Agents
Background execution
Expanded tool support
Multimodal content generation
Gemini Omni support, coming soon

That makes the Computer Use launch look less like a one-off tool and more like a first-class capability inside Google's stateful agent stack.