Gemini Omni Flash claims SOTA at image-to-video, text-to-video, and video editing
Google says Gemini Omni Flash now leads its benchmark set across image-to-video, text-to-video, and video editing, with API access coming soon. The claim matters because creators are already showing Flow-based reconstructions and relighting demos, but the broader developer rollout is not live yet.

TL;DR
- OfficialLoganK's benchmark post points to Google's official Gemini Omni benchmark page, where Gemini Omni Flash is presented as leading Google's current comparison set for video editing, text-to-video, image-to-video, and reference-to-video.
- Google's Gemini Omni announcement says Omni Flash is already rolling out in the Gemini app, Google Flow, and YouTube Shorts, while API access for developers and enterprise customers is still "coming in the following weeks," matching the "API soon" line in OfficialLoganK's thread.
- The official Flow page frames Omni as a conversational editing model that can create or edit from text, image, audio, and video references, and chrisfirst's demo thread shows the creator-facing version of that pitch in practice.
- OfficialLoganK's reply on benchmarks adds a useful caveat: Google is leaning on benchmark wins, but Logan also said he wants to see where those numbers feel off in real use.
You can browse the benchmark section, read Google's launch post, and poke through the official Flow tool page. The weirdly tangible part is not the leaderboard claim, it is creators already using Omni inside Flow for relighting and shallow-depth-of-field makeovers in chrisfirst's demo thread, while bilawalsidhu's sports reply and another bilawalsidhu reply jump straight to volumetric replay ideas.
Benchmark page
Google's benchmark page breaks Omni Flash into four buckets:
- Video editing
- Text to video
- Image to video
- Reference to video
According to the official Gemini Omni page, the comparisons mix human side by side preference tests and benchmark datasets such as MovieGenBench and VBench I2V. The page claims Omni Flash leads Google's tested set on overall preference and instruction following for editing and text-to-video, and on image-to-video against named models including Kling and Grok's video system.
That claim is narrower than "best video model," and OfficialLoganK's own follow-up is the clearest reminder of it: benchmark wins are interesting, but Google is still waiting to see where real world use diverges from the chart.
Flow editing
The official Flow page describes Omni as a model that can create and edit video from "any input reference," real or generated, through conversation. Google's launch post says that includes text, image, video, and voice references, with edits carrying across turns instead of restarting from scratch.
chrisfirst's demo thread gives the more useful creator read: shoot rough footage anywhere, then use Omni inside Flow to relight it, blur the background, and re-render the shot into something closer to a controlled studio setup. chrisfirst's separate complaint about Fable also hints at the current tradeoff, creators are already comparing usable generation time and iteration limits across tools, not just output quality.
Access surfaces
Google's launch post says Omni Flash is the first model in the family and is rolling out across these surfaces:
- Gemini app
- Google Flow
- YouTube Shorts
- YouTube Create
The same post says Shorts and Create access come at no cost, while fuller access sits inside Google's paid AI tiers. The official Flow changelog also logs Gemini Omni Flash as part of Flow's May 19 update set, which is a cleaner timestamp for when the tool actually landed in Google's creator stack.
The missing piece is still the developer surface. OfficialLoganK's thread says the API is coming soon, and Google's announcement uses nearly the same language, "in the following weeks," for developer and enterprise access. Until that flips, Omni Flash is still more of a creator product than a programmable building block.
Sports reconstructions
The most interesting unsolicited use case in the evidence is sports. Replying to a reconstruction clip, bilawalsidhu connected Omni-style scene understanding to Sony Hawkeye camera arrays and argued that the obvious next step is not just referee review, but replay systems that let viewers move around the play.
That lines up with chrisfirst's courtside post, which pitches 3D game reconstructions as a cheaper substitute for premium seats, viewed through Apple Vision Pro. It is a different category from text-to-video benchmarks, and it is the clearest sign in this small evidence set that creators are already treating Omni less like a prompt toy and more like infrastructure for camera moves that were previously impossible.