Gemini Omni adds reference-to-prompt video restyles with a 10-second cap
Creators are using Gemini Omni to read a reference design and generate a final prompt for another video model while preserving face, voice, lip sync, and gestures. Use it to separate style translation from generation, but plan around the current 10-second output limit.

TL;DR
- ozansihay demo post shows a workflow where Gemini Omni reads a reference visual, then writes a production-ready prompt for a second video model instead of generating the whole styled edit directly.
- In ozansihay's full prompt, the key constraint is preservation: face, identity, voice, speech, lip sync, facial expression, and hand motion stay intact while only the visual design language changes.
- The prompt text also forces the model to describe style explicitly, including palette, composition, texture, typography, collage elements, grain, halftone, paper texture, motion feel, and overall atmosphere, so the handoff prompt works without access to the original reference.
- According to ozansihay's runtime reply and a second 10-second reply, the current output cap is 10 seconds, which makes this feel more like a style-transfer block for shorts than a full scene workflow.
- When asked which downstream model to use, ozansihay's model reply pointed to Seedance 2.0, which turns the Gemini Omni step into a prompt-engineering front end for another generator.
Ozan Sihay published the full prompt in his follow-up post, showed the result in the demo clip, and clarified in replies that the current ceiling is 10 seconds in one answer and another. He also named Seedance 2.0 in a reply, which makes the interesting bit here less about one model shipping one feature, and more about a two-model workflow where one model translates style and another renders the video.
Reference-to-prompt handoff
The trick in ozansihay's demo is to use Gemini Omni as a style analyst, not the final renderer. The model ingests a reference image, extracts its design language, and returns a single prompt meant to be pasted into another video system.
That separation matters because the prompt in ozansihay's full text bans vague phrases like "like the reference image" or "in this style." The output has to restate the look in enough detail that a second model can reproduce it cold.
Preservation constraints
Most of the prompt is a guardrail stack. It tells the model to change only the visual layer while preserving performance and identity.
The constraints in the published prompt break down into five parts:
- Return only the final prompt text, with no explanation or bullet list.
- Describe the reference style explicitly: color palette, composition, background texture, typography, graphic elements, collage or doodle details, lighting, contrast, grain, halftone, paper texture, motion feel, atmosphere.
- Preserve the person's face, identity, voice, speaking, lip sync, facial expressions, and hand gestures.
- Analyze the video's topic, then add fitting symbols, icons, shapes, stickers, collage pieces, doodle lines, brush effects, or motion-graphics details.
- Pull only a few key words from the spoken content for on-screen text, with no long captions or full transcription.
For creative teams, this reads like a reusable spec for restyling talking-head clips into posterized social edits without losing sync.
10-second ceiling
The obvious catch is duration. In one reply and another, ozansihay said the system currently generates a maximum of 10 seconds.
That limit changes the shape of the workflow. It fits teasers, ad snippets, social cutdowns, and test shots better than longer narrative scenes.
Seedance 2.0 as the render step
When someone asked which app to use, ozansihay answered that the real work is mostly in the prompt, then suggested Seedance 2.0. That adds one more concrete detail to the stack: Gemini Omni is doing style translation, while Seedance 2.0 is the model he recommends for the actual render.
The workflow shown across the main post, the full prompt, and the model reply is simple:
- pick a reference visual
- have Gemini Omni convert that reference into a standalone style prompt
- keep the original speaker performance intact
- send the finished prompt to a video model for generation
It is a neat little unbundling of the creative pipeline. One model watches and describes, another model draws.