Gemini Omni: Google's World Model for Video (2026)

June 2, 2026

Gemini Omni: Google's World Model for Video (2026)

Gemini Omni is Google's new "world model" that creates and edits video from any mix of text, image, audio, and video input, with edits driven by plain-language conversation. Google unveiled it at I/O 2026 on May 19, 2026, and shipped the first model in the family, Gemini Omni Flash, the same day to Gemini app, Google Flow, and YouTube Shorts.1

TL;DR

Gemini Omni is Google DeepMind's attempt to fuse Gemini's reasoning with generation: instead of translating a text prompt into isolated pixels, it combines references across modalities and "reasons about what should happen next," grounded in Gemini's knowledge of physics, history, and culture.1 The launch model, Gemini Omni Flash, generates video with native audio and lets you refine clips over multiple conversational turns while keeping characters and scene continuity consistent.1 It is available now to Google AI Plus ($7.99/mo), AI Pro ($19.99/mo), and AI Ultra ($100 and $200/mo) subscribers via the Gemini app and Google Flow, and free on YouTube Shorts and the YouTube Create app.12 A developer and enterprise API is promised "in the coming weeks" but is not yet available, with no committed date.1 Every output carries an imperceptible SynthID watermark with no opt-out.1 Announced by Koray Kavukcuoglu, CTO of Google DeepMind.1

What you'll learn

  • What "world model" means in the context of Gemini Omni, and how it differs from a standard text-to-video generator
  • What Gemini Omni Flash can actually do today: conversational editing, multi-input references, and avatars
  • Where you can use it right now, and what it costs across Google's AI subscription tiers
  • How Gemini Omni compares to Veo 3.1, Google's photorealistic video model
  • The status of the developer API, SynthID watermarking, and the features Google is holding back

What is Gemini Omni?

Gemini Omni is described by Google as a model that can "create anything from any input — starting with video."1 The framing matters. Google calls it a world model: a system that doesn't just recognize patterns across modalities but simulates and reasons about physical reality. In Google's words, Omni "doesn't just build scenes that look real, it reasons about what should happen next," combining "an intuitive understanding of physics with Gemini's knowledge of history, science and cultural context."1

Omni is the next step after Nano Banana, the image generation and editing model Google shipped the prior year that became widely used for restoring photos and designing from sketches.1 Where Nano Banana brought Gemini's intelligence to still images, Omni extends that intelligence to motion, beginning with video output and, Google says, expanding to image and audio output "in time."1 (For background, see our earlier coverage of Nano Banana.)

The first and only model available so far is Gemini Omni Flash. A more capable "Pro" tier in the Omni family has been discussed in coverage of the launch but has not been formally specified or dated by Google, so treat it as unconfirmed.

Conversational editing is the headline feature

The capability Google leads with is editing video through natural language across multiple turns. "Every instruction builds on the last," Google writes. "Your characters stay consistent, the physics hold up and the scene remembers what came before."1

In Google's own examples, a user starts with a clip of a violinist, then issues sequential prompts: "Transport the violinist to the image environment," "Make the violin invisible," "Change the camera angle to be over the violinist's shoulder."1 Each edit preserves the thread of the original scene rather than regenerating from scratch. Other demonstrated edits include changing materials ("Make the sculpture out of bubbles"), reworking actions in footage you shot yourself, and applying styles, motion, or effects pulled from reference media.1

This is the practical distinction from a one-shot text-to-video tool: Omni treats your video as a living document you converse with, not a single render you accept or discard.

Any input, grounded in world knowledge

Omni accepts images, text, video, and audio as references and blends them into a single cohesive output.1 One launch example combines an image, a reference video, and an audio file: "Dynamic sci-fi film style video based on image_0.png. Elements light up similar to video_0.mp4 synchronized to the beat of the music from audio_0.wav."1

On the physics side, Google says Omni has an improved intuitive grasp of forces like gravity, kinetic energy, and fluid dynamics, which it uses to render more believable motion.1 On the knowledge side, it draws on Gemini's broader understanding to produce explainer-style content — Google shows a "claymation explainer of protein folding" generated from a short prompt.1

One important caveat on inputs: for audio references, only voice references are supported to start, with other audio input types coming later.1

Avatars, and what Google is holding back

Omni includes an Avatars feature that creates a digital version of you so you can generate videos that look and sound like you, using your own voice.1 Google frames this within its responsible-AI policies.

What Google is explicitly not shipping yet is broader audio and speech editing of existing video. "In terms of editing videos to change audio and speech, we are still working to test this and better understand how we can bring this capability to users responsibly," the company writes.1 So the avatar (your own voice) path is live; arbitrary audio/speech editing is withheld pending further testing.

Every video created with Omni includes an imperceptible SynthID watermark, and there is no opt-out. Google says you can verify Omni-generated video through the Gemini app, Gemini in Chrome, and Google Search.1

Where to use it, and what it costs

Gemini Omni Flash began rolling out on launch day to all Google AI Plus, Pro, and Ultra subscribers globally through the Gemini app and Google Flow, and at no cost on YouTube Shorts and the YouTube Create app.1 Google reorganized its AI subscriptions at I/O 2026 into the following tiers:23

TierPrice (per month)Notable for Omni users
Google AI Plus$7.99Entry paid access; 200 GB storage, double Gemini limits
Google AI Pro$19.995 TB storage, quadruple limits, Pro model access
Google AI Ultra$100Aimed at developers and creators; 5x Pro's usage limits
Google AI Ultra$200Same features, 20x Pro's usage limits (cut from $250)
YouTube Shorts / CreateFreeNo-cost Omni access starting launch week

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

The free YouTube path is the lowest-friction way to try it; the paid Gemini app and Flow paths unlock the fuller editing workflow.

Gemini Omni vs Veo 3.1

Google now has two video models, and they serve different jobs. Veo 3.1 is Google DeepMind's photorealistic, cinema-oriented generator. Per Google DeepMind, Veo 3.1 produces 4-, 6-, or 8-second clips at 720p, 1080p, or 4K, with native 48kHz stereo audio at 24 frames per second, and an Extend feature that chains footage into longer sequences.4 Gemini Omni, by contrast, is positioned around multi-input reasoning, conversational multi-turn editing, and world-knowledge grounding rather than maximum photorealistic fidelity.1

DimensionGemini Omni FlashVeo 3.1
Core ideaWorld model: reason + create from any input1Photorealistic video generation4
InputsText, image, video, voice reference1Text, image4
EditingConversational, multi-turn, scene-persistent1Prompt-based generation + Extend4
ResolutionNot formally published by Google720p / 1080p / 4K4
AudioNative audio; voice references in, more coming1Native 48kHz stereo, 24fps4
WatermarkSynthID, no opt-out1SynthID4

A note on Omni's clip specs: Google's launch materials demonstrate short clips (one example prompt explicitly asks for a "10s" video), and press coverage has reported a roughly 10-second cap with native audio.5 Google has not published a formal Omni Flash spec sheet listing maximum duration or output resolution, so those figures should be treated as reported rather than officially confirmed.

The developer API: not yet

If you are building on top of Omni, the short answer is wait. Google says only that "in the coming weeks, we'll also be rolling it out to developers and enterprise customers via APIs."1 As of this writing it is not yet available to developers, and Google has not given a committed date. Plan around it as a future item, not a current option.

For developers who need a video API today, Veo remains the available route, and for fast multimodal text-and-vision work, Google's other I/O launch, Gemini 3.5 Flash, is generally available now.

Bottom line

Gemini Omni reframes AI video from one-shot generation to an editable, conversational medium grounded in a model that reasons about physics and world knowledge. Gemini Omni Flash is live today for subscribers and free on YouTube, but the developer API — the part most builders care about — is still weeks out with no firm date. If you create video by hand, it is worth trying now; if you ship products on top of video models, keep Veo in production and watch for the Omni API. Either way, the SynthID-by-default, no-opt-out watermarking signals where Google thinks AI media accountability is heading.

Footnotes

  1. Koray Kavukcuoglu, "Introducing Gemini Omni," Google (The Keyword), May 19, 2026. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-omni/ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

  2. "Google AI subscription updates from Google I/O 2026," Google (The Keyword), 2026. https://blog.google/products-and-platforms/products/google-one/google-ai-subscriptions/ 2 3

  3. "What Gemini features you get with Google AI Plus, Pro, & Ultra," 9to5Google, May 25, 2026. https://9to5google.com/2026/05/25/google-ai-plus-pro-ultra-gemini-features/

  4. "Veo 3.1," Google DeepMind. https://deepmind.google/models/veo/ 2 3 4 5 6 7 8

  5. "Google launches Gemini Omni Flash, a conversational video-generation model," The Next Web, 2026. https://thenextweb.com/news/google-gemini-omni-flash-video-model-io-2026

Frequently Asked Questions

Gemini Omni is Google's multimodal "world model" that generates and edits video from text, image, audio, and video input, using conversational editing. The first model is Gemini Omni Flash, launched May 19, 2026. 1

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.