Google has unveiled Gemini Omni, a new family of generative AI models capable of producing and editing video from a mix of text, images, audio, and existing clips.
According to the tech giant, Gemini Omni marks a significant expansion of Google’s consumer AI tools, moving beyond last year’s image-focused Nano Banana to full multimodal video generation. Users can feed in several types of input, including photographs, sketches, voice recordings, and written prompts, and watch Omni stitch them into a single coherent video. Editing is conversational: a creator can ask the system to change a background, add an object, or alter the action while the scene retains consistent characters and physical plausibility across multiple turns.
Google stressed that Omni is grounded in Gemini’s real-world knowledge, allowing it to reason about history, science, and cultural context rather than simply producing visually convincing but meaningless footage. The company said the model demonstrates an improved intuitive understanding of forces like gravity and fluid dynamics, resulting in more realistic motion.
A digital avatar feature lets users generate a version of themselves that looks and sounds like them, though Google is still testing safeguards around editing speech and audio more broadly. All output carries an imperceptible SynthID watermark, and verification tools are available through the Gemini app, Chrome, and Google Search.
Gemini Omni Flash is available now to Google AI Plus, Pro, and Ultra subscribers globally, as well as through Google Flow. It will also appear in the YouTube Create app this week. The first release, Gemini Omni Flash, is currently available to paid subscribers and to users of YouTube Shorts at no cost. Developer and enterprise access via APIs is planned for the coming weeks. Google intends to extend the Omni family to image and audio generation later, building on its native multimodal architecture from the ground up.