OpenAI known for GPT-4 text-to-text AI and DALL·E 3 text-to-image AI has recently announced its next AI phase called Sora a text-to-video model.
By using text instructions AI model Sora can generate realistic and imaginative scenes up to a minute long with exceptional visual quality and strict user prompt guidance:
We’re teaching AI to understand and simulate the physical world in motion, with the goal of training models that help people solve problems that require real-world interaction.
Sora can create an entire new video or add onto existing generated videos. Sora can also produce a video from an existing image and generate a video from it, “animating the image’s contents with accuracy and attention to small detail.”
How OpenAI Sora works is detailed in the following steps:
1. Sora uses the recaptioning technique from DALL·E 3, this produces highly descriptive captions for the visual training data.
2. Sora utilizes a diffusion model where a video is created that looks like static noise. The noise is reduced gradually and is transformed with greater clarity when the noise is removed over many steps.
3. Sora uses a transformer architecture which produces scaling performance:
We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.
Sora will be open for review by granting access to selected visual artists, designers, and filmmakers to receive feedback on how to benefit creative professionals:
Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.
Red Teamers will also determine critical areas considered harms or risks. Red teamers are domain experts in areas like misinformation, hateful content, and bias. Sora also will be analyzed by invited policymakers, educators and artists around the world to identify concerns and positive use cases.
Sora goes beyond just outputting user text input. The model creates complex scenes with multiple characters, specific types of motion, and accurate details of the subject and background:
The model has a deep understanding of language, enabling it to accurately interpret prompts and generate compelling characters that express vibrant emotions. Sora can also create multiple shots within a single generated video that accurately persist characters and visual style.