OpenAI Introduces New Flagship Model GPT-4o

openai-Introduces-new -flagship-model-gpt-4o-600

OpenAI’s new flagship model GPT-4o (“o” for “omni”) can reason across audio, vision, and text in real time and is considered a move towards much more natural human-computer interaction.

GPT-4o essentially accepts text, audio, image, and video and generates any combination of text, audio, and image outputs. GPT-4o is fast with a human response time in a conversation at 232 milliseconds, with an average of 320 milliseconds.

GPT-4o challenges former models in performance and ability:

It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.

Instead of relying on multi models such as GPT-3.5 and GPT-4 are used to accomplish for example, Voice Mode to talk to ChatGPT, GPT-4o is trained through a single new model end-to-end across text, vision, and audio, so that inputs and outputs are processed by the same neural network.

With most generative AI models the issue of safety materializes frequently especially for output results. For GPT-4o filtering training data, refining the model’s behavior through post-training, and putting guardrails on voice outputs alleviates these type of problems. Evaluations of cybersecurity, CBRN, persuasion, and model autonomy show that GPT-4o does not score above Medium risk in these categories- based on OpenAI’s Preparedness Framework and voluntary commitments.

GPT-4o uses red teaming to mitigate risks that are produced by the newly added modalities. This system uses 70 plus external experts in domains such as social psychology, bias and fairness, and misinformation.

See also  Amazon Web Services and NVIDIA Announce Collaboration Delivering New Generative Supercomputing Infrastructure and Applications for Artificial Intelligence (AI)

OpenAI has identified audio modalities to be especially problematic for unsafe output. Therefore audio outputs will be limited to a selection of preset voices and will abide by its safety policies.