OpenAI introduces GPT-4o: The New State of the Art Model

LLM

Anuj

Fullstack Engineer

May 15, 2024

OpenAI has launched its new State of the Art model, called GPT-4o (Omni). It has native multimodal capabilities like Gemini, and comes with half the cost, twice the speed, and the same context window of 128k tokens. OpenAI demoed a few use cases, such as pair programming, live interpretation, and emotion detection during their update. Since then, OpenAI has posted 20+ videos on their YouTube Channel demonstrating Omni’s capabilities. Let’s take a look at the insights we have gathered since the event.

Current SOTA on ELO + More Benchmarks

On LMSYS’s Arena ELO leaderboard, Omni (shown as im-also-a-good-gpt2-chatbot in the graph) is the current best model, both overall and in coding.

From https://x.com/sama/status/1790066003113607626

From https://openai.com/index/hello-gpt-4o/

Omni also seems to do well at Chess puzzles, currently exceeding GPT-4 Turbo and other models by almost double.

From https://github.com/kagisearch/llm-chess-puzzles

Free For Everyone

OpenAI has announced that GPT-4o will be available to everyone through ChatGPT, even free users. Plus users will have early access and a 5x message cap, which is currently at 80 messages every 3 hours, meaning free users should get around 16 messages every 3 hours. Once that limit is exhausted, ChatGPT will switch to GPT 3.5 automatically. Omni should be available for Plus users already, free users may have to wait a little more.

OpenAI is rolling out the new interactive Voice Mode through an alpha channel, and it will be made available to the users slowly over the next few weeks.

Pricing for Developers

GPT-4o APIs are already fully available to developers, at half the cost of GPT-4 Turbo, costing $5 per million tokens in input, and $15 per million tokens in output. This is also 5x cheaper than Anthropic’s top offering, Claude 3 Opus, which costs $75 per million tokens.

Speed

GPT-4o is expected to perform at least twice as fast as GPT-4 Turbo on all tasks, while some users have mentioned that this comes at the cost of result accuracy. However, its primary use case seems to be Digital Assistants, where speed is a more important factor than being logically accurate.

We compared GPT-4 Turbo and GPT-4o side by side on the same System and User prompt about building a terminal app with fuzzy search. We noted that Omni was twice as fast while generating more tokens and a more accurate result.

Native Multimodal Support

Omni’s biggest strength and the reason for its name is its native multi-modality support. Earlier, GPT4 had to use a separate model like Whisper to convert user speech to text, which was then processed and yielded a text result that could be converted to speech through a TTS model. According to OpenAI, Omni can handle all of it, without needing any other model. At the surface, it seems like Omni fuses together several expert models responsible for different tasks like audio synthesis, image generation, etc.

Google’s Gemini supports multimodal input through a Mixture of Experts architecture but outputs only text. Omni supposedly supports multimodal input and output across text, audio, and images, which seems like a major leap.

Omni does not support video, contrary to popular belief. It can seemingly take images unprompted, maybe through some object detection service running on Edge devices that notifies it, taking constant photos at an interval, or when it detects that environment interaction is necessary.

Audio Synthesis: Beyond Speech

Omni seems to be capable of generating more than just human speech sounds. In an example, it generates “sounds of coins clanging on metal”, which sounds pretty accurate. This could spell bad news for startups like Suno and Udio working on AI music generation.

Better Conversations

It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation.

Omni understands nuances in speech, such as emotions, pauses, and tone, and is more expressive in return. The conversation experience tries to be a lot more human-like than anything that’s come before, but there’s still the uncanny valley feel sometimes. You can also interrupt the model in between the responses, mimicking how most natural conversations go.

Better Text in Images and Consistent Characters

Image Generation models have always had trouble generating imagery with text in them. Omni seems to be doing a lot better in this regard, with people speculating if it shares the same space as DALL E to provide better feedback loops and diffusion. Samples from OpenAI below:

Better Token Encoding in non-English language

Omni employs a new tokenizer, which allows the compression of tokens in non-English languages to a much better size. That allows it to respond faster and better in non-English languages, such as Hindi, where it now requires 2.9x fewer tokens than before. 20 languages were chosen as representative of the new tokenizer's compression across different language families.

Composite 3D rendering

OpenAI has provided an example where they ostensibly demonstrate merging several generated images to create a 3D composite. 3D imagery is extremely difficult to diffuse for a variety of reasons, but Omni seems to achieve it well enough after generating the input images in different angles. Seems pretty incredible to me.

New MacOS App

ChatGPT will be available as a desktop app on MacOS. It can see your screen (if you want it to, of course), and help you out with tasks that are on it. At the demo, OpenAI demonstrated using ChatGPT to pair programs, by allowing it to see the code on your screen.

The app didn’t work for us when we tried, even with a plus account.

And it appears that this feature will be added to the mobile apps as well, as demoed in this video from OpenAI on YouTube where ChatGPT helps a student with trigonometry by understanding what’s on the screen.

Use Cases

Omni seems to have use cases for a lot of industries, but at first glance, and as demoed by OpenAI, these seem to be on the top:

Personal Assistant

ChatGPT as a personal assistant has been making the rounds, especially since the news of Apple nearing the closure of a deal with OpenAI per Bloomberg. Lower latency and better intelligence than Gemini and especially the aged Siri make it the best assistant there is at the moment. Google is poised to release something similar at its I/O event. This could be a leapfrog moment for assistants available on mobile devices as they’ll be forced to adapt or die. It also leads efforts like the Rabbit R1 to a sorry state (which it seems, already was there).

Meeting AI

ChatGPT will be able to individually identify and help members in a meeting, by understanding both images and audio. In our mind, it can help with

Summarizing the meeting minutes and next steps
Retrieve information from documentation
Cook up charts and code samples on the spot

Customer Service

A field where Generative AI has already been extremely disruptive is customer support, where a large number of companies have already integrated LLMs into their chat interfaces, where it acts as the first point of contact and escalates to an agent if necessary (which are just better LLMs in some cases). Now, with voice modality, it may very well replace voice-based customer support for a large number of companies. Since it can also understand what’s on the user’s screen, it’s also capable of providing on-device technical support, for which OpenAI’s partner, Microsoft employs a large number of people, proving extremely disruptive to the entire industry.

Accessibility

Through BeMyEyes, OpenAI is developing accessibility services for the visually impaired. In a demo, ChatGPT running GPT-4o allowed a visually impaired person to observe their surroundings, and even hail the right cab. Extremely promising potential here.

Conclusion

With GPT-4o, OpenAI has once again pushed the boundaries of what generative AI can do. By offering multimodal capabilities, improved speed, and more human interactions, GPT-4o promises to cater to different fields from customer service to personal assistance and accessibility. However, there were a lot of risks involved with the GPT-4o’s audio modalities and regarding that new safety systems were created to provide guard rails for the audio output. Once the new Voice Mode is out, we will be able to check for ourselves. The full potential of this new model is yet to be explored.