OpenAI and Microsoft have simultaneously announced two new AI speech models
Both OpenAI and Microsoft have today introduced new AI models optimized for speech generation. The models focus on speed, naturalness, and efficiency and can be widely deployed, from customer support to generating podcasts based on text.
OpenAI Introduces Gpt-Realtime
“Gpt-realtime is our most powerful voice model to date,” states a blog post. It generates realistic and fluent speech, and can even change tone or language mid-sentence. Developers can also easily provide the model with instructions to perform specific tasks, such as quoting help desk articles in a chatbot.
A new feature also allows users to upload images, for example a screenshot of a software issue. This makes gpt-realtime suitable for advanced applications in technical support. Developers can access the model through the now generally available Realtime API.
Microsoft Launches MAI-Voice-1 and MAI-1-preview
Microsoft, in turn, introduces MAI-Voice-1, which is part of the Microsoft Copilot assistant. The model is designed with energy efficiency as a priority: one minute of speech is generated in less than a second, using just one GPU. MAI-Voice-1 will receive specialized models for different use cases in the future.
Additionally, Microsoft unveils MAI-1-preview, a powerful multimodal AI model trained on 15,000 Nvidia H100 chips. By using a mixture-of-experts architecture, only parts of the model are used per prompt. MAI-1-preview is currently only accessible to test users, but will soon be coming to Copilot.
Microsoft is already working on a successor, trained on a supercluster with Nvidia’s latest GB200 chips. More information about this will follow later, writes the tech giant in the announcement.
