Mistral launches text-to-speech model Voxtral TTS

mistral ai

Mistral launches its new text-to-speech model Voxtral TTS, which supports nine languages and is said to stand out through “natural speech generation.”

Mistral AI introduces Voxtral TTS, a new text-to-speech model that “focuses on natural, expressive, and multilingual speech generation for business applications.” According to Mistral, the model combines low latency with a relatively compact size of 4 billion parameters, making it suitable for scalable AI speech agents.

Focus on natural and emotional speech

Mistral writes in a blog post that Voxtral TTS goes beyond classic text-to-speech by not only pronouncing text correctly but also interpreting context and emotion. The model can process nuances such as tone, rhythm, and intent, making speech sound more natural. Additionally, it can adapt to specific voices. After listening to a few seconds of reference audio, the model can mimic a voice, including accents and speaking style.

The model supports nine languages, including English, French, German, and Dutch. According to Mistral, the model can also handle speech with different accents, where for example a French voice speaks English with a natural accent.

Aimed at business use cases

Mistral is positioning Voxtral TTS for business applications. Think of automated customer service, financial services, and real-time translation. The model can be integrated into existing AI stacks. Voxtral TTS is available via API and can be tested in Mistral Studio. The price is set at $0.016 per 1,000 characters, which also makes the model economically attractive for large-scale implementations.