Microsoft’s AI division has launched its first new in-house AI models: MAI-Voice-1 and MAI-1-preview. These new models mark a significant milestone in the company’s journey towards AI race and solutions that cater to both consumers and businesses alike. Let’s take a technical look at these models.
MAI-Voice-1 is Microsoft’s first natural speech generation model, which can convert any context into speech in seconds. It is designed to deliver distortion-free expressive audio that can mimic human-like speech in a variety of contexts.
MAI-Voice-1 has proven to be an incredibly efficient model. With the ability to generate an entire minute of audio in under a second using just a single GPU, it’s one of the fastest speech models currently available. This efficiency enables Microsoft to power features like Copilot Daily and Copilot Podcasts.
MAI-1-preview is a text-based model, trained end-to-end on around 15,000 Nvidia H100 GPUs, marking Microsoft’s first venture into developing a foundational model in-house. It’s built to excel at providing helpful responses to everyday queries, all while following detailed instructions.