, , ,

Meta Unveils Spirit LM Open Source Model Integrating Text and Speech

Merima Hadžić Avatar

Meta has introduced its first open-source multimodal language model, Meta Spirit LM, which integrates both text and speech inputs and outputs. Released just in time for Halloween 2024, Spirit LM is designed to handle tasks such as automatic speech recognition (ASR), text-to-speech (TTS), and speech classification. This model competes directly with other multimodal models like OpenAI’s GPT-4o and Hume’s EVI 2, while challenging dedicated offerings such as ElevenLabs’ text-to-speech and speech-to-text solutions.

Developed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM addresses the limitations of current AI voice models by focusing on more expressive, natural-sounding speech. Two versions of the model are available: Spirit LM Base, which processes speech using phonetic tokens, and Spirit LM Expressive, which adds pitch and tone tokens for more nuanced emotional expression in speech. Both versions were trained on text and speech datasets, allowing for cross-modal tasks while maintaining expressive, human-like outputs.

Although Meta Spirit LM is fully open-source, it is restricted to non-commercial use under Meta’s FAIR Noncommercial Research License. This limits the model’s use to research and prohibits commercial applications without compliance with the noncommercial restriction.

The model is part of Meta’s broader commitment to open science, which aims to encourage innovation and collaboration within the AI research community. Mark Zuckerberg, Meta’s CEO, has emphasized the potential of AI to enhance human productivity, creativity, and overall quality of life. In addition to Spirit LM, Meta’s FAIR team has also released updates to other research tools, such as the Segment Anything Model 2.1 for image and video segmentation.

Spirit LM’s ability to integrate emotional states into speech generation offers significant potential for applications in virtual assistants, customer service bots, and other interactive AI systems. Meta hopes that by making Spirit LM open-source, it will contribute to new advancements in AI systems that combine text and speech more effectively.


Featured image courtesy of Beebom

Merima Hadžić Avatar