Nvidia has unveiled a cutting-edge generative audio AI model that can produce a wide array of sounds, music, and voices simply based on users’ text and audio instructions.
This innovative model, known as Fugatto (short for Foundational Generative Audio Transformer Opus 1), has the capability to craft jingles and music segments from text prompts, as well as to modify existing audio tracks by adding or removing instruments and vocals. It can also alter the emotional tone and accent of voices, and even create sounds that have never been heard before, as per an announcement from Nvidia.
“Our goal was to design a model that comprehends and generates sound in a manner akin to human cognition,” said Rafael Valle, a manager in Nvidia’s applied audio research division. “Fugatto represents an initial stride toward a future where unsupervised multitask learning in audio creation and transformation can emerge from expansive datasets and model scalability.”
The potential applications for music producers are vast; they can quickly prototype and experiment with song ideas across different styles, enhance existing tracks by introducing new effects, and even adapt ad campaigns by localizing music and voiceovers. Additionally, it could allow game developers to dynamically adjust soundtracks as players progress through levels.
Fugatto is capable of generating entirely new sounds, such as barking trumpets and meowing saxophones, utilizing a method known as ComposableART, which allows it to combine various learned attributes during training.
“I aimed to provide users the freedom to mix attributes artistically, controlling how much emphasis they placed on each,” remarked Nvidia AI researcher Rohan Badlani. “During my trials, the outcomes often surprised me and made me feel somewhat like an artist, despite being a computer scientist.”
This model comprises 2.5 billion parameters and was trained using 32 H100 GPUs. The rise of audio AI technology is significant; Stability AI introduced a similar system in April that can generate tracks of up to three minutes, while Google’s V2A model can create an endless range of soundtracks based on any video input.
In recent developments, YouTube has also released an AI music remixer that can generate 30-second samples by interpreting input songs along with user prompts. Further, OpenAI is exploring this field, having launched an AI tool in April that can replicate a user’s voice and vocal patterns with just 15 seconds of sample audio.