With Stable Diffusion, Stability AI already has a good text-to-image AI image generator at the start. Recently, Stable Audio is also available online, a new diffusion model that - as the name suggests - can create audio and music from text prompts.
The Stable Audio model was trained with different audio inputs instead of images for this purpose. More than 800,000 - licensed - files of the audio library AudioSparks including the respective metadata were used. Through this context-rich training, the model is able to adhere to prompted specifications regarding content and form quite well, and also to time the output to the exact length. To condition the model on a connection between text and audio, a technique called Contrastive Language Audio Pretraining (CLAP) was used in the training - see this blog post for more details, which also embeds good audio examples.
Stable Audio, latent diffusion model
Music pieces of up to 90 seconds in length can be generated, as well as individual instrument tracks or sound effects. You can specify the genre, style, mood, instrumentation, speed in BPM and more - basically everything that is usually defined in the metadata of audio libraries. In a user guide, StabilityAI has collected some examples, ranging from short and crisp to multi-linear.
The resulting pieces of music do not sound very hitworthy, not to say partly quite erratically "composed". Whereby it also depends on the kind of music and the length; quiet, ambient-like tracks can hardly be distinguished from typical, GEMA-free background music. Rather usable seem to us basically the shorter sound snippets, which can be generated as effect background, or perhaps minimalist instrument outputs.
Stable Audio is available in a free version, with which 20x tracks of up to 45 seconds can be generated per month. The Pro subscription for 12 dollars per month allows for 500 generations of up to 90 seconds in length, which may also be used in commercial projects. The download is in 44.1 kHz stereo.
An open source model of Stable Audio is also expected to be released soon, though this will have been trained with a different data set, for licensing reasons one may assume. more infos at bei www.stableaudio.com