First pictures, then sounds: New Google AI generates arbitrary music according to text description

[10:57 Mon,30.January 2023 by Thomas Richter]

Researchers from Google have presented a new AI that generates music (instead of images) via text prompt in a similar pattern to the currently very popular text-2-image AIs such as DALL-E 2, Midjourney or Stable Diffusion.

Robot Musician - imagined by Stable Diffusion

The new text-to-music AI called "MusicLM" can generate music at 24 kHz from text descriptions, which remains consistent over several minutes. MusicLM has been trained with a dataset of 280,000 hours of music to learn to create pieces of music according to complex descriptions such as "A fusion of reggaeton and electronic dance music, with a spacey, otherworldly sound. The music should evoke a sense of wonder and awe while being danceable".

The spectrum of music generated by MusicML is astonishing - it ranges from folk and classical music to jazz, pop, rap and reggae to techno, 8-bit computer music or death metal. As was already the case with the image and text AIs, it becomes apparent that image/text or music style is also only one parameter for an AI - as is instrumentation. Thus, any wild crossover mixes can also be generated with the music AI, such as metal music with accordions, rapping string quartets and all kinds of other combinations.

Another interesting feature is the possibility of presenting the AI with a whistled or hummed melody, for example, which then serves as a template for producing music based on it in a certain style defined by text description.

Here is an input through a hummed "Bella Ciao":

via Music ML it becomes an electronic synth version:

or jazz with saxophone:

or a piano solo:

Text prompts for MusicML can be other instrumentations as well as abstract descriptions of specific locations, moods, musicians& skills, musical styles or combinations of these. For each description, any number of variations can be generated - in the programme, as with the image or text AIs, there are probably a number of parameters that can be used to influence the range of variation of the results. The length of the generated sound ranges from short jingles to pieces of music lasting several minutes. The resulting tracks are often surprisingly coherent and the instrumentation sounds realistic, but sometimes the melodies and tones generated are a bit weird. As always, however, with the rapid development in the field of AI, the next generation, and even more so the one after that, will be much better.

Electro Swing dancers - imagined by Midjourney

Rather unsuccessful attempt by MusicML of Swing:

Ideal for film music, for example, is the Story Mode, in which a dynamic soundtrack can be generated on the basis of a series of successive text descriptions and the sounds defined in this way merge seamlessly into one another. In the following piece, the corresponding prompts are "time to meditate", "time to wake up", "time to run" and "time to run" at 15-second intervals. time to run" and "time to give 100%":

more infos at bei google-research.github.io

deutsche Version dieser Seite: Erst Bilder, dann Sounds: Neue Google-KI generiert beliebige Musik nach Textbeschreibung

First pictures, then sounds: New Google AI generates arbitrary music according to text description

Not yet public due to copyright concerns

AI makes everyone an artist - or not?

deutsche Version dieser Seite: Erst Bilder, dann Sounds: Neue Google-KI generiert beliebige Musik nach Textbeschreibung