[10:26 Thu,2.February 2023 by Thomas Richter]
How fast the development in the field of AIs is progressing can be seen, among other things, in the field of "text-to-music", i.e. AIs that generate any music via text description: Google had just presented MusicLM (we reported), followed a few days later by AudioLDM, a research team from the University of Surrey and Imperial College, a very promising project, especially for filmmakers, because it not only synthesises pieces of music including instruments via text prompt, but also noises (SFX aka sound effects). AudioLDM can also produce entire soundscapes on request - ideal for sound backgrounds for films.
In addition, the AudioLDM team wants to make the programme and its model available online as open source, which means that it could not only be used freely on one&s own computer, but could also be improved by others and integrated into other programmes. For example, it could be used as a plug-in in video editing programmes such as Adobe Premiere or Blackmagic&s DaVinci Resolve to generate sound backdrops. Another argument in favour of using AudioLDM at home is that it is supposed to be very efficient (i.e. it requires relatively little computing power) and the training - for example, of your own sound samples - can be done using only one GPU (such as an NVIDIA RTX 3090).
In addition, AudioLDM has practical functions that are already known from the image AIs, such as InPainting (a part of an audio recording is replaced by another sound via text prompt to match the rest), Style Transfer (a melody is played by another instrument) or Super Resolution (i.e. in the case of an audio recording of music or speech with low sampling resolution, the resolution and thus the audio quality is increased via upsampling).
Here is an example of style transfer: trumpet to children&s singing
In addition to the description of the sounds that are to be generated, other parameters can be entered that affect the sound such as the type of acoustic environment (reverberation), the material of things that make sounds as well as the temporal order.
The sound of a steam engine:
Cutting meat on a wooden table:
For more complex soundscapes, the researchers enlist the help of the text AI ChatGPT, which, for example, responds to the prompt "Describe the sound of the universe" with a detailed description ("Radio emissions from stars, planets, galaxies and other celestial bodies, high fidelity, as well as the sounds of solar winds and cosmic rays"), which can then be used as a prompt for MusicLDM and generates the following output:
Model of AudioLDM
Actually, the source code was supposed to be published together with the research work on Monday, but the team is still reluctant to put the model (i.e. the result of the training process) online because of the just announced lawsuits against several image AIs due to copyright infringements, since the well-known BBC SFX library was used for training. Although this library may be used freely for non-commercial purposes, it is not clear whether this also applies to the training of AIs because the legal situation has not yet been clarified. After clarification, however, the code is to be published together with the model.
Examples of music generation:
More Audio AI Projects
The following demonstrates just how rapidly development in the field of audio AIs is progressing.
Audio AI Timeline
Within a few days, several text-to-audio AIs of very different quality have been developed, such as
Noise-to-Music and Moûsai: Text-to-Audio with Long-Context Latent Diffusion. The Chinese Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models project seems to us to be particularly worth mentioning, because it enables not only audio-to-audio but also image-to-audio and video-to-audio, i.e. sound is generated to match a video clip:
more infos at bei audioldm.github.io
deutsche Version dieser Seite: Neue Audio KI generiert neben Musik auch beliebige Soundeffekte