MultiFoley - SFX generation via AI with multimodal control

[17:21 Fri,29.November 2024 by blip]

To ensure that AI-generated videos do not remain silent, there are already several approaches to artificial (post-)dubbing - as reported, the Google Deepmind team is working on a video-to-audio system as a supplement for its video AI Veo. AI-generated sound effects can be found, for example, at at Elevenlabs.

Now another model for video-controlled sound generation has been introduced, which promises some potent capabilities. MultiFoley supports a multimodal approach and is designed to accept text, audio and video as input. The desired Foley sound for a clip can therefore be generated “from nothing” using a text prompt, or an audio sample can be defined as a reference, for example from a sound effect library, whose sound characteristics (e.g. rhythm and timbre) are to be adopted. If a video with partially existing sound is specified, MultiFoley spins the soundtrack accordingly.

🎥 Introducing MultiFoley, a video-aware audio generation method with multimodal controls! 🔊

We can
⌨️Make a typewriter sound like a piano 🎹
🐱Make a cat meow like a lion roars! 🦁
⏱️Perfectly time existing SFX 💥 to a video pic.twitter.com/oAiDKykdZw
— Ziyang Chen (@CzyangChen) November 27, 2024

Natural sounds can be generated (e.g. skateboard wheels rolling on a surface) as well as more bizarre audio sequences (e.g. the roar of a lion that sounds like the meowing of a cat), in each case synchronized with the image event. Negative prompting also makes it possible to exclude unwanted audio elements.

MultiFoley is based on diffusion models and currently uses two different data sets for training, VGG sound with 168K samples for video-text-sound generation and sound-ideas with 400K samples for text-sound generation. The approach combines speech with video cues and decouples the semantic and temporal elements of videos. This enables creative Foley applications, such as modifying a birdsong video to sound like a human voice, or converting a typewriter sound into piano notes - all while keeping it synchronized with the video.

According to the developers, a key innovation of the model is that it can be trained on both internet video data with low-quality sound and professional SFX recordings to enable high-quality sound generation with full bandwidth (48 kHz). MultiFoley is designed to outperform other existing methods with successfully synchronized and high-quality sounds. However, the aim does not appear to be to generate music or dialog (as with Google's video-to-audio system) - the name says it all.

MultiFoley is a joint project between researchers at the University of Michigan and Adobe. We should therefore not be surprised if a similar functionality appears in the Firefly video generator sooner or later; the model is currently not publicly accessible.

deutsche Version dieser Seite: MultiFoley - Video-Vertonung per KI mit multimodaler Kontrolle