The new DeepLearning Algorithm "Wav2Lip" of an Indian research team can match the lip movements of a speaker to the words from any audio recording. It beautifully demonstrates the continuous progress that machine learning technology is making, as the new method delivers significantly better results than older projects. Not only does it work in real time, but - and this is the real progress - it is also more universal, because it can handle any face, any language and any voice.
The usefulness of such an algorithm for working with video is obvious - as already shown in the demo video, it can be used to adapt the lip movement of a speaking person in a video to a synchronous version created in another language in order to eliminate the asynchronicity of mouth movements and words, which is otherwise disturbing for many viewers. This is practical for post-synchronized film versions as well as for lip-syncing lectures, press conferences or figures from animated films into other languages.
And last but not least, this technology could also help in principle to make it easier to use the voices in the post by overdub instead of the original sound in scenic productions. Even minor speech errors (which would otherwise render a scene unusable) could be easily corrected by briefly "tracking" the lips automatically.
Using deep learning algorithms, it would also be conceivable to automatically offer different language versions of any clips, for example on YouTube. YouTube already provides an automatic transcription, and the next steps are already possible using algorithms: the translation of the transcribed text into another language, speech synthesis with the voice of the original voice, and then lip-syncing the video with the new audio.
Of course, the technology can also be misused to generate clips in which people seem to be saying things they never said - the new audio can also be generated via neural network to mimic the real voice.
How good the Wav2Lip algorithm is, anyone can try it out for themselves on the project&s demo website and upload a short (maximum 20 seconds) video clip of a person speaking plus a speech audio clip to get an output of the newly lip-synced clip. For those who want to try more, please visit GitHub to find the appropriate program code. (Thanks to our forum member Ruessel for the news)