Microsoft VALL-E 2: AI imitates every voice perfectly - using only 3s voice sample

[10:07 Thu,18.July 2024 by Thomas Richter]

Already 1 1/2 years ago, OpenAI released VALL-E, a speech synthesis system that could imitate a voice using only a 3-second sample with any given text. The further developed version, VALL-E 2, now surpasses the old one in several aspects. The synthesized voice is now even more similar to the original than before, and the speech quality is so high that it is no longer distinguishable from real human voices for the first time. Additionally, VALL-E 2 can now pronounce complex sentences better than before and has no problems with word repetitions, which either disappeared or sounded strange in the previous version.

The new model of VALL-E 2

This is made possible by two important improvements in the system architecture: VALL-E 2 selects speech components more skillfully, avoiding repetitions, and it processes speech data more efficiently by grouping them. However, the similarity and naturalness of the imitated voice depend on factors such as the length and quality of the voice samples, their background noise, and other factors. More

audio voice samples with comparisons of VALL-E and VALL-E 2 can be found on Microsoft&s website. The research study can be found

here.

The 3-second sample of the original voice:

VALL-E:

VALL-E 2:

VALL-E 2 (with a 10-second voice sample):

Although commercial services like

Elevenlabs also offer voice cloning, this algorithm requires several minutes, and the professional model needs at least 3 hours of training material for sufficiently good sounding "copied" voices.

Naturalness and similarity of the simulated voice in comparison

Fear of Misuse

VALL-E 2 is purely a research project. Out of fear of misuse, the developers have no plans to integrate VALL-E 2 into a product or make the algorithm publicly accessible. The potential applications for a system that can perfectly imitate speakers would be diverse; besides entertainment purposes, it could be used for interactive voice dialogue systems, translations, chatbots, etc., or to help people who have difficulty speaking, such as those suffering from diseases like aphasia or ALS.

However, a tool for quick and perfect voice cloning poses the risk of being misused, such as for deceiving voice authentication systems or maliciously imitating a specific voice.

If VALL-E 2 is released in the future, researchers propose a procedure that ensures the speaker consents to the use of their voice and a synthetic speech recognition model. Elevenlabs, for example, provides a text captcha query that the user must read aloud within 10 seconds.

more infos at bei www.microsoft.com

deutsche Version dieser Seite: Microsoft VALL-E 2: KI ahmt jede Stimme perfekt nach - nur per 3s Stimmsample