.footer { } Logo Logo
deutsch
/// News
Apple releases Final Cut Pro 10.8 promotional clip

Microsoft VALL-E 2: AI imitates every voice perfectly - using only 3s voice sample

[10:07 Thu,18.July 2024   by Thomas Richter]    

Already 1 1/2 years ago, OpenAI released VALL-E, a speech synthesis system that could imitate a voice using only a 3-second sample with any given text. The further developed version, VALL-E 2, now surpasses the old one in several aspects. The synthesized voice is now even more similar to the original than before, and the speech quality is so high that it is no longer distinguishable from real human voices for the first time. Additionally, VALL-E 2 can now pronounce complex sentences better than before and has no problems with word repetitions, which either disappeared or sounded strange in the previous version.



VALLE2-Model
The new model of VALL-E 2



This is made possible by two important improvements in the system architecture: VALL-E 2 selects speech components more skillfully, avoiding repetitions, and it processes speech data more efficiently by grouping them. However, the similarity and naturalness of the imitated voice depend on factors such as the length and quality of the voice samples, their background noise, and other factors. More audio voice samples with comparisons of VALL-E and VALL-E 2 can be found on Microsoft&s website. The research study can be found here.

The 3-second sample of the original voice:


VALL-E:


VALL-E 2:


VALL-E 2 (with a 10-second voice sample):


Although commercial services like Elevenlabs also offer voice cloning, this algorithm requires several minutes, and the professional model needs at least 3 hours of training material for sufficiently good sounding "copied" voices.

Human-Parity
Naturalness and similarity of the simulated voice in comparison



Fear of Misuse


VALL-E 2 is purely a research project. Out of fear of misuse, the developers have no plans to integrate VALL-E 2 into a product or make the algorithm publicly accessible. The potential applications for a system that can perfectly imitate speakers would be diverse; besides entertainment purposes, it could be used for interactive voice dialogue systems, translations, chatbots, etc., or to help people who have difficulty speaking, such as those suffering from diseases like aphasia or ALS.

However, a tool for quick and perfect voice cloning poses the risk of being misused, such as for deceiving voice authentication systems or maliciously imitating a specific voice.

If VALL-E 2 is released in the future, researchers propose a procedure that ensures the speaker consents to the use of their voice and a synthetic speech recognition model. Elevenlabs, for example, provides a text captcha query that the user must read aloud within 10 seconds.

Link more infos at bei www.microsoft.com

deutsche Version dieser Seite: Microsoft VALL-E 2: KI ahmt jede Stimme perfekt nach - nur per 3s Stimmsample

  



[nach oben]












Archiv Newsmeldungen

2025

July - June - May - April - March - February - January

2024
December - November - October - September - August - July - June - May - April - March - February - January

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000






































deutsche Version dieser Seite: Microsoft VALL-E 2: KI ahmt jede Stimme perfekt nach - nur per 3s Stimmsample



last update : 16.Juli 2025 - 08:02 - slashCAM is a project by channelunit GmbH- mail : slashcam@--antispam:7465--slashcam.de - deutsche Version