AI simulates voices via Deep Learning deceptively real

[09:37 Sun,26.May 2019 by Thomas Richter]

Dessa, a company specializing in AI, has introduced a new speech synthesis that - at least in the samples provided - can hardly be distinguished from a real voice. This is demonstrated by the voice of Joe Rogan, a well-known stand-up comedian, commentator and podcast producer in the USA, in the form of a Youtube video with speech examples of the synthesized voice and his own quiz, where you can listen to various short sentences and decide for yourself whether they come from the real Rogan or the algorithm.

The new algorithm is based on - of course - Deep Learning technology. The model even learned to create breath and mouth noises in the right places to make the voice simulation sound as natural as possible. Text is output as speech using the RealTalk system. The results sound much better than those of Lyrebird, a voice simulation that was presented 2 years ago and was also implemented using Deep Learning.

Joe Rogan&s voice was probably a particularly good demo example because his 1,300 podcasts, among other things, provided an enormous amount of training material - a prerequisite for machine learning algorithms. How good a voice sounds, for which less training material is available, will be shown. If the quality of the simulated synthetic voice is reliably as good as in the examples, then many very useful but of course also shady applications will soon be conceivable.

Because - analogous to the right to one&s own image - there is no right to one&s own voice, only a right to sound recordings of one&s own voice as part of the general right of personality. Recording using a simulated voice is not affected by this.

In the field of film, the possibility of using the voices of real actors to replace dialogues in another language by dubbing would of course be groundbreaking. It would of course be important to know how good the voice simulation sounds in another language. Ideally, lip-sync should be done automatically via Deep Learning.

Other possible applications would be within the framework of existing synthetic speech output functions. These could be made much more vivid if the voice of a known personality or a friend is simulated - for example, when reading books. Likewise the personal digital assistant like Siri or Alexa could speak memories of appointments with the own voice, in order to be heard. A fitness app, for example, could give instructions with Arnold Schwarzenegger&s voice.

A wonderful application is for people who have lost their voice due to an illness (such as people with ALS) - provided of course there is old training material with the voice. They could then use their own voice to talk to others via text input. Finding food is such a voice simulation of course also for the banal use of creative internet memes.

Dassa himself also gives some examples for the abuse possibilities of such voice simulations - so voice recordings could be falsified at will, for example to discredit someone. Together with the use of

DeepFakes for the exchange of faces in videos, the simulation of the appropriate voice could create credible video forgeries that have been warned about for years. Likewise profit-pregnant (or at least disturbingly) automated advertising calls with the voice of the own nut/mother or a friend could take place. That is why Dassa does not publish any of the models or data sets.

It would be interesting - both for the positive and negative application examples - to know how much training material the algorithm needs to simulate a voice deceptively realistically. And in order to be really convincing for someone who is familiar with the respective speaker, peculiarities in expression and the choice of words would have to be copied.

At the moment there is still some know-how, computing power and data needed, but in a few years (or even shorter time) the technology will evolve in such a way that only a few seconds of audio will be needed to create a lifelike replica of each existing voice. And especially in the field of deep learning based technologies, which do not use highly specialized algorithms that can be protected by copyright, it doesn&t take long until a new technology can be imitated in similar or even better quality.

more infos at bei medium.com

deutsche Version dieser Seite: AI simuliert menschliche Stimmen täuschend echt