Whisper: New free AI turns speech into text and automatically translates into all languages

[15:28 Mon,26.September 2022 by Thomas Richter]

OpenAI, the creators of the text AI GPT3 and the image generation AI DALL-E 2, among others, have presented the speech recognition system "Whisper", which can not only transcribe spoken words into text, but also translate them into any other language. Fortunately, OpenAI has taken a cue from Stability.ai&s approach with its Text-2-Image AI Stable Diffusion and published the associated program including model freely available and thus also published for free.

The open-source code of Whisper is available in the form of five different large versions with different accuracies and working speeds on

Github, all of which run on home PCs equipped with a graphics card. Depending on the model, GPUs from 1 to 10 GB of VRAM are required. The first four models only include English, only the largest has been trained with many other languages and therefore also offers the possibility to translate spoken words from one language to another and output them as text.

Whisper models

Whisper was trained using 680,000 hours of audio material (including transcriptions) from the Internet, two-thirds of which was in English and the rest in a number of other languages. The Whisper architecture is an encoder-decoder transformer, which splits the input signal into 30-second segments, converts them into a log-mel spectrogram, and then passes them to an encoder. A decoder is trained to predict the appropriate text label, intermixed with special tokens that instruct the single model to perform tasks such as speech identification, phrase-level timestamping, multilingual speech transcription, and translation into English. The speech recognition works surprisingly well - even with unclear speech or distracting background noise.

First applications and tools use Whisper

.
The operation is quite simple via command line - but similar to

Stable Diffusion, the openly accessible source code also ensures that Whisper just masses of tools are programmed, which use its capabilities for special tasks or also simply simplify the handling by a graphical user interface (GUI).

asr-summary-of-model-architecture-desktop

Whisper Architecture

To use Whisper, you don&t even have to install a program on your own PC, Whisper can also be used via web services. For example, on the AI community Huggingface there is a simple tool

YouTube Whisperer that can be used to automatically transcribe the spoken words of a YouTube video into text. Another, still very simple

tool allows live audio input to be converted to text via microphone. There is also a more playful

Google Colab project that integrates Whisper with Stable Diffusion, allowing it to automatically generate images from English-language mp3 files.

YouTube Whisperer

The future: AI tools for everyone?

For users, Whisper is another interesting and practical AI feature that can be used in the future (for free!) for all sorts of tasks. Audio transcription is thus no longer a dominion knowledge and thus only usable in special pay apps (or on OS level as in Android or via Siri). We are excited about upcoming apps that will use Whisper for new interesting functionalities in video, such as automatic indexing of home or even professional film archives for spoken words, which are then searchable by text for dialogue passages, or automatic creation of text transcripts of phone calls or other audio recordings. Of particular interest to filmmakers or video podcasters, of course, is the ability to automatically create subtitles in multiple languages and offer them depending on the origin of the target audience.

Whisper Architecture

Bild zur Newsmeldung:

more infos at bei openai.com

deutsche Version dieser Seite: Whisper: Neue kostenlose KI verwandelt Sprache in Text und übersetzt automatisch in alle Sprachen