WhisperX: Free audio transcription with speaker recognition

[11:28 Wed,1.February 2023 by Thomas Richter]

In September, OpenAI, the developers of the text AI ChatGPT and the image generation AI DALL-E 2, among others, presented the speech recognition system Whisper, which can transcribe spoken words into text. Since OpenAI fortunately published the associated programme and model for free, a large number of open source projects based on it soon developed. One of these is WhisperX, which was started by the computer scientist Max Bain and has just been published. It is of particular interest to filmmakers because it fixes some specific weaknesses of Whisper that previously prevented its use as an automatic subtitle generator.

WhisperX Model

For one thing, WhisperX recognises different speakers (unlike the original Whisper) and makes them recognisable in the transcribed speech text. In Whisper, the timestamps can be wrong by several seconds - to prevent this, among other things, pre-filtering is used by detecting speech activity, which significantly improves the quality of the matching and prevents catastrophic timestamp errors due to whispering (such as negative timestamp duration, etc.). In WhisperX, the timestamps that indicate when a speaker starts and stops talking in the transcription are now accurate down to the sound level.

These improvements simplify the use of Whisper for the creation of subtitles, for example, or considerably, because thanks to WhispherX, much less manual editing is required. Not only is the timing now exactly right, i.e. when an actor begins to speak, the respective subtitle appears synchronously - word for word if desired - but the identification of who is saying something, which is important for subtitling for the hearing impaired, is done automatically.

Currently, standard models are provided for English, French, German, Spanish, Italian, Japanese, Dutch and Polish, among others. WhisperX uses several free tools independently to produce robust word-level segmentation with speaker labels, namely, in addition to OpenAI&s Whisper, MetaAI&s wav2vec2.0 (responsible for phoneme-level sound detection) and

for voice activity detection.

WhisperX, like Whisper itself, is free of charge and freely available on Github including source code. WhisperX is written in Python and can be accessed via command line, provided you have the necessary knowledge. However, we think that WhisperX will soon be integrated into the first (online) subtitling tools or plugins in a more user-friendly way and thus offer users simple automatic subtitling.

more infos at bei github.com

deutsche Version dieser Seite: WhisperX: Kostenlose lautgenaue Audiotranskription mit Sprechererkennung