WhisperX: Word-level timestamps, diarization (new), batch inference within file(new) #684

m-bain · 2022-12-14T19:06:56Z

m-bain
Dec 14, 2022

Hi,

I've released whisperX which refines the timestamps from whisper transcriptions using forced alignment a phoneme-based ASR model (e.g. wav2vec 2.0). This provides word-level timestamps, as well as improved segment timestamps.

I hacked this fairly up fairly quickly so feedback is welcome, and it's worth playing around with the hyperparameters (particularly how much to extend the original whisper segment -- sometimes these can be super inaccurate).

Example:
Using whisper out of the box (medium.en), many transcriptions are out of sync:

sample_whisper_og.mov

Now, using WhisperX (medium.en) with forced alignment to wav2vec2.0:

sample_whisperx.mov

And supports other languages:

sample_de_01_vis.mov

dgoryeo · 2022-12-17T19:55:11Z

dgoryeo
Dec 17, 2022

@m-bain , This looks quite interesting. I'd love to be able to test it on Japanese when available.

9 replies

dgoryeo Jan 4, 2023

Wow! Will give it a try right away. cheers

dgoryeo Jan 4, 2023

Works with Japanese. Thanks a lot and a big arigato!

A quick thought: have you considered to pair this with a VAD? I'm not sure about other languages, but for Japanese one needs to use Whisper's large model which means higher likelihood of halucination and longer time mismatches.

m-bain Jan 5, 2023
Author

Glad to help, yes I've seen this from other users -- it does seem separate VAD helps a lot and deals with this hallucination problem.

At the moment I am hesitant to add another model (VAD) to the pipeline and increase complexity/dependency. I guess it could be optional, but priority at the moment is diarization and batching when I have the time.

m-bain Feb 1, 2023
Author

@dgoryeo we have added VAD like your suggestion and it improves robustness, esp for non english. this also allows us to parallelize whisper by processing VAD segments in parallel 😎

GitHoobar May 5, 2024

how one can get highlighted words by running it locally,i dont think there's an highlight_words attribute

agenda-shaper · 2022-12-17T23:11:56Z

agenda-shaper
Dec 17, 2022

#435

what about this one?

1 reply

m-bain Dec 17, 2022
Author

Yes I tried this, it did not work that well for me unfortunately.
Our method leverages phoneme based models e.g. wav2vec2.0 which are directly trained for accurate word recognition.

PatienceAllergy · 2022-12-18T11:19:30Z

PatienceAllergy
Dec 18, 2022

If you're low on GPU RAM, running transcribe() from python seems to work where running the cli app for whisper (or via whisperx) won't. Also, if whisperx's align() function runs you out of GPU RAM, you totally can use a smaller WAV2VEC2 model.

I've demonstrated both below. I'm no expert in Python, and no doubt I've probably done things improperly/needlessly below, but it works for me. And all on my very budget GTX GeForce 950. Maybe it will help someone understand what's happening.

Thanks Max!

import sys
from timeit import default_timer as timer
from pathlib import Path
import json
import torch
import torchaudio
import whisperx
from whisperx import load_model, transcribe
from whisperx.transcribe import align

torch.cuda.empty_cache()
tmpDir     = "/home/ux/tmp"
audioFn    = tmpDir + "/" + sys.argv[1]
stem       = Path(audioFn).stem
lm         = 'base'
outFn      = stem + "." + lm + ".json"
model      = load_model( lm + '.en')
start_time = timer()
result     = transcribe(
    model,
    audioFn,
    language='en',
    verbose=True,
    condition_on_previous_text=False
)
torch.cuda.empty_cache()
device     = 'cuda'
# BUNDLE USED BY DEFAULT 1.2GB
# bundle     = torchaudio.pipelines.WAV2VEC2_ASR_LARGE_LV60K_960H
# SMALLER BUNDLE 360MB
bundle     = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
align_model         = bundle.get_model().to(device)
labels              = bundle.get_labels()
align_dictionary    = {c: i for i, c in enumerate(labels)}
result_aligned = align(
    result["segments"],
    align_model,
    align_dictionary,
    audioFn,
    device,
    extend_duration=2,
    start_from_previous=True
)
end_time   = timer()
duration   = ( end_time - start_time )
print("transcription duration: ")
print(duration)
outPath = open(Path(tmpDir) / (outFn), "w", encoding="utf-8")
json.dump(result_aligned[1]["segments"], outPath, ensure_ascii=True, indent=4)
outPath.close()

2 replies

m-bain Dec 18, 2022
Author

Thank you, that's a good point probably the smaller base model is good enough. I will add that to the README. Definitely I will look into reducing gpu mem requirements, memory will scale with size of whisper segment since wav2vec2.0 is a transformer.

PatienceAllergy Dec 18, 2022

I have no idea what a transformer is, but maybe that's why sometimes if I'm right on the edge of memory capacity, transcription completes, and other times not so... I guess every transcription has some randomness to it, even though it's the same data being inputted? I hope all these neural network terms will make sense to me one day soon. I wish they were actually terms from neuroscience, like: "oh here is a cortical column, and inside one of its micro-columns, at such and such a layer there are neurons with axons that stretch several columns in every direction". Anyway...

mayeaux · 2022-12-18T13:56:23Z

mayeaux
Dec 18, 2022

Cool project, thanks! I had a similar idea I'm glad someone implemented it

0 replies

m-bain · 2023-02-01T22:14:55Z

m-bain
Feb 1, 2023
Author

update: whisperx now provides

speaker diarization to label each utterance with speaker
VAD filtering (no longer relies on whisper non-robust timestamps)
batch inference within a file (requires VAD filtering), by processing VAD segments in parallel

0 replies

ripnonfegames · 2023-02-03T17:45:06Z

ripnonfegames
Feb 3, 2023

Could you possibly whip up a way to limit/lessen the amount of characters used in a sample? Like for example: Netflix reccommends 42 characters or one line of subtitles before it changes to a new timestamp; if that makes any sense :P. Maybe a setting that limits the characters to a specific amount, or a way to increase the frequency of timestamp tokens? Thanks!

0 replies

dgoryeo · 2023-03-04T17:44:13Z

dgoryeo
Mar 4, 2023

Hi @m-bain , just saw the paper-drop note (https://arxiv.org/abs/2303.00747). Good work and big big congratulations!!!

0 replies

oyardeni · 2023-03-05T15:37:26Z

oyardeni
Mar 5, 2023

Hi @m-bain,

How do I run whisperx with Hebrew Word-level timestamps?

Thanks

0 replies

stphnvdb · 2023-04-01T17:30:06Z

stphnvdb
Apr 1, 2023

Looks great.

I just wanted to test it in colab.research.google.com. I used these install commands:
!pip install git+https://github.com/m-bain/whisperx.git
!pip install git+https://github.com/openai/whisper.git
!sudo apt update && sudo apt install ffmpeg

When I run it using !whisperx "file" --model..., I'm getting this error:
2023-04-01 17:18:26.774223: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT Traceback (most recent call last): File "/usr/local/bin/whisperx", line 8, in <module> sys.exit(cli()) File "/usr/local/lib/python3.9/dist-packages/whisperx/transcribe.py", line 112, in cli vad_model = load_vad_model(torch.device(device), vad_onset, vad_offset, use_auth_token=hf_token) File "/usr/local/lib/python3.9/dist-packages/whisperx/vad.py", line 26, in load_vad_model with urllib.request.urlopen(VAD_SEGMENTATION_URL) as source, open(model_fp, "wb") as output: FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/whisperx-vad-segmentation.bin'

What can be the reason? Thanks.

2 replies

m-bain Apr 1, 2023
Author

ah, it seems the torch cache directory doesn't exist,

just mkdir /root/.cache/torch/ then rerun the scirpt. I've pushed to the code to do this automaticallly in case the cache directory doesn't exist.

stphnvdb Apr 1, 2023

Thanks a lot! The error is gone.

Now I only have to find an align_model for Spanish :(

stphnvdb · 2023-04-08T12:35:00Z

stphnvdb
Apr 8, 2023

OSError: /usr/local/lib/python3.9/dist-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN2at4_ops10select_int4callERKNS_6TensorElN3c106SymIntE

I'm getting this error when using Whisperx in Google colabs since today.

1 reply

Majdoddin Apr 8, 2023

see m-bain/whisperX#165 and m-bain/whisperX#166

OtterBeWorking · 2023-04-09T16:25:37Z

OtterBeWorking
Apr 9, 2023

This seems to be a great project. Thank you for sharing. Now that I have the *.word.srt file, how do I convert the contents to an SRT file with 42 characters per line and 2 lines at a time?

`1
00:00:02,156 --> 00:00:02,256
The

2
00:00:02,276 --> 00:00:02,876
broadcast

3
00:00:02,956 --> 00:00:03,056
is

4
00:00:03,096 --> 00:00:03,236
now

5
00:00:03,296 --> 00:00:03,676
starting.

6
00:00:03,856 --> 00:00:03,996
All

7
00:00:04,076 --> 00:00:04,517
attendees

8
00:00:04,657 --> 00:00:04,757
are

9
00:00:04,837 --> 00:00:04,917
in

10
00:00:04,957 --> 00:00:05,277
listen

11
00:00:05,357 --> 00:00:05,597
only

12
00:00:05,657 --> 00:00:05,937
mode.`

1 reply

VimWei Oct 15, 2024

https://github.com/VimWei/WhisperTranscriber

pheonis2 · 2023-04-12T19:14:48Z

pheonis2
Apr 12, 2023

Now, getting this "Repository unavailable due to DMCA takedown.". Whisperx has been taken down for now

1 reply

m-bain Apr 12, 2023
Author

Hi, yes some intern at a startup sent a pull request containing sensitive information and I forgot to reply to their DMCA request to remove it...
it should be back within a day or so

OtterBeWorking · 2023-04-12T19:24:24Z

OtterBeWorking
Apr 12, 2023

For anyone who may find this useful, I used ChatGPT to create the Python file I needed.

Write a Python script that takes the path to an SRT file as a command-line argument and outputs a new SRT file with the following modifications:

Check each subtitle of the input file. Create a new subtitle line by combining subtitles until a maximum of 84 characters has been reached.

The start time of each subtitle should match the start time of the first subtitle used to make the new line and the end time should match the last subtitle used to make the new line.

Each new subtitle line should be wrapped at a maximum of 42 characters to create 2 lines per subtitle. The second part of that wrapped line can be of any length.

Use only built-in modules and include appropriate error handling.

Your Python script should take a single command-line argument, which is the path to the input SRT file. The output SRT file should be saved to a file called new_subtitles.srt in the same directory as the input file.

You should define the following functions:

read_srt_file(file_path) - a function that reads an SRT file and returns its contents as a string.

write_srt_file(file_path, srt_data) - a function that writes an SRT file to disk.

split_lines(subtitle_text) - a function that splits a single-word subtitle into lines with 30-42 characters per line.

process_srt_data(srt_data) - a function that processes the SRT data and returns the modified SRT data.

Use appropriate error handling throughout the script, and print informative error messages if errors occur.

Each subtitle should be numbered sequentially

`
import os
import sys

def read_srt_file(file_path):
try:
with open(file_path, 'r') as f:
return f.read()
except Exception as e:
print(f"Error reading SRT file: {e}")
sys.exit(1)

def write_srt_file(file_path, srt_data):
try:
with open(file_path, 'w') as f:
f.write(srt_data)
except Exception as e:
print(f"Error writing SRT file: {e}")
sys.exit(1)

def split_lines(subtitle_text):
if len(subtitle_text) <= 42:
return subtitle_text
else:
index = subtitle_text.rfind(' ', 0, 42)
if index == -1:
index = 42
return subtitle_text[:index] + '\n' + subtitle_text[index:].lstrip()

def process_srt_data(srt_data):
subtitles = srt_data.split("\n\n")
new_subtitles = []
current_subtitle = ""
start_time = None
end_time = None
subtitle_number = 1
for subtitle in subtitles:
lines = subtitle.split("\n")
if len(lines) < 3:
continue
time_line = lines[1]
text_lines = lines[2:]
text = " ".join(text_lines)
if not start_time:
start_time = time_line.split(" --> ")[0]
if len(current_subtitle) + len(text) + 1 > 84:
new_subtitle = f"{subtitle_number}\n{start_time} --> {end_time}\n{split_lines(current_subtitle.strip())}"
new_subtitles.append(new_subtitle)
current_subtitle = text + " "
start_time = time_line.split(" --> ")[0]
subtitle_number += 1
else:
current_subtitle += text + " "
end_time = time_line.split(" --> ")[1]
if current_subtitle:
new_subtitle = f"{subtitle_number}\n{start_time} --> {end_time}\n{split_lines(current_subtitle.strip())}"
new_subtitles.append(new_subtitle)
return "\n\n".join(new_subtitles)

if name == "main":
if len(sys.argv) != 2:
print("Usage: python srt_processor.py <path_to_srt_file>")
sys.exit(1)
input_file_path = sys.argv[1]
if not os.path.isfile(input_file_path):
print(f"Error: {input_file_path} is not a valid file path")
sys.exit(1)
srt_data = read_srt_file(input_file_path)
new_srt_data = process_srt_data(srt_data)
output_file_path = os.path.join(os.path.dirname(input_file_path), "new_subtitles.srt")
write_srt_file(output_file_path, new_srt_data)
`

0 replies

haasr · 2025-01-13T15:14:15Z

haasr
Jan 13, 2025

I just wanted to say thanks for all your hard work on this tool! I first heard of it as a graduate student a few years ago because we have a team of graduate students who are working on a transcription / diarization platform for the International Storytelling Center in Jonesborough, Tennessee. From what I understand the storytelling center has ~50 years of archives they want transcribed on top of all the new recordings being generated each year! That project has stagnated for a while (its a 2-year masters program so valuable people leave), but anyway, they basically would've had to write something like WhisperX if you didn't. Diarization is still giving them a fit where they have a single speaker that will use multiple voices (or worse: multiple speakers using multiple voices!) and they're trying to figure out how to do context-specific transcription (e.g., if the model were going to transcribe a word as "castle" but knew the context of the story was feudal Japan, that awareness may allow it to choose another, more likely word; I have no clue how they plan to do that though; my approach would be to try to post-process the initial transcript with some other model).

Now that I am faculty, I record lecture videos frequently, and I realized during week one of the Fall semester that we were going to need a reliable transcription tool. I started playing with WhisperX and it was a lifesaver for my team! Seriously, I did not think I would be able to find a transcription pipeline of this quality without paying for it! Now that I have CUDA and cuDNN, it works phenomenally for accurate timestamped transcriptions. There are student workers whose entire jobs are to transcribe videos for my University. I have a feeling we can make their jobs a lot simpler (just small corrections) with this.

Edit: Forgot to mention... I noticed that numpy2.x deprecated NaN. Literally all they did was change it to nan, but that's something to refactor when you update the dependencies. Also, I was wondering if there is a way you can add support for cuDNN v9.x. I had to roll back from cuDNN v9.6.0 to v8.9.7 to use it with whisper. If its a major refactor, don't worry about it, but just thought I'd ask.

Btw, here's a nice little WhisperX GUI I had Claude.ai whip up:
https://gist.github.com/haasr/7638a4e72056ba3108a6be171f8cb534

0 replies

WhisperX: Word-level timestamps, diarization (new), batch inference within file(new) #684

Replies: 14 comments · 17 replies

m-bain Jan 5, 2023 Author

m-bain Feb 1, 2023 Author

m-bain Dec 17, 2022 Author

m-bain Dec 18, 2022 Author

m-bain Feb 1, 2023 Author

m-bain Apr 1, 2023 Author

m-bain Apr 12, 2023 Author

Replies: 14 comments 17 replies

m-bain Jan 5, 2023
Author

m-bain Feb 1, 2023
Author

m-bain Dec 17, 2022
Author

m-bain Dec 18, 2022
Author

m-bain
Feb 1, 2023
Author

m-bain Apr 1, 2023
Author

m-bain Apr 12, 2023
Author