WhisperX: Word-level timestamps, diarization (new), batch inference within file(new) #684
Replies: 14 comments 17 replies
-
@m-bain , This looks quite interesting. I'd love to be able to test it on Japanese when available. |
Beta Was this translation helpful? Give feedback.
-
what about this one? |
Beta Was this translation helpful? Give feedback.
-
If you're low on GPU RAM, running transcribe() from python seems to work where running the cli app for whisper (or via whisperx) won't. Also, if whisperx's align() function runs you out of GPU RAM, you totally can use a smaller WAV2VEC2 model. I've demonstrated both below. I'm no expert in Python, and no doubt I've probably done things improperly/needlessly below, but it works for me. And all on my very budget GTX GeForce 950. Maybe it will help someone understand what's happening. Thanks Max!
|
Beta Was this translation helpful? Give feedback.
-
Cool project, thanks! I had a similar idea I'm glad someone implemented it |
Beta Was this translation helpful? Give feedback.
-
update: whisperx now provides
|
Beta Was this translation helpful? Give feedback.
-
Could you possibly whip up a way to limit/lessen the amount of characters used in a sample? Like for example: Netflix reccommends 42 characters or one line of subtitles before it changes to a new timestamp; if that makes any sense :P. Maybe a setting that limits the characters to a specific amount, or a way to increase the frequency of timestamp tokens? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Hi @m-bain , just saw the paper-drop note (https://arxiv.org/abs/2303.00747). Good work and big big congratulations!!! |
Beta Was this translation helpful? Give feedback.
-
Hi @m-bain, How do I run whisperx with Hebrew Word-level timestamps? Thanks |
Beta Was this translation helpful? Give feedback.
-
Looks great. I just wanted to test it in colab.research.google.com. I used these install commands: When I run it using !whisperx "file" --model..., I'm getting this error: What can be the reason? Thanks. |
Beta Was this translation helpful? Give feedback.
-
I'm getting this error when using Whisperx in Google colabs since today. |
Beta Was this translation helpful? Give feedback.
-
This seems to be a great project. Thank you for sharing. Now that I have the *.word.srt file, how do I convert the contents to an SRT file with 42 characters per line and 2 lines at a time? `1 2 3 4 5 6 7 8 9 10 11 12 |
Beta Was this translation helpful? Give feedback.
-
Now, getting this "Repository unavailable due to DMCA takedown.". Whisperx has been taken down for now |
Beta Was this translation helpful? Give feedback.
-
For anyone who may find this useful, I used ChatGPT to create the Python file I needed.
` def read_srt_file(file_path): def write_srt_file(file_path, srt_data): def split_lines(subtitle_text): def process_srt_data(srt_data): if name == "main": |
Beta Was this translation helpful? Give feedback.
-
I just wanted to say thanks for all your hard work on this tool! I first heard of it as a graduate student a few years ago because we have a team of graduate students who are working on a transcription / diarization platform for the International Storytelling Center in Jonesborough, Tennessee. From what I understand the storytelling center has ~50 years of archives they want transcribed on top of all the new recordings being generated each year! That project has stagnated for a while (its a 2-year masters program so valuable people leave), but anyway, they basically would've had to write something like WhisperX if you didn't. Diarization is still giving them a fit where they have a single speaker that will use multiple voices (or worse: multiple speakers using multiple voices!) and they're trying to figure out how to do context-specific transcription (e.g., if the model were going to transcribe a word as "castle" but knew the context of the story was feudal Japan, that awareness may allow it to choose another, more likely word; I have no clue how they plan to do that though; my approach would be to try to post-process the initial transcript with some other model). Now that I am faculty, I record lecture videos frequently, and I realized during week one of the Fall semester that we were going to need a reliable transcription tool. I started playing with WhisperX and it was a lifesaver for my team! Seriously, I did not think I would be able to find a transcription pipeline of this quality without paying for it! Now that I have CUDA and cuDNN, it works phenomenally for accurate timestamped transcriptions. There are student workers whose entire jobs are to transcribe videos for my University. I have a feeling we can make their jobs a lot simpler (just small corrections) with this. Edit: Forgot to mention... I noticed that numpy2.x deprecated Btw, here's a nice little WhisperX GUI I had Claude.ai whip up: |
Beta Was this translation helpful? Give feedback.
-
Hi,
I've released whisperX which refines the timestamps from whisper transcriptions using forced alignment a phoneme-based ASR model (e.g. wav2vec 2.0). This provides word-level timestamps, as well as improved segment timestamps.
I hacked this fairly up fairly quickly so feedback is welcome, and it's worth playing around with the hyperparameters (particularly how much to extend the original whisper segment -- sometimes these can be super inaccurate).
Example:
Using whisper out of the box (
medium.en
), many transcriptions are out of sync:sample_whisper_og.mov
Now, using WhisperX (
medium.en
) with forced alignment to wav2vec2.0:sample_whisperx.mov
And supports other languages:
sample_de_01_vis.mov
Beta Was this translation helpful? Give feedback.
All reactions