Make whisper transcribe numbers in the actual spoken words #1041

Thresher12 · 2023-03-07T06:40:41Z

Thresher12
Mar 7, 2023

Hi, is there a way to get whisper to transcribe numbers the way they are actually spoken rather than just converting them to numeric format? Yeah I know I can postprocess it but this is a suboptimal solution since for example for something like 2015 there are multiple ways you can say it like 'two thousand and fifteen' or 'two thousand fifteen' or 'twenty fifteen' and whisper just automatically converts them to the same number while I need the actual spoken words. Theres obviously some conversion process going on here so is there a way to turn it off?

Answered by jongwook

Mar 7, 2023

It's not an explicit conversion but the model predicting the most likely textual output end-to-end. You can try the following which blocks all numeric tokens and encourages the model to transcribe in them literally.

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(multilingual=False)  # use multilingual=True if using multilingual model
number_tokens = [
    i 
    for i in range(tokenizer.eot)
    if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
]

...

model.transcribe("audio.mp3", suppress_tokens=[-1] + number_tokens, ...)

View full answer

jongwook · 2023-03-07T08:12:27Z

jongwook
Mar 7, 2023
Maintainer

It's not an explicit conversion but the model predicting the most likely textual output end-to-end. You can try the following which blocks all numeric tokens and encourages the model to transcribe in them literally.

from whisper.tokenizer import get_tokenizer

tokenizer = get_tokenizer(multilingual=False)  # use multilingual=True if using multilingual model
number_tokens = [
    i 
    for i in range(tokenizer.eot)
    if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
]

...

model.transcribe("audio.mp3", suppress_tokens=[-1] + number_tokens, ...)

10 replies

jongwook Oct 23, 2023
Maintainer

@orianemartin You can use tokenizer.decode([i]).strip() instead!

orianemartin Oct 24, 2023

@Warp-MFT I am on 3.12.0
@jongwook thank you very much, it worked!

ulatekh Apr 11, 2024

This can also be done for faster-whisper and insanely-fast-whisper; they only differ in how the tokenizer is found.

For faster-whisper, with a multilingual model:
tokenizer = faster_whisper.tokenizer.Tokenizer(tokenizer=model.hf_tokenizer, task="transcribe", language="en", multilingual=True)
With a monolingual model:
tokenizer = faster_whisper.tokenizer.Tokenizer(tokenizer=model.hf_tokenizer, multilingual=False)

For insanely-fast-whisper:
tokenizer = pipe.tokenizer

kdcyberdude Nov 7, 2024

blank_token_id at 220 with token " " is also suppressed with above logic!!
It also suppress this token 50256 - "" that's an empty token with no char - not sure what is the use of this in the vocab?

grzegorz700 Dec 16, 2024

@jongwook
After evaluating it, the original answer after decoding gave me 2 unexpected cases (space and empty or not for printing string). Snipped of the case:

print(f"#{tokenizer.decode([number_tokens[10]])}#")
print(f"#{tokenizer.decode([number_tokens[-1]])}#")

Output:

# #
##

To exclude it, we need to modify the code to:

number_tokens = [
    i
    for i in range(tokenizer.eot)
    if (all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
        and len(tokenizer.decode([i]).strip()) > 0)
]

lixikun · 2023-03-26T13:21:09Z

lixikun
Mar 26, 2023

@jongwook Hi, I tried this method, but I got this error:

And I check the Tokenizer ,and the eot api return int ,int array. Is it correct?

    @cached_property
    def eot(self) -> int:
        return self.encoding.eot_token

2 replies

lixikun Mar 26, 2023

ok, it is sucessful after I modified the code from
if all(c in "0123456789" for c in tokenizer.decode(i).removeprefix(" ")) to
if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))

jongwook Apr 11, 2023
Maintainer

Thanks! This happened because of a recent change in the tokenizer class

Warp-MFT · 2023-04-24T14:37:21Z

Warp-MFT
Apr 24, 2023

In case anyone is looking for how to do this for Hugging Face Transformers:
generation_config = GenerationConfig.from_pretrained(whisper_model_name)
generation_config.suppress_tokens += number_tokens
...
predicted_ids = whisper_model.generate(input_features, generation_config=generation_config, forced_decoder_ids=forced_decoder_ids)

3 replies

asr-lord Apr 17, 2024

Could you share entire code? I can't reproduce it. Thank you

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)
tokenizer=processor.tokenizer

number_tokens = [
    i 
    for i in range(tokenizer.eot)
    if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
]

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

result = pipe("test.mp3")

Error message:

      1 number_tokens = [
      2     i 
----> 3         for i in range(tokenizer.eot)
      4             if all(c in "0123456789" for c in tokenizer.decode([i]).removeprefix(" "))
      5             ]

AttributeError: 'WhisperTokenizer' object has no attribute 'eot'

Warp-MFT Apr 17, 2024

It's your lucky day :)
open_ai_stt.txt

JabblyApp Aug 11, 2024

Hey, any doc for this for Javascript, I'm using this models with APIs and the numbers aren't transcribed, thanks !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make whisper transcribe numbers in the actual spoken words #1041

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 15 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Make whisper transcribe numbers in the actual spoken words #1041

Replies: 3 comments · 15 replies

jongwook Mar 7, 2023 Maintainer

jongwook Oct 23, 2023 Maintainer

jongwook Apr 11, 2023 Maintainer

Replies: 3 comments 15 replies

jongwook
Mar 7, 2023
Maintainer

jongwook Oct 23, 2023
Maintainer

jongwook Apr 11, 2023
Maintainer