You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
#2335
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I Met the issue in title when im trying to use whisper large to do the recognition.
Here is my setup:
model_id = "openai/whisper-large-v3"
whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=whisper_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30,
batch_size=8,
torch_dtype=torch_dtype,
device=device,
)
result = pipe(audio_data, generate_kwargs={"language": "chinese"}, return_timestamps=True)
and i do see some problems, for example, about 20% for the output files doesnt have the timestamp,
also sometimes, the model recognize background music(no lyrics, just melody) as the speech. and sometimes it has some werid generetion.
May i know if those problems are due to some wrong setting(like the attention mask one) or just sometimes the model cant do it correctly,
Beta Was this translation helpful? Give feedback.
All reactions