-
Why do the transcription/translations seem to depend on the input bitrate, why is it affecting tokenization? Is there an optimal type of source? Is there a way to fine tune the window frame to more tightly or loosely match the predicted tokens? I know currently there is no current word tokenization implemented. Additionally, I understand that the way words are said in songs is different than how they are said naturally. If I understand correctly, is that what the patience parameter can be used for? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
I believe by "tokenization" you meant phrase boundaries produced by the model. The term more often means converting text to/from a sequence of integers, as done in tokenizer.py.
There might be a correlation between audio quality and typical phrase lengths in the subtitles.
Changing the decoding rules for the timestamp tokens like in #139 (comment) can affect how long the segments are in general, but not precisely.
No, the parameter implements the algorithm from this paper. |
Beta Was this translation helpful? Give feedback.
I believe by "tokenization" you meant phrase boundaries produced by the model. The term more often means converting text to/from a sequence of integers, as done in tokenizer.py.
There might be a correlation between audio quality and typical phrase lengths in the subtitles.
Changing the decoding rules for the timestamp tokens like in #139 (comment) can affect how long the segments are in general, but not precisely.