Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour (#33390)

* added doc explaining behaviour regarding tokens timestamps and previous tokens

* copied changes to faster tokenizer

---------

Co-authored-by: Bruno Hays <bruno.hays@illuin.tech>
This commit is contained in:
Bruno Hays 2024-09-10 16:49:28 +02:00 committed by GitHub
parent 6ed2b10942
commit dfee4f2362
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 8 additions and 4 deletions

View File

@ -673,13 +673,15 @@ class WhisperTokenizer(PreTrainedTokenizer):
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method. List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`): skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding. Whether or not to remove special tokens in the decoding. Will remove the previous tokens (pre-prompt)
if present.
clean_up_tokenization_spaces (`bool`, *optional*): clean_up_tokenization_spaces (`bool`, *optional*):
Whether or not to clean up the tokenization spaces. If `None`, will default to Whether or not to clean up the tokenization spaces. If `None`, will default to
`self.clean_up_tokenization_spaces` (available in the `tokenizer_config`). `self.clean_up_tokenization_spaces` (available in the `tokenizer_config`).
output_offsets (`bool`, *optional*, defaults to `False`): output_offsets (`bool`, *optional*, defaults to `False`):
Whether or not to output the offsets of the tokens. This should only be set if the model predicted Whether or not to output the offsets of the tokens. This should only be set if the model predicted
timestamps. timestamps. If there are previous tokens (pre-prompt) to decode, they will only appear in the decoded
text if they contain timestamp tokens.
time_precision (`float`, *optional*, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
decode_with_timestamps (`bool`, *optional*, defaults to `False`): decode_with_timestamps (`bool`, *optional*, defaults to `False`):

View File

@ -319,13 +319,15 @@ class WhisperTokenizerFast(PreTrainedTokenizerFast):
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method. List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`): skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding. Whether or not to remove special tokens in the decoding. Will remove the previous tokens (pre-prompt)
if present.
clean_up_tokenization_spaces (`bool`, *optional*): clean_up_tokenization_spaces (`bool`, *optional*):
Whether or not to clean up the tokenization spaces. If `None`, will default to Whether or not to clean up the tokenization spaces. If `None`, will default to
`self.clean_up_tokenization_spaces` (available in the `tokenizer_config`). `self.clean_up_tokenization_spaces` (available in the `tokenizer_config`).
output_offsets (`bool`, *optional*, defaults to `False`): output_offsets (`bool`, *optional*, defaults to `False`):
Whether or not to output the offsets of the tokens. This should only be set if the model predicted Whether or not to output the offsets of the tokens. This should only be set if the model predicted
timestamps. timestamps. If there are previous tokens (pre-prompt) to decode, they will only appear in the decoded
text if they contain timestamp tokens.
time_precision (`float`, *optional*, defaults to 0.02): time_precision (`float`, *optional*, defaults to 0.02):
The time ratio to convert from token to time. The time ratio to convert from token to time.
decode_with_timestamps (`bool`, *optional*, defaults to `False`): decode_with_timestamps (`bool`, *optional*, defaults to `False`):