Commit Graph

8 Commits

Author SHA1 Message Date
cyyever
1e6b546ea6
Use Python 3.9 syntax in tests (#37343)
Signed-off-by: cyy <cyyever@outlook.com>
2025-04-08 14:12:08 +02:00
Yih-Dar
1fcaad6df9
Use lru_cache for tokenization tests (#36818)
* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-03-28 15:09:35 +01:00
Raushan Turganbay
187439c3fa
VLM: special multimodal Tokenizer (#34461)
* kinda works

* update

* add tests

* update

* use special tokens in processors

* typo

* fix copies

* fix

* fix moshi after rebase

* update

* fix tests

* update

* Update docs/source/en/main_classes/tokenizer.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update docs

* test for load time adding tokens

* fix some more tests which are now fetched better

* one more fix

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-11-04 16:37:51 +01:00
Ita Zaporozhets
1e79eade41
SPLIT PR: add user defined symbols and control symbols (#31305)
* PR SPLIT: moving origina changes for adding user defined symbols

* adding gemma test and generalizing gemma converter

* ruff

* update common test

* update serialization test

* deberta v2 tests updates as rust version adds '.' as a user added token, so a space is not added

* removing commented lines

* applying feedback - user only added_tokens to add and check piece.type instead of trainer_spec for user_defined_symbols

* add comment referencing sentencepiece
2024-06-21 01:48:10 -07:00
Arthur
673440d073
update ruff version (#30932)
* update ruff version

* fix research projects

* Empty

* Fix errors

---------

Co-authored-by: Lysandre <lysandre@huggingface.co>
2024-05-22 06:40:15 +02:00
Sai-Suraj-27
06b1192768
fix: Replace deprecated assertEquals with assertEqual (#30241)
Replace deprecated assertEquals with assertEqual.
2024-04-15 09:36:06 +01:00
Lysandre Debut
11bbb505c7
Adds pretrained IDs directly in the tests (#29534)
* Adds pretrained IDs directly in the tests

* Fix tests

* Fix tests

* Review!
2024-03-13 14:53:27 +01:00
Nilesh
4d4febb7aa
Added test cases for rembert refering to albert and reformer test_tok… (#27637)
* Added test cases for rembert refering to albert and reformer test_tokenization

* removed CURL_CA_BUNDLE='

* Added flag test_sentencepiece_ignore_case and space_between_special_tokens to True

* Overrided test_added_tokens_serialization

* As slow->fast token failed due to the different initialization for [MASK]  for slow and fast, Therefore it required to make the initialization for [MASK] token uniform between fast and slow token

* Added few more test cases in test_encode_decode_round_trip and modefied the slow token (mask_token) to  have AddedToken instance with lstrip=True

* Added few test cases in test_encoder_decoder round trip and also modified slow tokenizer of rembert to have mask_token as AddedToken with lstrip = True

* Cleaned the code and added  fmt: skip to avoid line breaks after make style +  added comments to indicate from the copied test cases

* Corrected few comments

* Fixed quality issue

* Ran fix-copies

* Fixed few minor issues as (make fix-copies) broke few test cases while stripping the text

* Reverted the changes made by repo-consistancy

---------

Co-authored-by: Kokane <kokanen@apac.corpdir.net>
2023-12-04 13:36:57 +01:00