Commit Graph

14 Commits

Author SHA1 Message Date
amyeroberts
1de7dc7403
Skip tests properly (#31308)
* Skip tests properly

* [test_all]

* Add 'reason' as kwarg for skipTest

* [test_all] Fix up

* [test_all]
2024-06-26 21:59:08 +01:00
Ita Zaporozhets
deba7655e6
Add split special tokens (#30772)
* seems like `split_special_tokens` is used here

* split special token

* add new line at end of file

* moving split special token test to common tests

* added assertions

* test

* fixup

* add co-author

* passing rest of args to gptsan_japanese, fixing tests

* removing direct comparison of fast and slow models

* adding test support for UDOP and LayoutXLM

* ruff fix

* readd check if slow tokenizer

* modify test to handle bos tokens

* removing commented function

* trigger build

* applying review feedback - updated docstrings, var names, and simplified tests

* ruff fixes

* Update tests/test_tokenization_common.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* applying feedback, comments

* shutil temp directory fix

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Ita Zaporozhets <itazaporozhets@Itas-MBP.localdomain>
Co-authored-by: itazap <itazap@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <itazaporozhets@Itas-MacBook-Pro.local>
2024-05-24 08:38:58 -07:00
Arthur
673440d073
update ruff version (#30932)
* update ruff version

* fix research projects

* Empty

* Fix errors

---------

Co-authored-by: Lysandre <lysandre@huggingface.co>
2024-05-22 06:40:15 +02:00
Matt
9d999481b2
Add correct batched handling for apply_chat_template (#29222)
* Add correct batched handling for apply_chat_template

* Fix warning method

* Add error for incompatible options

* expand tests

* Add a skip for markuplm

* Add skips for other layout models

* Skip for LayoutLMv2

* Slightly update the warning message

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* typo fix

* Update docstring for conversation kwarg

* Update return docstring

* Remove the warning, improve error message

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/test_tokenization_common.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Update tests/test_tokenization_common.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Remove return_dict=None

* Fix up some merge cruft

* More merge cruft

* Add another skip

* Add another skip

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2024-03-20 15:50:22 +00:00
Lysandre Debut
11bbb505c7
Adds pretrained IDs directly in the tests (#29534)
* Adds pretrained IDs directly in the tests

* Fix tests

* Fix tests

* Review!
2024-03-13 14:53:27 +01:00
Arthur
651408a077
[Styling] stylify using ruff (#27144)
* try to stylify using ruff

* might need to remove these changes?

* use ruf format andruff check

* use isinstance instead of type comparision

* use # fmt: skip

* use # fmt: skip

* nits

* soem styling changes

* update ci job

* nits isinstance

* more files update

* nits

* more nits

* small nits

* check and format

* revert wrong changes

* actually use formatter instead of checker

* nits

* well docbuilder is overwriting this commit

* revert notebook changes

* try to nuke docbuilder

* style

* fix feature exrtaction test

* remve `indent-width = 4`

* fixup

* more nits

* update the ruff version that we use

* style

* nuke docbuilder styling

* leve the print for detected changes

* nits

* Remove file I/O

Co-authored-by: charliermarsh
 <charlie.r.marsh@gmail.com>

* style

* nits

* revert notebook changes

* Add # fmt skip when possible

* Add # fmt skip when possible

* Fix

* More `  # fmt: skip` usage

* More `  # fmt: skip` usage

* More `  # fmt: skip` usage

* NIts

* more fixes

* fix tapas

* Another way to skip

* Recommended way

* Fix two more fiels

* Remove asynch
Remove asynch

---------

Co-authored-by: charliermarsh <charlie.r.marsh@gmail.com>
2023-11-16 17:43:19 +01:00
Matt
866df66fe4
Overhaul Conversation class and prompt templating (#25323)
* First commit while I figure this out

* make fixup

* Remove unused method

* Store prompt attrib

* Fix prompt argument for tests

* Make same changes in fast tokenizer

* Remove global prompts from fast tokenizer too

* stash commit

* stash commit

* Migrate PromptConfig to its True Final Location

* Replace Conversation entirely with the new class

* Import/dependency fixes

* Import/dependency fixes

* Change format for lots of default prompts

* More default prompt fixups

* Revert llama old methods so we can compare

* Fix some default configs

* Fix some default configs

* Fix misspelled kwarg

* Fixes for Blenderbot

* make fixup

* little rebase cleanup

* Add basic documentation

* Quick doc fix

* Truncate docstring for now

* Add handling for the case when messages is a single string

* Quick llama merges

* Update conversational pipeline and tests

* Add a couple of legacy properties for backward compatibility

* More legacy handling

* Add docstring for build_conversation_input_ids

* Restructure PromptConfig

* Let's start T E M P L A T I N G

* Refactor all default configs to use templates instead

* Revert changes to the special token properties since we don't need them anymore

* More class templates

* Make the sandbox even sandier

* Everything replaced with pure templating

* Remove docs for PromptConfig

* Add testing and optional requirement boilerplate

* Fix imports and make fixup

* Fix LLaMA tests and add Conversation docstring

* Finally get LLaMA working with the template system

* Finally get LLaMA working with the template system

* make fixup

* make fixup

* fmt-off for the long lists of test tokens

* Rename method to apply_chat_template for now

* Start on documentation

* Make chat_template a property that reads through to the default if it's not set

* Expand docs

* Expand chat templating doc some more

* trim/lstrip blocks by default and update doc

* Few doc tweaks

* rebase cleanup

* Clarify docstring

* rebase cleanup

* rebase cleanup

* make fixup

* Quick doc edit

* Reformat the standard template to match ChatML

* Re-add PEFT check

* Update docs/source/en/chat_templating.md

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* Add apply_chat_template to the tokenizer doc

* make fixup

* Add doc links

* Fix chat links

* Fix chat links

* Explain system messages in the doc

* Add chat template test

* Proper save-loading for chat template attribute

* Add test skips for layout models

* Remove _build_conversation_input_ids, add default_chat_template to code_llama

* Make sure all LLaMA models are using the latest template

* Remove default_system_prompt block in code_llama because it has no default prompt

* Update ConversationPipeline preprocess

* Add correct #Copied from links to the default_chat_templates

* Remove unneeded type checking line

* Add a dummy mark_processsed method

* Reorganize Conversation to have **deprecated_kwargs

* Update chat_templating.md

* Quick fix to LLAMA tests

* Small doc tweaks

* Add proper docstrings and "copied from" statements to all default chat templates

* Merge use_default_system_prompt support for code_llama too

* Improve clarity around self.chat_template

* Docstring fix

* Fix blenderbot default template

* More doctest fix

* Break out some tokenizer kwargs

* Update doc to explain default templates

* Quick tweaks to tokenizer args

* Cleanups for tokenizer args

* Add note about cacheing

* Quick tweak to the chat-templating doc

* Update the LLaMA template with error checking and correct system message embedding

* make fixup

* make fixup

* add requires_jinja

* Cleanup to expected output formatting

* Add cacheing

* Fix typo in llama default template

* Update LLaMA tests

* Update documentation

* Improved legacy handling in the Conversation class

* Update Jinja template with proper error handling

* Quick bugfix

* Proper exception raising

* Change cacheing behaviour so it doesn't try to pickle an entire Jinja env

* make fixup

* rebase cleanup

---------

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-09-14 15:10:34 +01:00
Arthur
30b3c46ff5
[split_special_tokens] Add support for split_special_tokens argument to encode (#25081)
* draft changes

* update and add tests

* styling for no

* move test

* path to usable model

* update test

* small update

* update bertbased tokenizers

* don'tuse kwargs for _tokenize

* don'tuse kwargs for _tokenize

* fix copies

* update

* update test for special tokenizers

* fixup

* skip two tests

* remove pdb breakpiont()

* wowo

* rewrite custom tests

* nits

* revert chang in target keys

* fix markup lm

* update documentation of the argument
2023-08-18 13:26:27 +02:00
Sylvain Gugger
6f79d26442
Update quality tooling for formatting (#21480)
* Result of black 23.1

* Update target to Python 3.7

* Switch flake8 to ruff

* Configure isort

* Configure isort

* Apply isort with line limit

* Put the right black version

* adapt black in check copies

* Fix copies
2023-02-06 18:10:56 -05:00
Bartosz Szmelczynski
78a471ff71
Fix tapas scatter (#20149)
* First draft

* Remove scatter dependency

* Add require_torch

* update vectorized sum test, add clone call

* remove artifacts

* fix style

* fix style v2

* remove "scatter" mentions from the code base

* fix isort error

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-11-14 01:04:26 -05:00
Ben Eyal
9f9ddcc2de
🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string (#15775)
* Add test for SentencePiece not adding special tokens to strings

* Add SentencePieceStringConversionMixin to fix issue 15003

* Fix conversion from tokens to string for most SentencePiece tokenizers

Tokenizers fixed:
- AlbertTokenizer
- BarthezTokenizer
- CamembertTokenizer
- FNetTokenizer
- M2M100Tokenizer
- MBart50Tokenizer
- PegasusTokenizer
- Speech2TextTokenizer

* Fix MarianTokenizer, adjust SentencePiece test to accomodate vocab

* Fix DebertaV2Tokenizer

* Ignore LayoutXLMTokenizer in SentencePiece string conversion test

* Run 'make style' and 'make quality'

* Clean convert_tokens_to_string test

Instead of explicitly ignoring LayoutXLMTokenizer in the test,
override the test in LayoutLMTokenizationTest and do nothing in it.

* Remove commented out code

* Improve robustness of convert_tokens_to_string test

Instead of comparing lengths of re-tokenized text and input_ids,
check that converting all special tokens to string yields a string
with all special tokens.

* Inline and remove SentencePieceStringConversionMixin

The convert_tokens_to_string method is now implemented
in each relevant SentencePiece tokenizer.

* Run 'make style' and 'make quality'

* Revert removal of space in convert_tokens_to_string

* Remove redundant import

* Revert test text to original

* Uncomment the lowercasing of the reverse_text variable

* Mimic Rust tokenizer behavior for tokenizers

- Albert
- Barthez
- Camembert
- MBart50
- T5

* Fix accidentally skipping test in wrong tokenizer

* Add test for equivalent Rust and slow tokenizer behavior

* Override _decode in BigBirdTokenizer to mimic Rust behavior

* Override _decode in FNetTokenizer to mimic Rust behavior

* Override _decode in XLNetTokenizer to mimic Rust behavior

* Remove unused 're' import

* Update DebertaV2Tokenizer to mimic Rust tokenizer

* Deberta tokenizer now behaves like Albert and its `convert_tokens_to_string` is not tested.

* Ignore problematic tests in Deberta V2

* Add comment on why the Deberta V2 tests are skipped
2022-11-02 15:45:38 -04:00
SaulLu
6667b0d7bf
add warning to let the user know that the __call__ method is faster than encode + pad for a fast tokenizer (#18693)
* add warning to let the user know that the  method is slower that  for a fast tokenizer

* user warnings

* fix layoutlmv2

* fix layout*

* change warnings into logger.warning
2022-08-24 06:27:56 -04:00
Sylvain Gugger
afe5d42d8d
Black preview (#17217)
* Black preview

* Fixup too!

* Fix check copies

* Use the same version as the CI

* Bump black
2022-05-12 16:25:55 -04:00
Yih-Dar
19420fd99e
Move test model folders (#17034)
* move test model folders (TODO: fix imports and others)

* fix (potentially partially) imports (in model test modules)

* fix (potentially partially) imports (in tokenization test modules)

* fix (potentially partially) imports (in feature extraction test modules)

* fix import utils.test_modeling_tf_core

* fix path ../fixtures/

* fix imports about generation.test_generation_flax_utils

* fix more imports

* fix fixture path

* fix get_test_dir

* update module_to_test_file

* fix get_tests_dir from wrong transformers.utils

* update config.yml (CircleCI)

* fix style

* remove missing imports

* update new model script

* update check_repo

* update SPECIAL_MODULE_TO_TEST_MAP

* fix style

* add __init__

* update self-scheduled

* fix add_new_model scripts

* check one way to get location back

* python setup.py build install

* fix import in test auto

* update self-scheduled.yml

* update slack notification script

* Add comments about artifact names

* fix for yolos

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-05-03 14:42:02 +02:00