* update exampel
* update
* push the converted diff files for testing and ci
* correct one example
* fix class attributes and docstring
* nits
* oups
* fixed config!
* update
* nitd
* class attributes are not matched against the other, this is missing
* fixed overwriting self.xxx now onto the attributes I think
* partial fix, now order with docstring
* fix docstring order?
* more fixes
* update
* fix missing docstrings!
* examples don't all work yet
* fixup
* nit
* updated
* hick
* update
* delete
* update
* update
* update
* fix
* all default
* no local import
* fix more diff
* some fix related to "safe imports"
* push fixed
* add helper!
* style
* add a check
* all by default
* add the
* update
* FINALLY!
* nit
* fix config dependencies
* man that is it
* fix fix
* update diffs
* fix the last issue
* re-default to all
* alll the fixes
* nice
* fix properties vs setter
* fixup
* updates
* update dependencies
* make sure to install what needs to be installed
* fixup
* quick fix for now
* fix!
* fixup
* update
* update
* updates
* whitespaces
* nit
* fix
* simplify everything, and make it file agnostic (should work for image processors)
* style
* finish fixing all import issues
* fixup
* empty modeling should not be written!
* Add logic to find who depends on what
* update
* cleanup
* update
* update gemma to support positions
* some small nits
* this is the correct docstring for gemma2
* fix merging of docstrings
* update
* fixup
* update
* take doc into account
* styling
* update
* fix hidden activation
* more fixes
* final fixes!
* fixup
* fixup instruct blip video
* update
* fix bugs
* align gemma2 with the rest as well
* updats
* revert
* update
* more reversiom
* grind
* more
* arf
* update
* order will matter
* finish del stuff
* update
* rename to modular
* fixup
* nits
* update makefile
* fixup
* update order of the checks!
* fix
* fix docstring that has a call inside
* fiix conversion check
* style
* add some initial documentation
* update
* update doc
* some fixup
* updates
* yups
* Mostly todo gimme a minut
* update
* fixup
* revert some stuff
* Review docs for the modular transformers (#33472)
Docs
* good update
* fixup
* mmm current updates lead to this code
* okay, this fixes it
* cool
* fixes
* update
* nit
* updates
* nits
* fix doc
* update
* revert bad changes
* update
* updates
* proper update
* update
* update?
* up
* update
* cool
* nits
* nits
* bon bon
* fix
* ?
* minimise changes
* update
* update
* update
* updates?
* fixed gemma2
* kind of a hack
* nits
* update
* remove `diffs` in favor of `modular`
* fix make fix copies
---------
Co-authored-by: Lysandre Debut <hi@lysand.re>
* Updated ruff version and fixed the required code accorindg to the latest version.
* Updated ruff version and fixed the required code accorindg to the latest version.
* Added noqa directive to ignore 1 error shown by ruff
* Pass datasets trust_remote_code
* Pass trust_remote_code in more tests
* Add trust_remote_dataset_code arg to some tests
* Revert "Temporarily pin datasets upper version to fix CI"
This reverts commit b7672826ca.
* Pass trust_remote_code in librispeech_asr_dummy docstrings
* Revert "Pin datasets<2.20.0 for examples"
This reverts commit 833fc17a3e.
* Pass trust_remote_code to all examples
* Revert "Add trust_remote_dataset_code arg to some tests" to research_projects
* Pass trust_remote_code to tests
* Pass trust_remote_code to docstrings
* Fix flax examples tests requirements
* Pass trust_remote_dataset_code arg to tests
* Replace trust_remote_dataset_code with trust_remote_code in one example
* Fix duplicate trust_remote_code
* Replace args.trust_remote_dataset_code with args.trust_remote_code
* Replace trust_remote_dataset_code with trust_remote_code in parser
* Replace trust_remote_dataset_code with trust_remote_code in dataclasses
* Replace trust_remote_dataset_code with trust_remote_code arg
* [build-ci-image]
* correct branch
* push ci image
* [build-ci-image]
* update scheduled as well
* [push-ci-image]
* [build-ci-image]
* [push-ci-image]
* update deps
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* oups [build-ci-image]
* [push-ci-image]
* fix
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* updated
* [build-ci-image] update tag
* [build-ci-image]
* [build-ci-image]
* fix tag
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* [build-ci-image]
* github name
* commit_title?
* fetch
* update
* it not found
* dev
* dev
* [push-ci-image]
* dev
* dev
* update
* dev
* dev print dev commit message dev
* dev ? dev
* dev
* dev
* dev
* dev
* [build-ci-image]
* [build-ci-image]
* [push-ci-image]
* revert unwanted
* revert convert as well
* no you are not important
* [build-ci-image]
* Update .circleci/config.yml
* pin tf probability dev
* change cis
* nits
* update
* minor updates
* [push-ci-image]
* nit [push-ci-image]
* nitsssss
* [build-ci-image]
* [push-ci-image]
* [push-ci-image]
* both
* [push-ci-image]
* this?
* [push-ci-image]
* pypi-kenlm needs g++
* [push-ci-image]
* nit
* more nits [push-ci-image]
* nits [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* add vision
* [push-ci-image]
* [push-ci-image]
* add new dummy file but will need to update them [push-ci-image]
* [push-ci-image]
* show package size as well
* [push-ci-image]
* potentially ignore failures
* workflow updates
* nits [push-ci-image]
* [push-ci-image]
* fix consistency
* clean nciida triton
* also show big packages [push-ci-image]
* nit
* update
* another one
* line escape?
* add accelerate [push-ci-image]
* updates [push-ci-image]
* nits to run tests, no push-ci
* try to parse skip reason to make sure nothing is skipped that should no be skippped
* nit?
* always show skipped reasons
* nits
* better parsing of the test outputs
* action="store_true",
* failure on failed
* show matched
* debug
* update short summary with skipped, failed and errors
* nits
* nits
* coolu pdates
* remove docbuilder
* fix
* always run checks
* oups
* nits
* don't error out on library printing
* non zero exi codes
* no warning
* nit
* WAT?
* format nit
* [push-ci-image]
* fail if fail is needed
* [push-ci-image]
* sound file for torch light?
* [push-ci-image]
* order is important [push-ci-image]
* [push-ci-image] reduce even further
* [push-ci-image]
* use pytest rich !
* yes [push-ci-image]
* oupsy
* bring back the full traceback, but pytest rich should help
* nit
* [push-ci-image]
* re run
* nit
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* empty push to trigger
* [push-ci-image]
* nit? [push-ci-image]
* empty
* try to install timm with no deps
* [push-ci-image]
* oups [push-ci-image]
* [push-ci-image]
* [push-ci-image] ?
* [push-ci-image] open ssh client for git checkout fast
* empty for torch light
* updates [push-ci-image]
* nit
* @v4 for checkout
* [push-ci-image]
* [push-ci-image]
* fix fetch tests with parallelism
* [push-ci-image]
* more parallelism
* nit
* more nits
* empty to re-trigger
* empty to re-trigger
* split by timing
* did not work with previous commit
* junit.xml
* no path?
* mmm this?
* junitxml format
* split by timing
* nit
* fix junit family
* now we can test if the xunit1 is compatible!
* this?
* fully list tests
* update
* update
* oups
* finally
* use classname
* remove working directory to make sure the path does not interfere
* okay no juni should have the correct path
* name split?
* sort by classname is what make most sense
* some testing
* naem
* oups
* test something fun
* autodetect
* 18?
* nit
* file size?
* uip
* 4 is best
* update to see versions
* better print
* [push-ci-image]
* [push-ci-image]
* please install the correct keras version
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* uv is fucking me up
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* nits
* [push-ci-image]
* [push-ci-image]
* install issues an pins
* tapas as well
* nits
* more paralellism
* short tb
* soundfile
* soundfile
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* oups
* [push-ci-image]
* fix some things
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* use torch-light for hub
* small git lfs for hub job
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* fix tf tapas
* [push-ci-image]
* nits
* [push-ci-image]
* don't update the test
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* no use them
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* update tf proba
* [push-ci-image]
* [push-ci-image]
* woops
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* test with built dockers
* [push-ci-image]
* skip annoying tests
* revert fix copy
* update test values
* update
* last skip and fixup
* nit
* ALL GOOOD
* quality
* Update tests/models/layoutlmv2/test_image_processing_layoutlmv2.py
* Update docker/quality.dockerfile
Co-authored-by: Lysandre Debut <hi@lysand.re>
* Update src/transformers/models/tapas/modeling_tf_tapas.py
Co-authored-by: Lysandre Debut <hi@lysand.re>
* Apply suggestions from code review
Co-authored-by: Lysandre Debut <hi@lysand.re>
* use torch-speed
* updates
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* [push-ci-image]
* fuck ken-lm [push-ci-image]
* [push-ci-image]
* [push-ci-image]
---------
Co-authored-by: Lysandre Debut <hi@lysand.re>
* [DO NOT MERGE] Testing tokenizers 0.19.0rc0
* Accounting for the breaking change.
* Ruff.
* Upgrading to tokenizers `0.19` (new release with preprend_scheme fixed
and new surface for BPE tiktoken bug).
* Pin torch to <2.2.0
* Pin torchvision and torchaudio as well
* Playing around with versions to see if this helps
* twiddle something to restart the CI
* twiddle it back
* Try changing the natten version
* make fixup
* Revert "Try changing the natten version"
This reverts commit de0d6592c3.
* make fixup
* fix fix fix
* fix fix fix
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* try to stylify using ruff
* might need to remove these changes?
* use ruf format andruff check
* use isinstance instead of type comparision
* use # fmt: skip
* use # fmt: skip
* nits
* soem styling changes
* update ci job
* nits isinstance
* more files update
* nits
* more nits
* small nits
* check and format
* revert wrong changes
* actually use formatter instead of checker
* nits
* well docbuilder is overwriting this commit
* revert notebook changes
* try to nuke docbuilder
* style
* fix feature exrtaction test
* remve `indent-width = 4`
* fixup
* more nits
* update the ruff version that we use
* style
* nuke docbuilder styling
* leve the print for detected changes
* nits
* Remove file I/O
Co-authored-by: charliermarsh
<charlie.r.marsh@gmail.com>
* style
* nits
* revert notebook changes
* Add # fmt skip when possible
* Add # fmt skip when possible
* Fix
* More ` # fmt: skip` usage
* More ` # fmt: skip` usage
* More ` # fmt: skip` usage
* NIts
* more fixes
* fix tapas
* Another way to skip
* Recommended way
* Fix two more fiels
* Remove asynch
Remove asynch
---------
Co-authored-by: charliermarsh <charlie.r.marsh@gmail.com>
* Support runs/
* Upload runs folder as part of push to hub
* Add a test
* Add to test deps
* Update with proposed solution from Slack
* Ensure that repo gets deleted in tests
* remove SharedDDP as it was drepracated
* apply review suggestion
* make style
* Oops,forgot to remove the compute_loss context manager in Seq2SeqTrainer.
* remove the unnecessary conditional statement
* keep the logic of IPEX
* clean code
* mix precision setup & make fixup
---------
Co-authored-by: statelesshz <jihuazhong1@huawei.com>
* fix test for bart. Order is correct now let's skip BPEs
* ouf
* styling
* fix bert....
* slow refactoring
* current updates
* massive refactoring
* update
* NICE!
* update to see where I am at
* updates
* update
* update
* revert
* updates
* updates
* start supporting legacy_save
* styling
* big update
* revert some changes
* nits
* nniiiiiice
* small fixes
* kinda fix t5 with new behaviour
* major update
* fixup
* fix copies
* today's updates
* fix byt5
* upfate
* update
* update
* updates
* update vocab size test
* Barthez does not use not need the fairseq offset ids
* super calll must be after
* calll super
* move all super init
* move other super init
* fixup
* nits
* more fixes
* nits
* more fixes
* nits
* more fix
* remove useless files
* ouch all of them are affected
* and more!
* small imporvements
* no more sanitize token
* more changes around unique no split tokens
* partially fix more things
* keep legacy save but add warning
* so... more fixes
* updates
* guess deberta tokenizer could be nuked
* fixup
* fixup did some bad things
* nuke it if it breaks
* remove prints and pretrain fast from slow with new format.
* fixups
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* fiou
* nit
* by default specials should not be normalized?
* update
* remove brakpoint
* updates
* a lot of updates
* fixup
* fixes revert some changes to match fast
* small nits
* that makes it cleaner
* fix camembert accordingly
* update
* some lest breaking changes
* update
* fixup
* fix byt5 and whisper mostly
* some more fixes, canine's byte vocab
* fix gpt2
* fix most of the perceiver tests (4 left)
* fix layout lmv3
* fixup
* fix copies for gpt2 style
* make sure to only warn once
* fix perciever and gpt2 tests
* some more backward compatibility: also read special tokens map because some ppl use it........////.....
* fixup
* add else when reading
* nits
* fresh updates
* fix copies
* will this make everything faster?
* fixes
* more fixes
* update
* more fixes
* fixup
* is the source of truth right?
* sorry camembert for the troubles
* current updates
* fixup
* update led
* update
* fix regression
* fix single word
* more model specific fixes
* fix t5 tests
* fixup
* more comments
* update
* fix nllb
* rstrip removed
* small fixes
* better handle additional_special_tokens and vocab sizes
* fixing
* styling
* fix 4 / 21
* fixup
* fix nlbb's tests
* some fixes
* fix t5
* fixes
* style
* fix canine tests
* damn this is nice
* nits
* m2m100 nit
* fixups
* fixes!
* fixup
* stash
* fix merge
* revert bad change
* fixup
* correct order for code Llama
* fix speecht5 post merge
* styling
* revert source of 11 fails
* small nits
* all changes in one go
* fnet hack
* fix 2 more tests
* update based on main branch of tokenizers
* fixup
* fix VITS issues
* more fixes
* fix mgp test
* fix camembert issues
* oups camembert still has 2 failing tests
* mluke fixes
* decode fixes
* small nits
* nits
* fix llama and vits
* fix camembert
* smal nits
* more fixes when initialising a fast from a slow and etc
* fix one of the last test
* fix CPM tokenizer test
* fixups
* fix pop2piano
* fixup
* ⚠️ Change tokenizers required version ⚠️
* ⚠️ Change tokenizers required version ⚠️
* "tokenizers>=0.14,<0.15", don't forget smaller than
* fix musicgen tests and pretraiendtokenizerfast
* fix owlvit and all
* update t5
* fix 800 red
* fix tests
* fix the fix of the fix of t5
* styling
* documentation nits
* cache _added_tokens_encoder
* fixups
* Nit
* fix red tests
* one last nit!
* make eveything a lot simpler
* Now it's over 😉
* few small nits
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* updates that work for now
* tests that should no be skipped / changed and fixed next
* fixup
* i am ashamed
* pushe the fix
* update
* fixups
* nits
* fix added_tokens_encoder
* fix canine test
* fix pegasus vocab
* fix transfoXL
* fixup
* whisper needs to be fixed for train new
* pegasus nits
* more pegasus fixes
* minor update
* better error message in failed test
* fix whisper failing test
* fix whisper failing test
* fix pegasus
* fixup
* fix **** pegasus
* reset things
* remove another file
* attempts to fix the strange custome encoder and offset
* nits here and there
* update
* fixup
* nit
* fix the whisper test
* nits nits
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* updates based on review
* some small update to potentially remove
* nits
* import rlu cache
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Lysandre Debut <hi@lysand.re>
* move warning to `from_pretrained`
* update tests results now that the special tokens are always added
---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
* Limit Pydantic to V1 in dependencies
Pydantic is about to release V2 release which will break a lot of things. This change prevents `transformers` to be used with Pydantic V2 to avoid breaking things.
* more
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* An end to accursed version-specific imports
* No more K.is_keras_tensor() either
* Update dependency tables
* Use a cleaner call context function getter
* Add a cap to <2.14
* Add cap to examples requirements too
* fix for ragged list
* unpin numba
* make style
* np.object -> object
* propagate changes to tokenizer as well
* np.long -> "long"
* revert tokenization changes
* check with tokenization changes
* list/tuple logic
* catch numpy
* catch else case
* clean up
* up
* better check
* trigger ci
* Empty commit to trigger CI
* Making `safetensors` a core dependency.
To be merged later, I'm creating the PR so we can try it out.
* Update setup.py
* Remove duplicates.
* Even more redundant.
* Adding Llama FastTokenizer support.
- Requires https://github.com/huggingface/tokenizers/pull/1183 version
- Only support byte_fallback for llama, raise otherwise (safety net).
- Lots of questions are special tokens
How to test:
```python
from transformers.convert_slow_tokenizer import convert_slow_tokenizer
from transformers import AutoTokenizer
from tokenizers import Tokenizer
tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")
if False:
new_tokenizer = Tokenizer.from_file("tok.json")
else:
new_tokenizer = convert_slow_tokenizer(tokenizer)
new_tokenizer.save("tok.json")
strings = [
"This is a test",
"生活的真谛是",
"生活的真谛是[MASK]。",
# XXX: This one is problematic because of special tokens
# "<s> Something something",
]
for string in strings:
encoded = tokenizer(string)["input_ids"]
encoded2 = new_tokenizer.encode(string).ids
assert encoded == encoded2, f"{encoded} != {encoded2}"
decoded = tokenizer.decode(encoded)
decoded2 = new_tokenizer.decode(encoded2)
assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"
```
The converter + some test script.
The test script.
Tmp save.
Adding Fast tokenizer + tests.
Adding the tokenization tests.
Correct combination.
Small fix.
Fixing tests.
Fixing with latest update.
Rebased.
fix copies + normalized added tokens + copies.
Adding doc.
TMP.
Doc + split files.
Doc.
Versions + try import.
Fix Camembert + warnings -> Error.
Fix by ArthurZucker.
Not a decorator.
* Fixing comments.
* Adding more to docstring.
* Doc rewriting.
* [setup] drop deprecated `distutils` usage
* drop deprecated `distutils.util.strtobool` usage
* fix import order
* reformat docstring by `doc-builder`