Commit Graph

433 Commits

Author SHA1 Message Date
Arthur
2da8853775
🚨🚨 🚨🚨 [Tokenizer] attemp to fix add_token issues🚨🚨 🚨🚨 (#23909)
* fix test for bart. Order is correct now let's skip BPEs

* ouf

* styling

* fix bert....

* slow refactoring

* current updates

* massive refactoring

* update

* NICE!

* update to see where I am at

* updates

* update

* update

* revert

* updates

* updates

* start supporting legacy_save

* styling

* big update

* revert some changes

* nits

* nniiiiiice

* small fixes

* kinda fix t5 with new behaviour

* major update

* fixup

* fix copies

* today's updates

* fix byt5

* upfate

* update

* update

* updates

* update vocab size test

* Barthez does not use not need the fairseq offset ids

* super calll must be after

* calll super

* move all super init

* move other super init

* fixup

* nits

* more fixes

* nits

* more fixes

* nits

* more fix

* remove useless files

* ouch all of them are affected

* and more!

* small imporvements

* no more sanitize token

* more changes around unique no split tokens

* partially fix more things

* keep legacy save but add warning

* so... more fixes

* updates

* guess deberta tokenizer could be nuked

* fixup

* fixup did some bad things

* nuke it if it breaks

* remove prints and pretrain fast from slow with new format.

* fixups

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fiou

* nit

* by default specials should not be normalized?

* update

* remove brakpoint

* updates

* a lot of updates

* fixup

* fixes revert some changes to match fast

* small nits

* that makes it cleaner

* fix camembert accordingly

* update

* some lest breaking changes

* update

* fixup

* fix byt5 and whisper mostly

* some more fixes, canine's byte vocab

* fix gpt2

* fix most of the perceiver tests (4 left)

* fix layout lmv3

* fixup

* fix copies for gpt2 style

* make sure to only warn once

* fix perciever and gpt2 tests

* some more backward compatibility: also read special tokens map because some ppl use it........////.....

* fixup

* add else when reading

* nits

* fresh updates

* fix copies

* will this make everything faster?

* fixes

* more fixes

* update

* more fixes

* fixup

* is the source of truth right?

* sorry camembert for the troubles

* current updates

* fixup

* update led

* update

* fix regression

* fix single word

* more model specific fixes

* fix t5 tests

* fixup

* more comments

* update

* fix nllb

* rstrip removed

* small fixes

* better handle additional_special_tokens and vocab sizes

* fixing

* styling

* fix 4 / 21

* fixup

* fix nlbb's tests

* some fixes

* fix t5

* fixes

* style

* fix canine tests

* damn this is nice

* nits

* m2m100 nit

* fixups

* fixes!

* fixup

* stash

* fix merge

* revert bad change

* fixup

* correct order for code Llama

* fix speecht5 post merge

* styling

* revert source of 11 fails

* small nits

* all changes in one go

* fnet hack

* fix 2 more tests

* update based on main branch of tokenizers

* fixup

* fix VITS issues

* more fixes

* fix mgp test

* fix camembert issues

* oups camembert still has 2 failing tests

* mluke fixes

* decode fixes

* small nits

* nits

* fix llama and vits

* fix camembert

* smal nits

* more fixes when initialising a fast from a slow and etc

* fix one of the last test

* fix CPM tokenizer test

* fixups

* fix pop2piano

* fixup

* ⚠️ Change tokenizers required version ⚠️

* ⚠️ Change tokenizers required version ⚠️

* "tokenizers>=0.14,<0.15", don't forget smaller than

* fix musicgen tests and pretraiendtokenizerfast

* fix owlvit and all

* update t5

* fix 800 red

* fix tests

* fix the fix of the fix of t5

* styling

* documentation nits

* cache _added_tokens_encoder

* fixups

* Nit

* fix red tests

* one last nit!

* make eveything a lot simpler

* Now it's over 😉

* few small nits

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates that work for now

* tests that should no be skipped / changed and fixed next

* fixup

* i am ashamed

* pushe the fix

* update

* fixups

* nits

* fix added_tokens_encoder

* fix canine test

* fix pegasus vocab

* fix transfoXL

* fixup

* whisper needs to be fixed for train new

* pegasus nits

* more pegasus fixes

* minor update

* better error message in failed test

* fix whisper failing test

* fix whisper failing test

* fix pegasus

* fixup

* fix **** pegasus

* reset things

* remove another file

* attempts to fix the strange custome encoder and offset

* nits here and there

* update

* fixup

* nit

* fix the whisper test

* nits nits

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* updates based on review

* some small update to potentially remove

* nits

* import rlu cache

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Lysandre Debut <hi@lysand.re>

* move warning to `from_pretrained`

* update tests results now that the special tokens are always added

---------

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Lysandre Debut <hi@lysand.re>
2023-09-18 20:28:36 +02:00
Lysandre
d8e13b3e04 v4.34.dev.0 2023-09-04 15:12:11 -04:00
Yih-Dar
3fb1535b09
Update setup.py (#25893)
update

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-08-31 18:54:01 +02:00
Matt
62396cff46
TF 2.14 compatibility (#25630)
* Update the TF pin and see if anything breaks

* make fixup

* make fixup

* make fixup
2023-08-22 13:13:38 +01:00
Sylvain Gugger
5c67682b16
v4.33.0.dev0 2023-08-21 07:07:04 -04:00
Sylvain Gugger
2defb6b048
More utils doc (#25457)
* Document and clean more utils.

* More documentation and fixes

* Switch to Lysandre's token

* Address review comments

* Actually put else
2023-08-17 07:58:35 +02:00
Sylvain Gugger
baf1daa58e
Migrate Trainer from Repository to upload_folder (#25095)
* First draft

* Deal with progress bars

* Update src/transformers/utils/hub.py

Co-authored-by: Lucain <lucainp@gmail.com>

* Address review comments

* Forgot one

* Pin hf_hub

* Add argument for push all and fix tests

* Fix tests

* Address review comments

---------

Co-authored-by: Lucain <lucainp@gmail.com>
2023-08-07 17:47:22 +02:00
Sanchit Gandhi
66c240f3c9
[JAX] Bump min version (#25286)
* [JAX] Bump min version

* make fixup
2023-08-03 16:05:02 +01:00
Sylvain Gugger
e9ad51306f
4.32.0.dev0 2023-07-17 13:30:44 -04:00
Georgie Mathews
0866705022
Update setup.py to be compatible with pipenv (#24789) 2023-07-13 12:56:43 -04:00
Yih-Dar
e538189931
Upgrade jax/jaxlib/flax pin versions (#24791)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-07-13 13:57:30 +02:00
Yih-Dar
6eedfa6dd1
Pin Pillow for now (#24633)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-07-03 12:24:46 +02:00
Serge Matveenko
d51aa48a76
Limit Pydantic to V1 in dependencies (#24596)
* Limit Pydantic to V1 in dependencies

Pydantic is about to release V2 release which will break a lot of things. This change prevents `transformers` to be used with Pydantic V2 to avoid breaking things.

* more

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-07-01 00:04:03 +02:00
Yih-Dar
299aafe55f
Use protobuf 4 (#24599)
* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-06-30 20:56:55 +02:00
Yih-Dar
11cb6e0f7e
Unpin DeepSpeed and require DS >= 0.9.3 (#24541)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-06-28 14:01:22 +02:00
Yih-Dar
e84bf1f734
⚠️ Time to say goodbye to py37 (#24091)
* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-06-28 07:22:39 +02:00
Matt
8e164c5400
Improved keras imports (#24448)
* An end to accursed version-specific imports

* No more K.is_keras_tensor() either

* Update dependency tables

* Use a cleaner call context function getter

* Add a cap to <2.14

* Add cap to examples requirements too
2023-06-23 19:09:34 +01:00
Sylvain Gugger
26a2ec56d7
Clean up old Accelerate checks (#24279)
* Clean up old Accelerate checks

* Put back imports
2023-06-14 12:44:09 -04:00
Sylvain Gugger
8c5f306719
Update the pin on Accelerate (#24110) 2023-06-08 10:11:01 -04:00
Sylvain Gugger
ba695c1efd
v4.31.0.dev0 2023-06-07 16:49:00 -04:00
Zachary Mueller
5eb3d3c702
Up pinned accelerate version (#24089)
* Min accelerate

* Also min version

* Min accelerate

* Also min version

* To different minor version

* Empty
2023-06-07 16:21:51 -04:00
Sylvain Gugger
9193188276
Pin rhoknp (#23937) 2023-06-01 10:25:43 -04:00
Zachary Mueller
55451c66ce
Upgrade safetensors version (#23911)
* Upgrade safetensors

* Second table
2023-05-31 11:30:39 -04:00
Sanchit Gandhi
8f915c450d
Unpin numba (#23162)
* fix for ragged list

* unpin numba

* make style

* np.object -> object

* propagate changes to tokenizer as well

* np.long -> "long"

* revert tokenization changes

* check with tokenization changes

* list/tuple logic

* catch numpy

* catch else case

* clean up

* up

* better check

* trigger ci

* Empty commit to trigger CI
2023-05-31 14:59:30 +01:00
Nicolas Patry
9e8d7066e6
Making safetensors a core dependency. (#23254)
* Making `safetensors` a core dependency.

To be merged later, I'm creating the PR so we can try it out.

* Update setup.py

* Remove duplicates.

* Even more redundant.
2023-05-23 15:16:34 +02:00
Sylvain Gugger
9cf4a8b456
Build with non Python files (#23405)
* Add a test of the built release

* Polish everything

* Trigger CI
2023-05-16 14:23:10 -04:00
Yih-Dar
a3975f94f3
Only add files with modification outside doc blocks (#23327)
* min. version for pytest

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-05-12 16:35:15 +02:00
Sylvain Gugger
786b9cf5ca
Style 2023-05-11 14:40:38 -04:00
Lysandre Debut
71b19ee251
Agents extras (#23301)
* Agents extras

* Add to docs
2023-05-11 14:25:51 -04:00
José Ángel Rey Liñares
0c65fb7cfa
chore: allow protobuf 3.20.3 requirement (#22759)
* chore: allow protobuf 3.20.3

Allow latest bugfix release for protobuf (3.20.3)

* chore: update auto-generated dependency table

update auto-generated dependency table

* run in subprocess

* Apply suggestions from code review

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

* Apply suggestions

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-05-10 20:22:56 +02:00
Sylvain Gugger
a0c0a78233
v4.30.0.dev0 2023-05-09 14:59:38 -04:00
Sylvain Gugger
94056b57be
New version of Accelerate for the Trainer (#23204) 2023-05-08 09:47:08 -04:00
Sylvain Gugger
3341bb41cd
Pin urllib3 2023-05-04 12:00:22 -04:00
Sylvain Gugger
4b6aecb48e
Pin numba for now (#23118) 2023-05-02 22:02:39 -04:00
amyeroberts
e5f3487190
Pin flax & optax version (#22895)
* Pin optax version

* Pin flax too

* Fixup
2023-04-20 17:30:14 +01:00
Zachary Mueller
aec10d162f
Update accelerate version + warning check fix (#22833) 2023-04-18 12:51:32 -04:00
Zachary Mueller
03462875cc
Introduce PartialState as the device handler in the Trainer (#22752)
* Use accelerate for device management

* Add accelerate to setup


Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-04-17 15:09:45 -04:00
Sylvain Gugger
888c4a2ae0
v4.29.0.dev0 2023-04-12 20:04:29 -04:00
Sylvain Gugger
6db23af50c
Revert migration of setup to pyproject.toml (#22658) 2023-04-07 15:08:44 -04:00
Nicolas Patry
1670be4bde
Adding Llama FastTokenizer support. (#22264)
* Adding Llama FastTokenizer support.

- Requires https://github.com/huggingface/tokenizers/pull/1183 version
- Only support byte_fallback for llama, raise otherwise (safety net).
- Lots of questions are special tokens

How to test:

```python

from transformers.convert_slow_tokenizer import convert_slow_tokenizer
from transformers import AutoTokenizer
from tokenizers import Tokenizer

tokenizer = AutoTokenizer.from_pretrained("huggingface/llama-7b")

if False:
    new_tokenizer = Tokenizer.from_file("tok.json")
else:
    new_tokenizer = convert_slow_tokenizer(tokenizer)
    new_tokenizer.save("tok.json")

strings = [
    "This is a test",
    "生活的真谛是",
    "生活的真谛是[MASK]。",
    # XXX: This one is problematic because of special tokens
    # "<s> Something something",
]

for string in strings:
    encoded = tokenizer(string)["input_ids"]
    encoded2 = new_tokenizer.encode(string).ids

    assert encoded == encoded2, f"{encoded} != {encoded2}"

    decoded = tokenizer.decode(encoded)
    decoded2 = new_tokenizer.decode(encoded2)

    assert decoded.strip() == decoded2, f"{repr(decoded)} != {repr(decoded2)}"
```

The converter + some test script.

The test script.

Tmp save.

Adding Fast tokenizer + tests.

Adding the tokenization tests.

Correct combination.

Small fix.

Fixing tests.

Fixing with latest update.

Rebased.

fix copies + normalized added tokens  + copies.

Adding doc.

TMP.

Doc + split files.

Doc.

Versions + try import.

Fix Camembert + warnings -> Error.

Fix by ArthurZucker.

Not a decorator.

* Fixing comments.

* Adding more to docstring.

* Doc rewriting.
2023-04-06 09:53:03 +02:00
Xuehai Pan
4169dc84bf
[setup] migrate setup script to pyproject.toml (#22539)
* [setup] migrate setup script to `pyproject.toml`

* [setup] cleanup configurations

* remove unused imports
2023-04-03 14:03:41 -04:00
Xuehai Pan
80d1319e1b
[setup] drop deprecated distutils usage (#22531)
* [setup] drop deprecated `distutils` usage

* drop deprecated `distutils.util.strtobool` usage

* fix import order

* reformat docstring by `doc-builder`
2023-04-03 12:04:24 -04:00
Sylvain Gugger
2194943a34
Pin ruff (#22455) 2023-03-29 14:07:06 -04:00
Sylvain Gugger
4c295a265b
Update release instructions (#22454) 2023-03-29 14:05:42 -04:00
Joao Gante
88dae78f4d
TensorFlow: pin maximum version to 2.12 (#22364) 2023-03-24 18:45:03 +00:00
Sylvain Gugger
6587125c0a
Pin tensorflow-text to go with tensorflow (#22362)
* Pin tensorflow-text to go with tensorflow

* Make it more convenient to pin TensorFlow

* setup don't like f-strings
2023-03-24 10:54:06 -04:00
Stas Bekman
89a0a9eace
[deepspeed] offload + non-cpuadam optimizer exception doc (#22044)
* [deepspeed] offload + non-cpuadam optimizer exception doc

* deps
2023-03-21 17:00:05 -07:00
Ali Hassani
5990743fdd
Correct NATTEN function signatures and force new version (#22298) 2023-03-21 17:21:34 -04:00
Yih-Dar
67c2dbdb54
Time to Say Goodbye, torch 1.7 and 1.8 (#22291)
* time to say goodbye, torch 1.7 and 1.8

* clean up torch_int_div

* clean up is_torch_less_than_1_8-9

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-03-21 19:22:01 +01:00
Ali Hassani
3028b20a71
Fix natten (#22229)
* Add kernel size to NATTEN's QK arguments.

The new NATTEN 0.14.5 supports PyTorch 2.0, but also adds an additional
argument to the QK operation to allow optional RPBs.

This ends up failing NATTEN tests.

This commit adds NATTEN back to circleci and adds the arguments to get
it working again.

* Force NATTEN >= 0.14.5
2023-03-17 11:07:55 -04:00