Commit Graph

456 Commits

Author SHA1 Message Date
Sylvain Gugger
b6e58db277
Speedup doc build (#8301)
* Try -j option

* Try other thing

* Bigger machine

* Test lower sphinx version

* Remove trailing space
2020-11-04 11:51:21 -05:00
Sylvain Gugger
c5f3149f95
Adjust setup so that all extras run on Windows (#8102) 2020-10-27 14:39:49 -04:00
Jason Wolosonovich
9fefdb0751
DEP: pinned sentencepiece to 0.1.91 in setup.py (#8069)
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-27 14:09:31 -04:00
Stas Bekman
edd3721cd4
update/add setup targets (#8076) 2020-10-27 13:54:57 -04:00
Lysandre
eb0e0ce2ad Release: v3.4.0 2020-10-20 16:22:26 +02:00
Funtowicz Morgan
8f8f8d99fc
Integrate Bert-like model on Flax runtime. (#3722)
* WIP flax bert

* Initial commit Bert Jax/Flax implementation.

* Embeddings working and equivalent to PyTorch.

* Move embeddings in its own module BertEmbeddings

* Added jax.jit annotation on forward call

* BertEncoder on par with PyTorch ! :D

* Add BertPooler on par with PyTorch !!

* Working Jax+Flax implementation of BertModel with < 1e-5 differences on the last layer.

* Fix pooled output to take only the first token of the sequence.

* Refactoring to use BertConfig from transformers.

* Renamed FXBertModel to FlaxBertModel

* Model is now initialized in FlaxBertModel constructor and reused.

* WIP JaxPreTrainedModel

* Cleaning up the code of FlaxBertModel

* Added ability to load Flax model saved through save_pretrained()

* Added ability to convert Pytorch Bert model to FlaxBert

* FlaxBert can now load every Pytorch Bert model with on-the-fly conversion

* Fix hardcoded shape values in conversion scripts.

* Improve the way we handle LayerNorm conversion from PyTorch to Flax.

* Added positional embeddings as parameter of BertModel with default to np.arange.

* Let's roll FlaxRoberta !

* Fix missing position_ids parameters on predict for Bert

* Flax backend now supports batched inputs

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make it possible to load msgpacked model on convert from pytorch in last resort.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Moved save_pretrained to Jax base class along with more constructor parameters.

* Use specialized, model dependent conversion functio.

* Expose `is_flax_available` in file_utils.

* Added unittest for Flax models.

* Added run_tests_flax to the CI.

* Introduce FlaxAutoModel

* Added more unittests

* Flax model reference the _MODEL_ARCHIVE_MAP from PyTorch model.

* Addressing review comments.

* Expose seed in both Bert and Roberta

* Fix typo suggested by @stefan-it

Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Attempt to make style

* Attempt to make style in tests too

* Added jax & jaxlib to the flax optional dependencies.

* Attempt to fix flake8 warnings ...

* Redo black again and again

* When black and flake8 fight each other for a space ... 💥 💥 💥

* Try removing trailing comma to make both black and flake happy!

* Fix invalid is_<framework>_available call, thanks @LysandreJik 🎉

* Fix another invalid import in flax_roberta test

* Bump and pin flax release to 0.1.0.

* Make flake8 happy, remove unused jax import

* Change the type of the catch for msgpack.

* Remove unused import.

* Put seed as optional constructor parameter.

* trigger ci again

* Fix too much parameters in BertAttention.

* Formatting.

* Simplify Flax unittests to avoid machine crashes.

* Fix invalid number of arguments when raising issue for an unknown model.

* Address @bastings comment in PR, moving jax.jit decorated outside of __call__

* Fix incorrect path to require_flax/require_pytorch functions.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Attempt to make style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct rebasing of circle-ci dependencies

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix import sorting.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix unused imports.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Again import sorting...

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Installing missing nlp dependency for flax unittests.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Fix laoding of model for Flax implementations.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* jit the inner function call to make JAX-compatible

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Format !

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Flake one more time 🎶

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Rewrites BERT in Flax to the new Linen API (#7211)

* Rewrite Flax HuggingFace PR to Linen

* Some fixes

* Fix tests

* Fix CI with change of name of nlp (#7054)

* nlp -> datasets

* More nlp -> datasets

* Woopsie

* More nlp -> datasets

* One last

* Expose `is_flax_available` in file_utils.

* Added run_tests_flax to the CI.

* Attempt to make style

* trigger ci again

* Fix import sorting.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Revert "Rewrites BERT in Flax to the new Linen API (#7211)"

This reverts commit 23703a5eb3364e26a1cbc3ee34b4710d86a674b0.

* Remove jnp.lax references

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Reintroduce Linen changes ...

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use jax native's gelu function.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Renaming BertModel to BertModule to highlight the fact this is the Flax Module object.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Rewrite FlaxAutoModel test to not rely on pretrained_model_archive_map

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused variable in BertModule.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused variable in BertModule again

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Attempt to have is_flax_available working again.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Introduce JAX TensorType

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improve ImportError message when trying to convert to various TensorType format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Makes Flax model jittable.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure flax models are jittable in unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove unused imports.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Ensure jax imports are guarded behind is_flax_available.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style again

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style again again

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style again again again

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Update src/transformers/file_utils.py

Co-authored-by: Marc van Zee <marcvanzee@gmail.com>

* Bump flax to it's latest version

Co-authored-by: Marc van Zee <marcvanzee@gmail.com>

* Bump jax version to at least 0.2.0

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Update the unittest to use TensorType.JAX

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* isort import in tests.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Match new flax parameters name "params"

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove unused imports.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Add flax models to transformers __init__

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Attempt to address all CI related comments.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct circle.yml indent.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct circle.yml indent (2)

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Remove coverage from flax tests

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Addressing many naming suggestions from comments

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Simplify for loop logic to interate over layers in FlaxBertLayerCollection

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* use f-string syntax for formatting logs.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use config property from FlaxPreTrainedModel.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* use "cls_token" instead of "first_token" variable name.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* use "hidden_state" instead of "h" variable name.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct class reference in docstring to link to Flax related modules.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added HF + Google Flax team copyright.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make Roberta independent from Bert

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Move activation functions to flax_utils.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Move activation functions to flax_utils for bert.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Added docstring for BERT

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Update import for Bert and Roberta tokenizers

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* fix-copies

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Correct FlaxRobertaLayer to match PyTorch.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use the same store_artifact for flax unittest

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Style.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Make sure gradient are disabled only locally for flax unittest using torch equivalence.

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

* Use relative imports

Signed-off-by: Morgan Funtowicz <funtowiczmo@gmail.com>

Co-authored-by: Stefan Schweter <stefan@schweter.it>
Co-authored-by: Marc van Zee <marcvanzee@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-19 09:55:41 -04:00
Thomas Wolf
ba8c4d0ac0
[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)
* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-18 20:51:24 +02:00
Doug Blank
9618cd6964
Import integration libraries first (#7650)
* Import intergration libraries first

* isort and black happiness

* flake8 happiness

* Add a test

* Black reformat

* Ignore import order in tests

* A heavy-handed method of disabling comet for tests

* Remove comet_ml tests

* Run black on setup.py
2020-10-09 12:13:22 -04:00
sgugger
4dcc424de3 Complete release instruction 2020-10-09 12:12:03 -04:00
Sylvain Gugger
a3cea6a8cc
Better links for models in READMED and doc index (#7680) 2020-10-09 11:17:16 -04:00
sgugger
bc00b37a0d Revert "Better model links in the README and index"
This reverts commit 76e05518bb.
2020-10-09 10:56:13 -04:00
sgugger
76e05518bb Better model links in the README and index 2020-10-09 10:54:40 -04:00
Thomas Wolf
9aeacb58ba
Adding Fast tokenizers for SentencePiece based tokenizers - Breaking: remove Transfo-XL fast tokenizer (#7141)
* [WIP] SP tokenizers

* fixing tests for T5

* WIP tokenizers

* serialization

* update T5

* WIP T5 tokenization

* slow to fast conversion script

* Refactoring to move tokenzier implementations inside transformers

* Adding gpt - refactoring - quality

* WIP adding several tokenizers to the fast world

* WIP Roberta - moving implementations

* update to dev4 switch file loading to in-memory loading

* Updating and fixing

* advancing on the tokenizers - updating do_lower_case

* style and quality

* moving forward with tokenizers conversion and tests

* MBart, T5

* dumping the fast version of transformer XL

* Adding to autotokenizers + style/quality

* update init and space_between_special_tokens

* style and quality

* bump up tokenizers version

* add protobuf

* fix pickle Bert JP with Mecab

* fix newly added tokenizers

* style and quality

* fix bert japanese

* fix funnel

* limite tokenizer warning to one occurence

* clean up file

* fix new tokenizers

* fast tokenizers deep tests

* WIP adding all the special fast tests on the new fast tokenizers

* quick fix

* adding more fast tokenizers in the fast tests

* all tokenizers in fast version tested

* Adding BertGenerationFast

* bump up setup.py for CI

* remove BertGenerationFast (too early)

* bump up tokenizers version

* Clean old docstrings

* Typo

* Update following Lysandre comments

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
2020-10-08 11:32:16 +02:00
Sylvain Gugger
4c6728460a
Bump isort version. (#7484) 2020-09-30 13:44:58 -04:00
Sylvain Gugger
1ba08dc221 Release: v3.3.1 2020-09-29 14:17:34 -04:00
Lysandre
0613f05226 Release: v3.3.0 2020-09-28 16:24:43 +02:00
Ola Piktus
c754c41c61
RAG (#6813)
* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* path fix

* Formatting / renaming prior to actual work

* added rag WIP

* Formatting / renaming prior to actual work

* First commit

* improve comments

* Retrieval evaluation scripts

* refactor to include modeling outputs + MPI retriever

* Fix rag-token model + refactor

* Various fixes + finetuning logic

* use_bos fix

* Retrieval refactor

* Finetuning refactoring and cleanup

* Add documentation and cleanup

* Remove set_up_rag_env.sh file

* Fix retrieval wit HF index

* Fix import errors

* Fix quality errors

* Refactor as per suggestions in https://github.com/huggingface/transformers/pull/6813#issuecomment-687208867

* fix quality

* Fix RAG Sequence generation

* minor cleanup plus initial tests

* fix test

* fix tests 2

* Comments fix

* post-merge fixes

* Improve readme + post-rebase refactor

* Extra dependencied for tests

* Fix tests

* Fix tests 2

* Refactor test requirements

* Fix tests 3

* Post-rebase refactor

* rename nlp->datasets

* RAG integration tests

* add tokenizer to slow integration test and allow retriever to run on cpu

* add tests; fix position ids warning

* change structure

* change structure

* add from encoder generator

* save working solution

* make all integration tests pass

* add RagTokenizer.save/from_pretrained and RagRetriever.save/from_pretrained

* don't save paths

* delete unnecessary imports

* pass config to AutoTokenizer.from_pretrained for Rag tokenizers

* init wiki_dpr only once

* hardcode legacy index and passages paths (todo: add the right urls)

* finalize config

* finalize retriver api and config api

* LegacyIndex index download refactor

* add dpr to autotokenizer

* make from pretrained more flexible

* fix ragfortokengeneration

* small name changes in tokenizer

* add labels to models

* change default index name

* add retrieval tests

* finish token generate

* align test with previous version and make all tests pass

* add tests

* finalize tests

* implement thoms suggestions

* add first version of test

* make first tests work

* make retriever platform agnostic

* naming

* style

* add legacy index URL

* docstrings + simple retrieval test for distributed

* clean model api

* add doc_ids to retriever's outputs

* fix retrieval tests

* finish model outputs

* finalize model api

* fix generate problem for rag

* fix generate for other modles

* fix some tests

* save intermediate

* set generate to default

* big refactor generate

* delete rag_api

* correct pip faiss install

* fix auto tokenization test

* fix faiss install

* fix test

* move the distributed logic to examples

* model page

* docs

* finish tests

* fix dependencies

* fix import in __init__

* Refactor eval_rag and finetune scripts

* start docstring

* add psutil to test

* fix tf test

* move require torch to top

* fix retrieval test

* align naming

* finish automodel

* fix repo consistency

* test ragtokenizer save/load

* add rag model output docs

* fix ragtokenizer save/load from pretrained

* fix tokenizer dir

* remove torch in retrieval

* fix docs

* fixe finetune scripts

* finish model docs

* finish docs

* remove auto model for now

* add require torch

* remove solved todos

* integrate sylvains suggestions

* sams comments

* correct mistake on purpose

* improve README

* Add generation test cases

* fix rag token

* clean token generate

* fix test

* add note to test

* fix attention mask

* add t5 test for rag

* Fix handling prefix in finetune.py

* don't overwrite index_name

Co-authored-by: Patrick Lewis <plewis@fb.com>
Co-authored-by: Aleksandra Piktus <piktus@devfair0141.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5102.h2.fair>
Co-authored-by: Aleksandra Piktus <piktus@learnfair5067.h2.fair>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
2020-09-22 18:29:58 +02:00
Lysandre
3ebb1b3a2b Release: v3.2.0 2020-09-22 17:36:51 +02:00
sgugger
206b78d485 Pin version of TF and torch 2020-09-14 14:08:51 -04:00
Stas Bekman
159ef07e4c
match CI's version of flake8 (#6941)
my flake8 wasn't up-to-date enough `make quality` wasn't reporting the same things CI did - this PR adds the actual required version.

Thinking more about some of these minimal versions - CI will always install afresh and thus will always run the latest version. Is there a way to tell pip to always install the latest versions of certain dependencies on `pip install -i ".[dev]"`, rather than hardcoding the minimals which quickly become outdated?
2020-09-07 08:12:25 -04:00
Stas Bekman
b4a9c95f1b
[testing] add dependency: parametrize (#6958)
unittest doesn't support pytest's super-handy `@pytest.mark.parametrize`, I researched and there are many proposed workarounds, most tedious at best. If we include https://pypi.org/project/parameterized/ in dev dependencies - it will provide a very easy to write parameterization in tests. Same as pytest's fixture, plus quite a few other ways. 

Example:
```
from parameterized import parameterized
@parameterized([
    (2, 2, 4),
    (2, 3, 8),
    (1, 9, 1),
    (0, 9, 0),
])
def test_pow(base, exponent, expected):
   assert_equal(math.pow(base, exponent), expected)
```
(extra `self`var if inside a test class)

To remind the pytest style is slightly different:
```
    @pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
    def test_eval(test_input, expected):
```
More examples here: https://pypi.org/project/parameterized

May I suggest that it will make it much easier to write some types of tests?
2020-09-07 05:50:18 -04:00
Lysandre
4b3ee9cbc5 Release: v3.1.0 2020-09-01 14:27:52 +02:00
Stas Bekman
743d131d76
[style] set the minimal required version for black (#6784)
`make style` with `black` < 20.8b1 is a no go (in case some other package forced a lower version) - so make it explicit to avoid confusion
2020-08-28 11:38:09 +08:00
Funtowicz Morgan
ac9702c284
Fix ONNX test_quantize unittest (#6716) 2020-08-25 13:24:40 -04:00
Sylvain Gugger
a573777901
Update repo to isort v5 (#6686)
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
2020-08-24 11:03:01 -04:00
Masatoshi Suzuki
48c6c6139f
Support additional dictionaries for BERT Japanese tokenizers (#6515)
* Update BERT Japanese tokenizers

* Update CircleCI config to download unidic

* Specify to use the latest dictionary packages
2020-08-17 12:00:23 +08:00
Paul O'Leary McCann
cf3cf304ca
Replace mecab-python3 with fugashi for Japanese tokenization (#6086)
* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-07-31 04:41:14 -04:00
Julien Plu
fc64559c45
Fix TF CTRL model naming (#6134) 2020-07-29 12:20:00 -04:00
sgugger
9d0d3a6645 Pin TF while we wait for a fix 2020-07-27 18:03:09 -04:00
Sebastian
eae6d8d14f
Update tokenizers to 0.8.1.rc to fix Mac OS X issues (#5867) 2020-07-18 08:20:11 -04:00
Lysandre
1d2332861f Post v3.0.2 release commit 2020-07-06 18:56:47 -04:00
Lysandre
b0892fa0e8 Release: v3.0.2 2020-07-06 18:49:44 -04:00
Anthony MOI
5787e4c159
Various tokenizers fixes (#5558)
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
2020-07-06 18:27:53 -04:00
Thomas Wolf
b58a15a31e unpining specific git versions in setup.py 2020-07-03 17:38:39 +02:00
Thomas Wolf
fedabcd154 Release: 3.0.1 2020-07-03 17:02:44 +02:00
Lysandre Debut
69d313e808
Bans SentencePiece 0.1.92 (#5418) 2020-07-02 09:23:00 -04:00
Lysandre
90d13954c4 Repin versions 2020-06-30 09:16:36 -04:00
Lysandre
b62ca59527 Release: v3.0.0 2020-06-29 10:40:13 -04:00
Sylvain Gugger
482c9178d3
Pin mecab for now (#5362) 2020-06-29 09:51:13 -04:00
Lysandre Debut
364a5ae1f0
Refactor Code samples; Test code samples (#5036)
* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-06-25 16:46:00 -04:00
Thomas Wolf
11fdde0271
Tokenizers API developments (#5103)
* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-23 13:36:57 +02:00
Lysandre Debut
973433260e
Pin sphinx-rtd-theme (#5128) 2020-06-18 18:07:59 -04:00
Anthony MOI
36434220fc
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-06-15 17:12:51 -04:00
Patrick von Platen
2cfb947f59
[Benchmark] add tpu and torchscipt for benchmark (#4850)
* add tpu and torchscipt for benchmark

* fix name in tests

* "fix email"

* make style

* better log message for tpu

* add more print and info for tpu

* allow possibility to print tpu metrics

* correct cpu usage

* fix test for non-install

* remove bugus file

* include psutil in testing

* run a couple of times before tracing in torchscript

* do not allow tpu memory tracing for now

* make style

* add torchscript to env

* better name for torch tpu

Co-authored-by: Patrick von Platen <patrick@huggingface.co>
2020-06-09 23:12:43 +02:00
Lysandre
d976ef262e Repin versions 2020-06-02 10:27:15 -04:00
Lysandre
b43c78e5d3 Release: v2.11.0 2020-06-02 09:49:09 -04:00
Bram Vanroy
8cc6807e89
Make transformers-cli cross-platform (#4131)
* make transformers-cli cross-platform

Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)

* make style & quality
2020-05-26 10:00:51 -04:00
Julien Chaumond
2c1ebb8b50 Re-apply #4446 + add packaging dependency
As discussed w/ @lysandrejik

packaging is maintained by PyPA (the Python Packaging Authority), and should be lightweight and stable
2020-05-22 17:29:03 -04:00
Lysandre
ef22ba4836 Re-pin versions 2020-05-22 11:03:07 -04:00
Lysandre
e0db6bbd65 Release: v2.10.0 2020-05-22 10:37:44 -04:00
Funtowicz Morgan
db0076a9df
Conversion script to export transformers models to ONNX IR. (#4253)
* Added generic ONNX conversion script for PyTorch model.

* WIP initial TF support.

* TensorFlow/Keras ONNX export working.

* Print framework version info

* Add possibility to check the model is correctly loading on ONNX runtime.

* Remove quantization option.

* Specify ONNX opset version when exporting.

* Formatting.

* Remove unused imports.

* Make functions more generally reusable from other part of the code.

* isort happy.

* flake happy

* Export only feature-extraction for now

* Correctly check inputs order / filter before export.

* Removed task variable

* Fix invalid args call in load_graph_from_args.

* Fix invalid args call in convert.

* Fix invalid args call in infer_shapes.

* Raise exception and catch in caller function instead of exit.

* Add 04-onnx-export.ipynb notebook

* More WIP on the notebook

* Remove unused imports

* Simplify & remove unused constants.

* Export with constant_folding in PyTorch

* Let's try to put function args in the right order this time ...

* Disable external_data_format temporary

* ONNX notebook draft ready.

* Updated notebooks charts + wording

* Correct error while exporting last chart in notebook.

* Adressing @LysandreJik comment.

* Set ONNX opset to 11 as default value.

* Set opset param mandatory

* Added ONNX export unittests

* Quality.

* flake8 happy

* Add keras2onnx dependency on extras["tf"]

* Pin keras2onnx on github master to v1.6.5

* Second attempt.

* Third attempt.

* Use the right repo URL this time ...

* Do the same for onnxconverter-common

* Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2

* Correct commit hash.

* Addressing PR review: Optimization are enabled by default.

* Addressing PR review: small changes in the notebook

* setup.py comment about keras2onnx versioning.
2020-05-14 16:35:52 -04:00
Julien Chaumond
448c467256
Fix: unpin flake8 and fix cs errors (#4367)
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
2020-05-14 13:14:26 -04:00
Julien Chaumond
015f7812ed [ci skip] Pin isort 2020-05-14 10:12:18 -04:00
Lysandre
7cb203fae4 Release: v2.9.1 2020-05-13 17:38:50 -04:00
Funtowicz Morgan
7d7fe4997f
Allow BatchEncoding to be initialized empty. (#4316)
* Allow BatchEncoding to be initialized empty.

This is required by recent changes introduced in TF 2.2.

* Attempt to unpin Tensorflow to 2.2 with the previous commit.
2020-05-12 15:02:46 -04:00
Lysandre Debut
30e343862f
pin TF to 2.1 (#4297)
* pin TF to 2.1

* Pin flake8 as well
2020-05-11 21:03:30 -04:00
Julien Plu
94b57bf796
[TF 2.2 compat] use tf.VariableAggregation.ONLY_FIRST_REPLICA (#4283)
* Fix the issue to properly run the accumulator with TF 2.2

* Apply style

* Fix training_args_tf for TF 2.2

* Fix the TF training args when only one GPU is available

* Remove the fixed version of TF in setup.py
2020-05-11 11:28:37 -04:00
Lysandre
2e57824374 Pin isort and tf <= 2.1.0 2020-05-07 14:42:00 -04:00
Lysandre
e7cfc1a313 Release: v2.9.0 2020-05-07 14:15:20 -04:00
Lysandre Debut
79b1c6966b
Pytorch 1.5.0 (#3973)
* Standard deviation can no longer be set to 0

* Remove torch pinned version

* 9th instead of 10th, silly me
2020-05-05 10:23:01 -04:00
Sam Shleifer
18db92dd9a
[testing] add timeout_decorator (#3543) 2020-05-01 09:05:47 -04:00
Julien Chaumond
97a375484c rm boto3 dependency 2020-04-27 11:17:14 -04:00
Anthony MOI
13dd2acca4
Bump tokenizers version to final 0.7.0 (#3898) 2020-04-22 11:02:29 -04:00
Julien Chaumond
eb5601b0a5 [ci] Pin torch version while we update 2020-04-21 15:46:18 -04:00
Thomas Wolf
827d6d6ef0
Cleanup fast tokenizers integration (#3706)
* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py

Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy

Co-authored-by: Stefan Schweter <stefan@schweter.it>
2020-04-18 13:43:57 +02:00
Anthony MOI
b7cf9f43d2
Update tokenizers to 0.7.0-rc5 (#3705) 2020-04-10 14:23:49 -04:00
Funtowicz Morgan
96ab75b8dd
Tokenizers v3.0.0 (#3185)
* Renamed num_added_tokens to num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-04-07 00:29:15 +02:00
LysandreJik
ea6dba2787 Re-pin isort 2020-04-06 10:09:54 -04:00
LysandreJik
11c3257a18 unpin isort for pypi 2020-04-06 10:06:41 -04:00
LysandreJik
36bffc81b3 Release: v2.8.0 2020-04-06 10:03:53 -04:00
LysandreJik
eff757f2e3 Re-pin isort version 2020-03-30 09:00:47 -04:00
LysandreJik
a009d751c2 Un-pin isort for v2.7.0 pypi 2020-03-30 08:55:10 -04:00
LysandreJik
6f5a12a583 Release: v2.7.0 2020-03-30 08:49:24 -04:00
Patrick von Platen
b4fb94fe6d revert unpin isort commit 2020-03-26 13:19:18 -04:00
Julien Chaumond
83272a3853
Experiment w/ dataclasses (including Py36) (#3423)
* [ci] Also run test_examples in py37

(will revert at the end of the experiment)

* InputExample: use immutable dataclass

* [deps] Install dataclasses for Py<3.7

* [skip ci] Revert "[ci] Also run test_examples in py37"

This reverts commit d29afd9959.
2020-03-25 11:10:20 -04:00
LysandreJik
fbc5bf10cf v2.6.0 release: isort un-pinned 2020-03-24 11:52:02 -04:00
LysandreJik
471cce24b3 Release: v2.6.0 2020-03-24 10:37:32 -04:00
Julien Chaumond
ec6766a363 [deps] scikit-learn's transient issue was fixed 2020-03-23 18:38:09 -04:00
LysandreJik
e52482909b Correct order for dev/quality dependencies
cc @julien-c
2020-03-23 12:01:23 -04:00
Julien Chaumond
18eec3a984 [ci] simpler way to load correct version of isort
hat/tip @bramvanroy
2020-03-23 10:03:22 -04:00
Bram Vanroy
115abd2166 Handle pinned version of isort
The CONTRIBUTING file pins to a specific version of isort, so we might as well install that in `dev` . This makes it easier for contributors so they don't have to manually install the specific commit.
2020-03-20 18:00:04 -04:00
Thomas Wolf
2187c49f5c
CPU/GPU memory benchmarking utilities - Remove support for python 3.5 (now only 3.6+) (#3186)
* memory benchmark rss

* have both forward pass and line-by-line mem tracing

* cleaned up tracing

* refactored and cleaning up API

* no f-strings yet...

* add GPU mem logging

* fix GPU memory monitoring

* style and quality

* clean up and doc

* update with comments

* Switching to python 3.6+

* fix quality
2020-03-17 10:17:11 -04:00
Patrick von Platen
34de670dbe
fix sklearn release circle ci [temporary] (#3123) 2020-03-04 11:25:23 -05:00
Lysandre
f9ec5ca90b Release: v2.5.1 2020-02-24 18:22:54 -05:00
Funtowicz Morgan
4cd9c0971c
Fix for fast tokenizers save_pretrained compatibility with Python. (#2933)
* Renamed file generate by tokenizers when calling save_pretrained to match python.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added save_vocabulary tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove python quick and dirty fix for clean Rust impl.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.5.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added some save_pretrained / from_pretrained unittests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.5.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Quality and format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* flake8

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Making sure there is really a bug in unittest

* Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-24 18:20:42 -05:00
Lysandre
59c23ad9c9 README link + better instructions for release 2020-02-19 11:57:17 -05:00
Lysandre
fb560dcb07 Release: v2.5.0
Welcome Rust Tokenizers
2020-02-19 11:46:19 -05:00
Funtowicz Morgan
3f3fa7f7da
Integrate fast tokenizers library inside transformers (#2674)
* Implemented fast version of tokenizers

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumped tokenizers version requirements to latest 0.2.1

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added matching tests

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching OpenAI GPT tokenization !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching GPT2 on tokenizers

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose add_prefix_space as constructor parameter for GPT2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Matching Roberta tokenization !

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Removed fast implementation of CTRL.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Binding TransformerXL tokenizers to Rust.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Updating tests accordingly.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added tokenizers as top-level modules.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black & isort.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Rename LookupTable to WordLevel to match Rust side.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Black.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Use "fast" suffix instead of "ru" for rust tokenizers implementations.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Introduce tokenize() method on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus dispatchs to batch_encode_plus

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* batch_encode_plus now dispatchs to encode if there is only one input element.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bind all the encode_plus parameter to the forwarded batch_encode_plus call.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.3.0

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Formatting.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix tokenization_auto with support for new (python, fast) mapping schema.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give correct fixtures path in test_tokenization_fast.py for the CLI.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Expose max_len_ properties on BertTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Move max_len_ properties to PreTrainedTokenizerFast and override in specific subclasses.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* _convert_encoding should keep the batch axis tensor if only one sample in the batch.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add warning message for RobertaTokenizerFast if used for MLM.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added use_fast (bool) parameter on AutoTokenizer.from_pretrained().

This allows to easily enable/disable Rust-based tokenizer instantiation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Let's tokenizers handle all the truncation and padding stuff.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Allow to provide tokenizer arguments during pipeline creation.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update test_fill_mask pipeline to not use fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix too much parameters for convert_encoding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* When enabling padding, max_length should be set to None.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Avoid returning nested tensors of length 1 when calling encode_plus

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure output is padded when return_tensor is not None.

Tensor creation requires the inital list input to be of the exact same size.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Disable transfoxl unittest if pytorch is not available (required to load the model)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* encode_plus should not remove the leading batch axis if return_tensor is set

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Temporary disable fast tokenizers on QA pipelines.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix formatting issues.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Update tokenizers to 0.4.0

* Update style

* Enable truncation + stride unit test on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Add unittest ensuring special_tokens set match between Python and Rust.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure special_tokens are correctly set during construction.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Give more warning feedback to the user in case of padding without pad_token.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* quality & format.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added possibility to add a single token as str

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added unittest for add_tokens and add_special_tokens on fast tokenizers.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix rebase mismatch on pipelines qa default model.

QA requires cased input while the tokenizers would be uncased.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Using offset mapping relative to the original string + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: save_vocabulary requires folder and file name

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Simplify import for Bert.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: truncate_and_pad disables padding according to the same heuristic than the one enabling padding.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Remove private member access in tokenize()

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Bump tokenizers dependency to 0.4.2

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* format & quality.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Use named arguments when applicable.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Add Github link to Roberta/GPT2 space issue on masked input.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Move max_len_single_sentence / max_len_sentences_pair to PreTrainedTokenizerFast + tests.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Relax type checking to include tuple and list object.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Addressing review comment: Document the truncate_and_pad manager behavior.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Raise an exception if return_offsets_mapping is not available with the current tokenizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Ensure padding is set on the tokenizers before setting any padding strategy + unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* On pytorch we need to stack tensor to get proper new axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generalize tests to different framework removing hard written return_tensors="..."

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizer dependency for num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Overflowing tokens in batch_encode_plus are now stacked over the batch axis.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Improved error message for padding strategy without pad token.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bumping tokenizers dependency to 0.5.0 for release.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Optimizing convert_encoding around 4x improvement. 🚀

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* expose pad_to_max_length in encode_plus to avoid duplicating the parameters in kwargs

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Generate a proper overflow_to_sampling_mapping when return_overflowing_tokens is True.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix unittests for overflow_to_sampling_mapping not being returned as tensor.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Format & quality.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove perfect alignment constraint for Roberta (allowing 1% difference max)

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Triggering final CI

Co-authored-by: MOI Anthony <xn1t0x@gmail.com>
2020-02-19 11:35:40 -05:00
Morgan Funtowicz
6aa7973aec Fix circleci cuInit error on Tensorflow >= 2.1.0.
Tensorflow 2.1.0 introduce a new dependency model where pip install tensorflow would install tf with GPU support.
Before it would just install with CPU support, thus CircleCI is looking for NVidia driver version at initialization of the
tensorflow related tests but fails as their is no NVidia Driver running.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>
2020-02-10 13:24:37 +01:00
Lysandre
d426b58b9e Patch: v2.4.1 2020-01-31 14:55:33 -05:00
Lysandre
8036ceb7c5 Update commands for pypi test 2020-01-31 09:48:15 -05:00
Lysandre
6664ea943d Release: v2.4.0 2020-01-31 09:40:32 -05:00
Julien Chaumond
5004d5af42 [serving] Update dependencies 2020-01-27 19:58:00 -05:00
Brendan Roof
23c6998bf4 Add lower bound to tqdm for tqdm.auto
- It appears that `tqdm` only introduced `tqdm.auto` in 4.27.
- See https://github.com/tqdm/tqdm/releases/tag/v4.27.0.
- Without the lower bound I received the following stack trace in an environment where I already had tqdm installed:
```
  File "/home/brendanr/anaconda3/envs/allennlp/lib/python3.6/site-packages/transformers/__init__.py", line 20, in <module>
    from .file_utils import (TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, PYTORCH_PRETRAINED_BERT_CACHE,
  File "/home/brendanr/anaconda3/envs/allennlp/lib/python3.6/site-packages/transformers/file_utils.py", line 24, in <module>
    from tqdm.auto import tqdm
ModuleNotFoundError: No module named 'tqdm.auto'
```
2020-01-17 18:29:11 -05:00
alberduris
81d6841b4b GPU text generation: mMoved the encoded_prompt to correct device 2020-01-06 15:11:12 +01:00
alberduris
dd4df80f0b Moved the encoded_prompts to correct device 2020-01-06 15:11:12 +01:00
Anthony MOI
bfe870be65
Hotfix tokenizers version for sdist installs 2019-12-27 11:05:52 -05:00
Anthony MOI
1f82a5d910
Update for changes in tokenizers API 2019-12-26 14:37:55 -05:00
Anthony MOI
734d29b03d
tokenizers is now a real dependency 2019-12-24 13:32:41 -05:00
Aymeric Augustin
7a865821d9 Remove stray egg-info directory automatically.
If a user or contributor ran `pip install -e .` on transformers < 3.0,
pip created a transformers.egg-info directory next to the transformers
directory at the root of the repository.

In transformers 3.0, the source is in a `src` subdirectory.
`pip install -e .` creates a transformers.egg-info directory there.
However, pip will still pick transformers.egg-info from the previous
location. This is a bug: https://github.com/pypa/pip/issues/5466

Users and contributors are likely to hit this problem because the
documentation for transformers 3.0 relies heavily on extra_requires
which didn't exist in earlier versions, so aren't defined in a stale
transformers.egg-info directory.

If such a directory exists, remove it. It's autogenerated, gitignored
and not supposed to contain anything of value.
2019-12-23 21:06:23 +01:00
Aymeric Augustin
d73eb552e8 Remove requirements.txt.
It's redundant with setup.py and, also, incomplete (e.g. numpy).
2019-12-23 19:15:08 +01:00
Aymeric Augustin
76a1417f2a Include all optional dependencies in extras.
Take advantage of this to simplify the Circle CI configuration.

Don't bother with tensorboardX: it's a fallback for PyTorch < 1.1.0.
2019-12-23 19:14:31 +01:00
Aymeric Augustin
f2522869ea Review and update setup.py. 2019-12-23 18:45:42 +01:00
Aymeric Augustin
1c62e87b34 Use built-in open().
On Python 3, `open is io.open`.
2019-12-22 18:38:56 +01:00
Aymeric Augustin
d6eaf4e6d2 Update comments mentioning Python 2. 2019-12-22 18:38:56 +01:00
Aymeric Augustin
6be7cdda66 Move source code inside a src subdirectory.
This prevents transformers from being importable simply because the CWD
is the root of the git repository, while not being importable from other
directories. That led to inconsistent behavior, especially in examples.

Once you fetch this commit, in your dev environment, you must run:

    $ pip uninstall transformers
    $ pip install -e .
2019-12-22 14:15:13 +01:00
Aymeric Augustin
158e82e061 Sort imports with isort.
This is the result of:

    $ isort --recursive examples templates transformers utils hubconf.py setup.py
2019-12-22 10:57:46 +01:00
Aymeric Augustin
fa84ae26d6 Reformat source code with black.
This is the result of:

    $ black --line-length 119 examples templates transformers utils hubconf.py setup.py

There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.

This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.
2019-12-21 17:52:29 +01:00
Aymeric Augustin
a4c9338b83 Prevent parallel downloads of the same file with a lock.
Since the file is written to the filesystem, a filesystem lock is the
way to go here. Add a dependency on the third-party filelock library to
get cross-platform functionality.
2019-12-21 08:43:19 +01:00
Lysandre
a436574bfd Release: v2.3.0 2019-12-20 16:22:20 -05:00
thomwolf
73fcebf7ec update serving command 2019-12-20 13:47:35 +01:00
thomwolf
407093b3fa Merge branch 'cli' of https://github.com/huggingface/transformers into cli 2019-12-19 20:26:51 +01:00
thomwolf
c7be096c39 Merge branch 'master' into cli 2019-12-19 20:26:08 +01:00
Morgan Funtowicz
a305067f2d Removed __main__ 2019-12-19 19:41:48 +01:00
Lysandre
5e289f69bc regex 2019.12.17 install fails with Python 2 2019-12-17 15:54:05 -05:00
Morgan Funtowicz
d7c62661a3 Provide serving dependencies for tensorflow and pytorch (serving-tf, serving-torch) 2019-12-17 11:23:39 +01:00
Lysandre
7bd11dda6f Release: v2.2.2 2019-12-13 16:45:30 -05:00
thomwolf
72c36b9ea2 [WIP] - CLI 2019-12-10 11:33:14 +01:00
Aymeric Augustin
35401fe50f Remove dependency on pytest for running tests (#2055)
* Switch to plain unittest for skipping slow tests.

Add a RUN_SLOW environment variable for running them.

* Switch to plain unittest for PyTorch dependency.

* Switch to plain unittest for TensorFlow dependency.

* Avoid leaking open files in the test suite.

This prevents spurious warnings when running tests.

* Fix unicode warning on Python 2 when running tests.

The warning was:

    UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal

* Support running PyTorch tests on a GPU.

Reverts 27e015bd.

* Tests no longer require pytest.

* Make tests pass on cuda
2019-12-06 13:57:38 -05:00
Julien Chaumond
e4fbf3e2cc CLI for authenticated file sharing 2019-12-04 00:52:23 -05:00
LysandreJik
8101924a68 Patch: v2.2.1 2019-12-03 11:20:26 -05:00
Lysandre
ae98d45991 Release: v2.2.0 2019-11-26 14:12:44 -05:00
Lysandre
3ddce1d74c Release: 2.1.1 2019-10-11 06:37:49 -04:00
LysandreJik
9c2e0a4acf Release: 2.1.0 2019-10-09 12:14:03 -04:00
thomwolf
e4e35296fb update setup.py metadata 2019-09-26 13:52:24 +02:00
thomwolf
9676d1a2a8 update readme and setup.py 2019-09-26 13:47:58 +02:00
thomwolf
31c23bd5ee [BIG] pytorch-transformers => transformers 2019-09-26 10:15:53 +02:00
thomwolf
9d0a11a68c update dependencies and circle-ci 2019-09-08 15:02:06 +03:00
thomwolf
89fd3450a6 Release: 1.2.0 2019-09-04 13:32:18 +02:00
Shijie Wu
ca4baf8ca1 Match order of casing in OSS XLM; Improve document; Clean up dependency 2019-08-27 20:03:18 -04:00
Shijie Wu
e85123d398 Add custom tokenizer for zh and ja 2019-08-23 20:27:52 -04:00
Shijie Wu
436ce07218 Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in tokenize 2019-08-23 14:40:17 -04:00
LysandreJik
fe02e45e48 Release: 1.1.0 2019-08-15 11:15:08 -04:00
thomwolf
58830807d1 inidicate we only support pytorch 1.0.0+ now 2019-08-05 14:38:59 +02:00
thomwolf
6b70760204 typos 2019-07-16 21:21:03 +02:00
thomwolf
ed7549bb1a release version 1.0 2019-07-16 16:10:58 +02:00
thomwolf
eb91f6437e update readme and setup 2019-07-05 12:30:15 +02:00
thomwolf
0bab55d5d5 [BIG] name change 2019-07-05 11:55:36 +02:00
thomwolf
32da75486b add tokenizer and tests 2019-06-21 11:09:51 +02:00
thomwolf
b832d5bb8a Release: 0.6.2 2019-04-25 21:37:47 +02:00
thomwolf
e0855e8929 forgot to add regex to requirements :( 2019-02-18 11:54:51 +01:00
thomwolf
009ee86a19 fix tests - bump up version 2019-02-17 23:57:23 +01:00
thomwolf
321d70a7a9 bump up to 0.5.1 2019-02-13 10:11:20 +01:00
thomwolf
448937c00d python 2 compatibility 2019-02-06 00:07:46 +01:00
thomwolf
eed51c5bdf add OpenAI GPT 2019-01-08 12:26:58 +01:00
Patrick Sodré
87c1244c7d Convert scripts into entry_points
The recommended approach to create launch scripts is to use entry_points
and console_scripts.

xref: https://packaging.python.org/guides/distributing-packages-using-setuptools/#scripts
2018-12-19 02:26:08 +00:00
thomwolf
ae88eb88a4 set encoding to 'utf-8' in calls to open 2018-12-14 13:48:58 +01:00
thomwolf
1cbb32a542 include version number + comment in setup.py 2018-12-13 12:50:44 +01:00
thomwolf
ce52177638 added version in __init__.py 2018-12-13 12:50:44 +01:00
thomwolf
258eb50086 bump up version 2018-11-30 22:55:33 +01:00
thomwolf
ce37b8e481 bump version in setup.py 2018-11-26 10:45:48 +01:00
thomwolf
c8cba67742 clean up readme and examples 2018-11-17 12:19:16 +01:00
thomwolf
a99b971738 bump up version minor 2018-11-17 10:43:39 +01:00
thomwolf
4e46affc34 updating examples 2018-11-17 10:30:54 +01:00
thomwolf
f920eff8c3 update readme 2018-11-17 08:42:45 +01:00
thomwolf
1de35b624b preparing for first release 2018-11-15 20:56:10 +01:00