Commit Graph

137 Commits

Author SHA1 Message Date
Stas Bekman
159ef07e4c
match CI's version of flake8 (#6941)
my flake8 wasn't up-to-date enough `make quality` wasn't reporting the same things CI did - this PR adds the actual required version.

Thinking more about some of these minimal versions - CI will always install afresh and thus will always run the latest version. Is there a way to tell pip to always install the latest versions of certain dependencies on `pip install -i ".[dev]"`, rather than hardcoding the minimals which quickly become outdated?
2020-09-07 08:12:25 -04:00
Stas Bekman
b4a9c95f1b
[testing] add dependency: parametrize (#6958)
unittest doesn't support pytest's super-handy `@pytest.mark.parametrize`, I researched and there are many proposed workarounds, most tedious at best. If we include https://pypi.org/project/parameterized/ in dev dependencies - it will provide a very easy to write parameterization in tests. Same as pytest's fixture, plus quite a few other ways. 

Example:
```
from parameterized import parameterized
@parameterized([
    (2, 2, 4),
    (2, 3, 8),
    (1, 9, 1),
    (0, 9, 0),
])
def test_pow(base, exponent, expected):
   assert_equal(math.pow(base, exponent), expected)
```
(extra `self`var if inside a test class)

To remind the pytest style is slightly different:
```
    @pytest.mark.parametrize("test_input,expected", [("3+5", 8), ("2+4", 6), ("6*9", 42)])
    def test_eval(test_input, expected):
```
More examples here: https://pypi.org/project/parameterized

May I suggest that it will make it much easier to write some types of tests?
2020-09-07 05:50:18 -04:00
Lysandre
4b3ee9cbc5 Release: v3.1.0 2020-09-01 14:27:52 +02:00
Stas Bekman
743d131d76
[style] set the minimal required version for black (#6784)
`make style` with `black` < 20.8b1 is a no go (in case some other package forced a lower version) - so make it explicit to avoid confusion
2020-08-28 11:38:09 +08:00
Funtowicz Morgan
ac9702c284
Fix ONNX test_quantize unittest (#6716) 2020-08-25 13:24:40 -04:00
Sylvain Gugger
a573777901
Update repo to isort v5 (#6686)
* Run new isort

* More changes

* Update CI, CONTRIBUTING and benchmarks
2020-08-24 11:03:01 -04:00
Masatoshi Suzuki
48c6c6139f
Support additional dictionaries for BERT Japanese tokenizers (#6515)
* Update BERT Japanese tokenizers

* Update CircleCI config to download unidic

* Specify to use the latest dictionary packages
2020-08-17 12:00:23 +08:00
Paul O'Leary McCann
cf3cf304ca
Replace mecab-python3 with fugashi for Japanese tokenization (#6086)
* Replace mecab-python3 with fugashi

This replaces mecab-python3 with fugashi for Japanese tokenization. I am
the maintainer of both projects.

Both projects are MeCab wrappers, so the underlying C++ code is the
same. fugashi is the newer wrapper and doesn't use SWIG, so for basic
use of the MeCab API it's easier to use.

This code insures the use of a version of ipadic installed via pip,
which should make versioning and tracking down issues easier.

fugashi has wheels for Windows, OSX, and Linux, which will help with
issues with installing old versions of mecab-python3 on Windows.
Compared to mecab-python3, because fugashi doesn't use SWIG, it doesn't
require a C++ runtime to be installed on Windows.

In adding this change I removed some code dealing with `cursor`,
`token_start`, and `token_end` variables. These variables didn't seem to
be used for anything, it is unclear to me why they were there.

I ran the tests and they passed, though I couldn't figure out how to run
the slow tests (`--runslow` gave an error) and didn't try testing with
Tensorflow.

* Style fix

* Remove unused variable

Forgot to delete this...

* Adapt doc with install instructions

* Fix typo

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-07-31 04:41:14 -04:00
Julien Plu
fc64559c45
Fix TF CTRL model naming (#6134) 2020-07-29 12:20:00 -04:00
sgugger
9d0d3a6645 Pin TF while we wait for a fix 2020-07-27 18:03:09 -04:00
Sebastian
eae6d8d14f
Update tokenizers to 0.8.1.rc to fix Mac OS X issues (#5867) 2020-07-18 08:20:11 -04:00
Lysandre
1d2332861f Post v3.0.2 release commit 2020-07-06 18:56:47 -04:00
Lysandre
b0892fa0e8 Release: v3.0.2 2020-07-06 18:49:44 -04:00
Anthony MOI
5787e4c159
Various tokenizers fixes (#5558)
* BertTokenizerFast - Do not specify strip_accents by default

* Bump tokenizers to new version

* Add test for AddedToken serialization
2020-07-06 18:27:53 -04:00
Thomas Wolf
b58a15a31e unpining specific git versions in setup.py 2020-07-03 17:38:39 +02:00
Thomas Wolf
fedabcd154 Release: 3.0.1 2020-07-03 17:02:44 +02:00
Lysandre Debut
69d313e808
Bans SentencePiece 0.1.92 (#5418) 2020-07-02 09:23:00 -04:00
Lysandre
90d13954c4 Repin versions 2020-06-30 09:16:36 -04:00
Lysandre
b62ca59527 Release: v3.0.0 2020-06-29 10:40:13 -04:00
Sylvain Gugger
482c9178d3
Pin mecab for now (#5362) 2020-06-29 09:51:13 -04:00
Lysandre Debut
364a5ae1f0
Refactor Code samples; Test code samples (#5036)
* Refactor code samples

* Test docstrings

* Style

* Tokenization examples

* Run rust of tests

* First step to testing source docs

* Style and BART comment

* Test the remainder of the code samples

* Style

* let to const

* Formatting fixes

* Ready for merge

* Fix fixture + Style

* Fix last tests

* Update docs/source/quicktour.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Addressing @sgugger's comments + Fix MobileBERT in TF

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-06-25 16:46:00 -04:00
Thomas Wolf
11fdde0271
Tokenizers API developments (#5103)
* Add return lengths

* make pad a bit more flexible so it can be used as collate_fn

* check all kwargs sent to encoding method are known

* fixing kwargs in encodings

* New AddedToken class in python

This class let you specify specifique tokenization behaviors for some special tokens. Used in particular for GPT2 and Roberta, to control how white spaces are stripped around special tokens.

* style and quality

* switched to hugginface tokenizers library for AddedTokens

* up to tokenizer 0.8.0-rc3 - update API to use AddedToken state

* style and quality

* do not raise an error on additional or unused kwargs for tokenize() but only a warning

* transfo-xl pretrained model requires torch

* Update src/transformers/tokenization_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-06-23 13:36:57 +02:00
Lysandre Debut
973433260e
Pin sphinx-rtd-theme (#5128) 2020-06-18 18:07:59 -04:00
Anthony MOI
36434220fc
[HUGE] Refactoring tokenizers backend - padding - truncation - pre-tokenized pipeline - fast tokenizers - tests (#4510)
* Use tokenizers pre-tokenized pipeline

* failing pretrokenized test

* Fix is_pretokenized in python

* add pretokenized tests

* style and quality

* better tests for batched pretokenized inputs

* tokenizers clean up - new padding_strategy - split the files

* [HUGE] refactoring tokenizers - padding - truncation - tests

* style and quality

* bump up requied tokenizers version to 0.8.0-rc1

* switched padding/truncation API - simpler better backward compat

* updating tests for custom tokenizers

* style and quality - tests on pad

* fix QA pipeline

* fix backward compatibility for max_length only

* style and quality

* Various cleans up - add verbose

* fix tests

* update docstrings

* Fix tests

* Docs reformatted

* __call__ method documented

Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2020-06-15 17:12:51 -04:00
Patrick von Platen
2cfb947f59
[Benchmark] add tpu and torchscipt for benchmark (#4850)
* add tpu and torchscipt for benchmark

* fix name in tests

* "fix email"

* make style

* better log message for tpu

* add more print and info for tpu

* allow possibility to print tpu metrics

* correct cpu usage

* fix test for non-install

* remove bugus file

* include psutil in testing

* run a couple of times before tracing in torchscript

* do not allow tpu memory tracing for now

* make style

* add torchscript to env

* better name for torch tpu

Co-authored-by: Patrick von Platen <patrick@huggingface.co>
2020-06-09 23:12:43 +02:00
Lysandre
d976ef262e Repin versions 2020-06-02 10:27:15 -04:00
Lysandre
b43c78e5d3 Release: v2.11.0 2020-06-02 09:49:09 -04:00
Bram Vanroy
8cc6807e89
Make transformers-cli cross-platform (#4131)
* make transformers-cli cross-platform

Using "scripts" is a useful option in setup.py particularly when you want to get access to non-python scripts. However, in this case we want to have an entry point into some of our own Python scripts. To do this in a concise, cross-platfom way, we can use entry_points.console_scripts. This change is necessary to provide the CLI on different platforms, which "scripts" does not ensure. Usage remains the same, but the "transformers-cli" script has to be moved (be part of the library) and renamed (underscore + extension)

* make style & quality
2020-05-26 10:00:51 -04:00
Julien Chaumond
2c1ebb8b50 Re-apply #4446 + add packaging dependency
As discussed w/ @lysandrejik

packaging is maintained by PyPA (the Python Packaging Authority), and should be lightweight and stable
2020-05-22 17:29:03 -04:00
Lysandre
ef22ba4836 Re-pin versions 2020-05-22 11:03:07 -04:00
Lysandre
e0db6bbd65 Release: v2.10.0 2020-05-22 10:37:44 -04:00
Funtowicz Morgan
db0076a9df
Conversion script to export transformers models to ONNX IR. (#4253)
* Added generic ONNX conversion script for PyTorch model.

* WIP initial TF support.

* TensorFlow/Keras ONNX export working.

* Print framework version info

* Add possibility to check the model is correctly loading on ONNX runtime.

* Remove quantization option.

* Specify ONNX opset version when exporting.

* Formatting.

* Remove unused imports.

* Make functions more generally reusable from other part of the code.

* isort happy.

* flake happy

* Export only feature-extraction for now

* Correctly check inputs order / filter before export.

* Removed task variable

* Fix invalid args call in load_graph_from_args.

* Fix invalid args call in convert.

* Fix invalid args call in infer_shapes.

* Raise exception and catch in caller function instead of exit.

* Add 04-onnx-export.ipynb notebook

* More WIP on the notebook

* Remove unused imports

* Simplify & remove unused constants.

* Export with constant_folding in PyTorch

* Let's try to put function args in the right order this time ...

* Disable external_data_format temporary

* ONNX notebook draft ready.

* Updated notebooks charts + wording

* Correct error while exporting last chart in notebook.

* Adressing @LysandreJik comment.

* Set ONNX opset to 11 as default value.

* Set opset param mandatory

* Added ONNX export unittests

* Quality.

* flake8 happy

* Add keras2onnx dependency on extras["tf"]

* Pin keras2onnx on github master to v1.6.5

* Second attempt.

* Third attempt.

* Use the right repo URL this time ...

* Do the same for onnxconverter-common

* Added keras2onnx and onnxconveter-common to 1.7.0 to supports TF2.2

* Correct commit hash.

* Addressing PR review: Optimization are enabled by default.

* Addressing PR review: small changes in the notebook

* setup.py comment about keras2onnx versioning.
2020-05-14 16:35:52 -04:00
Julien Chaumond
448c467256
Fix: unpin flake8 and fix cs errors (#4367)
* Fix: unpin flake8 and fix cs errors

* Ok we still need to quote those
2020-05-14 13:14:26 -04:00
Julien Chaumond
015f7812ed [ci skip] Pin isort 2020-05-14 10:12:18 -04:00
Lysandre
7cb203fae4 Release: v2.9.1 2020-05-13 17:38:50 -04:00
Funtowicz Morgan
7d7fe4997f
Allow BatchEncoding to be initialized empty. (#4316)
* Allow BatchEncoding to be initialized empty.

This is required by recent changes introduced in TF 2.2.

* Attempt to unpin Tensorflow to 2.2 with the previous commit.
2020-05-12 15:02:46 -04:00
Lysandre Debut
30e343862f
pin TF to 2.1 (#4297)
* pin TF to 2.1

* Pin flake8 as well
2020-05-11 21:03:30 -04:00
Julien Plu
94b57bf796
[TF 2.2 compat] use tf.VariableAggregation.ONLY_FIRST_REPLICA (#4283)
* Fix the issue to properly run the accumulator with TF 2.2

* Apply style

* Fix training_args_tf for TF 2.2

* Fix the TF training args when only one GPU is available

* Remove the fixed version of TF in setup.py
2020-05-11 11:28:37 -04:00
Lysandre
2e57824374 Pin isort and tf <= 2.1.0 2020-05-07 14:42:00 -04:00
Lysandre
e7cfc1a313 Release: v2.9.0 2020-05-07 14:15:20 -04:00
Lysandre Debut
79b1c6966b
Pytorch 1.5.0 (#3973)
* Standard deviation can no longer be set to 0

* Remove torch pinned version

* 9th instead of 10th, silly me
2020-05-05 10:23:01 -04:00
Sam Shleifer
18db92dd9a
[testing] add timeout_decorator (#3543) 2020-05-01 09:05:47 -04:00
Julien Chaumond
97a375484c rm boto3 dependency 2020-04-27 11:17:14 -04:00
Anthony MOI
13dd2acca4
Bump tokenizers version to final 0.7.0 (#3898) 2020-04-22 11:02:29 -04:00
Julien Chaumond
eb5601b0a5 [ci] Pin torch version while we update 2020-04-21 15:46:18 -04:00
Thomas Wolf
827d6d6ef0
Cleanup fast tokenizers integration (#3706)
* First pass on utility classes and python tokenizers

* finishing cleanup pass

* style and quality

* Fix tests

* Updating following @mfuntowicz comment

* style and quality

* Fix Roberta

* fix batch_size/seq_length inBatchEncoding

* add alignement methods + tests

* Fix OpenAI and Transfo-XL tokenizers

* adding trim_offsets=True default for GPT2 et RoBERTa

* style and quality

* fix tests

* add_prefix_space in roberta

* bump up tokenizers to rc7

* style

* unfortunately tensorfow does like these - removing shape/seq_len for now

* Update src/transformers/tokenization_utils.py

Co-Authored-By: Stefan Schweter <stefan@schweter.it>

* Adding doc and docstrings

* making flake8 happy

Co-authored-by: Stefan Schweter <stefan@schweter.it>
2020-04-18 13:43:57 +02:00
Anthony MOI
b7cf9f43d2
Update tokenizers to 0.7.0-rc5 (#3705) 2020-04-10 14:23:49 -04:00
Funtowicz Morgan
96ab75b8dd
Tokenizers v3.0.0 (#3185)
* Renamed num_added_tokens to num_special_tokens_to_add

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Cherry-Pick: Partially fix space only input without special tokens added to the output #3091

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Added property is_fast on PretrainedTokenizer and PretrainedTokenizerFast

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Make fast tokenizers unittests work on Windows.

* Entirely refactored unittest for tokenizers fast.

* Remove ABC class for CommonFastTokenizerTest

* Added embeded_special_tokens tests from allenai @dirkgr

* Make embeded_special_tokens tests from allenai more generic

* Uniformize vocab_size as a property for both Fast and normal tokenizers

* Move special tokens handling out of PretrainedTokenizer (SpecialTokensMixin)

* Ensure providing None input raise the same ValueError than Python tokenizer + tests.

* Fix invalid input for assert_padding when testing batch_encode_plus

* Move add_special_tokens from constructor to tokenize/encode/[batch_]encode_plus methods parameter.

* Ensure tokenize() correctly forward add_special_tokens to rust.

* Adding None checking on top on encode / encode_batch for TransfoXLTokenizerFast.
Avoid stripping on None values.

* unittests ensure tokenize() also throws a ValueError if provided None

* Added add_special_tokens unittest for all supported models.

* Style

* Make sure TransfoXL test run only if PyTorch is provided.

* Split up tokenizers tests for each model type.

* Fix invalid unittest with new tokenizers API.

* Filter out Roberta openai detector models from unittests.

* Introduce BatchEncoding on fast tokenizers path.

This new structure exposes all the mappings retrieved from Rust.
It also keeps the current behavior with model forward.

* Introduce BatchEncoding on slow tokenizers path.

Backward compatibility.

* Improve error message on BatchEncoding for slow path

* Make add_prefix_space True by default on Roberta fast to match Python in majority of cases.

* Style and format.

* Added typing on all methods for PretrainedTokenizerFast

* Style and format

* Added path for feeding pretokenized (List[str]) input to PretrainedTokenizerFast.

* Style and format

* encode_plus now supports pretokenized inputs.

* Remove user warning about add_special_tokens when working on pretokenized inputs.

* Always go through the post processor.

* Added support for pretokenized input pairs on encode_plus

* Added is_pretokenized flag on encode_plus for clarity and improved error message on input TypeError.

* Added pretokenized inputs support on batch_encode_plus

* Update BatchEncoding methods name to match Encoding.

* Bump setup.py tokenizers dependency to 0.7.0rc1

* Remove unused parameters in BertTokenizerFast

* Make sure Roberta returns token_type_ids for unittests.

* Added missing typings

* Update add_tokens prototype to match tokenizers side and allow AddedToken

* Bumping tokenizers to 0.7.0rc2

* Added documentation for BatchEncoding

* Added (unused) is_pretokenized parameter on PreTrainedTokenizer encode_plus/batch_encode_plus methods.

* Added higher-level typing for tokenize / encode_plus / batch_encode_plus.

* Fix unittests failing because add_special_tokens was defined as a constructor parameter on Rust Tokenizers.

* Fix text-classification pipeline using the wrong tokenizer

* Make pipelines works with BatchEncoding

* Turn off add_special_tokens on tokenize by default.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove add_prefix_space from tokenize call in unittest.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style and quality

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Correct message for batch_encode_plus none input exception.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Fix invalid list comprehension for offset_mapping overriding content every iteration.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* TransfoXL uses Strip normalizer.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Bump tokenizers dependency to 0.7.0rc3

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Support AddedTokens for special_tokens and use left stripping on mask for Roberta.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* SpecilaTokenMixin can use slots to faster access to underlying attributes.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Remove update_special_tokens from fast tokenizers.

* Ensure TransfoXL unittests are run only when torch is available.

* Style.

Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>

* Style

* Style 🙏🙏

* Remove slots on SpecialTokensMixin, need deep dive into pickle protocol.

* Remove Roberta warning on __init__.

* Move documentation to Google style.

Co-authored-by: LysandreJik <lysandre.debut@reseau.eseo.fr>
2020-04-07 00:29:15 +02:00
LysandreJik
ea6dba2787 Re-pin isort 2020-04-06 10:09:54 -04:00
LysandreJik
11c3257a18 unpin isort for pypi 2020-04-06 10:06:41 -04:00