* remove references to old API in docstring - update data processors
* style
* fix tests - better type checking error messages
* better type checking
* include awesome fix by @LysandreJik for #5310
* updated doc and examples
* First pass on utility classes and python tokenizers
* finishing cleanup pass
* style and quality
* Fix tests
* Updating following @mfuntowicz comment
* style and quality
* Fix Roberta
* fix batch_size/seq_length inBatchEncoding
* add alignement methods + tests
* Fix OpenAI and Transfo-XL tokenizers
* adding trim_offsets=True default for GPT2 et RoBERTa
* style and quality
* fix tests
* add_prefix_space in roberta
* bump up tokenizers to rc7
* style
* unfortunately tensorfow does like these - removing shape/seq_len for now
* Update src/transformers/tokenization_utils.py
Co-Authored-By: Stefan Schweter <stefan@schweter.it>
* Adding doc and docstrings
* making flake8 happy
Co-authored-by: Stefan Schweter <stefan@schweter.it>
This change is mostly autogenerated with:
$ python -m autoflake --in-place --recursive --remove-all-unused-imports --ignore-init-module-imports examples templates transformers utils hubconf.py setup.py
I made minor changes in the generated diff.
This is the result of:
$ black --line-length 119 examples templates transformers utils hubconf.py setup.py
There's a lot of fairly long lines in the project. As a consequence, I'm
picking the longest widely accepted line length, 119 characters.
This is also Thomas' preference, because it allows for explicit variable
names, to make the code easier to understand.