Lysandre Debut
5164ea91a7
Skipping outputs ( #3116 )
...
* Minimal example
* Proposal 2
* Proposal 2 for fast tokenizers
* Typings
* Docs
* Revert "Docs" for easier review
This reverts commit eaf0f97062e809887704a542144c537f769d5223.
* Remove unnecessary assignments
* Tests
* Fix faulty type
* Remove prints
* return_outputs -> model_input_names
* Revert "Revert "Docs" for easier review"
This reverts commit 6fdc69408102bf695797f2dfddbb6350c6b9e722.
* code quality
2020-03-09 13:48:58 -04:00
Patrick von Platen
c0135194eb
Force pad_token_id to be set before padding for standard tokenizer ( #3035 )
...
* force pad_token_id to be set before padding
* fix tests and forbid padding without having a padding_token_id set
2020-03-02 10:53:55 -05:00
Lysandre Debut
21d8b6a33e
Testing that batch_encode_plus is the same as encode_plus ( #2973 )
...
* Testing that encode_plus and batch_encode_plus behave the same way
Spoiler alert: they don't
* Testing rest of arguments in batch_encode_plus
* Test tensor return in batch_encode_plus
* Addressing Sam's comments
* flake8
* Simplified with `num_added_tokens`
2020-02-24 12:09:46 -05:00
Joe Davison
197d74f988
Add get_vocab method to PretrainedTokenizer
2020-02-20 15:26:49 -05:00
Joe Davison
f1e8a51f08
Preserve spaces in GPT-2 tokenizers ( #2778 )
...
* Preserve spaces in GPT-2 tokenizers
Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.
* Add tokenization preprocessing method
* Add framework argument to pipeline factory
Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.
2020-02-13 13:29:43 -05:00
Lysandre
e63a81dd25
Style
2020-01-29 16:29:20 -05:00
Lysandre
217349016a
Copy object instead of passing the reference
2020-01-29 16:15:39 -05:00
alberduris
81d6841b4b
GPU text generation: mMoved the encoded_prompt to correct device
2020-01-06 15:11:12 +01:00
alberduris
dd4df80f0b
Moved the encoded_prompts to correct device
2020-01-06 15:11:12 +01:00
Thomas Wolf
9f5f646442
Merge pull request #2211 from huggingface/fast-tokenizers
...
Fast tokenizers
2019-12-27 10:24:29 +01:00
Anthony MOI
2818e50569
Add tests for fast tokenizers
2019-12-24 13:29:01 -05:00
Aymeric Augustin
e6c0019c80
Remove unused variables in tests.
2019-12-23 22:38:18 +01:00
Aymeric Augustin
1c62e87b34
Use built-in open().
...
On Python 3, `open is io.open`.
2019-12-22 18:38:56 +01:00
Aymeric Augustin
798b3b3899
Remove sys.version_info[0] == 2 or 3.
2019-12-22 18:38:42 +01:00
Aymeric Augustin
c824d15aa1
Remove __future__ imports.
2019-12-22 17:47:54 +01:00
Aymeric Augustin
00204f2b4c
Replace CommonTestCases for tokenizers with a mixin.
...
This is the same change as for (TF)CommonTestCases for modeling.
2019-12-22 15:35:25 +01:00
Aymeric Augustin
a3c5883f2c
Rename file for consistency.
2019-12-22 15:35:25 +01:00