Joe Davison
197d74f988
Add get_vocab method to PretrainedTokenizer
2020-02-20 15:26:49 -05:00
Joe Davison
f1e8a51f08
Preserve spaces in GPT-2 tokenizers ( #2778 )
...
* Preserve spaces in GPT-2 tokenizers
Preserves spaces after special tokens in GPT-2 and inhereted (RoBERTa)
tokenizers, enabling correct BPE encoding. Automatically inserts a space
in front of first token in encode function when adding special tokens.
* Add tokenization preprocessing method
* Add framework argument to pipeline factory
Also fixes pipeline test issue. Each test input now treated as a
distinct sequence.
2020-02-13 13:29:43 -05:00
Lysandre
e63a81dd25
Style
2020-01-29 16:29:20 -05:00
Lysandre
217349016a
Copy object instead of passing the reference
2020-01-29 16:15:39 -05:00
alberduris
81d6841b4b
GPU text generation: mMoved the encoded_prompt to correct device
2020-01-06 15:11:12 +01:00
alberduris
dd4df80f0b
Moved the encoded_prompts to correct device
2020-01-06 15:11:12 +01:00
Thomas Wolf
9f5f646442
Merge pull request #2211 from huggingface/fast-tokenizers
...
Fast tokenizers
2019-12-27 10:24:29 +01:00
Anthony MOI
2818e50569
Add tests for fast tokenizers
2019-12-24 13:29:01 -05:00
Aymeric Augustin
e6c0019c80
Remove unused variables in tests.
2019-12-23 22:38:18 +01:00
Aymeric Augustin
1c62e87b34
Use built-in open().
...
On Python 3, `open is io.open`.
2019-12-22 18:38:56 +01:00
Aymeric Augustin
798b3b3899
Remove sys.version_info[0] == 2 or 3.
2019-12-22 18:38:42 +01:00
Aymeric Augustin
c824d15aa1
Remove __future__ imports.
2019-12-22 17:47:54 +01:00
Aymeric Augustin
00204f2b4c
Replace CommonTestCases for tokenizers with a mixin.
...
This is the same change as for (TF)CommonTestCases for modeling.
2019-12-22 15:35:25 +01:00
Aymeric Augustin
a3c5883f2c
Rename file for consistency.
2019-12-22 15:35:25 +01:00