Commit Graph

70 Commits

Author SHA1 Message Date
thomwolf
0f091062d4 Merge branch 'glue-example' into tf2 2019-09-25 10:21:52 +02:00
thomwolf
c4acc3a8e9 let encode accept tensor inputs 2019-09-25 10:19:14 +02:00
thomwolf
a6981076ec various updates 2019-09-24 16:46:26 +02:00
LysandreJik
d340e2329e create_mask_from_sequences -> create_token_type_ids_from_sequences 2019-09-24 09:09:28 -04:00
LysandreJik
c832f43a4d output_token_type -> token_type_ids 2019-09-24 07:21:38 -04:00
LysandreJik
0ea82b246f Updated tests 2019-09-24 07:10:09 -04:00
LysandreJik
9d44236f70 Updated DistilBERT 2019-09-24 07:03:24 -04:00
LysandreJik
ab984a8b72 Python 2 compatibility 2019-09-19 15:01:33 +02:00
LysandreJik
3df208c93a Tokenizer accepts token list as well as string 2019-09-19 14:47:52 +02:00
LysandreJik
66ea76b8a9 prepare_for_model and prepare_pair_for_model methods. Added an option to select which sequence will be truncated. 2019-09-19 13:50:51 +02:00
LysandreJik
baa74326ab Stride + tests + small fixes 2019-09-19 10:55:06 +02:00
LysandreJik
c10c7d59e7 Mask computing in standalone method. Tests. 2019-09-19 10:55:06 +02:00
LysandreJik
bf503158c5 Sentence -> Sequence. Removed output_mask from the special token addition methods. 2019-09-19 10:55:06 +02:00
LysandreJik
8cba057260 Doc + remove artefacts 2019-09-19 10:55:06 +02:00
LysandreJik
dcc9bb3252 Modified encode to return only lists. Added a more complete encode_plus method 2019-09-19 10:55:06 +02:00
LysandreJik
af23b626c8 Max encoding length + corresponding tests 2019-09-19 10:55:06 +02:00
LysandreJik
d572d7027b Number of added tokens calculator 2019-09-19 10:55:06 +02:00
LysandreJik
2d8ec5a684 Changed warning to be more explicit
Co-authored by: julien_c <chaumond@gmail.com>
2019-09-19 10:55:06 +02:00
LysandreJik
c3df2136e1 Added binary masking tests 2019-09-19 10:55:06 +02:00
LysandreJik
e391d4735e Tokenizers' encode function can output binary masks 2019-09-19 10:55:06 +02:00
thomwolf
5c6cac102b adding test for common properties and cleaning up a bit base class 2019-09-05 21:31:29 +02:00
maru0kun
d737947725
Fix typo 2019-09-05 19:24:57 +09:00
LysandreJik
31d3373bc9 Appends space before special token 2019-09-01 21:07:00 -04:00
thomwolf
fede4ef45d fixing #1133 2019-09-02 02:27:39 +02:00
Thomas Wolf
f3d18c71ec
Merge pull request #1152 from epwalsh/fix-special-tokens
fix adding special tokens
2019-08-30 23:21:59 +02:00
Thomas Wolf
d483cd8e46
Merge pull request #1074 from huggingface/improved_testing
Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2
2019-08-30 23:18:58 +02:00
Thomas Wolf
cd65c41a83
Merge branch 'master' into xlm-tokenization 2019-08-30 17:15:16 +02:00
thomwolf
69da972ace added test and debug tokenizer configuration serialization 2019-08-30 17:09:36 +02:00
thomwolf
88111de07c saving and reloading tokenizer configurations 2019-08-30 16:55:48 +02:00
thomwolf
82462c5cba Added option to setup pretrained tokenizer arguments 2019-08-30 15:30:41 +02:00
Thomas Wolf
50e615f43d
Merge branch 'master' into improved_testing 2019-08-30 13:40:35 +02:00
thomwolf
8faf2e086b more doc on special tokens 2019-08-30 13:36:22 +02:00
thomwolf
d51f72d5de adding shortcut to the ids of all the special tokens 2019-08-30 11:41:11 +02:00
epwalsh
07e21307b6 fix adding special tokens 2019-08-29 13:44:50 -07:00
Thomas Wolf
5f297c7be3
Merge pull request #1087 from huggingface/fix-warnings
Decode now calls private property instead of public method
2019-08-28 22:22:11 +02:00
Thomas Wolf
0ecfd17f49
Merge pull request #987 from huggingface/generative-finetuning
Generative finetuning
2019-08-28 16:51:50 +02:00
thomwolf
a175a9dc01 add kwargs to base encode function 2019-08-27 14:05:59 +02:00
LysandreJik
529a16dec6 Generic encoding implementation. 2019-08-26 15:00:43 -04:00
thomwolf
3bcbebd440 max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified 2019-08-23 22:07:26 +02:00
thomwolf
47d6853439 adding max_lengths for single sentences and sentences pairs 2019-08-23 17:31:11 +02:00
Abhishek Rao
c603d099aa reraise EnvironmentError in from_pretrained functions of Model and Tokenizer 2019-08-22 15:25:40 -07:00
LysandreJik
2ba1a14fb0 Decode now calls private property instead of public method 2019-08-22 17:25:55 -04:00
Thomas Wolf
e4515faf54
Merge pull request #1057 from huggingface/fixes
Add a few of typos corrections, bugs fixes and small improvements
2019-08-21 01:54:05 +02:00
Thomas Wolf
d30cbaf5dc
Merge branch 'master' into iterative_split_on_token 2019-08-21 01:33:02 +02:00
thomwolf
43489756ad adding proxies options for the from_pretrained methods 2019-08-20 16:59:11 +02:00
thomwolf
fecaed0ed4 add force_download option to from_pretrained methods 2019-08-20 10:56:12 +02:00
LysandreJik
572dcfd1db Doc 2019-08-14 14:56:14 -04:00
samvelyan
9ce36e3e4b Re-implemented tokenize() iteratively in PreTrainedTokenizer. 2019-08-14 08:57:09 +00:00
LysandreJik
3d87991f60 Fixed error with encoding 2019-08-13 12:00:24 -04:00
LysandreJik
22ac004a7c Added documentation and changed parameters for special_tokens_sentences_pair. 2019-08-12 15:13:53 -04:00