Commit Graph

27 Commits

Author SHA1 Message Date
LysandreJik
c10c7d59e7 Mask computing in standalone method. Tests. 2019-09-19 10:55:06 +02:00
LysandreJik
bf503158c5 Sentence -> Sequence. Removed output_mask from the special token addition methods. 2019-09-19 10:55:06 +02:00
LysandreJik
e391d4735e Tokenizers' encode function can output binary masks 2019-09-19 10:55:06 +02:00
Thomas Wolf
d483cd8e46
Merge pull request #1074 from huggingface/improved_testing
Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2
2019-08-30 23:18:58 +02:00
Thomas Wolf
cd65c41a83
Merge branch 'master' into xlm-tokenization 2019-08-30 17:15:16 +02:00
thomwolf
82462c5cba Added option to setup pretrained tokenizer arguments 2019-08-30 15:30:41 +02:00
Thomas Wolf
50e615f43d
Merge branch 'master' into improved_testing 2019-08-30 13:40:35 +02:00
thomwolf
f8aace6bcd update tokenizers to use self.XX_token_id instead of converting self.XX_token 2019-08-30 13:39:52 +02:00
thomwolf
3bcbebd440 max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified 2019-08-23 22:07:26 +02:00
thomwolf
47d6853439 adding max_lengths for single sentences and sentences pairs 2019-08-23 17:31:11 +02:00
thomwolf
901dde0e45 fix #1014 2019-08-20 11:05:51 +02:00
LysandreJik
22ac004a7c Added documentation and changed parameters for special_tokens_sentences_pair. 2019-08-12 15:13:53 -04:00
LysandreJik
14e970c271 Tokenization encode/decode class-based sequence handling 2019-08-09 15:01:38 -04:00
thomwolf
328afb7097 cleaning up tokenizer tests structure (at last) - last remaining ppb refs 2019-08-05 14:08:56 +02:00
thomwolf
009273dbdd big doc update [WIP] 2019-08-04 12:14:57 +02:00
thomwolf
4cc1bf81ee typos 2019-07-27 12:08:21 +02:00
Yiqing-Zhou
b1019d2a8e
token[-1] -> token.rstrip('\n') 2019-07-23 20:41:26 +08:00
Yiqing-Zhou
bef0c629ca
fix
Remove '\n' before adding token into vocab
2019-07-22 22:30:49 +08:00
Yiqing-Zhou
897d0841be
read().splitlines() -> readlines()
splitlines() does not work as what we expect here for bert-base-chinese because there is a '\u2028' (unicode line seperator) token in vocab file. Value of '\u2028'.splitlines() is ['', ''].
Perhaps we should use readlines() instead.
2019-07-22 20:49:09 +08:00
thomwolf
15d8b1266c update tokenizer - update squad example for xlnet 2019-07-15 17:30:42 +02:00
thomwolf
ab49fafc04 update tokenization docstrings for #328 2019-07-15 12:51:23 +02:00
thomwolf
a9ab15174c fix #328 2019-07-15 12:42:12 +02:00
thomwolf
e468192e2f Merge branch 'pytorch-transformers' into xlnet 2019-07-09 17:05:37 +02:00
thomwolf
d5481cbe1b adding tests to examples - updating summary module - coverage update 2019-07-09 15:29:42 +02:00
thomwolf
b19786985d unified tokenizer api and serialization + tests 2019-07-09 10:25:18 +02:00
thomwolf
36bca545ff tokenization abstract class - tests for examples 2019-07-05 15:02:59 +02:00
thomwolf
0bab55d5d5 [BIG] name change 2019-07-05 11:55:36 +02:00