Commit Graph

23 Commits

Author SHA1 Message Date
Thomas Wolf
d483cd8e46
Merge pull request #1074 from huggingface/improved_testing
Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2
2019-08-30 23:18:58 +02:00
thomwolf
8678ff8df5 adding 17 and 100 xlm models 2019-08-30 16:26:04 +02:00
thomwolf
82462c5cba Added option to setup pretrained tokenizer arguments 2019-08-30 15:30:41 +02:00
Thomas Wolf
50e615f43d
Merge branch 'master' into improved_testing 2019-08-30 13:40:35 +02:00
thomwolf
f8aace6bcd update tokenizers to use self.XX_token_id instead of converting self.XX_token 2019-08-30 13:39:52 +02:00
Shijie Wu
ca4baf8ca1 Match order of casing in OSS XLM; Improve document; Clean up dependency 2019-08-27 20:03:18 -04:00
Shijie Wu
e85123d398 Add custom tokenizer for zh and ja 2019-08-23 20:27:52 -04:00
thomwolf
3bcbebd440 max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified 2019-08-23 22:07:26 +02:00
Shijie Wu
436ce07218 Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in tokenize 2019-08-23 14:40:17 -04:00
thomwolf
47d6853439 adding max_lengths for single sentences and sentences pairs 2019-08-23 17:31:11 +02:00
Guillem García Subies
388e3251fa
Update tokenization_xlm.py 2019-08-20 14:19:39 +02:00
Guillem García Subies
bfd75056b0
Update tokenization_xlm.py 2019-08-20 14:06:17 +02:00
LysandreJik
22ac004a7c Added documentation and changed parameters for special_tokens_sentences_pair. 2019-08-12 15:13:53 -04:00
LysandreJik
14e970c271 Tokenization encode/decode class-based sequence handling 2019-08-09 15:01:38 -04:00
thomwolf
1849aa7d39 update readme and pretrained model weight files 2019-07-16 15:11:29 +02:00
thomwolf
15d8b1266c update tokenizer - update squad example for xlnet 2019-07-15 17:30:42 +02:00
LysandreJik
7fdbc47822 Added the two CLM XLM pretrained checkpoints.
Fixed file extensions for config/vocab/merges of XLM models.
2019-07-10 19:37:24 -04:00
LysandreJik
dee3e45b93 Fixed XLM weights conversion script. Added 5 new checkpoints for XLM. 2019-07-10 19:04:21 -04:00
LysandreJik
f773faa258 Fixed all links. Removed TPU. Changed CLI to Converting TF models. Many minor formatting adjustments. Added "TODO Lysandre filled" where necessary. 2019-07-10 14:45:56 -04:00
thomwolf
d5481cbe1b adding tests to examples - updating summary module - coverage update 2019-07-09 15:29:42 +02:00
thomwolf
b19786985d unified tokenizer api and serialization + tests 2019-07-09 10:25:18 +02:00
thomwolf
36bca545ff tokenization abstract class - tests for examples 2019-07-05 15:02:59 +02:00
thomwolf
0bab55d5d5 [BIG] name change 2019-07-05 11:55:36 +02:00