LysandreJik
|
c10c7d59e7
|
Mask computing in standalone method. Tests.
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
bf503158c5
|
Sentence -> Sequence. Removed output_mask from the special token addition methods.
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
e391d4735e
|
Tokenizers' encode function can output binary masks
|
2019-09-19 10:55:06 +02:00 |
|
Shijie Wu
|
a15562e170
|
Fix reference of import when called for the second time
|
2019-09-03 18:27:29 -07:00 |
|
LysandreJik
|
6ae0bb5291
|
XLM 100 different URLs
|
2019-08-31 14:46:31 -04:00 |
|
Thomas Wolf
|
d483cd8e46
|
Merge pull request #1074 from huggingface/improved_testing
Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2
|
2019-08-30 23:18:58 +02:00 |
|
thomwolf
|
8678ff8df5
|
adding 17 and 100 xlm models
|
2019-08-30 16:26:04 +02:00 |
|
thomwolf
|
82462c5cba
|
Added option to setup pretrained tokenizer arguments
|
2019-08-30 15:30:41 +02:00 |
|
Thomas Wolf
|
50e615f43d
|
Merge branch 'master' into improved_testing
|
2019-08-30 13:40:35 +02:00 |
|
thomwolf
|
f8aace6bcd
|
update tokenizers to use self.XX_token_id instead of converting self.XX_token
|
2019-08-30 13:39:52 +02:00 |
|
Shijie Wu
|
ca4baf8ca1
|
Match order of casing in OSS XLM; Improve document; Clean up dependency
|
2019-08-27 20:03:18 -04:00 |
|
Shijie Wu
|
e85123d398
|
Add custom tokenizer for zh and ja
|
2019-08-23 20:27:52 -04:00 |
|
thomwolf
|
3bcbebd440
|
max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified
|
2019-08-23 22:07:26 +02:00 |
|
Shijie Wu
|
436ce07218
|
Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in tokenize
|
2019-08-23 14:40:17 -04:00 |
|
thomwolf
|
47d6853439
|
adding max_lengths for single sentences and sentences pairs
|
2019-08-23 17:31:11 +02:00 |
|
Guillem García Subies
|
388e3251fa
|
Update tokenization_xlm.py
|
2019-08-20 14:19:39 +02:00 |
|
Guillem García Subies
|
bfd75056b0
|
Update tokenization_xlm.py
|
2019-08-20 14:06:17 +02:00 |
|
LysandreJik
|
22ac004a7c
|
Added documentation and changed parameters for special_tokens_sentences_pair.
|
2019-08-12 15:13:53 -04:00 |
|
LysandreJik
|
14e970c271
|
Tokenization encode/decode class-based sequence handling
|
2019-08-09 15:01:38 -04:00 |
|
thomwolf
|
1849aa7d39
|
update readme and pretrained model weight files
|
2019-07-16 15:11:29 +02:00 |
|
thomwolf
|
15d8b1266c
|
update tokenizer - update squad example for xlnet
|
2019-07-15 17:30:42 +02:00 |
|
LysandreJik
|
7fdbc47822
|
Added the two CLM XLM pretrained checkpoints.
Fixed file extensions for config/vocab/merges of XLM models.
|
2019-07-10 19:37:24 -04:00 |
|
LysandreJik
|
dee3e45b93
|
Fixed XLM weights conversion script. Added 5 new checkpoints for XLM.
|
2019-07-10 19:04:21 -04:00 |
|
LysandreJik
|
f773faa258
|
Fixed all links. Removed TPU. Changed CLI to Converting TF models. Many minor formatting adjustments. Added "TODO Lysandre filled" where necessary.
|
2019-07-10 14:45:56 -04:00 |
|
thomwolf
|
d5481cbe1b
|
adding tests to examples - updating summary module - coverage update
|
2019-07-09 15:29:42 +02:00 |
|
thomwolf
|
b19786985d
|
unified tokenizer api and serialization + tests
|
2019-07-09 10:25:18 +02:00 |
|
thomwolf
|
36bca545ff
|
tokenization abstract class - tests for examples
|
2019-07-05 15:02:59 +02:00 |
|
thomwolf
|
0bab55d5d5
|
[BIG] name change
|
2019-07-05 11:55:36 +02:00 |
|