Thomas Wolf
|
f3d18c71ec
|
Merge pull request #1152 from epwalsh/fix-special-tokens
fix adding special tokens
|
2019-08-30 23:21:59 +02:00 |
|
Thomas Wolf
|
d483cd8e46
|
Merge pull request #1074 from huggingface/improved_testing
Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2
|
2019-08-30 23:18:58 +02:00 |
|
Thomas Wolf
|
cd65c41a83
|
Merge branch 'master' into xlm-tokenization
|
2019-08-30 17:15:16 +02:00 |
|
thomwolf
|
69da972ace
|
added test and debug tokenizer configuration serialization
|
2019-08-30 17:09:36 +02:00 |
|
thomwolf
|
88111de07c
|
saving and reloading tokenizer configurations
|
2019-08-30 16:55:48 +02:00 |
|
thomwolf
|
82462c5cba
|
Added option to setup pretrained tokenizer arguments
|
2019-08-30 15:30:41 +02:00 |
|
Thomas Wolf
|
50e615f43d
|
Merge branch 'master' into improved_testing
|
2019-08-30 13:40:35 +02:00 |
|
thomwolf
|
8faf2e086b
|
more doc on special tokens
|
2019-08-30 13:36:22 +02:00 |
|
thomwolf
|
d51f72d5de
|
adding shortcut to the ids of all the special tokens
|
2019-08-30 11:41:11 +02:00 |
|
epwalsh
|
07e21307b6
|
fix adding special tokens
|
2019-08-29 13:44:50 -07:00 |
|
Thomas Wolf
|
5f297c7be3
|
Merge pull request #1087 from huggingface/fix-warnings
Decode now calls private property instead of public method
|
2019-08-28 22:22:11 +02:00 |
|
Thomas Wolf
|
0ecfd17f49
|
Merge pull request #987 from huggingface/generative-finetuning
Generative finetuning
|
2019-08-28 16:51:50 +02:00 |
|
thomwolf
|
a175a9dc01
|
add kwargs to base encode function
|
2019-08-27 14:05:59 +02:00 |
|
LysandreJik
|
529a16dec6
|
Generic encoding implementation.
|
2019-08-26 15:00:43 -04:00 |
|
thomwolf
|
3bcbebd440
|
max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified
|
2019-08-23 22:07:26 +02:00 |
|
thomwolf
|
47d6853439
|
adding max_lengths for single sentences and sentences pairs
|
2019-08-23 17:31:11 +02:00 |
|
Abhishek Rao
|
c603d099aa
|
reraise EnvironmentError in from_pretrained functions of Model and Tokenizer
|
2019-08-22 15:25:40 -07:00 |
|
LysandreJik
|
2ba1a14fb0
|
Decode now calls private property instead of public method
|
2019-08-22 17:25:55 -04:00 |
|
Thomas Wolf
|
e4515faf54
|
Merge pull request #1057 from huggingface/fixes
Add a few of typos corrections, bugs fixes and small improvements
|
2019-08-21 01:54:05 +02:00 |
|
Thomas Wolf
|
d30cbaf5dc
|
Merge branch 'master' into iterative_split_on_token
|
2019-08-21 01:33:02 +02:00 |
|
thomwolf
|
43489756ad
|
adding proxies options for the from_pretrained methods
|
2019-08-20 16:59:11 +02:00 |
|
thomwolf
|
fecaed0ed4
|
add force_download option to from_pretrained methods
|
2019-08-20 10:56:12 +02:00 |
|
LysandreJik
|
572dcfd1db
|
Doc
|
2019-08-14 14:56:14 -04:00 |
|
samvelyan
|
9ce36e3e4b
|
Re-implemented tokenize() iteratively in PreTrainedTokenizer.
|
2019-08-14 08:57:09 +00:00 |
|
LysandreJik
|
3d87991f60
|
Fixed error with encoding
|
2019-08-13 12:00:24 -04:00 |
|
LysandreJik
|
22ac004a7c
|
Added documentation and changed parameters for special_tokens_sentences_pair.
|
2019-08-12 15:13:53 -04:00 |
|
LysandreJik
|
14e970c271
|
Tokenization encode/decode class-based sequence handling
|
2019-08-09 15:01:38 -04:00 |
|
LysandreJik
|
6c41a8f5dc
|
Encode and Decode are back in the superclass. They now handle sentence pairs special tokens.
|
2019-08-08 18:20:32 -04:00 |
|
Thomas Wolf
|
d43dc48b34
|
Merge branch 'master' into auto_models
|
2019-08-05 19:17:35 +02:00 |
|
thomwolf
|
328afb7097
|
cleaning up tokenizer tests structure (at last) - last remaining ppb refs
|
2019-08-05 14:08:56 +02:00 |
|
thomwolf
|
00132b7a7a
|
updating docs - adding few tests to tokenizers
|
2019-08-04 22:42:55 +02:00 |
|
thomwolf
|
009273dbdd
|
big doc update [WIP]
|
2019-08-04 12:14:57 +02:00 |
|
thomwolf
|
c717d38573
|
dictionnary => dictionary
|
2019-07-26 23:30:48 +02:00 |
|
thomwolf
|
7b6e474c9a
|
fix #901
|
2019-07-26 21:26:44 +02:00 |
|
thomwolf
|
27b0f86d36
|
clean up pretrained
|
2019-07-26 17:09:21 +02:00 |
|
Joel Grus
|
ae152cec09
|
make save_pretrained work with added tokens
right now it's dumping the *decoder* when it should be dumping the *encoder*. this fixes that.
|
2019-07-24 16:54:48 -07:00 |
|
thomwolf
|
1849aa7d39
|
update readme and pretrained model weight files
|
2019-07-16 15:11:29 +02:00 |
|
thomwolf
|
1b35d05d4b
|
update conversion scripts and __main__
|
2019-07-16 09:41:55 +02:00 |
|
thomwolf
|
15d8b1266c
|
update tokenizer - update squad example for xlnet
|
2019-07-15 17:30:42 +02:00 |
|
thomwolf
|
7d4b200e40
|
good quality generation example for GPT, GPT-2, Transfo-XL, XLNet
|
2019-07-13 15:25:03 +02:00 |
|
thomwolf
|
d5481cbe1b
|
adding tests to examples - updating summary module - coverage update
|
2019-07-09 15:29:42 +02:00 |
|
thomwolf
|
c079d7ddff
|
fix python 2 tests
|
2019-07-09 10:40:59 +02:00 |
|
thomwolf
|
b19786985d
|
unified tokenizer api and serialization + tests
|
2019-07-09 10:25:18 +02:00 |
|
thomwolf
|
1113f97f33
|
clean up glue example
|
2019-07-05 16:31:13 +02:00 |
|
thomwolf
|
6dacc79d39
|
fix python2 tests
|
2019-07-05 15:11:59 +02:00 |
|
thomwolf
|
36bca545ff
|
tokenization abstract class - tests for examples
|
2019-07-05 15:02:59 +02:00 |
|