thomwolf
|
a6bcfb8015
|
fix tests
|
2019-09-25 21:14:12 +02:00 |
|
thomwolf
|
78863f6b36
|
fix tokenizer to tensors
|
2019-09-25 21:09:46 +02:00 |
|
thomwolf
|
8a618e0af5
|
clean up __init__
|
2019-09-25 21:04:52 +02:00 |
|
thomwolf
|
0f091062d4
|
Merge branch 'glue-example' into tf2
|
2019-09-25 10:21:52 +02:00 |
|
thomwolf
|
c4acc3a8e9
|
let encode accept tensor inputs
|
2019-09-25 10:19:14 +02:00 |
|
thomwolf
|
a6981076ec
|
various updates
|
2019-09-24 16:46:26 +02:00 |
|
LysandreJik
|
d340e2329e
|
create_mask_from_sequences -> create_token_type_ids_from_sequences
|
2019-09-24 09:09:28 -04:00 |
|
LysandreJik
|
c832f43a4d
|
output_token_type -> token_type_ids
|
2019-09-24 07:21:38 -04:00 |
|
LysandreJik
|
0ea82b246f
|
Updated tests
|
2019-09-24 07:10:09 -04:00 |
|
LysandreJik
|
9d44236f70
|
Updated DistilBERT
|
2019-09-24 07:03:24 -04:00 |
|
LysandreJik
|
ab984a8b72
|
Python 2 compatibility
|
2019-09-19 15:01:33 +02:00 |
|
LysandreJik
|
3df208c93a
|
Tokenizer accepts token list as well as string
|
2019-09-19 14:47:52 +02:00 |
|
LysandreJik
|
66ea76b8a9
|
prepare_for_model and prepare_pair_for_model methods. Added an option to select which sequence will be truncated.
|
2019-09-19 13:50:51 +02:00 |
|
LysandreJik
|
baa74326ab
|
Stride + tests + small fixes
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
c10c7d59e7
|
Mask computing in standalone method. Tests.
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
bf503158c5
|
Sentence -> Sequence. Removed output_mask from the special token addition methods.
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
8cba057260
|
Doc + remove artefacts
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
dcc9bb3252
|
Modified encode to return only lists. Added a more complete encode_plus method
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
af23b626c8
|
Max encoding length + corresponding tests
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
d572d7027b
|
Number of added tokens calculator
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
2d8ec5a684
|
Changed warning to be more explicit
Co-authored by: julien_c <chaumond@gmail.com>
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
c3df2136e1
|
Added binary masking tests
|
2019-09-19 10:55:06 +02:00 |
|
LysandreJik
|
e391d4735e
|
Tokenizers' encode function can output binary masks
|
2019-09-19 10:55:06 +02:00 |
|
thomwolf
|
5c6cac102b
|
adding test for common properties and cleaning up a bit base class
|
2019-09-05 21:31:29 +02:00 |
|
maru0kun
|
d737947725
|
Fix typo
|
2019-09-05 19:24:57 +09:00 |
|
LysandreJik
|
31d3373bc9
|
Appends space before special token
|
2019-09-01 21:07:00 -04:00 |
|
thomwolf
|
fede4ef45d
|
fixing #1133
|
2019-09-02 02:27:39 +02:00 |
|
Thomas Wolf
|
f3d18c71ec
|
Merge pull request #1152 from epwalsh/fix-special-tokens
fix adding special tokens
|
2019-08-30 23:21:59 +02:00 |
|
Thomas Wolf
|
d483cd8e46
|
Merge pull request #1074 from huggingface/improved_testing
Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2
|
2019-08-30 23:18:58 +02:00 |
|
Thomas Wolf
|
cd65c41a83
|
Merge branch 'master' into xlm-tokenization
|
2019-08-30 17:15:16 +02:00 |
|
thomwolf
|
69da972ace
|
added test and debug tokenizer configuration serialization
|
2019-08-30 17:09:36 +02:00 |
|
thomwolf
|
88111de07c
|
saving and reloading tokenizer configurations
|
2019-08-30 16:55:48 +02:00 |
|
thomwolf
|
82462c5cba
|
Added option to setup pretrained tokenizer arguments
|
2019-08-30 15:30:41 +02:00 |
|
Thomas Wolf
|
50e615f43d
|
Merge branch 'master' into improved_testing
|
2019-08-30 13:40:35 +02:00 |
|
thomwolf
|
8faf2e086b
|
more doc on special tokens
|
2019-08-30 13:36:22 +02:00 |
|
thomwolf
|
d51f72d5de
|
adding shortcut to the ids of all the special tokens
|
2019-08-30 11:41:11 +02:00 |
|
epwalsh
|
07e21307b6
|
fix adding special tokens
|
2019-08-29 13:44:50 -07:00 |
|
Thomas Wolf
|
5f297c7be3
|
Merge pull request #1087 from huggingface/fix-warnings
Decode now calls private property instead of public method
|
2019-08-28 22:22:11 +02:00 |
|
Thomas Wolf
|
0ecfd17f49
|
Merge pull request #987 from huggingface/generative-finetuning
Generative finetuning
|
2019-08-28 16:51:50 +02:00 |
|
thomwolf
|
a175a9dc01
|
add kwargs to base encode function
|
2019-08-27 14:05:59 +02:00 |
|
LysandreJik
|
529a16dec6
|
Generic encoding implementation.
|
2019-08-26 15:00:43 -04:00 |
|
thomwolf
|
3bcbebd440
|
max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified
|
2019-08-23 22:07:26 +02:00 |
|
thomwolf
|
47d6853439
|
adding max_lengths for single sentences and sentences pairs
|
2019-08-23 17:31:11 +02:00 |
|
Abhishek Rao
|
c603d099aa
|
reraise EnvironmentError in from_pretrained functions of Model and Tokenizer
|
2019-08-22 15:25:40 -07:00 |
|
LysandreJik
|
2ba1a14fb0
|
Decode now calls private property instead of public method
|
2019-08-22 17:25:55 -04:00 |
|
Thomas Wolf
|
e4515faf54
|
Merge pull request #1057 from huggingface/fixes
Add a few of typos corrections, bugs fixes and small improvements
|
2019-08-21 01:54:05 +02:00 |
|
Thomas Wolf
|
d30cbaf5dc
|
Merge branch 'master' into iterative_split_on_token
|
2019-08-21 01:33:02 +02:00 |
|
thomwolf
|
43489756ad
|
adding proxies options for the from_pretrained methods
|
2019-08-20 16:59:11 +02:00 |
|
thomwolf
|
fecaed0ed4
|
add force_download option to from_pretrained methods
|
2019-08-20 10:56:12 +02:00 |
|
LysandreJik
|
572dcfd1db
|
Doc
|
2019-08-14 14:56:14 -04:00 |
|