thomwolf
|
b3c6ee0ac1
|
tokenization updates
|
2019-04-15 14:24:52 +02:00 |
|
thomwolf
|
870b734bfd
|
added tokenizers serialization tests
|
2019-04-15 12:03:56 +02:00 |
|
thomwolf
|
3e65f255dc
|
add serialization semantics to tokenizers - fix transfo-xl tokenizer
|
2019-04-15 11:47:25 +02:00 |
|
thomwolf
|
1d8c232324
|
Fix #436
|
2019-04-03 10:51:03 +02:00 |
|
thomwolf
|
5c85fc3977
|
fix typo - logger info
|
2019-03-06 10:05:21 +01:00 |
|
Thomas Wolf
|
477ec4b6cc
|
Merge pull request #337 from CatalinVoss/patch-2
Allow tokenization of sequences > 512 for caching
|
2019-03-06 09:45:49 +01:00 |
|
Catalin Voss
|
4a49c22584
|
Warn instead of raising in BERT and GPT-2 tokenizers as well, to allow for pre-caching of tokens
|
2019-03-05 12:31:45 -08:00 |
|
John Hewitt
|
4d1ad83236
|
update docstring of BERT tokenizer to reflect do_wordpiece_only
|
2019-02-27 14:50:41 -08:00 |
|
John Hewitt
|
e14c6b52e3
|
add BertTokenizer flag to skip basic tokenization
|
2019-02-26 20:11:24 -08:00 |
|
Yongbo Wang
|
813e4d18ba
|
typo
|
2019-02-20 21:10:07 +08:00 |
|
thomwolf
|
edcb56fd96
|
more explicit variable name
|
2019-02-08 09:54:49 +01:00 |
|
Thomas Wolf
|
848aae49e1
|
Merge branch 'master' into python_2
|
2019-02-06 00:13:20 +01:00 |
|
thomwolf
|
448937c00d
|
python 2 compatibility
|
2019-02-06 00:07:46 +01:00 |
|
WrRan
|
3f60a60eed
|
text in never_split should not lowercase
|
2019-01-08 13:33:57 +08:00 |
|
WrRan
|
751beb9e73
|
never split some text
|
2019-01-08 10:54:51 +08:00 |
|
Thomas Wolf
|
7fb94ab934
|
Merge pull request #127 from patrick-s-h-lewis/tokenizer-error-on-long-seqs
raises value error for bert tokenizer for long sequences
|
2018-12-19 10:29:17 +01:00 |
|
Julien Chaumond
|
d57763f582
|
Fix typos
|
2018-12-18 19:23:22 -05:00 |
|
Patrick Lewis
|
78cf7b4ab4
|
added code to raise value error for bert tokenizer for covert_tokens_to_indices
|
2018-12-18 14:41:30 +00:00 |
|
thomwolf
|
4a4b0e5783
|
remove logging. basicConfig from library code
|
2018-12-14 14:46:25 +01:00 |
|
thomwolf
|
d6f06c03f4
|
fixed loading pre-trained tokenizer from directory
|
2018-11-30 14:09:06 +01:00 |
|
thomwolf
|
298107fed7
|
Added new bert models
|
2018-11-30 13:56:02 +01:00 |
|
thomwolf
|
32167cdf4b
|
remove convert_to_unicode and printable_text from examples
|
2018-11-26 23:33:22 +01:00 |
|
thomwolf
|
982339d829
|
fixing unicode error
|
2018-11-23 12:22:12 +01:00 |
|
weiyumou
|
37b6c9b21b
|
Fixed UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3793: ordinal not in range(128)
|
2018-11-19 23:01:28 -05:00 |
|
thomwolf
|
1de35b624b
|
preparing for first release
|
2018-11-15 20:56:10 +01:00 |
|