Thomas Wolf
477ec4b6cc
Merge pull request #337 from CatalinVoss/patch-2
...
Allow tokenization of sequences > 512 for caching
2019-03-06 09:45:49 +01:00
Thomas Wolf
7b9e5a54b5
Merge pull request #327 from lukovnikov/master
...
Issue#324: warmup linear fixes
2019-03-06 09:44:56 +01:00
Catalin Voss
4a49c22584
Warn instead of raising in BERT and GPT-2 tokenizers as well, to allow for pre-caching of tokens
2019-03-05 12:31:45 -08:00
Aaron Mangum
0c970caa4a
catch exception if pathlib not install
2019-03-04 14:30:19 -08:00
Catalin Voss
9775b2eb27
Allow tokenization of sequences > 512 for caching
...
For many applications requiring randomized data access, it's easier to cache the tokenized representations than the words. So why not turn this into a warning?
2019-03-02 16:30:21 -08:00
John Hewitt
4d1ad83236
update docstring of BERT tokenizer to reflect do_wordpiece_only
2019-02-27 14:50:41 -08:00
lukovnikov
35410da758
added warning
2019-02-27 17:11:42 +01:00
lukovnikov
4d79e0d386
added warning
2019-02-27 16:50:05 +01:00
lukovnikov
66a84b63b0
added warning
2019-02-27 16:38:00 +01:00
lukovnikov
070f3b21d8
added warning
2019-02-27 16:26:45 +01:00
lukovnikov
46ef646016
added warning
2019-02-27 16:22:27 +01:00
lukovnikov
9bc3773c84
added warning
2019-02-27 16:10:31 +01:00
lukovnikov
60a372387f
added warning
2019-02-27 15:54:09 +01:00
John Hewitt
e14c6b52e3
add BertTokenizer flag to skip basic tokenization
2019-02-26 20:11:24 -08:00
lukovnikov
da2d8ca265
fix for negative learning rate with warmup_linear in BertAdam (happens when t_total is specified incorrectly)
...
+ copied BERT optimization warmup functions to OpenAI optimization file + added comments
2019-02-26 17:16:06 +01:00
lukovnikov
e04bab59e1
fix for negative learning rate with warmup_linear in BertAdam (happens when t_total is specified incorrectly)
...
+ copied BERT optimization warmup functions to OpenAI optimization file + added comments
2019-02-26 16:22:52 +01:00
Joel Grus
8722e9eb3b
finish updating docstrings
2019-02-23 06:31:59 -08:00
Joel Grus
33aa7a80ca
update documentation
2019-02-22 15:37:59 -08:00
Yongbo Wang
2fdab323d1
typo
2019-02-20 21:11:06 +08:00
Yongbo Wang
813e4d18ba
typo
2019-02-20 21:10:07 +08:00
thomwolf
e0855e8929
forgot to add regex to requirements :(
2019-02-18 11:54:51 +01:00
thomwolf
ab7f5d2943
simple
2019-02-18 11:33:54 +01:00
thomwolf
b450a7faf2
clean up tokenization - fix python 2 tests
2019-02-18 11:27:18 +01:00
thomwolf
d44db1145c
update readme
2019-02-18 11:12:09 +01:00
thomwolf
690a0dbf36
fix example - masking
2019-02-18 10:50:30 +01:00
thomwolf
fbb248a2e4
examples testing
2019-02-18 01:28:18 +01:00
thomwolf
5ff0c60505
language update
2019-02-18 00:55:47 +01:00
thomwolf
210d407245
updating init
2019-02-18 00:55:39 +01:00
thomwolf
009ee86a19
fix tests - bump up version
2019-02-17 23:57:23 +01:00
thomwolf
ffd623823d
adding gpt2
2019-02-17 23:38:51 +01:00
Dan Hendrycks
434d15da8e
Update activation function docstring
2019-02-16 12:17:52 -08:00
thomwolf
321d70a7a9
bump up to 0.5.1
2019-02-13 10:11:20 +01:00
thomwolf
c6bea08448
OpenAI GPT Tokenizer can fallback on using BERT BasicTokenizer
2019-02-13 10:11:00 +01:00
thomwolf
e7cfc46fc1
fix TransfoXLModel loading
2019-02-13 09:32:46 +01:00
thomwolf
e8fe6b7140
adapting transfo tokenizer to transposed inputs
2019-02-11 13:30:04 +01:00
thomwolf
884ca81d87
transposing the inputs of Transformer-XL to have a unified interface
2019-02-11 13:19:59 +01:00
thomwolf
2071a9b86e
fix python 2.7 imports
2019-02-11 10:35:36 +01:00
thomwolf
b514a60c36
added tests for OpenAI GPT and Transformer-XL tokenizers
2019-02-11 10:17:16 +01:00
thomwolf
f0bf81e141
back compatibility with Path inputs in fle_utils
2019-02-09 17:05:23 +01:00
thomwolf
6cd769957e
update transfo xl example
2019-02-09 16:59:17 +01:00
thomwolf
1320e4ec0c
mc_token_mask => mc_token_ids
2019-02-09 16:58:53 +01:00
thomwolf
cfcb95417c
fix hasattr
2019-02-08 23:08:53 +01:00
thomwolf
1756b5e956
fix loading from Transfo-XL LM model
2019-02-08 22:32:17 +01:00
thomwolf
dadd0c1b13
updating __main__
2019-02-08 22:31:57 +01:00
thomwolf
102c6b238c
adding file cache to __init__
2019-02-08 22:31:46 +01:00
thomwolf
80607874c1
fix layer norm epsilon in OpenAI GPT
2019-02-08 21:49:05 +01:00
thomwolf
5ee4f17234
adding option to load on cpu
2019-02-08 10:37:40 +01:00
thomwolf
777459b471
run openai example running
2019-02-08 10:33:14 +01:00
thomwolf
edcb56fd96
more explicit variable name
2019-02-08 09:54:49 +01:00
thomwolf
eb8fda51f4
update docstrings
2019-02-07 23:15:20 +01:00
thomwolf
f99f2fb661
docstrings
2019-02-07 17:07:22 +01:00
thomwolf
438db43d46
update adaptive softmax head
2019-02-07 17:07:15 +01:00
thomwolf
c306869ea2
add two transformer xl models
2019-02-07 17:07:03 +01:00
thomwolf
9c3c24800b
split saved model in config & weights
2019-02-07 17:06:17 +01:00
thomwolf
ed47cb6cba
fixing transfo eval script
2019-02-06 16:22:17 +01:00
thomwolf
973926431e
fix differencies with tensorflow version (mem cells and adaptive sofmax clusters)
2019-02-06 15:42:29 +01:00
Thomas Wolf
848aae49e1
Merge branch 'master' into python_2
2019-02-06 00:13:20 +01:00
thomwolf
448937c00d
python 2 compatibility
2019-02-06 00:07:46 +01:00
thomwolf
822915142b
fix docstring
2019-02-05 16:34:32 +01:00
Thibault Fevry
f3bda2352a
Only keep the active part mof the loss for token classification
2019-02-04 11:46:36 -05:00
thomwolf
6179f537a3
clean up tokenization spaces
2019-02-04 17:41:22 +01:00
thomwolf
850da1cc36
strip decoded outputs
2019-02-04 17:35:05 +01:00
thomwolf
01a3966bc6
more options on special tokens
2019-02-04 17:26:25 +01:00
thomwolf
05f961840b
logging
2019-02-04 13:06:19 +01:00
thomwolf
3a848111e6
update config, docstrings and readme to switch to seperated tokens and position embeddings
2019-01-29 11:00:11 +01:00
thomwolf
98c96fb1a7
splitting position and tokens embeddings in OpenAI GPT - updating tf imports - tests
2019-01-29 10:31:42 +01:00
thomwolf
5456d82311
more versatile model loading
2019-01-29 09:54:18 +01:00
thomwolf
9b2540b5a7
update __init__
2019-01-29 09:54:08 +01:00
thomwolf
bd3b3aee9c
update
2019-01-28 17:47:29 +01:00
thomwolf
b12616fd8e
updating code organization to fix imports
2019-01-28 17:03:39 +01:00
thomwolf
d77dd62ff8
directly load from TF checkpoints + code cleanup
2019-01-28 16:50:23 +01:00
thomwolf
9c35c132fa
apex LayerNorm
2019-01-17 09:19:19 +01:00
thomwolf
b9c77b98d5
fix transposition in model conversion and memory initialization
2019-01-17 00:33:21 +01:00
thomwolf
009101de12
fix loading bug and check full conversion of model
2019-01-16 12:16:20 +01:00
thomwolf
fea15cc9f5
update model conversion
2019-01-16 11:54:54 +01:00
thomwolf
c03c12687f
fix __main__ entry script
2019-01-16 10:55:22 +01:00
thomwolf
8831c68803
fixing various parts of model conversion, loading and weights sharing
2019-01-16 10:31:16 +01:00
thomwolf
a69ec2c722
improved corpus and tokenization conversion - added evaluation script
2019-01-15 23:17:46 +01:00
thomwolf
7d03c53718
conversion working
2019-01-15 16:07:25 +01:00
thomwolf
3a9c88377f
adding Transformer XL
2019-01-15 12:59:38 +01:00
nhatchan
cd30565aed
Fix importing unofficial TF models
...
Importing unofficial TF models seems to be working well, at least for me.
This PR resolves #50 .
2019-01-14 13:35:40 +09:00
thomwolf
e5c78c6684
update readme and few typos
2019-01-10 01:40:00 +01:00
thomwolf
ab90d4cddd
adding docs and example for OpenAI GPT
2019-01-09 00:12:43 +01:00
thomwolf
dc5df92fa8
added LM head for OpenAI
2019-01-08 17:18:47 +01:00
thomwolf
3cf12b235a
added tests + fixed losses
2019-01-08 16:24:23 +01:00
thomwolf
eed51c5bdf
add OpenAI GPT
2019-01-08 12:26:58 +01:00
WrRan
3f60a60eed
text in never_split should not lowercase
2019-01-08 13:33:57 +08:00
WrRan
751beb9e73
never split some text
2019-01-08 10:54:51 +08:00
thomwolf
793dcd236b
Merge branch 'master' of https://github.com/huggingface/pytorch-pretrained-BERT into fifth-release
2019-01-07 13:37:55 +01:00
thomwolf
93f563b8a8
adding OpenAI GPT
2019-01-07 12:55:36 +01:00
Thomas Wolf
e048c7f1c8
Merge pull request #171 from donglixp/patch-1
...
LayerNorm initialization
2019-01-07 12:44:46 +01:00
Thomas Wolf
bcd607542c
Merge pull request #145 from wlhgtc/master
...
Correct the wrong note
2019-01-07 12:23:05 +01:00
Li Dong
d0d9b384f2
LayerNorm initialization
...
The LayerNorm gamma and beta should be initialized by .fill_(1.0) and .zero_().
reference links:
989e78c412/tensorflow/contrib/layers/python/layers/layers.py (L2298)
989e78c412/tensorflow/contrib/layers/python/layers/layers.py (L2308)
2019-01-07 15:51:33 +08:00
wlhgtc
e626eecc25
Update modeling.py
2018-12-22 20:26:05 +08:00
Grégory Châtel
7176674849
Fixing various class documentations.
2018-12-20 13:11:17 +01:00
Thomas Wolf
7fb94ab934
Merge pull request #127 from patrick-s-h-lewis/tokenizer-error-on-long-seqs
...
raises value error for bert tokenizer for long sequences
2018-12-19 10:29:17 +01:00
Patrick Sodré
87c1244c7d
Convert scripts into entry_points
...
The recommended approach to create launch scripts is to use entry_points
and console_scripts.
xref: https://packaging.python.org/guides/distributing-packages-using-setuptools/#scripts
2018-12-19 02:26:08 +00:00
Julien Chaumond
d57763f582
Fix typos
2018-12-18 19:23:22 -05:00
Patrick Lewis
78cf7b4ab4
added code to raise value error for bert tokenizer for covert_tokens_to_indices
2018-12-18 14:41:30 +00:00
thomwolf
4a4b0e5783
remove logging. basicConfig from library code
2018-12-14 14:46:25 +01:00
thomwolf
ae88eb88a4
set encoding to 'utf-8' in calls to open
2018-12-14 13:48:58 +01:00
thomwolf
52c53f39d0
clean up apex integration
2018-12-13 13:02:17 +01:00
thomwolf
d23eed85bb
model loading apex modification
2018-12-13 12:53:17 +01:00
thomwolf
1cbb32a542
include version number + comment in setup.py
2018-12-13 12:50:44 +01:00
thomwolf
ce52177638
added version in __init__.py
2018-12-13 12:50:44 +01:00
thomwolf
93f335ef86
add pretrained loading from state_dict
2018-12-13 12:48:13 +01:00
thomwolf
13bf0d4659
fixing Adam weights skip in TF convert script
2018-12-13 12:48:13 +01:00
Thomas Wolf
91aab2a6d3
Merge pull request #116 from FDecaYed/deyuf/fp16_with_apex
...
Change to use apex for better fp16 and multi-gpu support
2018-12-13 12:32:37 +01:00
Thomas Wolf
32a227f507
Merge pull request #113 from hzhwcmhf/master
...
fix compatibility with python 3.5.2
2018-12-13 12:15:15 +01:00
Thomas Wolf
ffe9075f48
Merge pull request #96 from rodgzilla/multiple-choice-code
...
BertForMultipleChoice and Swag dataset example.
2018-12-13 12:05:11 +01:00
Deyu Fu
3b0a14b761
add fallback path for apex used in modeling.py
2018-12-12 15:05:45 -08:00
Deyu Fu
c8ea286048
change to apex for better fp16 and multi-gpu support
2018-12-11 17:13:58 -08:00
hzhwcmhf
485adde742
add pathlib support for file_utils.py on python 3.5
2018-12-11 22:49:19 +08:00
hzhwcmhf
bc659f86ad
fix compatibility with python 3.5.2; convert path to str
2018-12-11 20:18:56 +08:00
thomwolf
1df6f26214
Merge branch 'fourth-release' of https://github.com/huggingface/pytorch-pretrained-BERT into fourth-release
2018-12-11 12:20:31 +01:00
thomwolf
770f805ae5
include version number + comment in setup.py
2018-12-11 12:20:22 +01:00
thomwolf
ed3b62cd3b
added version in __init__.py
2018-12-11 12:12:08 +01:00
Thomas Wolf
632f2d2df9
Merge branch 'master' into fourth-release
2018-12-11 06:00:53 -05:00
thomwolf
270fa2f20b
add pretrained loading from state_dict
2018-12-11 11:50:38 +01:00
Li Li
81e1e2489f
Fix optimizer to work with horovod
2018-12-10 02:08:38 -08:00
thomwolf
68f77303b2
fixing Adam weights skip in TF convert script
2018-12-09 16:17:11 -05:00
Grégory Châtel
fc5a38ac92
Adding the BertForMultipleChoiceClass.
2018-12-06 18:42:23 +01:00
thomwolf
511bce58bd
update new token classification model
2018-11-30 22:56:02 +01:00
thomwolf
d787c6be8c
improve docstrings and fix new token classification model
2018-11-30 22:55:26 +01:00
thomwolf
ed302a73f4
add new token classification model
2018-11-30 22:55:03 +01:00
thomwolf
d6f06c03f4
fixed loading pre-trained tokenizer from directory
2018-11-30 14:09:06 +01:00
thomwolf
532a81d3d6
fixed doc_strings
2018-11-30 13:57:01 +01:00
thomwolf
296f006132
added BertForTokenClassification model
2018-11-30 13:56:53 +01:00
thomwolf
298107fed7
Added new bert models
2018-11-30 13:56:02 +01:00
thomwolf
32167cdf4b
remove convert_to_unicode and printable_text from examples
2018-11-26 23:33:22 +01:00
thomwolf
05053d163c
update cache_dir in readme and examples
2018-11-26 10:45:13 +01:00
thomwolf
63ae5d2134
added cache_dir option in from_pretrained
2018-11-26 10:21:56 +01:00
thomwolf
ebaacba38b
fixing typo in docstring
2018-11-26 09:55:15 +01:00
thomwolf
870d71636e
fixing target size in crossentropy losses
2018-11-26 09:51:34 +01:00
thomwolf
982339d829
fixing unicode error
2018-11-23 12:22:12 +01:00
weiyumou
37b6c9b21b
Fixed UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 3793: ordinal not in range(128)
2018-11-19 23:01:28 -05:00
thomwolf
757750d6f6
fix tests
2018-11-17 11:58:14 +01:00
thomwolf
886cb49792
updating readme and notebooks
2018-11-16 14:31:15 +01:00
thomwolf
1de35b624b
preparing for first release
2018-11-15 20:56:10 +01:00