thomwolf
f31154cb9d
Merge branch 'xlnet'
2019-07-16 11:51:13 +02:00
thomwolf
0bab55d5d5
[BIG] name change
2019-07-05 11:55:36 +02:00
thomwolf
c41f2bad69
WIP XLM + refactoring
2019-07-03 22:54:39 +02:00
Mayhul Arora
08ff056c43
Added option to use multiple workers to create training data for lm fine tuning
2019-06-26 16:16:12 -07:00
Thomas Wolf
460d9afd45
Merge pull request #640 from Barqawiz/master
...
Support latest multi language bert fine tune
2019-06-14 16:57:02 +02:00
jeonsworld
a3a604cefb
Update pregenerate_training_data.py
...
apply Whole Word Masking technique.
referred to [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py )
2019-06-10 12:17:23 +09:00
Ahmad Barqawi
c4fe56dcc0
support latest multi language bert fine tune
...
fix issue of bert-base-multilingual and add support for uncased multilingual
2019-05-27 11:27:41 +02:00
Matthew Carrigan
dbbd6c7500
Replaced some randints with cleaner randranges, and added a helpful
...
error for users whose corpus is just one giant document.
2019-04-12 15:07:58 +01:00
jeonsworld
60005f464d
Update pregenerate_training_data.py
...
If the value of rand_end is returned from the randint function, the value of sampled_doc_index that matches current_idx is returned from searchsorted.
example:
cumsum_max = {int64} 30
doc_cumsum = {ndarray} [ 5 7 11 19 30]
doc_lengths = {list} <class 'list'>: [5, 2, 4, 8, 11]
if current_idx = 1,
rand_start = 7
rand_end = 35
sentence_index = randint(7, 35) % cumsum_max
if randint return 35, sentence_index becomes 5.
if sentence_index is 5, np.searchsorted returns 1 equal to current_index.
2019-03-30 14:50:17 +09:00
Matthew Carrigan
abb7d1ff6d
Added proper context management to ensure cleanup happens in the right
...
order.
2019-03-21 17:50:03 +00:00
Matthew Carrigan
7d1ae644ef
Added a --reduce_memory option to the training script to keep training
...
data on disc as a memmap rather than in memory
2019-03-21 17:02:18 +00:00
Matthew Carrigan
2bba7f810e
Added a --reduce_memory option to shelve docs to disc instead of keeping them in memory.
2019-03-21 16:50:16 +00:00
Matthew Carrigan
8733ffcb5e
Removing a couple of other old unnecessary comments
2019-03-21 14:09:57 +00:00
Matthew Carrigan
8a861048dd
Fixed up the notes on a possible future low-memory path
2019-03-21 14:08:39 +00:00
Matthew Carrigan
0ae59e662d
Reduced memory usage for pregenerating the data a lot by writing it
...
out on the fly without shuffling - the Sampler in the finetuning script
will shuffle for us.
2019-03-21 14:04:17 +00:00
Matthew Carrigan
6a9038ba53
Removed an old irrelevant comment
2019-03-21 13:36:41 +00:00
Matthew Carrigan
934d3f4d2f
Syncing up argument names between the scripts
2019-03-20 17:23:23 +00:00
Matthew Carrigan
7de5c6aa5e
PEP8 and formatting cleanups
2019-03-20 16:44:04 +00:00
Matthew Carrigan
1798e98e5a
Added final TODOs
2019-03-20 16:42:37 +00:00
Matthew Carrigan
976554a472
First commit of the new LM finetuning
2019-03-20 14:23:51 +00:00