transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-16 11:08:23 +06:00

Author	SHA1	Message	Date
thomwolf	f31154cb9d	Merge branch 'xlnet'	2019-07-16 11:51:13 +02:00
thomwolf	0bab55d5d5	[BIG] name change	2019-07-05 11:55:36 +02:00
thomwolf	c41f2bad69	WIP XLM + refactoring	2019-07-03 22:54:39 +02:00
Mayhul Arora	08ff056c43	Added option to use multiple workers to create training data for lm fine tuning	2019-06-26 16:16:12 -07:00
Thomas Wolf	460d9afd45	Merge pull request #640 from Barqawiz/master Support latest multi language bert fine tune	2019-06-14 16:57:02 +02:00
jeonsworld	a3a604cefb	Update pregenerate_training_data.py apply Whole Word Masking technique. referred to [create_pretraining_data.py](https://github.com/google-research/bert/blob/master/create_pretraining_data.py)	2019-06-10 12:17:23 +09:00
Ahmad Barqawi	c4fe56dcc0	support latest multi language bert fine tune fix issue of bert-base-multilingual and add support for uncased multilingual	2019-05-27 11:27:41 +02:00
Matthew Carrigan	dbbd6c7500	Replaced some randints with cleaner randranges, and added a helpful error for users whose corpus is just one giant document.	2019-04-12 15:07:58 +01:00
jeonsworld	60005f464d	Update pregenerate_training_data.py If the value of rand_end is returned from the randint function, the value of sampled_doc_index that matches current_idx is returned from searchsorted. example: cumsum_max = {int64} 30 doc_cumsum = {ndarray} [ 5 7 11 19 30] doc_lengths = {list} <class 'list'>: [5, 2, 4, 8, 11] if current_idx = 1, rand_start = 7 rand_end = 35 sentence_index = randint(7, 35) % cumsum_max if randint return 35, sentence_index becomes 5. if sentence_index is 5, np.searchsorted returns 1 equal to current_index.	2019-03-30 14:50:17 +09:00
Matthew Carrigan	abb7d1ff6d	Added proper context management to ensure cleanup happens in the right order.	2019-03-21 17:50:03 +00:00
Matthew Carrigan	7d1ae644ef	Added a --reduce_memory option to the training script to keep training data on disc as a memmap rather than in memory	2019-03-21 17:02:18 +00:00
Matthew Carrigan	2bba7f810e	Added a --reduce_memory option to shelve docs to disc instead of keeping them in memory.	2019-03-21 16:50:16 +00:00
Matthew Carrigan	8733ffcb5e	Removing a couple of other old unnecessary comments	2019-03-21 14:09:57 +00:00
Matthew Carrigan	8a861048dd	Fixed up the notes on a possible future low-memory path	2019-03-21 14:08:39 +00:00
Matthew Carrigan	0ae59e662d	Reduced memory usage for pregenerating the data a lot by writing it out on the fly without shuffling - the Sampler in the finetuning script will shuffle for us.	2019-03-21 14:04:17 +00:00
Matthew Carrigan	6a9038ba53	Removed an old irrelevant comment	2019-03-21 13:36:41 +00:00
Matthew Carrigan	934d3f4d2f	Syncing up argument names between the scripts	2019-03-20 17:23:23 +00:00
Matthew Carrigan	7de5c6aa5e	PEP8 and formatting cleanups	2019-03-20 16:44:04 +00:00
Matthew Carrigan	1798e98e5a	Added final TODOs	2019-03-20 16:42:37 +00:00
Matthew Carrigan	976554a472	First commit of the new LM finetuning	2019-03-20 14:23:51 +00:00

20 Commits