transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-08-02 19:21:31 +06:00

Author	SHA1	Message	Date
LysandreJik	c10c7d59e7	Mask computing in standalone method. Tests.	2019-09-19 10:55:06 +02:00
LysandreJik	bf503158c5	Sentence -> Sequence. Removed output_mask from the special token addition methods.	2019-09-19 10:55:06 +02:00
LysandreJik	e391d4735e	Tokenizers' encode function can output binary masks	2019-09-19 10:55:06 +02:00
Shijie Wu	a15562e170	Fix reference of import when called for the second time	2019-09-03 18:27:29 -07:00
LysandreJik	6ae0bb5291	XLM 100 different URLs	2019-08-31 14:46:31 -04:00
Thomas Wolf	d483cd8e46	Merge pull request #1074 from huggingface/improved_testing Shortcut to special tokens' ids - fix GPT2 & RoBERTa tokenizers - improved testing for GPT/GPT-2	2019-08-30 23:18:58 +02:00
thomwolf	8678ff8df5	adding 17 and 100 xlm models	2019-08-30 16:26:04 +02:00
thomwolf	82462c5cba	Added option to setup pretrained tokenizer arguments	2019-08-30 15:30:41 +02:00
Thomas Wolf	50e615f43d	Merge branch 'master' into improved_testing	2019-08-30 13:40:35 +02:00
thomwolf	f8aace6bcd	update tokenizers to use self.XX_token_id instead of converting self.XX_token	2019-08-30 13:39:52 +02:00
Shijie Wu	ca4baf8ca1	Match order of casing in OSS XLM; Improve document; Clean up dependency	2019-08-27 20:03:18 -04:00
Shijie Wu	e85123d398	Add custom tokenizer for zh and ja	2019-08-23 20:27:52 -04:00
thomwolf	3bcbebd440	max_len_single_sentence & max_len_sentences_pair as attributes so they can be modified	2019-08-23 22:07:26 +02:00
Shijie Wu	436ce07218	Tokenization behave the same as original XLM proprocessing for most languages except zh, ja and th; Change API to allow specifying language in `tokenize`	2019-08-23 14:40:17 -04:00
thomwolf	47d6853439	adding max_lengths for single sentences and sentences pairs	2019-08-23 17:31:11 +02:00
Guillem García Subies	388e3251fa	Update tokenization_xlm.py	2019-08-20 14:19:39 +02:00
Guillem García Subies	bfd75056b0	Update tokenization_xlm.py	2019-08-20 14:06:17 +02:00
LysandreJik	22ac004a7c	Added documentation and changed parameters for special_tokens_sentences_pair.	2019-08-12 15:13:53 -04:00
LysandreJik	14e970c271	Tokenization encode/decode class-based sequence handling	2019-08-09 15:01:38 -04:00
thomwolf	1849aa7d39	update readme and pretrained model weight files	2019-07-16 15:11:29 +02:00
thomwolf	15d8b1266c	update tokenizer - update squad example for xlnet	2019-07-15 17:30:42 +02:00
LysandreJik	7fdbc47822	Added the two CLM XLM pretrained checkpoints. Fixed file extensions for config/vocab/merges of XLM models.	2019-07-10 19:37:24 -04:00
LysandreJik	dee3e45b93	Fixed XLM weights conversion script. Added 5 new checkpoints for XLM.	2019-07-10 19:04:21 -04:00
LysandreJik	f773faa258	Fixed all links. Removed TPU. Changed CLI to Converting TF models. Many minor formatting adjustments. Added "TODO Lysandre filled" where necessary.	2019-07-10 14:45:56 -04:00
thomwolf	d5481cbe1b	adding tests to examples - updating summary module - coverage update	2019-07-09 15:29:42 +02:00
thomwolf	b19786985d	unified tokenizer api and serialization + tests	2019-07-09 10:25:18 +02:00
thomwolf	36bca545ff	tokenization abstract class - tests for examples	2019-07-05 15:02:59 +02:00
thomwolf	0bab55d5d5	[BIG] name change	2019-07-05 11:55:36 +02:00

28 Commits