transformers

mirror of https://github.com/huggingface/transformers.git synced 2025-07-21 05:28:21 +06:00

Author	SHA1	Message	Date
Patrick von Platen	f5b50c6b8e	make style	2020-02-25 16:41:54 +01:00
Patrick von Platen	ec16142ee5	add special tokens to pretrain configs of respective lm head models	2020-02-25 16:37:59 +01:00
Patrick von Platen	e645dcbb70	add special tokens to pretrain configs of respective lm head models	2020-02-25 16:37:56 +01:00
Julien Chaumond	e693cd1e87	[ci] Run slow tests every day	2020-02-24 19:54:47 -05:00
Julien Chaumond	4fc63151af	[ci] Attempt to fix #2844	2020-02-24 19:51:34 -05:00
Lysandre Debut	b90745c590	Test correct tokenizers after default switch (#3003 )	2020-02-24 18:45:53 -05:00
Lysandre Debut	3716c3d8af	False by default (#3002 )	2020-02-24 18:30:57 -05:00
Lysandre	f9ec5ca90b	Release: v2.5.1	2020-02-24 18:22:54 -05:00
Funtowicz Morgan	4cd9c0971c	Fix for fast tokenizers save_pretrained compatibility with Python. (#2933 ) * Renamed file generate by tokenizers when calling save_pretrained to match python. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added save_vocabulary tests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Remove python quick and dirty fix for clean Rust impl. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Bump tokenizers dependency to 0.5.1 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * TransfoXLTokenizerFast uses a json vocabulary file + warning about incompatibility between Python and Rust Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added some save_pretrained / from_pretrained unittests. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Update tokenizers to 0.5.2 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Quality and format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * flake8 Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Making sure there is really a bug in unittest * Fix TransfoXL constructor vocab_file / pretrained_vocab_file mixin. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-24 18:20:42 -05:00
Sandro Cavallari	ee60840ee6	fix _update_memory fn call in transformer-xl (#2971 )	2020-02-24 17:50:24 -05:00
Patrick von Platen	6a50d501ec	add explaining example to XLNet LM modeling (#2997 ) * add explaining example to XLNet LM modeling * improve docstring for xlnet	2020-02-24 15:42:38 -05:00
Patrick von Platen	65d74c4965	Add preprocessing step for transfo-xl tokenization to avoid tokenizing words followed by punction to <unk> (#2987 ) * add preprocessing to add space before punctuation for transfo_xl * improve warning messages * make style * compile regex at instantination of tokenizer object	2020-02-24 15:11:10 -05:00
Bram Vanroy	a143d9479e	Add local_files_only parameter to pretrained items (#2930 ) * Add disable_outgoing to pretrained items Setting disable_outgoing=True disables outgonig traffic: - etags are not looked up - models are not downloaded * parameter name change * Remove forgotten print	2020-02-24 14:58:15 -05:00
Manuel Romero	286d1ec746	Create README.md	2020-02-24 14:33:49 -05:00
Lysandre Debut	7984a70ee4	kwargs are passed to both model and configuration in AutoModels (#2998 )	2020-02-24 14:19:39 -05:00
Lysandre Debut	21d8b6a33e	Testing that batch_encode_plus is the same as encode_plus (#2973 ) * Testing that encode_plus and batch_encode_plus behave the same way Spoiler alert: they don't * Testing rest of arguments in batch_encode_plus * Test tensor return in batch_encode_plus * Addressing Sam's comments * flake8 * Simplified with `num_added_tokens`	2020-02-24 12:09:46 -05:00
Patrick von Platen	17c45c39ed	Add slow generate tests for pretrained lm models (#2909 ) * add slow generate lm_model tests * fix conflicts * merge conflicts * fix conflicts * add slow generate lm_model tests * make style * delete unused variable * fix conflicts * fix conflicts * fix conflicts * delete unused variable * fix conflicts * finished hard coded tests	2020-02-24 11:51:57 -05:00
Lysandre Debut	8194df8e0c	Warning on `add_special_tokens` (#2966 ) Warning on `add_special_tokens` when passed to `encode`, `encode_plus` and `batch_encode_plus`	2020-02-24 08:42:54 -05:00
Patrick von Platen	38f5fe9e02	add_ctags_to_git_ignore (#2984 )	2020-02-23 16:55:32 -05:00
Martin Malmsten	105dcb4162	Now passes style guide enforcement	2020-02-23 21:47:59 +01:00
Martin Malmsten	33eb8a165d	Added ,	2020-02-23 21:43:31 +01:00
Martin Malmsten	869b66f6b3	* Added support for Albert when fine-tuning for NER * Added support for Albert in NER pipeline * Added command-line options to examples/ner/run_ner.py to better control tokenization * Added class AlbertForTokenClassification * Changed output for NerPipeline to use .convert_ids_to_tokens(...) instead of .decode(...) to better reflect tokens	2020-02-23 21:13:03 +01:00
Sam Shleifer	129f0604ac	Delete untested, broken Model2LSTM (#2968 )	2020-02-23 11:28:48 -05:00
Lysandre Debut	0e84559d64	Correct `special_tokens_mask` when `add_special_tokens=False` (#2965 ) Don't know of a use case where that would be useful, but this is more consistent	2020-02-23 09:50:39 -05:00
Sam Shleifer	92487a1dc0	Bart: fix layerdrop and cached decoder_input_ids for generation (#2969 )	2020-02-22 16:25:04 -05:00
Joe Davison	c36416e53c	Add standardized get_vocab method to tokenizers	2020-02-22 12:09:01 -05:00
saippuakauppias	cafc4dfc7c	fix hardcoded path in examples readme	2020-02-22 11:12:38 -05:00
Malte Pietsch	34b4b5a9ed	Update modelcard of bert-base-german-cased Add image	2020-02-22 11:08:42 -05:00
Manuel Romero	7df12d7bf8	Update README.md - I added an example using the model with pipelines to show that we have set```{"use_fast": False}``` in the tokenizer. - I added a Colab to play with the model and pipelines - I added a Colab to discover Huggingface pipelines at the end of the document	2020-02-22 11:06:41 -05:00
Funtowicz Morgan	cc6775cdf5	Fix max_length not taken into account when using pad_to_max_length on fast tokenizers (#2961 ) * enable_padding should pad up to max_length if set. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Added more testing on padding. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-22 09:27:47 -05:00
Lysandre Debut	94ff2d6ee8	Remove double bias (#2958 )	2020-02-21 17:10:18 -05:00
Sam Shleifer	b5b3445c4f	Only use F.gelu for torch >=1.4.0 (#2955 ) * Only use F.gelu for torch >=1.4.0 * Use F.gelu for newer torch	2020-02-21 16:10:21 -05:00
Patrick von Platen	fc38d4c86f	Improve special_token_id logic in run_generation.py and add tests (#2885 ) * improving generation * finalized special token behaviour for no_beam_search generation * solved modeling_utils merge conflict * solve merge conflicts in modeling_utils.py * add run_generation improvements from PR #2749 * adapted language generation to not use hardcoded -1 if no padding token is available * remove the -1 removal as hard coded -1`s are not necessary anymore * add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown * add slow language generation tests for pretrained models using hardcoded output with pytorch seed * delete ipdb * check that all generated tokens are valid * renaming * renaming Generation -> Generate * make style * updated so that generate_beam_search has same token behavior than generate_no_beam_search * consistent return format for run_generation.py * deleted pretrain lm generate tests -> will be added in another PR * cleaning of unused if statements and renaming * run_generate will always return an iterable * make style * consistent renaming * improve naming, make sure generate function always returns the same tensor, add docstring * add slow tests for all lmhead models * make style and improve example comments modeling_utils * better naming and refactoring in modeling_utils * improving generation * finalized special token behaviour for no_beam_search generation * solved modeling_utils merge conflict * solve merge conflicts in modeling_utils.py * add run_generation improvements from PR #2749 * adapted language generation to not use hardcoded -1 if no padding token is available * remove the -1 removal as hard coded -1`s are not necessary anymore * add lightweight language generation testing for randomely initialized models - just checking whether no errors are thrown * add slow language generation tests for pretrained models using hardcoded output with pytorch seed * delete ipdb * check that all generated tokens are valid * renaming * renaming Generation -> Generate * make style * updated so that generate_beam_search has same token behavior than generate_no_beam_search * consistent return format for run_generation.py * deleted pretrain lm generate tests -> will be added in another PR * cleaning of unused if statements and renaming * run_generate will always return an iterable * make style * consistent renaming * improve naming, make sure generate function always returns the same tensor, add docstring * add slow tests for all lmhead models * make style and improve example comments modeling_utils * better naming and refactoring in modeling_utils * changed fast random lm generation testing design to more general one * delete in old testing design in gpt2 * correct old variable name * temporary fix for encoder_decoder lm generation tests - has to be updated when t5 is fixed * adapted all fast random generate tests to new design * better warning description in modeling_utils * better comment * better comment and error message Co-authored-by: Thomas Wolf <thomwolf@users.noreply.github.com>	2020-02-21 12:09:59 -05:00
maximeilluin	c749a543fa	Added CamembertForQuestionAnswering (#2746 ) * Added CamembertForQuestionAnswering * fixed camembert tokenizer case	2020-02-21 12:01:02 -05:00
Bram Vanroy	5211d333bb	Update modeling_tf_utils.py (#2924 ) Tensorflow does not use .eval() vs .train(). closes https://github.com/huggingface/transformers/issues/2906	2020-02-21 11:28:32 -05:00
ahotrod	3e98f27e4a	Create README.md for xlnet_large_squad (#2942 )	2020-02-21 08:54:41 -05:00
Martin Malmsten	4452b44b90	Labels are now added to model config under id2label and label2id (#2945 )	2020-02-21 08:53:05 -05:00
Sam Shleifer	53ce3854a1	New BartModel (#2745 ) * Results same as fairseq * Wrote a ton of tests * Struggled with api signatures * added some docs	2020-02-20 18:11:13 -05:00
guillaume-be	564fd75d65	Removed unused fields in DistilBert TransformerBlock (#2710 ) * Removed unused fields in DistilBert TransformerBlock	2020-02-20 16:08:21 -05:00
srush	889d3bfdbb	default arg fix (#2937 )	2020-02-20 15:31:17 -05:00
Joe Davison	197d74f988	Add get_vocab method to PretrainedTokenizer	2020-02-20 15:26:49 -05:00
Scott Gigante	ea8eba35e2	Fix InputExample docstring (#2891 )	2020-02-20 15:25:15 -05:00
Funtowicz Morgan	e2a6445ebb	Tokenizer fast warnings (#2922 ) * Remove warning when pad_to_max_length is not set. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Move RoberTa warning to RoberTa and not GPT2 base tokenizer. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-20 11:55:03 -05:00
Funtowicz Morgan	9b3093311f	Expose all constructor parameter for BertTokenizerFast (#2921 ) Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-20 11:53:32 -05:00
srush	b662f0e625	Support for torch-lightning in NER examples (#2890 ) * initial pytorch lightning commit * tested multigpu * Fix learning rate schedule * black formatting * fix flake8 * isort * isort * . Co-authored-by: Check your git settings! <chris@chris-laptop>	2020-02-20 11:50:05 -05:00
Ilias Chalkidis	ab1238393c	Update to include example of LM The model files have been updated in order to include the classification layers, based on https://github.com/huggingface/transformers/issues/2901, and now can be also used as a LM.	2020-02-20 10:57:59 -05:00
Santiago Castro	976e9afece	Add syntax highlighting to the BibTeX in README	2020-02-20 10:06:15 -05:00
Cong	cbc5705541	Fix spell: EsperBERTo, not EspertBERTo	2020-02-20 10:02:07 -05:00
Funtowicz Morgan	d490b5d500	Fast Tokenizers save pretrained should return the list of generated file paths. (#2918 ) * Correctly return the tuple of generated file(s) when calling save_pretrained Signed-off-by: Morgan Funtowicz <morgan@huggingface.co> * Quality and format. Signed-off-by: Morgan Funtowicz <morgan@huggingface.co>	2020-02-20 00:58:04 +01:00
Lysandre	2708b44ee9	Patch ALBERT with heads in TensorFlow	2020-02-19 18:46:25 -05:00

... 234 235 236 237 238 ...

15053 Commits