Commit Graph

66 Commits

Author SHA1 Message Date
Sylvain Gugger
aaaed56ffc
Trainer iterable dataset (#11254)
* IterableDatasetShard

* Test and integration in Trainer

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Style

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-14 17:02:26 -04:00
Stas Bekman
c6d664849b
[DeepSpeed] ZeRO Stage 3 (#10753)
* synced gpus

* fix

* fix

* need to use t5-small for quality tests

* notes

* complete merge

* fix a disappearing std stream problem

* start zero3 tests

* wip

* tune params

* sorting out the pre-trained model loading

* reworking generate loop wip

* wip

* style

* fix tests

* split the tests

* refactor tests

* wip

* parameterized

* fix

* workout the resume from non-ds checkpoint pass + test

* cleanup

* remove no longer needed code

* split getter/setter functions

* complete the docs

* suggestions

* gpus and their compute capabilities link

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* style

* remove invalid paramgd

* automatically configure zero3 params that rely on hidden size

* make _get_resized_embeddings zero3-aware

* add test exercising resize_token_embeddings()

* add docstring

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-08 09:53:01 -07:00
Stas Bekman
3318c246f3
make failure to find a resume checkpoint fatal + tests (#10777) 2021-03-17 11:16:37 -07:00
Stas Bekman
cd8c93f701
[DeepSpeed] improve checkpoint loading code plus tests (#10760)
* deepspeed checkpoint loading code plus tests

* style

* style
2021-03-17 10:22:58 -07:00
Sylvain Gugger
3ced9b3eb9
Check layer types for Optimizer construction (#10598)
* Check layer types for Optimizer construction

* Duplicate class
2021-03-08 16:40:11 -05:00
Sylvain Gugger
821d518e03 Revert "Tests"
This reverts commit b35e7b68ca.
2021-03-08 16:05:55 -05:00
Sylvain Gugger
4196bfeda0 Revert "Style"
This reverts commit a8ec52efc2.
2021-03-08 16:05:52 -05:00
Sylvain Gugger
a8ec52efc2 Style 2021-03-08 16:04:46 -05:00
Sylvain Gugger
b35e7b68ca Tests 2021-03-08 16:04:30 -05:00
Stas Bekman
f882966004
fix double wrapping + test (#10583) 2021-03-08 10:15:55 -05:00
Sylvain Gugger
6290169eb3
Rework TPU checkpointing in Trainer (#10504)
* Rework TPU checkpointing in Trainer

* Wraps the barrier in a dist test

* Address review comments

* Remove line
2021-03-04 11:46:11 -05:00
Tanmay Garg
256482ac92
Introduce save_strategy training argument (#10286)
* Introduce save_strategy training argument

* deprecate EvaluationStrategy

* collapse EvaluationStrategy and LoggingStrategy into a single
  IntervalStrategy enum

* modify tests to use modified enum
2021-02-27 19:34:22 -05:00
Kai Fricke
98569d4ba2
Add Ray Tune hyperparameter search integration test (#10414) 2021-02-26 10:18:33 -05:00
Stas Bekman
4eddc459a9
[trainer] implement support for full fp16 in evaluation/predict (#10268)
* implement --fp16_full_eval

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

* add test

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-18 17:02:35 -08:00
Stas Bekman
d9a81fc0c5
fix func signature (#10271) 2021-02-18 16:44:42 -08:00
Stas Bekman
97e688bc22
[Trainer] memory tracker metrics (#10225)
* memory tracker metrics

* go back to eval for somewhat consistency

* handle no-gpu case

* deal with stackable eval calls

* restore callback order

* style

* simplify the API

* add test

* docs

* consistently use eval_ prefix

* improve docs

* Update src/transformers/trainer_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* rename method

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-18 09:27:32 -08:00
Sylvain Gugger
7169d1ea7b
Store FLOS as floats to avoid overflow. (#10213) 2021-02-16 11:15:15 -05:00
Lysandre Debut
8cbd0bd137
Specify dataset dtype (#10195)
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>

Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
2021-02-15 12:57:17 -05:00
Sylvain Gugger
b4e559cfa1
Deprecate model_path in Trainer.train (#9854) 2021-01-28 08:32:46 -05:00
Sylvain Gugger
35d55b7b84
When resuming training from checkpoint, Trainer loads model (#9818)
* Whenresuming training from checkpoint, Trainer loads model

* Finish cleaning tests

* Address review comment

* Use global_step from state
2021-01-27 09:31:18 -05:00
Sylvain Gugger
5e1bea4f16
Fix Trainer with a parallel model (#9578)
* Fix Trainer with a parallel model

* More clean up
2021-01-14 03:23:41 -05:00
Sylvain Gugger
04dc65e5c6
Fix data parallelism in Trainer (#9566)
* Fix data parallelism in Trainer

* Update src/transformers/training_args.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-13 09:54:41 -05:00
Stas Bekman
9f675b05d4
[trainer] self.model_wrapped + _model_unwrap (#9390)
* model wrapped + model_unwrap

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

* deprecation warning

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-06 06:50:11 -05:00
Sylvain Gugger
1198ba8fba
Add timing inside Trainer (#9196)
* Add timing inside Trainer

* Fix tests

* Add n_objs for train

* Sort logs
2020-12-18 15:10:39 -05:00
Sylvain Gugger
ad895af98d
Add possibility to switch between APEX and AMP in Trainer (#9137)
* Add possibility to switch between APEX and AMP in Trainer

* Update src/transformers/training_args.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Address review comments

* Update src/transformers/training_args.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2020-12-15 16:38:10 -05:00
Sylvain Gugger
7c10dd22ae
Better support for resuming training (#8878) 2020-12-01 13:45:21 -05:00
Colin Brochtrup
8ffc01a76a
Add early stopping callback to pytorch trainer (#8581)
* Add early stopping patience and minimum threshold metric must improve to prevent early stopping to pytorch trainer

* Add early stopping test

* Set patience counter to 0 if best metric not defined yet

* Make early stopping a callback. Add callback event for updating the best metric for early stopping callback to trigger on.

* Run make style

* make funciton name sensible

* Improve new argument docstring wording and hope that flakey CI test passes.

* Use on_evaluation callback instead of custom. Remove some debug printing

* Move early stopping arguments and state into early stopping callback

* Run make style

* Remove old code

* Fix docs formatting. make style went rogue on me.

* Remove copied attributes and fix variable

* Add assertions on training arguments instead of mutating them. Move comment out of public docs.

* Make separate test for early stopping callback. Add test of invalid arguments.

* Run make style... I remembered before CI this time!

* appease flake8

* Add EarlyStoppingCallback to callback docs

* Make docstring EarlyStoppingCallabck match other callbacks.

* Fix typo in docs
2020-11-23 17:25:35 -05:00
Sylvain Gugger
4208f496ee
Better filtering of the model outputs in Trainer (#8633)
* Better filtering of the model outputs in Trainer

* Fix examples tests

* Add test for Lysandre
2020-11-19 10:43:15 -05:00
Sylvain Gugger
1e62e999e8
Fixes the training resuming with gradient accumulation (#8624) 2020-11-18 12:00:11 -05:00
Sylvain Gugger
04e442d575
Make Trainer evaluation handle dynamic seq_length (#8336)
* Make Trainer evaluation handle dynamic seq_length

* Document behavior.

* Fix test

* Better fix

* Fixes for realsies this time

* Address review comments

* Without forgetting to save...
2020-11-05 15:13:51 -05:00
Sylvain Gugger
4c19f3baab
Clean Trainer tests and datasets dep (#8268) 2020-11-03 15:50:55 -05:00
François Lagunas
e174bfeb34
TensorBoard/Wandb/optuna/raytune integration improvements. (#7935)
Improved TensorBoard and Wandb integration, as well as optuna and ray/tune support, with minor modifications to trainer core code.
2020-10-21 17:18:52 +02:00
Julien Rossi
a09fe140c1
Trainer with Iterable Dataset (#7858)
* fix 5990

* accomodate iterable dataset without predefined length
* set it as 1 use case: provide max_steps, and NO num_epochs
* Is a merge of master and PR 5995

* fix trainer test under TF

* fix only for torch
* TF trainer untouched
* trainer tests are skipped when no torch

* address comments

* fix quality checks

* remove torch.dataset from test_trainer

* unnecessary inheritance
* RegressionDataset implements all needed methods __len__ and __getitem__

* fix quality checks

* restore RegressionDataset

* was wrongly under is_torch_available()
2020-10-19 11:57:39 -04:00
Thomas Wolf
ba8c4d0ac0
[Dependencies|tokenizers] Make both SentencePiece and Tokenizers optional dependencies (#7659)
* splitting fast and slow tokenizers [WIP]

* [WIP] splitting sentencepiece and tokenizers dependencies

* update dummy objects

* add name_or_path to models and tokenizers

* prefix added to file names

* prefix

* styling + quality

* spliting all the tokenizer files - sorting sentencepiece based ones

* update tokenizer version up to 0.9.0

* remove hard dependency on sentencepiece 🎉

* and removed hard dependency on tokenizers 🎉

* update conversion script

* update missing models

* fixing tests

* move test_tokenization_fast to main tokenization tests - fix bugs

* bump up tokenizers

* fix bert_generation

* update ad fix several tokenizers

* keep sentencepiece in deps for now

* fix funnel and deberta tests

* fix fsmt

* fix marian tests

* fix layoutlm

* fix squeezebert and gpt2

* fix T5 tokenization

* fix xlnet tests

* style

* fix mbart

* bump up tokenizers to 0.9.2

* fix model tests

* fix tf models

* fix seq2seq examples

* fix tests without sentencepiece

* fix slow => fast  conversion without sentencepiece

* update auto and bert generation tests

* fix mbart tests

* fix auto and common test without tokenizers

* fix tests without tokenizers

* clean up tests lighten up when tokenizers + sentencepiece are both off

* style quality and tests fixing

* add sentencepiece to doc/examples reqs

* leave sentencepiece on for now

* style quality split hebert and fix pegasus

* WIP Herbert fast

* add sample_text_no_unicode and fix hebert tokenization

* skip FSMT example test for now

* fix style

* fix fsmt in example tests

* update following Lysandre and Sylvain's comments

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/testing_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/tokenization_utils_base.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2020-10-18 20:51:24 +02:00
Sylvain Gugger
a1d1b332d0
Add predict step accumulation (#7767)
* Add eval_accumulation_step and clean distributed eval

* Add TPU test

* Add TPU stuff

* Fix arg name

* Fix Seq2SeqTrainer

* Fix total_size

* Update src/transformers/trainer_pt_utils.py

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Doc and add test to TPU

* Add unit test

* Adapt name

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2020-10-14 11:41:45 -04:00
Sylvain Gugger
7968051aba
Fix typo 2020-10-13 17:30:46 -04:00
Sylvain Gugger
c6e18de9f8
Fix flaky test in test_trainer (#7689) 2020-10-09 20:01:15 -04:00
Sylvain Gugger
d3adb985d1
Expand test to locate flakiness (#7580) 2020-10-05 09:45:47 -04:00
Sylvain Gugger
29baa8fabe
Clean the Trainer state (#7490)
* Trainer should not modify its TrainingArguments

* Trainer should not modify its TrainingArguments

* Trainer should not modify its TrainingArguments

* Add test of resumed training

* Fixes

* Non multiGPU test

* Clean Trainer state

* Add more to the state

* Documentation

* One last test

* Make resume training test more complete

* Unwanted changes
2020-10-01 13:07:04 -04:00
Sylvain Gugger
8546dc55c2
Fix Trainer tests in a multiGPU env (#7458) 2020-09-29 14:06:41 -04:00
Sylvain Gugger
52e8392b7e
Add automatic best model loading to Trainer (#7431)
* Add automatic best model loading to Trainer

* Some small fixes

* Formatting
2020-09-29 10:41:18 -04:00
Marcin Zabłocki
4083a55ab0
Flos fix (#7384) 2020-09-28 04:09:26 -04:00
Sylvain Gugger
1ee2194fb6
Mark big downloads slow (#7325)
* Make big downloads as slow

* Add import

* Right order for slow decorator

* More slow tests
2020-09-22 12:21:52 -04:00
Sylvain Gugger
492bb6aa48
Trainer multi label (#7191)
* Trainer accep multiple labels

* Missing import

* Fix dosctrings
2020-09-17 08:15:37 -04:00
Yih-Dar
4c62c6021a
fix ZeroDivisionError and epoch counting (#7125)
* fix ZeroDivisionError and epoch counting

* Add test for num_train_epochs calculation in trainer.py

* Remove @require_non_multigpu for test_num_train_epochs_in_training
2020-09-15 11:51:50 -04:00
Sylvain Gugger
7186ca6240
Multi predictions trainer (#7126)
* Allow multiple outputs

* Formatting

* Move the unwrapping before metrics

* Fix typo

* Add test for non-supported config options
2020-09-15 10:27:24 -04:00
Sylvain Gugger
2bf70e2150
Fix reproducible tests in Trainer (#7119)
* Fix reproducible tests in Trainer

* Deal with multiple GPUs
2020-09-15 03:32:44 -04:00
Lysandre Debut
bb3106f741
Temporarily skip failing tests due to dependency change (#7118)
* Temporarily skip failing tests due to dependency change

* Remove trace
2020-09-14 07:42:13 -04:00
Stas Bekman
8fcbe486e1
these tests require non-multigpu env (#7059)
* these tests require non-multigpu env

* cleanup

* clarify
2020-09-10 18:52:55 -04:00
Sylvain Gugger
514486739c
Fix CI with change of name of nlp (#7054)
* nlp -> datasets

* More nlp -> datasets

* Woopsie

* More nlp -> datasets

* One last
2020-09-10 14:51:08 -04:00