Patrick von Platen
1f9dcfc1ef
[Trainer] Add nan/inf logging filter ( #13619 )
...
* finish
* add test
* push
* remove unnecessary code
* up
* correct test
* Update src/transformers/training_args.py
2021-09-17 16:21:59 +02:00
Sylvain Gugger
3081d3868e
Push to hub when saving checkpoints ( #13503 )
...
* Push to hub when saving checkpoints
* Add model card
* Revert partial model card
* Small fix for checkpoint
* Add tests
* Add documentation
* Fix tests
* Bump huggingface_hub
* Fix test
2021-09-14 08:02:15 -04:00
Sylvain Gugger
e59d4d0147
Refactor internals for Trainer push_to_hub ( #13486 )
2021-09-09 13:04:37 -04:00
Philip May
b7439675b8
fix Trainer.train(resume_from_checkpoint=False)
is causing an exception ( #12981 )
...
* fix #12970
* Update tests/test_trainer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update tests/test_trainer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Update tests/test_trainer.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* remove unnecessary issue link
* fix test formatting
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-08-03 10:10:33 +02:00
Sylvain Gugger
0118ef89ee
Enforce eval and save strategies are compatible when --load_best_model_at_end ( #12786 )
...
* Enforce eval and save strategies are compatible when --load_best_model_at_end
* Update doc
* Fix typos
* Fix tests
2021-07-19 19:50:47 +02:00
Sylvain Gugger
53c60babe4
Clean push to hub API ( #12187 )
...
* Clean push to hub API
* Create working dir if it does not exist
* Different tweak
* New API + all models + test Flax
* Adds the Trainer clean up
* Update src/transformers/file_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Address review comments
* (nit) output types
* No need to set clone_from when folder exists
* Update src/transformers/trainer.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Add generated_from_trainer tag
* Update to new version
* Fixes
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Lysandre <lysandre.debut@reseau.eseo.fr>
2021-06-23 10:11:19 -04:00
Stas Bekman
ebe5413589
[trainer] 2 bug fixes and a rename ( #12309 )
...
* bug fixes and a rename
* add extended DDP test
2021-06-22 11:13:23 -07:00
Stas Bekman
0d97ba8a98
[tests] multiple improvements ( #12294 )
...
* [tests] multiple improvements
* cleanup
* style
* todo to investigate
* fix
2021-06-21 19:51:36 -07:00
Stas Bekman
dad414d5f9
[trainer + examples] set log level from CLI ( #12276 )
...
* set log level from CLI
* add log_level_replica + test + extended docs
* cleanup
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* rename datasets objects to allow datasets module
* improve the doc
* style
* doc improve
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-21 19:30:50 -07:00
Stas Bekman
a4ed074d4b
reset report_to to none, avoid deprecation warning ( #12293 )
2021-06-21 16:50:12 -07:00
Amog Kamsetty
b9d66f4c4b
Ray Tune Integration Updates ( #12134 )
...
* fix
* fixes
* add back to scheduled tests
* formatting
* Update integrations.py
2021-06-15 14:11:29 -04:00
Stas Bekman
372ab9cd6d
[style] consistent nn. and nn.functional: part 3 tests
( #12155 )
...
* consistent nn. and nn.functional: p3 templates
* restore
2021-06-14 12:18:22 -07:00
Stas Bekman
ff7c81687a
[optim] implement AdafactorSchedule ( #12123 )
...
* implement AdafactorSchedule
* typo
* fix
* Update src/transformers/optimization.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-14 09:43:48 -07:00
Stas Bekman
b1a8aa94f0
[test] support more than 2 gpus ( #12074 )
...
* support more than 2 gpus
* style
2021-06-09 09:23:47 -07:00
Stas Bekman
4ba203d9d3
[Trainer] add train loss and flops metrics reports ( #11980 )
...
* add train loss and flops metrics reports
* consistency
* add train_loss to skip keys
* restore on_train_end call timing
2021-06-01 15:58:31 -07:00
Lysandre Debut
6da129cb31
Enable memory metrics in tests that need it ( #11859 )
2021-05-25 04:06:19 -04:00
Sylvain Gugger
afe479adb5
[Trainer] Report both steps and num samples per second ( #11818 )
...
* [Trainer] Report both steps and num samples per second
* Fix batch number
* Update src/transformers/trainer_utils.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Address review comments
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-05-24 19:51:42 -04:00
Sylvain Gugger
a515caa331
Fix checkpoint deletion ( #11748 )
2021-05-18 07:42:39 -04:00
Sylvain Gugger
a135f59536
Auto modelcard ( #11599 )
...
* Autogenerate model cards from the Trainer
* ModelCard deprecated
* Fix test
* Style
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
* Address review comments
* Quality
* With all metadata
* Metadata
* Post-merge conflict mess
* Data args and all examples
* Default license and languages when possible
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-05-11 11:30:34 -04:00
Sylvain Gugger
6b241e0e3b
Reproducible checkpoint ( #11582 )
...
* Set generator in dataloader
* Use generator in all random samplers
* Checkpoint all RNG states
* Final version
* Quality
* Test
* Address review comments
* Quality
* Remove debug util
* Add python and numpy RNGs
* Split states in different files in distributed
* Quality
* local_rank for TPUs
* Only use generator when accepted
* Add test
* Set seed to avoid flakiness
* Make test less flaky
* Quality
2021-05-04 16:20:56 -04:00
Stas Bekman
bc2571e61c
[Deepspeed] ZeRO-Infinity integration plus config revamp ( #11418 )
...
* adding Z-inf
* revamp config process
* up version requirement
* wip
* massive rewrite
* cleanup
* cleanup
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* consistent json commas
* act on suggestions
* leave this feature for 0.3.16
* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-26 10:40:32 -07:00
Sylvain Gugger
7959d83599
Give each test a different repo name ( #11453 )
2021-04-26 11:52:23 -04:00
Sylvain Gugger
bf2e0cf70b
Trainer push to hub ( #11328 )
...
* Initial support for upload to hub
* push -> upload
* Fixes + examples
* Fix torchhub test
* Torchhub test I hate you
* push_model_to_hub -> push_to_hub
* Apply mixin to other pretrained models
* Remove ABC inheritance
* Add tests
* Typo
* Run tests
* Install git-lfs
* Change approach
* Add push_to_hub to all
* Staging test suite
* Typo
* Maybe like this?
* More deps
* Cache
* Adapt name
* Quality
* MOAR tests
* Put it in testing_utils
* Docs + torchhub last hope
* Styling
* Wrong method
* Typos
* Update src/transformers/file_utils.py
Co-authored-by: Julien Chaumond <julien@huggingface.co>
* Address review comments
* Apply suggestions from code review
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Julien Chaumond <julien@huggingface.co>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2021-04-23 09:17:37 -04:00
Sylvain Gugger
c0328a6c26
Load checkpoint without re-creating the model ( #11318 )
2021-04-19 20:31:29 -04:00
Sylvain Gugger
d9c62047a8
Trainer support for IterableDataset for evaluation and predict ( #11286 )
...
* Bulk of the work
* Polish and tests
* Update QA Trainer
* Avoid breaking the predict method
* Deprecation warnings
* Store real eval dataloder
* Get eval dataset reference before wrap
2021-04-16 16:01:58 -04:00
Sylvain Gugger
aaaed56ffc
Trainer iterable dataset ( #11254 )
...
* IterableDatasetShard
* Test and integration in Trainer
* Update src/transformers/trainer_pt_utils.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Style
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-14 17:02:26 -04:00
Stas Bekman
c6d664849b
[DeepSpeed] ZeRO Stage 3 ( #10753 )
...
* synced gpus
* fix
* fix
* need to use t5-small for quality tests
* notes
* complete merge
* fix a disappearing std stream problem
* start zero3 tests
* wip
* tune params
* sorting out the pre-trained model loading
* reworking generate loop wip
* wip
* style
* fix tests
* split the tests
* refactor tests
* wip
* parameterized
* fix
* workout the resume from non-ds checkpoint pass + test
* cleanup
* remove no longer needed code
* split getter/setter functions
* complete the docs
* suggestions
* gpus and their compute capabilities link
* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* style
* remove invalid paramgd
* automatically configure zero3 params that rely on hidden size
* make _get_resized_embeddings zero3-aware
* add test exercising resize_token_embeddings()
* add docstring
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-08 09:53:01 -07:00
Stas Bekman
3318c246f3
make failure to find a resume checkpoint fatal + tests ( #10777 )
2021-03-17 11:16:37 -07:00
Stas Bekman
cd8c93f701
[DeepSpeed] improve checkpoint loading code plus tests ( #10760 )
...
* deepspeed checkpoint loading code plus tests
* style
* style
2021-03-17 10:22:58 -07:00
Sylvain Gugger
3ced9b3eb9
Check layer types for Optimizer construction ( #10598 )
...
* Check layer types for Optimizer construction
* Duplicate class
2021-03-08 16:40:11 -05:00
Sylvain Gugger
821d518e03
Revert "Tests"
...
This reverts commit b35e7b68ca
.
2021-03-08 16:05:55 -05:00
Sylvain Gugger
4196bfeda0
Revert "Style"
...
This reverts commit a8ec52efc2
.
2021-03-08 16:05:52 -05:00
Sylvain Gugger
a8ec52efc2
Style
2021-03-08 16:04:46 -05:00
Sylvain Gugger
b35e7b68ca
Tests
2021-03-08 16:04:30 -05:00
Stas Bekman
f882966004
fix double wrapping + test ( #10583 )
2021-03-08 10:15:55 -05:00
Sylvain Gugger
6290169eb3
Rework TPU checkpointing in Trainer ( #10504 )
...
* Rework TPU checkpointing in Trainer
* Wraps the barrier in a dist test
* Address review comments
* Remove line
2021-03-04 11:46:11 -05:00
Tanmay Garg
256482ac92
Introduce save_strategy training argument ( #10286 )
...
* Introduce save_strategy training argument
* deprecate EvaluationStrategy
* collapse EvaluationStrategy and LoggingStrategy into a single
IntervalStrategy enum
* modify tests to use modified enum
2021-02-27 19:34:22 -05:00
Kai Fricke
98569d4ba2
Add Ray Tune hyperparameter search integration test ( #10414 )
2021-02-26 10:18:33 -05:00
Stas Bekman
4eddc459a9
[trainer] implement support for full fp16 in evaluation/predict ( #10268 )
...
* implement --fp16_full_eval
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* style
* add test
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-18 17:02:35 -08:00
Stas Bekman
d9a81fc0c5
fix func signature ( #10271 )
2021-02-18 16:44:42 -08:00
Stas Bekman
97e688bc22
[Trainer] memory tracker metrics ( #10225 )
...
* memory tracker metrics
* go back to eval for somewhat consistency
* handle no-gpu case
* deal with stackable eval calls
* restore callback order
* style
* simplify the API
* add test
* docs
* consistently use eval_ prefix
* improve docs
* Update src/transformers/trainer_utils.py
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* rename method
* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-02-18 09:27:32 -08:00
Sylvain Gugger
7169d1ea7b
Store FLOS as floats to avoid overflow. ( #10213 )
2021-02-16 11:15:15 -05:00
Lysandre Debut
8cbd0bd137
Specify dataset dtype ( #10195 )
...
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
Co-authored-by: Quentin Lhoest <lhoest.q@gmail.com>
2021-02-15 12:57:17 -05:00
Sylvain Gugger
b4e559cfa1
Deprecate model_path in Trainer.train ( #9854 )
2021-01-28 08:32:46 -05:00
Sylvain Gugger
35d55b7b84
When resuming training from checkpoint, Trainer loads model ( #9818 )
...
* Whenresuming training from checkpoint, Trainer loads model
* Finish cleaning tests
* Address review comment
* Use global_step from state
2021-01-27 09:31:18 -05:00
Sylvain Gugger
5e1bea4f16
Fix Trainer with a parallel model ( #9578 )
...
* Fix Trainer with a parallel model
* More clean up
2021-01-14 03:23:41 -05:00
Sylvain Gugger
04dc65e5c6
Fix data parallelism in Trainer ( #9566 )
...
* Fix data parallelism in Trainer
* Update src/transformers/training_args.py
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-01-13 09:54:41 -05:00
Stas Bekman
9f675b05d4
[trainer] self.model_wrapped + _model_unwrap ( #9390 )
...
* model wrapped + model_unwrap
* cleanup
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* style
* deprecation warning
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-01-06 06:50:11 -05:00
Sylvain Gugger
1198ba8fba
Add timing inside Trainer ( #9196 )
...
* Add timing inside Trainer
* Fix tests
* Add n_objs for train
* Sort logs
2020-12-18 15:10:39 -05:00
Sylvain Gugger
ad895af98d
Add possibility to switch between APEX and AMP in Trainer ( #9137 )
...
* Add possibility to switch between APEX and AMP in Trainer
* Update src/transformers/training_args.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
* Address review comments
* Update src/transformers/training_args.py
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2020-12-15 16:38:10 -05:00