Commit Graph

82 Commits

Author SHA1 Message Date
Stas Bekman
ce2fef2ad2
[trainer / deepspeed] fix hyperparameter_search (#16740)
* [trainer / deepspeed] fix hyperparameter_search

* require optuna

* style

* oops

* add dep in the right place

* create deepspeed-testing dep group

* Trigger CI
2022-04-14 17:24:38 -07:00
Sylvain Gugger
4975002df5
Reorganize file utils (#16264)
* Split file_utils in several submodules

* Fixes

* Add back more objects

* More fixes

* Who exactly decided to import that from there?

* Second suggestion to code with code review

* Revert wront move

* Fix imports

* Adapt all imports

* Adapt all imports everywhere

* Revert this import, will fix in a separate commit
2022-03-23 10:26:33 -04:00
Stas Bekman
580dd87c55
[Deepspeed] add support for bf16 mode (#14569)
* [WIP] add support for bf16 mode

* prep for bf16

* prep for bf16

* fix; zero2/bf16 is ok

* check bf16 is available

* test fixes

* enable zero3_bf16

* config files

* docs

* split stage_dtype; merge back to non-dtype-specific config file

* fix doc

* cleanup

* cleanup

* bfloat16 => bf16 to match the PR changes

* s/zero_gather_fp16_weights_on_model_save/zero_gather_16bit_weights_on_model_save/; s/save_fp16_model/save_16bit_model/

* test fixes/skipping

* move

* fix

* Update docs/source/main_classes/deepspeed.mdx

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* backticks

* cleanup

* cleanup

* cleanup

* new version

* add note about grad accum in bf16

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-11 17:53:53 -08:00
Stas Bekman
b842d7277a
fix deepspeed tests (#15881)
* fix deepspeed tests

* style

* more fixes
2022-03-01 19:27:28 -08:00
Lysandre Debut
29c10a41d0
[Test refactor 1/5] Per-folder tests reorganization (#15725)
* Per-folder tests reorganization

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Stas Bekman <stas@stason.org>
2022-02-23 15:46:28 -05:00
Stas Bekman
4f5faaf044
[deepspeed] fix a bug in a test (#15493)
* [deepspeed] fix a bug in a test

* consistency
2022-02-03 08:55:45 -08:00
Stas Bekman
1eb40338ac
[deepspeed tests] fix summarization (#15149) 2022-01-13 13:48:51 -08:00
Stas Bekman
b66c5ab20c
[deepspeed] fix --load_best_model_at_end (#14652)
* [deepspeed] fix load_best_model_at_end

* try with pull_request_target

* revert: try with pull_request_target

* style

* add test

* cleanup
2021-12-06 21:57:47 -08:00
Stas Bekman
956a483173
[deepspeed] zero inference (#14253)
* [deepspeed] zero inference

* only z3 makes sense for inference

* fix and style

* docs

* rework

* fix test

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* responding to suggestions

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-11-23 14:09:15 -08:00
Stas Bekman
1c76a51615
solve the port conflict (#14362) 2021-11-10 19:11:45 -08:00
Jeff Rasley
d0e96c6de6
[deepspeed] Enable multiple test runs on single box, defer to DS_TEST_PORT if set (#14331)
* defer to DS_TEST_PORT if set

* style

Co-authored-by: Stas Bekman <stas@stason.org>
2021-11-08 12:40:29 -08:00
Olatunji Ruwase
42f359d015
Use DS callable API to allow hf_scheduler + ds_optimizer (#13216)
* Use DS callable API to allow hf_scheduler + ds_optimizer

* Preserve backward-compatibility

* Restore backward compatibility

* Tweak arg positioning

* Tweak arg positioning

* bump the required version

* Undo indent

* Update src/transformers/trainer.py

* style

Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-08-30 10:01:06 -07:00
Stas Bekman
98364ea74f
[tests] fix logging_steps requirements (#12860) 2021-07-23 08:05:48 -07:00
Stas Bekman
5dd0c956a8
non-native optimizers are mostly ok with zero-offload (#12690) 2021-07-13 20:18:51 -07:00
Stas Bekman
78f5fe1416
[Deepspeed] adapt multiple models, add zero_to_fp32 tests (#12477)
* zero_to_fp32 tests

* args change

* remove unnecessary work

* use transformers.trainer_utils.get_last_checkpoint

* document the new features

* cleanup

* wip

* fix fsmt

* add bert

* cleanup

* add xlm-roberta

* electra works

* cleanup

* sync

* split off the model zoo tests

* cleanup

* cleanup

* cleanup

* cleanup

* reformat

* cleanup

* casing

* deepspeed>=0.4.3

* adjust distilbert

* Update docs/source/main_classes/deepspeed.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-07-13 12:07:32 -07:00
Stas Bekman
ebe5413589
[trainer] 2 bug fixes and a rename (#12309)
* bug fixes and a rename

* add extended DDP test
2021-06-22 11:13:23 -07:00
Stas Bekman
11d86d3de4
[Deepspeed Wav2vec2] integration (#11638)
* wip

* wip - but working with https://github.com/microsoft/DeepSpeed/pull/1044

* cleanup

* workaround

* working 5/8 modes

* solve fp32 distributed zero3

* style

* sync

* sync

* rework

* deprecation

* cleanup

* https://github.com/microsoft/DeepSpeed/pull/1044 pr was merged

* clean up

* add a guide

* more prose

* more prose

* fix

* more prose

* sub_group_size was too big

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* refactor

* bug fix

* make the true check explicit

* new deepspeed release

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-08 12:32:03 -07:00
Stas Bekman
32290d87f6
[Deepspeed] various fixes (#12058)
* replace deprecated config

* sub_group_size was too big

* complete deprecation removal
2021-06-08 08:36:15 -07:00
Stas Bekman
2c73b93099
[Deepspeed] Assert on mismatches between ds and hf args (#12021)
* wip

* add mismatch validation + test

* renames

* Update docs/source/main_classes/deepspeed.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* renames

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-04 08:58:23 -07:00
Stas Bekman
61c5063491
[deepspeed] add nvme test skip rule (#11997)
* add nvme skip rule

* fix
2021-06-02 12:06:37 -07:00
Stas Bekman
640318befa
[deepspeed] Move code and doc into standalone files (#11984)
* move code and docs

* style

* moved

* restore
2021-06-02 09:56:00 -07:00
Stas Bekman
7ec596ecda
[DeepSpeed] decouple DeepSpeedConfigHF from Trainer (#11966)
* decouple DeepSpeedConfigHF from Trainer

* add LoggingLevel ctx manager; add new test

* cleanup

* add docs

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* implemented suggested renames

* formatter workaround

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-01 13:24:52 -07:00
Stas Bekman
a26f4d6208
[Deepspeed] support zero.Init in from_config (#11805)
* support zero.Init in from_config

* no need for eval test
2021-05-21 09:07:46 -07:00
Stas Bekman
619200cc42
[cuda ext tests] fixing tests (#11619)
* fixing tests

* cleanup
2021-05-06 13:35:28 -07:00
Stas Bekman
4e7bf94e72
[DeepSpeed] fp32 support (#11499)
* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made
2021-04-30 12:51:48 -07:00
Stas Bekman
bc2571e61c
[Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)
* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-26 10:40:32 -07:00
Bhadresh Savani
1d30ec95c7
[Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)
* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval
2021-04-26 09:24:31 -07:00
Patrick von Platen
32dbb2d954
make style (#11442) 2021-04-26 13:50:34 +02:00
Sylvain Gugger
dabeb15292
Examples reorg (#11350)
* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments and clean

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-21 11:11:20 -04:00
Stas Bekman
83206ca6a8
[deepspeed] test on one node 2 gpus max (#11237)
* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version
2021-04-14 11:06:59 -07:00
Stas Bekman
3d339ee659
[Deepspeed] zero3 tests band aid (#11235)
* temp band-aid

* style
2021-04-13 17:58:09 -04:00
Stas Bekman
66446909b2
[tests] relocate core integration tests (#11146)
* relocate core integration tests

* add sys.path context manager

* cleanup

* try

* try2

* fix path

* doc

* style

* add dep

* add 2 more deps
2021-04-08 13:13:17 -07:00