Commit Graph

51 Commits

Author SHA1 Message Date
Joao Gante
beb2a09687
DeepSpeed: hardcode torch.arange dtype on float usage to avoid incorrect initialization (#28760) 2024-01-31 14:39:07 +00:00
Xuehai Pan
976189a6df
Fix initialization for missing parameters in from_pretrained under ZeRO-3 (#28245)
* Fix initialization for missing parameters in `from_pretrained` under ZeRO-3

* Test initialization for missing parameters under ZeRO-3

* Add more tests

* Only enable deepspeed context for per-module level parameters

* Enable deepspeed context only once

* Move class definition inside test case body
2024-01-09 14:58:21 +00:00
Ella Charlaix
39acfe84ba
Add deepspeed test to amd scheduled CI (#27633)
* add deepspeed scheduled test for amd

* fix image

* add dockerfile

* add comment

* enable tests

* trigger

* remove trigger for this branch

* trigger

* change runner env to trigger the docker build image test

* use new docker image

* remove test suffix from docker image tag

* replace test docker image with original image

* push new image

* Trigger

* add back amd tests

* fix typo

* add amd tests back

* fix

* comment until docker image build scheduled test fix

* remove deprecated deepspeed build option

* upgrade torch

* update docker & make tests pass

* Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

* fix

* tmp disable test

* precompile deepspeed to avoid timeout during tests

* fix comment

* trigger deepspeed tests with new image

* comment tests

* trigger

* add sklearn dependency to fix slow tests

* enable back other tests

* final update

---------

Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: Félix Marty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-12-11 16:33:36 +01:00
Hz, Ji
c5d7754b11
device-agnostic deepspeed testing (#27342) 2023-11-09 12:34:13 +01:00
Sourab Mangrulkar
7ecd229ba4
Smangrul/fix failing ds ci tests (#27358)
* fix failing DeepSpeed CI tests due to `safetensors` being default

* debug

* remove debug statements

* resolve comments

* Update test_deepspeed.py
2023-11-09 11:47:24 +05:30
Sourab Mangrulkar
b477327394
fix the deepspeed tests (#26021)
* fix the deepspeed tests

* resolve comment
2023-09-13 10:26:53 +05:30
Sourab Mangrulkar
6bc517ccd4
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler (#25863)
* Add support for deepspeed optimizer and HF scheduler

* fix bug

* fix the import

* fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario

* fix loading of hf scheduler when loading deepspeed checkpoint

* fix import of `DeepSpeedSchedulerWrapper`

* add tests

* add the comment and skip the failing tests

* address comment
2023-09-05 22:31:20 +05:30
Younes Belkada
4b79697865
🚨🚨🚨 [Refactor] Move third-party related utility files into integrations/ folder 🚨🚨🚨 (#25599)
* move deepspeed to `lib_integrations.deepspeed`

* more refactor

* oops

* fix slow tests

* Fix docs

* fix docs

* addess feedback

* address feedback

* final modifs for PEFT

* fixup

* ok now

* trigger CI

* trigger CI again

* Update docs/source/en/main_classes/deepspeed.md

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* import from `integrations`

* address feedback

* revert removal of `deepspeed` module

* revert removal of `deepspeed` module

* fix conflicts

* ooops

* oops

* add deprecation warning

* place it on the top

* put `FutureWarning`

* fix conflicts with not_doctested.txt

* add back `bitsandbytes` module with a depr warning

* fix

* fix

* fixup

* oops

* fix doctests

---------

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-08-25 17:13:34 +02:00
Sourab Mangrulkar
a73b1d59a3
accelerate deepspeed and gradient accumulation integrate (#23236)
* mixed precision support via accelerate

* fix issues

* fix for the sharded ddp case

* fix flax and tf failing tests

* `refactor the place to create `Accelerator` object

* move ddp prep to accelerate

* fix 😅

* resolving comments

* move fsdp handling to accelerate

* fixex

* fix saving

* shift torch dynamo handling to accelerate

* shift deepspeed integration and save & load utils to accelerate

* fix accelerate launcher support

* oops

* fix 🐛

* save ckpt fix

* Trigger CI

* nasty 🐛 😅

* as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate

* make tests happy

* quality 

* loss tracked needs to account for grad_acc

* fixing the deepspeed tests

* quality 

* 😅😅😅

* tests 😡

* quality 

* Trigger CI

* resolve comments and fix the issue with the previous merge from branch

* Trigger CI

* accelerate took over deepspeed integration

---------

Co-authored-by: Stas Bekman <stas@stason.org>
2023-05-31 15:16:22 +05:30
Yih-Dar
fe1f5a639d
Fix decorator order (#22708)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-04-11 17:59:15 +02:00
Stas Bekman
ec24132b6c
[deepspeed] offload + non-cpuadam optimizer exception (#22043)
* [deepspeed] offload + non-cpuadam optimizer exception

* flip

* revert min version
2023-03-09 08:12:57 -08:00
Stas Bekman
633062639b
[deepspeed tests] fix issues introduced by #21700 (#21769)
* [deepspeed tests] fix issues introduced by #21700

* fix

* fix
2023-02-23 13:22:25 -08:00
Aaron Gokaslan
5e8c8eb5ba
Apply ruff flake8-comprehensions (#21694) 2023-02-22 09:14:54 +01:00
Stas Bekman
8ea994d3c5
[tests] add missing report_to none (#21505)
[tests] report_to none
2023-02-08 09:32:40 -08:00
Sylvain Gugger
6f79d26442
Update quality tooling for formatting (#21480)
* Result of black 23.1

* Update target to Python 3.7

* Switch flake8 to ruff

* Configure isort

* Configure isort

* Apply isort with line limit

* Put the right black version

* adapt black in check copies

* Fix copies
2023-02-06 18:10:56 -05:00
Sylvain Gugger
36d4647993
Refine Bf16 test for deepspeed (#17734)
* Refine BF16 check in CPU/GPU

* Fixes

* Renames
2022-06-16 11:27:58 -04:00
Stas Bekman
d28b7aa8cb
[deepspeed / testing] reset global state (#17553)
* [deepspeed] fix load_best_model test

* [deepspeed] add state reset on unittest tearDown
2022-06-06 07:49:25 -07:00
Stas Bekman
26e5e129b4
[deepspeed] fix load_best_model test (#17550) 2022-06-03 11:19:03 -07:00
Stas Bekman
2f59ad1609
[trainer/deepspeed] load_best_model (reimplement re-init) (#17151)
* [trainer/deepspeed] load_best_model

* to sync with DS PR #1947

* simplify

* rework load_best_model test

* cleanup

* bump deepspeed>=0.6.5

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2022-06-02 09:14:21 -07:00
Stas Bekman
f861504466
[Deepspeed] add many more models to the model zoo test (#12695)
* model zoo take 2

* add deberta

* new param for zero2

* doc update

* doc update

* add layoutlm

* bump deepspeed

* add deberta-v2, funnel, longformer

* new models

* style

* add t5_v1

* update TAPAS status

* reorg problematic models

* move doc to another PR

* style

* fix checkpoint check test

* making progress on more models running

* cleanup

* new version

* cleanup
2022-05-10 08:22:42 -07:00
Stas Bekman
ce2fef2ad2
[trainer / deepspeed] fix hyperparameter_search (#16740)
* [trainer / deepspeed] fix hyperparameter_search

* require optuna

* style

* oops

* add dep in the right place

* create deepspeed-testing dep group

* Trigger CI
2022-04-14 17:24:38 -07:00
Sylvain Gugger
4975002df5
Reorganize file utils (#16264)
* Split file_utils in several submodules

* Fixes

* Add back more objects

* More fixes

* Who exactly decided to import that from there?

* Second suggestion to code with code review

* Revert wront move

* Fix imports

* Adapt all imports

* Adapt all imports everywhere

* Revert this import, will fix in a separate commit
2022-03-23 10:26:33 -04:00
Stas Bekman
580dd87c55
[Deepspeed] add support for bf16 mode (#14569)
* [WIP] add support for bf16 mode

* prep for bf16

* prep for bf16

* fix; zero2/bf16 is ok

* check bf16 is available

* test fixes

* enable zero3_bf16

* config files

* docs

* split stage_dtype; merge back to non-dtype-specific config file

* fix doc

* cleanup

* cleanup

* bfloat16 => bf16 to match the PR changes

* s/zero_gather_fp16_weights_on_model_save/zero_gather_16bit_weights_on_model_save/; s/save_fp16_model/save_16bit_model/

* test fixes/skipping

* move

* fix

* Update docs/source/main_classes/deepspeed.mdx

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* backticks

* cleanup

* cleanup

* cleanup

* new version

* add note about grad accum in bf16

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-11 17:53:53 -08:00
Stas Bekman
b842d7277a
fix deepspeed tests (#15881)
* fix deepspeed tests

* style

* more fixes
2022-03-01 19:27:28 -08:00
Lysandre Debut
29c10a41d0
[Test refactor 1/5] Per-folder tests reorganization (#15725)
* Per-folder tests reorganization

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Stas Bekman <stas@stason.org>
2022-02-23 15:46:28 -05:00
Stas Bekman
4f5faaf044
[deepspeed] fix a bug in a test (#15493)
* [deepspeed] fix a bug in a test

* consistency
2022-02-03 08:55:45 -08:00
Stas Bekman
b66c5ab20c
[deepspeed] fix --load_best_model_at_end (#14652)
* [deepspeed] fix load_best_model_at_end

* try with pull_request_target

* revert: try with pull_request_target

* style

* add test

* cleanup
2021-12-06 21:57:47 -08:00
Stas Bekman
956a483173
[deepspeed] zero inference (#14253)
* [deepspeed] zero inference

* only z3 makes sense for inference

* fix and style

* docs

* rework

* fix test

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* responding to suggestions

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-11-23 14:09:15 -08:00
Stas Bekman
1c76a51615
solve the port conflict (#14362) 2021-11-10 19:11:45 -08:00
Jeff Rasley
d0e96c6de6
[deepspeed] Enable multiple test runs on single box, defer to DS_TEST_PORT if set (#14331)
* defer to DS_TEST_PORT if set

* style

Co-authored-by: Stas Bekman <stas@stason.org>
2021-11-08 12:40:29 -08:00
Olatunji Ruwase
42f359d015
Use DS callable API to allow hf_scheduler + ds_optimizer (#13216)
* Use DS callable API to allow hf_scheduler + ds_optimizer

* Preserve backward-compatibility

* Restore backward compatibility

* Tweak arg positioning

* Tweak arg positioning

* bump the required version

* Undo indent

* Update src/transformers/trainer.py

* style

Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-08-30 10:01:06 -07:00
Stas Bekman
98364ea74f
[tests] fix logging_steps requirements (#12860) 2021-07-23 08:05:48 -07:00
Stas Bekman
5dd0c956a8
non-native optimizers are mostly ok with zero-offload (#12690) 2021-07-13 20:18:51 -07:00
Stas Bekman
78f5fe1416
[Deepspeed] adapt multiple models, add zero_to_fp32 tests (#12477)
* zero_to_fp32 tests

* args change

* remove unnecessary work

* use transformers.trainer_utils.get_last_checkpoint

* document the new features

* cleanup

* wip

* fix fsmt

* add bert

* cleanup

* add xlm-roberta

* electra works

* cleanup

* sync

* split off the model zoo tests

* cleanup

* cleanup

* cleanup

* cleanup

* reformat

* cleanup

* casing

* deepspeed>=0.4.3

* adjust distilbert

* Update docs/source/main_classes/deepspeed.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-07-13 12:07:32 -07:00
Stas Bekman
ebe5413589
[trainer] 2 bug fixes and a rename (#12309)
* bug fixes and a rename

* add extended DDP test
2021-06-22 11:13:23 -07:00
Stas Bekman
11d86d3de4
[Deepspeed Wav2vec2] integration (#11638)
* wip

* wip - but working with https://github.com/microsoft/DeepSpeed/pull/1044

* cleanup

* workaround

* working 5/8 modes

* solve fp32 distributed zero3

* style

* sync

* sync

* rework

* deprecation

* cleanup

* https://github.com/microsoft/DeepSpeed/pull/1044 pr was merged

* clean up

* add a guide

* more prose

* more prose

* fix

* more prose

* sub_group_size was too big

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* refactor

* bug fix

* make the true check explicit

* new deepspeed release

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-08 12:32:03 -07:00
Stas Bekman
32290d87f6
[Deepspeed] various fixes (#12058)
* replace deprecated config

* sub_group_size was too big

* complete deprecation removal
2021-06-08 08:36:15 -07:00
Stas Bekman
2c73b93099
[Deepspeed] Assert on mismatches between ds and hf args (#12021)
* wip

* add mismatch validation + test

* renames

* Update docs/source/main_classes/deepspeed.rst

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* renames

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-04 08:58:23 -07:00
Stas Bekman
61c5063491
[deepspeed] add nvme test skip rule (#11997)
* add nvme skip rule

* fix
2021-06-02 12:06:37 -07:00
Stas Bekman
640318befa
[deepspeed] Move code and doc into standalone files (#11984)
* move code and docs

* style

* moved

* restore
2021-06-02 09:56:00 -07:00
Stas Bekman
7ec596ecda
[DeepSpeed] decouple DeepSpeedConfigHF from Trainer (#11966)
* decouple DeepSpeedConfigHF from Trainer

* add LoggingLevel ctx manager; add new test

* cleanup

* add docs

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* implemented suggested renames

* formatter workaround

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-01 13:24:52 -07:00
Stas Bekman
a26f4d6208
[Deepspeed] support zero.Init in from_config (#11805)
* support zero.Init in from_config

* no need for eval test
2021-05-21 09:07:46 -07:00
Stas Bekman
619200cc42
[cuda ext tests] fixing tests (#11619)
* fixing tests

* cleanup
2021-05-06 13:35:28 -07:00
Stas Bekman
4e7bf94e72
[DeepSpeed] fp32 support (#11499)
* prep for deepspeed==0.3.16

* new version

* too soon

* support and test fp32 mode

* troubleshooting doc start

* workaround no longer needed

* add fp32 doc

* style

* cleanup, add tf32 note

* clarify

* release was made
2021-04-30 12:51:48 -07:00
Stas Bekman
bc2571e61c
[Deepspeed] ZeRO-Infinity integration plus config revamp (#11418)
* adding Z-inf

* revamp config process

* up version requirement

* wip

* massive rewrite

* cleanup

* cleanup

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* consistent json commas

* act on suggestions

* leave this feature for 0.3.16

* style

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-26 10:40:32 -07:00
Bhadresh Savani
1d30ec95c7
[Examples] Fixes inconsistency around eval vs val and predict vs test (#11380)
* added changes for uniformity

* modified files

* corrected typo

* fixed qa scripts

* fix typos

* fixed predict typo in qa no trainer

* fixed test file

* reverted trainer changes

* reverted trainer changes in custom exmaples

* updated readme

* added changes in deepspeed test

* added changes for predict and eval
2021-04-26 09:24:31 -07:00
Patrick von Platen
32dbb2d954
make style (#11442) 2021-04-26 13:50:34 +02:00
Sylvain Gugger
dabeb15292
Examples reorg (#11350)
* Base move

* Examples reorganization

* Update references

* Put back test data

* Move conftest

* More fixes

* Move test data to test fixtures

* Update path

* Apply suggestions from code review

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>

* Address review comments and clean

Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-21 11:11:20 -04:00
Stas Bekman
83206ca6a8
[deepspeed] test on one node 2 gpus max (#11237)
* test on one node 2 gpus max

* fix the other place

* refactor

* fix

* cleanup

* more exact version
2021-04-14 11:06:59 -07:00
Stas Bekman
3d339ee659
[Deepspeed] zero3 tests band aid (#11235)
* temp band-aid

* style
2021-04-13 17:58:09 -04:00