Joao Gante
beb2a09687
DeepSpeed: hardcode torch.arange
dtype on float
usage to avoid incorrect initialization ( #28760 )
2024-01-31 14:39:07 +00:00
Xuehai Pan
976189a6df
Fix initialization for missing parameters in from_pretrained
under ZeRO-3 ( #28245 )
...
* Fix initialization for missing parameters in `from_pretrained` under ZeRO-3
* Test initialization for missing parameters under ZeRO-3
* Add more tests
* Only enable deepspeed context for per-module level parameters
* Enable deepspeed context only once
* Move class definition inside test case body
2024-01-09 14:58:21 +00:00
Ella Charlaix
39acfe84ba
Add deepspeed test to amd scheduled CI ( #27633 )
...
* add deepspeed scheduled test for amd
* fix image
* add dockerfile
* add comment
* enable tests
* trigger
* remove trigger for this branch
* trigger
* change runner env to trigger the docker build image test
* use new docker image
* remove test suffix from docker image tag
* replace test docker image with original image
* push new image
* Trigger
* add back amd tests
* fix typo
* add amd tests back
* fix
* comment until docker image build scheduled test fix
* remove deprecated deepspeed build option
* upgrade torch
* update docker & make tests pass
* Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile
* fix
* tmp disable test
* precompile deepspeed to avoid timeout during tests
* fix comment
* trigger deepspeed tests with new image
* comment tests
* trigger
* add sklearn dependency to fix slow tests
* enable back other tests
* final update
---------
Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: Félix Marty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-12-11 16:33:36 +01:00
Hz, Ji
c5d7754b11
device-agnostic deepspeed testing ( #27342 )
2023-11-09 12:34:13 +01:00
Sourab Mangrulkar
7ecd229ba4
Smangrul/fix failing ds ci tests ( #27358 )
...
* fix failing DeepSpeed CI tests due to `safetensors` being default
* debug
* remove debug statements
* resolve comments
* Update test_deepspeed.py
2023-11-09 11:47:24 +05:30
Sourab Mangrulkar
b477327394
fix the deepspeed tests ( #26021 )
...
* fix the deepspeed tests
* resolve comment
2023-09-13 10:26:53 +05:30
Sourab Mangrulkar
6bc517ccd4
deepspeed resume from ckpt fixes and adding support for deepspeed optimizer and HF scheduler ( #25863 )
...
* Add support for deepspeed optimizer and HF scheduler
* fix bug
* fix the import
* fix issue with deepspeed scheduler saving for hf optim + hf scheduler scenario
* fix loading of hf scheduler when loading deepspeed checkpoint
* fix import of `DeepSpeedSchedulerWrapper`
* add tests
* add the comment and skip the failing tests
* address comment
2023-09-05 22:31:20 +05:30
Younes Belkada
4b79697865
🚨 🚨 🚨 [Refactor
] Move third-party related utility files into integrations/
folder 🚨 🚨 🚨 ( #25599 )
...
* move deepspeed to `lib_integrations.deepspeed`
* more refactor
* oops
* fix slow tests
* Fix docs
* fix docs
* addess feedback
* address feedback
* final modifs for PEFT
* fixup
* ok now
* trigger CI
* trigger CI again
* Update docs/source/en/main_classes/deepspeed.md
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* import from `integrations`
* address feedback
* revert removal of `deepspeed` module
* revert removal of `deepspeed` module
* fix conflicts
* ooops
* oops
* add deprecation warning
* place it on the top
* put `FutureWarning`
* fix conflicts with not_doctested.txt
* add back `bitsandbytes` module with a depr warning
* fix
* fix
* fixup
* oops
* fix doctests
---------
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-08-25 17:13:34 +02:00
Sourab Mangrulkar
a73b1d59a3
accelerate deepspeed and gradient accumulation integrate ( #23236 )
...
* mixed precision support via accelerate
* fix issues
* fix for the sharded ddp case
* fix flax and tf failing tests
* `refactor the place to create `Accelerator` object
* move ddp prep to accelerate
* fix 😅
* resolving comments
* move fsdp handling to accelerate
* fixex
* fix saving
* shift torch dynamo handling to accelerate
* shift deepspeed integration and save & load utils to accelerate
* fix accelerate launcher support
* oops
* fix 🐛
* save ckpt fix
* Trigger CI
* nasty 🐛 😅
* as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate
* make tests happy
* quality ✨
* loss tracked needs to account for grad_acc
* fixing the deepspeed tests
* quality ✨
* 😅 😅 😅
* tests 😡
* quality ✨
* Trigger CI
* resolve comments and fix the issue with the previous merge from branch
* Trigger CI
* accelerate took over deepspeed integration
---------
Co-authored-by: Stas Bekman <stas@stason.org>
2023-05-31 15:16:22 +05:30
Yih-Dar
fe1f5a639d
Fix decorator order ( #22708 )
...
fix
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-04-11 17:59:15 +02:00
Stas Bekman
ec24132b6c
[deepspeed] offload + non-cpuadam optimizer exception ( #22043 )
...
* [deepspeed] offload + non-cpuadam optimizer exception
* flip
* revert min version
2023-03-09 08:12:57 -08:00
Stas Bekman
633062639b
[deepspeed tests] fix issues introduced by #21700 ( #21769 )
...
* [deepspeed tests] fix issues introduced by #21700
* fix
* fix
2023-02-23 13:22:25 -08:00
Aaron Gokaslan
5e8c8eb5ba
Apply ruff flake8-comprehensions ( #21694 )
2023-02-22 09:14:54 +01:00
Stas Bekman
8ea994d3c5
[tests] add missing report_to none
( #21505 )
...
[tests] report_to none
2023-02-08 09:32:40 -08:00
Sylvain Gugger
6f79d26442
Update quality tooling for formatting ( #21480 )
...
* Result of black 23.1
* Update target to Python 3.7
* Switch flake8 to ruff
* Configure isort
* Configure isort
* Apply isort with line limit
* Put the right black version
* adapt black in check copies
* Fix copies
2023-02-06 18:10:56 -05:00
Sylvain Gugger
36d4647993
Refine Bf16 test for deepspeed ( #17734 )
...
* Refine BF16 check in CPU/GPU
* Fixes
* Renames
2022-06-16 11:27:58 -04:00
Stas Bekman
d28b7aa8cb
[deepspeed / testing] reset global state ( #17553 )
...
* [deepspeed] fix load_best_model test
* [deepspeed] add state reset on unittest tearDown
2022-06-06 07:49:25 -07:00
Stas Bekman
26e5e129b4
[deepspeed] fix load_best_model test ( #17550 )
2022-06-03 11:19:03 -07:00
Stas Bekman
2f59ad1609
[trainer/deepspeed] load_best_model (reimplement re-init) ( #17151 )
...
* [trainer/deepspeed] load_best_model
* to sync with DS PR #1947
* simplify
* rework load_best_model test
* cleanup
* bump deepspeed>=0.6.5
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2022-06-02 09:14:21 -07:00
Stas Bekman
f861504466
[Deepspeed] add many more models to the model zoo test ( #12695 )
...
* model zoo take 2
* add deberta
* new param for zero2
* doc update
* doc update
* add layoutlm
* bump deepspeed
* add deberta-v2, funnel, longformer
* new models
* style
* add t5_v1
* update TAPAS status
* reorg problematic models
* move doc to another PR
* style
* fix checkpoint check test
* making progress on more models running
* cleanup
* new version
* cleanup
2022-05-10 08:22:42 -07:00
Stas Bekman
ce2fef2ad2
[trainer / deepspeed] fix hyperparameter_search ( #16740 )
...
* [trainer / deepspeed] fix hyperparameter_search
* require optuna
* style
* oops
* add dep in the right place
* create deepspeed-testing dep group
* Trigger CI
2022-04-14 17:24:38 -07:00
Sylvain Gugger
4975002df5
Reorganize file utils ( #16264 )
...
* Split file_utils in several submodules
* Fixes
* Add back more objects
* More fixes
* Who exactly decided to import that from there?
* Second suggestion to code with code review
* Revert wront move
* Fix imports
* Adapt all imports
* Adapt all imports everywhere
* Revert this import, will fix in a separate commit
2022-03-23 10:26:33 -04:00
Stas Bekman
580dd87c55
[Deepspeed] add support for bf16 mode ( #14569 )
...
* [WIP] add support for bf16 mode
* prep for bf16
* prep for bf16
* fix; zero2/bf16 is ok
* check bf16 is available
* test fixes
* enable zero3_bf16
* config files
* docs
* split stage_dtype; merge back to non-dtype-specific config file
* fix doc
* cleanup
* cleanup
* bfloat16 => bf16 to match the PR changes
* s/zero_gather_fp16_weights_on_model_save/zero_gather_16bit_weights_on_model_save/; s/save_fp16_model/save_16bit_model/
* test fixes/skipping
* move
* fix
* Update docs/source/main_classes/deepspeed.mdx
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* backticks
* cleanup
* cleanup
* cleanup
* new version
* add note about grad accum in bf16
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2022-03-11 17:53:53 -08:00
Stas Bekman
b842d7277a
fix deepspeed tests ( #15881 )
...
* fix deepspeed tests
* style
* more fixes
2022-03-01 19:27:28 -08:00
Lysandre Debut
29c10a41d0
[Test refactor 1/5] Per-folder tests reorganization ( #15725 )
...
* Per-folder tests reorganization
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Stas Bekman <stas@stason.org>
2022-02-23 15:46:28 -05:00
Stas Bekman
4f5faaf044
[deepspeed] fix a bug in a test ( #15493 )
...
* [deepspeed] fix a bug in a test
* consistency
2022-02-03 08:55:45 -08:00
Stas Bekman
b66c5ab20c
[deepspeed] fix --load_best_model_at_end ( #14652 )
...
* [deepspeed] fix load_best_model_at_end
* try with pull_request_target
* revert: try with pull_request_target
* style
* add test
* cleanup
2021-12-06 21:57:47 -08:00
Stas Bekman
956a483173
[deepspeed] zero inference ( #14253 )
...
* [deepspeed] zero inference
* only z3 makes sense for inference
* fix and style
* docs
* rework
* fix test
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* responding to suggestions
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-11-23 14:09:15 -08:00
Stas Bekman
1c76a51615
solve the port conflict ( #14362 )
2021-11-10 19:11:45 -08:00
Jeff Rasley
d0e96c6de6
[deepspeed] Enable multiple test runs on single box, defer to DS_TEST_PORT if set ( #14331 )
...
* defer to DS_TEST_PORT if set
* style
Co-authored-by: Stas Bekman <stas@stason.org>
2021-11-08 12:40:29 -08:00
Olatunji Ruwase
42f359d015
Use DS callable API to allow hf_scheduler + ds_optimizer ( #13216 )
...
* Use DS callable API to allow hf_scheduler + ds_optimizer
* Preserve backward-compatibility
* Restore backward compatibility
* Tweak arg positioning
* Tweak arg positioning
* bump the required version
* Undo indent
* Update src/transformers/trainer.py
* style
Co-authored-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2021-08-30 10:01:06 -07:00
Stas Bekman
98364ea74f
[tests] fix logging_steps requirements ( #12860 )
2021-07-23 08:05:48 -07:00
Stas Bekman
5dd0c956a8
non-native optimizers are mostly ok with zero-offload ( #12690 )
2021-07-13 20:18:51 -07:00
Stas Bekman
78f5fe1416
[Deepspeed] adapt multiple models, add zero_to_fp32 tests ( #12477 )
...
* zero_to_fp32 tests
* args change
* remove unnecessary work
* use transformers.trainer_utils.get_last_checkpoint
* document the new features
* cleanup
* wip
* fix fsmt
* add bert
* cleanup
* add xlm-roberta
* electra works
* cleanup
* sync
* split off the model zoo tests
* cleanup
* cleanup
* cleanup
* cleanup
* reformat
* cleanup
* casing
* deepspeed>=0.4.3
* adjust distilbert
* Update docs/source/main_classes/deepspeed.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-07-13 12:07:32 -07:00
Stas Bekman
ebe5413589
[trainer] 2 bug fixes and a rename ( #12309 )
...
* bug fixes and a rename
* add extended DDP test
2021-06-22 11:13:23 -07:00
Stas Bekman
11d86d3de4
[Deepspeed Wav2vec2] integration ( #11638 )
...
* wip
* wip - but working with https://github.com/microsoft/DeepSpeed/pull/1044
* cleanup
* workaround
* working 5/8 modes
* solve fp32 distributed zero3
* style
* sync
* sync
* rework
* deprecation
* cleanup
* https://github.com/microsoft/DeepSpeed/pull/1044 pr was merged
* clean up
* add a guide
* more prose
* more prose
* fix
* more prose
* sub_group_size was too big
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* refactor
* bug fix
* make the true check explicit
* new deepspeed release
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-08 12:32:03 -07:00
Stas Bekman
32290d87f6
[Deepspeed] various fixes ( #12058 )
...
* replace deprecated config
* sub_group_size was too big
* complete deprecation removal
2021-06-08 08:36:15 -07:00
Stas Bekman
2c73b93099
[Deepspeed] Assert on mismatches between ds and hf args ( #12021 )
...
* wip
* add mismatch validation + test
* renames
* Update docs/source/main_classes/deepspeed.rst
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* renames
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-04 08:58:23 -07:00
Stas Bekman
61c5063491
[deepspeed] add nvme test skip rule ( #11997 )
...
* add nvme skip rule
* fix
2021-06-02 12:06:37 -07:00
Stas Bekman
640318befa
[deepspeed] Move code and doc into standalone files ( #11984 )
...
* move code and docs
* style
* moved
* restore
2021-06-02 09:56:00 -07:00
Stas Bekman
7ec596ecda
[DeepSpeed] decouple DeepSpeedConfigHF
from Trainer
( #11966 )
...
* decouple DeepSpeedConfigHF from Trainer
* add LoggingLevel ctx manager; add new test
* cleanup
* add docs
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* implemented suggested renames
* formatter workaround
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-06-01 13:24:52 -07:00
Stas Bekman
a26f4d6208
[Deepspeed] support zero.Init
in from_config
( #11805 )
...
* support zero.Init in from_config
* no need for eval test
2021-05-21 09:07:46 -07:00
Stas Bekman
619200cc42
[cuda ext tests] fixing tests ( #11619 )
...
* fixing tests
* cleanup
2021-05-06 13:35:28 -07:00
Stas Bekman
4e7bf94e72
[DeepSpeed] fp32 support ( #11499 )
...
* prep for deepspeed==0.3.16
* new version
* too soon
* support and test fp32 mode
* troubleshooting doc start
* workaround no longer needed
* add fp32 doc
* style
* cleanup, add tf32 note
* clarify
* release was made
2021-04-30 12:51:48 -07:00
Stas Bekman
bc2571e61c
[Deepspeed] ZeRO-Infinity integration plus config revamp ( #11418 )
...
* adding Z-inf
* revamp config process
* up version requirement
* wip
* massive rewrite
* cleanup
* cleanup
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* consistent json commas
* act on suggestions
* leave this feature for 0.3.16
* style
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2021-04-26 10:40:32 -07:00
Bhadresh Savani
1d30ec95c7
[Examples] Fixes inconsistency around eval vs val and predict vs test ( #11380 )
...
* added changes for uniformity
* modified files
* corrected typo
* fixed qa scripts
* fix typos
* fixed predict typo in qa no trainer
* fixed test file
* reverted trainer changes
* reverted trainer changes in custom exmaples
* updated readme
* added changes in deepspeed test
* added changes for predict and eval
2021-04-26 09:24:31 -07:00
Patrick von Platen
32dbb2d954
make style ( #11442 )
2021-04-26 13:50:34 +02:00
Sylvain Gugger
dabeb15292
Examples reorg ( #11350 )
...
* Base move
* Examples reorganization
* Update references
* Put back test data
* Move conftest
* More fixes
* Move test data to test fixtures
* Update path
* Apply suggestions from code review
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
* Address review comments and clean
Co-authored-by: Lysandre Debut <lysandre@huggingface.co>
2021-04-21 11:11:20 -04:00
Stas Bekman
83206ca6a8
[deepspeed] test on one node 2 gpus max ( #11237 )
...
* test on one node 2 gpus max
* fix the other place
* refactor
* fix
* cleanup
* more exact version
2021-04-14 11:06:59 -07:00
Stas Bekman
3d339ee659
[Deepspeed] zero3 tests band aid ( #11235 )
...
* temp band-aid
* style
2021-04-13 17:58:09 -04:00