transformers/tests/trainer
Matthew Hoffman 70b07d97cf
Default synced_gpus to True when using FullyShardedDataParallel (#33483)
* Default synced_gpus to True when using FullyShardedDataParallel

Fixes #30228

Related:

* https://github.com/pytorch/pytorch/issues/100069
* https://github.com/pytorch/pytorch/issues/123962

Similar to DeepSpeed ZeRO Stage 3, when using FSDP with multiple GPUs and differently sized data per rank, the ranks reach different synchronization points at the same time, leading to deadlock

To avoid this, we can automatically set synced_gpus to True if we detect that a PreTrainedModel is being managed by FSDP using _is_fsdp_managed_module, which was added in 2.0.0 for torch.compile: https://github.com/pytorch/pytorch/blob/v2.0.0/torch/distributed/fsdp/_dynamo_utils.py

* Remove test file

* ruff formatting

* ruff format

* Update copyright year

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Add test for FSDP-wrapped model generation

Before #33483, these tests would have hung for 10 minutes before crashing due to a timeout error

* Ruff format

* Move argparse import

* Remove barrier

I think this might cause more problems if one of the workers was killed

* Move import into function to decrease load time

https://github.com/huggingface/transformers/pull/33483#discussion_r1787972735

* Add test for accelerate and Trainer

https://github.com/huggingface/transformers/pull/33483#discussion_r1790309675

* Refactor imports

* Ruff format

* Use nullcontext

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-10-10 14:09:04 -04:00
..
__init__.py [Test refactor 1/5] Per-folder tests reorganization (#15725) 2022-02-23 15:46:28 -05:00
test_data_collator.py Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs (#31629) 2024-07-23 15:56:41 +02:00
test_trainer_callback.py add a callback hook right before the optimizer step (#33444) 2024-09-13 10:43:45 +02:00
test_trainer_distributed.py CI: update to ROCm 6.0.2 and test MI300 (#30266) 2024-05-13 18:14:36 +02:00
test_trainer_fsdp.py Default synced_gpus to True when using FullyShardedDataParallel (#33483) 2024-10-10 14:09:04 -04:00
test_trainer_seq2seq.py Trainer - deprecate tokenizer for processing_class (#32385) 2024-10-02 14:08:46 +01:00
test_trainer_tpu.py [Test refactor 1/5] Per-folder tests reorganization (#15725) 2022-02-23 15:46:28 -05:00
test_trainer_utils.py Add strategy to store results in evaluation loop (#30267) 2024-04-17 12:42:27 +01:00
test_trainer.py Fix PIL dep for tests (#34028) 2024-10-09 10:45:06 -04:00