* args keep_torch_compile=False in _save and _wwrap_method
* Fix FSDP execution on evaluation for torch_compile mode
* add test trainer FSDP + Torch Compile
* fix quality code
* make style
* Revert " make style"
This reverts commit 77e797f8829c50992cc21496be3d9a3e480e1c97.
* make style
* test
* fix
* fix
* skip some and run some first
* test fsdp
* fix
* patches for generate
* test distributed
* copy
* don't test distributed loss for hpu
* require fp16 and run first
* changes from marc's PR fixing zero3
* better alternative
* return True when fp16 support on gaudi without creating bridge
* fix
* fix tested dtype in deepspeed inference test
* test
* fix
* test
* fix
* skip
* require fp16
* run first fsdp
* Apply suggestions from code review
* address comments
* address comments and refactor test
* reduce precison
* avoid doing gaudi1 specific stuff in the genreation loop
* document test_gradient_accumulation_loss_alignment_with_model_loss test a bit more
* Remove FSDP wrapping from sub-models.
* solve conflict trainer.py
* make fixup
* add unit test for fsdp_auto_wrap_policy when using auto_find_batch_size
* put back extract_model_from_parallel
* use transformers unwrap_model
* exclude fsdp from delay_optimizer_creation
* add test case for trainer: FSDP mode and fp8 as mixed precision
* rearrange imports
* ruff formatted
* adapt _init_fsdp to fp8
* use _init_fsdp only when resume_from_checkpoint
* In case of FDP, self.layer will be CheckpointWrapper which has no len() method
* delete _init_fsdp
* solve conflict
* fix conflict
* make fixup
* Default synced_gpus to True when using FullyShardedDataParallel
Fixes#30228
Related:
* https://github.com/pytorch/pytorch/issues/100069
* https://github.com/pytorch/pytorch/issues/123962
Similar to DeepSpeed ZeRO Stage 3, when using FSDP with multiple GPUs and differently sized data per rank, the ranks reach different synchronization points at the same time, leading to deadlock
To avoid this, we can automatically set synced_gpus to True if we detect that a PreTrainedModel is being managed by FSDP using _is_fsdp_managed_module, which was added in 2.0.0 for torch.compile: https://github.com/pytorch/pytorch/blob/v2.0.0/torch/distributed/fsdp/_dynamo_utils.py
* Remove test file
* ruff formatting
* ruff format
* Update copyright year
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Add test for FSDP-wrapped model generation
Before #33483, these tests would have hung for 10 minutes before crashing due to a timeout error
* Ruff format
* Move argparse import
* Remove barrier
I think this might cause more problems if one of the workers was killed
* Move import into function to decrease load time
https://github.com/huggingface/transformers/pull/33483#discussion_r1787972735
* Add test for accelerate and Trainer
https://github.com/huggingface/transformers/pull/33483#discussion_r1790309675
* Refactor imports
* Ruff format
* Use nullcontext
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>