fxmarty
37bba2a32d
CI: update to ROCm 6.0.2 and test MI300 ( #30266 )
...
* update to ROCm 6.0.2 and test MI300
* add callers for mi300
* update dockerfile
* fix trainer tests
* remove apex
* style
* Update tests/trainer/test_trainer_seq2seq.py
* Update tests/trainer/test_trainer_seq2seq.py
* Update tests/trainer/test_trainer_seq2seq.py
* Update tests/trainer/test_trainer_seq2seq.py
* update to torch 2.3
* add workflow dispatch target
* we may need branches: mi300-ci after all
* nit
* fix docker build
* nit
* add check runner
* remove docker-gpu
* fix issues
* fix
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-05-13 18:14:36 +02:00
Zach Mueller
1681a6d452
🚨 Fully revert atomic checkpointing 🚨 ( #29370 )
...
Fully revert atomic checkpointing
2024-03-04 06:17:42 -05:00
Zach Mueller
93766251cb
Fix bug with rotating checkpoints ( #28009 )
...
* Fix bug
* Write test
* Keep back old modification for grad accum steps
* Whitespace...
* Whitespace again
* Race condition
* Wait for everyone
2023-12-13 12:17:30 -05:00
Abhilash Majumder
70a98024b1
Patch with accelerate xpu ( #25714 )
...
* patch with accelerate xpu
* patch with accelerate xpu
* formatting
* fix tests
* revert ruff unrelated fixes
* revert ruff unrelated fixes
* revert ruff unrelated fixes
* fix test
* review fixes
* review fixes
* black fixed
* review commits
* review commits
* style fix
* use pytorch_utils
* revert markuplm test
2023-09-05 15:41:42 +01:00
Zach Mueller
be0e189bd3
Revert frozen training arguments ( #25903 )
...
* Revert frozen training arguments
* TODO
2023-09-01 11:24:12 -04:00
Zach Mueller
ca51499248
Make training args fully immutable ( #25435 )
...
* Make training args fully immutable
* Working tests, PyTorch
* In test_trainer
* during testing
* Use proper dataclass way
* Fix test
* Another one
* Fix tf
* Lingering slow
* Exception
* Clean
2023-08-15 11:47:47 -04:00
Zach Mueller
3b734f5042
Add dispatch_batches to training arguments ( #25038 )
...
* Dispatch batches
* Copy items
2023-07-24 09:27:19 -04:00
statelesshz
9c875839c0
add ascend npu accelerator support ( #24879 )
...
* Add Ascend NPU accelerator support
* fix style warining
2023-07-18 08:20:32 -04:00
Zachary Mueller
03462875cc
Introduce PartialState
as the device handler in the Trainer
( #22752 )
...
* Use accelerate for device management
* Add accelerate to setup
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-04-17 15:09:45 -04:00
Stas Bekman
1306b7d3ae
[tests] switch to torchrun ( #22712 )
2023-04-12 08:25:45 -07:00
Sylvain Gugger
6f79d26442
Update quality tooling for formatting ( #21480 )
...
* Result of black 23.1
* Update target to Python 3.7
* Switch flake8 to ruff
* Configure isort
* Configure isort
* Apply isort with line limit
* Put the right black version
* adapt black in check copies
* Fix copies
2023-02-06 18:10:56 -05:00
jeffhataws
c59d71b282
Add AWS Neuron torchrun support ( #20806 )
...
* Add XLA torchrun support
* Clarify that currently DDP doesn't work with torch.distributed XLA backend yet
* Enable DDP with torchrun and XLA (now available in PT-XLA 1.13)
* Add check for AWS Neuron availability and AWS Neuron specific compiler flag
* Change the new test's name to TestTrainerDistributedNeuronCore
* Remove "assert" and replace raised exception
* Remove compiler flag as it is optional. If needed, will be another PR.
* Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
2023-01-18 11:21:19 -05:00
Lysandre Debut
29c10a41d0
[Test refactor 1/5] Per-folder tests reorganization ( #15725 )
...
* Per-folder tests reorganization
Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Stas Bekman <stas@stason.org>
2022-02-23 15:46:28 -05:00