Commit Graph

13 Commits

Author SHA1 Message Date
fxmarty
37bba2a32d
CI: update to ROCm 6.0.2 and test MI300 (#30266)
* update to ROCm 6.0.2 and test MI300

* add callers for mi300

* update dockerfile

* fix trainer tests

* remove apex

* style

* Update tests/trainer/test_trainer_seq2seq.py

* Update tests/trainer/test_trainer_seq2seq.py

* Update tests/trainer/test_trainer_seq2seq.py

* Update tests/trainer/test_trainer_seq2seq.py

* update to torch 2.3

* add workflow dispatch target

* we may need branches: mi300-ci after all

* nit

* fix docker build

* nit

* add check runner

* remove docker-gpu

* fix issues

* fix

---------

Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-05-13 18:14:36 +02:00
Zach Mueller
1681a6d452
🚨 Fully revert atomic checkpointing 🚨 (#29370)
Fully revert atomic checkpointing
2024-03-04 06:17:42 -05:00
Zach Mueller
93766251cb
Fix bug with rotating checkpoints (#28009)
* Fix bug

* Write test

* Keep back old modification for grad accum steps

* Whitespace...

* Whitespace again

* Race condition

* Wait for everyone
2023-12-13 12:17:30 -05:00
Abhilash Majumder
70a98024b1
Patch with accelerate xpu (#25714)
* patch with accelerate xpu

* patch with accelerate xpu

* formatting

* fix tests

* revert ruff unrelated fixes

* revert ruff unrelated fixes

* revert ruff unrelated fixes

* fix test

* review fixes

* review fixes

* black fixed

* review commits

* review commits

* style fix

* use pytorch_utils

* revert markuplm test
2023-09-05 15:41:42 +01:00
Zach Mueller
be0e189bd3
Revert frozen training arguments (#25903)
* Revert frozen training arguments

* TODO
2023-09-01 11:24:12 -04:00
Zach Mueller
ca51499248
Make training args fully immutable (#25435)
* Make training args fully immutable

* Working tests, PyTorch

* In test_trainer

* during testing

* Use proper dataclass way

* Fix test

* Another one

* Fix tf

* Lingering slow

* Exception

* Clean
2023-08-15 11:47:47 -04:00
Zach Mueller
3b734f5042
Add dispatch_batches to training arguments (#25038)
* Dispatch batches

* Copy items
2023-07-24 09:27:19 -04:00
statelesshz
9c875839c0
add ascend npu accelerator support (#24879)
* Add Ascend NPU accelerator support

* fix style warining
2023-07-18 08:20:32 -04:00
Zachary Mueller
03462875cc
Introduce PartialState as the device handler in the Trainer (#22752)
* Use accelerate for device management

* Add accelerate to setup


Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
2023-04-17 15:09:45 -04:00
Stas Bekman
1306b7d3ae
[tests] switch to torchrun (#22712) 2023-04-12 08:25:45 -07:00
Sylvain Gugger
6f79d26442
Update quality tooling for formatting (#21480)
* Result of black 23.1

* Update target to Python 3.7

* Switch flake8 to ruff

* Configure isort

* Configure isort

* Apply isort with line limit

* Put the right black version

* adapt black in check copies

* Fix copies
2023-02-06 18:10:56 -05:00
jeffhataws
c59d71b282
Add AWS Neuron torchrun support (#20806)
* Add XLA torchrun support

* Clarify that currently DDP doesn't work with torch.distributed XLA backend yet

* Enable DDP with torchrun and XLA (now available in PT-XLA 1.13)

* Add check for AWS Neuron availability and AWS Neuron specific compiler flag

* Change the new test's name to TestTrainerDistributedNeuronCore

* Remove "assert" and replace raised exception

* Remove compiler flag as it is optional. If needed, will be another PR.

* Use TORCHELASTIC_RUN_ID to determine whether torchrun is used
2023-01-18 11:21:19 -05:00
Lysandre Debut
29c10a41d0
[Test refactor 1/5] Per-folder tests reorganization (#15725)
* Per-folder tests reorganization

Co-authored-by: sgugger <sylvain.gugger@gmail.com>
Co-authored-by: Stas Bekman <stas@stason.org>
2022-02-23 15:46:28 -05:00