Commit Graph

152 Commits

Author SHA1 Message Date
Yih-Dar
3d34b92116
Switch to use A10 progressively (#38936)
* try

* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-06-20 16:10:35 +00:00
Yih-Dar
5009252a05
Better CI (#38552)
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
better CI

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-06-06 17:59:14 +02:00
Yih-Dar
eb74cf977b
Use one utils/notification_service.py (#38379)
* step 1

* step 2

* step 3

* step 4

* step 5

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-26 16:15:29 +02:00
Yih-Dar
d0c9c66d1c
new failure CI reports for all jobs (#38298)
Some checks failed
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
Check Tiny Models / Check tiny models (push) Has been cancelled
* new failures

* report_repo_id

* report_repo_id

* report_repo_id

* More fixes

* More fixes

* More fixes

* ruff

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-24 19:15:02 +02:00
Yih-Dar
7819911b0c
Use T4 single GPU runner with more CPU RAM (#37961)
larger T4 single GPU

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-05 16:17:45 +02:00
Yih-Dar
4f139f5a50
Send trainer/fsdp/deepspeed CI job reports to a single channel (#37411)
* send trainer/fsdd/deepspeed channel

* update

* change name

* no .

* final

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-04-10 13:17:31 +02:00
Yih-Dar
8064cd9b4f
fix deepspeed job (#37284)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-04-08 15:19:33 +02:00
Joao Gante
0863eef248
[tests] remove pt_tf equivalence tests (#36253) 2025-02-19 11:55:11 +00:00
Stas Bekman
9dc1efa5d4
DeepSpeed github repo move sync (#36021)
deepspeed github repo move
2025-02-05 08:19:31 -08:00
Yih-Dar
fce1fcfe71
Ping team members for new failed tests in daily CI (#34171)
* ping

* fix

* fix

* fix

* remove runner

* update members

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-10-17 16:11:52 +02:00
Yih-Dar
75c878da1e
Update daily ci to use new cluster (#33627)
* update

* re-enable daily CI

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-09-20 21:05:30 +02:00
Yih-Dar
5334b61c33
Revive AMD scheduled CI (#33448)
Revive AMD scheduled CI

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-09-12 15:52:15 +02:00
Yih-Dar
d4564df1d4
Revive Nightly/Past CI (#31159)
* build

* build

* build

* build

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-06-20 18:57:24 +02:00
Yih-Dar
fbb41cd420
consistent job / pytest report / artifact name correspondence (#30392)
* better names

* run better names

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-04-24 22:32:42 +02:00
Yih-Dar
440bd3c3c0
update github actions packages' version to suppress warnings (#30249)
update

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-04-15 15:08:09 +02:00
Marc Sun
bb76f81e40
[CI] Quantization workflow fix (#30158)
* fix workflow

* call ci

* Update .github/workflows/self-scheduled-caller.yml

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-04-10 11:51:06 +02:00
Marc Sun
6cdbd73e01
[CI] Fix setup (#30147)
* [CI] fix setup

* fix

* test

* Revert "test"

This reverts commit 7df416d450.
2024-04-09 18:10:00 +02:00
Marc Sun
58a939c6b7
Fix quantization tests (#29914)
* revert back to torch 2.1.1

* run test

* switch to torch 2.2.1

* udapte dockerfile

* fix awq tests

* fix test

* run quanto tests

* update tests

* split quantization tests

* fix

* fix again

* final fix

* fix report artifact

* build docker again

* Revert "build docker again"

This reverts commit 399a5f9d93.

* debug

* revert

* style

* new notification system

* testing notfication

* rebuild docker

* fix_prev_ci_results

* typo

* remove warning

* fix typo

* fix artifact name

* debug

* issue fixed

* debug again

* fix

* fix time

* test notif with faling test

* typo

* issues again

* final fix ?

* run all quantization tests again

* remove name to clear space

* revert modfiication done on workflow

* fix

* build docker

* build only quant docker

* fix quantization ci

* fix

* fix report

* better quantization_matrix

* add print

* revert to the basic one
2024-04-09 17:10:29 +02:00
Yih-Dar
b17b54d3dd
Refactor daily CI workflow (#30012)
* separate jobs

* separate jobs

* use channel name directly instead of ID

* use channel name directly instead of ID

* use channel name directly instead of ID

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-04-05 15:49:51 +02:00
Marc Sun
f54d82cace
[CI] Quantization workflow (#29046)
* [CI] Quantization workflow

* build dockerfile

* fix dockerfile

* update self-cheduled.yml

* test build dockerfile on push

* fix torch install

* udapte to python 3.10

* update aqlm version

* uncomment build dockerfile

* tests if the scheduler works

* fix docker

* do not trigger on psuh again

* add additional runs

* test again

* all good

* style

* Update .github/workflows/self-scheduled.yml

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* test build dockerfile with torch 2.2.0

* fix extra

* clean

* revert changes

* Revert "revert changes"

This reverts commit 4cb52b8822.

* revert correct change

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
2024-02-28 10:09:25 -05:00
Yih-Dar
93f8617afd
Use DS_DISABLE_NINJA=1 (#29290)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-02-26 17:41:01 +08:00
Yih-Dar
4735866141
Split daily CI using 2 level matrix (#28773)
* update / add new workflow files

* Add comment

* Use env.NUM_SLICES

* use scripts

* use scripts

* use scripts

* Fix

* using one script

* Fix

* remove unused file

* update

* fail-fast: false

* remove unused file

* fix

* fix

* use matrix

* inputs

* style

* update

* fix

* fix

* no model name

* add doc

* allow args

* style

* pass argument

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-01-31 18:04:43 +01:00
Yih-Dar
95346e9dcd
Add artifact name in job step to maintain job / artifact correspondence (#28682)
* avoid using job name

* apply to other files

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-01-31 15:58:17 +01:00
Patrick von Platen
cbbe30749b
[Whisper] Fix slow test (#28407)
* [Whisper] Fix slow test

* update

* update

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2024-01-10 22:35:36 +01:00
Ella Charlaix
39acfe84ba
Add deepspeed test to amd scheduled CI (#27633)
* add deepspeed scheduled test for amd

* fix image

* add dockerfile

* add comment

* enable tests

* trigger

* remove trigger for this branch

* trigger

* change runner env to trigger the docker build image test

* use new docker image

* remove test suffix from docker image tag

* replace test docker image with original image

* push new image

* Trigger

* add back amd tests

* fix typo

* add amd tests back

* fix

* comment until docker image build scheduled test fix

* remove deprecated deepspeed build option

* upgrade torch

* update docker & make tests pass

* Update docker/transformers-pytorch-deepspeed-amd-gpu/Dockerfile

* fix

* tmp disable test

* precompile deepspeed to avoid timeout during tests

* fix comment

* trigger deepspeed tests with new image

* comment tests

* trigger

* add sklearn dependency to fix slow tests

* enable back other tests

* final update

---------

Co-authored-by: Felix Marty <felix@hf.co>
Co-authored-by: Félix Marty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-12-11 16:33:36 +01:00
Yih-Dar
9f1f11a2e7
Show new failing tests in a more clear way in slack report (#27881)
* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-12-07 15:09:30 +01:00
Yih-Dar
64e21ca2a4
Make some jobs run on the GitHub Actions runners (#27512)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-11-15 10:43:16 +01:00
Yih-Dar
00dc856233
At most 2 GPUs for CI (#27435)
At most 2 GPUs

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-11-10 16:19:06 +01:00
Yih-Dar
9dc4ce9ea7
Disable CI runner check (#27170)
Disable runner check

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-10-31 11:59:21 +01:00
Yih-Dar
6ae71ec836
Update runs-on in workflow files (#26435)
* update

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-09-27 19:25:52 +02:00
Yih-Dar
11cb6e0f7e
Unpin DeepSpeed and require DS >= 0.9.3 (#24541)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-06-28 14:01:22 +02:00
Yih-Dar
7631db0fdc
Pin deepspeed to 0.9.2 for now (#24024)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-06-05 20:00:28 +02:00
Yih-Dar
e69feab8a1
Update workflow files (#23658)
* fix

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-05-22 21:26:51 +02:00
Yih-Dar
aa4316757d
Change schedule CI time (#22884)
* fix

* Update .github/workflows/self-nightly-past-ci-caller.yml

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2023-04-20 14:01:08 +02:00
Yih-Dar
648bd5a8aa
Show diff between 2 CI runs on Slack reports (#22798)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-04-19 19:27:37 +02:00
Yih-Dar
656d41ab4c
Remove DS_BUILD_AIO=1 (#22741)
fix

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-04-13 18:08:22 +02:00
Yih-Dar
0fe6c6bdca
(Re-)Enable Nightly + Past CI (#22393)
* Enable Nightly + Past CI

* put schedule

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-03-30 21:06:35 +02:00
Yih-Dar
aab895c396
Make Slack CI reporting stronger (#21823)
* Use token

* Avoid failure

* better error

* Fix

* fix style

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2023-02-28 17:12:44 +01:00
Yih-Dar
bf9a5882a7
Update some GH action versions (#20537)
* update actions versions

* update actions versions

* update actions versions

* update actions versions

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-12-06 16:54:40 +01:00
Yih-Dar
67d32f4649
Replace set-output by $GITHUB_OUTPUT (#20547)
* remove set-output

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-12-05 18:25:13 +01:00
Yih-Dar
e8d448edcf
extract warnings in GH workflows (#20487)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-11-29 15:58:54 +01:00
Yih-Dar
700e0cd65f
Add missing report button for Example test (#20293)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-11-17 15:55:00 +01:00
Yih-Dar
c06d555647
Show installed libraries and their versions in GA jobs (#20069)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-11-04 18:03:18 +01:00
Sylvain Gugger
9ac586b3c8
Rework pipeline tests (#19366)
* Rework pipeline tests

* Try to fix Flax tests

* Try to put it before

* Use a new decorator instead

* Remove ignore marker since it doesn't work

* Filter pipeline tests

* Woopsie

* Use the fitlered list

* Clean up and fake modif

* Remove init

* Revert fake modif
2022-10-07 18:01:58 -04:00
Yih-Dar
ba7f2173cc
Add runner availability check (#19054)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-09-19 12:27:06 +02:00
Yih-Dar
7a8118947f
Add checks for more workflow jobs (#18905)
* add check for scheduled CI

* Add check to other CIs

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-09-07 12:51:37 +02:00
Yih-Dar
7d5fde991d
unpin slack_sdk version (#18901)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-09-06 18:42:00 +02:00
Yih-Dar
ecdf9b06bc
Remove cached torch_extensions on CI runners (#18868)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-09-02 18:17:58 +02:00
Yih-Dar
0ab465a5d2
pin Slack SDK to 3.18.1 to avoid failing issue (#18869)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-09-02 16:49:08 +02:00
Yih-Dar
d2704c4143
Add machine type in the artifact of Examples directory job (#18459)
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2022-08-04 18:52:01 +02:00