Commit Graph

4956 Commits

Author SHA1 Message Date
Rémi Ouazan
493cf1554b
[seamless_m4t] Skip some tests when speech is not available (#38430)
* Added the require_speech decorator

* Added require_speecj to some seamless_m4t tests

* Changed skip message
2025-06-02 09:17:28 +00:00
Yuanyuan Chen
fde1120b6c
Remove deprecated use_flash_attention_2 parameter (#37131)
Signed-off-by: cyy <cyyever@outlook.com>
2025-06-02 11:06:25 +02:00
M Saqlain
e0545ef0b8
[Tests] Reduced model size for albert-test model (#38480)
* Reduced model size for albert-test model

* Run checks

* Removed test_save_load

* Removed test skipping functions
2025-05-30 14:22:32 +00:00
Yih-Dar
81cff7ad34
Fix Gemma3IntegrationTest (#38471)
* check

* check

* check

* check

* check

* check

* check

* test style bot

* fix

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-29 16:51:12 +02:00
Raushan Turganbay
ad9dd3d17b
🔴 [VLM] modeling updates (#38317)
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
New model PR merged notification / Notify new model (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
* updates

* fixup

* fix tests

* fix test

* fix

* let it be here for now, till monday

* two more fixes

* persimmon

* fixup

* fix

* fixup

* make sure fuyu runs now that LM has new attn API

* fixup + tests

* qwen vl uses new mask interface as well

* qwen image features format

* update

* remove image_sizes

* address comments

* i am dumb...
2025-05-29 11:08:23 +00:00
Yaswanth Gali
a6f7acb603
[Tests] Clean up test cases for few models (#38315)
* Update tests

* revert aria change

* too slow hence revert
2025-05-29 08:21:28 +00:00
Yih-Dar
66da700145
Fix GLM4 checkpoints (#38412)
* fix

* fix

* fix

* fix

* fix

* fix

* test style bot

* Apply style fixes

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
2025-05-28 16:40:08 +00:00
Matt
f844733568
Fix MoE gradient test (#38438) 2025-05-28 16:44:20 +01:00
Matt
0ed6f7e6b4
Remove redundant test_sdpa_equivalence test (#38436)
* Remove redundant test

* make fixup
2025-05-28 17:22:25 +02:00
Peter St. John
bab40c6838
[core] support tensor-valued _extra_state values in from_pretrained (#38155)
Support tensor-valued _extra_state values

TransformerEngine uses the pytorch get/set_extra_state API to store FP8
layer config information as bytes Tensor in the _extra_state entry in
the state dict. With recent changes to from_pretrained, this
functionality has broken and loading a model that uses this API doesn't
appear to work. This PR fixes the save/load pretrained functions for
extra state entries that use a pytorch tensor, and adds a (currently
x-failing) test for a dictionary extra state.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
2025-05-28 15:38:42 +02:00
Anton Vlasjuk
badc71b9f6
🔴[Attention] Attention refactor for Whisper-based models (#38235)
Some checks failed
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
New model PR merged notification / Notify new model (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Update Transformers metadata / build_and_package (push) Waiting to run
Self-hosted runner (AMD mi250 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner (AMD mi250 scheduled CI caller) / Torch pipeline CI (push) Has been cancelled
Self-hosted runner (AMD mi250 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner (AMD mi250 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Self-hosted runner scale set (AMD mi300 scheduled CI caller) / Model CI (push) Has been cancelled
Self-hosted runner scale set (AMD mi300 scheduled CI caller) / Torch pipeline CI (push) Has been cancelled
Self-hosted runner scale set (AMD mi300 scheduled CI caller) / Example CI (push) Has been cancelled
Self-hosted runner scale set (AMD mi300 scheduled CI caller) / DeepSpeed CI (push) Has been cancelled
Secret Leaks / trufflehog (push) Has been cancelled
* start refactoring whisper

* revert for now

* first step

* carry over attn fixes

* check if this works

* whisper has an off by one somewhere - cutting mask in any interface

* make it based on interface

* remove some tests that were skipped but now work

* some fixes for whisper tests

* interface changes

* change the order of fix

* some attention adjustments for eager + TP

* fix scaling

* mask changes

* why does whisper contain those extra seq lens?

* fix from config for fa2 as input_ids is invalid

* fix another test

* another fix

* disable flex attn due to compile issues

* copies and refactor for qwen audio since it somewhat relies on whisper

* fix scaling and smaller things

* retrigger

* new new interface version + more fixups

* adjust qwen

* add comment

* forgot this one

* change copies as whisper cuts on the mask

* add guard

* add flex attention

* switch to new mask function + add skips for torchscript

* remove old api with cache position

* last changes?

* trigger ci
2025-05-28 13:32:38 +02:00
Yao Matrix
fb82a98717
enable large_gpu and torchao cases on XPU (#38355)
* cohere2 done

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* enable torchao cases on XPU

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

* fix

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

* fix

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

* fix

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

* rename

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

* fix

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

* fix comments

Signed-off-by: Matrix YAO <matrix.yao@intel.com>

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Signed-off-by: Matrix YAO <matrix.yao@intel.com>
2025-05-28 10:30:16 +02:00
Yih-Dar
cea254c909
Update CsmForConditionalGenerationIntegrationTest (#38424)
* require_read_token

* ruff

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-28 10:20:43 +02:00
Cory Cornelius
9c50576860
[mllama] Allow pixel_values with inputs_embeds (#38334)
* Allow pixel_values and inputs_embeds at the same time

* remove unnecessary overwritten tests
2025-05-27 16:33:56 +00:00
Joao Gante
0f5a8243c4
[tests] remove overload for deleted test (test_offloaded_cache_implementation) (#37896)
* remove overload for deleted tests

* make fixup
2025-05-27 16:45:15 +01:00
eustlb
b9f8f863d9
[CSM] update model id (#38211)
* update model id

* codec_model eval

* add processor img

* use ungated repo for processor tests
2025-05-27 17:03:55 +02:00
Raushan Turganbay
19fdb75cf0
[video utils] group and reorder by number of frames (#38374)
fix
2025-05-27 11:32:33 +02:00
Raushan Turganbay
b0735dc0c1
[paligemma] fix processor with suffix (#38365)
fix pg processor
2025-05-27 11:31:56 +02:00
Yih-Dar
07848a8405
update gemma tests (#38384)
* update

* update

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-26 19:54:04 +02:00
Matt
ba6d72226d
🚨 🚨 Fix custom code saving (#37716)
* Firstly: Better detection of when we're a custom class

* Trigger tests

* Let's break everything

* make fixup

* fix mistaken line doubling

* Let's try to get rid of it from config classes at least

* Let's try to get rid of it from config classes at least

* Fixup image processor

* no more circular import

* Let's go back to setting `_auto_class` again

* Let's go back to setting `_auto_class` again

* stash commit

* Revert the irrelevant changes until we figure out AutoConfig

* Change tests since we're breaking expectations

* make fixup

* do the same for all custom classes

* Cleanup for feature extractor tests

* Cleanup tokenization tests too

* typo

* Fix tokenizer tests

* make fixup

* fix image processor test

* make fixup

* Remove warning from register_for_auto_class

* Stop adding model info to auto map entirely

* Remove todo

* Remove the other todo

* Let's start slapping _auto_class on models why not

* Let's start slapping _auto_class on models why not

* Make sure the tests know what's up

* Make sure the tests know what's up

* Completely remove add_model_info_to_*

* Start adding _auto_class to models

* Start adding _auto_class to models

* Add a flaky decorator

* Add a flaky decorator and import

* stash commit

* More message cleanup

* make fixup

* fix indent

* Fix trust_remote_code prompts

* make fixup

* correct indentation

* Reincorporate changes into dynamic_module_utils

* Update call to trust_remote_code

* make fixup

* Fix video processors too

* Fix video processors too

* Remove is_flaky additions

* make fixup
2025-05-26 17:37:30 +01:00
Dhia Eddine Rhaiem
7a9b071bfd
[Falcon H1] Fix slow path forward pass (#38320)
* Create push-important-models.yml

* feat: add falcon-h1

* fixup

* address comment

* fix

* fix copies

* fix copies

* fix

* fix

* fix

* fix

* fix copies

* fix

* fix copies

* fix test import to at least trigget the cis

* yups

* update

* fix make fix copies

* fix inits?

* fix style

* skip annoying test

* add integration test for Falcon H1

* fix copies

* fix

* fix typo

* make style

* fix slow path generations

* clean debug traces

* debug

* remove debug traces final confirmation

* clean debug traces final

* fix format and lineup

* make style

* debug

* Update src/transformers/models/falcon_h1/modular_falcon_h1.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

* adress comments

* fix fix-copies

* fix integration test

* Merge pull request #7 from ydshieh/fix-slow-path

update

* another update (#8)

* update

* update

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <younesbelkada@gmail.com>
Co-authored-by: younesbelkada <younes.belkada@tii.ae>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-26 15:30:35 +02:00
Isotr0py
bff32678cc
Fix incorrect batching audio index calculation for Phi-4-Multimodal (#38103)
* fix

Signed-off-by: Isotr0py <2037008807@qq.com>

* add tests

Signed-off-by: Isotr0py <2037008807@qq.com>

* code format

Signed-off-by: Isotr0py <2037008807@qq.com>

* Update src/transformers/models/phi4_multimodal/feature_extraction_phi4_multimodal.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Signed-off-by: Isotr0py <2037008807@qq.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-05-26 12:41:31 +00:00
Yao Matrix
a5a0c7b888
switch to device agnostic device calling for test cases (#38247)
* use device agnostic APIs in test cases

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* add one more

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* xpu now supports integer device id, aligning to CUDA behaviors

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* update to use device_properties

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* update comment

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix comments

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-26 10:18:53 +02:00
Cyril Vallez
e0aad278fe
Never fallback to eager implicitly (#38327)
Some checks failed
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
New model PR merged notification / Notify new model (push) Has been cancelled
* remove arg everywhere

* Update warnings

* add more models

* Update sdpa_attention.py

* fix style

* fix

* readd warnings but not for flex

* Update test_modeling_common.py

* skip

* fix

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-05-23 19:48:01 +02:00
Matt
53fb245eb6
🚨 🚨 Inherited CausalLM Tests (#37590)
* stash commit

* Experiment 1: Try just Gemma

* Experiment 1: Just try Gemma

* make fixup

* Trigger tests

* stash commit

* Try adding Gemma3 as well

* make fixup

* Correct attrib names

* Correct pipeline model mapping

* Add in all_model_classes for Gemma1 again

* Move the pipeline model mapping around again

* make fixup

* Revert Gemma3 changes since it's a VLM

* Let's try Falcon

* Correct attributes

* Correct attributes

* Let's try just overriding get_config() for now

* Do Nemotron too

* And Llama!

* Do llama/persimmon

* Correctly skip tests

* Fix Persimmon

* Include Phimoe

* Fix Gemma2

* Set model_tester_class correctly

* Add GLM

* More models!

* models models models

* make fixup

* Add Qwen3 + Qwen3MoE

* Correct import

* make fixup

* Add the QuestionAnswering classes

* Add the QuestionAnswering classes

* Move pipeline mapping to the right place

* Jetmoe too

* Stop RoPE testing models with no RoPE

* Fix up JetMOE a bit

* Fix up JetMOE a bit

* Can we just force pad_token_id all the time?

* make fixup

* fix starcoder2

* Move pipeline mapping

* Fix RoPE skipping

* Fix RecurrentGemma tests

* Fix Falcon tests

* Add MoE attributes

* Fix values for RoPE testing

* Make sure we set bos_token_id and eos_token_id in an appropriate range

* make fixup

* Fix GLM4

* Add mamba attributes

* Revert bits of JetMOE

* Re-add the JetMOE skips

* Update tests/causal_lm_tester.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Add licence

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-05-23 18:29:31 +01:00
Aaron V
d5f992f5e6
Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag (#36835)
* Get parallel loader working. Include tests.

* Update the tests for parallel loading

* Rename env variables.

* Add docs for parallel model weight loading.

* Touch up parallel model loading docs.

* Touch up parallel model loading docs again.

* Edit comment in test_modeling_utils_parallel_loading.py

* Make sure HF_PARALLEL_LOADING_WORKERS is spelled correctly in modeling_utils.py

* Correct times for parallelized loading, previous times were for a "hot" filesystem

* Update parallel model loading so the spawn method is encapsulated. DRY up the code by leveraging get_submodule.

* Update docs on model loading parallelism so that details on setting the multiprocessing start method are removed, now that the package handles this step internally.

* Fix style on model loading parallelism changes.

* Merge latest version of master's modeling_utils.

* Removed unused variable.

* Fix argument packing for the parallel loader.

* Fix state dict being undefined in the parallel model loader.

* Rename variables used in parallel model loading for clarity. Use get_module_from_name().

* Switch to the use of threads for parallel model loading.

* Update docs for parallel loading.

* Remove the use of json.loads when evaluating HF_ENABLE_PARALLEL_LOADING. Prefer simple casting.

* Move parallelized shard loading into its own function.

* Remove use of is_true(). Favor checking env var true values for HF_ENABLE_PARALLEL_LOADING.

* Update copyright to 2025 in readme for paralell model loading.

* Remove garbage collection line in load_shard_file, implicit garbage collection already occurs.

* Run formatter on modeling_utils.py

* Apply style fixes

* Delete tests/utils/test_modeling_utils_parallel_loading.py

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-05-23 16:39:47 +00:00
Anton Vlasjuk
1ed19360b1
[FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)
* reenable most flex attention test cases

* style

* trigger

* trigger
2025-05-23 18:16:43 +02:00
Arthur
f5d45d89c4
🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong (#38288)
* Protect ParallelInterface

* early error out on output attention setting for no wraning in modeling

* modular update

* fixup

* update model tests

* update

* oups

* set model's config

* more cases

* ??

* properly fix

* fixup

* update

* last onces

* update

* fix?

* fix wrong merge commit

* fix hub test

* nits

* wow I am tired

* updates

* fix pipeline!

---------

Co-authored-by: Lysandre <hi@lysand.re>
2025-05-23 17:17:38 +02:00
Cyril Vallez
896833c183
Fix some tests (especially compile with fullgraph=True on Python<3.11) (#38319)
* fix tests

* better fix for python<3.11

* fixes

* style
2025-05-23 17:11:40 +02:00
Joao Gante
a6b51e7341
[Whisper + beam search] fix usage of beam_indices (#38259)
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
New model PR merged notification / Notify new model (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
* tmp

* fix test_tiny_token_timestamp_batch_generation

* better comments

* test

* comments

* Apply suggestions from code review

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

---------

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
2025-05-23 10:05:44 +00:00
Yuanyuan Chen
11b670a282
Fix run_slow (#38314)
Signed-off-by: cyy <cyyever@outlook.com>
2025-05-23 10:18:30 +02:00
Yaswanth Gali
2b585419b4
[Tests] Cleanup Janus Testcase (#38311)
* Cleanup janus testcase

* shift code to setup
2025-05-23 09:29:16 +02:00
Arthur
211f2b0875
Add CB (#38085)
* stash for now

* initial commit

* small updated

* up

* up

* works!

* nits and fixes

* don't loop too much

* finish working example

* update

* fix the small freeblocks issue

* feat: stream inputs to continuous batch

* fix: update attn from `eager` to `sdpa`

* refactor: fmt

* refactor: cleanup unnecessary code

* feat: add `update` fn to `PagedAttentionCache`

* feat: broken optimal block size computation

* fix: debugging invalid cache logic

* fix: attention mask

* refactor: use custom prompts for example

* feat: add streaming output

* fix: prefill split

refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic

* fix: send decoded tokens when `prefilling_split` -> `decoding`

* refactor: move logic to appropriate parent class

* fix: remove truncation as we split prefilling anyways

refactor: early return when we have enough selected requests

* feat: add paged attention forward

* push Ggraoh>

* add paged sdpa

* update

* btter mps defaults

* feat: add progress bar for `generate_batch`

* feat: add opentelemetry metrics (ttft + batch fill %age)

* feat: add tracing

* Add cuda graphs (#38059)

* draft cudagraphs addition

* nits

* styling

* update

* fix

* kinda draft of what it should look like

* fixes

* lol

* not sure why inf everywhere

* can generate but output is shit

* some fixes

* we should have a single device synch

* broken outputs but it does run

* refactor

* updates

* updates with some fixes

* fix mask causality

* another commit that casts after

* add error

* simplify example

* update

* updates

* revert llama changes

* fix merge conflicts

* fix: tracing and metrics

* my updates

* update script default values

* fix block allocation issue

* fix prefill split attnetion mask

* no bugs

* add paged eager

* fix

* update

* style

* feat: add pytorch traces

* fix

* fix

* refactor: remove pytorch profiler data

* style

* nits

* cleanup

* draft test file

* fix

* fix

* fix paged and graphs

* small renamings

* cleanups and push

* refactor: move tracing and metrics logic to utils

* refactor: trace more blocks of code

* nits

* nits

* update

* to profile or not to profile

* refactor: create new output object

* causal by default

* cleanup but generations are still off for IDK what reason

* simplifications but not running still

* this does work.

* small quality of life updates

* nits

* updaet

* fix the scheduler

* fix warning

* ol

* fully fixed

* nits

* different generation parameters

* nice

* just style

* feat: add cache memory usage

* feat: add kv cache free memory

* feat: add active/waiting count & req latency

* do the sampling

* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager

* fix on mps

* feat: add dashboard & histogram buckets

* perf: improve waiting reqs data structures

* attempt to compile, but we should only do it on mps AFAIK

* feat: decouple scheduling logic

* just a draft

* c;eanup and fixup

* optional

* style

* update

* update

* remove the draft documentation

* fix import as well

* update

* fix the test

* style doomed

---------

Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
2025-05-22 17:43:48 +02:00
Anton Vlasjuk
d95c864a25
🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108)
* starting attn refactor for encoder decoder models via bart (eager + sdpa)

* flash attention works, remove unnecessary code

* flex attention support for bart!, gotta check if the renaming is not too aggressive

* some comments

* skip flex grad test for standalone as done with the other test

* revert flex attn rename (for now), sdpa simplify, and todos

* more todos

* refactor mask creation for reuse

* modular attempt at biogpt

* first batch of other models

* fix attn dropout

* fix autoformer copies

* hubert

* another batch of models

* copies/style + last round of bart models --> whisper next?

* remove unnecessary _reshape function and remove copy to whisper

* add skip for decoder-only models out of enc-dec (same as in bart)

* bring back licences

* remove comment, added to pr read instead

* mostly docs

* disable sew flex attn as it's unclear attn mask for now

* oops

* test fixes for enc-dec

* torch fx fixes + try at flex attn

* skip on mbart

* some more fixes

* musicgen skip / delete old attn class logic + sdpa compose compile skip

* disable flex attn for musicgen, not worth the effort

* more fixes and style

* flex attention test for dropout and encoder decoder that dont have main input names

* informer fixes

* the weirdest thing I've encountered yet...

* style

* remove empty tensor attempt, found core root in previous commits

* disable time series due to tests being very text centric on inputs

* add speech to text to be ignoring the other attns, also due to tests

* update docs

* remaining issues resolved ?

* update docs for current state --> nllb moe and pegasus x sdpa is questionable :D

* some models have not set the is_causal flag...

* change dtype in softmax tol old behaviour + some modular fixes

* I hate it but it is what it is

* fixes from main for bart

* forgot this one

* some model fixes

* style

* current status

* marian works now

* fixing some copies

* some copy fixes + time series x informer

* last models possibly and fixes on style/copies

* some post merge fixes

* more fixes

* make attention interface callable and move warnings there

* style lol

* add comment to "unsupported"

* remove callable interface and change interface warnings + some copies

* fix

* ternary is ugly af, make it simpler

* how did that happen

* fix flex attn test

* failing the test

* no more fallback! fixing copies next

* style + attn fixed

* fixing copies and mask creation

* wrong copy

* fixup tests and disable flex attn for now

* fixup last tests?
2025-05-22 17:12:58 +02:00
Cyril Vallez
163138a911
🚨🚨[core] Completely rewrite the masking logic for all attentions (#37866)
* start

* start having a clean 4d mask primitive

* Update mask_utils.py

* Update mask_utils.py

* switch name

* Update masking_utils.py

* add a new AttentionMask tensor class

* fix import

* nits

* fixes

* use full and quandrants

* general sdpa mask for all caches

* style

* start some tests

* tests with sliding, chunked

* add styling

* test hybrid

* Update masking_utils.py

* small temp fixes

* Update modeling_gemma2.py

* compile compatible

* Update masking_utils.py

* improve

* start making it more general

* Update masking_utils.py

* generate

* make it work with flex style primitives!

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* improve

* Update cache_utils.py

* Update masking_utils.py

* simplify - starting to look good!

* Update masking_utils.py

* name

* Update masking_utils.py

* style

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* small fix for flex

* flex compile

* FA2

* Update masking_utils.py

* Escape for TGI/vLLM!

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* General case without cache

* rename

* full test on llama4

* small fix for FA2 guard with chunk

* Update modeling_gemma2.py

* post rebase cleanup

* FA2 supports static cache!

* Update modeling_flash_attention_utils.py

* Update flex_attention.py

* Update masking_utils.py

* Update masking_utils.py

* Update utils.py

* override for export

* Update executorch.py

* Update executorch.py

* Update executorch.py

* Update executorch.py

* Update masking_utils.py

* Update masking_utils.py

* output attentions

* style

* Update masking_utils.py

* Update executorch.py

* Add doicstring

* Add license and put mask visualizer at the end

* Update test_modeling_common.py

* fix broken test

* Update test_modeling_gemma.py

* Update test_modeling_gemma2.py

* Use fullgraph=False with FA2

* Update utils.py

* change name

* Update masking_utils.py

* improve doc

* change name

* Update modeling_attn_mask_utils.py

* more explicit logic based on model's property

* pattern in config

* extend

* fixes

* make it better

* generalize to other test models

* fix

* Update masking_utils.py

* fix

* do not check mask equivalence if layer types are different

* executorch

* Update modeling_gemma2.py

* Update masking_utils.py

* use layer_idx instead

* adjust

* Update masking_utils.py

* test

* fix imports

* Update modeling_gemma2.py

* other test models

* Update modeling_llama4.py

* Update masking_utils.py

* improve

* simplify

* Update masking_utils.py

* typos

* typo

* fix

* Update masking_utils.py

* default DynamicCache

* remove default cache

* simplify

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* simplify

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* export

* Update executorch.py

* Update executorch.py

* Update flex_attention.py

* Update executorch.py

* upstream to modular gemma 1 & 2

* Update modular_mistral.py

* switch names

* use dict

* put it in the Layer directly

* update copy model source for mask functions

* apply so many modular (hopefully 1 shot)

* use explicite dicts for make style happy

* protect import

* check docstring

* better default in hybrid caches

* qwens

* Update modular_qwen2.py

* simplify core logic!

* Update executorch.py

* qwen3 moe

* Update masking_utils.py

* Update masking_utils.py

* simplify a lot sdpa causal skip

* Update masking_utils.py

* post-rebase

* gemma3 finally

* style

* check it before

* gemma3

* More general with newer torch

* align gemma3

* Update utils.py

* Update utils.py

* Update masking_utils.py

* Update test_modeling_common.py

* Update flex_attention.py

* Update flex_attention.py

* Update flex_attention.py

* test

* executorch

* Update test_modeling_common.py

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* Update masking_utils.py

* Update executorch.py

* Update test_modeling_common.py

* fix copies

* device

* sdpa can be used without mask -> pass the torchscript tests in this case

* Use enum for check

* revert enum and add check instead

* remove broken test

* cohere2

* some doc & reorganize the Interface

* Update tensor_parallel.py

* Update tensor_parallel.py

* doc and dummy

* Update test_modeling_paligemma2.py

* Update modeling_falcon_h1.py

* Update masking_utils.py

* executorch patch

* style

* CIs

* use register in executorch

* final comments!

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2025-05-22 11:38:26 +02:00
Joao Gante
aa02a5d902
[whisper] move processor test into processor test file 🧹 (#38266)
move processor tests
2025-05-22 10:07:11 +01:00
Yuanyuan Chen
ae3e4e2d97
Improve typing in TrainingArgument (#36944)
* Better error message in TrainingArgument typing checks

* Better typing

* Small fixes

Signed-off-by: cyy <cyyever@outlook.com>

---------

Signed-off-by: cyy <cyyever@outlook.com>
2025-05-21 13:54:38 +00:00
Joao Gante
e4decee9c0
[whisper] small changes for faster tests (#38236) 2025-05-21 14:11:08 +01:00
Mohamed Mekkouri
9a962dd9ed
Add tearDown method to Quark to solve OOM issues (#38234)
fix
2025-05-21 14:26:44 +02:00
youngrok cha
101b3fa4ea
fix multi-image case for llava-onevision (#38084)
* _get_padding_size module

* do not patchify images when processing multi image

* modify llava onevision image processor fast

* tensor to list of tensors

* backward compat

* reuse pad_to_square in llave & some clarification

* add to doc

* fix: consider no image cases (text only or video)

* add integration test

* style & repo_consistency
2025-05-21 11:50:46 +02:00
Raushan Turganbay
a21f11fca2
[compile] re-enable for Qwen-VL models (#38127)
* compile qwen models

* delete TODO comment

* fix embeds test

* fix assisted decoding

* add comments
2025-05-21 09:50:39 +00:00
Dhia Eddine Rhaiem
4542086db7
[Falcon H1] Fix Typo in Integration Test (#38256)
Some checks are pending
Self-hosted runner (benchmark) / Benchmark (aws-g5-4xlarge-cache) (push) Waiting to run
Build documentation / build (push) Waiting to run
New model PR merged notification / Notify new model (push) Waiting to run
Slow tests on important models (on Push - A10) / Get all modified files (push) Waiting to run
Slow tests on important models (on Push - A10) / Slow & FA2 tests (push) Blocked by required conditions
Self-hosted runner (push-caller) / Check if setup was changed (push) Waiting to run
Self-hosted runner (push-caller) / build-docker-containers (push) Blocked by required conditions
Self-hosted runner (push-caller) / Trigger Push CI (push) Blocked by required conditions
Secret Leaks / trufflehog (push) Waiting to run
Update Transformers metadata / build_and_package (push) Waiting to run
* Create push-important-models.yml

* feat: add falcon-h1

* fixup

* address comment

* fix

* fix copies

* fix copies

* fix

* fix

* fix

* fix

* fix copies

* fix

* fix copies

* fix test import to at least trigget the cis

* yups

* update

* fix make fix copies

* fix inits?

* fix style

* skip annoying test

* add integration test for Falcon H1

* fix copies

* fix

* fix typo

* make style

---------

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <younesbelkada@gmail.com>
Co-authored-by: younesbelkada <younes.belkada@tii.ae>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2025-05-21 11:25:26 +02:00
Younes Belkada
6829936ee0
[MODEL] Add Falcon H1 (#38249)
* Create push-important-models.yml

* feat: add falcon-h1

* fixup

* address comment

* fix

* fix copies

* fix copies

* fix

* fix

* fix

* fix

* fix copies

* fix

* fix copies

* fix test import to at least trigget the cis

* yups

* update

* fix make fix copies

* fix inits?

* fix style

* skip annoying test

* add integration test for Falcon H1

* fix copies

* fix

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: dhia.rhaiem <dhia.rhaiem@tii.ae>
2025-05-21 10:43:11 +02:00
Garrett Goon
390f153469
Add padding-free to bamba (#35861)
* add seq_idx and fa kwargs

* update tests

* docs and grad ckpt support

* fmt

* better names

* test_raise_missing_padding_free_kwarg_errs

* + seq_idx in doc strings

* padding free training docs

* add link to pr plots

* raise err on attn_mask with padding free

* rm raising missing padding free err test

* BambaFlashAttentionKwargs

* run modular util for modular_granitemoehybrid.py
2025-05-20 17:13:59 +02:00
ivarflakstad
3f0b7d0fac
Mamba2 remove unecessary test parameterization (#38227) 2025-05-20 13:54:04 +00:00
Pablo Montalvo
9cde2f5d42
Minor llama4 fixes (#38123)
* fix wrong scaling value/default Cache init

* style

* fix various issues on integration tests

* change expected outputs

* fixup

* fix config access

* protect default scaling
2025-05-20 13:15:54 +00:00
ivarflakstad
de70c8426e
Disable torchscript tests for AriaForConditionalGenerationModelTest (#38225)
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
2025-05-20 14:37:55 +02:00
Manuel de Prada Corral
d34e21e7dd
New cache tests and refactored Hybrid Cache (#37972) 2025-05-20 12:46:13 +02:00
Titus
f022bf9322
Remove trust_remote_code=True tests from bnb quantization tests (MPT now integrated) (#38206)
bnb quant tests: remove obsolete trust_remote_code test

The MPT model is now natively integrated in Transformers and no longer requires trust_remote_code=True. This removes the failing test_get_keys_to_not_convert_trust_remote_code and related usage, which depended on remote code and caused CI issues due to missing dependencies (e.g., triton_pre_mlir).
2025-05-20 11:43:11 +02:00
Raushan Turganbay
0a52bd2403
[fix] sliding window attention mask (#38045)
* fix sliding attn

* make style

* Update tests/test_modeling_common.py

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

* no a second throught, should default to `True` fo BC

---------

Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-05-20 09:32:19 +00:00