transformers/tests
Arthur 211f2b0875
Add CB (#38085)
* stash for now

* initial commit

* small updated

* up

* up

* works!

* nits and fixes

* don't loop too much

* finish working example

* update

* fix the small freeblocks issue

* feat: stream inputs to continuous batch

* fix: update attn from `eager` to `sdpa`

* refactor: fmt

* refactor: cleanup unnecessary code

* feat: add `update` fn to `PagedAttentionCache`

* feat: broken optimal block size computation

* fix: debugging invalid cache logic

* fix: attention mask

* refactor: use custom prompts for example

* feat: add streaming output

* fix: prefill split

refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic

* fix: send decoded tokens when `prefilling_split` -> `decoding`

* refactor: move logic to appropriate parent class

* fix: remove truncation as we split prefilling anyways

refactor: early return when we have enough selected requests

* feat: add paged attention forward

* push Ggraoh>

* add paged sdpa

* update

* btter mps defaults

* feat: add progress bar for `generate_batch`

* feat: add opentelemetry metrics (ttft + batch fill %age)

* feat: add tracing

* Add cuda graphs (#38059)

* draft cudagraphs addition

* nits

* styling

* update

* fix

* kinda draft of what it should look like

* fixes

* lol

* not sure why inf everywhere

* can generate but output is shit

* some fixes

* we should have a single device synch

* broken outputs but it does run

* refactor

* updates

* updates with some fixes

* fix mask causality

* another commit that casts after

* add error

* simplify example

* update

* updates

* revert llama changes

* fix merge conflicts

* fix: tracing and metrics

* my updates

* update script default values

* fix block allocation issue

* fix prefill split attnetion mask

* no bugs

* add paged eager

* fix

* update

* style

* feat: add pytorch traces

* fix

* fix

* refactor: remove pytorch profiler data

* style

* nits

* cleanup

* draft test file

* fix

* fix

* fix paged and graphs

* small renamings

* cleanups and push

* refactor: move tracing and metrics logic to utils

* refactor: trace more blocks of code

* nits

* nits

* update

* to profile or not to profile

* refactor: create new output object

* causal by default

* cleanup but generations are still off for IDK what reason

* simplifications but not running still

* this does work.

* small quality of life updates

* nits

* updaet

* fix the scheduler

* fix warning

* ol

* fully fixed

* nits

* different generation parameters

* nice

* just style

* feat: add cache memory usage

* feat: add kv cache free memory

* feat: add active/waiting count & req latency

* do the sampling

* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager

* fix on mps

* feat: add dashboard & histogram buckets

* perf: improve waiting reqs data structures

* attempt to compile, but we should only do it on mps AFAIK

* feat: decouple scheduling logic

* just a draft

* c;eanup and fixup

* optional

* style

* update

* update

* remove the draft documentation

* fix import as well

* update

* fix the test

* style doomed

---------

Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
2025-05-22 17:43:48 +02:00
..
bettertransformer Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
deepspeed 🚨 rm already deprecated pad_to_max_length arg (#37617) 2025-05-01 15:21:55 +02:00
extended Add Optional to remaining types (#37808) 2025-04-28 14:20:45 +01:00
fixtures Implementation of SuperPoint and AutoModelForKeypointDetection (#28966) 2024-03-19 14:43:02 +00:00
fsdp Fix the fsdp config cannot work issue. (#37549) 2025-04-28 10:44:51 +02:00
generation Add CB (#38085) 2025-05-22 17:43:48 +02:00
models 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108) 2025-05-22 17:12:58 +02:00
optimization Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
peft_integration FIX: Faulty PEFT tests (#37757) 2025-04-28 15:10:46 +02:00
pipelines [whisper] move processor test into processor test file 🧹 (#38266) 2025-05-22 10:07:11 +01:00
quantization Add tearDown method to Quark to solve OOM issues (#38234) 2025-05-21 14:26:44 +02:00
repo_utils Simplify soft dependencies and update the dummy-creation process (#36827) 2025-04-11 11:08:36 +02:00
sagemaker Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
tensor_parallel enable misc cases on XPU & use device agnostic APIs for cases in tests (#38192) 2025-05-20 10:09:01 +02:00
tokenization Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
trainer enable misc cases on XPU & use device agnostic APIs for cases in tests (#38192) 2025-05-20 10:09:01 +02:00
utils 🚨🚨[core] Completely rewrite the masking logic for all attentions (#37866) 2025-05-22 11:38:26 +02:00
__init__.py GPU text generation: mMoved the encoded_prompt to correct device 2020-01-06 15:11:12 +01:00
test_backbone_common.py Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
test_configuration_common.py Update composition flag usage (#36263) 2025-04-09 11:48:49 +02:00
test_feature_extraction_common.py Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
test_image_processing_common.py fix multi-image case for llava-onevision (#38084) 2025-05-21 11:50:46 +02:00
test_image_transforms.py Fix pad image transform for batched inputs (#37544) 2025-05-08 10:51:15 +01:00
test_modeling_common.py 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108) 2025-05-22 17:12:58 +02:00
test_modeling_flax_common.py Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
test_modeling_tf_common.py Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
test_pipeline_mixin.py Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
test_processing_common.py 🔴 Video processors as a separate class (#35206) 2025-05-12 11:55:51 +02:00
test_sequence_feature_extraction_common.py Use Python 3.9 syntax in tests (#37343) 2025-04-08 14:12:08 +02:00
test_tokenization_common.py 🚨 rm already deprecated pad_to_max_length arg (#37617) 2025-05-01 15:21:55 +02:00
test_training_args.py Fix TrainingArguments.torch_empty_cache_steps post_init check (#36734) 2025-03-17 16:09:46 +01:00
test_video_processing_common.py 🔴 Video processors as a separate class (#35206) 2025-05-12 11:55:51 +02:00