* stash for now
* initial commit
* small updated
* up
* up
* works!
* nits and fixes
* don't loop too much
* finish working example
* update
* fix the small freeblocks issue
* feat: stream inputs to continuous batch
* fix: update attn from `eager` to `sdpa`
* refactor: fmt
* refactor: cleanup unnecessary code
* feat: add `update` fn to `PagedAttentionCache`
* feat: broken optimal block size computation
* fix: debugging invalid cache logic
* fix: attention mask
* refactor: use custom prompts for example
* feat: add streaming output
* fix: prefill split
refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic
* fix: send decoded tokens when `prefilling_split` -> `decoding`
* refactor: move logic to appropriate parent class
* fix: remove truncation as we split prefilling anyways
refactor: early return when we have enough selected requests
* feat: add paged attention forward
* push Ggraoh>
* add paged sdpa
* update
* btter mps defaults
* feat: add progress bar for `generate_batch`
* feat: add opentelemetry metrics (ttft + batch fill %age)
* feat: add tracing
* Add cuda graphs (#38059)
* draft cudagraphs addition
* nits
* styling
* update
* fix
* kinda draft of what it should look like
* fixes
* lol
* not sure why inf everywhere
* can generate but output is shit
* some fixes
* we should have a single device synch
* broken outputs but it does run
* refactor
* updates
* updates with some fixes
* fix mask causality
* another commit that casts after
* add error
* simplify example
* update
* updates
* revert llama changes
* fix merge conflicts
* fix: tracing and metrics
* my updates
* update script default values
* fix block allocation issue
* fix prefill split attnetion mask
* no bugs
* add paged eager
* fix
* update
* style
* feat: add pytorch traces
* fix
* fix
* refactor: remove pytorch profiler data
* style
* nits
* cleanup
* draft test file
* fix
* fix
* fix paged and graphs
* small renamings
* cleanups and push
* refactor: move tracing and metrics logic to utils
* refactor: trace more blocks of code
* nits
* nits
* update
* to profile or not to profile
* refactor: create new output object
* causal by default
* cleanup but generations are still off for IDK what reason
* simplifications but not running still
* this does work.
* small quality of life updates
* nits
* updaet
* fix the scheduler
* fix warning
* ol
* fully fixed
* nits
* different generation parameters
* nice
* just style
* feat: add cache memory usage
* feat: add kv cache free memory
* feat: add active/waiting count & req latency
* do the sampling
* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager
* fix on mps
* feat: add dashboard & histogram buckets
* perf: improve waiting reqs data structures
* attempt to compile, but we should only do it on mps AFAIK
* feat: decouple scheduling logic
* just a draft
* c;eanup and fixup
* optional
* style
* update
* update
* remove the draft documentation
* fix import as well
* update
* fix the test
* style doomed
---------
Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
* starting attn refactor for encoder decoder models via bart (eager + sdpa)
* flash attention works, remove unnecessary code
* flex attention support for bart!, gotta check if the renaming is not too aggressive
* some comments
* skip flex grad test for standalone as done with the other test
* revert flex attn rename (for now), sdpa simplify, and todos
* more todos
* refactor mask creation for reuse
* modular attempt at biogpt
* first batch of other models
* fix attn dropout
* fix autoformer copies
* hubert
* another batch of models
* copies/style + last round of bart models --> whisper next?
* remove unnecessary _reshape function and remove copy to whisper
* add skip for decoder-only models out of enc-dec (same as in bart)
* bring back licences
* remove comment, added to pr read instead
* mostly docs
* disable sew flex attn as it's unclear attn mask for now
* oops
* test fixes for enc-dec
* torch fx fixes + try at flex attn
* skip on mbart
* some more fixes
* musicgen skip / delete old attn class logic + sdpa compose compile skip
* disable flex attn for musicgen, not worth the effort
* more fixes and style
* flex attention test for dropout and encoder decoder that dont have main input names
* informer fixes
* the weirdest thing I've encountered yet...
* style
* remove empty tensor attempt, found core root in previous commits
* disable time series due to tests being very text centric on inputs
* add speech to text to be ignoring the other attns, also due to tests
* update docs
* remaining issues resolved ?
* update docs for current state --> nllb moe and pegasus x sdpa is questionable :D
* some models have not set the is_causal flag...
* change dtype in softmax tol old behaviour + some modular fixes
* I hate it but it is what it is
* fixes from main for bart
* forgot this one
* some model fixes
* style
* current status
* marian works now
* fixing some copies
* some copy fixes + time series x informer
* last models possibly and fixes on style/copies
* some post merge fixes
* more fixes
* make attention interface callable and move warnings there
* style lol
* add comment to "unsupported"
* remove callable interface and change interface warnings + some copies
* fix
* ternary is ugly af, make it simpler
* how did that happen
* fix flex attn test
* failing the test
* no more fallback! fixing copies next
* style + attn fixed
* fixing copies and mask creation
* wrong copy
* fixup tests and disable flex attn for now
* fixup last tests?
* docs(swin): Update Swin model card to standard format
* docs(swin): Refine link to Microsoft organization for Swin models
Apply suggestion from @stevhliu in PR #37628.
This change updates the link pointing to the official Microsoft Swin Transformer checkpoints on the Hugging Face Hub.
The link now directs users specifically to the Microsoft organization page, filtered for Swin models, providing a clearer and more canonical reference compared to the previous general search link.
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(swin): Clarify padding description and link to backbone docs
Apply suggestion from @stevhliu in PR #37628.
This change introduces two improvements to the Swin model card:
1. Refines the wording describing how Swin handles input padding for better clarity.
2. Adds an internal documentation link to the general "backbones" page when discussing Swin's capability as a backbone model.
These updates enhance readability and improve navigation within the Transformers documentation.
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* docs(swin): Change Swin paper link to huggingface.co/papers as suggested
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* update model card.
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* update quantization example.
* update example.
* update
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* assign the correct data layout for xpu
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* check torch version before using torchao xpu
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix the log
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix zero point type
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* fix check torch version
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
---------
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
* _get_padding_size module
* do not patchify images when processing multi image
* modify llava onevision image processor fast
* tensor to list of tensors
* backward compat
* reuse pad_to_square in llave & some clarification
* add to doc
* fix: consider no image cases (text only or video)
* add integration test
* style & repo_consistency
* accept custom device_mesh
* fix device_map
* assert that num_heads % tp_size == 0
* todo.
* ReplicateParallel
* handle tied weights
* handle dtensor in save_pretrained with safe_serialization
* tp test works
* doesnt work
* fix shard_and_distribute_module's rank should be local_rank
* tp=4 is correct
* dp+tp is broken
* todo allreduce with dtensors on another dim is annoying
* workaround to sync dp grads when using dtensors
* loading a checkpoint works
* wandb and compare losses with different tp/dp
* cleaning
* cleaning
* .
* .
* logs
* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention
* DP=2 TP=2 now works even with tied embeddings
* model.parameters() and model.module.parameters() are empty..
* reformat sanity_check_tensor_sync
* set atol=1e-4 for CP to pass
* try populate _parameters from named_modules
* refactors
TP2 DP2 works
CP2 DP2 works
* is_causal=True and pack sequences, no attn mask, and preshuffle dataset
* fix packing
* CP=4 doesn't work
* fix labels and position_ids for CP
* DP CP works with transformers 🥳🥳🥳
* refactor
* add example cp
* fixup
* revert sdpa changes
* example cleared
* add CP, DP to the mesh init
* nit
* clean
* use `ALL_PARALLEL_STYLES`
* style
* FSDP works
* log on 1 rank
* .
* fix?
* FSDP1 also has .parameters() bug
* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay
* .
* style and fixup
* move stuff around
* fix tests
* style
* let's make it a check
* add missing licences
* warning should be an info
* tp plan should not be NONE
* test all
* god damn it
* test all
---------
Co-authored-by: nouamanetazi <nouamane98@gmail.com>
* add seq_idx and fa kwargs
* update tests
* docs and grad ckpt support
* fmt
* better names
* test_raise_missing_padding_free_kwarg_errs
* + seq_idx in doc strings
* padding free training docs
* add link to pr plots
* raise err on attn_mask with padding free
* rm raising missing padding free err test
* BambaFlashAttentionKwargs
* run modular util for modular_granitemoehybrid.py
* accept custom device_mesh
* fix device_map
* assert that num_heads % tp_size == 0
* todo.
* ReplicateParallel
* handle tied weights
* handle dtensor in save_pretrained with safe_serialization
* tp test works
* doesnt work
* fix shard_and_distribute_module's rank should be local_rank
* tp=4 is correct
* dp+tp is broken
* todo allreduce with dtensors on another dim is annoying
* workaround to sync dp grads when using dtensors
* loading a checkpoint works
* wandb and compare losses with different tp/dp
* cleaning
* cleaning
* .
* .
* logs
* CP2 DP2 no mask works after commenting attn_mask and is_causal from scaled_dot_product_attention
* DP=2 TP=2 now works even with tied embeddings
* model.parameters() and model.module.parameters() are empty..
* reformat sanity_check_tensor_sync
* set atol=1e-4 for CP to pass
* try populate _parameters from named_modules
* refactors
TP2 DP2 works
CP2 DP2 works
* is_causal=True and pack sequences, no attn mask, and preshuffle dataset
* fix packing
* CP=4 doesn't work
* fix labels and position_ids for CP
* DP CP works with transformers 🥳🥳🥳
* refactor
* add example cp
* fixup
* revert sdpa changes
* example cleared
* add CP, DP to the mesh init
* nit
* clean
* use `ALL_PARALLEL_STYLES`
* style
* FSDP works
* log on 1 rank
* .
* fix?
* FSDP1 also has .parameters() bug
* reported gradnorm when using FSDP1 is wrong, but loss is correct so it's okay
* .
* style and fixup
* move stuff around
* fix tests
* style
* let's make it a check
* warning should be an info
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>