* enable misc test cases on XPU
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* tweak bamba ground truth on XPU
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* remove print
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* one more
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
---------
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* remove it from all py files
* remove it from the doc
* remove it from examples
* style
* remove traces of _fast_init
* Update test_peft_integration.py
* CIs
* enable glm4 integration cases on XPU, set xpu expectation for blip2
Signed-off-by: Matrix YAO <matrix.yao@intel.com>
* more
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* refine wording
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* refine test case names
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* run
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* add gemma2 and chameleon
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix review comments
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
---------
Signed-off-by: Matrix YAO <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* stash for now
* initial commit
* small updated
* up
* up
* works!
* nits and fixes
* don't loop too much
* finish working example
* update
* fix the small freeblocks issue
* feat: stream inputs to continuous batch
* fix: update attn from `eager` to `sdpa`
* refactor: fmt
* refactor: cleanup unnecessary code
* feat: add `update` fn to `PagedAttentionCache`
* feat: broken optimal block size computation
* fix: debugging invalid cache logic
* fix: attention mask
* refactor: use custom prompts for example
* feat: add streaming output
* fix: prefill split
refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic
* fix: send decoded tokens when `prefilling_split` -> `decoding`
* refactor: move logic to appropriate parent class
* fix: remove truncation as we split prefilling anyways
refactor: early return when we have enough selected requests
* feat: add paged attention forward
* push Ggraoh>
* add paged sdpa
* update
* btter mps defaults
* feat: add progress bar for `generate_batch`
* feat: add opentelemetry metrics (ttft + batch fill %age)
* feat: add tracing
* Add cuda graphs (#38059)
* draft cudagraphs addition
* nits
* styling
* update
* fix
* kinda draft of what it should look like
* fixes
* lol
* not sure why inf everywhere
* can generate but output is shit
* some fixes
* we should have a single device synch
* broken outputs but it does run
* refactor
* updates
* updates with some fixes
* fix mask causality
* another commit that casts after
* add error
* simplify example
* update
* updates
* revert llama changes
* fix merge conflicts
* fix: tracing and metrics
* my updates
* update script default values
* fix block allocation issue
* fix prefill split attnetion mask
* no bugs
* add paged eager
* fix
* update
* style
* feat: add pytorch traces
* fix
* fix
* refactor: remove pytorch profiler data
* style
* nits
* cleanup
* draft test file
* fix
* fix
* fix paged and graphs
* small renamings
* cleanups and push
* refactor: move tracing and metrics logic to utils
* refactor: trace more blocks of code
* nits
* nits
* update
* to profile or not to profile
* refactor: create new output object
* causal by default
* cleanup but generations are still off for IDK what reason
* simplifications but not running still
* this does work.
* small quality of life updates
* nits
* updaet
* fix the scheduler
* fix warning
* ol
* fully fixed
* nits
* different generation parameters
* nice
* just style
* feat: add cache memory usage
* feat: add kv cache free memory
* feat: add active/waiting count & req latency
* do the sampling
* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager
* fix on mps
* feat: add dashboard & histogram buckets
* perf: improve waiting reqs data structures
* attempt to compile, but we should only do it on mps AFAIK
* feat: decouple scheduling logic
* just a draft
* c;eanup and fixup
* optional
* style
* update
* update
* remove the draft documentation
* fix import as well
* update
* fix the test
* style doomed
---------
Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
* starting attn refactor for encoder decoder models via bart (eager + sdpa)
* flash attention works, remove unnecessary code
* flex attention support for bart!, gotta check if the renaming is not too aggressive
* some comments
* skip flex grad test for standalone as done with the other test
* revert flex attn rename (for now), sdpa simplify, and todos
* more todos
* refactor mask creation for reuse
* modular attempt at biogpt
* first batch of other models
* fix attn dropout
* fix autoformer copies
* hubert
* another batch of models
* copies/style + last round of bart models --> whisper next?
* remove unnecessary _reshape function and remove copy to whisper
* add skip for decoder-only models out of enc-dec (same as in bart)
* bring back licences
* remove comment, added to pr read instead
* mostly docs
* disable sew flex attn as it's unclear attn mask for now
* oops
* test fixes for enc-dec
* torch fx fixes + try at flex attn
* skip on mbart
* some more fixes
* musicgen skip / delete old attn class logic + sdpa compose compile skip
* disable flex attn for musicgen, not worth the effort
* more fixes and style
* flex attention test for dropout and encoder decoder that dont have main input names
* informer fixes
* the weirdest thing I've encountered yet...
* style
* remove empty tensor attempt, found core root in previous commits
* disable time series due to tests being very text centric on inputs
* add speech to text to be ignoring the other attns, also due to tests
* update docs
* remaining issues resolved ?
* update docs for current state --> nllb moe and pegasus x sdpa is questionable :D
* some models have not set the is_causal flag...
* change dtype in softmax tol old behaviour + some modular fixes
* I hate it but it is what it is
* fixes from main for bart
* forgot this one
* some model fixes
* style
* current status
* marian works now
* fixing some copies
* some copy fixes + time series x informer
* last models possibly and fixes on style/copies
* some post merge fixes
* more fixes
* make attention interface callable and move warnings there
* style lol
* add comment to "unsupported"
* remove callable interface and change interface warnings + some copies
* fix
* ternary is ugly af, make it simpler
* how did that happen
* fix flex attn test
* failing the test
* no more fallback! fixing copies next
* style + attn fixed
* fixing copies and mask creation
* wrong copy
* fixup tests and disable flex attn for now
* fixup last tests?
* pipeline generation defaults
* add max_new_tokens=20 in test pipelines
* pop all kwargs that are used to parameterize generation config
* add class attr that tell us whether a pipeline calls generate
* tmp commit
* pt text gen pipeline tests passing
* remove failing tf tests
* fix text gen pipeline mixin test corner case
* update text_to_audio pipeline tests
* trigger tests
* a few more tests
* skips
* some more audio tests
* not slow
* broken
* lower severity of generation mode errors
* fix all asr pipeline tests
* nit
* skip
* image to text pipeline tests
* text2test pipeline
* last pipelines
* fix flaky
* PR comments
* handle generate attrs more carefully in models that cant generate
* same as above
* tmp commit (imports broken)
* working version; update tests
* remove line break
* shorter msg
* dola checks need num_beams=1; other minor PR comments
* update early trainer failing on bad gen config
* make fixup
* test msg
* mvp
* remove trust_remote_code
* generate_from_hub
* handle requirements; docs
* english
* doc PR suggestions
* Apply suggestions from code review
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* changed remote code path to generate/generate.py
* model repo has custom generate -> override base generate
* check for proper inheritance
* some doc updates (missing: tag-related docs)
* update docs to model repo
* nit
* nit
* nits
* Update src/transformers/dynamic_module_utils.py
* Apply suggestions from code review
* Update docs/source/en/generation_strategies.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* trust remote code is required
* use new import utils for requirements version parsing
* use org examples
* add tests
* Apply suggestions from code review
Co-authored-by: Manuel de Prada Corral <6536835+manueldeprada@users.noreply.github.com>
* ascii file structure; tag instructions on readme.md
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Manuel de Prada Corral <6536835+manueldeprada@users.noreply.github.com>
* enable generation fsdp/utils test cases on XPU
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* xx
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* use backend_xx APIs
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
---------
Signed-off-by: Yao Matrix <matrix.yao@intel.com>
* skip compilation on cpu offload
* add test
* better logic
* docstring
* boolean logic
* add disk offload check
* warn users if compilation options are set but compilation doesn happen
* fix test
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* initial commit
* add convert internvl
* add first end-to-end working internvl
* nit prompt and image proc
* add working chat template
* add conversion llama-based models
* add tests
* pass all tests
* fix isort
* fix modular after main merge
* add video processing for internvl
* add support for interlaced images and videos
* Remove processing and config from modular, add more tests
* add llama model tests
* Modify processor for compatibility with refactored got ocr image processor
* add comments in processor
* Add docs and nits
* change video processing to use custom sample_indices_fn
* rebase and fix tests
* add processor tests
* Add changes Raushan review
* Use the new attention interface for the vision model
* nits
* add support for custom video_load_backend
* remove mention to InternVLTokenizer
* refactor vision model to simplify logic
* refactor processor for better readibility
* fix copies
* fix require av processor test
* refactor internVL vision
* Update processor and fix processing tests
* fix docstring
* update convert_weights for internvl3
* change image processor to fast by default
* remove do_center_crop=True in convert_weights
* force use_cache to True
* push_to_hub before reloading
* fix internVLVision for larger models
* update convert weight for qk norm
* fix convert_weights
* fix eos_token_id in convert
* update docs and integration tests
* make modifs after review
* fix wrong k_norm and reduce modular
* change image_token_index to image_token_id
* change checkpoint to OpenGVLab org
* last nits
* explicitely del self.num_key_value_groups
* add extra special tokens
* use only `xxx_token_id` for multimodal tokens
* update modeling files as well
* fixup
* why fixup doesn't fix modular docstring first?
* janus, need to update configs in the hub still
* last fixup
* Iterative generation using input embeds
* Add Janus model
* discard changes
* Janus imports
* Refactor config and processor
* Added Vision tower of Janus
* Import Janus Image processor
* Vision tower fixes
* Refactor code
* Added VQ Model
* Complete model integration
* temp conversion script
* processor refactor
* Adding files to facilitate pulling
* Fixes after debugging
* Skip test for these models
* Add Janus Model
* discard changes
* Janus imports
* Refactor config and processor
* Added Vision tower of Janus
* Import Janus Image processor
* Vision tower fixes
* Refactor code
* Added VQ Model
* Complete model integration
* temp conversion script
* processor refactor
* Adding files to facilitate pulling
* Fixes after debugging
* Refactor to Text config
* ✨ Added generate function
* Saving intermediate convert file. Still need to read configs from the hub and convert them to our format.
* Adding version that reads from the JSON files. Still have to tweak some parameters manually.
* relative imports
* Initial tests
* Refactor image processor
* Seemingly working version of the conversion script, will need to test further.
* Adding command message
* Fixing conflicting JanusTextConfig class
* Incorporating some of the discussed changes.
* Small fix to create dir.
* Removing system from JINJA template
* Adding draft processor tests
* style fixes
* Minor fixes and enhancement
* added generation config
* Initial tests
* Small modifications, tests are now passing.
* Small changes I noticed while reading code.
* more fixes
* Added JanusModel class
* Small merge adaptations
* Small merge adaptations
* Image processing tests passing
* More tests and fixes
* Convert script updated and refactored
* Tests and cleanup
* make style
* Postprocessing for image generation
* generate refactor
* fixes
* - Passing tests that write a part of the model to cpu (e.g. test_cpu_offload)
- Passing tests of dispatching SDPA
- Only gradient checkpointing tests are left.
* Removing temporary code
* Changes
* Writing change to modular
* Added JanusVisionModel. SDPA dispatch tests pass more robustly. Gradient checkpoint tests are next
* Gradient checkpoint tests passing
* Removing debug code
* Major generate refactor 😮💨
* Temp changes for testing
* Green quality CI
* 2 out of 4 integration tests passing
* breadcrumbs
* Usage Examples
* Regenerate modeling after merge
* dirty code
* JanusIntegrationTest are passing
* breadcrumbs
* happy CI
* fixes
* Changing template
* nits
* Text generation logits matching original codebase at 100% precision
* Remove ./tmp from git tracking
* Remove ./tmp from git tracking
* Checkpointing changes after reviewing
* Fixing code in docstrings
* CHanging comments and small bug in convert file
* Fixing bug in image_token_id for 7B version
* Removing line that was added by both of us
* Pushing changes after discussion. Only one left is to change the key mapping for convert file.
* Updating module file
* New convert file using dict. Tested that it is equivalent to the old one by:
- comparing keys in a script
- comparing checksums of the output files between version generated with the current convert script and those generated with the old script. This is a more reliable test.
* revert changes
* mistake
* consistency change for CI
* make style
* doc fixes
* more fixes
* experimenting with masking out pad token
* checkpoint
* Batched generation with multi-images working for 1B models. Will test 7B next.
* Device fix.
* Writing changes to modular, previous ones were written to modeling just for quick testing.
* Using passed processor attention mask (only in modeling for now)
* Matching performance done in the non-standard way
* Working version of batched generation. Will change how some args are passed to make it more similar to language case
* More compliant version of the code
* Removed duplicated `_prepare_4d_causal_attention_mask_with_cache_position`
* Updating modular file, making masked filling with paddings more efficient
* Slightly more efficient version
* Modifying JanusVisionModel to be a wrapper
* Fixing test to comply with new names
* Modular overhaul
* More refactoring
* - Changing JanusVisionModel back
- Changing forward pass
- Adding boi token to the comparison
* - Removing whole context model_ids
- Using inherited implementation of prepare_inputs_for_generation
* Moving the way boi token is passed to the model
* Fixing sdpa test
* Minor changes
* testing changes
* Minor fix
* - Adding postprocessing test
- checking values of generated image on integration test
* changes
* Removing pooled attention vision module, fixing convert script as a consequence
* More changes
* Fixes
* Draft after merge
* Bug fixes
* More bug fix
* Fixing docs
* Nits
* Refactor return dict
* Moving image post processing test to main processor post process
* Passing guidance_scale as kwarg
* make style
* 🔥 refactor
* make style
* Update and green CI
* Nits and tests update
* up
* Added MID block
* fix
* Dead code
* update testcase
* update
* model_id change
* init_weight changes
---------
Co-authored-by: hsilva664 <metallic-silver@hotmail.com>
* First pass at speech granite
Add encoder / projector, rename things
* Combine into one model file with causal lm outputs for forward
* Add loss calc
* Fix config loading
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
* Split new / old loading logic
* Use transformers integration for loading peft adapters
* Add generation wrapper for selective lora enablement
* Add note for qformer encoder automodel
* Guard torch/audio imports in feature extractor
* Handle granite speech autoclasses
* Handle optional deps in package structure for granite speech
* Add granite pretrained model def for init
* Add dummy objects for torch/torchaudio
* Add tests for granite speech processor
* Minor formatting fixes and refactoring
* Add options for falling back to config in forward
* Tentative model docstrings for granite speech
* Fix config type
* Remove legacy load
* Allow non-lora variants for granite speech
* Override weight tying for llm
* Use text config instead of llm config
* Add output embeddings getter to fix weight tying
* Fix relative imports
* computing the number of audio features, based on the raw audio sequence.
* collating audio inputs, and keeping the original lengths.
* asserted we have text. otherwise we can't specify the audio special token.
* assering the number of audio-symbols/audios match correctly.
running get validated_audios only when audio is present
* indentation bugfix + supporting different feature lengths when expanding audio.
* redundant, done in _get_validated_text
* adapting the tests:
- we must have text (not either audio or text)
- _get_num_audio_features takes a list of raw lengths, provided it insetad.
* Minor cleanup, remove unused import
* Add more tests for batch feature processing
* Allow setting offset in rel position embeddings
* Add config option for warning if peft is not installed w/ lora
* Port blip2 qformer code into granite speech
* Add sad test for numpy arr processing
* Allow numpy arrays / tuples in granite speech processor
* Fix config type for projector
* - pad instead of creating a zeros tensor, to keep the original dtype/device (support bfloat16)
- cast input_features to the model dtype (support bfloat16)
* merge Blip2QFormerConfig to GraniteSpeechProjectorConfig
* prevent a crash when re-saving/loading the model (line 109)
* consider additional edge cases during preprocessing.
* consider additional edge cases during preprocessing.
* add features mask for batched inference (bugfix)
* Minor refactor, remove multiaudio processor tests
* Add set input/output embeddings for granite speech
* Fix feature dim check in processor test
* Pop input features in embed test for granite speech
* Small fixes for test edge cases
Add granite speech to seq2seq causal lm mapping names
* Add small tests for granite speech model
* Fix data parallelism test
* Standardize model class names
* Fix check for copies
* Fix misaligned init check
* Skip granite speech in checkpoint check
* Use default for tie_word_embeddings in granite speech
* Fix non documentation granite speech repo issues
* Fix comments and docstring checks
* Add placeholder docs for granite speech
* Fix test naming collision
* Code formatting
* Rerun torch dummy obj regen
* Fix save pretrained for granite speech
* Import sorting
* Fix tests typo
* Remove offset hack
* Pass args through encoder config
* Remove unused prune heads from blip2
* removing einsum. replaced with explicit multiplication (relative positional encodings) and sdpa attention.
* remove Sequential from ConformerFeedForward and ConformerConvModule. + fix for sdpa attention
* remove GraniteSpeechConformerScale
* rename to hidden_states
* rename conformer layers to self.layers, remove the first linear from the list to keep the list homogenous.
* move pre-norm to the attention/feedforward blocks (avoid complex module wrapping)
* adding pre_norm into forward
* feature extractor refactoring to resemble how it's done in phi4multimodal.
* rename feature_extractor to audio_processor
* bugfix: input_feature_mask fix to get the exact number tokens.
* Fix pytest decorator in processor test
* Add (disabled) integration tests for granite speech
* Fix handling of optional feature masking
* Loosen validation in processing for vLLM compatability
* Formatting fixes
* Update init structure to mirror llama
* Make granite speech projector generic
* Update test config to reflect generic projector
* Formatting fixes
* Fix typos, add license
* Fix undefined var in input processing
* Cleanup and expose ctc encoder
* Add missing config docstrings
* Better var names, type hints, etc
* Set attn context size in init
* Add max pos emb to encoder config
* Cleanup feature extractor
* Add granite speech architecture details
* Remove granite speech qformer ref
* Add paper link, explicit calc for qkv
* Calculate padding directly in depthwise conv1d init
* Raise value error instead of asserting
* Reorder class defs (classes used at top)
* Precompute relpos distances
* Run formatting
* Pass attention distances through forward
* Apply suggestions from code review
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* Add todo for using common batch feature extraction
* Rename audios/features
* Ensure chat template may be provided to processor
* Move granite speech docs to audio models
* Add todos for input proc refactoring
* Fix import order
* Guard torch import
* Use relative imports
* Require torch backend for processor in granite speech
* Add backend guards in feature extractor
---------
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Co-authored-by: Avihu Dekel <avihu.dekel@ibm.com>
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* Remove unnecessary masked_fill in deberta models
* Enable some code when exporting but not compiling
* add missing import
* style
* replace if by torch.cond
* style
* use numel
* style
* add unit tests
* style
* change empty value for dynamic cache
* replace != [] by numel()
* fix import issue
* style
* fix XPU UT error case brough by RNG difference btw XPU and CUDA
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* enable tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits and tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits_bf16 on xpu
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* Revert "enable tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits and tests/models/llama/test_modeling_llama.py::LlamaIntegrationTest::test_model_7b_logits_bf16 on xpu"
This reverts commit 3ef83a4f02.
---------
Signed-off-by: YAO Matrix <matrix.yao@intel.com>