* First pass at speech granite
Add encoder / projector, rename things
* Combine into one model file with causal lm outputs for forward
* Add loss calc
* Fix config loading
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
* Split new / old loading logic
* Use transformers integration for loading peft adapters
* Add generation wrapper for selective lora enablement
* Add note for qformer encoder automodel
* Guard torch/audio imports in feature extractor
* Handle granite speech autoclasses
* Handle optional deps in package structure for granite speech
* Add granite pretrained model def for init
* Add dummy objects for torch/torchaudio
* Add tests for granite speech processor
* Minor formatting fixes and refactoring
* Add options for falling back to config in forward
* Tentative model docstrings for granite speech
* Fix config type
* Remove legacy load
* Allow non-lora variants for granite speech
* Override weight tying for llm
* Use text config instead of llm config
* Add output embeddings getter to fix weight tying
* Fix relative imports
* computing the number of audio features, based on the raw audio sequence.
* collating audio inputs, and keeping the original lengths.
* asserted we have text. otherwise we can't specify the audio special token.
* assering the number of audio-symbols/audios match correctly.
running get validated_audios only when audio is present
* indentation bugfix + supporting different feature lengths when expanding audio.
* redundant, done in _get_validated_text
* adapting the tests:
- we must have text (not either audio or text)
- _get_num_audio_features takes a list of raw lengths, provided it insetad.
* Minor cleanup, remove unused import
* Add more tests for batch feature processing
* Allow setting offset in rel position embeddings
* Add config option for warning if peft is not installed w/ lora
* Port blip2 qformer code into granite speech
* Add sad test for numpy arr processing
* Allow numpy arrays / tuples in granite speech processor
* Fix config type for projector
* - pad instead of creating a zeros tensor, to keep the original dtype/device (support bfloat16)
- cast input_features to the model dtype (support bfloat16)
* merge Blip2QFormerConfig to GraniteSpeechProjectorConfig
* prevent a crash when re-saving/loading the model (line 109)
* consider additional edge cases during preprocessing.
* consider additional edge cases during preprocessing.
* add features mask for batched inference (bugfix)
* Minor refactor, remove multiaudio processor tests
* Add set input/output embeddings for granite speech
* Fix feature dim check in processor test
* Pop input features in embed test for granite speech
* Small fixes for test edge cases
Add granite speech to seq2seq causal lm mapping names
* Add small tests for granite speech model
* Fix data parallelism test
* Standardize model class names
* Fix check for copies
* Fix misaligned init check
* Skip granite speech in checkpoint check
* Use default for tie_word_embeddings in granite speech
* Fix non documentation granite speech repo issues
* Fix comments and docstring checks
* Add placeholder docs for granite speech
* Fix test naming collision
* Code formatting
* Rerun torch dummy obj regen
* Fix save pretrained for granite speech
* Import sorting
* Fix tests typo
* Remove offset hack
* Pass args through encoder config
* Remove unused prune heads from blip2
* removing einsum. replaced with explicit multiplication (relative positional encodings) and sdpa attention.
* remove Sequential from ConformerFeedForward and ConformerConvModule. + fix for sdpa attention
* remove GraniteSpeechConformerScale
* rename to hidden_states
* rename conformer layers to self.layers, remove the first linear from the list to keep the list homogenous.
* move pre-norm to the attention/feedforward blocks (avoid complex module wrapping)
* adding pre_norm into forward
* feature extractor refactoring to resemble how it's done in phi4multimodal.
* rename feature_extractor to audio_processor
* bugfix: input_feature_mask fix to get the exact number tokens.
* Fix pytest decorator in processor test
* Add (disabled) integration tests for granite speech
* Fix handling of optional feature masking
* Loosen validation in processing for vLLM compatability
* Formatting fixes
* Update init structure to mirror llama
* Make granite speech projector generic
* Update test config to reflect generic projector
* Formatting fixes
* Fix typos, add license
* Fix undefined var in input processing
* Cleanup and expose ctc encoder
* Add missing config docstrings
* Better var names, type hints, etc
* Set attn context size in init
* Add max pos emb to encoder config
* Cleanup feature extractor
* Add granite speech architecture details
* Remove granite speech qformer ref
* Add paper link, explicit calc for qkv
* Calculate padding directly in depthwise conv1d init
* Raise value error instead of asserting
* Reorder class defs (classes used at top)
* Precompute relpos distances
* Run formatting
* Pass attention distances through forward
* Apply suggestions from code review
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* Add todo for using common batch feature extraction
* Rename audios/features
* Ensure chat template may be provided to processor
* Move granite speech docs to audio models
* Add todos for input proc refactoring
* Fix import order
* Guard torch import
* Use relative imports
* Require torch backend for processor in granite speech
* Add backend guards in feature extractor
---------
Signed-off-by: Alex-Brooks <Alex.brooks@ibm.com>
Co-authored-by: Avihu Dekel <avihu.dekel@ibm.com>
Co-authored-by: eustlb <94853470+eustlb@users.noreply.github.com>
* update for fixes
* more fixes
* fuxix dynamic cache?
* style
* fix both traiining and generating. Eager seems alright
* dynamic does not work
* fix most cases, use_cache or not, eager or not, no default cache (ex: not training but you want to get cache states)
* should be final fixes
* fix more stuff no cat
* style
* fix
* style
* final sytle
* qualityeioiwhjfaopsejdpofqsdjkfjha;wesdhgfkjlqsw.denghjkaswednkgs
* fix
* revert
* Initial commit for Qwen3
* fix and add tests for qwen3 & qwen3_moe
* rename models for tests.
* fix
* fix
* fix and add docs.
* fix model name in docs.
* simplify modular and fix configuration issues
* Fix the red CI: ruff was updated
* revert ruff, version was wrong
* fix qwen3moe.
* fix
* make sure MOE can load
* fix copies
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
* Use `deformable_detr` kernel from the Hub
Remove the `deformable_detr` kernel from `kernels/` and use the
pre-built kernel from the Hub instead.
* Add license header
* Add `kernels` as an extra `hub-kernels`
Also add it to `testing`, so that the kernel replacement gets tested
when using CUDA in CI.
* draft of model tracer visualiser
* add context manager in addition to decorator
* add debug utils to init
* move model debugging utils to dedicated file
* add documentation
* protect some imports
* format
* move and protect imports
* format
* doc: improve errors in case of broken dummy imports.
* format
* use automatic torch backend
* update doc
* fix backend
* (TEMP) move to dummies while backend wait
* update documentation
* doc
* Fix converter
* [Broken] Adds Gemma 3 to Hugging Face Transformers
* Consolidating Config and Processor params across impls
* Sorting out configuration parameters. Adds qk_norm before RoPE. Still not sure if RoPE is right.
* Additional plumbing for CausalLM and ConditionalGeneration variants
* incomplete draft of Orbax conversion script
* More complete checkpoint conversion
* Supporting Gemma 3 1B checkpoints
* Updating RoPE for multiple frequencies
* Adjustments to rotary embedder
* Proof of life for text-only operation
* Updating the conversion script to handle multimodal projection weights
* Fixing tet-only conversions
* Cleaner conversion script with multimodal support and a simpler processor
* Additional refatcors to the Gemma3Processor
* Simplified Processor to work over text representations
* Updated conversion script to join text and vision embeddings at converion time
* Logging for debugging
* Update src/transformers/models/gemma2/modeling_gemma2.py
Co-authored-by: Joshua Lochner <admin@xenova.com>
* Removed extraneous Config params
* Switching to fast tokenizer for checkpoint conversions
* isolating siglip for performance tetsing
* Minor changes for debugging tests against baselines
* Adding average pooling for soft tokens
* Updating processor code to enable simpler embedding interleaving for arbitrary number of images in prompts
* Updating conversion script for ShieldGemma 2 conversion compatibility
* Allow disable_compile to be provided as a kwarg
* Refresh from modular
* Updated conversion script and corrected sliding window
* Fix type mismatch in cache_position (#4)
* Fix dtype (#5)
* Fix type mismatch in cache_position
* Actually fix in the modular file
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
---------
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
* fixes for embedding table overflow and missing image_soft_token_mask from Gemma3Processor
* Adding 2D pooling for image embeddings
* Revert "Adding 2D pooling for image embeddings"
This reverts commit 65350cf531.
* Gemma3 average pooling changed from 1D to 2D
* Major refactor to Gemma3MultimodalInputProjection
* Updating Gemm 3 Auto* registrations
* Add option to save Gemma 3 chat template with tokenizer during weights conversion
* Removing unused imports
* Moving out-of-vocab handling from Gemma3Processor to Gemma3ForConditionalGeneration
* Removing duplicate config property
* Removing final logit softcapping and 1-indexing of position ids
* Fixing image processor config and none --> None typo
* Fixing sliding window size for 1B
* Updating image_mean and image_std in Image Processor
* Attention masking changed to lower triangular
* Moving image special tokens to conversion script
* Mirror image processor defaults from conversion script into Gemma3ProcessorKwargs
* Remove special token variables from symbol space
* Moving image soft token mask computation from Gemma3Processor to Gemma3ForConditionalGeneration
* tie lm_head and embedding weights
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
* Correct tied weights in Gemma3CausalLM
* iterative bidirectional attention
* resolving merge conflicts
* Reverting to Gemma 2 HybridCache with sldiing window support and a sliding_window_pattern of 6
* Correcting RoPE scaling
* clean up first pass, dummy model geenration works
* final clean up before fixing tests
* causal lm test works, so fine
* Fix conversion
* Update src/transformers/models/gemma3/processing_gemma3.py
* model tests are happy
* processor tests are happy
* image processing tests added
* fixup
* Fix pre-processing in conversion
* Inputs merging
* Do not normalize vision embeddings
* Apply Ryan's (and team) changes to attention
* token type ids + mask
* template
* move embed scale, add rope scale, fix tests
* Add chat template to tokenizer
* Use prefix for causal model loading
* use existing code for sliding mask from gemma2
* self.embed_tokens already normalizes
* Correcting Gemma3TextConfig parameters in conversion script
* typo, modular overwrites my fixes
* enable device map for text model
* Conversion updates
* ultra nit: no einsums
* update image token
* copy deepcopy config + some docs
* add some test, still WIP
* Refactoring --include_chat_tempalte logic in converter
* Update src/transformers/models/gemma3/modular_gemma3.py
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Add eos tokens for instruct models
* dump so i can work on dgx
* Removing add_bos by default
* dump
* add fast im proc
* docs for PaS + fixup
* another fixup
* one more fixup
* fix tests
* Inverting prior BOS change
* ultra nit
* Reverting to Tokenizer saved with add_bos_token=True and chat template starting with BOS
* resize embeds, remove sqrt, add slow test outputs
* FA2 but quality is meh
* nit
* skip FA2, no idea what happened
* last bit for green CI
* please, green CI for docs
* T_T
* Fix for Gemma3 logits
* Support both options for system prompt
* Update src/transformers/models/gemma3/image_processing_gemma3_fast.py
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/gemma3.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/gemma3.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/gemma3.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/gemma3.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Update docs/source/en/model_doc/gemma3.md
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
* Docs updates now that assets are live
* Style fixes
---------
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Raushan Turganbay <raushan.turganbay@alumni.nu.edu.kz>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Lysandre <hi@lysand.re>
* add init and base image processing functions
* add add_fast_image_processor to transformers-cli
* add working fast image processor clip
* add fast image processor to doc, working tests
* remove "to be implemented" SigLip
* fix unprotected import
* fix unprotected vision import
* update ViTImageProcessorFast
* increase threshold slow fast ewuivalence
* add fast img blip
* add fast class in tests with cli
* improve cli
* add fast image processor convnext
* add LlavaPatchingMixin and fast image processor for llava_next and llava_onevision
* add device kwarg to ImagesKwargs for fast processing on cuda
* cleanup
* fix unprotected import
* group images by sizes and add batch processing
* Add batch equivalence tests, skip when center_crop is used
* cleanup
* update init and cli
* fix-copies
* refactor convnext, cleanup base
* fix
* remove patching mixins, add piped torchvision transforms for ViT
* fix unbatched processing
* fix f strings
* protect imports
* change llava onevision to class transforms (test)
* fix convnext
* improve formatting (following Pavel review)
* fix handling device arg
* improve cli
* fix
* fix inits
* Add distinction between preprocess and _preprocess, and support for arbitrary kwargs through valid_extra_kwargs
* uniformize qwen2_vl fast
* fix docstrings
* add add fast image processor llava
* remove min_pixels max_pixels from accepted size
* nit
* nit
* refactor fast image processors docstrings
* cleanup and remove fast class transforms
* update add fast image processor transformers cli
* cleanup docstring
* uniformize pixtral fast and make _process_image explicit
* fix prepare image structure llava next/onevision
* Use typed kwargs instead of explicit args
* nit fix import Unpack
* clearly separate pops and gets in base preprocess. Use explicit typed kwargs
* make qwen2_vl preprocess arguments hashable
* initial commit
* encoder+decoder layer changes WIP
* architecture checks
* working version of detection + segmentation
* fix modeling outputs
* fix return dict + output att/hs
* found the position embedding masking bug
* pre-training version
* added iamge processors
* typo in init.py
* iterupdate set to false
* fixed num_labels in class_output linear layer bias init
* multihead attention shape fixes
* test improvements
* test update
* dab-detr model_doc update
* dab-detr model_doc update2
* test fix:test_retain_grad_hidden_states_attentions
* config file clean and renaming variables
* config file clean and renaming variables fix
* updated convert_to_hf file
* small fixes
* style and qulity checks
* return_dict fix
* Merge branch main into add_dab_detr
* small comment fix
* skip test_inputs_embeds test
* image processor updates + image processor test updates
* check copies test fix update
* updates for check_copies.py test
* updates for check_copies.py test2
* tied weights fix
* fixed image processing tests and fixed shared weights issues
* added numpy nd array option to get_Expected_values method in test_image_processing_dab_detr.py
* delete prints from test file
* SafeTensor modification to solve HF Trainer issue
* removing the safetensor modifications
* make fix copies and hf uplaod has been added.
* fixed index.md
* fixed repo consistency
* styel fix and dabdetrimageprocessor docstring update
* requested modifications after the first review
* Update src/transformers/models/dab_detr/image_processing_dab_detr.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* repo consistency has been fixed
* update copied NestedTensor function after main merge
* Update src/transformers/models/dab_detr/modeling_dab_detr.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* temp commit
* temp commit2
* temp commit 3
* unit tests are fixed
* fixed repo consistency
* updated expected_boxes varible values based on related notebook results in DABDETRIntegrationTests file.
* temporarialy config modifications and repo consistency fixes
* Put dilation parameter back to config
* pattern embeddings have been added to the rename_keys method
* add dilation comment to config + add as an exception in check_config_attributes SPECIAL CASES
* delete FeatureExtractor part from docs.md
* requested modifications in modeling_dab_detr.py
* [run_slow] dab_detr
* deleted last segmentation code part, updated conversion script and changed the hf path in test files
* temp commit of requested modifications
* temp commit of requested modifications 2
* updated config file, resolved codepaths and refactored conversion script
* updated decodelayer block types and refactored conversion script
* style and quality update
* small modifications based on the request
* attentions are refactored
* removed loss functions from modeling file, added loss function to lossutils, tried to move the MLP layer generation to config but it failed
* deleted imageprocessor
* fixed conversion script + quality and style
* fixed config_att
* [run_slow] dab_detr
* changing model path in conversion file and in test file
* fix Decoder variable naming
* testing the old loss function
* switched back to the new loss function and testing with the odl attention functions
* switched back to the new last good result modeling file
* moved back to the version when I asked the review
* missing new line at the end of the file
* old version test
* turn back to newest mdoel versino but change image processor
* style fix
* style fix after merge main
* [run_slow] dab_detr
* [run_slow] dab_detr
* added device and type for head bias data part
* [run_slow] dab_detr
* fixed model head bias data fill
* changed test_inference_object_detection_head assertTrues to torch test assert_close
* fixes part 1
* quality update
* self.bbox_embed in decoder has been restored
* changed Assert true torch closeall methods to torch testing assertclose
* modelcard markdown file has been updated
* deleted intemediate list from decoder module
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>