* fix FA2
* update is causal flag and remove mask for FA2
* update for FA2 with varlen path
* how the tests were passing with different devices?
* add comment and ref to the PR
* move mask preparation to base pretrained model
* seq len is the first dim, not second
* fix copies to fix GLM4V
* deprecate for 1 version
* style
* fix some tests
* fix esm
* skip for now, GC requires positional args but we have keyword args
* remove transpose for scores in modified models only
* skip fx trace tests
* remove the skips
* fix the epsilon to a small value (does not make sense otherwise)
* safeguard
* overload test_eager_matches_sdpa
* Update test_modeling_common.py
* skip appropriate tests
* correct no_split_layer
* fix all devices issue
* fix backward
* fix
TST Fix PEFT integration test bitsandbytes config
The PEFT integration tests still used load_in_{4,8}_bit, which is
deprecated, moving to properly setting BitsAndBytesConfig. For 4bit,
also ensure that nf4 is being used to prevent
> RuntimeError: quant_type must be nf4 on CPU, got fp4
* Add Fast Image Processor for Chameleon
* add warning to resize and move blend_rgba to convert_to_rgb
* Remove unrelated files
* Update image_processing_chameleon_fast to use auto_docstring
* fix equivalence test
---------
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
* add fast image processor nougat
* test fixes
* docstring white space
* last fixes
* docstring_type
* tolerance unit test
* fix tolerance
* fix rtol
* remove traling white space
* remove white space
* note for tolerance unit test
* fix tests
* remove print
---------
Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Some PEFT integration tests involving text generation pipelines were
failing since #38129 because the base model is too small to generate
longer sequences. Setting max_new_tokens fixes this.
* timestamp token is end of token time !!!
* ensure correct alignment between tokens and timestamp tokens
* ignore input tokens for DTW computation
* use num_frames to avoid token timestamp hallucinations
* token timestamps test updates !
* num_frames: deprecate and use attention_mask instead
* avoid breaking change
* fix the pipeline usage for chunk approach
* make style
* better logging
* better logging
* make style
* update tests with correct values
* fix a bunch of XPU UT failures on stock PyTorch 2.7 and 2.8
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* qwen3
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* quanto
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* models
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* fix style
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* idefics2
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
---------
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
* Gemma 3n
* initial commit of Gemma 3n scaffold
* Fixing param pass through on Gemm3p5RMSNorm
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Adds AltUp to Gemma 3n
* Adding Gemma3p5 overall and text config with vision and audio config placeholders (#3)
* Adding gemma3p5 text configs
* Adding audio config placeholders
* Adding a placeholder for vision configs
* Updating MobileNetVisionConfig, inheriting TimmWrapperConfig
* Updating text configs
* Update src/transformers/models/gemma3p5/modular_gemma3p5.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Removing altup configs to accept the suggested configs
* Update src/transformers/models/gemma3p5/modular_gemma3p5.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating altup config
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Addressing review comments and updating text configs
* Adding a config for activation sparsity
* Updating configs to pass through options to super class init and adjust some name prefixes
* Updating laurel and altup with corrected config values
* Normalizing sub_config initializers
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating MLP with activation sparsity (#2)
* Updating DecoderBlock for Gemma 3n (#3)
* Initial Gemm3nTextModel (#4)
NOTE: This implementation WILL CHANGE in the coming weeks, however, changes will be strictly additive and this will remain a suitable baseline for downstream implementations to reference.
* Adding KV Cache Sharing
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Refactored kv cache sharing in attention
* Adding KVStore for cache sharing
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update src/transformers/cache_utils.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Updating KV Cache Sharing implementation
* Updating the q and k norm definitions in the attention module
* Fixing name error for q,k,v RMS norm to use the right 3n module
* Updating MLP with activation sparsity
* Updating DecoderBlock for Gemma 3.5
* Updating kv cache sharing implementation with the use of a cache buffer and refactoring some lines of code
* Isolating KV Cache logic to relevant components
* Fixing logic error in Gemma3nAttention.forward
* Refactoring caching contributions and fixing kv_store initialization
* Simplifying Configs
* Remove errant self from super init call
* Bug fix in the Attention module - changing self.head_dim to config.head_dim
* Bug fixes in the LaurelBlock and RMS Norm super init call
* removing redundant code from a merge
* Adding per_layer_inputs to TextModel
* Adding preprocess embeddings with altup
* Adds per-layer-to-single output and a host of TODOs
* Integrating altup predict with the model workflow and other minor bug fixes
* Using nn.Embedding temporarily for text model
* It goes forward
* Minor refactor of attention sparsity and RoPE initialization
* Fixing duplicate rope_scaling param bug when loading from pretrained
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Normalizing on altup_num_inputs config option
* regenerating modeling file after syncing to HEAD
* Use torch.std(..., unbiased=False) for activation sparsity (#8)
* Refactoring to a single QVK Norm (#13)
* AltUp: support scale_corrected_output (#14)
* Converts einsums to nn.Linear (#7)
* Converts einsums to nn.Linear
* Removing unused variables
* Aligning SharedKVCache with HybridCache (#11)
* Alinging SharedKVStore with HybridCache
* Remove KVStore. Refactor apply_rotary_pos_emb for sharing
* Addressing review comments
* Supporting split modality embeddings in Gemma3n (#10)
* Adding the Embedder class
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Addressing review comments, adding audio embedding layers, integrating embedder with the remaining architecture, adding a forward method for conditional generation
* Apply suggestions from code review
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Update modular
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
* Addressing review comments, prop drilling audio and vision configs to the text config
* Removing TODO's that have been addressed
* Simplify Embedder init and add audio embeddings
* Embeddings refactor. Adds Gemma3nAudioEmbedder and Gemma3nVisionEmbedder
* Refactoring vision and audio embeddings into ConditionalGeneration model
---------
Co-authored-by: Ryan Mullins <ryan@ryanmullins.org>
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating attention mask for Gemma 3.5 (#15)
* xxx_token_index to xxx_token_id
* remvoing deprecated last_cache_position
* Removing references to SigLIP
* Always init per-layer inputs
* Using torch.finfo().min for epsilon_tensor
* Gemma3nDecoderLayer inherits from Gemma3DecoderLayer. Remove gating lambdas
* fix modular GEMMA3N_INPUTS_DOCSTRING
* Gemma3nAttention inherits from Gemma3Attention
* Modular inheritance fixes
* CausalLM conversion script for 4B model (#16)
* Add Gemma3n Audio Encoder (#6)
* initial commit of Gemma 3.5 scaffold
* Fixing param pass through on Gemm3nRMSNorm
* Adds Einsum layer to Gemma 3.5
* Updating EinsumLayer API
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Adds AltUp to Gemma 3n
* Adding Gemma3n overall and text config with vision and audio config placeholders (#3)
* Adding gemma3n text configs
* Adding audio config placeholders
* Adding a placeholder for vision configs
* Updating MobileNetVisionConfig, inheriting TimmWrapperConfig
* Updating text configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Removing altup configs to accept the suggested configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating altup config
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Addressing review comments and updating text configs
* Adding a config for activation sparsity
* Updating configs to pass through options to super class init and adjust some name prefixes
* Updating laurel and altup with corrected config values
* Normalizing sub_config initializers
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating MLP with activation sparsity (#2)
* Updating DecoderBlock for Gemma 3.5 (#3)
* Initial Gemm3nTextModel (#4)
NOTE: This implementation WILL CHANGE in the coming weeks, however, changes will be strictly additive and this will remain a suitable baseline for downstream implementations to reference.
* Adding KV Cache Sharing
* Adds Einsum layer to Gemma 3.5
* Updating EinsumLayer API
* Refactored kv cache sharing in attention
* Adding KVStore for cache sharing
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update src/transformers/cache_utils.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Updating KV Cache Sharing implementation
* Updating the q and k norm definitions in the attention module
* Fixing name error for q,k,v RMS norm to use the right Gemma 3n module
* Updating MLP with activation sparsity
* Updating DecoderBlock for Gemma 3.5
* Updating kv cache sharing implementation with the use of a cache buffer and refactoring some lines of code
* Isolating KV Cache logic to relevant components
* Fixing logic error in Gemma3nAttention.forward
* Refactoring caching contributions and fixing kv_store initialization
* Simplifying Configs
* Remove errant self from super init call
* Bug fix in the Attention module - changing self.head_dim to config.head_dim
* Bug fixes in the LaurelBlock and RMS Norm super init call
* removing redundant code from a merge
* Adding per_layer_inputs to TextModel
* Adding preprocess embeddings with altup
* Adds per-layer-to-single output and a host of TODOs
* Integrating altup predict with the model workflow and other minor bug fixes
* Using nn.Embedding temporarily for text model
* It goes forward
* Minor refactor of attention sparsity and RoPE initialization
* Fixing duplicate rope_scaling param bug when loading from pretrained
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Normalizing on altup_num_inputs config option
* Adding audio encoder config
* Adds high-level components for Audio Encoder
* Implement uniform reducer for Audio Encoder
* Adding placeholders for Conformer components in Audio Encoder
* Adding placeholders for SubSampleConvProjection components in Audio Encoder
* Adding SequenceLayer component placeholders
* Implementing Gemma3nAudioEncoder with nn.Sequential
* Implementing Gemma3nAudioSubSampleConvProjection with nn.Sequential
* Implementing Conformer model with SequenceLayers
* Use OrderedDict in nn.Sequential initializers
* Implements sl.Residual in Torch with nn.Sequential and OrderedDict
* Adopting a base SequenceLayer class with default forward() method
* Implementing sl.GatedLinearUnit in Torch
* Implementing sl.Swish in Torch
* Implementing sl.ReLU in Torch
* Implementing sl.Scale in Torch
* Removing sl.Dropout after tree-shaking
* Implementing sl.RMSNorm in Torch with fake shape
* Implementing sl.GroupNorm in Torch
* Implementing sl.Conv2d in Torch
* Implementing sl.Dense in Torch
* Removing sl.Delay layers, which act as pass-throughs
* Connecting shapes to configs in initializers
* Removing sl.Emit
* Implementing sl.ExpandDims in Torch
* Adding sl.GradientClipping to Torch
* Implementing sl.DenseShaped in Torch
* Implementing sl.LDPA in Torch
* Removing unused sl.CombinedQKVProj class
* Fixing erroneous type hint
* Implemnenting sl.DepthwiseConv1D in Torch
* Implementing sl.MaskInvalid in Torch
* Fixes for initialization
* Fixes for saving weights
* Removing einsums per feedback from HF staff
* Removing Sequence Layers idioms from audio encoder
* Fixes for reviewer comments
* CausalLM conversion script for 4B model
* inv_timescales to non-persistent buffer
* Addressing audio encoder Attention feedback
* Addressing Gemma3nAudioSSCPConvBlock feedback
* Addressing Gemma3nAudioConformerAttention feedback
* Addressing padding feedback
* Weights conversion loads audio state dict
* Always use vision_config so saving works
* Token id updates for configs
* Stubs for interleaving audio embs
* Addressing reviewer feedback
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
* Fixing cache access error
* Removing duplicate code from a bad merge
* Gemma 3n Text + Vision Part 1 (#17)
* testing utilities for numerics comparisons
* Corrected einsum to nn.Linear weights conversion
* Inherit scaled word embs from Gemma3 not Bart
* Fixing transposes for collapsed linears
* More transpose fixes
* numpy api fix
* RMSNorm: Explicit kwargs, scale_shift=0.0 when with_scale=True
* Force AltUp to float32
* Updating debugging script for AudioEncoder debugging
* Support divide_weight_by_sqrt_fan_in from JAX for per-layer inputs
* Correcting attention einsum conversions
* RMSNorm in type of x
* Fixing douplicate laurel norm/gating
* KV sharing using the right previous indices
* Refactor kv shared index computation. Correct frac_shared_layers
* Use num_shared_layers instead of inferring from a fraction
* fixing a bug for logging
* Fix shared data_ptrs in altup inits
* rope: adjust proj -> norm -> rope to preserve computation (#20)
* rope: adjust proj -> norm -> rope to preserve computation
* Removing some breaking language model fluff in ConditionalGeneration
* Consolidate query_states transforms
---------
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Vectorize the loops in AltUp (#19)
* Vectorize the loops in AltUp
* fix typo
* Expanding to support batched inputs
* remove extra debug script
* Fix AltUp.forward
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Add 'scale_shift=0.0, with_scale=True' to the final norm in TextModel
* Convert norm to 1/sqrt (#21)
* Convert norm to 1/sqrt
* Scale shift change per Phil's rec
* Adding default activation sparsity
* Fixing 2B config in weights conversion script
* Fixing RMSNorm parameters - adding scale_shift and with_scale
* Correcting query pre-attention scaling
* Adding query_rescale_scalar to text config
* Adding layer_idx to MLP
* Permafix for input_layernorm
* Use 1/sqrt instead of rsqrt in DecoderLayer
* Fix o_proj conversion
* Conversion script update for vision encoder
* Removing logging for debugging timm model
* Fixing bugs in Gemma3nForConditionalGeneration for text generation
* Generating the modeling_gemma3n.py file
* Removing the addition of an erroneous line in the modeling file
* Adding gemma3n text model to modeling_auto
* Bugfix: Updating the interleaving of inputs_embeds and vision_embeds
* Updating the modeling file with the latest bugfix changes
* Updating models/auto for Gemma 3n
* using AutoTokenizer in forward test
* Adding processing_gemma3n.py
* Gemma 3n configured for AutoModel. Conversion script updated.
* Removing errant merge artifacts
---------
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Douglas Reid <douglas-reid@users.noreply.github.com>
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
* Removing errant debugging statements from Gemma 3
* Gemma3n audio model (#18)
* testing utilities for numerics comparisons
* Implement CumulativeGroupNorm and add to SubSampleConvProjection and SSCPConvBlock
* Add audio version of forward script based on RyanMullins' implementation
* Updating to match encoder tests. WIP: config question needs resolving
* Updates to audio classes to enable end-to-end running
* Removing vestigial classes, cleaning up print statements
* Adding SiLU / Swish to audio conformer feed forward block
* Shifted Gemma3p5Audio naming prefix to Gemma3NanoAudio
* Adding outputs to audio test
* Fixes to padding in SSCP and 1D convolution, align RMS Norm with wider model
* Update forward test to load from local weights
* Update conversion to process / output audio layers
* Update __all__ to export audio encoder
* AutoModel registration for Gemma 3n Audio
* Use AutoModel for ConditionalGeneration.audio_tower
* Fixing input_proj_linear transpose
* Fixing Gemma3NanoAudioConformerAttention.post conversion
* Fixing Gemma3NanoAudioSSCPConvBlock.conv weights conversion
* Correcting indentation issue on Gemma3p5RMSNorm
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Text + Vision Part 2 (#23)
* Updates for ConditionalGeneration.get_image_features
* Adding a WIP draft of image_processing_gemma3p5.py
* Update src/transformers/models/gemma3p5/modular_gemma3p5.py
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Modular conversion after github suggested change
* Text + image gives good results
* Fixing image size preset
* Updating configs for the 2B variant in the conversion script
* Using final generation config in conversion script
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Audio Integration (#12)
* initial commit of Gemma 3n scaffold
* Fixing param pass through on Gemm3nRMSNorm
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Adds AltUp to Gemma 3n
* Adding Gemma 3n overall and text config with vision and audio config placeholders (#3)
* Adding Gemma 3n text configs
* Adding audio config placeholders
* Adding a placeholder for vision configs
* Updating MobileNetVisionConfig, inheriting TimmWrapperConfig
* Updating text configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Removing altup configs to accept the suggested configs
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating altup config
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Addressing review comments and updating text configs
* Adding a config for activation sparsity
* Updating configs to pass through options to super class init and adjust some name prefixes
* Updating laurel and altup with corrected config values
* Normalizing sub_config initializers
---------
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Updating MLP with activation sparsity (#2)
* Updating DecoderBlock for Gemma 3n (#3)
* Initial Gemma3nTextModel (#4)
NOTE: This implementation WILL CHANGE in the coming weeks, however, changes will be strictly additive and this will remain a suitable baseline for downstream implementations to reference.
* Adding KV Cache Sharing
* Adds Einsum layer to Gemma 3n
* Updating EinsumLayer API
* Refactored kv cache sharing in attention
* Adding KVStore for cache sharing
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update modular
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Update src/transformers/cache_utils.py
Co-authored-by: Ryan Mullins <ryanmullins@google.com>
* Undoing erroneous force push
* Reverting RMSNorm to with_scale by default
* Adds LAuReL to Gemma 3n
* Updating KV Cache Sharing implementation
* Updating the q and k norm definitions in the attention module
* Fixing name error for q,k,v RMS norm to use the right 3n module
* Updating MLP with activation sparsity
* Updating DecoderBlock for Gemma 3n
* Updating kv cache sharing implementation with the use of a cache buffer and refactoring some lines of code
* Isolating KV Cache logic to relevant components
* Fixing logic error in Gemma3nAttention.forward
* Refactoring caching contributions and fixing kv_store initialization
* Simplifying Configs
* Remove errant self from super init call
* Bug fix in the Attention module - changing self.head_dim to config.head_dim
* Bug fixes in the LaurelBlock and RMS Norm super init call
* removing redundant code from a merge
* Adding per_layer_inputs to TextModel
* Adding preprocess embeddings with altup
* Adds per-layer-to-single output and a host of TODOs
* Integrating altup predict with the model workflow and other minor bug fixes
* Using nn.Embedding temporarily for text model
* It goes forward
* Minor refactor of attention sparsity and RoPE initialization
* Fixing duplicate rope_scaling param bug when loading from pretrained
---------
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Normalizing on altup_num_inputs config option
* Adding audio encoder config
* Adds high-level components for Audio Encoder
* Implement uniform reducer for Audio Encoder
* Adding placeholders for Conformer components in Audio Encoder
* Adding placeholders for SubSampleConvProjection components in Audio Encoder
* Adding SequenceLayer component placeholders
* Implementing Gemma3nAudioEncoder with nn.Sequential
* Implementing Gemma3nAudioSubSampleConvProjection with nn.Sequential
* Implementing Conformer model with SequenceLayers
* Use OrderedDict in nn.Sequential initializers
* Implements sl.Residual in Torch with nn.Sequential and OrderedDict
* Adopting a base SequenceLayer class with default forward() method
* Implementing sl.GatedLinearUnit in Torch
* Implementing sl.Swish in Torch
* Implementing sl.ReLU in Torch
* Implementing sl.Scale in Torch
* Removing sl.Dropout after tree-shaking
* Implementing sl.RMSNorm in Torch with fake shape
* Implementing sl.GroupNorm in Torch
* Implementing sl.Conv2d in Torch
* Implementing sl.Dense in Torch
* Removing sl.Delay layers, which act as pass-throughs
* Connecting shapes to configs in initializers
* Removing sl.Emit
* Implementing sl.ExpandDims in Torch
* Adding sl.GradientClipping to Torch
* Implementing sl.DenseShaped in Torch
* Implementing sl.LDPA in Torch
* Removing unused sl.CombinedQKVProj class
* Fixing erroneous type hint
* Implemnenting sl.DepthwiseConv1D in Torch
* Implementing sl.MaskInvalid in Torch
* Fixes for initialization
* Fixes for saving weights
* Removing einsums per feedback from HF staff
* Removing Sequence Layers idioms from audio encoder
* Fixes for reviewer comments
* Converting sl.Frontend to FeatureExtractor
* Updates for ConditionalGeneration.get_image_features
* Adding a WIP draft of image_processing_gemma3n.py
* Update modular
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
* Modular conversion after github suggested change
* Text + image gives good results
* Fixing image size preset
* Draft of audio data in chat template
* Removing image processing. Using SigLIP instead.
* Audio input going end-to-end
* Fixing dtype issues in audio encoder
* x-lib formatting consistency
* Adding example data
* Save preprocessor_config.json from conversion script
* Instrumentaiton for debugging
* Additional instrumentation for preprocessing debugging
* Updates to preprocessor, padding; produces correct end-to-end results on sample
* Tackling configuraiton TODOs
* Start of feature extractor refatcor
* Adds Numpy version of USM extractor, removes Torch version and dependencies
* Fixing AltUp.correct coef permute
* Supporting batches of single audio segment inputs
* Docstrings updates for config
* In-lining audio feature extraction
* Adjustments to conversion script and smoke test script
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: pculliton <phillipculliton@gmail.com>
* Gemma 3n renaming
* Removing test data and utilities
* Renaming test files
* Gemma 3n refactor
* Fix tokenizer config in conversion script
* Address reviewer feedback
* FeatureExtractor returns float32 by default
* Adding basic tests for audio, and input name for audio encoder
* Audio integration test, updates to model_id for other integration tests
* Use scales for q and k norms (#26)
* Update audio integration test to use HF dataset
* Reviewer feedback
* Expand embedding table to full vocab size in weights conversion
* Mix-n-match MatFormers for Gemma 3n (#25)
* Remove in-place operations (#30)
* chore: removing inplace ops
* remove [tensor] * n pattern
* chore: reviewer feedback in AudioEncoder and AltUp
* More grad clipping
* Dynamo compatibility
* fix: cache slicing error
* chore: simplify shared kv cache slicing
* chore: vision encoder rename in timm
* fix: image processor do_normalize=False
* fixup: style
* chore: model_doc
* fix: docs for code quality
* chore: repo consistency
* fix: RMSNorm in float as in prior Gemmas
* fix: per_layer_inputs = None
* chore: Gemma3nForCausalLM from Gemma3nForConditionalGeneration checkpoint
* chore: repo consistency
* Add initial unit tests for Gemma3nAudioFeatureExtractor (#27)
* Add initial unit tests for Gemma3nAudioFeatureExtractor
* Add basic unit tests for Gemma3nProcessor (#28)
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
* parameterize tests
---------
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
* chore: code style
* fix: test cases
* style and consistency
* fix config in the test to be coherent with layer cache sharing
* fix hidden states in tests and code
* inits and mappings
* fix modality prefixes
* test order and prefixes
* fix test exception
* fix class order and reduce model size for faster tests
* restore _checkpoint_conversion_mapping to load Caual from Conditional
* fix config mapping!
* fix: reviewer feedback
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Douglas Reid <douglas-reid@users.noreply.github.com>
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: pculliton <phillipculliton@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
* fix import test
* add model args
* auto_docstring
* replace test path
* consistency
* skip tests for now
* fix docstring for doc builder
* skip unused attr
---------
Co-authored-by: SindhuRaghuram97 <114270661+SindhuRaghuram97@users.noreply.github.com>
Co-authored-by: Sindhu Raghuram <sindhuraghuram@google.com>
Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: Mayank Chaturvedi <imayank@google.com>
Co-authored-by: Douglas Reid <douglas-reid@users.noreply.github.com>
Co-authored-by: Douglas Reid <21148125+douglas-reid@users.noreply.github.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: pculliton <phillipculliton@gmail.com>
Co-authored-by: Aritra Roy Gosthipaty <aritra.born2fly@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Arthur <arthur.zucker@gmail.com>
* rm tf/flax tests
* more flax deletions
* revert fixture change
* reverted test that should not be deleted; rm tf/flax test
* revert
* fix a few add-model-like tests
* fix add-model-like checkpoint source
* a few more
* test_get_model_files_only_pt fix
* fix test_retrieve_info_for_model_with_xxx
* fix test_retrieve_model_classes
* relative paths are the devil
* add todo
* handle long form generation
* add warning
* correct incorrect in place token change
* update test to catch edge case
* make style
* update warning
* add doc
* Image processor compile fix (#38540)
* Added a compile-friendly versiom of resize to BaseImgProcessorFast
* Changed qwen2 processor to use its parent class .resize
* Style
* underlined issue only happens on AMD w/ comment and bool check
* Fixed some utils functions
* Fixed the same issue for bridgetower
* Fixed the same issue for llava_next
* Repo consistency for llava onevision
* Update src/transformers/image_processing_utils_fast.py
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* Added an Expectation to an internvl test
* Made qwen2_vl use the resize method of its parent clas
* Changed to torch.where
---------
Co-authored-by: Mohit Sharma <mohit21sharma.ms@gmail.com>
* add dia model
* add tokenizer files
* cleanup some stuff
* brut copy paste code
* rough cleanup of the modeling code
* nuke some stuff
* more nuking
* more cleanups
* updates
* add mulitLayerEmbedding vectorization
* nits
* more modeling simplifications
* updates
* update rope
* update rope
* just fixup
* update configuration files
* more cleanup!
* default config values
* update
* forgotten comma
* another comma!
* update, more cleanups
* just more nits
* more config cleanups
* time for the encoder
* fix
* sa=mall nit
* nits
* n
* refacto a bit
* cleanup
* update cv scipt
* fix last issues
* fix last nits
* styling
* small fixes
* just run 1 generation
* fixes
* nits
* fix conversion
* fix
* more fixes
* full generate
* ouf!
* fixes!
* updates
* fix
* fix cvrt
* fixup
* nits
* delete wrong test
* update
* update
* test tokenization
* let's start changing things bit by bit - fix encoder step
* removing custom generation, moving to GenerationMixin
* add encoder decoder attention masks for generation
* mask changes, correctness checked against ad29837 in dia repo
* refactor a bit already --> next cache
* too important not to push :)
* minimal cleanup + more todos
* make main overwrite modeling utils
* add cfg filter & eos filter
* add eos countdown & delay pattern
* update eos countdown
* add max step eos countdown
* fix tests
* fix some things
* fix generation with testing
* move cfg & eos stuff to logits processor
* make RepetitionPenaltyLogitsProcessor flexible
- can accept 3D scores like (batch_size, channel, vocab)
* fix input_ids concatenation dimension in GenerationMixin for flexibility
* Add DiaHangoverLogitsProcessor and DiaExponentialDecayLengthPenalty classes; refactor logits processing in DiaForConditionalGeneration to utilize new configurations and improve flexibility.
* Add stopping criteria
* refactor
* move delay pattern from processor to modeling like musicgen.
- add docs
- change eos countdown to eos delay pattern
* fix processor & fix tests
* refactor types
* refactor imports
* format code
* fix docstring to pass ci
* add docstring to DiaConfig & add DiaModel to test
* fix docstring
* add docstring
* fix some bugs
* check
* porting / merging results from other branch - IMPORTANT: it very likely breaks generation, the goal is to have a proper forward path first
* experimental testing of left padding for first channel
* whoops
* Fix merge to make generation work
* fix cfg filter
* add position ids
* add todos, break things
* revert changes to generation --> we will force 2d but go 3d on custom stuff
* refactor a lot, change prepare decoder ids to work with left padding (needs testing), add todos
* some first fixes to get to 10. in generation
* some more generation fixes / adjustment
* style + rope fixes
* move cfg out, simplify a few things, more todos
* nit
* start working on custom logit processors
* nit
* quick fixes
* cfg top k
* more refactor of logits processing, needs a decision if gen config gets the new attributes or if we move it to config or similar
* lets keep changes to core code minimal, only eos scaling is questionable atm
* simpler eos delay logits processor
* that was for debugging :D
* proof of concept rope
* small fix on device mismatch
* cfg fixes + delay logits max len
* transformers rope
* modular dia
* more cleanup
* keep modeling consistently 3D, generate handles 2D internally
* decoder starts with bos if nothing
* post processing prototype
* style
* lol
* force sample / greedy + fixes on padding
* style
* fixup tokenization
* nits
* revert
* start working on dia tests
* fix a lot of tests
* more test fixes
* nit
* more test fixes + some features to simplify code more
* more cleanup
* forgot that one
* autodocs
* small consistency fixes
* fix regression
* small fixes
* dia feature extraction
* docs
* wip processor
* fix processor order
* processing goes brrr
* transpose before
* small fix
* fix major bug but needs now a closer look into the custom processors esp cfg
* small thing on logits
* nits
* simplify indices and shifts
* add simpler version of padding tests back (temporarily)
* add logit processor tests
* starting tests on processor
* fix mask application during generation
* some fixes on the weights conversion
* style + fixup logits order
* simplify conversion
* nit
* remove padding tests
* nits on modeling
* hmm
* fix tests
* trigger
* probably gonna be reverted, just a quick design around audio tokenizer
* fixup typing
* post merge + more typing
* initial design for audio tokenizer
* more design changes
* nit
* more processor tests and style related things
* add to init
* protect import
* not sure why tbh
* add another protect
* more fixes
* wow
* it aint stopping :D
* another missed type issue
* ...
* change design around audio tokenizer to prioritize init and go for auto - in regards to the review
* change to new causal mask function + docstrings
* change ternary
* docs
* remove todo, i dont think its essential tbh
* remove pipeline as current pipelines do not fit in the current scheme, same as csm
* closer to wrapping up the processor
* text to audio, just for demo purposes (will likely be reverted)
* check if it's this
* save audio function
* ensure no grad
* fixes on prefixed audio, hop length is used via preprocess dac, device fixes
* integration tests (tested locally on a100) + some processor utils / fixes
* style
* nits
* another round of smaller things
* docs + some fixes (generate one might be big)
* msytery solved
* small fix on conversion
* add abstract audio tokenizer, change init check to abstract class
* nits
* update docs + fix some processing :D
* change inheritance scheme for audio tokenizer
* delete dead / unnecessary code in copied generate loop
* last nits on new pipeline behavior (+ todo on tests) + style
* trigger
---------
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Vasqu <antonprogamer@gmail.com>
* remove trust_remote_code
* again
* Revert "Skip some tests for now (#38931)"
This reverts commit 31d30b7224.
* again
* style
* again
* again
* style
* fix integration test
* fix tests
* style
* fix
* fix
* fix the last ones
* style
* last one
* fix last
* fix
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Support `flash_attn_3`
Implements fwd and tests for Flash Attention 3 https://github.com/Dao-AILab/flash-attention/commits/main/hopper
- Includes checks for dropout>0 and ALiBi in `modeling_utils.PreTrainedModel._check_and_enable_flash_attn_3` (Dropout will likely be supported soon, so this will need to be updated and `modeling_flash_attention_utils._flash_attention_forward` at the `if _IS_FLASH_ATTN_3_AVAILABLE: ...`
An example Llama implementation is included in `modeling_llama.py` but other models would still need to be updated
Based on https://github.com/huggingface/transformers/pull/36190 which has model implementations and examples which could be merged
* Add tests for Flash Attention 2 and 3 parity
* ci fix
* FA2 compatibiity
- `_prepare_flash_attention_from_position_ids` ->`prepare_fa2_from_position_ids`
- Remove bettertransformer check in Flash Attention 3
- Merge tests
- Add licensing
* ci fix
* Test naming consistency
* ci fix
* Deprecation warning for `prepare_fa2_from_position_ids`
* ci fix
* Initial submit
* Fix bugs:
1. add __init__ file
2. tied word embedding
3. support flash/flex attention
4. model saving and loading
* Code refactor:
* Rename encdecgemma to t5gemma.
* Split attention into self- and cross-attention
* Split stack into encoder and decoder
* Add test cases
* Add auto configuration
* Update configurations.
* Fix bugs related to copy and attribute checks
* Fix type union
* Fix merge errors
* run ruff format
* Run make style and update tests.
* Add t5gemma model doc.
* ruff and style formatting.
* Add missed module config.
* Add dummy checkpoint link to pass tests (need updated when real checkpoints are uplioaded.).
* Update model doc.
* Minor updates following Arthur's comments:
* replace docstrings with auto_docstrings
* remove checkpoint layers
* remove deprecate_kwargs
* fix rebase errors
* Fix docstring issues.
* fix t5gemma doc issue.
* run ruff format
* Updates:
* split encoder-only model out
* make t5gemmamodel encoder-decoder only
* update token and sequence classification
* update tests
* don't move the whole video to GPU
* add torchcodec
* add tests
* make style
* instrucblip as well
* consistency
* Update src/transformers/utils/import_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/utils/import_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Update src/transformers/video_utils.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1
* fix code format
* add test; replace position_ids with query_states becasue position_ids.shape[0] is always 1
* add assert loss is not nan
* Add Arcee model support to transformers
- Add ArceeConfig and model mappings for all task types (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)
- Add auto-loading support through AutoModel, AutoConfig, and AutoTokenizer
- Use LlamaTokenizer for tokenization
- Add FX graph support for Arcee models
- Create lazy loading module structure for Arcee
* feat: update YARN scaling and RoPE validation for Arcee model
* feat: add auto_docstring checkpoint config to Arcee model classes
* docs: add pre-trained model weights reference to Arcee configuration files
* refactor: move RoPE utilities to dedicated modeling_rope_utils module
* Add comprehensive test suite for Arcee model
- Add test_modeling_arcee.py following standard transformers test patterns
- Include tests for all model variants (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)
- Add specific test for ReLU² activation in ArceeMLP
- Add RoPE scaling tests including YARN support
- Follow CausalLMModelTest pattern used by similar models
* Add documentation for Arcee model
- Add comprehensive model documentation with usage examples
- Include all model variants in autodoc
- Add to table of contents in proper alphabetical order
- Fixes documentation coverage for Arcee model classes
* Make style/fixup
* fix copyright year
* Sync modular conversion
* revert in legacy supported models in src/transformers/utils/fx
* cleaned redundant code in modular_arcee.py
* cleaned testing
* removed pretraining tp
* fix styles
* integration testing
---------
Co-authored-by: Pranav <veldurthipranav@gmail.com>
Co-authored-by: Pranav <56645758+pranav4501@users.noreply.github.com>