* Support for generate_argument: return_dict_in_generate=True, instead of returning a error
* fix: call test with return_dict_in_generate=True
* fix: Only import torch if it is present
* update: Encapsulate output_dict changes
* fix: added back original comments
---------
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
* correctly slice
* check mask
* Update modular_gemma2.py
* fix
* add tests
* fix typo
* finally fix mask slicing
* Finally correctly slice in all cases!!
* add test for all attention functions
* small fix in tests
* trick around dynamo tracing issue
* last update
* more robust
* kwargs propagation
* make it explicit for checkpointing
* apply modular
* Added `segmentation_maps` support for DPT image processor
* Added tests for dpt image processor
* Moved preprocessing into separate functions
* Added # Copied from statements
* Fixed # Copied from statements
* Added `segmentation_maps` support for DPT image processor
* Added tests for dpt image processor
* Moved preprocessing into separate functions
* Added # Copied from statements
* Fixed # Copied from statements
* First commit
* Finish model implementation
* First commit
* Finish model implementation
* Register zamba2
* generated modeling and configuration
* generated modeling and configuration
* added hybrid cache
* fix attention_mask in mamba
* dropped unused loras
* fix flash2
* config docstrings
* fix config and fwd pass
* make fixup fixes
* text_modeling_zamba2
* small fixes
* make fixup fixes
* Fix modular model converter
* added inheritances in modular, renamed zamba cache
* modular rebase
* new modular conversion
* fix generated modeling file
* fixed import for Zamba2RMSNormGated
* modular file cleanup
* make fixup and model tests
* dropped inheritance for Zamba2PreTrainedModel
* make fixup and unit tests
* Add inheritance of rope from GemmaRotaryEmbedding
* moved rope to model init
* drop del self.self_attn and del self.feed_forward
* fix tests
* renamed lora -> adapter
* rewrote adapter implementation
* fixed tests
* Fix torch_forward in mamba2 layer
* Fix torch_forward in mamba2 layer
* Fix torch_forward in mamba2 layer
* Dropped adapter in-place sum
* removed rope from attention init
* updated rope
* created get_layers method
* make fixup fix
* make fixup fixes
* make fixup fixes
* update to new attention standard
* update to new attention standard
* make fixup fixes
* minor fixes
* cache_position
* removed cache_position postion_ids use_cache
* remove config from modular
* removed config from modular (2)
* import apply_rotary_pos_emb from llama
* fixed rope_kwargs
* Instantiate cache in Zamba2Model
* fix cache
* fix @slow decorator
* small fix in modular file
* Update docs/source/en/model_doc/zamba2.md
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* several minor fixes
* inherit mamba2decoder fwd and drop position_ids in mamba
* removed docstrings from modular
* reinstate zamba2 attention decoder fwd
* use regex for tied keys
* Revert "use regex for tied keys"
This reverts commit 9007a522b1.
* use regex for tied keys
* add cpu to slow forward tests
* dropped config.use_shared_mlp_adapter
* Update docs/source/en/model_doc/zamba2.md
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* re-convert from modular
---------
Co-authored-by: root <root@node-2.us-southcentral1-a.compute.internal>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* use torch.testing.assertclose instead to get more details about error in cis
* fix
* style
* test_all
* revert for I bert
* fixes and updates
* more image processing fixes
* more image processors
* fix mamba and co
* style
* less strick
* ok I won't be strict
* skip and be done
* up
* Fix test_pipelines_video_classification that was always failing
* Update video pipeline docstring to reflect actual return type
---------
Co-authored-by: Louis Groux <louis.cal.groux@gmail.com>
* Initial commit with template code generated by transformers-cli
* Multiple additions to SuperGlue implementation :
- Added the SuperGlueConfig
- Added the SuperGlueModel and its implementation
- Added basic weight conversion script
- Added new ImageMatchingOutput dataclass
* Few changes for SuperGlue
* Multiple changes :
- Added keypoint detection config to SuperGlueConfig
- Completed convert_superglue_to_pytorch and succesfully run inference
* Reverted unintentional change
* Multiple changes :
- Added SuperGlue to a bunch of places
- Divided SuperGlue into SuperGlueForImageMatching and SuperGlueModel
- Added testing images
* Moved things in init files
* Added docs (to be finished depending on the final implementation)
* Added necessary imports and some doc
* Removed unnecessary import
* Fixed make fix-copies bug and ran it
* Deleted SuperGlueModel
Fixed convert script
* Added SuperGlueImageProcessor
* Changed SuperGlue to support batching pairs of images and modified ImageMatchingOutput in consequences
* Changed convert_superglue_to_hf.py script to experiment different ways of reading an image and seeing its impact on performances
* Added initial tests for SuperGlueImageProcessor
* Added AutoModelForImageMatching in missing places and tests
* Fixed keypoint_detector_output instructions
* Fix style
* Adapted to latest main changes
* Added integration test
* Fixed bugs to pass tests
* Added keypoints returned by keypoint detector in the output of SuperGlue
* Added doc to SuperGlue
* SuperGlue returning all attention and hidden states for a fixed number of keypoints
* Make style
* Changed SuperGlueImageProcessor tests
* Revert "SuperGlue returning all attention and hidden states for a fixed number of keypoints"
Changed tests accordingly
This reverts commit 5b3b669c
* Added back hidden_states and attentions masked outputs with tests
* Renamed ImageMatching occurences into KeypointMatching
* Changed SuperGlueImageProcessor to raise error when batch_size is not even
* Added docs and clarity to hidden state and attention grouping function
* Fixed some code and done refactoring
* Fixed typo in SuperPoint output doc
* Fixed some of the formatting and variable naming problems
* Removed useless function call
* Removed AutoModelForKeypointMatching
* Fixed SuperGlueImageProcessor to only accept paris of images
* Added more fixes to SuperGlueImageProcessor
* Simplified the batching of attention and hidden states
* Simplified stack functions
* Moved attention instructions into class
* Removed unused do_batch_norm argument
* Moved weight initialization to the proper place
* Replaced deepcopy for instantiation
* Fixed small bug
* Changed from stevenbucaille to magic-leap repo
* Renamed London Bridge images to Tower Bridge
* Fixed formatting
* Renamed remaining "london" to "tower"
* Apply suggestions from code review
Small changes in the docs
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Added AutoModelForKeypointMatching
* Changed images used in example
* Several changes to image_processing_superglue and style
* Fixed resample type hint
* Changed SuperGlueImageProcessor and added test case for list of 2 images
* Changed list_of_tuples implementation
* Fix in dummy objects
* Added normalize_keypoint, log_sinkhorn_iterations and log_optimal_transport docstring
* Added missing docstring
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Moved forward block at bottom
* Added docstring to forward method
* Added docstring to match_image_pair method
* Changed test_model_common_attributes to test_model_get_set_embeddings test method signature
* Removed AutoModelForKeypointMatching
* Removed image fixtures and added load_dataset
* Added padding of images in SuperGlueImageProcessor
* Cleaned up convert_superglue_to_hf script
* Added missing docs and fixed unused argument
* Fixed SuperGlueImageProcessor tests
* Transposed all hidden states from SuperGlue to reflect the standard (..., seq_len, feature_dim) shape
* Added SuperGlueForKeypointMatching back to modeling_auto
* Fixed image processor padding test
* Changed SuperGlue docs
* changes:
- Abstraction to batch, concat and stack of inconsistent tensors
- Changed conv1d's to linears to match standard attention implementations
- Renamed all tensors to be tensor0 and not tensor_0 and be consistent
- Changed match image pair to run keypoint detection on all image first, create batching tensors and then filling these tensors matches after matches
- Various changes in docs, etc
* Changes to SuperGlueImageProcessor:
- Reworked the input image pairs checking function and added tests accordingly
- Added Copied from statements
- Added do_grayscale tag (also for SuperPointImageProcessor)
- Misc changes for better code
* Formatting changes
* Reverted conv1d to linear conversion because of numerical differences
* fix: changed some code to be more straightforward (e.g. filtering keypoints) and converted plot from opencv to matplotlib
* fix: removed unnecessary test
* chore: removed commented code and added back hidden states transpositions
* chore: changed from "inconsistent" to "ragged" function names as suggested
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* docs: applied suggestions
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* docs: updated to display matched output
* chore: applied suggestion for check_image_pairs_input function
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* chore: changed check_image_pairs_input function name to validate_and_format_image_pairs and used validate_preprocess_arguments function
* tests: simplified tests for image input format and shapes
* feat: converted SuperGlue's use of Conv1d with kernel_size of 1 with Linear layers. Changed tests and conversion script accordingly
* feat: several changes to address comments
Conversion script:
- Reverted fuse batchnorm to linear conversion
- Changed all 'nn.Module' to respective SuperGlue models
- Changed conversion script to use regex mapping and match other recent scripts
Modeling SuperGlue:
- Added batching with mask and padding to attention
- Removed unnecessary concat, stack and batch ragged pairs functions
- Reverted batchnorm layer
- Renamed query, key, value and merge layers into q, k, v, out proj
- Removed Union of different Module into nn.Module in _init_weights method typehint
- Changed several method's signature to combine image0 and image1 inputs with appropriate doc changes
- Updated SuperGlue's doc with torch.no_grad()
Updated test to reflect changes in SuperGlue model
* refactor: changed validate_and_format_image_pairs function with clarity
* refactor: changed from one SuperGlueMLP class to a list of SuperGlueMLP class
* fix: fixed forgotten init weight change from last commit
* fix: fixed rebase mistake
* fix: removed leftover commented code
* fix: added typehint and changed some of arguments default values
* fix: fixed attribute default values for SuperGlueConfig
* feat: added SuperGlueImageProcessor post process keypoint matching method with tests
* fix: fixed SuperGlue attention and hidden state tuples aggregation
* chore: fixed mask optionality and reordered tensor reshapes to be cleaner
* chore: fixed docs and error message returned in validate_and_format_image_pairs function
* fix: fixed returned keypoints to be the ones that SuperPoint returns
* fix: fixed check on number of image sizes for post process compared to the pairs in outputs of SuperGlue
* fix: fixed check on number of image sizes for post process compared to the pairs in outputs of SuperGlue (bis)
* fix: Changed SuperGlueMultiLayerPerceptron instantiation to avoid if statement
* fix: Changed convert_superglue_to_hf script to reflect latest SuperGlue changes and got rid of nn.Modules
* WIP: implement Attention from an existing class (like BERT)
* docs: Changed docs to include more appealing matching plot
* WIP: Implement Attention
* chore: minor typehint change
* chore: changed convert superglue script by removing all classes and apply conv to linear conversion in state dict + rearrange keys to comply with changes in model's layers organisation
* Revert "Fixed typo in SuperPoint output doc"
This reverts commit 2120390e82.
* chore: added comments in SuperGlueImageProcessor
* chore: changed SuperGlue organization HF repo to magic-leap-community
* [run-slow] refactor: small change in layer instantiation
* [run-slow] chore: replaced remaining stevenbucaille org to magic-leap-community
* [run-slow] chore: make style
* chore: update image matching fixture dataset HF repository
* [run-slow] superglue
* tests: overwriting test_batching_equivalence
* [run-slow] superglue
* tests: changed test to cope with value changing depending on cuda version
* [run-slow] superglue
* tests: changed matching_threshold value
* [run-slow] superglue
* [run-slow] superglue
* tests: changed tests for integration
* [run-slow] superglue
* fix: Changed tensor view and permutations to match original implementation results
* fix: updated convert script and integration test to include last change in model
* fix: increase tolerance for CUDA variances
* Apply suggestions from code review
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* [run-slow] superglue
* chore: removed blank whitespaces
* [run-slow] superglue
* Revert SuperPoint image processor accident changes
* [run-slow] superglue
* refactor: reverted copy from BERT class
* tests: lower the tolerance in integration tests for SuperGlue
* [run-slow] superglue
* chore: set do_grayscale to False in SuperPoint and SuperGlue image processors
* [run-slow] superglue
* fix: fixed imports in SuperGlue files
* chore: changed do_grayscale SuperGlueImageProcessing default value to True
* docs: added typehint to post_process_keypoint_matching method in SuperGlueImageProcessor
* fix: set matching_threshold default value to 0.0 instead of 0.2
* feat: added matching_threshold to post_process_keypoint_matching method
* docs: update superglue.md to include matching_threshold parameter
* docs: updated SuperGlueConfig docstring for matching_threshold default value
* refactor: removed unnecessary parameters in SuperGlueConfig
* fix: changed from matching_threshold to threshold
* fix: re-revert changes to make SuperGlue attention classes copies of BERT
* [run-slow] superglue
* fix: added missing device argument in post_processing method
* [run-slow] superglue
* fix: add matches different from -1 to compute valid matches in post_process_keypoint_matching (and docstring)
* fix: add device to image_sizes tensor instantiation
* tests: added checks on do_grayscale test
* chore: reordered and added Optional typehint to KeypointMatchingOutput
* LightGluePR suggestions:
- use `post_process_keypoint_matching` as default docs example
- add `post_process_keypoint_matching` in autodoc
- add `SuperPointConfig` import under TYPE_CHECKING condition
- format SuperGlueConfig docstring
- add device in convert_superglue_to_hf
- Fix typo
- Fix KeypointMatchingOutput docstring
- Removed unnecessary line
- Added missing SuperGlueConfig in __init__ methods
* LightGluePR suggestions:
- use batching to get keypoint detection
* refactor: processing images done in 1 for loop instead of 4
* fix: use @ instead of torch.einsum for scores computation
* style: added #fmt skip to long tensor values
* refactor: rollbacked validate_and_format_image_pairs valid and invalid case to more simple ones
* refactor: prepare_imgs
* refactor: simplified `validate_and_format_image_pairs`
* docs: fixed doc
---------
Co-authored-by: steven <steven.bucaillle@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Steven Bucaille <steven.bucaille@buawei.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Convert more checkpoints
* Update docs, convert huge variant
* Update model name
* Update src/transformers/models/vitpose/modeling_vitpose.py
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* Remove print statements
* Update docs/source/en/model_doc/vitpose.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Link to collection
---------
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
`return unittest.skip()` used in the `test_model_parallel_beam_search` in
skip condition for xpu did not actually mark test to be skipped running
under pytest:
* 148 passed, 1 skipped
Other tests use `self.skipTest()`. Reusing this approach and moving the
condition outside the loop (since it does not depend on it) allows to skip
for xpu correctly:
* 148 skipped
Secondly, `device_map="auto"` is now implemented for XPU for IPEX>=2.5 and
torch>=2.6, so we can now enable these tests for XPU for new IPEX/torch
versions.
Fixes: 1ea3ad1ae ("[tests] use `torch_device` instead of `auto` for model testing (#29531)")
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* Add input ids to model output
* Add text preprocessing for processor
* Fix snippet
* Add test for equivalence
* Add type checking guard
* Fixing typehint
* Fix test for added `input_ids` in output
* Add deprecations and "text_labels" to output
* Adjust tests
* Fix test
* Update code examples
* Minor docs and code improvement
* Remove one-liner functions and rename class to CamelCase
* Update docstring
* Fixup
* An attempt to fix#29554. Include 'LayerNorm.' in gamma/beta rename scope, reduce number of characters searched on every load considerably.
* Fix fix on load issue
* Fix gamma/beta warning test
* A style complaint
* Improve efficiency of weight norm key rename. Add better comments about weight norm and layer norm renaming.
* Habitual elif redunant with the return
* add test
* augment test as suggested
* Update tests/utils/test_modeling_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* rerun tests
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* DataCollatorForLanguageModeling class was updated with new parameters that provides more control over the token masking and relacing
* DataCollatorForLanguageModeling class was updated with new parameters that provides more control over the token masking and relacing
* Addressed review comments, modified the docstring and made a test for the DataCollatorForLanguageModeling
* Add the helium model.
* Add a missing helium.
* And add another missing helium.
* Use float for the rmsnorm mul.
* Add the Helium tokenizer converter.
* Add the pad token as suggested by Arthur.
* Update the RMSNorm + some other tweaks.
* Fix more rebase issues.
* fix copies and style
* fixes and add helium.md
* add missing tests
* udpate the backlink
* oups
* style
* update init, and expected results
* small fixes
* match test outputs
* style fixup, fix doc builder
* add dummies and we should be good to go!z
* update sdpa and fa2 documentation
---------
Co-authored-by: laurent <laurent.mazare@gmail.com>
* model can convert to HF and be loaded back
* nit
* works in single batch generation but hallucinates
* use the image tokens
* add image generation
* now it works
* add tests
* update
* add modulare but it doesn't work for porting docstring :(
* skip some tests
* add slow tests
* modular removed the import?
* guess this works
* update
* update
* fix copies
* fix test
* fix copies
* update
* docs
* fix tests
* last fix tests?
* pls
* repo consistency
* more style
* style
* remove file
* address comments
* tiny bits
* update after the new modular
* fix tests
* add one more cond in check attributes
* decompose down/up/mid blocks
* allow static cache generation in VLMs
* nit
* fix copies
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/en/model_doc/emu3.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fix VAE upsampling
* Update src/transformers/models/emu3/modular_emu3.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* address comments
* state overwritten stuff explicitly
* fix copies
* add the flag for flex attn
---------
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Introduce 5 integration tests for the 4 model classes + torch export
* ModernBert: reuse GemmaRotaryEmbedding via modular
* Revert #35589, keep rope_kwargs; rely on them in modular_modernbert
* Revert "Revert #35589, keep rope_kwargs; rely on them in modular_modernbert"
This reverts commit 11b44b9ee8.
* Don't set rope_kwargs; override 'self.rope_init_fn' call instead
* Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer
in PreTrainedTokenizerFast, rather than relying on subclasses to take care of this.
* Simplify setting self.add_prefix_space, ensure pre_tok exists
* Wrap in try-except to catch 'Custom PreTokenizer cannot be serialized'
862d1a346a/bindings/python/src/pre_tokenizers.rs (L672) produces the Exception. They're triggered by the roformer tests, as the RoFormerTokenizerFast uses a custom PreTokenizer.
* Propagate add_prefix_space in T5TokenizerFast to superclass
* look-ahead negation
* re add examples by default
* Fix the bug in topological sort
* Update create_dependency_mapping.py
* start adding test
* finalize test
* more tests
* style
* style
* update modular_modernbert -- add inputs_embeds param to ModernBertModel
* Fix implementation issues; extend to other classes; docstring
First of all, the inputs_embeds shouldn't fully replace `self.embeddings(input_ids)`, because this call also does layer normalization and dropout. So, now both input_ids and inputs_embeds is passed to the ModernBertEmbeddings, much like how BertEmbeddings is implemented.
I also added `inputs_embeds` to the docstring, and propagated the changes to the other model classes.
I also introduced an error if input_ids and input_embeds are both or neither provided.
Lastly, I fixed an issue with device being based solely on input_ids with attention_mask.
* Propagate inputs_embeds to ModernBertForMaskedLM correctly
Also reintroduce inputs_embeds test
---------
Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
* update codecarbon
* replace directly-specified-test-dirs with tmp_dir
* pass tmp_dir to all get_regression_trainer
* test_trainer.py: Use tmp_dir consistently for all output_dir arguments
* fix some with...as tmp_dir blocks
* reflect the comments to improve test_trainer.py
* refresh .gitignore
* update conversion script
* update for bias again
* remove pdv
* use my dir
* Update how we initialize the tokenizer
* Convert in bfloat16
* Undo that one again
* fix config dump
* .to() was broken for BatchMixFeature
* quick debug breakpoint
* put the breakpoint in the right place
* Add a config flag for the multimodal projector bias
* Add a config flag for the multimodal projector bias
* Conversion script can load chat templates
* Indent config for comparison
* Stop clobbering the config
* Re-enable the config clobber
* Get rid of the config manual save - it has no effect!
* Handle adapter bias correctly
* Default vision transformer activation to silu
* Remove legacy processing path
* One commit with all the debug breakpoints before I delete them all, in case I need to revert
* Update conversion
* Remove vLLM debugging instrumentation
* Drop xformers
* Remove debug enumerates
* make fixup
* make fixup
* Break copied from in pixtral
* Propagate multimodal_projector_bias change
* Propagate multimodal_projector_bias change
* Remove debug device .to()
* Restore attention weights output
* Fix Pixtral test
* Drop image_seq_length
* Drop image_seq_length
* Put the legacy processing code back
* Add the bias option to the llava_next_video config
* Add the bias option to the llava_next_video config
* Make certain args required in converter
* Make certain args required in converter
* typo
* make fixup
* Reverting some dtype changes since it seems to work without them
---------
Co-authored-by: arthur@huggingface.co <arthur@ip-26-0-166-244.ec2.internal>
Co-authored-by: Matt <rocketknight1@gmail.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* Updated docstring for _determine_best_metric.
* Updated docstring for metric_for_best_model.
* Added test case for save strategy.
* Updated incorrect test case.
* Changed eval_strategy to match save_strategy.
* Separated test cases for metric.
* Allow load_best_model when save_strategy == "best".
* Updated docstring for metric_for_best_model.
* fix: processing odd number of frames
* feat: add test case
* update: test one frame
* feat: support custom patch size
* fix: test with videos
* revert: change on patch repeat
* fix: much wow
* update: fixups
* fixup pls
* ruff fixup
* fix typo at least
* Correctly list the chat template file in the saved files list
* Update src/transformers/tokenization_utils_base.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Add save file checking to test
* make fixup
* better filename handling
* make fixup
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* add audio_token attribute to proc
* expand input_ids
* and legacy and expanded input_ids
* test update
* split lines
* add possibility not to provide eos and bos audio tokens
* raise errors
* test incorrect number of audio tokens
* add example
* fmt
* typo
* first adding diffllama
* add Diff Attention and other but still with errors
* complate make attention Diff-Attention
* fix some bugs which may be caused by transformer-cli while adding model
* fix a bug caused by forgetting KV cache...
* Update src/transformers/models/diffllama/modeling_diffllama.py
You don't need to divide by 2 if we use same number of attention heads as llama. instead you can just split in forward.
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
fit to changeing "num_heads // 2" place
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
new codes are more meaningful than before
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
new codes are more meaningful than before
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
fit to changeing "num_heads // 2" place
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
fix 2times divide by sqrt(self.head_dim)
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
fix 2times divide by sqrt(self.head_dim)
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* Update src/transformers/models/diffllama/modeling_diffllama.py
fit to changeing "num_heads // 2" place.
and more visible
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* I found Attention missed implemented from paper still on e072544a3b.
* re-implemented
* adding groupnorm
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* align with transformers code style
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* fix typo
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* adding groupnorm
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* change SdpaAttention to DiffSdpaAttention
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* fix bug
* Update src/transformers/models/diffllama/modeling_diffllama.py
resolve "not same outputs" problem
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* fix bugs of places of "GroupNorm with scale" and etc
* Revert "fix bugs of places of "GroupNorm with scale" and etc"
This reverts commit 26307d92f6.
* simplify multiple of attention (matmul) operations into one by repeating value_states
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* simplify multiple of attention (matmul) operations into one by repeating value_states
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* simplify multiple of attention (matmul) operations into one by repeating value_states
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
* remove missed type
* add diffllama model_doc
* apply make style/quality
* apply review comment about model
* apply review comment about test
* place diffllama alphabetically on the src/transformers/__init__.py
* fix forgot code
* Supports parameters that are not initialized with standard deviation 0 in the conventional method
* add DiffLlamaConfig to CONFIG_CLASSES_TO_IGNORE_FOR_DOCSTRING_CHECKPOINT_CHECK on utils/check_config_docstrings.py
* remove unused property of config
* add to supported model list
* add to spda supported model list
* fix copyright, remove pretraining_tensor_parallel, and modify for initialization test
* remove unused import and etc.
* empty commit
* empty commit
* empty commit
* apply modular transformers but with bugs
* revert prev commit
* create src/transformers/model/diffllama/modular_diffllama.py
* run utils/modular_model_converter.py
* empty commit
* leaner modular diffllama
* remove more and more in modular_diffllama.pt
* remove more and more in modular_diffllama.pt
* resolve missing docstring entries
* force reset
* convert modular
---------
Co-authored-by: Minho Ryu <ryumin93@gmail.com>
`parallelize()` API is deprecated in favor of accelerate's `device_map="auto"`
and therefore is not accepting new features. At the same time `parallelize()`
implementation is currently CUDA-specific. This commit marks respective
ci tests with `@require_torch_gpu`.
Fixes: #35252
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* added logic for deleting adapters once loaded
* updated to the latest version of transformers, merged utility function into the source
* updated with missing check
* added peft version check
* Apply suggestions from code review
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* changes according to reviewer
* added test for deleting adapter(s)
* styling changes
* styling changes in test
* removed redundant code
* formatted my contributions with ruff
* optimized error handling
* ruff formatted with correct config
* resolved formatting issues
---------
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
* Make kwargs uniform for SAM
* Remove unused attribute
* Make point_pad_value part of image_kwargs
* Update annotations
* Code review - use existing methods
* Use ProcessorTesterMixin
* Do not add ProcessorTesterMixin everywhere
* fixup mamba2 - caching and several other small fixes
* fixup cached forward
* correct fix this time
* fixup cache - we do not need to extend the attn mask it's handled by generate (gives total ids + mask at each step)
* remove unnecessary (un)squeeze
* fixup cache position
* simplify a few things
* [run-slow] mamba2
* multi gpu attempt two
* [run-slow] mamba2
* [run-slow] mamba2
* [run-slow] mamba2
* [run-slow] mamba2
* add newer slow path fix
* [run-slow] mamba2
* initial cut of modernbert for transformers
* small bug fixes
* fixes
* Update import
* Use compiled mlp->mlp_norm to match research implementation
* Propagate changes in modular to modeling
* Replace duplicate attn_out_dropout in favor of attention_dropout
cc @warner-benjamin let me know if the two should remain separate!
* Update BOS to CLS and EOS to SEP
Please confirm @warner-benjamin
* Set default classifier bias to False, matching research repo
* Update tie_word_embeddings description
* Fix _init_weights for ForMaskedLM
* Match base_model_prefix
* Add compiled_head to match research repo outputs
* Fix imports for ModernBertForMaskedLM
* Just use "gelu" default outright for classifier
* Fix config name typo: initalizer -> initializer
* Remove some unused parameters in docstring. Still lots to edit there!
* Compile the embeddings forward
Not having this resulted in very slight differences - so small it wasn't even noticed for the base model, only for the large model.
But the tiny difference for large propagated at the embedding layer through the rest of the model, leading to notable differences of ~0.0084 average per value, up to 0.2343 for the worst case.
* Add drafts for ForSequenceClassification/ForTokenClassification
* Add initial SDPA support (not exactly equivalent to FA2 yet!)
During testing, FA2 and SDPA still differ by about 0.0098 per value in the token embeddings. It still predicts the correct mask fills, but I'd like to get it fully 1-1 if possible.
* Only use attention dropout if training
* Add initial eager attention support (also not equivalent to FA2 yet!)
Frustratingly, I also can't get eager to be equivalent to FA2 (or sdpa), but it does get really close, i.e. avg ~0.010 difference per value.
Especially if I use fp32 for both FA2&eager, avg ~0.0029 difference per value
The fill-mask results are good with eager.
* Add initial tests, output_attentions, output_hidden_states, prune_heads
Tests are based on BERT, not all tests pass yet: 23 failed, 79 passed, 100 skipped
* Remove kwargs from ModernBertForMaskedLM
Disable sparse_prediction by default to match the normal HF, can be enabled via config
* Remove/adjust/skip improper tests; warn if padding but no attn mask
* Run formatting etc.
* Run python utils/custom_init_isort.py
* FlexAttention with unpadded sequences(matches FA2 within bf16 numerics)
* Reformat init_weights based on review
* self -> module in attention forwards
* Remove if config.tie_word_embeddings
* Reformat output projection on a different line
* Remove pruning
* Remove assert
* Call contiguous() to simplify paths
* Remove prune_qkv_linear_layer
* Format code
* Keep as kwargs, only use if needed
* Remove unused codepaths & related config options
* Remove 3d attn_mask test; fix token classification tuple output
* Reorder: attention_mask above position_ids, fixes gradient checkpointing
* Fix usage if no FA2 or torch v2.5+
* Make torch.compile/triton optional
Should we rename 'compile'? It's a bit vague
* Separate pooling options into separate functions (cls, mean) - cls as default
* Simplify _pad_modernbert_output, remove unused labels path
* Update tied weights to remove decoder.weight, simplify decoder loading
* Adaptively set config.compile based on hf_device_map/device/resize, etc.
* Update ModernBertConfig docstring
* Satisfy some consistency checks, add unfinished docs
* Only set compile to False if there's more than 1 device
* Add docstrings for public ModernBert classes
* Dont replace docstring returns - ends up being duplicate
* Fix mistake in toctree
* Reformat toctree
* Patched FlexAttention, SDPA, Eager with Local Attention
* Implement FA2 -> SDPA -> Eager attn_impl defaulting, crucial
both to match the original performance, and to get the highest inference speed without requiring users to manually pick FA2
* Patch test edge case with Idefics3 not working with 'attn_implementation="sdpa"'
* Repad all_hidden_states as well
* rename config.compile to reference_compile
* disable flex_attention since it crashes
* Update modernbert.md
* Using dtype min to mask in eager
* Fully remove flex attention for now
It's only compatible with the nightly torch 2.6, so we'll leave it be for now. It's also slower than eager/sdpa.
Also, update compile -> reference_compile in one more case
* Call contiguous to allow for .view()
* Copyright 2020 -> 2024
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update/simplify __init__ structure
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Remove "... if dropout_prob > 0 else identity"
As dropout with 0.0 should be efficient like identity
* re-use existing pad/unpad functions instead of creating new ones
* remove flexattention method
* Compute attention_mask and local_attention_mask once in modeling
* Simplify sequence classification prediction heads, only CLS now
Users can make custom heads if they feel like it
Also removes the unnecessary pool parameter
* Simplify module.training in eager attn
* Also export ModernBertPreTrainedModel
* Update the documentation with links to finetuning scripts
* Explain local_attention_mask parameter in docstring
* Simplify _autoset_attn_implementation, rely on super()
* Keep "in" to initialize Prediction head
Doublechecked with Benjamin that it's correct/what we used for pretraining
* add back mean pooling
* Use the pooling head in TokenClassification
* update copyright
* Reset config._attn_implementation_internal on failure
* Allow optional attention_mask in ForMaskedLM head
* fix failing run_slow tests
* Add links to the paper
* Remove unpad_no_grad, always pad/unpad without gradients
* local_attention_mask -> sliding_window_mask
* Revert "Use the pooling head in TokenClassification"
This reverts commit 99c38badd1.
There was no real motivation, no info on whether having this bigger head does anything useful.
* Simplify pooling, 2 options via if-else
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Said Taghadouini <taghadouinisaid@gmail.com>
Co-authored-by: Benjamin Clavié <ben@clavie.eu>
Co-authored-by: Antoine Chaffin <ant54600@hotmail.fr>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* initial commit for PR
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
* rename dynamic cache
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* add more unit tests
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* add integration test
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* add integration test
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* Add modular bamba file
* Remove trainer changes from unrelated PR
* Modify modular and cofig to get model running
* Fix some CI errors and beam search
* Fix a plethora of bugs from CI/docs/etc
* Add bamba to models with special caches
* Updat to newer mamba PR for mamba sublayer
* fix test_left_padding_compatibility
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* fix style
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* fix remaining tests
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* missed this test
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* ran make style
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* move slow tag to integration obj
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* make style
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* address comments
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* fix modular
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* left out one part of modular
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* change model
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* Make Rotary modular as well
* Update bamba.md
Added overview, update Model inference card and added config
* Update bamba.md
* Update bamba.md
* Update bamba.md
Minor fixes
* Add docs for config and model back
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Add warning when using fast kernels
* replaced generate example
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* Address comments from PR
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Propagate attention fixes
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Fix attention interfaces to the new API
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Fix API for decoder layer
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Remove extra weights
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
---------
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Co-authored-by: Antoni Viros i Martin <aviros@ibm.com>
Co-authored-by: divya-kumari32 <72085811+divya-kumari32@users.noreply.github.com>
Co-authored-by: Antoni Viros <ani300@gmail.com>
* do not remove decoder_input_ids for the first segment
* do not remove eos token in generate_with_fallback
* when removing padding tokens, do not remove eos token
* remove eos token in generate (and not in generate_with_fallback!)
* reconciliate short-from/ long-form behavior
* correct avg_logprobs calculation
* handle eos token in segments
* handle decoder_input_ids and eos token in _prepare_decoder_input_ids
* fix incorrect time precision
* always remove eos token
* always remove decoder_input_ids
* no need to handle decoder_inputs_ids and eos token
* no need to remove decoder_input_ids
* no need to handle eos token
* fix num_beams in _retrieve_logit_processors
* remove todo unconsistency
* no need to add eos token
* last_timestamp_pos should indeed be timestamp token pos
* patch generate to enable compatibility with GenerationTesterMixin tests
* adapt test_generate_continue_from_past_key_values
* adapt test_prompt_lookup_decoding_matches_greedy_search
* adapt generic GenerationMixin tests to whisper's generate
* fix speculative decoding
* fix
* [run-slow] whisper
* change HF_HUB_TOKEN for require_read_token
* [run-slow] whisper
* prioritize kwargs over generation_config
* remove unnecessary args
* [run-slow] whisper
* update tests
* [run-slow] whisper
* add comment
* update test
* [run-slow] whisper
* update test + revert require_read_token
* docstring updates
* revert tokenizer decode args change
* do not use a patch + docstring updates
* [run-slow] whisper
* make
* [run-slow] whisper
* add a flag to force unique call to generate
* test update
* [run-slow] whisper
* add force_unique_generate_call arg
* do not use a patch
* correct the timestamps for the pad tokens
* docstring update
* docstring update
* docstring update
* upodate TF tests
* add require_read_token
* [run-slow] whisper
* test reset dynamo
* [run-slow] whisper
* fix
* [run-slow] whisper
* avoid iterating twice on current_segments
* [run-slow] whisper
* [run-slow] whisper
---------
Co-authored-by: Eustache Le Bihan <eustlb@users.noreply.huggingface.co>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* feat: add support for sdpa and gradient checkpointing
* fix: ruff format
* fix: config sdpa
* fix: sdpa layer naming convention
* fix: update test_eager_matches_sdpa_inference to handle vision_hidden_states
* test: skip incompatible tests and fix loading issue with sdpa
- Updated tests to skip cases flash and dynamic compile.
- Minor adjustment to ensure correct loading of model with sdpa for dispatch test.
* style: apply Ruff formatting
* ruff fix again after rebase
* [run-slow] sam
* [run-slow] sam
* refactor: Address review comments and improve sub-config handling in SAM model tests
- Added attributes for sub_configs as per PR #34410.
- Enabled tests for configs, ensuring the composite model (SAM) has several sub-configs in the main config.
- Added class attribute _is_composite=True to the tester class
- test_sdpa_can_dispatch_composite_models added
* [run-slow] sam
* style: ruff
* [run-slow] sam
* style: ruff again ...
* [run-slow] sam
* refactor image_processing_auto logic
* fix fast image processor tests
* Fix tests fast vit image processor
* Add safeguard when use_fast True and torchvision not available
* change default use_fast back to None, add warnings
* remove debugging print
* call get_image_processor_class_from_name once
* add more cases
* fix method not found in unittest
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
* fix more cases
* add more models
* add all
* no unittest.case
* remove for oneformer
* fix style
---------
Signed-off-by: Lin, Fanli <fanli.lin@intel.com>
* draft, run model as compreszed/uncompressed mode
* draft
* run run_compressed=False
* run_compressed as attr
* set run_compressed=False using quantization_config
* remove redundant line
* make is_qat_trainable dependent on run_compressed status
* add tests
* lint
* full in docstring
* add decompress
* comments
* decompress if model is compresssed and not run_compressed
* apply_quant_config logic fix -- populate statedict properly
* comments
* remove non compressed model
* make is_compressed as property
* cosmetic
* run apply_quant_config for non-compressed models -- popualte scales and zeropoints
* add pahtway for decompressing sparse models
* typo on is_quantization_compressed
* lint
* fix typo
* Add files
* Init
* Add TimmWrapperModel
* Fix up
* Some fixes
* Fix up
* Remove old file
* Sort out import orders
* Fix some model loading
* Compatible with pipeline and trainer
* Fix up
* Delete test_timm_model_1/config.json
* Remove accidentally commited files
* Delete src/transformers/models/modeling_timm_wrapper.py
* Remove empty imports; fix transformations applied
* Tidy up
* Add image classifcation model to special cases
* Create pretrained model; enable device_map='auto'
* Enable most tests; fix init order
* Sort imports
* [run-slow] timm_wrapper
* Pass num_classes into timm.create_model
* Remove train transforms from image processor
* Update timm creation with pretrained=False
* Fix gamma/beta issue for timm models
* Fixing gamma and beta renaming for timm models
* Simplify config and model creation
* Remove attn_implementation diff
* Fixup
* Docstrings
* Fix warning msg text according to test case
* Fix device_map auto
* Set dtype and device for pixel_values in forward
* Enable output hidden states
* Enable tests for hidden_states and model parallel
* Remove default scriptable arg
* Refactor inner model
* Update timm version
* Fix _find_mismatched_keys function
* Change inheritance for Classification model (fix weights loading with device_map)
* Minor bugfix
* Disable save pretrained for image processor
* Rename hook method for loaded keys correction
* Rename state dict keys on save, remove `timm_model` prefix, make checkpoint compatible with `timm`
* Managing num_labels <-> num_classes attributes
* Enable loading checkpoints in Trainer to resume training
* Update error message for output_hidden_states
* Add output hidden states test
* Decouple base and classification models
* Add more test cases
* Add save-load-to-timm test
* Fix test name
* Fixup
* Add do_pooling
* Add test for do_pooling
* Fix doc
* Add tests for TimmWrapperModel
* Add validation for `num_classes=0` in timm config + test for DINO checkpoint
* Adjust atol for test
* Fix docs
* dev-ci
* dev-ci
* Add tests for image processor
* Update docs
* Update init to new format
* Update docs in configuration
* Fix some docs in image processor
* Improve docs for modeling
* fix for is_timm_checkpoint
* Update code examples
* Fix header
* Fix typehint
* Increase tolerance a bit
* Fix Path
* Fixing model parallel tests
* Disable "parallel" tests
* Add comment for metadata
* Refactor AutoImageProcessor for timm wrapper loading
* Remove custom test_model_outputs_equivalence
* Add require_timm decorator
* Fix comment
* Make image processor work with older timm versions and tensor input
* Save config instead of whole model in image processor tests
* Add docstring for `image_processor_filename`
* Sanitize kwargs for timm image processor
* Fix doc style
* Update check for tensor input
* Update normalize
* Remove _load_timm_model function
---------
Co-authored-by: Amy Roberts <22614925+amyeroberts@users.noreply.github.com>
Original issue: https://github.com/huggingface/peft/issues/2256
There is a potential error when using load_best_model_at_end=True with a
prompt learning PEFT method. This is because Trainer uses load_adapter
under the hood but with some prompt learning methods, there is an
optimization on the saved model to remove parameters that are not
required for inference, which in turn requires a change to the model
architecture. This is why load_adapter will fail in such cases and users
should instead set load_best_model_at_end=False and use
PeftModel.from_pretrained. As this is not obvious, we now intercept the
error and add a helpful error message.
* Support BatchNorm in Hubert pos_conv_emb as in fairseq
* Correct the new defaults (#34377)
* Correct the new defaults
* CIs
* add check
* Update utils.py
* Update utils.py
* Add the max_length in generate test checking shape without passing length
* style
* CIs
* fix fx CI issue
* [auto. ping] Avoid sending empty info + add more team members (#34383)
* update
* update
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fix glm (#34388)
* Fix duplicated
* fix import
* Use non nested images and batched text Idefics2/3 (#34222)
* add support for non nested images and add tests
* add tests error scenario
* fix style
* added single and no image to error tests
* Fix onnx non-expotable inplace aten op (#34376)
* fix onnx non-expotable inplace op
* mistral, qwen2, qwen2_vl, starcoder2
* fixup copies
* Fix right padding in LLaVA models (#34305)
* fix right pad llavas
* device mismatch
* no filter (#34391)
* no filter
* no filter
* no filter
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* SynthID: better example (#34372)
* better example
* Update src/transformers/generation/configuration_utils.py
* Update src/transformers/generation/logits_process.py
* nits
* Tests: upgrade `test_eager_matches_sdpa_generate` (#34386)
* Fix bnb training test failure (#34414)
* Fix bnb training test: compatibility with OPTSdpaAttention
* Avoid check expected exception when it is on CUDA (#34408)
* update
* update
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fix typos in agents_advanced.md (#34405)
* [docs] Cache implementations (#34325)
cache
* [run-slow] hubert
* Support BatchNorm in Hubert pos_conv_emb as in fairseq
Add conversion integration test, and make batchnorm explicit variable
* Support BatchNorm in Hubert pos_conv_emb as in fairseq
fix make fixup styling changes
* [run-slow] hubert
* Support BatchNorm in Hubert pos_conv_emb as in fairseq
* [run-slow] hubert
* Support BatchNorm in Hubert pos_conv_emb as in fairseq
Add conversion integration test, and make batchnorm explicit variable
* Support BatchNorm in Hubert pos_conv_emb as in fairseq
fix make fixup styling changes
* [run-slow] hubert
* [run-slow] hubert
---------
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Yoni Gozlan <74535834+yonigozlan@users.noreply.github.com>
Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Co-authored-by: Raushan Turganbay <raushan@huggingface.co>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
Co-authored-by: Rudy Delouya <rudy.delouya@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>
* fix GA bugs and add unit test
* narrow down model loss unit test diff gap
* format code to make ruff happy
* send num_items_in_batch argument to decoder
* fix GA loss bug in BertLMHeadModel
* use TinyStories-33M to narrow down diff gap
* fotmat code
* missing .config
* avoid add extra args
---------
Co-authored-by: kangsheng <kangsheng@meituan.com>
* gpt neox flex attention + refactor
* some formatting
* small fix on dropout
* add assertion on flex attn test
* flaky ci :(
* add head mask support
* style
* handle dtype, replace torch where
* fixup flex with output attns
* code review and several other fixes
* Update src/transformers/modeling_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* style
* remove unnecessary comment
* remove incorrect comment
* make flex attn check more agnostic tor versions and centralized
* change peft input dtype check to value since q and k could be affected by other stuff like RoPE
* i forgor
* flaky
* code review and small fixes
* Update src/transformers/models/gpt_neox/modeling_gpt_neox.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Use torch.nn.attention.sdpa_kernel instead of deprecated torch.backends.cuda.sdp_kernel
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* Fix test_eager_matches_sdpa_inference for XPU backend
As of PyTorch 2.5 XPU backend supports only torch.nn.attention.SDPBackend.MATH
which is implemented on PyTorch level using aten operators and is device
agnostic with respect to implementation of each aten operator. Thus, we can
reuse CUDA (or CPU) MATH weights for XPU.
Fixes: #34888
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* Use torch.amp.autocast instead of deprecated torch.cuda.amp.autocast in nemotron
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
---------
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* [PEFT] Set eval mode when loading PEFT adapter
Resolves#34469
When calling model.load_adapter to load a PEFT adapter, by default the
adapter should be set to eval mode. This is now correctly done. Users
can still pass is_trainable=True to load the adapter in training mode.
* Linter
* Initial draft
* Add .jinja file loading for processors
* Add processor saving of naked chat template files
* make fixup
* Add save-load test for tokenizers
* Add save-load test for tokenizers
* stash commit
* Try popping the file
* make fixup
* Pop the arg correctly
* Pop the arg correctly
* Add processor test
* Fix processor code
* stash commit
* Processor clobbers child tokenizer's chat template
* Processor clobbers child tokenizer's chat template
* make fixup
* Split processor/tokenizer files to avoid interactions
* fix test
* Expand processor tests
* Rename arg to "save_raw_chat_template" across all classes
* Update processor warning
* Move templates to single file
* Move templates to single file
* Improve testing for processor/tokenizer clashes
* Improve testing for processor/tokenizer clashes
* Extend saving test
* Test file priority correctly
* make fixup
* Don't pop the chat template file before the slow tokenizer gets a look
* Remove breakpoint
* make fixup
* Fix error
* fix test_tiny_timestamp_generation
* fix test_large_timestamp_generation
* fix test_whisper_shortform_single_batch_prev_cond
* fix test_whisper_shortform_multi_batch_hard_prev_cond
* return_timestamps necessary with long form
* fix test_default_multilingual_transcription_long_form
* fix test_tiny_token_timestamp_generation_longform
* fix test_whisper_longform_multi_batch_hard
* Update tests/models/whisper/test_modeling_whisper.py
Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>
* fix typo
* do not expect special tokens
* fix test_whisper_longform_single_batch_beam
* fix test_whisper_longform_multi_batch_hard_prev_cond
* update test_whisper_longform_multi_batch_hard_prev_cond
* update test_whisper_longform_multi_batch_hard_prev_cond
* these tests does not make sense anymore
* this test does not make sense anymore
* make fixup
* suggested nits
* add test with forced_decoder_ids
* this test does not make sense anymore
* change assert for unittest test cases
* make fixup
* test with prompt_ids and task and language
* fix unittest test case call
* fix test_tiny_generation
* fix test_tiny_en_generation
* fix test_tiny_en_batched_generation
* fix test_tiny_longform_timestamps_generation
* fix test_tiny_timestamp_generation
* fix test_large_generation
* fix test_large_batched_generation
* fix test_large_generation_multilingual
* fix test_large_timestamp_generation
* fix test_large_timestamp_generation
* fix test_tiny_token_timestamp_generation_longform
* fix test_tiny_en_batched_generation
* make fixup
* [run-slow] whisper
---------
Co-authored-by: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com>
* allow unused parameter passthrough when chunking in asr pipelines
* format code
* format
* run fixup
* update tests
* update parameters to pipline in test
* updates parametrs in tests
* change spelling in gitignore
* revert .gitignore to main
* add git ignore of devcontainer folder
* assert asr output follows expected inference output type
* run fixup
* Remove .devcontainer from .gitignore
* remove compliance check
* Add Nemotron GGUF Loading Support
* fix the Nemotron architecture assignation
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Do not load for meta device
* Make some minor improvements
* Add test
* Update tests/utils/test_modeling_utils.py
Update test parameters
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* Make the test simpler
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* add deformable detr image processor fast
* add fast processor to doc
* fix copies
* nit docstring
* Add tests gpu/cpu and fix docstrings
* fix docstring
* import changes from detr
* fix imports
* rebase and fix
* fix input data format change in detr and rtdetr fast
* add support for openai api image_url input
* change continue to elif
* Explicitely add support for OpenAI/TGI chat format
* rewrite content to transformers chat format and add tests
* Add support for typing of image type in chat templates
* add base64 to possible image types
* refactor nesting
* softcapping
* soft cap before the mask
* style
* ...
* super nit
* update
* fixes
* update
* small issue with modular
* fix modular imports
* update
* fixup
* simplify a hell lot
* simplify cleaning imports
* finish fixing
* update our design
* nits
* use a deprecation cycle
* updates
* Fix modular (recursive deps need to always be computed after merges!)
* push
* fix
* update
* fix modular order
* make fix-copies
* updates
* update
* ?
* don't compile for now
* ?
* fix some stuff
* donc!
* fix copies
* update
* fixup
* ?
* fix two tests
* fix?
* for now, don't use head info
* eager when output attentoin and sdpa or flash as it's the simplest behaviour (for our tests as well :))
* fix-copies
* revert sdpa check
* Apply suggestions from code review
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
* rebase, fix-copies and push
* add a slow integration test
* update the test
* fix left padding issue
* fix test
* remove duplicate scaling
* quality
* add a small test and make sure it works
* 2b
---------
Co-authored-by: Cyril Vallez <cyril.vallez@gmail.com>
Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
19d58d31f has introduced a context manager to manage subtests of
test_training_gradient_checkpointing. However, test body was not
moved under "with" statement. Thus, while tests are correctly
marked as skipped, test bodies were still executed. In some cases,
as with llama this caused attribute errors.
Fixes: #34722
Fixes: 19d58d31f ("Add MLLama (#33703)")
Signed-off-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
* Add model skeletion with transformers-cli add-new-model-like
* Convert config to modular, add rms_norm_eps, delete clip_qkv
* Convert model to modular, add RMSNorm
* Add flash attention with qk norm and no qkv clipping
* Add decoder layer with RMSNorm after attention/feedforward layers
* Add base and causal model
* Add converter improvements from OLMo repo
* Update weight loading in OLMo to HF converter
* Set correct default for rms_norm_eps
* Set correct pipeline_model_mapping in test
* Run make fixup
* Fix model type
* Re-run modular conversion
* Manually set config docs to fix build errors
* Convert olmo-1124 to olmo_1124 to fix flash attention docs errors
* Start updating tests
* Update tests
* Copy upstream test_eager_matches_sdpa_inference_1_bfloat16 changes to olmo_1124
* Rename input_layernorm and post_attention_layernorm to reflect their ops better
* Use correct tokenizer
* Remove test unsupported by GPT2 tokenizer
* Create GenerationConfig outside of from_pretrained call
* Use simpler init file structure
* Add explicit __all__ to support simplified init
* Make safetensor serialization the default
* Update OLMo November 2024 docs
* remove v4.44 deprecations
* PR comments
* deprecations scheduled for v4.50
* hub version update
* make fiuxp
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Remove FSDP wrapping from sub-models.
* solve conflict trainer.py
* make fixup
* add unit test for fsdp_auto_wrap_policy when using auto_find_batch_size
* put back extract_model_from_parallel
* use transformers unwrap_model
* Retain newlines in chat template when
* Add try/except
* Add regression test
* Simplify test
* Apply suggestions from code review
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
---------
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>