* fix gptq creation when optimum is not installed + fix metadata checking
* fix compressed tensors as well
* style
* pray for ci luck on flaky tests :prayge:
* trigger ci
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
* Make kwargs uniform for SAM
* Remove unused attribute
* Make point_pad_value part of image_kwargs
* Update annotations
* Code review - use existing methods
* Use ProcessorTesterMixin
* Do not add ProcessorTesterMixin everywhere
* Improve modular transformers documentation
- Adds hints to general contribution guides
- Lists which utils scripts are available to generate single-files from modular files and check their content
* Show commands in copyable code cells
---------
Co-authored-by: Joel Koch <joel@bitcrowd.net>
* bugfix: torch.export failure caused by `_make_causal_mask`
Recent changes in torch dynamo prevent mutations on tensors converted with aten::_to_copy. To address this, we can clone such tensor before performing in-place operation `masked_fill_` only when the code is being compiled by torch dynamo.
(relevant issue: https://github.com/pytorch/pytorch/issues/127571)
* chore: use `is_torchdynamo_compiling` instead of `torch._dynamo.is_compiling`
* fixup mamba2 - caching and several other small fixes
* fixup cached forward
* correct fix this time
* fixup cache - we do not need to extend the attn mask it's handled by generate (gives total ids + mask at each step)
* remove unnecessary (un)squeeze
* fixup cache position
* simplify a few things
* [run-slow] mamba2
* multi gpu attempt two
* [run-slow] mamba2
* [run-slow] mamba2
* [run-slow] mamba2
* [run-slow] mamba2
* add newer slow path fix
* [run-slow] mamba2
* initial cut of modernbert for transformers
* small bug fixes
* fixes
* Update import
* Use compiled mlp->mlp_norm to match research implementation
* Propagate changes in modular to modeling
* Replace duplicate attn_out_dropout in favor of attention_dropout
cc @warner-benjamin let me know if the two should remain separate!
* Update BOS to CLS and EOS to SEP
Please confirm @warner-benjamin
* Set default classifier bias to False, matching research repo
* Update tie_word_embeddings description
* Fix _init_weights for ForMaskedLM
* Match base_model_prefix
* Add compiled_head to match research repo outputs
* Fix imports for ModernBertForMaskedLM
* Just use "gelu" default outright for classifier
* Fix config name typo: initalizer -> initializer
* Remove some unused parameters in docstring. Still lots to edit there!
* Compile the embeddings forward
Not having this resulted in very slight differences - so small it wasn't even noticed for the base model, only for the large model.
But the tiny difference for large propagated at the embedding layer through the rest of the model, leading to notable differences of ~0.0084 average per value, up to 0.2343 for the worst case.
* Add drafts for ForSequenceClassification/ForTokenClassification
* Add initial SDPA support (not exactly equivalent to FA2 yet!)
During testing, FA2 and SDPA still differ by about 0.0098 per value in the token embeddings. It still predicts the correct mask fills, but I'd like to get it fully 1-1 if possible.
* Only use attention dropout if training
* Add initial eager attention support (also not equivalent to FA2 yet!)
Frustratingly, I also can't get eager to be equivalent to FA2 (or sdpa), but it does get really close, i.e. avg ~0.010 difference per value.
Especially if I use fp32 for both FA2&eager, avg ~0.0029 difference per value
The fill-mask results are good with eager.
* Add initial tests, output_attentions, output_hidden_states, prune_heads
Tests are based on BERT, not all tests pass yet: 23 failed, 79 passed, 100 skipped
* Remove kwargs from ModernBertForMaskedLM
Disable sparse_prediction by default to match the normal HF, can be enabled via config
* Remove/adjust/skip improper tests; warn if padding but no attn mask
* Run formatting etc.
* Run python utils/custom_init_isort.py
* FlexAttention with unpadded sequences(matches FA2 within bf16 numerics)
* Reformat init_weights based on review
* self -> module in attention forwards
* Remove if config.tie_word_embeddings
* Reformat output projection on a different line
* Remove pruning
* Remove assert
* Call contiguous() to simplify paths
* Remove prune_qkv_linear_layer
* Format code
* Keep as kwargs, only use if needed
* Remove unused codepaths & related config options
* Remove 3d attn_mask test; fix token classification tuple output
* Reorder: attention_mask above position_ids, fixes gradient checkpointing
* Fix usage if no FA2 or torch v2.5+
* Make torch.compile/triton optional
Should we rename 'compile'? It's a bit vague
* Separate pooling options into separate functions (cls, mean) - cls as default
* Simplify _pad_modernbert_output, remove unused labels path
* Update tied weights to remove decoder.weight, simplify decoder loading
* Adaptively set config.compile based on hf_device_map/device/resize, etc.
* Update ModernBertConfig docstring
* Satisfy some consistency checks, add unfinished docs
* Only set compile to False if there's more than 1 device
* Add docstrings for public ModernBert classes
* Dont replace docstring returns - ends up being duplicate
* Fix mistake in toctree
* Reformat toctree
* Patched FlexAttention, SDPA, Eager with Local Attention
* Implement FA2 -> SDPA -> Eager attn_impl defaulting, crucial
both to match the original performance, and to get the highest inference speed without requiring users to manually pick FA2
* Patch test edge case with Idefics3 not working with 'attn_implementation="sdpa"'
* Repad all_hidden_states as well
* rename config.compile to reference_compile
* disable flex_attention since it crashes
* Update modernbert.md
* Using dtype min to mask in eager
* Fully remove flex attention for now
It's only compatible with the nightly torch 2.6, so we'll leave it be for now. It's also slower than eager/sdpa.
Also, update compile -> reference_compile in one more case
* Call contiguous to allow for .view()
* Copyright 2020 -> 2024
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Update/simplify __init__ structure
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* Remove "... if dropout_prob > 0 else identity"
As dropout with 0.0 should be efficient like identity
* re-use existing pad/unpad functions instead of creating new ones
* remove flexattention method
* Compute attention_mask and local_attention_mask once in modeling
* Simplify sequence classification prediction heads, only CLS now
Users can make custom heads if they feel like it
Also removes the unnecessary pool parameter
* Simplify module.training in eager attn
* Also export ModernBertPreTrainedModel
* Update the documentation with links to finetuning scripts
* Explain local_attention_mask parameter in docstring
* Simplify _autoset_attn_implementation, rely on super()
* Keep "in" to initialize Prediction head
Doublechecked with Benjamin that it's correct/what we used for pretraining
* add back mean pooling
* Use the pooling head in TokenClassification
* update copyright
* Reset config._attn_implementation_internal on failure
* Allow optional attention_mask in ForMaskedLM head
* fix failing run_slow tests
* Add links to the paper
* Remove unpad_no_grad, always pad/unpad without gradients
* local_attention_mask -> sliding_window_mask
* Revert "Use the pooling head in TokenClassification"
This reverts commit 99c38badd1.
There was no real motivation, no info on whether having this bigger head does anything useful.
* Simplify pooling, 2 options via if-else
---------
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>
Co-authored-by: Said Taghadouini <taghadouinisaid@gmail.com>
Co-authored-by: Benjamin Clavié <ben@clavie.eu>
Co-authored-by: Antoine Chaffin <ant54600@hotmail.fr>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* docs: fix typo quickstart snippet in ColPali's model card
* docs: clean the ColPali's model card
* docs: make the `ColPaliForRetrieval`'s docstring more concise
* docs: add missing bash command used to convert weights for `vidore/colpali-v1.3-hf`
* initial commit for PR
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
* rename dynamic cache
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* add more unit tests
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* add integration test
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* add integration test
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* Add modular bamba file
* Remove trainer changes from unrelated PR
* Modify modular and cofig to get model running
* Fix some CI errors and beam search
* Fix a plethora of bugs from CI/docs/etc
* Add bamba to models with special caches
* Updat to newer mamba PR for mamba sublayer
* fix test_left_padding_compatibility
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* fix style
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* fix remaining tests
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* missed this test
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* ran make style
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* move slow tag to integration obj
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* make style
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* address comments
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* fix modular
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* left out one part of modular
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* change model
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* Make Rotary modular as well
* Update bamba.md
Added overview, update Model inference card and added config
* Update bamba.md
* Update bamba.md
* Update bamba.md
Minor fixes
* Add docs for config and model back
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Add warning when using fast kernels
* replaced generate example
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
* Address comments from PR
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Propagate attention fixes
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Fix attention interfaces to the new API
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Fix API for decoder layer
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
* Remove extra weights
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
---------
Signed-off-by: Yu Chin Fabian Lim <flim@sg.ibm.com>
Signed-off-by: Antoni Viros i Martin <aviros@ibm.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Co-authored-by: Antoni Viros i Martin <aviros@ibm.com>
Co-authored-by: divya-kumari32 <72085811+divya-kumari32@users.noreply.github.com>
Co-authored-by: Antoni Viros <ani300@gmail.com>
* feat: add `benchmarks_entrypoint.py`
Adding `benchmarks_entrypoint.py` file, which will be run from the
benchmarks CI.
This python script will list all python files from the `benchmark/`
folder and run the included `run_benchmark` function, allowing people to
add new benchmarks scripts.
* feat: add `MetricsRecorder`
* feat: update dashboard
* fix: add missing arguments to `MetricsRecorder`
* feat: update dash & add datasource + `default.yml`
* fix: move responsibility to create `MetricsRecorder` in bench script
* fix: update incorrect datasource UID
* fix: incorrect variable values
* debug: benchmark entrypoint script
* refactor: update log level
* fix: update broken import
* feat: add debug log in `MetricsRecorder`
* debug: set log level to debug
* fix: set connection `autocommit` to `True`