mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-15 10:38:23 +06:00

* Initial commit with template code generated by transformers-cli
* Multiple additions to SuperGlue implementation :
- Added the SuperGlueConfig
- Added the SuperGlueModel and its implementation
- Added basic weight conversion script
- Added new ImageMatchingOutput dataclass
* Few changes for SuperGlue
* Multiple changes :
- Added keypoint detection config to SuperGlueConfig
- Completed convert_superglue_to_pytorch and succesfully run inference
* Reverted unintentional change
* Multiple changes :
- Added SuperGlue to a bunch of places
- Divided SuperGlue into SuperGlueForImageMatching and SuperGlueModel
- Added testing images
* Moved things in init files
* Added docs (to be finished depending on the final implementation)
* Added necessary imports and some doc
* Removed unnecessary import
* Fixed make fix-copies bug and ran it
* Deleted SuperGlueModel
Fixed convert script
* Added SuperGlueImageProcessor
* Changed SuperGlue to support batching pairs of images and modified ImageMatchingOutput in consequences
* Changed convert_superglue_to_hf.py script to experiment different ways of reading an image and seeing its impact on performances
* Added initial tests for SuperGlueImageProcessor
* Added AutoModelForImageMatching in missing places and tests
* Fixed keypoint_detector_output instructions
* Fix style
* Adapted to latest main changes
* Added integration test
* Fixed bugs to pass tests
* Added keypoints returned by keypoint detector in the output of SuperGlue
* Added doc to SuperGlue
* SuperGlue returning all attention and hidden states for a fixed number of keypoints
* Make style
* Changed SuperGlueImageProcessor tests
* Revert "SuperGlue returning all attention and hidden states for a fixed number of keypoints"
Changed tests accordingly
This reverts commit 5b3b669c
* Added back hidden_states and attentions masked outputs with tests
* Renamed ImageMatching occurences into KeypointMatching
* Changed SuperGlueImageProcessor to raise error when batch_size is not even
* Added docs and clarity to hidden state and attention grouping function
* Fixed some code and done refactoring
* Fixed typo in SuperPoint output doc
* Fixed some of the formatting and variable naming problems
* Removed useless function call
* Removed AutoModelForKeypointMatching
* Fixed SuperGlueImageProcessor to only accept paris of images
* Added more fixes to SuperGlueImageProcessor
* Simplified the batching of attention and hidden states
* Simplified stack functions
* Moved attention instructions into class
* Removed unused do_batch_norm argument
* Moved weight initialization to the proper place
* Replaced deepcopy for instantiation
* Fixed small bug
* Changed from stevenbucaille to magic-leap repo
* Renamed London Bridge images to Tower Bridge
* Fixed formatting
* Renamed remaining "london" to "tower"
* Apply suggestions from code review
Small changes in the docs
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Added AutoModelForKeypointMatching
* Changed images used in example
* Several changes to image_processing_superglue and style
* Fixed resample type hint
* Changed SuperGlueImageProcessor and added test case for list of 2 images
* Changed list_of_tuples implementation
* Fix in dummy objects
* Added normalize_keypoint, log_sinkhorn_iterations and log_optimal_transport docstring
* Added missing docstring
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* Moved forward block at bottom
* Added docstring to forward method
* Added docstring to match_image_pair method
* Changed test_model_common_attributes to test_model_get_set_embeddings test method signature
* Removed AutoModelForKeypointMatching
* Removed image fixtures and added load_dataset
* Added padding of images in SuperGlueImageProcessor
* Cleaned up convert_superglue_to_hf script
* Added missing docs and fixed unused argument
* Fixed SuperGlueImageProcessor tests
* Transposed all hidden states from SuperGlue to reflect the standard (..., seq_len, feature_dim) shape
* Added SuperGlueForKeypointMatching back to modeling_auto
* Fixed image processor padding test
* Changed SuperGlue docs
* changes:
- Abstraction to batch, concat and stack of inconsistent tensors
- Changed conv1d's to linears to match standard attention implementations
- Renamed all tensors to be tensor0 and not tensor_0 and be consistent
- Changed match image pair to run keypoint detection on all image first, create batching tensors and then filling these tensors matches after matches
- Various changes in docs, etc
* Changes to SuperGlueImageProcessor:
- Reworked the input image pairs checking function and added tests accordingly
- Added Copied from statements
- Added do_grayscale tag (also for SuperPointImageProcessor)
- Misc changes for better code
* Formatting changes
* Reverted conv1d to linear conversion because of numerical differences
* fix: changed some code to be more straightforward (e.g. filtering keypoints) and converted plot from opencv to matplotlib
* fix: removed unnecessary test
* chore: removed commented code and added back hidden states transpositions
* chore: changed from "inconsistent" to "ragged" function names as suggested
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* docs: applied suggestions
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* docs: updated to display matched output
* chore: applied suggestion for check_image_pairs_input function
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
* chore: changed check_image_pairs_input function name to validate_and_format_image_pairs and used validate_preprocess_arguments function
* tests: simplified tests for image input format and shapes
* feat: converted SuperGlue's use of Conv1d with kernel_size of 1 with Linear layers. Changed tests and conversion script accordingly
* feat: several changes to address comments
Conversion script:
- Reverted fuse batchnorm to linear conversion
- Changed all 'nn.Module' to respective SuperGlue models
- Changed conversion script to use regex mapping and match other recent scripts
Modeling SuperGlue:
- Added batching with mask and padding to attention
- Removed unnecessary concat, stack and batch ragged pairs functions
- Reverted batchnorm layer
- Renamed query, key, value and merge layers into q, k, v, out proj
- Removed Union of different Module into nn.Module in _init_weights method typehint
- Changed several method's signature to combine image0 and image1 inputs with appropriate doc changes
- Updated SuperGlue's doc with torch.no_grad()
Updated test to reflect changes in SuperGlue model
* refactor: changed validate_and_format_image_pairs function with clarity
* refactor: changed from one SuperGlueMLP class to a list of SuperGlueMLP class
* fix: fixed forgotten init weight change from last commit
* fix: fixed rebase mistake
* fix: removed leftover commented code
* fix: added typehint and changed some of arguments default values
* fix: fixed attribute default values for SuperGlueConfig
* feat: added SuperGlueImageProcessor post process keypoint matching method with tests
* fix: fixed SuperGlue attention and hidden state tuples aggregation
* chore: fixed mask optionality and reordered tensor reshapes to be cleaner
* chore: fixed docs and error message returned in validate_and_format_image_pairs function
* fix: fixed returned keypoints to be the ones that SuperPoint returns
* fix: fixed check on number of image sizes for post process compared to the pairs in outputs of SuperGlue
* fix: fixed check on number of image sizes for post process compared to the pairs in outputs of SuperGlue (bis)
* fix: Changed SuperGlueMultiLayerPerceptron instantiation to avoid if statement
* fix: Changed convert_superglue_to_hf script to reflect latest SuperGlue changes and got rid of nn.Modules
* WIP: implement Attention from an existing class (like BERT)
* docs: Changed docs to include more appealing matching plot
* WIP: Implement Attention
* chore: minor typehint change
* chore: changed convert superglue script by removing all classes and apply conv to linear conversion in state dict + rearrange keys to comply with changes in model's layers organisation
* Revert "Fixed typo in SuperPoint output doc"
This reverts commit 2120390e82
.
* chore: added comments in SuperGlueImageProcessor
* chore: changed SuperGlue organization HF repo to magic-leap-community
* [run-slow] refactor: small change in layer instantiation
* [run-slow] chore: replaced remaining stevenbucaille org to magic-leap-community
* [run-slow] chore: make style
* chore: update image matching fixture dataset HF repository
* [run-slow] superglue
* tests: overwriting test_batching_equivalence
* [run-slow] superglue
* tests: changed test to cope with value changing depending on cuda version
* [run-slow] superglue
* tests: changed matching_threshold value
* [run-slow] superglue
* [run-slow] superglue
* tests: changed tests for integration
* [run-slow] superglue
* fix: Changed tensor view and permutations to match original implementation results
* fix: updated convert script and integration test to include last change in model
* fix: increase tolerance for CUDA variances
* Apply suggestions from code review
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
* [run-slow] superglue
* chore: removed blank whitespaces
* [run-slow] superglue
* Revert SuperPoint image processor accident changes
* [run-slow] superglue
* refactor: reverted copy from BERT class
* tests: lower the tolerance in integration tests for SuperGlue
* [run-slow] superglue
* chore: set do_grayscale to False in SuperPoint and SuperGlue image processors
* [run-slow] superglue
* fix: fixed imports in SuperGlue files
* chore: changed do_grayscale SuperGlueImageProcessing default value to True
* docs: added typehint to post_process_keypoint_matching method in SuperGlueImageProcessor
* fix: set matching_threshold default value to 0.0 instead of 0.2
* feat: added matching_threshold to post_process_keypoint_matching method
* docs: update superglue.md to include matching_threshold parameter
* docs: updated SuperGlueConfig docstring for matching_threshold default value
* refactor: removed unnecessary parameters in SuperGlueConfig
* fix: changed from matching_threshold to threshold
* fix: re-revert changes to make SuperGlue attention classes copies of BERT
* [run-slow] superglue
* fix: added missing device argument in post_processing method
* [run-slow] superglue
* fix: add matches different from -1 to compute valid matches in post_process_keypoint_matching (and docstring)
* fix: add device to image_sizes tensor instantiation
* tests: added checks on do_grayscale test
* chore: reordered and added Optional typehint to KeypointMatchingOutput
* LightGluePR suggestions:
- use `post_process_keypoint_matching` as default docs example
- add `post_process_keypoint_matching` in autodoc
- add `SuperPointConfig` import under TYPE_CHECKING condition
- format SuperGlueConfig docstring
- add device in convert_superglue_to_hf
- Fix typo
- Fix KeypointMatchingOutput docstring
- Removed unnecessary line
- Added missing SuperGlueConfig in __init__ methods
* LightGluePR suggestions:
- use batching to get keypoint detection
* refactor: processing images done in 1 for loop instead of 4
* fix: use @ instead of torch.einsum for scores computation
* style: added #fmt skip to long tensor values
* refactor: rollbacked validate_and_format_image_pairs valid and invalid case to more simple ones
* refactor: prepare_imgs
* refactor: simplified `validate_and_format_image_pairs`
* docs: fixed doc
---------
Co-authored-by: steven <steven.bucaillle@gmail.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: Steven Bucaille <steven.bucaille@buawei.com>
Co-authored-by: Pavel Iakubovskii <qubvel@gmail.com>
390 lines
47 KiB
Markdown
390 lines
47 KiB
Markdown
<!--Copyright 2020 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
-->
|
|
|
|
# 🤗 Transformers
|
|
|
|
State-of-the-art Machine Learning for [PyTorch](https://pytorch.org/), [TensorFlow](https://www.tensorflow.org/), and [JAX](https://jax.readthedocs.io/en/latest/).
|
|
|
|
🤗 Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as:
|
|
|
|
📝 **Natural Language Processing**: text classification, named entity recognition, question answering, language modeling, code generation, summarization, translation, multiple choice, and text generation.<br>
|
|
🖼️ **Computer Vision**: image classification, object detection, and segmentation.<br>
|
|
🗣️ **Audio**: automatic speech recognition and audio classification.<br>
|
|
🐙 **Multimodal**: table question answering, optical character recognition, information extraction from scanned documents, video classification, and visual question answering.
|
|
|
|
🤗 Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. This provides the flexibility to use a different framework at each stage of a model's life; train a model in three lines of code in one framework, and load it for inference in another. Models can also be exported to a format like ONNX and TorchScript for deployment in production environments.
|
|
|
|
Join the growing community on the [Hub](https://huggingface.co/models), [forum](https://discuss.huggingface.co/), or [Discord](https://discord.com/invite/JfAtkvEtRb) today!
|
|
|
|
## If you are looking for custom support from the Hugging Face team
|
|
|
|
<a target="_blank" href="https://huggingface.co/support">
|
|
<img alt="HuggingFace Expert Acceleration Program" src="https://cdn-media.huggingface.co/marketing/transformers/new-support-improved.png" style="width: 100%; max-width: 600px; border: 1px solid #eee; border-radius: 4px; box-shadow: 0 1px 2px 0 rgba(0, 0, 0, 0.05);">
|
|
</a>
|
|
|
|
## Contents
|
|
|
|
The documentation is organized into five sections:
|
|
|
|
- **GET STARTED** provides a quick tour of the library and installation instructions to get up and running.
|
|
- **TUTORIALS** are a great place to start if you're a beginner. This section will help you gain the basic skills you need to start using the library.
|
|
- **HOW-TO GUIDES** show you how to achieve a specific goal, like finetuning a pretrained model for language modeling or how to write and share a custom model.
|
|
- **CONCEPTUAL GUIDES** offers more discussion and explanation of the underlying concepts and ideas behind models, tasks, and the design philosophy of 🤗 Transformers.
|
|
- **API** describes all classes and functions:
|
|
|
|
- **MAIN CLASSES** details the most important classes like configuration, model, tokenizer, and pipeline.
|
|
- **MODELS** details the classes and functions related to each model implemented in the library.
|
|
- **INTERNAL HELPERS** details utility classes and functions used internally.
|
|
|
|
|
|
## Supported models and frameworks
|
|
|
|
The table below represents the current support in the library for each of those models, whether they have a Python
|
|
tokenizer (called "slow"). A "fast" tokenizer backed by the 🤗 Tokenizers library, whether they have support in Jax (via
|
|
Flax), PyTorch, and/or TensorFlow.
|
|
|
|
<!--This table is updated automatically from the auto modules with _make fix-copies_. Do not update manually!-->
|
|
|
|
| Model | PyTorch support | TensorFlow support | Flax Support |
|
|
|:------------------------------------------------------------------------:|:---------------:|:------------------:|:------------:|
|
|
| [ALBERT](model_doc/albert) | ✅ | ✅ | ✅ |
|
|
| [ALIGN](model_doc/align) | ✅ | ❌ | ❌ |
|
|
| [AltCLIP](model_doc/altclip) | ✅ | ❌ | ❌ |
|
|
| [Aria](model_doc/aria) | ✅ | ❌ | ❌ |
|
|
| [AriaText](model_doc/aria_text) | ✅ | ❌ | ❌ |
|
|
| [Audio Spectrogram Transformer](model_doc/audio-spectrogram-transformer) | ✅ | ❌ | ❌ |
|
|
| [Autoformer](model_doc/autoformer) | ✅ | ❌ | ❌ |
|
|
| [Bamba](model_doc/bamba) | ✅ | ❌ | ❌ |
|
|
| [Bark](model_doc/bark) | ✅ | ❌ | ❌ |
|
|
| [BART](model_doc/bart) | ✅ | ✅ | ✅ |
|
|
| [BARThez](model_doc/barthez) | ✅ | ✅ | ✅ |
|
|
| [BARTpho](model_doc/bartpho) | ✅ | ✅ | ✅ |
|
|
| [BEiT](model_doc/beit) | ✅ | ❌ | ✅ |
|
|
| [BERT](model_doc/bert) | ✅ | ✅ | ✅ |
|
|
| [Bert Generation](model_doc/bert-generation) | ✅ | ❌ | ❌ |
|
|
| [BertJapanese](model_doc/bert-japanese) | ✅ | ✅ | ✅ |
|
|
| [BERTweet](model_doc/bertweet) | ✅ | ✅ | ✅ |
|
|
| [BigBird](model_doc/big_bird) | ✅ | ❌ | ✅ |
|
|
| [BigBird-Pegasus](model_doc/bigbird_pegasus) | ✅ | ❌ | ❌ |
|
|
| [BioGpt](model_doc/biogpt) | ✅ | ❌ | ❌ |
|
|
| [BiT](model_doc/bit) | ✅ | ❌ | ❌ |
|
|
| [Blenderbot](model_doc/blenderbot) | ✅ | ✅ | ✅ |
|
|
| [BlenderbotSmall](model_doc/blenderbot-small) | ✅ | ✅ | ✅ |
|
|
| [BLIP](model_doc/blip) | ✅ | ✅ | ❌ |
|
|
| [BLIP-2](model_doc/blip-2) | ✅ | ❌ | ❌ |
|
|
| [BLOOM](model_doc/bloom) | ✅ | ❌ | ✅ |
|
|
| [BORT](model_doc/bort) | ✅ | ✅ | ✅ |
|
|
| [BridgeTower](model_doc/bridgetower) | ✅ | ❌ | ❌ |
|
|
| [BROS](model_doc/bros) | ✅ | ❌ | ❌ |
|
|
| [ByT5](model_doc/byt5) | ✅ | ✅ | ✅ |
|
|
| [CamemBERT](model_doc/camembert) | ✅ | ✅ | ❌ |
|
|
| [CANINE](model_doc/canine) | ✅ | ❌ | ❌ |
|
|
| [Chameleon](model_doc/chameleon) | ✅ | ❌ | ❌ |
|
|
| [Chinese-CLIP](model_doc/chinese_clip) | ✅ | ❌ | ❌ |
|
|
| [CLAP](model_doc/clap) | ✅ | ❌ | ❌ |
|
|
| [CLIP](model_doc/clip) | ✅ | ✅ | ✅ |
|
|
| [CLIPSeg](model_doc/clipseg) | ✅ | ❌ | ❌ |
|
|
| [CLVP](model_doc/clvp) | ✅ | ❌ | ❌ |
|
|
| [CodeGen](model_doc/codegen) | ✅ | ❌ | ❌ |
|
|
| [CodeLlama](model_doc/code_llama) | ✅ | ❌ | ✅ |
|
|
| [Cohere](model_doc/cohere) | ✅ | ❌ | ❌ |
|
|
| [Cohere2](model_doc/cohere2) | ✅ | ❌ | ❌ |
|
|
| [ColPali](model_doc/colpali) | ✅ | ❌ | ❌ |
|
|
| [Conditional DETR](model_doc/conditional_detr) | ✅ | ❌ | ❌ |
|
|
| [ConvBERT](model_doc/convbert) | ✅ | ✅ | ❌ |
|
|
| [ConvNeXT](model_doc/convnext) | ✅ | ✅ | ❌ |
|
|
| [ConvNeXTV2](model_doc/convnextv2) | ✅ | ✅ | ❌ |
|
|
| [CPM](model_doc/cpm) | ✅ | ✅ | ✅ |
|
|
| [CPM-Ant](model_doc/cpmant) | ✅ | ❌ | ❌ |
|
|
| [CTRL](model_doc/ctrl) | ✅ | ✅ | ❌ |
|
|
| [CvT](model_doc/cvt) | ✅ | ✅ | ❌ |
|
|
| [DAC](model_doc/dac) | ✅ | ❌ | ❌ |
|
|
| [Data2VecAudio](model_doc/data2vec) | ✅ | ❌ | ❌ |
|
|
| [Data2VecText](model_doc/data2vec) | ✅ | ❌ | ❌ |
|
|
| [Data2VecVision](model_doc/data2vec) | ✅ | ✅ | ❌ |
|
|
| [DBRX](model_doc/dbrx) | ✅ | ❌ | ❌ |
|
|
| [DeBERTa](model_doc/deberta) | ✅ | ✅ | ❌ |
|
|
| [DeBERTa-v2](model_doc/deberta-v2) | ✅ | ✅ | ❌ |
|
|
| [Decision Transformer](model_doc/decision_transformer) | ✅ | ❌ | ❌ |
|
|
| [Deformable DETR](model_doc/deformable_detr) | ✅ | ❌ | ❌ |
|
|
| [DeiT](model_doc/deit) | ✅ | ✅ | ❌ |
|
|
| [DePlot](model_doc/deplot) | ✅ | ❌ | ❌ |
|
|
| [Depth Anything](model_doc/depth_anything) | ✅ | ❌ | ❌ |
|
|
| [DETA](model_doc/deta) | ✅ | ❌ | ❌ |
|
|
| [DETR](model_doc/detr) | ✅ | ❌ | ❌ |
|
|
| [DialoGPT](model_doc/dialogpt) | ✅ | ✅ | ✅ |
|
|
| [DiffLlama](model_doc/diffllama) | ✅ | ❌ | ❌ |
|
|
| [DiNAT](model_doc/dinat) | ✅ | ❌ | ❌ |
|
|
| [DINOv2](model_doc/dinov2) | ✅ | ❌ | ✅ |
|
|
| [DINOv2 with Registers](model_doc/dinov2_with_registers) | ✅ | ❌ | ❌ |
|
|
| [DistilBERT](model_doc/distilbert) | ✅ | ✅ | ✅ |
|
|
| [DiT](model_doc/dit) | ✅ | ❌ | ✅ |
|
|
| [DonutSwin](model_doc/donut) | ✅ | ❌ | ❌ |
|
|
| [DPR](model_doc/dpr) | ✅ | ✅ | ❌ |
|
|
| [DPT](model_doc/dpt) | ✅ | ❌ | ❌ |
|
|
| [EfficientFormer](model_doc/efficientformer) | ✅ | ✅ | ❌ |
|
|
| [EfficientNet](model_doc/efficientnet) | ✅ | ❌ | ❌ |
|
|
| [ELECTRA](model_doc/electra) | ✅ | ✅ | ✅ |
|
|
| [Emu3](model_doc/emu3) | ✅ | ❌ | ❌ |
|
|
| [EnCodec](model_doc/encodec) | ✅ | ❌ | ❌ |
|
|
| [Encoder decoder](model_doc/encoder-decoder) | ✅ | ✅ | ✅ |
|
|
| [ERNIE](model_doc/ernie) | ✅ | ❌ | ❌ |
|
|
| [ErnieM](model_doc/ernie_m) | ✅ | ❌ | ❌ |
|
|
| [ESM](model_doc/esm) | ✅ | ✅ | ❌ |
|
|
| [FairSeq Machine-Translation](model_doc/fsmt) | ✅ | ❌ | ❌ |
|
|
| [Falcon](model_doc/falcon) | ✅ | ❌ | ❌ |
|
|
| [Falcon3](model_doc/falcon3) | ✅ | ❌ | ✅ |
|
|
| [FalconMamba](model_doc/falcon_mamba) | ✅ | ❌ | ❌ |
|
|
| [FastSpeech2Conformer](model_doc/fastspeech2_conformer) | ✅ | ❌ | ❌ |
|
|
| [FLAN-T5](model_doc/flan-t5) | ✅ | ✅ | ✅ |
|
|
| [FLAN-UL2](model_doc/flan-ul2) | ✅ | ✅ | ✅ |
|
|
| [FlauBERT](model_doc/flaubert) | ✅ | ✅ | ❌ |
|
|
| [FLAVA](model_doc/flava) | ✅ | ❌ | ❌ |
|
|
| [FNet](model_doc/fnet) | ✅ | ❌ | ❌ |
|
|
| [FocalNet](model_doc/focalnet) | ✅ | ❌ | ❌ |
|
|
| [Funnel Transformer](model_doc/funnel) | ✅ | ✅ | ❌ |
|
|
| [Fuyu](model_doc/fuyu) | ✅ | ❌ | ❌ |
|
|
| [Gemma](model_doc/gemma) | ✅ | ❌ | ✅ |
|
|
| [Gemma2](model_doc/gemma2) | ✅ | ❌ | ❌ |
|
|
| [GIT](model_doc/git) | ✅ | ❌ | ❌ |
|
|
| [GLM](model_doc/glm) | ✅ | ❌ | ❌ |
|
|
| [GLPN](model_doc/glpn) | ✅ | ❌ | ❌ |
|
|
| [GPT Neo](model_doc/gpt_neo) | ✅ | ❌ | ✅ |
|
|
| [GPT NeoX](model_doc/gpt_neox) | ✅ | ❌ | ❌ |
|
|
| [GPT NeoX Japanese](model_doc/gpt_neox_japanese) | ✅ | ❌ | ❌ |
|
|
| [GPT-J](model_doc/gptj) | ✅ | ✅ | ✅ |
|
|
| [GPT-Sw3](model_doc/gpt-sw3) | ✅ | ✅ | ✅ |
|
|
| [GPTBigCode](model_doc/gpt_bigcode) | ✅ | ❌ | ❌ |
|
|
| [GPTSAN-japanese](model_doc/gptsan-japanese) | ✅ | ❌ | ❌ |
|
|
| [Granite](model_doc/granite) | ✅ | ❌ | ❌ |
|
|
| [GraniteMoeMoe](model_doc/granitemoe) | ✅ | ❌ | ❌ |
|
|
| [Graphormer](model_doc/graphormer) | ✅ | ❌ | ❌ |
|
|
| [Grounding DINO](model_doc/grounding-dino) | ✅ | ❌ | ❌ |
|
|
| [GroupViT](model_doc/groupvit) | ✅ | ✅ | ❌ |
|
|
| [Helium](model_doc/helium) | ✅ | ❌ | ❌ |
|
|
| [HerBERT](model_doc/herbert) | ✅ | ✅ | ✅ |
|
|
| [Hiera](model_doc/hiera) | ✅ | ❌ | ❌ |
|
|
| [Hubert](model_doc/hubert) | ✅ | ✅ | ❌ |
|
|
| [I-BERT](model_doc/ibert) | ✅ | ❌ | ❌ |
|
|
| [I-JEPA](model_doc/ijepa) | ✅ | ❌ | ❌ |
|
|
| [IDEFICS](model_doc/idefics) | ✅ | ✅ | ❌ |
|
|
| [Idefics2](model_doc/idefics2) | ✅ | ❌ | ❌ |
|
|
| [Idefics3](model_doc/idefics3) | ✅ | ❌ | ❌ |
|
|
| [Idefics3VisionTransformer](model_doc/idefics3_vision) | ❌ | ❌ | ❌ |
|
|
| [ImageGPT](model_doc/imagegpt) | ✅ | ❌ | ❌ |
|
|
| [Informer](model_doc/informer) | ✅ | ❌ | ❌ |
|
|
| [InstructBLIP](model_doc/instructblip) | ✅ | ❌ | ❌ |
|
|
| [InstructBlipVideo](model_doc/instructblipvideo) | ✅ | ❌ | ❌ |
|
|
| [Jamba](model_doc/jamba) | ✅ | ❌ | ❌ |
|
|
| [JetMoe](model_doc/jetmoe) | ✅ | ❌ | ❌ |
|
|
| [Jukebox](model_doc/jukebox) | ✅ | ❌ | ❌ |
|
|
| [KOSMOS-2](model_doc/kosmos-2) | ✅ | ❌ | ❌ |
|
|
| [LayoutLM](model_doc/layoutlm) | ✅ | ✅ | ❌ |
|
|
| [LayoutLMv2](model_doc/layoutlmv2) | ✅ | ❌ | ❌ |
|
|
| [LayoutLMv3](model_doc/layoutlmv3) | ✅ | ✅ | ❌ |
|
|
| [LayoutXLM](model_doc/layoutxlm) | ✅ | ❌ | ❌ |
|
|
| [LED](model_doc/led) | ✅ | ✅ | ❌ |
|
|
| [LeViT](model_doc/levit) | ✅ | ❌ | ❌ |
|
|
| [LiLT](model_doc/lilt) | ✅ | ❌ | ❌ |
|
|
| [LLaMA](model_doc/llama) | ✅ | ❌ | ✅ |
|
|
| [Llama2](model_doc/llama2) | ✅ | ❌ | ✅ |
|
|
| [Llama3](model_doc/llama3) | ✅ | ❌ | ✅ |
|
|
| [LLaVa](model_doc/llava) | ✅ | ❌ | ❌ |
|
|
| [LLaVA-NeXT](model_doc/llava_next) | ✅ | ❌ | ❌ |
|
|
| [LLaVa-NeXT-Video](model_doc/llava_next_video) | ✅ | ❌ | ❌ |
|
|
| [LLaVA-Onevision](model_doc/llava_onevision) | ✅ | ❌ | ❌ |
|
|
| [Longformer](model_doc/longformer) | ✅ | ✅ | ❌ |
|
|
| [LongT5](model_doc/longt5) | ✅ | ❌ | ✅ |
|
|
| [LUKE](model_doc/luke) | ✅ | ❌ | ❌ |
|
|
| [LXMERT](model_doc/lxmert) | ✅ | ✅ | ❌ |
|
|
| [M-CTC-T](model_doc/mctct) | ✅ | ❌ | ❌ |
|
|
| [M2M100](model_doc/m2m_100) | ✅ | ❌ | ❌ |
|
|
| [MADLAD-400](model_doc/madlad-400) | ✅ | ✅ | ✅ |
|
|
| [Mamba](model_doc/mamba) | ✅ | ❌ | ❌ |
|
|
| [mamba2](model_doc/mamba2) | ✅ | ❌ | ❌ |
|
|
| [Marian](model_doc/marian) | ✅ | ✅ | ✅ |
|
|
| [MarkupLM](model_doc/markuplm) | ✅ | ❌ | ❌ |
|
|
| [Mask2Former](model_doc/mask2former) | ✅ | ❌ | ❌ |
|
|
| [MaskFormer](model_doc/maskformer) | ✅ | ❌ | ❌ |
|
|
| [MatCha](model_doc/matcha) | ✅ | ❌ | ❌ |
|
|
| [mBART](model_doc/mbart) | ✅ | ✅ | ✅ |
|
|
| [mBART-50](model_doc/mbart50) | ✅ | ✅ | ✅ |
|
|
| [MEGA](model_doc/mega) | ✅ | ❌ | ❌ |
|
|
| [Megatron-BERT](model_doc/megatron-bert) | ✅ | ❌ | ❌ |
|
|
| [Megatron-GPT2](model_doc/megatron_gpt2) | ✅ | ✅ | ✅ |
|
|
| [MGP-STR](model_doc/mgp-str) | ✅ | ❌ | ❌ |
|
|
| [Mimi](model_doc/mimi) | ✅ | ❌ | ❌ |
|
|
| [Mistral](model_doc/mistral) | ✅ | ✅ | ✅ |
|
|
| [Mixtral](model_doc/mixtral) | ✅ | ❌ | ❌ |
|
|
| [Mllama](model_doc/mllama) | ✅ | ❌ | ❌ |
|
|
| [mLUKE](model_doc/mluke) | ✅ | ❌ | ❌ |
|
|
| [MMS](model_doc/mms) | ✅ | ✅ | ✅ |
|
|
| [MobileBERT](model_doc/mobilebert) | ✅ | ✅ | ❌ |
|
|
| [MobileNetV1](model_doc/mobilenet_v1) | ✅ | ❌ | ❌ |
|
|
| [MobileNetV2](model_doc/mobilenet_v2) | ✅ | ❌ | ❌ |
|
|
| [MobileViT](model_doc/mobilevit) | ✅ | ✅ | ❌ |
|
|
| [MobileViTV2](model_doc/mobilevitv2) | ✅ | ❌ | ❌ |
|
|
| [ModernBERT](model_doc/modernbert) | ✅ | ❌ | ❌ |
|
|
| [Moonshine](model_doc/moonshine) | ✅ | ❌ | ❌ |
|
|
| [Moshi](model_doc/moshi) | ✅ | ❌ | ❌ |
|
|
| [MPNet](model_doc/mpnet) | ✅ | ✅ | ❌ |
|
|
| [MPT](model_doc/mpt) | ✅ | ❌ | ❌ |
|
|
| [MRA](model_doc/mra) | ✅ | ❌ | ❌ |
|
|
| [MT5](model_doc/mt5) | ✅ | ✅ | ✅ |
|
|
| [MusicGen](model_doc/musicgen) | ✅ | ❌ | ❌ |
|
|
| [MusicGen Melody](model_doc/musicgen_melody) | ✅ | ❌ | ❌ |
|
|
| [MVP](model_doc/mvp) | ✅ | ❌ | ❌ |
|
|
| [NAT](model_doc/nat) | ✅ | ❌ | ❌ |
|
|
| [Nemotron](model_doc/nemotron) | ✅ | ❌ | ❌ |
|
|
| [Nezha](model_doc/nezha) | ✅ | ❌ | ❌ |
|
|
| [NLLB](model_doc/nllb) | ✅ | ❌ | ❌ |
|
|
| [NLLB-MOE](model_doc/nllb-moe) | ✅ | ❌ | ❌ |
|
|
| [Nougat](model_doc/nougat) | ✅ | ✅ | ✅ |
|
|
| [Nyströmformer](model_doc/nystromformer) | ✅ | ❌ | ❌ |
|
|
| [OLMo](model_doc/olmo) | ✅ | ❌ | ❌ |
|
|
| [OLMo2](model_doc/olmo2) | ✅ | ❌ | ❌ |
|
|
| [OLMoE](model_doc/olmoe) | ✅ | ❌ | ❌ |
|
|
| [OmDet-Turbo](model_doc/omdet-turbo) | ✅ | ❌ | ❌ |
|
|
| [OneFormer](model_doc/oneformer) | ✅ | ❌ | ❌ |
|
|
| [OpenAI GPT](model_doc/openai-gpt) | ✅ | ✅ | ❌ |
|
|
| [OpenAI GPT-2](model_doc/gpt2) | ✅ | ✅ | ✅ |
|
|
| [OpenLlama](model_doc/open-llama) | ✅ | ❌ | ❌ |
|
|
| [OPT](model_doc/opt) | ✅ | ✅ | ✅ |
|
|
| [OWL-ViT](model_doc/owlvit) | ✅ | ❌ | ❌ |
|
|
| [OWLv2](model_doc/owlv2) | ✅ | ❌ | ❌ |
|
|
| [PaliGemma](model_doc/paligemma) | ✅ | ❌ | ❌ |
|
|
| [PatchTSMixer](model_doc/patchtsmixer) | ✅ | ❌ | ❌ |
|
|
| [PatchTST](model_doc/patchtst) | ✅ | ❌ | ❌ |
|
|
| [Pegasus](model_doc/pegasus) | ✅ | ✅ | ✅ |
|
|
| [PEGASUS-X](model_doc/pegasus_x) | ✅ | ❌ | ❌ |
|
|
| [Perceiver](model_doc/perceiver) | ✅ | ❌ | ❌ |
|
|
| [Persimmon](model_doc/persimmon) | ✅ | ❌ | ❌ |
|
|
| [Phi](model_doc/phi) | ✅ | ❌ | ❌ |
|
|
| [Phi3](model_doc/phi3) | ✅ | ❌ | ❌ |
|
|
| [Phimoe](model_doc/phimoe) | ✅ | ❌ | ❌ |
|
|
| [PhoBERT](model_doc/phobert) | ✅ | ✅ | ✅ |
|
|
| [Pix2Struct](model_doc/pix2struct) | ✅ | ❌ | ❌ |
|
|
| [Pixtral](model_doc/pixtral) | ✅ | ❌ | ❌ |
|
|
| [PLBart](model_doc/plbart) | ✅ | ❌ | ❌ |
|
|
| [PoolFormer](model_doc/poolformer) | ✅ | ❌ | ❌ |
|
|
| [Pop2Piano](model_doc/pop2piano) | ✅ | ❌ | ❌ |
|
|
| [ProphetNet](model_doc/prophetnet) | ✅ | ❌ | ❌ |
|
|
| [PVT](model_doc/pvt) | ✅ | ❌ | ❌ |
|
|
| [PVTv2](model_doc/pvt_v2) | ✅ | ❌ | ❌ |
|
|
| [QDQBert](model_doc/qdqbert) | ✅ | ❌ | ❌ |
|
|
| [Qwen2](model_doc/qwen2) | ✅ | ❌ | ❌ |
|
|
| [Qwen2Audio](model_doc/qwen2_audio) | ✅ | ❌ | ❌ |
|
|
| [Qwen2MoE](model_doc/qwen2_moe) | ✅ | ❌ | ❌ |
|
|
| [Qwen2VL](model_doc/qwen2_vl) | ✅ | ❌ | ❌ |
|
|
| [RAG](model_doc/rag) | ✅ | ✅ | ❌ |
|
|
| [REALM](model_doc/realm) | ✅ | ❌ | ❌ |
|
|
| [RecurrentGemma](model_doc/recurrent_gemma) | ✅ | ❌ | ❌ |
|
|
| [Reformer](model_doc/reformer) | ✅ | ❌ | ❌ |
|
|
| [RegNet](model_doc/regnet) | ✅ | ✅ | ✅ |
|
|
| [RemBERT](model_doc/rembert) | ✅ | ✅ | ❌ |
|
|
| [ResNet](model_doc/resnet) | ✅ | ✅ | ✅ |
|
|
| [RetriBERT](model_doc/retribert) | ✅ | ❌ | ❌ |
|
|
| [RoBERTa](model_doc/roberta) | ✅ | ✅ | ✅ |
|
|
| [RoBERTa-PreLayerNorm](model_doc/roberta-prelayernorm) | ✅ | ✅ | ✅ |
|
|
| [RoCBert](model_doc/roc_bert) | ✅ | ❌ | ❌ |
|
|
| [RoFormer](model_doc/roformer) | ✅ | ✅ | ✅ |
|
|
| [RT-DETR](model_doc/rt_detr) | ✅ | ❌ | ❌ |
|
|
| [RT-DETR-ResNet](model_doc/rt_detr_resnet) | ✅ | ❌ | ❌ |
|
|
| [RWKV](model_doc/rwkv) | ✅ | ❌ | ❌ |
|
|
| [SAM](model_doc/sam) | ✅ | ✅ | ❌ |
|
|
| [SeamlessM4T](model_doc/seamless_m4t) | ✅ | ❌ | ❌ |
|
|
| [SeamlessM4Tv2](model_doc/seamless_m4t_v2) | ✅ | ❌ | ❌ |
|
|
| [SegFormer](model_doc/segformer) | ✅ | ✅ | ❌ |
|
|
| [SegGPT](model_doc/seggpt) | ✅ | ❌ | ❌ |
|
|
| [SEW](model_doc/sew) | ✅ | ❌ | ❌ |
|
|
| [SEW-D](model_doc/sew-d) | ✅ | ❌ | ❌ |
|
|
| [SigLIP](model_doc/siglip) | ✅ | ❌ | ❌ |
|
|
| [Speech Encoder decoder](model_doc/speech-encoder-decoder) | ✅ | ❌ | ✅ |
|
|
| [Speech2Text](model_doc/speech_to_text) | ✅ | ✅ | ❌ |
|
|
| [SpeechT5](model_doc/speecht5) | ✅ | ❌ | ❌ |
|
|
| [Splinter](model_doc/splinter) | ✅ | ❌ | ❌ |
|
|
| [SqueezeBERT](model_doc/squeezebert) | ✅ | ❌ | ❌ |
|
|
| [StableLm](model_doc/stablelm) | ✅ | ❌ | ❌ |
|
|
| [Starcoder2](model_doc/starcoder2) | ✅ | ❌ | ❌ |
|
|
| [SuperGlue](model_doc/superglue) | ✅ | ❌ | ❌ |
|
|
| [SuperPoint](model_doc/superpoint) | ✅ | ❌ | ❌ |
|
|
| [SwiftFormer](model_doc/swiftformer) | ✅ | ✅ | ❌ |
|
|
| [Swin Transformer](model_doc/swin) | ✅ | ✅ | ❌ |
|
|
| [Swin Transformer V2](model_doc/swinv2) | ✅ | ❌ | ❌ |
|
|
| [Swin2SR](model_doc/swin2sr) | ✅ | ❌ | ❌ |
|
|
| [SwitchTransformers](model_doc/switch_transformers) | ✅ | ❌ | ❌ |
|
|
| [T5](model_doc/t5) | ✅ | ✅ | ✅ |
|
|
| [T5v1.1](model_doc/t5v1.1) | ✅ | ✅ | ✅ |
|
|
| [Table Transformer](model_doc/table-transformer) | ✅ | ❌ | ❌ |
|
|
| [TAPAS](model_doc/tapas) | ✅ | ✅ | ❌ |
|
|
| [TAPEX](model_doc/tapex) | ✅ | ✅ | ✅ |
|
|
| [TextNet](model_doc/textnet) | ✅ | ❌ | ❌ |
|
|
| [Time Series Transformer](model_doc/time_series_transformer) | ✅ | ❌ | ❌ |
|
|
| [TimeSformer](model_doc/timesformer) | ✅ | ❌ | ❌ |
|
|
| [TimmWrapperModel](model_doc/timm_wrapper) | ✅ | ❌ | ❌ |
|
|
| [Trajectory Transformer](model_doc/trajectory_transformer) | ✅ | ❌ | ❌ |
|
|
| [Transformer-XL](model_doc/transfo-xl) | ✅ | ✅ | ❌ |
|
|
| [TrOCR](model_doc/trocr) | ✅ | ❌ | ❌ |
|
|
| [TVLT](model_doc/tvlt) | ✅ | ❌ | ❌ |
|
|
| [TVP](model_doc/tvp) | ✅ | ❌ | ❌ |
|
|
| [UDOP](model_doc/udop) | ✅ | ❌ | ❌ |
|
|
| [UL2](model_doc/ul2) | ✅ | ✅ | ✅ |
|
|
| [UMT5](model_doc/umt5) | ✅ | ❌ | ❌ |
|
|
| [UniSpeech](model_doc/unispeech) | ✅ | ❌ | ❌ |
|
|
| [UniSpeechSat](model_doc/unispeech-sat) | ✅ | ❌ | ❌ |
|
|
| [UnivNet](model_doc/univnet) | ✅ | ❌ | ❌ |
|
|
| [UPerNet](model_doc/upernet) | ✅ | ❌ | ❌ |
|
|
| [VAN](model_doc/van) | ✅ | ❌ | ❌ |
|
|
| [VideoLlava](model_doc/video_llava) | ✅ | ❌ | ❌ |
|
|
| [VideoMAE](model_doc/videomae) | ✅ | ❌ | ❌ |
|
|
| [ViLT](model_doc/vilt) | ✅ | ❌ | ❌ |
|
|
| [VipLlava](model_doc/vipllava) | ✅ | ❌ | ❌ |
|
|
| [Vision Encoder decoder](model_doc/vision-encoder-decoder) | ✅ | ✅ | ✅ |
|
|
| [VisionTextDualEncoder](model_doc/vision-text-dual-encoder) | ✅ | ✅ | ✅ |
|
|
| [VisualBERT](model_doc/visual_bert) | ✅ | ❌ | ❌ |
|
|
| [ViT](model_doc/vit) | ✅ | ✅ | ✅ |
|
|
| [ViT Hybrid](model_doc/vit_hybrid) | ✅ | ❌ | ❌ |
|
|
| [VitDet](model_doc/vitdet) | ✅ | ❌ | ❌ |
|
|
| [ViTMAE](model_doc/vit_mae) | ✅ | ✅ | ❌ |
|
|
| [ViTMatte](model_doc/vitmatte) | ✅ | ❌ | ❌ |
|
|
| [ViTMSN](model_doc/vit_msn) | ✅ | ❌ | ❌ |
|
|
| [ViTPose](model_doc/vitpose) | ✅ | ❌ | ❌ |
|
|
| [ViTPoseBackbone](model_doc/vitpose_backbone) | ✅ | ❌ | ❌ |
|
|
| [VITS](model_doc/vits) | ✅ | ❌ | ❌ |
|
|
| [ViViT](model_doc/vivit) | ✅ | ❌ | ❌ |
|
|
| [Wav2Vec2](model_doc/wav2vec2) | ✅ | ✅ | ✅ |
|
|
| [Wav2Vec2-BERT](model_doc/wav2vec2-bert) | ✅ | ❌ | ❌ |
|
|
| [Wav2Vec2-Conformer](model_doc/wav2vec2-conformer) | ✅ | ❌ | ❌ |
|
|
| [Wav2Vec2Phoneme](model_doc/wav2vec2_phoneme) | ✅ | ✅ | ✅ |
|
|
| [WavLM](model_doc/wavlm) | ✅ | ❌ | ❌ |
|
|
| [Whisper](model_doc/whisper) | ✅ | ✅ | ✅ |
|
|
| [X-CLIP](model_doc/xclip) | ✅ | ❌ | ❌ |
|
|
| [X-MOD](model_doc/xmod) | ✅ | ❌ | ❌ |
|
|
| [XGLM](model_doc/xglm) | ✅ | ✅ | ✅ |
|
|
| [XLM](model_doc/xlm) | ✅ | ✅ | ❌ |
|
|
| [XLM-ProphetNet](model_doc/xlm-prophetnet) | ✅ | ❌ | ❌ |
|
|
| [XLM-RoBERTa](model_doc/xlm-roberta) | ✅ | ✅ | ✅ |
|
|
| [XLM-RoBERTa-XL](model_doc/xlm-roberta-xl) | ✅ | ❌ | ❌ |
|
|
| [XLM-V](model_doc/xlm-v) | ✅ | ✅ | ✅ |
|
|
| [XLNet](model_doc/xlnet) | ✅ | ✅ | ❌ |
|
|
| [XLS-R](model_doc/xls_r) | ✅ | ✅ | ✅ |
|
|
| [XLSR-Wav2Vec2](model_doc/xlsr_wav2vec2) | ✅ | ✅ | ✅ |
|
|
| [YOLOS](model_doc/yolos) | ✅ | ❌ | ❌ |
|
|
| [YOSO](model_doc/yoso) | ✅ | ❌ | ❌ |
|
|
| [Zamba](model_doc/zamba) | ✅ | ❌ | ❌ |
|
|
| [ZoeDepth](model_doc/zoedepth) | ✅ | ❌ | ❌ |
|
|
|
|
<!-- End table-->
|