mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
Improve vision models docs (#19103)
* Add tips * Add BEiT figure * Fix URL * Move tip to start * Add tip to TF model as well Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
This commit is contained in:
parent
0d1ba2dd0b
commit
e7206ceab9
@ -59,6 +59,11 @@ Tips:
|
||||
`use_relative_position_bias` attribute of [`BeitConfig`] to `True` in order to add
|
||||
position embeddings.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/beit_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> BEiT pre-training. Taken from the <a href="https://arxiv.org/abs/2106.08254">original paper.</a> </small>
|
||||
|
||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The JAX/FLAX version of this model was
|
||||
contributed by [kamalkraj](https://huggingface.co/kamalkraj). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/beit).
|
||||
|
||||
|
@ -12,13 +12,6 @@ specific language governing permissions and limitations under the License.
|
||||
|
||||
# Vision Transformer (ViT)
|
||||
|
||||
<Tip>
|
||||
|
||||
This is a recently introduced model so the API hasn't been tested extensively. There may be some bugs or slight
|
||||
breaking changes to fix it in the future. If you see something strange, file a [Github Issue](https://github.com/huggingface/transformers/issues/new?assignees=&labels=&template=bug-report.md&title).
|
||||
|
||||
</Tip>
|
||||
|
||||
## Overview
|
||||
|
||||
The Vision Transformer (ViT) model was proposed in [An Image is Worth 16x16 Words: Transformers for Image Recognition
|
||||
@ -63,6 +56,11 @@ Tips:
|
||||
language modeling). With this approach, the smaller ViT-B/16 model achieves 79.9% accuracy on ImageNet, a significant
|
||||
improvement of 2% to training from scratch, but still 4% behind supervised pre-training.
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vit_architecture.jpg"
|
||||
alt="drawing" width="600"/>
|
||||
|
||||
<small> ViT architecture. Taken from the <a href="https://arxiv.org/abs/2010.11929">original paper.</a> </small>
|
||||
|
||||
Following the original Vision Transformer, some follow-up works have been made:
|
||||
|
||||
- [DeiT](deit) (Data-efficient Image Transformers) by Facebook AI. DeiT models are distilled vision transformers.
|
||||
|
@ -23,7 +23,8 @@ The abstract from the paper is the following:
|
||||
|
||||
Tips:
|
||||
|
||||
- Usage of X-CLIP is identical to CLIP.
|
||||
- Usage of X-CLIP is identical to [CLIP](clip).
|
||||
- Demo notebooks for X-CLIP can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/X-CLIP).
|
||||
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/xclip_architecture.png"
|
||||
alt="drawing" width="600"/>
|
||||
|
@ -555,8 +555,15 @@ class DeiTPooler(nn.Module):
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"DeiT Model with a decoder on top for masked image modeling, as proposed in"
|
||||
" [SimMIM](https://arxiv.org/abs/2111.09886).",
|
||||
"""DeiT Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://arxiv.org/abs/2111.09886).
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that we provide a script to pre-train this model on custom data in our [examples
|
||||
directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
|
||||
|
||||
</Tip>
|
||||
""",
|
||||
DEIT_START_DOCSTRING,
|
||||
)
|
||||
class DeiTForMaskedImageModeling(DeiTPreTrainedModel):
|
||||
|
@ -1007,8 +1007,15 @@ class SwinModel(SwinPreTrainedModel):
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"Swin Model with a decoder on top for masked image modeling, as proposed in"
|
||||
" [SimMIM](https://arxiv.org/abs/2111.09886).",
|
||||
"""Swin Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://arxiv.org/abs/2111.09886).
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that we provide a script to pre-train this model on custom data in our [examples
|
||||
directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
|
||||
|
||||
</Tip>
|
||||
""",
|
||||
SWIN_START_DOCSTRING,
|
||||
)
|
||||
class SwinForMaskedImageModeling(SwinPreTrainedModel):
|
||||
|
@ -1087,8 +1087,16 @@ class Swinv2Model(Swinv2PreTrainedModel):
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"Swinv2 Model with a decoder on top for masked image modeling, as proposed in"
|
||||
" [SimMIM](https://arxiv.org/abs/2111.09886).",
|
||||
"""Swinv2 Model with a decoder on top for masked image modeling, as proposed in
|
||||
[SimMIM](https://arxiv.org/abs/2111.09886).
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that we provide a script to pre-train this model on custom data in our [examples
|
||||
directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
|
||||
|
||||
</Tip>
|
||||
""",
|
||||
SWINV2_START_DOCSTRING,
|
||||
)
|
||||
# Copied from transformers.models.swin.modeling_swin.SwinForMaskedImageModeling with SWIN->SWINV2,Swin->Swinv2,swin->swinv2,224->256,window7->window8
|
||||
|
@ -733,6 +733,14 @@ class TFViTPooler(tf.keras.layers.Layer):
|
||||
"""
|
||||
ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of
|
||||
the [CLS] token) e.g. for ImageNet.
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that it's possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
|
||||
setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
|
||||
position embeddings to the higher resolution.
|
||||
|
||||
</Tip>
|
||||
""",
|
||||
VIT_START_DOCSTRING,
|
||||
)
|
||||
|
@ -597,8 +597,15 @@ class ViTPooler(nn.Module):
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"ViT Model with a decoder on top for masked image modeling, as proposed in"
|
||||
" [SimMIM](https://arxiv.org/abs/2111.09886).",
|
||||
"""ViT Model with a decoder on top for masked image modeling, as proposed in [SimMIM](https://arxiv.org/abs/2111.09886).
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that we provide a script to pre-train this model on custom data in our [examples
|
||||
directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
|
||||
|
||||
</Tip>
|
||||
""",
|
||||
VIT_START_DOCSTRING,
|
||||
)
|
||||
class ViTForMaskedImageModeling(ViTPreTrainedModel):
|
||||
@ -712,6 +719,14 @@ class ViTForMaskedImageModeling(ViTPreTrainedModel):
|
||||
"""
|
||||
ViT Model transformer with an image classification head on top (a linear layer on top of the final hidden state of
|
||||
the [CLS] token) e.g. for ImageNet.
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that it's possible to fine-tune ViT on higher resolution images than the ones it has been trained on, by
|
||||
setting `interpolate_pos_encoding` to `True` in the forward of the model. This will interpolate the pre-trained
|
||||
position embeddings to the higher resolution.
|
||||
|
||||
</Tip>
|
||||
""",
|
||||
VIT_START_DOCSTRING,
|
||||
)
|
||||
|
@ -837,7 +837,16 @@ class ViTMAEDecoder(nn.Module):
|
||||
|
||||
|
||||
@add_start_docstrings(
|
||||
"The ViTMAE Model transformer with the decoder on top for self-supervised pre-training.",
|
||||
"""The ViTMAE Model transformer with the decoder on top for self-supervised pre-training.
|
||||
|
||||
<Tip>
|
||||
|
||||
Note that we provide a script to pre-train this model on custom data in our [examples
|
||||
directory](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
|
||||
|
||||
</Tip>
|
||||
|
||||
""",
|
||||
VIT_MAE_START_DOCSTRING,
|
||||
)
|
||||
class ViTMAEForPreTraining(ViTMAEPreTrainedModel):
|
||||
|
Loading…
Reference in New Issue
Block a user