mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
docs(swinv2): Update SwinV2 model card to new standard format (#37942)
* docs(swinv2): Update SwinV2 model card to new standard format * docs(swinv2): Apply review suggestions Incorporates feedback from @stevhliu to: - Enhance the introductory paragraph with more details about scaling and SimMIM. - Generalize the tip from "image classification tasks" to "vision tasks". Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
parent
33d23c39ed
commit
36f97ae15b
@ -14,37 +14,74 @@ rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Swin Transformer V2
|
||||
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
<div style="float: right;">
|
||||
<div class="flex flex-wrap space-x-1">
|
||||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
||||
</div>
|
||||
</div>
|
||||
|
||||
## Overview
|
||||
# Swin Transformer V2
|
||||
|
||||
The Swin Transformer V2 model was proposed in [Swin Transformer V2: Scaling Up Capacity and Resolution](https://arxiv.org/abs/2111.09883) by Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo.
|
||||
[Swin Transformer V2](https://huggingface.co/papers/2111.09883) is a 3B parameter model that focuses on how to scale a vision model to billions of parameters. It introduces techniques like residual-post-norm combined with cosine attention for improved training stability, log-spaced continuous position bias to better handle varying image resolutions between pre-training and fine-tuning, and a new pre-training method (SimMIM) to reduce the need for large amounts of labeled data. These improvements enable efficiently training very large models (up to 3 billion parameters) capable of processing high-resolution images.
|
||||
|
||||
The abstract from the paper is the following:
|
||||
You can find official Swin Transformer V2 checkpoints under the [Microsoft](https://huggingface.co/microsoft?search_models=swinv2) organization.
|
||||
|
||||
*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.*
|
||||
> [!TIP]
|
||||
> Click on the Swin Transformer V2 models in the right sidebar for more examples of how to apply Swin Transformer V2 to vision tasks.
|
||||
|
||||
This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik).
|
||||
The original code can be found [here](https://github.com/microsoft/Swin-Transformer).
|
||||
<hfoptions id="usage">
|
||||
<hfoption id="Pipeline">
|
||||
|
||||
## Resources
|
||||
```py
|
||||
import torch
|
||||
from transformers import pipeline
|
||||
|
||||
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Swin Transformer v2.
|
||||
pipeline = pipeline(
|
||||
task="image-classification",
|
||||
model="microsoft/swinv2-tiny-patch4-window8-256",
|
||||
torch_dtype=torch.float16,
|
||||
device=0
|
||||
)
|
||||
pipeline(images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
|
||||
```
|
||||
|
||||
<PipelineTag pipeline="image-classification"/>
|
||||
</hfoption>
|
||||
|
||||
- [`Swinv2ForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
|
||||
- See also: [Image classification task guide](../tasks/image_classification)
|
||||
<hfoption id="AutoModel">
|
||||
|
||||
Besides that:
|
||||
```py
|
||||
import torch
|
||||
import requests
|
||||
from PIL import Image
|
||||
from transformers import AutoModelForImageClassification, AutoImageProcessor
|
||||
|
||||
- [`Swinv2ForMaskedImageModeling`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-pretraining).
|
||||
image_processor = AutoImageProcessor.from_pretrained(
|
||||
"microsoft/swinv2-tiny-patch4-window8-256",
|
||||
)
|
||||
model = AutoModelForImageClassification.from_pretrained(
|
||||
"microsoft/swinv2-tiny-patch4-window8-256",
|
||||
device_map="auto"
|
||||
)
|
||||
|
||||
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
|
||||
image = Image.open(requests.get(url, stream=True).raw)
|
||||
inputs = image_processor(image, return_tensors="pt").to(model.device)
|
||||
|
||||
with torch.no_grad():
|
||||
logits = model(**inputs).logits
|
||||
|
||||
predicted_class_id = logits.argmax(dim=-1).item()
|
||||
predicted_class_label = model.config.id2label[predicted_class_id]
|
||||
print(f"The predicted class label is: {predicted_class_label}")
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## Notes
|
||||
|
||||
- Swin Transformer V2 can pad the inputs for any input height and width divisible by `32`.
|
||||
- Swin Transformer V2 can be used as a [backbone](../backbones). When `output_hidden_states = True`, it outputs both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
|
||||
|
||||
## Swinv2Config
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user