mirror of
https://github.com/huggingface/transformers.git
synced 2025-08-01 02:31:11 +06:00
Update doc examples feature extractor -> image processor (#20501)
* Update doc example feature extractor -> image processor * Apply suggestions from code review
This commit is contained in:
parent
afad0c18d9
commit
17a7b49bda
@ -23,6 +23,7 @@ Remember, architecture refers to the skeleton of the model and checkpoints are t
|
|||||||
In this tutorial, learn to:
|
In this tutorial, learn to:
|
||||||
|
|
||||||
* Load a pretrained tokenizer.
|
* Load a pretrained tokenizer.
|
||||||
|
* Load a pretrained image processor
|
||||||
* Load a pretrained feature extractor.
|
* Load a pretrained feature extractor.
|
||||||
* Load a pretrained processor.
|
* Load a pretrained processor.
|
||||||
* Load a pretrained model.
|
* Load a pretrained model.
|
||||||
@ -49,9 +50,20 @@ Then tokenize your input as shown below:
|
|||||||
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## AutoImageProcessor
|
||||||
|
|
||||||
|
For vision tasks, an image processor processes the image into the correct input format.
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from transformers import AutoImageProcessor
|
||||||
|
|
||||||
|
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
## AutoFeatureExtractor
|
## AutoFeatureExtractor
|
||||||
|
|
||||||
For audio and vision tasks, a feature extractor processes the audio signal or image into the correct input format.
|
For audio tasks, a feature extractor processes the audio signal the correct input format.
|
||||||
|
|
||||||
Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
||||||
|
|
||||||
@ -65,7 +77,7 @@ Load a feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
|||||||
|
|
||||||
## AutoProcessor
|
## AutoProcessor
|
||||||
|
|
||||||
Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the [LayoutLMV2](model_doc/layoutlmv2) model requires a feature extractor to handle images and a tokenizer to handle text; a processor combines both of them.
|
Multimodal tasks require a processor that combines two types of preprocessing tools. For example, the [LayoutLMV2](model_doc/layoutlmv2) model requires an image processor to handle images and a tokenizer to handle text; a processor combines both of them.
|
||||||
|
|
||||||
Load a processor with [`AutoProcessor.from_pretrained`]:
|
Load a processor with [`AutoProcessor.from_pretrained`]:
|
||||||
|
|
||||||
@ -103,7 +115,7 @@ TensorFlow and Flax checkpoints are not affected, and can be loaded within PyTor
|
|||||||
|
|
||||||
</Tip>
|
</Tip>
|
||||||
|
|
||||||
Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
|
Generally, we recommend using the `AutoTokenizer` class and the `AutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning.
|
||||||
</pt>
|
</pt>
|
||||||
<tf>
|
<tf>
|
||||||
Finally, the `TFAutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`TFAutoModelForSequenceClassification.from_pretrained`]:
|
Finally, the `TFAutoModelFor` classes let you load a pretrained model for a given task (see [here](model_doc/auto) for a complete list of available tasks). For example, load a model for sequence classification with [`TFAutoModelForSequenceClassification.from_pretrained`]:
|
||||||
@ -122,6 +134,6 @@ Easily reuse the same checkpoint to load an architecture for a different task:
|
|||||||
>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
|
>>> model = TFAutoModelForTokenClassification.from_pretrained("distilbert-base-uncased")
|
||||||
```
|
```
|
||||||
|
|
||||||
Generally, we recommend using the `AutoTokenizer` class and the `TFAutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, feature extractor and processor to preprocess a dataset for fine-tuning.
|
Generally, we recommend using the `AutoTokenizer` class and the `TFAutoModelFor` class to load pretrained instances of models. This will ensure you load the correct architecture every time. In the next [tutorial](preprocessing), learn how to use your newly loaded tokenizer, image processor, feature extractor and processor to preprocess a dataset for fine-tuning.
|
||||||
</tf>
|
</tf>
|
||||||
</frameworkcontent>
|
</frameworkcontent>
|
||||||
|
@ -17,7 +17,8 @@ An [`AutoClass`](model_doc/auto) automatically infers the model architecture and
|
|||||||
- Load and customize a model configuration.
|
- Load and customize a model configuration.
|
||||||
- Create a model architecture.
|
- Create a model architecture.
|
||||||
- Create a slow and fast tokenizer for text.
|
- Create a slow and fast tokenizer for text.
|
||||||
- Create a feature extractor for audio or image tasks.
|
- Create an image processor for vision tasks.
|
||||||
|
- Create a feature extractor for audio tasks.
|
||||||
- Create a processor for multimodal tasks.
|
- Create a processor for multimodal tasks.
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
@ -244,21 +245,21 @@ By default, [`AutoTokenizer`] will try to load a fast tokenizer. You can disable
|
|||||||
|
|
||||||
</Tip>
|
</Tip>
|
||||||
|
|
||||||
## Feature Extractor
|
## Image Processor
|
||||||
|
|
||||||
A feature extractor processes audio or image inputs. It inherits from the base [`~feature_extraction_utils.FeatureExtractionMixin`] class, and may also inherit from the [`ImageFeatureExtractionMixin`] class for processing image features or the [`SequenceFeatureExtractor`] class for processing audio inputs.
|
An image processor processes vision inputs. It inherits from the base [`~image_processing_utils.ImageProcessingMixin`] class.
|
||||||
|
|
||||||
Depending on whether you are working on an audio or vision task, create a feature extractor associated with the model you're using. For example, create a default [`ViTFeatureExtractor`] if you are using [ViT](model_doc/vit) for image classification:
|
To use, create an image processor associated with the model you're using. For example, create a default [`ViTImageProcessor`] if you are using [ViT](model_doc/vit) for image classification:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import ViTFeatureExtractor
|
>>> from transformers import ViTImageProcessor
|
||||||
|
|
||||||
>>> vit_extractor = ViTFeatureExtractor()
|
>>> vit_extractor = ViTImageProcessor()
|
||||||
>>> print(vit_extractor)
|
>>> print(vit_extractor)
|
||||||
ViTFeatureExtractor {
|
ViTImageProcessor {
|
||||||
"do_normalize": true,
|
"do_normalize": true,
|
||||||
"do_resize": true,
|
"do_resize": true,
|
||||||
"feature_extractor_type": "ViTFeatureExtractor",
|
"feature_extractor_type": "ViTImageProcessor",
|
||||||
"image_mean": [
|
"image_mean": [
|
||||||
0.5,
|
0.5,
|
||||||
0.5,
|
0.5,
|
||||||
@ -276,21 +277,21 @@ ViTFeatureExtractor {
|
|||||||
|
|
||||||
<Tip>
|
<Tip>
|
||||||
|
|
||||||
If you aren't looking for any customization, just use the `from_pretrained` method to load a model's default feature extractor parameters.
|
If you aren't looking for any customization, just use the `from_pretrained` method to load a model's default image processor parameters.
|
||||||
|
|
||||||
</Tip>
|
</Tip>
|
||||||
|
|
||||||
Modify any of the [`ViTFeatureExtractor`] parameters to create your custom feature extractor:
|
Modify any of the [`ViTImageProcessor`] parameters to create your custom image processor:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import ViTFeatureExtractor
|
>>> from transformers import ViTImageProcessor
|
||||||
|
|
||||||
>>> my_vit_extractor = ViTFeatureExtractor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3])
|
>>> my_vit_extractor = ViTImageProcessor(resample="PIL.Image.BOX", do_normalize=False, image_mean=[0.3, 0.3, 0.3])
|
||||||
>>> print(my_vit_extractor)
|
>>> print(my_vit_extractor)
|
||||||
ViTFeatureExtractor {
|
ViTImageProcessor {
|
||||||
"do_normalize": false,
|
"do_normalize": false,
|
||||||
"do_resize": true,
|
"do_resize": true,
|
||||||
"feature_extractor_type": "ViTFeatureExtractor",
|
"feature_extractor_type": "ViTImageProcessor",
|
||||||
"image_mean": [
|
"image_mean": [
|
||||||
0.3,
|
0.3,
|
||||||
0.3,
|
0.3,
|
||||||
@ -306,7 +307,11 @@ ViTFeatureExtractor {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
For audio inputs, you can create a [`Wav2Vec2FeatureExtractor`] and customize the parameters in a similar way:
|
## Feature Extractor
|
||||||
|
|
||||||
|
A feature extractor processes audio inputs. It inherits from the base [`~feature_extraction_utils.FeatureExtractionMixin`] class, and may also inherit from the [`SequenceFeatureExtractor`] class for processing audio inputs.
|
||||||
|
|
||||||
|
To use, create a feature extractor associated with the model you're using. For example, create a default [`Wav2Vec2FeatureExtractor`] if you are using [Wav2Vec2](model_doc/wav2vec2) for audio classification:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import Wav2Vec2FeatureExtractor
|
>>> from transformers import Wav2Vec2FeatureExtractor
|
||||||
@ -324,9 +329,34 @@ Wav2Vec2FeatureExtractor {
|
|||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<Tip>
|
||||||
|
|
||||||
|
If you aren't looking for any customization, just use the `from_pretrained` method to load a model's default feature extractor parameters.
|
||||||
|
|
||||||
|
</Tip>
|
||||||
|
|
||||||
|
Modify any of the [`Wav2Vec2FeatureExtractor`] parameters to create your custom feature extractor:
|
||||||
|
|
||||||
|
```py
|
||||||
|
>>> from transformers import Wav2Vec2FeatureExtractor
|
||||||
|
|
||||||
|
>>> w2v2_extractor = Wav2Vec2FeatureExtractor(sampling_rate=8000, do_normalize=False)
|
||||||
|
>>> print(w2v2_extractor)
|
||||||
|
Wav2Vec2FeatureExtractor {
|
||||||
|
"do_normalize": false,
|
||||||
|
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
||||||
|
"feature_size": 1,
|
||||||
|
"padding_side": "right",
|
||||||
|
"padding_value": 0.0,
|
||||||
|
"return_attention_mask": false,
|
||||||
|
"sampling_rate": 8000
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
## Processor
|
## Processor
|
||||||
|
|
||||||
For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps a feature extractor and tokenizer into a single object. For example, let's use the [`Wav2Vec2Processor`] for an automatic speech recognition task (ASR). ASR transcribes audio to text, so you will need a feature extractor and a tokenizer.
|
For models that support multimodal tasks, 🤗 Transformers offers a processor class that conveniently wraps processing classes such as a feature extractor and a tokenizer into a single object. For example, let's use the [`Wav2Vec2Processor`] for an automatic speech recognition task (ASR). ASR transcribes audio to text, so you will need a feature extractor and a tokenizer.
|
||||||
|
|
||||||
Create a feature extractor to handle the audio inputs:
|
Create a feature extractor to handle the audio inputs:
|
||||||
|
|
||||||
@ -352,4 +382,4 @@ Combine the feature extractor and tokenizer in [`Wav2Vec2Processor`]:
|
|||||||
>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
>>> processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)
|
||||||
```
|
```
|
||||||
|
|
||||||
With two basic classes - configuration and model - and an additional preprocessing class (tokenizer, feature extractor, or processor), you can create any of the models supported by 🤗 Transformers. Each of these base classes are configurable, allowing you to use the specific attributes you want. You can easily setup a model for training or modify an existing pretrained model to fine-tune.
|
With two basic classes - configuration and model - and an additional preprocessing class (tokenizer, image processor, feature extractor, or processor), you can create any of the models supported by 🤗 Transformers. Each of these base classes are configurable, allowing you to use the specific attributes you want. You can easily setup a model for training or modify an existing pretrained model to fine-tune.
|
||||||
|
@ -299,7 +299,7 @@ whole text, individual words).
|
|||||||
|
|
||||||
### pixel values
|
### pixel values
|
||||||
|
|
||||||
A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from a feature extractor.
|
A tensor of the numerical representations of an image that is passed to a model. The pixel values have a shape of [`batch_size`, `num_channels`, `height`, `width`], and are generated from an image processor.
|
||||||
|
|
||||||
### pooling
|
### pooling
|
||||||
|
|
||||||
|
@ -20,8 +20,8 @@ Processors can mean two different things in the Transformers library:
|
|||||||
## Multi-modal processors
|
## Multi-modal processors
|
||||||
|
|
||||||
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
|
Any multi-modal model will require an object to encode or decode the data that groups several modalities (among text,
|
||||||
vision and audio). This is handled by objects called processors, which group tokenizers (for the text modality) and
|
vision and audio). This is handled by objects called processors, which group together two or more processing objects
|
||||||
feature extractors (for vision and audio).
|
such as tokenizers (for the text modality), image processors (for vision) and feature extractors (for audio).
|
||||||
|
|
||||||
Those processors inherit from the following base class that implements the saving and loading functionality:
|
Those processors inherit from the following base class that implements the saving and loading functionality:
|
||||||
|
|
||||||
|
@ -40,12 +40,12 @@ Tips:
|
|||||||
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
|
- BEiT models are regular Vision Transformers, but pre-trained in a self-supervised way rather than supervised. They
|
||||||
outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
|
outperform both the [original model (ViT)](vit) as well as [Data-efficient Image Transformers (DeiT)](deit) when fine-tuned on ImageNet-1K and CIFAR-100. You can check out demo notebooks regarding inference as well as
|
||||||
fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
|
fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace
|
||||||
[`ViTFeatureExtractor`] by [`BeitFeatureExtractor`] and
|
[`ViTFeatureExtractor`] by [`BeitImageProcessor`] and
|
||||||
[`ViTForImageClassification`] by [`BeitForImageClassification`]).
|
[`ViTForImageClassification`] by [`BeitForImageClassification`]).
|
||||||
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
|
- There's also a demo notebook available which showcases how to combine DALL-E's image tokenizer with BEiT for
|
||||||
performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
|
performing masked image modeling. You can find it [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/BEiT).
|
||||||
- As the BEiT models expect each image to be of the same size (resolution), one can use
|
- As the BEiT models expect each image to be of the same size (resolution), one can use
|
||||||
[`BeitFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
[`BeitImageProcessor`] to resize (or rescale) and normalize images for the model.
|
||||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||||
each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
|
each checkpoint. For example, `microsoft/beit-base-patch16-224` refers to a base-sized architecture with patch
|
||||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
|
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=microsoft/beit).
|
||||||
|
@ -32,7 +32,7 @@ a crucial component in existing Vision Transformers, can be safely removed in ou
|
|||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100.
|
- CvT models are regular Vision Transformers, but trained with convolutions. They outperform the [original model (ViT)](vit) when fine-tuned on ImageNet-1K and CIFAR-100.
|
||||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoFeatureExtractor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
|
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer) (you can just replace [`ViTFeatureExtractor`] by [`AutoImageProcessor`] and [`ViTForImageClassification`] by [`CvtForImageClassification`]).
|
||||||
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
|
- The available checkpoints are either (1) pre-trained on [ImageNet-22k](http://www.image-net.org/) (a collection of 14 million images and 22k classes) only, (2) also fine-tuned on ImageNet-22k or (3) also fine-tuned on [ImageNet-1k](http://www.image-net.org/challenges/LSVRC/2012/) (also referred to as ILSVRC 2012, a collection of 1.3 million
|
||||||
images and 1,000 classes).
|
images and 1,000 classes).
|
||||||
|
|
||||||
|
@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- One can use [`DeformableDetrFeatureExtractor`] to prepare images (and optional targets) for the model.
|
- One can use [`DeformableDetrImageProcessor`] to prepare images (and optional targets) for the model.
|
||||||
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR).
|
- Training Deformable DETR is equivalent to training the original [DETR](detr) model. Demo notebooks can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR).
|
||||||
|
|
||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/deformable_detr_architecture.png"
|
||||||
|
@ -66,7 +66,7 @@ Tips:
|
|||||||
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||||
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
|
(while only using ImageNet-1k for pre-training). There are 4 variants available (in 3 different sizes):
|
||||||
*facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
|
*facebook/deit-tiny-patch16-224*, *facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and
|
||||||
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTFeatureExtractor`] in order to
|
*facebook/deit-base-patch16-384*. Note that one should use [`DeiTImageProcessor`] in order to
|
||||||
prepare images for the model.
|
prepare images for the model.
|
||||||
|
|
||||||
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [amyeroberts](https://huggingface.co/amyeroberts).
|
||||||
|
@ -105,11 +105,11 @@ Tips:
|
|||||||
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
|
- DETR resizes the input images such that the shortest side is at least a certain amount of pixels while the longest is
|
||||||
at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
|
at most 1333 pixels. At training time, scale augmentation is used such that the shortest side is randomly set to at
|
||||||
least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
|
least 480 and at most 800 pixels. At inference time, the shortest side is set to 800. One can use
|
||||||
[`~transformers.DetrFeatureExtractor`] to prepare images (and optional annotations in COCO format) for the
|
[`~transformers.DetrImageProcessor`] to prepare images (and optional annotations in COCO format) for the
|
||||||
model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
|
model. Due to this resizing, images in a batch can have different sizes. DETR solves this by padding images up to the
|
||||||
largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
|
largest size in a batch, and by creating a pixel mask that indicates which pixels are real/which are padding.
|
||||||
Alternatively, one can also define a custom `collate_fn` in order to batch images together, using
|
Alternatively, one can also define a custom `collate_fn` in order to batch images together, using
|
||||||
[`~transformers.DetrFeatureExtractor.pad_and_create_pixel_mask`].
|
[`~transformers.DetrImageProcessor.pad_and_create_pixel_mask`].
|
||||||
- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`.
|
- The size of the images will determine the amount of memory being used, and will thus determine the `batch_size`.
|
||||||
It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
|
It is advised to use a batch size of 2 per GPU. See [this Github thread](https://github.com/facebookresearch/detr/issues/150) for more info.
|
||||||
|
|
||||||
@ -142,14 +142,14 @@ As a summary, consider the following table:
|
|||||||
| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
|
| **Description** | Predicting bounding boxes and class labels around objects in an image | Predicting masks around objects (i.e. instances) in an image | Predicting masks around both objects (i.e. instances) as well as "stuff" (i.e. background things like trees and roads) in an image |
|
||||||
| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
|
| **Model** | [`~transformers.DetrForObjectDetection`] | [`~transformers.DetrForSegmentation`] | [`~transformers.DetrForSegmentation`] |
|
||||||
| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | |
|
| **Example dataset** | COCO detection | COCO detection, COCO panoptic | COCO panoptic | |
|
||||||
| **Format of annotations to provide to** [`~transformers.DetrFeatureExtractor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
|
| **Format of annotations to provide to** [`~transformers.DetrImageProcessor`] | {'image_id': `int`, 'annotations': `List[Dict]`} each Dict being a COCO object annotation | {'image_id': `int`, 'annotations': `List[Dict]`} (in case of COCO detection) or {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} (in case of COCO panoptic) | {'file_name': `str`, 'image_id': `int`, 'segments_info': `List[Dict]`} and masks_path (path to directory containing PNG files of the masks) |
|
||||||
| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrFeatureExtractor.post_process`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`] | [`~transformers.DetrFeatureExtractor.post_process_segmentation`], [`~transformers.DetrFeatureExtractor.post_process_panoptic`] |
|
| **Postprocessing** (i.e. converting the output of the model to COCO API) | [`~transformers.DetrImageProcessor.post_process`] | [`~transformers.DetrImageProcessor.post_process_segmentation`] | [`~transformers.DetrImageProcessor.post_process_segmentation`], [`~transformers.DetrImageProcessor.post_process_panoptic`] |
|
||||||
| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |
|
| **evaluators** | `CocoEvaluator` with `iou_types="bbox"` | `CocoEvaluator` with `iou_types="bbox"` or `"segm"` | `CocoEvaluator` with `iou_tupes="bbox"` or `"segm"`, `PanopticEvaluator` |
|
||||||
|
|
||||||
In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
|
In short, one should prepare the data either in COCO detection or COCO panoptic format, then use
|
||||||
[`~transformers.DetrFeatureExtractor`] to create `pixel_values`, `pixel_mask` and optional
|
[`~transformers.DetrImageProcessor`] to create `pixel_values`, `pixel_mask` and optional
|
||||||
`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
|
`labels`, which can then be used to train (or fine-tune) a model. For evaluation, one should first convert the
|
||||||
outputs of the model using one of the postprocessing methods of [`~transformers.DetrFeatureExtractor`]. These can
|
outputs of the model using one of the postprocessing methods of [`~transformers.DetrImageProcessor`]. These can
|
||||||
be be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like
|
be be provided to either `CocoEvaluator` or `PanopticEvaluator`, which allow you to calculate metrics like
|
||||||
mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.
|
mean Average Precision (mAP) and Panoptic Quality (PQ). The latter objects are implemented in the [original repository](https://github.com/facebookresearch/detr). See the [example notebooks](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DETR) for more info regarding evaluation.
|
||||||
|
|
||||||
|
@ -32,7 +32,7 @@ The abstract from the paper is the following:
|
|||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- A notebook illustrating inference with [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GLPN/GLPN_inference_(depth_estimation).ipynb).
|
- A notebook illustrating inference with [`GLPNForDepthEstimation`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/GLPN/GLPN_inference_(depth_estimation).ipynb).
|
||||||
- One can use [`GLPNFeatureExtractor`] to prepare images for the model.
|
- One can use [`GLPNImageProcessor`] to prepare images for the model.
|
||||||
|
|
||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/glpn_architecture.jpg"
|
||||||
alt="drawing" width="600"/>
|
alt="drawing" width="600"/>
|
||||||
|
@ -49,7 +49,7 @@ Tips:
|
|||||||
applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
|
applied k-means clustering to the (R,G,B) pixel values with k=512. This way, we only have a 32*32 = 1024-long
|
||||||
sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
|
sequence, but now of integers in the range 0..511. So we are shrinking the sequence length at the cost of a bigger
|
||||||
embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
|
embedding matrix. In other words, the vocabulary size of ImageGPT is 512, + 1 for a special "start of sentence" (SOS)
|
||||||
token, used at the beginning of every sequence. One can use [`ImageGPTFeatureExtractor`] to prepare
|
token, used at the beginning of every sequence. One can use [`ImageGPTImageProcessor`] to prepare
|
||||||
images for the model.
|
images for the model.
|
||||||
- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
|
- Despite being pre-trained entirely unsupervised (i.e. without the use of any labels), ImageGPT produces fairly
|
||||||
performant image features useful for downstream tasks, such as image classification. The authors showed that the
|
performant image features useful for downstream tasks, such as image classification. The authors showed that the
|
||||||
|
@ -53,11 +53,11 @@ Tips:
|
|||||||
Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
Techniques like data augmentation, optimization, and regularization were used in order to simulate training on a much larger dataset
|
||||||
(while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
|
(while only using ImageNet-1k for pre-training). The 5 variants available are (all trained on images of size 224x224):
|
||||||
*facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
|
*facebook/levit-128S*, *facebook/levit-128*, *facebook/levit-192*, *facebook/levit-256* and
|
||||||
*facebook/levit-384*. Note that one should use [`LevitFeatureExtractor`] in order to
|
*facebook/levit-384*. Note that one should use [`LevitImageProcessor`] in order to
|
||||||
prepare images for the model.
|
prepare images for the model.
|
||||||
- [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
|
- [`LevitForImageClassificationWithTeacher`] currently supports only inference and not training or fine-tuning.
|
||||||
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
|
- You can check out demo notebooks regarding inference as well as fine-tuning on custom data [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VisionTransformer)
|
||||||
(you can just replace [`ViTFeatureExtractor`] by [`LevitFeatureExtractor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
(you can just replace [`ViTFeatureExtractor`] by [`LevitImageProcessor`] and [`ViTForImageClassification`] by [`LevitForImageClassification`] or [`LevitForImageClassificationWithTeacher`]).
|
||||||
|
|
||||||
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
|
This model was contributed by [anugunj](https://huggingface.co/anugunj). The original code can be found [here](https://github.com/facebookresearch/LeViT).
|
||||||
|
|
||||||
|
@ -32,8 +32,8 @@ Tips:
|
|||||||
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
- If you want to train the model in a distributed environment across multiple nodes, then one should update the
|
||||||
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
`get_num_masks` function inside in the `MaskFormerLoss` class of `modeling_maskformer.py`. When training on multiple nodes, this should be
|
||||||
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
set to the average number of target masks across all nodes, as can be seen in the original implementation [here](https://github.com/facebookresearch/MaskFormer/blob/da3e60d85fdeedcb31476b5edd7d328826ce56cc/mask_former/modeling/criterion.py#L169).
|
||||||
- One can use [`MaskFormerFeatureExtractor`] to prepare images for the model and optional targets for the model.
|
- One can use [`MaskFormerImageProcessor`] to prepare images for the model and optional targets for the model.
|
||||||
- To get the final segmentation, depending on the task, you can call [`~MaskFormerFeatureExtractor.post_process_semantic_segmentation`] or [`~MaskFormerFeatureExtractor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
|
- To get the final segmentation, depending on the task, you can call [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`]. Both tasks can be solved using [`MaskFormerForInstanceSegmentation`] output, panoptic segmentation accepts an optional `label_ids_to_fuse` argument to fuse instances of the target object/s (e.g. sky) together.
|
||||||
|
|
||||||
The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
|
The figure below illustrates the architecture of MaskFormer. Taken from the [original paper](https://arxiv.org/abs/2107.06278).
|
||||||
|
|
||||||
|
@ -26,7 +26,7 @@ Tips:
|
|||||||
|
|
||||||
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
||||||
|
|
||||||
- One can use [`MobileNetV1FeatureExtractor`] to prepare images for the model.
|
- One can use [`MobileNetV1ImageProcessor`] to prepare images for the model.
|
||||||
|
|
||||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
||||||
|
|
||||||
|
@ -28,7 +28,7 @@ Tips:
|
|||||||
|
|
||||||
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
- Even though the checkpoint is trained on images of specific size, the model will work on images of any size. The smallest supported image size is 32x32.
|
||||||
|
|
||||||
- One can use [`MobileNetV2FeatureExtractor`] to prepare images for the model.
|
- One can use [`MobileNetV2ImageProcessor`] to prepare images for the model.
|
||||||
|
|
||||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes). However, the model predicts 1001 classes: the 1000 classes from ImageNet plus an extra “background” class (index 0).
|
||||||
|
|
||||||
|
@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
|||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
|
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
|
||||||
- One can use [`MobileViTFeatureExtractor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
- One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
||||||
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
|
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
|
||||||
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
||||||
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
||||||
|
@ -28,7 +28,7 @@ The figure below illustrates the architecture of PoolFormer. Taken from the [ori
|
|||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer).
|
- PoolFormer has a hierarchical architecture, where instead of Attention, a simple Average Pooling layer is present. All checkpoints of the model can be found on the [hub](https://huggingface.co/models?other=poolformer).
|
||||||
- One can use [`PoolFormerFeatureExtractor`] to prepare images for the model.
|
- One can use [`PoolFormerImageProcessor`] to prepare images for the model.
|
||||||
- As most models, PoolFormer comes in different sizes, the details of which can be found in the table below.
|
- As most models, PoolFormer comes in different sizes, the details of which can be found in the table below.
|
||||||
|
|
||||||
| **Model variant** | **Depths** | **Hidden sizes** | **Params (M)** | **ImageNet-1k Top 1** |
|
| **Model variant** | **Depths** | **Hidden sizes** | **Params (M)** | **ImageNet-1k Top 1** |
|
||||||
|
@ -24,7 +24,7 @@ The abstract from the paper is the following:
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- One can use [`AutoFeatureExtractor`] to prepare images for the model.
|
- One can use [`AutoImageProcessor`] to prepare images for the model.
|
||||||
- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
|
- The huge 10B model from [Self-supervised Pretraining of Visual Features in the Wild](https://arxiv.org/abs/2103.01988), trained on one billion Instagram images, is available on the [hub](https://huggingface.co/facebook/regnet-y-10b-seer)
|
||||||
|
|
||||||
This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model
|
This model was contributed by [Francesco](https://huggingface.co/Francesco). The TensorFlow version of the model
|
||||||
|
@ -25,7 +25,7 @@ The depth of representations is of central importance for many visual recognitio
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- One can use [`AutoFeatureExtractor`] to prepare images for the model.
|
- One can use [`AutoImageProcessor`] to prepare images for the model.
|
||||||
|
|
||||||
The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385).
|
The figure below illustrates the architecture of ResNet. Taken from the [original paper](https://arxiv.org/abs/1512.03385).
|
||||||
|
|
||||||
|
@ -56,12 +56,12 @@ Tips:
|
|||||||
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
|
- One can also check out [this interactive demo on Hugging Face Spaces](https://huggingface.co/spaces/chansung/segformer-tf-transformers)
|
||||||
to try out a SegFormer model on custom images.
|
to try out a SegFormer model on custom images.
|
||||||
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
|
- SegFormer works on any input size, as it pads the input to be divisible by `config.patch_sizes`.
|
||||||
- One can use [`SegformerFeatureExtractor`] to prepare images and corresponding segmentation maps
|
- One can use [`SegformerImageProcessor`] to prepare images and corresponding segmentation maps
|
||||||
for the model. Note that this feature extractor is fairly basic and does not include all data augmentations used in
|
for the model. Note that this image processor is fairly basic and does not include all data augmentations used in
|
||||||
the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
|
the original paper. The original preprocessing pipelines (for the ADE20k dataset for instance) can be found [here](https://github.com/NVlabs/SegFormer/blob/master/local_configs/_base_/datasets/ade20k_repeat.py). The most
|
||||||
important preprocessing step is that images and segmentation maps are randomly cropped and padded to the same size,
|
important preprocessing step is that images and segmentation maps are randomly cropped and padded to the same size,
|
||||||
such as 512x512 or 640x640, after which they are normalized.
|
such as 512x512 or 640x640, after which they are normalized.
|
||||||
- One additional thing to keep in mind is that one can initialize [`SegformerFeatureExtractor`] with
|
- One additional thing to keep in mind is that one can initialize [`SegformerImageProcessor`] with
|
||||||
`reduce_labels` set to `True` or `False`. In some datasets (like ADE20k), the 0 index is used in the annotated
|
`reduce_labels` set to `True` or `False`. In some datasets (like ADE20k), the 0 index is used in the annotated
|
||||||
segmentation maps for background. However, ADE20k doesn't include the "background" class in its 150 labels.
|
segmentation maps for background. However, ADE20k doesn't include the "background" class in its 150 labels.
|
||||||
Therefore, `reduce_labels` is used to reduce all labels by 1, and to make sure no loss is computed for the
|
Therefore, `reduce_labels` is used to reduce all labels by 1, and to make sure no loss is computed for the
|
||||||
|
@ -33,7 +33,7 @@ prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO
|
|||||||
The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.*
|
The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.*
|
||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
- One can use the [`AutoFeatureExtractor`] API to prepare images for the model.
|
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||||
- Swin pads the inputs supporting any input height and width (if divisible by `32`).
|
- Swin pads the inputs supporting any input height and width (if divisible by `32`).
|
||||||
- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
|
- Swin can be used as a *backbone*. When `output_hidden_states = True`, it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, sequence_length, num_channels)`.
|
||||||
|
|
||||||
|
@ -21,7 +21,7 @@ The abstract from the paper is the following:
|
|||||||
*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.*
|
*Large-scale NLP models have been shown to significantly improve the performance on language tasks with no signs of saturation. They also demonstrate amazing few-shot capabilities like that of human beings. This paper aims to explore large-scale models in computer vision. We tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. Through these techniques, this paper successfully trained a 3 billion-parameter Swin Transformer V2 model, which is the largest dense vision model to date, and makes it capable of training with images of up to 1,536×1,536 resolution. It set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification. Also note our training is much more efficient than that in Google's billion-level visual models, which consumes 40 times less labelled data and 40 times less training time.*
|
||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
- One can use the [`AutoFeatureExtractor`] API to prepare images for the model.
|
- One can use the [`AutoImageProcessor`] API to prepare images for the model.
|
||||||
|
|
||||||
This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik).
|
This model was contributed by [nandwalritik](https://huggingface.co/nandwalritik).
|
||||||
The original code can be found [here](https://github.com/microsoft/Swin-Transformer).
|
The original code can be found [here](https://github.com/microsoft/Swin-Transformer).
|
||||||
|
@ -32,7 +32,7 @@ special customization for these tasks.*
|
|||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table).
|
- The authors released 2 models, one for [table detection](https://huggingface.co/microsoft/table-transformer-detection) in documents, one for [table structure recognition](https://huggingface.co/microsoft/table-transformer-structure-recognition) (the task of recognizing the individual rows, columns etc. in a table).
|
||||||
- One can use the [`AutoFeatureExtractor`] API to prepare images and optional targets for the model. This will load a [`DetrFeatureExtractor`] behind the scenes.
|
- One can use the [`AutoImageProcessor`] API to prepare images and optional targets for the model. This will load a [`DetrImageProcessor`] behind the scenes.
|
||||||
|
|
||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/table_transformer_architecture.jpeg"
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/table_transformer_architecture.jpeg"
|
||||||
alt="drawing" width="600"/>
|
alt="drawing" width="600"/>
|
||||||
|
@ -55,9 +55,9 @@ Tips:
|
|||||||
TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
TrOCR's [`VisionEncoderDecoder`] model accepts images as input and makes use of
|
||||||
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
[`~generation.GenerationMixin.generate`] to autoregressively generate text given the input image.
|
||||||
|
|
||||||
The [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] class is responsible for preprocessing the input image and
|
The [`ViTImageProcessor`/`DeiTImageProcessor`] class is responsible for preprocessing the input image and
|
||||||
[`RobertaTokenizer`/`XLMRobertaTokenizer`] decodes the generated target tokens to the target string. The
|
[`RobertaTokenizer`/`XLMRobertaTokenizer`] decodes the generated target tokens to the target string. The
|
||||||
[`TrOCRProcessor`] wraps [`ViTFeatureExtractor`/`DeiTFeatureExtractor`] and [`RobertaTokenizer`/`XLMRobertaTokenizer`]
|
[`TrOCRProcessor`] wraps [`ViTImageProcessor`/`DeiTImageProcessor`] and [`RobertaTokenizer`/`XLMRobertaTokenizer`]
|
||||||
into a single instance to both extract the input features and decode the predicted token ids.
|
into a single instance to both extract the input features and decode the predicted token ids.
|
||||||
|
|
||||||
- Step-by-step Optical Character Recognition (OCR)
|
- Step-by-step Optical Character Recognition (OCR)
|
||||||
|
@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- One can use [`VideoMAEFeatureExtractor`] to prepare videos for the model. It will resize + normalize all frames of a video for you.
|
- One can use [`VideoMAEImageProcessor`] to prepare videos for the model. It will resize + normalize all frames of a video for you.
|
||||||
- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.
|
- [`VideoMAEForPreTraining`] includes the decoder on top for self-supervised pre-training.
|
||||||
|
|
||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/videomae_architecture.jpeg"
|
||||||
|
@ -68,17 +68,17 @@ To perform inference, one uses the [`generate`] method, which allows to autoregr
|
|||||||
>>> import requests
|
>>> import requests
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
|
|
||||||
>>> from transformers import GPT2TokenizerFast, ViTFeatureExtractor, VisionEncoderDecoderModel
|
>>> from transformers import GPT2TokenizerFast, ViTImageProcessor, VisionEncoderDecoderModel
|
||||||
|
|
||||||
>>> # load a fine-tuned image captioning model and corresponding tokenizer and feature extractor
|
>>> # load a fine-tuned image captioning model and corresponding tokenizer and image processor
|
||||||
>>> model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
>>> model = VisionEncoderDecoderModel.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||||
>>> tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
>>> tokenizer = GPT2TokenizerFast.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
>>> image_processor = ViTImageProcessor.from_pretrained("nlpconnect/vit-gpt2-image-captioning")
|
||||||
|
|
||||||
>>> # let's perform inference on an image
|
>>> # let's perform inference on an image
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
>>> pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
>>> # autoregressively generate caption (uses greedy decoding by default)
|
>>> # autoregressively generate caption (uses greedy decoding by default)
|
||||||
>>> generated_ids = model.generate(pixel_values)
|
>>> generated_ids = model.generate(pixel_values)
|
||||||
@ -115,10 +115,10 @@ As you can see, only 2 inputs are required for the model in order to compute a l
|
|||||||
images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ViTFeatureExtractor, BertTokenizer, VisionEncoderDecoderModel
|
>>> from transformers import ViTImageProcessor, BertTokenizer, VisionEncoderDecoderModel
|
||||||
>>> from datasets import load_dataset
|
>>> from datasets import load_dataset
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
>>> tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
|
||||||
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
>>> model = VisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||||
... "google/vit-base-patch16-224-in21k", "bert-base-uncased"
|
... "google/vit-base-patch16-224-in21k", "bert-base-uncased"
|
||||||
@ -129,7 +129,7 @@ images) and `labels` (which are the `input_ids` of the encoded target sequence).
|
|||||||
|
|
||||||
>>> dataset = load_dataset("huggingface/cats-image")
|
>>> dataset = load_dataset("huggingface/cats-image")
|
||||||
>>> image = dataset["test"]["image"][0]
|
>>> image = dataset["test"]["image"][0]
|
||||||
>>> pixel_values = feature_extractor(image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
>>> labels = tokenizer(
|
>>> labels = tokenizer(
|
||||||
... "an image of two cats chilling on a couch",
|
... "an image of two cats chilling on a couch",
|
||||||
|
@ -53,7 +53,7 @@ vectors to a standard BERT model. The text input is concatenated in the front of
|
|||||||
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
|
layer, and is expected to be bound by [CLS] and a [SEP] tokens, as in BERT. The segment IDs must also be set
|
||||||
appropriately for the textual and visual parts.
|
appropriately for the textual and visual parts.
|
||||||
|
|
||||||
The [`BertTokenizer`] is used to encode the text. A custom detector/feature extractor must be used
|
The [`BertTokenizer`] is used to encode the text. A custom detector/image processor must be used
|
||||||
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:
|
to get the visual embeddings. The following example notebooks show how to use VisualBERT with Detectron-like models:
|
||||||
|
|
||||||
- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert) : This notebook
|
- [VisualBERT VQA demo notebook](https://github.com/huggingface/transformers/tree/main/examples/research_projects/visual_bert) : This notebook
|
||||||
|
@ -40,7 +40,7 @@ Tips:
|
|||||||
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
|
used for classification. The authors also add absolute position embeddings, and feed the resulting sequence of
|
||||||
vectors to a standard Transformer encoder.
|
vectors to a standard Transformer encoder.
|
||||||
- As the Vision Transformer expects each image to be of the same size (resolution), one can use
|
- As the Vision Transformer expects each image to be of the same size (resolution), one can use
|
||||||
[`ViTFeatureExtractor`] to resize (or rescale) and normalize images for the model.
|
[`ViTImageProcessor`] to resize (or rescale) and normalize images for the model.
|
||||||
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
- Both the patch resolution and image resolution used during pre-training or fine-tuning are reflected in the name of
|
||||||
each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch
|
each checkpoint. For example, `google/vit-base-patch16-224` refers to a base-sized architecture with patch
|
||||||
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit).
|
resolution of 16x16 and fine-tuning resolution of 224x224. All checkpoints can be found on the [hub](https://huggingface.co/models?search=vit).
|
||||||
@ -67,7 +67,7 @@ Following the original Vision Transformer, some follow-up works have been made:
|
|||||||
The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into [`ViTModel`] or
|
The authors of DeiT also released more efficiently trained ViT models, which you can directly plug into [`ViTModel`] or
|
||||||
[`ViTForImageClassification`]. There are 4 variants available (in 3 different sizes): *facebook/deit-tiny-patch16-224*,
|
[`ViTForImageClassification`]. There are 4 variants available (in 3 different sizes): *facebook/deit-tiny-patch16-224*,
|
||||||
*facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and *facebook/deit-base-patch16-384*. Note that one should
|
*facebook/deit-small-patch16-224*, *facebook/deit-base-patch16-224* and *facebook/deit-base-patch16-384*. Note that one should
|
||||||
use [`DeiTFeatureExtractor`] in order to prepare images for the model.
|
use [`DeiTImageProcessor`] in order to prepare images for the model.
|
||||||
|
|
||||||
- [BEiT](beit) (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained
|
- [BEiT](beit) (BERT pre-training of Image Transformers) by Microsoft Research. BEiT models outperform supervised pre-trained
|
||||||
vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
|
vision transformers using a self-supervised method inspired by BERT (masked image modeling) and based on a VQ-VAE.
|
||||||
|
@ -37,7 +37,7 @@ One can easily tweak it for their own use case.
|
|||||||
- A notebook that illustrates how to visualize reconstructed pixel values with [`ViTMAEForPreTraining`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb).
|
- A notebook that illustrates how to visualize reconstructed pixel values with [`ViTMAEForPreTraining`] can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb).
|
||||||
- After pre-training, one "throws away" the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing. This means that after
|
- After pre-training, one "throws away" the decoder used to reconstruct pixels, and one uses the encoder for fine-tuning/linear probing. This means that after
|
||||||
fine-tuning, one can directly plug in the weights into a [`ViTForImageClassification`].
|
fine-tuning, one can directly plug in the weights into a [`ViTForImageClassification`].
|
||||||
- One can use [`ViTFeatureExtractor`] to prepare images for the model. See the code examples for more info.
|
- One can use [`ViTImageProcessor`] to prepare images for the model. See the code examples for more info.
|
||||||
- Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
|
- Note that the encoder of MAE is only used to encode the visual patches. The encoded patches are then concatenated with mask tokens, which the decoder (which also
|
||||||
consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
|
consists of Transformer blocks) takes as input. Each mask token is a shared, learned vector that indicates the presence of a missing patch to be predicted. Fixed
|
||||||
sin/cos position embeddings are added both to the input of the encoder and the decoder.
|
sin/cos position embeddings are added both to the input of the encoder and the decoder.
|
||||||
|
@ -23,7 +23,7 @@ The abstract from the paper is the following:
|
|||||||
|
|
||||||
Tips:
|
Tips:
|
||||||
|
|
||||||
- One can use [`YolosFeatureExtractor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
|
- One can use [`YolosImageProcessor`] for preparing images (and optional targets) for the model. Contrary to [DETR](detr), YOLOS doesn't require a `pixel_mask` to be created.
|
||||||
- Demo notebooks (regarding inference and fine-tuning on custom data) can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS).
|
- Demo notebooks (regarding inference and fine-tuning on custom data) can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/YOLOS).
|
||||||
|
|
||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png"
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/yolos_architecture.png"
|
||||||
|
@ -24,7 +24,7 @@ The library was designed with two strong goals in mind:
|
|||||||
|
|
||||||
- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
|
- We strongly limited the number of user-facing abstractions to learn, in fact, there are almost no abstractions,
|
||||||
just three standard classes required to use each model: [configuration](main_classes/configuration),
|
just three standard classes required to use each model: [configuration](main_classes/configuration),
|
||||||
[models](main_classes/model), and a preprocessing class ([tokenizer](main_classes/tokenizer) for NLP, [feature extractor](main_classes/feature_extractor) for vision and audio, and [processor](main_classes/processors) for multimodal inputs).
|
[models](main_classes/model), and a preprocessing class ([tokenizer](main_classes/tokenizer) for NLP, [image processor](main_classes/image_processor) for vision, [feature extractor](main_classes/feature_extractor) for audio, and [processor](main_classes/processors) for multimodal inputs).
|
||||||
- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
|
- All of these classes can be initialized in a simple and unified way from pretrained instances by using a common
|
||||||
`from_pretrained()` method which downloads (if needed), caches and
|
`from_pretrained()` method which downloads (if needed), caches and
|
||||||
loads the related class instance and associated data (configurations' hyperparameters, tokenizers' vocabulary,
|
loads the related class instance and associated data (configurations' hyperparameters, tokenizers' vocabulary,
|
||||||
@ -62,7 +62,7 @@ The library is built around three types of classes for each model:
|
|||||||
|
|
||||||
- **Model classes** can be PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) or JAX/Flax models ([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen.html)) that work with the pretrained weights provided in the library.
|
- **Model classes** can be PyTorch models ([torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module)), Keras models ([tf.keras.Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model)) or JAX/Flax models ([flax.linen.Module](https://flax.readthedocs.io/en/latest/api_reference/flax.linen.html)) that work with the pretrained weights provided in the library.
|
||||||
- **Configuration classes** store the hyperparameters required to build a model (such as the number of layers and hidden size). You don't always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).
|
- **Configuration classes** store the hyperparameters required to build a model (such as the number of layers and hidden size). You don't always need to instantiate these yourself. In particular, if you are using a pretrained model without any modification, creating the model will automatically take care of instantiating the configuration (which is part of the model).
|
||||||
- **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provide methods for encoding and decoding strings in a list of token embedding indices to be fed to a model. [Feature extractors](main_classes/feature_extractor) preprocess audio or vision inputs, and a [processor](main_classes/processors) handles multimodal inputs.
|
- **Preprocessing classes** convert the raw data into a format accepted by the model. A [tokenizer](main_classes/tokenizer) stores the vocabulary for each model and provide methods for encoding and decoding strings in a list of token embedding indices to be fed to a model. [Image processors](main_classes/image_processor) preprocess vision inputs, [feature extractors](main_classes/feature_extractor) preprocess audio inputs, and a [processor](main_classes/processors) handles multimodal inputs.
|
||||||
|
|
||||||
All these classes can be instantiated from pretrained instances, saved locally, and shared on the Hub with three methods:
|
All these classes can be instantiated from pretrained instances, saved locally, and shared on the Hub with three methods:
|
||||||
|
|
||||||
|
@ -19,11 +19,11 @@ Before you can train a model on a dataset, it needs to be preprocessed into the
|
|||||||
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
|
* Text, use a [Tokenizer](./main_classes/tokenizer) to convert text into a sequence of tokens, create a numerical representation of the tokens, and assemble them into tensors.
|
||||||
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
|
* Image inputs use a [ImageProcessor](./main_classes/image) to convert images into tensors.
|
||||||
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
|
* Speech and audio, use a [Feature extractor](./main_classes/feature_extractor) to extract sequential features from audio waveforms and convert them into tensors.
|
||||||
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor.
|
* Multimodal inputs, use a [Processor](./main_classes/processors) to combine a tokenizer and a feature extractor or image processor.
|
||||||
|
|
||||||
<Tip>
|
<Tip>
|
||||||
|
|
||||||
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, feature extractor or processor.
|
`AutoProcessor` **always** works and automatically chooses the correct class for the model you're using, whether you're using a tokenizer, image processor, feature extractor or processor.
|
||||||
|
|
||||||
</Tip>
|
</Tip>
|
||||||
|
|
||||||
@ -320,9 +320,9 @@ The sample lengths are now the same and match the specified maximum length. You
|
|||||||
|
|
||||||
## Computer vision
|
## Computer vision
|
||||||
|
|
||||||
For computer vision tasks, you'll need a [feature extractor](main_classes/feature_extractor) to prepare your dataset for the model. The feature extractor is designed to extract features from images, and convert them into tensors.
|
For computer vision tasks, you'll need an [image processor](main_classes/image_processor) to prepare your dataset for the model. The image processor is designed to preprocess images, and convert them into tensors.
|
||||||
|
|
||||||
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a feature extractor with computer vision datasets:
|
Load the [food101](https://huggingface.co/datasets/food101) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use an image processor with computer vision datasets:
|
||||||
|
|
||||||
<Tip>
|
<Tip>
|
||||||
|
|
||||||
@ -346,17 +346,17 @@ Next, take a look at the image with 🤗 Datasets [`Image`](https://huggingface.
|
|||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/vision-preprocess-tutorial.png"/>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
Load the feature extractor with [`AutoFeatureExtractor.from_pretrained`]:
|
Load the image processor with [`AutoImageProcessor.from_pretrained`]:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoFeatureExtractor
|
>>> from transformers import AutoImageProcessor
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
|
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||||
```
|
```
|
||||||
|
|
||||||
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
|
For computer vision tasks, it is common to add some type of data augmentation to the images as a part of preprocessing. You can add augmentations with any library you'd like, but in this tutorial, you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module. If you're interested in using another data augmentation library, learn how in the [Albumentations](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_albumentations.ipynb) or [Kornia notebooks](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification_kornia.ipynb).
|
||||||
|
|
||||||
1. Normalize the image with the feature extractor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
|
1. Normalize the image with the image processor and use [`Compose`](https://pytorch.org/vision/master/generated/torchvision.transforms.Compose.html) to chain some transforms - [`RandomResizedCrop`](https://pytorch.org/vision/main/generated/torchvision.transforms.RandomResizedCrop.html) and [`ColorJitter`](https://pytorch.org/vision/main/generated/torchvision.transforms.ColorJitter.html) - together:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
|
>>> from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor
|
||||||
@ -370,7 +370,7 @@ For computer vision tasks, it is common to add some type of data augmentation to
|
|||||||
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
|
>>> _transforms = Compose([RandomResizedCrop(size), ColorJitter(brightness=0.5, hue=0.5), ToTensor(), normalize])
|
||||||
```
|
```
|
||||||
|
|
||||||
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the feature extractor. Create a function that generates `pixel_values` from the transforms:
|
2. The model accepts [`pixel_values`](model_doc/visionencoderdecoder#transformers.VisionEncoderDecoderModel.forward.pixel_values) as its input, which is generated by the image processor. Create a function that generates `pixel_values` from the transforms:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> def transforms(examples):
|
>>> def transforms(examples):
|
||||||
@ -384,7 +384,7 @@ For computer vision tasks, it is common to add some type of data augmentation to
|
|||||||
>>> dataset.set_transform(transforms)
|
>>> dataset.set_transform(transforms)
|
||||||
```
|
```
|
||||||
|
|
||||||
4. Now when you access the image, you'll notice the feature extractor has added `pixel_values`. You can pass your processed dataset to the model now!
|
4. Now when you access the image, you'll notice the image processor has added `pixel_values`. You can pass your processed dataset to the model now!
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> dataset[0]["image"]
|
>>> dataset[0]["image"]
|
||||||
@ -431,7 +431,7 @@ Here is what the image looks like after the transforms are applied. The image ha
|
|||||||
|
|
||||||
## Multimodal
|
## Multimodal
|
||||||
|
|
||||||
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples a tokenizer and feature extractor.
|
For tasks involving multimodal inputs, you'll need a [processor](main_classes/processors) to prepare your dataset for the model. A processor couples together two processing objects such as as tokenizer and feature extractor.
|
||||||
|
|
||||||
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
|
Load the [LJ Speech](https://huggingface.co/datasets/lj_speech) dataset (see the 🤗 [Datasets tutorial](https://huggingface.co/docs/datasets/load_hub.html) for more details on how to load a dataset) to see how you can use a processor for automatic speech recognition (ASR):
|
||||||
|
|
||||||
|
@ -225,7 +225,7 @@ A tokenizer can also accept a list of inputs, and pad and truncate the text to r
|
|||||||
|
|
||||||
<Tip>
|
<Tip>
|
||||||
|
|
||||||
Check out the [preprocess](./preprocessing) tutorial for more details about tokenization, and how to use an [`AutoFeatureExtractor`] and [`AutoProcessor`] to preprocess image, audio, and multimodal inputs.
|
Check out the [preprocess](./preprocessing) tutorial for more details about tokenization, and how to use an [`AutoImageProcessor`], [`AutoFeatureExtractor`] and [`AutoProcessor`] to preprocess image, audio, and multimodal inputs.
|
||||||
|
|
||||||
</Tip>
|
</Tip>
|
||||||
|
|
||||||
@ -424,7 +424,7 @@ Depending on your task, you'll typically pass the following parameters to [`Trai
|
|||||||
... )
|
... )
|
||||||
```
|
```
|
||||||
|
|
||||||
3. A preprocessing class like a tokenizer, feature extractor, or processor:
|
3. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoTokenizer
|
>>> from transformers import AutoTokenizer
|
||||||
@ -501,7 +501,7 @@ All models are a standard [`tf.keras.Model`](https://www.tensorflow.org/api_docs
|
|||||||
>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
|
>>> model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")
|
||||||
```
|
```
|
||||||
|
|
||||||
2. A preprocessing class like a tokenizer, feature extractor, or processor:
|
2. A preprocessing class like a tokenizer, image processor, feature extractor, or processor:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoTokenizer
|
>>> from transformers import AutoTokenizer
|
||||||
|
@ -1101,24 +1101,24 @@ Class Egyptian cat with score 0.0239
|
|||||||
Class tiger cat with score 0.0229
|
Class tiger cat with score 0.0229
|
||||||
```
|
```
|
||||||
|
|
||||||
The general process for using a model and feature extractor for image classification is:
|
The general process for using a model and image processor for image classification is:
|
||||||
|
|
||||||
1. Instantiate a feature extractor and a model from the checkpoint name.
|
1. Instantiate an image processor and a model from the checkpoint name.
|
||||||
2. Process the image to be classified with a feature extractor.
|
2. Process the image to be classified with an image processor.
|
||||||
3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
|
3. Pass the input through the model and take the `argmax` to retrieve the predicted class.
|
||||||
4. Convert the class id to a class name with `id2label` to return an interpretable result.
|
4. Convert the class id to a class name with `id2label` to return an interpretable result.
|
||||||
|
|
||||||
<frameworkcontent>
|
<frameworkcontent>
|
||||||
<pt>
|
<pt>
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoFeatureExtractor, AutoModelForImageClassification
|
>>> from transformers import AutoImageProcessor, AutoModelForImageClassification
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from datasets import load_dataset
|
>>> from datasets import load_dataset
|
||||||
|
|
||||||
>>> dataset = load_dataset("huggingface/cats-image")
|
>>> dataset = load_dataset("huggingface/cats-image")
|
||||||
>>> image = dataset["test"]["image"][0]
|
>>> image = dataset["test"]["image"][0]
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
|
>>> feature_extractor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||||
>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
|
>>> model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(image, return_tensors="pt")
|
>>> inputs = feature_extractor(image, return_tensors="pt")
|
||||||
|
@ -91,26 +91,26 @@ Now you can convert the label id to a label name:
|
|||||||
|
|
||||||
## Preprocess
|
## Preprocess
|
||||||
|
|
||||||
The next step is to load a ViT feature extractor to process the image into a tensor:
|
The next step is to load a ViT image processor to process the image into a tensor:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoFeatureExtractor
|
>>> from transformers import AutoImageProcessor
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
```
|
```
|
||||||
|
|
||||||
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
|
Apply some image transformations to the images to make the model more robust against overfitting. Here you'll use torchvision's [`transforms`](https://pytorch.org/vision/stable/transforms.html) module, but you can also use any image library you like.
|
||||||
|
|
||||||
Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
|
Crop a random part of the image, resize it, and normalize it with the image mean and standard deviation:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
|
>>> from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
|
||||||
|
|
||||||
>>> normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
|
>>> normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
|
||||||
>>> size = (
|
>>> size = (
|
||||||
... feature_extractor.size["shortest_edge"]
|
... image_processor.size["shortest_edge"]
|
||||||
... if "shortest_edge" in feature_extractor.size
|
... if "shortest_edge" in image_processor.size
|
||||||
... else (feature_extractor.size["height"], feature_extractor.size["width"])
|
... else (image_processor.size["height"], image_processor.size["width"])
|
||||||
... )
|
... )
|
||||||
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
|
>>> _transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
|
||||||
```
|
```
|
||||||
@ -213,7 +213,7 @@ At this point, only three steps remain:
|
|||||||
... data_collator=data_collator,
|
... data_collator=data_collator,
|
||||||
... train_dataset=food["train"],
|
... train_dataset=food["train"],
|
||||||
... eval_dataset=food["test"],
|
... eval_dataset=food["test"],
|
||||||
... tokenizer=feature_extractor,
|
... tokenizer=image_processor,
|
||||||
... compute_metrics=compute_metrics,
|
... compute_metrics=compute_metrics,
|
||||||
... )
|
... )
|
||||||
|
|
||||||
@ -266,14 +266,14 @@ You can also manually replicate the results of the `pipeline` if you'd like:
|
|||||||
|
|
||||||
<frameworkcontent>
|
<frameworkcontent>
|
||||||
<pt>
|
<pt>
|
||||||
Load a feature extractor to preprocess the image and return the `input` as PyTorch tensors:
|
Load an image processor to preprocess the image and return the `input` as PyTorch tensors:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoFeatureExtractor
|
>>> from transformers import AutoImageProcessor
|
||||||
>>> import torch
|
>>> import torch
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("my_awesome_food_model")
|
>>> image_processor = AutoImageProcessor.from_pretrained("my_awesome_food_model")
|
||||||
>>> inputs = feature_extractor(image, return_tensors="pt")
|
>>> inputs = image_processor(image, return_tensors="pt")
|
||||||
```
|
```
|
||||||
|
|
||||||
Pass your inputs to the model and return the logits:
|
Pass your inputs to the model and return the logits:
|
||||||
|
@ -90,12 +90,12 @@ You'll also want to create a dictionary that maps a label id to a label class wh
|
|||||||
|
|
||||||
## Preprocess
|
## Preprocess
|
||||||
|
|
||||||
The next step is to load a SegFormer feature extractor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function:
|
The next step is to load a SegFormer image processor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> from transformers import AutoFeatureExtractor
|
>>> from transformers import AutoImageProcessor
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("nvidia/mit-b0", reduce_labels=True)
|
>>> feature_extractor = AutoImageProcessor.from_pretrained("nvidia/mit-b0", reduce_labels=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) function from [torchvision](https://pytorch.org/vision/stable/index.html) to randomly change the color properties of an image, but you can also use any image library you like.
|
It is common to apply some data augmentations to an image dataset to make a model more robust against overfitting. In this guide, you'll use the [`ColorJitter`](https://pytorch.org/vision/stable/generated/torchvision.transforms.ColorJitter.html) function from [torchvision](https://pytorch.org/vision/stable/index.html) to randomly change the color properties of an image, but you can also use any image library you like.
|
||||||
@ -106,7 +106,7 @@ It is common to apply some data augmentations to an image dataset to make a mode
|
|||||||
>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)
|
>>> jitter = ColorJitter(brightness=0.25, contrast=0.25, saturation=0.25, hue=0.1)
|
||||||
```
|
```
|
||||||
|
|
||||||
Now create two preprocessing functions to prepare the images and annotations for the model. These functions convert the images into `pixel_values` and annotations to `labels`. For the training set, `jitter` is applied before providing the images to the feature extractor. For the test set, the feature extractor crops and normalizes the `images`, and only crops the `labels` because no data augmentation is applied during testing.
|
Now create two preprocessing functions to prepare the images and annotations for the model. These functions convert the images into `pixel_values` and annotations to `labels`. For the training set, `jitter` is applied before providing the images to the image processor. For the test set, the image processor crops and normalizes the `images`, and only crops the `labels` because no data augmentation is applied during testing.
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> def train_transforms(example_batch):
|
>>> def train_transforms(example_batch):
|
||||||
@ -281,7 +281,7 @@ The simplest way to try out your finetuned model for inference is to use it in a
|
|||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E10>}]
|
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E10>}]
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also manually replicate the results of the `pipeline` if you'd like. Process the image with a feature extractor and place the `pixel_values` on a GPU:
|
You can also manually replicate the results of the `pipeline` if you'd like. Process the image with an image processor and place the `pixel_values` on a GPU:
|
||||||
|
|
||||||
```py
|
```py
|
||||||
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU
|
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU
|
||||||
|
@ -89,7 +89,7 @@ TensorFlow's [model.save](https://www.tensorflow.org/tutorials/keras/save_and_lo
|
|||||||
Another common error you may encounter, especially if it is a newly released model, is `ImportError`:
|
Another common error you may encounter, especially if it is a newly released model, is `ImportError`:
|
||||||
|
|
||||||
```
|
```
|
||||||
ImportError: cannot import name 'ImageGPTFeatureExtractor' from 'transformers' (unknown location)
|
ImportError: cannot import name 'ImageGPTImageProcessor' from 'transformers' (unknown location)
|
||||||
```
|
```
|
||||||
|
|
||||||
For these error types, check to make sure you have the latest version of 🤗 Transformers installed to access the most recent models:
|
For these error types, check to make sure you have the latest version of 🤗 Transformers installed to access the most recent models:
|
||||||
|
@ -49,7 +49,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "BeitConfig"
|
_CONFIG_FOR_DOC = "BeitConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "BeitFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "BeitImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/beit-base-patch16-224-pt22k"
|
_CHECKPOINT_FOR_DOC = "microsoft/beit-base-patch16-224-pt22k"
|
||||||
@ -593,8 +593,8 @@ BEIT_START_DOCSTRING = r"""
|
|||||||
BEIT_INPUTS_DOCSTRING = r"""
|
BEIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`BeitFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`BeitImageProcessor`]. See
|
||||||
[`BeitFeatureExtractor.__call__`] for details.
|
[`BeitImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -769,7 +769,7 @@ class BeitForMaskedImageModeling(BeitPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import BeitFeatureExtractor, BeitForMaskedImageModeling
|
>>> from transformers import BeitImageProcessor, BeitForMaskedImageModeling
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -777,11 +777,11 @@ class BeitForMaskedImageModeling(BeitPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
>>> image_processor = BeitImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
||||||
>>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
>>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
||||||
|
|
||||||
@ -1218,17 +1218,17 @@ class BeitForSemanticSegmentation(BeitPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, BeitForSemanticSegmentation
|
>>> from transformers import AutoImageProcessor, BeitForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
|
||||||
>>> model = BeitForSemanticSegmentation.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
|
>>> model = BeitForSemanticSegmentation.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> # logits are of shape (batch_size, num_labels, height, width)
|
>>> # logits are of shape (batch_size, num_labels, height, width)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
|
@ -102,8 +102,8 @@ BEIT_START_DOCSTRING = r"""
|
|||||||
BEIT_INPUTS_DOCSTRING = r"""
|
BEIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`numpy.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`numpy.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`BeitFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`BeitImageProcessor`]. See
|
||||||
[`BeitFeatureExtractor.__call__`] for details.
|
[`BeitImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
@ -756,17 +756,17 @@ FLAX_BEIT_MODEL_DOCSTRING = """
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import BeitFeatureExtractor, FlaxBeitModel
|
>>> from transformers import BeitImageProcessor, FlaxBeitModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
|
>>> image_processor = BeitImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
|
||||||
>>> model = FlaxBeitModel.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
|
>>> model = FlaxBeitModel.from_pretrained("microsoft/beit-base-patch16-224-pt22k-ft22k")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="np")
|
>>> inputs = image_processor(images=image, return_tensors="np")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```
|
```
|
||||||
@ -843,17 +843,17 @@ FLAX_BEIT_MLM_DOCSTRING = """
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import BeitFeatureExtractor, BeitForMaskedImageModeling
|
>>> from transformers import BeitImageProcessor, BeitForMaskedImageModeling
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
>>> image_processor = BeitImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
||||||
>>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
>>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="np")
|
>>> inputs = image_processor(images=image, return_tensors="np")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
```
|
```
|
||||||
@ -927,17 +927,17 @@ FLAX_BEIT_CLASSIF_DOCSTRING = """
|
|||||||
Example:
|
Example:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import BeitFeatureExtractor, FlaxBeitForImageClassification
|
>>> from transformers import BeitImageProcessor, FlaxBeitForImageClassification
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = BeitFeatureExtractor.from_pretrained("microsoft/beit-base-patch16-224")
|
>>> image_processor = BeitImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")
|
||||||
>>> model = FlaxBeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224")
|
>>> model = FlaxBeitForImageClassification.from_pretrained("microsoft/beit-base-patch16-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="np")
|
>>> inputs = image_processor(images=image, return_tensors="np")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> # model predicts one of the 1000 ImageNet classes
|
>>> # model predicts one of the 1000 ImageNet classes
|
||||||
|
@ -153,8 +153,8 @@ class ConditionalDetrObjectDetectionOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~ConditionalDetrFeatureExtractor.post_process_object_detection`] to
|
possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve
|
||||||
retrieve the unnormalized bounding boxes.
|
the unnormalized bounding boxes.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
|
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
|
||||||
@ -217,13 +217,13 @@ class ConditionalDetrSegmentationOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~ConditionalDetrFeatureExtractor.post_process_object_detection`] to
|
possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve
|
||||||
retrieve the unnormalized bounding boxes.
|
the unnormalized bounding boxes.
|
||||||
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
|
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
|
||||||
Segmentation masks logits for all queries. See also
|
Segmentation masks logits for all queries. See also
|
||||||
[`~ConditionalDetrFeatureExtractor.post_process_semantic_segmentation`] or
|
[`~ConditionalDetrImageProcessor.post_process_semantic_segmentation`] or
|
||||||
[`~ConditionalDetrFeatureExtractor.post_process_instance_segmentation`]
|
[`~ConditionalDetrImageProcessor.post_process_instance_segmentation`]
|
||||||
[`~ConditionalDetrFeatureExtractor.post_process_panoptic_segmentation`] to evaluate semantic, instance and
|
[`~ConditionalDetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and
|
||||||
panoptic segmentation masks respectively.
|
panoptic segmentation masks respectively.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
@ -1097,8 +1097,8 @@ CONDITIONAL_DETR_INPUTS_DOCSTRING = r"""
|
|||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it.
|
Pixel values. Padding will be ignored by default should you provide it.
|
||||||
|
|
||||||
Pixel values can be obtained using [`ConditionalDetrFeatureExtractor`]. See
|
Pixel values can be obtained using [`ConditionalDetrImageProcessor`]. See
|
||||||
[`ConditionalDetrFeatureExtractor.__call__`] for details.
|
[`ConditionalDetrImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
||||||
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
||||||
@ -1519,18 +1519,18 @@ class ConditionalDetrModel(ConditionalDetrPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, AutoModel
|
>>> from transformers import AutoImageProcessor, AutoModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/conditional-detr-resnet-50")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
|
||||||
>>> model = AutoModel.from_pretrained("microsoft/conditional-detr-resnet-50")
|
>>> model = AutoModel.from_pretrained("microsoft/conditional-detr-resnet-50")
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
@ -1687,25 +1687,25 @@ class ConditionalDetrForObjectDetection(ConditionalDetrPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, AutoModelForObjectDetection
|
>>> from transformers import AutoImageProcessor, AutoModelForObjectDetection
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/conditional-detr-resnet-50")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
|
||||||
>>> model = AutoModelForObjectDetection.from_pretrained("microsoft/conditional-detr-resnet-50")
|
>>> model = AutoModelForObjectDetection.from_pretrained("microsoft/conditional-detr-resnet-50")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
||||||
>>> target_sizes = torch.tensor([image.size[::-1]])
|
>>> target_sizes = torch.tensor([image.size[::-1]])
|
||||||
>>> results = feature_extractor.post_process_object_detection(
|
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
|
||||||
... outputs, threshold=0.5, target_sizes=target_sizes
|
... 0
|
||||||
... )[0]
|
... ]
|
||||||
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
||||||
... box = [round(i, 2) for i in box.tolist()]
|
... box = [round(i, 2) for i in box.tolist()]
|
||||||
... print(
|
... print(
|
||||||
@ -1880,7 +1880,7 @@ class ConditionalDetrForSegmentation(ConditionalDetrPreTrainedModel):
|
|||||||
>>> import numpy
|
>>> import numpy
|
||||||
|
|
||||||
>>> from transformers import (
|
>>> from transformers import (
|
||||||
... AutoFeatureExtractor,
|
... AutoImageProcessor,
|
||||||
... ConditionalDetrConfig,
|
... ConditionalDetrConfig,
|
||||||
... ConditionalDetrForSegmentation,
|
... ConditionalDetrForSegmentation,
|
||||||
... )
|
... )
|
||||||
@ -1889,21 +1889,21 @@ class ConditionalDetrForSegmentation(ConditionalDetrPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/conditional-detr-resnet-50")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/conditional-detr-resnet-50")
|
||||||
|
|
||||||
>>> # randomly initialize all weights of the model
|
>>> # randomly initialize all weights of the model
|
||||||
>>> config = ConditionalDetrConfig()
|
>>> config = ConditionalDetrConfig()
|
||||||
>>> model = ConditionalDetrForSegmentation(config)
|
>>> model = ConditionalDetrForSegmentation(config)
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # Use the `post_process_panoptic_segmentation` method of `ConditionalDetrFeatureExtractor` to retrieve post-processed panoptic segmentation maps
|
>>> # Use the `post_process_panoptic_segmentation` method of `ConditionalDetrImageProcessor` to retrieve post-processed panoptic segmentation maps
|
||||||
>>> # Segmentation results are returned as a list of dictionaries
|
>>> # Segmentation results are returned as a list of dictionaries
|
||||||
>>> result = feature_extractor.post_process_panoptic_segmentation(outputs, target_sizes=[(300, 500)])
|
>>> result = image_processor.post_process_panoptic_segmentation(outputs, target_sizes=[(300, 500)])
|
||||||
>>> # A tensor of shape (height, width) where each value denotes a segment id, filled with -1 if no segment is found
|
>>> # A tensor of shape (height, width) where each value denotes a segment id, filled with -1 if no segment is found
|
||||||
>>> panoptic_seg = result[0]["segmentation"]
|
>>> panoptic_seg = result[0]["segmentation"]
|
||||||
>>> # Get prediction score and segment_id to class_id mapping of each segment
|
>>> # Get prediction score and segment_id to class_id mapping of each segment
|
||||||
|
@ -37,7 +37,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "ConvNextConfig"
|
_CONFIG_FOR_DOC = "ConvNextConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "ConvNextFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "ConvNextImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/convnext-tiny-224"
|
_CHECKPOINT_FOR_DOC = "facebook/convnext-tiny-224"
|
||||||
@ -308,8 +308,8 @@ CONVNEXT_START_DOCSTRING = r"""
|
|||||||
CONVNEXT_INPUTS_DOCSTRING = r"""
|
CONVNEXT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
@ -432,8 +432,8 @@ CONVNEXT_START_DOCSTRING = r"""
|
|||||||
CONVNEXT_INPUTS_DOCSTRING = r"""
|
CONVNEXT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`ConvNextFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`ConvNextImageProcessor`]. See
|
||||||
[`ConvNextFeatureExtractor.__call__`] for details.
|
[`ConvNextImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
@ -470,17 +470,17 @@ class TFConvNextModel(TFConvNextPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ConvNextFeatureExtractor, TFConvNextModel
|
>>> from transformers import ConvNextImageProcessor, TFConvNextModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ConvNextFeatureExtractor.from_pretrained("facebook/convnext-tiny-224")
|
>>> image_processor = ConvNextImageProcessor.from_pretrained("facebook/convnext-tiny-224")
|
||||||
>>> model = TFConvNextModel.from_pretrained("facebook/convnext-tiny-224")
|
>>> model = TFConvNextModel.from_pretrained("facebook/convnext-tiny-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```"""
|
```"""
|
||||||
@ -561,7 +561,7 @@ class TFConvNextForImageClassification(TFConvNextPreTrainedModel, TFSequenceClas
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ConvNextFeatureExtractor, TFConvNextForImageClassification
|
>>> from transformers import ConvNextImageProcessor, TFConvNextForImageClassification
|
||||||
>>> import tensorflow as tf
|
>>> import tensorflow as tf
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -569,10 +569,10 @@ class TFConvNextForImageClassification(TFConvNextPreTrainedModel, TFSequenceClas
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ConvNextFeatureExtractor.from_pretrained("facebook/convnext-tiny-224")
|
>>> image_processor = ConvNextImageProcessor.from_pretrained("facebook/convnext-tiny-224")
|
||||||
>>> model = TFConvNextForImageClassification.from_pretrained("facebook/convnext-tiny-224")
|
>>> model = TFConvNextForImageClassification.from_pretrained("facebook/convnext-tiny-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> # model predicts one of the 1000 ImageNet classes
|
>>> # model predicts one of the 1000 ImageNet classes
|
||||||
|
@ -35,7 +35,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "CvtConfig"
|
_CONFIG_FOR_DOC = "CvtConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/cvt-13"
|
_CHECKPOINT_FOR_DOC = "microsoft/cvt-13"
|
||||||
@ -573,8 +573,8 @@ CVT_START_DOCSTRING = r"""
|
|||||||
CVT_INPUTS_DOCSTRING = r"""
|
CVT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`CvtFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`CvtImageProcessor`]. See [`CvtImageProcessor.__call__`]
|
||||||
[`CvtFeatureExtractor.__call__`] for details.
|
for details.
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
more detail.
|
more detail.
|
||||||
|
@ -766,8 +766,8 @@ TFCVT_START_DOCSTRING = r"""
|
|||||||
TFCVT_INPUTS_DOCSTRING = r"""
|
TFCVT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
@ -808,17 +808,17 @@ class TFCvtModel(TFCvtPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TFCvtModel
|
>>> from transformers import AutoImageProcessor, TFCvtModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/cvt-13")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
|
||||||
>>> model = TFCvtModel.from_pretrained("microsoft/cvt-13")
|
>>> model = TFCvtModel.from_pretrained("microsoft/cvt-13")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```"""
|
```"""
|
||||||
@ -897,7 +897,7 @@ class TFCvtForImageClassification(TFCvtPreTrainedModel, TFSequenceClassification
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TFCvtForImageClassification
|
>>> from transformers import AutoImageProcessor, TFCvtForImageClassification
|
||||||
>>> import tensorflow as tf
|
>>> import tensorflow as tf
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -905,10 +905,10 @@ class TFCvtForImageClassification(TFCvtPreTrainedModel, TFSequenceClassification
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/cvt-13")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/cvt-13")
|
||||||
>>> model = TFCvtForImageClassification.from_pretrained("microsoft/cvt-13")
|
>>> model = TFCvtForImageClassification.from_pretrained("microsoft/cvt-13")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> # model predicts one of the 1000 ImageNet classes
|
>>> # model predicts one of the 1000 ImageNet classes
|
||||||
|
@ -48,7 +48,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "Data2VecVisionConfig"
|
_CONFIG_FOR_DOC = "Data2VecVisionConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "BeitFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "BeitImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/data2vec-vision-base"
|
_CHECKPOINT_FOR_DOC = "facebook/data2vec-vision-base"
|
||||||
@ -606,8 +606,8 @@ DATA2VEC_VISION_START_DOCSTRING = r"""
|
|||||||
DATA2VEC_VISION_INPUTS_DOCSTRING = r"""
|
DATA2VEC_VISION_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`BeitFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`BeitImageProcessor`]. See
|
||||||
[`BeitFeatureExtractor.__call__`] for details.
|
[`BeitImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -1146,17 +1146,17 @@ class Data2VecVisionForSemanticSegmentation(Data2VecVisionPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, Data2VecVisionForSemanticSegmentation
|
>>> from transformers import AutoImageProcessor, Data2VecVisionForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/data2vec-vision-base")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/data2vec-vision-base")
|
||||||
>>> model = Data2VecVisionForSemanticSegmentation.from_pretrained("facebook/data2vec-vision-base")
|
>>> model = Data2VecVisionForSemanticSegmentation.from_pretrained("facebook/data2vec-vision-base")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> # logits are of shape (batch_size, num_labels, height, width)
|
>>> # logits are of shape (batch_size, num_labels, height, width)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
|
@ -53,7 +53,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "Data2VecVisionConfig"
|
_CONFIG_FOR_DOC = "Data2VecVisionConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "BeitFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "BeitImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/data2vec-vision-base"
|
_CHECKPOINT_FOR_DOC = "facebook/data2vec-vision-base"
|
||||||
@ -849,8 +849,8 @@ DATA2VEC_VISION_START_DOCSTRING = r"""
|
|||||||
DATA2VEC_VISION_INPUTS_DOCSTRING = r"""
|
DATA2VEC_VISION_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`BeitFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`BeitImageProcessor`]. See
|
||||||
[`BeitFeatureExtractor.__call__`] for details.
|
[`BeitImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -1397,17 +1397,17 @@ class TFData2VecVisionForSemanticSegmentation(TFData2VecVisionPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TFData2VecVisionForSemanticSegmentation
|
>>> from transformers import AutoImageProcessor, TFData2VecVisionForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/data2vec-vision-base")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/data2vec-vision-base")
|
||||||
>>> model = TFData2VecVisionForSemanticSegmentation.from_pretrained("facebook/data2vec-vision-base")
|
>>> model = TFData2VecVisionForSemanticSegmentation.from_pretrained("facebook/data2vec-vision-base")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> # logits are of shape (batch_size, num_labels, height, width)
|
>>> # logits are of shape (batch_size, num_labels, height, width)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
|
@ -241,7 +241,7 @@ class DeformableDetrObjectDetectionOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~AutoFeatureExtractor.post_process_object_detection`] to retrieve the
|
possible padding). You can use [`~AutoImageProcessor.post_process_object_detection`] to retrieve the
|
||||||
unnormalized bounding boxes.
|
unnormalized bounding boxes.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
@ -1073,8 +1073,7 @@ DEFORMABLE_DETR_INPUTS_DOCSTRING = r"""
|
|||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it.
|
Pixel values. Padding will be ignored by default should you provide it.
|
||||||
|
|
||||||
Pixel values can be obtained using [`AutoFeatureExtractor`]. See [`AutoFeatureExtractor.__call__`] for
|
Pixel values can be obtained using [`AutoImageProcessor`]. See [`AutoImageProcessor.__call__`] for details.
|
||||||
details.
|
|
||||||
|
|
||||||
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
||||||
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
||||||
@ -1603,17 +1602,17 @@ class DeformableDetrModel(DeformableDetrPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, DeformableDetrModel
|
>>> from transformers import AutoImageProcessor, DeformableDetrModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("SenseTime/deformable-detr")
|
>>> image_processor = AutoImageProcessor.from_pretrained("SenseTime/deformable-detr")
|
||||||
>>> model = DeformableDetrModel.from_pretrained("SenseTime/deformable-detr")
|
>>> model = DeformableDetrModel.from_pretrained("SenseTime/deformable-detr")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
@ -1873,24 +1872,24 @@ class DeformableDetrForObjectDetection(DeformableDetrPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, DeformableDetrForObjectDetection
|
>>> from transformers import AutoImageProcessor, DeformableDetrForObjectDetection
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("SenseTime/deformable-detr")
|
>>> image_processor = AutoImageProcessor.from_pretrained("SenseTime/deformable-detr")
|
||||||
>>> model = DeformableDetrForObjectDetection.from_pretrained("SenseTime/deformable-detr")
|
>>> model = DeformableDetrForObjectDetection.from_pretrained("SenseTime/deformable-detr")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
||||||
>>> target_sizes = torch.tensor([image.size[::-1]])
|
>>> target_sizes = torch.tensor([image.size[::-1]])
|
||||||
>>> results = feature_extractor.post_process_object_detection(
|
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[
|
||||||
... outputs, threshold=0.5, target_sizes=target_sizes
|
... 0
|
||||||
... )[0]
|
... ]
|
||||||
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
||||||
... box = [round(i, 2) for i in box.tolist()]
|
... box = [round(i, 2) for i in box.tolist()]
|
||||||
... print(
|
... print(
|
||||||
|
@ -44,7 +44,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "DeiTConfig"
|
_CONFIG_FOR_DOC = "DeiTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "DeiTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "DeiTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/deit-base-distilled-patch16-224"
|
_CHECKPOINT_FOR_DOC = "facebook/deit-base-distilled-patch16-224"
|
||||||
@ -433,8 +433,8 @@ DEIT_START_DOCSTRING = r"""
|
|||||||
DEIT_INPUTS_DOCSTRING = r"""
|
DEIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`DeiTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`DeiTImageProcessor`]. See
|
||||||
[`DeiTFeatureExtractor.__call__`] for details.
|
[`DeiTImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -611,7 +611,7 @@ class DeiTForMaskedImageModeling(DeiTPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DeiTFeatureExtractor, DeiTForMaskedImageModeling
|
>>> from transformers import DeiTImageProcessor, DeiTForMaskedImageModeling
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -619,11 +619,11 @@ class DeiTForMaskedImageModeling(DeiTPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DeiTFeatureExtractor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> image_processor = DeiTImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
>>> model = DeiTForMaskedImageModeling.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> model = DeiTForMaskedImageModeling.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
||||||
|
|
||||||
@ -721,7 +721,7 @@ class DeiTForImageClassification(DeiTPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DeiTFeatureExtractor, DeiTForImageClassification
|
>>> from transformers import DeiTImageProcessor, DeiTForImageClassification
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -732,10 +732,10 @@ class DeiTForImageClassification(DeiTPreTrainedModel):
|
|||||||
|
|
||||||
>>> # note: we are loading a DeiTForImageClassificationWithTeacher from the hub here,
|
>>> # note: we are loading a DeiTForImageClassificationWithTeacher from the hub here,
|
||||||
>>> # so the head will be randomly initialized, hence the predictions will be random
|
>>> # so the head will be randomly initialized, hence the predictions will be random
|
||||||
>>> feature_extractor = DeiTFeatureExtractor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> image_processor = DeiTImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
>>> model = DeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> model = DeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> # model predicts one of the 1000 ImageNet classes
|
>>> # model predicts one of the 1000 ImageNet classes
|
||||||
|
@ -52,7 +52,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "DeiTConfig"
|
_CONFIG_FOR_DOC = "DeiTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "DeiTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "DeiTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/deit-base-distilled-patch16-224"
|
_CHECKPOINT_FOR_DOC = "facebook/deit-base-distilled-patch16-224"
|
||||||
@ -614,8 +614,8 @@ DEIT_START_DOCSTRING = r"""
|
|||||||
DEIT_INPUTS_DOCSTRING = r"""
|
DEIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`DeiTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`DeiTImageProcessor`]. See
|
||||||
[`DeiTFeatureExtractor.__call__`] for details.
|
[`DeiTImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -786,7 +786,7 @@ class TFDeiTForMaskedImageModeling(TFDeiTPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DeiTFeatureExtractor, TFDeiTForMaskedImageModeling
|
>>> from transformers import DeiTImageProcessor, TFDeiTForMaskedImageModeling
|
||||||
>>> import tensorflow as tf
|
>>> import tensorflow as tf
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -794,11 +794,11 @@ class TFDeiTForMaskedImageModeling(TFDeiTPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DeiTFeatureExtractor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> image_processor = DeiTImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
>>> model = TFDeiTForMaskedImageModeling.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> model = TFDeiTForMaskedImageModeling.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="tf").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="tf").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = tf.cast(tf.random.uniform((1, num_patches), minval=0, maxval=2, dtype=tf.int32), tf.bool)
|
>>> bool_masked_pos = tf.cast(tf.random.uniform((1, num_patches), minval=0, maxval=2, dtype=tf.int32), tf.bool)
|
||||||
|
|
||||||
@ -917,7 +917,7 @@ class TFDeiTForImageClassification(TFDeiTPreTrainedModel, TFSequenceClassificati
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DeiTFeatureExtractor, TFDeiTForImageClassification
|
>>> from transformers import DeiTImageProcessor, TFDeiTForImageClassification
|
||||||
>>> import tensorflow as tf
|
>>> import tensorflow as tf
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -928,10 +928,10 @@ class TFDeiTForImageClassification(TFDeiTPreTrainedModel, TFSequenceClassificati
|
|||||||
|
|
||||||
>>> # note: we are loading a TFDeiTForImageClassificationWithTeacher from the hub here,
|
>>> # note: we are loading a TFDeiTForImageClassificationWithTeacher from the hub here,
|
||||||
>>> # so the head will be randomly initialized, hence the predictions will be random
|
>>> # so the head will be randomly initialized, hence the predictions will be random
|
||||||
>>> feature_extractor = DeiTFeatureExtractor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> image_processor = DeiTImageProcessor.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
>>> model = TFDeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
>>> model = TFDeiTForImageClassification.from_pretrained("facebook/deit-base-distilled-patch16-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> # model predicts one of the 1000 ImageNet classes
|
>>> # model predicts one of the 1000 ImageNet classes
|
||||||
|
@ -148,7 +148,7 @@ class DetrObjectDetectionOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~DetrFeatureExtractor.post_process_object_detection`] to retrieve the
|
possible padding). You can use [`~DetrImageProcessor.post_process_object_detection`] to retrieve the
|
||||||
unnormalized bounding boxes.
|
unnormalized bounding boxes.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
@ -211,13 +211,13 @@ class DetrSegmentationOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~DetrFeatureExtractor.post_process_object_detection`] to retrieve the
|
possible padding). You can use [`~DetrImageProcessor.post_process_object_detection`] to retrieve the
|
||||||
unnormalized bounding boxes.
|
unnormalized bounding boxes.
|
||||||
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
|
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
|
||||||
Segmentation masks logits for all queries. See also
|
Segmentation masks logits for all queries. See also
|
||||||
[`~DetrFeatureExtractor.post_process_semantic_segmentation`] or
|
[`~DetrImageProcessor.post_process_semantic_segmentation`] or
|
||||||
[`~DetrFeatureExtractor.post_process_instance_segmentation`]
|
[`~DetrImageProcessor.post_process_instance_segmentation`]
|
||||||
[`~DetrFeatureExtractor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic
|
[`~DetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic
|
||||||
segmentation masks respectively.
|
segmentation masks respectively.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
@ -856,8 +856,7 @@ DETR_INPUTS_DOCSTRING = r"""
|
|||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it.
|
Pixel values. Padding will be ignored by default should you provide it.
|
||||||
|
|
||||||
Pixel values can be obtained using [`DetrFeatureExtractor`]. See [`DetrFeatureExtractor.__call__`] for
|
Pixel values can be obtained using [`DetrImageProcessor`]. See [`DetrImageProcessor.__call__`] for details.
|
||||||
details.
|
|
||||||
|
|
||||||
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
||||||
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
||||||
@ -1243,18 +1242,18 @@ class DetrModel(DetrPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DetrFeatureExtractor, DetrModel
|
>>> from transformers import DetrImageProcessor, DetrModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DetrFeatureExtractor.from_pretrained("facebook/detr-resnet-50")
|
>>> image_processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
|
||||||
>>> model = DetrModel.from_pretrained("facebook/detr-resnet-50")
|
>>> model = DetrModel.from_pretrained("facebook/detr-resnet-50")
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
@ -1410,7 +1409,7 @@ class DetrForObjectDetection(DetrPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DetrFeatureExtractor, DetrForObjectDetection
|
>>> from transformers import DetrImageProcessor, DetrForObjectDetection
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -1418,17 +1417,17 @@ class DetrForObjectDetection(DetrPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DetrFeatureExtractor.from_pretrained("facebook/detr-resnet-50")
|
>>> image_processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
|
||||||
>>> model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
|
>>> model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
||||||
>>> target_sizes = torch.tensor([image.size[::-1]])
|
>>> target_sizes = torch.tensor([image.size[::-1]])
|
||||||
>>> results = feature_extractor.post_process_object_detection(
|
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[
|
||||||
... outputs, threshold=0.9, target_sizes=target_sizes
|
... 0
|
||||||
... )[0]
|
... ]
|
||||||
|
|
||||||
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
||||||
... box = [round(i, 2) for i in box.tolist()]
|
... box = [round(i, 2) for i in box.tolist()]
|
||||||
@ -1588,24 +1587,24 @@ class DetrForSegmentation(DetrPreTrainedModel):
|
|||||||
>>> import torch
|
>>> import torch
|
||||||
>>> import numpy
|
>>> import numpy
|
||||||
|
|
||||||
>>> from transformers import DetrFeatureExtractor, DetrForSegmentation
|
>>> from transformers import DetrImageProcessor, DetrForSegmentation
|
||||||
>>> from transformers.image_transforms import rgb_to_id
|
>>> from transformers.image_transforms import rgb_to_id
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DetrFeatureExtractor.from_pretrained("facebook/detr-resnet-50-panoptic")
|
>>> image_processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50-panoptic")
|
||||||
>>> model = DetrForSegmentation.from_pretrained("facebook/detr-resnet-50-panoptic")
|
>>> model = DetrForSegmentation.from_pretrained("facebook/detr-resnet-50-panoptic")
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # Use the `post_process_panoptic_segmentation` method of `DetrFeatureExtractor` to retrieve post-processed panoptic segmentation maps
|
>>> # Use the `post_process_panoptic_segmentation` method of `DetrImageProcessor` to retrieve post-processed panoptic segmentation maps
|
||||||
>>> # Segmentation results are returned as a list of dictionaries
|
>>> # Segmentation results are returned as a list of dictionaries
|
||||||
>>> result = feature_extractor.post_process_panoptic_segmentation(outputs, target_sizes=[(300, 500)])
|
>>> result = image_processor.post_process_panoptic_segmentation(outputs, target_sizes=[(300, 500)])
|
||||||
|
|
||||||
>>> # A tensor of shape (height, width) where each value denotes a segment id, filled with -1 if no segment is found
|
>>> # A tensor of shape (height, width) where each value denotes a segment id, filled with -1 if no segment is found
|
||||||
>>> panoptic_seg = result[0]["segmentation"]
|
>>> panoptic_seg = result[0]["segmentation"]
|
||||||
|
@ -52,7 +52,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "DPTConfig"
|
_CONFIG_FOR_DOC = "DPTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "DPTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "DPTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "Intel/dpt-large"
|
_CHECKPOINT_FOR_DOC = "Intel/dpt-large"
|
||||||
@ -651,8 +651,8 @@ DPT_START_DOCSTRING = r"""
|
|||||||
DPT_INPUTS_DOCSTRING = r"""
|
DPT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`ViTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`ViTImageProcessor`]. See [`ViTImageProcessor.__call__`]
|
||||||
[`ViTFeatureExtractor.__call__`] for details.
|
for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -890,7 +890,7 @@ class DPTForDepthEstimation(DPTPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DPTFeatureExtractor, DPTForDepthEstimation
|
>>> from transformers import DPTImageProcessor, DPTForDepthEstimation
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
@ -899,11 +899,11 @@ class DPTForDepthEstimation(DPTPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
|
>>> image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large")
|
||||||
>>> model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
|
>>> model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(**inputs)
|
... outputs = model(**inputs)
|
||||||
@ -1052,17 +1052,17 @@ class DPTForSemanticSegmentation(DPTPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import DPTFeatureExtractor, DPTForSemanticSegmentation
|
>>> from transformers import DPTImageProcessor, DPTForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large-ade")
|
>>> image_processor = DPTImageProcessor.from_pretrained("Intel/dpt-large-ade")
|
||||||
>>> model = DPTForSemanticSegmentation.from_pretrained("Intel/dpt-large-ade")
|
>>> model = DPTForSemanticSegmentation.from_pretrained("Intel/dpt-large-ade")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
|
@ -41,7 +41,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "GLPNConfig"
|
_CONFIG_FOR_DOC = "GLPNConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "GLPNFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "GLPNImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "vinvino02/glpn-kitti"
|
_CHECKPOINT_FOR_DOC = "vinvino02/glpn-kitti"
|
||||||
@ -464,7 +464,7 @@ GLPN_INPUTS_DOCSTRING = r"""
|
|||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
|
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
|
||||||
[`GLPNFeatureExtractor`]. See [`GLPNFeatureExtractor.__call__`] for details.
|
[`GLPNImageProcessor`]. See [`GLPNImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
@ -713,7 +713,7 @@ class GLPNForDepthEstimation(GLPNPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import GLPNFeatureExtractor, GLPNForDepthEstimation
|
>>> from transformers import GLPNImageProcessor, GLPNForDepthEstimation
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
@ -722,11 +722,11 @@ class GLPNForDepthEstimation(GLPNPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = GLPNFeatureExtractor.from_pretrained("vinvino02/glpn-kitti")
|
>>> image_processor = GLPNImageProcessor.from_pretrained("vinvino02/glpn-kitti")
|
||||||
>>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")
|
>>> model = GLPNForDepthEstimation.from_pretrained("vinvino02/glpn-kitti")
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(**inputs)
|
... outputs = model(**inputs)
|
||||||
|
@ -556,7 +556,7 @@ IMAGEGPT_INPUTS_DOCSTRING = r"""
|
|||||||
If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as
|
If `past_key_values` is used, only `input_ids` that do not have their past calculated should be passed as
|
||||||
`input_ids`.
|
`input_ids`.
|
||||||
|
|
||||||
Indices can be obtained using [`ImageGPTFeatureExtractor`]. See [`ImageGPTFeatureExtractor.__call__`] for
|
Indices can be obtained using [`ImageGPTImageProcessor`]. See [`ImageGPTImageProcessor.__call__`] for
|
||||||
details.
|
details.
|
||||||
|
|
||||||
past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.n_layers`):
|
past_key_values (`Tuple[Tuple[torch.Tensor]]` of length `config.n_layers`):
|
||||||
@ -679,17 +679,17 @@ class ImageGPTModel(ImageGPTPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ImageGPTFeatureExtractor, ImageGPTModel
|
>>> from transformers import ImageGPTImageProcessor, ImageGPTModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ImageGPTFeatureExtractor.from_pretrained("openai/imagegpt-small")
|
>>> image_processor = ImageGPTImageProcessor.from_pretrained("openai/imagegpt-small")
|
||||||
>>> model = ImageGPTModel.from_pretrained("openai/imagegpt-small")
|
>>> model = ImageGPTModel.from_pretrained("openai/imagegpt-small")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```"""
|
```"""
|
||||||
@ -973,12 +973,12 @@ class ImageGPTForCausalImageModeling(ImageGPTPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ImageGPTFeatureExtractor, ImageGPTForCausalImageModeling
|
>>> from transformers import ImageGPTImageProcessor, ImageGPTForCausalImageModeling
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> import matplotlib.pyplot as plt
|
>>> import matplotlib.pyplot as plt
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
|
|
||||||
>>> feature_extractor = ImageGPTFeatureExtractor.from_pretrained("openai/imagegpt-small")
|
>>> image_processor = ImageGPTImageProcessor.from_pretrained("openai/imagegpt-small")
|
||||||
>>> model = ImageGPTForCausalImageModeling.from_pretrained("openai/imagegpt-small")
|
>>> model = ImageGPTForCausalImageModeling.from_pretrained("openai/imagegpt-small")
|
||||||
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||||
>>> model.to(device)
|
>>> model.to(device)
|
||||||
@ -991,9 +991,9 @@ class ImageGPTForCausalImageModeling(ImageGPTPreTrainedModel):
|
|||||||
... input_ids=context, max_length=model.config.n_positions + 1, temperature=1.0, do_sample=True, top_k=40
|
... input_ids=context, max_length=model.config.n_positions + 1, temperature=1.0, do_sample=True, top_k=40
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> clusters = feature_extractor.clusters
|
>>> clusters = image_processor.clusters
|
||||||
>>> height = feature_extractor.size["height"]
|
>>> height = image_processor.size["height"]
|
||||||
>>> width = feature_extractor.size["width"]
|
>>> width = image_processor.size["width"]
|
||||||
|
|
||||||
>>> samples = output[:, 1:].cpu().detach().numpy()
|
>>> samples = output[:, 1:].cpu().detach().numpy()
|
||||||
>>> samples_img = [
|
>>> samples_img = [
|
||||||
@ -1124,17 +1124,17 @@ class ImageGPTForImageClassification(ImageGPTPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ImageGPTFeatureExtractor, ImageGPTForImageClassification
|
>>> from transformers import ImageGPTImageProcessor, ImageGPTForImageClassification
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ImageGPTFeatureExtractor.from_pretrained("openai/imagegpt-small")
|
>>> image_processor = ImageGPTImageProcessor.from_pretrained("openai/imagegpt-small")
|
||||||
>>> model = ImageGPTForImageClassification.from_pretrained("openai/imagegpt-small")
|
>>> model = ImageGPTForImageClassification.from_pretrained("openai/imagegpt-small")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
```"""
|
```"""
|
||||||
|
@ -38,7 +38,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "LevitConfig"
|
_CONFIG_FOR_DOC = "LevitConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "LevitFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "LevitImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/levit-128S"
|
_CHECKPOINT_FOR_DOC = "facebook/levit-128S"
|
||||||
@ -523,8 +523,8 @@ LEVIT_START_DOCSTRING = r"""
|
|||||||
LEVIT_INPUTS_DOCSTRING = r"""
|
LEVIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
@ -51,7 +51,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
_CONFIG_FOR_DOC = "MaskFormerConfig"
|
_CONFIG_FOR_DOC = "MaskFormerConfig"
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/maskformer-swin-base-ade"
|
_CHECKPOINT_FOR_DOC = "facebook/maskformer-swin-base-ade"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "MaskFormerFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "MaskFormerImageProcessor"
|
||||||
|
|
||||||
MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
MASKFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = [
|
||||||
"facebook/maskformer-swin-base-ade",
|
"facebook/maskformer-swin-base-ade",
|
||||||
@ -192,10 +192,10 @@ class MaskFormerForInstanceSegmentationOutput(ModelOutput):
|
|||||||
"""
|
"""
|
||||||
Class for outputs of [`MaskFormerForInstanceSegmentation`].
|
Class for outputs of [`MaskFormerForInstanceSegmentation`].
|
||||||
|
|
||||||
This output can be directly passed to [`~MaskFormerFeatureExtractor.post_process_semantic_segmentation`] or or
|
This output can be directly passed to [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or or
|
||||||
[`~MaskFormerFeatureExtractor.post_process_instance_segmentation`] or
|
[`~MaskFormerImageProcessor.post_process_instance_segmentation`] or
|
||||||
[`~MaskFormerFeatureExtractor.post_process_panoptic_segmentation`] depending on the task. Please, see
|
[`~MaskFormerImageProcessor.post_process_panoptic_segmentation`] depending on the task. Please, see
|
||||||
[`~MaskFormerFeatureExtractor] for details regarding usage.
|
[`~MaskFormerImageProcessor] for details regarding usage.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
loss (`torch.Tensor`, *optional*):
|
loss (`torch.Tensor`, *optional*):
|
||||||
@ -1462,8 +1462,8 @@ MASKFORMER_START_DOCSTRING = r"""
|
|||||||
MASKFORMER_INPUTS_DOCSTRING = r"""
|
MASKFORMER_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
||||||
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
||||||
|
|
||||||
@ -1562,18 +1562,18 @@ class MaskFormerModel(MaskFormerPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import MaskFormerFeatureExtractor, MaskFormerModel
|
>>> from transformers import MaskFormerImageProcessor, MaskFormerModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
|
>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
|
||||||
>>> feature_extractor = MaskFormerFeatureExtractor.from_pretrained("facebook/maskformer-swin-base-ade")
|
>>> image_processor = MaskFormerImageProcessor.from_pretrained("facebook/maskformer-swin-base-ade")
|
||||||
>>> model = MaskFormerModel.from_pretrained("facebook/maskformer-swin-base-ade")
|
>>> model = MaskFormerModel.from_pretrained("facebook/maskformer-swin-base-ade")
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> inputs = feature_extractor(image, return_tensors="pt")
|
>>> inputs = image_processor(image, return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
@ -1741,19 +1741,19 @@ class MaskFormerForInstanceSegmentation(MaskFormerPreTrainedModel):
|
|||||||
Semantic segmentation example:
|
Semantic segmentation example:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import MaskFormerFeatureExtractor, MaskFormerForInstanceSegmentation
|
>>> from transformers import MaskFormerImageProcessor, MaskFormerForInstanceSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
|
>>> # load MaskFormer fine-tuned on ADE20k semantic segmentation
|
||||||
>>> feature_extractor = MaskFormerFeatureExtractor.from_pretrained("facebook/maskformer-swin-base-ade")
|
>>> image_processor = MaskFormerImageProcessor.from_pretrained("facebook/maskformer-swin-base-ade")
|
||||||
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-ade")
|
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-ade")
|
||||||
|
|
||||||
>>> url = (
|
>>> url = (
|
||||||
... "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
|
... "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
|
||||||
... )
|
... )
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
|
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
|
||||||
@ -1761,8 +1761,8 @@ class MaskFormerForInstanceSegmentation(MaskFormerPreTrainedModel):
|
|||||||
>>> class_queries_logits = outputs.class_queries_logits
|
>>> class_queries_logits = outputs.class_queries_logits
|
||||||
>>> masks_queries_logits = outputs.masks_queries_logits
|
>>> masks_queries_logits = outputs.masks_queries_logits
|
||||||
|
|
||||||
>>> # you can pass them to feature_extractor for postprocessing
|
>>> # you can pass them to image_processor for postprocessing
|
||||||
>>> predicted_semantic_map = feature_extractor.post_process_semantic_segmentation(
|
>>> predicted_semantic_map = image_processor.post_process_semantic_segmentation(
|
||||||
... outputs, target_sizes=[image.size[::-1]]
|
... outputs, target_sizes=[image.size[::-1]]
|
||||||
... )[0]
|
... )[0]
|
||||||
|
|
||||||
@ -1774,17 +1774,17 @@ class MaskFormerForInstanceSegmentation(MaskFormerPreTrainedModel):
|
|||||||
Panoptic segmentation example:
|
Panoptic segmentation example:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import MaskFormerFeatureExtractor, MaskFormerForInstanceSegmentation
|
>>> from transformers import MaskFormerImageProcessor, MaskFormerForInstanceSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> # load MaskFormer fine-tuned on COCO panoptic segmentation
|
>>> # load MaskFormer fine-tuned on COCO panoptic segmentation
|
||||||
>>> feature_extractor = MaskFormerFeatureExtractor.from_pretrained("facebook/maskformer-swin-base-coco")
|
>>> image_processor = MaskFormerImageProcessor.from_pretrained("facebook/maskformer-swin-base-coco")
|
||||||
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-coco")
|
>>> model = MaskFormerForInstanceSegmentation.from_pretrained("facebook/maskformer-swin-base-coco")
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
|
>>> # model predicts class_queries_logits of shape `(batch_size, num_queries)`
|
||||||
@ -1792,8 +1792,8 @@ class MaskFormerForInstanceSegmentation(MaskFormerPreTrainedModel):
|
|||||||
>>> class_queries_logits = outputs.class_queries_logits
|
>>> class_queries_logits = outputs.class_queries_logits
|
||||||
>>> masks_queries_logits = outputs.masks_queries_logits
|
>>> masks_queries_logits = outputs.masks_queries_logits
|
||||||
|
|
||||||
>>> # you can pass them to feature_extractor for postprocessing
|
>>> # you can pass them to image_processor for postprocessing
|
||||||
>>> result = feature_extractor.post_process_panoptic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
|
>>> result = image_processor.post_process_panoptic_segmentation(outputs, target_sizes=[image.size[::-1]])[0]
|
||||||
|
|
||||||
>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
|
>>> # we refer to the demo notebooks for visualization (see "Resources" section in the MaskFormer docs)
|
||||||
>>> predicted_panoptic_map = result["segmentation"]
|
>>> predicted_panoptic_map = result["segmentation"]
|
||||||
|
@ -33,7 +33,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "MobileNetV1Config"
|
_CONFIG_FOR_DOC = "MobileNetV1Config"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "MobileNetV1FeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "MobileNetV1ImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "google/mobilenet_v1_1.0_224"
|
_CHECKPOINT_FOR_DOC = "google/mobilenet_v1_1.0_224"
|
||||||
@ -285,8 +285,8 @@ MOBILENET_V1_START_DOCSTRING = r"""
|
|||||||
MOBILENET_V1_INPUTS_DOCSTRING = r"""
|
MOBILENET_V1_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`MobileNetV1FeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`MobileNetV1ImageProcessor`]. See
|
||||||
[`MobileNetV1FeatureExtractor.__call__`] for details.
|
[`MobileNetV1ImageProcessor.__call__`] for details.
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
more detail.
|
more detail.
|
||||||
|
@ -43,7 +43,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "MobileNetV2Config"
|
_CONFIG_FOR_DOC = "MobileNetV2Config"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "MobileNetV2FeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "MobileNetV2ImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "google/mobilenet_v2_1.0_224"
|
_CHECKPOINT_FOR_DOC = "google/mobilenet_v2_1.0_224"
|
||||||
@ -486,8 +486,8 @@ MOBILENET_V2_START_DOCSTRING = r"""
|
|||||||
MOBILENET_V2_INPUTS_DOCSTRING = r"""
|
MOBILENET_V2_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`MobileNetV2FeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`MobileNetV2ImageProcessor`]. See
|
||||||
[`MobileNetV2FeatureExtractor.__call__`] for details.
|
[`MobileNetV2ImageProcessor.__call__`] for details.
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
more detail.
|
more detail.
|
||||||
@ -811,17 +811,17 @@ class MobileNetV2ForSemanticSegmentation(MobileNetV2PreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import MobileNetV2FeatureExtractor, MobileNetV2ForSemanticSegmentation
|
>>> from transformers import MobileNetV2ImageProcessor, MobileNetV2ForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = MobileNetV2FeatureExtractor.from_pretrained("google/deeplabv3_mobilenet_v2_1.0_513")
|
>>> image_processor = MobileNetV2ImageProcessor.from_pretrained("google/deeplabv3_mobilenet_v2_1.0_513")
|
||||||
>>> model = MobileNetV2ForSemanticSegmentation.from_pretrained("google/deeplabv3_mobilenet_v2_1.0_513")
|
>>> model = MobileNetV2ForSemanticSegmentation.from_pretrained("google/deeplabv3_mobilenet_v2_1.0_513")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(**inputs)
|
... outputs = model(**inputs)
|
||||||
|
@ -49,7 +49,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "MobileViTConfig"
|
_CONFIG_FOR_DOC = "MobileViTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "MobileViTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "MobileViTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "apple/mobilevit-small"
|
_CHECKPOINT_FOR_DOC = "apple/mobilevit-small"
|
||||||
@ -692,8 +692,8 @@ MOBILEVIT_START_DOCSTRING = r"""
|
|||||||
MOBILEVIT_INPUTS_DOCSTRING = r"""
|
MOBILEVIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`MobileViTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`MobileViTImageProcessor`]. See
|
||||||
[`MobileViTFeatureExtractor.__call__`] for details.
|
[`MobileViTImageProcessor.__call__`] for details.
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
more detail.
|
more detail.
|
||||||
@ -1027,17 +1027,17 @@ class MobileViTForSemanticSegmentation(MobileViTPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import MobileViTFeatureExtractor, MobileViTForSemanticSegmentation
|
>>> from transformers import MobileViTImageProcessor, MobileViTForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/deeplabv3-mobilevit-small")
|
>>> image_processor = MobileViTImageProcessor.from_pretrained("apple/deeplabv3-mobilevit-small")
|
||||||
>>> model = MobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small")
|
>>> model = MobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(**inputs)
|
... outputs = model(**inputs)
|
||||||
|
@ -43,7 +43,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "MobileViTConfig"
|
_CONFIG_FOR_DOC = "MobileViTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "MobileViTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "MobileViTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "apple/mobilevit-small"
|
_CHECKPOINT_FOR_DOC = "apple/mobilevit-small"
|
||||||
@ -811,8 +811,8 @@ MOBILEVIT_START_DOCSTRING = r"""
|
|||||||
MOBILEVIT_INPUTS_DOCSTRING = r"""
|
MOBILEVIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]`, `Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]`, `Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`MobileViTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`MobileViTImageProcessor`]. See
|
||||||
[`MobileViTFeatureExtractor.__call__`] for details.
|
[`MobileViTImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
@ -1103,17 +1103,17 @@ class TFMobileViTForSemanticSegmentation(TFMobileViTPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import MobileViTFeatureExtractor, TFMobileViTForSemanticSegmentation
|
>>> from transformers import MobileViTImageProcessor, TFMobileViTForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = MobileViTFeatureExtractor.from_pretrained("apple/deeplabv3-mobilevit-small")
|
>>> image_processor = MobileViTImageProcessor.from_pretrained("apple/deeplabv3-mobilevit-small")
|
||||||
>>> model = TFMobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small")
|
>>> model = TFMobileViTForSemanticSegmentation.from_pretrained("apple/deeplabv3-mobilevit-small")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
|
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
|
@ -768,7 +768,7 @@ class PerceiverModel(PerceiverPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import PerceiverConfig, PerceiverTokenizer, PerceiverFeatureExtractor, PerceiverModel
|
>>> from transformers import PerceiverConfig, PerceiverTokenizer, PerceiverImageProcessor, PerceiverModel
|
||||||
>>> from transformers.models.perceiver.modeling_perceiver import (
|
>>> from transformers.models.perceiver.modeling_perceiver import (
|
||||||
... PerceiverTextPreprocessor,
|
... PerceiverTextPreprocessor,
|
||||||
... PerceiverImagePreprocessor,
|
... PerceiverImagePreprocessor,
|
||||||
@ -839,10 +839,10 @@ class PerceiverModel(PerceiverPreTrainedModel):
|
|||||||
... )
|
... )
|
||||||
|
|
||||||
>>> # you can then do a forward pass as follows:
|
>>> # you can then do a forward pass as follows:
|
||||||
>>> feature_extractor = PerceiverFeatureExtractor()
|
>>> image_processor = PerceiverImageProcessor()
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
>>> inputs = feature_extractor(image, return_tensors="pt").pixel_values
|
>>> inputs = image_processor(image, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(inputs=inputs)
|
... outputs = model(inputs=inputs)
|
||||||
@ -1266,17 +1266,17 @@ class PerceiverForImageClassificationLearned(PerceiverPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationLearned
|
>>> from transformers import PerceiverImageProcessor, PerceiverForImageClassificationLearned
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained("deepmind/vision-perceiver-learned")
|
>>> image_processor = PerceiverImageProcessor.from_pretrained("deepmind/vision-perceiver-learned")
|
||||||
>>> model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
|
>>> model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> inputs = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> outputs = model(inputs=inputs)
|
>>> outputs = model(inputs=inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> list(logits.shape)
|
>>> list(logits.shape)
|
||||||
@ -1407,17 +1407,17 @@ class PerceiverForImageClassificationFourier(PerceiverPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationFourier
|
>>> from transformers import PerceiverImageProcessor, PerceiverForImageClassificationFourier
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained("deepmind/vision-perceiver-fourier")
|
>>> image_processor = PerceiverImageProcessor.from_pretrained("deepmind/vision-perceiver-fourier")
|
||||||
>>> model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")
|
>>> model = PerceiverForImageClassificationFourier.from_pretrained("deepmind/vision-perceiver-fourier")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> inputs = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> outputs = model(inputs=inputs)
|
>>> outputs = model(inputs=inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> list(logits.shape)
|
>>> list(logits.shape)
|
||||||
@ -1548,17 +1548,17 @@ class PerceiverForImageClassificationConvProcessing(PerceiverPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import PerceiverFeatureExtractor, PerceiverForImageClassificationConvProcessing
|
>>> from transformers import PerceiverImageProcessor, PerceiverForImageClassificationConvProcessing
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = PerceiverFeatureExtractor.from_pretrained("deepmind/vision-perceiver-conv")
|
>>> image_processor = PerceiverImageProcessor.from_pretrained("deepmind/vision-perceiver-conv")
|
||||||
>>> model = PerceiverForImageClassificationConvProcessing.from_pretrained("deepmind/vision-perceiver-conv")
|
>>> model = PerceiverForImageClassificationConvProcessing.from_pretrained("deepmind/vision-perceiver-conv")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> inputs = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> outputs = model(inputs=inputs)
|
>>> outputs = model(inputs=inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
>>> list(logits.shape)
|
>>> list(logits.shape)
|
||||||
|
@ -34,7 +34,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "PoolFormerConfig"
|
_CONFIG_FOR_DOC = "PoolFormerConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "PoolFormerFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "PoolFormerImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "sail/poolformer_s12"
|
_CHECKPOINT_FOR_DOC = "sail/poolformer_s12"
|
||||||
@ -302,8 +302,8 @@ POOLFORMER_START_DOCSTRING = r"""
|
|||||||
POOLFORMER_INPUTS_DOCSTRING = r"""
|
POOLFORMER_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`PoolFormerFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`PoolFormerImageProcessor`]. See
|
||||||
[`PoolFormerFeatureExtractor.__call__`] for details.
|
[`PoolFormerImageProcessor.__call__`] for details.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
@ -37,7 +37,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "RegNetConfig"
|
_CONFIG_FOR_DOC = "RegNetConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/regnet-y-040"
|
_CHECKPOINT_FOR_DOC = "facebook/regnet-y-040"
|
||||||
@ -313,8 +313,8 @@ REGNET_START_DOCSTRING = r"""
|
|||||||
REGNET_INPUTS_DOCSTRING = r"""
|
REGNET_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
@ -35,7 +35,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "RegNetConfig"
|
_CONFIG_FOR_DOC = "RegNetConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "facebook/regnet-y-040"
|
_CHECKPOINT_FOR_DOC = "facebook/regnet-y-040"
|
||||||
@ -389,8 +389,8 @@ REGNET_START_DOCSTRING = r"""
|
|||||||
REGNET_INPUTS_DOCSTRING = r"""
|
REGNET_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
more detail.
|
more detail.
|
||||||
|
@ -43,7 +43,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "ResNetConfig"
|
_CONFIG_FOR_DOC = "ResNetConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/resnet-50"
|
_CHECKPOINT_FOR_DOC = "microsoft/resnet-50"
|
||||||
@ -285,8 +285,8 @@ RESNET_START_DOCSTRING = r"""
|
|||||||
RESNET_INPUTS_DOCSTRING = r"""
|
RESNET_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
@ -34,7 +34,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "ResNetConfig"
|
_CONFIG_FOR_DOC = "ResNetConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/resnet-50"
|
_CHECKPOINT_FOR_DOC = "microsoft/resnet-50"
|
||||||
@ -313,8 +313,8 @@ RESNET_START_DOCSTRING = r"""
|
|||||||
RESNET_INPUTS_DOCSTRING = r"""
|
RESNET_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
|
||||||
|
@ -42,7 +42,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "SegformerConfig"
|
_CONFIG_FOR_DOC = "SegformerConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "SegformerFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "SegformerImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "nvidia/mit-b0"
|
_CHECKPOINT_FOR_DOC = "nvidia/mit-b0"
|
||||||
@ -491,7 +491,7 @@ SEGFORMER_INPUTS_DOCSTRING = r"""
|
|||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
|
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
|
||||||
[`SegformerFeatureExtractor`]. See [`SegformerFeatureExtractor.__call__`] for details.
|
[`SegformerImageProcessor`]. See [`SegformerImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
@ -772,17 +772,17 @@ class SegformerForSemanticSegmentation(SegformerPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import SegformerFeatureExtractor, SegformerForSemanticSegmentation
|
>>> from transformers import SegformerImageProcessor, SegformerForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
>>> image_processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
||||||
>>> model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
>>> model = SegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits # shape (batch_size, num_labels, height/4, width/4)
|
>>> logits = outputs.logits # shape (batch_size, num_labels, height/4, width/4)
|
||||||
>>> list(logits.shape)
|
>>> list(logits.shape)
|
||||||
|
@ -37,7 +37,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "SegformerConfig"
|
_CONFIG_FOR_DOC = "SegformerConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "SegformerFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "SegformerImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "nvidia/mit-b0"
|
_CHECKPOINT_FOR_DOC = "nvidia/mit-b0"
|
||||||
@ -568,8 +568,8 @@ SEGFORMER_INPUTS_DOCSTRING = r"""
|
|||||||
|
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
@ -835,17 +835,17 @@ class TFSegformerForSemanticSegmentation(TFSegformerPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import SegformerFeatureExtractor, TFSegformerForSemanticSegmentation
|
>>> from transformers import SegformerImageProcessor, TFSegformerForSemanticSegmentation
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = SegformerFeatureExtractor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
>>> image_processor = SegformerImageProcessor.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
||||||
>>> model = TFSegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
>>> model = TFSegformerForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs, training=False)
|
>>> outputs = model(**inputs, training=False)
|
||||||
>>> # logits are of shape (batch_size, num_labels, height/4, width/4)
|
>>> # logits are of shape (batch_size, num_labels, height/4, width/4)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
|
@ -43,7 +43,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "SwinConfig"
|
_CONFIG_FOR_DOC = "SwinConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/swin-tiny-patch4-window7-224"
|
_CHECKPOINT_FOR_DOC = "microsoft/swin-tiny-patch4-window7-224"
|
||||||
@ -888,8 +888,8 @@ SWIN_START_DOCSTRING = r"""
|
|||||||
SWIN_INPUTS_DOCSTRING = r"""
|
SWIN_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
|
|
||||||
@ -1053,7 +1053,7 @@ class SwinForMaskedImageModeling(SwinPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, SwinForMaskedImageModeling
|
>>> from transformers import AutoImageProcessor, SwinForMaskedImageModeling
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -1061,11 +1061,11 @@ class SwinForMaskedImageModeling(SwinPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swin-base-simmim-window6-192")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/swin-base-simmim-window6-192")
|
||||||
>>> model = SwinForMaskedImageModeling.from_pretrained("microsoft/swin-base-simmim-window6-192")
|
>>> model = SwinForMaskedImageModeling.from_pretrained("microsoft/swin-base-simmim-window6-192")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
||||||
|
|
||||||
|
@ -47,7 +47,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "SwinConfig"
|
_CONFIG_FOR_DOC = "SwinConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/swin-tiny-patch4-window7-224"
|
_CHECKPOINT_FOR_DOC = "microsoft/swin-tiny-patch4-window7-224"
|
||||||
@ -985,8 +985,8 @@ SWIN_START_DOCSTRING = r"""
|
|||||||
SWIN_INPUTS_DOCSTRING = r"""
|
SWIN_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`tf.Tensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
head_mask (`tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
|
|
||||||
@ -1321,7 +1321,7 @@ class TFSwinForMaskedImageModeling(TFSwinPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TFSwinForMaskedImageModeling
|
>>> from transformers import AutoImageProcessor, TFSwinForMaskedImageModeling
|
||||||
>>> import tensorflow as tf
|
>>> import tensorflow as tf
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -1329,11 +1329,11 @@ class TFSwinForMaskedImageModeling(TFSwinPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
|
||||||
>>> model = TFSwinForMaskedImageModeling.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
|
>>> model = TFSwinForMaskedImageModeling.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="tf").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="tf").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = tf.random.uniform((1, num_patches)) >= 0.5
|
>>> bool_masked_pos = tf.random.uniform((1, num_patches)) >= 0.5
|
||||||
|
|
||||||
|
@ -44,7 +44,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "Swinv2Config"
|
_CONFIG_FOR_DOC = "Swinv2Config"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "microsoft/swinv2-tiny-patch4-window8-256"
|
_CHECKPOINT_FOR_DOC = "microsoft/swinv2-tiny-patch4-window8-256"
|
||||||
@ -968,8 +968,8 @@ SWINV2_START_DOCSTRING = r"""
|
|||||||
SWINV2_INPUTS_DOCSTRING = r"""
|
SWINV2_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
|
|
||||||
@ -1136,7 +1136,7 @@ class Swinv2ForMaskedImageModeling(Swinv2PreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, Swinv2ForMaskedImageModeling
|
>>> from transformers import AutoImageProcessor, Swinv2ForMaskedImageModeling
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -1144,11 +1144,11 @@ class Swinv2ForMaskedImageModeling(Swinv2PreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")
|
||||||
>>> model = Swinv2ForMaskedImageModeling.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")
|
>>> model = Swinv2ForMaskedImageModeling.from_pretrained("microsoft/swinv2-tiny-patch4-window8-256")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
||||||
|
|
||||||
|
@ -136,7 +136,7 @@ class TableTransformerModelOutput(Seq2SeqModelOutput):
|
|||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
# Copied from transformers.models.detr.modeling_detr.DetrObjectDetectionOutput with Detr->TableTransformer,DetrFeatureExtractor->DetrFeatureExtractor
|
# Copied from transformers.models.detr.modeling_detr.DetrObjectDetectionOutput with Detr->TableTransformer,DetrImageProcessor->DetrImageProcessor
|
||||||
class TableTransformerObjectDetectionOutput(ModelOutput):
|
class TableTransformerObjectDetectionOutput(ModelOutput):
|
||||||
"""
|
"""
|
||||||
Output type of [`TableTransformerForObjectDetection`].
|
Output type of [`TableTransformerForObjectDetection`].
|
||||||
@ -153,7 +153,7 @@ class TableTransformerObjectDetectionOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~TableTransformerFeatureExtractor.post_process_object_detection`] to
|
possible padding). You can use [`~TableTransformerImageProcessor.post_process_object_detection`] to
|
||||||
retrieve the unnormalized bounding boxes.
|
retrieve the unnormalized bounding boxes.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
@ -797,8 +797,7 @@ TABLE_TRANSFORMER_INPUTS_DOCSTRING = r"""
|
|||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it.
|
Pixel values. Padding will be ignored by default should you provide it.
|
||||||
|
|
||||||
Pixel values can be obtained using [`DetrFeatureExtractor`]. See [`DetrFeatureExtractor.__call__`] for
|
Pixel values can be obtained using [`DetrImageProcessor`]. See [`DetrImageProcessor.__call__`] for details.
|
||||||
details.
|
|
||||||
|
|
||||||
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
pixel_mask (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
|
||||||
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
Mask to avoid performing attention on padding pixel values. Mask values selected in `[0, 1]`:
|
||||||
@ -1188,18 +1187,18 @@ class TableTransformerModel(TableTransformerPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TableTransformerModel
|
>>> from transformers import AutoImageProcessor, TableTransformerModel
|
||||||
>>> from huggingface_hub import hf_hub_download
|
>>> from huggingface_hub import hf_hub_download
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
|
|
||||||
>>> file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
|
>>> file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
|
||||||
>>> image = Image.open(file_path).convert("RGB")
|
>>> image = Image.open(file_path).convert("RGB")
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/table-transformer-detection")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/table-transformer-detection")
|
||||||
>>> model = TableTransformerModel.from_pretrained("microsoft/table-transformer-detection")
|
>>> model = TableTransformerModel.from_pretrained("microsoft/table-transformer-detection")
|
||||||
|
|
||||||
>>> # prepare image for the model
|
>>> # prepare image for the model
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
@ -1357,24 +1356,24 @@ class TableTransformerForObjectDetection(TableTransformerPreTrainedModel):
|
|||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from huggingface_hub import hf_hub_download
|
>>> from huggingface_hub import hf_hub_download
|
||||||
>>> from transformers import AutoFeatureExtractor, TableTransformerForObjectDetection
|
>>> from transformers import AutoImageProcessor, TableTransformerForObjectDetection
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
|
|
||||||
>>> file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
|
>>> file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="example_pdf.png")
|
||||||
>>> image = Image.open(file_path).convert("RGB")
|
>>> image = Image.open(file_path).convert("RGB")
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/table-transformer-detection")
|
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/table-transformer-detection")
|
||||||
>>> model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-detection")
|
>>> model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-detection")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
||||||
>>> target_sizes = torch.tensor([image.size[::-1]])
|
>>> target_sizes = torch.tensor([image.size[::-1]])
|
||||||
>>> results = feature_extractor.post_process_object_detection(
|
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[
|
||||||
... outputs, threshold=0.9, target_sizes=target_sizes
|
... 0
|
||||||
... )[0]
|
... ]
|
||||||
|
|
||||||
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
||||||
... box = [round(i, 2) for i in box.tolist()]
|
... box = [round(i, 2) for i in box.tolist()]
|
||||||
|
@ -23,7 +23,7 @@ from ...processing_utils import ProcessorMixin
|
|||||||
|
|
||||||
class TrOCRProcessor(ProcessorMixin):
|
class TrOCRProcessor(ProcessorMixin):
|
||||||
r"""
|
r"""
|
||||||
Constructs a TrOCR processor which wraps a vision feature extractor and a TrOCR tokenizer into a single processor.
|
Constructs a TrOCR processor which wraps a vision image processor and a TrOCR tokenizer into a single processor.
|
||||||
|
|
||||||
[`TrOCRProcessor`] offers all the functionalities of [`ViTImageProcessor`/`DeiTImageProcessor`] and
|
[`TrOCRProcessor`] offers all the functionalities of [`ViTImageProcessor`/`DeiTImageProcessor`] and
|
||||||
[`RobertaTokenizer`/`XLMRobertaTokenizer`]. See the [`~TrOCRProcessor.__call__`] and [`~TrOCRProcessor.decode`] for
|
[`RobertaTokenizer`/`XLMRobertaTokenizer`]. See the [`~TrOCRProcessor.__call__`] and [`~TrOCRProcessor.decode`] for
|
||||||
|
@ -38,7 +38,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "VanConfig"
|
_CONFIG_FOR_DOC = "VanConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "AutoFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "AutoImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "Visual-Attention-Network/van-base"
|
_CHECKPOINT_FOR_DOC = "Visual-Attention-Network/van-base"
|
||||||
@ -407,8 +407,8 @@ VAN_START_DOCSTRING = r"""
|
|||||||
VAN_INPUTS_DOCSTRING = r"""
|
VAN_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
output_hidden_states (`bool`, *optional*):
|
output_hidden_states (`bool`, *optional*):
|
||||||
Whether or not to return the hidden states of all stages. See `hidden_states` under returned tensors for
|
Whether or not to return the hidden states of all stages. See `hidden_states` under returned tensors for
|
||||||
|
@ -510,8 +510,8 @@ VIDEOMAE_START_DOCSTRING = r"""
|
|||||||
VIDEOMAE_INPUTS_DOCSTRING = r"""
|
VIDEOMAE_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_frames, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_frames, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`VideoMAEFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`VideoMAEImageProcessor`]. See
|
||||||
[`VideoMAEFeatureExtractor.__call__`] for details.
|
[`VideoMAEImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -581,7 +581,7 @@ class VideoMAEModel(VideoMAEPreTrainedModel):
|
|||||||
>>> from decord import VideoReader, cpu
|
>>> from decord import VideoReader, cpu
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
|
|
||||||
>>> from transformers import VideoMAEFeatureExtractor, VideoMAEModel
|
>>> from transformers import VideoMAEImageProcessor, VideoMAEModel
|
||||||
>>> from huggingface_hub import hf_hub_download
|
>>> from huggingface_hub import hf_hub_download
|
||||||
|
|
||||||
|
|
||||||
@ -605,11 +605,11 @@ class VideoMAEModel(VideoMAEPreTrainedModel):
|
|||||||
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
|
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
|
||||||
>>> video = videoreader.get_batch(indices).asnumpy()
|
>>> video = videoreader.get_batch(indices).asnumpy()
|
||||||
|
|
||||||
>>> feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
|
>>> image_processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
|
||||||
>>> model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
|
>>> model = VideoMAEModel.from_pretrained("MCG-NJU/videomae-base")
|
||||||
|
|
||||||
>>> # prepare video for the model
|
>>> # prepare video for the model
|
||||||
>>> inputs = feature_extractor(list(video), return_tensors="pt")
|
>>> inputs = image_processor(list(video), return_tensors="pt")
|
||||||
|
|
||||||
>>> # forward pass
|
>>> # forward pass
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
@ -765,17 +765,17 @@ class VideoMAEForPreTraining(VideoMAEPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import VideoMAEFeatureExtractor, VideoMAEForPreTraining
|
>>> from transformers import VideoMAEImageProcessor, VideoMAEForPreTraining
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
>>> import torch
|
>>> import torch
|
||||||
|
|
||||||
>>> num_frames = 16
|
>>> num_frames = 16
|
||||||
>>> video = list(np.random.randn(16, 3, 224, 224))
|
>>> video = list(np.random.randn(16, 3, 224, 224))
|
||||||
|
|
||||||
>>> feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base")
|
>>> image_processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base")
|
||||||
>>> model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")
|
>>> model = VideoMAEForPreTraining.from_pretrained("MCG-NJU/videomae-base")
|
||||||
|
|
||||||
>>> pixel_values = feature_extractor(video, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(video, return_tensors="pt").pixel_values
|
||||||
|
|
||||||
>>> num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches_per_frame = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
|
>>> seq_length = (num_frames // model.config.tubelet_size) * num_patches_per_frame
|
||||||
@ -942,7 +942,7 @@ class VideoMAEForVideoClassification(VideoMAEPreTrainedModel):
|
|||||||
>>> import torch
|
>>> import torch
|
||||||
>>> import numpy as np
|
>>> import numpy as np
|
||||||
|
|
||||||
>>> from transformers import VideoMAEFeatureExtractor, VideoMAEForVideoClassification
|
>>> from transformers import VideoMAEImageProcessor, VideoMAEForVideoClassification
|
||||||
>>> from huggingface_hub import hf_hub_download
|
>>> from huggingface_hub import hf_hub_download
|
||||||
|
|
||||||
>>> np.random.seed(0)
|
>>> np.random.seed(0)
|
||||||
@ -968,10 +968,10 @@ class VideoMAEForVideoClassification(VideoMAEPreTrainedModel):
|
|||||||
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
|
>>> indices = sample_frame_indices(clip_len=16, frame_sample_rate=4, seg_len=len(videoreader))
|
||||||
>>> video = videoreader.get_batch(indices).asnumpy()
|
>>> video = videoreader.get_batch(indices).asnumpy()
|
||||||
|
|
||||||
>>> feature_extractor = VideoMAEFeatureExtractor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
|
>>> image_processor = VideoMAEImageProcessor.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
|
||||||
>>> model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
|
>>> model = VideoMAEForVideoClassification.from_pretrained("MCG-NJU/videomae-base-finetuned-kinetics")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(list(video), return_tensors="pt")
|
>>> inputs = image_processor(list(video), return_tensors="pt")
|
||||||
|
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(**inputs)
|
... outputs = model(**inputs)
|
||||||
|
@ -86,8 +86,8 @@ VISION_ENCODER_DECODER_START_DOCSTRING = r"""
|
|||||||
VISION_ENCODER_DECODER_INPUTS_DOCSTRING = r"""
|
VISION_ENCODER_DECODER_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`jnp.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`jnp.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using the vision model's feature extractor. For example, using
|
Pixel values. Pixel values can be obtained using the vision model's image processor. For example, using
|
||||||
[`ViTFeatureExtractor`]. See [`ViTFeatureExtractor.__call__`] for details.
|
[`ViTImageProcessor`]. See [`ViTImageProcessor.__call__`] for details.
|
||||||
decoder_input_ids (`jnp.ndarray` of shape `(batch_size, target_sequence_length)`, *optional*):
|
decoder_input_ids (`jnp.ndarray` of shape `(batch_size, target_sequence_length)`, *optional*):
|
||||||
Indices of decoder input sequence tokens in the vocabulary.
|
Indices of decoder input sequence tokens in the vocabulary.
|
||||||
|
|
||||||
@ -114,8 +114,8 @@ VISION_ENCODER_DECODER_INPUTS_DOCSTRING = r"""
|
|||||||
VISION_ENCODER_DECODER_ENCODE_INPUTS_DOCSTRING = r"""
|
VISION_ENCODER_DECODER_ENCODE_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`jnp.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`jnp.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using the vision model's feature extractor. For example, using
|
Pixel values. Pixel values can be obtained using the vision model's image processor. For example, using
|
||||||
[`ViTFeatureExtractor`]. See [`ViTFeatureExtractor.__call__`] for details.
|
[`ViTImageProcessor`]. See [`ViTImageProcessor.__call__`] for details.
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
tensors for more detail.
|
tensors for more detail.
|
||||||
@ -409,21 +409,21 @@ class FlaxVisionEncoderDecoderModel(FlaxPreTrainedModel):
|
|||||||
Example:
|
Example:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ViTFeatureExtractor, FlaxVisionEncoderDecoderModel
|
>>> from transformers import ViTImageProcessor, FlaxVisionEncoderDecoderModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
|
|
||||||
>>> # initialize a vit-gpt2 from pretrained ViT and GPT2 models. Note that the cross-attention layers will be randomly initialized
|
>>> # initialize a vit-gpt2 from pretrained ViT and GPT2 models. Note that the cross-attention layers will be randomly initialized
|
||||||
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||||
... "google/vit-base-patch16-224-in21k", "gpt2"
|
... "google/vit-base-patch16-224-in21k", "gpt2"
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="np").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="np").pixel_values
|
||||||
>>> encoder_outputs = model.encode(pixel_values)
|
>>> encoder_outputs = model.encode(pixel_values)
|
||||||
```"""
|
```"""
|
||||||
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||||
@ -487,7 +487,7 @@ class FlaxVisionEncoderDecoderModel(FlaxPreTrainedModel):
|
|||||||
Example:
|
Example:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ViTFeatureExtractor, FlaxVisionEncoderDecoderModel
|
>>> from transformers import ViTImageProcessor, FlaxVisionEncoderDecoderModel
|
||||||
>>> import jax.numpy as jnp
|
>>> import jax.numpy as jnp
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -495,14 +495,14 @@ class FlaxVisionEncoderDecoderModel(FlaxPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
|
|
||||||
>>> # initialize a vit-gpt2 from pretrained ViT and GPT2 models. Note that the cross-attention layers will be randomly initialized
|
>>> # initialize a vit-gpt2 from pretrained ViT and GPT2 models. Note that the cross-attention layers will be randomly initialized
|
||||||
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
>>> model = FlaxVisionEncoderDecoderModel.from_encoder_decoder_pretrained(
|
||||||
... "google/vit-base-patch16-224-in21k", "gpt2"
|
... "google/vit-base-patch16-224-in21k", "gpt2"
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="np").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="np").pixel_values
|
||||||
>>> encoder_outputs = model.encode(pixel_values)
|
>>> encoder_outputs = model.encode(pixel_values)
|
||||||
|
|
||||||
>>> decoder_start_token_id = model.config.decoder.bos_token_id
|
>>> decoder_start_token_id = model.config.decoder.bos_token_id
|
||||||
@ -617,14 +617,14 @@ class FlaxVisionEncoderDecoderModel(FlaxPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import FlaxVisionEncoderDecoderModel, ViTFeatureExtractor, GPT2Tokenizer
|
>>> from transformers import FlaxVisionEncoderDecoderModel, ViTImageProcessor, GPT2Tokenizer
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
|
|
||||||
>>> # load output tokenizer
|
>>> # load output tokenizer
|
||||||
>>> tokenizer_output = GPT2Tokenizer.from_pretrained("gpt2")
|
>>> tokenizer_output = GPT2Tokenizer.from_pretrained("gpt2")
|
||||||
@ -634,7 +634,7 @@ class FlaxVisionEncoderDecoderModel(FlaxPreTrainedModel):
|
|||||||
... "google/vit-base-patch16-224-in21k", "gpt2"
|
... "google/vit-base-patch16-224-in21k", "gpt2"
|
||||||
... )
|
... )
|
||||||
|
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="np").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="np").pixel_values
|
||||||
|
|
||||||
>>> # use GPT2's eos_token as the pad as well as eos token
|
>>> # use GPT2's eos_token as the pad as well as eos token
|
||||||
>>> model.config.eos_token_id = model.config.decoder.eos_token_id
|
>>> model.config.eos_token_id = model.config.decoder.eos_token_id
|
||||||
|
@ -106,8 +106,8 @@ VISION_TEXT_DUAL_ENCODER_INPUTS_DOCSTRING = r"""
|
|||||||
[What are position IDs?](../glossary#position-ids)
|
[What are position IDs?](../glossary#position-ids)
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
|
Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
|
||||||
a feature extractor (e.g. if you use ViT as the encoder, you should use [`ViTFeatureExtractor`]). See
|
an image processor (e.g. if you use ViT as the encoder, you should use [`ViTImageProcessor`]). See
|
||||||
[`ViTFeatureExtractor.__call__`] for details.
|
[`ViTImageProcessor.__call__`] for details.
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
tensors for more detail.
|
tensors for more detail.
|
||||||
|
@ -70,8 +70,8 @@ VIT_START_DOCSTRING = r"""
|
|||||||
VIT_INPUTS_DOCSTRING = r"""
|
VIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`numpy.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`numpy.ndarray` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`ViTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`ViTImageProcessor`]. See [`ViTImageProcessor.__call__`]
|
||||||
[`ViTFeatureExtractor.__call__`] for details.
|
for details.
|
||||||
|
|
||||||
output_attentions (`bool`, *optional*):
|
output_attentions (`bool`, *optional*):
|
||||||
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
|
||||||
@ -565,17 +565,17 @@ FLAX_VISION_MODEL_DOCSTRING = """
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ViTFeatureExtractor, FlaxViTModel
|
>>> from transformers import ViTImageProcessor, FlaxViTModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> model = FlaxViTModel.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="np")
|
>>> inputs = image_processor(images=image, return_tensors="np")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```
|
```
|
||||||
@ -648,7 +648,7 @@ FLAX_VISION_CLASSIF_DOCSTRING = """
|
|||||||
Example:
|
Example:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ViTFeatureExtractor, FlaxViTForImageClassification
|
>>> from transformers import ViTImageProcessor, FlaxViTForImageClassification
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import jax
|
>>> import jax
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -656,10 +656,10 @@ FLAX_VISION_CLASSIF_DOCSTRING = """
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
|
||||||
>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
|
>>> model = FlaxViTForImageClassification.from_pretrained("google/vit-base-patch16-224")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="np")
|
>>> inputs = image_processor(images=image, return_tensors="np")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> logits = outputs.logits
|
>>> logits = outputs.logits
|
||||||
|
|
||||||
|
@ -41,7 +41,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "ViTConfig"
|
_CONFIG_FOR_DOC = "ViTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "ViTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "ViTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "google/vit-base-patch16-224-in21k"
|
_CHECKPOINT_FOR_DOC = "google/vit-base-patch16-224-in21k"
|
||||||
@ -629,8 +629,8 @@ VIT_START_DOCSTRING = r"""
|
|||||||
VIT_INPUTS_DOCSTRING = r"""
|
VIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`ViTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`ViTImageProcessor`]. See [`ViTImageProcessor.__call__`]
|
||||||
[`ViTFeatureExtractor.__call__`] for details.
|
for details.
|
||||||
|
|
||||||
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
|
@ -42,7 +42,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "ViTConfig"
|
_CONFIG_FOR_DOC = "ViTConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "ViTFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "ViTImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "google/vit-base-patch16-224-in21k"
|
_CHECKPOINT_FOR_DOC = "google/vit-base-patch16-224-in21k"
|
||||||
@ -481,8 +481,8 @@ VIT_START_DOCSTRING = r"""
|
|||||||
VIT_INPUTS_DOCSTRING = r"""
|
VIT_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`ViTFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`ViTImageProcessor`]. See [`ViTImageProcessor.__call__`]
|
||||||
[`ViTFeatureExtractor.__call__`] for details.
|
for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -664,7 +664,7 @@ class ViTForMaskedImageModeling(ViTPreTrainedModel):
|
|||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
```python
|
```python
|
||||||
>>> from transformers import ViTFeatureExtractor, ViTForMaskedImageModeling
|
>>> from transformers import ViTImageProcessor, ViTForMaskedImageModeling
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -672,11 +672,11 @@ class ViTForMaskedImageModeling(ViTPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = ViTFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> image_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch16-224-in21k")
|
>>> model = ViTForMaskedImageModeling.from_pretrained("google/vit-base-patch16-224-in21k")
|
||||||
|
|
||||||
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
|
||||||
>>> pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values
|
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
|
||||||
>>> # create random boolean mask of shape (batch_size, num_patches)
|
>>> # create random boolean mask of shape (batch_size, num_patches)
|
||||||
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
|
||||||
|
|
||||||
|
@ -770,8 +770,8 @@ VIT_MAE_START_DOCSTRING = r"""
|
|||||||
VIT_MAE_INPUTS_DOCSTRING = r"""
|
VIT_MAE_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`np.ndarray`, `tf.Tensor`, `List[tf.Tensor]` ``Dict[str, tf.Tensor]` or `Dict[str, np.ndarray]` and each example must have the shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`np.ndarray` or `tf.Tensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -830,17 +830,17 @@ class TFViTMAEModel(TFViTMAEPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TFViTMAEModel
|
>>> from transformers import AutoImageProcessor, TFViTMAEModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-mae-base")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
|
||||||
>>> model = TFViTMAEModel.from_pretrained("facebook/vit-mae-base")
|
>>> model = TFViTMAEModel.from_pretrained("facebook/vit-mae-base")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="tf")
|
>>> inputs = image_processor(images=image, return_tensors="tf")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```"""
|
```"""
|
||||||
@ -1121,17 +1121,17 @@ class TFViTMAEForPreTraining(TFViTMAEPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, TFViTMAEForPreTraining
|
>>> from transformers import AutoImageProcessor, TFViTMAEForPreTraining
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-mae-base")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
|
||||||
>>> model = TFViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base")
|
>>> model = TFViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> loss = outputs.loss
|
>>> loss = outputs.loss
|
||||||
>>> mask = outputs.mask
|
>>> mask = outputs.mask
|
||||||
|
@ -612,8 +612,8 @@ VIT_MAE_START_DOCSTRING = r"""
|
|||||||
VIT_MAE_INPUTS_DOCSTRING = r"""
|
VIT_MAE_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -677,17 +677,17 @@ class ViTMAEModel(ViTMAEPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, ViTMAEModel
|
>>> from transformers import AutoImageProcessor, ViTMAEModel
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-mae-base")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
|
||||||
>>> model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")
|
>>> model = ViTMAEModel.from_pretrained("facebook/vit-mae-base")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
```"""
|
```"""
|
||||||
@ -978,17 +978,17 @@ class ViTMAEForPreTraining(ViTMAEPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, ViTMAEForPreTraining
|
>>> from transformers import AutoImageProcessor, ViTMAEForPreTraining
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
|
|
||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-mae-base")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/vit-mae-base")
|
||||||
>>> model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base")
|
>>> model = ViTMAEForPreTraining.from_pretrained("facebook/vit-mae-base")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
>>> loss = outputs.loss
|
>>> loss = outputs.loss
|
||||||
>>> mask = outputs.mask
|
>>> mask = outputs.mask
|
||||||
|
@ -464,8 +464,8 @@ VIT_MSN_START_DOCSTRING = r"""
|
|||||||
VIT_MSN_INPUTS_DOCSTRING = r"""
|
VIT_MSN_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -532,7 +532,7 @@ class ViTMSNModel(ViTMSNPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, ViTMSNModel
|
>>> from transformers import AutoImageProcessor, ViTMSNModel
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -540,9 +540,9 @@ class ViTMSNModel(ViTMSNPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-small")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/vit-msn-small")
|
||||||
>>> model = ViTMSNModel.from_pretrained("facebook/vit-msn-small")
|
>>> model = ViTMSNModel.from_pretrained("facebook/vit-msn-small")
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... outputs = model(**inputs)
|
... outputs = model(**inputs)
|
||||||
>>> last_hidden_states = outputs.last_hidden_state
|
>>> last_hidden_states = outputs.last_hidden_state
|
||||||
@ -627,7 +627,7 @@ class ViTMSNForImageClassification(ViTMSNPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, ViTMSNForImageClassification
|
>>> from transformers import AutoImageProcessor, ViTMSNForImageClassification
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -637,10 +637,10 @@ class ViTMSNForImageClassification(ViTMSNPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/vit-msn-small")
|
>>> image_processor = AutoImageProcessor.from_pretrained("facebook/vit-msn-small")
|
||||||
>>> model = ViTMSNForImageClassification.from_pretrained("facebook/vit-msn-small")
|
>>> model = ViTMSNForImageClassification.from_pretrained("facebook/vit-msn-small")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> with torch.no_grad():
|
>>> with torch.no_grad():
|
||||||
... logits = model(**inputs).logits
|
... logits = model(**inputs).logits
|
||||||
>>> # model predicts one of the 1000 ImageNet classes
|
>>> # model predicts one of the 1000 ImageNet classes
|
||||||
|
@ -53,7 +53,7 @@ logger = logging.get_logger(__name__)
|
|||||||
|
|
||||||
# General docstring
|
# General docstring
|
||||||
_CONFIG_FOR_DOC = "YolosConfig"
|
_CONFIG_FOR_DOC = "YolosConfig"
|
||||||
_FEAT_EXTRACTOR_FOR_DOC = "YolosFeatureExtractor"
|
_FEAT_EXTRACTOR_FOR_DOC = "YolosImageProcessor"
|
||||||
|
|
||||||
# Base docstring
|
# Base docstring
|
||||||
_CHECKPOINT_FOR_DOC = "hustvl/yolos-small"
|
_CHECKPOINT_FOR_DOC = "hustvl/yolos-small"
|
||||||
@ -83,7 +83,7 @@ class YolosObjectDetectionOutput(ModelOutput):
|
|||||||
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
|
||||||
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
|
||||||
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
|
||||||
possible padding). You can use [`~DetrFeatureExtractor.post_process`] to retrieve the unnormalized bounding
|
possible padding). You can use [`~DetrImageProcessor.post_process`] to retrieve the unnormalized bounding
|
||||||
boxes.
|
boxes.
|
||||||
auxiliary_outputs (`list[Dict]`, *optional*):
|
auxiliary_outputs (`list[Dict]`, *optional*):
|
||||||
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
Optional, only returned when auxilary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
|
||||||
@ -573,8 +573,8 @@ YOLOS_START_DOCSTRING = r"""
|
|||||||
YOLOS_INPUTS_DOCSTRING = r"""
|
YOLOS_INPUTS_DOCSTRING = r"""
|
||||||
Args:
|
Args:
|
||||||
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
|
||||||
Pixel values. Pixel values can be obtained using [`AutoFeatureExtractor`]. See
|
Pixel values. Pixel values can be obtained using [`AutoImageProcessor`]. See
|
||||||
[`AutoFeatureExtractor.__call__`] for details.
|
[`AutoImageProcessor.__call__`] for details.
|
||||||
|
|
||||||
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
|
||||||
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
|
||||||
@ -756,7 +756,7 @@ class YolosForObjectDetection(YolosPreTrainedModel):
|
|||||||
Examples:
|
Examples:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
>>> from transformers import AutoFeatureExtractor, AutoModelForObjectDetection
|
>>> from transformers import AutoImageProcessor, AutoModelForObjectDetection
|
||||||
>>> import torch
|
>>> import torch
|
||||||
>>> from PIL import Image
|
>>> from PIL import Image
|
||||||
>>> import requests
|
>>> import requests
|
||||||
@ -764,17 +764,17 @@ class YolosForObjectDetection(YolosPreTrainedModel):
|
|||||||
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
|
||||||
>>> image = Image.open(requests.get(url, stream=True).raw)
|
>>> image = Image.open(requests.get(url, stream=True).raw)
|
||||||
|
|
||||||
>>> feature_extractor = AutoFeatureExtractor.from_pretrained("hustvl/yolos-tiny")
|
>>> image_processor = AutoImageProcessor.from_pretrained("hustvl/yolos-tiny")
|
||||||
>>> model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-tiny")
|
>>> model = AutoModelForObjectDetection.from_pretrained("hustvl/yolos-tiny")
|
||||||
|
|
||||||
>>> inputs = feature_extractor(images=image, return_tensors="pt")
|
>>> inputs = image_processor(images=image, return_tensors="pt")
|
||||||
>>> outputs = model(**inputs)
|
>>> outputs = model(**inputs)
|
||||||
|
|
||||||
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
>>> # convert outputs (bounding boxes and class logits) to COCO API
|
||||||
>>> target_sizes = torch.tensor([image.size[::-1]])
|
>>> target_sizes = torch.tensor([image.size[::-1]])
|
||||||
>>> results = feature_extractor.post_process_object_detection(
|
>>> results = image_processor.post_process_object_detection(outputs, threshold=0.9, target_sizes=target_sizes)[
|
||||||
... outputs, threshold=0.9, target_sizes=target_sizes
|
... 0
|
||||||
... )[0]
|
... ]
|
||||||
|
|
||||||
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
|
||||||
... box = [round(i, 2) for i in box.tolist()]
|
... box = [round(i, 2) for i in box.tolist()]
|
||||||
|
@ -186,14 +186,14 @@ wish, as it will appear on the Model Hub. Do not forget to include the organisat
|
|||||||
Then you will have to say whether your model re-uses the same processing classes as the model you're cloning:
|
Then you will have to say whether your model re-uses the same processing classes as the model you're cloning:
|
||||||
|
|
||||||
```
|
```
|
||||||
Will your new model use the same processing class as Xxx (XxxTokenizer/XxxFeatureExtractor)
|
Will your new model use the same processing class as Xxx (XxxTokenizer/XxxFeatureExtractor/XxxImageProcessor)
|
||||||
```
|
```
|
||||||
|
|
||||||
Answer yes if you have no intentions to make any change to the class used for preprocessing. It can use different
|
Answer yes if you have no intentions to make any change to the class used for preprocessing. It can use different
|
||||||
files (for instance you can reuse the `BertTokenizer` with a new vocab file).
|
files (for instance you can reuse the `BertTokenizer` with a new vocab file).
|
||||||
|
|
||||||
If you answer no, you will have to give the name of the classes
|
If you answer no, you will have to give the name of the classes
|
||||||
for the new tokenizer/feature extractor/processor (depending on the model you're cloning).
|
for the new tokenizer/image processor/feature extractor/processor (depending on the model you're cloning).
|
||||||
|
|
||||||
Next the questionnaire will ask
|
Next the questionnaire will ask
|
||||||
|
|
||||||
|
Loading…
Reference in New Issue
Block a user