mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-05 22:00:09 +06:00

* toctree * not-doctested.txt * collapse sections * feedback * update * rewrite get started sections * fixes * fix * loading models * fix * customize models * share * fix link * contribute part 1 * contribute pt 2 * fix toctree * tokenization pt 1 * Add new model (#32615) * v1 - working version * fix * fix * fix * fix * rename to correct name * fix title * fixup * rename files * fix * add copied from on tests * rename to `FalconMamba` everywhere and fix bugs * fix quantization + accelerate * fix copies * add `torch.compile` support * fix tests * fix tests and add slow tests * copies on config * merge the latest changes * fix tests * add few lines about instruct * Apply suggestions from code review Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * fix * fix tests --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> * "to be not" -> "not to be" (#32636) * "to be not" -> "not to be" * Update sam.md * Update trainer.py * Update modeling_utils.py * Update test_modeling_utils.py * Update test_modeling_utils.py * fix hfoption tag * tokenization pt. 2 * image processor * fix toctree * backbones * feature extractor * fix file name * processor * update not-doctested * update * make style * fix toctree * revision * make fixup * fix toctree * fix * make style * fix hfoption tag * pipeline * pipeline gradio * pipeline web server * add pipeline * fix toctree * not-doctested * prompting * llm optims * fix toctree * fixes * cache * text generation * fix * chat pipeline * chat stuff * xla * torch.compile * cpu inference * toctree * gpu inference * agents and tools * gguf/tiktoken * finetune * toctree * trainer * trainer pt 2 * optims * optimizers * accelerate * parallelism * fsdp * update * distributed cpu * hardware training * gpu training * gpu training 2 * peft * distrib debug * deepspeed 1 * deepspeed 2 * chat toctree * quant pt 1 * quant pt 2 * fix toctree * fix * fix * quant pt 3 * quant pt 4 * serialization * torchscript * scripts * tpu * review * model addition timeline * modular * more reviews * reviews * fix toctree * reviews reviews * continue reviews * more reviews * modular transformers * more review * zamba2 * fix * all frameworks * pytorch * supported model frameworks * flashattention * rm check_table * not-doctested.txt * rm check_support_list.py * feedback * updates/feedback * review * feedback * fix * update * feedback * updates * update --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
135 lines
6.6 KiB
Markdown
135 lines
6.6 KiB
Markdown
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
|
rendered properly in your Markdown viewer.
|
|
|
|
-->
|
|
|
|
# MobileViT
|
|
|
|
<div class="flex flex-wrap space-x-1">
|
|
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
|
|
<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
|
|
</div>
|
|
|
|
## Overview
|
|
|
|
The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.*
|
|
|
|
This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
|
|
|
|
## Usage tips
|
|
|
|
- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
|
|
- One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
|
|
- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
|
|
- The segmentation model uses a [DeepLabV3](https://arxiv.org/abs/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
|
|
- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
|
|
|
|
You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
|
|
TensorFlow Lite model:
|
|
|
|
```py
|
|
from transformers import TFMobileViTForImageClassification
|
|
import tensorflow as tf
|
|
|
|
|
|
model_ckpt = "apple/mobilevit-xx-small"
|
|
model = TFMobileViTForImageClassification.from_pretrained(model_ckpt)
|
|
|
|
converter = tf.lite.TFLiteConverter.from_keras_model(model)
|
|
converter.optimizations = [tf.lite.Optimize.DEFAULT]
|
|
converter.target_spec.supported_ops = [
|
|
tf.lite.OpsSet.TFLITE_BUILTINS,
|
|
tf.lite.OpsSet.SELECT_TF_OPS,
|
|
]
|
|
tflite_model = converter.convert()
|
|
tflite_filename = model_ckpt.split("/")[-1] + ".tflite"
|
|
with open(tflite_filename, "wb") as f:
|
|
f.write(tflite_model)
|
|
```
|
|
|
|
The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network
|
|
bandwidth can be constrained.
|
|
|
|
## Resources
|
|
|
|
A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with MobileViT.
|
|
|
|
<PipelineTag pipeline="image-classification"/>
|
|
|
|
- [`MobileViTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
|
|
- See also: [Image classification task guide](../tasks/image_classification)
|
|
|
|
**Semantic segmentation**
|
|
- [Semantic segmentation task guide](../tasks/semantic_segmentation)
|
|
|
|
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
|
|
|
|
## MobileViTConfig
|
|
|
|
[[autodoc]] MobileViTConfig
|
|
|
|
## MobileViTFeatureExtractor
|
|
|
|
[[autodoc]] MobileViTFeatureExtractor
|
|
- __call__
|
|
- post_process_semantic_segmentation
|
|
|
|
## MobileViTImageProcessor
|
|
|
|
[[autodoc]] MobileViTImageProcessor
|
|
- preprocess
|
|
- post_process_semantic_segmentation
|
|
|
|
<frameworkcontent>
|
|
<pt>
|
|
|
|
## MobileViTModel
|
|
|
|
[[autodoc]] MobileViTModel
|
|
- forward
|
|
|
|
## MobileViTForImageClassification
|
|
|
|
[[autodoc]] MobileViTForImageClassification
|
|
- forward
|
|
|
|
## MobileViTForSemanticSegmentation
|
|
|
|
[[autodoc]] MobileViTForSemanticSegmentation
|
|
- forward
|
|
|
|
</pt>
|
|
<tf>
|
|
|
|
## TFMobileViTModel
|
|
|
|
[[autodoc]] TFMobileViTModel
|
|
- call
|
|
|
|
## TFMobileViTForImageClassification
|
|
|
|
[[autodoc]] TFMobileViTForImageClassification
|
|
- call
|
|
|
|
## TFMobileViTForSemanticSegmentation
|
|
|
|
[[autodoc]] TFMobileViTForSemanticSegmentation
|
|
- call
|
|
|
|
</tf>
|
|
</frameworkcontent> |