# VisionTextDualEncoder
## Overview
The [`VisionTextDualEncoderModel`] can be used to initialize a vision-text dual encoder model with
any pretrained vision autoencoding model as the vision encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit)) and any pretrained text autoencoding model as the text encoder (*e.g.* [RoBERTa](roberta), [BERT](bert)). Two projection layers are added on top of both the vision and text encoder to project the output embeddings
to a shared latent space. The projection layers are randomly initialized so the model should be fine-tuned on a
downstream task. This model can be used to align the vision-text embeddings using CLIP like contrastive image-text
training and then can be used for zero-shot vision tasks such image-classification or retrieval.
In [LiT: Zero-Shot Transfer with Locked-image Text Tuning](https://arxiv.org/abs/2111.07991) it is shown how
leveraging pre-trained (locked/frozen) image and text model for contrastive learning yields significant improvement on
new zero-shot vision tasks such as image classification or retrieval.
## VisionTextDualEncoderConfig
[[autodoc]] VisionTextDualEncoderConfig
## VisionTextDualEncoderProcessor
[[autodoc]] VisionTextDualEncoderProcessor
## VisionTextDualEncoderModel
[[autodoc]] VisionTextDualEncoderModel
- forward
## FlaxVisionTextDualEncoderModel
[[autodoc]] FlaxVisionTextDualEncoderModel
- __call__
## TFVisionTextDualEncoderModel
[[autodoc]] TFVisionTextDualEncoderModel
- call