# Vision Encoder Decoder Models The [`VisionEncoderDecoderModel`] can be used to initialize an image-to-text-sequence model with any pretrained vision autoencoding model as the encoder (*e.g.* [ViT](vit), [BEiT](beit), [DeiT](deit)) and any pretrained language model as the decoder (*e.g.* [RoBERTa](roberta), [GPT2](gpt2), [BERT](bert)). The effectiveness of initializing image-to-text-sequence models with pretrained checkpoints has been shown in (for example) [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei. An example of how to use a [`VisionEncoderDecoderModel`] for inference can be seen in [TrOCR](trocr). ## VisionEncoderDecoderConfig [[autodoc]] VisionEncoderDecoderConfig ## VisionEncoderDecoderModel [[autodoc]] VisionEncoderDecoderModel - forward - from_encoder_decoder_pretrained ## FlaxVisionEncoderDecoderModel [[autodoc]] FlaxVisionEncoderDecoderModel - __call__ - from_encoder_decoder_pretrained