# LayoutLMv3 ## Overview The LayoutLMv3 model was proposed in [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei. LayoutLMv3 simplifies [LayoutLMv2](layoutlmv2) by using patch embeddings (as in [ViT](vit)) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM) and word-patch alignment (WPA). The abstract from the paper is the following: *Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.* Tips: - In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that: - images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format. - text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece. Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3FeatureExtractor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model. - Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor. - Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3). - Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3). drawing LayoutLMv3 architecture. Taken from the original paper. This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3). ## LayoutLMv3Config [[autodoc]] LayoutLMv3Config ## LayoutLMv3FeatureExtractor [[autodoc]] LayoutLMv3FeatureExtractor - __call__ ## LayoutLMv3Tokenizer [[autodoc]] LayoutLMv3Tokenizer - __call__ - save_vocabulary ## LayoutLMv3TokenizerFast [[autodoc]] LayoutLMv3TokenizerFast - __call__ ## LayoutLMv3Processor [[autodoc]] LayoutLMv3Processor - __call__ ## LayoutLMv3Model [[autodoc]] LayoutLMv3Model - forward ## LayoutLMv3ForSequenceClassification [[autodoc]] LayoutLMv3ForSequenceClassification - forward ## LayoutLMv3ForTokenClassification [[autodoc]] LayoutLMv3ForTokenClassification - forward ## LayoutLMv3ForQuestionAnswering [[autodoc]] LayoutLMv3ForQuestionAnswering - forward ## TFLayoutLMv3Model [[autodoc]] TFLayoutLMv3Model - call ## TFLayoutLMv3ForSequenceClassification [[autodoc]] TFLayoutLMv3ForSequenceClassification - call ## TFLayoutLMv3ForTokenClassification [[autodoc]] TFLayoutLMv3ForTokenClassification - call ## TFLayoutLMv3ForQuestionAnswering [[autodoc]] TFLayoutLMv3ForQuestionAnswering - call