mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-06 14:20:04 +06:00

Co-authored-by: Esben Toke Christensen <esben.christensen@visma.com> Co-authored-by: Lasse Reedtz <lasse.reedtz@visma.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
107 lines
4.9 KiB
Plaintext
107 lines
4.9 KiB
Plaintext
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
-->
|
|
|
|
# LayoutLMv3
|
|
|
|
## Overview
|
|
|
|
The LayoutLMv3 model was proposed in [LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking](https://arxiv.org/abs/2204.08387) by Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, Furu Wei.
|
|
LayoutLMv3 simplifies [LayoutLMv2](layoutlmv2) by using patch embeddings (as in [ViT](vit)) instead of leveraging a CNN backbone, and pre-trains the model on 3 objectives: masked language modeling (MLM), masked image modeling (MIM)
|
|
and word-patch alignment (WPA).
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.*
|
|
|
|
Tips:
|
|
|
|
- In terms of data processing, LayoutLMv3 is identical to its predecessor [LayoutLMv2](layoutlmv2), except that:
|
|
- images need to be resized and normalized with channels in regular RGB format. LayoutLMv2 on the other hand normalizes the images internally and expects the channels in BGR format.
|
|
- text is tokenized using byte-pair encoding (BPE), as opposed to WordPiece.
|
|
Due to these differences in data preprocessing, one can use [`LayoutLMv3Processor`] which internally combines a [`LayoutLMv3FeatureExtractor`] (for the image modality) and a [`LayoutLMv3Tokenizer`]/[`LayoutLMv3TokenizerFast`] (for the text modality) to prepare all data for the model.
|
|
- Regarding usage of [`LayoutLMv3Processor`], we refer to the [usage guide](layoutlmv2#usage-layoutlmv2processor) of its predecessor.
|
|
- Demo notebooks for LayoutLMv3 can be found [here](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/LayoutLMv3).
|
|
- Demo scripts can be found [here](https://github.com/huggingface/transformers/tree/main/examples/research_projects/layoutlmv3).
|
|
|
|
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/layoutlmv3_architecture.png"
|
|
alt="drawing" width="600"/>
|
|
|
|
<small> LayoutLMv3 architecture. Taken from the <a href="https://arxiv.org/abs/2204.08387">original paper</a>. </small>
|
|
|
|
This model was contributed by [nielsr](https://huggingface.co/nielsr). The TensorFlow version of this model was added by [chriskoo](https://huggingface.co/chriskoo), [tokec](https://huggingface.co/tokec), and [lre](https://huggingface.co/lre). The original code can be found [here](https://github.com/microsoft/unilm/tree/master/layoutlmv3).
|
|
|
|
|
|
## LayoutLMv3Config
|
|
|
|
[[autodoc]] LayoutLMv3Config
|
|
|
|
## LayoutLMv3FeatureExtractor
|
|
|
|
[[autodoc]] LayoutLMv3FeatureExtractor
|
|
- __call__
|
|
|
|
## LayoutLMv3Tokenizer
|
|
|
|
[[autodoc]] LayoutLMv3Tokenizer
|
|
- __call__
|
|
- save_vocabulary
|
|
|
|
## LayoutLMv3TokenizerFast
|
|
|
|
[[autodoc]] LayoutLMv3TokenizerFast
|
|
- __call__
|
|
|
|
## LayoutLMv3Processor
|
|
|
|
[[autodoc]] LayoutLMv3Processor
|
|
- __call__
|
|
|
|
## LayoutLMv3Model
|
|
|
|
[[autodoc]] LayoutLMv3Model
|
|
- forward
|
|
|
|
## LayoutLMv3ForSequenceClassification
|
|
|
|
[[autodoc]] LayoutLMv3ForSequenceClassification
|
|
- forward
|
|
|
|
## LayoutLMv3ForTokenClassification
|
|
|
|
[[autodoc]] LayoutLMv3ForTokenClassification
|
|
- forward
|
|
|
|
## LayoutLMv3ForQuestionAnswering
|
|
|
|
[[autodoc]] LayoutLMv3ForQuestionAnswering
|
|
- forward
|
|
|
|
## TFLayoutLMv3Model
|
|
|
|
[[autodoc]] TFLayoutLMv3Model
|
|
- call
|
|
|
|
## TFLayoutLMv3ForSequenceClassification
|
|
|
|
[[autodoc]] TFLayoutLMv3ForSequenceClassification
|
|
- call
|
|
|
|
## TFLayoutLMv3ForTokenClassification
|
|
|
|
[[autodoc]] TFLayoutLMv3ForTokenClassification
|
|
- call
|
|
|
|
## TFLayoutLMv3ForQuestionAnswering
|
|
|
|
[[autodoc]] TFLayoutLMv3ForQuestionAnswering
|
|
- call
|