mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-13 01:30:04 +06:00

* First draft * Style and remove mlm * Make forward pass work * More improvements * More improvements * Fix bug * More improvements * More improvements * Add PerceiverTokenizer first draft * Improve conversion script * More improvements * Make conversion script work for the encoder * Make conversion script work with local pickle files * Style & quality, fix-copies * Add dummy input to conversion script * Add absolute position embeddings to TextPreProcessor * Make forward pass of encoder work * More improvements * Move text preprocessor to separate script * More improvements * More improvements * Add post processor * Make MLM model work * Style * Add PerceiverForMaskedLM * Add PerceiverImagePreprocessor * Make style * Make PerceiverForImageClassification work * More improvements * More improvements * Use tokenizer in conversion script * Use PerceiverForMaskedLM in conversion script * Define custom PerceiverModelOutput * Improve PerceiverAttention to make it work for both MLM and image classification * More improvements * More improvements * More improvements to the conversion script * Make conversion script work for both MLM and image classification * Add PerceiverFeatureExtractor * More improvements * Style and quality * Add center cropping * Fix bug * Small fix * Add print statement * Fix bug in image preprocessor * Fix bug with conversion script * Make output position embeddings an nn.Parameter layer instead of nn.Embedding * Comment out print statements * Add position encoding classes * More improvements * Use position_encoding_kwargs * Add PerceiverForImageClassificationFourier * Make style & quality * Add PerceiverForImageClassificationConvProcessing * Style & quality * Add flow model * Move processors to modeling file * Make position encodings modular * Make basic decoder use modular position encodings * Add PerceiverForOpticalFlow to conversion script * Add AudioPreprocessor * Make it possible for the basic decoder to use Fourier position embeddings * Add PerceiverForMultimodalAutoencoding * Improve model for optical flow * Improve _build_network_inputs method * Add print statement * Fix device issue * Fix device of Fourier embeddings * Add print statements for debugging * Add another print statement * Add another print statement * Add another print statement * Add another print statement * Improve PerceiverAudioPreprocessor * Improve conversion script for multimodal modal * More improvements * More improvements * Improve multimodal model * Make forward pass multimodal model work * More improvements * Improve tests * Fix some more tests * Add output dataclasses * Make more tests pass * Add print statements for debuggin * Add tests for image classification * Add PerceiverClassifierOutput * More improvements * Make more tests pass for the optical flow model * Make style & quality * Small improvements * Don't support training for optical flow model for now * Fix _prepare_for_class for tests * Make more tests pass, add some docs * Add multimodal model to tests * Minor fixes * Fix tests * Improve conversion script * Make fixup * Remove pos_dim argument * Fix device issue * Potential fix for OOM * Revert previous commit * Fix test_initialization * Add print statements for debugging * Fix print statement * Add print statement * Add print statement * Add print statement * Add print statement * Add print statement * Add print statement * Remove need for output_shape * Comment out output_shape * Remove unnecessary code * Improve docs * Fix make fixup * Remove PerceiverTextProcessor from init * Improve docs * Small improvement * Apply first batch of suggestions from code review * Apply more suggestions from code review * Update docstrings * Define dicts beforehand for readability * Rename task to architecture in conversion script, include PerceiverModel in tests * Add print statements for debugging * Fix tests on GPU * Remove preprocessors, postprocessors and decoders from main init * Add integration test * Fix docs * Replace einops by torch * Update for new docs frontend * Rename PerceiverForImageClassification * Improve docs * Improve docs * Improve docs of PerceiverModel * Fix some more tests * Improve center_crop * Add PerceiverForSequenceClassification * Small improvements * Fix tests * Add integration test for optical flow model * Clean up * Add tests for tokenizer * Fix tokenizer by adding special tokens properly * Fix CI
235 lines
11 KiB
ReStructuredText
235 lines
11 KiB
ReStructuredText
..
|
|
Copyright 2021 The HuggingFace Team. All rights reserved.
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations under the License.
|
|
|
|
Perceiver
|
|
-----------------------------------------------------------------------------------------------------------------------
|
|
|
|
Overview
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The Perceiver IO model was proposed in `Perceiver IO: A General Architecture for Structured Inputs & Outputs
|
|
<https://arxiv.org/abs/2107.14795>`__ by Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch,
|
|
Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M.
|
|
Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira.
|
|
|
|
Perceiver IO is a generalization of `Perceiver <https://arxiv.org/abs/2103.03206>`__ to handle arbitrary outputs in
|
|
addition to arbitrary inputs. The original Perceiver only produced a single classification label. In addition to
|
|
classification labels, Perceiver IO can produce (for example) language, optical flow, and multimodal videos with audio.
|
|
This is done using the same building blocks as the original Perceiver. The computational complexity of Perceiver IO is
|
|
linear in the input and output size and the bulk of the processing occurs in the latent space, allowing us to process
|
|
inputs and outputs that are much larger than can be handled by standard Transformers. This means, for example,
|
|
Perceiver IO can do BERT-style masked language modeling directly using bytes instead of tokenized inputs.
|
|
|
|
The abstract from the paper is the following:
|
|
|
|
*The recently-proposed Perceiver model obtains good results on several domains (images, audio, multimodal, point
|
|
clouds) while scaling linearly in compute and memory with the input size. While the Perceiver supports many kinds of
|
|
inputs, it can only produce very simple outputs such as class scores. Perceiver IO overcomes this limitation without
|
|
sacrificing the original's appealing properties by learning to flexibly query the model's latent space to produce
|
|
outputs of arbitrary size and semantics. Perceiver IO still decouples model depth from data size and still scales
|
|
linearly with data size, but now with respect to both input and output sizes. The full Perceiver IO model achieves
|
|
strong results on tasks with highly structured output spaces, such as natural language and visual understanding,
|
|
StarCraft II, and multi-task and multi-modal domains. As highlights, Perceiver IO matches a Transformer-based BERT
|
|
baseline on the GLUE language benchmark without the need for input tokenization and achieves state-of-the-art
|
|
performance on Sintel optical flow estimation.*
|
|
|
|
Here's a TLDR explaining how Perceiver works:
|
|
|
|
The main problem with the self-attention mechanism of the Transformer is that the time and memory requirements scale
|
|
quadratically with the sequence length. Hence, models like BERT and RoBERTa are limited to a max sequence length of 512
|
|
tokens. Perceiver aims to solve this issue by, instead of performing self-attention on the inputs, perform it on a set
|
|
of latent variables, and only use the inputs for cross-attention. In this way, the time and memory requirements don't
|
|
depend on the length of the inputs anymore, as one uses a fixed amount of latent variables, like 256 or 512. These are
|
|
randomly initialized, after which they are trained end-to-end using backpropagation.
|
|
|
|
Internally, :class:`~transformers.PerceiverModel` will create the latents, which is a tensor of shape
|
|
:obj:`(batch_size, num_latents, d_latents)`. One must provide :obj:`inputs` (which could be text, images, audio, you
|
|
name it!) to the model, which it will use to perform cross-attention with the latents. The output of the Perceiver
|
|
encoder is a tensor of the same shape. One can then, similar to BERT, convert the last hidden states of the latents to
|
|
classification logits by averaging along the sequence dimension, and placing a linear layer on top of that to project
|
|
the :obj:`d_latents` to :obj:`num_labels`.
|
|
|
|
This was the idea of the original Perceiver paper. However, it could only output classification logits. In a follow-up
|
|
work, PerceiverIO, they generalized it to let the model also produce outputs of arbitrary size. How, you might ask? The
|
|
idea is actually relatively simple: one defines outputs of an arbitrary size, and then applies cross-attention with the
|
|
last hidden states of the latents, using the outputs as queries, and the latents as keys and values.
|
|
|
|
So let's say one wants to perform masked language modeling (BERT-style) with the Perceiver. As the Perceiver's input
|
|
length will not have an impact on the computation time of the self-attention layers, one can provide raw bytes,
|
|
providing :obj:`inputs` of length 2048 to the model. If one now masks out certain of these 2048 tokens, one can define
|
|
the :obj:`outputs` as being of shape: :obj:`(batch_size, 2048, 768)`. Next, one performs cross-attention with the final
|
|
hidden states of the latents to update the :obj:`outputs` tensor. After cross-attention, one still has a tensor of
|
|
shape :obj:`(batch_size, 2048, 768)`. One can then place a regular language modeling head on top, to project the last
|
|
dimension to the vocabulary size of the model, i.e. creating logits of shape :obj:`(batch_size, 2048, 262)` (as
|
|
Perceiver uses a vocabulary size of 262 byte IDs).
|
|
|
|
|
|
This model was contributed by `<nielsr> <https://huggingface.co/nielsr>`__. The original code can be found `here
|
|
<https://github.com/deepmind/deepmind-research/tree/master/perceiver>`__.
|
|
|
|
|
|
Perceiver specific outputs
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverModelOutput
|
|
:members:
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverDecoderOutput
|
|
:members:
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMaskedLMOutput
|
|
:members:
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassifierOutput
|
|
:members:
|
|
|
|
|
|
PerceiverConfig
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverConfig
|
|
:members:
|
|
|
|
|
|
PerceiverTokenizer
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverTokenizer
|
|
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
|
|
create_token_type_ids_from_sequences, save_vocabulary
|
|
|
|
|
|
PerceiverFeatureExtractor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverFeatureExtractor
|
|
:members:
|
|
|
|
|
|
PerceiverTextPreprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverTextPreprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverImagePreprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverImagePreprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverOneHotPreprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverOneHotPreprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverAudioPreprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPreprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverMultimodalPreprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPreprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverProjectionPostprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverProjectionPostprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverAudioPostprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverAudioPostprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverClassificationPostprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverClassificationPostprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverMultimodalPostprocessor
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.models.perceiver.modeling_perceiver.PerceiverMultimodalPostprocessor
|
|
:members:
|
|
|
|
|
|
PerceiverModel
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverModel
|
|
:members: forward
|
|
|
|
|
|
PerceiverForMaskedLM
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForMaskedLM
|
|
:members: forward
|
|
|
|
|
|
PerceiverForSequenceClassification
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForSequenceClassification
|
|
:members: forward
|
|
|
|
|
|
PerceiverForImageClassificationLearned
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForImageClassificationLearned
|
|
:members: forward
|
|
|
|
|
|
PerceiverForImageClassificationFourier
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForImageClassificationFourier
|
|
:members: forward
|
|
|
|
|
|
PerceiverForImageClassificationConvProcessing
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForImageClassificationConvProcessing
|
|
:members: forward
|
|
|
|
|
|
PerceiverForOpticalFlow
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForOpticalFlow
|
|
:members: forward
|
|
|
|
|
|
PerceiverForMultimodalAutoencoding
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
.. autoclass:: transformers.PerceiverForMultimodalAutoencoding
|
|
:members: forward
|