Add support for auto_docstring with model outputs (#38242)

* experiment auto_docstring model outputs

* Fix PatchTSMixer

* Add check model output docstring to check_auto_docstring and fix all model outputs docstring

* add reordering of docstring in check_docstrings

* add check for redundant docstring in check_docstrings, remove redundant docstrings

* refactor check_auto_docstring

* make style

* fix copies

* remove commented code

* change List-> list Tuple-> tuple in docstrings

* fix modular

* make style

* Fix modular vipllava

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
This commit is contained in:
Yoni Gozlan 2025-06-23 10:39:41 -04:00 committed by GitHub
parent 0c98f24889
commit b6b4d43d6d
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
177 changed files with 6799 additions and 8820 deletions

View File

@ -570,30 +570,21 @@ class AlbertPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class AlbertForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`AlbertForPreTraining`]. Output type of [`AlbertForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class AlbertForPreTrainingOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). (classification) loss.
sop_logits (`torch.FloatTensor` of shape `(batch_size, 2)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
before SoftMax). sop_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -40,20 +40,15 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class AlignVisionModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
"""
Args: )
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): class AlignVisionModelOutput(ModelOutput):
The image embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The image embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
image_embeds: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None
@ -62,26 +57,15 @@ class AlignVisionModelOutput(ModelOutput):
@dataclass @dataclass
class AlignTextModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for text model's outputs that also contains a pooling of the last hidden states. Base class for text model's outputs that also contains a pooling of the last hidden states.
"""
Args: )
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): class AlignTextModelOutput(ModelOutput):
The text embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The text embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
text_embeds: Optional[torch.FloatTensor] = None text_embeds: Optional[torch.FloatTensor] = None
@ -91,25 +75,25 @@ class AlignTextModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class AlignOutput(ModelOutput): class AlignOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`AlignTextModel`].
The text embeddings obtained by applying the projection layer to the pooled output of [`AlignTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The output of [`AlignVisionModel`].
The output of [`AlignVisionModel`]. text_model_output (`BaseModelOutputWithPoolingAndCrossAttentions`):
text_model_output(`BaseModelOutputWithPoolingAndCrossAttentions`): The output of the [`AlignTextModel`].
The output of the [`AlignTextModel`]. vision_model_output (`BaseModelOutputWithPoolingAndNoAttention`):
vision_model_output(`BaseModelOutputWithPoolingAndNoAttention`): The output of the [`AlignVisionModel`].
The output of the [`AlignVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -53,26 +53,26 @@ def clip_loss(similarity: torch.Tensor) -> torch.Tensor:
@dataclass @dataclass
@auto_docstring
# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->AltCLIP # Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->AltCLIP
class AltCLIPOutput(ModelOutput): class AltCLIPOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPTextModel`].
The text embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPVisionModel`].
The image embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPVisionModel`]. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The output of the [`AltCLIPTextModel`].
The output of the [`AltCLIPTextModel`]. vision_model_output (`BaseModelOutputWithPooling`):
vision_model_output (`BaseModelOutputWithPooling`): The output of the [`AltCLIPVisionModel`].
The output of the [`AltCLIPVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -963,35 +963,26 @@ class AriaTextForCausalLM(AriaTextPreTrainedModel, GenerationMixin):
@dataclass @dataclass
class AriaCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Aria causal language model (or autoregressive) outputs. Base class for Aria causal language model (or autoregressive) outputs.
"""
)
class AriaCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1003,33 +994,22 @@ class AriaCausalLMOutputWithPast(ModelOutput):
@dataclass @dataclass
class AriaModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Aria outputs, with hidden states and attentions. Base class for Aria outputs, with hidden states and attentions.
"""
)
class AriaModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None

View File

@ -46,44 +46,35 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class AutoFormerDecoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding). Base class for model's outputs that may also contain a past key/values (to speed up sequential decoding).
"""
)
class AutoFormerDecoderOutput(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
Args: If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): hidden_size)` is output.
Sequence of hidden-states at the output of the last layer of the model. trend (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Trend tensor for each time series.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
hidden_size)` is output. `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
trend (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): input) to speed up sequential decoding.
Trend tensor for each time series. cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape sequence_length)`.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` weighted average in the cross-attention heads.
input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -95,63 +86,35 @@ class AutoFormerDecoderOutput(ModelOutput):
@dataclass @dataclass
class AutoformerModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Autoformer model output that contains the additional trend output. Autoformer model output that contains the additional trend output.
"""
)
class AutoformerModelOutput(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
Args: If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): hidden_size)` is output.
Sequence of hidden-states at the output of the last layer of the decoder of the model. trend (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Trend tensor for each time series.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
hidden_size)` is output. blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
trend (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): loc (`torch.FloatTensor` of shape `(batch_size,)` or `(batch_size, input_size)`, *optional*):
Trend tensor for each time series. Shift values of each time series' context window which is used to give the model inputs of the same
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): magnitude and then used to shift back to the original magnitude.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape scale (`torch.FloatTensor` of shape `(batch_size,)` or `(batch_size, input_size)`, *optional*):
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape Scaling values of each time series' context window which is used to give the model inputs of the same
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`. magnitude and then used to rescale back to the original magnitude.
static_features: (`torch.FloatTensor` of shape `(batch_size, feature size)`, *optional*):
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention Static features of each time series' in a batch which are copied to the covariates at inference time.
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the optional initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the optional initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
loc (`torch.FloatTensor` of shape `(batch_size,)` or `(batch_size, input_size)`, *optional*):
Shift values of each time series' context window which is used to give the model inputs of the same
magnitude and then used to shift back to the original magnitude.
scale (`torch.FloatTensor` of shape `(batch_size,)` or `(batch_size, input_size)`, *optional*):
Scaling values of each time series' context window which is used to give the model inputs of the same
magnitude and then used to rescale back to the original magnitude.
static_features: (`torch.FloatTensor` of shape `(batch_size, feature size)`, *optional*):
Static features of each time series' in a batch which are copied to the covariates at inference time.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -1795,6 +1758,14 @@ class AutoformerForPrediction(AutoformerPreTrainedModel):
Transformer requires to provide additional features. Transformer requires to provide additional features.
The Autoformer only learns additional embeddings for `static_categorical_features`. The Autoformer only learns additional embeddings for `static_categorical_features`.
future_observed_mask (`torch.BoolTensor` of shape `(batch_size, sequence_length)` or `(batch_size, sequence_length, input_size)`, *optional*):
Boolean mask to indicate which `future_values` were observed and which were missing. Mask values selected
in `[0, 1]`:
- 1 for values that are **observed**,
- 0 for values that are **missing** (i.e. NaNs that were replaced by zeros).
This mask is used to filter out missing values for the final loss calculation.
cross_attn_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*): cross_attn_head_mask (`torch.Tensor` of shape `(decoder_layers, decoder_attention_heads)`, *optional*):
Mask to nullify selected heads of the cross-attention modules. Mask values selected in `[0, 1]`: Mask to nullify selected heads of the cross-attention modules. Mask values selected in `[0, 1]`:
@ -1804,14 +1775,6 @@ class AutoformerForPrediction(AutoformerPreTrainedModel):
Tuple consists of `last_hidden_state`, `hidden_states` (*optional*) and `attentions` (*optional*) Tuple consists of `last_hidden_state`, `hidden_states` (*optional*) and `attentions` (*optional*)
`last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` (*optional*) is a sequence of `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` (*optional*) is a sequence of
hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder. hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.
future_observed_mask (`torch.BoolTensor` of shape `(batch_size, sequence_length)` or `(batch_size, sequence_length, input_size)`, *optional*):
Boolean mask to indicate which `future_values` were observed and which were missing. Mask values selected
in `[0, 1]`:
- 1 for values that are **observed**,
- 0 for values that are **missing** (i.e. NaNs that were replaced by zeros).
This mask is used to filter out missing values for the final loss calculation.
Examples: Examples:

View File

@ -117,35 +117,26 @@ class AyaVisionPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class AyaVisionCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for AyaVision causal language model (or autoregressive) outputs. Base class for AyaVision causal language model (or autoregressive) outputs.
"""
)
class AyaVisionCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -157,33 +148,22 @@ class AyaVisionCausalLMOutputWithPast(ModelOutput):
@dataclass @dataclass
class AyaVisionModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for AyaVision outputs, with hidden states and attentions. Base class for AyaVision outputs, with hidden states and attentions.
"""
)
class AyaVisionModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None

View File

@ -44,39 +44,19 @@ from .configuration_beit import BeitConfig
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
# General docstring
# Base docstring
_EXPECTED_OUTPUT_SHAPE = [1, 197, 768]
# Image classification docstring
_IMAGE_CLASS_CHECKPOINT = "microsoft/beit-base-patch16-224"
_IMAGE_CLASS_EXPECTED_OUTPUT = "tabby, tabby cat"
@dataclass @dataclass
class BeitModelOutputWithPooling(BaseModelOutputWithPooling): @auto_docstring(
""" custom_intro="""
Class for outputs of [`BeitModel`]. Class for outputs of [`BeitModel`].
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class BeitModelOutputWithPooling(BaseModelOutputWithPooling):
Sequence of hidden-states at the output of the last layer of the model. r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if
*config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
will be returned. will be returned.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """

View File

@ -805,30 +805,21 @@ class BertPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class BertForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`BertForPreTraining`]. Output type of [`BertForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class BertForPreTrainingOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). (classification) loss.
seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
before SoftMax). seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -1744,30 +1744,21 @@ class BigBirdPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class BigBirdForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`BigBirdForPreTraining`]. Output type of [`BigBirdForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class BigBirdForPreTrainingOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). (classification) loss.
seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
before SoftMax). seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1778,30 +1769,17 @@ class BigBirdForPreTrainingOutput(ModelOutput):
@dataclass @dataclass
class BigBirdForQuestionAnsweringModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of question answering models. Base class for outputs of question answering models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class BigBirdForQuestionAnsweringModelOutput(ModelOutput):
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. r"""
start_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Span-start scores (before SoftMax). Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, 1)`):
Span-end scores (before SoftMax). pooler output from BigBigModel
pooler_output (`torch.FloatTensor` of shape `(batch_size, 1)`):
pooler output from BigBigModel
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -49,31 +49,31 @@ def blip_loss(similarity: torch.Tensor) -> torch.Tensor:
@dataclass @dataclass
class BlipForConditionalGenerationModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the
last hidden states. This class also adds the loss term from the text decoder. last hidden states. This class also adds the loss term from the text decoder.
"""
)
class BlipForConditionalGenerationModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Language modeling loss from the text decoder.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*):
Prediction scores of the language modeling head of the text decoder model.
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*):
The image embeddings obtained after applying the Vision Transformer model to the input image.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Args: Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed):
Language modeling loss from the text decoder. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`, *optional*): sequence_length)`.
Prediction scores of the language modeling head of the text decoder model.
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*):
The image embeddings obtained after applying the Vision Transformer model to the input image.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed): heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[tuple[torch.FloatTensor]] = None loss: Optional[tuple[torch.FloatTensor]] = None
@ -94,29 +94,18 @@ class BlipForConditionalGenerationModelOutput(ModelOutput):
@dataclass @dataclass
class BlipTextVisionModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the
last hidden states. This class also adds the loss term from the text decoder. last hidden states. This class also adds the loss term from the text decoder.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class BlipTextVisionModelOutput(ModelOutput):
Language modeling loss from the text decoder. r"""
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
The image embeddings obtained by applying the projection layer to the pooler_output. Language modeling loss from the text decoder.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The image embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -127,36 +116,25 @@ class BlipTextVisionModelOutput(ModelOutput):
@dataclass @dataclass
class BlipImageTextMatchingModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the Adapted from the base class for vision model's outputs that also contains image embeddings of the pooling of the
last hidden states. This class also adds the loss term from the text decoder as well as the image-text similarity last hidden states. This class also adds the loss term from the text decoder as well as the image-text similarity
scores. scores.
"""
Args: )
itm_score (`torch.FloatTensor`): class BlipImageTextMatchingModelOutput(ModelOutput):
The image-text similarity scores. r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): itm_score (`torch.FloatTensor`):
Language modeling loss from the text decoder. The image-text similarity scores.
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
The image embeddings obtained by applying the projection layer to the pooler_output. Language modeling loss from the text decoder.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The image embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): vision_pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Last layer hidden-state of the vision of the vision-only branch of the model.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. question_embeds (`torch.FloatTensor`):
The question embeddings obtained by the text projection layer.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
vision_pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*):
Last layer hidden-state of the vision of the vision-only branch of the model.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
question_embeds (`torch.FloatTensor`):
The question embeddings obtained by the text projection layer.
""" """
itm_score: Optional[torch.FloatTensor] = None itm_score: Optional[torch.FloatTensor] = None
@ -170,25 +148,25 @@ class BlipImageTextMatchingModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class BlipOutput(ModelOutput): class BlipOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`BlipTextModel`].
The text embeddings obtained by applying the projection layer to the pooled output of [`BlipTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of [`BlipVisionModel`].
The image embeddings obtained by applying the projection layer to the pooled output of [`BlipVisionModel`]. text_model_output (`BaseModelOutputWithPooling`):
text_model_output(`BaseModelOutputWithPooling`): The output of the [`BlipTextModel`].
The output of the [`BlipTextModel`]. vision_model_output (`BaseModelOutputWithPooling`):
vision_model_output(`BaseModelOutputWithPooling`): The output of the [`BlipVisionModel`].
The output of the [`BlipVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -45,21 +45,23 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class Blip2ForConditionalGenerationModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class defining the outputs of [`Blip2ForConditionalGeneration`]. Class defining the outputs of [`Blip2ForConditionalGeneration`].
"""
Args: )
loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class Blip2ForConditionalGenerationModelOutput(ModelOutput):
Language modeling loss from the language model. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Prediction scores of the language modeling head of the language model. Language modeling loss from the language model.
vision_outputs (`BaseModelOutputWithPooling`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Outputs of the vision encoder. Prediction scores of the language modeling head of the language model.
qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`): vision_outputs (`BaseModelOutputWithPooling`):
Outputs of the Q-Former (Querying Transformer). Outputs of the vision encoder.
language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`): qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`):
Outputs of the language model. Outputs of the Q-Former (Querying Transformer).
language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
Outputs of the language model.
""" """
loss: Optional[tuple[torch.FloatTensor]] = None loss: Optional[tuple[torch.FloatTensor]] = None
@ -78,25 +80,25 @@ class Blip2ForConditionalGenerationModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class Blip2ImageTextMatchingModelOutput(ModelOutput): class Blip2ImageTextMatchingModelOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output.
The text embeddings obtained by applying the projection layer to the pooled output. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output.
The image embeddings obtained by applying the projection layer to the pooled output. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The output of the [`Blip2QFormerModel`].
The output of the [`Blip2QFormerModel`]. vision_model_output (`BaseModelOutputWithPooling`):
vision_model_output (`BaseModelOutputWithPooling`): The output of the [`Blip2VisionModel`].
The output of the [`Blip2VisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -115,27 +117,16 @@ class Blip2ImageTextMatchingModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for text model's outputs that also contains a pooling of the last hidden states.
"""
)
# Copied from transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Blip2 # Copied from transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Blip2
class Blip2TextModelOutput(ModelOutput): class Blip2TextModelOutput(ModelOutput):
""" r"""
Base class for text model's outputs that also contains a pooling of the last hidden states. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The text embeddings obtained by applying the projection layer to the pooler_output.
Args:
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The text embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
text_embeds: Optional[torch.FloatTensor] = None text_embeds: Optional[torch.FloatTensor] = None
@ -145,27 +136,16 @@ class Blip2TextModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
"""
)
# Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Blip2 # Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Blip2
class Blip2VisionModelOutput(ModelOutput): class Blip2VisionModelOutput(ModelOutput):
""" r"""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output.
Args:
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
image_embeds: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None

View File

@ -45,28 +45,20 @@ _TOKENIZER_FOR_DOC = "RobertaTokenizer"
@dataclass @dataclass
class BridgeTowerModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`BridgeTowerModel`]. Output type of [`BridgeTowerModel`].
"""
Args: )
text_features (`torch.FloatTensor` of shape `(batch_size, text_sequence_length, hidden_size)`): class BridgeTowerModelOutput(ModelOutput):
Sequence of hidden-states at the text output of the last layer of the model. r"""
image_features (`torch.FloatTensor` of shape `(batch_size, image_sequence_length, hidden_size)`): text_features (`torch.FloatTensor` of shape `(batch_size, text_sequence_length, hidden_size)`):
Sequence of hidden-states at the image output of the last layer of the model. Sequence of hidden-states at the text output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size x 2)`): image_features (`torch.FloatTensor` of shape `(batch_size, image_sequence_length, hidden_size)`):
Concatenation of last layer hidden-state of the first token of the text and image sequence (classification Sequence of hidden-states at the image output of the last layer of the model.
token), respectively, after further processing through layers used for auxiliary pretraining tasks. pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size x 2)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Concatenation of last layer hidden-state of the first token of the text and image sequence (classification
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + token), respectively, after further processing through layers used for auxiliary pretraining tasks.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of
the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
text_features: Optional[torch.FloatTensor] = None text_features: Optional[torch.FloatTensor] = None
@ -77,28 +69,26 @@ class BridgeTowerModelOutput(ModelOutput):
@dataclass @dataclass
class BridgeTowerContrastiveOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of ['BridgeTowerForContrastiveLearning'] Output type of ['BridgeTowerForContrastiveLearning']
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`: class BridgeTowerContrastiveOutput(ModelOutput):
Image-text contrastive loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Image-text contrastive loss.
text_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
The text embeddings obtained by applying the projection layer to the pooler_output. Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
image_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`): text_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output. The text embeddings obtained by applying the projection layer to the pooler_output.
cross_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`): image_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`):
The text-image cross-modal embeddings obtained by applying the projection layer to the pooler_output. The image embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): cross_embeds (`torch.FloatTensor)`, *optional*, returned when model is initialized with `with_projection=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + The text-image cross-modal embeddings obtained by applying the projection layer to the pooler_output.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the model at the output of each layer plus the optional initial embedding outputs. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): sequence_length)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -40,28 +40,19 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class BrosSpadeOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of token classification models. Base class for outputs of token classification models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) : class BrosSpadeOutput(ModelOutput):
Classification loss. r"""
initial_token_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification scores for entity initial tokens (before SoftMax). Classification loss.
subsequent_token_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, sequence_length+1)`): initial_token_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Classification scores for entity sequence tokens (before SoftMax). Classification scores for entity initial tokens (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): subsequent_token_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, sequence_length+1)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Classification scores for entity sequence tokens (before SoftMax).
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -49,32 +49,34 @@ _PRIMES = [31, 43, 59, 61, 73, 97, 103, 113, 137, 149, 157, 173, 181, 193, 211,
@dataclass @dataclass
class CanineModelOutputWithPooling(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`CanineModel`]. Based on [`~modeling_outputs.BaseModelOutputWithPooling`], but with slightly Output type of [`CanineModel`]. Based on [`~modeling_outputs.BaseModelOutputWithPooling`], but with slightly
different `hidden_states` and `attentions`, as these also include the hidden states and attentions of the shallow different `hidden_states` and `attentions`, as these also include the hidden states and attentions of the shallow
Transformer encoders. Transformer encoders.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class CanineModelOutputWithPooling(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model (i.e. the output of the final r"""
shallow Transformer encoder). last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): Sequence of hidden-states at the output of the last layer of the model (i.e. the output of the final
Hidden-state of the first token of the sequence (classification token) at the last layer of the deep shallow Transformer encoder).
Transformer encoder, further processed by a Linear layer and a Tanh activation function. The Linear layer pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
weights are trained from the next sentence prediction (classification) objective during pretraining. Hidden-state of the first token of the sequence (classification token) at the last layer of the deep
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Transformer encoder, further processed by a Linear layer and a Tanh activation function. The Linear layer
Tuple of `torch.FloatTensor` (one for the input to each encoder + one for the output of each layer of each weights are trained from the next sentence prediction (classification) objective during pretraining.
encoder) of shape `(batch_size, sequence_length, hidden_size)` and `(batch_size, sequence_length // hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
config.downsampling_rate, hidden_size)`. Hidden-states of the model at the output of each layer plus the Tuple of `torch.FloatTensor` (one for the input to each encoder + one for the output of each layer of each
initial input to each Transformer encoder. The hidden states of the shallow encoders have length encoder) of shape `(batch_size, sequence_length, hidden_size)` and `(batch_size, sequence_length //
`sequence_length`, but the hidden states of the deep encoder have length `sequence_length` // config.downsampling_rate, hidden_size)`. Hidden-states of the model at the output of each layer plus the
`config.downsampling_rate`. initial input to each Transformer encoder. The hidden states of the shallow encoders have length
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): `sequence_length`, but the hidden states of the deep encoder have length `sequence_length` //
Tuple of `torch.FloatTensor` (one for each layer) of the 3 Transformer encoders of shape `(batch_size, `config.downsampling_rate`.
num_heads, sequence_length, sequence_length)` and `(batch_size, num_heads, sequence_length // attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
config.downsampling_rate, sequence_length // config.downsampling_rate)`. Attentions weights after the Tuple of `torch.FloatTensor` (one for each layer) of the 3 Transformer encoders of shape `(batch_size,
attention softmax, used to compute the weighted average in the self-attention heads. num_heads, sequence_length, sequence_length)` and `(batch_size, num_heads, sequence_length //
config.downsampling_rate, sequence_length // config.downsampling_rate)`. Attentions weights after the
attention softmax, used to compute the weighted average in the self-attention heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None

View File

@ -52,27 +52,27 @@ def chinese_clip_loss(similarity: torch.Tensor) -> torch.Tensor:
@dataclass @dataclass
@auto_docstring
class ChineseCLIPOutput(ModelOutput): class ChineseCLIPOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of
The text embeddings obtained by applying the projection layer to the pooled output of [`ChineseCLIPTextModel`].
[`ChineseCLIPTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of
The image embeddings obtained by applying the projection layer to the pooled output of [`ChineseCLIPVisionModel`].
[`ChineseCLIPVisionModel`]. text_model_output (`BaseModelOutputWithPoolingAndCrossAttentions`):
text_model_output(`BaseModelOutputWithPoolingAndCrossAttentions`): The output of the [`ChineseCLIPTextModel`].
The output of the [`ChineseCLIPTextModel`]. vision_model_output (`BaseModelOutputWithPoolingAndCrossAttentions`):
vision_model_output(`BaseModelOutputWithPoolingAndCrossAttentions`): The output of the [`ChineseCLIPVisionModel`].
The output of the [`ChineseCLIPVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -122,27 +122,16 @@ def contrastive_loss(logits: torch.Tensor) -> torch.Tensor:
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for text model's outputs that also contains a pooling of the last hidden states.
"""
)
# Copied from transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Clap # Copied from transformers.models.clip.modeling_clip.CLIPTextModelOutput with CLIP->Clap
class ClapTextModelOutput(ModelOutput): class ClapTextModelOutput(ModelOutput):
""" r"""
Base class for text model's outputs that also contains a pooling of the last hidden states. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The text embeddings obtained by applying the projection layer to the pooler_output.
Args:
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The text embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
text_embeds: Optional[torch.FloatTensor] = None text_embeds: Optional[torch.FloatTensor] = None
@ -152,26 +141,15 @@ class ClapTextModelOutput(ModelOutput):
@dataclass @dataclass
class ClapAudioModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
ClapAudio model output to mimic the output of the original implementation. ClapAudio model output to mimic the output of the original implementation.
"""
Args: )
audio_embeds (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): class ClapAudioModelOutput(ModelOutput):
The Audio embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): audio_embeds (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model. The Audio embeddings obtained by applying the projection layer to the pooler_output.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
audio_embeds: Optional[torch.FloatTensor] = None audio_embeds: Optional[torch.FloatTensor] = None
@ -181,26 +159,26 @@ class ClapAudioModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->Clap, vision->audio, Vision->Audio, image->audio # Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->Clap, vision->audio, Vision->Audio, image->audio
class ClapOutput(ModelOutput): class ClapOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for audio-text similarity.
Contrastive loss for audio-text similarity. logits_per_audio (`torch.FloatTensor` of shape `(audio_batch_size, text_batch_size)`):
logits_per_audio (`torch.FloatTensor` of shape `(audio_batch_size, text_batch_size)`): The scaled dot product scores between `audio_embeds` and `text_embeds`. This represents the audio-text
The scaled dot product scores between `audio_embeds` and `text_embeds`. This represents the audio-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, audio_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, audio_batch_size)`): The scaled dot product scores between `text_embeds` and `audio_embeds`. This represents the text-audio
The scaled dot product scores between `text_embeds` and `audio_embeds`. This represents the text-audio similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`ClapTextModel`].
The text embeddings obtained by applying the projection layer to the pooled output of [`ClapTextModel`]. audio_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
audio_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The audio embeddings obtained by applying the projection layer to the pooled output of [`ClapAudioModel`].
The audio embeddings obtained by applying the projection layer to the pooled output of [`ClapAudioModel`]. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The output of the [`ClapTextModel`].
The output of the [`ClapTextModel`]. audio_model_output (`BaseModelOutputWithPooling`):
audio_model_output (`BaseModelOutputWithPooling`): The output of the [`ClapAudioModel`].
The output of the [`ClapAudioModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1931,11 +1909,11 @@ class ClapModel(ClapPreTrainedModel):
input_features (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): input_features (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Input audio features. This should be returned by the [`ClapFeatureExtractor`] class that you can also Input audio features. This should be returned by the [`ClapFeatureExtractor`] class that you can also
retrieve from [`AutoFeatureExtractor`]. See [`ClapFeatureExtractor.__call__`] for details. retrieve from [`AutoFeatureExtractor`]. See [`ClapFeatureExtractor.__call__`] for details.
return_loss (`bool`, *optional*):
Whether or not to return the contrastive loss.
is_longer (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*): is_longer (`torch.FloatTensor`, of shape `(batch_size, 1)`, *optional*):
Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance Whether the audio clip is longer than `max_length`. If `True`, a feature fusion will be enabled to enhance
the features. the features.
return_loss (`bool`, *optional*):
Whether or not to return the contrastive loss.
Examples: Examples:

View File

@ -57,26 +57,15 @@ def _get_vector_norm(tensor: torch.Tensor) -> torch.Tensor:
@dataclass @dataclass
class CLIPVisionModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
"""
Args: )
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): class CLIPVisionModelOutput(ModelOutput):
The image embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The image embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
image_embeds: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None
@ -86,26 +75,15 @@ class CLIPVisionModelOutput(ModelOutput):
@dataclass @dataclass
class CLIPTextModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for text model's outputs that also contains a pooling of the last hidden states. Base class for text model's outputs that also contains a pooling of the last hidden states.
"""
Args: )
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): class CLIPTextModelOutput(ModelOutput):
The text embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The text embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
text_embeds: Optional[torch.FloatTensor] = None text_embeds: Optional[torch.FloatTensor] = None
@ -115,25 +93,25 @@ class CLIPTextModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class CLIPOutput(ModelOutput): class CLIPOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPTextModel`].
The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPVisionModel`].
The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPVisionModel`]. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The output of the [`CLIPTextModel`].
The output of the [`CLIPTextModel`]. vision_model_output (`BaseModelOutputWithPooling`):
vision_model_output (`BaseModelOutputWithPooling`): The output of the [`CLIPVisionModel`].
The output of the [`CLIPVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -49,26 +49,26 @@ def clipseg_loss(similarity: torch.Tensor) -> torch.Tensor:
@dataclass @dataclass
@auto_docstring
# Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->CLIPSeg # Copied from transformers.models.clip.modeling_clip.CLIPOutput with CLIP->CLIPSeg
class CLIPSegOutput(ModelOutput): class CLIPSegOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegTextModel`].
The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegVisionModel`].
The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPSegVisionModel`]. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The output of the [`CLIPSegTextModel`].
The output of the [`CLIPSegTextModel`]. vision_model_output (`BaseModelOutputWithPooling`):
vision_model_output (`BaseModelOutputWithPooling`): The output of the [`CLIPSegVisionModel`].
The output of the [`CLIPSegVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -87,18 +87,11 @@ class CLIPSegOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class CLIPSegDecoderOutput(ModelOutput): class CLIPSegDecoderOutput(ModelOutput):
""" r"""
Args: logits (`torch.FloatTensor` of shape `(batch_size, height, width)`):
logits (`torch.FloatTensor` of shape `(batch_size, height, width)`): Classification scores for each pixel.
Classification scores for each pixel.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
logits: Optional[torch.FloatTensor] = None logits: Optional[torch.FloatTensor] = None
@ -107,14 +100,21 @@ class CLIPSegDecoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class CLIPSegImageSegmentationOutput(ModelOutput): class CLIPSegImageSegmentationOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Binary cross entropy loss for segmentation.
Contrastive loss for image-text similarity. logits (`torch.FloatTensor` of shape `(batch_size, height, width)`):
... Classification scores for each pixel.
vision_model_output (`BaseModelOutputWithPooling`): conditional_embeddings (`torch.FloatTensor` of shape `(batch_size, projection_dim)`):
The output of the [`CLIPSegVisionModel`]. Conditional embeddings used for segmentation.
pooled_output (`torch.FloatTensor` of shape `(batch_size, embed_dim)`):
Pooled output of the [`CLIPSegVisionModel`].
vision_model_output (`BaseModelOutputWithPooling`):
The output of the [`CLIPSegVisionModel`].
decoder_output (`CLIPSegDecoderOutput`):
The output of the [`CLIPSegDecoder`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1260,15 +1260,15 @@ class CLIPSegForImageSegmentation(CLIPSegPreTrainedModel):
return_dict: Optional[bool] = None, return_dict: Optional[bool] = None,
) -> Union[tuple, CLIPSegOutput]: ) -> Union[tuple, CLIPSegOutput]:
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
conditional_pixel_values (`torch.FloatTensor`, *optional*): conditional_pixel_values (`torch.FloatTensor`, *optional*):
The pixel values of the conditional images. The pixel values of the conditional images.
conditional_embeddings (`torch.FloatTensor` of shape `(batch_size, config.projection_dim)`, *optional*): conditional_embeddings (`torch.FloatTensor` of shape `(batch_size, config.projection_dim)`, *optional*):
The conditional embeddings for the query images. If provided, the model will use this instead of computing The conditional embeddings for the query images. If provided, the model will use this instead of computing
the embeddings from the conditional_pixel_values. the embeddings from the conditional_pixel_values.
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
Examples: Examples:

View File

@ -144,26 +144,20 @@ def _pad_extra_bos_eos_tokens(
@dataclass @dataclass
class ClvpEncoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for CLVP encoder's outputs that contains a pooling of the last hidden states as well as a projection Base class for CLVP encoder's outputs that contains a pooling of the last hidden states as well as a projection
output (a linear layer on top of the pooled output). output (a linear layer on top of the pooled output).
"""
Args: )
embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`): class ClvpEncoderOutput(ModelOutput):
The embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when model is initialized with `with_projection=True`):
The hidden state of the last layer of the model. The embeddings obtained by applying the projection layer to the pooler_output.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Pooled output of the `last_hidden_state`. The hidden state of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Pooled output of the `last_hidden_state`.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of
the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
embeds: Optional[torch.FloatTensor] = None embeds: Optional[torch.FloatTensor] = None
@ -174,35 +168,35 @@ class ClvpEncoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class ClvpOutput(ModelOutput): class ClvpOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for speech-text similarity.
Contrastive loss for speech-text similarity. speech_ids (`torch.LongTensor`, *optional*):
speech_ids (`torch.LongTensor`, *optional*): speech_ids (or speech candidates) generated by the `ClvpForCausalLM` model.
speech_ids (or speech candidates) generated by the `ClvpForCausalLM` model. logits_per_speech (`torch.FloatTensor` of shape `(speech_batch_size, text_batch_size)`):
logits_per_speech (`torch.FloatTensor` of shape `(speech_batch_size, text_batch_size)`): The scaled dot product scores between `speech_embeds` and `text_embeds`. This represents the speech-text
The scaled dot product scores between `speech_embeds` and `text_embeds`. This represents the speech-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, speech_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, speech_batch_size)`): The scaled dot product scores between `text_embeds` and `speech_embeds`. This represents the text-speech
The scaled dot product scores between `text_embeds` and `speech_embeds`. This represents the text-speech similarity scores.
similarity scores. text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of the text encoder
The text embeddings obtained by applying the projection layer to the pooled output of the text encoder model.
model. speech_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
speech_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The speech embeddings obtained by applying the projection layer to the pooled output of the speech encoder
The speech embeddings obtained by applying the projection layer to the pooled output of the speech encoder model.
model. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The pooled output of the `last_hidden_state` of the text encoder Model.
The pooled output of the `last_hidden_state` of the text encoder Model. speech_model_output (`BaseModelOutputWithPooling`):
speech_model_output (`BaseModelOutputWithPooling`): The pooled output of the `last_hidden_state` of the speech encoder Model.
The pooled output of the `last_hidden_state` of the speech encoder Model. decoder_hidden_states (`torch.FloatTensor`, *optional*):
decoder_hidden_states (`torch.FloatTensor`, *optional*): The hidden states of the decoder model.
The hidden states of the decoder model. text_encoder_hidden_states (`torch.FloatTensor`, *optional*):
text_encoder_hidden_states (`torch.FloatTensor`, *optional*): The hidden states of the text encoder model.
The hidden states of the text encoder model. speech_encoder_hidden_states (`torch.FloatTensor`, *optional*):
speech_encoder_hidden_states (`torch.FloatTensor`, *optional*): The hidden states of the speech encoder model.
The hidden states of the speech encoder model.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -28,6 +28,7 @@ from ...utils import ModelOutput, auto_docstring, can_return_tuple
from .configuration_colpali import ColPaliConfig from .configuration_colpali import ColPaliConfig
@auto_docstring
class ColPaliPreTrainedModel(PreTrainedModel): class ColPaliPreTrainedModel(PreTrainedModel):
config_class = ColPaliConfig config_class = ColPaliConfig
base_model_prefix = "model" base_model_prefix = "model"
@ -51,35 +52,26 @@ class ColPaliPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class ColPaliForRetrievalOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for ColPali embeddings output. Base class for ColPali embeddings output.
"""
)
class ColPaliForRetrievalOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The embeddings of the model.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
The embeddings of the model. image_hidden_states of the model produced by the vision encoder after projecting last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder after projecting last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -36,6 +36,7 @@ if is_torch_available():
import torch import torch
@auto_docstring
class ColQwen2PreTrainedModel(PreTrainedModel): class ColQwen2PreTrainedModel(PreTrainedModel):
config_class = ColQwen2Config config_class = ColQwen2Config
base_model_prefix = "model" base_model_prefix = "model"
@ -62,32 +63,23 @@ class ColQwen2PreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class ColQwen2ForRetrievalOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for ColQwen2 embeddings output. Base class for ColQwen2 embeddings output.
"""
)
class ColQwen2ForRetrievalOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The embeddings of the model.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction).
embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The embeddings of the model.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -231,32 +231,23 @@ class ColQwen2PreTrainedModel(ColPaliPreTrainedModel):
@dataclass @dataclass
class ColQwen2ForRetrievalOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for ColQwen2 embeddings output. Base class for ColQwen2 embeddings output.
"""
)
class ColQwen2ForRetrievalOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The embeddings of the model.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction).
embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
The embeddings of the model.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -39,33 +39,25 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class ConditionalDetrDecoderOutput(BaseModelOutputWithCrossAttentions): @auto_docstring(
""" custom_intro="""
Base class for outputs of the Conditional DETR decoder. This class adds one attribute to Base class for outputs of the Conditional DETR decoder. This class adds one attribute to
BaseModelOutputWithCrossAttentions, namely an optional stack of intermediate decoder activations, i.e. the output BaseModelOutputWithCrossAttentions, namely an optional stack of intermediate decoder activations, i.e. the output
of each decoder layer, each of them gone through a layernorm. This is useful when training the model with auxiliary of each decoder layer, each of them gone through a layernorm. This is useful when training the model with auxiliary
decoding losses. decoding losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class ConditionalDetrDecoderOutput(BaseModelOutputWithCrossAttentions):
Sequence of hidden-states at the output of the last layer of the model. r"""
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
plus the initial embedding outputs. used to compute the weighted average in the cross-attention heads.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in layernorm.
the self-attention heads. reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`): Reference points (reference points of each layer of the decoder).
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
Reference points (reference points of each layer of the decoder).
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@ -73,43 +65,23 @@ class ConditionalDetrDecoderOutput(BaseModelOutputWithCrossAttentions):
@dataclass @dataclass
class ConditionalDetrModelOutput(Seq2SeqModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the Conditional DETR encoder-decoder model. This class adds one attribute to Base class for outputs of the Conditional DETR encoder-decoder model. This class adds one attribute to
Seq2SeqModelOutput, namely an optional stack of intermediate decoder activations, i.e. the output of each decoder Seq2SeqModelOutput, namely an optional stack of intermediate decoder activations, i.e. the output of each decoder
layer, each of them gone through a layernorm. This is useful when training the model with auxiliary decoding layer, each of them gone through a layernorm. This is useful when training the model with auxiliary decoding
losses. losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class ConditionalDetrModelOutput(Seq2SeqModelOutput):
Sequence of hidden-states at the output of the last layer of the decoder of the model. r"""
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Sequence of hidden-states at the output of the last layer of the decoder of the model.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, sequence_length, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
layer plus the initial embedding outputs. Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): layernorm.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the Reference points (reference points of each layer of the decoder).
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, sequence_length, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
Reference points (reference points of each layer of the decoder).
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@ -117,53 +89,33 @@ class ConditionalDetrModelOutput(Seq2SeqModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Output type of [`ConditionalDetrForObjectDetection`].
"""
)
# Copied from transformers.models.detr.modeling_detr.DetrObjectDetectionOutput with Detr->ConditionalDetr # Copied from transformers.models.detr.modeling_detr.DetrObjectDetectionOutput with Detr->ConditionalDetr
class ConditionalDetrObjectDetectionOutput(ModelOutput): class ConditionalDetrObjectDetectionOutput(ModelOutput):
""" r"""
Output type of [`ConditionalDetrForObjectDetection`]. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
Args: bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): scale-invariant IoU loss.
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a loss_dict (`Dict`, *optional*):
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized A dictionary containing the individual losses. Useful for logging.
scale-invariant IoU loss. logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
loss_dict (`Dict`, *optional*): Classification logits (including no-object) for all queries.
A dictionary containing the individual losses. Useful for logging. pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
Classification logits (including no-object) for all queries. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These unnormalized bounding boxes.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding auxiliary_outputs (`list[Dict]`, *optional*):
possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
unnormalized bounding boxes. and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
auxiliary_outputs (`list[Dict]`, *optional*): `pred_boxes`) for each decoder layer.
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and Sequence of hidden-states at the output of the last layer of the decoder of the model.
`pred_boxes`) for each decoder layer.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each
layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -181,59 +133,39 @@ class ConditionalDetrObjectDetectionOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Output type of [`ConditionalDetrForSegmentation`].
"""
)
# Copied from transformers.models.detr.modeling_detr.DetrSegmentationOutput with Detr->ConditionalDetr # Copied from transformers.models.detr.modeling_detr.DetrSegmentationOutput with Detr->ConditionalDetr
class ConditionalDetrSegmentationOutput(ModelOutput): class ConditionalDetrSegmentationOutput(ModelOutput):
""" r"""
Output type of [`ConditionalDetrForSegmentation`]. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
Args: bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): scale-invariant IoU loss.
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a loss_dict (`Dict`, *optional*):
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized A dictionary containing the individual losses. Useful for logging.
scale-invariant IoU loss. logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
loss_dict (`Dict`, *optional*): Classification logits (including no-object) for all queries.
A dictionary containing the individual losses. Useful for logging. pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
Classification logits (including no-object) for all queries. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These unnormalized bounding boxes.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
possible padding). You can use [`~ConditionalDetrImageProcessor.post_process_object_detection`] to retrieve the Segmentation masks logits for all queries. See also
unnormalized bounding boxes. [`~ConditionalDetrImageProcessor.post_process_semantic_segmentation`] or
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`): [`~ConditionalDetrImageProcessor.post_process_instance_segmentation`]
Segmentation masks logits for all queries. See also [`~ConditionalDetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic
[`~ConditionalDetrImageProcessor.post_process_semantic_segmentation`] or segmentation masks respectively.
[`~ConditionalDetrImageProcessor.post_process_instance_segmentation`] auxiliary_outputs (`list[Dict]`, *optional*):
[`~ConditionalDetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
segmentation masks respectively. and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
auxiliary_outputs (`list[Dict]`, *optional*): `pred_boxes`) for each decoder layer.
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and Sequence of hidden-states at the output of the last layer of the decoder of the model.
`pred_boxes`) for each decoder layer.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each
layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1022,7 +954,6 @@ class MLP(nn.Module):
@auto_docstring @auto_docstring
# Copied from transformers.models.detr.modeling_detr.DetrPreTrainedModel with Detr->ConditionalDetr # Copied from transformers.models.detr.modeling_detr.DetrPreTrainedModel with Detr->ConditionalDetr
class ConditionalDetrPreTrainedModel(PreTrainedModel): class ConditionalDetrPreTrainedModel(PreTrainedModel):
config_class = ConditionalDetrConfig config_class = ConditionalDetrConfig

View File

@ -46,49 +46,40 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class CsmOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for the model autoregressive outputs. Base class for the model autoregressive outputs.
"""
)
class CsmOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). depth_decoder_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Language modeling loss (for next-token prediction) of the depth decoder model.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). depth_decoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): Prediction scores of the depth decoder (scores for each vocabulary token before SoftMax).
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape depth_decoder_past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
depth_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
`past_key_values` input) to speed up sequential decoding. depth_decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + sequence_length)`.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. backbone_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction) of the backbone model.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
depth_decoder_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction) of the depth decoder model.
depth_decoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the depth decoder (scores for each vocabulary token before SoftMax).
depth_decoder_past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
depth_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
depth_decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
backbone_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction) of the backbone model.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -46,49 +46,40 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class CsmOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for the model autoregressive outputs. Base class for the model autoregressive outputs.
"""
)
class CsmOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). depth_decoder_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Language modeling loss (for next-token prediction) of the depth decoder model.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). depth_decoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): Prediction scores of the depth decoder (scores for each vocabulary token before SoftMax).
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape depth_decoder_past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
depth_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
`past_key_values` input) to speed up sequential decoding. depth_decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + sequence_length)`.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. backbone_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction) of the backbone model.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
depth_decoder_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction) of the depth decoder model.
depth_decoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the depth decoder (scores for each vocabulary token before SoftMax).
depth_decoder_past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
depth_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
depth_decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
backbone_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction) of the backbone model.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -33,19 +33,15 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class BaseModelOutputWithCLSToken(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model's outputs, with potential hidden states and attentions. Base class for model's outputs, with potential hidden states and attentions.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class BaseModelOutputWithCLSToken(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
cls_token_value (`torch.FloatTensor` of shape `(batch_size, 1, hidden_size)`): cls_token_value (`torch.FloatTensor` of shape `(batch_size, 1, hidden_size)`):
Classification token at the output of the last layer of the model. Classification token at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None

View File

@ -433,57 +433,41 @@ class DFineDecoderLayer(nn.Module):
@dataclass @dataclass
class DFineModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the RT-DETR encoder-decoder model. Base class for outputs of the RT-DETR encoder-decoder model.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`): class DFineModelOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the decoder of the model. r"""
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder). Sequence of hidden-states at the output of the last layer of the decoder of the model.
intermediate_logits (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, config.num_labels)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate logits (logits of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_logits (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, config.num_labels)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate logits (logits of each layer of the decoder).
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Stacked intermediate reference points (reference points of each layer of the decoder).
shape `(batch_size, num_queries, hidden_size)`. Hidden-states of the decoder at the output of each layer intermediate_predicted_corners (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
plus the initial embedding outputs. Stacked intermediate predicted corners (predicted corners of each layer of the decoder).
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): initial_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, num_queries, Initial reference points used for the first decoder layer.
num_queries)`. Attentions weights of the decoder, after the attention softmax, used to compute the weighted init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
average in the self-attention heads. Initial reference points sent through the Transformer decoder.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_topk_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`. Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the picked as region proposals in the encoder stage. Output of bounding box binary classification (i.e.
weighted average in the cross-attention heads. foreground and background).
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): enc_topk_bboxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Logits of predicted bounding boxes coordinates in the encoder stage.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
layer plus the initial embedding outputs. foreground and background).
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`. Logits of predicted bounding boxes coordinates in the first stage.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the denoising_meta_values (`dict`):
self-attention heads. Extra dictionary for the denoising related values.
init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Initial reference points sent through the Transformer decoder.
enc_topk_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
picked as region proposals in the encoder stage. Output of bounding box binary classification (i.e.
foreground and background).
enc_topk_bboxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`):
Logits of predicted bounding boxes coordinates in the encoder stage.
enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
foreground and background).
enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the first stage.
denoising_meta_values (`dict`):
Extra dictionary for the denoising related values
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -507,76 +491,56 @@ class DFineModelOutput(ModelOutput):
@dataclass @dataclass
class DFineObjectDetectionOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`DFineForObjectDetection`]. Output type of [`DFineForObjectDetection`].
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): class DFineObjectDetectionOutput(ModelOutput):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a r"""
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
scale-invariant IoU loss. Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
loss_dict (`Dict`, *optional*): bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
A dictionary containing the individual losses. Useful for logging. scale-invariant IoU loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): loss_dict (`Dict`, *optional*):
Classification logits (including no-object) for all queries. A dictionary containing the individual losses. Useful for logging.
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These Classification logits (including no-object) for all queries.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
possible padding). You can use [`~DFineImageProcessor.post_process_object_detection`] to retrieve the Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
unnormalized (absolute) bounding boxes. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
auxiliary_outputs (`list[Dict]`, *optional*): possible padding). You can use [`~DFineImageProcessor.post_process_object_detection`] to retrieve the
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) unnormalized (absolute) bounding boxes.
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and auxiliary_outputs (`list[Dict]`, *optional*):
`pred_boxes`) for each decoder layer. Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`): and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
Sequence of hidden-states at the output of the last layer of the decoder of the model. `pred_boxes`) for each decoder layer.
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder). Sequence of hidden-states at the output of the last layer of the decoder of the model.
intermediate_logits (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, config.num_labels)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate logits (logits of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_logits (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, config.num_labels)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate logits (logits of each layer of the decoder).
intermediate_predicted_corners (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Stacked intermediate predicted corners (predicted corners of each layer of the decoder). Stacked intermediate reference points (reference points of each layer of the decoder).
initial_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_predicted_corners (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Stacked initial reference points (initial reference points of each layer of the decoder). Stacked intermediate predicted corners (predicted corners of each layer of the decoder).
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): initial_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Stacked initial reference points (initial reference points of each layer of the decoder).
shape `(batch_size, num_queries, hidden_size)`. Hidden-states of the decoder at the output of each layer init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
plus the initial embedding outputs. Initial reference points sent through the Transformer decoder.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_topk_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, num_queries, Logits of predicted bounding boxes coordinates in the encoder.
num_queries)`. Attentions weights of the decoder, after the attention softmax, used to compute the weighted enc_topk_bboxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
average in the self-attention heads. Logits of predicted bounding boxes coordinates in the encoder.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`. Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
weighted average in the cross-attention heads. foreground and background).
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Logits of predicted bounding boxes coordinates in the first stage.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): denoising_meta_values (`dict`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Extra dictionary for the denoising related values
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Initial reference points sent through the Transformer decoder.
enc_topk_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the encoder.
enc_topk_bboxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the encoder.
enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
foreground and background).
enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the first stage.
denoising_meta_values (`dict`):
Extra dictionary for the denoising related values
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1008,38 +972,30 @@ class DFineIntegral(nn.Module):
@dataclass @dataclass
class DFineDecoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the DFineDecoder. This class adds two attributes to Base class for outputs of the DFineDecoder. This class adds two attributes to
BaseModelOutputWithCrossAttentions, namely: BaseModelOutputWithCrossAttentions, namely:
- a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer) - a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer)
- a stacked tensor of intermediate reference points. - a stacked tensor of intermediate reference points.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class DFineDecoderOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
intermediate_logits (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, config.num_labels)`): intermediate_logits (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, config.num_labels)`):
Stacked intermediate logits (logits of each layer of the decoder). Stacked intermediate logits (logits of each layer of the decoder).
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, hidden_size)`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, hidden_size)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate reference points (reference points of each layer of the decoder).
intermediate_predicted_corners (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_predicted_corners (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Stacked intermediate predicted corners (predicted corners of each layer of the decoder). Stacked intermediate predicted corners (predicted corners of each layer of the decoder).
initial_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): initial_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Stacked initial reference points (initial reference points of each layer of the decoder). Stacked initial reference points (initial reference points of each layer of the decoder).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
plus the initial embedding outputs. used to compute the weighted average in the cross-attention heads.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None

View File

@ -39,34 +39,26 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
# Copied from transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrDecoderOutput with ConditionalDetr->DabDetr,Conditional DETR->DAB-DETR,2 (anchor points)->4 (anchor points) @auto_docstring(
class DabDetrDecoderOutput(BaseModelOutputWithCrossAttentions): custom_intro="""
"""
Base class for outputs of the Conditional DETR decoder. This class adds one attribute to Base class for outputs of the Conditional DETR decoder. This class adds one attribute to
BaseModelOutputWithCrossAttentions, namely an optional stack of intermediate decoder activations, i.e. the output BaseModelOutputWithCrossAttentions, namely an optional stack of intermediate decoder activations, i.e. the output
of each decoder layer, each of them gone through a layernorm. This is useful when training the model with auxiliary of each decoder layer, each of them gone through a layernorm. This is useful when training the model with auxiliary
decoding losses. decoding losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): # Copied from transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrDecoderOutput with ConditionalDetr->DabDetr,Conditional DETR->DAB-DETR,2 (anchor points)->4 (anchor points)
Sequence of hidden-states at the output of the last layer of the model. class DabDetrDecoderOutput(BaseModelOutputWithCrossAttentions):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): r"""
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
plus the initial embedding outputs. sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): used to compute the weighted average in the cross-attention heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
the self-attention heads. layernorm.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`): reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, Reference points (reference points of each layer of the decoder).
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
Reference points (reference points of each layer of the decoder).
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@ -74,44 +66,24 @@ class DabDetrDecoderOutput(BaseModelOutputWithCrossAttentions):
@dataclass @dataclass
# Copied from transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrModelOutput with ConditionalDetr->DabDetr,Conditional DETR->DAB-DETR,2 (anchor points)->4 (anchor points) @auto_docstring(
class DabDetrModelOutput(Seq2SeqModelOutput): custom_intro="""
"""
Base class for outputs of the Conditional DETR encoder-decoder model. This class adds one attribute to Base class for outputs of the Conditional DETR encoder-decoder model. This class adds one attribute to
Seq2SeqModelOutput, namely an optional stack of intermediate decoder activations, i.e. the output of each decoder Seq2SeqModelOutput, namely an optional stack of intermediate decoder activations, i.e. the output of each decoder
layer, each of them gone through a layernorm. This is useful when training the model with auxiliary decoding layer, each of them gone through a layernorm. This is useful when training the model with auxiliary decoding
losses. losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): # Copied from transformers.models.conditional_detr.modeling_conditional_detr.ConditionalDetrModelOutput with ConditionalDetr->DabDetr,Conditional DETR->DAB-DETR,2 (anchor points)->4 (anchor points)
Sequence of hidden-states at the output of the last layer of the decoder of the model. class DabDetrModelOutput(Seq2SeqModelOutput):
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): r"""
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each Sequence of hidden-states at the output of the last layer of the decoder of the model.
layer plus the initial embedding outputs. intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, sequence_length, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, layernorm.
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
weighted average in the self-attention heads. Reference points (reference points of each layer of the decoder).
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, sequence_length, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
reference_points (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, 2 (anchor points))`):
Reference points (reference points of each layer of the decoder).
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@ -119,53 +91,33 @@ class DabDetrModelOutput(Seq2SeqModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Output type of [`DabDetrForObjectDetection`].
"""
)
# Copied from transformers.models.detr.modeling_detr.DetrObjectDetectionOutput with Detr->DabDetr # Copied from transformers.models.detr.modeling_detr.DetrObjectDetectionOutput with Detr->DabDetr
class DabDetrObjectDetectionOutput(ModelOutput): class DabDetrObjectDetectionOutput(ModelOutput):
""" r"""
Output type of [`DabDetrForObjectDetection`]. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
Args: bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): scale-invariant IoU loss.
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a loss_dict (`Dict`, *optional*):
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized A dictionary containing the individual losses. Useful for logging.
scale-invariant IoU loss. logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
loss_dict (`Dict`, *optional*): Classification logits (including no-object) for all queries.
A dictionary containing the individual losses. Useful for logging. pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
Classification logits (including no-object) for all queries. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): possible padding). You can use [`~DabDetrImageProcessor.post_process_object_detection`] to retrieve the
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These unnormalized bounding boxes.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding auxiliary_outputs (`list[Dict]`, *optional*):
possible padding). You can use [`~DabDetrImageProcessor.post_process_object_detection`] to retrieve the Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
unnormalized bounding boxes. and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
auxiliary_outputs (`list[Dict]`, *optional*): `pred_boxes`) for each decoder layer.
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and Sequence of hidden-states at the output of the last layer of the decoder of the model.
`pred_boxes`) for each decoder layer.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each
layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -29,19 +29,19 @@ from .configuration_dac import DacConfig
@dataclass @dataclass
@auto_docstring
class DacOutput(ModelOutput): class DacOutput(ModelOutput):
""" r"""
Args: loss (`torch.Tensor`):
loss (`torch.Tensor`): Loss from the encoder model, comprising the weighted combination of the commitment and codebook losses.
Loss from the encoder model, comprising the weighted combination of the commitment and codebook losses. audio_values (`torch.Tensor` of shape `(batch_size, input_length)`):
audio_values (`torch.Tensor` of shape `(batch_size, input_length)`): Reconstructed audio data.
Reconstructed audio data. quantized_representation (`torch.Tensor` of shape `(batch_size, dimension, time_steps)`):
quantized_representation (`torch.Tensor` of shape `(batch_size, dimension, time_steps)`): Quantized continuous representation of input.
Quantized continuous representation of input. audio_codes (`torch.LongTensor` of shape `(batch_size, num_codebooks, time_steps)`):
audio_codes (`torch.LongTensor` of shape `(batch_size, num_codebooks, time_steps)`): Codebook indices for each codebook (quantized discrete representation of input).
Codebook indices for each codebook (quantized discrete representation of input). projected_latents (`torch.Tensor` of shape `(batch_size, num_codebooks * dimension, time_steps)`):
projected_latents (`torch.Tensor` of shape `(batch_size, num_codebooks * dimension, time_steps)`): Projected latents (continuous representation of input before quantization).
Projected latents (continuous representation of input before quantization).
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -52,17 +52,17 @@ class DacOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class DacEncoderOutput(ModelOutput): class DacEncoderOutput(ModelOutput):
""" r"""
Args: loss (`torch.Tensor`):
loss (`torch.Tensor`): Loss from the encoder model, comprising the weighted combination of the commitment and codebook losses.
Loss from the encoder model, comprising the weighted combination of the commitment and codebook losses. quantized_representation (`torch.Tensor` of shape `(batch_size, dimension, time_steps)`, *optional*):
quantized_representation (`torch.Tensor` of shape `(batch_size, dimension, time_steps)`, *optional*): Quantized continuous representation of input.
Quantized continuous representation of input. audio_codes (`torch.Tensor` of shape `(batch_size, num_codebooks, time_steps)`, *optional*):
audio_codes (`torch.Tensor` of shape `(batch_size, num_codebooks, time_steps)`, *optional*): Codebook indices for each codebook (quantized discrete representation of input).
Codebook indices for each codebook (quantized discrete representation of input). projected_latents (`torch.Tensor` of shape `(batch_size, num_codebooks * dimension, time_steps)`, *optional*):
projected_latents (`torch.Tensor` of shape `(batch_size, num_codebooks * dimension, time_steps)`, *optional*): Projected latents (continuous representation of input before quantization).
Projected latents (continuous representation of input before quantization).
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -72,12 +72,12 @@ class DacEncoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
# Copied from transformers.models.encodec.modeling_encodec.EncodecDecoderOutput with Encodec->Dac, segment_length->input_length # Copied from transformers.models.encodec.modeling_encodec.EncodecDecoderOutput with Encodec->Dac, segment_length->input_length
class DacDecoderOutput(ModelOutput): class DacDecoderOutput(ModelOutput):
""" r"""
Args: audio_values (`torch.FloatTensor` of shape `(batch_size, input_length)`, *optional*):
audio_values (`torch.FloatTensor` of shape `(batch_size, input_length)`, *optional*): Decoded audio values, obtained using the decoder part of Dac.
Decoded audio values, obtained using the decoder part of Dac.
""" """
audio_values: Optional[torch.FloatTensor] = None audio_values: Optional[torch.FloatTensor] = None

View File

@ -43,29 +43,18 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class for outputs of [`Data2VecVisionModel`].
"""
)
# Copied from transformers.models.beit.modeling_beit.BeitModelOutputWithPooling with Beit->Data2VecVision # Copied from transformers.models.beit.modeling_beit.BeitModelOutputWithPooling with Beit->Data2VecVision
class Data2VecVisionModelOutputWithPooling(BaseModelOutputWithPooling): class Data2VecVisionModelOutputWithPooling(BaseModelOutputWithPooling):
""" r"""
Class for outputs of [`Data2VecVisionModel`]. pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if
Args: *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): will be returned.
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if
*config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
will be returned.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """

View File

@ -711,30 +711,19 @@ class DecisionTransformerGPT2Model(DecisionTransformerGPT2PreTrainedModel):
@dataclass @dataclass
class DecisionTransformerOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model's outputs that also contains a pooling of the last hidden states. Base class for model's outputs that also contains a pooling of the last hidden states.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class DecisionTransformerOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
state_preds (`torch.FloatTensor` of shape `(batch_size, sequence_length, state_dim)`): state_preds (`torch.FloatTensor` of shape `(batch_size, sequence_length, state_dim)`):
Environment state predictions Environment state predictions
action_preds (`torch.FloatTensor` of shape `(batch_size, sequence_length, action_dim)`): action_preds (`torch.FloatTensor` of shape `(batch_size, sequence_length, action_dim)`):
Model action predictions Model action predictions
return_preds (`torch.FloatTensor` of shape `(batch_size, sequence_length, 1)`): return_preds (`torch.FloatTensor` of shape `(batch_size, sequence_length, 1)`):
Predicted returns for each state Predicted returns for each state
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
state_preds: Optional[torch.FloatTensor] = None state_preds: Optional[torch.FloatTensor] = None

View File

@ -108,32 +108,24 @@ class MultiScaleDeformableAttention(nn.Module):
@dataclass @dataclass
class DeformableDetrDecoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the DeformableDetrDecoder. This class adds two attributes to Base class for outputs of the DeformableDetrDecoder. This class adds two attributes to
BaseModelOutputWithCrossAttentions, namely: BaseModelOutputWithCrossAttentions, namely:
- a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer) - a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer)
- a stacked tensor of intermediate reference points. - a stacked tensor of intermediate reference points.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class DeformableDetrDecoderOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, hidden_size)`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, hidden_size)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate reference points (reference points of each layer of the decoder).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
plus the initial embedding outputs. used to compute the weighted average in the cross-attention heads.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -145,47 +137,27 @@ class DeformableDetrDecoderOutput(ModelOutput):
@dataclass @dataclass
class DeformableDetrModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the Deformable DETR encoder-decoder model. Base class for outputs of the Deformable DETR encoder-decoder model.
"""
Args: )
init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): class DeformableDetrModelOutput(ModelOutput):
Initial reference points sent through the Transformer decoder. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`): init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Sequence of hidden-states at the output of the last layer of the decoder of the model. Initial reference points sent through the Transformer decoder.
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder). Sequence of hidden-states at the output of the last layer of the decoder of the model.
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Stacked intermediate reference points (reference points of each layer of the decoder).
shape `(batch_size, num_queries, hidden_size)`. Hidden-states of the decoder at the output of each layer enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
plus the initial embedding outputs. Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, num_queries, foreground and background).
num_queries)`. Attentions weights of the decoder, after the attention softmax, used to compute the weighted enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
average in the self-attention heads. Logits of predicted bounding boxes coordinates in the first stage.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
foreground and background).
enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the first stage.
""" """
init_reference_points: Optional[torch.FloatTensor] = None init_reference_points: Optional[torch.FloatTensor] = None
@ -203,64 +175,44 @@ class DeformableDetrModelOutput(ModelOutput):
@dataclass @dataclass
class DeformableDetrObjectDetectionOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`DeformableDetrForObjectDetection`]. Output type of [`DeformableDetrForObjectDetection`].
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): class DeformableDetrObjectDetectionOutput(ModelOutput):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a r"""
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
scale-invariant IoU loss. Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
loss_dict (`Dict`, *optional*): bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
A dictionary containing the individual losses. Useful for logging. scale-invariant IoU loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): loss_dict (`Dict`, *optional*):
Classification logits (including no-object) for all queries. A dictionary containing the individual losses. Useful for logging.
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These Classification logits (including no-object) for all queries.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
possible padding). You can use [`~DeformableDetrProcessor.post_process_object_detection`] to retrieve the Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
unnormalized bounding boxes. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
auxiliary_outputs (`list[Dict]`, *optional*): possible padding). You can use [`~DeformableDetrProcessor.post_process_object_detection`] to retrieve the
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) unnormalized bounding boxes.
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and auxiliary_outputs (`list[Dict]`, *optional*):
`pred_boxes`) for each decoder layer. Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`, *optional*): and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
Sequence of hidden-states at the output of the last layer of the decoder of the model. `pred_boxes`) for each decoder layer.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Initial reference points sent through the Transformer decoder.
shape `(batch_size, num_queries, hidden_size)`. Hidden-states of the decoder at the output of each layer last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`, *optional*):
plus the initial embedding outputs. Sequence of hidden-states at the output of the last layer of the decoder of the model.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, num_queries, Stacked intermediate hidden states (output of each layer of the decoder).
num_queries)`. Attentions weights of the decoder, after the attention softmax, used to compute the weighted intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
average in the self-attention heads. Stacked intermediate reference points (reference points of each layer of the decoder).
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_queries, num_heads, 4, 4)`. Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
weighted average in the cross-attention heads. foreground and background).
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Logits of predicted bounding boxes coordinates in the first stage.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, sequence_length, num_heads, 4,
4)`. Attentions weights of the encoder, after the attention softmax, used to compute the weighted average
in the self-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder).
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Stacked intermediate reference points (reference points of each layer of the decoder).
init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Initial reference points sent through the Transformer decoder.
enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Predicted bounding boxes scores where the top `config.two_stage_num_proposals` scoring bounding boxes are
picked as region proposals in the first stage. Output of bounding box binary classification (i.e.
foreground and background).
enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.with_box_refine=True` and `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the first stage.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -807,27 +807,21 @@ class DeiTForImageClassification(DeiTPreTrainedModel):
@dataclass @dataclass
class DeiTForImageClassificationWithTeacherOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`DeiTForImageClassificationWithTeacher`]. Output type of [`DeiTForImageClassificationWithTeacher`].
"""
Args: )
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): class DeiTForImageClassificationWithTeacherOutput(ModelOutput):
Prediction scores as the average of the cls_logits and distillation logits. r"""
cls_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the Prediction scores as the average of the cls_logits and distillation logits.
class token). cls_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
distillation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the
Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the class token).
distillation token). distillation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of distillation token).
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
logits: Optional[torch.FloatTensor] = None logits: Optional[torch.FloatTensor] = None

View File

@ -32,26 +32,17 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class DepthProOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for DepthPro's outputs. Base class for DepthPro's outputs.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, n_patches_per_batch, sequence_length, hidden_size)`): class DepthProOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
features (`Union[torch.FloatTensor, list[torch.FloatTensor]]`, *optional*): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, n_patches_per_batch, sequence_length, hidden_size)`):
Features from encoders. Can be a single feature or a list of features. Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): features (`Union[torch.FloatTensor, List[torch.FloatTensor]]`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Features from encoders. Can be a single feature or a list of features.
one for the output of each layer) of shape `(batch_size, n_patches_per_batch, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer and the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, n_patches_per_batch, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -61,28 +52,17 @@ class DepthProOutput(ModelOutput):
@dataclass @dataclass
class DepthProDepthEstimatorOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for DepthProForDepthEstimation's output. Base class for DepthProForDepthEstimation's output.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class DepthProDepthEstimatorOutput(ModelOutput):
Classification (or regression if config.num_labels==1) loss. r"""
predicted_depth (`torch.FloatTensor` of shape `(batch_size, height, width)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Predicted depth for each pixel. Classification (or regression if config.num_labels==1) loss.
field_of_view (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned when `use_fov_model` is provided): field_of_view (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned when `use_fov_model` is provided):
Field of View Scaler. Field of View Scaler.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, n_patches_per_batch, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer and the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, n_patches_per_batch, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -45,122 +45,74 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class DetrDecoderOutput(BaseModelOutputWithCrossAttentions): @auto_docstring(
""" custom_intro="""
Base class for outputs of the DETR decoder. This class adds one attribute to BaseModelOutputWithCrossAttentions, Base class for outputs of the DETR decoder. This class adds one attribute to BaseModelOutputWithCrossAttentions,
namely an optional stack of intermediate decoder activations, i.e. the output of each decoder layer, each of them namely an optional stack of intermediate decoder activations, i.e. the output of each decoder layer, each of them
gone through a layernorm. This is useful when training the model with auxiliary decoding losses. gone through a layernorm. This is useful when training the model with auxiliary decoding losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class DetrDecoderOutput(BaseModelOutputWithCrossAttentions):
Sequence of hidden-states at the output of the last layer of the model. r"""
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
plus the initial embedding outputs. used to compute the weighted average in the cross-attention heads.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in layernorm.
the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@dataclass @dataclass
class DetrModelOutput(Seq2SeqModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the DETR encoder-decoder model. This class adds one attribute to Seq2SeqModelOutput, Base class for outputs of the DETR encoder-decoder model. This class adds one attribute to Seq2SeqModelOutput,
namely an optional stack of intermediate decoder activations, i.e. the output of each decoder layer, each of them namely an optional stack of intermediate decoder activations, i.e. the output of each decoder layer, each of them
gone through a layernorm. This is useful when training the model with auxiliary decoding losses. gone through a layernorm. This is useful when training the model with auxiliary decoding losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class DetrModelOutput(Seq2SeqModelOutput):
Sequence of hidden-states at the output of the last layer of the decoder of the model. r"""
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Sequence of hidden-states at the output of the last layer of the decoder of the model.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, sequence_length, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
layer plus the initial embedding outputs. Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): layernorm.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, sequence_length, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@dataclass @dataclass
class DetrObjectDetectionOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`DetrForObjectDetection`]. Output type of [`DetrForObjectDetection`].
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): class DetrObjectDetectionOutput(ModelOutput):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a r"""
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
scale-invariant IoU loss. Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
loss_dict (`Dict`, *optional*): bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
A dictionary containing the individual losses. Useful for logging. scale-invariant IoU loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): loss_dict (`Dict`, *optional*):
Classification logits (including no-object) for all queries. A dictionary containing the individual losses. Useful for logging.
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These Classification logits (including no-object) for all queries.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
possible padding). You can use [`~DetrImageProcessor.post_process_object_detection`] to retrieve the Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
unnormalized bounding boxes. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
auxiliary_outputs (`list[Dict]`, *optional*): possible padding). You can use [`~DetrImageProcessor.post_process_object_detection`] to retrieve the
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) unnormalized bounding boxes.
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and auxiliary_outputs (`list[Dict]`, *optional*):
`pred_boxes`) for each decoder layer. Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
Sequence of hidden-states at the output of the last layer of the decoder of the model. `pred_boxes`) for each decoder layer.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Sequence of hidden-states at the output of the last layer of the decoder of the model.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each
layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -178,58 +130,38 @@ class DetrObjectDetectionOutput(ModelOutput):
@dataclass @dataclass
class DetrSegmentationOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`DetrForSegmentation`]. Output type of [`DetrForSegmentation`].
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): class DetrSegmentationOutput(ModelOutput):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a r"""
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
scale-invariant IoU loss. Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
loss_dict (`Dict`, *optional*): bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
A dictionary containing the individual losses. Useful for logging. scale-invariant IoU loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): loss_dict (`Dict`, *optional*):
Classification logits (including no-object) for all queries. A dictionary containing the individual losses. Useful for logging.
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These Classification logits (including no-object) for all queries.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
possible padding). You can use [`~DetrImageProcessor.post_process_object_detection`] to retrieve the Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
unnormalized bounding boxes. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`): possible padding). You can use [`~DetrImageProcessor.post_process_object_detection`] to retrieve the
Segmentation masks logits for all queries. See also unnormalized bounding boxes.
[`~DetrImageProcessor.post_process_semantic_segmentation`] or pred_masks (`torch.FloatTensor` of shape `(batch_size, num_queries, height/4, width/4)`):
[`~DetrImageProcessor.post_process_instance_segmentation`] Segmentation masks logits for all queries. See also
[`~DetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic [`~DetrImageProcessor.post_process_semantic_segmentation`] or
segmentation masks respectively. [`~DetrImageProcessor.post_process_instance_segmentation`]
auxiliary_outputs (`list[Dict]`, *optional*): [`~DetrImageProcessor.post_process_panoptic_segmentation`] to evaluate semantic, instance and panoptic
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) segmentation masks respectively.
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and auxiliary_outputs (`list[Dict]`, *optional*):
`pred_boxes`) for each decoder layer. Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
Sequence of hidden-states at the output of the last layer of the decoder of the model. `pred_boxes`) for each decoder layer.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Sequence of hidden-states at the output of the last layer of the decoder of the model.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the decoder at the output of each
layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the encoder at the output of each
layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the encoder, after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -57,30 +57,19 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class DinatEncoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Dinat encoder's outputs, with potential hidden states and attentions. Dinat encoder's outputs, with potential hidden states and attentions.
"""
)
class DinatEncoderOutput(ModelOutput):
r"""
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -90,32 +79,21 @@ class DinatEncoderOutput(ModelOutput):
@dataclass @dataclass
class DinatModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Dinat model's outputs that also contains a pooling of the last hidden states. Dinat model's outputs that also contains a pooling of the last hidden states.
"""
)
class DinatModelOutput(ModelOutput):
r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -126,32 +104,23 @@ class DinatModelOutput(ModelOutput):
@dataclass @dataclass
class DinatImageClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Dinat outputs for image classification. Dinat outputs for image classification.
"""
)
class DinatImageClassifierOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): include the spatial dimensions.
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -38,31 +38,20 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring(
custom_intro="""
DonutSwin encoder's outputs, with potential hidden states and attentions.
"""
)
# Copied from transformers.models.swin.modeling_swin.SwinEncoderOutput with Swin->DonutSwin # Copied from transformers.models.swin.modeling_swin.SwinEncoderOutput with Swin->DonutSwin
class DonutSwinEncoderOutput(ModelOutput): class DonutSwinEncoderOutput(ModelOutput):
""" r"""
DonutSwin encoder's outputs, with potential hidden states and attentions. reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -72,33 +61,22 @@ class DonutSwinEncoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
DonutSwin model's outputs that also contains a pooling of the last hidden states.
"""
)
# Copied from transformers.models.swin.modeling_swin.SwinModelOutput with Swin->DonutSwin # Copied from transformers.models.swin.modeling_swin.SwinModelOutput with Swin->DonutSwin
class DonutSwinModelOutput(ModelOutput): class DonutSwinModelOutput(ModelOutput):
""" r"""
DonutSwin model's outputs that also contains a pooling of the last hidden states. pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -109,33 +87,24 @@ class DonutSwinModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
DonutSwin outputs for image classification.
"""
)
# Copied from transformers.models.swin.modeling_swin.SwinImageClassifierOutput with Swin->DonutSwin # Copied from transformers.models.swin.modeling_swin.SwinImageClassifierOutput with Swin->DonutSwin
class DonutSwinImageClassifierOutput(ModelOutput): class DonutSwinImageClassifierOutput(ModelOutput):
""" r"""
DonutSwin outputs for image classification. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): include the spatial dimensions.
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -40,26 +40,17 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class for outputs of [`DPRQuestionEncoder`].
"""
)
class DPRContextEncoderOutput(ModelOutput): class DPRContextEncoderOutput(ModelOutput):
""" r"""
Class for outputs of [`DPRQuestionEncoder`]. pooler_output (`torch.FloatTensor` of shape `(batch_size, embeddings_size)`):
The DPR encoder outputs the *pooler_output* that corresponds to the context representation. Last layer
Args: hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
pooler_output (`torch.FloatTensor` of shape `(batch_size, embeddings_size)`): This output is to be used to embed contexts for nearest neighbors queries with questions embeddings.
The DPR encoder outputs the *pooler_output* that corresponds to the context representation. Last layer
hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
This output is to be used to embed contexts for nearest neighbors queries with questions embeddings.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
pooler_output: torch.FloatTensor pooler_output: torch.FloatTensor
@ -68,26 +59,17 @@ class DPRContextEncoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class for outputs of [`DPRQuestionEncoder`].
"""
)
class DPRQuestionEncoderOutput(ModelOutput): class DPRQuestionEncoderOutput(ModelOutput):
""" r"""
Class for outputs of [`DPRQuestionEncoder`]. pooler_output (`torch.FloatTensor` of shape `(batch_size, embeddings_size)`):
The DPR encoder outputs the *pooler_output* that corresponds to the question representation. Last layer
Args: hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
pooler_output (`torch.FloatTensor` of shape `(batch_size, embeddings_size)`): This output is to be used to embed questions for nearest neighbors queries with context embeddings.
The DPR encoder outputs the *pooler_output* that corresponds to the question representation. Last layer
hidden-state of the first token of the sequence (classification token) further processed by a Linear layer.
This output is to be used to embed questions for nearest neighbors queries with context embeddings.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
pooler_output: torch.FloatTensor pooler_output: torch.FloatTensor
@ -96,29 +78,20 @@ class DPRQuestionEncoderOutput(ModelOutput):
@dataclass @dataclass
class DPRReaderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for outputs of [`DPRQuestionEncoder`]. Class for outputs of [`DPRQuestionEncoder`].
"""
Args: )
start_logits (`torch.FloatTensor` of shape `(n_passages, sequence_length)`): class DPRReaderOutput(ModelOutput):
Logits of the start index of the span for each passage. r"""
end_logits (`torch.FloatTensor` of shape `(n_passages, sequence_length)`): start_logits (`torch.FloatTensor` of shape `(n_passages, sequence_length)`):
Logits of the end index of the span for each passage. Logits of the start index of the span for each passage.
relevance_logits (`torch.FloatTensor` of shape `(n_passages, )`): end_logits (`torch.FloatTensor` of shape `(n_passages, sequence_length)`):
Outputs of the QA classifier of the DPRReader that corresponds to the scores of each passage to answer the Logits of the end index of the span for each passage.
question, compared to all the other passages. relevance_logits (`torch.FloatTensor` of shape `(n_passages, )`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Outputs of the QA classifier of the DPRReader that corresponds to the scores of each passage to answer the
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of question, compared to all the other passages.
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
start_logits: torch.FloatTensor start_logits: torch.FloatTensor

View File

@ -42,16 +42,18 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class BaseModelOutputWithIntermediateActivations(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model's outputs that also contains intermediate activations that can be used at later stages. Useful Base class for model's outputs that also contains intermediate activations that can be used at later stages. Useful
in the context of Vision models.: in the context of Vision models.:
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class BaseModelOutputWithIntermediateActivations(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
intermediate_activations (`tuple(torch.FloatTensor)`, *optional*): last_hidden_states (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Intermediate activations that can be used to compute hidden states of the model at various layers. Sequence of hidden-states at the output of the last layer of the model.
intermediate_activations (`tuple(torch.FloatTensor)`, *optional*):
Intermediate activations that can be used to compute hidden states of the model at various layers.
""" """
last_hidden_states: Optional[torch.FloatTensor] = None last_hidden_states: Optional[torch.FloatTensor] = None
@ -59,32 +61,21 @@ class BaseModelOutputWithIntermediateActivations(ModelOutput):
@dataclass @dataclass
class BaseModelOutputWithPoolingAndIntermediateActivations(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model's outputs that also contains a pooling of the last hidden states as well as intermediate Base class for model's outputs that also contains a pooling of the last hidden states as well as intermediate
activations that can be used by the model at later stages. activations that can be used by the model at later stages.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class BaseModelOutputWithPoolingAndIntermediateActivations(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Last layer hidden-state of the first token of the sequence (classification token) after further processing Last layer hidden-state of the first token of the sequence (classification token) after further processing
through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
the classification token after processing through a linear layer and a tanh activation function. The linear the classification token after processing through a linear layer and a tanh activation function. The linear
layer weights are trained from the next sentence prediction (classification) objective during pretraining. layer weights are trained from the next sentence prediction (classification) objective during pretraining.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): intermediate_activations (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Intermediate activations that can be used to compute hidden states of the model at various layers.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
intermediate_activations (`tuple(torch.FloatTensor)`, *optional*):
Intermediate activations that can be used to compute hidden states of the model at various layers.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None

View File

@ -667,26 +667,17 @@ class ElectraPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class ElectraForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`ElectraForPreTraining`]. Output type of [`ElectraForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class ElectraForPreTrainingOutput(ModelOutput):
Total loss of the ELECTRA objective. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Prediction scores of the head (scores for each token before SoftMax). Total loss of the ELECTRA objective.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Prediction scores of the head (scores for each token before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -38,13 +38,13 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring
class EncodecOutput(ModelOutput): class EncodecOutput(ModelOutput):
""" r"""
Args: audio_codes (`torch.LongTensor` of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
audio_codes (`torch.LongTensor` of shape `(batch_size, nb_chunks, chunk_length)`, *optional*): Discret code embeddings computed using `model.encode`.
Discret code embeddings computed using `model.encode`. audio_values (`torch.FloatTensor` of shape `(batch_size, segment_length)`, *optional*):
audio_values (`torch.FlaotTensor` of shape `(batch_size, sequence_length)`, *optional*) Decoded audio values, obtained using the decoder part of Encodec.
Decoded audio values, obtained using the decoder part of Encodec.
""" """
audio_codes: Optional[torch.LongTensor] = None audio_codes: Optional[torch.LongTensor] = None
@ -52,13 +52,13 @@ class EncodecOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class EncodecEncoderOutput(ModelOutput): class EncodecEncoderOutput(ModelOutput):
""" r"""
Args: audio_codes (`torch.LongTensor` of shape `(batch_size, nb_chunks, chunk_length)`, *optional*):
audio_codes (`torch.LongTensor` of shape `(batch_size, nb_chunks, chunk_length)`, *optional*): Discret code embeddings computed using `model.encode`.
Discret code embeddings computed using `model.encode`. audio_scales (`torch.Tensor` of shape `(batch_size, nb_chunks)`, *optional*):
audio_scales (`torch.Tensor` of shape `(batch_size, nb_chunks)`, *optional*): Scaling factor for each `audio_codes` input. This is used to unscale each chunk of audio when decoding.
Scaling factor for each `audio_codes` input. This is used to unscale each chunk of audio when decoding.
""" """
audio_codes: Optional[torch.LongTensor] = None audio_codes: Optional[torch.LongTensor] = None
@ -66,11 +66,11 @@ class EncodecEncoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class EncodecDecoderOutput(ModelOutput): class EncodecDecoderOutput(ModelOutput):
""" r"""
Args: audio_values (`torch.FloatTensor` of shape `(batch_size, segment_length)`, *optional*):
audio_values (`torch.FloatTensor` of shape `(batch_size, segment_length)`, *optional*): Decoded audio values, obtained using the decoder part of Encodec.
Decoded audio values, obtained using the decoder part of Encodec.
""" """
audio_values: Optional[torch.FloatTensor] = None audio_values: Optional[torch.FloatTensor] = None

View File

@ -647,31 +647,22 @@ class ErniePreTrainedModel(PreTrainedModel):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Output type of [`ErnieForPreTraining`].
"""
)
# Copied from transformers.models.bert.modeling_bert.BertForPreTrainingOutput with Bert->Ernie # Copied from transformers.models.bert.modeling_bert.BertForPreTrainingOutput with Bert->Ernie
class ErnieForPreTrainingOutput(ModelOutput): class ErnieForPreTrainingOutput(ModelOutput):
""" r"""
Output type of [`ErnieForPreTraining`]. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction
Args: (classification) loss.
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
(classification) loss. seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). before SoftMax).
seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -53,59 +53,61 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class EsmForProteinFoldingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`EsmForProteinFoldingOutput`]. Output type of [`EsmForProteinFoldingOutput`].
"""
Args: )
frames (`torch.FloatTensor`): class EsmForProteinFoldingOutput(ModelOutput):
Output frames. r"""
sidechain_frames (`torch.FloatTensor`): frames (`torch.FloatTensor`):
Output sidechain frames. Output frames.
unnormalized_angles (`torch.FloatTensor`): sidechain_frames (`torch.FloatTensor`):
Predicted unnormalized backbone and side chain torsion angles. Output sidechain frames.
angles (`torch.FloatTensor`): unnormalized_angles (`torch.FloatTensor`):
Predicted backbone and side chain torsion angles. Predicted unnormalized backbone and side chain torsion angles.
positions (`torch.FloatTensor`): angles (`torch.FloatTensor`):
Predicted positions of the backbone and side chain atoms. Predicted backbone and side chain torsion angles.
states (`torch.FloatTensor`): positions (`torch.FloatTensor`):
Hidden states from the protein folding trunk. Predicted positions of the backbone and side chain atoms.
s_s (`torch.FloatTensor`): states (`torch.FloatTensor`):
Per-residue embeddings derived by concatenating the hidden states of each layer of the ESM-2 LM stem. Hidden states from the protein folding trunk.
s_z (`torch.FloatTensor`): s_s (`torch.FloatTensor`):
Pairwise residue embeddings. Per-residue embeddings derived by concatenating the hidden states of each layer of the ESM-2 LM stem.
distogram_logits (`torch.FloatTensor`): s_z (`torch.FloatTensor`):
Input logits to the distogram used to compute residue distances. Pairwise residue embeddings.
lm_logits (`torch.FloatTensor`): distogram_logits (`torch.FloatTensor`):
Logits output by the ESM-2 protein language model stem. Input logits to the distogram used to compute residue distances.
aatype (`torch.FloatTensor`): lm_logits (`torch.FloatTensor`):
Input amino acids (AlphaFold2 indices). Logits output by the ESM-2 protein language model stem.
atom14_atom_exists (`torch.FloatTensor`): aatype (`torch.FloatTensor`):
Whether each atom exists in the atom14 representation. Input amino acids (AlphaFold2 indices).
residx_atom14_to_atom37 (`torch.FloatTensor`): atom14_atom_exists (`torch.FloatTensor`):
Mapping between atoms in the atom14 and atom37 representations. Whether each atom exists in the atom14 representation.
residx_atom37_to_atom14 (`torch.FloatTensor`): residx_atom14_to_atom37 (`torch.FloatTensor`):
Mapping between atoms in the atom37 and atom14 representations. Mapping between atoms in the atom14 and atom37 representations.
atom37_atom_exists (`torch.FloatTensor`): residx_atom37_to_atom14 (`torch.FloatTensor`):
Whether each atom exists in the atom37 representation. Mapping between atoms in the atom37 and atom14 representations.
residue_index (`torch.FloatTensor`): atom37_atom_exists (`torch.FloatTensor`):
The index of each residue in the protein chain. Unless internal padding tokens are used, this will just be Whether each atom exists in the atom37 representation.
a sequence of integers from 0 to `sequence_length`. residue_index (`torch.FloatTensor`):
lddt_head (`torch.FloatTensor`): The index of each residue in the protein chain. Unless internal padding tokens are used, this will just be
Raw outputs from the lddt head used to compute plddt. a sequence of integers from 0 to `sequence_length`.
plddt (`torch.FloatTensor`): lddt_head (`torch.FloatTensor`):
Per-residue confidence scores. Regions of low confidence may indicate areas where the model's prediction is Raw outputs from the lddt head used to compute plddt.
uncertain, or where the protein structure is disordered. plddt (`torch.FloatTensor`):
ptm_logits (`torch.FloatTensor`): Per-residue confidence scores. Regions of low confidence may indicate areas where the model's prediction is
Raw logits used for computing ptm. uncertain, or where the protein structure is disordered.
ptm (`torch.FloatTensor`): ptm_logits (`torch.FloatTensor`):
TM-score output representing the model's high-level confidence in the overall structure. Raw logits used for computing ptm.
aligned_confidence_probs (`torch.FloatTensor`): ptm (`torch.FloatTensor`):
Per-residue confidence scores for the aligned structure. TM-score output representing the model's high-level confidence in the overall structure.
predicted_aligned_error (`torch.FloatTensor`): aligned_confidence_probs (`torch.FloatTensor`):
Predicted error between the model's prediction and the ground truth. Per-residue confidence scores for the aligned structure.
max_predicted_aligned_error (`torch.FloatTensor`): predicted_aligned_error (`torch.FloatTensor`):
Per-sample maximum predicted error. Predicted error between the model's prediction and the ground truth.
max_predicted_aligned_error (`torch.FloatTensor`):
Per-sample maximum predicted error.
""" """
frames: Optional[torch.FloatTensor] = None frames: Optional[torch.FloatTensor] = None

View File

@ -492,24 +492,19 @@ class FalconMambaPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class for the FALCONMAMBA model outputs.
"""
)
# Copied from transformers.models.mamba.modeling_mamba.MambaOutput with MAMBA->FALCONMAMBA,Mamba->FalconMamba,FalconMambaCache->MambaCache # Copied from transformers.models.mamba.modeling_mamba.MambaOutput with MAMBA->FALCONMAMBA,Mamba->FalconMamba,FalconMambaCache->MambaCache
class FalconMambaOutput(ModelOutput): class FalconMambaOutput(ModelOutput):
""" r"""
Class for the FALCONMAMBA model outputs. cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Args: Includes both the State space model state matrices after the selective scan, and the Convolutional states
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Includes both the State space model state matrices after the selective scan, and the Convolutional states
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -518,26 +513,23 @@ class FalconMambaOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for causal language model (or autoregressive) outputs.
"""
)
# Copied from transformers.models.mamba.modeling_mamba.MambaCausalLMOutput with Mamba->FalconMamba,FalconMambaCache->MambaCache # Copied from transformers.models.mamba.modeling_mamba.MambaCausalLMOutput with Mamba->FalconMamba,FalconMambaCache->MambaCache
class FalconMambaCausalLMOutput(ModelOutput): class FalconMambaCausalLMOutput(ModelOutput):
""" r"""
Base class for causal language model (or autoregressive) outputs. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Args: Includes both the State space model state matrices after the selective scan, and the Convolutional states
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Includes both the State space model state matrices after the selective scan, and the Convolutional states
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -35,46 +35,21 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class FastSpeech2ConformerModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`FastSpeech2ConformerModel`]. Output type of [`FastSpeech2ConformerModel`].
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class FastSpeech2ConformerModelOutput(ModelOutput):
Spectrogram generation loss. r"""
spectrogram (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
The predicted spectrogram. Spectrogram generation loss.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): duration_outputs (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Outputs of the duration predictor.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): pitch_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Outputs of the pitch predictor.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. energy_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Outputs of the energy predictor.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
duration_outputs (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*):
Outputs of the duration predictor.
pitch_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Outputs of the pitch predictor.
energy_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Outputs of the energy predictor.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -90,47 +65,23 @@ class FastSpeech2ConformerModelOutput(ModelOutput):
@dataclass @dataclass
class FastSpeech2ConformerWithHifiGanOutput(FastSpeech2ConformerModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`FastSpeech2ConformerWithHifiGan`]. Output type of [`FastSpeech2ConformerWithHifiGan`].
"""
Args: )
waveform (`torch.FloatTensor` of shape `(batch_size, audio_length)`): class FastSpeech2ConformerWithHifiGanOutput(FastSpeech2ConformerModelOutput):
Speech output as a result of passing the predicted mel spectrogram through the vocoder. r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Spectrogram generation loss. Spectrogram generation loss.
spectrogram (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`): duration_outputs (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*):
The predicted spectrogram. Outputs of the duration predictor.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): pitch_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Outputs of the pitch predictor.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): energy_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Outputs of the energy predictor.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. waveform (`torch.FloatTensor` of shape `(batch_size, audio_length)`):
Speech output as a result of passing the predicted mel spectrogram through the vocoder.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
duration_outputs (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*):
Outputs of the duration predictor.
pitch_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Outputs of the pitch predictor.
energy_outputs (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*):
Outputs of the energy predictor.
""" """
waveform: Optional[torch.FloatTensor] = None waveform: Optional[torch.FloatTensor] = None

View File

@ -246,27 +246,28 @@ class FlaubertPredLayer(nn.Module):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for outputs of question answering models using a [`~modeling_utils.FlaubertSQuADHead`].
"""
)
# Copied from transformers.models.xlm.modeling_xlm.XLMSquadHeadOutput with XLM->Flaubert # Copied from transformers.models.xlm.modeling_xlm.XLMSquadHeadOutput with XLM->Flaubert
class FlaubertSquadHeadOutput(ModelOutput): class FlaubertSquadHeadOutput(ModelOutput):
""" r"""
Base class for outputs of question answering models using a [`~modeling_utils.FlaubertSQuADHead`]. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned if both `start_positions` and `end_positions` are provided):
Classification loss as the sum of start token, end token (and is_impossible if provided) classification
Args: losses.
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned if both `start_positions` and `end_positions` are provided): start_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Classification loss as the sum of start token, end token (and is_impossible if provided) classification Log probabilities for the top config.start_n_top start token possibilities (beam-search).
losses. start_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
start_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): Indices for the top config.start_n_top start token possibilities (beam-search).
Log probabilities for the top config.start_n_top start token possibilities (beam-search). end_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
start_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): Log probabilities for the top `config.start_n_top * config.end_n_top` end token possibilities
Indices for the top config.start_n_top start token possibilities (beam-search). (beam-search).
end_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): end_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Log probabilities for the top `config.start_n_top * config.end_n_top` end token possibilities Indices for the top `config.start_n_top * config.end_n_top` end token possibilities (beam-search).
(beam-search). cls_logits (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
end_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): Log probabilities for the `is_impossible` label of the answers.
Indices for the top `config.start_n_top * config.end_n_top` end token possibilities (beam-search).
cls_logits (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Log probabilities for the `is_impossible` label of the answers.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -815,6 +816,14 @@ class FlaubertModel(FlaubertPreTrainedModel):
return_dict: Optional[bool] = None, return_dict: Optional[bool] = None,
) -> Union[tuple, BaseModelOutput]: ) -> Union[tuple, BaseModelOutput]:
r""" r"""
langs (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
languages ids which can be obtained from the language names by using two conversion mappings provided in
the configuration of the model (only provided for multilingual models). More precisely, the *language name
to language id* mapping is in `model.config.lang2id` (which is a dictionary string to int) and the
*language id to language name* mapping is in `model.config.id2lang` (dictionary int to string).
See usage examples detailed in the [multilingual documentation](../multilingual).
lengths (`torch.LongTensor` of shape `(batch_size,)`, *optional*): lengths (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Length of each sentence that can be used to avoid performing attention on padding token indices. You can Length of each sentence that can be used to avoid performing attention on padding token indices. You can
also use `attention_mask` for the same result (see above), kept here for compatibility. Indices selected in also use `attention_mask` for the same result (see above), kept here for compatibility. Indices selected in
@ -824,14 +833,6 @@ class FlaubertModel(FlaubertPreTrainedModel):
attention blocks) as computed by the model (see `cache` output below). Can be used to speed up sequential attention blocks) as computed by the model (see `cache` output below). Can be used to speed up sequential
decoding. The dictionary object will be modified in-place during the forward pass to add newly computed decoding. The dictionary object will be modified in-place during the forward pass to add newly computed
hidden-states. hidden-states.
langs (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
languages ids which can be obtained from the language names by using two conversion mappings provided in
the configuration of the model (only provided for multilingual models). More precisely, the *language name
to language id* mapping is in `model.config.lang2id` (which is a dictionary string to int) and the
*language id to language name* mapping is in `model.config.id2lang` (dictionary int to string).
See usage examples detailed in the [multilingual documentation](../multilingual).
""" """
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = ( output_hidden_states = (
@ -1040,6 +1041,14 @@ class FlaubertWithLMHeadModel(FlaubertPreTrainedModel, GenerationMixin):
return_dict: Optional[bool] = None, return_dict: Optional[bool] = None,
) -> Union[tuple, MaskedLMOutput]: ) -> Union[tuple, MaskedLMOutput]:
r""" r"""
langs (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
languages ids which can be obtained from the language names by using two conversion mappings provided in
the configuration of the model (only provided for multilingual models). More precisely, the *language name
to language id* mapping is in `model.config.lang2id` (which is a dictionary string to int) and the
*language id to language name* mapping is in `model.config.id2lang` (dictionary int to string).
See usage examples detailed in the [multilingual documentation](../multilingual).
lengths (`torch.LongTensor` of shape `(batch_size,)`, *optional*): lengths (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Length of each sentence that can be used to avoid performing attention on padding token indices. You can Length of each sentence that can be used to avoid performing attention on padding token indices. You can
also use `attention_mask` for the same result (see above), kept here for compatibility. Indices selected in also use `attention_mask` for the same result (see above), kept here for compatibility. Indices selected in
@ -1053,14 +1062,6 @@ class FlaubertWithLMHeadModel(FlaubertPreTrainedModel, GenerationMixin):
Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
`labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100` `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]` are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
langs (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
A parallel sequence of tokens to be used to indicate the language of each token in the input. Indices are
languages ids which can be obtained from the language names by using two conversion mappings provided in
the configuration of the model (only provided for multilingual models). More precisely, the *language name
to language id* mapping is in `model.config.lang2id` (which is a dictionary string to int) and the
*language id to language name* mapping is in `model.config.id2lang` (dictionary int to string).
See usage examples detailed in the [multilingual documentation](../multilingual).
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@ -1413,37 +1414,28 @@ class FlaubertForQuestionAnsweringSimple(FlaubertPreTrainedModel):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for outputs of question answering models using a `SquadHead`.
"""
)
# Copied from transformer.models.xlm.modeling_xlm.XLMForQuestionAnsweringOutput with XLM->Flaubert # Copied from transformer.models.xlm.modeling_xlm.XLMForQuestionAnsweringOutput with XLM->Flaubert
class FlaubertForQuestionAnsweringOutput(ModelOutput): class FlaubertForQuestionAnsweringOutput(ModelOutput):
""" r"""
Base class for outputs of question answering models using a `SquadHead`. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned if both `start_positions` and `end_positions` are provided):
Classification loss as the sum of start token, end token (and is_impossible if provided) classification
Args: losses.
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned if both `start_positions` and `end_positions` are provided): start_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Classification loss as the sum of start token, end token (and is_impossible if provided) classification Log probabilities for the top config.start_n_top start token possibilities (beam-search).
losses. start_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
start_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): Indices for the top config.start_n_top start token possibilities (beam-search).
Log probabilities for the top config.start_n_top start token possibilities (beam-search). end_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
start_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): Log probabilities for the top `config.start_n_top * config.end_n_top` end token possibilities
Indices for the top config.start_n_top start token possibilities (beam-search). (beam-search).
end_top_log_probs (`torch.FloatTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): end_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Log probabilities for the top `config.start_n_top * config.end_n_top` end token possibilities Indices for the top `config.start_n_top * config.end_n_top` end token possibilities (beam-search).
(beam-search). cls_logits (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
end_top_index (`torch.LongTensor` of shape `(batch_size, config.start_n_top * config.end_n_top)`, *optional*, returned if `start_positions` or `end_positions` is not provided): Log probabilities for the `is_impossible` label of the answers.
Indices for the top `config.start_n_top * config.end_n_top` end token possibilities (beam-search).
cls_logits (`torch.FloatTensor` of shape `(batch_size,)`, *optional*, returned if `start_positions` or `end_positions` is not provided):
Log probabilities for the `is_impossible` label of the answers.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -49,27 +49,29 @@ FlavaPossibleConfigs = Union[FlavaTextConfig, FlavaImageConfig, FlavaMultimodalC
@dataclass @dataclass
class FlavaModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output from FlavaModel containing embeddings and outputs from individual encoders. Output from FlavaModel containing embeddings and outputs from individual encoders.
Note that `image_embeddings` and `text_embeddigns` returned are similar to pooled output returned from a Note that `image_embeddings` and `text_embeddigns` returned are similar to pooled output returned from a
transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's `image_projection` and transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's `image_projection` and
`text_projection` layers on `image_embeddings` and `text_embeddings` respectively. `text_projection` layers on `image_embeddings` and `text_embeddings` respectively.
"""
Args: )
image_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present): class FlavaModelOutput(ModelOutput):
The image embeddings which are basically the pooled output of [`FlavaImageModel`]. r"""
image_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present): image_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present):
The output of the [`FlavaImageModel`]. The image embeddings which are basically the pooled output of [`FlavaImageModel`].
text_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` are present): image_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present):
The text embeddings which are basically the pooled output of [`FlavaTextModel`]. The output of the [`FlavaImageModel`].
text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present): text_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` are present):
The output of the [`FlavaTextModel`]. The text embeddings which are basically the pooled output of [`FlavaTextModel`].
multimodal_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False`): text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present):
The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`]. The output of the [`FlavaTextModel`].
multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False`): multimodal_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False`):
The output of the [`FlavaMultimodalModel`]. The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`].
multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False`):
The output of the [`FlavaMultimodalModel`].
""" """
image_embeddings: Optional[torch.FloatTensor] = None image_embeddings: Optional[torch.FloatTensor] = None
@ -87,24 +89,27 @@ class FlavaModelOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class representing pretraining losses from FLAVA model
"""
)
class FlavaLosses(ModelOutput): class FlavaLosses(ModelOutput):
"""Class representing pretraining losses from FLAVA model r"""
mim (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mim_labels` and `pixel_values` are present, `input_ids_masked` is absent and `mim_weight` > 0.):
Args: Masked Image Modeling loss as used in BeIT calculated only for unimodal image data.
mim (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mim_labels` and `pixel_values` are present, `input_ids_masked` is absent and `mim_weight` > 0.: mlm (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mlm_labels` and `input_ids_masked` are present, `pixel_values` is absent and `mlm_weight` > 0.):
Masked Image Modeling loss as used in BeIT calculated only for unimodal image data. Masked Language Modeling loss as used in BERT calculated only for unimodal text data.
mlm (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mlm_labels` and `input_ids_masked` are present, `pixel_values` is absent and `mlm_weight` > 0.: itm (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `itm_labels`, `input_ids_masked`, `pixel_values` are present and `itm_weight` > 0.):
Masked Language Modeling loss as used in BERT calculated only for unimodal text data. Image Text Matching (ITM) loss calculated for paired image-text data. Note that ITM loss is calculated on
itm (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `itm_labels`, `input_ids_masked`, `pixel_values` are present and `itm_weight` > 0.: masked pairs in FLAVA.
Image Text Matching (ITM) loss calculated for paired image-text data. Note that ITM loss is calculated on global_contrastive (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `input_ids` and `pixel_values` are present and `global_contrastive_weight` > 0.):
masked pairs in FLAVA. Contrastive loss for image-text similarity similar to CLIP but calculated globally for paired image-text
global_contrastive (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `input_ids` and `pixel_values` are present and `global_contrastive_weight` > 0.: data. This is calculated on unmasked images and texts.
Contrastive loss for image-text similarity similar to CLIP but calculated globally for paired image-text mmm_image (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mim_labels`, `pixel_values` and `input_ids_masked` are present and `mmm_image_weight` > 0.):
data. This is calculated on unmasked images and texts. Masked Multimodal Modeling loss's image component calculated on paired image-text data.
mmm_image (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mim_labels`, `pixel_values` and `input_ids_masked` are present and `mmm_image_weight` > 0.: mmm_text (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mlm_labels`, `pixel_values` and `input_ids_masked` are present and `mmm_text_weight` > 0.):
Masked Multimodal Modeling loss's image component calculated on paired image-text data. Masked Multimodal Modeling loss's text component calculated on paired image-text data.
mmm_text (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mlm_labels`, `pixel_values` and `input_ids_masked` are present and `mmm_text_weight` > 0.:
Masked Multimodal Modeling loss's text component calculated on paired image-text data.
""" """
mim: Optional[torch.FloatTensor] = None mim: Optional[torch.FloatTensor] = None
@ -124,69 +129,69 @@ class FlavaLosses(ModelOutput):
@dataclass @dataclass
class FlavaForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output from FlavaForPreTraining containing embeddings, and outputs from individual encoders. Output from FlavaForPreTraining containing embeddings, and outputs from individual encoders.
Note that `image_embeddings` and `text_embeddings` returned are similar to pooled output returned from a Note that `image_embeddings` and `text_embeddings` returned are similar to pooled output returned from a
transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's `image_projection` and transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's `image_projection` and
`text_projection` layers on `image_embeddings` and `text_embeddings` respectively. `text_projection` layers on `image_embeddings` and `text_embeddings` respectively.
"""
Args: )
loss (`torch.FloatTensor`, *optional*, returned when `return_loss` is True): class FlavaForPreTrainingOutput(ModelOutput):
Total loss calculated for this model. r"""
loss_info (`FlavaLosses`): loss (`torch.FloatTensor`, *optional*, returned when `return_loss` is True):
Detailed info for FLAVA Pretraining losses. Check `FlavaLosses` class description for the information on Total loss calculated for this model.
the keys. loss_info (`FlavaLosses`):
image_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present): Detailed info for FLAVA Pretraining losses. Check `FlavaLosses` class description for the information on
The image embeddings which are basically the pooled output of [`FlavaImageModel`]. the keys.
image_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present): image_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present):
The output of the [`FlavaImageModel`]. The image embeddings which are basically the pooled output of [`FlavaImageModel`].
text_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` are present): image_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present):
The text embeddings which are basically the pooled output of [`FlavaTextModel`]. The output of the [`FlavaImageModel`].
text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present): text_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` are present):
The output of the [`FlavaTextModel`]. The text embeddings which are basically the pooled output of [`FlavaTextModel`].
multimodal_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`): text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present):
The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`]. The output of the [`FlavaTextModel`].
multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`): multimodal_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`):
The output of the [`FlavaMultimodalModel`]. The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`].
multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`):
image_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present): The output of the [`FlavaMultimodalModel`].
The image embeddings which are basically the pooled output of [`FlavaImageModel`]. Uses `bool_masked_pos` image_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `pixel_values` are present):
to create masked images. The image embeddings which are basically the pooled output of [`FlavaImageModel`]. Uses `bool_masked_pos`
image_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present): to create masked images.
The output of the [`FlavaImageModel`]. Uses `bool_masked_pos` to create masked images. image_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present):
text_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids_masked` are present): The output of the [`FlavaImageModel`]. Uses `bool_masked_pos` to create masked images.
The text embeddings which are basically the pooled output of [`FlavaTextModel`]. text_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids_masked` are present):
text_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` are present): The text embeddings which are basically the pooled output of [`FlavaTextModel`].
The output of the [`FlavaTextModel`]. text_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` are present):
multimodal_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present): The output of the [`FlavaTextModel`].
The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`]. multimodal_masked_embeddings (`torch.FloatTensor` of shape `(batch_size, output_dim)`, *optional*, returned when `input_ids` and `pixel_values` are present):
multimodal_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` and `pixel_values` are present): The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`].
The output of the [`FlavaMultimodalModel`]. multimodal_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` and `pixel_values` are present):
The output of the [`FlavaMultimodalModel`].
mim_logits (`torch.FloatTensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape `(total_masked_patches, image_vocab_size)` , *optional*, returned when `pixel_values` are present and `input_ids_masked` are not): mim_logits (`torch.FloatTensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape `(total_masked_patches, image_vocab_size)` , *optional*, returned when `pixel_values` are present and `input_ids_masked` are not):
The logits for MIM unimodal loss. Uses `book_masked_pos` to get masked patches. The flattened output is The logits for MIM unimodal loss. Uses `book_masked_pos` to get masked patches. The flattened output is
returned when `bool_masked_pos` has some of the patches masked. returned when `bool_masked_pos` has some of the patches masked.
mlm_logits (`torch.FloatTensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape `(total_masked_seq_length, text_vocab_size)`, *optional*, returned when `input_ids_masked` are present and `pixel_values` are not): mlm_logits (`torch.FloatTensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape `(total_masked_seq_length, text_vocab_size)`, *optional*, returned when `input_ids_masked` are present and `pixel_values` are not):
The logits for MLM unimodal loss. The flattened output is returned when `input_ids_masked` has some of The logits for MLM unimodal loss. The flattened output is returned when `input_ids_masked` has some of
the tokens masked. the tokens masked.
itm_logits (`torch.FloatTensor` of shape `(batch_size, 2)`, *optional*, returned when `input_ids_masked` and `pixel_values` are present): itm_logits (`torch.FloatTensor` of shape `(batch_size, 2)`, *optional*, returned when `input_ids_masked` and `pixel_values` are present):
The logits for ITM loss. Note that ITM loss is calculated on masked pairs in FLAVA. The logits for ITM loss. Note that ITM loss is calculated on masked pairs in FLAVA.
mmm_image_logits (`torch.FloatTensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape`(total_masked_patches, image_vocab_size)`, *optional*, returned when `pixel_values` and `input_ids_masked` are present): contrastive_logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
The logits for MMM image multimodal loss. Uses `book_masked_pos` to get masked patches. The flattened The scaled dot product scores between `image_embeddings` and `text_embeddings` but passed through FLAVA's
output is returned when `bool_masked_pos` has some of the patches masked. `image_projection` and `text_projection` layers respectively. This represents the image-text similarity
mmm_text_logits (`torch.FloatTensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape `(`(total_masked_seq_length, text_vocab_size)`), *optional*, returned when `pixel_values` and `input_ids_masked` are present): scores. This is calculated on unmasked images and texts.
The logits for MMM text multimodal loss. The flattened output is returned when `input_ids_masked` has contrastive_logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
some of the tokens masked. The scaled dot product scores between `text_embeddings` and `image_embeddings` but passed through FLAVA's
contrastive_logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): `text_projection` and `image_projection` layers respectively. This is calculated on unmasked images and
The scaled dot product scores between `image_embeddings` and `text_embeddings` but passed through FLAVA's texts.
`image_projection` and `text_projection` layers respectively. This represents the image-text similarity mmm_image_logits (`torch.FloatTensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape`(total_masked_patches, image_vocab_size)`, *optional*, returned when `pixel_values` and `input_ids_masked` are present):
scores. This is calculated on unmasked images and texts. The logits for MMM image multimodal loss. Uses `book_masked_pos` to get masked patches. The flattened
contrastive_logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): output is returned when `bool_masked_pos` has some of the patches masked.
The scaled dot product scores between `text_embeddings` and `image_embeddings` but passed through FLAVA's mmm_text_logits (`torch.FloatTensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape `(`(total_masked_seq_length, text_vocab_size)`), *optional*, returned when `pixel_values` and `input_ids_masked` are present):
`text_projection` and `image_projection` layers respectively. This is calculated on unmasked images and The logits for MMM text multimodal loss. The flattened output is returned when `input_ids_masked` has
texts. some of the tokens masked.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1207,12 +1212,12 @@ class FlavaModel(FlavaPreTrainedModel):
[What are token type IDs?](../glossary#token-type-ids) [What are token type IDs?](../glossary#token-type-ids)
bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, image_num_patches)`): bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, image_num_patches)`):
Boolean masked positions. Indicates which patches are masked (1) and which aren't (0). Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
skip_multimodal_encoder (*bool*, *optional*):
Skip any calculations for multimodal encoder. Useful if multimodal encoding is not going to be used.
image_attention_mask (`torch.Tensor` of shape `(batch_size, image_num_patches)`, *optional*): image_attention_mask (`torch.Tensor` of shape `(batch_size, image_num_patches)`, *optional*):
Mask to avoid performing attention on padding pixel values for image inputs. Mask values selected in `[0, 1]`: Mask to avoid performing attention on padding pixel values for image inputs. Mask values selected in `[0, 1]`:
- 1 for pixel values that are real (i.e., **not masked**), - 1 for pixel values that are real (i.e., **not masked**),
- 0 for pixel values that are padding (i.e., **masked**). - 0 for pixel values that are padding (i.e., **masked**).
skip_multimodal_encoder (*bool*, *optional*):
Skip any calculations for multimodal encoder. Useful if multimodal encoding is not going to be used.
Examples: Examples:
@ -1681,6 +1686,8 @@ class FlavaForPreTraining(FlavaPreTrainedModel):
to be used with MLM. Indices can be obtained using [`AutoTokenizer`] along with to be used with MLM. Indices can be obtained using [`AutoTokenizer`] along with
[`DataCollatorForMaskedLanguageModeling`]. See [`PreTrainedTokenizer.encode`] and [`DataCollatorForMaskedLanguageModeling`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids) [`PreTrainedTokenizer.__call__`] for details. [What are input IDs?](../glossary#input-ids)
codebook_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_image_patches, patch_size, patch_size, 3)`, *optional*):
Pixel values for image patches that are used to compute the image codebook labels for masked image modeling.
token_type_ids (`torch.LongTensor` of shape `(batch_size, text_seq_len)`, *optional*): token_type_ids (`torch.LongTensor` of shape `(batch_size, text_seq_len)`, *optional*):
Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
1]`: 1]`:
@ -1714,8 +1721,6 @@ class FlavaForPreTraining(FlavaPreTrainedModel):
The pairs with 0 will be skipped for calculation of MMM and global contrastive losses as well. The pairs with 0 will be skipped for calculation of MMM and global contrastive losses as well.
return_loss (`bool`, *optional*, default to None): return_loss (`bool`, *optional*, default to None):
Whether to return calculated loss or not. Whether to return calculated loss or not.
codebook_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_image_patches, patch_size, patch_size, 3)`, *optional*):
Pixel values for image patches that are used to compute the image codebook labels for masked image modeling.
Examples: Examples:
```python ```python

View File

@ -409,23 +409,21 @@ class FNetPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class FNetForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`FNetForPreTraining`]. Output type of [`FNetForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class FNetForPreTrainingOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). (classification) loss.
seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
before SoftMax). seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -37,25 +37,19 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class FocalNetEncoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
FocalNet encoder's outputs, with potential hidden states. FocalNet encoder's outputs, with potential hidden states.
"""
)
class FocalNetEncoderOutput(ModelOutput):
r"""
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -64,26 +58,21 @@ class FocalNetEncoderOutput(ModelOutput):
@dataclass @dataclass
class FocalNetModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
FocalNet model's outputs that also contains a pooling of the last hidden states. FocalNet model's outputs that also contains a pooling of the last hidden states.
"""
)
class FocalNetModelOutput(ModelOutput):
r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -93,26 +82,23 @@ class FocalNetModelOutput(ModelOutput):
@dataclass @dataclass
class FocalNetMaskedImageModelingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
FocalNet masked image model outputs. FocalNet masked image model outputs.
"""
)
class FocalNetMaskedImageModelingOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `bool_masked_pos` is provided):
Masked image modeling (MLM) loss.
reconstruction (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Reconstructed pixel values.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `bool_masked_pos` is provided): include the spatial dimensions.
Masked image modeling (MLM) loss.
reconstruction (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Reconstructed pixel values.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -122,26 +108,23 @@ class FocalNetMaskedImageModelingOutput(ModelOutput):
@dataclass @dataclass
class FocalNetImageClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
FocalNet outputs for image classification. FocalNet outputs for image classification.
"""
)
class FocalNetImageClassifierOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): include the spatial dimensions.
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, hidden_size, height, width)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -804,26 +804,17 @@ class FunnelClassificationHead(nn.Module):
@dataclass @dataclass
class FunnelForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`FunnelForPreTraining`]. Output type of [`FunnelForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class FunnelForPreTrainingOutput(ModelOutput):
Total loss of the ELECTRA-style objective. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Prediction scores of the head (scores for each token before SoftMax). Total loss of the ELECTRA-style objective.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Prediction scores of the head (scores for each token before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -290,12 +290,12 @@ class FuyuForCausalLM(FuyuPreTrainedModel, GenerationMixin):
image_patches (`torch.FloatTensor` of shape `(batch_size, num_total_patches, patch_size_ x patch_size x num_channels)`, *optional*): image_patches (`torch.FloatTensor` of shape `(batch_size, num_total_patches, patch_size_ x patch_size x num_channels)`, *optional*):
Image patches to be used as continuous embeddings. The patches are flattened and then projected to the Image patches to be used as continuous embeddings. The patches are flattened and then projected to the
hidden size of the model. hidden size of the model.
image_patches_indices (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Tensor of indices of the image patches in the input_ids tensor.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.text_config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored config.text_config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.text_config.vocab_size]`. (masked), the loss is only computed for the tokens with labels in `[0, ..., config.text_config.vocab_size]`.
image_patches_indices (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Tensor of indices of the image patches in the input_ids tensor.
Examples: Examples:

View File

@ -48,68 +48,48 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class Gemma3ModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Gemma3 outputs, with hidden states and attentions. Base class for Gemma3 outputs, with hidden states and attentions.
"""
)
class Gemma3ModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None
@dataclass @dataclass
class Gemma3CausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Gemma3 causal language model (or autoregressive) outputs. Base class for Gemma3 causal language model (or autoregressive) outputs.
"""
)
class Gemma3CausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.text_config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.text_config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder after projecting last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder after projecting last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -15,7 +15,6 @@
# limitations under the License. # limitations under the License.
import copy import copy
from collections.abc import Callable from collections.abc import Callable
from dataclasses import dataclass
from typing import Any, Optional, Union from typing import Any, Optional, Union
import torch import torch
@ -346,12 +345,10 @@ class Gemma3Config(PretrainedConfig):
super().__init__(**kwargs) super().__init__(**kwargs)
@dataclass
class Gemma3ModelOutputWithPast(PaligemmaModelOutputWithPast): class Gemma3ModelOutputWithPast(PaligemmaModelOutputWithPast):
pass pass
@dataclass
class Gemma3CausalLMOutputWithPast(PaligemmaCausalLMOutputWithPast): class Gemma3CausalLMOutputWithPast(PaligemmaCausalLMOutputWithPast):
pass pass

View File

@ -49,27 +49,16 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
"""
)
# Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Git # Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->Git
class GitVisionModelOutput(ModelOutput): class GitVisionModelOutput(ModelOutput):
""" r"""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output.
Args:
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
image_embeds: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None

View File

@ -289,27 +289,16 @@ class GotOcr2VisionLayer(GradientCheckpointingLayer):
@dataclass @dataclass
class GotOcr2VisionEncoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for got_ocr2 vision model's outputs that also contains image embeddings obtained by applying the projection Base class for got_ocr2 vision model's outputs that also contains image embeddings obtained by applying the projection
layer to the pooler_output. layer to the pooler_output.
"""
Args: )
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`): class GotOcr2VisionEncoderOutput(ModelOutput):
The image embeddings obtained by applying the projection layer to the pooler_output. r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
Sequence of hidden-states at the output of the last layer of the model. The image embeddings obtained by applying the projection layer to the pooler_output.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
image_embeds: Optional[torch.FloatTensor] = None image_embeds: Optional[torch.FloatTensor] = None
@ -505,35 +494,26 @@ class GotOcr2MultiModalProjector(nn.Module):
@dataclass @dataclass
class GotOcr2CausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for GotOcr2 causal language model (or autoregressive) outputs. Base class for GotOcr2 causal language model (or autoregressive) outputs.
"""
)
class GotOcr2CausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -545,33 +525,22 @@ class GotOcr2CausalLMOutputWithPast(ModelOutput):
@dataclass @dataclass
class GotOcr2ModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for GotOcr2 outputs, with hidden states and attentions. Base class for GotOcr2 outputs, with hidden states and attentions.
"""
)
class GotOcr2ModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None

View File

@ -597,36 +597,27 @@ class GPT2PreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class GPT2DoubleHeadsModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of models predicting if two sentences are consecutive or not. Base class for outputs of models predicting if two sentences are consecutive or not.
"""
)
class GPT2DoubleHeadsModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss.
mc_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mc_labels` is provided):
Multiple choice classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_choices, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
mc_logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`):
Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).
past_key_values (`tuple[tuple[torch.Tensor]]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of length `config.n_layers`, containing tuples of tensors of shape `(batch_size, num_heads,
sequence_length, embed_size_per_head)`).
Args: Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss.
mc_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `mc_labels` is provided):
Multiple choice classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_choices, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
mc_logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`):
Prediction scores of the multiple choice classification head (scores for each choice before SoftMax).
past_key_values (`tuple[tuple[torch.Tensor]]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of length `config.n_layers`, containing tuples of tensors of shape `(batch_size, num_heads,
sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
GPT2Attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -33,32 +33,23 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class GraniteSpeechCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for LlavaNext causal language model (or autoregressive) outputs. Base class for LlavaNext causal language model (or autoregressive) outputs.
"""
)
class GraniteSpeechCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -382,12 +373,12 @@ class GraniteSpeechForConditionalGeneration(GraniteSpeechPreTrainedModel, Genera
The tensors corresponding to the input audios. input features can be obtained using The tensors corresponding to the input audios. input features can be obtained using
[`AutoFeatureExtractor`]. See [`GraniteSpeechFeatureExtractor.__call__`] for details. [`AutoFeatureExtractor`]. See [`GraniteSpeechFeatureExtractor.__call__`] for details.
[`GraniteSpeechProcessor`] uses [`GraniteSpeechFeatureExtractor`] for processing audio. [`GraniteSpeechProcessor`] uses [`GraniteSpeechFeatureExtractor`] for processing audio.
input_features_mask (`torch.Tensor`, *optional*):
Mask to be applied to audio features prior to scattering into the language embeddings.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ..., Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`. (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
input_features_mask (`torch.Tensor`, *optional*):
Mask to be applied to audio features prior to scattering into the language embeddings.
""" """
# TODO (@alex-jw-brooks) add an example to this docstring once models are released # TODO (@alex-jw-brooks) add an example to this docstring once models are released
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

View File

@ -102,28 +102,20 @@ class MultiScaleDeformableAttention(nn.Module):
@dataclass @dataclass
class GroundingDinoDecoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the GroundingDinoDecoder. This class adds two attributes to Base class for outputs of the GroundingDinoDecoder. This class adds two attributes to
BaseModelOutputWithCrossAttentions, namely: BaseModelOutputWithCrossAttentions, namely:
- a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer) - a stacked tensor of intermediate decoder hidden states (i.e. the output of each decoder layer)
- a stacked tensor of intermediate reference points. - a stacked tensor of intermediate reference points.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class GroundingDinoDecoderOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate hidden states (output of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, hidden_size)`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, sequence_length, hidden_size)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate reference points (reference points of each layer of the decoder).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.
attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads,
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention, cross-attention and multi-scale deformable attention heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -134,30 +126,27 @@ class GroundingDinoDecoderOutput(ModelOutput):
@dataclass @dataclass
class GroundingDinoEncoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the GroundingDinoEncoder. This class extends BaseModelOutput, due to: Base class for outputs of the GroundingDinoEncoder. This class extends BaseModelOutput, due to:
- vision and text last hidden states - vision and text last hidden states
- vision and text intermediate hidden states - vision and text intermediate hidden states
"""
Args: )
last_hidden_state_vision (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class GroundingDinoEncoderOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the vision encoder. r"""
last_hidden_state_text (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): last_hidden_state_vision (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the text encoder. Sequence of hidden-states at the output of the last layer of the vision encoder.
vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state_text (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Tuple of `torch.FloatTensor` (one for the output of the vision embeddings + one for the output of each Sequence of hidden-states at the output of the last layer of the text encoder.
layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the vision encoder at the vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
output of each layer plus the initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the vision embeddings + one for the output of each
text_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the vision encoder at the
Tuple of `torch.FloatTensor` (one for the output of the text embeddings + one for the output of each layer) output of each layer plus the initial embedding outputs.
of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the text encoder at the output of text_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
each layer plus the initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the text embeddings + one for the output of each layer)
attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the text encoder at the output of
Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads, each layer plus the initial embedding outputs.
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
weighted average in the text-vision attention, vision-text attention, text-enhancer (self-attention) and
multi-scale deformable attention heads.
""" """
last_hidden_state_vision: Optional[torch.FloatTensor] = None last_hidden_state_vision: Optional[torch.FloatTensor] = None
@ -168,55 +157,49 @@ class GroundingDinoEncoderOutput(ModelOutput):
@dataclass @dataclass
class GroundingDinoModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of the Grounding DINO encoder-decoder model. Base class for outputs of the Grounding DINO encoder-decoder model.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`): class GroundingDinoModelOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the decoder of the model. r"""
init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`):
Initial reference points sent through the Transformer decoder. Sequence of hidden-states at the output of the last layer of the decoder of the model.
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
Stacked intermediate hidden states (output of each layer of the decoder). Initial reference points sent through the Transformer decoder.
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Stacked intermediate reference points (reference points of each layer of the decoder). Stacked intermediate hidden states (output of each layer of the decoder).
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Stacked intermediate reference points (reference points of each layer of the decoder).
shape `(batch_size, num_queries, hidden_size)`. Hidden-states of the decoder at the output of each layer encoder_last_hidden_state_vision (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
plus the initial embedding outputs. Sequence of hidden-states at the output of the last layer of the encoder of the model.
decoder_attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): encoder_last_hidden_state_text (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads, Sequence of hidden-states at the output of the last layer of the encoder of the model.
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the encoder_vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
weighted average in the self-attention, cross-attention and multi-scale deformable attention heads. Tuple of `torch.FloatTensor` (one for the output of the vision embeddings + one for the output of each
encoder_last_hidden_state_vision (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the vision encoder at the
Sequence of hidden-states at the output of the last layer of the encoder of the model. output of each layer plus the initial embedding outputs.
encoder_last_hidden_state_text (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): encoder_text_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Tuple of `torch.FloatTensor` (one for the output of the text embeddings + one for the output of each layer)
encoder_vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the text encoder at the output of
Tuple of `torch.FloatTensor` (one for the output of the vision embeddings + one for the output of each each layer plus the initial embedding outputs.
layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the vision encoder at the encoder_attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
output of each layer plus the initial embedding outputs. Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads,
encoder_text_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
Tuple of `torch.FloatTensor` (one for the output of the text embeddings + one for the output of each layer) weighted average in the text-vision attention, vision-text attention, text-enhancer (self-attention) and
of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the text encoder at the output of multi-scale deformable attention heads. attention softmax, used to compute the weighted average in the
each layer plus the initial embedding outputs. bi-attention heads.
encoder_attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`):
Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads, Predicted bounding boxes scores where the top `config.num_queries` scoring bounding boxes are picked as
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the region proposals in the first stage. Output of bounding box binary classification (i.e. foreground and
weighted average in the text-vision attention, vision-text attention, text-enhancer (self-attention) and background).
multi-scale deformable attention heads. attention softmax, used to compute the weighted average in the enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
bi-attention heads. Logits of predicted bounding boxes coordinates in the first stage.
enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`): encoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`):
Predicted bounding boxes scores where the top `config.num_queries` scoring bounding boxes are picked as Logits of top `config.num_queries` scoring bounding boxes in the first stage.
region proposals in the first stage. Output of bounding box binary classification (i.e. foreground and encoder_pred_boxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
background). Coordinates of top `config.num_queries` scoring bounding boxes in the first stage.
enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the first stage.
encoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`):
Logits of top `config.num_queries` scoring bounding boxes in the first stage.
encoder_pred_boxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
Coordinates of top `config.num_queries` scoring bounding boxes in the first stage.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -237,73 +220,62 @@ class GroundingDinoModelOutput(ModelOutput):
@dataclass @dataclass
class GroundingDinoObjectDetectionOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`GroundingDinoForObjectDetection`]. Output type of [`GroundingDinoForObjectDetection`].
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)): class GroundingDinoObjectDetectionOutput(ModelOutput):
Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a r"""
bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` are provided)):
scale-invariant IoU loss. Total loss as a linear combination of a negative log-likehood (cross-entropy) for class prediction and a
loss_dict (`Dict`, *optional*): bounding box loss. The latter is defined as a linear combination of the L1 loss and the generalized
A dictionary containing the individual losses. Useful for logging. scale-invariant IoU loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`): loss_dict (`Dict`, *optional*):
Classification logits (including no-object) for all queries. A dictionary containing the individual losses. Useful for logging.
pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes + 1)`):
Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These Classification logits (including no-object) for all queries.
values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding pred_boxes (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
possible padding). You can use [`~GroundingDinoProcessor.post_process_grounded_object_detection`] to retrieve the Normalized boxes coordinates for all queries, represented as (center_x, center_y, width, height). These
unnormalized bounding boxes. values are normalized in [0, 1], relative to the size of each individual image in the batch (disregarding
auxiliary_outputs (`list[Dict]`, *optional*): possible padding). You can use [`~GroundingDinoProcessor.post_process_grounded_object_detection`] to retrieve the
Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`) unnormalized bounding boxes.
and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and auxiliary_outputs (`list[Dict]`, *optional*):
`pred_boxes`) for each decoder layer. Optional, only returned when auxiliary losses are activated (i.e. `config.auxiliary_loss` is set to `True`)
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`, *optional*): and labels are provided. It is a list of dictionaries containing the two above keys (`logits` and
Sequence of hidden-states at the output of the last layer of the decoder of the model. `pred_boxes`) for each decoder layer.
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_queries, hidden_size)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Sequence of hidden-states at the output of the last layer of the decoder of the model.
shape `(batch_size, num_queries, hidden_size)`. Hidden-states of the decoder at the output of each layer init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
plus the initial embedding outputs. Initial reference points sent through the Transformer decoder.
decoder_attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`):
Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads, Stacked intermediate hidden states (output of each layer of the decoder).
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`):
weighted average in the self-attention, cross-attention and multi-scale deformable attention heads. Stacked intermediate reference points (reference points of each layer of the decoder).
encoder_last_hidden_state_vision (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): encoder_last_hidden_state_vision (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_last_hidden_state_text (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*): encoder_last_hidden_state_text (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model. Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the vision embeddings + one for the output of each Tuple of `torch.FloatTensor` (one for the output of the vision embeddings + one for the output of each
layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the vision encoder at the layer) of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the vision encoder at the
output of each layer plus the initial embedding outputs. output of each layer plus the initial embedding outputs.
encoder_text_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_text_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the text embeddings + one for the output of each layer) Tuple of `torch.FloatTensor` (one for the output of the text embeddings + one for the output of each layer)
of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the text encoder at the output of of shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the text encoder at the output of
each layer plus the initial embedding outputs. each layer plus the initial embedding outputs.
encoder_attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`):
Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads, Predicted bounding boxes scores where the top `config.num_queries` scoring bounding boxes are picked as
sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the region proposals in the first stage. Output of bounding box binary classification (i.e. foreground and
weighted average in the text-vision attention, vision-text attention, text-enhancer (self-attention) and background).
multi-scale deformable attention heads. enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
intermediate_hidden_states (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, hidden_size)`): Logits of predicted bounding boxes coordinates in the first stage.
Stacked intermediate hidden states (output of each layer of the decoder). encoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`):
intermediate_reference_points (`torch.FloatTensor` of shape `(batch_size, config.decoder_layers, num_queries, 4)`): Logits of top `config.num_queries` scoring bounding boxes in the first stage.
Stacked intermediate reference points (reference points of each layer of the decoder). encoder_pred_boxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`): Coordinates of top `config.num_queries` scoring bounding boxes in the first stage.
Initial reference points sent through the Transformer decoder. input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
enc_outputs_class (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`): Encoded candidate labels sequence. Used in processor to post process object detection result.
Predicted bounding boxes scores where the top `config.num_queries` scoring bounding boxes are picked as
region proposals in the first stage. Output of bounding box binary classification (i.e. foreground and
background).
enc_outputs_coord_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
Logits of predicted bounding boxes coordinates in the first stage.
encoder_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`, *optional*, returned when `config.two_stage=True`):
Logits of top `config.num_queries` scoring bounding boxes in the first stage.
encoder_pred_boxes (`torch.FloatTensor` of shape `(batch_size, sequence_length, 4)`, *optional*, returned when `config.two_stage=True`):
Coordinates of top `config.num_queries` scoring bounding boxes in the first stage.
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Encoded candidate labels sequence. Used in processor to post process object detection result.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -259,38 +259,37 @@ class GroupViTTokenAssign(nn.Module):
@dataclass @dataclass
@auto_docstring
class GroupViTModelOutput(ModelOutput): class GroupViTModelOutput(ModelOutput):
""" r"""
Args: loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`): Contrastive loss for image-text similarity.
Contrastive loss for image-text similarity. logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`): The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text similarity scores.
similarity scores. logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`): The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image similarity scores.
similarity scores. segmentation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels, logits_height, logits_width)`):
segmentation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels, logits_height, logits_width)`): Classification scores for each pixel.
Classification scores for each pixel.
<Tip warning={true}> <Tip warning={true}>
The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is The logits returned do not necessarily have the same size as the `pixel_values` passed as inputs. This is
to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the to avoid doing two interpolations and lose some quality when a user needs to resize the logits to the
original image size as post-processing. You should always check your logits shape and resize as needed. original image size as post-processing. You should always check your logits shape and resize as needed.
</Tip> </Tip>
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by applying the projection layer to the pooled output of
The text embeddings obtained by applying the projection layer to the pooled output of [`GroupViTTextModel`].
[`GroupViTTextModel`]. image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by applying the projection layer to the pooled output of
The image embeddings obtained by applying the projection layer to the pooled output of [`GroupViTVisionModel`].
[`GroupViTVisionModel`]. text_model_output (`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`): The output of the [`GroupViTTextModel`].
The output of the [`GroupViTTextModel`]. vision_model_output (`BaseModelOutputWithPooling`):
vision_model_output (`BaseModelOutputWithPooling`): The output of the [`GroupViTVisionModel`].
The output of the [`GroupViTVisionModel`].
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -42,30 +42,19 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class HieraEncoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Hiera encoder's outputs, with potential hidden states and attentions. Hiera encoder's outputs, with potential hidden states and attentions.
"""
)
class HieraEncoderOutput(ModelOutput):
r"""
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, height, width, hidden_size)`. These are the reshaped and re-rolled hidden states of the model.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`. Thesre are the unrolled hidden states of the model.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, height, width, hidden_size)`. These are the reshaped and re-rolled hidden states of the model.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -75,36 +64,25 @@ class HieraEncoderOutput(ModelOutput):
@dataclass @dataclass
class HieraModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Hiera model's outputs that also contains a pooling of the last hidden states. Hiera model's outputs that also contains a pooling of the last hidden states.
"""
)
class HieraModelOutput(ModelOutput):
r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
Tensor indicating which patches are masked (0) and which are not (1).
ids_restore (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Tensor containing the original index of the (shuffled) masked patches.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, height, width, hidden_size)`. These are the reshaped and re-rolled hidden states of the model.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): include the spatial dimensions.
Sequence of hidden-states at the output of the last layer of the model.
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`, *optional*, returned when `add_pooling_layer=True` is passed):
Average pooling of the last layer hidden-state.
bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
Tensor indicating which patches are masked (0) and which are not (1).
ids_restore (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Tensor containing the original index of the (shuffled) masked patches.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`. These are the unrolled hidden states of the model.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, height, width, hidden_size)`. These are the reshaped and re-rolled hidden states of the model.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -117,32 +95,34 @@ class HieraModelOutput(ModelOutput):
@dataclass @dataclass
class HieraForImageClassificationOutput(ImageClassifierOutput): @auto_docstring(
""" custom_intro="""
Hiera image classification outputs. Hiera image classification outputs.
"""
)
class HieraForImageClassificationOutput(ImageClassifierOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, `optional`):
Loss value for the training task.
logits (`torch.FloatTensor` of shape `(batch_size, num_labels)`):
Prediction scores of the classification head (logits of the output layer).
hidden_states (`tuple(torch.FloatTensor)`, `optional`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`. These are the unrolled hidden states of the model.
Args: Hidden-states of the model at the output of each layer plus the initial embedding outputs.
loss (`torch.FloatTensor` of shape `(1,)`, `optional`): attentions (`tuple(torch.FloatTensor)`, `optional`):
Loss value for the training task. Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length,
logits (`torch.FloatTensor` of shape `(batch_size, num_labels)`): sequence_length)`.
Prediction scores of the classification head (logits of the output layer).
hidden_states (`tuple(torch.FloatTensor)`, `optional`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`. These are the unrolled hidden states of the model.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
attentions (`tuple(torch.FloatTensor)`, `optional`): heads.
Tuple of `torch.FloatTensor` (one for each stage) of shape `(batch_size, num_heads, sequence_length, reshaped_hidden_states (`tuple(torch.FloatTensor)`, `optional`):
sequence_length)`. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, height, width, hidden_size)`. These are the reshaped and re-rolled hidden states of the model.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
heads. include the spatial dimensions.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, `optional`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, height, width, hidden_size)`. These are the reshaped and re-rolled hidden states of the model.
Hidden-states of the model at the output of each layer plus the initial embedding outputs reshaped to
include the spatial dimensions.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -153,31 +133,25 @@ class HieraForImageClassificationOutput(ImageClassifierOutput):
@dataclass @dataclass
class HieraForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for HieraForPreTraining's outputs, with potential hidden states and attentions. Class for HieraForPreTraining's outputs, with potential hidden states and attentions.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`): class HieraForPreTrainingOutput(ModelOutput):
Pixel reconstruction loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, patch_size ** 2 * num_channels)`): loss (`torch.FloatTensor` of shape `(1,)`):
Pixel reconstruction logits. Pixel reconstruction loss.
bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, patch_size ** 2 * num_channels)`):
Tensor indicating which patches are masked (0) and which are not (1). Pixel reconstruction logits.
ids_restore (`torch.LongTensor` of shape `(batch_size, sequence_length)`): bool_masked_pos (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
Tensor containing the original index of the (shuffled) masked patches. Tensor indicating which patches are masked (0) and which are not (1).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): ids_restore (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tensor containing the original index of the (shuffled) masked patches.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
plus the initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): shape `(batch_size, height, width, hidden_size)`. Hidden-states of the model at the output of each layer
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, plus the initial embedding outputs reshaped to include the spatial dimensions.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
reshaped_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, height, width, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs reshaped to include the spatial dimensions.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -52,41 +52,32 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class IdeficsBaseModelOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Idefics model's outputs that may also contain a past key/values (to speed up sequential decoding). Base class for Idefics model's outputs that may also contain a past key/values (to speed up sequential decoding).
"""
)
class IdeficsBaseModelOutputWithPast(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
Args: If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): hidden_size)` is output.
Sequence of hidden-states at the output of the last layer of the model. past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
hidden_size)` is output. `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): input) to speed up sequential decoding.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads, sequence_length, hidden_size)`.
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -97,37 +88,28 @@ class IdeficsBaseModelOutputWithPast(ModelOutput):
@dataclass @dataclass
class IdeficsCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Idefics causal language model (or autoregressive) outputs. Base class for Idefics causal language model (or autoregressive) outputs.
"""
)
class IdeficsCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). sequence_length, hidden_size)`.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1445,16 +1427,16 @@ class IdeficsForVisionText2Text(IdeficsPreTrainedModel, GenerationMixin):
**kwargs: Unpack[KwargsForCausalLM], **kwargs: Unpack[KwargsForCausalLM],
) -> Union[tuple, IdeficsCausalLMOutputWithPast]: ) -> Union[tuple, IdeficsCausalLMOutputWithPast]:
r""" r"""
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
image_encoder_embeddings (`torch.FloatTensor`, *optional*): image_encoder_embeddings (`torch.FloatTensor`, *optional*):
The output of the image encoder. The output of the image encoder.
perceiver_embeddings (`torch.FloatTensor`, *optional*): perceiver_embeddings (`torch.FloatTensor`, *optional*):
The output of the perceiver resampler. The output of the perceiver resampler.
image_attention_mask (`torch.LongTensor`, *optional*): image_attention_mask (`torch.LongTensor`, *optional*):
The attention mask for the image encoder. The attention mask for the image encoder.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Example: Example:

View File

@ -39,35 +39,29 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class Idefics2BaseModelOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Idefics2 model's outputs that may also contain a past key/values (to speed up sequential decoding). Base class for Idefics2 model's outputs that may also contain a past key/values (to speed up sequential decoding).
Args: """
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): )
Sequence of hidden-states at the output of the last layer of the model. class Idefics2BaseModelOutputWithPast(ModelOutput):
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, r"""
hidden_size)` is output. last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): Sequence of hidden-states at the output of the last layer of the model.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if hidden_size)` is output.
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads, past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
encoder_sequence_length, embed_size_per_head)`. Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
input) to speed up sequential decoding. encoder_sequence_length, embed_size_per_head)`.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. input) to speed up sequential decoding.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length, hidden_size)`.
sequence_length)`. image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -78,33 +72,27 @@ class Idefics2BaseModelOutputWithPast(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for Idefics2 causal language model (or autoregressive) outputs.
"""
)
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Idefics2 # Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Idefics2
class Idefics2CausalLMOutputWithPast(ModelOutput): class Idefics2CausalLMOutputWithPast(ModelOutput):
""" r"""
Base class for Idefics2 causal language model (or autoregressive) outputs. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Args: Language modeling loss (for next-token prediction).
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Language modeling loss (for next-token prediction). Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) `past_key_values` input) to speed up sequential decoding.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
`past_key_values` input) to speed up sequential decoding. Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -39,35 +39,29 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class Idefics3BaseModelOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Idefics3 model's outputs that may also contain a past key/values (to speed up sequential decoding). Base class for Idefics3 model's outputs that may also contain a past key/values (to speed up sequential decoding).
Args: """
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): )
Sequence of hidden-states at the output of the last layer of the model. class Idefics3BaseModelOutputWithPast(ModelOutput):
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, r"""
hidden_size)` is output. last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): Sequence of hidden-states at the output of the last layer of the model.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if hidden_size)` is output.
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads, past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
encoder_sequence_length, embed_size_per_head)`. Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values` `config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
input) to speed up sequential decoding. encoder_sequence_length, embed_size_per_head)`.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. input) to speed up sequential decoding.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, sequence_length, hidden_size)`.
sequence_length)`. image_hidden_states of the model produced by the vision encoder
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -78,33 +72,26 @@ class Idefics3BaseModelOutputWithPast(ModelOutput):
@dataclass @dataclass
class Idefics3CausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Idefics causal language model (or autoregressive) outputs. Base class for Idefics causal language model (or autoregressive) outputs.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class Idefics3CausalLMOutputWithPast(ModelOutput):
Language modeling loss (for next-token prediction). r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Language modeling loss (for next-token prediction).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`past_key_values` input) to speed up sequential decoding. `(batch_size, num_heads, sequence_length, embed_size_per_head)`)
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + `past_key_values` input) to speed up sequential decoding.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, image_hidden_states of the model produced by the vision encoder
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -44,22 +44,24 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class defining the outputs of [`InstructBlipForConditionalGeneration`].
"""
)
# Copied from transformers.models.blip_2.modeling_blip_2.Blip2ForConditionalGenerationModelOutput with Blip2->InstructBlip # Copied from transformers.models.blip_2.modeling_blip_2.Blip2ForConditionalGenerationModelOutput with Blip2->InstructBlip
class InstructBlipForConditionalGenerationModelOutput(ModelOutput): class InstructBlipForConditionalGenerationModelOutput(ModelOutput):
""" r"""
Class defining the outputs of [`InstructBlipForConditionalGeneration`]. loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Language modeling loss from the language model.
Args: logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): Prediction scores of the language modeling head of the language model.
Language modeling loss from the language model. vision_outputs (`BaseModelOutputWithPooling`):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Outputs of the vision encoder.
Prediction scores of the language modeling head of the language model. qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`):
vision_outputs (`BaseModelOutputWithPooling`): Outputs of the Q-Former (Querying Transformer).
Outputs of the vision encoder. language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`): Outputs of the language model.
Outputs of the Q-Former (Querying Transformer).
language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
Outputs of the language model.
""" """
loss: Optional[tuple[torch.FloatTensor]] = None loss: Optional[tuple[torch.FloatTensor]] = None

View File

@ -1147,21 +1147,23 @@ class InstructBlipVideoQFormerModel(InstructBlipVideoPreTrainedModel):
@dataclass @dataclass
class InstructBlipVideoForConditionalGenerationModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class defining the outputs of [`InstructBlipVideoForConditionalGeneration`]. Class defining the outputs of [`InstructBlipVideoForConditionalGeneration`].
"""
Args: )
loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class InstructBlipVideoForConditionalGenerationModelOutput(ModelOutput):
Language modeling loss from the language model. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): loss (`torch.FloatTensor`, *optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Prediction scores of the language modeling head of the language model. Language modeling loss from the language model.
vision_outputs (`BaseModelOutputWithPooling`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Outputs of the vision encoder. Prediction scores of the language modeling head of the language model.
qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`): vision_outputs (`BaseModelOutputWithPooling`):
Outputs of the Q-Former (Querying Transformer). Outputs of the vision encoder.
language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`): qformer_outputs (`BaseModelOutputWithPoolingAndCrossAttentions`):
Outputs of the language model. Outputs of the Q-Former (Querying Transformer).
language_model_outputs (`CausalLMOutputWithPast` or `Seq2SeqLMOutput`):
Outputs of the language model.
""" """
loss: Optional[tuple[torch.FloatTensor]] = None loss: Optional[tuple[torch.FloatTensor]] = None

View File

@ -13,7 +13,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
from dataclasses import dataclass
from typing import Optional, Union from typing import Optional, Union
import torch import torch
@ -189,7 +188,6 @@ class InstructBlipVideoQFormerModel(InstructBlipQFormerModel):
pass pass
@dataclass
class InstructBlipVideoForConditionalGenerationModelOutput(InstructBlipForConditionalGenerationModelOutput): class InstructBlipVideoForConditionalGenerationModelOutput(InstructBlipForConditionalGenerationModelOutput):
pass pass

View File

@ -209,28 +209,17 @@ class InternVLVisionPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class InternVLVisionModelOutputWithPooling(BaseModelOutputWithPooling): @auto_docstring(
""" custom_intro="""
Class for outputs of [`InternVLVisionModel`]. Class for outputs of [`InternVLVisionModel`].
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class InternVLVisionModelOutputWithPooling(BaseModelOutputWithPooling):
Sequence of hidden-states at the output of the last layer of the model. r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if
*config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
will be returned. will be returned.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
@ -569,33 +558,22 @@ class InternVLMultiModalProjector(nn.Module):
@dataclass @dataclass
class InternVLModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for InternVL outputs, with hidden states and attentions. Base class for InternVL outputs, with hidden states and attentions.
"""
)
class InternVLModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None
@ -805,35 +783,26 @@ class InternVLModel(InternVLPreTrainedModel):
@dataclass @dataclass
class InternVLCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for InternVL causal language model (or autoregressive) outputs. Base class for InternVL causal language model (or autoregressive) outputs.
"""
)
class InternVLCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -171,28 +171,17 @@ class InternVLVisionPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class InternVLVisionModelOutputWithPooling(BaseModelOutputWithPooling): @auto_docstring(
""" custom_intro="""
Class for outputs of [`InternVLVisionModel`]. Class for outputs of [`InternVLVisionModel`].
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class InternVLVisionModelOutputWithPooling(BaseModelOutputWithPooling):
Sequence of hidden-states at the output of the last layer of the model. r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if Average of the last layer hidden states of the patch tokens (excluding the *[CLS]* token) if
*config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token *config.use_mean_pooling* is set to True. If set to False, then the final hidden state of the *[CLS]* token
will be returned. will be returned.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """

View File

@ -87,14 +87,17 @@ class JanusPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class JanusVQVAEOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Janus VQ-VAE mode model outputs. Base class for Janus VQ-VAE mode model outputs.
Args: """
decoded_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`): )
Reconstructed pixel values after encoding and decoding the input. class JanusVQVAEOutput(ModelOutput):
embedding_loss (`torch.FloatTensor`): r"""
Embedding loss. decoded_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
Reconstructed pixel values after encoding and decoding the input.
embedding_loss (`torch.FloatTensor`):
Embedding loss.
""" """
decoded_pixel_values: Optional[torch.FloatTensor] = None decoded_pixel_values: Optional[torch.FloatTensor] = None
@ -102,41 +105,32 @@ class JanusVQVAEOutput(ModelOutput):
@dataclass @dataclass
class JanusBaseModelOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Janus model's outputs that may also contain a past key/values (to speed up sequential decoding). Base class for Janus model's outputs that may also contain a past key/values (to speed up sequential decoding).
"""
)
class JanusBaseModelOutputWithPast(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
Args: If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): hidden_size)` is output.
Sequence of hidden-states at the output of the last layer of the model. past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
hidden_size)` is output. `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): input) to speed up sequential decoding.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads, sequence_length, hidden_size)`.
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -147,37 +141,28 @@ class JanusBaseModelOutputWithPast(ModelOutput):
@dataclass @dataclass
class JanusCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Janus causal language model (or autoregressive) outputs. Base class for Janus causal language model (or autoregressive) outputs.
"""
)
class JanusCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). sequence_length, hidden_size)`.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -408,26 +408,27 @@ class JanusPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class JanusVQVAEOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Janus VQ-VAE mode model outputs. Base class for Janus VQ-VAE mode model outputs.
Args: """
decoded_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`): )
Reconstructed pixel values after encoding and decoding the input. class JanusVQVAEOutput(ModelOutput):
embedding_loss (`torch.FloatTensor`): r"""
Embedding loss. decoded_pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)`):
Reconstructed pixel values after encoding and decoding the input.
embedding_loss (`torch.FloatTensor`):
Embedding loss.
""" """
decoded_pixel_values: Optional[torch.FloatTensor] = None decoded_pixel_values: Optional[torch.FloatTensor] = None
embedding_loss: torch.FloatTensor = None embedding_loss: torch.FloatTensor = None
@dataclass
class JanusBaseModelOutputWithPast(IdeficsBaseModelOutputWithPast): class JanusBaseModelOutputWithPast(IdeficsBaseModelOutputWithPast):
pass pass
@dataclass
class JanusCausalLMOutputWithPast(IdeficsCausalLMOutputWithPast): class JanusCausalLMOutputWithPast(IdeficsCausalLMOutputWithPast):
pass pass

View File

@ -90,43 +90,32 @@ def create_position_ids_from_input_ids(input_ids, padding_idx, past_key_values_l
@dataclass @dataclass
class Kosmos2ModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for text model's outputs that also contains a pooling of the last hidden states. Base class for text model's outputs that also contains a pooling of the last hidden states.
"""
)
class Kosmos2ModelOutput(ModelOutput):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
Sequence of hidden-states at the output of the last layer of the model. input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. projection_attentions (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. Attentions weights given by `Kosmos2ImageToTextProjection`, after the attention softmax, used to compute
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): the weighted average in the self-attention heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, vision_model_output (`BaseModelOutputWithPooling`, *optional*):
sequence_length)`. The output of the [`Kosmos2VisionModel`].
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
projection_attentions (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights given by `Kosmos2ImageToTextProjection`, after the attention softmax, used to compute
the weighted average in the self-attention heads.
vision_model_output(`BaseModelOutputWithPooling`, *optional*):
The output of the [`Kosmos2VisionModel`].
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -145,45 +134,36 @@ class Kosmos2ModelOutput(ModelOutput):
@dataclass @dataclass
class Kosmos2ForConditionalGenerationModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Model output class for `Kosmos2ForConditionalGeneration`. Model output class for `Kosmos2ForConditionalGeneration`.
"""
)
class Kosmos2ForConditionalGenerationModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
Language modeling loss (for next-token prediction). input) to speed up sequential decoding.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): projection_attentions (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. sequence_length)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. Attentions weights given by `Kosmos2ImageToTextProjection`, after the attention softmax, used to compute
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): the weighted average in the self-attention heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, vision_model_output (`BaseModelOutputWithPooling`, *optional*):
sequence_length)`. The output of the [`Kosmos2VisionModel`].
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
projection_attentions (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights given by `Kosmos2ImageToTextProjection`, after the attention softmax, used to compute
the weighted average in the self-attention heads.
vision_model_output(`BaseModelOutputWithPooling`, *optional*):
The output of the [`Kosmos2VisionModel`].
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape `(batch_size, num_heads,
encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that can be used (see `past_key_values`
input) to speed up sequential decoding.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1332,6 +1312,8 @@ class Kosmos2TextModel(Kosmos2PreTrainedModel):
**kwargs: Unpack[FlashAttentionKwargs], **kwargs: Unpack[FlashAttentionKwargs],
) -> Union[tuple, BaseModelOutputWithPastAndCrossAttentions]: ) -> Union[tuple, BaseModelOutputWithPastAndCrossAttentions]:
r""" r"""
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
image_embeds_position_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): image_embeds_position_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to indicate the location in a sequence to insert the image features . Mask values selected in `[0, Mask to indicate the location in a sequence to insert the image features . Mask values selected in `[0,
1]`: 1]`:
@ -1343,8 +1325,6 @@ class Kosmos2TextModel(Kosmos2PreTrainedModel):
- 1 indicates the head is **not masked**, - 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**. - 0 indicates the head is **masked**.
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
""" """
return self.model( return self.model(
input_ids=input_ids, input_ids=input_ids,
@ -1423,6 +1403,8 @@ class Kosmos2TextForCausalLM(Kosmos2PreTrainedModel, GenerationMixin):
**kwargs: Unpack[KwargsForCausalLM], **kwargs: Unpack[KwargsForCausalLM],
) -> Union[tuple, CausalLMOutputWithCrossAttentions]: ) -> Union[tuple, CausalLMOutputWithCrossAttentions]:
r""" r"""
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
image_embeds_position_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*): image_embeds_position_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to indicate the location in a sequence to insert the image features . Mask values selected in `[0, Mask to indicate the location in a sequence to insert the image features . Mask values selected in `[0,
1]`: 1]`:
@ -1438,8 +1420,6 @@ class Kosmos2TextForCausalLM(Kosmos2PreTrainedModel, GenerationMixin):
Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
`[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
""" """
return_dict = return_dict if return_dict is not None else self.config.use_return_dict return_dict = return_dict if return_dict is not None else self.config.use_return_dict
@ -1794,12 +1774,12 @@ class Kosmos2ForConditionalGeneration(Kosmos2PreTrainedModel, GenerationMixin):
- 1 for places where to put the image features, - 1 for places where to put the image features,
- 0 for places that are not for image features (i.e. for text tokens). - 0 for places that are not for image features (i.e. for text tokens).
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*): labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
`[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]` ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
image_embeds (`torch.FloatTensor` of shape `(batch_size, latent_query_num, hidden_size)`, *optional*):
Sequence of hidden-states at the output of `Kosmos2ImageToTextProjection`.
Examples: Examples:

View File

@ -1132,41 +1132,36 @@ class LEDPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for LEDEncoder's outputs, with potential hidden states, local and global attentions.
"""
)
# Copied from transformers.models.longformer.modeling_longformer.LongformerBaseModelOutput with Longformer->LEDEncoder # Copied from transformers.models.longformer.modeling_longformer.LongformerBaseModelOutput with Longformer->LEDEncoder
class LEDEncoderBaseModelOutput(ModelOutput): class LEDEncoderBaseModelOutput(ModelOutput):
""" r"""
Base class for LEDEncoder's outputs, with potential hidden states, local and global attentions. attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Sequence of hidden-states at the output of the last layer of the model. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
shape `(batch_size, sequence_length, hidden_size)`. token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
last_hidden_state: torch.FloatTensor last_hidden_state: torch.FloatTensor
@ -1176,60 +1171,32 @@ class LEDEncoderBaseModelOutput(ModelOutput):
@dataclass @dataclass
class LEDSeq2SeqModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model encoder's outputs that also contains : pre-computed hidden states that can speed up sequential Base class for model encoder's outputs that also contains : pre-computed hidden states that can speed up sequential
decoding. decoding.
"""
)
class LEDSeq2SeqModelOutput(ModelOutput):
r"""
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the decoder of the model.
Args: If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1,
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): hidden_size)` is output.
Sequence of hidden-states at the output of the last layer of the decoder of the model. past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
If `past_key_values` is used only the last hidden-state of the sequences of shape `(batch_size, 1, Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
hidden_size)` is output. used (see `past_key_values` input) to speed up sequential decoding.
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size, Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
num_heads, sequence_length, embed_size_per_head)`). where `x` is the number of tokens with global attention mask.
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be Global attentions weights after the attention softmax, used to compute the weighted average in the
used (see `past_key_values` input) to speed up sequential decoding. self-attention heads. Those are the attention weights from every token with global attention to every token
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): in the sequence.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -1244,58 +1211,30 @@ class LEDSeq2SeqModelOutput(ModelOutput):
@dataclass @dataclass
class LEDSeq2SeqLMOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for sequence-to-sequence language models outputs. Base class for sequence-to-sequence language models outputs.
"""
)
class LEDSeq2SeqLMOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
Args: Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): used (see `past_key_values` input) to speed up sequential decoding.
Language modeling loss. encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). where `x` is the number of tokens with global attention mask.
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be Global attentions weights after the attention softmax, used to compute the weighted average in the
used (see `past_key_values` input) to speed up sequential decoding. self-attention heads. Those are the attention weights from every token with global attention to every token
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): in the sequence.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1311,58 +1250,30 @@ class LEDSeq2SeqLMOutput(ModelOutput):
@dataclass @dataclass
class LEDSeq2SeqSequenceClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of sequence-to-sequence sentence classification models. Base class for outputs of sequence-to-sequence sentence classification models.
"""
)
class LEDSeq2SeqSequenceClassifierOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `label` is provided):
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
Args: Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `label` is provided): used (see `past_key_values` input) to speed up sequential decoding.
Classification (or regression if config.num_labels==1) loss. encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
Classification (or regression if config.num_labels==1) scores (before SoftMax). where `x` is the number of tokens with global attention mask.
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be Global attentions weights after the attention softmax, used to compute the weighted average in the
used (see `past_key_values` input) to speed up sequential decoding. self-attention heads. Those are the attention weights from every token with global attention to every token
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): in the sequence.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -1378,60 +1289,28 @@ class LEDSeq2SeqSequenceClassifierOutput(ModelOutput):
@dataclass @dataclass
class LEDSeq2SeqQuestionAnsweringModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of sequence-to-sequence question answering models. Base class for outputs of sequence-to-sequence question answering models.
"""
)
class LEDSeq2SeqQuestionAnsweringModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
Args: Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): used (see `past_key_values` input) to speed up sequential decoding.
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
start_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
Span-start scores (before SoftMax). where `x` is the number of tokens with global attention mask.
end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):
Span-end scores (before SoftMax).
past_key_values (`list[torch.FloatTensor]`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
List of `torch.FloatTensor` of length `config.n_layers`, with each tensor of shape `(2, batch_size,
num_heads, sequence_length, embed_size_per_head)`).
Contains pre-computed hidden-states (key and values in the attention blocks) of the decoder that can be Global attentions weights after the attention softmax, used to compute the weighted average in the
used (see `past_key_values` input) to speed up sequential decoding. self-attention heads. Those are the attention weights from every token with global attention to every token
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): in the sequence.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the decoder's cross-attention layer, after the attention softmax, used to compute the
weighted average in the cross-attention heads.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Sequence of hidden-states at the output of the last layer of the encoder of the model.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
self-attention heads.
encoder_global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -38,23 +38,21 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class LevitForImageClassificationWithTeacherOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`LevitForImageClassificationWithTeacher`]. Output type of [`LevitForImageClassificationWithTeacher`].
"""
Args: )
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): class LevitForImageClassificationWithTeacherOutput(ModelOutput):
Prediction scores as the average of the `cls_logits` and `distillation_logits`. r"""
cls_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the Prediction scores as the average of the `cls_logits` and `distillation_logits`.
class token). cls_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
distillation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): Prediction scores of the classification head (i.e. the linear layer on top of the final hidden state of the
Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the class token).
distillation token). distillation_logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the distillation head (i.e. the linear layer on top of the final hidden state of the
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of distillation token).
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.
""" """
logits: Optional[torch.FloatTensor] = None logits: Optional[torch.FloatTensor] = None

View File

@ -36,36 +36,38 @@ from .configuration_lightglue import LightGlueConfig
@dataclass @dataclass
class LightGlueKeypointMatchingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of LightGlue keypoint matching models. Due to the nature of keypoint detection and matching, Base class for outputs of LightGlue keypoint matching models. Due to the nature of keypoint detection and matching,
the number of keypoints is not fixed and can vary from image to image, which makes batching non-trivial. In the the number of keypoints is not fixed and can vary from image to image, which makes batching non-trivial. In the
batch of images, the maximum number of matches is set as the dimension of the matches and matching scores. The mask batch of images, the maximum number of matches is set as the dimension of the matches and matching scores. The mask
tensor is used to indicate which values in the keypoints, matches, matching_scores and prune tensors are keypoint tensor is used to indicate which values in the keypoints, matches, matching_scores and prune tensors are keypoint
matching information. matching information.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*): class LightGlueKeypointMatchingOutput(ModelOutput):
Loss computed during training. r"""
matches (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*):
Index of keypoint matched in the other image. Loss computed during training.
matching_scores (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`): matches (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`):
Scores of predicted matches. Index of keypoint matched in the other image.
keypoints (`torch.FloatTensor` of shape `(batch_size, num_keypoints, 2)`): matching_scores (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`):
Absolute (x, y) coordinates of predicted keypoints in a given image. Scores of predicted matches.
prune (`torch.IntTensor` of shape `(batch_size, num_keypoints)`): keypoints (`torch.FloatTensor` of shape `(batch_size, num_keypoints, 2)`):
Pruning mask indicating which keypoints are removed and at which layer. Absolute (x, y) coordinates of predicted keypoints in a given image.
mask (`torch.BoolTensor` of shape `(batch_size, num_keypoints)`): prune (`torch.IntTensor` of shape `(batch_size, num_keypoints)`):
Mask indicating which values in matches, matching_scores, keypoints and prune are keypoint matching Pruning mask indicating which keypoints are removed and at which layer.
information. mask (`torch.BoolTensor` of shape `(batch_size, num_keypoints)`):
hidden_states (`Tuple[torch.FloatTensor, ...]`, *optional*): Mask indicating which values in matches, matching_scores, keypoints and prune are keypoint matching
Tuple of `torch.FloatTensor` (one for the output of each stage) of shape `(batch_size, 2, num_channels, information.
num_keypoints)` returned when `output_hidden_states=True` is passed or when hidden_states (`Tuple[torch.FloatTensor, ...]`, *optional*):
`config.output_hidden_states=True` Tuple of `torch.FloatTensor` (one for the output of each stage) of shape `(batch_size, 2, num_channels,
attentions (`Tuple[torch.FloatTensor, ...]`, *optional*): num_keypoints)` returned when `output_hidden_states=True` is passed or when
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, 2, num_heads, num_keypoints, `config.output_hidden_states=True`
num_keypoints)` returned when `output_attentions=True` is passed or when attentions (`Tuple[torch.FloatTensor, ...]`, *optional*):
`config.output_attentions=True` Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, 2, num_heads, num_keypoints,
num_keypoints)` returned when `output_attentions=True` is passed or when
`config.output_attentions=True`
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -37,9 +37,6 @@ from ..superpoint import SuperPointConfig
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC_ = "LightGlueConfig"
_CHECKPOINT_FOR_DOC_ = "ETH-CVG/lightglue_superpoint"
class LightGlueConfig(PretrainedConfig): class LightGlueConfig(PretrainedConfig):
r""" r"""
@ -158,36 +155,38 @@ class LightGlueConfig(PretrainedConfig):
@dataclass @dataclass
class LightGlueKeypointMatchingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of LightGlue keypoint matching models. Due to the nature of keypoint detection and matching, Base class for outputs of LightGlue keypoint matching models. Due to the nature of keypoint detection and matching,
the number of keypoints is not fixed and can vary from image to image, which makes batching non-trivial. In the the number of keypoints is not fixed and can vary from image to image, which makes batching non-trivial. In the
batch of images, the maximum number of matches is set as the dimension of the matches and matching scores. The mask batch of images, the maximum number of matches is set as the dimension of the matches and matching scores. The mask
tensor is used to indicate which values in the keypoints, matches, matching_scores and prune tensors are keypoint tensor is used to indicate which values in the keypoints, matches, matching_scores and prune tensors are keypoint
matching information. matching information.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*): class LightGlueKeypointMatchingOutput(ModelOutput):
Loss computed during training. r"""
matches (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*):
Index of keypoint matched in the other image. Loss computed during training.
matching_scores (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`): matches (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`):
Scores of predicted matches. Index of keypoint matched in the other image.
keypoints (`torch.FloatTensor` of shape `(batch_size, num_keypoints, 2)`): matching_scores (`torch.FloatTensor` of shape `(batch_size, 2, num_matches)`):
Absolute (x, y) coordinates of predicted keypoints in a given image. Scores of predicted matches.
prune (`torch.IntTensor` of shape `(batch_size, num_keypoints)`): keypoints (`torch.FloatTensor` of shape `(batch_size, num_keypoints, 2)`):
Pruning mask indicating which keypoints are removed and at which layer. Absolute (x, y) coordinates of predicted keypoints in a given image.
mask (`torch.BoolTensor` of shape `(batch_size, num_keypoints)`): prune (`torch.IntTensor` of shape `(batch_size, num_keypoints)`):
Mask indicating which values in matches, matching_scores, keypoints and prune are keypoint matching Pruning mask indicating which keypoints are removed and at which layer.
information. mask (`torch.BoolTensor` of shape `(batch_size, num_keypoints)`):
hidden_states (`Tuple[torch.FloatTensor, ...]`, *optional*): Mask indicating which values in matches, matching_scores, keypoints and prune are keypoint matching
Tuple of `torch.FloatTensor` (one for the output of each stage) of shape `(batch_size, 2, num_channels, information.
num_keypoints)` returned when `output_hidden_states=True` is passed or when hidden_states (`Tuple[torch.FloatTensor, ...]`, *optional*):
`config.output_hidden_states=True` Tuple of `torch.FloatTensor` (one for the output of each stage) of shape `(batch_size, 2, num_channels,
attentions (`Tuple[torch.FloatTensor, ...]`, *optional*): num_keypoints)` returned when `output_hidden_states=True` is passed or when
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, 2, num_heads, num_keypoints, `config.output_hidden_states=True`
num_keypoints)` returned when `output_attentions=True` is passed or when attentions (`Tuple[torch.FloatTensor, ...]`, *optional*):
`config.output_attentions=True` Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, 2, num_heads, num_keypoints,
num_keypoints)` returned when `output_attentions=True` is passed or when
`config.output_attentions=True`
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -719,35 +719,26 @@ class Llama4ForCausalLM(Llama4PreTrainedModel, GenerationMixin):
@dataclass @dataclass
class Llama4CausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Llava causal language model (or autoregressive) outputs. Base class for Llava causal language model (or autoregressive) outputs.
"""
)
class Llama4CausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size (batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size (batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -36,68 +36,48 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class LlavaModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Llava outputs, with hidden states and attentions. Base class for Llava outputs, with hidden states and attentions.
"""
)
class LlavaModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None
@dataclass @dataclass
class LlavaCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Llava causal language model (or autoregressive) outputs. Base class for Llava causal language model (or autoregressive) outputs.
"""
)
class LlavaCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -145,68 +145,48 @@ def unpad_image(tensor, original_size):
@dataclass @dataclass
class LlavaNextModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Llava outputs, with hidden states and attentions. Base class for Llava outputs, with hidden states and attentions.
"""
)
class LlavaNextModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None
@dataclass @dataclass
class LlavaNextCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for LlavaNext causal language model (or autoregressive) outputs. Base class for LlavaNext causal language model (or autoregressive) outputs.
"""
)
class LlavaNextCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -43,37 +43,25 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class LlavaNextVideoModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Llava outputs, with hidden states and attentions. Base class for Llava outputs, with hidden states and attentions.
"""
)
class LlavaNextVideoModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None
@ -82,39 +70,29 @@ class LlavaNextVideoModelOutputWithPast(BaseModelOutputWithPast):
@dataclass @dataclass
class LlavaNextVideoCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for LlavaNextVideo causal language model (or autoregressive) outputs. Base class for LlavaNextVideo causal language model (or autoregressive) outputs.
"""
)
class LlavaNextVideoCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): video_hidden_states (`torch.FloatTensor`, *optional*):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -14,7 +14,6 @@
# limitations under the License. # limitations under the License.
import math import math
from dataclasses import dataclass
from typing import Optional, Union from typing import Optional, Union
import torch import torch
@ -182,9 +181,17 @@ class LlavaNextVideoConfig(PretrainedConfig):
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs) super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
@dataclass
class LlavaNextVideoModelOutputWithPast(LlavaNextModelOutputWithPast): class LlavaNextVideoModelOutputWithPast(LlavaNextModelOutputWithPast):
""" r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
video_hidden_states (`torch.FloatTensor`, *optional*): video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`. A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state. video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
@ -193,9 +200,21 @@ class LlavaNextVideoModelOutputWithPast(LlavaNextModelOutputWithPast):
video_hidden_states: Optional[torch.FloatTensor] = None video_hidden_states: Optional[torch.FloatTensor] = None
@dataclass
class LlavaNextVideoCausalLMOutputWithPast(LlavaNextCausalLMOutputWithPast): class LlavaNextVideoCausalLMOutputWithPast(LlavaNextCausalLMOutputWithPast):
""" r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
video_hidden_states (`torch.FloatTensor`, *optional*): video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`. A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state. video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.

View File

@ -49,37 +49,25 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class LlavaOnevisionModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Llava outputs, with hidden states and attentions. Base class for Llava outputs, with hidden states and attentions.
"""
)
class LlavaOnevisionModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None
@ -88,39 +76,29 @@ class LlavaOnevisionModelOutputWithPast(BaseModelOutputWithPast):
@dataclass @dataclass
class LlavaOnevisionCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for LlavaOnevision causal language model (or autoregressive) outputs. Base class for LlavaOnevision causal language model (or autoregressive) outputs.
"""
)
class LlavaOnevisionCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): video_hidden_states (`torch.FloatTensor`, *optional*):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size (batch_size * num_patches, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
video_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size * num_frames, num_videos, sequence_length, hidden_size)`.
video_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -35,40 +35,35 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class LongformerBaseModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Longformer's outputs, with potential hidden states, local and global attentions. Base class for Longformer's outputs, with potential hidden states, local and global attentions.
"""
)
class LongformerBaseModelOutput(ModelOutput):
r"""
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Sequence of hidden-states at the output of the last layer of the model. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
shape `(batch_size, sequence_length, hidden_size)`. token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
last_hidden_state: torch.FloatTensor last_hidden_state: torch.FloatTensor
@ -78,44 +73,39 @@ class LongformerBaseModelOutput(ModelOutput):
@dataclass @dataclass
class LongformerBaseModelOutputWithPooling(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Longformer's outputs that also contains a pooling of the last hidden states. Base class for Longformer's outputs that also contains a pooling of the last hidden states.
"""
)
class LongformerBaseModelOutputWithPooling(ModelOutput):
r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Last layer hidden-state of the first token of the sequence (classification token) further processed by a
Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence
prediction (classification) objective during pretraining.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Sequence of hidden-states at the output of the last layer of the model. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Last layer hidden-state of the first token of the sequence (classification token) further processed by a remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
prediction (classification) objective during pretraining. (succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): If the attention window contains a token with global attention, the attention weight at the corresponding
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
shape `(batch_size, sequence_length, hidden_size)`. attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
last_hidden_state: torch.FloatTensor last_hidden_state: torch.FloatTensor
@ -126,42 +116,39 @@ class LongformerBaseModelOutputWithPooling(ModelOutput):
@dataclass @dataclass
class LongformerMaskedLMOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for masked language models outputs. Base class for masked language models outputs.
"""
)
class LongformerMaskedLMOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Masked language modeling (MLM) loss.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Masked language modeling (MLM) loss. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of (succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
shape `(batch_size, sequence_length, hidden_size)`. If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -172,44 +159,37 @@ class LongformerMaskedLMOutput(ModelOutput):
@dataclass @dataclass
class LongformerQuestionAnsweringModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of question answering Longformer models. Base class for outputs of question answering Longformer models.
"""
)
class LongformerQuestionAnsweringModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
start_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Span-start scores (before SoftMax). remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
Span-end scores (before SoftMax). (succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): If the attention window contains a token with global attention, the attention weight at the corresponding
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
shape `(batch_size, sequence_length, hidden_size)`. attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -221,42 +201,39 @@ class LongformerQuestionAnsweringModelOutput(ModelOutput):
@dataclass @dataclass
class LongformerSequenceClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of sentence classification models. Base class for outputs of sentence classification models.
"""
)
class LongformerSequenceClassifierOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification (or regression if config.num_labels==1) loss.
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Classification (or regression if config.num_labels==1) scores (before SoftMax).
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Classification (or regression if config.num_labels==1) loss. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Classification (or regression if config.num_labels==1) scores (before SoftMax). remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of (succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
shape `(batch_size, sequence_length, hidden_size)`. If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -267,44 +244,41 @@ class LongformerSequenceClassifierOutput(ModelOutput):
@dataclass @dataclass
class LongformerMultipleChoiceModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of multiple choice Longformer models. Base class for outputs of multiple choice Longformer models.
"""
)
class LongformerMultipleChoiceModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape *(1,)*, *optional*, returned when `labels` is provided):
Classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`):
*num_choices* is the second dimension of the input tensors. (see *input_ids* above).
Args: Classification scores (before SoftMax).
loss (`torch.FloatTensor` of shape *(1,)*, *optional*, returned when `labels` is provided): attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Classification loss. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`): attention_window + 1)`, where `x` is the number of tokens with global attention mask.
*num_choices* is the second dimension of the input tensors. (see *input_ids* above).
Classification scores (before SoftMax). Local attentions weights after the attention softmax, used to compute the weighted average in the
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): self-attention heads. Those are the attention weights from every token in the sequence to every token with
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of global attention (first `x` values) and to every token in the attention window (remaining `attention_window
shape `(batch_size, sequence_length, hidden_size)`. + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -315,42 +289,39 @@ class LongformerMultipleChoiceModelOutput(ModelOutput):
@dataclass @dataclass
class LongformerTokenClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of token classification models. Base class for outputs of token classification models.
"""
)
class LongformerTokenClassifierOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Classification scores (before SoftMax).
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x +
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Args: Local attentions weights after the attention softmax, used to compute the weighted average in the
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) : self-attention heads. Those are the attention weights from every token in the sequence to every token with
Classification loss. global attention (first `x` values) and to every token in the attention window (remaining `attention_window
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`): + 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
Classification scores (before SoftMax). remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of (succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
shape `(batch_size, sequence_length, hidden_size)`. If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Hidden-states of the model at the output of each layer plus the initial embedding outputs. Global attentions weights after the attention softmax, used to compute the weighted average in the
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): self-attention heads. Those are the attention weights from every token with global attention to every token
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x + in the sequence.
attention_window + 1)`, where `x` is the number of tokens with global attention mask.
Local attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token in the sequence to every token with
global attention (first `x` values) and to every token in the attention window (remaining `attention_window
+ 1` values). Note that the first `x` values refer to tokens with fixed positions in the text, but the
remaining `attention_window + 1` values refer to tokens with relative positions: the attention weight of a
token to itself is located at index `x + attention_window / 2` and the `attention_window / 2` preceding
(succeeding) values are the attention weights to the `attention_window / 2` preceding (succeeding) tokens.
If the attention window contains a token with global attention, the attention weight at the corresponding
index is set to 0; the value should be accessed from the first `x` attention weights. If a token has global
attention, the attention weights to all other tokens in `attentions` is set to 0, the values should be
accessed from `global_attentions`.
global_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, x)`,
where `x` is the number of tokens with global attention mask.
Global attentions weights after the attention softmax, used to compute the weighted average in the
self-attention heads. Those are the attention weights from every token with global attention to every token
in the sequence.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -36,30 +36,22 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class BaseLukeModelOutputWithPooling(BaseModelOutputWithPooling): @auto_docstring(
""" custom_intro="""
Base class for outputs of the LUKE model. Base class for outputs of the LUKE model.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class BaseLukeModelOutputWithPooling(BaseModelOutputWithPooling):
Sequence of hidden-states at the output of the last layer of the model. r"""
entity_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, entity_length, hidden_size)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Sequence of entity hidden-states at the output of the last layer of the model. Last layer hidden-state of the first token of the sequence (classification token) further processed by a
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): Linear layer and a Tanh activation function.
Last layer hidden-state of the first token of the sequence (classification token) further processed by a entity_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, entity_length, hidden_size)`):
Linear layer and a Tanh activation function. Sequence of entity hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
plus the initial embedding outputs. layer plus the initial entity embedding outputs.
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length +
entity_length, sequence_length + entity_length)`. Attentions weights after the attention softmax, used to
compute the weighted average in the self-attention heads.
""" """
entity_last_hidden_state: Optional[torch.FloatTensor] = None entity_last_hidden_state: Optional[torch.FloatTensor] = None
@ -67,30 +59,19 @@ class BaseLukeModelOutputWithPooling(BaseModelOutputWithPooling):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for model's outputs, with potential hidden states and attentions.
"""
)
class BaseLukeModelOutput(BaseModelOutput): class BaseLukeModelOutput(BaseModelOutput):
""" r"""
Base class for model's outputs, with potential hidden states and attentions. entity_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, entity_length, hidden_size)`):
Sequence of entity hidden-states at the output of the last layer of the model.
Args: entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
Sequence of hidden-states at the output of the last layer of the model. shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
entity_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, entity_length, hidden_size)`): layer plus the initial entity embedding outputs.
Sequence of entity hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
entity_last_hidden_state: Optional[torch.FloatTensor] = None entity_last_hidden_state: Optional[torch.FloatTensor] = None
@ -98,36 +79,27 @@ class BaseLukeModelOutput(BaseModelOutput):
@dataclass @dataclass
class LukeMaskedLMOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for model's outputs, with potential hidden states and attentions. Base class for model's outputs, with potential hidden states and attentions.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class LukeMaskedLMOutput(ModelOutput):
The sum of masked language modeling (MLM) loss and entity prediction loss. r"""
mlm_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Masked language modeling (MLM) loss. The sum of masked language modeling (MLM) loss and entity prediction loss.
mep_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): mlm_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Masked entity prediction (MEP) loss. Masked language modeling (MLM) loss.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): mep_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). Masked entity prediction (MEP) loss.
entity_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the entity prediction head (scores for each entity vocabulary token before SoftMax). Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): entity_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Prediction scores of the entity prediction head (scores for each entity vocabulary token before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`. entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
Hidden-states of the model at the output of each layer plus the initial embedding outputs. shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): layer plus the initial entity embedding outputs.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -141,27 +113,21 @@ class LukeMaskedLMOutput(ModelOutput):
@dataclass @dataclass
class EntityClassificationOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of entity classification models. Outputs of entity classification models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class EntityClassificationOutput(ModelOutput):
Classification loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification scores (before SoftMax). Classification loss.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Classification scores (before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
plus the initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of layer plus the initial entity embedding outputs.
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -172,27 +138,21 @@ class EntityClassificationOutput(ModelOutput):
@dataclass @dataclass
class EntityPairClassificationOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of entity pair classification models. Outputs of entity pair classification models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class EntityPairClassificationOutput(ModelOutput):
Classification loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification scores (before SoftMax). Classification loss.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Classification scores (before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
plus the initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of layer plus the initial entity embedding outputs.
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -203,27 +163,21 @@ class EntityPairClassificationOutput(ModelOutput):
@dataclass @dataclass
class EntitySpanClassificationOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of entity span classification models. Outputs of entity span classification models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class EntitySpanClassificationOutput(ModelOutput):
Classification loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, entity_length, config.num_labels)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification scores (before SoftMax). Classification loss.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, entity_length, config.num_labels)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Classification scores (before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
plus the initial embedding outputs. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of layer plus the initial entity embedding outputs.
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -234,30 +188,21 @@ class EntitySpanClassificationOutput(ModelOutput):
@dataclass @dataclass
class LukeSequenceClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of sentence classification models. Outputs of sentence classification models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class LukeSequenceClassifierOutput(ModelOutput):
Classification (or regression if config.num_labels==1) loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification (or regression if config.num_labels==1) scores (before SoftMax). Classification (or regression if config.num_labels==1) loss.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Classification (or regression if config.num_labels==1) scores (before SoftMax).
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): layer plus the initial entity embedding outputs.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -268,30 +213,21 @@ class LukeSequenceClassifierOutput(ModelOutput):
@dataclass @dataclass
class LukeTokenClassifierOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for outputs of token classification models. Base class for outputs of token classification models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) : class LukeTokenClassifierOutput(ModelOutput):
Classification loss. r"""
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Classification scores (before SoftMax). Classification loss.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.num_labels)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + Classification scores (before SoftMax).
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`. entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs. shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): layer plus the initial entity embedding outputs.
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -302,32 +238,19 @@ class LukeTokenClassifierOutput(ModelOutput):
@dataclass @dataclass
class LukeQuestionAnsweringModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of question answering models. Outputs of question answering models.
"""
Args: )
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): class LukeQuestionAnsweringModelOutput(ModelOutput):
Total span extraction loss is the sum of a Cross-Entropy for the start and end positions. r"""
start_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Span-start scores (before SoftMax). Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.
end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`): entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Span-end scores (before SoftMax). Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, + layer plus the initial entity embedding outputs.
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -339,32 +262,23 @@ class LukeQuestionAnsweringModelOutput(ModelOutput):
@dataclass @dataclass
class LukeMultipleChoiceModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of multiple choice models. Outputs of multiple choice models.
"""
)
class LukeMultipleChoiceModelOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape *(1,)*, *optional*, returned when `labels` is provided):
Classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`):
*num_choices* is the second dimension of the input tensors. (see *input_ids* above).
Args: Classification scores (before SoftMax).
loss (`torch.FloatTensor` of shape *(1,)*, *optional*, returned when `labels` is provided): entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Classification loss. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
logits (`torch.FloatTensor` of shape `(batch_size, num_choices)`): shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
*num_choices* is the second dimension of the input tensors. (see *input_ids* above). layer plus the initial entity embedding outputs.
Classification scores (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
entity_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, entity_length, hidden_size)`. Entity hidden-states of the model at the output of each
layer plus the initial entity embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -42,39 +42,40 @@ class GeLU(nn.Module):
@dataclass @dataclass
class LxmertModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Lxmert's outputs that contain the last hidden states, pooled outputs, and attention probabilities for the language, Lxmert's outputs that contain the last hidden states, pooled outputs, and attention probabilities for the language,
visual, and, cross-modality encoders. (note: the visual encoder in Lxmert is referred to as the "relation-ship" visual, and, cross-modality encoders. (note: the visual encoder in Lxmert is referred to as the "relation-ship"
encoder") encoder")
"""
)
Args: class LxmertModelOutput(ModelOutput):
language_output (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): r"""
Sequence of hidden-states at the output of the last layer of the language encoder. language_output (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
vision_output (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Sequence of hidden-states at the output of the last layer of the language encoder.
Sequence of hidden-states at the output of the last layer of the visual encoder. vision_output (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
pooled_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): Sequence of hidden-states at the output of the last layer of the visual encoder.
Last layer hidden-state of the first token of the sequence (classification, CLS, token) further processed pooled_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
by a Linear layer and a Tanh activation function. The Linear Last layer hidden-state of the first token of the sequence (classification, CLS, token) further processed
language_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): by a Linear layer and a Tanh activation function. The Linear
Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of language_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
shape `(batch_size, sequence_length, hidden_size)`. Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of
vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
shape `(batch_size, sequence_length, hidden_size)`. Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of
language_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): shape `(batch_size, sequence_length, hidden_size)`.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, language_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
the self-attention heads. sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
vision_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): the self-attention heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, vision_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
the self-attention heads. sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
cross_encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): the self-attention heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, cross_encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
the self-attention heads. sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
language_output: Optional[torch.FloatTensor] = None language_output: Optional[torch.FloatTensor] = None
@ -88,34 +89,36 @@ class LxmertModelOutput(ModelOutput):
@dataclass @dataclass
class LxmertForQuestionAnsweringOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`LxmertForQuestionAnswering`]. Output type of [`LxmertForQuestionAnswering`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class LxmertForQuestionAnsweringOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss.k. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
question_answering_score (`torch.FloatTensor` of shape `(batch_size, n_qa_answers)`, *optional*): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of question answering objective (classification). (classification) loss.k.
language_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): question_answering_score (`torch.FloatTensor` of shape `(batch_size, n_qa_answers)`, *optional*):
Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of Prediction scores of question answering objective (classification).
shape `(batch_size, sequence_length, hidden_size)`. language_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of
Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape `(batch_size, sequence_length, hidden_size)`.
shape `(batch_size, sequence_length, hidden_size)`. vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
language_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, shape `(batch_size, sequence_length, hidden_size)`.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in language_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the self-attention heads. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
vision_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, the self-attention heads.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in vision_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the self-attention heads. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
cross_encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, the self-attention heads.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in cross_encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the self-attention heads. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -128,40 +131,41 @@ class LxmertForQuestionAnsweringOutput(ModelOutput):
@dataclass @dataclass
class LxmertForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`LxmertForPreTraining`]. Output type of [`LxmertForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class LxmertForPreTrainingOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). (classification) loss.
cross_relationship_score (`torch.FloatTensor` of shape `(batch_size, 2)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the textual matching objective (classification) head (scores of True/False Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
continuation before SoftMax). cross_relationship_score (`torch.FloatTensor` of shape `(batch_size, 2)`):
question_answering_score (`torch.FloatTensor` of shape `(batch_size, n_qa_answers)`): Prediction scores of the textual matching objective (classification) head (scores of True/False
Prediction scores of question answering objective (classification). continuation before SoftMax).
language_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): question_answering_score (`torch.FloatTensor` of shape `(batch_size, n_qa_answers)`):
Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of Prediction scores of question answering objective (classification).
shape `(batch_size, sequence_length, hidden_size)`. language_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of
Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of shape `(batch_size, sequence_length, hidden_size)`.
shape `(batch_size, sequence_length, hidden_size)`. vision_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
language_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` (one for input features + one for the output of each cross-modality layer) of
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, shape `(batch_size, sequence_length, hidden_size)`.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in language_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the self-attention heads. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
vision_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, the self-attention heads.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in vision_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the self-attention heads. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
cross_encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, the self-attention heads.
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in cross_encoder_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
the self-attention heads. Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -429,23 +429,18 @@ class MambaPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class MambaOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for the MAMBA model outputs. Class for the MAMBA model outputs.
"""
)
class MambaOutput(ModelOutput):
r"""
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Args: Includes both the State space model state matrices after the selective scan, and the Convolutional states
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Includes both the State space model state matrices after the selective scan, and the Convolutional states
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -454,25 +449,22 @@ class MambaOutput(ModelOutput):
@dataclass @dataclass
class MambaCausalLMOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for causal language model (or autoregressive) outputs. Base class for causal language model (or autoregressive) outputs.
"""
)
class MambaCausalLMOutput(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Args: Includes both the State space model state matrices after the selective scan, and the Convolutional states
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
cache_params (`MambaCache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Includes both the State space model state matrices after the selective scan, and the Convolutional states
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -763,24 +763,19 @@ class Mamba2PreTrainedModel(PreTrainedModel):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Class for the MAMBA2 model outputs.
"""
)
# Copied from transformers.models.mamba.modeling_mamba.MambaOutput with MAMBA->MAMBA2,Mamba->Mamba2 # Copied from transformers.models.mamba.modeling_mamba.MambaOutput with MAMBA->MAMBA2,Mamba->Mamba2
class Mamba2Output(ModelOutput): class Mamba2Output(ModelOutput):
""" r"""
Class for the MAMBA2 model outputs. cache_params (`Mamba2Cache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Args: Includes both the State space model state matrices after the selective scan, and the Convolutional states
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
cache_params (`Mamba2Cache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Includes both the State space model state matrices after the selective scan, and the Convolutional states
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -789,26 +784,23 @@ class Mamba2Output(ModelOutput):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Base class for causal language model (or autoregressive) outputs.
"""
)
# Copied from transformers.models.mamba.modeling_mamba.MambaCausalLMOutput with Mamba->Mamba2 # Copied from transformers.models.mamba.modeling_mamba.MambaCausalLMOutput with Mamba->Mamba2
class Mamba2CausalLMOutput(ModelOutput): class Mamba2CausalLMOutput(ModelOutput):
""" r"""
Base class for causal language model (or autoregressive) outputs. loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
cache_params (`Mamba2Cache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Args: Includes both the State space model state matrices after the selective scan, and the Convolutional states
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
cache_params (`Mamba2Cache`):
The state of the model at the last time step. Can be used in a forward method with the next `input_ids` to
avoid providing the old `input_ids`.
Includes both the State space model state matrices after the selective scan, and the Convolutional states
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -44,22 +44,24 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class Mask2FormerPixelDecoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Mask2Former's pixel decoder module output, practically a Multi-Scale Deformable Attention based decoder. It returns Mask2Former's pixel decoder module output, practically a Multi-Scale Deformable Attention based decoder. It returns
the mask features and the multiscale features. the mask features and the multiscale features.
"""
Args: )
multi_scale_features (`tuple(torch.FloatTensor)`): class Mask2FormerPixelDecoderOutput(ModelOutput):
Tuple of multi-scale features of scales [1/8, 1/16, 1/32] and shape `(batch_size, num_channels, height, r"""
width)`from the Multi-Scale Deformable Attenntion based Pixel Decoder. multi_scale_features (`tuple(torch.FloatTensor)`):
mask_features (`torch.FloatTensor`): Tuple of multi-scale features of scales [1/8, 1/16, 1/32] and shape `(batch_size, num_channels, height,
Tensor of shape `(batch_size, num_channels, height, width)`, 1/4 scale features from the last Pixel Decoder width)`from the Multi-Scale Deformable Attenntion based Pixel Decoder.
Layer. mask_features (`torch.FloatTensor`):
attentions (`tuple(torch.FloatTensor)`, *optional*): Tensor of shape `(batch_size, num_channels, height, width)`, 1/4 scale features from the last Pixel Decoder
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, Layer.
sequence_length)`. Attentions weights from pixel decoder. Returned when `output_attentions=True` is passed attentions (`tuple(torch.FloatTensor)`, *optional*):
or when `config.output_attentions=True` Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights from pixel decoder. Returned when `output_attentions=True` is passed
or when `config.output_attentions=True`
""" """
multi_scale_features: tuple[torch.FloatTensor] = None multi_scale_features: tuple[torch.FloatTensor] = None
@ -68,28 +70,28 @@ class Mask2FormerPixelDecoderOutput(ModelOutput):
@dataclass @dataclass
class Mask2FormerMaskedAttentionDecoderOutput(BaseModelOutputWithCrossAttentions): @auto_docstring(
""" custom_intro="""
Base class for outputs of the Transformer decoder. This class adds two attributes to Base class for outputs of the Transformer decoder. This class adds two attributes to
BaseModelOutputWithCrossAttentions for mask predictions logits and a tuple of intermediate decoder activations, BaseModelOutputWithCrossAttentions for mask predictions logits and a tuple of intermediate decoder activations,
i.e. the output of each decoder layer, each of them gone through a layernorm. i.e. the output of each decoder layer, each of them gone through a layernorm.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class Mask2FormerMaskedAttentionDecoderOutput(BaseModelOutputWithCrossAttentions):
Sequence of hidden-states at the output of the last layer of the model. r"""
hidden_states (`tuple(torch.FloatTensor)`, *optional*): hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs. Returned when `output_hidden_states=True`. plus the initial embedding outputs. Returned when `output_hidden_states=True`.
attentions (`tuple(torch.FloatTensor)`, *optional*): attentions (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in
the self-attention heads. Returned when `output_attentions=True`. the self-attention heads. Returned when `output_attentions=True`.
masks_queries_logits (`tuple(torch.FloatTensor)` of shape `(batch_size, num_queries, height, width)`): masks_queries_logits (`tuple(torch.FloatTensor)` of shape `(batch_size, num_queries, height, width)`):
Tuple of mask predictions from all layers of the transformer decoder. Tuple of mask predictions from all layers of the transformer decoder.
intermediate_hidden_states (`tuple(torch.FloatTensor)` of shape `(num_queries, 1, hidden_size)`): intermediate_hidden_states (`tuple(torch.FloatTensor)` of shape `(num_queries, 1, hidden_size)`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm. layernorm.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -100,28 +102,30 @@ class Mask2FormerMaskedAttentionDecoderOutput(BaseModelOutputWithCrossAttentions
@dataclass @dataclass
class Mask2FormerPixelLevelModuleOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Mask2Former's pixel level module output. It returns the output of the encoder (optional) and all hidden states Mask2Former's pixel level module output. It returns the output of the encoder (optional) and all hidden states
(multi-scale features) from the `decoder`. By default, the `encoder` is a Swin Backbone and the `decoder` is a (multi-scale features) from the `decoder`. By default, the `encoder` is a Swin Backbone and the `decoder` is a
Multi-Scale Deformable Attention based decoder. Multi-Scale Deformable Attention based decoder.
The `decoder_last_hidden_state` are the **per-pixel embeddings** while `decoder_hidden_states` refer to multi-scale The `decoder_last_hidden_state` are the **per-pixel embeddings** while `decoder_hidden_states` refer to multi-scale
feature maps produced using **multi-scaling strategy** defined in the paper. feature maps produced using **multi-scaling strategy** defined in the paper.
"""
Args: )
encoder_last_hidden_state (`torch.FloatTensor`): class Mask2FormerPixelLevelModuleOutput(ModelOutput):
Last hidden states (final feature map of shape `(batch_size, num_channels, height, width)`) of the last r"""
stage of the encoder. encoder_last_hidden_state (`torch.FloatTensor`):
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*): Last hidden states (final feature map of shape `(batch_size, num_channels, height, width)`) of the last
Tuple of `torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`. Hidden states (also stage of the encoder.
called feature maps) of the model at the output of each stage. Returned if output_hidden_states is set to encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
True. Tuple of `torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`. Hidden states (also
decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)): called feature maps) of the model at the output of each stage. Returned if output_hidden_states is set to
1/4 scale features from the last Pixel Decoder Layer. True.
decoder_hidden_states (`tuple(torch.FloatTensor)`): decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)):
Tuple of `torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`. Hidden states (also 1/4 scale features from the last Pixel Decoder Layer.
called feature maps) of the model at the output of each stage. decoder_hidden_states (`tuple(torch.FloatTensor)`):
Tuple of `torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`. Hidden states (also
called feature maps) of the model at the output of each stage.
""" """
encoder_last_hidden_state: Optional[torch.FloatTensor] = None encoder_last_hidden_state: Optional[torch.FloatTensor] = None
@ -131,38 +135,40 @@ class Mask2FormerPixelLevelModuleOutput(ModelOutput):
@dataclass @dataclass
class Mask2FormerModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for outputs of [`Mask2FormerModel`]. This class returns all the needed hidden states to compute the logits. Class for outputs of [`Mask2FormerModel`]. This class returns all the needed hidden states to compute the logits.
"""
Args: )
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`, *optional*): class Mask2FormerModelOutput(ModelOutput):
Last hidden states (final feature map) of the last stage of the encoder model (backbone). Returned when r"""
`output_hidden_states=True` is passed. encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`, *optional*):
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*): Last hidden states (final feature map) of the last stage of the encoder model (backbone). Returned when
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of `output_hidden_states=True` is passed.
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`, *optional*):
model at the output of each stage. Returned when `output_hidden_states=True` is passed. Last hidden states (final feature map) of the last stage of the pixel decoder model.
pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`, *optional*): transformer_decoder_last_hidden_state (`tuple(torch.FloatTensor)`):
Last hidden states (final feature map) of the last stage of the pixel decoder model. Final output of the transformer decoder `(batch_size, sequence_length, hidden_size)`.
pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, , *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder
decoder model at the output of each stage. Returned when `output_hidden_states=True` is passed. model at the output of each stage. Returned when `output_hidden_states=True` is passed.
transformer_decoder_last_hidden_state (`tuple(torch.FloatTensor)`): pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, , *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Final output of the transformer decoder `(batch_size, sequence_length, hidden_size)`. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*): shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of decoder model at the output of each stage. Returned when `output_hidden_states=True` is passed.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states (also called feature maps) of the transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
transformer decoder at the output of each stage. Returned when `output_hidden_states=True` is passed. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
transformer_decoder_intermediate_states (`tuple(torch.FloatTensor)` of shape `(num_queries, 1, hidden_size)`): shape `(batch_size, sequence_length, hidden_size)`. Hidden-states (also called feature maps) of the
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a transformer decoder at the output of each stage. Returned when `output_hidden_states=True` is passed.
layernorm. transformer_decoder_intermediate_states (`tuple(torch.FloatTensor)` of shape `(num_queries, 1, hidden_size)`):
masks_queries_logits (`tuple(torch.FloatTensor)` of shape `(batch_size, num_queries, height, width)`) Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
Mask Predictions from each layer in the transformer decoder. layernorm.
attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed): masks_queries_logits (`tuple(torch.FloatTensor)` of shape `(batch_size, num_queries, height, width)`)
Tuple of `tuple(torch.FloatTensor)` (one for each layer) of shape `(batch_size, num_heads, sequence_length, Mask Predictions from each layer in the transformer decoder.
sequence_length)`. Self attentions weights from transformer decoder. attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed):
Tuple of `tuple(torch.FloatTensor)` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Self attentions weights from transformer decoder.
""" """
encoder_last_hidden_state: Optional[torch.FloatTensor] = None encoder_last_hidden_state: Optional[torch.FloatTensor] = None
@ -177,47 +183,49 @@ class Mask2FormerModelOutput(ModelOutput):
@dataclass @dataclass
class Mask2FormerForUniversalSegmentationOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for outputs of [`Mask2FormerForUniversalSegmentationOutput`]. Class for outputs of [`Mask2FormerForUniversalSegmentationOutput`].
This output can be directly passed to [`~Mask2FormerImageProcessor.post_process_semantic_segmentation`] or This output can be directly passed to [`~Mask2FormerImageProcessor.post_process_semantic_segmentation`] or
[`~Mask2FormerImageProcessor.post_process_instance_segmentation`] or [`~Mask2FormerImageProcessor.post_process_instance_segmentation`] or
[`~Mask2FormerImageProcessor.post_process_panoptic_segmentation`] to compute final segmentation maps. Please, see [`~Mask2FormerImageProcessor.post_process_panoptic_segmentation`] to compute final segmentation maps. Please, see
[`~Mask2FormerImageProcessor] for details regarding usage. [`~Mask2FormerImageProcessor] for details regarding usage.
"""
Args: )
loss (`torch.Tensor`, *optional*): class Mask2FormerForUniversalSegmentationOutput(ModelOutput):
The computed loss, returned when labels are present. r"""
class_queries_logits (`torch.FloatTensor`): loss (`torch.Tensor`, *optional*):
A tensor of shape `(batch_size, num_queries, num_labels + 1)` representing the proposed classes for each The computed loss, returned when labels are present.
query. Note the `+ 1` is needed because we incorporate the null class. class_queries_logits (`torch.FloatTensor`):
masks_queries_logits (`torch.FloatTensor`): A tensor of shape `(batch_size, num_queries, num_labels + 1)` representing the proposed classes for each
A tensor of shape `(batch_size, num_queries, height, width)` representing the proposed masks for each query. Note the `+ 1` is needed because we incorporate the null class.
query. masks_queries_logits (`torch.FloatTensor`):
auxiliary_logits (`list[Dict(str, torch.FloatTensor)]`, *optional*): A tensor of shape `(batch_size, num_queries, height, width)` representing the proposed masks for each
List of class and mask predictions from each layer of the transformer decoder. query.
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): auxiliary_logits (`list[Dict(str, torch.FloatTensor)]`, *optional*):
Last hidden states (final feature map) of the last stage of the encoder model (backbone). List of class and mask predictions from each layer of the transformer decoder.
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Last hidden states (final feature map) of the last stage of the encoder model (backbone).
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
model at the output of each stage. Last hidden states (final feature map) of the last stage of the pixel decoder model.
pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): transformer_decoder_last_hidden_state (`tuple(torch.FloatTensor)`):
Last hidden states (final feature map) of the last stage of the pixel decoder model. Final output of the transformer decoder `(batch_size, sequence_length, hidden_size)`.
pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder
decoder model at the output of each stage. model at the output of each stage.
transformer_decoder_last_hidden_state (`tuple(torch.FloatTensor)`): pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Final output of the transformer decoder `(batch_size, sequence_length, hidden_size)`. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of decoder model at the output of each stage.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states (also called feature maps) of the transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
transformer decoder at the output of each stage. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): shape `(batch_size, sequence_length, hidden_size)`. Hidden-states (also called feature maps) of the
Tuple of `tuple(torch.FloatTensor)` (one for each layer) of shape `(batch_size, num_heads, sequence_length, transformer decoder at the output of each stage.
sequence_length)`. Self and Cross Attentions weights from transformer decoder. attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `tuple(torch.FloatTensor)` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Self and Cross Attentions weights from transformer decoder.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -53,59 +53,53 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
# Copied from transformers.models.detr.modeling_detr.DetrDecoderOutput @auto_docstring(
class DetrDecoderOutput(BaseModelOutputWithCrossAttentions): custom_intro="""
"""
Base class for outputs of the DETR decoder. This class adds one attribute to BaseModelOutputWithCrossAttentions, Base class for outputs of the DETR decoder. This class adds one attribute to BaseModelOutputWithCrossAttentions,
namely an optional stack of intermediate decoder activations, i.e. the output of each decoder layer, each of them namely an optional stack of intermediate decoder activations, i.e. the output of each decoder layer, each of them
gone through a layernorm. This is useful when training the model with auxiliary decoding losses. gone through a layernorm. This is useful when training the model with auxiliary decoding losses.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): # Copied from transformers.models.detr.modeling_detr.DetrDecoderOutput
Sequence of hidden-states at the output of the last layer of the model. class DetrDecoderOutput(BaseModelOutputWithCrossAttentions):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): r"""
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
plus the initial embedding outputs. sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): used to compute the weighted average in the cross-attention heads.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
sequence_length)`. Attentions weights after the attention softmax, used to compute the weighted average in Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
the self-attention heads. layernorm.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` and `config.add_cross_attention=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights of the decoder's cross-attention layer, after the attention softmax,
used to compute the weighted average in the cross-attention heads.
intermediate_hidden_states (`torch.FloatTensor` of shape `(config.decoder_layers, batch_size, num_queries, hidden_size)`, *optional*, returned when `config.auxiliary_loss=True`):
Intermediate decoder activations, i.e. the output of each decoder layer, each of them gone through a
layernorm.
""" """
intermediate_hidden_states: Optional[torch.FloatTensor] = None intermediate_hidden_states: Optional[torch.FloatTensor] = None
@dataclass @dataclass
class MaskFormerPixelLevelModuleOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
MaskFormer's pixel level module output. It returns both the last and (optionally) the hidden states from the MaskFormer's pixel level module output. It returns both the last and (optionally) the hidden states from the
`encoder` and `decoder`. By default, the `encoder` is a MaskFormerSwin Transformer and the `decoder` is a Feature `encoder` and `decoder`. By default, the `encoder` is a MaskFormerSwin Transformer and the `decoder` is a Feature
Pyramid Network (FPN). Pyramid Network (FPN).
The `encoder_last_hidden_state` are referred on the paper as **images features**, while `decoder_last_hidden_state` The `encoder_last_hidden_state` are referred on the paper as **images features**, while `decoder_last_hidden_state`
as **pixel embeddings** as **pixel embeddings**
"""
Args: )
encoder_last_hidden_state (`torch.FloatTensor` of shape`(batch_size, num_channels, height, width)`): class MaskFormerPixelLevelModuleOutput(ModelOutput):
Last hidden states (final feature map) of the last stage of the encoder. r"""
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_last_hidden_state (`torch.FloatTensor` of shape`(batch_size, num_channels, height, width)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Last hidden states (final feature map) of the last stage of the encoder.
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the model at decoder_last_hidden_state (`torch.FloatTensor` of shape`(batch_size, num_channels, height, width)`):
the output of each stage. Last hidden states (final feature map) of the last stage of the decoder.
decoder_last_hidden_state (`torch.FloatTensor` of shape`(batch_size, num_channels, height, width)`): encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Last hidden states (final feature map) of the last stage of the decoder. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the model at
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of the output of each stage.
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the model at decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
the output of each stage. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the model at
the output of each stage.
""" """
encoder_last_hidden_state: Optional[torch.FloatTensor] = None encoder_last_hidden_state: Optional[torch.FloatTensor] = None
@ -115,22 +109,16 @@ class MaskFormerPixelLevelModuleOutput(ModelOutput):
@dataclass @dataclass
class MaskFormerPixelDecoderOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
MaskFormer's pixel decoder module output, practically a Feature Pyramid Network. It returns the last hidden state MaskFormer's pixel decoder module output, practically a Feature Pyramid Network. It returns the last hidden state
and (optionally) the hidden states. and (optionally) the hidden states.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): class MaskFormerPixelDecoderOutput(ModelOutput):
Last hidden states (final feature map) of the last stage of the model. r"""
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of Last hidden states (final feature map) of the last stage of the model.
shape `(batch_size, num_channels, height, width)`. Hidden-states of the model at the output of each layer
plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`. Attentions weights from Detr's decoder after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -139,36 +127,34 @@ class MaskFormerPixelDecoderOutput(ModelOutput):
@dataclass @dataclass
class MaskFormerModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for outputs of [`MaskFormerModel`]. This class returns all the needed hidden states to compute the logits. Class for outputs of [`MaskFormerModel`]. This class returns all the needed hidden states to compute the logits.
"""
Args: )
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): class MaskFormerModelOutput(ModelOutput):
Last hidden states (final feature map) of the last stage of the encoder model (backbone). r"""
pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN). Last hidden states (final feature map) of the last stage of the encoder model (backbone).
transformer_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Last hidden states (final feature map) of the last stage of the transformer decoder model. Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): transformer_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Last hidden states (final feature map) of the last stage of the transformer decoder model.
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
model at the output of each stage. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of model at the output of each stage.
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
decoder model at the output of each stage. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of decoder model at the output of each stage.
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states (also called feature maps) of the transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
transformer decoder at the output of each stage. Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
hidden_states `tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): shape `(batch_size, sequence_length, hidden_size)`. Hidden-states (also called feature maps) of the
Tuple of `torch.FloatTensor` containing `encoder_hidden_states`, `pixel_decoder_hidden_states` and transformer decoder at the output of each stage.
`decoder_hidden_states` hidden_states `tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): Tuple of `torch.FloatTensor` containing `encoder_hidden_states`, `pixel_decoder_hidden_states` and
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, `decoder_hidden_states`
sequence_length)`. Attentions weights from Detr's decoder after the attention softmax, used to compute the
weighted average in the self-attention heads.
""" """
encoder_last_hidden_state: Optional[torch.FloatTensor] = None encoder_last_hidden_state: Optional[torch.FloatTensor] = None
@ -182,49 +168,49 @@ class MaskFormerModelOutput(ModelOutput):
@dataclass @dataclass
class MaskFormerForInstanceSegmentationOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for outputs of [`MaskFormerForInstanceSegmentation`]. Class for outputs of [`MaskFormerForInstanceSegmentation`].
This output can be directly passed to [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or or This output can be directly passed to [`~MaskFormerImageProcessor.post_process_semantic_segmentation`] or or
[`~MaskFormerImageProcessor.post_process_instance_segmentation`] or [`~MaskFormerImageProcessor.post_process_instance_segmentation`] or
[`~MaskFormerImageProcessor.post_process_panoptic_segmentation`] depending on the task. Please, see [`~MaskFormerImageProcessor.post_process_panoptic_segmentation`] depending on the task. Please, see
[`~MaskFormerImageProcessor] for details regarding usage. [`~MaskFormerImageProcessor] for details regarding usage.
"""
Args: )
loss (`torch.Tensor`, *optional*): class MaskFormerForInstanceSegmentationOutput(ModelOutput):
The computed loss, returned when labels are present. r"""
class_queries_logits (`torch.FloatTensor`): loss (`torch.Tensor`, *optional*):
A tensor of shape `(batch_size, num_queries, num_labels + 1)` representing the proposed classes for each The computed loss, returned when labels are present.
query. Note the `+ 1` is needed because we incorporate the null class. class_queries_logits (`torch.FloatTensor`):
masks_queries_logits (`torch.FloatTensor`): A tensor of shape `(batch_size, num_queries, num_labels + 1)` representing the proposed classes for each
A tensor of shape `(batch_size, num_queries, height, width)` representing the proposed masks for each query. Note the `+ 1` is needed because we incorporate the null class.
query. masks_queries_logits (`torch.FloatTensor`):
encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): A tensor of shape `(batch_size, num_queries, height, width)` representing the proposed masks for each
Last hidden states (final feature map) of the last stage of the encoder model (backbone). query.
pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`): auxiliary_logits (`Dict[str, torch.FloatTensor]`, *optional*, returned when `output_auxiliary_logits=True`):
Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN). Dictionary containing auxiliary predictions for each decoder layer when auxiliary losses are enabled.
transformer_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): encoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Last hidden states (final feature map) of the last stage of the transformer decoder model. Last hidden states (final feature map) of the last stage of the encoder model (backbone).
encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): pixel_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Last hidden states (final feature map) of the last stage of the pixel decoder model (FPN).
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder transformer_decoder_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
model at the output of each stage. Last hidden states (final feature map) of the last stage of the transformer decoder model.
pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): encoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the encoder
decoder model at the output of each stage. model at the output of each stage.
transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): pixel_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the transformer decoder at the output shape `(batch_size, num_channels, height, width)`. Hidden-states (also called feature maps) of the pixel
of each stage. decoder model at the output of each stage.
hidden_states `tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): transformer_decoder_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` containing `encoder_hidden_states`, `pixel_decoder_hidden_states` and Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each stage) of
`decoder_hidden_states`. shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the transformer decoder at the output
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): of each stage.
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, hidden_states `tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
sequence_length)`. Attentions weights from Detr's decoder after the attention softmax, used to compute the Tuple of `torch.FloatTensor` containing `encoder_hidden_states`, `pixel_decoder_hidden_states` and
weighted average in the self-attention heads. `decoder_hidden_states`.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -30,36 +30,25 @@ from ...modeling_layers import GradientCheckpointingLayer
from ...modeling_outputs import BackboneOutput from ...modeling_outputs import BackboneOutput
from ...modeling_utils import PreTrainedModel from ...modeling_utils import PreTrainedModel
from ...pytorch_utils import find_pruneable_heads_and_indices, meshgrid, prune_linear_layer from ...pytorch_utils import find_pruneable_heads_and_indices, meshgrid, prune_linear_layer
from ...utils import torch_int from ...utils import auto_docstring, torch_int
from ...utils.backbone_utils import BackboneMixin from ...utils.backbone_utils import BackboneMixin
from .configuration_maskformer_swin import MaskFormerSwinConfig from .configuration_maskformer_swin import MaskFormerSwinConfig
@dataclass @dataclass
class MaskFormerSwinModelOutputWithPooling(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for MaskFormerSwinModel's outputs that also contains the spatial dimensions of the hidden states. Class for MaskFormerSwinModel's outputs that also contains the spatial dimensions of the hidden states.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class MaskFormerSwinModelOutputWithPooling(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`):
Last layer hidden-state after a mean pooling operation. Last layer hidden-state after a mean pooling operation.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to
shape `(batch_size, sequence_length, hidden_size)`. `batch, channels, height, width`. Due to padding, their spatial size cannot be inferred before the
`forward` method.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*):
A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to
`batch, channels, height, width`. Due to padding, their spatial size cannot be inferred before the
`forward` method.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -70,28 +59,17 @@ class MaskFormerSwinModelOutputWithPooling(ModelOutput):
@dataclass @dataclass
class MaskFormerSwinBaseModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Class for SwinEncoder's outputs. Class for SwinEncoder's outputs.
"""
Args: )
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): class MaskFormerSwinBaseModelOutput(ModelOutput):
Sequence of hidden-states at the output of the last layer of the model. r"""
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to
shape `(batch_size, sequence_length, hidden_size)`. `batch, channels, height, width`. Due to padding, their spatial size cannot inferred before the `forward`
method.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*):
A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to
`batch, channels, height, width`. Due to padding, their spatial size cannot inferred before the `forward`
method.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
last_hidden_state: Optional[torch.FloatTensor] = None last_hidden_state: Optional[torch.FloatTensor] = None
@ -759,12 +737,8 @@ class MaskFormerSwinEncoder(nn.Module):
) )
@auto_docstring
class MaskFormerSwinPreTrainedModel(PreTrainedModel): class MaskFormerSwinPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = MaskFormerSwinConfig config_class = MaskFormerSwinConfig
base_model_prefix = "model" base_model_prefix = "model"
main_input_name = "pixel_values" main_input_name = "pixel_values"

View File

@ -700,31 +700,22 @@ class MegatronBertPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
@auto_docstring(
custom_intro="""
Output type of [`MegatronBertForPreTraining`].
"""
)
# Copied from transformers.models.bert.modeling_bert.BertForPreTrainingOutput with Bert->MegatronBert # Copied from transformers.models.bert.modeling_bert.BertForPreTrainingOutput with Bert->MegatronBert
class MegatronBertForPreTrainingOutput(ModelOutput): class MegatronBertForPreTrainingOutput(ModelOutput):
""" r"""
Output type of [`MegatronBertForPreTraining`]. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction
Args: (classification) loss.
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Total loss as the sum of the masked language modeling loss and the next sequence prediction Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
(classification) loss. seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). before SoftMax).
seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -69,35 +69,26 @@ class MgpstrDropPath(nn.Module):
@dataclass @dataclass
class MgpstrModelOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states. Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
"""
)
class MgpstrModelOutput(ModelOutput):
r"""
logits (`tuple(torch.FloatTensor)` of shape `(batch_size, config.num_character_labels)`):
Tuple of `torch.FloatTensor` (one for the output of character of shape `(batch_size,
config.max_token_length, config.num_character_labels)`, + one for the output of bpe of shape `(batch_size,
config.max_token_length, config.num_bpe_labels)`, + one for the output of wordpiece of shape `(batch_size,
config.max_token_length, config.num_wordpiece_labels)`) .
Args: Classification scores (before SoftMax) of character, bpe and wordpiece.
logits (`tuple(torch.FloatTensor)` of shape `(batch_size, config.num_character_labels)`): a3_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_a3_attentions=True` is passed or when `config.output_a3_attentions=True`):
Tuple of `torch.FloatTensor` (one for the output of character of shape `(batch_size, Tuple of `torch.FloatTensor` (one for the attention of character, + one for the attention of bpe`, + one
config.max_token_length, config.num_character_labels)`, + one for the output of bpe of shape `(batch_size, for the attention of wordpiece) of shape `(batch_size, config.max_token_length, sequence_length)`.
config.max_token_length, config.num_bpe_labels)`, + one for the output of wordpiece of shape `(batch_size,
config.max_token_length, config.num_wordpiece_labels)`) .
Classification scores (before SoftMax) of character, bpe and wordpiece. Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): heads.
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, config.max_token_length,
sequence_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
a3_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_a3_attentions=True` is passed or when `config.output_a3_attentions=True`):
Tuple of `torch.FloatTensor` (one for the attention of character, + one for the attention of bpe`, + one
for the attention of wordpiece) of shape `(batch_size, config.max_token_length, sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
logits: tuple[torch.FloatTensor] = None logits: tuple[torch.FloatTensor] = None

View File

@ -47,29 +47,29 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
@auto_docstring
class MimiOutput(ModelOutput): class MimiOutput(ModelOutput):
""" r"""
Args: audio_codes (`torch.LongTensor` of shape `(batch_size, num_quantizers, codes_length)`, *optional*):
audio_codes (`torch.LongTensor` of shape `(batch_size, num_quantizers, codes_length)`, *optional*): Discret code embeddings computed using `model.encode`.
Discret code embeddings computed using `model.encode`. audio_values (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*):
audio_values (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*) Decoded audio values, obtained using the decoder part of Mimi.
Decoded audio values, obtained using the decoder part of Mimi. encoder_past_key_values (`Cache`, *optional*):
encoder_past_key_values (`Cache`, *optional*): Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer.
Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
The model will output the same cache format that is fed as input. The model will output the same cache format that is fed as input.
If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't
have their past key value states given to this model). have their past key value states given to this model).
decoder_past_key_values (`Cache`, *optional*): decoder_past_key_values (`Cache`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer. Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer.
This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
The model will output the same cache format that is fed as input. The model will output the same cache format that is fed as input.
If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't
have their past key value states given to this model). have their past key value states given to this model).
""" """
audio_codes: Optional[torch.LongTensor] = None audio_codes: Optional[torch.LongTensor] = None
@ -79,19 +79,19 @@ class MimiOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class MimiEncoderOutput(ModelOutput): class MimiEncoderOutput(ModelOutput):
""" r"""
Args: audio_codes (`torch.LongTensor` of shape `(batch_size, num_quantizers, codes_length)`, *optional*):
audio_codes (`torch.LongTensor` of shape `(batch_size, num_quantizers, codes_length)`, *optional*): Discret code embeddings computed using `model.encode`.
Discret code embeddings computed using `model.encode`. encoder_past_key_values (`Cache`, *optional*):
encoder_past_key_values (`Cache`, *optional*): Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer.
Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the encoder transformer. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
The model will output the same cache format that is fed as input. The model will output the same cache format that is fed as input.
If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't
have their past key value states given to this model). have their past key value states given to this model).
""" """
audio_codes: Optional[torch.LongTensor] = None audio_codes: Optional[torch.LongTensor] = None
@ -99,19 +99,19 @@ class MimiEncoderOutput(ModelOutput):
@dataclass @dataclass
@auto_docstring
class MimiDecoderOutput(ModelOutput): class MimiDecoderOutput(ModelOutput):
""" r"""
Args: audio_values (`torch.FloatTensor` of shape `(batch_size, segment_length)`, *optional*):
audio_values (`torch.FloatTensor` of shape `(batch_size, segment_length)`, *optional*): Decoded audio values, obtained using the decoder part of Mimi.
Decoded audio values, obtained using the decoder part of Mimi. decoder_past_key_values (`Cache`, *optional*):
decoder_past_key_values (`Cache`, *optional*): Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer.
Pre-computed hidden-states (key and values in the self-attention blocks) that can be used to speed up sequential decoding of the decoder transformer. This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
This typically consists in the `past_key_values` returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
The model will output the same cache format that is fed as input. The model will output the same cache format that is fed as input.
If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't If `past_key_values` are used, the user can optionally input only the last `audio_values` or `audio_codes (those that don't
have their past key value states given to this model). have their past key value states given to this model).
""" """
audio_values: Optional[torch.FloatTensor] = None audio_values: Optional[torch.FloatTensor] = None

View File

@ -1209,17 +1209,6 @@ class MiniMaxForQuestionAnswering(MiniMaxPreTrainedModel):
output_hidden_states: Optional[bool] = None, output_hidden_states: Optional[bool] = None,
**kwargs, **kwargs,
) -> QuestionAnsweringModelOutput: ) -> QuestionAnsweringModelOutput:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
outputs: BaseModelOutputWithPast = self.model( outputs: BaseModelOutputWithPast = self.model(
input_ids, input_ids,
attention_mask=attention_mask, attention_mask=attention_mask,

View File

@ -760,17 +760,6 @@ class MistralForQuestionAnswering(MistralPreTrainedModel):
output_hidden_states: Optional[bool] = None, output_hidden_states: Optional[bool] = None,
**kwargs, **kwargs,
) -> QuestionAnsweringModelOutput: ) -> QuestionAnsweringModelOutput:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
outputs: BaseModelOutputWithPast = self.model( outputs: BaseModelOutputWithPast = self.model(
input_ids, input_ids,
attention_mask=attention_mask, attention_mask=attention_mask,

View File

@ -29,8 +29,6 @@ from .configuration_mistral import MistralConfig
logger = logging.get_logger(__name__) logger = logging.get_logger(__name__)
_CHECKPOINT_FOR_DOC = "mistralai/Mistral-7B-v0.1"
class MistralMLP(LlamaMLP): class MistralMLP(LlamaMLP):
def __init__(self, config): def __init__(self, config):
@ -247,17 +245,6 @@ class MistralForQuestionAnswering(LlamaForQuestionAnswering):
output_hidden_states: Optional[bool] = None, output_hidden_states: Optional[bool] = None,
**kwargs, **kwargs,
) -> QuestionAnsweringModelOutput: ) -> QuestionAnsweringModelOutput:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
outputs: BaseModelOutputWithPast = self.model( outputs: BaseModelOutputWithPast = self.model(
input_ids, input_ids,
attention_mask=attention_mask, attention_mask=attention_mask,

View File

@ -123,35 +123,26 @@ class Mistral3MultiModalProjector(nn.Module):
@dataclass @dataclass
class Mistral3CausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
Base class for Mistral3 causal language model (or autoregressive) outputs. Base class for Mistral3 causal language model (or autoregressive) outputs.
"""
)
class Mistral3CausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction). image_hidden_states (`torch.FloatTensor`, *optional*):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -163,33 +154,22 @@ class Mistral3CausalLMOutputWithPast(ModelOutput):
@dataclass @dataclass
class Mistral3ModelOutputWithPast(BaseModelOutputWithPast): @auto_docstring(
""" custom_intro="""
Base class for Mistral3 outputs, with hidden states and attentions. Base class for Mistral3 outputs, with hidden states and attentions.
"""
)
class Mistral3ModelOutputWithPast(BaseModelOutputWithPast):
r"""
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): `past_key_values` input) to speed up sequential decoding.
Sequence of hidden-states at the output of the last layer of the model. image_hidden_states (`torch.FloatTensor`, *optional*):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`torch.FloatTensor`, *optional*):
A `torch.FloatTensor` of size `(batch_size, num_images, sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder and after projecting the last hidden state.
""" """
image_hidden_states: Optional[torch.FloatTensor] = None image_hidden_states: Optional[torch.FloatTensor] = None

View File

@ -992,17 +992,6 @@ class MixtralForQuestionAnswering(MixtralPreTrainedModel):
output_hidden_states: Optional[bool] = None, output_hidden_states: Optional[bool] = None,
**kwargs, **kwargs,
) -> QuestionAnsweringModelOutput: ) -> QuestionAnsweringModelOutput:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
outputs: BaseModelOutputWithPast = self.model( outputs: BaseModelOutputWithPast = self.model(
input_ids, input_ids,
attention_mask=attention_mask, attention_mask=attention_mask,

View File

@ -1531,10 +1531,6 @@ class MllamaForCausalLM(MllamaPreTrainedModel, GenerationMixin):
For each text token (in seq_length): For each text token (in seq_length):
- 1 indicates the token **should attend** to the corresponding image tile - 1 indicates the token **should attend** to the corresponding image tile
- 0 indicates the token **should not attend** to the corresponding image tile - 0 indicates the token **should not attend** to the corresponding image tile
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
full_text_row_masked_out_mask (`tuple[torch.Tensor, torch.Tensor]`, *optional*): full_text_row_masked_out_mask (`tuple[torch.Tensor, torch.Tensor]`, *optional*):
A tuple containing two tensors that mask out rows in the cross-attention mechanism: A tuple containing two tensors that mask out rows in the cross-attention mechanism:
- The first tensor has shape `(batch_size, 1, seq_length, 1)` and contains values of 0 or 1. - The first tensor has shape `(batch_size, 1, seq_length, 1)` and contains values of 0 or 1.
@ -1544,6 +1540,10 @@ class MllamaForCausalLM(MllamaPreTrainedModel, GenerationMixin):
the forward pass of cross-attention layers. the forward pass of cross-attention layers.
This mask is derived from the cross_attention_mask and is used to handle cases where a text token This mask is derived from the cross_attention_mask and is used to handle cases where a text token
should not attend to any image token. should not attend to any image token.
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Example: Example:

View File

@ -678,30 +678,21 @@ class MobileBertPreTrainedModel(PreTrainedModel):
@dataclass @dataclass
class MobileBertForPreTrainingOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Output type of [`MobileBertForPreTraining`]. Output type of [`MobileBertForPreTraining`].
"""
Args: )
loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`): class MobileBertForPreTrainingOutput(ModelOutput):
Total loss as the sum of the masked language modeling loss and the next sequence prediction r"""
(classification) loss. loss (*optional*, returned when `labels` is provided, `torch.FloatTensor` of shape `(1,)`):
prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Total loss as the sum of the masked language modeling loss and the next sequence prediction
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). (classification) loss.
seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`): prediction_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
before SoftMax). seq_relationship_logits (`torch.FloatTensor` of shape `(batch_size, 2)`):
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation
Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of before SoftMax).
shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None

View File

@ -50,41 +50,43 @@ logger = logging.get_logger(__name__)
@dataclass @dataclass
class MoshiConditionalGenerationGenerateOutput(ModelOutput): @auto_docstring(
""" custom_intro="""
Outputs of [`MoshiForConditionalConditionalGeneration.generate`]. Outputs of [`MoshiForConditionalConditionalGeneration.generate`].
"""
Args: )
audio_sequences (`torch.LongTensor` of shape `(batch_size*num_return_sequences, 1, sequence_length)`, *optional*): class MoshiConditionalGenerationGenerateOutput(ModelOutput):
The generated audio waveforms. r"""
sequences (`torch.LongTensor` of shape `(batch_size*num_return_sequences, sequence_length)`): audio_sequences (`torch.LongTensor` of shape `(batch_size*num_return_sequences, 1, sequence_length)`, *optional*):
The generated text sequences. The second dimension (sequence_length) is either equal to `max_length` or shorter The generated audio waveforms.
if all batches finished early due to the `eos_token_id`. sequences (`torch.LongTensor` of shape `(batch_size*num_return_sequences, sequence_length)`):
sequences_scores (`torch.FloatTensor` of shape `(batch_size*num_return_sequences)`, *optional*, returned when `output_scores=True`): The generated text sequences. The second dimension (sequence_length) is either equal to `max_length` or shorter
Final beam scores of the generated `sequences`. if all batches finished early due to the `eos_token_id`.
scores (`tuple(torch.FloatTensor)` *optional*, returned when `output_scores=True`): sequences_scores (`torch.FloatTensor` of shape `(batch_size*num_return_sequences)`, *optional*, returned when `output_scores=True`):
Beam transition scores for each vocabulary token at each generation step. Beam transition scores consisting Final beam scores of the generated `sequences`.
of log probabilities of tokens conditioned on log softmax of previously generated tokens in this beam. scores (`tuple(torch.FloatTensor)` *optional*, returned when `output_scores=True`):
Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for each generated token), Beam transition scores for each vocabulary token at each generation step. Beam transition scores consisting
with each tensor of shape `(batch_size*num_beams, config.vocab_size)`. of log probabilities of tokens conditioned on log softmax of previously generated tokens in this beam.
logits (`tuple(torch.FloatTensor)` *optional*, returned when `output_logits=True`): Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for each generated token),
Unprocessed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) with each tensor of shape `(batch_size*num_beams, config.vocab_size)`.
at each generation step. Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for logits (`tuple(torch.FloatTensor)` *optional*, returned when `output_logits=True`):
each generated token), with each tensor of shape `(batch_size, config.vocab_size)`. Unprocessed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax)
beam_indices (`torch.LongTensor`, *optional*, returned when `output_scores=True`): at each generation step. Tuple of `torch.FloatTensor` with up to `max_new_tokens` elements (one element for
Beam indices of generated token id at each generation step. `torch.LongTensor` of shape each generated token), with each tensor of shape `(batch_size, config.vocab_size)`.
`(batch_size*num_return_sequences, sequence_length)`. beam_indices (`torch.LongTensor`, *optional*, returned when `output_scores=True`):
attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True`): Beam indices of generated token id at each generation step. `torch.LongTensor` of shape
Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of `(batch_size*num_return_sequences, sequence_length)`.
`torch.FloatTensor` of shape `(batch_size*num_beams, num_heads, generated_length, sequence_length)`. attentions (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_attentions=True`):
hidden_states (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_hidden_states=True`): Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of
Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of `torch.FloatTensor` of shape `(batch_size*num_beams, num_heads, generated_length, sequence_length)`.
`torch.FloatTensor` of shape `(batch_size*num_beams*num_return_sequences, generated_length, hidden_size)`. hidden_states (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `output_hidden_states=True`):
past_key_values (`tuple(tuple(torch.FloatTensor)))`, *optional*, returned when `use_cache=True`): Tuple (one element for each generated token) of tuples (one element for each layer of the decoder) of
Returns the model cache, used to speed up decoding. Different models have a different cache format, check `torch.FloatTensor` of shape `(batch_size*num_beams*num_return_sequences, generated_length, hidden_size)`.
the model's documentation. Usually, a [`~cache_utils.Cache`] instance. past_key_values (`tuple(tuple(torch.FloatTensor)))`, *optional*, returned when `use_cache=True`):
audio_codes (`torch.LongTensor` of shape `(batch_size*num_return_sequences, num_codeooks, sequence_length)`, *optional*): Contains the model cache, used to speed up decoding. Different models have a different cache format, check
The generated audio codes. Returned if `return_audio_codes=True`. Intermediate audio "tokens" which transforms to `audio_sequences` once passed through the audio decoder. the model's documentation. Usually, a [`~cache_utils.Cache`] instance.
audio_codes (`torch.LongTensor` of shape `(batch_size*num_return_sequences, num_codeooks, sequence_length)`, *optional*):
The generated audio codes. Returned if `return_audio_codes=True`. Intermediate audio "tokens" which transforms to `audio_sequences` once passed through the audio decoder.
""" """
audio_sequences: Optional[torch.Tensor] = None audio_sequences: Optional[torch.Tensor] = None
@ -100,34 +102,23 @@ class MoshiConditionalGenerationGenerateOutput(ModelOutput):
@dataclass @dataclass
class MoshiCausalLMOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
`MoshiForCausalLM` outputs. `MoshiForCausalLM` outputs.
"""
)
class MoshiCausalLMOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided): `past_key_values` input) to speed up sequential decoding.
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -139,45 +130,34 @@ class MoshiCausalLMOutputWithPast(ModelOutput):
@dataclass @dataclass
class MoshiConditionalGenerationOutputWithPast(ModelOutput): @auto_docstring(
""" custom_intro="""
`MoshiForConditionalGeneration` outputs. `MoshiForConditionalGeneration` outputs.
"""
)
class MoshiConditionalGenerationOutputWithPast(ModelOutput):
r"""
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `text_labels` is provided):
Text language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the text language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Args: Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `text_labels` is provided): `past_key_values` input) to speed up sequential decoding.
Text language modeling loss (for next-token prediction). depth_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `audio_labels` is provided):
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`): Audio language modeling loss (for next-token prediction).
Prediction scores of the text language modeling head (scores for each vocabulary token before SoftMax). audio_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): Prediction scores of the audio language modeling heads.
Sequence of hidden-states at the output of the last layer of the model. depth_past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`): Past key-values of the depth decoder.
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape depth_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) Hidden states of the depth decoder
depth_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see Depth decoder's Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
`past_key_values` input) to speed up sequential decoding. heads.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
depth_loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `audio_labels` is provided):
Audio language modeling loss (for next-token prediction).
audio_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the audio language modeling heads.
depth_past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Past key-values of the depth decoder.
depth_hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Hidden states of the depth decoder
depth_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Depth decoder's Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
""" """
loss: Optional[torch.FloatTensor] = None loss: Optional[torch.FloatTensor] = None
@ -194,18 +174,18 @@ class MoshiConditionalGenerationOutputWithPast(ModelOutput):
@dataclass @dataclass
@auto_docstring
class MoshiUnconditionalInput(ModelOutput): class MoshiUnconditionalInput(ModelOutput):
""" r"""
Args: input_ids (`torch.Tensor `of shape `(batch_size, sequence_length), *optional*):
input_ids (`torch.Tensor `of shape `(batch_size, sequence_length), *optional*): The sequence used as a text prompt for the generation.
The sequence used as a text prompt for the generation. user_audio_codes (`torch.Tensor `of shape `(batch_size, num_codebooks, sequence_length), *optional*):
user_audio_codes (`torch.Tensor `of shape `(batch_size, num_codebooks, sequence_length), *optional*): The audio codes used as audio user prompt for the generation. Has priority over `user_input_values` and represents the audio "tokens" of `user_input_values` once passed through the audio encoder.
The audio codes used as audio user prompt for the generation. Has priority over `user_input_values` and represents the audio "tokens" of `user_input_values` once passed through the audio encoder. moshi_audio_codes (`torch.Tensor `of shape `(batch_size, num_codebooks, sequence_length), *optional*):
moshi_audio_codes (`torch.Tensor `of shape `(batch_size, num_codebooks, sequence_length), *optional*): The audio codes used as audio Moshi prompt for the generation. Has priority over `moshi_input_values` and represents the audio "tokens" of `moshi_input_values` once passed through the audio encoder.
The audio codes used as audio Moshi prompt for the generation. Has priority over `moshi_input_values` and represents the audio "tokens" of `moshi_input_values` once passed through the audio encoder. attention_mask (`torch.LongTensor`) of shape `(batch_size, sequence_length)`, *optional*):
attention_mask (`torch.LongTensor`) of shape `(batch_size, sequence_length)`, *optional*): Attention mask to avoid performing attention on padding token indices. Mask values selected in `[0,
Attention mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: 1 for tokens that are **not masked**, 0 for tokens that are **masked**.
1]`: 1 for tokens that are **not masked**, 0 for tokens that are **masked**.
""" """
input_ids: Optional[torch.LongTensor] = None input_ids: Optional[torch.LongTensor] = None

Some files were not shown because too many files have changed in this diff Show More