transformers/docs/source/en/internal/generation_utils.md
Tom Aarsen 633215ba58
Generate: New Cache abstraction and Attention Sinks support (#26681)
* Draft version of new KV Caching

This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks)
/ StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented
in a third-party or in transformers directly

* Address numerous PR suggestions

1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic.
2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls.
3. Remove __bool__ and __getitem__ magic as they're confusing.
4. past_key_values.update(key, value, idx) now returns key, value.
5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR.
6. Separate key_cache and value_cache.

Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method.

* Implement the SinkCache through backward+forward rotations

* Integrate (Sink)Cache with Llama FA2

* Set use_legacy_cache=True as default, allows for test passes

* Move from/to_legacy_cache to ...Model class

* Undo unnecessary newline change

* Remove copy utility from deprecated OpenLlama

* Match import style

* manual rebase with main

* Cache class working with generate (#1)

* Draft version of new KV Caching

This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks)
/ StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented
in a third-party or in transformers directly

* Address numerous PR suggestions

1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic.
2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls.
3. Remove __bool__ and __getitem__ magic as they're confusing.
4. past_key_values.update(key, value, idx) now returns key, value.
5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR.
6. Separate key_cache and value_cache.

Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method.

* Integrate (Sink)Cache with Llama FA2

* Move from/to_legacy_cache to ...Model class

* Undo unnecessary newline change

* Match import style

* working generate

* Add tests; Simplify code; Apply changes to Mistral and Persimmon

* fix rebase mess

* a few more manual fixes

* last manual fix

* propagate changes to phi

* upgrade test

* add use_legacy_cache docstring; beef up tests

* reintroduce unwanted deletes

---------

Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com>

* move import

* add default to model_kwargs.get('use_legacy_cache')

* correct failing test

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* apply PR suggestions

* fix failing test

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

* PR comments

* tmp commit

* add docstrings

* more tests, more docstrings, add to docs

* derp

* tmp commit

* tmp dbg

* more dbg

* fix beam search bug

* cache can be a list of tuples in some models

* fix group beam search

* all but sinkcache integration tests

* fix sink cache and add hard integration test

* now also compatible with input_embeds input

* PR comments

* add Cache support to Phi+FA2

* make fixup

---------

Co-authored-by: Joao Gante <joao@huggingface.co>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
2023-12-08 09:00:17 +01:00

9.3 KiB

Utilities for Generation

This page lists all the utility functions used by [~generation.GenerationMixin.generate], [~generation.GenerationMixin.greedy_search], [~generation.GenerationMixin.contrastive_search], [~generation.GenerationMixin.sample], [~generation.GenerationMixin.beam_search], [~generation.GenerationMixin.beam_sample], [~generation.GenerationMixin.group_beam_search], and [~generation.GenerationMixin.constrained_beam_search].

Most of those are only useful if you are studying the code of the generate methods in the library.

Generate Outputs

The output of [~generation.GenerationMixin.generate] is an instance of a subclass of [~utils.ModelOutput]. This output is a data structure containing all the information returned by [~generation.GenerationMixin.generate], but that can also be used as tuple or dictionary.

Here's an example:

from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)

The generation_output object is a [~generation.GreedySearchDecoderOnlyOutput], as we can see in the documentation of that class below, it means it has the following attributes:

  • sequences: the generated sequences of tokens
  • scores (optional): the prediction scores of the language modelling head, for each generation step
  • hidden_states (optional): the hidden states of the model, for each generation step
  • attentions (optional): the attention weights of the model, for each generation step

Here we have the scores since we passed along output_scores=True, but we don't have hidden_states and attentions because we didn't pass output_hidden_states=True or output_attentions=True.

You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you will get None. Here for instance generation_output.scores are all the generated prediction scores of the language modeling head, and generation_output.attentions is None.

When using our generation_output object as a tuple, it only keeps the attributes that don't have None values. Here, for instance, it has two elements, loss then logits, so

generation_output[:2]

will return the tuple (generation_output.sequences, generation_output.scores) for instance.

When using our generation_output object as a dictionary, it only keeps the attributes that don't have None values. Here, for instance, it has two keys that are sequences and scores.

We document here all output types.

PyTorch

autodoc generation.GreedySearchEncoderDecoderOutput

autodoc generation.GreedySearchDecoderOnlyOutput

autodoc generation.SampleEncoderDecoderOutput

autodoc generation.SampleDecoderOnlyOutput

autodoc generation.BeamSearchEncoderDecoderOutput

autodoc generation.BeamSearchDecoderOnlyOutput

autodoc generation.BeamSampleEncoderDecoderOutput

autodoc generation.BeamSampleDecoderOnlyOutput

autodoc generation.ContrastiveSearchEncoderDecoderOutput

autodoc generation.ContrastiveSearchDecoderOnlyOutput

TensorFlow

autodoc generation.TFGreedySearchEncoderDecoderOutput

autodoc generation.TFGreedySearchDecoderOnlyOutput

autodoc generation.TFSampleEncoderDecoderOutput

autodoc generation.TFSampleDecoderOnlyOutput

autodoc generation.TFBeamSearchEncoderDecoderOutput

autodoc generation.TFBeamSearchDecoderOnlyOutput

autodoc generation.TFBeamSampleEncoderDecoderOutput

autodoc generation.TFBeamSampleDecoderOnlyOutput

autodoc generation.TFContrastiveSearchEncoderDecoderOutput

autodoc generation.TFContrastiveSearchDecoderOnlyOutput

FLAX

autodoc generation.FlaxSampleOutput

autodoc generation.FlaxGreedySearchOutput

autodoc generation.FlaxBeamSearchOutput

LogitsProcessor

A [LogitsProcessor] can be used to modify the prediction scores of a language model head for generation.

PyTorch

autodoc AlternatingCodebooksLogitsProcessor - call

autodoc ClassifierFreeGuidanceLogitsProcessor - call

autodoc EncoderNoRepeatNGramLogitsProcessor - call

autodoc EncoderRepetitionPenaltyLogitsProcessor - call

autodoc EpsilonLogitsWarper - call

autodoc EtaLogitsWarper - call

autodoc ExponentialDecayLengthPenalty - call

autodoc ForcedBOSTokenLogitsProcessor - call

autodoc ForcedEOSTokenLogitsProcessor - call

autodoc ForceTokensLogitsProcessor - call

autodoc HammingDiversityLogitsProcessor - call

autodoc InfNanRemoveLogitsProcessor - call

autodoc LogitNormalization - call

autodoc LogitsProcessor - call

autodoc LogitsProcessorList - call

autodoc LogitsWarper - call

autodoc MinLengthLogitsProcessor - call

autodoc MinNewTokensLengthLogitsProcessor - call

autodoc NoBadWordsLogitsProcessor - call

autodoc NoRepeatNGramLogitsProcessor - call

autodoc PrefixConstrainedLogitsProcessor - call

autodoc RepetitionPenaltyLogitsProcessor - call

autodoc SequenceBiasLogitsProcessor - call

autodoc SuppressTokensAtBeginLogitsProcessor - call

autodoc SuppressTokensLogitsProcessor - call

autodoc TemperatureLogitsWarper - call

autodoc TopKLogitsWarper - call

autodoc TopPLogitsWarper - call

autodoc TypicalLogitsWarper - call

autodoc UnbatchedClassifierFreeGuidanceLogitsProcessor - call

autodoc WhisperTimeStampLogitsProcessor - call

TensorFlow

autodoc TFForcedBOSTokenLogitsProcessor - call

autodoc TFForcedEOSTokenLogitsProcessor - call

autodoc TFForceTokensLogitsProcessor - call

autodoc TFLogitsProcessor - call

autodoc TFLogitsProcessorList - call

autodoc TFLogitsWarper - call

autodoc TFMinLengthLogitsProcessor - call

autodoc TFNoBadWordsLogitsProcessor - call

autodoc TFNoRepeatNGramLogitsProcessor - call

autodoc TFRepetitionPenaltyLogitsProcessor - call

autodoc TFSuppressTokensAtBeginLogitsProcessor - call

autodoc TFSuppressTokensLogitsProcessor - call

autodoc TFTemperatureLogitsWarper - call

autodoc TFTopKLogitsWarper - call

autodoc TFTopPLogitsWarper - call

FLAX

autodoc FlaxForcedBOSTokenLogitsProcessor - call

autodoc FlaxForcedEOSTokenLogitsProcessor - call

autodoc FlaxForceTokensLogitsProcessor - call

autodoc FlaxLogitsProcessor - call

autodoc FlaxLogitsProcessorList - call

autodoc FlaxLogitsWarper - call

autodoc FlaxMinLengthLogitsProcessor - call

autodoc FlaxSuppressTokensAtBeginLogitsProcessor - call

autodoc FlaxSuppressTokensLogitsProcessor - call

autodoc FlaxTemperatureLogitsWarper - call

autodoc FlaxTopKLogitsWarper - call

autodoc FlaxTopPLogitsWarper - call

autodoc FlaxWhisperTimeStampLogitsProcessor - call

StoppingCriteria

A [StoppingCriteria] can be used to change when to stop generation (other than EOS token). Please note that this is exclusivelly available to our PyTorch implementations.

autodoc StoppingCriteria - call

autodoc StoppingCriteriaList - call

autodoc MaxLengthCriteria - call

autodoc MaxTimeCriteria - call

Constraints

A [Constraint] can be used to force the generation to include specific tokens or sequences in the output. Please note that this is exclusivelly available to our PyTorch implementations.

autodoc Constraint

autodoc PhrasalConstraint

autodoc DisjunctiveConstraint

autodoc ConstraintListState

BeamSearch

autodoc BeamScorer - process - finalize

autodoc BeamSearchScorer - process - finalize

autodoc ConstrainedBeamSearchScorer - process - finalize

Utilities

autodoc top_k_top_p_filtering

autodoc tf_top_k_top_p_filtering

Streamers

autodoc TextStreamer

autodoc TextIteratorStreamer

Caches

autodoc Cache - update

autodoc DynamicCache - update - get_seq_length - reorder_cache - to_legacy_cache - from_legacy_cache

autodoc SinkCache - update - get_seq_length - reorder_cache