
* Draft version of new KV Caching This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks) / StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented in a third-party or in transformers directly * Address numerous PR suggestions 1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic. 2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls. 3. Remove __bool__ and __getitem__ magic as they're confusing. 4. past_key_values.update(key, value, idx) now returns key, value. 5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR. 6. Separate key_cache and value_cache. Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method. * Implement the SinkCache through backward+forward rotations * Integrate (Sink)Cache with Llama FA2 * Set use_legacy_cache=True as default, allows for test passes * Move from/to_legacy_cache to ...Model class * Undo unnecessary newline change * Remove copy utility from deprecated OpenLlama * Match import style * manual rebase with main * Cache class working with generate (#1) * Draft version of new KV Caching This should allow Attention Sinks (https://github.com/tomaarsen/attention_sinks) / StreamingLLM (https://arxiv.org/abs/2309.17453) to be easily implemented in a third-party or in transformers directly * Address numerous PR suggestions 1. Move layer_idx from cache to ...Attention. Removes confusing set_layer_idx magic. 2. Always convert past_key_values to Cache instance at the start of ...Attention, removes all other isinstance calls. 3. Remove __bool__ and __getitem__ magic as they're confusing. 4. past_key_values.update(key, value, idx) now returns key, value. 5. Add use_legacy_cache flag, defaults to None, i.e. Falsey. This breaks generate for now, until 1) the cache is used is generate() or 2) use_legacy_cache is defaulted to True in generate() until we change it in another PR. 6. Separate key_cache and value_cache. Some work is still needed to see if the SinkCache can conveniently be implemented with just one update method. * Integrate (Sink)Cache with Llama FA2 * Move from/to_legacy_cache to ...Model class * Undo unnecessary newline change * Match import style * working generate * Add tests; Simplify code; Apply changes to Mistral and Persimmon * fix rebase mess * a few more manual fixes * last manual fix * propagate changes to phi * upgrade test * add use_legacy_cache docstring; beef up tests * reintroduce unwanted deletes --------- Co-authored-by: Tom Aarsen <Cubiegamedev@gmail.com> * move import * add default to model_kwargs.get('use_legacy_cache') * correct failing test * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * apply PR suggestions * fix failing test * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com> * PR comments * tmp commit * add docstrings * more tests, more docstrings, add to docs * derp * tmp commit * tmp dbg * more dbg * fix beam search bug * cache can be a list of tuples in some models * fix group beam search * all but sinkcache integration tests * fix sink cache and add hard integration test * now also compatible with input_embeds input * PR comments * add Cache support to Phi+FA2 * make fixup --------- Co-authored-by: Joao Gante <joao@huggingface.co> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com> Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
9.3 KiB
Utilities for Generation
This page lists all the utility functions used by [~generation.GenerationMixin.generate
],
[~generation.GenerationMixin.greedy_search
],
[~generation.GenerationMixin.contrastive_search
],
[~generation.GenerationMixin.sample
],
[~generation.GenerationMixin.beam_search
],
[~generation.GenerationMixin.beam_sample
],
[~generation.GenerationMixin.group_beam_search
], and
[~generation.GenerationMixin.constrained_beam_search
].
Most of those are only useful if you are studying the code of the generate methods in the library.
Generate Outputs
The output of [~generation.GenerationMixin.generate
] is an instance of a subclass of
[~utils.ModelOutput
]. This output is a data structure containing all the information returned
by [~generation.GenerationMixin.generate
], but that can also be used as tuple or dictionary.
Here's an example:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
inputs = tokenizer("Hello, my dog is cute and ", return_tensors="pt")
generation_output = model.generate(**inputs, return_dict_in_generate=True, output_scores=True)
The generation_output
object is a [~generation.GreedySearchDecoderOnlyOutput
], as we can
see in the documentation of that class below, it means it has the following attributes:
sequences
: the generated sequences of tokensscores
(optional): the prediction scores of the language modelling head, for each generation stephidden_states
(optional): the hidden states of the model, for each generation stepattentions
(optional): the attention weights of the model, for each generation step
Here we have the scores
since we passed along output_scores=True
, but we don't have hidden_states
and
attentions
because we didn't pass output_hidden_states=True
or output_attentions=True
.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
will get None
. Here for instance generation_output.scores
are all the generated prediction scores of the
language modeling head, and generation_output.attentions
is None
.
When using our generation_output
object as a tuple, it only keeps the attributes that don't have None
values.
Here, for instance, it has two elements, loss
then logits
, so
generation_output[:2]
will return the tuple (generation_output.sequences, generation_output.scores)
for instance.
When using our generation_output
object as a dictionary, it only keeps the attributes that don't have None
values. Here, for instance, it has two keys that are sequences
and scores
.
We document here all output types.
PyTorch
autodoc generation.GreedySearchEncoderDecoderOutput
autodoc generation.GreedySearchDecoderOnlyOutput
autodoc generation.SampleEncoderDecoderOutput
autodoc generation.SampleDecoderOnlyOutput
autodoc generation.BeamSearchEncoderDecoderOutput
autodoc generation.BeamSearchDecoderOnlyOutput
autodoc generation.BeamSampleEncoderDecoderOutput
autodoc generation.BeamSampleDecoderOnlyOutput
autodoc generation.ContrastiveSearchEncoderDecoderOutput
autodoc generation.ContrastiveSearchDecoderOnlyOutput
TensorFlow
autodoc generation.TFGreedySearchEncoderDecoderOutput
autodoc generation.TFGreedySearchDecoderOnlyOutput
autodoc generation.TFSampleEncoderDecoderOutput
autodoc generation.TFSampleDecoderOnlyOutput
autodoc generation.TFBeamSearchEncoderDecoderOutput
autodoc generation.TFBeamSearchDecoderOnlyOutput
autodoc generation.TFBeamSampleEncoderDecoderOutput
autodoc generation.TFBeamSampleDecoderOnlyOutput
autodoc generation.TFContrastiveSearchEncoderDecoderOutput
autodoc generation.TFContrastiveSearchDecoderOnlyOutput
FLAX
autodoc generation.FlaxSampleOutput
autodoc generation.FlaxGreedySearchOutput
autodoc generation.FlaxBeamSearchOutput
LogitsProcessor
A [LogitsProcessor
] can be used to modify the prediction scores of a language model head for
generation.
PyTorch
autodoc AlternatingCodebooksLogitsProcessor - call
autodoc ClassifierFreeGuidanceLogitsProcessor - call
autodoc EncoderNoRepeatNGramLogitsProcessor - call
autodoc EncoderRepetitionPenaltyLogitsProcessor - call
autodoc EpsilonLogitsWarper - call
autodoc EtaLogitsWarper - call
autodoc ExponentialDecayLengthPenalty - call
autodoc ForcedBOSTokenLogitsProcessor - call
autodoc ForcedEOSTokenLogitsProcessor - call
autodoc ForceTokensLogitsProcessor - call
autodoc HammingDiversityLogitsProcessor - call
autodoc InfNanRemoveLogitsProcessor - call
autodoc LogitNormalization - call
autodoc LogitsProcessor - call
autodoc LogitsProcessorList - call
autodoc LogitsWarper - call
autodoc MinLengthLogitsProcessor - call
autodoc MinNewTokensLengthLogitsProcessor - call
autodoc NoBadWordsLogitsProcessor - call
autodoc NoRepeatNGramLogitsProcessor - call
autodoc PrefixConstrainedLogitsProcessor - call
autodoc RepetitionPenaltyLogitsProcessor - call
autodoc SequenceBiasLogitsProcessor - call
autodoc SuppressTokensAtBeginLogitsProcessor - call
autodoc SuppressTokensLogitsProcessor - call
autodoc TemperatureLogitsWarper - call
autodoc TopKLogitsWarper - call
autodoc TopPLogitsWarper - call
autodoc TypicalLogitsWarper - call
autodoc UnbatchedClassifierFreeGuidanceLogitsProcessor - call
autodoc WhisperTimeStampLogitsProcessor - call
TensorFlow
autodoc TFForcedBOSTokenLogitsProcessor - call
autodoc TFForcedEOSTokenLogitsProcessor - call
autodoc TFForceTokensLogitsProcessor - call
autodoc TFLogitsProcessor - call
autodoc TFLogitsProcessorList - call
autodoc TFLogitsWarper - call
autodoc TFMinLengthLogitsProcessor - call
autodoc TFNoBadWordsLogitsProcessor - call
autodoc TFNoRepeatNGramLogitsProcessor - call
autodoc TFRepetitionPenaltyLogitsProcessor - call
autodoc TFSuppressTokensAtBeginLogitsProcessor - call
autodoc TFSuppressTokensLogitsProcessor - call
autodoc TFTemperatureLogitsWarper - call
autodoc TFTopKLogitsWarper - call
autodoc TFTopPLogitsWarper - call
FLAX
autodoc FlaxForcedBOSTokenLogitsProcessor - call
autodoc FlaxForcedEOSTokenLogitsProcessor - call
autodoc FlaxForceTokensLogitsProcessor - call
autodoc FlaxLogitsProcessor - call
autodoc FlaxLogitsProcessorList - call
autodoc FlaxLogitsWarper - call
autodoc FlaxMinLengthLogitsProcessor - call
autodoc FlaxSuppressTokensAtBeginLogitsProcessor - call
autodoc FlaxSuppressTokensLogitsProcessor - call
autodoc FlaxTemperatureLogitsWarper - call
autodoc FlaxTopKLogitsWarper - call
autodoc FlaxTopPLogitsWarper - call
autodoc FlaxWhisperTimeStampLogitsProcessor - call
StoppingCriteria
A [StoppingCriteria
] can be used to change when to stop generation (other than EOS token). Please note that this is exclusivelly available to our PyTorch implementations.
autodoc StoppingCriteria - call
autodoc StoppingCriteriaList - call
autodoc MaxLengthCriteria - call
autodoc MaxTimeCriteria - call
Constraints
A [Constraint
] can be used to force the generation to include specific tokens or sequences in the output. Please note that this is exclusivelly available to our PyTorch implementations.
autodoc Constraint
autodoc PhrasalConstraint
autodoc DisjunctiveConstraint
autodoc ConstraintListState
BeamSearch
autodoc BeamScorer - process - finalize
autodoc BeamSearchScorer - process - finalize
autodoc ConstrainedBeamSearchScorer - process - finalize
Utilities
autodoc top_k_top_p_filtering
autodoc tf_top_k_top_p_filtering
Streamers
autodoc TextStreamer
autodoc TextIteratorStreamer
Caches
autodoc Cache - update
autodoc DynamicCache - update - get_seq_length - reorder_cache - to_legacy_cache - from_legacy_cache
autodoc SinkCache - update - get_seq_length - reorder_cache