[docs] update cache docs with new info (#38775)

* update docs with new info

* Update docs/source/en/kv_cache.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
Raushan Turganbay 2025-06-13 09:10:56 +02:00 committed by GitHub
parent 324cc77dc3
commit e26ae89281
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -261,7 +261,9 @@ A cache can also work in iterative generation settings where there is back-and-f
For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating). For iterative generation with a cache, start by initializing an empty cache class and then you can feed in your new prompts. Keep track of dialogue history with a [chat template](./chat_templating).
The example below demonstrates how to use a cache for iterative generation. The following example demonstrates [Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). If youre using a different chat-style model, [`~PreTrainedTokenizer.apply_chat_template`] may process messages differently. It might cut out important tokens depending on how the Jinja template is written.
For example, some models use special `<think> ... </think>` tokens during reasoning. These could get lost during re-encoding, causing indexing issues. You might need to manually remove or adjust extra tokens from the completions to keep things stable.
```py ```py
import torch import torch
@ -281,7 +283,6 @@ tokenizer = AutoTokenizer.from_pretrained(model_id)
user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."] user_prompts = ["Hello, what's your name?", "Btw, yesterday I was on a rock concert."]
past_key_values = DynamicCache() past_key_values = DynamicCache()
max_cache_length = past_key_values.get_max_length()
messages = [] messages = []
for prompt in user_prompts: for prompt in user_prompts: