Commit Graph

36 Commits

Author SHA1 Message Date
Manuel de Prada Corral
d34e21e7dd
New cache tests and refactored Hybrid Cache (#37972) 2025-05-20 12:46:13 +02:00
Yao Matrix
3bd1c20149
enable misc cases on XPU & use device agnostic APIs for cases in tests (#38192)
* use device agnostic APIs in tests

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* more

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* fix style

Signed-off-by: Matrix Yao <matrix.yao@intel.com>

* add reset_peak_memory_stats API

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* update

---------

Signed-off-by: Matrix Yao <matrix.yao@intel.com>
Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
2025-05-20 10:09:01 +02:00
Yao Matrix
a72cb31434
enable utils test cases on XPU (#38005)
* enable utils test cases on XPU

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

* Update tests/utils/test_skip_decorators.py

Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>

* fix comment

Signed-off-by: Yao Matrix <matrix.yao@intel.com>

---------

Signed-off-by: Yao Matrix <matrix.yao@intel.com>
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
2025-05-09 08:45:01 +02:00
Joao Gante
f2b59c6173
[caches] Raise exception on offloaded static caches + multi device (#37974)
* skip tests on >1 gpu

* add todo
2025-05-08 14:37:36 +01:00
Joao Gante
9981214d32
[tests] Smaller model in slow cache tests (#37922) 2025-05-06 11:15:25 +01:00
Joao Gante
1b222903c3
[tests] Test all cache implementations (#37873) 2025-04-30 15:37:00 +01:00
Guang Yang
a57274466f
Allow override inputs to export recipe (#37508)
Add option to specify dynamic shapes during export

Co-authored-by: Guang Yang <guangyang@fb.com>
2025-04-30 10:19:27 +02:00
Joao Gante
755b0fa2fe
[tests] reorganize cache tests and clean memory between tests (#37684) 2025-04-29 12:21:14 +01:00
Poedator
7c62e69326
GPT2Model StaticCache support (#35761)
* initial GPT2 changes

* causal_mask support

* return_legacy_cache

* cleanup

* fix1

* outputs shape fixes

* gpt2 return fix

* pkv, attn fixes

* fix dual_head

* is_causal arg fix

* decision transformer updated

* style fix

* batch_size from inputs_embeds

* DecisionTransformerModel fixes

* cross-attn support + cache warning

* x-attn @decision

* EDCache proper init

* simplified logic in `if use_cache:` for GPT2Model

* @deprecate_kwarg for DecisionTr attn fwd

* @deprecate_kwarg in gpt2

* deprecation version updated to 4.51

* kwargs in gradient_checkpointing_fn

* rename next_cache to past_key_values

* attention_mask prep

* +cache_position in GPT2DoubleHeadsModel

* undo kwargs in gradient checkpointing

* moved up `if self.gradient_checkpointing`

* consistency in decision_transformer

* pastkv, cache_pos in grad_checkpt args

* rm _reorder_cache

* output_attentions streamlined

* decision_transformer consistency

* return_legacy_cache improved

* ClvpForCausalLM used for legacy cache test now

* is_causal fixed

* attn_output cleanup

* consistency @ decision_transformer

* Updated deprecation notice version to 4.52

* upd deprecation

* consistent legacy cache code in decision transformers\

* next_cache -> past_kv in decision_tr

* cache support flags in decision_transf

* rm legacy cache warning

* consistency in cache init for decision transf

* no Static Cache for Decision Transformer

---------

Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2025-04-24 14:46:35 +02:00
cyyever
1e6b546ea6
Use Python 3.9 syntax in tests (#37343)
Signed-off-by: cyy <cyyever@outlook.com>
2025-04-08 14:12:08 +02:00
cyyever
6cc9c8d7d1
Remove deprecated batch_size parameter (#37007) 2025-03-27 15:01:56 +00:00
Joao Gante
bc1c90a755
[Utils] torch version checks optionally accept dev versions (#36847) 2025-03-25 10:58:58 +00:00
Tugsbayasgalan Manlaibaatar
f39f4960f3
Support tracable dynamicKVcache (#36311)
* Support tracable dynamicKVcache

* Fix lint

* More fine grained test

* Lint

* Update

* Update

* Fix up

* Apply suggestions from code review

* Update src/transformers/cache_utils.py

* Update tests/utils/test_cache_utils.py

* Apply suggestions from code review

* Update

* Change error message

* Rename

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review

---------

Co-authored-by: Ilyas Moutawwakil <57442720+IlyasMoutawwakil@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
2025-03-19 16:52:30 +00:00
Yao Matrix
b11050d6a2
enable OffloadedCache on XPU from PyTorch 2.7 (#36654)
* fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model

* follow Marc's suggestion to use _tie_weights to fix

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* enable OffloadedCache on XPU since PyTorch 2.7

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* fix style

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>

* don't change bart

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

* make code more concise per review comments

Signed-off-by: N <matrix.yao@intel.com>

* fix review comments

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

* Revert "fix review comments"

This reverts commit acf1484b86.

* fix review comments

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

* fix style

Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>

---------

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Signed-off-by: root <root@a4bf01945cfe.jf.intel.com>
Signed-off-by: N <matrix.yao@intel.com>
Co-authored-by: root <root@a4bf01945cfe.jf.intel.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
2025-03-19 15:15:52 +01:00
Joao Gante
c4161238bd
[Cache] Don't initialize the cache on meta device (#36543) 2025-03-13 10:13:29 +00:00
Joao Gante
8aed019764
[generate] torch.distributed-compatible DynamicCache (#36373)
* test

* docstring

* prepare distributed cache data

* fix cat dim

* test mvp

* add test checks

* like this?

* working test and solution

* nit

* nit

* add shape info
2025-02-27 11:48:57 +00:00
Ilyas Moutawwakil
5e2183f344
Make cache traceable (#35873)
simply make cache traceable
2025-02-20 09:59:25 +01:00
Joao Gante
ece8c42488
Test: generate with torch.compile(model.forward) as a fast test (#34544) 2025-01-28 14:10:38 +00:00
Raushan Turganbay
373e50e970
Init cache on meta device (#35164)
* init cache on meta device

* offloaded static + enable tests

* tests weren't running before  :(

* update

* fix mamba

* fix copies

* update

* address comments and fix tests

* fix copies

* Update src/transformers/cache_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* update

* mamba fix

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2025-01-22 09:49:17 +01:00
jiqing-feng
387663e571
Enable gptqmodel (#35012)
* gptqmodel

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update readme

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* gptqmodel need use checkpoint_format (#1)

* gptqmodel need use checkpoint_format

* fix quantize

* Update quantization_config.py

* Update quantization_config.py

* Update quantization_config.py

---------

Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

* Revert quantizer_gptq.py (#2)

* revert quantizer_gptq.py change

* pass **kwargs

* limit gptqmodel and optimum version

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix warning

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix version check

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* revert unrelated changes

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* enable gptqmodel tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix requires gptq

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* Fix Transformer compat (#3)

* revert quantizer_gptq.py change

* pass **kwargs

* add meta info

* cleanup

* cleanup

* Update quantization_config.py

* hf_select_quant_linear pass checkpoint_format and meta

* fix GPTQTestCUDA

* Update test_gptq.py

* gptqmodel.hf_select_quant_linear() now does not select ExllamaV2

* cleanup

* add backend

* cleanup

* cleanup

* no need check exllama version

* Update quantization_config.py

* lower checkpoint_format and backend

* check none

* cleanup

* Update quantization_config.py

* fix self.use_exllama == False

* spell

* fix unittest

* fix unittest

---------

Co-authored-by: LRL <lrl@lbx.dev>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

* fix format

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix format again

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update gptqmodel version (#6)

* update gptqmodel version

* update gptqmodel version

* fix unit test (#5)

* update gptqmodel version

* update gptqmodel version

* "not self.use_exllama" is not equivalent to "self.use_exllama==False"

* fix unittest

* update gptqmodel version

* backend is loading_attibutes (#7)

* fix format and tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix memory check

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix device mismatch

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* fix result check

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update src/transformers/quantizers/quantizer_gptq.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* update tests

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* review: update docs (#10)

* review: update docs (#12)

* review: update docs

* fix typo

* update tests for gptqmodel

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

* update document (#9)

* update overview.md

* cleanup

* Update overview.md

* Update overview.md

* Update overview.md

* update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

* Update gptq.md

---------

Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>

* typo

* doc note for asymmetric quant

* typo with apple silicon(e)

* typo for marlin

* column name revert: review

* doc rocm support

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quantization/gptq.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update docs/source/en/quantization/overview.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

---------

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com>
Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai>
Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai>
Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com>
Co-authored-by: LRL <lrl@lbx.dev>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
2025-01-15 14:22:49 +01:00
Joao Gante
38f9f10dd9
Cache: revert DynamicCache init for BC (#33861)
* tmp commit

* tmp commit

* make fixup

* missing removal

* fix condition

* fix end-to-end compilation

* if -> elif

* BC

* BC

* use @deprecate_kwarg("num_hidden_layers", version="4.47.0")

* wups the import

* 🥴

---------

Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-10-04 22:47:08 +02:00
Guang Yang
808997a634
Fix passing str dtype to static cache (#33741)
Co-authored-by: Guang Yang <guangyang@fb.com>
2024-10-01 09:50:17 +02:00
Arthur
19d58d31f1
Add MLLama (#33703)
* current changes

* nit

* Add cross_attenttion_mask to processor

* multi-image fixed

* Add cross_attenttion_mask to processor

* cross attn works in all cases

* WIP refactoring function for image processor

* WIP refactoring image processor functions

* Refactor preprocess to use global loops instead of list nested list comps

* Docstrings

* Add channels unification

* fix dtype issues

* Update docsrings and format

* Consistent max_image_tiles

* current script

* updates

* Add convert to rgb

* Add image processor tests

* updates!

* update

* god damn it I am dumb sometimes

* Precompute aspect ratios

* now this works, full match

* fix 😉

* nits

* style

* fix model and conversion

* nit

* nit

* kinda works

* hack for sdpa non-contiguous bias

* nits here and there

* latest c hanges

* merge?

* run forward

* Add aspect_ratio_mask

* vision attention mask

* update script and config variable names

* nit

* nits

* be able to load

* style

* nits

* there

* nits

* make forward run

* small update

* enable generation multi-turn

* nit

* nit

* Clean up a bit for errors and typos

* A bit more constant fixes

* 90B keys and shapes match

* Fix for 11B model

* Fixup, remove debug part

* Docs

* Make max_aspect_ratio_id to be minimal

* Update image processing code to match new implementation

* Adjust conversion for final checkpoint state

* Change dim in repeat_interleave (accordig to meta code)

* tmp fix for num_tiles

* Fix for conversion (gate<->up, q/k_proj rope permute)

* nits

* codestyle

* Vision encoder fixes

* pass cross attn mask further

* Refactor aspect ratio mask

* Disable text-only generation

* Fix cross attention layers order, remove q/k norm rotation for cross atention layers

* Refactor gated position embeddings

* fix bugs but needs test with new weights

* rope scaling should be llama3

* Fix rope scaling name

* Remove debug for linear layer

* fix copies

* Make mask prepare private func

* Remove linear patch embed

* Make precomputed embeddings as nn.Embedding module

* MllamaPrecomputedAspectRatioEmbedding with config init

* Remove unused self.output_dim

* nit, intermediate layers

* Rename ln and pos_embed

* vision_chunk_size -> image_size

* return_intermediate -> intermediate_layers_indices

* vision_input_dim -> hidden_size

* Fix copied from statements

* fix most tests

* Fix more copied from

* layer_id->layer_idx

* Comment

* Fix tests for processor

* Copied from for _prepare_4d_causal_attention_mask_with_cache_position

* Style fix

* Add MllamaForCausalLM

* WIP fixing tests

* Remove duplicated layers

* Remove dummy file

* Fix style

* Fix consistency

* Fix some TODOs

* fix language_model instantiation, add docstring

* Move docstring, remove todos for precomputed embeds (we cannot init them properly)

* Add initial docstrings

* Fix

* fix some tests

* lets skip these

* nits, remove print, style

* Add one more copied from

* Improve test message

* Make validate func private

* Fix dummy objects

* Refactor `data_format` a bit + add comment

* typos/nits

Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>

* fix dummy objects and imports

* Add chat template config json

* remove num_kv_heads from vision attention

* fix

* move some commits and add more tests

* fix test

* Remove `update_key_name` from modeling utils

* remove num-kv-heads again

* some prelimiary docs

* Update chat template + tests

* nit, conversion script max_num_tiles from params

* Fix warning for text-only generation

* Update conversion script for instruct models

* Update chat template in converstion + test

* add tests for CausalLM model

* model_max_length, avoid null chat_template

* Refactor conversion script

* Fix forward

* Fix integration tests

* Refactor vision config + docs

* Fix default

* Refactor text config

* Doc fixes

* Remove unused args, fix docs example

* Squashed commit of the following:

commit b51ce5a2efffbecdefbf6fc92ee87372ec9d8830
Author: qubvel <qubvel@gmail.com>
Date:   Wed Sep 18 13:39:15 2024 +0000

    Move model + add output hidden states and output attentions

* Fix num_channels

* Add mllama text and mllama vision models

* Fixing repo consistency

* Style fix

* Fixing repo consistency

* Fixing unused config params

* Fix failed tests after refactoring

* hidden_activation -> hidden_act  for text mlp

* Remove from_pretrained from sub-configs

* Apply suggestions from code review

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/mllama/convert_mllama_weights_to_hf.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Reuse lambda in conversion script

* Remove run.py

* Update docs/source/en/model_doc/mllama.md

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/mllama/processing_mllama.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Remove unused LlamaTokenizerFast

* Fix logging

* Refactor gating

* Remove cycle for collecting intermediate states

* Refactor text-only check, add integration test for text-only

* Revert from pretrained to configs

* Fix example

* Add auto `bos_token` adding in processor

* Fix tips

* Update src/transformers/models/auto/tokenization_auto.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Enable supports_gradient_checkpointing model flag

* add eager/sdpa options

* don't skip attn tests and bring back GC skips (did i really remove those?)

* Fix signature, but get error with None gradient

* Fix output attention tests

* Disable GC back

* Change no split modules

* Fix dropout

* Style

* Add Mllama to sdpa list

* Add post init for vision model

* Refine config for MllamaForCausalLMModelTest and skipped tests for CausalLM model

* if skipped, say it, don't pass

* Clean vision tester config

* Doc for args

* Update tests/models/mllama/test_modeling_mllama.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Add cross_attention_mask to test

* typehint

* Remove todo

* Enable gradient checkpointing

* Docstring

* Style

* Fixing and skipping some tests for new cache

* Mark flaky test

* Skip `test_sdpa_can_compile_dynamic` test

* Fixing some offload tests

* Add direct GenerationMixin inheritance

* Remove unused code

* Add initializer_range to vision config

* update the test to make sure we show if split

* fix gc?

* Fix repo consistency

* Undo modeling utils debug changes

* Fix link

* mllama -> Mllama

* [mllama] -> [Mllama]

* Enable compile test for CausalLM model (text-only)

* Fix TextModel prefix

* Update doc

* Docs for forward, type hints, and vision model prefix

* make sure to reset

* fix init

* small script refactor and styling

* nit

* updates!

* some nits

* Interpolate embeddings for 560 size and update integration tests

* nit

* does not suppor static cache!

* update

* fix

* nit2

* this?

* Fix conversion

* Style

* 4x memory improvement with image cache AFAIK

* Token decorator for tests

* Skip failing tests

* update processor errors

* fix split issues

* style

* weird

* style

* fix failing tests

* update

* nit fixing the whisper tests

* fix path

* update

---------

Co-authored-by: raushan <raushan@huggingface.co>
Co-authored-by: pavel <ubuntu@ip-10-90-0-11.ec2.internal>
Co-authored-by: qubvel <qubvel@gmail.com>
Co-authored-by: Pablo Montalvo <39954772+molbap@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
2024-09-25 19:56:25 +02:00
Fanli Lin
b87755aa6d
[tests] skip tests for xpu (#33553)
* enable

* fix

* add xpu skip

* add marker

* skip for xpu

* add more

* add one more
2024-09-19 19:28:04 +01:00
Guang Yang
f38590dade
Make StaticCache configurable at model construct time (#32830)
* Make StaticCache configurable at model construct time

* integrations import structure

* add new doc file to toc

---------

Co-authored-by: Guang Yang <guangyang@fb.com>
Co-authored-by: Joao Gante <joao@huggingface.co>
2024-09-10 16:35:57 +01:00
Raushan Turganbay
ebbe8d8014
Cache docs: update (#32929)
* some changes

* more updates

* fix cache copy

* nits

* nits

* add tests
2024-09-04 15:05:31 +05:00
Gerben van V
5129671290
Add a static cache that offloads to the CPU or other device (#32161)
* Add a static cache that offloads to the CPU or other device

* Fix PR comments, add unit-tests
2024-08-29 11:51:09 +02:00
Joao Gante
cf32ee1753
Cache: use batch_size instead of max_batch_size (#32657)
* more precise name

* better docstrings

* Update src/transformers/cache_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-16 11:48:45 +01:00
Guang Yang
0164560353
Fixed test test_static_cache_exportability with torch 2.4.0 (#32516)
Workaround the export issue in torch 2.4

Co-authored-by: Guang Yang <guangyang@fb.com>
2024-08-08 18:13:40 +01:00
OsamaS99
51ab25e293
Fixed Hybrid Cache Shape Initialization. (#32163)
* fixed hybrid cache init, added test

* Fix Test Typo

---------

Co-authored-by: Aaron Haag <aaron.haag@siemens.com>
2024-08-01 13:57:42 +01:00
Nikos Karampatziakis
ca59d6f77c
Offloaded KV Cache (#31325)
* Initial implementation of OffloadedCache

* enable usage via cache_implementation

* Address feedback, add tests, remove legacy methods.

* Remove flash-attn, discover synchronization bugs, fix bugs

* Prevent usage in CPU only mode

* Add a section about offloaded KV cache to the docs

* Fix typos in docs

* Clarifications and better explanation of streams
2024-08-01 14:42:07 +02:00
Guang Yang
811a9caa21
Make static cache compatible with torch.export (#32168) 2024-07-29 18:19:15 +01:00
Joao Gante
7ffe25f2b9
Generate: end-to-end compilation (#30788)
* mvp

* added test (a few models need fixes)

* fix a few test cases

* test nits

* harder test 😈

* revert changes in stablelm

* test with improved condition

* add todo

* tmp commit

* merged with main

* nits

* add todo

* final corrections

* add docs for generation compilation

* docs nits

* add  tip

* PR suggestions

* add more details to the compilation docs

* fix cache positions

* cache is now init in generate; update docs

* tag test as flaky

* docs

* post rebase make fixup and other nits

* remove unintended changes

* whisper (encoder-decoder) not supported

* move token default updates to ; add tests for token defaults

* push changes

* manual rebase

* chameleon doesn't support this

* fix test_static_cache_mha_mqa_gqa (broken in another PR)

* docs: dynamic is better with end-to-end compilation
2024-07-29 10:52:13 +01:00
Fanli Lin
27c7f971c0
[tests] fix static cache implementation is not compatible with attn_implementation==flash_attention_2 (#32039)
* add flash attention check

* fix

* fix
2024-07-26 11:41:27 +02:00
Joao Gante
739a63166d
Generate: remove deprecated code due to Cache and cache_position being default (#31898)
* tmp commit

* shorter

* nit

* explicit kwargs

* propagate changes

* mass propagation with a few manual touches (let's see how CI behaves)

* fix cacheless case

* Update src/transformers/generation/utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* make fixup

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-07-14 15:16:58 +01:00
Yih-Dar
93cd94b79d
Move some test files (tets/test_xxx_utils.py) to tests/utils (#31730)
* move

* move

* move

* move

* Update tests/utils/test_image_processing_utils.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
2024-07-02 13:46:03 +02:00