Joao Gante
9a1c1fe7ed
[CI] green llama tests ( #37244 )
...
* green llama tests
* use cleanup instead
* better test comment; cleanup upgrade
* better test comment; cleanup upgrade
2025-04-03 14:15:53 +01:00
Steven Liu
c0f8d055ce
[docs] Redesign ( #31757 )
...
* toctree
* not-doctested.txt
* collapse sections
* feedback
* update
* rewrite get started sections
* fixes
* fix
* loading models
* fix
* customize models
* share
* fix link
* contribute part 1
* contribute pt 2
* fix toctree
* tokenization pt 1
* Add new model (#32615 )
* v1 - working version
* fix
* fix
* fix
* fix
* rename to correct name
* fix title
* fixup
* rename files
* fix
* add copied from on tests
* rename to `FalconMamba` everywhere and fix bugs
* fix quantization + accelerate
* fix copies
* add `torch.compile` support
* fix tests
* fix tests and add slow tests
* copies on config
* merge the latest changes
* fix tests
* add few lines about instruct
* Apply suggestions from code review
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* fix
* fix tests
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
* "to be not" -> "not to be" (#32636 )
* "to be not" -> "not to be"
* Update sam.md
* Update trainer.py
* Update modeling_utils.py
* Update test_modeling_utils.py
* Update test_modeling_utils.py
* fix hfoption tag
* tokenization pt. 2
* image processor
* fix toctree
* backbones
* feature extractor
* fix file name
* processor
* update not-doctested
* update
* make style
* fix toctree
* revision
* make fixup
* fix toctree
* fix
* make style
* fix hfoption tag
* pipeline
* pipeline gradio
* pipeline web server
* add pipeline
* fix toctree
* not-doctested
* prompting
* llm optims
* fix toctree
* fixes
* cache
* text generation
* fix
* chat pipeline
* chat stuff
* xla
* torch.compile
* cpu inference
* toctree
* gpu inference
* agents and tools
* gguf/tiktoken
* finetune
* toctree
* trainer
* trainer pt 2
* optims
* optimizers
* accelerate
* parallelism
* fsdp
* update
* distributed cpu
* hardware training
* gpu training
* gpu training 2
* peft
* distrib debug
* deepspeed 1
* deepspeed 2
* chat toctree
* quant pt 1
* quant pt 2
* fix toctree
* fix
* fix
* quant pt 3
* quant pt 4
* serialization
* torchscript
* scripts
* tpu
* review
* model addition timeline
* modular
* more reviews
* reviews
* fix toctree
* reviews reviews
* continue reviews
* more reviews
* modular transformers
* more review
* zamba2
* fix
* all frameworks
* pytorch
* supported model frameworks
* flashattention
* rm check_table
* not-doctested.txt
* rm check_support_list.py
* feedback
* updates/feedback
* review
* feedback
* fix
* update
* feedback
* updates
* update
---------
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
2025-03-03 10:33:46 -08:00
Jacky Lee
44a26c871c
Update llm_optims docs for sdpa_kernel
( #35481 )
...
update: use sdpa_kernel
2025-01-06 08:54:31 -08:00
wejoncy
4e27a4009d
FEAT : Adding VPTQ quantization method to HFQuantizer ( #34770 )
...
* init vptq
* add integration
* add vptq support
fix readme
* add tests && format
* format
* address comments
* format
* format
* address comments
* format
* address comments
* remove debug code
* Revert "remove debug code"
This reverts commit ed3b3eaaba
.
* fix test
---------
Co-authored-by: Yang Wang <wyatuestc@gmail.com>
2024-12-20 09:45:53 +01:00
Steven Liu
1ed1de2fec
[docs] Increase visibility of torch_dtype="auto" ( #35067 )
...
* auto-dtype
* feedback
2024-12-04 09:18:44 -08:00
Fanli Lin
329f5dbf97
[docs] use device-agnostic API instead of hard-coded cuda ( #35048 )
...
replace cuda
2024-12-03 10:54:15 -08:00
Abhishek Maurya
65753d6065
Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned ( #33932 )
...
* fix: fixes for graph breaks
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix: formatting
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix: import error
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix: Add Fa2Kwargs
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix: PR Changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* Revert "PR changes"
This reverts commit 39d2868e5c
.
* PR changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix: FlashAttentionKwarg
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix: FlashAttentionKwarg
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR Changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR Changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR Changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR Changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* PR Changes
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* addition of documentation
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* change in _flash_attention_forward
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* make fix-copies
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* revert make fix-copies
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
* fix copies
* style
* loss kwargs typing
* style and pull latest changes
---------
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
2024-10-24 11:02:54 +02:00
Joao Gante
2b789f27f3
Docs: add more cross-references to the KV cache docs ( #33323 )
...
* add more cross-references
* nit
* import guard
* more import guards
* nit
* Update src/transformers/generation/configuration_utils.py
2024-09-06 10:22:00 +01:00
Joao Gante
cf32ee1753
Cache: use batch_size
instead of max_batch_size
( #32657 )
...
* more precise name
* better docstrings
* Update src/transformers/cache_utils.py
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
---------
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
2024-08-16 11:48:45 +01:00
Joao Gante
7ffe25f2b9
Generate: end-to-end compilation ( #30788 )
...
* mvp
* added test (a few models need fixes)
* fix a few test cases
* test nits
* harder test 😈
* revert changes in stablelm
* test with improved condition
* add todo
* tmp commit
* merged with main
* nits
* add todo
* final corrections
* add docs for generation compilation
* docs nits
* add tip
* PR suggestions
* add more details to the compilation docs
* fix cache positions
* cache is now init in generate; update docs
* tag test as flaky
* docs
* post rebase make fixup and other nits
* remove unintended changes
* whisper (encoder-decoder) not supported
* move token default updates to ; add tests for token defaults
* push changes
* manual rebase
* chameleon doesn't support this
* fix test_static_cache_mha_mqa_gqa (broken in another PR)
* docs: dynamic is better with end-to-end compilation
2024-07-29 10:52:13 +01:00
Longjie Zheng
616bb11d48
Add torch.compile for Mistral ( #30642 )
...
* first version
* fix sliding window
* fix style
* add sliding window cache
* fix style
* address comments
* fix test
* fix style
* move sliding window check inside cache init
* revert changes on irrelevant files & add comment on SlidingWindowCache
* address comments & fix style
fix style
* update causal mask
* [run-slow] mistral
* [run-slow] mistral
* [run-slow] mistral
* [run-slow] mistral
* [run-slow] mistral
* [run-slow] llama
* [run-slow] mistral
* [run-slow] mistral
* [run-slow] mistral
* revert CI from a10 to t4
* wrap up
2024-05-20 16:27:24 +02:00
Joao Gante
75bbfd5b22
Cache: Static cache as a standalone object ( #30476 )
2024-04-30 16:37:19 +01:00
Steven Liu
e74d793a3c
[docs] LLM inference ( #29791 )
...
* first draft
* feedback
* static cache snippet
* feedback
* feedback
2024-04-22 12:41:51 -07:00