transformers/examples
Arthur 211f2b0875
Add CB (#38085)
* stash for now

* initial commit

* small updated

* up

* up

* works!

* nits and fixes

* don't loop too much

* finish working example

* update

* fix the small freeblocks issue

* feat: stream inputs to continuous batch

* fix: update attn from `eager` to `sdpa`

* refactor: fmt

* refactor: cleanup unnecessary code

* feat: add `update` fn to `PagedAttentionCache`

* feat: broken optimal block size computation

* fix: debugging invalid cache logic

* fix: attention mask

* refactor: use custom prompts for example

* feat: add streaming output

* fix: prefill split

refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic

* fix: send decoded tokens when `prefilling_split` -> `decoding`

* refactor: move logic to appropriate parent class

* fix: remove truncation as we split prefilling anyways

refactor: early return when we have enough selected requests

* feat: add paged attention forward

* push Ggraoh>

* add paged sdpa

* update

* btter mps defaults

* feat: add progress bar for `generate_batch`

* feat: add opentelemetry metrics (ttft + batch fill %age)

* feat: add tracing

* Add cuda graphs (#38059)

* draft cudagraphs addition

* nits

* styling

* update

* fix

* kinda draft of what it should look like

* fixes

* lol

* not sure why inf everywhere

* can generate but output is shit

* some fixes

* we should have a single device synch

* broken outputs but it does run

* refactor

* updates

* updates with some fixes

* fix mask causality

* another commit that casts after

* add error

* simplify example

* update

* updates

* revert llama changes

* fix merge conflicts

* fix: tracing and metrics

* my updates

* update script default values

* fix block allocation issue

* fix prefill split attnetion mask

* no bugs

* add paged eager

* fix

* update

* style

* feat: add pytorch traces

* fix

* fix

* refactor: remove pytorch profiler data

* style

* nits

* cleanup

* draft test file

* fix

* fix

* fix paged and graphs

* small renamings

* cleanups and push

* refactor: move tracing and metrics logic to utils

* refactor: trace more blocks of code

* nits

* nits

* update

* to profile or not to profile

* refactor: create new output object

* causal by default

* cleanup but generations are still off for IDK what reason

* simplifications but not running still

* this does work.

* small quality of life updates

* nits

* updaet

* fix the scheduler

* fix warning

* ol

* fully fixed

* nits

* different generation parameters

* nice

* just style

* feat: add cache memory usage

* feat: add kv cache free memory

* feat: add active/waiting count & req latency

* do the sampling

* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager

* fix on mps

* feat: add dashboard & histogram buckets

* perf: improve waiting reqs data structures

* attempt to compile, but we should only do it on mps AFAIK

* feat: decouple scheduling logic

* just a draft

* c;eanup and fixup

* optional

* style

* update

* update

* remove the draft documentation

* fix import as well

* update

* fix the test

* style doomed

---------

Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
2025-05-22 17:43:48 +02:00
..
flax v4.53.0.dev0 2025-05-20 18:12:56 +02:00
legacy Fix typos in strings and comments (#37910) 2025-05-01 14:58:58 +01:00
metrics-monitoring Add CB (#38085) 2025-05-22 17:43:48 +02:00
modular-transformers [modular] Fix the prefix-based renaming if the old and new model share a common name suffix (#37829) 2025-04-29 10:43:23 +02:00
pytorch Add CB (#38085) 2025-05-22 17:43:48 +02:00
quantization Use Python 3.9 syntax in examples (#37279) 2025-04-07 12:52:21 +01:00
research_projects Remove research projects (#36645) 2025-03-11 13:47:38 +00:00
tensorflow v4.53.0.dev0 2025-05-20 18:12:56 +02:00
training Use Python 3.9 syntax in examples (#37279) 2025-04-07 12:52:21 +01:00
3D_parallel.py tp plan should not be NONE (#38255) 2025-05-21 10:22:38 +02:00
README.md Remove research projects (#36645) 2025-03-11 13:47:38 +00:00
run_on_remote.py Use Python 3.9 syntax in examples (#37279) 2025-04-07 12:52:21 +01:00

Examples

We host a wide range of example scripts for multiple learning frameworks. Simply choose your favorite: TensorFlow, PyTorch or JAX/Flax.

We also have some research projects, as well as some legacy examples. Note that unlike the main examples these are not actively maintained, and may require specific older versions of dependencies in order to run.

While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the-box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data, allowing you to tweak and edit them as required.

Please discuss on the forum or in an issue a feature you would like to implement in an example before submitting a PR; we welcome bug fixes, but since we want to keep the examples as simple as possible it's unlikely that we will merge a pull request adding more functionality at the cost of readability.

Important note

Important

To make sure you can successfully run the latest versions of the example scripts, you have to install the library from source and install some example-specific requirements. To do this, execute the following steps in a new virtual environment:

git clone https://github.com/huggingface/transformers
cd transformers
pip install .

Then cd in the example folder of your choice and run

pip install -r requirements.txt

To browse the examples corresponding to released versions of 🤗 Transformers, click on the line below and then on your desired version of the library:

Examples for older versions of 🤗 Transformers

Alternatively, you can switch your cloned 🤗 Transformers to a specific version (for instance with v3.5.1) with

git checkout tags/v3.5.1

and run the example command as usual afterward.

Running the Examples on Remote Hardware with Auto-Setup

run_on_remote.py is a script that launches any example on remote self-hosted hardware, with automatic hardware and environment setup. It uses Runhouse to launch on self-hosted hardware (e.g. in your own cloud account or on-premise cluster) but there are other options for running remotely as well. You can easily customize the example used, command line arguments, dependencies, and type of compute hardware, and then run the script to automatically launch the example.

You can refer to hardware setup for more information about hardware and dependency setup with Runhouse, or this Colab tutorial for a more in-depth walkthrough.

You can run the script with the following commands:

# First install runhouse:
pip install runhouse

# For an on-demand V100 with whichever cloud provider you have configured:
python run_on_remote.py \
    --example pytorch/text-generation/run_generation.py \
    --model_type=gpt2 \
    --model_name_or_path=openai-community/gpt2 \
    --prompt "I am a language model and"

# For byo (bring your own) cluster:
python run_on_remote.py --host <cluster_ip> --user <ssh_user> --key_path <ssh_key_path> \
  --example <example> <args>

# For on-demand instances
python run_on_remote.py --instance <instance> --provider <provider> \
  --example <example> <args>

You can also adapt the script to your own needs.