mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-03 12:50:06 +06:00

* stash for now * initial commit * small updated * up * up * works! * nits and fixes * don't loop too much * finish working example * update * fix the small freeblocks issue * feat: stream inputs to continuous batch * fix: update attn from `eager` to `sdpa` * refactor: fmt * refactor: cleanup unnecessary code * feat: add `update` fn to `PagedAttentionCache` * feat: broken optimal block size computation * fix: debugging invalid cache logic * fix: attention mask * refactor: use custom prompts for example * feat: add streaming output * fix: prefill split refactor: add doc strings and unsound/redundant logic fix: compute optimal blocks logic * fix: send decoded tokens when `prefilling_split` -> `decoding` * refactor: move logic to appropriate parent class * fix: remove truncation as we split prefilling anyways refactor: early return when we have enough selected requests * feat: add paged attention forward * push Ggraoh> * add paged sdpa * update * btter mps defaults * feat: add progress bar for `generate_batch` * feat: add opentelemetry metrics (ttft + batch fill %age) * feat: add tracing * Add cuda graphs (#38059) * draft cudagraphs addition * nits * styling * update * fix * kinda draft of what it should look like * fixes * lol * not sure why inf everywhere * can generate but output is shit * some fixes * we should have a single device synch * broken outputs but it does run * refactor * updates * updates with some fixes * fix mask causality * another commit that casts after * add error * simplify example * update * updates * revert llama changes * fix merge conflicts * fix: tracing and metrics * my updates * update script default values * fix block allocation issue * fix prefill split attnetion mask * no bugs * add paged eager * fix * update * style * feat: add pytorch traces * fix * fix * refactor: remove pytorch profiler data * style * nits * cleanup * draft test file * fix * fix * fix paged and graphs * small renamings * cleanups and push * refactor: move tracing and metrics logic to utils * refactor: trace more blocks of code * nits * nits * update * to profile or not to profile * refactor: create new output object * causal by default * cleanup but generations are still off for IDK what reason * simplifications but not running still * this does work. * small quality of life updates * nits * updaet * fix the scheduler * fix warning * ol * fully fixed * nits * different generation parameters * nice * just style * feat: add cache memory usage * feat: add kv cache free memory * feat: add active/waiting count & req latency * do the sampling * fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager * fix on mps * feat: add dashboard & histogram buckets * perf: improve waiting reqs data structures * attempt to compile, but we should only do it on mps AFAIK * feat: decouple scheduling logic * just a draft * c;eanup and fixup * optional * style * update * update * remove the draft documentation * fix import as well * update * fix the test * style doomed --------- Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>
56 lines
1.8 KiB
YAML
56 lines
1.8 KiB
YAML
services:
|
|
memcached:
|
|
image: memcached:1.6.29
|
|
container_name: memcached
|
|
ports:
|
|
- "11211:11211"
|
|
environment:
|
|
- MEMCACHED_MAX_MEMORY=64m # Set the maximum memory usage
|
|
- MEMCACHED_THREADS=4 # Number of threads to use
|
|
|
|
prometheus:
|
|
image: prom/prometheus:latest
|
|
command:
|
|
- "--config.file=/etc/prometheus/prometheus.yml"
|
|
- --web.enable-otlp-receiver # Enable OTLP receiver
|
|
- --web.enable-remote-write-receiver
|
|
- --enable-feature=exemplar-storage
|
|
- --enable-feature=native-histograms
|
|
volumes:
|
|
- ./prometheus.yml:/etc/prometheus/prometheus.yml
|
|
ports:
|
|
- "9090:9090"
|
|
|
|
tempo:
|
|
image: grafana/tempo:latest
|
|
command: [ "-config.file=/etc/tempo.yaml" ]
|
|
volumes:
|
|
- ./tempo.yaml:/etc/tempo.yaml
|
|
ports:
|
|
- "14268:14268" # jaeger ingest
|
|
- "3200:3200" # tempo
|
|
- "9095:9095" # tempo grpc
|
|
- "4317:4317" # otlp grpc
|
|
- "4318:4318" # otlp http
|
|
- "9411:9411" # zipkin
|
|
depends_on:
|
|
- memcached
|
|
|
|
grafana:
|
|
image: grafana/grafana:latest
|
|
volumes:
|
|
- ./continuous-batching-dashboard.json:/etc/grafana/provisioning/dashboards/continuous-batching-dashboard.json
|
|
- ./grafana-dashboard.yaml:/etc/grafana/provisioning/dashboards/grafana-dashboard.yaml
|
|
- ./grafana-datasources.yaml:/etc/grafana/provisioning/datasources/datasources.yaml
|
|
environment:
|
|
- GF_AUTH_ANONYMOUS_ENABLED=true
|
|
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
|
|
- GF_AUTH_DISABLE_LOGIN_FORM=true
|
|
- GF_FEATURE_TOGGLES_ENABLE=traceqlEditor metricsSummary
|
|
- GF_INSTALL_PLUGINS=https://storage.googleapis.com/integration-artifacts/grafana-exploretraces-app/grafana-exploretraces-app-latest.zip;grafana-traces-app
|
|
ports:
|
|
- "3000:3000"
|
|
depends_on:
|
|
- prometheus
|
|
- tempo
|