
* Get parallel loader working. Include tests. * Update the tests for parallel loading * Rename env variables. * Add docs for parallel model weight loading. * Touch up parallel model loading docs. * Touch up parallel model loading docs again. * Edit comment in test_modeling_utils_parallel_loading.py * Make sure HF_PARALLEL_LOADING_WORKERS is spelled correctly in modeling_utils.py * Correct times for parallelized loading, previous times were for a "hot" filesystem * Update parallel model loading so the spawn method is encapsulated. DRY up the code by leveraging get_submodule. * Update docs on model loading parallelism so that details on setting the multiprocessing start method are removed, now that the package handles this step internally. * Fix style on model loading parallelism changes. * Merge latest version of master's modeling_utils. * Removed unused variable. * Fix argument packing for the parallel loader. * Fix state dict being undefined in the parallel model loader. * Rename variables used in parallel model loading for clarity. Use get_module_from_name(). * Switch to the use of threads for parallel model loading. * Update docs for parallel loading. * Remove the use of json.loads when evaluating HF_ENABLE_PARALLEL_LOADING. Prefer simple casting. * Move parallelized shard loading into its own function. * Remove use of is_true(). Favor checking env var true values for HF_ENABLE_PARALLEL_LOADING. * Update copyright to 2025 in readme for paralell model loading. * Remove garbage collection line in load_shard_file, implicit garbage collection already occurs. * Run formatter on modeling_utils.py * Apply style fixes * Delete tests/utils/test_modeling_utils_parallel_loading.py --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Cyril Vallez <cyril.vallez@huggingface.co>
2.2 KiB
Environment Variables
HF_ENABLE_PARALLEL_LOADING
By default this is disabled. Enables the loading of torch and safetensor based weights to be loaded in parallel. Can decrease the time to load large models significantly, often times producing speed ups around ~50%.
Can be set to a string equal to "false"
or "true"
. e.g. os.environ["HF_ENABLE_PARALLEL_LOADING"] = "true"
.
e.g. facebook/opt-30b
on an AWS EC2 g4dn.metal instance can be made to load in ~30s with this enabled vs ~55s without it.
Profile before committing to using this environment variable, this will not produce speed ups for smaller models.
import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "true"
from transformers import pipeline
model = pipeline(task="text-generation", model="facebook/opt-30b", device_map="auto")
HF_PARALLEL_LOADING_WORKERS
Determines how many threads should be used when parallel loading is enabled. Default is 8
.
If the number of files that are being loaded is less than the number of threads specified, the number that is actually spawned will be equal to the number of files.
e.g. If you specify 8 workers, and there are only 2 files, only 2 workers will be spawned.
Tune as you see fit.
import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "true"
os.environ["HF_PARALLEL_LOADING_WORKERS"] = "4"
from transformers import pipeline
model = pipeline(task="text-generation", model="facebook/opt-30b", device_map="auto")