
* Fix typos and grammar mistakes in docs and examples * Fix typos in docstrings and comments * Fix spelling of `tokenizer` in model tests * Remove erroneous spaces in decorators * Remove extra spaces in Markdown link texts
109 KiB
DeepSpeed Integration
DeepSpeed ã¯ãZeRO è«æ ã§èª¬æãããŠãããã¹ãŠãå®è£ ããŸããçŸåšã次ã®ãã®ãå®å šã«ãµããŒãããŠããŸãã
- ãªããã£ãã€ã¶ãŒã®ç¶æ åå² (ZeRO ã¹ããŒãž 1)
- åŸé åå² (ZeRO ã¹ããŒãž 2)
- ãã©ã¡ãŒã¿ãŒã®åå² (ZeRO ã¹ããŒãž 3)
- ã«ã¹ã¿ã æ··å粟床ãã¬ãŒãã³ã°åŠç
- äžé£ã®é«é CUDA æ¡åŒµããŒã¹ã®ãªããã£ãã€ã¶ãŒ
- CPU ããã³ NVMe ãžã® ZeRO ãªãããŒã
ZeRO-Offload ã«ã¯ç¬èªã®å°çšããŒããŒããããŸã: ZeRO-Offload: Democratizing Billion-Scale Model Trainingã NVMe ãµããŒãã«ã€ããŠã¯ãè«æ ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learningã
DeepSpeed ZeRO-2 ã¯ããã®æ©èœãæšè«ã«ã¯åœ¹ã«ç«ããªããããäž»ã«ãã¬ãŒãã³ã°ã®ã¿ã«äœ¿çšãããŸãã
DeepSpeed ZeRO-3 ã¯ã巚倧ãªã¢ãã«ãè€æ°ã® GPU ã«ããŒãã§ãããããæšè«ã«ã䜿çšã§ããŸãã åäžã® GPU ã§ã¯äžå¯èœã§ãã
ð€ Transformers ã¯ã2 ã€ã®ãªãã·ã§ã³ãä»ã㊠DeepSpeed ãçµ±åããŸãã
- [
Trainer
] ã«ããã³ã¢ DeepSpeed æ©èœã®çµ±åãäœã§ããã£ãŠãããã¿ã€ãã§ã çµ±åã®å Žå - ã«ã¹ã¿ã æ§æãã¡ã€ã«ãæå®ãããããã³ãã¬ãŒãã䜿çšããã ãã§ãä»ã«äœãããå¿ èŠã¯ãããŸããããããŠãã® ãã®ããã¥ã¡ã³ãã§ã¯ãã®æ©èœã«çŠç¹ãåœãŠãŠããŸãã - [
Trainer
] ã䜿çšãããDeepSpeed ãçµ±åããç¬èªã®ãã¬ãŒããŒã䜿çšãããå Žåfrom_pretrained
ãfrom_config
ãªã©ã®ã³ã¢æ©èœã«ã¯ãéèŠãªæ©èœã®çµ±åãå«ãŸããŠããŸãã ZeRO ã¹ããŒãž 3 以éã®zero.Init
ãªã©ã® DeepSpeed ã®éšåããã®æ©èœã掻çšããã«ã¯ã次ã®ããã¥ã¡ã³ãããèªã¿ãã ããã éãã¬ãŒã㌠DeepSpeed çµ±åã
çµ±åãããŠãããã®:
ãã¬ãŒãã³ã°ïŒ
- DeepSpeed ZeRO ãã¬ãŒãã³ã°ã¯ãZeRO-Infinity (CPU ããã³ NVME ãªãããŒã) ã䜿çšããŠå®å šãª ZeRO ã¹ããŒãž 1ã2ãããã³ 3 ããµããŒãããŸãã
æšè«ïŒ
- DeepSpeed ZeRO Inference ã¯ãZeRO-Infinity ã«ãã ZeRO ã¹ããŒãž 3 ããµããŒãããŸãããã¬ãŒãã³ã°ãšåã ZeRO ãããã³ã«ã䜿çšããŸããã ãªããã£ãã€ã¶ãš lr ã¹ã±ãžã¥ãŒã©ã¯äœ¿çšãããã¹ããŒãž 3 ã®ã¿ãé¢é£ããŸãã詳现ã«ã€ããŠã¯ã以äžãåç §ããŠãã ããã ãŒãæšè«ã
DeepSpeed Inference ããããŸããããã¯ãTensor Parallelism ã®ä»£ããã« Tensor Parallelism ã䜿çšãããŸã£ããç°ãªããã¯ãããžãŒã§ãã ZeRO (è¿æ¥å ¬é)ã
Trainer Deepspeed Integration
Installation
pypi çµç±ã§ã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããŸãã
pip install deepspeed
ãŸãã¯tansformers
, extras
çµç±:
pip install transformers[deepspeed]
ãŸãã¯ãDeepSpeed ã® GitHub ããŒãž ã§è©³çްã確èªããŠãã ããã é«åºŠãªã€ã³ã¹ããŒã«ã
ããã§ããã«ãã«èŠåŽããå Žåã¯ããŸã CUDA æ¡åŒµæ©èœã®ã€ã³ã¹ããŒã« ããŒã ãå¿ ãèªãã§ãã ããã
æ¡åŒµæ©èœãäºåãã«ããããå®è¡æã«æ¡åŒµæ©èœããã«ããããããšã«äŸåããŠãããäžèšã®è§£æ±ºçããã¹ãŠè©Šããå Žå ããã圹ã«ç«ããªãã£ãå Žåãæ¬¡ã«è©Šãã¹ãããšã¯ãã¢ãžã¥ãŒã«ãã€ã³ã¹ããŒã«ããåã«ã¢ãžã¥ãŒã«ãäºåã«ãã«ãããããšã§ãã
DeepSpeed ã®ããŒã«ã« ãã«ããäœæããã«ã¯:
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 pip install . \
--global-option="build_ext" --global-option="-j8" --no-cache -v \
--disable-pip-version-check 2>&1 | tee build.log
NVMe ãªãããŒãã䜿çšããå Žåã¯ãäžèšã®æé ã«DS_BUILD_AIO=1
ãå«ããå¿
èŠããããŸã (ãŸãã
libaio-dev ã·ã¹ãã å
šäœã«ã€ã³ã¹ããŒã«ããŸã)ã
TORCH_CUDA_ARCH_LIST
ãç·šéããŠã䜿çšãã GPU ã«ãŒãã®ã¢ãŒããã¯ãã£ã®ã³ãŒããæ¿å
¥ããŸãããã¹ãŠãä»®å®ãããš
ããªãã®ã«ãŒãã¯åãã§ãæ¬¡ã®æ¹æ³ã§ã¢ãŒããååŸã§ããŸãã
CUDA_VISIBLE_DEVICES=0 python -c "import torch; print(torch.cuda.get_device_capability())"
ãããã£ãŠã8, 6
ãååŸããå Žåã¯ãTORCH_CUDA_ARCH_LIST="8.6"
ã䜿çšããŸããè€æ°ã®ç°ãªãã«ãŒãããæã¡ã®å Žåã¯ããã¹ãŠããªã¹ãããããšãã§ããŸã
ãããã®ãã¡ãTORCH_CUDA_ARCH_LIST="6.1;8.6"
ã奜ãã§ã
è€æ°ã®ãã·ã³ã§åãã»ããã¢ããã䜿çšããå¿ èŠãããå Žåã¯ããã€ã㪠ãã€ãŒã«ãäœæããŸãã
git clone https://github.com/microsoft/DeepSpeed/
cd DeepSpeed
rm -rf build
TORCH_CUDA_ARCH_LIST="8.6" DS_BUILD_CPU_ADAM=1 DS_BUILD_UTILS=1 \
python setup.py build_ext -j8 bdist_wheel
dist/deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
ã®ãããªãã®ãçæãããã®ã§ããããã€ã³ã¹ããŒã«ã§ããŸã
pip install deepspeed-0.3.13+8cd046f-cp38-cp38-linux_x86_64.whl
ãšããŠããŒã«ã«ãŸãã¯ä»ã®ãã·ã³ã«ã€ã³ã¹ããŒã«ããŸãã
ç¹°ãè¿ããŸãããTORCH_CUDA_ARCH_LIST
ãã¿ãŒã²ãã ã¢ãŒããã¯ãã£ã«åãããŠèª¿æŽããããšãå¿ããªãã§ãã ããã
NVIDIA GPU ã®å®å šãªãªã¹ããšãããã«å¯Ÿå¿ãã ã³ã³ãã¥ãŒãã£ã³ã°æ©èœ (ãã®èšäºã® Arch ãšåã) ãèŠã€ããããšãã§ããŸãã ã³ã³ããã¹ã) ããã
以äžã䜿çšããŠãpytorch ãæ§ç¯ãããã¢ãŒãã確èªã§ããŸãã
python -c "import torch; print(torch.cuda.get_arch_list())"
ããã§ã¯ãã€ã³ã¹ããŒã«ãããŠãã GPU ã® 1 ã€ã®ã¢ãŒããèŠã€ããæ¹æ³ã説æããŸããããšãã°ãGPU 0 ã®å Žå:
CUDA_VISIBLE_DEVICES=0 python -c "import torch; \
print(torch.cuda.get_device_properties(torch.device('cuda')))"
åºåãæ¬¡ã®å Žå:
_CudaDeviceProperties(name='GeForce RTX 3090', major=8, minor=6, total_memory=24268MB, multi_processor_count=82)
ããããã°ããã®ã«ãŒãã®ã¢ãŒãã8.6
ã§ããããšãããããŸãã
TORCH_CUDA_ARCH_LIST
ãå®å
šã«çç¥ããããšãã§ããŸããããããã°ããã«ã ããã°ã©ã ãèªåçã«ã¯ãšãªãå®è¡ããŸãã
ãã«ããè¡ããã GPU ã®ã¢ãŒããã¯ãã£ãããã¯ãã¿ãŒã²ãã ãã·ã³ã® GPU ãšäžèŽããå Žåãããã°ãäžèŽããªãå ŽåããããŸãã
ç®çã®ã¢ãŒããæç€ºçã«æå®ããããšããå§ãããŸãã
ææ¡ãããããšããã¹ãŠè©ŠããŠããŸã ãã«ãã®åé¡ãçºçããå Žåã¯ãGitHub ã®åé¡ã«é²ãã§ãã ããã ãã£ãŒãã¹ããŒãã
Deployment with multiple GPUs
DeepSpeed çµ±åããããã€ããã«ã¯ã[Trainer
] ã³ãã³ã ã©ã€ã³åŒæ°ã調æŽããŠæ°ããåŒæ° --deepspeed ds_config.json
ãå«ããŸããããã§ãds_config.json
㯠DeepSpeed æ§æãã¡ã€ã«ã§ãã
ãã¡ãã«èšèŒãããŠããŸãããã¡ã€ã«åã¯ããªã次第ã§ãã
DeepSpeed ã®add_config_arguments
ãŠãŒãã£ãªãã£ã䜿çšããŠãå¿
èŠãªã³ãã³ã ã©ã€ã³åŒæ°ãã³ãŒãã«è¿œå ããããšããå§ãããŸãã
詳现ã«ã€ããŠã¯ãDeepSpeed ã®åŒæ°è§£æ ããã¥ã¡ã³ããåç
§ããŠãã ããã
ããã§éžæããã©ã³ãã£ãŒã䜿çšã§ããŸãã pytorch ã©ã³ãã£ãŒãåŒãç¶ã䜿çšã§ããŸãã
torch.distributed.run --nproc_per_node=2 your_program.py <normal cl args> --deepspeed ds_config.json
ãŸãã¯ãdeepspeed
ã«ãã£ãŠæäŸãããã©ã³ãã£ãŒã䜿çšããŸãã
deepspeed --num_gpus=2 your_program.py <normal cl args> --deepspeed ds_config.json
ã芧ã®ãšãããåŒæ°ã¯åãã§ã¯ãããŸããããã»ãšãã©ã®ããŒãºã§ã¯ã©ã¡ãã§ãæ©èœããŸããã® ããŸããŸãªããŒããš GPU ãæ§æããæ¹æ³ã®è©³çްã«ã€ããŠã¯ããã¡ã ãåç §ããŠãã ããã
deepspeed
ã©ã³ãã£ãŒã䜿çšããå©çšå¯èœãªãã¹ãŠã® GPU ã䜿çšãããå Žåã¯ã--num_gpus
ãã©ã°ãçç¥ããã ãã§ãã
以äžã¯ãå©çšå¯èœãªãã¹ãŠã® GPU ããããã€ãã DeepSpeed ã§run_translation.py
ãå®è¡ããäŸã§ãã
deepspeed examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
DeepSpeed ã®ããã¥ã¡ã³ãã«ã¯ã--deepspeed --deepspeed_config ds_config.json
ã衚瀺ãããå¯èœæ§ãé«ãããšã«æ³šæããŠãã ããã
DeepSpeed é¢é£ã®åŒæ°ã 2 ã€ãããŸãããç°¡åã«ããããã§ãããåŠçãã¹ãåŒæ°ããã§ã«éåžžã«å€ãããã§ãã
ãã® 2 ã€ã 1 ã€ã®åŒæ°ã«çµåããŸããã
å®éã®äœ¿çšäŸã«ã€ããŠã¯ããã® æçš¿ ãåç §ããŠãã ããã
Deployment with one GPU
1 ã€ã® GPU ã§ DeepSpeed ããããã€ããã«ã¯ã[Trainer
] ã³ãã³ã ã©ã€ã³åŒæ°ã次ã®ããã«èª¿æŽããŸãã
deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero2.json \
--model_name_or_path t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro
ããã¯è€æ°ã® GPU ã®å Žåãšã»ãŒåãã§ãããããã§ã¯ãDeepSpeed ã« 1 ã€ã® GPU ã ãã䜿çšããããã«æç€ºçã«æç€ºããŸãã
--num_gpus=1
ãããã©ã«ãã§ã¯ãDeepSpeed ã¯æå®ãããããŒãäžã§èªèã§ãããã¹ãŠã® GPU ããããã€ããŸããèµ·åãã GPU ã 1 ã€ã ãã®å Žå
ã®å Žåããã®åŒæ°ã¯å¿
èŠãããŸãããæ¬¡ã® ããã¥ã¡ã³ã ã§ã¯ãã©ã³ãã£ãŒ ãªãã·ã§ã³ã«ã€ããŠèª¬æããŠããŸãã
1 ã€ã® GPU ã ãã§ DeepSpeed ã䜿çšãããã®ã¯ãªãã§ãã?
- äžéšã®èšç®ãšã¡ã¢ãªããã¹ãã® CPU ãš RAM ã«å§ä»»ã§ãã ZeRO ãªãããŒãæ©èœãåããŠããããã ã¢ãã«ã®ããŒãºã«åãããŠããå€ãã® GPU ãªãœãŒã¹ãæ®ããŠãããŸãããã倧ããªããã ãµã€ãºããŸãã¯éåžžã«å€§ããªã¢ãã«ã®ãã£ããã£ã³ã°ãå¯èœã«ãã æ®éã¯åããªãã§ãããã
- ã¹ããŒã㪠GPU ã¡ã¢ãªç®¡çã·ã¹ãã ãæäŸããã¡ã¢ãªã®æçåãæå°éã«æããŸãã ãã倧ããªã¢ãã«ãšããŒã¿ ãããã
æ¬¡ã«æ§æã«ã€ããŠè©³ãã説æããŸãããåäžã® GPU ã§å€§å¹ ãªæ¹åãå®çŸããããã®éµã¯æ¬¡ã®ãšããã§ãã DeepSpeed ã䜿çšããã«ã¯ãæ§æãã¡ã€ã«ã«å°ãªããšãæ¬¡ã®æ§æãå¿ èŠã§ãã
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"overlap_comm": true,
"contiguous_gradients": true
}
}
ããã«ããããªããã£ãã€ã¶ãŒã®ãªãããŒãããã®ä»ã®éèŠãªæ©èœãæå¹ã«ãªããŸãããããã¡ ãµã€ãºã詊ããŠã¿ããšããã§ãããã 詳现ã«ã€ããŠã¯ã以äžã®ãã£ã¹ã«ãã·ã§ã³ãåç §ããŠãã ããã
ãã®ã¿ã€ãã®ãããã€ã¡ã³ãã®å®éçãªäœ¿çšäŸã«ã€ããŠã¯ããã® æçš¿ ãåç §ããŠãã ããã
ãã®ããã¥ã¡ã³ãã§è©³ãã説æãããŠããããã«ãCPU ããã³ NVMe ãªãããŒããåãã ZeRO-3 ã詊ãããšãã§ããŸãã
ããŒãïŒ
-
GPU 0 ãšã¯ç°ãªãç¹å®ã® GPU ã§å®è¡ããå¿ èŠãããå Žåã
CUDA_VISIBLE_DEVICES
ã䜿çšããŠå¶éããããšã¯ã§ããŸããã å©çšå¯èœãª GPU ã®è¡šç€ºç¯å²ã代ããã«ãæ¬¡ã®æ§æã䜿çšããå¿ èŠããããŸããdeepspeed --include localhost:1 examples/pytorch/translation/run_translation.py ...
ãã®äŸã§ã¯ãDeepSpeed ã« GPU 1 (2 çªç®ã® GPU) ã䜿çšããããã«æç€ºããŸãã
è€æ°ã®ããŒãã䜿çšãããããã€ã¡ã³ã
ãã®ã»ã¯ã·ã§ã³ã®æ
å ±ã¯ DeepSpeed çµ±åã«åºæã®ãã®ã§ã¯ãªãããããããã«ãããŒã ããã°ã©ã ã«é©çšã§ããŸãããã ããDeepSpeed ã¯ãSLURM ç°å¢ã§ãªãéããä»ã®ã©ã³ãã£ãŒããã䜿ããããdeepspeed
ã©ã³ãã£ãŒãæäŸããŸãã
ãã®ã»ã¯ã·ã§ã³ã§ã¯ããããã 8 GPU ãåãã 2 ã€ã®ããŒãããããšä»®å®ããŸãããŸããæåã®ããŒãã«ã¯ ssh hostname1
ã䜿çšããŠã2 çªç®ã®ããŒãã«ã¯ ssh hostname2
ã䜿çšããŠæ¥ç¶ã§ããŸããäž¡æ¹ãšããã¹ã¯ãŒããªãã§ããŒã«ã«ã® ssh çµç±ã§çžäºã«æ¥ç¶ã§ããå¿
èŠããããŸãããã¡ããããããã®ãã¹ã (ããŒã) åããäœæ¥ããŠããå®éã®ãã¹ãåã«å€æŽããå¿
èŠããããŸãã
The torch.distributed.run launcher
ããšãã°ãtorch.distributed.run
ã䜿çšããã«ã¯ã次ã®ããã«ããŸãã
python -m torch.distributed.run --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json
åããŒãã« SSH ã§æ¥ç¶ããããããã®ããŒãã§åãã³ãã³ããå®è¡ããå¿ èŠããããŸããæ¥ãå¿ èŠã¯ãããŸãããã©ã³ãã£ãŒã¯äž¡æ¹ã®ããŒããåæãããŸã§åŸ æ©ããŸãã
詳现ã«ã€ããŠã¯ãtorchrun ãåç
§ããŠãã ãããã¡ãªã¿ã«ããã㯠pytorch ã®æ°ããŒãžã§ã³åã®torch.distributed.launch
ã眮ãæããã©ã³ãã£ãŒã§ããããŸãã
ãã£ãŒãã¹ããŒã ã©ã³ãã£ãŒ
代ããã«deepspeed
ã©ã³ãã£ãŒã䜿çšããã«ã¯ããŸãhostfile
ãã¡ã€ã«ãäœæããå¿
èŠããããŸãã
hostname1 slots=8
hostname2 slots=8
ãããŠã次ã®ããã«èµ·åã§ããŸãã
deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \
your_program.py <normal cl args> --deepspeed ds_config.json
torch.distributed.run
ã©ã³ãã£ãŒãšã¯ç°ãªããdeepspeed
ã¯äž¡æ¹ã®ããŒãã§ãã®ã³ãã³ããèªåçã«èµ·åããŸãã
詳现ã«ã€ããŠã¯ããªãœãŒã¹æ§æ (ãã«ãããŒã) ãåç §ããŠãã ããã
Launching in a SLURM environment
SLURM ç°å¢ã§ã¯ã次ã®ã¢ãããŒãã䜿çšã§ããŸãã以äžã¯ãç¹å®ã® SLURM ç°å¢ã«é©åãããããã«å¿
èŠãª slurm ã¹ã¯ãªãã launch.slurm
ã§ãã
#SBATCH --job-name=test-nodes # name
#SBATCH --nodes=2 # nodes
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=10 # number of cores per tasks
#SBATCH --gres=gpu:8 # number of gpus
#SBATCH --time 20:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
export GPUS_PER_NODE=8
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9901
srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
--master_addr $MASTER_ADDR --master_port $MASTER_PORT \
your_program.py <normal cl args> --deepspeed ds_config.json'
ããšã¯å®è¡ãã¹ã±ãžã¥ãŒã«ããã ãã§ãã
sbatch launch.slurm
Use of Non-shared filesystem
ããã©ã«ãã§ã¯ãDeepSpeed ã¯ãã«ãããŒãç°å¢ãå
±æã¹ãã¬ãŒãžã䜿çšããããšãæ³å®ããŠããŸãããããåœãŠã¯ãŸãããåããŒããããŒã«ã« ãã¡ã€ã«ã·ã¹ãã ããåç
§ã§ããªãå Žåã¯ãèšå®ãã¡ã€ã«ã調æŽã㊠checkpoint
_section ãå«ããå¿
èŠããããŸãããã§ãã¯ãã€ã³ã ãªãã·ã§ã³) ãæ¬¡ã®èšå®ã§æå®ããŸãã
{
"checkpoint": {
"use_node_local_storage": true
}
}
ãããã¯ã[Trainer
] ã® --save_on_each_node
åŒæ°ã䜿çšããããšãã§ããäžèšã®èšå®ã¯èªåçã«è¿œå ãããŸãã
Deployment in Notebooks
ããŒãããã¯ã®ã»ã«ãã¹ã¯ãªãããšããŠå®è¡ããå Žåã®åé¡ã¯ãäŸåããéåžžã®deepspeed
ã©ã³ãã£ãŒããªãããšã§ãã
ç¹å®ã®èšå®ã§ã¯ãããããšãã¥ã¬ãŒãããå¿
èŠããããŸãã
GPU ã 1 ã€ã ã䜿çšããŠããå ŽåãDeepSpeed ã䜿çšããããã«ããŒãããã¯å ã®ãã¬ãŒãã³ã° ã³ãŒãã調æŽããå¿ èŠãããæ¹æ³ã¯æ¬¡ã®ãšããã§ãã
# DeepSpeed requires a distributed environment even when only one process is used.
# This emulates a launcher in the notebook
import os
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994" # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
# Now proceed as normal, plus pass the deepspeed config file
training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
trainer = Trainer(...)
trainer.train()
泚: ...
ã¯ã颿°ã«æž¡ãéåžžã®åŒæ°ã衚ããŸãã
è€æ°ã® GPU ã䜿çšããå ŽåãDeepSpeed ãåäœããã«ã¯ãã«ãããã»ã¹ç°å¢ã䜿çšããå¿ èŠããããŸããã€ãŸããããªãã¯æã£ãŠããŸã ãã®ç®çã§ã©ã³ãã£ãŒã䜿çšããããšã¯ã§ããŸããããããã¯ãæç€ºããã忣ç°å¢ããšãã¥ã¬ãŒãããããšã«ãã£ãŠã¯å®çŸã§ããŸããã ãã®ã»ã¯ã·ã§ã³ã®åé ã§ã
çŸåšã®ãã£ã¬ã¯ããªã®ããŒãããã¯ã«ãã®å Žã§æ§æãã¡ã€ã«ãäœæãããå Žåã¯ãå°çšã® ã»ã«ã®å 容:
%%bash
cat <<'EOT' > ds_config_zero3.json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
EOT
ãã¬ãŒãã³ã° ã¹ã¯ãªãããããŒãããã¯ã®ã»ã«ã§ã¯ãªãéåžžã®ãã¡ã€ã«ã«ããå Žåã¯ã次ã®ããã«ããŠdeepspeed
ãéåžžã©ããèµ·åã§ããŸãã
现èããã®ã·ã§ã«ãããšãã°ãrun_translation.py
ã䜿çšããã«ã¯ã次ã®ããã«èµ·åããŸãã
!git clone https://github.com/huggingface/transformers
!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...
ãŸãã¯ã%%bash
ããžãã¯ã䜿çšãããšãã·ã§ã« ããã°ã©ã ãå®è¡ããããã®è€æ°è¡ã®ã³ãŒããèšè¿°ããããšãã§ããŸãã
%%bash
git clone https://github.com/huggingface/transformers
cd transformers
deepspeed examples/pytorch/translation/run_translation.py ...
ãã®ãããªå Žåããã®ã»ã¯ã·ã§ã³ã®æåã«ç€ºããã³ãŒãã¯å¿ èŠãããŸããã
泚: %%bash
ããžãã¯ã¯åªããŠããŸãããçŸæç¹ã§ã¯åºåããããã¡ãªã³ã°ãããããããã»ã¹ãçµäºãããŸã§ãã°ã¯è¡šç€ºãããŸããã
å®äºããŸãã
Configuration
èšå®ãã¡ã€ã«ã§äœ¿çšã§ãã DeepSpeed èšå®ãªãã·ã§ã³ã®å®å šãªã¬ã€ãã«ã€ããŠã¯ã次ãåç §ããŠãã ããã æ¬¡ã®ããã¥ã¡ã³ã ã«ã¢ã¯ã»ã¹ããŠãã ããã
ããŸããŸãªå®éã®ããŒãºã«å¯Ÿå¿ããæ°åã® DeepSpeed æ§æäŸã [DeepSpeedExamples] (https://github.com/microsoft/DeepSpeedExamples)ã§èŠã€ããããšãã§ããŸãã ãªããžããª:
git clone https://github.com/microsoft/DeepSpeedExamples
cd DeepSpeedExamples
find . -name '*json'
äžèšã®ã³ãŒããç¶ããŠãLamb ãªããã£ãã€ã¶ãŒãæ§æããããšããŠãããšããŸãããããã£ãŠã次ã®äžããæ€çŽ¢ã§ããŸã
.json
ãã¡ã€ã«ã®äŸ:
grep -i Lamb $(find . -name '*json')
ããã«ããã€ãã®äŸã ã¡ã€ã³ ãªããžã㪠ã«ããããŸãã
DeepSpeed ã䜿çšããå Žåã¯ãåžžã« DeepSpeed æ§æãã¡ã€ã«ãæå®ããå¿ èŠããããŸãããäžéšã®æ§æãã©ã¡ãŒã¿ã«ã¯ ã³ãã³ãã©ã€ã³çµç±ã§èšå®ããŸãã埮åŠãªéãã«ã€ããŠã¯ããã®ã¬ã€ãã®æ®ãã®éšåã§èª¬æããŸãã
DeepSpeed æ§æãã¡ã€ã«ãã©ã®ãããªãã®ããçè§£ããããã«ãZeRO ã¹ããŒãž 2 æ©èœãæå¹ã«ããæ§æãã¡ã€ã«ã次ã«ç€ºããŸãã
ãªããã£ãã€ã¶ãŒç¶æ
ã® CPU ãªãããŒããå«ã¿ãAdamW
ãªããã£ãã€ã¶ãŒãšWarmupLR
ã¹ã±ãžã¥ãŒã©ãŒã䜿çšããæ··åãæå¹ã«ããŸãã
--fp16
ãæž¡ãããå Žåã®ç²ŸåºŠãã¬ãŒãã³ã°:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
}
ããã°ã©ã ãå®è¡ãããšãDeepSpeed 㯠[Trainer
] ããåãåã£ãèšå®ããã°ã«èšé²ããŸãã
ã³ã³ãœãŒã«ã«æž¡ããããããæçµçã«ã©ã®ãããªèšå®ãæž¡ãããã®ããæ£ç¢ºã«ç¢ºèªã§ããŸãã
Passing Configuration
ãã®ããã¥ã¡ã³ãã§èª¬æããããã«ãéåžžãDeepSpeed èšå®ã¯ json ãã¡ã€ã«ãžã®ãã¹ãšããŠæž¡ãããŸããã
ãã¬ãŒãã³ã°ã®èšå®ã«ã³ãã³ã ã©ã€ã³ ã€ã³ã¿ãŒãã§ã€ã¹ã䜿çšããã代ããã«ã€ã³ã¹ã¿ã³ã¹ãäœæããŸãã
[Trainer
] via [TrainingArguments
] ãã®åŸãdeepspeed
åŒæ°ã«ã€ããŠã¯æ¬¡ã®ããšãã§ããŸã
ãã¹ãããã dict
ãæž¡ããŸããããã«ããããã®å Žã§æ§æãäœæã§ãããããæžã蟌ãå¿
èŠããããŸããã
[TrainingArguments
] ã«æž¡ãåã«ãã¡ã€ã« ã·ã¹ãã ã倿ŽããŸãã
èŠçŽãããšã次ã®ããšãã§ããŸãã
TrainingArguments(..., deepspeed="/path/to/ds_config.json")
ãŸãã¯ïŒ
ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params)
TrainingArguments(..., deepspeed=ds_config_dict)
Shared Configuration
ãã®ã»ã¯ã·ã§ã³ã¯å¿ èªã§ã
[Trainer
] ãš DeepSpeed ã®äž¡æ¹ãæ£ããæ©èœããã«ã¯ãããã€ãã®èšå®å€ãå¿
èŠã§ãã
ãããã£ãŠãæ€åºãå°é£ãªãšã©ãŒã«ã€ãªããå¯èœæ§ã®ããå®çŸ©ã®ç«¶åãé²ãããã«ãããããæ§æããããšã«ããŸããã
[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°çµç±ã
ããã«ãäžéšã®æ§æå€ã¯ã¢ãã«ã®æ§æã«åºã¥ããŠèªåçã«å°åºãããŸãã
è€æ°ã®å€ãæåã§èª¿æŽããããšãå¿ããªãã§ãã ããã[Trainer
] ã«å€§éšåãä»»ããã®ãæåã§ã
ã®èšå®ãè¡ããŸãã
ãããã£ãŠããã®ã¬ã€ãã®æ®ãã®éšåã§ã¯ãç¹å¥ãªèšå®å€ auto
ã衚瀺ãããŸãããããèšå®ãããšã
æ£ããå€ãŸãã¯æãå¹ççãªå€ã«èªåçã«çœ®ãæããããŸãããããç¡èŠããããšãèªç±ã«éžæããŠãã ãã
æšå¥šäºé
ãåç
§ããå€ãæç€ºçã«èšå®ããŸãããã®å Žåãæ¬¡ã®ç¹ã«ååæ³šæããŠãã ããã
[Trainer
] åŒæ°ãš DeepSpeed èšå®ã¯äžèŽããŸããããšãã°ãåããã®ã䜿çšããŠããŸãã
åŠç¿çãããããµã€ãºããŸãã¯åŸé
环ç©èšå®?ããããäžèŽããªãå Žåããã¬ãŒãã³ã°ã¯éåžžã«å€±æããå¯èœæ§ããããŸã
æ¹æ³ãæ€åºããã®ãé£ãããããªãã¯èŠåãåããŸããã
DeepSpeed ã®ã¿ã«åºæã®å€ããããã«åãããŠæåã§èšå®ããå¿ èŠãããå€ãä»ã«ãè€æ°ãããŸãã ããªãã®èŠæã
ç¬èªã®ããã°ã©ã ã§ãDeepSpeed æ§æããã¹ã¿ãŒãšããŠå€æŽãããå Žåã¯ã次ã®ã¢ãããŒãã䜿çšããããšãã§ããŸãã
ããã«åºã¥ã㊠[TrainingArguments
] ãèšå®ããŸããæé ã¯æ¬¡ã®ãšããã§ãã
- ãã¹ã¿ãŒæ§æãšããŠäœ¿çšãã DeepSpeed æ§æãäœæãŸãã¯ããŒãããŸã
- ãããã®å€ã«åºã¥ã㊠[
TrainingArguments
] ãªããžã§ã¯ããäœæããŸã
scheduler.params.total_num_steps
ãªã©ã®äžéšã®å€ã¯æ¬¡ã®ããã«èšç®ãããããšã«æ³šæããŠãã ããã
train
äžã« [Trainer
] ãå®è¡ããŸããããã¡ããèªåã§èšç®ããããšãã§ããŸãã
ZeRO
Zero Redundancy Optimizer (ZeRO) ã¯ãDeepSpeed ã®äž»å補åã§ãããã 3 ã€ã®ç°ãªãã¬ãã« (段é) ã®æé©åããµããŒãããŸããæåã®ãã®ã¯ãã¹ã±ãŒã©ããªãã£ã®èгç¹ããã¯ããŸãè峿·±ããã®ã§ã¯ãããŸããã ãããã£ãŠããã®ããã¥ã¡ã³ãã§ã¯ã¹ããŒãž 2 ãš 3 ã«çŠç¹ãåœãŠãŸããã¹ããŒãž 3 ã¯ãææ°ã® ZeRO-Infinity ã®è¿œå ã«ãã£ãŠããã«æ¹åãããŠããŸãã 詳现ã«ã€ããŠã¯ãDeepSpeed ã®ããã¥ã¡ã³ããåç §ããŠãã ããã
æ§æãã¡ã€ã«ã® zero_optimization
ã»ã¯ã·ã§ã³ã¯æãéèŠãªéšåã§ã (docs)ãããã§å®çŸ©ããŸã
ã©ã® ZeRO ã¹ããŒãžãæå¹ã«ãããããããŠããããã©ã®ããã«æ§æããããåãã©ã¡ãŒã¿ã®èª¬æã¯ã
DeepSpeed ã®ããã¥ã¡ã³ãã
ãã®ã»ã¯ã·ã§ã³ã¯ãDeepSpeed èšå®ãä»ããŠã®ã¿èšå®ããå¿
èŠããããŸã - [Trainer
] ãæäŸããŸã
åçã®ã³ãã³ãã©ã€ã³åŒæ°ã¯ãããŸããã
泚: çŸåšãDeepSpeed ã¯ãã©ã¡ãŒã¿ãŒåãæ€èšŒããªããããã¹ãã«ãééãããšãããã©ã«ãèšå®ã䜿çšãããŸãã ã¹ãã«ãééã£ãŠãããã©ã¡ãŒã¿ã DeepSpeed ãšã³ãžã³ã®èµ·åãã° ã¡ãã»ãŒãžãèŠãŠããã®å€ã確èªã§ããŸãã 䜿çšããã€ããã§ãã
ZeRO-2 Config
以äžã¯ãZeRO ã¹ããŒãž 2 ã®æ§æäŸã§ãã
{
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true
}
}
æ§èœèª¿æŽïŒ
offload_optimizer
ãæå¹ã«ãããšãGPU RAM ã®äœ¿çšéãåæžãããŸã ("stage": 2
ãå¿ èŠã§ã)"overlap_comm": true
ã¯ãGPU RAM 䜿çšéã®å¢å ãšãã¬ãŒããªãããŠãé å»¶ããã¹ãŠåæžããŸããoverlap_comm
㯠4.5x ã䜿çšããŸãallgather_bucket_size
ãšreduce_bucket_size
ã®å€ããããã£ãŠã5e8 ã«èšå®ãããŠããå Žåã9GB ãå¿ èŠã«ãªããŸãã ãããããªã³ã (5e8 x 2Bytes x 2 x 4.5
)ããããã£ãŠã8GB 以äžã® RAM ãæèŒãã GPU ã䜿çšããŠããå Žåã OOM ãšã©ãŒãçºçããå Žåã¯ããããã®ãã©ã¡ãŒã¿ã2e8
çšåºŠã«æžããå¿ èŠããããããã«ã¯ 3.6GB ãå¿ èŠã«ãªããŸãããããããªãã§ããã OOM ã«éãå§ããŠããå Žåã¯ããã倧容éã® GPU ã§ãåæ§ã§ãã- ãããã®ãããã¡ãæžãããšãããå€ãã® GPU RAM ãå©çšããããã«éä¿¡é床ãç ç²ã«ããããšã«ãªããŸãããããã¡ãµã€ãºãå°ããã»ã©ã éä¿¡ãé ããªããä»ã®ã¿ã¹ã¯ã§äœ¿çšã§ãã GPU RAM ãå¢ããŸãããããã£ãŠãããããµã€ãºã倧ããå Žåã¯ã éèŠãªã®ã¯ããã¬ãŒãã³ã°æéãå°ãé ãããããšã¯è¯ããã¬ãŒãã«ãªãå¯èœæ§ããããŸãã
ããã«ãdeepspeed==0.4.4
ã«ã¯ã次ã®ã³ãã³ãã§æå¹ã«ã§ããæ°ãããªãã·ã§ã³round_robin_gradients
ã远å ãããŸããã
{
"zero_optimization": {
"round_robin_gradients": true
}
}
ããã¯ããã现ããåŸé ããŒãã£ã·ã§ãã³ã°ã«ãã£ãŠã©ã³ã¯éã® CPU ã¡ã¢ãªãžã®åŸé ã³ããŒã䞊ååãããCPU ãªãããŒãã®ã¹ããŒãž 2 æé©åã§ããããã©ãŒãã³ã¹ã®å©ç¹ã¯ãåŸé 环ç©ã¹ããã (ãªããã£ãã€ã¶ãŒ ã¹ãããéã®ã³ããŒã®å¢å ) ãŸã㯠GPU æ° (䞊ååŠçã®å¢å ) ã«å¿ããŠå¢å ããŸãã
ZeRO-3 Config
以äžã¯ãZeRO ã¹ããŒãž 3 ã®æ§æäŸã§ãã
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
ã¢ãã«ãŸãã¯ã¢ã¯ãã£ããŒã·ã§ã³ã GPU ã¡ã¢ãªã«é©åãããCPU ãæªäœ¿çšã§ããããã« OOM ãçºçããŠããå Žå
"device": "cpu"
ã䜿çšããŠãªããã£ãã€ã¶ã®ç¶æ
ãšãã©ã¡ãŒã¿ã CPU ã¡ã¢ãªã«ã¡ã¢ãªãªãããŒããããšããã®å¶éã解決ãããå¯èœæ§ããããŸãã
CPU ã¡ã¢ãªã«ãªãããŒãããããªãå Žåã¯ãdevice
ãšã³ããªã«cpu
ã®ä»£ããã«none
ã䜿çšããŸãããªãããŒãå
NVMe ã«ã€ããŠã¯åŸã»ã©èª¬æããŸãã
åºå®ã¡ã¢ãªã¯ãpin_memory
ãtrue
ã«èšå®ãããšæå¹ã«ãªããŸãããã®æ©èœã«ãããæ¬¡ã®ãããªã³ã¹ãããããŠã¹ã«ãŒããããåäžãããããšãã§ããŸãã
ä»ã®ããã»ã¹ã䜿çšã§ããã¡ã¢ãªãå°ãªããªããŸãããã³çããããã¡ã¢ãªã¯ããããèŠæ±ããç¹å®ã®ããã»ã¹ã®ããã«ç¢ºä¿ãããŸãã
éåžžãéåžžã® CPU ã¡ã¢ãªãããã¯ããã«é«éã«ã¢ã¯ã»ã¹ãããŸãã
æ§èœèª¿æŽïŒ
stage3_max_live_parameters
:1e9
stage3_max_reuse_distance
:1e9
OOM ã«éããå Žåã¯ããstage3_max_live_parametersããšãstage3_max_reuse_ distanceããæžãããŸãã圱é¿ã¯æå°éã«æããããã¯ãã§ã
ã¢ã¯ãã£ãåãã§ãã¯ãã€ã³ããå®è¡ããªãéããããã©ãŒãã³ã¹ã«åœ±é¿ããŸãã 1e9
ã¯çŽ 2GB ãæ¶è²»ããŸããèšæ¶ãå
±æããŠããã®ã¯ã
stage3_max_live_parameters
ãš stage3_max_reuse_distance
ãªã®ã§ãå ç®ããããã®ã§ã¯ãªããåèšã§ 2GB ã«ãªããŸãã
stage3_max_live_parameters
ã¯ãç¹å®ã®æç¹ã§ GPU äžã«ä¿æããå®å
šãªãã©ã¡ãŒã¿ã®æ°ã®äžéã§ãã
æéã ãåå©çšè·é¢ãã¯ããã©ã¡ãŒã¿ãå°æ¥ãã€åã³äœ¿çšããããã倿ããããã«äœ¿çšããææšã§ãã
stage3_max_reuse_ distance
ã䜿çšããŠããã©ã¡ãŒã¿ãç Žæ£ãããä¿æããããæ±ºå®ããŸãããã©ã¡ãŒã¿ã
è¿ãå°æ¥ã«åã³äœ¿çšãããäºå® (stage3_max_reuse_distance
æªæº) ãªã®ã§ãéä¿¡ãæžããããã«ä¿æããŸãã
ãªãŒããŒããããããã¯ãã¢ã¯ãã£ããŒã·ã§ã³ ãã§ãã¯ãã€ã³ããæå¹ã«ããŠããå Žåã«éåžžã«åœ¹ç«ã¡ãŸãããã©ã¯ãŒãåèšç®ãè¡ããã
backward ã¯åäžã¬ã€ã€ãŒç²åºŠãæž¡ããåŸæ¹åèšç®ãŸã§ãã©ã¡ãŒã¿ãåæ¹åèšç®ã«ä¿æããããšèããŠããŸãã
æ¬¡ã®æ§æå€ã¯ãã¢ãã«ã®é衚瀺ãµã€ãºã«ãã£ãŠç°ãªããŸãã
reduce_bucket_size
:hidden_size*hidden_size
stage3_prefetch_bucket_size
:0.9 * hidden_size * hidden_size
stage3_param_persistence_threshold
:10 * hidden_size
ãããã£ãŠããããã®å€ã auto
ã«èšå®ãããšã[Trainer
] ãæšå¥šãããå€ãèªåçã«å²ãåœãŠãŸãã
䟡å€èгããã ãããã¡ãããããããæç€ºçã«èšå®ããããšãã§ããŸãã
stage3_gather_16bit_weights_on_model_save
ã¯ãã¢ãã«ã®ä¿åæã«ã¢ãã« fp16 ã®éã¿çµ±åãæå¹ã«ããŸãã倧ãã
ã¢ãã«ãšè€æ°ã® GPU ã®å Žåãããã¯ã¡ã¢ãªãšé床ã®äž¡æ¹ã®ç¹ã§é«äŸ¡ãªæäœã§ããçŸåšå¿
é ãšãªã£ãŠããã®ã¯ã
ãã¬ãŒãã³ã°ãåéããäºå®ã§ãããã®å¶éãåãé€ãããã䟿å©ã«ããä»åŸã®ã¢ããããŒãã«æ³šç®ããŠãã ããã
ãã¬ãã·ãã«ã
ZeRO-2 æ§æããç§»è¡ããŠããå Žåã¯ãallgather_partitions
ãallgather_bucket_size
ãããã³
reduce_scatter
èšå®ãã©ã¡ãŒã¿ã¯ ZeRO-3 ã§ã¯äœ¿çšãããŸãããããããèšå®ãã¡ã€ã«ã«ä¿åããŠãããšã
ç¡èŠãããã
sub_group_size
:1e9
sub_group_size
ã¯ããªããã£ãã€ã¶ãŒã®ã¹ãããäžã«ãã©ã¡ãŒã¿ãŒãæŽæ°ãããç²åºŠãå¶åŸ¡ããŸãããã©ã¡ãŒã¿ã¯æ¬¡ã®ãšããã§ãã
sub_group_size
ã®ãã±ããã«ã°ã«ãŒãåãããåãã±ããã¯äžåºŠã« 1 ã€ãã€æŽæ°ãããŸãã NVMeãªãããŒãã§äœ¿çšããå Žå
ãããã£ãŠãZeRO-Infinity ã® sub_group_size
ã¯ãã¢ãã«ã®ç¶æ
ã CPU ã«åºå
¥ãããç²åºŠãå¶åŸ¡ããŸãã
ãªããã£ãã€ã¶ã¹ãããäžã« NVMe ããã¡ã¢ãªãååŸããŸããããã«ãããéåžžã«å€§èŠæš¡ãªã¢ãã«ã® CPU ã¡ã¢ãªäžè¶³ã鲿¢ãããŸãã
NVMe ãªãããŒãã䜿çšããªãå Žåã¯ãsub_group_size
ãããã©ã«ãå€ã® 1e9 ã®ãŸãŸã«ããããšãã§ããŸãã倿Žããããšãã§ããŸã
次ã®å Žåã®ããã©ã«ãå€:
- ãªããã£ãã€ã¶ãŒ ã¹ãããäžã« OOM ãçºçãã:
sub_group_size
ãæžãããŠãäžæãããã¡ãŒã®ã¡ã¢ãªäœ¿çšéãåæžããŸãã - ãªããã£ãã€ã¶ãŒ ã¹ãããã«æéãããããŸãã
sub_group_size
ãå¢ãããŠã垯åå¹ ã®äœ¿çšçãåäžãããŸãã ããŒã¿ãããã¡ã®å¢å ã
ZeRO-0 Config
ã¹ããŒãž 0 ãš 1 ã¯ãã£ãã«äœ¿çšãããªããããæåŸã«ãªã¹ãããŠããããšã«æ³šæããŠãã ããã
ã¹ããŒãž 0 ã§ã¯ããã¹ãŠã®ã¿ã€ãã®ã·ã£ãŒãã£ã³ã°ãç¡å¹ã«ããDDP ãšã㊠DeepSpeed ã®ã¿ã䜿çšããŸããæ¬¡ã®ã³ãã³ãã§ãªã³ã«ã§ããŸãã
{
"zero_optimization": {
"stage": 0
}
}
ããã«ãããä»ã«äœã倿Žããå¿ èŠããªããåºæ¬çã« ZeRO ãç¡å¹ã«ãªããŸãã
ZeRO-1 Config
ã¹ããŒãž 1 ã¯ãã¹ããŒãž 2 ããã°ã©ããŒã·ã§ã³ ã·ã£ãŒãã£ã³ã°ãé€ãããã®ã§ãããªããã£ãã€ã¶ãŒã®ç¶æ ãã·ã£ãŒãåããã ãã§ãåŠçãå°ãé«éåããããã«ãã€ã§ã詊ãããšãã§ããŸãã
{
"zero_optimization": {
"stage": 1
}
}
NVMe Support
ZeRO-Infinity ã¯ãGPU ãš CPU ã¡ã¢ãªã NVMe ã¡ã¢ãªã§æ¡åŒµããããšã§ãéåžžã«å€§èŠæš¡ãªã¢ãã«ã®ãã¬ãŒãã³ã°ãå¯èœã«ããŸãããããã§ ã¹ããŒã ããŒãã£ã·ã§ãã³ã°ããã³ã¿ã€ãªã³ã° ã¢ã«ãŽãªãºã ã§ã¯ãå GPU ãéåžžã«å°éã®ããŒã¿ãéåä¿¡ããå¿ èŠããããŸãã ãªãããŒãã«ãããææ°ã® NVMe ããã¬ãŒãã³ã°ã«å©çšã§ããåèšã¡ã¢ãª ããŒã«ãããã«å€§ããããã®ã«é©ããŠããããšã倿ããŸããã ããã»ã¹ã ZeRO-Infinity ã«ã¯ãZeRO-3 ãæå¹ã«ãªã£ãŠããå¿ èŠããããŸãã
次ã®èšå®äŸã§ã¯ãNVMe ããªããã£ãã€ã¶ã®ç¶æ ãšãã©ã¡ãŒã¿ã®äž¡æ¹ããªãããŒãã§ããããã«ããŸãã
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 4,
"fast_init": false
},
"offload_param": {
"device": "nvme",
"nvme_path": "/local_nvme",
"pin_memory": true,
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
"aio": {
"block_size": 262144,
"queue_depth": 32,
"thread_count": 1,
"single_submit": false,
"overlap_events": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
}
ãªããã£ãã€ã¶ã®ç¶æ ãšãã©ã¡ãŒã¿ã®äž¡æ¹ã NVMe ã«ãªãããŒãããããã©ã¡ãã 1 ã€ã ãããªãããŒããããããŸã£ãããªãããŒãããªãããéžæã§ããŸããããšãã°ã次ã®å Žå å©çšå¯èœãª CPU ã¡ã¢ãªã倧éã«ããå Žåã¯ãé«éã«ãªããããå¿ ã CPU ã¡ã¢ãªã®ã¿ã«ãªãããŒãããŠãã ãã (ãã³ã: "device": "CPU")ã
ãªããã£ãã€ã¶ãŒã®ç¶æ ãš ãã©ã¡ãŒã¿ãŒã
nvme_path
ãå®éã« NVMe ã§ããããšã確èªããŠãã ãããNVMe ã¯éåžžã®ããŒããã©ã€ããŸã㯠SSD ã§åäœããŸããã
ã¯ããã«é
ããªããŸããé«éã¹ã±ãŒã©ãã«ãªãã¬ãŒãã³ã°ã¯ãææ°ã® NVMe 転éé床ã念é ã«çœ®ããŠèšèšãããŸãã (ãã®æç¹ã§ã¯
æžã蟌ã¿ã§ã¯ãèªã¿åãæå€§ 3.5 GB/ç§ãæžãèŸŒã¿æå€§ 3 GB/ç§ã®ããŒã¯é床ãåŸãããŸã)ã
æé©ãªaio
æ§æãããã¯ãèŠã€ããã«ã¯ãã¿ãŒã²ããèšå®ã§ãã³ãããŒã¯ãå®è¡ããå¿
èŠããããŸãã
ããã§èª¬æã
ZeRO-2 vs ZeRO-3 Performance
ZeRO-3 ã¯ãä»ã®ãã¹ãŠãåãããã«æ§æãããŠããå ŽåãZeRO-2 ãããé ããªãå¯èœæ§ããããŸããåè ã¯åéããå¿ èŠãããããã§ãã ZeRO-2 ã®æ©èœã«å ããŠã¢ãã«ã®éã¿ä»ããè¡ããŸãã ZeRO-2 ãããŒãºãæºãããæ°åã® GPU ãè¶ ããŠæ¡åŒµããå¿ èŠããªãå Žå ããããã°ãããã«åºå·ããããšãéžæããããšãã§ããŸãã ZeRO-3 ã«ãããã¯ããã«é«ãã¹ã±ãŒã©ããªãã£å®¹éãå¯èœã«ãªãããšãçè§£ããããšãéèŠã§ã ã¹ããŒããç ç²ã«ããŠã
ZeRO-3 ã®æ§æã調æŽããŠãZeRO-2 ã«è¿ã¥ããããšãã§ããŸãã
stage3_param_persistence_threshold
ãéåžžã«å€§ããªæ°å€ã«èšå®ããŸããããšãã°ã6 * hidden_ââsize * hidden_ââsize
ã®ããã«ãæå€§ââãã©ã¡ãŒã¿ããã倧ãããªããŸããããã«ããããã©ã¡ãŒã¿ã GPU ã«ä¿æãããŸãã- ZeRO-2 ã«ã¯ãã®ãªãã·ã§ã³ããªãããã
offload_params
ããªãã«ããŸãã
倿ŽããªããŠããoffload_params
ããªãã«ããã ãã§ããã©ãŒãã³ã¹ã倧å¹
ã«åäžããå¯èœæ§ããããŸãã
stage3_param_persistence_threshold
ããã¡ããããããã®å€æŽã¯ãã¬ãŒãã³ã°ã§ããã¢ãã«ã®ãµã€ãºã«åœ±é¿ããŸããããã§
ãããã¯ãããŒãºã«å¿ããŠãã¹ã±ãŒã©ããªãã£ãšåŒãæãã«é床ãåäžãããã®ã«åœ¹ç«ã¡ãŸãã
ZeRO-2 Example
以äžã¯ãå®å
šãª ZeRO-2 èªåæ§æãã¡ã€ã« ds_config_zero2.json
ã§ãã
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
以äžã¯ãæåã§èšå®ãããå®å
šãª ZeRO-2 ã®ãã¹ãŠãæå¹ãªæ§æãã¡ã€ã«ã§ããããã§ã¯äž»ã«ãå
žåçãªãã®ã確èªããããã®ãã®ã§ãã
å€ã¯æ¬¡ã®ããã«ãªããŸãããè€æ°ã®auto
èšå®ãå«ãŸããå€ã䜿çšããããšã匷ããå§ãããŸãã
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-5,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
ZeRO-3 Example
以äžã¯ãå®å
šãª ZeRO-3 èªåæ§æãã¡ã€ã«ds_config_zero3.json
ã§ãã
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
以äžã¯ãæåã§èšå®ãããå®å
šãª ZeRO-3 ã®ãã¹ãŠãæå¹ãªæ§æãã¡ã€ã«ã§ããããã§ã¯äž»ã«ãå
žåçãªãã®ã確èªããããã®ãã®ã§ãã
å€ã¯æ¬¡ã®ããã«ãªããŸãããè€æ°ã®auto
èšå®ãå«ãŸããå€ã䜿çšããããšã匷ããå§ãããŸãã
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-5,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": 1e6,
"stage3_prefetch_bucket_size": 0.94e6,
"stage3_param_persistence_threshold": 1e4,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
How to Choose Which ZeRO Stage and Offloads To Use For Best Performance
ããã§ãããŸããŸãªæ®µéãããããšãããããŸãããã©ã¡ãã䜿çšããããã©ã®ããã«æ±ºå®ããã°ããã§ãããã?ãã®ã»ã¯ã·ã§ã³ã§ã¯ããã®è³ªåã«çããŠãããŸãã
äžè¬ã«ã次ã®ããšãåœãŠã¯ãŸããŸãã
- é床ã®ç¹ïŒå·Šã®æ¹ãå³ããéãïŒ
ã¹ããŒãž 0 (DDP) > ã¹ããŒãž 1 > ã¹ããŒãž 2 > ã¹ããŒãž 2 + ãªãããŒã > ã¹ããŒãž 3 > ã¹ããŒãž 3 + ãªãããŒã
- GPU ã¡ã¢ãªã®äœ¿çšç¶æ³ (å³ã¯å·Šããã GPU ã¡ã¢ãªå¹çãé«ã)
ã¹ããŒãž 0 (DDP) < ã¹ããŒãž 1 < ã¹ããŒãž 2 < ã¹ããŒãž 2 + ãªãããŒã < ã¹ããŒãž 3 < ã¹ããŒãž 3 + ãªãããŒã
ãããã£ãŠãæå°éã®æ°ã® GPU ã«åãŸããªããæéã®å®è¡ãå®çŸãããå Žåã¯ã次ã®ããã»ã¹ã«åŸãããšãã§ããŸããæãéãã¢ãããŒãããéå§ããGPU OOM ã«é¥ã£ãå Žåã¯ã次ã«é ãã¢ãããŒãã«é²ã¿ãŸãããããã«ãã䜿çšããã GPU ã¡ã¢ãªãå°ãªããªããŸãããªã©ãªã©ã
ãŸããããã ãµã€ãºã 1 ã«èšå®ããŸã (å¿ èŠãªæå¹ããã ãµã€ãºã«å¯ŸããŠããã€ã§ãåŸé 环ç©ã䜿çšã§ããŸã)ã
-
--gradient_checkpointing 1
(HF Trainer) ãŸãã¯çŽæ¥model.gradient_checkpointing_enable()
ãæå¹ã«ããŸã - OOM ã®å Žå -
æåã« ZeRO ã¹ããŒãž 2 ã詊ããŠãã ããã OOMã®å Žå
-
ZeRO ã¹ããŒãž 2 +
offload_optimizer
ã詊ããŸã - OOM ã®å Žå -
ZeRO ã¹ããŒãž 3 ã«åãæ¿ãã - OOM ã®å Žå
-
cpu
ã«å¯ŸããŠoffload_param
ãæå¹ã«ããŸã - OOM ã®å Žå -
OOM ã®å Žåã¯ã
cpu
ã«å¯ŸããŠoffload_optimizer
ãæå¹ã«ããŸãã -
ããã§ãããã ãµã€ãº 1 ã«é©åããªãå Žåã¯ããŸãããŸããŸãªããã©ã«ãå€ã確èªããå¯èœã§ããã°å€ãäžããŸããããšãã°ã
generate
ã䜿çšããåºãæ€çŽ¢ããŒã ã䜿çšããªãå Žåã¯ã倧éã®ã¡ã¢ãªãæ¶è²»ãããããæ€çŽ¢ããŒã ãçãããŸãã -
fp32 ã§ã¯å¿ ãæ··åå粟床ã䜿çšããŸããã€ãŸããAmpere 以äžã® GPU ã§ã¯ bf16ãå€ã GPU ã¢ãŒããã¯ãã£ã§ã¯ fp16 ã䜿çšããŸãã
-
ããã§ã OOM ãè¡ãå Žåã¯ãããŒããŠã§ã¢ã远å ããããZeRO-Infinity ãæå¹ã«ããããšãã§ããŸããã€ãŸãããªãããŒã
offload_param
ãšoffload_optimizer
ãnvme
ã«åãæ¿ããŸããéåžžã«é«é㪠nvme ã§ããããšã確èªããå¿ èŠããããŸããéžè©±ãšããŠãZeRO-Infinity ã䜿çšããŠå°ã㪠GPU ã§ BLOOM-176B ãæšè«ããããšãã§ããŸããããéåžžã«é ãã£ãã§ããã§ããããŸããããŸããïŒ
ãã¡ãããæã GPU ã¡ã¢ãªå¹çã®é«ãæ§æããå§ããŠãåŸããéã«é²ãããšã§ããããã®æé ãéã«å®è¡ããããšãã§ããŸãããããã¯äºçåããŠã¿ãŠãã ããã
OOM ãåŒãèµ·ãããªãããã ãµã€ãº 1 ãååŸããããå®å¹ã¹ã«ãŒããããæž¬å®ããŸãã
次ã«ãããã ãµã€ãºãã§ããã ã倧ããããŠã¿ãŸããããã ãµã€ãºã倧ããã»ã©ãä¹ç®ããè¡åã巚倧ãªå Žåã« GPU ã®ããã©ãŒãã³ã¹ãæé«ã«ãªããããGPU ã®å¹çãåäžããŸãã
ããã§ãããã©ãŒãã³ã¹æé©åã²ãŒã ãå§ãŸããŸããäžéšã®ãªãããŒãæ©èœããªãã«ããããZeRO 段éã§ã¹ãããããŠã³ããŠããã ãµã€ãºã墿žããŠãå®å¹ã¹ã«ãŒããããå床枬å®ããããšãã§ããŸããæºè¶³ãããŸã§æŽãæµããç¹°ãè¿ããŸãã
æ°žé ã«ããã«è²»ããå¿ èŠã¯ãããŸãããã3 ãæã®ãã¬ãŒãã³ã°ãéå§ããããšããŠããå Žåã¯ãã¹ã«ãŒãããã«é¢ããŠæã广çãªèšå®ãèŠã€ããããã«æ°æ¥ãããŠãã ããããã®ããããã¬ãŒãã³ã°ã®ã³ã¹ããæå°éã«ãªãããã¬ãŒãã³ã°ãããæ©ãå®äºã§ããŸããçŸåšã®ç®ãŸããããå€åãã ML ã®äžçã§ã¯ãäœãããã¬ãŒãã³ã°ããã®ã«ããã« 1 ãæãããå Žåãçµ¶å¥œã®æ©äŒãéãå¯èœæ§ããããŸãããã¡ãããããã¯ç§ãæèŠãå ±æããŠããã ãã§ãããæ±ºããŠããªããæ¥ããããšããŠããããã§ã¯ãããŸããã BLOOM-176B ã®ãã¬ãŒãã³ã°ãéå§ããåã«ããã®ããã»ã¹ã« 2 æ¥éè²»ãããã¹ã«ãŒãããã 90 TFLOP ãã 150 TFLOP ã«åäžãããããšãã§ããŸããããã®åãçµã¿ã«ããããã¬ãŒãã³ã°æéã 1 ãæä»¥äžç¯çŽã§ããŸããã
ãããã®ã¡ã¢ã¯äž»ã«ãã¬ãŒãã³ã° ã¢ãŒãçšã«æžããããã®ã§ãããã»ãšãã©ã®å Žåã¯æšè«ã«ãé©çšãããã¯ãã§ããããšãã°ãåŸé ãã§ãã¯ãã€ã³ãã¯ãã¬ãŒãã³ã°äžã«ã®ã¿åœ¹ç«ã€ãããæšè«äžã¯äœãè¡ãããŸãããããã«ããã«ã GPU æšè«ãå®è¡ããŠããŠãDeepSpeed-InferenceãAccelerate ã¯åªããããã©ãŒãã³ã¹ãæäŸããã¯ãã§ãã
ãã®ä»ã®ããã©ãŒãã³ã¹é¢é£ã®ç°¡åãªã¡ã¢:
- äœããæåãããã¬ãŒãã³ã°ããŠããå Žåã¯ãåžžã« 16 ã§å²ãåãã圢ç¶ã®ãã³ãœã« (é ãããµã€ãºãªã©) ã䜿çšããããã«ããŠãã ãããããã ãµã€ãºã«ã€ããŠã¯ãå°ãªããšã 2 ã§å²ãåããããã«ããŠãã ããã GPU ããããã«é«ãããã©ãŒãã³ã¹ãåŒãåºãããå Žåã¯ãããŒããŠã§ã¢åºæã® æ³¢ãšã¿ã€ã«ã®éåå ã®å¯åæ§ããããŸãã
Activation Checkpointing or Gradient Checkpointing
ã¢ã¯ãã£ããŒã·ã§ã³ ãã§ãã¯ãã€ã³ããšåŸé ãã§ãã¯ãã€ã³ãã¯ãåãæ¹æ³è«ãæã 2 ã€ã®ç°ãªãçšèªã§ãããšãŠããããããã§ããããããªæãã§ãã
åŸé ãã§ãã¯ãã€ã³ãã䜿çšãããšãé床ã GPU ã¡ã¢ãªãšåŒãæãã«ã§ããŸããããã«ãããGPU OOM ãå æããããããã ãµã€ãºãå¢ããããšãã§ããå€ãã®å Žåãããã©ãŒãã³ã¹ã®åäžã«ã€ãªãããŸãã
HF Transformers ã¢ãã«ã¯ãDeepSpeed ã®ã¢ã¯ãã£ããŒã·ã§ã³ ãã§ãã¯ãã€ã³ãã«ã€ããŠäœãç¥ããªããããDeepSpeed æ§æãã¡ã€ã«ã§ãã®æ©èœãæå¹ã«ããããšããŠããäœãèµ·ãããŸããã
ãããã£ãŠããã®éåžžã«æçãªæ©èœã掻çšããã«ã¯ 2 ã€ã®æ¹æ³ããããŸãã
- HF Transformers ã¢ãã«ã䜿çšãããå Žåã¯ã
model.gradient_checkpointing_enable()
ãå®è¡ããããHF ãã¬ãŒããŒã§--gradient_checkpointing
ã䜿çšããŸããããã«ããããããèªåçã«æå¹ã«ãªããŸããããã§äœ¿ãããã®ãtorch.utils.checkpoint
ã§ãã - ç¬èªã®ã¢ãã«ãäœæããDeepSpeed ã®ã¢ã¯ãã£ããŒã·ã§ã³ ãã§ãã¯ãã€ã³ãã䜿çšãããå Žåã¯ãããã§èŠå®ãããŠãã API ã䜿çšã§ããŸãã HF Transformers ã¢ããªã³ã° ã³ãŒãã䜿çšããŠã
torch.utils.checkpoint
ã DeepSpeed ã® API ã«çœ®ãæããããšãã§ããŸããåŸè ã¯ãé æ¹åã¢ã¯ãã£ããŒã·ã§ã³ãåèšç®ãã代ããã« CPU ã¡ã¢ãªã«ãªãããŒãã§ãããããããæè»ã§ãã
Optimizer and Scheduler
offload_optimizer
ãæå¹ã«ããªãéããDeepSpeed ã¹ã±ãžã¥ãŒã©ãŒãš HuggingFace ã¹ã±ãžã¥ãŒã©ãŒãçµã¿åãããŠäœ¿çšââã§ããŸãã
ãªããã£ãã€ã¶ãŒ (HuggingFace ã¹ã±ãžã¥ãŒã©ãŒãš DeepSpeed ãªããã£ãã€ã¶ãŒã®çµã¿åãããé€ã):
Combos | HF Scheduler | DS Scheduler |
---|---|---|
HF Optimizer | Yes | Yes |
DS Optimizer | No | Yes |
offload_optimizer
ãæå¹ãªå ŽåãCPU ãš
GPU å®è£
(LAMB ãé€ã)ã
Optimizer
DeepSpeed ã®äž»ãªãªããã£ãã€ã¶ãŒã¯ãAdamãAdamWãOneBitAdamãLamb ã§ããããã㯠ZeRO ã§åŸ¹åºçã«ãã¹ããããŠããã ãããã£ãŠã䜿çšããããšããå§ãããŸãããã ããä»ã®ãªããã£ãã€ã¶ããtorchãããã€ã³ããŒãããããšã¯ã§ããŸããå®å šãªããã¥ã¡ã³ã㯠ãã¡ã ã«ãããŸãã
èšå®ãã¡ã€ã«ã§ optimizer
ãšã³ããªãèšå®ããªãå Žåã[Trainer
] ã¯
èªåçã«AdamW
ã«èšå®ãããæå®ãããå€ãŸãã¯æ¬¡ã®ã³ãã³ãã©ã€ã³ã®ããã©ã«ãã䜿çšãããŸãã
åŒæ°: --learning_rate
ã--adam_beta1
ã--adam_beta2
ã--adam_epsilon
ãããã³ --weight_decay
ã
以äžã¯ãAdamW
ã®èªåæ§æãããoptimizer
ãšã³ããªã®äŸã§ãã
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
}
}
ã³ãã³ãã©ã€ã³åŒæ°ã«ãã£ãŠæ§æãã¡ã€ã«å ã®å€ãèšå®ãããããšã«æ³šæããŠãã ããããã㯠1 ã€ããããã§ã å€ã®æ±ºå®çãªãœãŒã¹ãæäŸããããšãã°åŠç¿çãæ¬¡ã®ããã«èšå®ãããŠããå Žåã«ãèŠã€ãã«ãããšã©ãŒãåé¿ããŸãã ããŸããŸãªå Žæã§ããŸããŸãªäŸ¡å€èгãã³ãã³ãã©ã€ã³ã®ã«ãŒã«ããªãŒããŒã©ã€ããããå€ã¯æ¬¡ã®ãšããã§ãã
lr
ãš--learning_rate
ã®å€betas
ãš--adam_beta1 --adam_beta2
ã®å€eps
ãš--adam_epsilon
ã®å€weight_decay
ãš--weight_decay
ã®å€
ãããã£ãŠãã³ãã³ãã©ã€ã³ã§å ±æãã€ããŒãã©ã¡ãŒã¿ã調æŽããããšãå¿ããªãã§ãã ããã
å€ãæç€ºçã«èšå®ããããšãã§ããŸãã
{
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.001,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
}
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
äžèšã«ãªã¹ããããŠããªãå¥ã®ãªããã£ãã€ã¶ãŒã䜿çšããå Žåã¯ããããã¬ãã«ã®æ§æã«è¿œå ããå¿ èŠããããŸãã
{
"zero_allow_untested_optimizer": true
}
AdamW
ãšåæ§ã«ãå
¬åŒã«ãµããŒããããŠããä»ã®ãªããã£ãã€ã¶ãŒãæ§æã§ããŸãããããã¯ç°ãªãèšå®å€ãæã€å¯èœæ§ãããããšã«æ³šæããŠãã ãããäŸãã°Adam ã®å Žåã¯ãweight_decay
ã0.01
ä»è¿ã«ããå¿
èŠããããŸãã
ããã«ããªãããŒãã¯ãDeepspeed ã® CPU Adam ãªããã£ãã€ã¶ãŒãšäœµçšãããšæã广çã«æ©èœããŸãã deepspeed==0.8.3
ãªã®ã§ããªãããŒãã§å¥ã®ãªããã£ãã€ã¶ãŒã䜿çšãããå Žåã¯ã以äžã远å ããå¿
èŠããããŸãã
{
"zero_force_ds_cpu_optimizer": false
}
æäžäœã®æ§æã«ç§»è¡ããŸãã
Scheduler
DeepSpeed ã¯ãLRRangeTest
ãOneCycle
ãWarmupLR
ãããã³WarmupDecayLR
åŠç¿çã¹ã±ãžã¥ãŒã©ãŒããµããŒãããŠããŸããå®å
šãª
ããã¥ã¡ã³ãã¯ããã§ãã
ããã§ã¯ãð€ Transformers ãš DeepSpeed ã®éã§ã¹ã±ãžã¥ãŒã©ãŒãéè€ããå Žæã瀺ããŸãã
--lr_scheduler_type constant_with_warmup
çµç±ã®WarmupLR
--lr_scheduler_type Linear
ãä»ããWarmupDecayLR
ãããã¯--lr_scheduler_type
ã®ããã©ã«ãå€ã§ããããŸãã ãããã£ãŠãã¹ã±ãžã¥ãŒã©ãèšå®ããªãå Žåããããããã©ã«ãã§èšå®ãããã¹ã±ãžã¥ãŒã©ã«ãªããŸãã
èšå®ãã¡ã€ã«ã§ scheduler
ãšã³ããªãèšå®ããªãå Žåã[Trainer
] ã¯
--lr_scheduler_type
ã--learning_rate
ãããã³ --warmup_steps
ãŸã㯠--warmup_ratio
ã®å€ãèšå®ããŸãã
ð€ ããã®ãã©ã³ã¹ãã©ãŒããŒããŒãžã§ã³ã
以äžã¯ãWarmupLR
ã®èªåæ§æãããscheduler
ãšã³ããªã®äŸã§ãã
{
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}
"auto" ã䜿çšãããŠããããã[Trainer
] åŒæ°ã¯èšå®ã«æ£ããå€ãèšå®ããŸãã
ãã¡ã€ã«ãããã¯ãå€ã®æ±ºå®çãªãœãŒã¹ã 1 ã€ããããšãšãããšãã°æ¬¡ã®ãããªå Žåã«èŠã€ãã«ãããšã©ãŒãé¿ããããã§ãã
åŠç¿çã¯ãå Žæããšã«ç°ãªãå€ã«èšå®ãããŸããã³ãã³ãã©ã€ã³ã®ã«ãŒã«ãèšå®ãããå€ã¯æ¬¡ã®ãšããã§ãã
warmup_min_lr
ã®å€ã¯0
ã§ããwarmup_max_lr
ãš--learning_rate
ã®å€ãwarmup_num_steps
ãš--warmup_steps
ã®å€ (æå®ãããŠããå Žå)ããã以å€ã®å Žåã¯--warmup_ratio
ã䜿çšããŸã ãã¬ãŒãã³ã° ã¹ãããã®æ°ãä¹ç®ããåãäžããŸããtotal_num_steps
ã«ã¯--max_steps
ã®å€ãæå®ããããæå®ãããŠããªãå Žåã¯å®è¡æã«èªåçã«å°åºãããŸãã ç°å¢ãããŒã¿ã»ããã®ãµã€ãºãããã³ãã®ä»ã®ã³ãã³ã ã©ã€ã³åŒæ° (WarmupDecayLR
)ã
ãã¡ãããæ§æå€ã®äžéšãŸãã¯ãã¹ãŠãåŒãç¶ãã§ãèªåã§èšå®ããããšãã§ããŸãã
{
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 0.001,
"warmup_num_steps": 1000
}
}
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
ããšãã°ãWarmupDecayLR
ã®å Žåã¯ã次ã®ãšã³ããªã䜿çšã§ããŸãã
{
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
}
}
total_num_steps
ãwarmup_max_lr
ãwarmup_num_steps
ãããã³ total_num_steps
ã¯ããŒãæã«èšå®ãããŸãã
fp32 Precision
Deepspeed ã¯ãå®å šãª fp32 ãš fp16 ã®æ··å粟床ããµããŒãããŸãã
fp16 æ··å粟床ã䜿çšãããšãå¿
èŠãªã¡ã¢ãªã倧å¹
ã«åæžãããé床ãåäžããããã
䜿çšããŠããã¢ãã«ããã®ãã¬ãŒãã³ã° ã¢ãŒãã§é©åã«åäœããªãå Žåã¯ã䜿çšããªãæ¹ãããã§ããããéåžžãã
ã¢ãã«ã fp16 æ··å粟床ã§äºåãã¬ãŒãã³ã°ãããŠããªãå Žåã«çºçããŸã (ããšãã°ããã㯠bf16 ã§äºåãã¬ãŒãã³ã°ãããå Žåã«ããçºçããŸã)
ã¢ãã«ïŒããã®ãããªã¢ãã«ã§ã¯ããªãŒããŒãããŒãŸãã¯ã¢ã³ããŒãããŒãçºçããNaN
æå€±ãçºçããå¯èœæ§ããããŸãããããããªãã®å Žåã¯ã䜿çšããããšæãã§ããã
å®å
šãª fp32 ã¢ãŒããããã©ã«ãã® fp16 æ··å粟床ã¢ãŒããæ¬¡ã®ããã«æç€ºçã«ç¡å¹ã«ããŸãã
{
"fp16": {
"enabled": false,
}
}
Ampere ã¢ãŒããã¯ã㣠ããŒã¹ã® GPU ã䜿çšããŠããå Žåãpytorch ããŒãžã§ã³ 1.7 以éã¯èªåçã« ã䜿çšããããã«åãæ¿ãããŸãã äžéšã®æäœã§ã¯ã¯ããã«å¹çç㪠tf32 圢åŒã䜿çšããŸãããçµæã¯äŸç¶ãšã㊠fp32 ã«ãªããŸãã詳现㚠ãã³ãããŒã¯ã«ã€ããŠã¯ãAmpere ããã€ã¹äžã® TensorFloat-32(TF32) ãåç §ããŠãã ãããææžã«ã¯ä»¥äžãå«ãŸããŸã äœããã®çç±ã§ãã®èªå倿ã䜿çšããããªãå Žåã¯ããã®èªå倿ãç¡å¹ã«ããæ¹æ³ã«ã€ããŠèª¬æããŸãã
ð€ ãã¬ãŒããŒã§ã¯ã--tf32
ã䜿çšããŠæå¹ã«ãããã--tf32 0
ãŸã㯠--no_tf32
ã䜿çšããŠç¡å¹ã«ããããšãã§ããŸããããã©ã«ãã§ã¯ãPyTorch ã®ããã©ã«ãã䜿çšãããŸãã
Automatic Mixed Precision
pytorch ã®ãã㪠AMP ã®æ¹æ³ãŸã㯠apex ã®ãããªæ¹æ³ã§èªåæ··å粟床ã䜿çšã§ããŸãã
fp16
fp16 (float16) ãèšå®ã㊠pytorch AMP ã®ãããªã¢ãŒããèšå®ããã«ã¯:
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
}
[Trainer
] ã¯ãã®å€ã«åºã¥ããŠãããèªåçã«æå¹ãŸãã¯ç¡å¹ã«ããŸãã
args.fp16_backend
ãæ®ãã®èšå®å€ã¯ããªã次第ã§ãã
ãã®ã¢ãŒãã¯ã--fp16 --fp16_backend amp
ãŸãã¯--fp16_full_eval
ã³ãã³ãã©ã€ã³åŒæ°ãæž¡ããããšæå¹ã«ãªããŸãã
ãã®ã¢ãŒããæç€ºçã«æå¹/ç¡å¹ã«ããããšãã§ããŸãã
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
ãããããã¥ã¡ã³ãã§ãã
BF16
fp16 ã®ä»£ããã« bf16 (bfloat16) ãå¿ èŠãªå Žåã¯ãæ¬¡ã®æ§æã»ã¯ã·ã§ã³ã䜿çšãããŸãã
{
"bf16": {
"enabled": "auto"
}
}
bf16 㯠fp32 ãšåããã€ããã㯠ã¬ã³ãžãåããŠãããããæå€±ã¹ã±ãŒãªã³ã°ã¯å¿ èŠãããŸããã
ãã®ã¢ãŒãã¯ã--bf16
ãŸã㯠--bf16_full_eval
ã³ãã³ãã©ã€ã³åŒæ°ãæž¡ããããšæå¹ã«ãªããŸãã
ãã®ã¢ãŒããæç€ºçã«æå¹/ç¡å¹ã«ããããšãã§ããŸãã
{
"bf16": {
"enabled": true
}
}
deepspeed==0.6.0
ã®æç¹ã§ã¯ãbf16 ãµããŒãã¯æ°ããå®éšçãªãã®ã§ãã
bf16 ãæå¹ãªç¶æ ã§ åŸé çŽ¯ç© ã䜿çšããå Žåã¯ãbf16 ã§åŸé ã环ç©ãããããšã«æ³šæããå¿ èŠããããŸãããã®åœ¢åŒã®ç²ŸåºŠãäœããããããã¯åžæã©ããã§ã¯ãªãå¯èœæ§ããããŸããæå€±ã®ããèç©ã«ã€ãªãããŸãã
ãã®åé¡ãä¿®æ£ããããé«ç²ŸåºŠã® dtype
(fp16 ãŸã㯠fp32) ã䜿çšãããªãã·ã§ã³ãæäŸããããã®äœæ¥ãè¡ãããŠããŸãã
NCCL Collectives
èšç·Žäœå¶ã®dtype
ããããããŸããŸãªåæžãåé/忣æäœãªã©ã®ã³ãã¥ãã±ãŒã·ã§ã³éåäœã«äœ¿çšãããå¥ã®dtype
ããããŸãã
ãã¹ãŠã®åé/忣æäœã¯ãããŒã¿ãå«ãŸããŠããã®ãšåã dtype
ã§å®è¡ããããããbf16 ãã¬ãŒãã³ã°äœå¶ã䜿çšããŠããå ŽåãããŒã¿ã¯ bf16 ã§åéãããŸããåéã¯æå€±ã®ãªãæäœã§ãã
ããŸããŸãªãªãã¥ãŒã¹æäœã¯éåžžã«æå€±ã倧ããå¯èœæ§ããããŸããããšãã°ãè€æ°ã® GPU éã§åŸé
ãå¹³ååãããå Žåãéä¿¡ã fp16 ãŸã㯠bf16 ã§è¡ãããå Žåãçµæã¯æå€±ãå€ããªãå¯èœæ§ããããŸããè€æ°ã®æ°å€ãäœç²ŸåºŠã§ã¢ããã¿ã€ãºãããšçµæã¯æ£ç¢ºã§ã¯ãªãããã§ãã ã bf16 ã§ã¯ fp16 ããã粟床ãäœããããããã«ããã§ããéåžžã¯éåžžã«å°ãã grad ãå¹³åããéã®æå€±ãæå°éã«æãããããããfp16 ã§ååã§ããããšããããããŸãããããã£ãŠãããã©ã«ãã§ã¯ãå粟床ãã¬ãŒãã³ã°ã§ã¯ fp16 ããªãã¯ã·ã§ã³æŒç®ã®ããã©ã«ããšããŠäœ¿çšãããŸãããã ãããã®æ©èœãå®å
šã«å¶åŸ¡ã§ããå¿
èŠã«å¿ããŠå°ããªãªãŒããŒãããã远å ããŠããªãã¯ã·ã§ã³ãçŽ¯ç© dtype ãšã㊠fp32 ã䜿çšããçµæã®æºåãã§ããå Žåã«ã®ã¿å粟床 dtype
ã«ããŠã³ãã£ã¹ãããããã«ããããšãã§ããŸããã§ãã¬ãŒãã³ã°äžã§ãã
ããã©ã«ãããªãŒããŒã©ã€ãããã«ã¯ãæ°ããæ§æãšã³ããªã远å ããã ãã§ãã
{
"communication_data_type": "fp32"
}
ãã®èšäºã®å·çæç¹ã§ã®æå¹ãªå€ã¯ã"fp16"ã"bfp16"ã"fp32"ã§ãã
泚: ã¹ããŒãž ãŒã 3 ã«ã¯ãbf16 éä¿¡ã¿ã€ãã«é¢ãããã°ããããdeepspeed==0.8.1
ã§ä¿®æ£ãããŸããã
apex
apex AMP ã®ãããªã¢ãŒã ã»ãããèšå®ããã«ã¯:
"amp": {
"enabled": "auto",
"opt_level": "auto"
}
[Trainer
] 㯠args.fp16_backend
ã®å€ã«åºã¥ããŠèªåçã«èšå®ããŸãã
args.fp16_opt_level
ã
ãã®ã¢ãŒãã¯ã--fp16 --fp16_backend apex --fp16_opt_level 01
ã³ãã³ã ã©ã€ã³åŒæ°ãæž¡ããããšæå¹ã«ãªããŸãã
ãã®ã¢ãŒããæç€ºçã«æ§æããããšãã§ããŸãã
{
"amp": {
"enabled": true,
"opt_level": "O1"
}
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
ããã¯ããã¥ã¡ã³ãã§ãã
Batch Size
ããããµã€ãºãèšå®ããã«ã¯ã次ã䜿çšããŸãã
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
[Trainer
] ã¯èªåçã« train_micro_batch_size_per_gpu
ãæ¬¡ã®å€ã«èšå®ããŸãã
args.per_device_train_batch_size
ãštrain_batch_size
ãargs.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps
ã«å€æŽããŸãã
å€ãæç€ºçã«èšå®ããããšãã§ããŸãã
{
"train_batch_size": 12,
"train_micro_batch_size_per_gpu": 4
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
Gradient Accumulation
åŸé 环ç©ã»ãããæ§æããã«ã¯:
{
"gradient_accumulation_steps": "auto"
}
[Trainer
] ã¯èªåçã«ããã args.gradient_accumulation_steps
ã®å€ã«èšå®ããŸãã
å€ãæç€ºçã«èšå®ããããšãã§ããŸãã
{
"gradient_accumulation_steps": 3
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
Gradient Clipping
ã°ã©ããŒã·ã§ã³ ã°ã©ããŒã·ã§ã³ ã¯ãªããã³ã° ã»ãããæ§æããã«ã¯:
{
"gradient_clipping": "auto"
}
[Trainer
] ã¯èªåçã«ããã args.max_grad_norm
ã®å€ã«èšå®ããŸãã
å€ãæç€ºçã«èšå®ããããšãã§ããŸãã
{
"gradient_clipping": 1.0
}
ãã ãã[Trainer
] ã³ãã³ãã©ã€ã³åŒæ°ãš DeepSpeed ãèªåã§åæããããšã«ãªããŸãã
æ§æã
Getting The Model Weights Out
ãã¬ãŒãã³ã°ãç¶ç¶ããDeepSpeed ã®äœ¿çšãåéããéããäœãå¿é
ããå¿
èŠã¯ãããŸããã DeepSpeed ã¹ãã¢
fp32 ã®ã«ã¹ã¿ã ãã§ãã¯ãã€ã³ã ãªããã£ãã€ã¶ãŒ ãã¡ã€ã«å
ã®ãã¹ã¿ãŒã®éã¿ããã㯠global_step*/*optim_states.pt
(ãã㯠glob
ãã¿ãŒã³)ãéåžžã®ãã§ãã¯ãã€ã³ãã®äžã«ä¿åãããŸãã
FP16 ãŠã§ã€ã:
ã¢ãã«ã ZeRO-2 ã§ä¿åãããšãã¢ãã«ã®éã¿ãå«ãéåžžã® pytorch_model.bin
ãã¡ã€ã«ãäœæãããŸããã
ãããã¯éã¿ã® fp16 ããŒãžã§ã³ã«ãããŸããã
ZeRO-3 ã§ã¯ãã¢ãã«ã®éã¿ãè€æ°ã® GPU ã«åå²ããããããç¶æ³ã¯ããã«è€éã«ãªããŸãã
ãããã£ãŠãfp16 ãä¿åããããã® Trainer
ãååŸããã«ã¯ã"stage3_gather_16bit_weights_on_model_save": true
ãå¿
èŠã§ãã
éã¿ã®ããŒãžã§ã³ããã®èšå®ãFalse
ã®å Žåãpytorch_model.bin
ã¯äœæãããŸãããããã¯ãããã©ã«ãã§ DeepSpeed ã® state_dict
ã«å®éã®éã¿ã§ã¯ãªããã¬ãŒã¹ãã«ããŒãå«ãŸããããã§ãããã® state_dict
ãä¿åããå ŽåãããŒããçŽãããšã¯ã§ããŸããã
{
"zero_optimization": {
"stage3_gather_16bit_weights_on_model_save": true
}
}
FP32 éé:
fp16 ãŠã§ã€ãã¯ãã¬ãŒãã³ã°ãåéããã®ã«é©ããŠããŸãããã¢ãã«ã®åŸ®èª¿æŽãå®äºããããã ã¢ãã« ãã ã«ã¢ã¯ã»ã¹ããããfp32 ãå ¥æããããšæãããä»ã®äººã«æž¡ããŸãã éã¿ãããã¯å€§éã®ã¡ã¢ãªãå¿ èŠãšããããã»ã¹ã§ããããããã¬ãŒãã³ã°äžã«è¡ãã¹ãã§ã¯ãªãã®ãçæ³çã§ãã ãããã£ãŠããã¬ãŒãã³ã°ã®å®äºåŸã«ãªãã©ã€ã³ã§å®è¡ããã®ãæé©ã§ãããã ããå¿ èŠã«å¿ããŠã空ã CPU ãååã«ããå Žåã¯ã åããã¬ãŒãã³ã° ã¹ã¯ãªããã§å®è¡ã§ããããšãæãåºããŠãã ãããæ¬¡ã®ã»ã¯ã·ã§ã³ã§ã¯ãäž¡æ¹ã®ã¢ãããŒãã«ã€ããŠèª¬æããŸãã
ã©ã€ã FP32 ãŠã§ã€ã ãªã«ããª:
ã¢ãã«ã倧ããããã¬ãŒãã³ã°ã®çµäºæã«ç©ºã CPU ã¡ã¢ãªãã»ãšãã©æ®ã£ãŠããªãå Žåããã®ã¢ãããŒãã¯æ©èœããªãå¯èœæ§ããããŸãã
å°ãªããšã 1 ã€ã®ãã§ãã¯ãã€ã³ããä¿åããŠããŠãææ°ã®ãã§ãã¯ãã€ã³ãã䜿çšãããå Žåã¯ãæ¬¡ã®æé ãå®è¡ã§ããŸãã
from transformers.trainer_utils import get_last_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
--load_best_model_at_end
class:~transformers.TrainingArguments åŒæ°ã䜿çšããŠããå Žå (æé©ãªã¢ãã«ã远跡ãããã)
ãã§ãã¯ãã€ã³ã)ãæåã«æçµã¢ãã«ãæç€ºçã«ä¿åããŠãããäžèšãšåãããšãè¡ãããšã§ãã¬ãŒãã³ã°ãçµäºã§ããŸãã
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint
checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)
load_state_dict_from_zero_checkpoint
ãå®è¡ããããšãmodel
ã¯ãã¯ã䜿çšã§ããªããªãããšã«æ³šæããŠãã ããã
åãã¢ããªã±ãŒã·ã§ã³ã® DeepSpeed ã³ã³ããã¹ããã€ãŸããdeepspeed ãšã³ãžã³ãååæåããå¿
èŠããããŸãã
model.load_state_dict(state_dict)
ã¯ãããããã¹ãŠã® DeepSpeed ããžãã¯ãåé€ããŸãããããã£ãŠãããã¯æåŸã«ã®ã¿å®è¡ããŠãã ãã
ãã¬ãŒãã³ã°ã®æ§åã
ãã¡ãããclass:~transformers.Trainer ã䜿çšããå¿ èŠã¯ãªããäžèšã®äŸãç¬èªã®ãã®ã«èª¿æŽããããšãã§ããŸãã ãã¬ãŒããŒã
äœããã®çç±ã§ããã«æ¹è¯ãããå Žåã¯ãéã¿ã® fp32 state_dict
ãæœåºããŠé©çšããããšãã§ããŸãã
次ã®äŸã«ç€ºãããã«ããããã¯èªåã§äœæããŸãã
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir) # already on cpu
model = model.cpu()
model.load_state_dict(state_dict)
ãªãã©ã€ã³ FP32 ãŠã§ã€ã ãªã«ããª:
DeepSpeed ã¯ç¹å¥ãªå€æã¹ã¯ãªããzero_to_fp32.py
ãäœæãããã§ãã¯ãã€ã³ãã®æäžäœã«é
眮ããŸãã
ãã©ã«ãããã®ã¹ã¯ãªããã䜿çšãããšããã€ã§ãéã¿ãæœåºã§ããŸããã¹ã¯ãªããã¯ã¹ã¿ã³ãã¢ãã³ãªã®ã§ãããå¿
èŠãããŸããã
æœåºãè¡ãããã®èšå®ãã¡ã€ã«ãŸã㯠Trainer
ãå¿
èŠã§ãã
ãã§ãã¯ãã€ã³ã ãã©ã«ããŒã次ã®ããã«ãªã£ãŠãããšããŸãã
$ ls -l output_dir/checkpoint-1/
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
-rw-rw-r-- 1 stas stas 12 Mar 27 13:16 latest
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-rw-rw-r-- 1 stas stas 623 Mar 27 20:42 scheduler.pt
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-rw-rw-r-- 1 stas stas 339 Mar 27 20:42 trainer_state.json
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*
ãã®äŸã§ã¯ãDeepSpeed ãã§ãã¯ãã€ã³ã ãµããã©ã«ã㌠global_step1 ã 1 ã€ã ããããŸãããããã£ãŠãFP32ãåæ§ç¯ããã«ã¯ éã¿ãå®è¡ããã ãã§ã:
python zero_to_fp32.py . pytorch_model.bin
ããã ãã pytorch_model.bin
ã«ã¯ãè€æ°ã® GPU ããçµ±åãããå®å
šãª fp32 ã¢ãã«ã®éã¿ãå«ãŸããããã«ãªããŸãã
ã¹ã¯ãªããã¯ãZeRO-2 ãŸã㯠ZeRO-3 ãã§ãã¯ãã€ã³ããèªåçã«åŠçã§ããããã«ãªããŸãã
python zero_to_fp32.py -h
ãå®è¡ãããšãäœ¿çšæ¹æ³ã®è©³çްã衚瀺ãããŸãã
ã¹ã¯ãªããã¯ããã¡ã€ã«latest
ã®å
容ã䜿çšã㊠deepspeed ãµããã©ã«ããŒãèªåæ€åºããŸãã
äŸã«ã¯global_step1
ãå«ãŸããŸãã
泚: çŸåšãã¹ã¯ãªããã«ã¯æçµç㪠fp32 ã¢ãã«ã®éã¿ã® 2 åã®äžè¬ RAM ãå¿ èŠã§ãã
ZeRO-3 ãš Infinity Nuances
ZeRO-3 ã¯ããã©ã¡ãŒã¿ ã·ã£ãŒãã£ã³ã°æ©èœã®ç¹ã§ ZeRO-2 ãšã¯å€§ããç°ãªããŸãã
ZeRO-Infinity 㯠ZeRO-3 ãããã«æ¡åŒµããNVMe ã¡ã¢ãªããã®ä»ã®è€æ°ã®é床ãšã¹ã±ãŒã©ããªãã£ã®åäžããµããŒãããŸãã
ã¢ãã«ã«ç¹å¥ãªå€æŽãå ããå¿ èŠããªããŠãæ£åžžã«åäœããããã«ããããåªåãæãããŠããŸããããç¹å®ã®ç¹ã§ã¯ ç¶æ³ã«ãã£ãŠã¯ãæ¬¡ã®æ å ±ãå¿ èŠã«ãªãå ŽåããããŸãã
Constructing Massive Models
DeepSpeed/ZeRO-3 ã¯ãæ¢åã® RAM ã«åãŸããªãå¯èœæ§ã®ããæ°å ã®ãã©ã¡ãŒã¿ãæã€ã¢ãã«ãåŠçã§ããŸãããã®ãããªå Žåã ãŸããåæåãããé«éã«å®è¡ãããå Žåã¯ãdeepspeed.zero.Init() ã䜿çšããŠã¢ãã«ãåæåããŸãã ã³ã³ããã¹ã ãããŒãžã£ãŒ (颿°ãã³ã¬ãŒã¿ãŒã§ããããŸã)ãæ¬¡ã®ããã«ãªããŸãã
from transformers import T5ForConditionalGeneration, T5Config
import deepspeed
with deepspeed.zero.Init():
config = T5Config.from_pretrained("t5-small")
model = T5ForConditionalGeneration(config)
ã芧ã®ãšãããããã«ããã©ã³ãã ã«åæåãããã¢ãã«ãåŸãããŸãã
äºåãã¬ãŒãã³ã°ãããã¢ãã«ã䜿çšãããå Žåãmodel_class.from_pretrained
ã¯æ¬¡ã®æ¡ä»¶ãæºããéããã®æ©èœãæå¹ã«ããŸãã
is_deepspeed_zero3_enabled()
㯠True
ãè¿ããŸããããã¯çŸåšã
[TrainingArguments
] ãªããžã§ã¯ã (æž¡ããã DeepSpeed æ§æãã¡ã€ã«ã« ZeRO-3 æ§æãå«ãŸããŠããå Žå)
ã»ã¯ã·ã§ã³ããããã£ãŠãåŒã³åºãã®åã«** [TrainingArguments
] ãªããžã§ã¯ããäœæããå¿
èŠããããŸãã
from_pretrained
ãèããããã·ãŒã±ã³ã¹ã®äŸã次ã«ç€ºããŸãã
from transformers import AutoModel, Trainer, TrainingArguments
training_args = TrainingArguments(..., deepspeed=ds_config)
model = AutoModel.from_pretrained("t5-small")
trainer = Trainer(model=model, args=training_args, ...)
å
¬åŒã®ãµã³ãã« ã¹ã¯ãªããã䜿çšããŠããŠãã³ãã³ã ã©ã€ã³åŒæ°ã« --deepspeed ds_config.json
ãå«ãŸããŠããå Žå
ZeRO-3 èšå®ãæå¹ã«ãããšãããããµã³ãã« ã¹ã¯ãªããã®èšè¿°æ¹æ³ã§ããããããã¹ãŠããã§ã«å®äºããŠããŸãã
泚: ã¢ãã«ã® fp16 éã¿ãåäžã® GPU ã®ã¡ã¢ãªã«åãŸããªãå Žåã¯ããã®æ©èœã䜿çšããå¿ èŠããããŸãã
ãã®æ¹æ³ãšãã®ä»ã®é¢é£æ©èœã®è©³çްã«ã€ããŠã¯ãå€§èŠæš¡ã¢ãã«ã®æ§ç¯ ãåç §ããŠãã ããã
ãŸããfp16 ã§äºåèšç·Žãããã¢ãã«ãããŒããããšãã¯ãfrom_pretrained
ã«äœ¿çšããããã«æç€ºããå¿
èŠããããŸãã
torch_dtype=torch.float16
ã詳现ã«ã€ããŠã¯ãfrom_pretrained-torch-dtype ãåç
§ããŠãã ããã
Gathering Parameters
è€æ°ã® GPU äžã® ZeRO-3 ã§ã¯ãçŸåšã® GPU ã®ãã©ã¡ãŒã¿ã§ãªãéããåäžã® GPU ããã¹ãŠã®ãã©ã¡ãŒã¿ãæã€ããšã¯ãããŸããã å®è¡å±€ããããã£ãŠããã¹ãŠã®ã¬ã€ã€ãŒã®ãã¹ãŠã®ãã©ã¡ãŒã¿ãŒã«äžåºŠã«ã¢ã¯ã»ã¹ããå¿ èŠãããå Žåã¯ããããè¡ãããã®ç¹å®ã®æ¹æ³ããããŸãã ã»ãšãã©ã®å Žåã¯å¿ èŠãããŸããããå¿ èŠãªå Žåã¯ããã©ã¡ãŒã¿ã®åé ãåç §ããŠãã ããã
ãã ããããã€ãã®å Žæã§å
éšçã«äœ¿çšããŠããŸãããã®äŸã® 1 ã€ã¯ãäºåãã¬ãŒãã³ã°ãããã¢ãã«ã®éã¿ãããŒããããšãã§ãã
from_pretrained
ãäžåºŠã« 1 ã€ã®ã¬ã€ã€ãŒãããŒãããåå ããŠãããã¹ãŠã® GPU ã«å³åº§ã«åå²ããŸãã
å€§èŠæš¡ãªã¢ãã«ã§ã¯ãã¡ã¢ãªã®é¢ä¿ã§ã1 ã€ã® GPU ã«ããŒãããŠããè€æ°ã® GPU ã«åæ£ããããšã¯ã§ããŸããã
å¶éã
ãŸããZeRO-3 ã§ã¯ãç¬èªã®ã³ãŒããäœæããæ¬¡ã®ãããªã¢ãã« ãã©ã¡ãŒã¿ãŒã®éã¿ãçºçãããšããŸãã
tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True)
tensor([1.])
ã«ã¹ãã¬ã¹ãæããå ŽåããŸãã¯ãã©ã¡ãŒã¿ã®ãµã€ãºã 1
ã§ãããšãããšã©ãŒãçºçããå Žå
ãã倧ããªå€æ¬¡å
圢ç¶ãããã¯ããã©ã¡ãŒã¿ãŒãåå²ãããŠããã衚瀺ãããã®ã¯ ZeRO-3 ãã¬ãŒã¹ãã«ããŒã§ããããšãæå³ããŸãã
ZeRO Inference
ZeRO Inference ã¯ãZeRO-3 Training ãšåãæ§æã䜿çšããŸãããªããã£ãã€ã¶ãŒãšã¹ã±ãžã¥ãŒã©ãŒã®ã»ã¯ã·ã§ã³ã¯å¿ èŠãããŸãããã§ å®éãåããã®ããã¬ãŒãã³ã°ãšå ±æãããå Žåã¯ãããããèšå®ãã¡ã€ã«ã«æ®ãããšãã§ããŸãã圌ãã¯ãã ãããªãã ãã ç¡èŠãããŸããã
ãã以å€ã®å Žåã¯ãéåžžã® [TrainingArguments
] åŒæ°ãæž¡ãã ãã§ããäŸãã°ïŒ
deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json
å¯äžéèŠãªããšã¯ãZeRO-2 ã«ã¯äœã®å©ç¹ããªããããZeRO-3 æ§æã䜿çšããå¿ èŠããããšããããšã§ãã ZeRO-3 ã®ã¿ããã©ã¡ãŒã¿ãŒã®ã·ã£ãŒãã£ã³ã°ãå®è¡ããã®ã«å¯ŸããZeRO-1 ã¯åŸé ãšãªããã£ãã€ã¶ãŒã®ç¶æ ãã·ã£ãŒãã£ã³ã°ãããããæšè«ã«åœ¹ç«ã¡ãŸãã
以äžã¯ãå©çšå¯èœãªãã¹ãŠã® GPU ããããã€ãã DeepSpeed ã§run_translation.py
ãå®è¡ããäŸã§ãã
deepspeed examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path t5-small --output_dir output_dir \
--do_eval --max_eval_samples 50 --warmup_steps 50 \
--max_source_length 128 --val_max_target_length 128 \
--overwrite_output_dir --per_device_eval_batch_size 4 \
--predict_with_generate --dataset_config "ro-en" --fp16 \
--source_lang en --target_lang ro --dataset_name wmt16 \
--source_prefix "translate English to Romanian: "
æšè«ã®ããã«ããªããã£ãã€ã¶ãŒã®ç¶æ ãšåŸé ã«ãã£ãŠäœ¿çšããã远å ã®å€§ããªã¡ã¢ãªã¯å¿ èŠãªãããã ã¯ããã«å€§ããªããããã·ãŒã±ã³ã¹é·ãåãããŒããŠã§ã¢ã«é©åã§ããå¿ èŠããããŸãã
ããã«ãDeepSpeed ã¯çŸåšãDeepspeed-Inference ãšåŒã°ããé¢é£è£œåãéçºããŠããŸããããããšã¯äœã®é¢ä¿ããããŸããã ZeRO ãã¯ãããžãŒã«æºæ ããŠããŸããã代ããã«ãã³ãœã«äžŠååŠçã䜿çšããŠãåäžã® GPU ã«åãŸããªãã¢ãã«ãã¹ã±ãŒãªã³ã°ããŸãããã㯠çŸåšéçºäžã§ãã補åã宿ãããçµ±åãæäŸããäºå®ã§ãã
Memory Requirements
Deepspeed ZeRO ã¯ã¡ã¢ãªã CPU (ããã³ NVMe) ã«ãªãããŒãã§ããããããã¬ãŒã ã¯ãŒã¯ã¯ã䜿çšãããŠãã GPU ã®æ°ã«å¿ããŠå¿ èŠãª CPU ããã³ GPU ã¡ã¢ãªã®éãç¥ãããšãã§ãããŠãŒãã£ãªãã£ãæäŸããŸãã
åäžã® GPU ã§ bigscience/T0_3B
ã埮調æŽããããã«å¿
èŠãªã¡ã¢ãªã®éãèŠç©ãã£ãŠã¿ãŸãããã
$ python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("bigscience/T0_3B"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'
[...]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 2783M total params, 65M largest layer params.
per CPU | per GPU | Options
70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1
62.23GB | 5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0
0.37GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=1
15.56GB | 46.91GB | offload_param=none, offload_optimizer=none, zero_init=0
ãããã£ãŠãåäžã® 80 GB GPU ã§ CPU ãªãããŒããªãã§æèŒããããšããå°ã㪠8 GB GPU ã§ãæå€§ 60 GB ã® CPU ã¡ã¢ãªãå¿ èŠã«ãªãããšãå¯èœã§ãã (ããã¯ãã©ã¡ãŒã¿ããªããã£ãã€ã¶ã®ç¶æ ãããã³åŸé ã®ããã®ã¡ã¢ãªã§ããããšã«æ³šæããŠãã ãããcuda ã«ãŒãã«ãã¢ã¯ãã£ããŒã·ã§ã³ãããã³äžæã¡ã¢ãªã«ã¯ããå°ãå€ãã®ã¡ã¢ãªãå¿ èŠã§ãã)
次ã«ãã³ã¹ããšé床ã®ãã¬ãŒããªãã«ãªããŸããããå°ãã GPU ãè³Œå ¥ãŸãã¯ã¬ã³ã¿ã«ããæ¹ãå®ããªããŸã (Deepspeed ZeRO ã§ã¯è€æ°ã® GPU ã䜿çšã§ãããããGPU ã®æ°ãæžããããšãã§ããŸã)ããããããã®å Žåã¯é ããªããŸãããã®ãããäœããå®è¡ããéåºŠãæ°ã«ããªããŠããé床ã®äœäžã¯ GPU ã®äœ¿çšæéã«çŽæ¥åœ±é¿ããã³ã¹ããå¢å€§ãããããã©ããæã广çããå®éšããŠæ¯èŒããŠãã ããã
åå㪠GPU ã¡ã¢ãªãããå Žåã¯ããã¹ãŠãé«éã«ãªããããCPU/NVMe ãªãããŒããå¿ ãç¡å¹ã«ããŠãã ããã
ããšãã°ã2 ã€ã® GPU ã«å¯ŸããŠåãããšãç¹°ãè¿ããŠã¿ãŸãããã
$ python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("bigscience/T0_3B"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=2, num_nodes=1)'
[...]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 2 GPUs per node.
SW: Model with 2783M total params, 65M largest layer params.
per CPU | per GPU | Options
70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
70.00GB | 0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=1
62.23GB | 2.84GB | offload_param=none, offload_optimizer=cpu , zero_init=0
0.74GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=1
31.11GB | 23.58GB | offload_param=none, offload_optimizer=none, zero_init=0
ãããã£ãŠãããã§ã¯ãCPU ã«ãªãããŒãããã« 2x 32GB 以äžã® GPU ãå¿ èŠã«ãªããŸãã
詳现ã«ã€ããŠã¯ãã¡ã¢ãªæšå®ããŒã« ãåç §ããŠãã ããã
Filing Issues
ããã§ã¯ãåé¡ã®ççžãããã«è§£æããäœæ¥ã®ãããã¯ãè§£é€ã§ãããããåé¡ãå ±åããæ¹æ³ã説æããŸãã
ã¬ããŒãã«ã¯å¿ ãæ¬¡ã®å 容ãå«ããŠãã ããã
-
ã¬ããŒãå ã®å®å šãª Deepspeed æ§æãã¡ã€ã«
-
[
Trainer
] ã䜿çšããŠããå Žåã¯ã³ãã³ãã©ã€ã³åŒæ°ããŸã㯠ãã¬ãŒããŒã®ã»ããã¢ãããèªåã§ã¹ã¯ãªããäœæããŠããå Žåã¯ã[TrainingArguments
] åŒæ°ãããªãã§ãã ãã [TrainingArguments
] ã«ã¯ç¡é¢ä¿ãªãšã³ããªã倿°å«ãŸããŠããããããã³ãããŸãã -
次ã®åºå:
python -c 'import torch; print(f"torch: {torch.__version__}")' python -c 'import transformers; print(f"transformers: {transformers.__version__}")' python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
-
å¯èœã§ããã°ãåé¡ãåçŸã§ãã Google Colab ããŒãããã¯ãžã®ãªã³ã¯ãå«ããŠãã ãããããã䜿ããŸã ããŒããã㯠ãšã㊠åºçºç¹ã
-
äžå¯èœã§ãªãéããã«ã¹ã¿ã ããŒã¿ã»ããã§ã¯ãªããåžžã«äœ¿çšã§ããæšæºããŒã¿ã»ããã䜿çšããŠãã ããã
-
å¯èœã§ããã°ãæ¢åã® ãµã³ãã« ã®ããããã䜿çšããŠåé¡ãåçŸããŠã¿ãŠãã ããã
-
Deepspeed ãåé¡ã®åå ã§ã¯ãªãããšããããããŸãã
æåºãããåé¡ã®äžéšã¯ãDeepspeed ãšã¯ç¡é¢ä¿ã§ããããšã倿ããŸãããããã¯ãDeepspeed ãã»ããã¢ããããåé€ãããåŸã§ãã åé¡ã¯ãŸã æ®ã£ãŠããã
ãããã£ãŠãå®å šã«æçœã§ãªãå Žåã¯ãDeepSpeed é¢é£ã®åé¡ã§ãã äŸå€ãçºçããDeepSpeed ã¢ãžã¥ãŒã«ãé¢ä¿ããŠããããšãããããŸãããŸããDeepSpeed ãå«ãŸãªãã»ããã¢ãããåãã¹ãããŠãã ããã åé¡ã解決ããªãå Žåã«ã®ã¿ãDeepspeed ã«ã€ããŠèšåããå¿ èŠãªè©³çްããã¹ãŠæäŸããŠãã ããã
-
åé¡ãçµ±åéšåã§ã¯ãªã DeepSpeed ã³ã¢ã«ããããšãæãããªå Žåã¯ãåé¡ãæåºããŠãã ããã Deepspeed ãçŽæ¥äœ¿çšããŸããããããããªãå Žåã§ãããå®å¿ãã ããã ã©ã¡ãã®åé¡ãã©ãã«ãŒã§ãåé¡ãããŸãããæçš¿ããããããã倿ããæ¬¡ã®å Žåã¯å¥ã®åé¡ãã©ãã«ãŒã«ãªãã€ã¬ã¯ãããŸãã ããã§ããå¿ èŠãããã
Troubleshooting
the deepspeed
process gets killed at startup without a traceback
deepspeed
ããã»ã¹ãèµ·åæã«ãã¬ãŒã¹ããã¯ãªãã§åŒ·å¶çµäºãããå Žåãããã¯éåžžãããã°ã©ã ã詊è¡ããããšãæå³ããŸãã
ã·ã¹ãã ãæã£ãŠãããããå€ãã® CPU ã¡ã¢ãªãå²ãåœãŠãããããã»ã¹ãå²ãåœãŠãèš±å¯ãããŠãããããOS ã«ãŒãã«ãããã匷å¶çµäºããŸãã
ããã»ã¹ãããã¯ãèšå®ãã¡ã€ã«ã« offload_optimizer
ãŸã㯠offload_param
ãå«ãŸããŠããå¯èœæ§ãé«ãããã§ãã
ã©ã¡ããcpu
ã«ãªãããŒãããããã«èšå®ãããŠããŸãã NVMe ã䜿çšããŠããå Žåã¯ã次ã®ç°å¢ã§å®è¡ããŠããå Žå㯠NVMe ãžã®ãªãããŒãã詊ããŠãã ããã
ãŒã-3ã [ç¹å®ã®ã¢ãã«ã«å¿
èŠãªã¡ã¢ãªéãèŠç©ãã]æ¹æ³ã¯æ¬¡ã®ãšããã§ã(https://deepspeed.readthedocs.io/en/latest/memory.html)ã
training and/or eval/predict loss is NaN
ããã¯ãbf16 æ··å粟床ã¢ãŒãã§äºåãã¬ãŒãã³ã°ãããã¢ãã«ãååŸããããã fp16 (æ··åç²ŸåºŠã®æç¡ã«ããããã) ã§äœ¿çšããããšããå Žåã«ããçºçããŸãã TPU ã§ãã¬ãŒãã³ã°ãããã»ãšãã©ã®ã¢ãã«ãããã³å€ãã®å ŽåãGoogle ã«ãã£ãŠãªãªãŒã¹ãããã¢ãã«ã¯ããã®ã«ããŽãªã«åé¡ãããŸã (ããšãã°ãã»ãŒãã¹ãŠã® t5 ããŒã¹ã®ã¢ãã«)ãããã§ã®è§£æ±ºçã¯ãããŒããŠã§ã¢ããµããŒãããŠããå Žå (TPUãAmpere GPU 以é)ãfp32 ãŸã㯠bf16 ã䜿çšããããšã§ãã
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
}
}
ãã°ã«ã¯ãDeepspeed ãæ¬¡ã®ããã«OVERFLOW!
ãå ±åããŠããããšãããããŸãã
0%| | 0/189 [00:00<?, ?it/s]
[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 262144
1%|â | 1/189 [00:00<01:26, 2.17it/s]
[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072.0
1%|ââ
[...]
[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
14%|âââââââââââââââââ | 27/189 [00:14<01:13, 2.21it/s]
[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
15%|ââââââââââââââââââ | 28/189 [00:14<01:13, 2.18it/s]
[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
15%|ââââââââââââââââââ | 29/189 [00:15<01:13, 2.18it/s]
[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
[...]
ããã¯ãDeepspeed æå€±ã¹ã±ãŒã©ãŒãæå€±ãªãŒããŒãããŒãå æããã¹ã±ãŒãªã³ã°ä¿æ°ãèŠã€ããããªãããšãæå³ããŸãã
(ãã°ã¯ããã§èªã¿ãããããããã«ãããµãŒãžãããŠããŸãã)
ãã®å Žåãé垞㯠initial_scale_power
ã®å€ãäžããå¿
èŠããããŸããéåžžãinitial_scale_power: 32
ã«èšå®ãããšåé¡ã解決ããŸãã
Notes
- DeepSpeed ã«ã¯ pip ã§ã€ã³ã¹ããŒã«å¯èœãª PyPI ããã±ãŒãžããããŸãããããŒããŠã§ã¢ã«æãé©åããããã«ããŸãæå¹ã«ããå¿ èŠãããå Žåã¯ããœãŒã¹ ããã€ã³ã¹ããŒã«ããããšã匷ããå§ãããŸãã 1 ããã Adam ãªã©ã®ç¹å®ã®æ©èœã¯ãpypi ãã£ã¹ããªãã¥ãŒã·ã§ã³ã§ã¯å©çšã§ããŸããã
- ð€ Transformers ã§ DeepSpeed ã䜿çšããããã« [
Trainer
] ã䜿çšããå¿ èŠã¯ãããŸãã - ä»»æã®ã¢ãã«ã䜿çšã§ããŸã åŸè 㯠DeepSpeed çµ±åæé ã«åŸã£ãŠèª¿æŽããå¿ èŠããããŸãã
Non-Trainer Deepspeed Integration
[~integrations.HfDeepSpeedConfig
] ã¯ãDeepspeed ã ð€ Transformers ã³ã¢ã«çµ±åããããã«äœ¿çšãããŸã
[Trainer
] ã䜿çšããªãå Žåã®æ©èœãå®è¡ããå¯äžã®ããšã¯ãDeepspeed ZeRO-3 ãã©ã¡ãŒã¿åéãåŠçããfrom_pretrained
åŒã³åºãäžã«ã¢ãã«ãè€æ°ã® GPU ã«èªåçã«åå²ããããšã§ãããã以å€ã¯ãã¹ãŠèªåã§è¡ãå¿
èŠããããŸãã
[Trainer
] ã䜿çšãããšããã¹ãŠãèªåçã«åŠçãããŸãã
[Trainer
] ã䜿çšããªãå ŽåãDeepSpeed ZeRO-3 ãå¹ççã«å°å
¥ããã«ã¯ã
ã¢ãã«ãã€ã³ã¹ã¿ã³ã¹åããåã« [~integrations.HfDeepSpeedConfig
] ãªããžã§ã¯ããåé€ãããã®ãªããžã§ã¯ããçãããŸãŸã«ããŸãã
Deepspeed ZeRO-1 ãŸã㯠ZeRO-2 ã䜿çšããŠããå Žåã¯ãHfDeepSpeedConfig
ã䜿çšããå¿
èŠã¯ãŸã£ãããããŸããã
ããšãã°ãäºåãã¬ãŒãã³ã°ãããã¢ãã«ã®å Žåã¯æ¬¡ã®ããã«ãªããŸãã
from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel
import deepspeed
ds_config = {...} # deepspeed config object or path to the file
# must run before instantiating the model to detect zero 3
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
model = AutoModel.from_pretrained("gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
ãŸãã¯ãäºåãã¬ãŒãã³ã°ãããŠããªãã¢ãã«ã®å Žå:
from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel, AutoConfig
import deepspeed
ds_config = {...} # deepspeed config object or path to the file
# must run before instantiating the model to detect zero 3
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
config = AutoConfig.from_pretrained("gpt2")
model = AutoModel.from_config(config)
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)
[Trainer
] çµ±åã䜿çšããŠããªãå Žåã¯ãå®å
šã«ç¬åã§è¡ãããšã«ãªãããšã«æ³šæããŠãã ãããåºæ¬çã«ã¯ãDeepspeed Web ãµã€ãã®ããã¥ã¡ã³ãã«åŸã£ãŠãã ããããŸããèšå®ãã¡ã€ã«ãæç€ºçã«èšå®ããå¿
èŠããããŸãã"auto"
å€ã¯äœ¿çšã§ããã代ããã«å®éã®å€ãå
¥åããå¿
èŠããããŸãã
HfDeepSpeedConfig
autodoc integrations.HfDeepSpeedConfig - all
Custom DeepSpeed ZeRO Inference
以äžã¯ãåäžã® GPU ã«ã¢ãã«ãé©åã§ããªãå Žåã«ã[Trainer
] ã䜿çšããã« DeepSpeed ZeRO æšè«ãå®è¡ããæ¹æ³ã®äŸã§ãã解決çã«ã¯ã远å ã® GPU ã®äœ¿çšããŸã㯠GPU ã¡ã¢ãªã CPU ã¡ã¢ãªã«ãªãããŒãããããšãå«ãŸããŸãã
ããã§çè§£ãã¹ãéèŠãªãã¥ã¢ã³ã¹ã¯ãZeRO ã®èšè𿹿³ã«ãããç°ãªã GPU ã§ç°ãªãå ¥åã䞊è¡ããŠåŠçã§ãããšããããšã§ãã
ãã®äŸã«ã¯å€§éã®ã¡ã¢ããããèªå·±ææžåãããŠããŸãã
å¿ ãæ¬¡ã®ããšãè¡ã£ãŠãã ããã
- åå㪠GPU ã¡ã¢ãªãããå Žåã¯ãCPU ãªãããŒããç¡å¹ã«ããŸã (é床ãäœäžãããã)ã
- Ampere ãŸãã¯æ°ãã GPU ãææããŠããå Žåã¯ãåŠçãé«éåããããã« bf16 ãæå¹ã«ããŸãããã®ããŒããŠã§ã¢ããªãå Žåã¯ãbf16 æ··å粟床ã§äºåãã¬ãŒãã³ã°ãããã¢ãã« (ã»ãšãã©ã® t5 ã¢ãã«ãªã©) ã䜿çšããªãéããfp16 ãæå¹ã«ããããšãã§ããŸãããããã¯éåžžãfp16 ã§ãªãŒããŒãããŒããåºåãšããŠã¬ããŒãžã衚瀺ãããŸãã
#!/usr/bin/env python
# This script demonstrates how to use Deepspeed ZeRO in an inference mode when one can't fit a model
# into a single GPU
#
# 1. Use 1 GPU with CPU offload
# 2. Or use multiple GPUs instead
#
# First you need to install deepspeed: pip install deepspeed
#
# Here we use a 3B "bigscience/T0_3B" model which needs about 15GB GPU RAM - so 1 largish or 2
# small GPUs can handle it. or 1 small GPU and a lot of CPU memory.
#
# To use a larger model like "bigscience/T0" which needs about 50GB, unless you have an 80GB GPU -
# you will need 2-4 gpus. And then you can adapt the script to handle more gpus if you want to
# process multiple inputs at once.
#
# The provided deepspeed config also activates CPU memory offloading, so chances are that if you
# have a lot of available CPU memory and you don't mind a slowdown you should be able to load a
# model that doesn't normally fit into a single GPU. If you have enough GPU memory the program will
# run faster if you don't want offload to CPU - so disable that section then.
#
# To deploy on 1 gpu:
#
# deepspeed --num_gpus 1 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# To deploy on 2 gpus:
#
# deepspeed --num_gpus 2 t0.py
# or:
# python -m torch.distributed.run --nproc_per_node=2 t0.py
from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.integrations import HfDeepSpeedConfig
import deepspeed
import os
import torch
os.environ["TOKENIZERS_PARALLELISM"] = "false" # To avoid warnings about parallelism in tokenizers
# distributed setup
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()
model_name = "bigscience/T0_3B"
config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model
# batch size has to be divisible by world_size, but can be bigger than world_size
train_batch_size = 1 * world_size
# ds_config notes
#
# - enable bf16 if you use Ampere or higher GPU - this will run in mixed precision and will be
# faster.
#
# - for older GPUs you can enable fp16, but it'll only work for non-bf16 pretrained models - e.g.
# all official t5 models are bf16-pretrained
#
# - set offload_param.device to "none" or completely remove the `offload_param` section if you don't
# - want CPU offload
#
# - if using `offload_param` you can manually finetune stage3_param_persistence_threshold to control
# - which params should remain on gpus - the larger the value the smaller the offload size
#
# For in-depth info on Deepspeed config see
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed
# keeping the same format as json for consistency, except it uses lower case for true/false
# fmt: off
ds_config = {
"fp16": {
"enabled": False
},
"bf16": {
"enabled": False
},
"zero_optimization": {
"stage": 3,
"offload_param": {
"device": "cpu",
"pin_memory": True
},
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_bucket_size": model_hidden_size * model_hidden_size,
"stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
"stage3_param_persistence_threshold": 10 * model_hidden_size
},
"steps_per_print": 2000,
"train_batch_size": train_batch_size,
"train_micro_batch_size_per_gpu": 1,
"wall_clock_breakdown": False
}
# fmt: on
# next line instructs transformers to partition the model directly over multiple gpus using
# deepspeed.zero.Init when model's `from_pretrained` method is called.
#
# **it has to be run before loading the model AutoModelForSeq2SeqLM.from_pretrained(model_name)**
#
# otherwise the model will first be loaded normally and only partitioned at forward time which is
# less efficient and when there is little CPU RAM may fail
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
# now a model can be loaded.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# initialise Deepspeed ZeRO and store only the engine object
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval() # inference
# Deepspeed ZeRO can process unrelated inputs on each GPU. So for 2 gpus you process 2 inputs at once.
# If you use more GPUs adjust for more.
# And of course if you have just one input to process you then need to pass the same string to both gpus
# If you use only one GPU, then you will have only rank 0.
rank = torch.distributed.get_rank()
if rank == 0:
text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"
tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n in={text_in}\n out={text_out}")
ãããt0.py
ãšããŠä¿åããŠå®è¡ããŸãããã
$ deepspeed --num_gpus 2 t0.py
rank0:
in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
out=Positive
rank1:
in=Is this review positive or negative? Review: this is the worst restaurant ever
out=negative
ããã¯éåžžã«åºæ¬çãªäŸã§ãããããŒãºã«åãããŠèª¿æŽããŠãã ããã
generate
nuances
ZeRO Stage-3 ã§è€æ°ã® GPU ã䜿çšããå Žåãgenerate(..., synced_gpus=True)
ãåŒã³åºã㊠GPU ãåæããå¿
èŠããããŸãããããè¡ããªããšã1 ã€ã® GPU ãä»ã® GPU ããå
ã«çæãçµäºããå Žåãæ®ãã® GPU ãçæã忢ãã GPU ãããŠã§ã€ãã®ã·ã£ãŒããåä¿¡ã§ããªããªããããã·ã¹ãã å
šäœããã³ã°ããŸãã
transformers>=4.28
以éãsynced_gpus
ãæç€ºçã«æå®ãããŠããªãå Žåããããã®æ¡ä»¶ãæ€åºããããšèªåçã« True
ã«èšå®ãããŸãããã ããå¿
èŠã«å¿ã㊠synced_gpus
ã®å€ããªãŒããŒã©ã€ãããããšãã§ããŸãã
Deepspeed çµ±åã®ãã¹ã
DeepSpeed çµ±åãå«ã PR ãéä¿¡ããå Žåã¯ãCircleCI PR CI ã»ããã¢ããã«ã¯ GPU ããªãããšã«æ³šæããŠãã ããããã®ãããGPU ãå¿ èŠãšãããã¹ãã¯å¥ã® CI ã§æ¯æ©ã®ã¿å®è¡ãããŸãããããã£ãŠãPR ã§ç·è²ã® CI ã¬ããŒãã衚瀺ãããŠããDeepSpeed ãã¹ããåæ Œããããšãæå³ããããã§ã¯ãããŸããã
DeepSpeed ãã¹ããå®è¡ããã«ã¯ãå°ãªããšã以äžãå®è¡ããŠãã ããã
RUN_SLOW=1 pytest tests/deepspeed/test_deepspeed.py
ã¢ããªã³ã°ãŸã㯠pytorch ãµã³ãã« ã³ãŒãã®ããããã倿Žããå Žåã¯ãModel Zoo ãã¹ããå®è¡ããŸãã以äžã¯ãã¹ãŠã® DeepSpeed ãã¹ããå®è¡ããŸãã
RUN_SLOW=1 pytest tests/deepspeed
Main DeepSpeed Resources
è«æ:
- ZeRO: å ãã©ã¡ãŒã¿ ã¢ãã«ã®ãã¬ãŒãã³ã°ã«åããã¡ã¢ãªã®æé©å
- ZeRO-Offload: 10 åèŠæš¡ã®ã¢ãã« ãã¬ãŒãã³ã°ã®æ°äž»å
- ZeRO-Infinity: 極éã¹ã±ãŒã«ã®æ·±å±€åŠç¿ã®ããã® GPU ã¡ã¢ãªã®å£ãæã¡ç Žã
æåŸã«ãHuggingFace [Trainer
] 㯠DeepSpeed ã®ã¿ãçµ±åããŠããããšãèŠããŠãããŠãã ããã
DeepSpeed ã®äœ¿çšã«é¢ããŠåé¡ã質åãããå Žåã¯ãDeepSpeed GitHub ã«åé¡ãæåºããŠãã ããã