mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
Update perf_train_gpu_many.md (#31451)
* Update perf_train_gpu_many.md * Update docs/source/en/perf_train_gpu_many.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/perf_train_gpu_many.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
parent
280cef51b3
commit
22b41b3f8a
@ -56,15 +56,15 @@ impact performance. Here's a breakdown of your options:
|
||||
If your model can comfortably fit onto a single GPU, you have two primary options:
|
||||
|
||||
1. DDP - Distributed DataParallel
|
||||
2. ZeRO - depending on the situation and configuration used, this method may or may not be faster, however, it's worth experimenting with it.
|
||||
2. [Zero Redundancy Optimizer (ZeRO)](https://arxiv.org/abs/1910.02054) - depending on the situation and configuration used, this method may or may not be faster, however, it's worth experimenting with it.
|
||||
|
||||
**Case 2: Your model doesn't fit onto a single GPU:**
|
||||
|
||||
If your model is too large for a single GPU, you have several alternatives to consider:
|
||||
|
||||
1. PipelineParallel (PP)
|
||||
2. ZeRO
|
||||
3. TensorParallel (TP)
|
||||
2. [ZeRO](https://arxiv.org/abs/1910.02054)
|
||||
3. [TensorParallel](#tensor-parallelism) (TP)
|
||||
|
||||
With very fast inter-node connectivity (e.g., NVLINK or NVSwitch) all three strategies (PP, ZeRO, TP) should result in
|
||||
similar performance. However, without these, PP will be faster than TP or ZeRO. The degree of TP may also
|
||||
|
Loading…
Reference in New Issue
Block a user