mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 10:12:23 +06:00
fixed typo in fp16 training section for perf_train_gpu_one (#19736)
This commit is contained in:
parent
8db92dbe26
commit
5cbf1fa8ca
@ -311,7 +311,7 @@ We can see that this saved some more memory but at the same time training became
|
||||
|
||||
## Floating Data Types
|
||||
|
||||
The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput:
|
||||
The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variables and their computations are faster. Here are the commonly used floating point data types choice of which impacts both memory usage and throughput:
|
||||
|
||||
- fp32 (`float32`)
|
||||
- fp16 (`float16`)
|
||||
@ -328,7 +328,7 @@ While fp16 and fp32 have been around for quite some time, bf16 and tf32 are only
|
||||
|
||||
### FP16 Training
|
||||
|
||||
The idea of mixed precision training is that no all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`:
|
||||
The idea of mixed precision training is that not all variables need to be stored in full (32-bit) floating point precision. If we can reduce the precision the variales and their computations are faster. The main advantage comes from saving the activations in half (16-bit) precision. Although the gradients are also computed in half precision they are converted back to full precision for the optimization step so no memory is saved here. Since the model is present on the GPU in both 16-bit and 32-bit precision this can use more GPU memory (1.5x the original model is on the GPU), especially for small batch sizes. Since some computations are performed in full and some in half precision this approach is also called mixed precision training. Enabling mixed precision training is also just a matter of setting the `fp16` flag to `True`:
|
||||
|
||||
```py
|
||||
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
|
||||
|
Loading…
Reference in New Issue
Block a user