mirror of
https://github.com/huggingface/transformers.git
synced 2025-07-31 02:02:21 +06:00
[Doc]: Broken link in Kubernetes doc (#33879)
* add relative path in .md and redirects to conf.py * add redirects to conf.py and update .md * modify links in .md
This commit is contained in:
parent
124713c32b
commit
b6a01df6e9
@ -11,4 +11,4 @@ black_avoid_patterns = {
|
||||
"{processor_class}": "FakeProcessorClass",
|
||||
"{model_class}": "FakeModelClass",
|
||||
"{object_class}": "FakeObjectClass",
|
||||
}
|
||||
}
|
@ -138,16 +138,16 @@ Now, run the following command in node0 and **4DDP** will be enabled in node0 an
|
||||
## Usage with Kubernetes
|
||||
|
||||
The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the
|
||||
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/pytorch/).
|
||||
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/user-guides/pytorch).
|
||||
|
||||
### Setup
|
||||
|
||||
This example assumes that you have:
|
||||
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow/)
|
||||
* [`kubectl`](https://kubernetes.io/docs/tasks/tools/) installed and configured to access the Kubernetes cluster
|
||||
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) that can be used
|
||||
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow)
|
||||
* [`kubectl`](https://kubernetes.io/docs/tasks/tools) installed and configured to access the Kubernetes cluster
|
||||
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes) that can be used
|
||||
to store datasets and model files. There are multiple options for setting up the PVC including using an NFS
|
||||
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes/) or a cloud storage bucket.
|
||||
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes) or a cloud storage bucket.
|
||||
* A Docker container that includes your model training script and all the dependencies needed to run the script. For
|
||||
distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel
|
||||
oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers.
|
||||
@ -176,7 +176,7 @@ PyTorchJob to the cluster.
|
||||
|
||||
### PyTorchJob Specification File
|
||||
|
||||
The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/pytorch/) is used to run the distributed
|
||||
The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch) is used to run the distributed
|
||||
training job on the cluster. The yaml file for the PyTorchJob defines parameters such as:
|
||||
* The name of the PyTorchJob
|
||||
* The number of replicas (workers)
|
||||
@ -273,12 +273,13 @@ To run this example, update the yaml based on your training script and the nodes
|
||||
|
||||
<Tip>
|
||||
|
||||
The CPU resource limits/requests in the yaml are defined in [cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
|
||||
The CPU resource limits/requests in the yaml are defined in
|
||||
[cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
|
||||
where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical
|
||||
host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of
|
||||
available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in
|
||||
order to leave some resources for the kubelet and OS. In order to get ["guaranteed"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed)
|
||||
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) for the worker pods,
|
||||
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod) for the worker pods,
|
||||
set the same CPU and memory amounts for both the resource limits and requests.
|
||||
|
||||
</Tip>
|
||||
@ -318,4 +319,4 @@ with the job, the PyTorchJob resource can be deleted from the cluster using `kub
|
||||
|
||||
This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes
|
||||
cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training
|
||||
performance, and can be used as a template to run your own workload on multiple nodes.
|
||||
performance, and can be used as a template to run your own workload on multiple nodes.
|
Loading…
Reference in New Issue
Block a user