[Doc]: Broken link in Kubernetes doc (#33879)

* add relative path in .md and redirects to conf.py

* add redirects to conf.py and update .md

* modify links in .md
This commit is contained in:
Deepak Saldanha 2024-10-04 14:50:56 +05:30 committed by GitHub
parent 124713c32b
commit b6a01df6e9
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 11 additions and 10 deletions

View File

@ -11,4 +11,4 @@ black_avoid_patterns = {
"{processor_class}": "FakeProcessorClass",
"{model_class}": "FakeModelClass",
"{object_class}": "FakeObjectClass",
}
}

View File

@ -138,16 +138,16 @@ Now, run the following command in node0 and **4DDP** will be enabled in node0 an
## Usage with Kubernetes
The same distributed training job from the previous section can be deployed to a Kubernetes cluster using the
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/pytorch/).
[Kubeflow PyTorchJob training operator](https://www.kubeflow.org/docs/components/training/user-guides/pytorch).
### Setup
This example assumes that you have:
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow/)
* [`kubectl`](https://kubernetes.io/docs/tasks/tools/) installed and configured to access the Kubernetes cluster
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) that can be used
* Access to a Kubernetes cluster with [Kubeflow installed](https://www.kubeflow.org/docs/started/installing-kubeflow)
* [`kubectl`](https://kubernetes.io/docs/tasks/tools) installed and configured to access the Kubernetes cluster
* A [Persistent Volume Claim (PVC)](https://kubernetes.io/docs/concepts/storage/persistent-volumes) that can be used
to store datasets and model files. There are multiple options for setting up the PVC including using an NFS
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes/) or a cloud storage bucket.
[storage class](https://kubernetes.io/docs/concepts/storage/storage-classes) or a cloud storage bucket.
* A Docker container that includes your model training script and all the dependencies needed to run the script. For
distributed CPU training jobs, this typically includes PyTorch, Transformers, Intel Extension for PyTorch, Intel
oneCCL Bindings for PyTorch, and OpenSSH to communicate between the containers.
@ -176,7 +176,7 @@ PyTorchJob to the cluster.
### PyTorchJob Specification File
The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/pytorch/) is used to run the distributed
The [Kubeflow PyTorchJob](https://www.kubeflow.org/docs/components/training/user-guides/pytorch) is used to run the distributed
training job on the cluster. The yaml file for the PyTorchJob defines parameters such as:
* The name of the PyTorchJob
* The number of replicas (workers)
@ -273,12 +273,13 @@ To run this example, update the yaml based on your training script and the nodes
<Tip>
The CPU resource limits/requests in the yaml are defined in [cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
The CPU resource limits/requests in the yaml are defined in
[cpu units](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#meaning-of-cpu)
where 1 CPU unit is equivalent to 1 physical CPU core or 1 virtual core (depending on whether the node is a physical
host or a VM). The amount of CPU and memory limits/requests defined in the yaml should be less than the amount of
available CPU/memory capacity on a single machine. It is usually a good idea to not use the entire machine's capacity in
order to leave some resources for the kubelet and OS. In order to get ["guaranteed"](https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#guaranteed)
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/) for the worker pods,
[quality of service](https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod) for the worker pods,
set the same CPU and memory amounts for both the resource limits and requests.
</Tip>
@ -318,4 +319,4 @@ with the job, the PyTorchJob resource can be deleted from the cluster using `kub
This guide covered running distributed PyTorch training jobs using multiple CPUs on bare metal and on a Kubernetes
cluster. Both cases utilize Intel Extension for PyTorch and Intel oneCCL Bindings for PyTorch for optimal training
performance, and can be used as a template to run your own workload on multiple nodes.
performance, and can be used as a template to run your own workload on multiple nodes.