switch default xpu tp backend to pytorch built-in XCCL from pytorch 2.8 (#39024)

* switch default xpu tp backend to pytorch built-in XCCL from pytorch 2.8

Signed-off-by: YAO Matrix <matrix.yao@intel.com>

* Update docs/source/en/perf_infer_gpu_multi.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

* Update perf_infer_gpu_multi.md

* Update perf_infer_gpu_multi.md

* Update perf_infer_gpu_multi.md

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
This commit is contained in:
Yao Matrix 2025-06-30 23:54:05 +08:00 committed by GitHub
parent 9e0c865b8b
commit 29a3f5ed8c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 6 additions and 4 deletions

View File

@ -15,9 +15,9 @@ rendered properly in your Markdown viewer.
# Distributed inference
When a model doesn't fit on a single GPU, distributed inference with [tensor parallelism](./perf_train_gpu_many#tensor-parallelism) can help. Tensor parallelism shards a model onto multiple GPUs and parallelizes computations such as matrix multiplication. It enables fitting larger model sizes into memory and is faster because each GPU can process a tensor slice.
When a model doesn't fit on a single GPU, distributed inference with [tensor parallelism](./perf_train_gpu_many#tensor-parallelism) can help. Tensor parallelism shards a model onto multiple accelerators (CUDA GPU, Intel XPU, etc.) and parallelizes computations such as matrix multiplication. It enables fitting larger model sizes into memory and is faster because each accelerator can process a tensor slice.
However, tensor parallelism adds communication overhead and should be used on single machine setups with multiple GPUs to take advantage of fast intra-node communication. For multi-node training, it may be more efficient to use pipeline or data parallelism depending on your use case.
However, tensor parallelism adds communication overhead and should be used on single machine setups with multiple accelerators to take advantage of fast intra-node communication. For multi-node training, it may be more efficient to use pipeline or data parallelism depending on your use case.
> [!TIP]
> Refer to the [Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=tensor_parallelism) section on tensor parallelism to learn more.

View File

@ -57,10 +57,12 @@ def initialize_tensor_parallelism(tp_plan, tp_size=None):
local_rank = int(os.environ["LOCAL_RANK"])
world_size = int(os.environ["WORLD_SIZE"])
backend_map = {"cuda": "nccl", "cpu": "gloo", "xpu": "ccl", "hpu": "hccl"}
backend_map = {"cuda": "nccl", "cpu": "gloo", "xpu": "xccl", "hpu": "hccl"}
backend = backend_map.get(device_type)
if device_type == "cpu" and int(os.environ.get("CCL_WORKER_COUNT", 0)):
backend = "ccl"
if device_type == "xpu" and not is_torch_greater_or_equal("2.8", accept_dev=True):
backend = "ccl"
torch.distributed.init_process_group(backend=backend, rank=rank, world_size=world_size)
current_device = getattr(torch, device_type)