Validating GPU Cluster Communication with NCCL Tests

This tutorial demonstrates how to run NCCL all_reduce_perf_mpi benchmarks inside PyTorch containers launched as NVIDIA Run:ai training workloads. A single workflow covers single-node, single-rack, and multi-rack validation, so you can verify GPU-to-GPU communication on the cluster before running training, inference, or other production workloads.

Note

  • The example uses the nvcr.io/nvidia/pytorch:26.01-py3 container, which ships with the NCCL test binaries pre-built under /usr/local/bin/. Adjust the tag to match the CUDA and driver versions supported by your cluster.

  • The example is configured for GB200 systems. For other hardware, verify compatibility with your target GPU hardware.

  • While the walkthrough uses the NVIDIA PyTorch container, you can adapt this workflow to other container images and hardware configurations.

In this tutorial, you will learn how to:

  • Submit single-node, single-rack, and multi-rack training workloads through the user interface, API, or CLI

  • Run an all_reduce NCCL benchmark interactively from inside each workload

  • Verify pod placement across nodes and racks

  • Use additional NCCL collective tests to characterize specific communication patterns

What NCCL Tests Validate

NCCL (NVIDIA Collective Communications Library) benchmarks are the fastest health check for a GPU cluster. A short nccl-tests run validates:

  • Driver and library install - CUDA, NCCL, and the MPI launcher work end-to-end inside the container.

  • GPU visibility - All requested GPUs are passed through to the pod.

  • Intra-node fabric - NVLink and NVSwitch bandwidth on a single node.

  • Inter-node fabric - InfiniBand or RoCE bandwidth between nodes, including rack-to-rack hops.

  • Scheduling and topology - Pods land where the fabric expects.

A failing or low-bandwidth NCCL run is much cheaper to diagnose than burning GPU-hours on a misconfigured stack.

Configurations Covered

The tutorial walks through three progressive scenarios that validate increasing infrastructure complexity:

#
Test
Workers × GPUs per worker
Validates

1

Single node

1 × 4 = 4 GPUs

Intra-node NVLink

2

Single rack

2 × 4 = 8 GPUs

Inter-node, same rack

3

Multi-rack

3 × 4 = 12 GPUs

Cross-rack fabric

Prerequisites

Before you start, make sure the following requirements are met:

  • Your administrator has:

    • Created a project with a sufficient GPU quota for the benchmark workloads.

  • You have:

    • A container image that bundles NCCL and the nccl-tests binaries. The example uses nvcr.io/nvidia/pytorch:26.01-py3, which ships /usr/local/bin/all_reduce_perf_mpi.

    • For multi-node tests, an image that supports MPI launch between pods (SSH keys or site PMIx setup).

    • Optional: kubectl access for inspecting Kubernetes cluster node labels (including rack-level information, if configured).

Note

The example image is publicly pullable from NGC without authentication. If you use a container that requires authentication, your administrator must configure a Docker registry credential so the cluster can pull the image.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Creating a User Access Key

Access keys are used for API integrations with NVIDIA Run:ai. An access key contains a client ID and a client secret. With the client credentials, you can obtain a token and use it within subsequent API calls.

In the NVIDIA Run:ai user interface:

  1. Click the user avatar at the top right corner, then select Settings

  2. Click +ACCESS KEY

  3. Enter the access key's name and click CREATE

  4. Copy the Client ID and Client secret and store securely

  5. Click DONE

To request an API access token, use the client credentials to get a token to access NVIDIA Run:ai using the Tokens API. For example:

Step 3: Identifying Available Nodes and Racks

Before submitting any workloads, identify which nodes are available and which racks they belong to. This is required to design tests that exercise intra-rack and inter-rack communication paths.

  1. SSH to a Kubernetes node and load the cluster module:

  2. List nodes with kubectl to see node names and labels. Node names typically encode the rack identifier (for example, s03 and s04 indicate different racks):

  3. List nodes and node pools through NVIDIA Run:ai:

Step 4: Running NCCL Tests on a Single Node

This step runs an all_reduce benchmark across four GPUs within a single node, validating intra-node NVLink and PCIe bandwidth.

Submitting the Workload

Submit a training workload that requests four GPUs on one node and idles the container so you can run benchmarks interactively.

  1. To create the training workload, go to Workload manager → Workloads.

  2. Click +NEW WORKLOAD and select Training from the dropdown menu.

  3. Within the new training form, select the cluster and the nccl-benchmarking project.

  4. Set the training workload architecture to Standard. A standard workload runs a single pod.

  5. Select Start from scratch.

  6. Enter nccl-single-node as the name for the workload, then click CONTINUE.

  7. Under Environment, set the Image URL to nvcr.io/nvidia/pytorch:26.01-py3.

  8. Under Runtime settings, click +COMMAND & ARGUMENTS, then set the Command to bash and the Arguments to -c 'sleep 1d'. The container idles for one day so you can launch the benchmark interactively.

  9. Under Compute resource, set GPU devices per pod to 4.

  10. Click CREATE TRAINING.

Running the Benchmark

The CLI commands below pass -p nccl-benchmarking explicitly. Alternatively, set the project as the CLI default once with runai project set nccl-benchmarking and omit the flag from subsequent commands.

  1. Confirm the workload is in Running state and review GPU allocation. The exec step in the next instruction fails if the pod is not yet running:

  2. Once the workload is Running, open an interactive shell into the workload and check GPU visibility:

  3. Run the four-GPU all_reduce benchmark with debug logging enabled. The parameters sweep message sizes from 8 bytes to 1 GB doubling at each step (-b 8 -e 1G -f 2), assign one GPU per rank (-g 1), perform two warmup iterations (-w 2), measure ten iterations (--iters 10), and run ten validation checks (-c 10):

Step 5: Running NCCL Tests Across Multiple Nodes in the Same Rack

This step runs an eight-GPU all_reduce benchmark spanning two nodes in the same rack, validating intra-rack inter-node fabric performance.

Submitting the Workload

Submit a multi-worker training workload. Two workers each request four GPUs, for a total of eight GPUs across two nodes. Both master and worker idle so benchmarks can be launched interactively.

  1. To create the training workload, go to Workload manager → Workloads.

  2. Click +NEW WORKLOAD and select Training from the dropdown menu.

  3. Within the new training form, select the cluster and the nccl-benchmarking project.

  4. Set the training workload architecture to Distributed. A distributed workload runs multiple processes that can span across different nodes.

  5. Set the framework for the distributed workload to MPI. If MPI isn't available, see Distributed training prerequisites for details on enabling.

  6. Set the distributed workload configuration to Workers & master. The master pod coordinates mpirun across the workers.

  7. Select Start from scratch.

  8. Enter nccl-multi-node as the name for the workload, then click CONTINUE.

  9. Under Environment, set the Image URL to nvcr.io/nvidia/pytorch:26.01-py3.

  10. Under Runtime settings, click +COMMAND & ARGUMENTS, then set the Command to bash and the Arguments to -c 'sleep 1d'.

  11. Under Compute resource, set GPU devices per pod to 4.

  12. Set the number of workers to 2. Combined with the master pod, this produces three pods; only the workers run on GPU nodes.

  13. Click CONTINUE.

  14. Ensure the Allow different setup for the master toggle is disabled so the master uses the same image and command as the workers.

  15. Click CREATE TRAINING.

Running the Benchmark

The CLI commands below pass -p nccl-benchmarking explicitly. Alternatively, set the project as the CLI default once with runai project set nccl-benchmarking and omit the flag from subsequent commands.

  1. Confirm the workload is in Running state and review GPU allocation. The exec step in the next instruction fails if the pod is not yet running:

  2. Once the workload is Running, open an interactive shell into the workload and check GPU visibility. nvidia-smi only reports the 4 GPUs on the pod's own node. To verify all 8 GPUs across the workload, repeat the exec for each pod (use --pod with the pod names from the workload) and rerun nvidia-smi in each:

  3. Run the eight-GPU all_reduce benchmark, distributing four ranks per node using the MPI hostfile bundled in the PyTorch container at /tmp/nccl-hosts:

Step 6: Running NCCL Tests Across Multiple Racks

This step runs a twelve-GPU all_reduce benchmark spanning three nodes placed in different racks, validating inter-rack fabric performance.

Submitting the Workload

Submit a workload that scales the multi-worker pattern to three workers across three nodes.

  1. To create the training workload, go to Workload manager → Workloads.

  2. Click +NEW WORKLOAD and select Training from the dropdown menu.

  3. Within the new training form, select the cluster and the nccl-benchmarking project.

  4. Set the training workload architecture to Distributed and the framework to MPI.

  5. Set the distributed workload configuration to Workers & master.

  6. Select Start from scratch.

  7. Enter nccl-multi-rack as the name for the workload, then click CONTINUE.

  8. Under Environment, set the Image URL to nvcr.io/nvidia/pytorch:26.01-py3.

  9. Under Runtime settings, click +COMMAND & ARGUMENTS, then set the Command to bash and the Arguments to -c 'sleep 1d'.

  10. Under Compute resource, set GPU devices per pod to 4.

  11. Set the number of workers to 3. To force the workers onto nodes in different racks, set node affinities or labels through your cluster administrator.

  12. Click CONTINUE.

  13. Ensure the Allow different setup for the master toggle is disabled.

  14. Click CREATE TRAINING.

Running the Benchmark

The CLI commands below pass -p nccl-benchmarking explicitly. Alternatively, set the project as the CLI default once with runai project set nccl-benchmarking and omit the flag from subsequent commands.

  1. Confirm the workload is in Running state and review GPU allocation. The exec step in the next instruction fails if the pod is not yet running:

  2. Once the workload is Running, open an interactive shell into the workload and check GPU visibility. nvidia-smi only reports the 4 GPUs on the pod's own node. To verify all 12 GPUs across the workload, repeat the exec for each pod (use --pod with the pod names from the workload) and rerun nvidia-smi in each:

  3. Run the twelve-GPU all_reduce benchmark, distributing four ranks per node using the MPI hostfile bundled in the PyTorch container at /tmp/nccl-hosts:

Step 7: Running Other NCCL Collective Tests

The previous steps use all_reduce_perf_mpi, but the NCCL test suite includes additional binaries that exercise different collective operations. Substitute the binary in the mpirun command above to characterize the collective most relevant to your workload:

Binary
Typical workload pattern

all_reduce_perf_mpi

Default; bandwidth and latency for gradient synchronization

all_gather_perf_mpi

Tensor-parallel forward path or activations gather

reduce_scatter_perf_mpi

ZeRO or sharded-optimizer step

broadcast_perf_mpi

Weight broadcast from rank 0

reduce_perf_mpi

One-to-all reduction

alltoall_perf_mpi

Mixture-of-Experts dispatch or sequence-parallel

sendrecv_perf_mpi

Pipeline-parallel hop

gather_perf_mpi, scatter_perf_mpi

Asymmetric collectives

Step 8: Cleaning Up the Environment

The benchmark containers idle for one day and are reclaimed automatically when the sleep expires. To release GPUs sooner, delete the workloads manually.

Last updated