Validating GPU Cluster Communication with NCCL Tests
This tutorial demonstrates how to run NCCL all_reduce_perf_mpi benchmarks inside PyTorch containers launched as NVIDIA Run:ai training workloads. A single workflow covers single-node, single-rack, and multi-rack validation, so you can verify GPU-to-GPU communication on the cluster before running training, inference, or other production workloads.
Note
The example uses the
nvcr.io/nvidia/pytorch:26.01-py3container, which ships with the NCCL test binaries pre-built under/usr/local/bin/. Adjust the tag to match the CUDA and driver versions supported by your cluster.The example is configured for GB200 systems. For other hardware, verify compatibility with your target GPU hardware.
While the walkthrough uses the NVIDIA PyTorch container, you can adapt this workflow to other container images and hardware configurations.
In this tutorial, you will learn how to:
Submit single-node, single-rack, and multi-rack training workloads through the user interface, API, or CLI
Run an
all_reduceNCCL benchmark interactively from inside each workloadVerify pod placement across nodes and racks
Use additional NCCL collective tests to characterize specific communication patterns
What NCCL Tests Validate
NCCL (NVIDIA Collective Communications Library) benchmarks are the fastest health check for a GPU cluster. A short nccl-tests run validates:
Driver and library install - CUDA, NCCL, and the MPI launcher work end-to-end inside the container.
GPU visibility - All requested GPUs are passed through to the pod.
Intra-node fabric - NVLink and NVSwitch bandwidth on a single node.
Inter-node fabric - InfiniBand or RoCE bandwidth between nodes, including rack-to-rack hops.
Scheduling and topology - Pods land where the fabric expects.
A failing or low-bandwidth NCCL run is much cheaper to diagnose than burning GPU-hours on a misconfigured stack.
Configurations Covered
The tutorial walks through three progressive scenarios that validate increasing infrastructure complexity:
1
Single node
1 × 4 = 4 GPUs
Intra-node NVLink
2
Single rack
2 × 4 = 8 GPUs
Inter-node, same rack
3
Multi-rack
3 × 4 = 12 GPUs
Cross-rack fabric
Prerequisites
Before you start, make sure the following requirements are met:
Your administrator has:
Created a project with a sufficient GPU quota for the benchmark workloads.
You have:
A container image that bundles NCCL and the
nccl-testsbinaries. The example usesnvcr.io/nvidia/pytorch:26.01-py3, which ships/usr/local/bin/all_reduce_perf_mpi.For multi-node tests, an image that supports MPI launch between pods (SSH keys or site PMIx setup).
Optional:
kubectlaccess for inspecting Kubernetes cluster node labels (including rack-level information, if configured).
Note
The example image is publicly pullable from NGC without authentication. If you use a container that requires authentication, your administrator must configure a Docker registry credential so the cluster can pull the image.
Step 1: Logging In
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
To use the API, you will need to obtain a token as shown in Creating a user access key.
Run the below --help command to obtain the login options and log in according to your setup:
Step 2: Creating a User Access Key
Access keys are used for API integrations with NVIDIA Run:ai. An access key contains a client ID and a client secret. With the client credentials, you can obtain a token and use it within subsequent API calls.
In the NVIDIA Run:ai user interface:
Click the user avatar at the top right corner, then select Settings
Click +ACCESS KEY
Enter the access key's name and click CREATE
Copy the Client ID and Client secret and store securely
Click DONE
To request an API access token, use the client credentials to get a token to access NVIDIA Run:ai using the Tokens API. For example:
Step 3: Identifying Available Nodes and Racks
Before submitting any workloads, identify which nodes are available and which racks they belong to. This is required to design tests that exercise intra-rack and inter-rack communication paths.
SSH to a Kubernetes node and load the cluster module:
List nodes with
kubectlto see node names and labels. Node names typically encode the rack identifier (for example,s03ands04indicate different racks):List nodes and node pools through NVIDIA Run:ai:
Step 4: Running NCCL Tests on a Single Node
This step runs an all_reduce benchmark across four GPUs within a single node, validating intra-node NVLink and PCIe bandwidth.
Submitting the Workload
Submit a training workload that requests four GPUs on one node and idles the container so you can run benchmarks interactively.
To create the training workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Training from the dropdown menu.
Within the new training form, select the cluster and the
nccl-benchmarkingproject.Set the training workload architecture to Standard. A standard workload runs a single pod.
Select Start from scratch.
Enter
nccl-single-nodeas the name for the workload, then click CONTINUE.Under Environment, set the Image URL to
nvcr.io/nvidia/pytorch:26.01-py3.Under Runtime settings, click +COMMAND & ARGUMENTS, then set the Command to
bashand the Arguments to-c 'sleep 1d'. The container idles for one day so you can launch the benchmark interactively.Under Compute resource, set GPU devices per pod to
4.Click CREATE TRAINING.
Copy the following example request and update the parameters as needed. For more details, see the Trainings API:
<COMPANY-URL>- The link to the NVIDIA Run:ai user interface.<TOKEN>- The API access token obtained in Step 2.<PROJECT-ID>- The ID of thenccl-benchmarkingproject. Retrieve it via the Get projects API.<CLUSTER-UUID>- The unique identifier of the cluster. Retrieve it via the Get clusters API.
Copy the following command to your terminal. For more details, see the CLI reference:
Running the Benchmark
The CLI commands below pass -p nccl-benchmarking explicitly. Alternatively, set the project as the CLI default once with runai project set nccl-benchmarking and omit the flag from subsequent commands.
Confirm the workload is in Running state and review GPU allocation. The
execstep in the next instruction fails if the pod is not yet running:Once the workload is Running, open an interactive shell into the workload and check GPU visibility:
Run the four-GPU
all_reducebenchmark with debug logging enabled. The parameters sweep message sizes from 8 bytes to 1 GB doubling at each step (-b 8 -e 1G -f 2), assign one GPU per rank (-g 1), perform two warmup iterations (-w 2), measure ten iterations (--iters 10), and run ten validation checks (-c 10):
Step 5: Running NCCL Tests Across Multiple Nodes in the Same Rack
This step runs an eight-GPU all_reduce benchmark spanning two nodes in the same rack, validating intra-rack inter-node fabric performance.
Submitting the Workload
Submit a multi-worker training workload. Two workers each request four GPUs, for a total of eight GPUs across two nodes. Both master and worker idle so benchmarks can be launched interactively.
To create the training workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Training from the dropdown menu.
Within the new training form, select the cluster and the
nccl-benchmarkingproject.Set the training workload architecture to Distributed. A distributed workload runs multiple processes that can span across different nodes.
Set the framework for the distributed workload to MPI. If MPI isn't available, see Distributed training prerequisites for details on enabling.
Set the distributed workload configuration to Workers & master. The master pod coordinates
mpirunacross the workers.Select Start from scratch.
Enter
nccl-multi-nodeas the name for the workload, then click CONTINUE.Under Environment, set the Image URL to
nvcr.io/nvidia/pytorch:26.01-py3.Under Runtime settings, click +COMMAND & ARGUMENTS, then set the Command to
bashand the Arguments to-c 'sleep 1d'.Under Compute resource, set GPU devices per pod to
4.Set the number of workers to
2. Combined with the master pod, this produces three pods; only the workers run on GPU nodes.Click CONTINUE.
Ensure the Allow different setup for the master toggle is disabled so the master uses the same image and command as the workers.
Click CREATE TRAINING.
Copy the following example request and update the parameters as needed. For more details, see the Distributed API:
<COMPANY-URL>- The link to the NVIDIA Run:ai user interface.<TOKEN>- The API access token obtained in Step 2.<PROJECT-ID>- The ID of thenccl-benchmarkingproject. Retrieve it via the Get projects API.<CLUSTER-UUID>- The unique identifier of the cluster. Retrieve it via the Get clusters API.
Copy the following command to your terminal. For more details, see the CLI reference:
Running the Benchmark
The CLI commands below pass -p nccl-benchmarking explicitly. Alternatively, set the project as the CLI default once with runai project set nccl-benchmarking and omit the flag from subsequent commands.
Confirm the workload is in Running state and review GPU allocation. The
execstep in the next instruction fails if the pod is not yet running:Once the workload is Running, open an interactive shell into the workload and check GPU visibility.
nvidia-smionly reports the 4 GPUs on the pod's own node. To verify all 8 GPUs across the workload, repeat theexecfor each pod (use--podwith the pod names from the workload) and rerunnvidia-smiin each:Run the eight-GPU
all_reducebenchmark, distributing four ranks per node using the MPI hostfile bundled in the PyTorch container at/tmp/nccl-hosts:
Step 6: Running NCCL Tests Across Multiple Racks
This step runs a twelve-GPU all_reduce benchmark spanning three nodes placed in different racks, validating inter-rack fabric performance.
Submitting the Workload
Submit a workload that scales the multi-worker pattern to three workers across three nodes.
To create the training workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Training from the dropdown menu.
Within the new training form, select the cluster and the
nccl-benchmarkingproject.Set the training workload architecture to Distributed and the framework to MPI.
Set the distributed workload configuration to Workers & master.
Select Start from scratch.
Enter
nccl-multi-rackas the name for the workload, then click CONTINUE.Under Environment, set the Image URL to
nvcr.io/nvidia/pytorch:26.01-py3.Under Runtime settings, click +COMMAND & ARGUMENTS, then set the Command to
bashand the Arguments to-c 'sleep 1d'.Under Compute resource, set GPU devices per pod to
4.Set the number of workers to
3. To force the workers onto nodes in different racks, set node affinities or labels through your cluster administrator.Click CONTINUE.
Ensure the Allow different setup for the master toggle is disabled.
Click CREATE TRAINING.
Copy the following example request and update the parameters as needed. For more details, see the Distributed API:
<COMPANY-URL>- The link to the NVIDIA Run:ai user interface.<TOKEN>- The API access token obtained in Step 2.<PROJECT-ID>- The ID of thenccl-benchmarkingproject. Retrieve it via the Get projects API.<CLUSTER-UUID>- The unique identifier of the cluster. Retrieve it via the Get clusters API.
Copy the following command to your terminal. For more details, see the CLI reference:
Running the Benchmark
The CLI commands below pass -p nccl-benchmarking explicitly. Alternatively, set the project as the CLI default once with runai project set nccl-benchmarking and omit the flag from subsequent commands.
Confirm the workload is in Running state and review GPU allocation. The
execstep in the next instruction fails if the pod is not yet running:Once the workload is Running, open an interactive shell into the workload and check GPU visibility.
nvidia-smionly reports the 4 GPUs on the pod's own node. To verify all 12 GPUs across the workload, repeat theexecfor each pod (use--podwith the pod names from the workload) and rerunnvidia-smiin each:Run the twelve-GPU
all_reducebenchmark, distributing four ranks per node using the MPI hostfile bundled in the PyTorch container at/tmp/nccl-hosts:
Step 7: Running Other NCCL Collective Tests
The previous steps use all_reduce_perf_mpi, but the NCCL test suite includes additional binaries that exercise different collective operations. Substitute the binary in the mpirun command above to characterize the collective most relevant to your workload:
all_reduce_perf_mpi
Default; bandwidth and latency for gradient synchronization
all_gather_perf_mpi
Tensor-parallel forward path or activations gather
reduce_scatter_perf_mpi
ZeRO or sharded-optimizer step
broadcast_perf_mpi
Weight broadcast from rank 0
reduce_perf_mpi
One-to-all reduction
alltoall_perf_mpi
Mixture-of-Experts dispatch or sequence-parallel
sendrecv_perf_mpi
Pipeline-parallel hop
gather_perf_mpi, scatter_perf_mpi
Asymmetric collectives
Step 8: Cleaning Up the Environment
The benchmark containers idle for one day and are reclaimed automatically when the sleep expires. To release GPUs sooner, delete the workloads manually.
Last updated