# Validating GPU Cluster Communication with NCCL Tests

This tutorial demonstrates how to run NCCL `all_reduce_perf_mpi` benchmarks inside PyTorch containers launched as NVIDIA Run:ai training workloads. A single workflow covers single-node, single-rack, and multi-rack validation, so you can verify GPU-to-GPU communication on the cluster before running training, inference, or other production workloads.

{% hint style="info" %}
**Note**

* The example uses the `nvcr.io/nvidia/pytorch:26.01-py3` container, which ships with the NCCL test binaries pre-built under `/usr/local/bin/`. Adjust the tag to match the [CUDA and driver versions](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html) supported by your cluster.
* The example is configured for GB200 systems. For other hardware, verify compatibility with your target [GPU hardware](https://developer.nvidia.com/cuda/gpus).
* While the walkthrough uses the NVIDIA PyTorch container, you can adapt this workflow to other container images and hardware configurations.
  {% endhint %}

In this tutorial, you will learn how to:

* Submit single-node, single-rack, and multi-rack training workloads through the user interface, API, or CLI
* Run an `all_reduce` NCCL benchmark interactively from inside each workload
* Verify pod placement across nodes and racks
* Use additional NCCL collective tests to characterize specific communication patterns

## What NCCL Tests Validate

NCCL (NVIDIA Collective Communications Library) benchmarks are the fastest health check for a GPU cluster. A short [`nccl-tests`](https://github.com/NVIDIA/nccl-tests) run validates:

* **Driver and library install** - CUDA, NCCL, and the MPI launcher work end-to-end inside the container.
* **GPU visibility** - All requested GPUs are passed through to the pod.
* **Intra-node fabric** - NVLink and NVSwitch bandwidth on a single node.
* **Inter-node fabric** - InfiniBand or RoCE bandwidth between nodes, including rack-to-rack hops.
* **Scheduling and topology** - Pods land where the fabric expects.

A failing or low-bandwidth NCCL run is much cheaper to diagnose than burning GPU-hours on a misconfigured stack.

## Configurations Covered

The tutorial walks through three progressive scenarios that validate increasing infrastructure complexity:

| # | Test        | Workers × GPUs per worker | Validates             |
| - | ----------- | ------------------------- | --------------------- |
| 1 | Single node | 1 × 4 = 4 GPUs            | Intra-node NVLink     |
| 2 | Single rack | 2 × 4 = 8 GPUs            | Inter-node, same rack |
| 3 | Multi-rack  | 3 × 4 = 12 GPUs           | Cross-rack fabric     |

## Prerequisites

Before you start, make sure the following requirements are met:

* Your administrator has:
  * Created a [project](/self-hosted/2.24/platform-management/aiinitiatives/organization/projects.md) with a sufficient GPU quota for the benchmark workloads.
* You have:
  * A container image that bundles NCCL and the [`nccl-tests`](https://github.com/NVIDIA/nccl-tests) binaries. The example uses `nvcr.io/nvidia/pytorch:26.01-py3`, which ships `/usr/local/bin/all_reduce_perf_mpi`.
  * For multi-node tests, an image that supports MPI launch between pods (SSH keys or site PMIx setup).
  * Optional: `kubectl` access for inspecting Kubernetes cluster node labels (including rack-level information, if configured).

{% hint style="info" %}
**Note**

The example image is publicly pullable from NGC without authentication. If you use a container that requires authentication, your administrator must configure a Docker registry [credential](/self-hosted/2.24/workloads-in-nvidia-run-ai/assets/credentials.md#docker-registry) so the cluster can pull the image.
{% endhint %}

## Step 1: Logging In

{% tabs %}
{% tab title="UI" %}
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
{% endtab %}

{% tab title="API" %}
To use the API, you will need to obtain a token as shown in [Creating a user access key](#step-2-creating-a-user-access-key).
{% endtab %}

{% tab title="CLI v2" %}
Run the below `--help` command to obtain the login options and log in according to your setup:

```sh
runai login --help
```

{% endtab %}
{% endtabs %}

## Step 2: Creating a User Access Key

Access keys are used for API integrations with NVIDIA Run:ai. An access key contains a client ID and a client secret. With the client credentials, you can obtain a token and use it within subsequent API calls.

In the NVIDIA Run:ai user interface:

1. Click the user avatar at the top right corner, then select **Settings**
2. Click **+ACCESS KEY**
3. Enter the access key's **name** and click **CREATE**
4. Copy the **Client ID** and **Client secret** and store securely
5. Click **DONE**

To request an API access token, use the client credentials to get a token to access NVIDIA Run:ai using the [Tokens](https://run-ai-docs.nvidia.com/api/2.24/authentication-and-authorization/tokens) API. For example:

```bash
curl -X POST \ 
  # Replace <COMPANY_URL> below with:
  # For SaaS, use <tenant-name>.run.ai
  # For self-hosted use the NVIDIA Run:ai user interface URL.
  'https://<COMPANY_URL>/api/v1/token' \ 
  --header 'Accept: */*' \ 
  --header 'Content-Type: application/json' \ 
  --data-raw '{ 
  "grantType":"client_credentials", 
  "clientId":"<CLIENT ID>", 
  "clientSecret" : "<CLIENT SECRET>" 
}'
```

## Step 3: Identifying Available Nodes and Racks

Before submitting any workloads, identify which nodes are available and which racks they belong to. This is required to design tests that exercise intra-rack and inter-rack communication paths.

1. SSH to a Kubernetes node and load the cluster module:

   ```bash
   ssh t06-p1-k8s-arm-01
   module load kubernetes/k8s-user/
   ```
2. List nodes with `kubectl` to see node names and labels. Node names typically encode the rack identifier (for example, `s03` and `s04` indicate different racks):

   ```bash
   kubectl get nodes -o wide
   ```
3. List nodes and node pools through NVIDIA Run:ai:

   ```bash
   runai list nodes
   runai nodepool list
   ```

## Step 4: Running NCCL Tests on a Single Node

This step runs an `all_reduce` benchmark across four GPUs within a single node, validating intra-node NVLink and PCIe bandwidth.

### Submitting the Workload

Submit a training workload that requests four GPUs on one node and idles the container so you can run benchmarks interactively.

{% tabs %}
{% tab title="UI" %}

1. To create the training workload, go to Workload manager → Workloads.
2. Click **+NEW WORKLOAD** and select **Training** from the dropdown menu.
3. Within the new training form, select the **cluster** and the `nccl-benchmarking` **project**.
4. Set the training workload **architecture** to **Standard**. A standard workload runs a single pod.
5. Select **Start from scratch**.
6. Enter `nccl-single-node` as the **name** for the workload, then click **CONTINUE**.
7. Under **Environment**, set the **Image URL** to `nvcr.io/nvidia/pytorch:26.01-py3`.
8. Under **Runtime settings**, click **+COMMAND & ARGUMENTS**, then set the **Command** to `bash` and the **Arguments** to `-c 'sleep 1d'`. The container idles for one day so you can launch the benchmark interactively.
9. Under **Compute resource**, set **GPU devices per pod** to `4`.
10. Click **CREATE TRAINING**.
    {% endtab %}

{% tab title="API" %}
Copy the following example request and update the parameters as needed. For more details, see the [Trainings](https://run-ai-docs.nvidia.com/api/2.24/workloads/trainings) API:

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface.
* `<TOKEN>` - The API access token obtained in [Step 2](#step-2-creating-a-user-access-key).
* `<PROJECT-ID>` - The ID of the `nccl-benchmarking` project. Retrieve it via the [Get projects](https://run-ai-docs.nvidia.com/api/2.24/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the cluster. Retrieve it via the [Get clusters](https://run-ai-docs.nvidia.com/api/2.24/organizations/clusters#get-api-v1-clusters) API.

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "name": "nccl-single-node",
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",
    "spec": {
      "image": "nvcr.io/nvidia/pytorch:26.01-py3",
      "command": "bash",
      "args": "-c \"sleep 1d\"",
      "compute": {
        "gpuDevicesRequest": 4
      }
    }
  }'
```

{% endtab %}

{% tab title="CLI v2" %}
Copy the following command to your terminal. For more details, see the [CLI reference](/self-hosted/2.24/reference/cli/runai/runai_training_submit.md):

```bash
runai training standard submit nccl-single-node \
  -p nccl-benchmarking \
  -i nvcr.io/nvidia/pytorch:26.01-py3 \
  -g 4 \
  -- bash -c 'sleep 1d'
```

{% endtab %}
{% endtabs %}

### Running the Benchmark

The CLI commands below pass `-p nccl-benchmarking` explicitly. Alternatively, set the project as the CLI default once with `runai project set nccl-benchmarking` and omit the flag from subsequent commands.

1. Confirm the workload is in **Running** state and review GPU allocation. The `exec` step in the next instruction fails if the pod is not yet running:

   ```bash
   runai training standard describe nccl-single-node -p nccl-benchmarking
   ```
2. Once the workload is **Running**, open an interactive shell into the workload and check GPU visibility:

   ```bash
   runai training standard exec nccl-single-node -p nccl-benchmarking -it -- bash
   nvidia-smi
   ```
3. Run the four-GPU `all_reduce` benchmark with debug logging enabled. The parameters sweep message sizes from 8 bytes to 1 GB doubling at each step (`-b 8 -e 1G -f 2`), assign one GPU per rank (`-g 1`), perform two warmup iterations (`-w 2`), measure ten iterations (`--iters 10`), and run ten validation checks (`-c 10`):

   ```bash
   mpirun -np 4 \
     --allow-run-as-root \
     -x NCCL_DEBUG=INFO \
     /usr/local/bin/all_reduce_perf_mpi \
     -b 8 -e 1G -f 2 -g 1 -w 2 --iters 10 -c 10
   ```

## Step 5: Running NCCL Tests Across Multiple Nodes in the Same Rack

This step runs an eight-GPU `all_reduce` benchmark spanning two nodes in the same rack, validating intra-rack inter-node fabric performance.

### Submitting the Workload

Submit a multi-worker training workload. Two workers each request four GPUs, for a total of eight GPUs across two nodes. Both master and worker idle so benchmarks can be launched interactively.

{% tabs %}
{% tab title="UI" %}

1. To create the training workload, go to Workload manager → Workloads.
2. Click **+NEW WORKLOAD** and select **Training** from the dropdown menu.
3. Within the new training form, select the **cluster** and the `nccl-benchmarking` **project**.
4. Set the training workload **architecture** to **Distributed**. A distributed workload runs multiple processes that can span across different nodes.
5. Set the framework for the distributed workload to **MPI**. If MPI isn't available, see [Distributed training prerequisites](/self-hosted/2.24/getting-started/installation/install-using-helm/system-requirements.md#distributed-training) for details on enabling.
6. Set the **distributed workload configuration** to **Workers & master**. The master pod coordinates `mpirun` across the workers.
7. Select **Start from scratch**.
8. Enter `nccl-multi-node` as the **name** for the workload, then click **CONTINUE**.
9. Under **Environment**, set the **Image URL** to `nvcr.io/nvidia/pytorch:26.01-py3`.
10. Under **Runtime settings**, click **+COMMAND & ARGUMENTS**, then set the **Command** to `bash` and the **Arguments** to `-c 'sleep 1d'`.
11. Under **Compute resource**, set **GPU devices per pod** to `4`.
12. Set the **number of workers** to `2`. Combined with the master pod, this produces three pods; only the workers run on GPU nodes.
13. Click **CONTINUE**.
14. Ensure the **Allow different setup for the master** toggle is **disabled** so the master uses the same image and command as the workers.
15. Click **CREATE TRAINING**.
    {% endtab %}

{% tab title="API" %}
Copy the following example request and update the parameters as needed. For more details, see the [Distributed](https://run-ai-docs.nvidia.com/api/2.24/workloads/distributed) API:

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface.
* `<TOKEN>` - The API access token obtained in [Step 2](#step-2-creating-a-user-access-key).
* `<PROJECT-ID>` - The ID of the `nccl-benchmarking` project. Retrieve it via the [Get projects](https://run-ai-docs.nvidia.com/api/2.24/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the cluster. Retrieve it via the [Get clusters](https://run-ai-docs.nvidia.com/api/2.24/organizations/clusters#get-api-v1-clusters) API.

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/distributed' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "name": "nccl-multi-node",
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",
    "masterSpecSameAsWorker": true,
    "spec": {
      "image": "nvcr.io/nvidia/pytorch:26.01-py3",
      "command": "bash",
      "args": "-c \"sleep 1d\"",
      "compute": {
        "gpuDevicesRequest": 4
      },
      "numWorkers": 2,
      "distributedFramework": "MPI"
    }
  }'
```

{% endtab %}

{% tab title="CLI v2" %}
Copy the following command to your terminal. For more details, see the [CLI reference](/self-hosted/2.24/reference/cli/runai/runai_training_submit.md):

```bash
runai training mpi submit nccl-multi-node \
  -p nccl-benchmarking \
  -i nvcr.io/nvidia/pytorch:26.01-py3 \
  -g 4 \
  --workers 2 \
  --slots-per-worker 4 \
  --master-command bash \
  --master-args "-c 'sleep 1d'" \
  -- bash -c 'sleep 1d'
```

{% endtab %}
{% endtabs %}

### Running the Benchmark

The CLI commands below pass `-p nccl-benchmarking` explicitly. Alternatively, set the project as the CLI default once with `runai project set nccl-benchmarking` and omit the flag from subsequent commands.

1. Confirm the workload is in **Running** state and review GPU allocation. The `exec` step in the next instruction fails if the pod is not yet running:

   ```bash
   runai training mpi describe nccl-multi-node -p nccl-benchmarking
   ```
2. Once the workload is **Running**, open an interactive shell into the workload and check GPU visibility. `nvidia-smi` only reports the 4 GPUs on the pod's own node. To verify all 8 GPUs across the workload, repeat the `exec` for each pod (use `--pod` with the pod names from the workload) and rerun `nvidia-smi` in each:

   ```bash
   runai training mpi exec nccl-multi-node -p nccl-benchmarking -it -- bash
   nvidia-smi
   ```
3. Run the eight-GPU `all_reduce` benchmark, distributing four ranks per node using the MPI hostfile bundled in the PyTorch container at `/tmp/nccl-hosts`:

   ```bash
   mpirun -np 8 \
     --hostfile /tmp/nccl-hosts \
     --map-by ppr:4:node \
     --allow-run-as-root \
     -x NCCL_DEBUG=INFO \
     /usr/local/bin/all_reduce_perf_mpi \
     -b 8 -e 1G -f 2 -g 1 -w 2 --iters 10 -c 10
   ```

## Step 6: Running NCCL Tests Across Multiple Racks

This step runs a twelve-GPU `all_reduce` benchmark spanning three nodes placed in different racks, validating inter-rack fabric performance.

### Submitting the Workload

Submit a workload that scales the multi-worker pattern to three workers across three nodes.

{% tabs %}
{% tab title="UI" %}

1. To create the training workload, go to Workload manager → Workloads.
2. Click **+NEW WORKLOAD** and select **Training** from the dropdown menu.
3. Within the new training form, select the **cluster** and the `nccl-benchmarking` **project**.
4. Set the training workload **architecture** to **Distributed** and the framework to **MPI**.
5. Set the **distributed workload configuration** to **Workers & master**.
6. Select **Start from scratch**.
7. Enter `nccl-multi-rack` as the **name** for the workload, then click **CONTINUE**.
8. Under **Environment**, set the **Image URL** to `nvcr.io/nvidia/pytorch:26.01-py3`.
9. Under **Runtime settings**, click **+COMMAND & ARGUMENTS**, then set the **Command** to `bash` and the **Arguments** to `-c 'sleep 1d'`.
10. Under **Compute resource**, set **GPU devices per pod** to `4`.
11. Set the **number of workers** to `3`. To force the workers onto nodes in different racks, set node affinities or labels through your cluster administrator.
12. Click **CONTINUE**.
13. Ensure the **Allow different setup for the master** toggle is **disabled**.
14. Click **CREATE TRAINING**.
    {% endtab %}

{% tab title="API" %}
Copy the following example request and update the parameters as needed. For more details, see the [Distributed](https://run-ai-docs.nvidia.com/api/2.24/workloads/distributed) API:

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface.
* `<TOKEN>` - The API access token obtained in [Step 2](#step-2-creating-a-user-access-key).
* `<PROJECT-ID>` - The ID of the `nccl-benchmarking` project. Retrieve it via the [Get projects](https://run-ai-docs.nvidia.com/api/2.24/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the cluster. Retrieve it via the [Get clusters](https://run-ai-docs.nvidia.com/api/2.24/organizations/clusters#get-api-v1-clusters) API.

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/distributed' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <TOKEN>' \
  -d '{
    "name": "nccl-multi-rack",
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",
    "masterSpecSameAsWorker": true,
    "spec": {
      "image": "nvcr.io/nvidia/pytorch:26.01-py3",
      "command": "bash",
      "args": "-c \"sleep 1d\"",
      "compute": {
        "gpuDevicesRequest": 4
      },
      "numWorkers": 3,
      "distributedFramework": "MPI"
    }
  }'
```

{% endtab %}

{% tab title="CLI v2" %}
Copy the following command to your terminal. For more details, see the [CLI reference](/self-hosted/2.24/reference/cli/runai/runai_training_submit.md):

```bash
runai training mpi submit nccl-multi-rack \
  -p nccl-benchmarking \
  -i nvcr.io/nvidia/pytorch:26.01-py3 \
  -g 4 \
  --workers 3 \
  --slots-per-worker 4 \
  --master-command bash \
  --master-args "-c 'sleep 1d'" \
  -- bash -c 'sleep 1d'
```

{% endtab %}
{% endtabs %}

### Running the Benchmark

The CLI commands below pass `-p nccl-benchmarking` explicitly. Alternatively, set the project as the CLI default once with `runai project set nccl-benchmarking` and omit the flag from subsequent commands.

1. Confirm the workload is in **Running** state and review GPU allocation. The `exec` step in the next instruction fails if the pod is not yet running:

   ```bash
   runai training mpi describe nccl-multi-rack -p nccl-benchmarking
   ```
2. Once the workload is **Running**, open an interactive shell into the workload and check GPU visibility. `nvidia-smi` only reports the 4 GPUs on the pod's own node. To verify all 12 GPUs across the workload, repeat the `exec` for each pod (use `--pod` with the pod names from the workload) and rerun `nvidia-smi` in each:

   ```bash
   runai training mpi exec nccl-multi-rack -p nccl-benchmarking -it -- bash
   nvidia-smi
   ```
3. Run the twelve-GPU `all_reduce` benchmark, distributing four ranks per node using the MPI hostfile bundled in the PyTorch container at `/tmp/nccl-hosts`:

   ```bash
   mpirun -np 12 \
     --hostfile /tmp/nccl-hosts \
     --map-by ppr:4:node \
     --allow-run-as-root \
     -x NCCL_DEBUG=INFO \
     /usr/local/bin/all_reduce_perf_mpi \
     -b 8 -e 1G -f 2 -g 1 -w 2 --iters 10 -c 10
   ```

## Step 7: Running Other NCCL Collective Tests

The previous steps use `all_reduce_perf_mpi`, but the NCCL test suite includes additional binaries that exercise different collective operations. Substitute the binary in the `mpirun` command above to characterize the collective most relevant to your workload:

| Binary                                | Typical workload pattern                                    |
| ------------------------------------- | ----------------------------------------------------------- |
| `all_reduce_perf_mpi`                 | Default; bandwidth and latency for gradient synchronization |
| `all_gather_perf_mpi`                 | Tensor-parallel forward path or activations gather          |
| `reduce_scatter_perf_mpi`             | ZeRO or sharded-optimizer step                              |
| `broadcast_perf_mpi`                  | Weight broadcast from rank 0                                |
| `reduce_perf_mpi`                     | One-to-all reduction                                        |
| `alltoall_perf_mpi`                   | Mixture-of-Experts dispatch or sequence-parallel            |
| `sendrecv_perf_mpi`                   | Pipeline-parallel hop                                       |
| `gather_perf_mpi`, `scatter_perf_mpi` | Asymmetric collectives                                      |

## Step 8: Cleaning Up the Environment

The benchmark containers idle for one day and are reclaimed automatically when the sleep expires. To release GPUs sooner, delete the workloads manually.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://run-ai-docs.nvidia.com/self-hosted/2.24/tutorials/training-tutorials/nccl-tests.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.