> For the complete documentation index, see [llms.txt](https://run-ai-docs.nvidia.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://run-ai-docs.nvidia.com/self-hosted/2.23/getting-started/installation/bcm-install/validation-tests.md).

# Validation Tests

{% hint style="info" %}
**Container images**

These tests use NVIDIA NGC container images:

* `nvcr.io/nvidia/nv-mission-control/nvbandwidth` — bandwidth tests. This image is part of the NVIDIA Mission Control software stack and requires a valid NVIDIA Mission Control entitlement. Pulling it needs an NGC API key configured as an image pull secret in the cluster. The bandwidth manifests below reference this secret as `ngc-nvcr`; create it before applying them (this step is also included in the post-wizard deployment steps).
* `nvcr.io/nvidia/pytorch` — NCCL tests. This image is publicly pullable from NGC without authentication and ships the NCCL test binaries (for example `/usr/local/bin/all_reduce_perf_mpi`).

To create the `ngc-nvcr` image pull secret, generate an NGC API key from the [NGC dashboard](https://ngc.nvidia.com) (**Setup → Generate API Key**), then create a `docker-registry` secret in the namespace where the tests run. NGC uses the literal username `$oauthtoken` and the API key as the password (quote `$oauthtoken` so the shell does not expand it):

```bash
kubectl create secret docker-registry ngc-nvcr \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password='<NGC_API_KEY>' \
  -n default
```

The NCCL tests use the public `pytorch` image and do not need this secret.

Adjust image tags to match the CUDA and driver versions supported by your cluster.
{% endhint %}

{% hint style="warning" %}
**Air-gapped clusters**

The MPI-based tests below start `sshd` in the worker pods by installing `openssh-server` at runtime with `apt-get`, which requires the pods to reach a package repository. In an air-gapped cluster, either point the nodes at a reachable local apt mirror, or substitute a container image that already includes `openssh-server` (and, for the NCCL tests, the NCCL test binaries) so that no runtime package installation is needed.
{% endhint %}

## Multi-Node NVLink (MNNVL) ComputeDomain Bandwidth Test (dra-computedomain-test.yaml)

This test exercises the NVIDIA DRA driver's `ComputeDomain` (the Multi-Node NVLink / IMEX channel) with a multi-node `nvbandwidth` MPIJob. It applies to MNNVL-capable platforms — for example DGX GB200 and GB300, and future MNNVL systems — and is named for the capability it validates rather than a specific platform SKU.

```yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: nvbandwidth-test-compute-domain
spec:
  numNodes: 2
  channel:
    resourceClaimTemplate:
      name: nvbandwidth-test-compute-domain-channel

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  # The NGC nvbandwidth image runs as root and does not ship an SSH server,
  # so keys are mounted under /root/.ssh and the worker installs openssh-server
  # at runtime (see the Worker spec below).
  sshAuthMountPath: /root/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-launcher
        spec:
          restartPolicy: OnFailure
          # The launcher pins to a control-plane node; tolerate its taint.
          tolerations:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
            effect: NoSchedule
          containers:
          - image: nvcr.io/nvidia/nv-mission-control/nvbandwidth:1.8.0
            name: mpi-launcher
            securityContext:
              runAsUser: 0
            env:
            - name: OMPI_ALLOW_RUN_AS_ROOT
              value: "1"
            - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
              value: "1"
            command: ["/bin/bash", "-lc"]
            args:
            - >
              mpirun --allow-run-as-root
              --bind-to core --map-by ppr:4:node -np 8
              --report-bindings -q
              -mca plm_rsh_args "-p 2222 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /root/.ssh/id_rsa"
              nvbandwidth -t multinode_device_to_device_memcpy_read_ce
          imagePullSecrets:
          - name: ngc-nvcr
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
        spec:
          restartPolicy: OnFailure
          containers:
          - image: nvcr.io/nvidia/nv-mission-control/nvbandwidth:1.8.0
            name: mpi-worker
            securityContext:
              runAsUser: 0
            # The NGC image has no SSH server; install openssh-server at runtime
            # and start sshd on port 2222 for the MPI launcher to connect.
            command: ["/bin/bash", "-lc"]
            args:
            - >
              apt-get update -qq &&
              apt-get install -y -q openssh-server &&
              mkdir -p /run/sshd &&
              ssh-keygen -A &&
              exec /usr/sbin/sshd -De -p 2222 -o StrictModes=no
            resources:
              limits:
                nvidia.com/gpu: 4
              claims:
              - name: compute-domain-channel
          imagePullSecrets:
          - name: ngc-nvcr
          resourceClaims:
          - name: compute-domain-channel
            resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
          # GB200/GB300 only: co-locate all worker pods within the same NVL
          # clique so the multi-node nvbandwidth test exercises NVLink rather
          # than the scale-out fabric. Omit (or change topologyKey) on
          # non-NVL accelerators.
          affinity:
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: mpi-memcpy-dra-test-replica
                    operator: In
                    values:
                    - mpi-worker
                topologyKey: nvidia.com/gpu.clique

```

## InfiniBand (SR-IOV) Bandwidth Tests (ib-bandwidth-test.yaml)

`nvbandwidth` over the SR-IOV InfiniBand fabric. This test applies to InfiniBand-fabric deployments — for example GB200 / GB300 SuperPOD systems configured for InfiniBand.

```yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  # The NGC nvbandwidth image runs as root and does not ship an SSH server,
  # so keys are mounted under /root/.ssh and the worker installs openssh-server
  # at runtime (see the Worker spec below).
  sshAuthMountPath: /root/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-launcher
        spec:
          restartPolicy: OnFailure
          # The launcher pins to a control-plane node; tolerate its taint.
          tolerations:
          - key: node-role.kubernetes.io/control-plane
            operator: Exists
            effect: NoSchedule
          containers:
          - image: nvcr.io/nvidia/nv-mission-control/nvbandwidth:1.8.0
            name: mpi-launcher
            securityContext:
              runAsUser: 0
            env:
            - name: OMPI_ALLOW_RUN_AS_ROOT
              value: "1"
            - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
              value: "1"
            command: ["/bin/bash", "-lc"]
            args:
            - >
              mpirun --allow-run-as-root
              --bind-to core --map-by ppr:4:node -np 8
              --report-bindings -q
              -mca plm_rsh_args "-p 2222 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /root/.ssh/id_rsa"
              nvbandwidth
          imagePullSecrets:
          - name: ngc-nvcr
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
          annotations:
            k8s.v1.cni.cncf.io/networks: sriovibnet-rdma-default-a-su-1,sriovibnet-rdma-default-b-su-1,sriovibnet-rdma-default-c-su-1,sriovibnet-rdma-default-d-su-1
        spec:
          restartPolicy: OnFailure
          imagePullSecrets:
          - name: ngc-nvcr
          containers:
          - image: nvcr.io/nvidia/nv-mission-control/nvbandwidth:1.8.0
            name: mpi-worker
            securityContext:
              runAsUser: 0
            # The NGC image has no SSH server; install openssh-server at runtime
            # and start sshd on port 2222 for the MPI launcher to connect.
            command: ["/bin/bash", "-lc"]
            args:
            - >
              apt-get update -qq &&
              apt-get install -y -q openssh-server &&
              mkdir -p /run/sshd &&
              ssh-keygen -A &&
              exec /usr/sbin/sshd -De -p 2222 -o StrictModes=no
            resources:
              limits:
                nvidia.com/gpu: 4
                nvidia.com/sriovib_resource_a: '1'
                nvidia.com/sriovib_resource_b: '1'
                nvidia.com/sriovib_resource_c: '1'
                nvidia.com/sriovib_resource_d: '1'


```

## InfiniBand (SR-IOV) NCCL Tests (ib-nccl-test.yaml)

NCCL all-reduce over the SR-IOV InfiniBand fabric. This test applies to InfiniBand-fabric deployments — DGX B200, and B-series / GB-series systems configured for InfiniBand.

{% hint style="info" %}
**Spectrum-X (RoCE) deployments**

This test targets the InfiniBand fabric. DGX systems configured for NVIDIA Spectrum-X Ethernet (RoCE) — for example DGX B300 SuperPOD deployments using Spectrum-X — use a different NCCL transport and a different secondary-network and resource model, and therefore require a separate RoCE NCCL validation test. Do not use the InfiniBand test below on a Spectrum-X / RoCE-configured cluster; use the **Spectrum-X (RoCE) NCCL Tests** section that follows.
{% endhint %}

```yaml
---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-test
spec:
  slotsPerWorker: 8
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running

  # Mount MPI Operator's SSH key where the user actually is
  # We'll run as root, so point to /root/.ssh
  sshAuthMountPath: /root/.ssh

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: mpi-launcher
              image: nvcr.io/nvidia/pytorch:26.04-py3
              # Run as root to avoid the "uid 1000" user lookup error
              securityContext:
                runAsUser: 0
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              command: ["/bin/bash","-lc"]
              args:
                - >
                  mpirun --allow-run-as-root
                  -np 16
                  -bind-to none -map-by slot
                  -mca pml ob1
                  -mca btl ^openib
                  -mca btl_tcp_if_include 192.168.0.0/16
                  -mca oob_tcp_if_include 172.29.0.0/16
                  -mca routed direct
                  -mca plm_rsh_args "-p 2222 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /root/.ssh/id_rsa"
                  /usr/local/bin/all_reduce_perf_mpi -b 16 -e 16G -f 2 -g 1

    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
          annotations:
            # Your SR-IOV RDMA networks
            k8s.v1.cni.cncf.io/networks: "sriovibnet-rdma-default-a-su-1,sriovibnet-rdma-default-b-su-1,sriovibnet-rdma-default-c-su-1,sriovibnet-rdma-default-d-su-1,sriovibnet-rdma-default-e-su-1,sriovibnet-rdma-default-f-su-1,sriovibnet-rdma-default-g-su-1,sriovibnet-rdma-default-h-su-1"
        spec:
          restartPolicy: OnFailure
          containers:
            - name: mpi-worker
              image: nvcr.io/nvidia/pytorch:26.04-py3
              # Root to install/run sshd, generate host keys & read /etc/ssh/*.
              # The PyTorch NGC image does not ship an SSH server, so install
              # openssh-server at runtime before starting sshd for MPI launch.
              securityContext:
                runAsUser: 0
                capabilities:
                  add: ["IPC_LOCK"]
              command: ["/bin/bash","-lc"]
              args:
                - >
                  apt-get update -qq &&
                  apt-get install -y -q openssh-server &&
                  mkdir -p /run/sshd &&
                  ssh-keygen -A &&
                  exec /usr/sbin/sshd -De -p 2222 -o StrictModes=no
              resources:
                limits:
                  nvidia.com/gpu: 8
                  nvidia.com/sriovib_resource_a: "1"
                  nvidia.com/sriovib_resource_b: "1"
                  nvidia.com/sriovib_resource_c: "1"
                  nvidia.com/sriovib_resource_d: "1"
                  nvidia.com/sriovib_resource_e: "1"
                  nvidia.com/sriovib_resource_f: "1"
                  nvidia.com/sriovib_resource_g: "1"
                  nvidia.com/sriovib_resource_h: "1"
```

## Spectrum-X (RoCE) NCCL Tests (roce-nccl-test.yaml)

NCCL all-reduce over the NVIDIA Spectrum-X Ethernet (RoCE) fabric. This test applies to Spectrum-X / RoCE deployments — for example DGX B300 SuperPOD systems configured for Spectrum-X. It is the RoCE counterpart to the InfiniBand NCCL test above: it engages the NVIDIA Spectrum-X NCCL plugin (`NCCL_NET_PLUGIN=spcx`) and the twin-planar RoCE secondary networks, rather than the SR-IOV InfiniBand transport.

{% hint style="info" %}
**Fabric and resource names are site-specific**

The `nvidia.com/rX-pY` resources and the `k8s.v1.cni.cncf.io/networks` names below are the twin-planar layout (8 rails × 2 planes = 16 RoCE VFs) provisioned by the NVIDIA Network Operator for Spectrum-X. Align them to your site's `NetworkAttachmentDefinition` / `NicClusterPolicy` names before applying. The manifest sets `NCCL_DEBUG=INFO` so the launcher log shows which transport NCCL selected; confirm a successful run by checking the launcher log for the Spectrum-X plugin (lines such as `NCCL INFO Assigned NET plugin SPCX to comm` and channels routed `via NET/SPCX/<n>/GDRDMA`) together with a non-zero average bus bandwidth (`# Avg bus bandwidth`).
{% endhint %}

```yaml
---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-test-rocex
spec:
  slotsPerWorker: 8
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running

  # Mount MPI Operator's SSH key where the user actually is.
  # We run as root, so point to /root/.ssh
  sshAuthMountPath: /root/.ssh

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: mpi-launcher
              image: nvcr.io/nvidia/pytorch:26.04-py3
              securityContext:
                runAsUser: 0
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              command: ["/bin/bash","-lc"]
              # The Spectrum-X plugin (spcx) and its library ship in the
              # PyTorch NGC image under /opt/hpcx. The -x flags below are the
              # validated Spectrum-X NCCL tuning; align NCCL_IB_HCA only if your
              # RoCE device prefix differs from roce_.
              args:
                - >
                  mpirun --allow-run-as-root
                  --mca plm_rsh_args "-p 2222 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null"
                  --bind-to core --map-by ppr:8:node
                  -np 16
                  -x NCCL_DEBUG=INFO
                  -x NCCL_DEBUG_SUBSYS=INIT,NET
                  -x NCCL_NET_PLUGIN=spcx
                  -x LD_LIBRARY_PATH=/opt/hpcx/nccl_spectrum-x_plugin/lib:/usr/local/lib:$LD_LIBRARY_PATH
                  -x NCCL_IB_HCA=roce_
                  -x NCCL_SOCKET_IFNAME=eth0
                  -x UCX_TLS=tcp
                  -x NCCL_IB_TC=96
                  -x NCCL_CROSS_NIC=0
                  -x NCCL_IB_MERGE_NICS=0
                  -x NCCL_NET_MERGE_LEVEL=PIX
                  -x NCCL_NCHANNELS_PER_NET_PEER=4
                  -x NCCL_P2P_NET_CHUNKSIZE=262144
                  /usr/local/bin/all_reduce_perf_mpi -b 8 -e 8G -f 2 -g 1 -w 2 --iters 4 -c 10
              volumeMounts:
                - name: dshm
                  mountPath: /dev/shm
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory

    Worker:
      replicas: 2
      template:
        metadata:
          annotations:
            # Spectrum-X twin-planar RoCE networks (align to site config)
            k8s.v1.cni.cncf.io/networks: "r0-p0,r0-p1,r1-p0,r1-p1,r2-p0,r2-p1,r3-p0,r3-p1,r4-p0,r4-p1,r5-p0,r5-p1,r6-p0,r6-p1,r7-p0,r7-p1"
        spec:
          restartPolicy: OnFailure
          containers:
            - name: mpi-worker
              image: nvcr.io/nvidia/pytorch:26.04-py3
              # Root to install/run sshd, generate host keys & read /etc/ssh/*.
              # The PyTorch NGC image does not ship an SSH server, so install
              # openssh-server at runtime before starting sshd for MPI launch.
              securityContext:
                runAsUser: 0
                capabilities:
                  add: ["NET_ADMIN","NET_RAW","IPC_LOCK","SYS_RESOURCE"]
              command: ["/bin/bash","-lc"]
              args:
                - >
                  apt-get update -qq &&
                  apt-get install -y -q openssh-server &&
                  mkdir -p /run/sshd &&
                  ssh-keygen -A &&
                  exec /usr/sbin/sshd -De -p 2222 -o StrictModes=no
              resources:
                limits:
                  nvidia.com/gpu: 8
                  nvidia.com/r0-p0: "1"
                  nvidia.com/r0-p1: "1"
                  nvidia.com/r1-p0: "1"
                  nvidia.com/r1-p1: "1"
                  nvidia.com/r2-p0: "1"
                  nvidia.com/r2-p1: "1"
                  nvidia.com/r3-p0: "1"
                  nvidia.com/r3-p1: "1"
                  nvidia.com/r4-p0: "1"
                  nvidia.com/r4-p1: "1"
                  nvidia.com/r5-p0: "1"
                  nvidia.com/r5-p1: "1"
                  nvidia.com/r6-p0: "1"
                  nvidia.com/r6-p1: "1"
                  nvidia.com/r7-p0: "1"
                  nvidia.com/r7-p1: "1"
              volumeMounts:
                - name: dshm
                  mountPath: /dev/shm
          volumes:
            - name: dshm
              emptyDir:
                medium: Memory
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://run-ai-docs.nvidia.com/self-hosted/2.23/getting-started/installation/bcm-install/validation-tests.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.