# Validation Tests

## Dynamic Resource Allocation Tests for GB200 & GB300 Systems (dra-test-gb200-gb300.yaml)

```yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: nvbandwidth-test-compute-domain
spec:
  numNodes: 2
  channel:
    resourceClaimTemplate:
      name: nvbandwidth-test-compute-domain-channel

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-launcher
        spec:
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:4:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
            - -t
            - multinode_device_to_device_memcpy_read_ce
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
        spec:
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 4
              claims:
              - name: compute-domain-channel
          resourceClaims:
          - name: compute-domain-channel
            resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel

```

## Dynamic Resource Allocation Tests for B200 and B300 Systems (dra-test-b200-b300.yaml)

```yaml
---
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: single-gpu
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
  name: dra-test-b200
  labels:
    app: pod
spec:
  containers:
  - name: ctr0
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  - name: ctr1
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: shared-gpu
  resourceClaims:
  - name: shared-gpu
    resourceClaimTemplateName: single-gpu
  tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

# See https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/demo/specs/quickstart/gpu-test2.yaml

```

## Infiniband Tests for GB200 and GB300 Systems (ib-test-gb200-gb300.yaml)

```yaml
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-launcher
        spec:
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:4:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
          tolerations:
            - key: "key"
              operator: "Equal"
              value: "value"
              effect: "NoSchedule"
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
          annotations:
            k8s.v1.cni.cncf.io/networks: sriovibnet-rdma-default-a-su-1,sriovibnet-rdma-default-b-su-1,sriovibnet-rdma-default-c-su-1,sriovibnet-rdma-default-d-su-1
        spec:
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 4
                nvidia.com/sriovib_resource_a: '1'
                nvidia.com/sriovib_resource_b: '1'
                nvidia.com/sriovib_resource_c: '1'
                nvidia.com/sriovib_resource_d: '1'


```

## InfiniBand Tests for B200 and B300 Systems (ib-test-b200-b300.yaml)

```yaml
---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nccl-test
spec:
  slotsPerWorker: 8
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running

  # Mount MPI Operator's SSH key where the user actually is
  # We'll run as root, so point to /root/.ssh
  sshAuthMountPath: /root/.ssh

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: mpi-launcher
              image: docker.io/deepops/nccl-tests:2312
              # Run as root to avoid the "uid 1000" user lookup error
              securityContext:
                runAsUser: 0
              env:
                - name: OMPI_ALLOW_RUN_AS_ROOT
                  value: "1"
                - name: OMPI_ALLOW_RUN_AS_ROOT_CONFIRM
                  value: "1"
              command: ["/bin/bash","-lc"]
              args:
                - >
                  mpirun --allow-run-as-root
                  -np 16
                  -bind-to none -map-by slot
                  -mca pml ob1
                  -mca btl ^openib
                  -mca btl_tcp_if_include 192.168.0.0/16
                  -mca oob_tcp_if_include 172.29.0.0/16
                  -mca routed direct
                  -mca plm_rsh_args "-p 2222 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i /root/.ssh/id_rsa"
                  all_reduce_perf_mpi -b 16 -e 16G -f 2 -g 1

    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            mpi-memcpy-dra-test-replica: mpi-worker
          annotations:
            # Your SR-IOV RDMA networks
            k8s.v1.cni.cncf.io/networks: "sriovibnet-rdma-default-a-su-1,sriovibnet-rdma-default-b-su-1,sriovibnet-rdma-default-c-su-1,sriovibnet-rdma-default-d-su-1,sriovibnet-rdma-default-e-su-1,sriovibnet-rdma-default-f-su-1,sriovibnet-rdma-default-g-su-1,sriovibnet-rdma-default-h-su-1"
        spec:
          restartPolicy: OnFailure
          containers:
            - name: mpi-worker
              image: docker.io/deepops/nccl-tests:2312
              # Root to generate host keys & read /etc/ssh/*
              securityContext:
                runAsUser: 0
                capabilities:
                  add: ["IPC_LOCK"]
              command: ["/bin/bash","-lc"]
              args:
                - >
                  ssh-keygen -A &&
                  exec /usr/sbin/sshd -De -p 2222
              resources:
                limits:
                  nvidia.com/gpu: 8
                  nvidia.com/sriovib_resource_a: "1"
                  nvidia.com/sriovib_resource_b: "1"
                  nvidia.com/sriovib_resource_c: "1"
                  nvidia.com/sriovib_resource_d: "1"
                  nvidia.com/sriovib_resource_e: "1"
                  nvidia.com/sriovib_resource_f: "1"
                  nvidia.com/sriovib_resource_g: "1"
                  nvidia.com/sriovib_resource_h: "1"
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://run-ai-docs.nvidia.com/self-hosted/2.23/getting-started/installation/bcm-install/validation-tests.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
