# Deployment

## Preparations <a href="#system-and-network-requirements" id="system-and-network-requirements"></a>

Before installing NVIDIA Run:ai, make sure you have reviewed the [Preparations](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/preparations) section and completed all tasks indicated in the [Pre-installation checklist](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/preparations#pre-installation-checklist).

## BCM Version

The instructions in this document are specific to BCM 11, with a minimum required version of **11.25.08.**

## Deploy Using the Wizard

1. Access the active BCM head node via ssh:

   ```sh
   ssh root@<IP address of BCM head node>
   ```
2. Verify the BCM version:

   ```sh
   cm-package-release-info -f cm-setup,cmdaemon

   Name      Version    Release(s)
   --------  ---------  ------------
   cm-setup  123245     11.25.08
   cmdaemon  163415     11.25.08
   ```
3. Create the following files in the `/cm/shared/runai/` directory populating each respectively from the linked content:
   * [netop-values.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/netop-values.yaml)
   * [nic-cluster-policy.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/nicclusterpolicy.yaml)
   * [sriov-node-pool-config.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/sriov-node-pool-config.yaml)
   * If on DGX GB200, add these:
     * [combined-ippools-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-ippools-gb200.yaml)
     * [combined-sriovibnet-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-sriovibnet-gb200.yaml)
     * [dra-test-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/dra-test-gb200.yaml)
     * [ib-test-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/ib-test-gb200.yaml)
   * If on DGX B200, add these:
     * [combined-ippools-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-ippools-b200.yaml)
     * [combined-sriovibnet-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-sriovibnet-b200.yaml)
     * [dra-test-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/dra-test-b200.yaml)
     * [ib-test-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/ib-test-b200.yaml)
4. Verify that all files from the [Preparations](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/preparations) section and the step above have been created and are present:

   ```sh
   root@bcm11-headnode:~# ls -1 /cm/shared/runai/*

   credential.jwt
   netop-values.yaml 
   nic-cluster-policy.yaml 
   combined-ippools-gb200.yaml
   combined-sriovibnet-gb200.yaml
   dra-test-gb200.yaml
   ib-test-gb200.yaml
   full-chain.pem
   private.key
   ca.crt # optional
   ```
5. Run the following command to initiate deployment via an interactive command-line assistant:

   ```sh
   cm-kubernetes-setup
   ```
6. Select **Deploy Kubernetes installation wizard** and click **Ok** to proceed. If `cm-kubernetes-setup` is being run from GB200, refer to the second screenshot:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-94b9374dc9997bf97e909d2a01f1e91ca4313c8a%2Funknown.png?alt=media" alt=""><figcaption></figcaption></figure>

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-7f3b06e084f16f6f2853318e7bc0a9d943a4f3f9%2Fcm-kubernetes-gb200.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

7. Select the relevant **Kubernetes** version. This guide, employing Base Command Manager 11.25.08, is based on and requires Kubernetes 1.32. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-cf90e1918dbbd868642c3cec0ea27a5fbc1295ad%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

8. The next step inquires if there’s a Docker Hub registry mirror available. It’s recommended that a local registry mirror be employed when available. For the purpose of this guide, leave the default value (blank) and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-1edf18ef92405e5cf598266f715046a0f4a45a17%2Fdockerhub-registry.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

9. Insert values for the new Kubernetes cluster that NVIDIA Run:ai will be installed into. Click **Ok** to proceed:
   * The Kubernetes cluster name should be a short, unique name that can be used to distinguish between multiple clusters (i.e. `k8s-user`).
   * The `k8s-user.local` value for Kubernetes domain name is the default value for internal (within the Kubernetes cluster) name resolution and service discovery. It should be unique to distinguish it from the NMC cluster on DGX GB200 and later SuperPODs. Common practice is to avoid using the same domain for the internal Kubernetes domain name and eternally referenceable FQDN to avoid potential name resolution inconsistencies.
   * The Kubernetes external FQDN field refers to the domain name that the Kubernetes API Server will be proxied at and will be automatically populated by BCM. If a valid name record (FQDN) for the BCM head node has been established prior that should be entered here. Please see the reference architecture section of the [BCM Containerization Manual](https://support.brightcomputing.com/manuals/11/containerization-manual.pdf) for details on how this is implemented via an NGINX proxy.
   * The Service network base address, Service network netmask bits, Pod network base address, & Pod network netmask bits fields provide CIDR ranges for Kubernetes service and pod networks. These will be pre-populated (taking care to avoid overlapping ranges from networks known to BCM) from private, non-routable ranges.

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-56041e63713b2841fdaa768108c0df97e83e9908%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

10. The next step asks about exposing the Kubernetes API server to the external network. Select **no** and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-e13300edcbbc3ed40fa4761e7493195182ebd5a7%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

11. The preferred internal network is used for Kubernetes intercommunication between ctrl plane and worker nodes. Select **internalnet** for the preferred internal network and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-8f96f7013c8f96b57258ad449aaf10686cae79cc%2Finternal-network.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

12. Select 3 or more Kubernetes master nodes. These should be the same nodes assigned to the control plane category. The screenshot below is for illustration only - the correct category should be `k8s-system-user`. See the [BCM node categories](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/preparations#bcm-node-categories) section for more information. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-49d8af1bb5e9524f7cd5f7f031ce510eb04ac79c%2Fkubernetes-master-nodes.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

To ensure high availability and prevent a single point of failure, it is recommended to configure at least three Kubernetes master nodes in your cluster. The nodes selected at this stage will be employed to serve the needs of the control plane and should be located on CPU nodes. In contemporary Kubernetes versions, “master nodes” are referred to as control plane nodes.
{% endhint %}

13. Select the worker node categories to operate as the Kubernetes worker nodes. The screenshot below is for illustration only - the correct category should be either `dgx-gb200-k8s` or `dgx-b200-k8s` and `k8s-system-user`. See the [BCM node categories](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/preparations#bcm-node-categories) section for more information. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-fbda2d871c82b128ed6a7698939b96a1d110c6e2%2Fkubernetes-workers.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

Both the control plane nodes and the DGX nodes must be selected. Selecting the control plane nodes here allows select NVIDIA Run:ai services to run on the control plane nodes. If the cluster configuration has dedicated NVIDIA Run:ai system nodes as described in the optional [Node Category](#bcm-node-categories) section select that category here instead.
{% endhint %}

14. Skip the selection of individual Kubernetes worker nodes (the category selected in the previous step will be used instead). Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-36e68e0445898e35d9284a0768d77379797c51ec%2Findividual-worker-nodes.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

In the combined steps 13 and 14 above, you must select from either:

* A “node category” only (as described in this guide as `k8s-system-user`)
* “Individual Kubernetes nodes” only (not generally recommended)
* Or, a combination of both
  {% endhint %}

15. Select nodes for deploying [etcd](https://kubernetes.io/docs/concepts/architecture/#etcd) on. Make sure to select the same three nodes as the Kubernetes control plane nodes (Step 12). Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-e09b3ef3470a0b1abcccaf016409a527c875fcc9%2Fetcd-nodes.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

16. Leave the API server proxy port and [etcd](https://kubernetes.io/docs/concepts/architecture/#etcd) spool directory values at their prepopulated values (do not modify them). Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-4fd181833d05cabf64746dd425e9f4b26d0b2290%2Fmain-kubernetes-components.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

If there are multiple Kubernetes clusters being managed by BCM (such as in the case of DGX GB200 and later SuperPODs), the default proxy port value will automatically be incremented to avoid an overlap with existing clusters and may not match the screenshot.
{% endhint %}

17. Select **Calico** as the Kubernetes network plugin. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-aa3b8423a9b4bf1e3f2ca7058cafce6a3c48e1da%2Fnetwork-plugin-cni.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

18. Select **no** and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-071a55d5b6d637a335ba9a6b5efdd697fd22d168%2Fkyverno-policy.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

19. The components selected in this screen represent those required by NVIDIA Run:ai for a self-hosted installation. Select the operator and NVIDIA Run:ai self-hosted options as depicted below. Click **Ok** to proceed:
    * NVIDIA GPU Operator
    * Ingress NGINX Controller
    * Knative Operator
    * KubeFlow Training operator
    * Kubernetes Metrics Server
    * Kubernetes State Metrics
    * LeaderWorkerSet operator
    * MetalLB
    * Network Operator
    * Prometheus Adapter
    * Prometheus Operator Stack
    * Run:ai (self-hosted)

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-952b79754b58cf12eb72fe6cc5426f4d65545815%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

20. Provide the NVIDIA Run:ai configuration with the below and click **Ok** to proceed:
    * **Run:ai Registry Credentials** - Enter the path to a file containing the base64-encoded NVIDIA token. Alternatively the Base64 encoded value can be pasted in directly.
    * **Run:ai Control Plane Domain Name (FQDN)** - Enter the Run:ai control plane’s fully qualified domain name (e.g., `runai.example.com`). This value should be different from the FQDN entered on the first “Insert basic values” Kubernetes setup in Step 9. It should be what was used when creating certificates (and should not be the same as the BCM head node hostname).
    * **Local CA Cert Path (.crt or .pem)** - Path to the root CA certificate file if you are using a local CA–issued certificate (common in testing or internal environments). It’s optional if using a certificate from a public CA.
    * **Domain Cert Path (.crt/.pem)** - Path to the full-chain certificate for your domain (the domain’s leaf certificate followed by any intermediate certificates).
    * **Domain Cert Key Path (.key)** - Path to the private key that matches the domain certificate.

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-62fe3b0baa451a5f6b076c32b886f1955c5405a9%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

It’s recommended to save all certificates, configuration files, and deployment artifacts into a persistent and accessible location in case of redeployment. The `/cm/shared/runai/` directory referred to in this guide resides on a shared mount point and would be a suitable location. See the [TLS certificates](#tls-certificates) section for additional clarification.
{% endhint %}

21. Select **yes** to install NVIDIA Run:ai components. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-92f3e4e9747c01d35239901bf9041d46e336d95d%2Frunai-cluster.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

In this version of the BCM installation assistant, a warning dialog indicating an ssh issue will follow - disregard and click **Ok** to proceed. Other indications at this stage may indicate a problem with the certs supplied.
{% endhint %}

22. Select the `k8s-system-user` node category for the NVIDIA Run:ai control plane nodes and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-77254493ee18426577812e403690ed7eda18fb69%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

23. Select the required **NVIDIA GPU Operator** version (v25.3.2). Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-44b7b0925dcce3c7cb4602e7ad77e6d1a7979f72%2Fgpu-operator-version.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

24. Select the required **Network Operator** (v25.4.0) version. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-ab51731108be130b2a52757f392e2de100e7638e%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

25. Select the required **NVIDIA Run:ai version**. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-200401704d9351af441263c665056b6b68e801e4%2Fself-hosted-version.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

26. When prompted to supply a **Custom YAML config** for the GPU Operator leave the default (blank) and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-b50b59b4c8e78cf87ad139d6f5477d57a45f8495%2Fgpu-operator-yaml.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

27. Configure the NVIDIA GPU Operator by selecting the following configuration parameters. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-a61b07d2552c6f2d54de1673827682a5910d466d%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

28. Supply the path to the `netop-values.yaml` file that was created before. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-7d554756d352b5173e5746d2ebebd5d9d601812f%2Fnetop-value.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

29. Click **Ok** for the MetalLB IP address pools page and it will automatically set up the requirements for NVIDIA Run:ai:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-8052c4538cf9298914a271b332c9f271295863b4%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

30. Specify the ingress IP addresses prepared as documented in the [Pre-installation checklist](#pre-installation-checklist) section. The mention of MetalLB here is an indication that these will be set up as part of a load balanced pool and assigned to each respective ingress. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-b449ba313959b92f6eee1b4f3bf7dca1ce8535dd%2Fload-balancer.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

31. Select **no** to expose the **Kubernetes Ingress** to the default HTTPS port. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-0b1069586141e81f1f831adfdd701666267babb7%2Fexpose-kubernetes-ingress.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

32. Leave the node ports for the Ingress NGINX Controller at the pre-populated values (do not modify them) and click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-792f70bd0b4265a66aa18693c6a900ba887b20f6%2Fnginx-controller.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

33. Select the serving option in the Knative Operator components dialog. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-534d7ea5bc3c162c8f3676806e2e0bf85e9e3fd0%2Fknative-operator.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

34. If deploying onto an A100 or H100 only cluster, select **yes**. If deploying on any other cluster configuration select **no**. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-d535a96ea592c3c85060cc1a0763b454831a0be5%2Fnetwork-operator-policies.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

If applicable, Network Operator policies for DGX B200 or DGX GB200 systems will be applied in a [post-deployment](#post-wizard-deployment-steps) step described below.
{% endhint %}

35. If **yes** was selected for the previous step, select the appropriate option for the cluster and click **Ok** to proceed. If **no** was selected for the previous step, this page will not appear:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-fedce1808ee2bd20d4d54dab2d009b865d5d99dc%2Fnetwork-policies-dgx.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

36. Select **yes** to install the **Permission Manager**. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-03606d56eb3c62de85449ba12622c98702e4fda2%2Fpermissions-manager.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

The BCM Permission Manager coordinates security policy, system accounts, RBAC, and configures Kubernetes to employ BCM LDAP for user accounts. BCM User Accounts, however, are not automatically represented within NVIDIA Run:ai. For assistance with configuring NVIDIA Run:ai, see [Set Up SSO with OpenID Connect](https://run-ai-docs.nvidia.com/self-hosted/2.22/infrastructure-setup/authentication/sso/openidconnect). For more information on the BCM Permission Manager, see [Containerization Manual documentation](https://docs.nvidia.com/base-command-manager/index.html#product-manuals).
{% endhint %}

37. Select **Local path** as the Kubernetes StorageClass. Ensure that both **enabled** and **default** are specified. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-05890fd3405159b0d0b5c913b2728c0ceb648113%2Fkubernetes-storageclass.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

{% hint style="info" %}
**Note**

The indication “local path” in the installation assistant may imply that local storage is employed, but those paths are pointing to NFS mountpoints. These were mounted as part of standard BCM node provisioning (e.g. `/cm/shared/home`).
{% endhint %}

38. Configure the CSI Provider (`local-path-provisioner`) to employ shared storage (`/cm/shared/apps/kubernetes/k8s-user/var/volumes` as a default). Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-4a0331e64db18678f2fcb39d81dcc3336f522722%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

39. Select **yes** to enable local persistent storage for Grafana. Click **Ok** to proceed:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-de3a9f3db43d48924021056a39c6c24c3327814a%2Fpersistent-storage.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

40. Select **Save config**, set an accessible location for the config file (for example: `/cm/share/runai/cm-kubernete-setup.conf`) with the rest of the config files and then click **Ok**. Select **Exit** and **Ok** to complete the wizard and return to the terminal:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-8adb4621024fa290294eda858e8bd9409df5dd7a%2Funknown.png?alt=media" alt="" width="516"><figcaption></figcaption></figure>

The deployment process may require an extended period (60+ minutes).. In order to prevent potential interruptions, failures, or network outages from disrupting the deployment process it’s recommended to perform the deployment from a persistent terminal session such as tmux or screen.

```sh
# Start a new screen session named "install_runai" 
# This allows detach/reattaching safely during the installation
screen -S install_runai

# Inside the screen session: run the cluster setup using the configuration file 
cm-kubernetes-setup -c /cm/shared/runai/cm-kubernetes-setup.conf
```

{% hint style="info" %}
**Note**

During the deployment process all nodes that are members of the new Kubernetes cluster will be rebooted.
{% endhint %}

## Connect to NVIDIA Run:ai User Interface <a href="#connect-to-nvidia-run-ai-user-interface" id="connect-to-nvidia-run-ai-user-interface"></a>

1. Open your browser and go to: `https://<DOMAIN>`
2. Log in using the default credentials:
   * User: `test@run.ai`
   * Password: `Abcd!234`

You will be prompted to change the password.

## Post-wizard Deployment Steps

After the BCM installation assistant completes additional steps are required.

If multiple Kubernetes clusters are configured in this instance of BCM, load the correct Kubernetes module before running all post-wizard commands:

```bash
module unload kubernetes
module load kubernetes/k8s-user
```

### MPI Operator

Install the MPI Operator v0.6.0 or later by running the following command:

```bash
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml --force-conflicts

# Validate MPIJob CRD is installed
kubectl get crd mpijobs.kubeflow.org  
> NAME                   CREATED AT
> mpijobs.kubeflow.org   2025-09-10T20:57:42Z
```

### NVIDIA Dynamic Resource Allocation (DRA) Driver

The [NVIDIA DRA Driver for GPUs](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/dra-cds.html) extends how NVIDIA GPUs are consumed within Kubernetes. This is required to enable secure Internode Memory Exchange (IMEX) on Multi-Node NVLink (MNNVL) systems (e.g. GB200 and similar) for Kubernetes workloads and should be included with all NVIDIA GPU systems.

1. Install using Helm:

   ```bash
   helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update
   helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
   --version="25.3.0" \
   --create-namespace \
   --namespace nvidia-dra-driver-gpu \
   --set nvidiaDriverRoot=/ \
   --set resources.gpus.enabled=false
   ```
2. **GB200 only** - Create a file in `/cm/shared/runai` from [dra-test-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/dra-test-gb200.yaml) and update the clique ID to match the clique ID from the cluster. Note that the following test addresses single rack NVL72 clusters. For multi-rack systems, you’ll need to adjust `podAffinity` (e.g. `topologyKey: nvidia.com/gpu.clique`).

   ```bash
   kubectl describe nodes | grep nvidia.com/gpu.clique=
   >                    nvidia.com/gpu.clique=f84d133c-bbc9-55fd-b1ff-ffffc7ef6783.23322

   # For DGX GB200 systems
   kubectl apply -f /cm/shared/runai/dra-test-gb200.yaml
   ```
3. **GB200 only** - Validate the test successfully completed and inspect the logs of the launcher:

   ```bash
   # For GB200
   kubectl get pods
   > NAME                              READY   STATUS      RESTARTS   AGE
   > nvbandwidth-test-launcher-snb82   0/1     Completed   0          72s

   kubectl logs nvbandwidth-test-launcher-snb82
   ```
4. **GB200 only** - Cleanup test:

   ```bash
   #For GB200
   kubectl delete -f dra-test-gb200.yaml
   ```

#### Enable DRA and Multi-Node NVLink

The default NVIDIA Run:ai configuration does not expose DRA features. After installing the DRA components, this can be enabled by modifying the `runaiconfig` in the cluster. See [Advanced cluster configurations](https://run-ai-docs.nvidia.com/self-hosted/2.22/infrastructure-setup/advanced-setup/cluster-config) for more details:

```yaml
# Edit the runaiconfig object to toggle GPUNetworkAccelerationEnable
# to true and adjust tolerations for the Kubernetes control plane

kubectl patch runaiconfig runai \
  -n runai \
  --context=kubernetes-admin@k8s-user \
  --type='merge' \
  -p '{
    "spec": {
      "workload-controller": {
        "GPUNetworkAccelerationEnabled": true
      },
      "global": {
        "tolerations": [
          {
            "key": "node-role.kubernetes.io/control-plane",
            "operator": "Exists",
            "effect": "NoSchedule"
          }
        ]
      }
    }
  }'
```

Instructions for validating the change and reverting if necessary:

```yaml
# Validate the patch was applied successfully


kubectl get runaiconfig runai \
  -n runai \
  --context=kubernetes-admin@k8s-user \
  -o custom-columns=GPUAccelEnabled:.spec.workload-controller.GPUNetworkAccelerationEnabled,Tolerations:.spec.global.tolerations

# To revert the runaiconfig object change

kubectl patch runaiconfig runai -n runai --type='merge' -p '{
  "spec": {
    "workload-controller": {
      "GPUNetworkAccelerationEnabled": false
    },
    "global": {
      "tolerations": null
    }
  }
}'
```

### Configure the Network Operator for B200 and GB200 Systems

In version 11.25.08 of the BCM installation assistant, the Network Operator requires additional configuration on DGX B200 and GB200 SuperPOD / BasePOD systems. While the operator is installed in a preceding step, it does not automatically initialize or configure SR-IOV and secondary network plugins.

The following CRD resources have to be created in the exact order as below:

* SR-IOV Network Policies for each NVIDIA InfiniBand NIC
* An nvIPAM IP address pool
* SR-IOV InfiniBand networks\\

1. Create SR-IOV network node policies using the [nic-cluster-policy.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/nicclusterpolicy.yaml) that was created in an earlier step:

   ```bash
   kubectl apply -f /cm/shared/runai/nic-cluster-policy.yaml
   ```
2. Create an IPAM IP Pool using the respective [combined-ippools-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-ippools-gb200.yaml) or [combined-ippools-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-ippools-b200.yaml) that were created in an earlier step:

   ```bash
   kubectl apply -f /cm/shared/runai/combined-ippools-gb200.yaml
   ```
3. Create the SR-IOV IB networks using the respective [combined-sriovibnet-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-sriovibnet-gb200.yaml) or [combined-sriovibnet-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/combined-sriovibnet-b200.yaml) that were created in an earlier step:

   ```bash
   kubectl apply -f /cm/shared/runai/combined-sriovibnet-gb200.yaml
   ```

{% hint style="info" %}
**Note**

You may need to modify the interface names for non-DGX systems.
{% endhint %}

4. Create the SR-IOV node pool configuration using the [sriov-node-pool-config.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/sriov-node-pool-config.yaml):

   ```bash
   kubectl apply -f /cm/shared/runai/sriov-node-pool-config.yaml
   ```

{% hint style="info" %}
**Note**

This will typically reconfigure NICs and may result in a node reboot. The supplied YAML sets the maxUnavailable field to 20%. This value should be adjusted to align with your operational requirements. A value of 1 would have the effect of serializing the upgrade and would result in blocking upon a single node failure. It may be appropriate for a small lab deployment to set it to 100%. This would prevent any single machine failure from blocking the remaining nodes from upgrading. For larger clusters, setting the value to a lower percentage means that the upgrade process will be effectively split into batches.
{% endhint %}

5. Validate by describing one of the DGX nodes and checking for SRIOV devices:

   ```bash
   # Describe a DGX worker node
   kubectl describe node <dgx-node> --context=kubernetes-admin@k8s-user | grep sriovib


   # Example output
     nvidia.com/sriovib_resource_a:  16
     nvidia.com/sriovib_resource_b:  16
     nvidia.com/sriovib_resource_c:  16
     nvidia.com/sriovib_resource_d:  16
     nvidia.com/sriovib_resource_a:  16
     nvidia.com/sriovib_resource_b:  16
     nvidia.com/sriovib_resource_c:  16
     nvidia.com/sriovib_resource_d:  16
   nvidia.com/sriovib_resource_b

   # Check the state of SR-IOV Nodes
   kubectl get -n network-operator sriovnetworknodestate --context=kubernetes-admin@k8s-user
   # Example Output
   NAME        SYNC STATUS
   <dgx_node>   Succeeded
   ```

{% hint style="info" %}
**Note**

It might take several minutes for these settings to take effect. If the `sriovnetworkconfig` daemon changes the NIC config, then a node reboot will occur.
{% endhint %}

6. Validate by running the DGX SuperPOD platform specific tests:
   1. For GB200 - [ib-test-gb200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/ib-test-gb200.yaml):

      ```bash
      #DGX GB200

      kubectl apply -f /cm/shared/runai/ib-test-gb200.yaml -n default

      MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022

      CUDA Runtime Version: 12080

      CUDA Driver Version: 12080

      Driver Version: 570.172.08

      Process 0 (nvbandwidth-test-worker-0): device 0: NVIDIA GB200 (00000008:01:00)

      Process 1 (nvbandwidth-test-worker-0): device 1: NVIDIA GB200 (00000009:01:00)

      Process 2 (nvbandwidth-test-worker-0): device 2: NVIDIA GB200 (00000018:01:00)

      Process 3 (nvbandwidth-test-worker-0): device 3: NVIDIA GB200 (00000019:01:00)

      Process 4 (nvbandwidth-test-worker-1): device 0: NVIDIA GB200 (00000008:01:00)

      Process 5 (nvbandwidth-test-worker-1): device 1: NVIDIA GB200 (00000009:01:00)

      Process 6 (nvbandwidth-test-worker-1): device 2: NVIDIA GB200 (00000018:01:00)

      Process 7 (nvbandwidth-test-worker-1): device 3: NVIDIA GB200 (00000019:01:00)

      Running host_to_device_memcpy_ce.

      memcpy CE CPU(row) -> GPU(column) bandwidth (GB/s)
                 0         1         2         3
       0     85.59     95.33    200.73    191.27

      SUM host_to_device_memcpy_ce 572.93
      ```
   2. For B200 - [ib-test-b200.yaml](https://run-ai-docs.nvidia.com/self-hosted/2.22/getting-started/installation/bcm-install/ib-test-b200.yaml):

      ```bash
      # DGX B200
      kubectl apply -f /cm/shared/runai/ib-test-b200.yaml -n default

      kubectl logs -n default nccl-test-launcher-hdm54 
      Warning: Permanently added '[nccl-test-worker-0.nccl-test.default.svc]:2222' (ED25519) to the list of known hosts.
      Warning: Permanently added '[nccl-test-worker-1.nccl-test.default.svc]:2222' (ED25519) to the list of known hosts.
      # nThread 1 nGpus 1 minBytes 16 maxBytes 17179869184 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
      #
      # Using devices
      #  Rank  0 Group  0 Pid     46 on nccl-test-worker-0 device  0 [0x1b] NVIDIA B200
      #  Rank  1 Group  0 Pid     47 on nccl-test-worker-0 device  1 [0x43] NVIDIA B200
      #  Rank  2 Group  0 Pid     48 on nccl-test-worker-0 device  2 [0x52] NVIDIA B200
      #  Rank  3 Group  0 Pid     49 on nccl-test-worker-0 device  3 [0x61] NVIDIA B200
      #  Rank  4 Group  0 Pid     50 on nccl-test-worker-0 device  4 [0x9d] NVIDIA B200
      #  Rank  5 Group  0 Pid     52 on nccl-test-worker-0 device  5 [0xc3] NVIDIA B200
      #  Rank  6 Group  0 Pid     55 on nccl-test-worker-0 device  6 [0xd1] NVIDIA B200
      #  Rank  7 Group  0 Pid     59 on nccl-test-worker-0 device  7 [0xdf] NVIDIA B200
      #  Rank  8 Group  0 Pid     46 on nccl-test-worker-1 device  0 [0x1b] NVIDIA B200
      #  Rank  9 Group  0 Pid     47 on nccl-test-worker-1 device  1 [0x43] NVIDIA B200
      #  Rank 10 Group  0 Pid     48 on nccl-test-worker-1 device  2 [0x52] NVIDIA B200
      #  Rank 11 Group  0 Pid     49 on nccl-test-worker-1 device  3 [0x61] NVIDIA B200
      #  Rank 12 Group  0 Pid     50 on nccl-test-worker-1 device  4 [0x9d] NVIDIA B200
      #  Rank 13 Group  0 Pid     51 on nccl-test-worker-1 device  5 [0xc3] NVIDIA B200
      #  Rank 14 Group  0 Pid     54 on nccl-test-worker-1 device  6 [0xd1] NVIDIA B200
      #  Rank 15 Group  0 Pid     58 on nccl-test-worker-1 device  7 [0xdf] NVIDIA B200
      #
      #                                                              out-of-place                       in-place          
      #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
      #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
                16             4     float     sum      -1    38.60    0.00    0.00      0    49.18    0.00    0.00      0
                32             8     float     sum      -1    42.23    0.00    0.00      0    50.96    0.00    0.00      0
                64            16     float     sum      -1    47.44    0.00    0.00      0    42.18    0.00    0.00      0
               128            32     float     sum      -1    39.51    0.00    0.01      0    42.78    0.00    0.01      0
               256            64     float     sum      -1    40.16    0.01    0.01      0    43.35    0.01    0.01      0
               512           128     float     sum      -1    39.22    0.01    0.02      0    44.53    0.01    0.02      0
              1024           256     float     sum      -1    42.81    0.02    0.04      0    44.47    0.02    0.04      0
              2048           512     float     sum      -1    40.63    0.05    0.09      0    52.70    0.04    0.07      0
              4096          1024     float     sum      -1    46.76    0.09    0.16      0    52.63    0.08    0.15      0
              8192          2048     float     sum      -1    47.22    0.17    0.33      0    53.69    0.15    0.29      0
             16384          4096     float     sum      -1    49.02    0.33    0.63      0    50.96    0.32    0.60      0
             32768          8192     float     sum      -1    54.24    0.60    1.13      0    53.88    0.61    1.14      0
             65536         16384     float     sum      -1    59.05    1.11    2.08      0    59.53    1.10    2.06      0
            131072         32768     float     sum      -1    62.04    2.11    3.96      0    63.99    2.05    3.84      0
            262144         65536     float     sum      -1    106.4    2.46    4.62      0    103.1    2.54    4.77      0
            524288        131072     float     sum      -1    107.5    4.88    9.15      0    102.8    5.10    9.56      0
           1048576        262144     float     sum      -1    108.8    9.64   18.07      0    106.6    9.83   18.44      0
           2097152        524288     float     sum      -1    112.7   18.60   34.88      0    106.6   19.67   36.88      0
           4194304       1048576     float     sum      -1    118.2   35.49   66.54      0    116.6   35.97   67.44      0
           8388608       2097152     float     sum      -1    150.2   55.85  104.72      0    153.8   54.54  102.26      0
          16777216       4194304     float     sum      -1    187.5   89.46  167.73      0    188.1   89.19  167.23      0
          33554432       8388608     float     sum      -1    250.5  133.97  251.20      0    251.6  133.35  250.02      0
          67108864      16777216     float     sum      -1    395.9  169.52  317.86      0    395.1  169.87  318.50      0
         134217728      33554432     float     sum      -1    618.9  216.85  406.59      0    620.8  216.20  405.37      0
         268435456      67108864     float     sum      -1   1073.4  250.08  468.90      0   1074.2  249.89  468.54      0
         536870912     134217728     float     sum      -1   1977.2  271.53  509.13      0   1976.0  271.69  509.42      0
        1073741824     268435456     float     sum      -1   3713.5  289.14  542.15      0   3710.3  289.40  542.62      0
        2147483648     536870912     float     sum      -1   7245.1  296.40  555.76      0   7226.3  297.18  557.20      0
        4294967296    1073741824     float     sum      -1    14049  305.71  573.20      0    13939  308.13  577.75      0
        8589934592    2147483648     float     sum      -1    27360  313.97  588.68      0    27292  314.74  590.13      0
       17179869184    4294967296     float     sum      -1    53941  318.49  597.17      0    53953  318.43  597.05      0
      # Out of bounds values : 0 OK
      # Avg bus bandwidth    : 168.649 


      # Clean up after validating via ib-test-b200.yaml
      kubectl delete -f /cm/shared/runai/ib-test-b200.yaml -n default 
      ```

{% hint style="info" %}
**Note**

The Network Operator will restart the DGX nodes if the number of Virtual Functions in the SR-IOV Network Policy file does not match the NVIDIA/Mellanox firmware configuration.
{% endhint %}

## (Optional) Apply Security Policies

By default, BCM Kubernetes deployment has permissive security policies to ease in development environments. For production clusters or in secure environments, it’s recommended to take additional steps to harden the cluster. This includes steps such as configuring permission manager, applying Kyverno policies, and applying Calico policies.

For deployments of NVIDIA Run:ai as a part of NVIDIA Mission Control, please reach out to your NVIDIA representative for the latest example configurations and suggested policies. The Mission Control software installation guide’s [Kubernetes Security Hardening](https://docs.nvidia.com/mission-control/docs/nmc-software-installation-guide/2.0.0/nmc-kube-security-guide.html) documentation provides guidance for application.

## (Optional) Create Node Pools

See [Node pools](https://run-ai-docs.nvidia.com/self-hosted/2.22/platform-management/aiinitiatives/resources/node-pools) to create and manage groups of nodes (either by predefined node label or administrator-defined node labels). This optional configuration step can be used for advanced deployment scenarios to allocate different resources across teams or projects.

## (Optional) Add Additional Users

See [Users](https://run-ai-docs.nvidia.com/self-hosted/2.22/infrastructure-setup/authentication/users) for steps on adding additional users beyond the initial <test@run.ai> account or connecting SSO.

## (Optional) Install the NVIDIA Run:ai Command Line

To obtain the command line binary, see the [Install and configure CLI](https://run-ai-docs.nvidia.com/self-hosted/2.22/reference/cli/install-cli) section.

### Test the command line tool installation

Validate the installation by running the following command:

```bash
runai version
```

{% hint style="info" %}
**Note**

If NVIDIA Run:ai had previously been installed via BCM, it may be necessary to update the command line version.
{% endhint %}

### Set the Control Plane URL

The following step is required for Windows users only. Linux and Mac clients are configured via the installation script.

Run the following command (substituting the NVIDIA Run:ai control plane FQDN value specified in previous steps) to create the `config.json` file in the default path:

```bash
runai config set --cp-url runai.example.com
```

Alternative, the Base Command Manager installation assistant can generate this config with the following steps:

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-099e5a4144aa9147388924db54b3a68994ca11f0%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

<figure><img src="https://3765967871-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FlZD4O0TqiQ8mlhnxho0w%2Fuploads%2Fgit-blob-e8c64e14e65c8416d3f8b42a4219156f03a69821%2Funknown.png?alt=media" alt="" width="563"><figcaption></figcaption></figure>

## Validate NVIDIA Run:ai

To validate the installation, please refer to the quick start guides for deploying single-GPU training jobs, multi-node training jobs, single-GPU inference jobs, and multi-GPU inference jobs. Certain NGC workloads may require adding[ NGC API keys and docker credentials ](https://docs.nvidia.com/nemo/microservices/latest/set-up/manage-secrets/ngc-image-pull-secret.html)into the cluster.

1. Validate the ingress IP for NVIDIA Run:ai inference is configured, `EXTERNAL-IP` should have the value configured in the prior MetalLB steps:

   ```bash
   kubectl get svc -n knative-serving kourier -o wide

   NAME      TYPE           CLUSTER-IP       EXTERNAL-IP    PORT(S)                      AGE
   kourier   LoadBalancer   x.x.x.x          10.1.1.26      80:31038/TCP,443:30783/TCP   8h
   ```
2. Validate distributed training workloads, see [Run you first distributed training workload](https://run-ai-docs.nvidia.com/self-hosted/2.22/workloads-in-nvidia-run-ai/using-training/distributed-training/quick-starts/distributed-training-quickstart):

   ```bash
   # Example command
   runai training mpi submit distributed-training \
     -g 4 \
     -p training \
     --node-pools nvl72rackb06 \
     -i ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163 \
     --workers 2 \
     --slots-per-worker 4 \
     --run-as-uid 1000 \
     --ssh-auth-mount-path /home/mpiuser/.ssh \
     --clean-pod-policy Running \
     --master-command mpirun \
     --master-args "--bind-to core --map-by ppr:4:node -np 8 --report-bindings -q nvbandwidth -t multinode_device_to_device_memcpy_read_ce" \
     --command -- /usr/sbin/sshd -De -f /home/mpiuser/.sshd_config

   ```
3. Validate distributed inference workloads, see [Run your first custom inference workload](https://run-ai-docs.nvidia.com/self-hosted/2.22/workloads-in-nvidia-run-ai/using-inference/quick-starts/inference-quickstart):

   ```bash
   {
       "name": "distributed-vllm",
       "projectId": "4501034",
       "clusterId": "c7cd67df-c309-45ac-9056-5a04d074617d",
       "spec": {
           "workers": 1,
           "replicas": 1,
           "servingPort": {
               "port": 8000,
               "exposedUrl": "http://vllm.infernece-calorado.runailabs-ps.com/"
           },
           "leader": {
               "image": "vllm/vllm-openai:latest-aarch64",
               "command": "sh -c \"bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE); python3 -m vllm.entrypoints.openai.api_server --port 8000 --model meta-llama/Meta-Llama-3.1-8B-Instruct --tensor-parallel-size 4 --pipeline_parallel_size 2\"",
               "environmentVariables": [
                   {
                       "name": "NCCL_MNNVL_ENABLE",
                       "value": "0"
                   },
                   {
                       "name": "HF_TOKEN",
                       "value": "hf_xxx
                   }
               ],
               "compute": {
                   "largeShmRequest": true,
                   "gpuDevicesRequest": 4
               }
           },
           "worker": {
               "image": "vllm/vllm-openai:latest-aarch64",
               "command": "sh -c \"bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)\"",
                "environmentVariables": [
                   {
                       "name": "NCCL_MNNVL_ENABLE",
                       "value": "0"
                   },
                   {
                       "name": "HF_TOKEN",
                       "value": "<HF_TOKEN>"
                   }
               ],
              "compute": {
                   "largeShmRequest": true,
                   "gpuDevicesRequest": 4
               }
           }
       }
   }

   ```

## Troubleshooting Common Issues

<details>

<summary>Slow installation</summary>

Provide a registry mirror when requested in the wizard. If one isn’t available, authenticated access to Docker Hub can avoid potential [rate limiting](https://docs.docker.com/docker-hub/usage/pulls/) for at least some of the artifact pulls:

```bash
# Authenticate to Docker Hub prior to running cm-kubernetes-setup
echo -n "$DOCKERHUB_TOKEN" | docker login --username "$DOCKERHUB_USER" 
--password-stdin
```

</details>

<details>

<summary>Delayed responsiveness from the cmsh command</summary>

If encountering slow response when running the `cmsh` command, try using the `cmsh-lazy-load` command (substituting it for cmsh wherever referenced in the documentation).

```sh
# Example: use of cmsh-lazy-load as substitute for cmsh
cmsh-lazy-load -c "device list; quit"
```

</details>

<details>

<summary>Failed installation</summary>

If encountering issues with installation failure (which should be evident immediately) ensure that the DGX node kernel parameters are not inadvertently forcing [Cgroup v1 vs Cgroup v2](https://kubernetes.io/blog/2024/08/14/kubernetes-1-31-moving-cgroup-v1-support-maintenance-mode/):

```bash
# the following kernel parameters should not be present 
systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller
```

</details>

<details>

<summary>Shared Storage (NFS) configuration</summary>

If encountering issues indicating problems consistently accessing Persistent Volumes (PVs) ensure that NFSv3 for `/cm/shared` for both of the node categories that’ll be used later in this guide. For example (please substitute the category name as appropriate for the DGX type):

```sh
# Force NFSv3 for the worker node category
cmsh -c "category use dgx-gb200-k8s; fsmounts; use /cm/shared; set mountoptions defaults,_netdev,vers=3; commit; quit"

# Force NFSv3 on the CPU nodes
cmsh -c "category use k8s-system-user; fsmounts; use /cm/shared; set mountoptions defaults,_netdev,vers=3; commit; quit"
```

</details>

<details>

<summary>MetalLB Load Balancer manual installation</summary>

Since there’s shared use of CPU nodes for the combined control plane elements in this architecture, BCM configures MetalLB and adjusts node labels to run. The following would be required as a manual step when deploying MetalLB in this manner:

```bash
# Remove the exclusion preventing nodes from receiving load balancer traffic
kubectl label nodes --all node.kubernetes.io/exclude-from-external-load-balancers-
```

{% hint style="info" %}
**Note**

The above is not required when using the BCM installation assistant. It’s included here to assist with alternative deployment approaches on DGX SuperPOD / BasePOD.
{% endhint %}

</details>

<details>

<summary>NVIDIA Run:ai exact version selection</summary>

The BCM installation assistant will pull the latest NVIDIA Run:ai patch release available for the minor version selected. The following can be used to indicate which version will be installed:

```bash
helm repo add runai https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm search repo runai/control-plane --versions | grep "2.22"

runai-backend/control-plane	2.22.48      	2.22.48    	Run:ai Control Plane
```

</details>
