Install using Base Command Manager

This section explains the steps required to install the NVIDIA Run:ai cluster on a DGX Kubernetes Cluster using NVIDIA Base Command Manager (BCM).

NVIDIA Run:ai installer

The NVIDIA Run:ai installer is a wizard that simplifies the deployment of the NVIDIA Run:ai cluster on DGX. The NVIDIA Run:ai installer is installed via the BCM cluster wizard when the cluster is created.

Note

For custom deployment options, check the Install using Helm.

System and network requirements

Before installing the NVIDIA Run:ai cluster on a DGX system using BCM, ensure that your System requirements and Network requirements meets the necessary prerequisites.

The BCM cluster wizard deploys essential Software Requirements, such as the Kubernetes Ingress Controller, NVIDIA GPU Operator, and Prometheus, as part of the NVIDIA Run:ai Installer deployment. Additional optional software requirements for Distributed training and Inference requires manual setup.

Tenant name

Your tenant name is predefined and supplied by NVIDIA Run:ai. Each customer is provided with a unique, dedicated URL in the format <tenant-name>.run.ai which includes the required tenant name.

Application secret key

An application secret key is required to connect the cluster to the NVIDIA Run:ai Platform, In order to get the Application secret key, a new cluster must be added.

  1. Follow the Adding a new cluster setup instructions. Do not follow the Installation instructions.

  2. Once cluster instructions are displayed, find the controlPlane.clientSecret flag in the displayed Helm command, copy and save its value.

Note

For DGX Bundle customers, installing their first NVIDIA Run:ai cluster, the Application secret key will be provided by the NVIDIA Run:ai support team.

TLS certificate

A TLS private and public keys for the cluster’s Fully Qualified Domain Name (FQDN) are required for HTTP access to the cluster

Note

TLS Certificate must be trusted. Self-signed certificates are not supported.

Installation

Follow these instructions to install using BCM.

Installing a cluster

The cluster installer is available via the locally installed BCM landing page,

  1. Go to the locally installed BCM landing page, Select the NVIDIA Run:ai tile or access directly to http://<BCM-CLUSTER-IP>:30080/runai-installer (HTTP only)

  2. Click VERIFY in order to check System Requirements are met.

  3. After verification completed successfully, click CONTINUE.

  4. Enter the cluster information and click CONTINUE.

  5. The NVIDIA Run:ai installation will start and should be complete within a few minutes

  6. Once a message of NVIDIA Run:ai was installed successfully! is displayed, Click on START USING NVIDIA Run:ai to launch the login page of the tenant in a new browser tab.

Troubleshooting

If you encounter an issue with the installation, try the troubleshooting scenario below.

NVIDIA Run:ai installer

The NVIDIA Run:ai installer is a pod in Kubernetes. The pod is responsible for the installation preparation and prerequisite gathering phase. If there is an error during the prerequisite verification process, run the following command to print the logs:

kubectl get pods -n runai | grep 'cluster-installer' #Find the cluster installer pod's name
kubectl logs <POD-NAME> -n runai #Print the cluster installer pod logs

Installation

If the NVIDIA Run:ai cluster installation fails, check the installation logs to identify the issue. Run the following script to print the installation logs:

curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh

Cluster status

If the NVIDIA Run:ai cluster installation is complete but the cluster status did not change to Connected, check the cluster troubleshooting scenarios

Last updated