Install using Base Command Manager
This section explains the steps required to install the NVIDIA Run:ai cluster on a DGX Kubernetes Cluster using NVIDIA Base Command Manager (BCM).
NVIDIA Run:ai installer
The NVIDIA Run:ai installer is a wizard that simplifies the deployment of the NVIDIA Run:ai cluster on DGX. The NVIDIA Run:ai installer is installed via the BCM cluster wizard when the cluster is created.
Note
For custom deployment options, check the Install using Helm.
System and network requirements
Before installing the NVIDIA Run:ai cluster on a DGX system using BCM, ensure that your System requirements and Network requirements meets the necessary prerequisites.
The BCM cluster wizard deploys essential Software Requirements, such as the Kubernetes Ingress Controller, NVIDIA GPU Operator, and Prometheus, as part of the NVIDIA Run:ai Installer deployment. Additional optional software requirements for Distributed training and Inference requires manual setup.
Tenant name
Your tenant name is predefined and supplied by NVIDIA Run:ai. Each customer is provided with a unique, dedicated URL in the format <tenant-name>.run.ai
which includes the required tenant name.
Application secret key
An application secret key is required to connect the cluster to the NVIDIA Run:ai Platform, In order to get the Application secret key, a new cluster must be added.
Follow the Adding a new cluster setup instructions. Do not follow the Installation instructions.
Once cluster instructions are displayed, find the
controlPlane.clientSecret
flag in the displayed Helm command, copy and save its value.
Note
For DGX Bundle customers, installing their first NVIDIA Run:ai cluster, the Application secret key will be provided by the NVIDIA Run:ai support team.
TLS certificate
A TLS private and public keys for the cluster’s Fully Qualified Domain Name (FQDN) are required for HTTP access to the cluster
Installation
Follow these instructions to install using BCM.
Installing a cluster
The cluster installer is available via the locally installed BCM landing page,
Go to the locally installed BCM landing page, Select the NVIDIA Run:ai tile or access directly to
http://<BCM-CLUSTER-IP>:30080/runai-installer
(HTTP only)Click VERIFY in order to check System Requirements are met.
After verification completed successfully, click CONTINUE.
Enter the cluster information and click CONTINUE.
The NVIDIA Run:ai installation will start and should be complete within a few minutes
Once a message of NVIDIA Run:ai was installed successfully! is displayed, Click on START USING NVIDIA Run:ai to launch the login page of the tenant in a new browser tab.
Troubleshooting
If you encounter an issue with the installation, try the troubleshooting scenario below.
NVIDIA Run:ai installer
The NVIDIA Run:ai installer is a pod in Kubernetes. The pod is responsible for the installation preparation and prerequisite gathering phase. If there is an error during the prerequisite verification process, run the following command to print the logs:
Installation
If the NVIDIA Run:ai cluster installation fails, check the installation logs to identify the issue. Run the following script to print the installation logs:
Cluster status
If the NVIDIA Run:ai cluster installation is complete but the cluster status did not change to Connected, check the cluster troubleshooting scenarios
Last updated