Install the Cluster
In this section you will install the NVIDIA Run:ai cluster on your Kubernetes environment using Helm. The cluster extends Kubernetes with NVIDIA Run:ai orchestration capabilities - scheduling and workload management - and connects to the previously installed control plane for centralized management.
Once the control plane is installed and you access the NVIDIA Run:ai UI for the first time, an onboarding wizard opens automatically. The wizard guides you through the cluster setup and generates a Helm installation command. Follow the instructions below to modify and run the command based on your artifact source and environment.
This procedure includes:
Adding the NVIDIA Run:ai Helm repository from NGC or JFrog
Installing the NVIDIA Run:ai cluster into the
runainamespaceRegistering the NVIDIA Run:ai cluster with the NVIDIA Run:ai control plane using the provided connection details
By completing this process, the NVIDIA Run:ai cluster will be connected to the NVIDIA Run:ai control plane and ready to run training, inference, and other workloads.
System and Network Requirements
Before installing the NVIDIA Run:ai cluster, validate that the system requirements and network requirements are met. For air-gapped environments, make sure you have the software artifacts prepared.
Once all the requirements are met, it is highly recommended to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:
Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking
Look at additional components installed and analyze their relevance to a successful installation
For more information, see preinstall diagnostics. To run the preinstall diagnostics tool, download the latest version, and run:
chmod +x ./preinstall-diagnostics-<platform> && \
./preinstall-diagnostics-<platform> \
--domain ${CONTROL_PLANE_FQDN} \
--cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
--image ${PRIVATE_REGISTRY_IMAGE_URL} In an air-gapped deployment, the diagnostics image is saved, pushed, and pulled manually from the organization's registry.
#Save the image locally
docker save --output preinstall-diagnostics.tar gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION}
#Load the image to the organization's registry
docker load --input preinstall-diagnostics.tar
docker tag gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION} ${CLIENT_IMAGE_AND_TAG}
docker push ${CLIENT_IMAGE_AND_TAG}Run the binary with the --image parameter to modify the diagnostics image to be used:
chmod +x ./preinstall-diagnostics-darwin-arm64 && \
./preinstall-diagnostics-darwin-arm64 \
--domain ${CONTROL_PLANE_FQDN} \
--cluster-domain ${CLUSTER_FQDN} \
--image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
--image ${PRIVATE_REGISTRY_IMAGE_URL} Helm
NVIDIA Run:ai requires Helm 3.14 or later. To install Helm, see Installing Helm. If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.
Permissions
Using a Kubernetes user with the cluster-admin role to ensure a successful installation is recommended. For more information, see Using RBAC authorization.
Installation
Note
To customize the installation based on your environment, see Advanced cluster configurations.
You can store the
clientSecretas a Kubernetes secret within the cluster instead of using plain text. You can then configure the installation to use it by setting thecontrolPlane.existingSecretandcontrolPlane.secretKeys.clientSecretparameters as described in Advanced cluster configurations.
Artifact Source
Starting with v2.24, NVIDIA Run:ai artifacts are available on both NVIDIA NGC and JFrog. NGC is the recommended artifact source. JFrog remains supported in v2.24 but will be removed in a future release. For connected environments, follow the instructions for your artifact source in the sections below. For air-gapped environments, the installation steps are the same regardless of artifact source. Artifacts are prepared in the Preparations step.
Kubernetes
Connected
When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.
Enter a unique name for your cluster and click CONTINUE.
Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.
Note
The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.
If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.
Click CONTINUE
Install NVIDIA Run:ai on Your Cluster
In the next section, the NVIDIA Run:ai cluster installation steps will be presented.
The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the instructions for your artifact source.
NGC (Recommended)
Modify the UI-generated command as follows:
Add
--username='$oauthtoken'and--password=<NGC_API_KEY>to thehelm repo addcommand, and replace<NGC_API_KEY>with your NGC API key.If you are using a local certificate authority, add
--set global.customCA.enabled=trueto the Helm command as described in the Local certificate authority section.The recommended ingress controller is HAProxy. If you are using a different ingress controller, update the ingress class to match the ingress controller configured during the control plane installation.
JFrog
Run the Helm commands exactly as shown in the UI. If you are using a local certificate authority, add --set global.customCA.enabled=true to the Helm command as described in the Local certificate authority section.
The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.
Tip
Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see see Understanding cluster access roles.
Air-gapped
When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.
Enter a unique name for your cluster and click CONTINUE.
Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.
Note
The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.
If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.
Click CONTINUE
Install NVIDIA Run:ai on Your Cluster
In the next section, the NVIDIA Run:ai cluster installation steps will be presented.
The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard.
— Do not run the command exactly as shown in the UI —
Update the UI-generated Helm command as follows (see example command below) and use the pre-provided installation file instead of using helm repositories:
Do not add the helm repository,
helm repo add, and do not runhelm repo update.Instead, edit the
helm upgradecommand:Replace
runai/runai-clusterwithrunai-cluster-<VERSION>.tgz.Replace
runai-cluster-<VERSION>.tgzwith the NVIDIA Run:ai cluster version.Add
--set global.image.registry=<DOCKER REGISTRY ADDRESS>where<DOCKER_REGISTRY_ADDRESS>is the Docker registry address configured in the Preparations section.Add
--set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>. The registry address should point to the location where the Prometheus image is hosted.Add
--set global.customCA.enabled=trueas described in the Local certificate authority section.The recommended ingress controller is HAProxy. If you are using a different ingress controller, update the ingress class to match the ingress controller configured during the control plane installation.
Keep the remaining
--setvalues exactly as generated by the UI.
The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.
Tip
Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.
OpenShift
Connected
When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.
Enter a unique name for your cluster and click CONTINUE.
Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.
Note
The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.
If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.
Click CONTINUE
Install NVIDIA Run:ai on Your Cluster
In the next section, the NVIDIA Run:ai cluster installation steps will be presented.
The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the instructions for your artifact source:
NGC (Recommended)
Modify the UI-generated command as follows:
Add
--username='$oauthtoken'and--password=<NGC_API_KEY>to thehelm repo addcommand, and replace<NGC_API_KEY>with your NGC API key.If you are using a local certificate authority, add
--set global.customCA.enabled=trueto the Helm command as described in the Local certificate authority section.
JFrog
Run the Helm commands exactly as shown in the UI. If you are using a local certificate authority, add --set global.customCA.enabled=true to the Helm command as described in the Local certificate authority section.
The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.
Tip
Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.
Air-gapped
When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.
Enter a unique name for your cluster and click CONTINUE.
Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.
Note
The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.
If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.
Click CONTINUE
Install NVIDIA Run:ai on Your Cluster
In the next section, the NVIDIA Run:ai cluster installation steps will be presented.
The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard.
— Do not run the command exactly as shown in the UI —
Update the UI-generated Helm command as follows (see example command below) and use the pre-provided installation file instead of using helm repositories:
Do not add the helm repository and do not run
helm repo update.Instead, edit the
helm upgradecommand:Replace
runai/runai-clusterwithrunai-cluster-<VERSION>.tgz.Replace
runai-cluster-<VERSION>.tgzwith the NVIDIA Run:ai cluster version.Add
--set global.image.registry=<DOCKER REGISTRY ADDRESS>where<DOCKER_REGISTRY_ADDRESS>is the Docker registry address configured in the Preparations section.Add
--set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>. The registry address should point to the location where the Prometheus image is hosted.Add
--set global.customCA.enabled=trueas described in the Local certificate authority section.Keep the remaining
--setvalues exactly as generated by the UI.
The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.
Tip
Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.
Troubleshooting
If you encounter an issue with the installation, try the troubleshooting scenario below.
Installation
If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:
Cluster Status
If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, refer to the cluster Troubleshooting scenarios section.
Next Steps
Once the cluster is installed and connected, the NVIDIA Run:ai UI guides you through optional post-installation configurations. These steps are optional but recommended for a production setup:
SSO (Single Sign-On) - Configure SSO to allow users to log in with your organization's identity provider. See SSO for setup instructions using SAML or OpenID Connect.
Email server - Configure an SMTP server to enable email notifications for workload events and system alerts. See Email notifications for configuration details.
Create your first research team - Set up projects and permissions to organize your AI practitioners and allocate GPU resources.
Last updated