Install the Cluster

In this section you will install the NVIDIA Run:ai cluster on your Kubernetes environment using Helm. The cluster extends Kubernetes with NVIDIA Run:ai orchestration capabilities - scheduling and workload management - and connects to the previously installed control plane for centralized management.

Once the control plane is installed and you access the NVIDIA Run:ai UI for the first time, an onboarding wizard opens automatically. The wizard guides you through the cluster setup and generates a Helm installation command. Follow the instructions below to modify and run the command based on your artifact source and environment.

This procedure includes:

  • Adding the NVIDIA Run:ai Helm repository from NGC or JFrog

  • Installing the NVIDIA Run:ai cluster into the runai namespace

  • Registering the NVIDIA Run:ai cluster with the NVIDIA Run:ai control plane using the provided connection details

By completing this process, the NVIDIA Run:ai cluster will be connected to the NVIDIA Run:ai control plane and ready to run training, inference, and other workloads.

System and Network Requirements

Before installing the NVIDIA Run:ai cluster, validate that the system requirements and network requirements are met. For air-gapped environments, make sure you have the software artifacts prepared.

Once all the requirements are met, it is highly recommended to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

  • Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking

  • Look at additional components installed and analyze their relevance to a successful installation

For more information, see preinstall diagnosticsarrow-up-right. To run the preinstall diagnostics tool, downloadarrow-up-right the latest version, and run:

chmod +x ./preinstall-diagnostics-<platform> && \ 
./preinstall-diagnostics-<platform> \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    

Helm

NVIDIA Run:ai requires Helmarrow-up-right 3.14 or later. To install Helm, see Installing Helmarrow-up-right. If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.

Permissions

Using a Kubernetes user with the cluster-admin role to ensure a successful installation is recommended. For more information, see Using RBAC authorizationarrow-up-right.

Installation

circle-info

Note

  • To customize the installation based on your environment, see Advanced cluster configurations.

  • You can store the clientSecret as a Kubernetes secret within the cluster instead of using plain text. You can then configure the installation to use it by setting the controlPlane.existingSecret and controlPlane.secretKeys.clientSecret parameters as described in Advanced cluster configurations.

Artifact Source

Starting with v2.24, NVIDIA Run:ai artifacts are available on both NVIDIA NGC and JFrog. NGC is the recommended artifact source. JFrog remains supported in v2.24 but will be removed in a future release. For connected environments, follow the instructions for your artifact source in the sections below. For air-gapped environments, the installation steps are the same regardless of artifact source. Artifacts are prepared in the Preparations step.

Kubernetes

chevron-rightConnectedhashtag

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

  1. Enter a unique name for your cluster and click CONTINUE.

  2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:

    • Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.

    • Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

      circle-info

      Note

      The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.

  3. If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.

  4. Click CONTINUE

Install NVIDIA Run:ai on Your Cluster

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Before installing the NVIDIA Run:ai cluster, ensure that all required system and network requirements are met.

  2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the instructions for your artifact source.

NGC (Recommended)

Modify the UI-generated command as follows:

  • Add --username='$oauthtoken' and --password=<NGC_API_KEY> to the helm repo add command, and replace <NGC_API_KEY> with your NGC API key.

  • If you are using a local certificate authority, add --set global.customCA.enabled=true to the Helm command as described in the Local certificate authority section.

  • The recommended ingress controller is HAProxy. If you are using a different ingress controller, update the ingress class to match the ingress controller configured during the control plane installation.

JFrog

Run the Helm commands exactly as shown in the UI. If you are using a local certificate authority, add --set global.customCA.enabled=true to the Helm command as described in the Local certificate authority section.


The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.

circle-info

Tip

Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see see Understanding cluster access roles.arrow-up-right

chevron-rightAir-gappedhashtag

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

  1. Enter a unique name for your cluster and click CONTINUE.

  2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:

    • Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.

    • Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

      circle-info

      Note

      The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.

  3. If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.

  4. Click CONTINUE

Install NVIDIA Run:ai on Your Cluster

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Before installing the NVIDIA Run:ai cluster, ensure that all required system and network requirements are met.

  2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard.

    — Do not run the command exactly as shown in the UI —

  3. Update the UI-generated Helm command as follows (see example command below) and use the pre-provided installation file instead of using helm repositories:

    • Do not add the helm repository, helm repo add, and do not run helm repo update.

    • Instead, edit the helm upgrade command:

      • Replace runai/runai-cluster with runai-cluster-<VERSION>.tgz.

      • Replace runai-cluster-<VERSION>.tgz with the NVIDIA Run:ai cluster version.

      • Add --set global.image.registry=<DOCKER REGISTRY ADDRESS> where <DOCKER_REGISTRY_ADDRESS> is the Docker registry address configured in the Preparations section.

      • Add --set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>. The registry address should point to the location where the Prometheus image is hosted.

      • Add --set global.customCA.enabled=true as described in the Local certificate authority section.

      • The recommended ingress controller is HAProxy. If you are using a different ingress controller, update the ingress class to match the ingress controller configured during the control plane installation.

      • Keep the remaining --set values exactly as generated by the UI.

The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.

circle-info

Tip

Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.arrow-up-right

OpenShift

chevron-rightConnectedhashtag

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

  1. Enter a unique name for your cluster and click CONTINUE.

  2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:

    • Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.

    • Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

      circle-info

      Note

      The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.

  3. If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.

  4. Click CONTINUE

Install NVIDIA Run:ai on Your Cluster

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Before installing the NVIDIA Run:ai cluster, ensure that all required system and network requirements are met.

  2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the instructions for your artifact source:

NGC (Recommended)

Modify the UI-generated command as follows:

  • Add --username='$oauthtoken' and --password=<NGC_API_KEY> to the helm repo add command, and replace <NGC_API_KEY> with your NGC API key.

  • If you are using a local certificate authority, add --set global.customCA.enabled=true to the Helm command as described in the Local certificate authority section.

JFrog

Run the Helm commands exactly as shown in the UI. If you are using a local certificate authority, add --set global.customCA.enabled=true to the Helm command as described in the Local certificate authority section.


The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.

circle-info

Tip

Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.arrow-up-right

chevron-rightAir-gappedhashtag

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

  1. Enter a unique name for your cluster and click CONTINUE.

  2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:

    • Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.

    • Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

      circle-info

      Note

      The selected location must align with the system requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.

  3. If you selected Remote control plane, enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.

  4. Click CONTINUE

Install NVIDIA Run:ai on Your Cluster

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Before installing the NVIDIA Run:ai cluster, ensure that all required system and network requirements are met.

  2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard.

    — Do not run the command exactly as shown in the UI —

  3. Update the UI-generated Helm command as follows (see example command below) and use the pre-provided installation file instead of using helm repositories:

    • Do not add the helm repository and do not run helm repo update.

    • Instead, edit the helm upgrade command:

      • Replace runai/runai-cluster with runai-cluster-<VERSION>.tgz.

      • Replace runai-cluster-<VERSION>.tgz with the NVIDIA Run:ai cluster version.

      • Add --set global.image.registry=<DOCKER REGISTRY ADDRESS> where <DOCKER_REGISTRY_ADDRESS> is the Docker registry address configured in the Preparations section.

      • Add --set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>. The registry address should point to the location where the Prometheus image is hosted.

      • Add --set global.customCA.enabled=true as described in the Local certificate authority section.

      • Keep the remaining --set values exactly as generated by the UI.

The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.

circle-info

Tip

Use the dry-run flag --dry-run=client to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.arrow-up-right

Troubleshooting

If you encounter an issue with the installation, try the troubleshooting scenario below.

Installation

If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

Cluster Status

If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, refer to the cluster Troubleshooting scenarios section.

Next Steps

Once the cluster is installed and connected, the NVIDIA Run:ai UI guides you through optional post-installation configurations. These steps are optional but recommended for a production setup:

  • SSO (Single Sign-On) - Configure SSO to allow users to log in with your organization's identity provider. See SSO for setup instructions using SAML or OpenID Connect.

  • Email server - Configure an SMTP server to enable email notifications for workload events and system alerts. See Email notifications for configuration details.

  • Create your first research team - Set up projects and permissions to organize your AI practitioners and allocate GPU resources.

Last updated