Install the Cluster

In this section you will install the NVIDIA Run:ai cluster on your Kubernetes environment using Helm. The cluster extends Kubernetes with NVIDIA Run:ai orchestration capabilities - scheduling and workload management - and connects to NVIDIA’s cloud-hosted control plane for centralized management.

Once you access the NVIDIA Run:ai UI for the first time, an onboarding wizard opens automatically. The wizard guides you through the cluster setup and generates a Helm installation command.

This procedure includes:

  • Adding the NVIDIA Run:ai Helm repository from JFrog

  • Installing the NVIDIA Run:ai cluster into the runai namespace

  • Registering the NVIDIA Run:ai cluster with the NVIDIA Run:ai control plane using the provided connection details

By completing this process, the NVIDIA Run:ai cluster will be connected to NVIDIA’s cloud-hosted control plane and ready to run training, inference, and other workloads.

System and Network Requirements

Before installing the NVIDIA Run:ai cluster, validate that the system requirements and network requirements are met.

Once all the requirements are met, it is highly recommended to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

  • Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking

  • Look at additional components installed and analyze their relevance to a successful installation

To run the preinstall diagnostics tool, downloadarrow-up-right the latest version, and run:

chmod +x ./preinstall-diagnostics-<platform> && \
./preinstall-diagnostics-<platform> \
  --domain ${COMPANY_NAME}.run.ai \
  --cluster-domain ${CLUSTER_FQDN}

For more information, see preinstall diagnosticsarrow-up-right.

Helm

NVIDIA Run:ai cluster requires Helm 3.14 or above. To install Helm, see Helm Installarrow-up-right.

Permissions

Using a Kubernetes user with the cluster-admin role to ensure a successful installation is recommended. For more information, see Using RBAC authorizationarrow-up-right.

Installation

Follow these instructions to install using Helm.

circle-info

Note

  • To customize the installation based on your environment, see Advanced cluster configurations.

  • You can store the clientSecret as a Kubernetes secret within the cluster instead of using plain text. You can then configure the installation to use it by setting the controlPlane.existingSecret and controlPlane.secretKeys.clientSecret parameters as described in Advanced cluster configurations.

Adding a New Cluster

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

  1. Enter a unique name for your cluster and click CONTINUE.

  2. Set your Cluster URL. Enter the Kubernetes cluster's URL. It will only be accessible within the organization network.
For more information, see Fully Qualified Domain Name (FQDN).

  3. Click CONTINUE

Installation Instructions

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Before installing the NVIDIA Run:ai cluster, ensure that all required system and network requirements are met.

  2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

The wizard displays Waiting for cluster to connect while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to Cluster connected. After completing the wizard flow, the cluster is added to the Clusters table.

Troubleshooting

If you encounter an issue with the installation, try the troubleshooting scenario below.

Installation

If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

Cluster Status

If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster troubleshooting scenarios.

Next Steps

Once the cluster is installed and connected, the NVIDIA Run:ai UI guides you through optional post-installation configurations. These steps are optional but recommended for a production setup:

  • SSO (Single Sign-On) - Configure SSO to allow users to log in with your organization's identity provider. See SSO for setup instructions using SAML or OpenID Connect.

  • Create your first research team - Set up projects and permissions to organize your AI practitioners and allocate GPU resources.

Last updated