# Install the Cluster

In this section you will install the NVIDIA Run:ai cluster on your Kubernetes environment using Helm. The cluster extends Kubernetes with NVIDIA Run:ai orchestration capabilities - scheduling and workload management - and connects to the previously installed [control plane](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/install-control-plane) for centralized management.

Once the control plane is installed and you access the NVIDIA Run:ai UI for the first time, an onboarding wizard opens automatically. The wizard guides you through the cluster setup and generates a Helm installation command. Follow the instructions below to modify and run the command based on your artifact source and environment.

This procedure includes:

* Adding the NVIDIA Run:ai Helm repository from NGC or JFrog
* Installing the NVIDIA Run:ai cluster into the `runai` namespace
* Registering the NVIDIA Run:ai cluster with the NVIDIA Run:ai control plane using the provided connection details

By completing this process, the NVIDIA Run:ai cluster will be connected to the NVIDIA Run:ai control plane and ready to run training, inference, and other workloads.

## System and Network Requirements <a href="#system-and-network-requirements" id="system-and-network-requirements"></a>

Before installing the NVIDIA Run:ai cluster, validate that the [system requirements](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements) and [network requirements](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/network-requirements) are met. For air-gapped environments, make sure you have the [software artifacts](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#software-artifacts) prepared.

Once all the requirements are met, it is highly recommended to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

* Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking
* Look at additional components installed and analyze their relevance to a successful installation

For more information, see [preinstall diagnostics](https://github.com/run-ai/preinstall-diagnostics). To run the preinstall diagnostics tool, [download](https://runai.jfrog.io/ui/native/pd-cli-prod/preinstall-diagnostics-cli/) the latest version, and run:

{% tabs %}
{% tab title="Connected" %}

```bash
chmod +x ./preinstall-diagnostics-<platform> && \ 
./preinstall-diagnostics-<platform> \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    
```

{% endtab %}

{% tab title="Air-gapped" %}
In an air-gapped deployment, the diagnostics image is saved, pushed, and pulled manually from the organization's registry.

```bash
#Save the image locally
docker save --output preinstall-diagnostics.tar gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION}
#Load the image to the organization's registry
docker load --input preinstall-diagnostics.tar
docker tag gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION} ${CLIENT_IMAGE_AND_TAG} 
docker push ${CLIENT_IMAGE_AND_TAG}
```

Run the binary with the `--image` parameter to modify the diagnostics image to be used:

```bash
chmod +x ./preinstall-diagnostics-darwin-arm64 && \
./preinstall-diagnostics-darwin-arm64 \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    
```

{% endtab %}
{% endtabs %}

## Helm

NVIDIA Run:ai requires [Helm](https://helm.sh/) 3.14 or later. To install Helm, see [Installing Helm](https://helm.sh/docs/intro/install/). If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the [helm binary](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#software-artifacts).

## Permissions <a href="#permissions" id="permissions"></a>

Using a Kubernetes user with the `cluster-admin` role to ensure a successful installation is recommended. For more information, see [Using RBAC authorization](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).

## Installation

{% hint style="info" %}
**Note**

* To customize the installation based on your environment, see [Advanced cluster configurations](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/advanced-setup/cluster-config).
* You can store the `clientSecret` as a Kubernetes secret within the cluster instead of using plain text. You can then configure the installation to use it by setting the `controlPlane.existingSecret` and `controlPlane.secretKeys.clientSecret` parameters as described in [Advanced cluster configurations](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/advanced-setup/cluster-config).
  {% endhint %}

### Artifact Source

Starting with v2.24, NVIDIA Run:ai artifacts are available on both NVIDIA NGC and JFrog. NGC is the recommended artifact source. JFrog remains supported in v2.24 but will be removed in a future release. For connected environments, follow the instructions for your artifact source in the sections below. For air-gapped environments, the installation steps are the same regardless of artifact source. Artifacts are prepared in the [Preparations](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#software-artifacts) step.

### Kubernetes

<details>

<summary>Connected</summary>

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

1. Enter a unique name for your cluster and click **CONTINUE**.
2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
   * Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
   * Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

     <div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p><strong>Note</strong></p><p>The selected location must align with the <a href="system-requirements">system</a> requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.</p></div>
3. If you selected Remote control plane, enter the Cluster URL. For more information, see [Fully Qualified Domain Name](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements#fully-qualified-domain-name-fqdn) requirement.
4. Click **CONTINUE**

**Install NVIDIA Run:ai on Your Cluster**

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Before installing the NVIDIA Run:ai cluster, ensure that all required [system](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements) and [network](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/network-requirements) requirements are met.
2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the instructions for your artifact source.

**NGC (Recommended)**

Modify the UI-generated command as follows:

* Add `--username='$oauthtoken'` and `--password=<NGC_API_KEY>` to the `helm repo add` command, and replace `<NGC_API_KEY>` with your NGC API key.
* If you are using a local certificate authority, add `--set global.customCA.enabled=true` to the Helm command as described in the [Local certificate authority](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/cp-system-requirements#local-certificate-authority) section.
* The recommended ingress controller is HAProxy. If you are using a different ingress controller, update the ingress class to match the ingress controller configured during the [control plane installation](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-control-plane#connected).

```bash
helm repo add runai https://helm.ngc.nvidia.com/nvidia/runai --force-update \
  --username='$oauthtoken' \
  --password=<NGC_API_KEY>
helm repo update
helm upgrade -i runai-cluster runai/runai-cluster -n runai \
    --set controlPlane.url=... \
    --set controlPlane.clientSecret=... \
    --set cluster.uid=... \
    --set cluster.url=... --version="<VERSION>" --create-namespace \
    --set clusterConfig.global.ingress.ingressClass=haproxy
```

**JFrog**

Run the Helm commands exactly as shown in the UI. If you are using a local certificate authority, add `--set global.customCA.enabled=true` to the Helm command as described in the [Local certificate authority](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/cp-system-requirements#local-certificate-authority) section.

***

The wizard displays **Waiting for cluster to connect** while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to **Cluster connected**. After completing the wizard flow, the cluster is added to the [Clusters](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/procedures/clusters#clusters-table) table.

{% hint style="info" %}
**Tip**

Use the dry-run flag `--dry-run=client` to gain an understanding of what is being installed before the actual installation. For more details, see see [Understanding cluster access roles.](https://docs.run.ai/latest/admin/config/access-roles/)
{% endhint %}

</details>

<details>

<summary>Air-gapped</summary>

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

1. Enter a unique name for your cluster and click **CONTINUE**.
2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
   * Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
   * Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

     <div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p><strong>Note</strong></p><p>The selected location must align with the <a href="system-requirements">system</a> requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.</p></div>
3. If you selected Remote control plane, enter the Cluster URL. For more information, see [Fully Qualified Domain Name](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements#fully-qualified-domain-name-fqdn) requirement.
4. Click **CONTINUE**

**Install NVIDIA Run:ai on Your Cluster**

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Before installing the NVIDIA Run:ai cluster, ensure that all required [system](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements) and [network](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/network-requirements) requirements are met.
2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard.

   **— Do not run the command exactly as shown in the UI —**
3. Update the UI-generated Helm command as follows (see example command below) and use the [pre-provided installation file](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#air-gapped) instead of using helm repositories:
   * Do not add the helm repository, `helm repo add`, and do not run `helm repo update`.
   * Instead, edit the `helm upgrade` command:
     * Replace `runai/runai-cluster` with `runai-cluster-<VERSION>.tgz`.
     * Replace `runai-cluster-<VERSION>.tgz` with the NVIDIA Run:ai cluster version.
     * Add `--set global.image.registry=<DOCKER REGISTRY ADDRESS>` where `<DOCKER_REGISTRY_ADDRESS>` is the Docker registry address configured in the [Preparations](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#air-gapped) section.
     * Add `--set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>`. The registry address should point to the location where the Prometheus image is hosted.
     * Add `--set global.customCA.enabled=true` as described in the [Local certificate authority](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements#local-certificate-authority) section.
     * The recommended ingress controller is HAProxy. If you are using a different ingress controller, update the ingress class to match the ingress controller configured during the [control plane installation](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-control-plane#connected).
     * Keep the remaining `--set` values exactly as generated by the UI.

```bash
helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
    --set controlPlane.url=... \
    --set controlPlane.clientSecret=... \
    --set cluster.uid=... \
    --set cluster.url=... --create-namespace \
    --set global.image.registry=registry.mycompany.local \
    --set clusterConfig.prometheus.spec.baseImage=registry.mycompany.local/prometheus/prometheus \
    --set global.customCA.enabled=true \
    --set clusterConfig.global.ingress.ingressClass=haproxy
```

The wizard displays **Waiting for cluster to connect** while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to **Cluster connected**. After completing the wizard flow, the cluster is added to the [Clusters](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/procedures/clusters#clusters-table) table.

{% hint style="info" %}
**Tip**

Use the dry-run flag `--dry-run=client` to gain an understanding of what is being installed before the actual installation. For more details, see [Understanding cluster access roles.](https://docs.run.ai/latest/admin/config/access-roles/)
{% endhint %}

</details>

### OpenShift

<details>

<summary>Connected</summary>

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

1. Enter a unique name for your cluster and click **CONTINUE.**
2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
   * Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
   * Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

     <div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p><strong>Note</strong></p><p>The selected location must align with the <a href="system-requirements">system</a> requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.</p></div>
3. If you selected Remote control plane, enter the Cluster URL. For more information, see [Fully Qualified Domain Name](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements#fully-qualified-domain-name-fqdn) requirement.
4. Click **CONTINUE**

**Install NVIDIA Run:ai on Your Cluster**

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Before installing the NVIDIA Run:ai cluster, ensure that all required [system](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements) and [network](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/network-requirements) requirements are met.
2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard. Follow the instructions for your artifact source:

**NGC (Recommended)**

Modify the UI-generated command as follows:

* Add `--username='$oauthtoken'` and `--password=<NGC_API_KEY>` to the `helm repo add` command, and replace `<NGC_API_KEY>` with your NGC API key.
* If you are using a local certificate authority, add `--set global.customCA.enabled=true` to the Helm command as described in the [Local certificate authority](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/cp-system-requirements#local-certificate-authority) section.

```bash
helm repo add runai https://helm.ngc.nvidia.com/nvidia/runai --force-update \
  --username='$oauthtoken' \
  --password=<NGC_API_KEY>
helm repo update
helm upgrade -i runai-cluster runai/runai-cluster -n runai \
  --set controlPlane.url=... \
  --set controlPlane.clientSecret=... \
  --set cluster.uid=... \
  --set cluster.url=... --version="<VERSION>" --create-namespace
```

**JFrog**

Run the Helm commands exactly as shown in the UI. If you are using a local certificate authority, add `--set global.customCA.enabled=true` to the Helm command as described in the [Local certificate authority](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/cp-system-requirements#local-certificate-authority) section.

***

The wizard displays **Waiting for cluster to connect** while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to **Cluster connected**. After completing the wizard flow, the cluster is added to the [Clusters](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/procedures/clusters#clusters-table) table.

{% hint style="info" %}
**Tip**

Use the dry-run flag `--dry-run=client` to gain an understanding of what is being installed before the actual installation. For more details, see [Understanding cluster access roles.](https://docs.run.ai/latest/admin/config/access-roles/)
{% endhint %}

</details>

<details>

<summary>Air-gapped</summary>

When adding a cluster for the first time, the onboarding wizard opens automatically when you log in to the NVIDIA Run:ai platform. You cannot perform other actions in the platform until the cluster is created.

1. Enter a unique name for your cluster and click **CONTINUE**.
2. Set the cluster location. Choose where the NVIDIA Run:ai cluster will be installed:
   * Same as the control plane - Install the NVIDIA Run:ai cluster on the same Kubernetes cluster as the NVIDIA Run:ai control plane.
   * Remote control plane - Install the NVIDIA Run:ai cluster on a different Kubernetes cluster than the NVIDIA Run:ai control plane.

     <div data-gb-custom-block data-tag="hint" data-style="info" class="hint hint-info"><p><strong>Note</strong></p><p>The selected location must align with the <a href="system-requirements">system</a> requirements you prepared earlier. The NVIDIA Run:ai cluster system requirements differ depending on whether the NVIDIA Run:ai cluster is installed on the same Kubernetes cluster as the NVIDIA Run:ai control plane or on a separate one.</p></div>
3. If you selected Remote control plane, enter the Cluster URL. For more information, see [Fully Qualified Domain Name](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements#fully-qualified-domain-name-fqdn) requirement.
4. Click **CONTINUE**

**Install NVIDIA Run:ai on Your Cluster**

In the next section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Before installing the NVIDIA Run:ai cluster, ensure that all required [system](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/system-requirements) and [network](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/install-using-helm/network-requirements) requirements are met.
2. The NVIDIA Run:ai platform displays the Helm installation command in the cluster wizard.

   **— Do not run the command exactly as shown in the UI —**
3. Update the UI-generated Helm command as follows (see example command below) and use the [pre-provided installation file](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#air-gapped-1) instead of using helm repositories:

   * Do not add the helm repository and do not run `helm repo update`.
   * Instead, edit the `helm upgrade` command:
     * Replace `runai/runai-cluster` with `runai-cluster-<VERSION>.tgz`.
     * Replace `runai-cluster-<VERSION>.tgz` with the NVIDIA Run:ai cluster version.
     * Add `--set global.image.registry=<DOCKER REGISTRY ADDRESS>` where `<DOCKER_REGISTRY_ADDRESS>` is the Docker registry address configured in the [Preparations](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/preparations#air-gapped) section.
     * Add `--set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>`. The registry address should point to the location where the Prometheus image is hosted.
     * Add `--set global.customCA.enabled=true` as described in the [Local certificate authority](https://run-ai-docs.nvidia.com/self-hosted/getting-started/installation/system-requirements#local-certificate-authority) section.
     * Keep the remaining `--set` values exactly as generated by the UI.

   ```bash
   helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
       --set controlPlane.url=... \
       --set controlPlane.clientSecret=... \
       --set cluster.uid=... \
       --set cluster.url=... --create-namespace \
       --set global.image.registry=registry.mycompany.local \
       --set clusterConfig.prometheus.spec.baseImage=registry.mycompany.local/prometheus/prometheus \
       --set global.customCA.enabled=true
   ```

The wizard displays **Waiting for cluster to connect** while the cluster is being installed and connected to the control plane. Once the installation completes successfully and the cluster establishes communication with the control plane, the wizard updates to **Cluster connected**. After completing the wizard flow, the cluster is added to the [Clusters](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/procedures/clusters#clusters-table) table.

{% hint style="info" %}
**Tip**

Use the dry-run flag `--dry-run=client` to gain an understanding of what is being installed before the actual installation. For more details, see [Understanding cluster access roles.](https://docs.run.ai/latest/admin/config/access-roles/)
{% endhint %}

</details>

## Troubleshooting <a href="#troubleshooting" id="troubleshooting"></a>

If you encounter an issue with the installation, try the troubleshooting scenario below.

### Installation <a href="#installation_1" id="installation_1"></a>

If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

```bash
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh
```

### Cluster Status <a href="#cluster-status" id="cluster-status"></a>

If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to **Connected**, refer to the cluster [Troubleshooting scenarios](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/procedures/clusters#troubleshooting-scenarios) section.

## Next Steps

Once the cluster is installed and connected, the NVIDIA Run:ai UI guides you through optional post-installation configurations. These steps are optional but recommended for a production setup:

* **SSO (Single Sign-On)** - Configure SSO to allow users to log in with your organization's identity provider. See [SSO](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/authentication/sso) for setup instructions using SAML or OpenID Connect.
* **Email server** - Configure an SMTP server to enable email notifications for workload events and system alerts. See [Email notifications](https://run-ai-docs.nvidia.com/self-hosted/settings/general-settings/notifications#email-notifications) for configuration details.
* **Create your first research team** - Set up [projects](https://run-ai-docs.nvidia.com/self-hosted/platform-management/aiinitiatives/organization/projects) and [permissions](https://run-ai-docs.nvidia.com/self-hosted/infrastructure-setup/authentication/accessrules) to organize your AI practitioners and allocate GPU resources.
