# Install the Cluster

## System and Network Requirements <a href="#system-and-network-requirements" id="system-and-network-requirements"></a>

Before installing the NVIDIA Run:ai cluster, validate that the [system requirements](/self-hosted/2.23/getting-started/installation/install-using-helm/system-requirements.md) and [network requirements ](/self-hosted/2.23/getting-started/installation/install-using-helm/network-requirements.md)are met. For air-gapped environments, make sure you have the [software artifacts ](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md#software-artifacts)prepared.

Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

* Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking
* Look at additional components installed and analyze their relevance to a successful installation

For more information, see [preinstall diagnostics](https://github.com/run-ai/preinstall-diagnostics). To run the preinstall diagnostics tool, [download](https://runai.jfrog.io/ui/native/pd-cli-prod/preinstall-diagnostics-cli/) the latest version, and run:

{% tabs %}
{% tab title="Connected" %}

```bash
chmod +x ./preinstall-diagnostics-<platform> && \ 
./preinstall-diagnostics-<platform> \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    
```

{% endtab %}

{% tab title="Air-gapped" %}
In an air-gapped deployment, the diagnostics image is saved, pushed, and pulled manually from the organization's registry.

```bash
#Save the image locally
docker save --output preinstall-diagnostics.tar gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION}
#Load the image to the organization's registry
docker load --input preinstall-diagnostics.tar
docker tag gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION} ${CLIENT_IMAGE_AND_TAG} 
docker push ${CLIENT_IMAGE_AND_TAG}
```

Run the binary with the `--image` parameter to modify the diagnostics image to be used:

```bash
chmod +x ./preinstall-diagnostics-darwin-arm64 && \
./preinstall-diagnostics-darwin-arm64 \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    
```

{% endtab %}
{% endtabs %}

## Helm

NVIDIA Run:ai requires [Helm](https://helm.sh/) 3.14 or later. To install Helm, see [Installing Helm](https://helm.sh/docs/intro/install/). If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the [helm binary](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md#software-artifacts).

{% hint style="info" %}
**Note**

Helm 4 defaults to [server-side apply](https://helm.sh/docs/overview/#server-side-apply) when installing a new chart release, which can conflict with resources managed by the NVIDIA Run:ai operator. Append `--server-side=false` to your `helm upgrade` command. NVIDIA Run:ai clusters originally installed with Helm 3.x are unaffected.
{% endhint %}

## Permissions <a href="#permissions" id="permissions"></a>

Using a Kubernetes user with the `cluster-admin` role to ensure a successful installation is recommended. For more information, see [Using RBAC authorization](https://kubernetes.io/docs/reference/access-authn-authz/rbac/).

## Installation

{% hint style="info" %}
**Note**

* To customize the installation based on your environment, see [Advanced cluster configurations](/self-hosted/2.23/infrastructure-setup/advanced-setup/cluster-config.md).
* You can store the `clientSecret` as a Kubernetes secret within the cluster instead of using plain text. You can then configure the installation to use it by setting the `controlPlane.existingSecret` and `controlPlane.secretKeys.clientSecret` parameters as described in [Advanced cluster configurations](/self-hosted/2.23/infrastructure-setup/advanced-setup/cluster-config.md).
  {% endhint %}

### Kubernetes

<details>

<summary>Connected</summary>

Follow the steps below to add a new cluster.

**Note:** When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

1. In the NVIDIA Run:ai platform, go to **Resources**
2. Click **+NEW CLUSTER**
3. Enter a unique name for your cluster
4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
5. Enter the Cluster URL. For more information, see [Fully Qualified Domain Name](/self-hosted/2.23/getting-started/installation/install-using-helm/system-requirements.md#fully-qualified-domain-name-fqdn) requirement.
6. Click **Continue**

**Installing NVIDIA Run:ai cluster**

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.
2. Click **DONE**

The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**.

**Tip:** Use the dry-run flag `--dry-run=client` to gain an understanding of what is being installed before the actual installation.

</details>

<details>

<summary>Air-gapped</summary>

{% hint style="warning" %}
**Prerequisite**

If your internal registry requires authentication, you must create the `runai-reg-creds` imagePullSecret before proceeding. See [Private Docker Registry](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md#private-docker-registry) in Preparations.
{% endhint %}

Follow the steps below to add a new cluster.

**Note:** When adding a cluster for the first time, the New Cluster form automatically opens when you log-in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

1. In the NVIDIA Run:ai platform, go to **Resources**
2. Click **+NEW CLUSTER**
3. Enter a unique name for your cluster
4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
5. Enter the Cluster URL . For more information, see [Fully Qualified Domain Name](/self-hosted/2.23/getting-started/installation/install-using-helm/system-requirements.md#fully-qualified-domain-name-fqdn) requirement.
6. Click **Continue**

**Installing NVIDIA Run:ai cluster**

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.
2. On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the [pre-provided installation file](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md#software-artifacts) instead of using helm repositories. As such:

   * Do not add the helm repository and do not run `helm repo update`.
   * Instead, edit the `helm upgrade` command:
     * Replace `runai/runai-cluster` with `./chart/runai-cluster-<VERSION>.tgz`, where `<VERSION>` is the full version number (e.g., `runai-cluster-2.24.58.tgz`). This file is located in the `chart` folder of the extracted software artifacts.
     * Add `--set global.image.registry=<DOCKER REGISTRY ADDRESS>` where the registry address is as entered in the [preparations](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md) section
     * Add `--set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>`. The registry address should point to the location where the Prometheus image is hosted.
     * Add `--set global.customCA.enabled=true` as described [here](/self-hosted/2.23/getting-started/installation/install-using-helm/system-requirements.md#local-certificate-authority)

   The command should look like the following:

   Run the following command from the root of the extracted software artifacts directory:

   ```bash
   helm upgrade -i runai-cluster ./chart/runai-cluster-<VERSION>.tgz \
       --set controlPlane.url=... \
       --set controlPlane.clientSecret=... \
       --set cluster.uid=... \
       --set cluster.url=... --create-namespace \
       --set global.image.registry=registry.mycompany.local \
       --set clusterConfig.prometheus.spec.baseImage=registry.mycompany.local/prometheus/prometheus \
       --set global.customCA.enabled=true
   ```
3. Click **DONE**

The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**.

**Tip:** Use the dry-run flag `--dry-run=client` to gain an understanding of what is being installed before the actual installation.

</details>

### OpenShift

<details>

<summary>Connected</summary>

Follow the steps below to add a new cluster.

**Note:** When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

1. In the NVIDIA Run:ai platform, go to **Resources**
2. Click **+NEW CLUSTER**
3. Enter a unique name for your cluster
4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
5. Enter the Cluster URL
6. Click **Continue**

**Installing NVIDIA Run:ai cluster**

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.
2. Click **DONE**

The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**.

</details>

<details>

<summary>Air-gapped</summary>

When creating a new cluster, select the **OpenShift** target platform.

Follow the steps below to add a new cluster.

**Note:** When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

1. In the NVIDIA Run:ai platform, go to **Resources**
2. Click **+NEW CLUSTER**
3. Enter a unique name for your cluster
4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)
5. Enter the Cluster URL
6. Click **Continue**

**Installing NVIDIA Run:ai cluster**

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.
2. On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the [pre-provided installation file](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md#software-artifacts) instead of using helm repositories. As such:

   * Do not add the helm repository and do not run `helm repo update`.
   * Instead, edit the `helm upgrade` command.
     * Replace `runai/runai-cluster` with `./chart/runai-cluster-<VERSION>.tgz`, where `<VERSION>` is the full version number (e.g., `runai-cluster-2.24.58.tgz`). This file is located in the `chart` folder of the extracted software artifacts.
     * Add `--set global.image.registry=<DOCKER REGISTRY ADDRESS>` where the registry address is as entered in the [preparations](/self-hosted/2.23/getting-started/installation/install-using-helm/preparations.md) section
     * Add `--set clusterConfig.prometheus.spec.baseImage=<DOCKER REGISTRY ADDRESS>/<FULL_IMAGE_PATH>`. The registry address should point to the location where the Prometheus image is hosted.
     * Add `--set global.customCA.enabled=true` as described [here](/self-hosted/2.23/getting-started/installation/install-using-helm/system-requirements.md#local-certificate-authority)

   The command should look like the following:

   Run the following command from the root of the extracted software artifacts directory:

   ```bash
   helm upgrade -i runai-cluster ./chart/runai-cluster-<VERSION>.tgz \
       --set controlPlane.url=... \
       --set controlPlane.clientSecret=... \
       --set cluster.uid=... \
       --set cluster.url=... --create-namespace \
       --set global.image.registry=registry.mycompany.local \
       --set clusterConfig.prometheus.spec.baseImage=registry.mycompany.local/prometheus/prometheus \
       --set global.customCA.enabled=true
   ```
3. Click **DONE**

The cluster is displayed in the table with the status **Waiting to connect**. Once installation is complete, the cluster status changes to **Connected**.

</details>

## Troubleshooting <a href="#troubleshooting" id="troubleshooting"></a>

If you encounter an issue with the installation, try the troubleshooting scenario below.

### Installation <a href="#installation_1" id="installation_1"></a>

If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

{% file src="/files/2BCXbwVvSn6F1NwIjGZ5" %}

### Cluster Status <a href="#cluster-status" id="cluster-status"></a>

If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster [troubleshooting scenarios](/self-hosted/2.23/infrastructure-setup/procedures/clusters.md#troubleshooting-scenarios)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://run-ai-docs.nvidia.com/self-hosted/2.23/getting-started/installation/install-using-helm/helm-install.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
