Cluster Restore

This section explains how to restore a NVIDIA Run:ai cluster on a different Kubernetes environment.

In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster, projects, workloads and other cluster data are synced automatically.

The restoration or backup of NVIDIA Run:ai advanced cluster configurations which are stored locally on the Kubernetes cluster is optional and can be restored and backed up separately.

Back Up the Cluster

As back-up of data is not required, the backup procedure is optional for advanced deployments, as explained above.

Save Cluster Configurations

To back up the NVIDIA Run:ai cluster configurations, you should save both the Helm values and the runtime configuration (runaiconfig).

  1. Back up Helm values - Run the following command to export the Helm values used for deployment:

    helm get values runai-cluster -n runai > runai_cluster_values_backup.yaml
  2. Back up the runtime configuration (runaiconfig) - Run the following command to export the active runtime configuration:Run the following command in your terminal:

    kubectl get runaiconfig runai -n runai -o yaml -o=jsonpath='{.spec}' > runaiconfig_backup.yaml
  3. Save both backup files (runai_cluster_values_backup.yaml and runaiconfig_backup.yaml) externally so they can be retrieved later if needed.

Restore the Cluster

Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.

Prerequisites

Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled.

  1. If the Kubernetes cluster is still available, uninstall the NVIDIA Run:ai cluster. Make sure not to remove the cluster from the control plane.

  2. Navigate to the Clusters grid in the NVIDIA Run:ai UI

  3. Locate the cluster and verify its status is Disconnected

Re-install the Cluster

  1. Follow the NVIDIA Run:ai cluster installation instructions and ensure all prerequisites are met

  2. If you have a backup of the cluster configurations, reload it once the installation is complete:

    kubectl apply -f runaiconfig_backup.yaml -n runai
  3. Navigate to the Clusters grid in the NVIDIA Run:ai UI

  4. Locate the cluster and verify its status is Connected

Restore Namespace and RoleBindings

If your cluster configuration disables automatic namespace creation for projects, you must manually:

  • Re-create each project namespace

  • Reapply the required role bindings for access control

For more information, see Advanced cluster configurations.

Last updated