Cluster Restore

This section explains how to restore a NVIDIA Run:ai cluster on a different Kubernetes environment.

In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster, projects, workloads and other cluster data are synced automatically.

The restoration or backup of NVIDIA Run:ai advanced cluster configurations and customized deployment configurations which are stored locally on the Kubernetes cluster is optional and can be restored and backed up separately.

Back Up the Cluster

As back-up of data is not required, the backup procedure is optional for advanced deployments, as explained above.

Save Cluster Configurations

To backup NVIDIA Run:ai cluster configurations:

  1. Run the following command in your terminal:

    kubectl get runaiconfig runai -n runai -o yaml -o=jsonpath='{.spec}' > runaiconfig_backup.yaml
  2. Once the runaiconfig_back.yaml backup file is created, save the file externally, so that it can be retrieved later.

Restore the Cluster

Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.

Prerequisites

Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled.

  1. If the Kubernetes cluster is still available, uninstall the NVIDIA Run:ai cluster. Make sure not to remove the cluster from the control plane.

  2. Navigate to the Clusters grid in the NVIDIA Run:ai UI

  3. Locate the cluster and verify its status is Disconnected

Re-install the Cluster

  1. Follow the NVIDIA Run:ai cluster installation instructions and ensure all prerequisites are met

  2. If you have a backup of the cluster configurations, reload it once the installation is complete:

    kubectl apply -f runaiconfig_backup.yaml -n runai
  3. Navigate to the Clusters grid in the NVIDIA Run:ai UI

  4. Locate the cluster and verify its status is Connected

Restore Namespace and RoleBindings

If your cluster configuration disables automatic namespace creation for projects, you must manually:

  • Re-create each project namespace

  • Reapply the required role bindings for access control

For more information, see Advanced cluster configurations.

Last updated