Cluster Restore
This section explains how to restore a NVIDIA Run:ai cluster on a different Kubernetes environment.
In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster, projects, workloads and other cluster data are synced automatically.
The restoration or backup of NVIDIA Run:ai advanced cluster configurations which are stored locally on the Kubernetes cluster is optional and can be restored and backed up separately.
Back Up the Cluster
As back-up of data is not required, the backup procedure is optional for advanced deployments, as explained above.
Save Cluster Configurations
To back up the NVIDIA Run:ai cluster configurations, you should save both the Helm values and the runtime configuration (runaiconfig
).
Back up Helm values - Run the following command to export the Helm values used for deployment:
helm get values runai-cluster -n runai > runai_cluster_values_backup.yaml
Back up the runtime configuration (
runaiconfig
) - Run the following command to export the active runtime configuration:Run the following command in your terminal:kubectl get runaiconfig runai -n runai -o yaml -o=jsonpath='{.spec}' > runaiconfig_backup.yaml
Save both backup files (
runai_cluster_values_backup.yaml
andrunaiconfig_backup.yaml
) externally so they can be retrieved later if needed.
Restore the Cluster
Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.
Prerequisites
Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled.
If the Kubernetes cluster is still available, uninstall the NVIDIA Run:ai cluster. Make sure not to remove the cluster from the control plane.
Navigate to the Clusters grid in the NVIDIA Run:ai UI
Locate the cluster and verify its status is Disconnected
Re-install the Cluster
Follow the NVIDIA Run:ai cluster installation instructions and ensure all prerequisites are met
If you have a backup of the cluster configurations, reload it once the installation is complete:
kubectl apply -f runaiconfig_backup.yaml -n runai
Navigate to the Clusters grid in the NVIDIA Run:ai UI
Locate the cluster and verify its status is Connected
Restore Namespace and RoleBindings
If your cluster configuration disables automatic namespace creation for projects, you must manually:
Re-create each project namespace
Reapply the required role bindings for access control
For more information, see Advanced cluster configurations.
Last updated