Cluster Restore
This section explains how to restore a NVIDIA Run:ai cluster on a different Kubernetes environment.
In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster - projects, workloads and other cluster data is synced automatically.
The restoration or back-up of NVIDIA Run:ai cluster Advanced features and Customized deployment configurations which are stored locally on the Kubernetes cluster is optional and they can be restored and backed-up separately.
Backup
As back-up of data is not required, the backup procedure is optional for advanced deployments, as explained above.
Backup Cluster Configurations
To backup NVIDIA Run:ai cluster configurations:
Run the following command in your terminal:
kubectl get runaiconfig runai -n runai -o yaml -o=jsonpath='{.spec}' > runaiconfig_backup.yaml
Once the
runaiconfig_back.yaml
back-up file is created, save the file externally, so that it can be retrieved later.
Restore
Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.
Prerequisites
Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled.
If the Kubernetes cluster is still available, uninstall the NVIDIA Run:ai cluster - make sure not to remove the cluster from the Control Plane
Navigate to the Cluster page in the NVIDIA Run:ai platform
Search for the cluster, and make sure its status is Disconnected
Re-installing NVIDIA Run:ai Cluster
Follow the NVIDIA Run:ai cluster installation instructions and ensure all prerequisites are met
If you have a back-up of the cluster configurations, reload it once the installation is complete
kubectl apply -f runaiconfig_backup.yaml -n runai
Navigate to the Cluster page in the NVIDIA Run:ai platform
Search for the cluster, and make sure its status is Connected
NVIDIA Run:ai Control Plane
Database Storage
By default, NVIDIA Run:ai uses an internal PostgreSQL database. The database is stored on a Kubernetes PersistentVolume (PV). You must provide a backup solution for the database. Some options include:
Backing up of PostgreSQL itself. For example,
kubectl -n runai-backend exec -it runai-backend-postgresql-0 -- env PGPASSWORD=password pg_dump -U postgres backend > cluster_name_db_backup.sql
Backing up the PersistentVolume holding the database storage.
Using third-party backup solutions.
NVIDIA Run:ai also supports an external PostgreSQL database. For details, see external PostgreSQL database.
Metrics Storage
NVIDIA Run:ai stores metrics history using Thanos. Thanos is configured to store data on a persistent volume. The recommendation is to back up the PV.
Backing up Control Plane Configuration
The installation of the NVIDIA Run:ai control plane can be customized. The customization is provided as --set
command in the Helm installation. These changes are preserved on upgrade, but will not be preserved on uninstall or upon damage to Kubernetes. It is best to back up these customizations. For a list of customizations used during the installation, run helm get values runai-backend -n runai-backend
.
Recovery
To recover NVIDIA Run:ai:
Re-create the Kubernetes/OpenShift cluster.
Recover the persistent volumes for the database and metrics.
Re-install the NVIDIA Run:ai control plane. Use the additional configuration previously saved and connect to the restored PostgreSQL PV. Connect Prometheus to the stored metrics PV.
Re-install the cluster. Add additional configurations post-install.
If the cluster is configured such that projects do not create a namespace automatically, you will need to re-create namespaces and apply role bindings as discussed in Kubernetes or OpenShift.
Last updated