High Availability

This guide outlines the best practices for configuring the NVIDIA Run:ai platform to ensure high availability and maintain service continuity during system failures or under heavy load. The goal is to reduce downtime and eliminate single points of failure by leveraging Kubernetes best practices alongside NVIDIA Run:ai specific configuration options. The NVIDIA Run:ai platform relies on two fundamental high availability strategies:

  • Use of system nodes - Assigning multiple dedicated nodes for critical system services ensures control, resource isolation, and enables system-level scaling.

  • Replication of core and third-party services - Configuring multiple replicas of essential services, distributes workloads and reduces single points of failure. If a component fails on one node, requests can seamlessly route to another instance.

System Nodes

The NVIDIA Run:ai platform allows you to dedicate specific nodes (system nodes) exclusively for core platform services. This approach provides improved operational isolation and easier resource management.

Ensure that at least three system nodes are configured to support high availability. If you use only a single node for core services, horizontally scaled components will not be distributed, resulting in a single point of failure. See NVIDIA Run:ai system nodes for more details.

Service Replicas

By default, NVIDIA Run:ai cluster services are deployed with a single replica. To achieve high availability, it is recommended to configure multiple replicas for core NVIDIA Run:ai services. For more information, see NVIDIA Run:ai services replicas.

Note Some NVIDIA Run:ai services do not have a replicas configuration. These will always run a single replica, and their recovery time after failure is tied to pod restart and rescheduling time.

Last updated