System requirements
The NVIDIA Run:ai cluster is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai cluster.
Hardware requirements
The following hardware requirements are for the Kubernetes cluster nodes. By default, all NVIDIA Run:ai cluster services run on all available nodes. For production deployments, you may want to set node roles, to separate between system and worker nodes, reduce downtime and save CPU cycles on expensive GPU machines.
NVIDIA Run:ai cluster - system nodes
This configuration is the minimum requirement you need to install and use NVIDIA Run:ai cluster.
CPU
10 cores
Memory
20GB
Disk space
50GB
Note
To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes
NVIDIA Run:ai cluster - worker nodes
The NVIDIA Run:ai cluster supports x86 and ARM (see the below note) CPUs, and NVIDIA GPUs from the T, V, A, L, H, B and GH architecture families. For the list of supported GPU models, see Supported NVIDIA Data Center GPUs and Systems.
The following configuration represents the minimum hardware requirements for installing and operating the NVIDIA Run:ai cluster on worker nodes. Each node must meet these specifications:
CPU
2 cores
Memory
4GB
Note
NVIDIA Run:ai supports AMD CPUs for all supported versions and ARM CPUs starting from v2.19. Using ARM CPUs may require some small additional handling. Please contact NVIDIA Run:ai support for assistance.
To designate nodes to NVIDIA Run:ai workloads, follow the instructions as described in Worker nodes.
Shared storage
NVIDIA Run:ai workloads must be able to access data from any worker node in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.
Typical protocols are Network File Storage (NFS) or Network-attached storage (NAS). NVIDIA Run:ai cluster supports both, for more information see Shared storage.
Software requirements
The following software requirements must be fulfilled on the Kubernetes cluster.
Operating system
Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator
NVIDIA Run:ai cluster on Google Kubernetes Engine (GKE) supports both Ubuntu and Container Optimized OS (COS). COS is supported only with NVIDIA GPU Operator 24.6 or newer, and NVIDIA Run:ai cluster version 2.19 or newer. NVIDIA Run:ai cluster on Oracle Kubernetes Engine (OKE) supports only Ubuntu.
Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.
Kubernetes distribution
NVIDIA Run:ai cluster requires Kubernetes. The following Kubernetes distributions are supported:
Vanilla Kubernetes
OpenShift Container Platform (OCP)
NVIDIA Base Command Manager (BCM)
Elastic Kubernetes Engine (EKS)
Google Kubernetes Engine (GKE)
Azure Kubernetes Service (AKS)
Oracle Kubernetes Engine (OKE)
Rancher Kubernetes Engine (RKE1)
Rancher Kubernetes Engine 2 (RKE2)
Note
The latest release of the NVIDIA Run:ai cluster supports Kubernetes 1.30 to 1.32 and OpenShift 4.14 to 4.18
For existing Kubernetes clusters, see the following Kubernetes version support matrix for the latest NVIDIA Run:ai cluster releases:
v2.17
1.27 to 1.29
4.12 to 4.15
v2.18
1.28 to 1.30
4.12 to 4.16
v2.19
1.28 to 1.31
4.12 to 4.17
v2.20
1.29 to 1.32
4.14 to 4.17
For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release History or OpenShift Container Platform Life Cycle Policy.
Container runtime
NVIDIA Run:ai supports the following container runtimes. Make sure your Kubernetes cluster is configured with one of these runtimes:
Containerd (default in Kubernetes)
CRI-O (default in OpenShift)
Kubernetes pod security admission
NVIDIA Run:ai supports restricted
policy for Pod Security Admission (PSA) on OpenShift only. Other Kubernetes distributions are only supported with privileged
policy.
For NVIDIA Run:ai on OpenShift to run with PSA restricted
policy:
Label the
runai
namespace as described in Pod Security Admission with the following labels:
The workloads submitted through NVIDIA Run:ai should comply with the restrictions of PSA restricted policy. This can be enforced using Policies.
NVIDIA Run:ai namespace
NVIDIA Run:ai cluster must be installed in a namespace named runai
. Create the namespace by running:
Kubernetes ingress controller
NVIDIA Run:ai cluster requires Kubernetes Ingress Controller to be installed on the Kubernetes cluster.
OpenShift, RKE and RKE2 come pre-installed ingress controller.
Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.
Make sure that a default ingress controller is set.
There are many ways to install and configure different ingress controllers. A simple example to install and configure NGINX ingress controller using helm:
NVIDIA GPU Operator
NVIDIA Run:ai Cluster requires NVIDIA GPU Operator to be installed on the Kubernetes Cluster, supports version 22.9 to 25.3.
See the Installing the NVIDIA GPU Operator, followed by notes below:
Use the default
gpu-operator
namespace . Otherwise, you must specify the target namespace using the flagrunai-operator.config.nvidiaDcgmExporter.namespace
as described in customized cluster installation.NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags
--set driver.enabled=false
. DGX OS is one such example as it comes bundled with NVIDIA Drivers.For distribution-specific additional instructions see below:
For troubleshooting information, see the NVIDIA GPU Operator Troubleshooting Guide.
Prometheus
NVIDIA Run:ai cluster requires Prometheus to be installed on the Kubernetes cluster.
OpenShift comes pre-installed with prometheus
For RKE2 see Enable Monitoring instructions to install Prometheus
There are many ways to install Prometheus. A simple example to install the community Kube-Prometheus Stack using helm, run the following commands:
Additional software requirements
Additional NVIDIA Run:ai capabilities, Distributed Training and Inference require additional Kubernetes applications (frameworks) to be installed on the cluster.
Distributed training
Distributed training enables training of AI models over multiple nodes. This requires installing a distributed training framework on the cluster. The following frameworks are supported:
There are several ways to install each framework. A simple method of installation example is the Kubeflow Training Operator which includes TensorFlow, PyTorch, and XGBoost.
It is recommended to use Kubeflow Training Operator v1.8.1, and MPI Operator v0.6.0 or later for compatibility with advanced workload capabilities, such as Stopping a workload and Scheduling rules.
To install the Kubeflow Training Operator for TensorFlow, PyTorch and XGBoost frameworks, run the following command:
To install the MPI Operator for MPI v2, run the following command:
Note
If you require both the MPI Operator and Kubeflow Training Operator, follow the steps below:
Install the Kubeflow Training Operator as described above.
Disable and delete MPI v1 in the Kubeflow Training Operator by running:
Install the MPI Operator as described above.
Inference
Inference enables serving of AI models. This requires the Knative Serving framework to be installed on the cluster and supports Knative versions 1.11 to 1.16.
Follow the Installing Knative instructions. After installation, configure Knative to use the NVIDIA Run:ai scheduler and features, by running:
Knative autoscaling
NVIDIA Run:ai allows for autoscaling a deployment according to the below metrics:
Latency (milliseconds)
Throughput (requests/sec)
Concurrency (requests)
Using a custom metric (for example, Latency) requires installing the Kubernetes Horizontal Pod Autoscaler (HPA). Use the following command to install. Make sure to update the VERSION in the below command with a supported Knative version.
Fully Qualified Domain Name (FQDN)
You must have a Fully Qualified Domain Name (FQDN) to install NVIDIA Run:ai control plane (ex: runai.mycorp.local
). This cannot be an IP. The domain name must be accessible inside the organization's private network.
TLS certificates
You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a Kubernetes Secret named runai-cluster-domain-tls-secret
in the runai
namespace and include the path to the TLS --cert
and its corresponding private --key
by running the following:
Last updated