Cluster System Requirements

After provisioning the tenant’s Kubernetes cluster, the next step is to install the required system components to support the NVIDIA Run:ai platform.

The NVIDIA Run:ai cluster is deployed as a Kubernetes application. This section outlines the minimum hardware and software requirements that must be installed and configured on each tenant cluster before deploying the NVIDIA Run:ai components.

Hardware Requirements

The following hardware requirements are for the Kubernetes cluster nodes. By default, all NVIDIA Run:ai cluster services run on all available nodes. For production deployments, you may want to set node roles, to separate between system and worker nodes, reduce downtime and save CPU cycles on expensive GPU Machines.

Architecture

  • x86 - Supported for Kubernetes and OpenShift.

  • ARM - Supported for Kubernetes and OpenShift.

NVIDIA Run:ai Cluster - System Nodes

This configuration is the minimum requirement you need to install and use NVIDIA Run:ai cluster.

Component
Required Capacity

CPU

10 cores

Memory

20GB

Disk space

50GB

circle-info

Note

To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes.

NVIDIA Run:ai Cluster - Worker Nodes

The NVIDIA Run:ai cluster supports x86 and ARM CPUs, and any NVIDIA GPUs supported by the NVIDIA GPU Operator. The list of supported GPUs depends on the version of the NVIDIA GPU Operator installed in the cluster. NVIDIA Run:ai supports GPU Operator versions 25.10 to 26.3.

For the list of supported GPUs, see Supported NVIDIA Data Center GPUs and Systemsarrow-up-right. To install the GPU Operator, see NVIDIA GPU Operator.

circle-info

Note

  • NVIDIA DGX Spark, NVIDIA Jetson and workstations are not supported.

  • vGPU is not supported. NVIDIA Run:ai currently supports GPU passthrough only.

The following configuration represents the minimum hardware requirements for installing and operating the NVIDIA Run:ai cluster on worker nodes. Each node must meet these specifications:

Component
Required Capacity

CPU

2 cores

Memory

4GB

circle-info

Note

To designate nodes to NVIDIA Run:ai workloads, follow the instructions as described in Worker nodes.

Software Requirements

The following software requirements must be fulfilled on the Kubernetes cluster.

Operating System

  • Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator

  • NVIDIA Run:ai cluster on Google Kubernetes Engine (GKE) supports both Ubuntu and Container Optimized OS (COS). COS is supported only with NVIDIA GPU Operator 24.6 or newer, and NVIDIA Run:ai cluster version 2.19 or newer.

  • NVIDIA Run:ai cluster on Elastic Kubernetes Service (EKS) does not support Bottlerocket or Amazon Linux.

  • NVIDIA Run:ai cluster on Oracle Kubernetes Engine (OKE) supports only Ubuntu.

  • Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

Kubernetes Distribution

NVIDIA Run:ai cluster requires Kubernetes. The following Kubernetes distributions are supported:

  • Vanilla Kubernetes

  • OpenShift Container Platform (OCP)

  • NVIDIA Base Command Manager (BCM)

  • Rancher Kubernetes Engine 2 (RKE2)

circle-info

Note

  • The latest release of the NVIDIA Run:ai cluster supports Kubernetes 1.33 to 1.35 and OpenShift 4.18 to 4.21.

  • For Multi-Node NVLink support (e.g. GB200), Kubernetes 1.32 and above is required.

For existing Kubernetes clusters, see the following Kubernetes version support matrix for the latest NVIDIA Run:ai cluster releases:

NVIDIA Run:ai version
Supported Kubernetes versions
Supported OpenShift versions

v2.25 (latest)

1.33 to 1.35

4.18 to 4.21

v2.24

1.33 to 1.35

4.17 to 4.20

v2.23

1.31 to 1.34

4.16 to 4.19

v2.22

1.31 to 1.33

4.15 to 4.19

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release Historyarrow-up-right or OpenShift Container Platform Life Cycle Policyarrow-up-right.

Container Runtime

NVIDIA Run:ai supports the following container runtimesarrow-up-right. Make sure your Kubernetes cluster is configured with one of these runtimes:

Kubernetes Pod Security Admission

NVIDIA Run:ai supports restricted policy for Pod Security Admissionarrow-up-right (PSA) on OpenShift only. Other Kubernetes distributions are only supported with privileged policy.

For NVIDIA Run:ai on OpenShift to run with PSA restricted policy:

  • The workloads submitted through NVIDIA Run:ai should comply with the restrictions of PSA restricted policy. This can be enforced using Policies.

NVIDIA Run:ai Namespace

The NVIDIA Run:ai must be installed in a namespace or project (OpenShift) called runai. Use the following to create the namespace/project:

Kubernetes Load Balancer

In Kubernetes, services of type LoadBalancer are used to expose applications outside the cluster through a single, stable IP address, providing a consistent entry point for external traffic. In managed cloud environments this capability is built-in, while in self-hosted and on-premise deployments it must be provided explicitly.

MetalLB fulfills this role by allocating external IP addresses from a predefined pool and advertising them on the external network, enabling access to services running inside the cluster.

In NVIDIA Run:ai, this is required to support north-south traffic, including access to the NVIDIA Run:ai control plane, APIs, UI, inference endpoints, and externally exposed development workspaces and training workloads.

  1. Reserve a range of IP addresses (recommended a full 32 subnet) for example: 172.20.10.0-172.20.10.255

  2. Install MetalLB:

  3. Create a YAML file named metalLB-config.yaml and replace <IPADDRESS-RANGE-START>-<IPADDRESS-RANGE-END> with the reserved range of IP addresses:

  4. Apply the YAML:

Kubernetes Ingress Controller

NVIDIA Run:ai cluster requires Kubernetes Ingress Controllerarrow-up-right to be installed on the Kubernetes cluster.

  • OpenShift and RKE2 come with a pre-installed ingress controller.

  • Make sure that a default ingress controller, global.ingress.ingressClass is set. For more details, see Advanced cluster configurations.

There are many ways to install and configure different ingress controllers. The following provides a simple example to install and configure HAProxyarrow-up-right ingress controller using helmarrow-up-right:

NVIDIA GPU Operator

The NVIDIA Run:ai cluster requires NVIDIA GPU Operator to be installed on the Kubernetes cluster. GPU Operator versions 25.10 to 26.3 are supported.

circle-info

Note

For Multi-Node NVLink support (e.g. GB200), GPU Operator 25.3 and above is required.

See the Installing the NVIDIA GPU Operatorarrow-up-right, followed by notes below:

  • Use the default gpu-operator namespace . Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.

  • NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false. DGX OSarrow-up-right is one such example as it comes bundled with NVIDIA Drivers.

  • For distribution-specific additional instructions see below:

chevron-rightOpenShift Container Platform (OCP)hashtag

The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator in OpenShift. Install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console. For more information, see Installing the Node Feature Discovery (NFD) Operatorarrow-up-right.

chevron-rightRancher Kubernetes Engine 2 (RKE2)hashtag

Before installing the GPU Operator, verify the host OS requirementsarrow-up-right are met. Then, install the operatorarrow-up-right.

When installing GPU Operator v25.3, update the Helm values file as follows:

For troubleshooting information, see the NVIDIA GPU Operator Troubleshooting Guidearrow-up-right.

NVIDIA Network Operator

When deploying on clusters with RDMA or Multi-Node NVLink‑capable nodes (e.g. B200, GB200), the NVIDIA Network Operator is required to enable high-performance networking features such as GPUDirect RDMA in Kubernetes. Network Operator versions 25.10 to 26.1 are supported.

The Network Operator works alongside the NVIDIA GPU Operator to provide:

  • NVIDIA networking drivers for advanced network capabilities.

  • Kubernetes device plugins to expose high‑speed network hardware to workloads.

  • Secondary network components to support network‑intensive applications.

The Network Operator must be installed and configured as follows:

  1. Configure SR-IOV InfiniBand support as detailed in Network Operator Deployment with an SR-IOV InfiniBand Networkarrow-up-right.

NVIDIA Dynamic Resource Allocation (DRA) Driver

When deploying on clusters with Multi-Node NVLink (e.g. GB200), the NVIDIA DRA driver is essential to enable Dynamic Resource Allocation at the Kubernetes level. To install, follow the instructions in Configure and Helm-install the driverarrow-up-right. DRA driver versions 25.8 to 25.12 are supported.

After the DRA driver is installed, update runaiconfig using the GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See Advanced cluster configurations for more details.

Prometheus

circle-info

Note

Installing Prometheus applies for Kubernetes only.

NVIDIA Run:ai cluster requires Prometheus to be installed on the Kubernetes cluster.

There are many ways to install Prometheus. A simple example to install the community Kube-Prometheus Stackarrow-up-right using helmarrow-up-right, run the following commands:

Routing Traffic to and from NVIDIA Run:ai Services

This section describes how to route traffic to and from NVIDIA Run:ai services. Proper traffic routing is required to enable secure access to the NVIDIA Run:ai control plane (via the UI or API), as well as external access to development workspaces, training workloads, and inference workloads.

NVIDIA Run:ai supports two routing approaches for exposing services: host-based routing and path-based routing. While path-based routing exposes multiple services under a shared domain using URL paths, host-based routing assigns each service its own subdomain. Since many development tools and applications expect to run at the root path, host-based routing avoids common compatibility issues and is therefore used by default in NVIDIA Run:ai.

NVIDIA Run:ai uses host-based routing by default, which relies on DNS and TLS configuration to securely expose services. To support this, three key components must be configured:

Together, these components ensure that traffic is routed correctly and securely across all NVIDIA Run:ai services.

circle-info

Note

  • NVIDIA Run:ai also supports path-based routing. If this approach better fits your environment, you can use it instead of the default host-based routing. In this case, the workspace and training wildcard certificate is not required.

  • To use path-based routing, disable host-based routing by setting clusterConfig.global.subdomainSupport: false during Helm installation. See Advanced cluster configurations.

Fully Qualified Domain Name (FQDN)

circle-info

Note

Fully Qualified Domain Name applies to Kubernetes only.

NVIDIA Run:ai services rely on Fully Qualified Domain Names (FQDNs) to route traffic between system components and to expose workloads externally. In NVIDIA Run:ai, the FQDN settings are needed for:

  • Enabling communication between the control plane and the cluster

  • Exposing development workspaces and training workloads via subdomains

  • Exposing inference workloads via dedicated subdomains

You must configure domain names for each of the following communication types:

  • Control plane ↔ cluster communication Example: runai.mycorp.local The IP address of this domain must be resolvable within the organization’s private network.

  • Workspace and training workloads (external access) Example: *.runai.mycorp.local

  • Inference workloads (external access) Example: *.runai-inference.mycorp.local

Since NVIDIA Run:ai uses host-based routing, wildcard DNS records must be configured to enable external access to workloads.

Configure the following DNS records. Both records should resolve to the same cluster public IP address, or to the cluster’s load balancer IP in on-prem environments. This ensures that each workspace or workload is assigned a unique subdomain under the wildcard domains:

NVIDIA Run:ai TLS Certificates

TLS certificates secure communication between NVIDIA Run:ai components and enable HTTPS access to exposed services.

NVIDIA Run:ai requires three TLS certificates, each aligned with a specific domain:

  • Cluster domain certificate

  • Workspaces and training workload certificate

  • Inference certificate (wildcard)

Cluster Domain Certificates (Single-Domain)

  • Kubernetes - To enable secure communication between the NVIDIA Run:ai control plane and the cluster, configure a TLS certificate associated with the cluster’s main domain (e.g. runai.mycorp.local). This certificate should be stored as a secret named runai-cluster-domain-tls-secret in the runai namespace.

    • Replace /path/to/fullchain.pem with the actual path to your TLS certificate.

    • Replace /path/to/private.pem with the actual path to your private key.

  • OpenShift - NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on configuring certificatesarrow-up-right.

Workspaces & Training Workload Wildcard Certificate

circle-info

Note

For path-based routing, ignore this configuration and move to the next step.

  • Kubernetes - To allow secure access to workspace and training workloads via subdomains, configure a wildcard TLS certificate that matches the cluster domain (e.g. *.runai.mycorp.local). This certificate should be stored as a secret named runai-cluster-domain-star-tls-secret in the runai namespace.

    • Replace /path/to/fullchain.pem with the actual path to your TLS certificate.

    • Replace /path/to/private.pem with the actual path to your private key.

  • OpenShift - A wildcard TLS certificate for Workspace and Training workloads is not required. OpenShift Routes handle TLS termination for inference endpoints using the platform’s built-in routing and certificate management.

Inference Wildcard Certificate

  • Kubernetes - To securely expose inference services over HTTPS, configure a wildcard TLS certificate for the inference domain (e.g. *.runai-inference.mycorp.local). This certificate should be stored as a secret named runai-cluster-inference-tls-secret in the knative-serving namespace.

    • Replace /path/to/fullchain.pem with the actual path to your TLS certificate.

    • Replace /path/to/private.pem with the actual path to your private key.

  • OpenShift - A wildcard TLS certificate for Inference workloads is not required. OpenShift Routes handle TLS termination for inference endpoints using the platform’s built-in routing and certificate management.

Host-Based Routing (Default)

circle-info

Note

  • The following steps are required for Kubernetes only. For OpenShift, no additional configuration is required.

  • NVIDIA Run:ai also support path-based routing. If you prefer to use it instead of the default host-based routing, disable host-based routing by setting clusterConfig.global.subdomainSupport: false during the Helm installation. See Advanced cluster configurations.

  • If you choose path-based routing, skip the below steps.

Host-based routing binds together the configured domains (FQDN), TLS certificates, and ingress rules to expose workloads externally. With NVIDIA Run:ai host-based routing, workloads are exposed using subdomains, so each workload is assigned its own URL. For example:

Host-based routing relies on:

  • The FQDN structure defined earlier

  • The TLS certificates configured for those domains

This section describes how to connect these components to enable workload exposure.

  1. Ensure that:

    • Wildcard DNS records are configured (see FQDN section)

    • TLS certificates are created and stored as Kubernetes secrets (see TLS certificates section)

  2. Create the ingress resource

  3. Run the following:

Additional Software Requirements

Additional NVIDIA Run:ai capabilities, Distributed Training and Inference require additional Kubernetes applications (frameworks) to be installed on the cluster.

Distributed Training

Distributed training enables training of AI models over multiple nodes. This requires installing a distributed training framework on the cluster. The following frameworks are supported:

There are several ways to install each framework. A simple method of installation example is the Kubeflow Training Operatorarrow-up-right which includes TensorFlow, PyTorch, XGBoost and JAX.

It is recommended to use Kubeflow Training Operator v1.9.2, and MPI Operator v0.6.0 or later for compatibility with advanced workload capabilities, such as Stopping a workload and Scheduling rules.

  • To install the Kubeflow Training Operator for TensorFlow, PyTorch, XGBoost and JAX frameworks, run the following command:

  • To install the MPI Operator for MPI v2, run the following command:

circle-info

Note

If you require both the MPI Operator and Kubeflow Training Operator, follow the steps below:

  • Install the Kubeflow Training Operator as described above.

  • Disable and delete MPI v1 in the Kubeflow Training Operator by running:

  • Install the MPI Operator as described above.

Inference

Inference enables serving of AI models. This requires the Knative Servingarrow-up-right framework to be installed on the cluster and supports Knative versions 1.19 to 1.21.

Follow the Installing Knativearrow-up-right instructions or run:

Once installed, follow the below steps:

  1. Create the knative-serving namespace:

  2. Create a YAML file named knative-serving.yaml and replace the placeholder FQDN with your wildcard inference domain (for example, runai-inference.mycorp.local):

  3. Apply the changes:

  4. Configure HAProxy to proxy requests to Kourier / Knative and handle TLS termination using the wildcard certificate. Create a YAML file named knative-ingress.yaml and replace the FQDN placeholders with your wildcard inference domain:

  5. Apply the changes:

Autoscaling

NVIDIA Run:ai allows for autoscaling a deployment according to the below metrics:

  • Latency (milliseconds)

  • Throughput (requests/sec)

  • Concurrency (requests)

Using a custom metric (for example, Latency) requires installing the Kubernetes Horizontal Pod Autoscaler (HPA)arrow-up-right. Use the following command to install. Make sure to update the {VERSION} in the below command with a supported Knative version.

Distributed Inference

NVIDIA Run:ai supports distributed inference (multi-node) deployments using the Leader Worker Set (LWS). To enable this capability, you must install the LWS Helm chartarrow-up-right in version 0.7.0 or higher on your cluster:

Last updated