Cluster System Requirements

After provisioning the tenant’s Kubernetes cluster, the next step is to install the required system components to support the NVIDIA Run:ai platform.

The NVIDIA Run:ai cluster is deployed as a Kubernetes application. This section outlines the minimum hardware and software requirements that must be installed and configured on each tenant cluster before deploying the NVIDIA Run:ai components.

Hardware Requirements

The following hardware requirements are for the Kubernetes cluster nodes. By default, all NVIDIA Run:ai cluster services run on all available nodes. For production deployments, you may want to set node roles, to separate between system and worker nodes, reduce downtime and save CPU cycles on expensive GPU Machines.

Architecture

x86 – Supported for both Kubernetes and OpenShift deployments.
ARM – Supported for Kubernetes only. ARM is currently not supported for OpenShift.

NVIDIA Run:ai Cluster - System Nodes

This configuration is the minimum requirement you need to install and use NVIDIA Run:ai cluster.

Component

Required Capacity

CPU

10 cores

Memory

20GB

Disk space

50GB

Note

To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes.

NVIDIA Run:ai Cluster - Worker Nodes

he NVIDIA Run:ai cluster supports x86 and ARM CPUs, and any NVIDIA GPUs supported by the NVIDIA GPU Operator. The list of supported GPUs depends on the version of the NVIDIA GPU Operator installed in the cluster. NVIDIA Run:ai supports GPU Operator versions 24.9 to 25.3.

For the list of supported GPU models, see Supported NVIDIA Data Center GPUs and Systems. To install the GPU Operator, see NVIDIA GPU Operator.

Note

NVIDIA DGX Spark and NVIDIA Jetson are not supported.

The following configuration represents the minimum hardware requirements for installing and operating the NVIDIA Run:ai cluster on worker nodes. Each node must meet these specifications:

Component

Required Capacity

CPU

2 cores

Memory

4GB

Note

To designate nodes to NVIDIA Run:ai workloads, follow the instructions as described in Worker nodes.

Software Requirements

The following software requirements must be fulfilled on the Kubernetes cluster.

Operating System

Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator
NVIDIA Run:ai cluster on Google Kubernetes Engine (GKE) supports both Ubuntu and Container Optimized OS (COS). COS is supported only with NVIDIA GPU Operator 24.6 or newer, and NVIDIA Run:ai cluster version 2.19 or newer. NVIDIA Run:ai cluster on Oracle Kubernetes Engine (OKE) supports only Ubuntu.
Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

Kubernetes Distribution

NVIDIA Run:ai cluster requires Kubernetes. The following Kubernetes distributions are supported:

Vanilla Kubernetes
OpenShift Container Platform (OCP)
NVIDIA Base Command Manager (BCM)
Rancher Kubernetes Engine (RKE1)
Rancher Kubernetes Engine 2 (RKE2)

Note

For Multi-Node NVLink support (e.g. GB200), Kubernetes 1.32 and above is required.

For existing Kubernetes clusters, see the following Kubernetes version support matrix for the latest NVIDIA Run:ai cluster releases:

NVIDIA Run:ai version

Supported Kubernetes versions

Supported OpenShift versions

v2.22 (latest)

1.31 to 1.33

4.15 to 4.19

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release History or OpenShift Container Platform Life Cycle Policy.

Container Runtime

NVIDIA Run:ai supports the following container runtimes. Make sure your Kubernetes cluster is configured with one of these runtimes:

Containerd (default in Kubernetes)
CRI-O (default in OpenShift)

Kubernetes Pod Security Admission

NVIDIA Run:ai supports restricted policy for Pod Security Admission (PSA) on OpenShift only. Other Kubernetes distributions are only supported with privileged policy.

For NVIDIA Run:ai on OpenShift to run with PSA restricted policy:

Label the runai namespace as described in Pod Security Admission with the following labels:

pod-security.kubernetes.io/audit=privileged
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/warn=privileged

The workloads submitted through NVIDIA Run:ai should comply with the restrictions of PSA restricted policy. This can be enforced using Policies.

NVIDIA Run:ai Namespace

The NVIDIA Run:ai must be installed in a namespace or project (OpenShift) called runai. Use the following to create the namespace/project:

kubectl create ns runai

Kubernetes Ingress Controller

NVIDIA Run:ai cluster requires Kubernetes Ingress Controller to be installed on the Kubernetes cluster.

OpenShift, RKE and RKE2 come pre-installed ingress controller.
Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.
Make sure that a default ingress controller is set.

There are many ways to install and configure different ingress controllers. A simple example to install and configure NGINX ingress controller using helm:

Run the following commands:

For cloud deployments, both the internal IP and external IP are required.
For on-prem deployments, only the external IP is needed.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace \
    --set controller.kind=DaemonSet \
    --set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes

Fully Qualified Domain Name (FQDN)

You must have a Fully Qualified Domain Name (FQDN) to install the NVIDIA Run:ai cluster (ex: runai.mycorp.local). This cannot be an IP. The domain name must be accessible inside the organization's private network.

Wildcard FQDN for Inference (Optional)

In order to make inference serving endpoints available externally to the cluster, configure a wildcard DNS record (*.runai-inference.mycorp.local) that resolves to the cluster’s public IP address, or to the cluster's load balancer IP address in on-prem environments. This ensures each inference workload receives a unique subdomain under the wildcard domain.

TLS Certificates

Kubernetes - You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a Kubernetes Secret named runai-cluster-domain-tls-secret in the runai namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

kubectl create secret tls runai-cluster-domain-tls-secret -n runai \    
  --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate    
  --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key

OpenShift - NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on configuring certificates.

Wildcard TLS Certificate - Inference

Kubernetes - For serving inference endpoints over HTTPS, NVIDIA Run:ai requires a dedicated wildcard TLS certificate that matches the fully qualified domain name (FQDN) used for inference. This certificate ensures secure external access to inference workloads:

kubectl create secret tls runai-cluster-inference-tls-secret -n knative-serving \
    --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate
    --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key

OpenShift - A wildcard TLS certificate for inference is not required. OpenShift Routes handle TLS termination for inference endpoints using the platform’s built-in routing and certificate management.

NVIDIA GPU Operator

NVIDIA Run:ai Cluster requires NVIDIA GPU Operator to be installed on the Kubernetes Cluster, supports version 24.9 to 25.3.

Note

For Multi-Node NVLink support (e.g. GB200), GPU Operator 25.3 and above is required.

See the Installing the NVIDIA GPU Operator, followed by notes below:

Use the default gpu-operator namespace . Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.
NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false. DGX OS is one such example as it comes bundled with NVIDIA Drivers.
For distribution-specific additional instructions see below:

OpenShift Container Platform (OCP)

The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator in OpenShift. Install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console. For more information, see Installing the Node Feature Discovery (NFD) Operator.

Rancher Kubernetes Engine 2 (RKE2)

Before installing the GPU Operator, verify the host OS requirements are met. Then, install the operator.

When installing GPU Operator v25.3, update the Helm values file as follows:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: gpu-operator
  namespace: kube-system
spec:
  repo: https://helm.ngc.nvidia.com/nvidia
  chart: gpu-operator
  version: v25.3.4
  targetNamespace: gpu-operator
  createNamespace: true
  valuesContent: |-
    toolkit:
      env:
      - name: CONTAINERD_SOCKET
        value: /run/k3s/containerd/containerd.sock

For troubleshooting information, see the NVIDIA GPU Operator Troubleshooting Guide.

NVIDIA Network Operator

When deploying on clusters with RDMA or Multi Node NVLink‑capable nodes (e.g. B200, GB200), the NVIDIA Network Operator is required to enable high-performance networking features such as GPUDirect RDMA in Kubernetes. Network Operator versions v24.4 and above are supported.

The Network Operator works alongside the NVIDIA GPU Operator to provide:

NVIDIA networking drivers for advanced network capabilities.
Kubernetes device plugins to expose high‑speed network hardware to workloads.
Secondary network components to support network‑intensive applications.

The Network Operator must be installed and configured as follows:

Install the network operator as detailed in Network Operator Deployment on Vanilla Kubernetes Cluster.
Configure SR-IOV InfiniBand support as detailed in Network Operator Deployment with an SR-IOV InfiniBand Network.

NVIDIA Dynamic Resource Allocation (DRA) Driver

When deploying on clusters with Multi-Node NVLink (e.g. GB200), the NVIDIA DRA driver is essential to enable Dynamic Resource Allocation at the Kubernetes level. To install, follow the instructions in Configure and Helm-install the driver. For air-gapped installations, the DRA driver is installed with the GPU Operator.

After the DRA driver is installed, update runaiconfig using the GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See Advanced cluster configurations for more details.

Prometheus

Note

Installing Prometheus applies for Kubernetes only.

NVIDIA Run:ai cluster requires Prometheus to be installed on the Kubernetes cluster.

OpenShift comes pre-installed with prometheus
For RKE2 see Enable Monitoring instructions to install Prometheus

There are many ways to install Prometheus. A simple example to install the community Kube-Prometheus Stack using helm, run the following commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
    -n monitoring --create-namespace --set grafana.enabled=false

Additional Software Requirements

Additional NVIDIA Run:ai capabilities, Distributed Training and Inference require additional Kubernetes applications (frameworks) to be installed on the cluster.

Distributed Training

Distributed training enables training of AI models over multiple nodes. This requires installing a distributed training framework on the cluster. The following frameworks are supported:

There are several ways to install each framework. A simple method of installation example is the Kubeflow Training Operator which includes TensorFlow, PyTorch, XGBoost and JAX.

It is recommended to use Kubeflow Training Operator v1.9.2, and MPI Operator v0.6.0 or later for compatibility with advanced workload capabilities, such as Stopping a workload and Scheduling rules.

To install the Kubeflow Training Operator for TensorFlow, PyTorch, XGBoost and JAX frameworks, run the following command:

kubectl apply -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.2"

To install the MPI Operator for MPI v2, run the following command:

kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml

Note

If you require both the MPI Operator and Kubeflow Training Operator, follow the steps below:

Install the Kubeflow Training Operator as described above.
Disable and delete MPI v1 in the Kubeflow Training Operator by running:

kubectl patch deployment training-operator -n kubeflow --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--enable-scheme=tfjob", "--enable-scheme=pytorchjob", "--enable-scheme=xgboostjob", "--enable-scheme=jaxjob"]}]'
kubectl delete crd mpijobs.kubeflow.org

Install the MPI Operator as described above.

Inference

Inference enables serving of AI models. This requires the Knative Serving framework to be installed on the cluster and supports Knative versions 1.11 to 1.18.

Follow the Installing Knative instructions or run:

helm repo add knative-operator https://knative.github.io/operator
helm install knative-operator --create-namespace --namespace knativeoperator --version 1.18.2 knative-operator/knative-operator

Once installed, follow the below steps:

Create the knative-serving namespace:
```
kubectl create ns knative-serving
```

Create a YAML file named knative-serving.yaml and replace the placeholder FQDN with your wildcard inference domain (for example, runai-inference.mycorp.local):

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  config:
    config-autoscaler:
      enable-scale-to-zero: "true"
    config-features:
      kubernetes.podspec-affinity: enabled
      kubernetes.podspec-init-containers: enabled
      kubernetes.podspec-persistent-volume-claim: enabled
      kubernetes.podspec-persistent-volume-write: enabled
      kubernetes.podspec-schedulername: enabled
      kubernetes.podspec-securitycontext: enabled
      kubernetes.podspec-tolerations: enabled
      kubernetes.podspec-volumes-emptydir: enabled
      kubernetes.podspec-fieldref: enabled
      kubernetes.containerspec-addcapabilities: enabled
      kubernetes.podspec-nodeselector: enabled
      multi-container: enabled
    domain:
      runai-inference.mycorp.local: "" # replace with the wildcard FQDN for Inference
    network:
      domainTemplate: '{{.Name}}-{{.Namespace}}.{{.Domain}}'
      ingress-class: kourier.ingress.networking.knative.dev
      default-external-scheme: https
  high-availability:
    replicas: 2
  ingress:
    kourier:
      enabled: true

Apply the changes:
```
kubectl apply -f knative-serving.yaml
```

Configure NGINX to proxy requests to Kourier / Knative and handle TLS termination using the wildcard certificate. Create a YAML file named knative-ingress.yaml and replace the FQDN placeholders with your wildcard inference domain:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  ingressClassName: nginx
  rules:
  - host: '*.runai-inference.mycorp.local' # replace with the wildcard FQDN for Inference
    http:
      paths:
      - backend:
          service:
            name: kourier
            port:
              number: 80
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - '*.runai-inference.mycorp.local' # replace with the wildcard FQDN for Inference
    secretName: runai-cluster-inference-tls-secret

Apply the changes:
```
kubectl apply -f knative-ingress.yaml
```

Follow the Installing the OpenShift Serverless Operator instructions. Once installed, follow the steps below:

Create the knative-serving project:
```
oc new-project knative-serving
```

Create a YAML file named knative-serving.yaml:

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  finalizers:
    - knative-serving-openshift
    - knativeservings.operator.knative.dev
  name: knative-serving
  namespace: knative-serving
spec:
  config:
    config-features:
      kubernetes.podspec-tolerations: enabled
      kubernetes.podspec-volumes-emptydir: enabled
      kubernetes.podspec-persistent-volume-claim: enabled
      multi-container: enabled
      kubernetes.podspec-persistent-volume-write: enabled
      kubernetes.podspec-fieldref: enabled
      kubernetes.podspec-schedulername: enabled
      kubernetes.podspec-nodeselector: enabled
      kubernetes.podspec-init-containers: enabled
      kubernetes.podspec-securitycontext: enabled
      kubernetes.podspec-affinity: enabled
      kubernetes.containerspec-addcapabilities: enabled
  controller-custom-certs:
    name: ''
    type: ''
  registry: {}

Apply the changes:
```
oc apply -f knative-serving.yaml
```

Autoscaling

NVIDIA Run:ai allows for autoscaling a deployment according to the below metrics:

Latency (milliseconds)
Throughput (requests/sec)
Concurrency (requests)

Using a custom metric (for example, Latency) requires installing the Kubernetes Horizontal Pod Autoscaler (HPA). Use the following command to install. Make sure to update the {VERSION} in the below command with a supported Knative version.

kubectl apply -f https://github.com/knative/serving/releases/download/knative-{VERSION}/serving-hpa.yaml

Distributed Inference

NVIDIA Run:ai supports distributed inference (multi-node) deployments using the Leader Worker Set (LWS). To enable this capability, you must install the LWS Helm chart on your cluster:

CHART_VERSION=0.6.2
helm install lws oci://registry.k8s.io/lws/charts/lws \
  --version=$CHART_VERSION \
  --namespace lws-system \
  --create-namespace \
  --wait --timeout 300s

PreviousProvision Kubernetes Clusters NextCreate and Install Clusters

Last updated 1 month ago

Good evening

hashtagHardware Requirements

hashtagArchitecture

hashtagNVIDIA Run:ai Cluster - System Nodes

hashtagNVIDIA Run:ai Cluster - Worker Nodes

hashtagSoftware Requirements

hashtagOperating System

hashtagKubernetes Distribution

hashtagContainer Runtime

hashtagKubernetes Pod Security Admission

hashtagNVIDIA Run:ai Namespace

hashtagKubernetes Ingress Controller

hashtagFully Qualified Domain Name (FQDN)

hashtagWildcard FQDN for Inference (Optional)

hashtagTLS Certificates

hashtagWildcard TLS Certificate - Inference

hashtagNVIDIA GPU Operator

hashtagNVIDIA Network Operator

hashtagNVIDIA Dynamic Resource Allocation (DRA) Driver

hashtagPrometheus

hashtagAdditional Software Requirements

hashtagDistributed Training

hashtagInference

hashtagAutoscaling

hashtagDistributed Inference

Hardware Requirements

Architecture

NVIDIA Run:ai Cluster - System Nodes

NVIDIA Run:ai Cluster - Worker Nodes

Software Requirements

Operating System

Kubernetes Distribution

Container Runtime

Kubernetes Pod Security Admission

NVIDIA Run:ai Namespace

Kubernetes Ingress Controller

Fully Qualified Domain Name (FQDN)

Wildcard FQDN for Inference (Optional)

TLS Certificates

Wildcard TLS Certificate - Inference

NVIDIA GPU Operator

NVIDIA Network Operator

NVIDIA Dynamic Resource Allocation (DRA) Driver

Prometheus

Additional Software Requirements

Distributed Training

Inference

Autoscaling

Distributed Inference