1 of 100

v2.21

Getting Started

Overview

NVIDIA Run:ai is a GPU orchestration and optimization platform that helps organizations maximize compute utilization for AI workloads. By optimizing the use of expensive compute resources, NVIDIA Run:ai accelerates AI development cycles, and drives faster time-to-market for AI-powered innovations.

Built on Kubernetes, NVIDIA Run:ai supports dynamic GPU allocation, workload submission, workload scheduling, and resource sharing, ensuring that AI teams get the compute power they need while IT teams maintain control over infrastructure efficiency.

How NVIDIA Run:ai Helps Your Organization

What’s New

This section includes release information for the self-hosted version of NVIDIA Run:ai:

New Features and Enhancements - Highlights major updates introduced in each version, including new capabilities, UI improvements, and changes to system behavior..
Hotfixes - Lists patches applied to released versions, including critical fixes and behavior corrections.

Note

See our for a list of supported versions and their respective support timelines.

Feature Life Cycle

NVIDIA Run:ai uses life cycle labels to indicate the maturity and stability of features across releases:

Experimental - This feature is in early development. It may not be stable and could be removed or changed significantly in future versions. Use with caution.
Beta - This feature is still being developed for official release in a future version and may have some limitations. Use with caution.

Quick Start Guides

This quick start helps you get up and running with NVIDIA Run:ai by guiding you through the main stages of platform adoption. Each guide is written for a specific role and provides step-by-step instructions.

Install the Platform

For infrastructure administrators

Follow the instructions in the Quick start for infrastructure administrators to install NVIDIA Run:ai and complete the required post-installation infrastructure setup.

Onboard Organization, Projects and Users

For platform administrators

Follow the instructions in the to complete the onboarding wizard, configure authentication and organizational structure, and prepare the platform for teams to run workloads.

Build, Train and Deploy Models

For AI practitioners

Follow the instructions in the to log in, select a project, and run experiments and production workloads on NVIDIA Run:ai.

Quick Start for Infrastructure Administrators

This guide is for infrastructure administrators responsible for installing, configuring, and operating NVIDIA Run:ai.

The quick start walks through the initial infrastructure setup lifecycle, including platform installation and the essential post-installation configuration required to prepare the cluster for onboarding and workload execution. It focuses on infrastructure-level concerns such as cluster readiness, control plane behavior, security boundaries, and operational stability.

Prerequisites

Before you begin, ensure that:

Quick Start for Platform Administrators

This guide is for platform administrators responsible for configuring, governing, and operating NVIDIA Run:ai across the organization.

The quick start outlines the critical, high-level setup phases you must complete immediately after NVIDIA Run:ai is installed. It focuses on establishing authentication, organizational structure, resource governance, and operational visibility required to enable teams to run workloads.

Prerequisites

Access to the NVIDIA Run:ai tenant - You have the NVIDIA Run:ai tenant URL for your organization (for example, https://<your-domain>).
Administrator credentials - You have credentials with system administrator permissions to sign in to the NVIDIA Run:ai UI.

Ongoing Platform Management

After installation, manage the platform through the following core activities.

Configure Platform Behavior and Admin Settings

NVIDIA Run:ai provides global configuration options to control system-wide features and functionality.

Use the General settings in the Admin panel to tailor how the platform operates, including feature enablement and analytics behavior. These settings apply across all users and workloads and can be adjusted to align with organizational policies and operational requirements.

Define Authorization and Access Control

Authorization ensures users can access only the features and resources required for their role. Define how users and teams access platform capabilities using and to:

Grant users the appropriate level of access
Control who can submit workloads or manage resources

Define Organizational Structure and Quota

Create additional and to logically partition resources. For each project, define guaranteed and over-quota resource allocations to control how resources are shared across the organization. At a high level, these constructs often align with an organization’s internal structure, such as business units, teams, or similar groupings. You can model them to reflect how your organization plans and manages AI initiatives. See for more details.

Configure Node Pools

Node pools allow you to translate organizational and business priorities into infrastructure-level scheduling decisions. By grouping worker nodes based on hardware type, capabilities, or location (for example, H100 vs. A100 GPUs), enable you to:

Reserve high-end or scarce GPUs for business-critical or production workloads
Provide predictable performance and scheduling behavior guarantees for prioritized projects
Prevent lower-priority or experimental workloads from impacting production or revenue-generating use cases

Node pools form a foundational layer for enforcing resource access, workload placement, and scheduling policies across the organization.

Define Policies (Governance)

Policies define how workloads are governed and scheduled across the platform. By combining workload policies with scheduling rules, you can standardize workload behavior and control how resources are used and shared.

Use policies to:

Standardize workload behavior with , enforcing best practices and organizational limits
Define scheduling behavior with that determine how workloads are placed or how long they can run

Scheduling rules are applied at the project or department level and affect all matching workloads for that scope.

Monitor and Optimize the Platform

Monitor platform usage and health to ensure efficient and reliable operation.

Monitor usage - Use analytics dashboards to track GPU utilization, identify idle resources, and review consumption by department and project.
Review logs and system health - Monitor control plane and cluster components to proactively troubleshoot issues and manage maintenance activities.

Quick Start for AI Practitioners

This guide is for AI practitioners responsible for running experiments and production workloads on NVIDIA Run:ai.

The quick start walks through the essential steps to begin using the platform, from initial access and project selection to launching a workspace and submitting your first workloads. The focus is on day-to-day workload execution and resource consumption, so you can experiment, train models, and deploy inference within your assigned project.

Prerequisites

To begin, ensure you meet the following conditions set up by your platform administrator:

You have an active user account and credentials to access the NVIDIA Run:ai UI
You are assigned to at least one project
Your project has available resources to run workloads

Getting Started

Choose a quick start based on your goal. Each scenario walks through a practical example so you can validate access, confirm resource availability, and understand how workloads run in your environment.

- Launch a Jupyter notebook workspace for interactive development and experimentation.
- Submit a standard training job to run a model training script on a single GPU.
- Submit a distributed PyTorch training job and launch a multi-node training workload using an example PyTorch image.

Understand Workload Capabilities

After completing the quick starts, explore the broader workload capabilities available in NVIDIA Run:ai. This helps you move beyond basic scenarios and take advantage of advanced scheduling, scaling, and configuration options.

- How workloads are defined, scheduled, and executed in NVIDIA Run:ai.
- The different supported workload types and the capabilities available for each, including scaling, resource configuration, scheduling behavior, and other advanced options.
- Shared resources used by workloads, such as environments, data sources, and credentials.

Run Workloads for Your Use Case

Once you understand the supported workload types and configuration options, proceed to the workload-specific documentation to configure and run workloads tailored for your project. Each workload section includes complete configuration examples and step-by-step instructions for the UI, API, and CLI.

- Interactive development environment for building and testing. Recommended for lightweight experimentation and debugging.
- Workload for standard or distributed training models. Recommended for resource-intensive model development.
- Deployment of an AI model for serving via an API. Recommended for production use.

Installation

NVIDIA Run:ai Components

As part of the installation process, you will install:

A control plane managing cluster/s
One or more

Both the control plane and clusters require Kubernetes. Typically, the control plane and first cluster are installed on the same Kubernetes cluster.

Installation Types

The self-hosted option is for organizations that cannot use a SaaS solution due to data leakage concerns. NVIDIA Run:ai self-hosting comes with two variants:

Type

Description

Support Matrix

The support matrix outlines the verified compatibility standards for NVIDIA Run:ai v2.21. To ensure a stable and performant deployment, all infrastructure components, including Kubernetes/OpenShift distributions, NVIDIA Operators, and specialized frameworks, must align with the versions specified below. Use this matrix as a validation checklist prior to performing new installations or upgrades.

Operator and Framework Versions

Component

Supported Versions

Install Using Helm

Network Requirements

The following network requirements are for the NVIDIA Run:ai components installation and usage.

External Access

Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management.

Customized Installation

This section explains the available configurations for customizing the NVIDIA Run:ai control plane and cluster installation.

Control Plane Helm Chart Values

The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm or flags. See .

Upgrade

Before Upgrade

Before proceeding with the upgrade, it's crucial to apply the specific prerequisites associated with your current version of NVIDIA Run:ai and every version in between up to the version you are upgrading to.

To ensure a smooth and supported upgrade process:

Uninstall

Uninstall the Control Plane

To delete the control plane, run:

helm uninstall runai-backend -n runai-backend

Uninstall the Cluster

To uninstall the NVIDIA Run:ai cluster, run the following command in your terminal:

To remove the NVIDIA Run:ai cluster from the NVIDIA Run:ai platform, see .

Note

Uninstall of NVIDIA Run:ai cluster from the Kubernetes cluster does not delete existing projects, departments or workloads submitted by users.

Infrastructure setup

Authentication and Authorization

NVIDIA Run:ai authentication and authorization enables a streamlined experience for the user with precise controls covering the data each user can see and the actions each user can perform in the NVIDIA Run:ai platform.

Authentication verifies user identity during login, and authorization assigns the user with specific permissions according to the assigned .

Authenticated access is required to use all aspects of the NVIDIA Run:ai interfaces, including the NVIDIA Run:ai platform, the NVIDIA Run:ai Command Line Interface (CLI) and APIs.

Authentication

Users

This section explains the procedure to manage users and their permissions.

Users can be managed locally, or via the identity provider (Idp), while assigned with to manage permissions. For example, user [email protected] is a department admin in department A.

Users Table

The Users table can be found under Access in the NVIDIA Run:ai platform.

SSO

Applications

This section explains the procedure to manage your organization's applications.

Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token as detailed in and use it within subsequent API calls.

Applications are assigned with to manage permissions. For example, application ci-pipeline-prod is assigned with a Researcher role in Cluster: A.

User Applications

This article explains the procedure to create your own user applications.

Notes

All clusters in the tenant must be version 2.20 and onward.
The token obtained through user applications assumes the of the user.

Creating an Application

To create an application:

Click the user avatar at the top right corner, then select Settings
Click +APPLICATION
Enter the application’s name

You can create up to 20 user applications.

Note

The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

Regenerating a Client Secret

To regenerate a client secret:

Locate the application you want to regenerate its client secret
Click Regenerate client secret
Click REGENERATE

Important

Regenerating a client secret revokes the previous one.

Deleting an Application

Locate the application you want to delete
Click on the trash icon
On the dialog, click DELETE to confirm

Using API

Go to the API reference to view the available actions.

Advanced Setup

Node Roles

This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.

For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:

NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.
Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.

NVIDIA Run:ai services are scheduled on the defined node roles by applying using node labels .

Prerequisites

To perform these tasks, make sure to install the NVIDIA Run:ai .

Configure Node Roles

The following node roles can be configured on the cluster:

System node: Reserved for NVIDIA Run:ai system-level services.
GPU Worker node: Dedicated for GPU-based workloads.
CPU Worker node: Used for CPU-only workloads.

System Nodes

NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the (recommended) or via NVIDIA Run:ai .

By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system for system services scheduling. You can modify the default node affinity rule by:

Editing the spec.global.affinity configuration parameter as detailed in .
Editing the global.affinity configuration as detailed in for self-hosted deployments.

Note

To ensure and prevent a single point of failure, it is recommended to configure at least three system nodes in your cluster.
By default, Kubernetes master nodes are configured to prevent workloads from running on them as a best-practice measure to safeguard control plane stability. While this restriction is generally recommended, certain NVIDIA reference architectures allow adding tolerations to the NVIDIA Run:ai deployment so critical system services can run on these nodes.

Kubectl

To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:

Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to label the node with its role:

NVIDIA Run:ai Administrator CLI

Note

The NVIDIA Run:ai Administrator CLI only supports the default node affinity.

To set a system role for a node in your Kubernetes cluster, follow these steps:

Run the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to set or remove a node’s role:

The set node-role command will label the node and set relevant cluster configurations.

Worker Nodes

NVIDIA Run:ai worker nodes run user-submitted workloads and system-level required to operate. This can be managed via the (recommended) or via NVIDIA Run:ai .

By default, GPU workloads are scheduled on GPU nodes based on the nvidia.com/gpu.present label. When global.nodeAffinity.restrictScheduling is set to true via the :

GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-gpu-worker
CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-cpu-worker

Kubectl

To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:

Validate the global.nodeAffinity.restrictScheduling is set to true in the cluster’s .
Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to label the node with its role. Replace the label and value (true

NVIDIA Run:ai Administrator CLI

To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai , follow these steps:

Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to set or remove a node’s role. <node-role> must be either --gpu-worker or --cpu-worker :

The set node-role command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling true.

Note

Use the --all flag to set or remove a role to all nodes.

Container Access

External Access to Containers

Researchers may need to access containers remotely during workload execution. Common use cases include:

Running a Jupyter Notebook inside the container
Connecting PyCharm for remote Python development

Service Mesh

NVIDIA Run:ai supports service mesh implementations. When a service mesh is deployed with sidecar injection, specific configurations must be applied to ensure compatibility with NVIDIA Run:ai. This document outlines the required changes for the NVIDIA Run:ai control plane and cluster.

Control Plane Configuration

Interworking with Karpenter

Karpenter is an open-source, Kubernetes cluster autoscaler built for cloud deployments. Karpenter optimizes the cloud cost of a customer’s cluster by moving workloads between different node types, consolidating workloads into fewer nodes, using lower-cost nodes where possible, scaling up new nodes when needed, and shutting down unused nodes.

Karpenter’s main goal is cost optimization. Unlike Karpenter, NVIDIA Run:ai’s Scheduler optimizes for fairness and resource utilization. Therefore, there are a few potential friction points when using both on the same cluster.

Friction Points Using Karpenter with NVIDIA Run:ai

Infrastructure Procedures

NVIDIA Run:ai at Scale

Operating NVIDIA Run:ai at scale ensures that the system can efficiently handle fluctuating workloads while maintaining optimal performance. As clusters grow, whether due to an increasing number of nodes or a surge in workload demand, NVIDIA Run:ai services must be appropriately tuned to support large-scale environments.

This guide outlines the best practices for optimizing NVIDIA Run:ai for high-performance deployments, including NVIDIA Run:ai system services configurations, vertical scaling (adjusting CPU and memory resources) and where applicable, horizontal scaling (replicas).

NVIDIA Run:ai Services

High Availability

This guide outlines the best practices for configuring the NVIDIA Run:ai platform to ensure high availability and maintain service continuity during system failures or under heavy load. The goal is to reduce downtime and eliminate single points of failure by leveraging Kubernetes best practices alongside NVIDIA Run:ai specific configuration options. The NVIDIA Run:ai platform relies on two fundamental high availability strategies:

Use of system nodes - Assigning multiple dedicated nodes for critical system services ensures control, resource isolation, and enables system-level scaling.
Replication of core and third-party services - Configuring multiple replicas of essential services, including both platform and third-party components, distributes workloads and reduces single points of failure. If a component fails on one node, requests can seamlessly route to another instance.

Monitoring and Maintenance

Deploying NVIDIA Run:ai in mission-critical environments requires proper monitoring and maintenance of resources to ensure workloads run and are deployed as expected.

Details on how to monitor different parts of the physical resources in your Kubernetes system, including and , can be found in the monitoring and maintenance section. Adjacent configuration and troubleshooting sections also cover high availability, and clusters, , and to meet compliance requirements.

In addition to monitoring NVIDIA Run:ai resources, it is also highly recommended to monitor NVIDIA Run:ai runs on Kubernetes, which manages containerized applications. In particular, focus on three main layers:

Shared Storage

Shared storage is a critical component in AI and machine learning workflows, particularly in scenarios involving distributed training and shared datasets. In AI and ML environments, data must be readily accessible across multiple nodes, especially when training large models or working with vast datasets. Shared storage enable seamless access to data, ensuring that all nodes in a distributed training setup can read and write to the same datasets simultaneously. This setup not only enhances efficiency but is also crucial for maintaining consistency and speed in high-performance computing environments.

While NVIDIA Run:ai Platform supports a variety of remote data sources, such as Git and S3, it is often more efficient to keep data close to the compute resources. This proximity is typically achieved through the use of shared storage, accessible to multiple nodes in your Kubernetes cluster.

Shared Storage

Backup and Restore

This document outlines how to back up and restore a NVIDIA Run:ai deployment, including both the NVIDIA Run:ai cluster and control plane.

Back Up the Cluster

The restoration or backup of NVIDIA Run:ai and which are stored locally on the Kubernetes cluster is optional and can be restored and backed up separately. As backup of data is not required, the backup procedure is optional for advanced deployments.

Secure Your Cluster

This section details the security considerations for deploying NVIDIA Run:ai. It is intended to help administrators and security officers understand the specific permissions required by NVIDIA Run:ai.

Access to the Kubernetes Cluster

NVIDIA Run:ai integrates with Kubernetes clusters and requires specific permissions to successfully operate. These are permissions are controlled with configuration flags that dictate how NVIDIA Run:ai interacts with cluster resources. Prior to installation, security teams can review the permissions and ensure it aligns with their organization’s policies.

Logs Collection

This section provides instructions for IT administrators on collecting NVIDIA Run:ai logs for support, including prerequisites, CLI commands, and log file retrieval. It also covers enabling verbose logging for Prometheus and the NVIDIA Run:ai Scheduler.

Collect Logs to Send to Support

To collect NVIDIA Run:ai logs, follow these steps:

Event History

This section provides details about NVIDIA Run:ai’s Audit log.

The NVIDIA Run:ai control plane provides the audit log API and event history table in the NVIDIA Run:ai UI. Both reflect the same information regarding changes to business objects: clusters, projects and assets etc.

Note

Only system administrator users with tenant-wide permissions can access Audit log.

Platform management

Manage AI Initiatives

Managing Your Organization

Managing Your Resources

Scheduling and Resource Optimization

Scheduling

Using the Scheduler with Third-Party Workloads

By default, Kubernetes uses its own native scheduler to determine pod placement. The NVIDIA Run:ai platform provides a custom scheduler, runai-scheduler, which is used by default for workloads submitted using the platform. This section outlines how to configure third-party workloads, such as those submitted directly to Kubernetes, to run with the , runai-scheduler, instead of the default Kubernetes scheduler.

This section outlines how to configure workloads submitted directly to Kubernetes or through external frameworks to run with the , instead of the default Kubernetes scheduler.

Quick Starts

Resource Optimization

Quick Starts

Policies

Monitor Performance and Health

Workloads in NVIDIA Run:ai

How the Scheduler Works

Efficient resource allocation is critical for managing AI and compute-intensive workloads in Kubernetes clusters. The NVIDIA Run:ai Scheduler enhances Kubernetes' native capabilities by introducing advanced scheduling principles such as fairness, quota management, and dynamic resource balancing. It ensures that workloads, whether simple single-pod or complex distributed tasks, are allocated resources effectively while adhering to organizational policies and priorities.

This guide explores the NVIDIA Run:ai Scheduler’s allocation process, preemption mechanisms, and resource management. Through examples and detailed explanations, you'll gain insights into how the Scheduler dynamically balances workloads to optimize cluster utilization and maintain fairness across projects and departments.

Allocation Process

Pod Creation and Prouping

When a workload is submitted, the workload controller creates a pod or pods (for distributed training workloads or deployment based inference). When the Scheduler gets a submit request with the first pod, it creates a and allocates all the relevant building blocks of that workload. The next pods of the same workload are attached to the same pod group.

Queue Management

A workload, with its associated pod group, is queued in the appropriate . In every scheduling cycle, the Scheduler ranks the order of queues by calculating their precedence for scheduling.

Resource Binding

The next step is for the Scheduler to find nodes for those pods, assign the pods to their nodes (bind operation), and bind other building blocks of the pods such as storage, ingress and so on. If the pod group has a rule attached to it, the Scheduler either allocates and binds all pods together, or puts all of them into pending state. It retries to schedule them all together in the next scheduling cycle. The Scheduler also updates the status of the pods and their associated pod group. Users are able to track the workload submission process both in the CLI or NVIDIA Run:ai UI. For more details on submitting and managing workloads, see .

Preemption

If the Scheduler cannot find resources for the submitted workloads (and all of its associated pods), and the workload deserves resources either because it is under its queue quota , the Scheduler tries to from other queues. If this does not solve the resource issue, the Scheduler tries to preempt within the same queue (project).

Reclaim Preemption Between Projects and Departments

Reclaim is an inter-project and inter-department resource balancing action that takes back resources from one project or department that has used them as an over quota. It returns the resources back to a project (or department) that deserves those resources as part of its deserved quota, or to balance fairness between projects (or departments), so a project (or department) does not exceed its fairshare (portion of the unused resources).

This mode of operation means that a lower priority workload submitted in one project (e.g. training) can reclaim resources from a project that runs a higher priority workload (e.g. preemptive workspace) if fairness balancing is required.

Note

Only preemptive workloads can go over quota as they are susceptible to reclaim (cross-projects preemption) of the over quota resources they are using. The amount of over quota resources a project can gain depends on the over quota weight or quota (if over quota weight is disabled). Departments’ over quota is always proportional to its quota.

Priority Preemption Within a Project

may preempt lower priority preemptible workloads within the same project/node pool queue. For example, in a project that runs a training workload that exceeds the project quota for a certain node pool, a newly submitted workspace within the same project/node pool may stop (preempt) the training workload if there are not enough over quota resources for the project within that node pool to run both workloads (e.g. workspace using in-quota resources and training using over quota resources).

Note

Workload priority applies only within the same project and does not influence workloads across different projects, where fairness determines precedence.

Quota, Over Quota, and Fairshare

The NVIDIA Run:ai Scheduler strives to ensure fairness between projects and between departments. This means each department and project always strive to get their deserved quota, and unused resources are split between projects according to known rules (e.g. over quota weights).

If a project needs more resources even beyond its fairshare, and the Scheduler finds unused resources that no other project needs, this project can consume resources even beyond its fairshare.

Some scenarios can prevent the Scheduler from fully providing deserved quota and fairness:

Fragmentation or other scheduling constraints such affinities, taints etc.
Some requested resources, such as GPUs and CPU memory, can be allocated, while others, like CPU cores, are insufficient to meet the request. As a result, the Scheduler will place the workload in a pending state until the required resource becomes available.

Example of Splitting Quota

The example below illustrates a split of quota between different projects and departments using several node pools:

The example below illustrates how fairshare is calculated per project/node pool for the above example:

For each Project:
- The over quota (OQ) portion of each project (per node pool) is calculated as:
[(OQ-Weight) / (Σ Projects OQ-Weights)] x (Unused Resource per node pool)

Fairshare Balancing

The Scheduler constantly re-calculates the fairshare of each project and department per node pool, represented in the scheduler as queues, resulting in the re-balancing of resources between projects and between departments. This means that a preemptible workload that was granted resources to run in one scheduling cycle, can find itself preempted and go back to pending state while waiting for resources in the next cycle.

A queue, representing a scheduler-managed object for each project or department per node pool, can be in one of 3 states:

In-quota: The queue’s allocated resources ≤ queue deserved quota. The Scheduler’s first priority is to ensure each queue receives its deserved quota.
Over quota but below fairshare: The queue’s deserved quota < queue’s allocated resources <= queue’s fairshare. The Scheduler tries to find and allocate more resources to queues that need resources beyond their deserved quota and up to their fairshare.
Over-fairshare and over quota: The queue’s fairshare < queue’s allocated resources. The Scheduler tries to allocate resources to queues that need even more resources beyond their fairshare.

When re-balancing resources between queues of different projects and departments, the Scheduler goes in the opposite direction, i.e. first take resources from over-fairshare queues, then from over quota queues, and finally, in some scenarios, even from queues that are below their deserved quota.

Next Steps

Now that you have gained insights into how the Scheduler dynamically balances workloads to optimize cluster utilization and maintain fairness across projects and departments, you can . Before submitting your workloads, it’s important to familiarize yourself with the following key topics:

- Learn what workloads are and what is supported for both NVIDIA Run:ai and third-party workloads.
- Explore the various NVIDIA Run:ai workload types available and understand their specific purposes to enable you to choose the most appropriate workload type for your needs.

What’s New in Version 2.21

The NVIDIA Run:ai v2.21 what's new provides a detailed summary of the latest features, enhancements, and updates introduced in this version. They serve as a guide to help users, administrators, and researchers understand the new capabilities and how to leverage them for improved workload management, resource optimization, and more.

Important

For a complete list of deprecations, see Deprecation notifications. Deprecated features and capabilities will be available for two versions ahead of the notification.

AI Practitioners

Flexible Workload Submission

Streamlined workload submission with a customizable form - The new customizable submission form allows you to submit workloads by selecting and modifying an existing setup or providing your own settings. This enables faster, more accurate submissions that align with organizational policies and individual workload needs. Beta From cluster v2.18 onward

Feature high level details:

Flexible submission options - Choose from an existing setup and customize it, or start from scratch and provide your own settings for a one-time setup.
Improved visibility - Review existing setups and understand their associated policy definitions.
One-time data sources setup - Configure a data source as part of your one-time setup for a specific workload.

Workspaces and Training

Support for JAX distributed training workloads - You can now submit distributed training workloads using the JAX framework via the UI, API, and CLI. This enables you to leverage JAX for scalable, high-performance training, making it easier to run and manage JAX-based workloads seamlessly within NVIDIA Run:ai. See for more details. From cluster v2.21 onward
Pod restart policy for all workload types - A restart policy can be configured to define how pods are restarted when they terminate. The policy is set at the workload level across all workload types via the API and CLI. For distributed training workloads, restart policies can be set separately for master and worker pods. This enhancement ensures workloads are restarted efficiently, minimizing downtime and optimizing resource usage. From cluster v2.21 onward

Workload Assets

New environment presets - Added new NVIDIA Run:ai environment presets when running in a host-based routing cluster - vscode, rstudio, jupyter-scipy, tensorboard-tensorflow. See for more details. From cluster v2.21 onward
Support for PVC size expansion - Adjust the size of Persistent Volume Claims (PVCs) via the API, leveraging the allowVolumeExpansion field of the storage class resource. This enhancement enables you to dynamically adjust storage capacity as needed.

Command-line Interface (CLI v2)

New default CLI - CLI v2 is the default command-line interface. CLI v1 has been as of version 2.20.
Secret volume mapping for workloads - You can now map secrets to volumes when submitting workloads using the --secret-volume flag. This feature is available for all workload types - workspaces, training, and inference.
Support for environment field references in submit commands - A new flag,

ML Engineers

Workloads - Inference

Support for inference workloads via CLI v2 - You can now run inference workloads directly from the command-line interface. This update enables greater automation and flexibility for managing inference workloads. See for more details.
Enhanced rolling inference updates - Rolling inference updates allow ML engineers to apply live updates to existing inference workloads—regardless of their current status (e.g., running or pending)—without disrupting critical services. Experimental

Platform Administrators

Analytics

Enhancements to the Overview dashboard - The Overview dashboard includes optimization insights for projects and departments, providing real-time visibility into GPU resource allocation and utilization. These insights help department and project managers make more informed decisions about quota management, ensuring efficient resource usage.
Dashboard UX improvements:
- Improved visibility of metrics in the Resources utilization widget by repositioning them above the graphs.

Organizations - Projects/Departments

Enhanced resource prioritization for projects and departments - Admins can now define and manage SLAs tailored to specific and via the UI, ensuring resource allocation aligns with real business priorities. This enhancement empowers admins to assign strict priority to over-quota resources, extending control beyond the existing over-quota weight system. From cluster v2.20 onward
This feature allows administrators to:
- Set the priority of each department relative to other departments within the same node pool.

Audit Logs

Updated access control for audit logs - Only users with tenant-wide permissions have the ability to access audit logs, ensuring proper access control and data security. This update reinforces security and compliance by restricting access to sensitive system logs. It ensures that only authorized users can view audit logs, reducing the risk of unauthorized access and potential data exposure.

Notifications

Slack API integration for notifications - A new API allows organizations to receive notifications directly to Slack. This feature enhances real-time communication and monitoring by enabling users to stay informed about workload statuses. See for more details.

Authentication and Authorization

Improved visibility into user roles and access scopes - Individual users can now view their assigned roles and scopes directly in their settings. This enhancement provides greater transparency into user permissions, allowing individuals to easily verify their access levels. It helps users understand what actions they can perform and reduces dependency on administrators for access-related inquiries. See for more details.
Added auto-redirect to SSO - To deliver a consistent and streamlined login experience across customer applications, users accessing the NVIDIA Run:ai login page will be automatically redirected to SSO, bypassing the standard login screen entirely. This can be enabled via a toggle after an Identity Provider is added, and is available through both the UI and API. See for more details.

Data & Storage

Added Data volumes to the UI - Administrators can now create and manage data volumes directly from the UI and share data across different scopes in a cluster, including projects and departments. See for more details. Experimental From cluster v2.19 onward

Infrastructure Administrators

NVIDIA Datacenter GPUs - Grace-Blackwell

Support for NVIDIA GB200 NVL72 and MultiNode NVLink systems - NVIDIA Run:ai offers full support for NVIDIA’s most advanced MultiNode NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives. NVIDIA Run:ai simplifies the complexity of managing and submitting workloads on these systems by automating infrastructure detection, domain labeling, and distributed job submission via the UI, CLI, or API. See for more details. From cluster v2.21 onward

Advanced Cluster Configurations

Automatic cleanup of resources for failed workloads - When a workload fails due to infrastructure issues, its resources can be automatically cleaned up using failureResourceCleanupPolicy, reducing resource of failed workloads. For more details, see . From cluster v2.21 onward

Advanced Setup

Custom pod labels and annotations - Add custom labels and annotations to pods in both the control plane and cluster. This new capability enables service mesh deployment in NVIDIA Run:ai. This feature provides greater flexibility in workload customization and management, allowing users to integrate with service meshes more easily. See for more details.

System Requirements

NVIDIA Run:ai now supports NVIDIA GPU Operator version 25.3.
NVIDIA Run:ai now supports OpenShift version 4.18.
NVIDIA Run:ai now supports Kubeflow Training Operator 1.9.

Deprecation Notifications

Cluster API for Workload Submission

Using the Cluster API to submit NVIDIA Run:ai workloads via YAML was starting from NVIDIA Run:ai version 2.18. For cluster version 2.18 and above, use the to submit workloads. The Cluster API documentation has also been removed from v2.20 and above.

GPU Memory Swap

NVIDIA Run:ai’s GPU memory swap helps administrators and AI practitioners to further increase the utilization of their existing GPU hardware by improving GPU sharing between AI initiatives and stakeholders. This is done by expanding the GPU physical memory to the CPU memory, typically an order of magnitude larger than that of the GPU.

Expanding the GPU physical memory helps the NVIDIA Run:ai system to put more workloads on the same GPU physical hardware, and to provide a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.

Benefits of GPU Memory Swap

There are several use cases where GPU memory swap can benefit and improve the user experience and the system's overall utilization.

AI practitioners use notebooks to develop and test new AI models and to improve existing AI models. While developing or testing an AI model, notebooks use GPU resources intermittently, yet, required resources of the GPUs are pre-allocated by the notebook and cannot be used by other workloads after one notebook has already reserved them. To overcome this inefficiency, NVIDIA Run:ai introduced .

When one or more workloads require more than their requested GPU resources, there’s a high probability not all workloads can run on a single GPU because the total memory required is larger than the physical size of the GPU memory.

With GPU memory swap, several workloads can run on the same GPU, even if the sum of their used memory is larger than the size of the physical GPU memory. GPU memory swap can swap in and out workloads interchangeably, allowing multiple workloads to each use the full amount of GPU memory. The most common scenario is for one workload to run on the GPU (for example, an interactive notebook), while other notebooks are either idle or using the CPU to develop new code (while not using the GPU). From a user experience point of view, the swap in and out is a smooth process since the notebooks do not notice that they are being swapped in and out of the GPU memory. On rare occasions, when multiple notebooks need to access the GPU simultaneously, slower workload execution may be experienced.

Notebooks typically use the GPU intermittently, therefore with high probability, only one workload (for example, an ), will use the GPU at a time. The more notebooks the system puts on a single GPU, the higher the chances are that there will be more than one notebook requiring the GPU resources at the same time. Admins have a significant role here in fine tuning the number of notebooks running on the same GPU, based on specific use patterns and required SLAs.

A single GPU can be shared between an (for example, a Jupyter notebook, image recognition services, or an LLM service), and a training workload that is not time-sensitive or delay-sensitive. At times when the inference/interactive workload uses the GPU, both training and inference/interactive workloads share the GPU resources, each running part of the time swapped-in to the GPU memory, and swapped-out into the CPU memory the rest of the time.

Whenever the inference/interactive workload stops using the GPU, the swap mechanism swaps out the inference/interactive workload GPU data to the CPU memory. Kubernetes wise, the pod is still alive and running using the CPU. This allows the training workload to run faster when the inference/interactive workload is not using the GPU, and slower when it does, thus sharing the same resource between multiple workloads, fully utilizing the GPU at all times, and maintaining uninterrupted service for both workloads.

Serving Inference Warm Models with GPU Memory Swap

Running multiple is a demanding task and you will need to ensure that your SLA is met. You need to provide high performance and low latency, while maximizing GPU utilization. This becomes even more challenging when the exact model usage patterns are unpredictable. You must plan for the agility of inference services and strive to keep models on standby in a ready state rather than an idle state.

NVIDIA Run:ai’s GPU memory swap feature enables you to load multiple models to a single GPU, where each can use up to the full amount GPU memory. Using an application load balancer, the administrator can control to which server each inference request is sent. Then the GPU can be loaded with multiple models, where the model in use is loaded into the GPU memory and the rest of the models are swapped-out to the CPU memory. The swapped models are stored as ready models to be loaded when required. GPU memory swap always maintains the context of the workload (model) on the GPU so it can easily and quickly switch between models. This is unlike industry standard model servers that load models from scratch into the GPU whenever required.

How GPU Memory Swap Works

Swapping the workload’s GPU memory to and from the CPU is performed simultaneously and synchronously for all GPUs used by the workload. In some cases, if workloads specify a memory limit smaller than a full GPU memory size, multiple workloads can run in parallel on the same GPUs, maximizing the utilization and shortening the response times.

In other cases, workloads will run serially, with each workload running for a few seconds before the system swaps them in/out. If multiple workloads occupy more than the GPU physical memory and attempt to run simultaneously, memory swapping will occur. In this scenario, each workload will run part of the time on the GPU while being swapped out to the CPU memory the other part of the time, slowing down the execution of the workloads. Therefore, it is important to evaluate whether memory swapping is suitable for your specific use cases, weighing the benefits against the potential for slower execution time. To better understand the benefits and use cases of GPU memory swap, refer to the detailed sections below. This will help you determine how to best utilize GPU swap for your workloads and achieve optimal performance.

The workload MUST use . This means the workload’s memory Request is less than a full GPU, but it may add a GPU memory Limit to allow the workload to effectively use the full GPU memory. The NVIDIA Run:ai Scheduler allocates the dynamic fraction pair (Request and Limit) on single or multiple GPU devices in the same node.

The administrator must label each node that they want to provide GPU memory swap with a run.ai/swap-enabled=true to enable that node. Enabling the feature reserves CPU memory to serve the swapped GPU memory from all GPUs on that node. The administrator sets the size of the CPU reserved RAM memory using the runaiconfig file as detailed in .

Multi-GPU Memory Swap

NVIDIA Run:ai also supports workload submission using multi-GPU memory swap. Multi-GPU memory swap works similarly to single GPU memory swap, but instead of swapping memory for a single GPU workload, it swaps memory for workloads across multiple GPUs simultaneously and synchronously.

The NVIDIA Run:ai Scheduler allocates the same dynamic GPU fraction pair (Request and Limit) on multiple GPU devices in the same node. For example, if you want to run two LLM models, each consuming 8 GPUs that are not used simultaneously, you can use GPU memory swap to share their GPUs. This approach allows multiple models to be stacked on the same node.

The following outlines the advantages of stacking multiple models on the same node:

Maximizes GPU utilization - Efficiently uses available GPU resources by enabling multiple workloads to share GPUs.
Improves cold start times - Loading large LLM models to a node and its GPUs can take several minutes during a “cold start”. Using memory swap turns this process into a “warm start” that takes only a fraction of a second to a few seconds (depending on the model size and the GPU model).
Increases GPU availability - Frees up and maximizes GPU availability for additional workloads (and users), enabling better resource sharing.

Deployment Considerations

A pod created before the GPU memory swap feature was enabled in that cluster, cannot be scheduled to a swap-enabled node. A proper event is generated in case no matching node is found. Users must re-submit those pods to make them swap-enabled.
GPU memory swap cannot be enabled if the NVIDIA Run:ai is used. GPU memory swap can only be used with the default NVIDIA time-slicing mechanism.
CPU RAM size cannot be decreased once GPU memory swap is enabled.

Enabling and Configuring GPU Memory Swap

Before configuring GPU memory swap, dynamic GPU fractions must be enabled. Dynamic GPU fractions enable you to make your workloads burstable as well as maximize your workloads’ performance and GPU utilization within a single node.

To enable GPU memory swap in a NVIDIA Run:ai cluster:

Add the following label to each node where you want to enable GPU memory swap:
Edit the runaiconfig file with the following parameters. This example uses 100Gi as the size of the swap memory. For more details, see :
Or, use the following patch command from your terminal:

Configuring System Reserved GPU Resources

Swappable workloads require reserving a small part of the GPU for non-swappable allocations like binaries and GPU context. To avoid getting out-of-memory (OOM) errors due to non-swappable memory regions, the system reserves a 2GiB of GPU RAM memory by default, effectively truncating the total size of the GPU memory. For example, a 16GiB T4 will appear as 14GiB on a swap-enabled node. The exact reserved size is application-dependent, and 2GiB is a safe assumption for 2-3 applications sharing and swapping on a GPU. This value can be changed by:

Editing the runaiconfig as follows:
Or, using the following patch command from your terminal:

Preventing Your Workloads from Getting Swapped

If you prefer your workloads not to be swapped into CPU memory, you can specify on the pod an anti-affinity to run.ai/swap-enabled=true node label when submitting your workloads and the Scheduler will ensure not to use swap-enabled nodes. An alternative way is to set swap on a dedicated node pool and not use this node pool for workloads you prefer not to swap.

What Happens When the CPU Reserved Memory for GPU Swap is Exhausted?

CPU memory is limited, and since a single CPU serves multiple GPUs on a node, this number is usually between 2 to 8. For example, when using 80GB of GPU memory, each swapped workload consumes up to 80GB (but may use less) assuming each GPU is shared between 2-4 workloads. In this example, you can see how the swap memory can become very large. Therefore, we give administrators a way to limit the size of the CPU reserved memory for swapped GPU memory on each swap-enabled node as shown in .

Control Plane System Requirements

The NVIDIA Run:ai control plane is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai control plane. Before you start, make sure to review the Installation overview.

Installer Machine

The machine running the installation script (typically the Kubernetes master) must have:

At least 50GB of free space
Docker installed
3.14 or later

Note

If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai include the Helm binary.

Hardware Requirements

The following hardware requirements are for the control plane system nodes. By default, all NVIDIA Run:ai control plane services run on all available nodes.

Architecture

x86 - Supported for both Kubernetes and OpenShift deployments.
ARM - Supported for Kubernetes only. ARM is currently not supported for OpenShift.

NVIDIA Run:ai Control Plane - System Nodes

This configuration is the minimum requirement you need to install and use NVIDIA Run:ai control plane:

Component

Required Capacity

Note

To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in .

If NVIDIA Run:ai control plane is planned to be installed on the same Kubernetes cluster as the NVIDIA Run:ai cluster, make sure the cluster are considered in addition to the NVIDIA Run:ai control plane hardware requirements.

Software Requirements

The following software requirements must be fulfilled.

Operating System

Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator
Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

Network Time Protocol

Nodes are required to be synchronized by time using NTP (Network Time Protocol) for proper system functionality.

Kubernetes Distribution

NVIDIA Run:ai control plane requires Kubernetes. The following Kubernetes distributions are supported:

Vanilla Kubernetes
OpenShift Container Platform (OCP)
NVIDIA Base Command Manager (BCM)
Elastic Kubernetes Engine (EKS)

Note

The latest release of the NVIDIA Run:ai control plane supports Kubernetes 1.30 to 1.32 and OpenShift 4.14 to 4.18.

See the following Kubernetes version support matrix for the latest NVIDIA Run:ai releases:

NVIDIA Run:ai version

Supported Kubernetes versions

Supported OpenShift versions

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see or .

NVIDIA Run:ai Namespace

The NVIDIA Run:ai control plane uses a namespace or project (OpenShift) called runai-backend. Use the following to create the namespace/project:

Default Storage Class

Note

Default storage class applies for Kubernetes only.

The NVIDIA Run:ai control plane requires a default storage class to create persistent volume claims for NVIDIA Run:ai storage. The storage class, as per Kubernetes standards, controls the reclaim behavior, whether the NVIDIA Run:ai persistent data is saved or deleted when the NVIDIA Run:ai control plane is deleted.

Note

For a simple (non-production) storage class example see . The storage class will set the directory /opt/local-path-provisioner to be used across all nodes as the path for provisioning persistent volumes. Then set the new storage class as default:

Kubernetes Ingress Controller

Note

Installing ingress controller applies for Kubernetes only.

The NVIDIA Run:ai control plane requires to be installed.

OpenShift, RKE and RKE2 come with a pre-installed ingress controller.
Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.
Make sure that a default ingress controller is set.

There are many ways to install and configure different ingress controllers. The following shows a simple example to install and configure NGINX ingress controller using :

Vanilla Kubernetes

Run the following commands:

For cloud deployments, both the internal IP and external IP are required.
For on-prem deployments, only the

Managed Kubernetes (EKS, GKE, AKS)

Run the following commands:

Oracle Kubernetes Engine (OKE)

Run the following commands:

Fully Qualified Domain Name (FQDN)

Note

Fully Qualified Domain Name applies for Kubernetes only.

You must have a Fully Qualified Domain Name (FQDN) to install the NVIDIA Run:ai control plane (ex: runai.mycorp.local). This cannot be an IP. The FQDN must be resolvable within the organization's private network.

TLS Certificate

Kubernetes

You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a named runai-backend-tls in the runai-backend namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

OpenShift

NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on .

Local Certificate Authority

A local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority. Follow the below steps to configure the local certificate authority.

In air-gapped environments, you must configure and install the local CA's public key in the Kubernetes cluster. This is required for the installation to succeed:

Add the public key to the runai-backend namespace:

When installing the control plane, make sure the following flag is added to the helm command --set global.customCA.enabled=true. See .

External Postgres Database (Optional)

The NVIDIA Run:ai control plane installation includes a default PostgreSQL database. However, you may opt to use an existing PostgreSQL database if you have specific requirements or preferences as detailed in . Note that only PostgreSQL version 16 is supported.

Dynamic GPU Fractions

Many workloads utilize GPU resources intermittently, with long periods of inactivity. These workloads typically need GPU resources when they are running AI applications or debugging a model in development. Other workloads such as inference may utilize GPUs at lower rates than requested, but may demand higher resource usage during peak utilization. The disparity between resource request and actual resource utilization often leads to inefficient utilization of GPUs. This usually occurs when multiple workloads request resources based on their peak demand, despite operating below those peaks for the majority of their runtime.

To address this challenge, NVIDIA Run:ai has introduced dynamic GPU fractions. This feature optimizes GPU utilization by enabling workloads to dynamically adjust their resource usage. It allows users to specify a guaranteed fraction of GPU memory and compute resources with a higher limit that can be dynamically utilized when additional resources are requested.

How Dynamic GPU Fractions Work

With dynamic GPU fractions, users can using GPU fraction Request and Limit which is achieved by leveraging the Kubernetes Request and Limit notations. You can either:

Request a GPU fraction (portion) using a percentage of a GPU and specify a Limit
Request a GPU memory size (GB, MB) and specify a Limit

When setting a GPU memory limit either as GPU fraction or GPU memory size, the Limit must be equal to or greater than the GPU fractional memory request. Both GPU fraction and GPU memory are translated into the actual requested memory size of the Request (guaranteed resources) and the Limit (burstable resources - non guaranteed).

For example, a user can specify a workload with a GPU fraction request of 0.25 GPU, and add a limit of up to 0.80 GPU. The NVIDIA Run:ai schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the Limit), where only 0.25 of the GPU memory is guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.

NVIDIA Run:ai automatically manages the state changes between Request and Limit as well as the reverse (when the balance needs to be "returned"), updating the workloads’ utilization vs. Request and Limit parameters in the .

To guarantee fair quality of service between different workloads using the same GPU, NVIDIA Run:ai developed an extendable GPUOOMKiller (Out Of Memory Killer) component that guarantees the quality of service using Kubernetes semantics for resources of Request and Limit.

The OOMKiller capability requires adding CAP_KILL capabilities to the dynamic GPU fractions and to the NVIDIA Run:ai core scheduling module (toolkit daemon). This capability is enabled by default.

Note

Dynamic GPU fractions is enabled by default in the cluster. Disabling dynamic GPU fractions in removes the CAP_KILL capability.

Multi-GPU Dynamic Fractions

NVIDIA Run:ai also supports workload submission using multi-GPU dynamic fractions. Multi-GPU dynamic fractions work similarly to dynamic fractions on a single GPU workload, however, instead of a single GPU device, the NVIDIA Run:ai Scheduler allocates the same dynamic fraction pair (Request and Limit) on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, but may want to burst out and consume up to the full GPU memory, they can allocate 8×40GB with multi-GPU fractions and a limit of 80GB (e.g. H100 GPU) instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.This is useful during model development, where memory requirements are usually lower due to experimentation with smaller models or configurations.

This approach significantly improves GPU utilization and availability, enabling more precise and often smaller quota requirements for the end user. Time sharing where single GPUs can serve multiple workloads with dynamic fractions remains unchanged, only now, it serves multiple workloads using multi-GPUs per workload.

Setting Dynamic GPU Fractions

Note

Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Using the asset, you can define the compute requirements by specifying your requested GPU portion or GPU memory, and set a Limit. You can then use the compute resource with any of the for single and multi-GPU dynamic fractions. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the .

Single dynamic GPU fractions - Define the compute requirement to run 1 GPU device, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB) with a Limit. The limit must be equal to or greater than the GPU fractional memory request.
Multi-GPU dynamic fractions - Define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB) with a Limit. The limit must be equal to or greater than the GPU fractional memory request.

Note

When setting a workload with dynamic GPU fractions, (for example, when using it with GPU Request or GPU memory Limits), you practically make the workload burstable. This means it can use memory that is not guaranteed for that workload and is susceptible to an ‘OOM Kill’ signal if the actual owner of that memory requires it back. This applies to non-preemptive workloads as well. For that reason, it is recommended that you use dynamic GPU fractions with Interactive workloads running Notebooks. Notebook pods are not evicted when their GPU process is OOM Kill’ed. This behavior is the same as standard Kubernetes burstable CPU workloads.

Setting Dynamic GPU Fractions for Third-Party Workloads

To enable dynamic GPU fractions for workloads submitted via Kubernetes YAML, use the following annotations to define the GPU fraction configuration. You can configure either gpu-fraction or gpu-memory. You must also set the RUNAI_GPU_MEMORY_LIMIT environment variable in the first container to enforce the memory limit. This is the GPU consuming container. Make sure the default scheduler is set to runai-scheduler. See for more details.

Variable

Input Format

Where to Set

The following example YAML creates a pod that requests 2 GPU devices, each requesting 50% of memory (gpu-fraction: "0.5") and allows usage of up to 95% (RUNAI_GPU_MEMORY_LIMIT: "0.95") if available.

Using CLI

To view the available actions, go to the and run according to your workload.

Using API

To view the available actions, go to the and run according to your workload.

GPU Fractions

To submit a workload with GPU resources in Kubernetes, you typically need to specify an integer number of GPUs. However, workloads often require diverse GPU memory and compute requirements or even use GPUs intermittently depending on the application (such as inference workloads, training workloads or notebooks at the model-creation phase). Additionally, GPUs are becoming increasingly powerful, offering more processing power and larger memory capacity for applications. Despite the increasing model sizes, the increasing capabilities of GPUs allow them to be effectively shared among multiple users or applications.

NVIDIA Run:ai’s GPU fractions provide an agile and easy-to-use method to share a GPU or multiple GPUs across workloads. With GPU fractions, you can divide the GPU/s memory into smaller chunks and share the GPU/s compute resources between different workloads and users, resulting in higher GPU utilization and more efficient resource allocation.

Benefits of GPU Fractions

Utilizing GPU fractions to share GPU resources among multiple workloads provides numerous advantages for both platform administrators and practitioners, including improved efficiency, resource optimization, and enhanced user experience.

For the AI practitioner:
- Reduced wait time - Workloads with smaller GPU requests are more likely to be scheduled quickly, minimizing delays in accessing resources.
- Increased workload capacity - More workloads can be run using the same admin-defined GPU and available unused resources - .

Quota Planning with GPU Fractions

When planning the quota distribution for your and , using fractions gives the platform administrator the ability to allocate more precise quota per project and department, assuming the usage of GPU fractions or enforcing it with or templates.

For example, in an organization with a department budgeted for two nodes of 8×H100 GPUs and a team of 32 researchers:

Allocating 0.5 GPU per researcher ensures all researchers have access to GPU resources.
Using fractions enables researchers to run smaller workloads intermittently within their quota or go over their quota by using temporary over quota resources with higher resource demanding workloads.
Using GPUs for notebook-based model development, where GPUs are not continuously active and can be shared among multiple users.

For more details on mapping your organization and resources, see .

How GPU Fractions Work

When a workload is submitted, the Scheduler finds a node with a GPU that can satisfy the requested GPU portion or GPU memory, then it schedules the pod to that node. The NVIDIA Run:ai GPU fractions logic, running locally on each NVIDIA Run:ai worker node, allocates the requested memory size on the selected GPU. Each pod uses its own separate virtual memory address space. NVIDIA Run:ai’s GPU fractions logic enforces the requested memory size, so no workload can use more than requested, and no workload can run over another workload’s memory. This gives users the experience of a ‘logical GPU’ per workload.

While requires administrative work to configure every MIG slice, where a slice is a fixed chunk of memory, GPU fractions allow dynamic and fully flexible allocation of GPU memory chunks. By default, GPU fractions use NVIDIA’s time-slicing to share the GPU compute runtime. You can also use the which allows dynamic and fully flexible splitting of the GPU compute time.

NVIDIA Run:ai GPU fractions are agile and dynamic allowing a user to allocate and free GPU fractions during the runtime of the system, at any size between zero to the maximum GPU portion (100%) or memory size (up to the maximum memory size of a GPU).

The NVIDIA Run:ai Scheduler can work alongside other schedulers. In order to avoid collisions with other schedulers, the NVIDIA Run:ai Scheduler creates special reservation pods. Once a workload is submitted requesting a fraction of a GPU, NVIDIA Run:ai will create a pod in a dedicated runai-reservation namespace with the full GPU as a resource, allowing other schedulers to understand that the GPU is reserved.

Note

Splitting a GPU into fractions may generate some fragmentation of the GPU memory. The will try to consolidate GPU resources where feasible (i.e. preemptible workloads).
Using as a scheduling placement strategy can also reduce GPU fragmentation.

Multi-GPU Fractions

NVIDIA Run:ai also supports workload submission using multi-GPU fractions. Multi-GPU fractions work similarly to single-GPU fractions, however, the NVIDIA Run:ai Scheduler allocates the same fraction size on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, they can allocate 8×40GB with multi-GPU fractions instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.

Time sharing where single GPUs can serve multiple workloads with fractions remains unchanged, only now, it serves multiple workloads using multi-GPUs per workload, single-GPU per workload, or a mix of both.

Deployment Considerations

Selecting a GPU portion using percentages as units does not guarantee the exact memory size. This means 50% of an A-100-40GB is 20GB while 50% of an A-100-80 is 40GB. To have better control over the exact allocated memory, specify the exact memory size, i.e. 40GB.
Using NVIDIA Run:ai GPU fractions controls the memory split (i.e. 0.5 GPU means 50% of the GPU memory) but not the compute (processing time). To split the compute time, see .
NVIDIA Run:ai GPU fractions and cannot be used on the same node.

Setting GPU Fractions

Using the asset, you can define the compute requirements by specifying your requested GPU portion or GPU memory, and use it with any of the for single GPU and multi-GPU fractions.

Single-GPU fractions - Define the compute requirement to run 1 GPU device, by specifying either a fraction (percentage) of the overall memory or a memory request (GB, MB).
Multi-GPU fractions - Define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or a memory request (GB, MB).

Setting GPU Fractions for Third-Party Workloads

To enable GPU fractions for workloads submitted via Kubernetes YAML, use the following annotations to define the GPU fraction configuration. You can configure either gpu-fraction or gpu-memory. Make sure the default scheduler is set to runai-scheduler. See for more details.

Variable

Input Format

Where to Set

The following example YAML creates a pod that requests 2 GPU devices, each requesting 50% of memory (gpu-fraction: "0.5") .

Using CLI

To view the available actions, go to the or the and run according to your workload.

Using API

To view the available actions, go to the and run according to your workload.

Using GB200 NVL72 and Multi-Node NVLink Domains

Multi-Node NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives are fully supported by the NVIDIA Run:ai platform.

Kubernetes does not natively recognize NVIDIA’s MNNVL architecture, which makes managing and scheduling workloads across these high-performance domains more complex. The NVIDIA Run:ai platform simplifies this by abstracting the complexity of MNNVL configuration. Without this abstraction, optimal performance on a GB200 NVL72 system would require deep knowledge of NVLink domains, their hardware dependencies, and manual configuration for each distributed workload. NVIDIA Run:ai automates these steps, ensuring high performance with minimal effort. While GB200 NVL72 supports all workload types, distributed training workloads benefit most from its accelerated GPU networking capabilities.

To learn more about GB200, MNNVL and related NVIDIA technologies, refer to the following:

NVIDIA GB200 NVL72

Benefits of Using GB200 NVL72 with NVIDIA Run:ai

The NVIDIA Run:ai platform enables administrators, researchers, and MLOps engineers to fully leverage GB200 NVL72 systems and other NVLink-based domains without requiring deep knowledge of hardware configurations or NVLink topologies. Key capabilities include:

Automatic detection and labeling
- Detects GB200 NVL72 nodes and identifies MNNVL domains (e.g., GB200 NVL72 racks).
- Automatically detects whether a node pool contains GB200 NVL72.

Prerequisites

Kubernetes version - Requires Kubernetes 1.32 or later.
NVIDIA GPU Operator - Install NVIDIA GPU Operator version 25.3 or above. See the section for installation instructions. This version must include the associated Dynamic Resource Allocation (DRA) driver, which provides support for GB200 accelerated networking resources and the ComputeDomain feature. For detailed steps on installing the DRA driver and configuring ComputeDomain, refer to .
NVIDIA Network Operator - Install the NVIDIA Network Operator. See the

Configuring and Managing GB200 NVL72 Domains

Administrators must define dedicated node pools that align with GB200 NVL72 rack topologies. These node pools ensure that workloads are isolated to nodes with NVLink interconnects and are not scheduled on incompatible hardware. Each node pool can be manually configured in the NVIDIA Run:ai platform and associated with specific node labels. Two key configurations are required for each node pool:

Node Labels – Identify nodes equipped with GB200.
MNNVL Domain Discovery – Specify how the platform detects whether the node pool includes NVLink-connected nodes.

To create a node pool with GPU network acceleration, see .

Identifying GB200 Nodes

To enable the to recognize GB200-based nodes, administrators must:

Use the default node label provided by the NVIDIA GPU Operator - nvidia.com/gpu.clique.
Or, apply a custom label that clearly marks the node as GB200/MNNVL capable.

This node label serves as the basis for identifying appropriate nodes and ensuring workloads are scheduled on the correct hardware.

Enabling MNNVL Domain Discovery

The administrator can configure how the NVIDIA Run:ai platform detects MNNVL domains for each node pool. The available options include:

Automatic Discovery – Uses the default label key nvidia.com/gpu.clique, or a custom label key specified by the administrator. The NVIDIA Run:ai platform automatically discovers MNNVL domains within node pools. If a node is labeled with the MNNVL label key, the NVIDIA Run:ai platform indicates this node pool as MNNVL detected. MNNVL detected node pools are treated differently by the NVIDIA Run:ai platform when submitting a distributed training workload.
Manual Discovery – The platform does not evaluate any node labels. Detection is based solely on the administrator’s configuration of the node pool as MNNVL “Detected” or “Not Detected.”

When automatic discovery is enabled, all GB200 nodes that are part of the same physical rack (NVL72 or other future topologies) are part of the same NVL Domain and automatically labeled by the GPU Operator with a common label using a unique label value per domain and sub-domain. The default label key set by the NVIDIA GPU Operator is nvidia.com/gpu.clique and its value consists of - <NVL Domain ID (ClusterUUID)>.<Clique ID> :

The NVL Domain ID (ClusterUUID) is a unique identifier that represents the physical NVL domain, for example, a physical GB200 NVL72 rack.
The Clique ID denotes a logical MNNVL sub-domain. A clique represents a further logical split of the MNNVL into smaller domains that enable secure, fast, and isolated communication between pods running on different GB200 nodes within the same GB200 NVL72.

The provides more information on which GB200 NVL72 domain each node belongs to, and which Clique ID it is associated with.

Submitting Distributed Training Workloads

When a distributed training workload is submitted to an MNNVL-detected node pool, the NVIDIA Run:ai platform automates several key configuration steps to ensure optimal workload execution:

ComputeDomain creation - The NVIDIA Run:ai platform creates a ComputeDomain Custom Resource Definition (CRD), which is a proprietary resource used to manage NVLink-based domain assignments.
Resource Claim injection - A reference to the ComputeDomain is automatically added to the workload specification as a resource claim, allowing the Scheduler to link the workload to a specific NVLink domain.
Pod affinity configuration - Pod affinity is applied using a Preferred policy with the MNNVL label key (e.g., nvidia.com/gpu.clique

These additional steps are crucial for the creation of underlying HW resources (also known as IMEX channels) and stickiness of the distributed workload to MNNVL topologies and nodes. When a distributed workload is stopped or evicted, the platform automatically removes the corresponding ComputeDomain.

Best Practices for MNNVL Node Pool Management

When submitting a distributed workload, you should explicitly specify a list of one or more MNNVL detected node pools, or a list of one or more non-MNNVL detected node pools. A mix of MNNVL detected and non-MNNVL detected node pools is not supported. A GB200 MNNVL node pool is a pool that contains at least one node belonging to an MNNVL domain.
Other workload types (not distributed) can include a list of mixed MNNVL and non-MNNVL node pools, from which the Scheduler will choose.
MNNVL node pools can include any size of MNNVL domains (i.e. NVL72 and any future domain size) and support any Grace-Blackwell models (GB200 and any future models).

Fine-tuning Scheduling Behavior for MNNVL

You can influence how the Scheduler places distributed training workloads into GB200 MNNVL node pools using the Topology field available in the .

Note

The following options are based on , which define how pods are grouped based on topology.

Confine a workload to a single GB200 MNNVL domain - To ensure the workload is scheduled within a single GB200 MNNVL domain (e.g., a GB200 NVL72 rack), apply a topology label with a Required policy using the MNNVL label key (nvidia.com/gpu.clique). This instructs the Scheduler to strictly place all pods within the same MNNVL domain. If the workload exceeds 18 pods (or 72 GPUs), the Scheduler will not be able to find a matching domain and will fail to schedule the workload.
Try to schedule a workload using a Preferred topology - To guide the Scheduler to prioritize a specific topology without enforcing it, apply a topology label with a policy of Preferred. You can apply any topology label with a Preferred policy. These labels are treated with higher scheduling weight than the default Preferred pod affinity automatically applied by NVIDIA Run:ai for MNNVL.

Fine-tuning MNNVL per Workload

You can customize how the NVIDIA Run:ai platform applies the MNNVL feature to each distributed training workload. This allows you to override the default behavior when needed. To configure this behavior, set the proprietary label key run.ai/MNNVL in the General settings section of the . The following values are supported:

None - Disables the MNNVL feature for the workload. The platform does not create a ComputeDomain and no pod affinity or node affinity is applied by default.
Preferred (default) - Indicates that MNNVL feature is preferred but not required. This is the default behavior when submitting a distributed training workload:
- If the workload is submitted to a 'non-MNNVL detected' node pool, then the NVIDIA Run:ai platform does not add a ComputeDomain, ComputeDomain claim, pod affinity or node affinity for MNNVL nodes.

Known Limitations and Compatibility

If the DRA driver is not installed correctly in the cluster, particularly if the required CRDs are missing, and the MNNVL feature is enabled in the NVIDIA Run:ai platform, the workload controller will enter a crash loop. This will continue until the DRA driver is properly installed with all necessary CRDs or the MNNVL feature is disabled in the NVIDIA Run:ai platform.
To run workloads on a GB200 node pool (i.e., a node pool detected as MNNVL-enabled), the workload must explicitly request that node pool. To prevent unintentional use of MNNVL-detected node pools, administrators must ensure these node pools are not included in any project's default list of node pools.
Only one distributed training workload per node can use GB200 accelerated networking resources. If GPUs remain unused on that node, other workload types may still utilize them.

v2.21

Getting Started

Overview

hashtagHow NVIDIA Run:ai Helps Your Organization

What’s New

hashtagFeature Life Cycle

Quick Start Guides

hashtagInstall the Platform

hashtagOnboard Organization, Projects and Users

hashtagBuild, Train and Deploy Models

Quick Start for Infrastructure Administrators

hashtagPrerequisites

Quick Start for Platform Administrators

hashtagPrerequisites

hashtagOngoing Platform Management

hashtagConfigure Platform Behavior and Admin Settings

hashtagDefine Authorization and Access Control

hashtagDefine Organizational Structure and Quota

hashtagConfigure Node Pools

hashtagDefine Policies (Governance)

hashtagMonitor and Optimize the Platform

Quick Start for AI Practitioners

hashtagPrerequisites

hashtagGetting Started

hashtagUnderstand Workload Capabilities

hashtagRun Workloads for Your Use Case

Installation

hashtagNVIDIA Run:ai Components

hashtagInstallation Types

Support Matrix

hashtagOperator and Framework Versions

Install Using Helm

Network Requirements

hashtagExternal Access

Customized Installation

hashtagControl Plane Helm Chart Values

Upgrade

hashtagBefore Upgrade

Uninstall

hashtagUninstall the Control Plane

hashtagUninstall the Cluster

Infrastructure setup

Authentication and Authorization

Authentication and Authorization

hashtagAuthentication

Users

hashtagUsers Table

SSO

Applications

hashtag

User Applications

hashtagCreating an Application

hashtagRegenerating a Client Secret

hashtagDeleting an Application

hashtagUsing API

Advanced Setup

Node Roles

hashtagPrerequisites

hashtagConfigure Node Roles

hashtagSystem Nodes

hashtagKubectl

hashtagNVIDIA Run:ai Administrator CLI

hashtagWorker Nodes

hashtagKubectl

hashtagNVIDIA Run:ai Administrator CLI

Container Access

External Access to Containers

Service Mesh

hashtagControl Plane Configuration

Interworking with Karpenter

hashtagFriction Points Using Karpenter with NVIDIA Run:ai

Infrastructure Procedures

NVIDIA Run:ai at Scale

hashtagNVIDIA Run:ai Services

High Availability

Monitoring and Maintenance

hashtag

Shared Storage

hashtagShared Storage

Backup and Restore

How NVIDIA Run:ai Helps Your Organization

Feature Life Cycle

Install the Platform

Onboard Organization, Projects and Users

Build, Train and Deploy Models

Prerequisites

Prerequisites

Ongoing Platform Management

Configure Platform Behavior and Admin Settings

Define Authorization and Access Control

Define Organizational Structure and Quota

Configure Node Pools

Define Policies (Governance)

Monitor and Optimize the Platform

Prerequisites

Getting Started

Understand Workload Capabilities

Run Workloads for Your Use Case

NVIDIA Run:ai Components

Installation Types

Operator and Framework Versions

External Access

Control Plane Helm Chart Values

Before Upgrade

Uninstall the Control Plane

Uninstall the Cluster

Authentication

Users Table

Creating an Application

Regenerating a Client Secret

Deleting an Application

Using API

Prerequisites

Configure Node Roles

System Nodes

Kubectl

NVIDIA Run:ai Administrator CLI

Worker Nodes

Kubectl

NVIDIA Run:ai Administrator CLI

Control Plane Configuration

Friction Points Using Karpenter with NVIDIA Run:ai

NVIDIA Run:ai Services

Shared Storage

Back Up the Cluster

Access to the Kubernetes Cluster

Collect Logs to Send to Support

Uninstall the Control Plane

Uninstall the Cluster

Install the Platform

Onboard Organization, Projects and Users

Build, Train and Deploy Models

Feature Life Cycle

Prerequisites

Access to the Kubernetes Cluster

Shared Storage

Installation

Post Installation Infrastructure Setup

Secure Installation

Security Vulnerabilities

Kubernetes Storage Classes

Direct NFS Mount

Kubernetes Cluster

Host Infrastructure

Prerequisites

Getting Started

Understand Workload Capabilities

Run Workloads for Your Use Case