What's New in Version 2.24
The NVIDIA Run:ai v2.24 what's new provides a detailed summary of the latest features, enhancements, and updates introduced in this version. They serve as a guide to help users, administrators, and researchers understand the new capabilities and how to leverage them for improved workload management, resource optimization, and more.
Important
For a complete list of deprecations, see Deprecation notifications. Deprecated features, APIs, and capabilities remain available for two versions from the time of the deprecation notice, after which they may be removed.
Dynamo Support
NVIDIA Run:ai supports Dynamo-based inference workloads through the DynamoGraphDeployment workload type. This allows Dynamo workloads to be deployed, scheduled, and monitored using the same platform capabilities and operational model as native workloads. See Supported workload types for more details. From cluster v2.23 onward
Key capabilities include:
YAML-based deployment and management - Dynamo workloads can be submitted via YAML from the UI, API, or CLI, without requiring direct cluster access.
Hierarchical gang scheduling - NVIDIA Run:ai supports multi-level gang scheduling for Dynamo workloads. Replica groups are scheduled together as sub-gangs, and the entire workload is then scheduled as a single unit. This ensures coordinated placement and execution across all components of the Dynamo workload.
Topology-aware scheduling - NVIDIA Run:ai applies topology-aware scheduling at the workload level to ensure Dynamo workload components are placed according to the underlying cluster topology, improving communication efficiency and execution consistency.
Support for automatic MNNVL - Dynamo workloads are supported on Multi-Node NVLink (MNNVL) domains, including GB200 NVL72 systems. NVIDIA Run:ai applies the appropriate compute domain configuration to ensure Dynamo workloads are placed and scaled within the same NVLink domain.
Automatic discovery of Dynamo frontend endpoints - NVIDIA Run:ai automatically detects Dynamo frontend endpoints and exposes them for access and monitoring.
Unified workload lifecycle and status visibility - Dynamo workloads are managed, monitored, and tracked with a unified lifecycle and status view.
AI Practitioners
Workloads
YAML-based workload submission in the UI and CLI - Submit supported workload types defined in YAML directly from the UI and CLI. This brings YAML-based submission, previously available through the API, into interactive workflows, allowing you to submit existing Kubernetes or framework-specific manifests while still benefiting from NVIDIA Run:ai scheduling, resource management, and monitoring. See Submit supported workload types via YAML for more details.
From cluster v2.23 onwardAutomatic network topology acceleration for supported workloads - Network topology–aware scheduling is applied automatically to supported distributed workload types submitted via YAML. Once a topology is attached to a node pool, NVIDIA Run:ai automatically applies Preferred topology constraints at the lowest available level for the entire workload, optimizing pod placement without additional user configuration. This expands the topology acceleration beyond NVIDIA Run:ai native distributed workloads to additional workload types. See Accelerating workloads with network topology-aware scheduling for more details.
From cluster v2.23 onwardVisibility into workload topology constraints - Workloads now expose the topology constraints requested during scheduling, providing clear visibility into how network topology influences placement decisions. In the UI, NVIDIA Run:ai native workloads display the requested topology constraints in the workload Details view, while the Workloads API exposes these fields across native and supported workload types.
From cluster v2.24 onwardMNNVL acceleration for supported workload types - NVIDIA Run:ai now enables running supported workload types on Multi-Node NVLink (MNNVL) domains, including GB200 NVL72 systems. NVIDIA Run:ai applies the appropriate compute domain configuration to ensure workloads are placed and scaled within the same NVLink domain. AI practitioners can submit supported workload types using the Workloads V2 API and configure their MNNVL preference as part of the workload submission.
From cluster v2.24 onwardAI application–based workload grouping in the UI - NVIDIA Run:ai provides a dedicated AI applications view. This view automatically groups Kubernetes resources deployed via Helm charts into a single logical application, allowing you to list, sort, and filter AI applications. You can also inspect aggregated resource requests and allocations (GPU, CPU, memory) and view the underlying workloads through the Details pane, making it easier to understand and manage complex, multi-component solutions. See AI applications for more details.
From cluster v2.23 onwardSeparate priority and preemptibility controls - Workload priority and preemptibility are configured as two independent parameters across the UI, CLI, and API for native and supported workload types. If no preemptibility value is specified, the existing behavior based on priority is applied automatically. See Workload priority and preemption for more details.
From cluster v2.24 onwardAuthenticated browsing for the NGC catalog - Browse the NGC catalog and private NGC registries as an authenticated user by selecting your NGC API key credentials during workload submission or template creation. This provides access to models and containers that require authentication while preserving the option to browse the public container registry. Private NGC registries require administrator configuration in the General settings.
BetaFrom cluster v2.23 onwardImproved fractional GPU support for multi-container pods - Fractional GPUs are no longer limited to the first container in a pod. You can explicitly specify which container should receive fractional GPU resources using an annotation. If no container is specified, fractional GPUs continue to be associated with the first container by default. See GPU fractions and Dynamic GPU fractions for more details.
From cluster v2.24 onwardSupport for elastic distributed workloads on NVLink domains - Elastic distributed workloads, including auto-scaling and dynamically sized deployments, are fully supported on GB200 NVL72 and Multi-Node NVLink (MNNVL) domains using NVIDIA DRA driver version 25.8 and later. NVIDIA Run:ai automatically applies ComputeDomain configuration and topology-aware scheduling to ensure workloads scale within the same NVLink domain. See Using GB200 NVL72 and Multi-Node NVLink domains for more details.
From cluster v2.24 onwardNative Load Balancer support - NVIDIA Run:ai exposes LoadBalancer connectivity directly in the UI and CLI when submitting workloads or creating templates (assuming a load balancer is already installed in the cluster). Configure service ports explicitly and view clearer port configuration and connectivity status.
From cluster v2.24 onwardMin/max worker configuration for PyTorch distributed training - Define the minimum and maximum number of workers directly from the UI when submitting PyTorch distributed training workloads, providing greater flexibility and control over resource allocation. See Train models using a distributed training workload for more details.
Connections column enabled by default in the Workloads grid - The Connections column is now selected by default in the Workloads table. When a workload has a single connection, the URL is displayed directly with long URLs automatically shortened. The URL is clickable and opens the Connections dialog. When multiple connections exist, the table displays the total count.
Enhanced workload details view - The workload Details tab for NVIDIA Run:ai native workloads has been updated to provide a clearer and more structured view of workload configuration. The updated design improves readability and makes it easier to understand how a workload was submitted and configured. Key enhancements include:
Improved layout and data presentation - Configuration fields are grouped and displayed more intuitively, helping users quickly find the information they need.
Specification selector - When a workload contains multiple specs, a new dropdown allows you to easily switch between them.
Inference
NVIDIA NIM service API enhancements - NVIDIA Run:ai expands support for deploying and managing NIMs through the NVIDIA NIM Operator, providing a standardized, operator-based deployment flow aligned with NIM-native configurations. NIM services are fully managed through the NVIDIA Run:ai API, with UI and CLI support planned for a future release. This capability does not replace the current NIM deployment flow and is available as an additional option. See NVIDIA NIM API for more details.
From cluster v2.23 onwardAutoscaling allowing NIM services to scale dynamically based on demand
Fractional GPU support, enabling NIM services to request and use partial GPUs for more efficient GPU utilization
Multi-node NIM deployments, enabling distributed NIM workloads across multiple nodes
Policy enforcement through a dedicated NVIDIA NIM Policy API for consistent governance of NIM services
Partial updates via a new PATCH endpoint, allowing targeted changes without resubmitting the full specification
NIM Cache support for model stores, enabling caching of specific LLM or multi-LLM model artifacts to improve startup time and reuse across deployments
LeaderWorkerSet (LWS) as a new workload type - LeaderWorkerSet is now available as a supported workload type. LWS workloads can be deployed and managed using YAML submission from the UI, API, or CLI, providing a standardized way to run leader–worker and multi-process workloads across the platform without direct cluster access. See Supported workload types for more details.
From cluster v2.23 onwardNGC API key support for NVIDIA NIM workloads - NVIDIA Run:ai supports using an NGC API key when deploying NIM workloads to handle both image access and model runtime authentication. A single NGC API key is automatically applied for pulling NIM images from the NGC catalog and injected as a runtime environment variable required for downloading model weights. This streamlines NIM deployment by removing the need for separate pull secrets and runtime credentials while enabling full user self-service for authenticated NIM workloads. See Deploy inference workloads from NVIDIA NIM for more details.
From cluster v2.23 onwardDistributed inference support in the CLI - Native distributed inference workloads can be submitted and managed directly from the NVIDIA Run:ai CLI. AI practitioners can use familiar NVIDIA Run:ai commands to work with distributed inference workloads, such as list, describe, logs, exec, port-forward, update, and delete. See CLI command reference for more details.
From cluster v2.23 onwardControl access scope for inference serving endpoints - Set whether an inference serving endpoint is accessible externally or restricted to internal cluster traffic when submitting workloads or creating templates. Endpoints can be configured as External (public access), if your administrator has configured Knative to support external access, or Internal only, limiting access to in-cluster traffic.
From cluster v2.24 onwardHugging Face model catalog browsing - Browse and search the Hugging Face model catalog directly from the NVIDIA Run:ai UI and API when creating Hugging Face inference workloads. The live catalog view displays model details such as download count and gated status. For gated models, the platform prompts you to provide a Hugging Face token for access, while open models can be selected without authentication. See Deploy inference workloads from Hugging Face for more details.
Distributed inference templates (API) - Distributed inference templates allow you to save workload configurations that can be reused across distributed inference submissions. These templates simplify the submission process and promote standardization across distributed inference workloads.
From cluster v2.22 onwardNew NVIDIA NIM performance histogram metrics - The Metrics pane now includes two new histograms for NVIDIA NIM workloads: end-to-end request latency and time to first token (TTFT). These metrics provide deeper visibility into inference performance and responsiveness. See NVIDIA NIM metrics for more details.
From cluster v2.23 onward
Workload Assets
Updated credential creation in the UI - The Credentials page has been redesigned for improved usability. The Access key and Username & password credential types have been consolidated under Generic secret, where each secret format now opens a dedicated form with context-specific input fields. In addition, a dedicated SSH key format has been added under Generic secret for easier configuration of SSH-based authentication. This change simplifies the UI and provides a more streamlined experience for managing credentials. See Credentials for more details.
From cluster v2.23 onwardNGC API key support for credentials - A new NGC API Key credential type is available across both credential assets and user credentials, enabling authenticated access to NGC resources, including gated or private models and images.
From cluster v2.23 onwardNew PVC events - NVIDIA Run:ai emits new PVC asset lifecycle events - Creating, Deleting, and Syncing. These events appear in the PVC’s Event history, extending the visibility introduced in previous releases and giving administrators clearer insight into PVC asset changes and activity over time.
Command-line Interface (CLI v2)
Template and asset-based workload submission in the CLI - The NVIDIA Run:ai CLI supports submitting native workloads using existing templates and workload assets, such as compute resources, environments, and data sources. This allows AI practitioners to reuse the same predefined configurations available in the UI and API, reducing the need for long, flag-heavy CLI commands. Templates and assets can be browsed and inspected directly from the CLI to support consistent and reliable workload submission. See CLI command reference for more details.
From cluster v2.23 onwardCluster diagnostics collection command - Added a new CLI command,
runai diagnostics collect-logs, which gathers diagnostic logs from the Kubernetes cluster for troubleshooting or sharing with NVIDIA Run:ai support. You can collect logs from all or specific namespaces, specify an output directory, and choose whether to include previous pod logs, simplifying cluster debugging and support workflows. See runai diagnostics command for more details.Extended storage visibility in CLI describe commands - The
describecommand for native workloads supports--storage, showing storage resources such as PVCs, ConfigMaps, and Secrets.
Platform Administrators
Organizations - Projects/Departments
Redesigned Projects and Departments management - NVIDIA Run:ai introduces an improved organization management experience that provides better visibility into resource distribution and clearer explainability for how resources are prioritized and allocated across the organization. This update simplifies large-scale organizational management while maintaining full compatibility with NVIDIA Run:ai’s advanced scheduling capabilities. See Projects and Departments for more details. From cluster v2.20 onward
Key improvements include:
Improved organizational visibility - A clearer, “big picture” view of projects and departments, making it easier to understand how GPU resources are distributed and prioritized.
Bulk management operations - Administrators can perform bulk actions across multiple organizational units directly from the UI and API, reducing operational overhead.
Clearer resource explainability - Improved transparency into resource contention and ordering, helping align scheduling behavior with business needs.
Authentication and Authorization
Custom roles using permission sets (API) - Administrators can now create custom roles by combining predefined permission sets using the Roles API. Permission sets are predefined, supported groupings of permissions that represent all required permission dependencies for a specific operation (for example, workload submission). Custom roles can then be assigned to users or groups through access rules in the UI or API, alongside the existing NVIDIA Run:ai predefined roles. This allows organizations to tailor access control to their operational needs while maintaining compatibility with the platform’s supported permission model. See Roles for more details.
From cluster v2.21 onwardUpdated predefined roles - Predefined roles in NVIDIA Run:ai have been updated to better align with common organizational responsibilities and workflows. See Roles for more details.
Added new predefined roles - AI practitioner, Data and storage administrator, and Project administrator
Some existing predefined roles have been deprecated. See Deprecation notifications for more details.
Reduced access to clusters and node pools - New APIs and permissions are now available to support reduced access to clusters and node pools - Clusters minimal and Node pools minimal. These APIs allow roles to perform actions such as workload submission while exposing only the minimal required cluster and node pool information (for example, names and IDs), rather than full read access. This improves role design by aligning the visible data with what is actually required for the action being performed. Roles that rely on full read access remain unchanged. Some predefined roles are planned to transition to the new minimal access as described in the Deprecation notifications.
Renamed applications and user applications - Authentication entities in NVIDIA Run:ai have been renamed across the UI, API and CLI to better reflect their purpose and usage. Applications are now referred to as service accounts, and user applications are now referred to as access keys. These changes are terminology updates only, all existing functionality remains the same, and all existing records continue to work unchanged. Existing API endpoints and CLI login modes reflect the updated terminology. See Deprecation notifications for more details.
Email invitations for local users - Administrators can now choose to automatically send an email invitation when creating a local user. When enabled, the invitation email allows the new user to sign in and get started. To use this feature, ensure that email notifications are configured under General settings.
Node Pools
Time-based fairshare configuration per node pool - NVIDIA Run:ai supports time-based fairshare to improve long-term fairness in over-quota resource allocation. Instead of relying only on momentary demand, the Scheduler factors in historical GPU usage over time, ensuring that projects with lower recent consumption are given fair access to resources. Usage is tracked continuously, and each project’s GPU-hour consumption is evaluated against its configured weight to balance resource distribution more effectively across projects. Time-based fairshare can be enabled and configured per node pool using the Node pools form, with advanced customization available through the Node pools API. From cluster v2.24 onward
Onboarding
Email server configuration during admin onboarding - The administrator onboarding wizard includes an optional step for configuring an email server (SMTP). This allows administrators to set up email delivery early in the onboarding process, enabling email invitations for local users and supporting email-based notifications across the platform using the same configuration.
Analytics
Overview dashboard enhancements - Improvements to the Overview dashboard strengthen visibility and support key monitoring workflows. These enhancements also support deprecating the legacy Grafana dashboards. See Deprecation notifications for more details:
Enhancements to the Projects/Departments tables - Timeframe controls for GPU allocation, utilization, and memory utilization are located within each column. The tables display up to 15 entries, include GPU quota, separate pending and running workloads, and provide direct links to each project or department.
New pending-time widget - Introduced a new widget that displays pending workloads count by pending time, helping admins understand how long workloads wait and identify projects and departments with extended pending times.
New guided tour for the Overview dashboard - A built-in tour walks administrators and AI practitioners through the key areas of the Overview dashboard, helping them navigate the dashboard and become familiar with the functionality available.
Additional dashboard improvements include:
Added a top-stats Failed workloads widget.
Added numeric counters to bar graphs, displaying values directly on each bar rather than only in tooltips.
Updated the Workloads by category/type widget to count running workloads only.
Added an Idle time column to the idle workloads table.
Updated widget ordering to separate current time widgets from over time widgets, improving the analysis process from identifying issue to over time investigation.
Policies
Policy-aware behavior for templates and assets - Templates and assets that do not fully comply with policy are no longer blocked outright when submitting a workload. Instead, NVIDIA Run:ai now evaluates non-compliance on a case-by-case basis:
Fixable non-compliance - If compliance can be achieved by adjusting settings during submission, the template or asset can be loaded. The UI highlights what needs to be updated to meet policy requirements.
Non-fixable non-compliance - If the non-compliant configuration cannot be changed, the template or asset cannot be used, and the relevant policy is displayed to explain the restriction.
Quick workload submit behavior - Templates with any non-compliance can now be loaded, but quick submit is automatically blocked. The full workload submission flow opens by default, where the UI highlights what needs to be updated to meet policy requirements.
Network Topology
Network topology visibility in clusters and node pools - The Network topologies modal in the Clusters page displays a new column showing which node pools each topology is associated with. This information is also available in the Network topologies API. From cluster v2.23 onward
Reports
Consumption report enhancements for GPU hour breakdown - The Consumption report includes two new columns, GPU deserved quota hours and GPU over-quota hours. These metrics fully support all existing grouping options, including cluster, node pool, department, and project. This change also supports deprecating the legacy Consumption dashboard. See Deprecation notifications for more details.
Event History
Audit logging for password resets - Audit logs capture all password reset events, including administrator-initiated resets, user-initiated resets, and password-recovery (“forgot password”) actions. This enhancement improves traceability and security visibility across user management workflows.
Infrastructure Administrators
Installation
Ingress controller recommendation update - Due to an announced deprecation by the upstream NGINX Ingress Controller project, NVIDIA Run:ai is updating its recommended ingress controller to HAProxy Ingress for supported environments. The Kubernetes Ingress standard remains fully supported. This change affects only the underlying ingress controller implementation and is intended to ensure long-term security, stability, and maintainability. For fresh installations, see Installation. To upgrade from earlier versions, see Migrate from NGINX to HAProxy Ingress. From cluster v2.24 onward
Advanced Configurations
Global pod tolerations (control plane) - You can define
global.tolerationsto apply a shared list of Kubernetes pod tolerations across NVIDIA Run:ai services and supported third-party services. These tolerations are applied globally unless explicitly overridden at the individual service level. See Advanced control plane configurations for more details.From cluster v2.24 onwardGlobal replica count (control plane) - A new
global.replicaCountsetting allows you to define a default number of replicas for all NVIDIA Run:ai services that support scaling. See Advanced control plane configurations for more details.From cluster v2.24 onwardDefault pod anti-affinity - A new control plane and cluster configuration,
global.requireDefaultPodAntiAffinity, applies a default pod anti-affinity rule to prevent pods from the same service from being scheduled on the same node when possible. This setting is enabled by default. See Advanced control plane configurations and Advanced cluster configurations for more details.From cluster v2.24 onward
System Requirements
NVIDIA Run:ai supports Kubernetes version 1.35.
NVIDIA Run:ai supports OpenShift version 4.20.
NVIDIA Run:ai supports GPU Operator version 25.10.
OpenShift version 4.16 is no longer supported.
Kubernetes version 1.31 is no longer supported.
Rancher Kubernetes Engine (RKE1) is no longer supported due to reaching end of life (EOL). RKE2 is the recommended Rancher distribution. See the Rancher migration guide for more details.
Deprecation Notifications
Note
Deprecated features, APIs, and capabilities remain available for two versions from the time of the deprecation notice, after which they may be removed.
NVIDIA Run:ai Predefined Roles
The following predefined roles are deprecated in the UI and API. Review the new predefined roles to determine whether they meet your requirements, or create a custom role using the API. See Roles for more details:
Compute resource administrator
Credentials administrator
Data source administrator
Data volume administrator
Environment administrator
L1 researcher
L2 researcher
ML engineer
Research manager
Template administrator
During the deprecation period, the following predefined roles will be updated with minimal access to cluster and node pool data:
Both cluster and node pool access are planned to transition to Clusters minimal and Node pools minimal - L1 researcher, L2 researcher, Research manager
Cluster access is planned to transition to Clusters minimal while node pools access remains unchanged - ML engineer
Cluster access is planned to transition to Clusters minimal - Compute resource administrator, Credentials administrator, Data source administrator, Data volume administrator, Environment administrator, Template administrator
Models Catalog
The Models catalog page is deprecated. Previously, the Models catalog provided a quick start experience for deploying a curated set of Hugging Face models. The same capability is now available through the Hugging Face inference workload flow, which integrates directly with Hugging Face and allows you to browse, select, and deploy any supported model from an open list. To deploy Hugging Face models, use the Hugging Face inference workload flow.
API Deprecation Notifications
Deprecated Endpoints
/api/v1/apps
/api/v1/service-accounts
/api/v1/user-applications
/api/v1/access-keys
/api/v1/authorization/roles
/api/v2/authorization/roles
Deprecated Parameters
api/v1/authorization/access-rules
subjectType: app (enum)
subjectType: service-account
/api/v2/authorization/roles/api/v1/authorization/roles/api/v1/authorization/permissions
resourceType: app (enum)
resourceType: service-account
/api/v1/org-unit/projects/api/v1/org-unit/departments
resources: priority
resources: rank
CLI Deprecation Notifications
runai login
application
access-key
runai login
user
password
runai [workloadtype] submit
--environment
--environment-variable
Last updated