What's New in Version 2.22

The NVIDIA Run:ai v2.22 what's new provides a detailed summary of the latest features, enhancements, and updates introduced in this version. They serve as a guide to help users, administrators, and researchers understand the new capabilities and how to leverage them for improved workload management, resource optimization, and more.

Important

For a complete list of deprecations, see Deprecation notifications. Deprecated features and capabilities will be available for two versions ahead of the notification.

AI Practitioners

Workloads

  • Flexible workload templates - Flexible workload templates allow you to save workload configurations that can be reused across workload submissions. You can create templates from scratch or base them on existing assets - environments, compute resources, or data sources. These templates simplify the submission process and promote standardization across users and teams. See Workload templates for more details. From cluster v2.22 onward Experimental

    • Template creation - Define and save workload configurations for flexible workspace, standard training, and distributed training workloads.

    • Reusable configurations - Submit flexible workloads quickly by reusing existing templates or modifying them as needed.

    • Edit support - Use the new Edit action to update saved templates with new or revised settings.

    • Standardized setup - Ensure consistency across workload submissions and align with organizational best practices.

  • Expanded workload priority management - Workload priority management has been extended with a new set of predefined priority values and broader configuration options. Researchers can select a priority when submitting flexible workloads in the UI, in addition to the API and CLI. Administrators can update the default priority mapping for each workload type via the API, allowing platform-wide alignment with scheduling policies. These enhancements provide greater flexibility and control over how workloads are scheduled within a project. See Workload priority control for more details. From cluster v2.22 onward

  • User-scoped credential management - Users can create and manage their own credentials directly in the UI and API. These credentials are scoped to the individual user and can be securely referenced when configuring environment variables or setting up authenticated image pulls during flexible workload submission. See User credentials for more details. From cluster v2.22 onward

  • Configurable MPI launcher start behavior for distributed training - Configure the MPI launcher to wait until all workers are ready before starting execution when submitting an MPI distributed training workloads via the flexible workload form in the UI, as well as through the API and CLI. This prevents failures caused by premature launcher execution and improves the stability of distributed workloads. See Train models using a distributed training workload for more details. From cluster v2.21 onward

  • Suspend and resume actions for multiple workloads - Suspend or resume multiple workloads at once using the multi-select option in the UI. From cluster v2.18 onward

  • Pod deletion policy for terminal distributed training workloads - Specify which pods should be deleted when a distributed training workload reaches a terminal state (completed/failed) using the flexible workload form in the UI. This enhancement provides greater control over resource cleanup and helps maintain a more organized and efficient cluster environment. See Train models using a distributed training workload for more details. From cluster v2.21 onward

  • Node selection controls via API and CLI - Gain more control over workload placement with new node selection capabilities: From cluster v2.22 onward

    • Node affinity via API - Define rules to target specific nodes based on hardware or configuration labels when submitting workloads through the API using nodeAffinityRequired.

    • Exclude nodes via CLI - Use the --exclude-node flag to avoid specific nodes when submitting a workload through the CLI.

  • ConfigMap subpath support - Added support for the subpath parameter in ConfigMap mounting, allowing you to use different paths within the volume instead of just the root. This is supported across all workloads that use flexible submission in the UI, and is also available via the API and CLI. From cluster v2.21 onward

  • Support for custom sshd mount path for non-root MPI jobs - Define a custom sshd mount path when submitting MPI distributed training workloads as a non-root user using the flexible workload form in the UI, as well as the API and CLI. This ensures correct SSH communication between master and worker nodes by avoiding fallback to root’s SSH configuration. See Train models using a distributed training workload for more details. From cluster v2.21 onward

  • Enhanced timeframe granularity for workload metrics - Added more predefined options for selecting workload metric timeframes (e.g., last minute, last 5 minutes, last hour), and improved the custom date range selector for greater flexibility. This allows users to analyze workload performance with finer granularity and more control over historical data views.

Workload Assets

  • Built-in NVIDIA environments for NeMo, BioNeMo, and PyTorch - NVIDIA Run:ai provides pre-configured NVIDIA environments for NeMo, BioNeMo, and PyTorch, enabling researchers to quickly launch workloads using optimized, ready-to-use frameworks. See Environments for more details. From cluster v2.22 onward

  • Improved compute resource configuration experience - The compute resource form has been enhanced to offer a more intuitive and streamlined experience. Users now benefit from clear summaries of total GPU resources requested, improved GPU memory configuration with validation, and contextual preemption alerts when request and limit values may trigger preemption. The layout also separates required fields from advanced options. See Compute resources for more details.

  • Display of custom storage class fields in PVC creation - The UI surfaces custom fields defined in storage class configurations when creating PVCs or volumes. This ensures users see all relevant configuration details up front, reducing setup errors and aligning storage choices with admin-defined policies. See Data sources for more details. From cluster v2.22 onward

  • Shared volume between projects - Create PVCs at the department or cluster scope with a shared volume accessible across projects. Supports both read and write options, whereas data volumes are read-only and intended for controlled, organization-wide data access. This enables more efficient storage utilization, especially in environments with limited quota space, and removes the need to over-provision volumes purely for performance gains. See Data sources for more details. From cluster v2.22 onward

Command-line Interface (CLI v2)

  • Project and department support in the CLI - The CLI includes enhanced functionality for Admins to manage projects and departments. See CLI commands reference for more details:

    • runai project create to create new projects

    • runai departments list to view a list of departments

  • Resource flag support for master pod in distributed training - Added new flags in the CLI to specify CPU and memory resources for the master pod in distributed training, including options to set CPU core limits, CPU core requests, memory limits, and memory requests. See CLI commands reference for more details.

  • Pagination support in the CLI - The CLI supports pagination for list commands (workloads, projects, nodes, node pools, and PVCs). You can control the number of results per page using --page-size, limit the total number of results with --max-items, and retrieve additional pages using --next-token. To disable pagination and retrieve a single page of results, use the --no-pagination flag. See CLI commands reference for more details.

  • Automatic CLI version updates - A new --auto-update flag has been added to the config set CLI command allowing you to enable automatic version updates. This ensures you're always using the latest CLI features and fixes without needing manual upgrades. See CLI commands reference for more details. From cluster v2.22 onward

  • Expanded commands for inference workloads in the CLI - The CLI supports additional container management commands for inference workloads, including logs, bash, exec, and port-forward. This update aligns inference support with other workload types and provides a more complete and consistent CLI experience for debugging and runtime interaction. See CLI commands reference for more details.

ML Engineers

Inference

  • Distributed inference support via API - The inference API supports distributed inference by allowing submission of multi-node workloads with a Leader Worker Set (LWS). This enables more advanced deployment patterns where a leader pod coordinates execution across multiple nodes, supporting scalable inference use cases. To enable this capability, install the LWS Helm chart on your cluster. From cluster v2.22 onward Experimental

  • New timeout controls for inference workloads - The inference workload submission in the UI, API and CLI supports timeout parameters across all inference workload types. These parameters manage workload initialization and request handling, improving control over workload behavior and ensuring timely failure detection and response enforcement. UI support is available for all inference workload types, including custom inference (flexible submission), Hugging Face, and NVIDIA NIM. From cluster v2.22 onward

  • UI support for inference policies - Administrators can submit inference policies directly through the UI. These policies are reflected in the custom inference (flexible submission) form. When applied, a policy dynamically adjusts the interface by modifying card visibility, enabling or disabling fields, locking values, and enforcing defaults or value ranges. See Policies for more details.

  • Application-based API access for inference endpoints - Inference endpoints for all inference workloads - custom, Hugging Face, and NVIDIA NIM - support authentication using NVIDIA Run:ai user applications (OIDC clients). This enables secure, programmatic access to inference endpoints when accessed externally from the cluster. To use this capability, configure the serving endpoint, authenticate using a token granted by a user application, and use the token in API requests to the endpoint.

  • New gRPC option to NIM workloads - You can select gRPC as a protocol when submitting inference workloads through the NVIDIA NIM form, enabling more flexible communication with inference servers. See Deploy inference workloads with NVIDIA NIM for more details.

  • Copy and edit NIM Inference workloads in the UI - Use the Make a Copy option to duplicate and modify existing NIM inference workloads directly in the UI, making it easier to reuse and adapt workload configurations.

Platform Administrators

Analytics

  • New workload category feature for standardized workload classification - NVIDIA Run:ai supports workload categories, enabling consistent classification of workloads based on their purpose. Each workload type is automatically assigned a default category, which is visible in the Overview dashboard to improve filtering, grouping, and monitoring. Administrators can customize these default mappings using the API to align with organizational needs. See Monitor workloads by category for more details. From cluster v2.22 onward

  • Pending time visibility for workloads - The workloads grid displays total pending time, which represents the cumulative duration a workload spent in Pending state. This helps administrators assess resource demands for specific projects or departments.

Nodes / Node Pools

  • Swap and Node Level Scheduler controls for node pools - Administrators can configure swap settings per node pool via the API, including CPU swap memory size and reserved GPU memory for swap operations. Additionally, a new API option allows enabling or disabling the Node Level Scheduler, replacing the deprecated overProvisioningRatio field. These capabilities offer more fine-grained resource control and scheduling behavior per node pool. From cluster v2.22 onward

  • Enhanced metric graphs for nodes - Enhanced metric graphs in the DETAILS tab for nodes by aligning these graphs with the dashboard and the Nodes API. As part of this improvement, the following columns have been removed from the nodes table:

    • Used GPU memory

    • GPU compute utilization

    • GPU memory utilization

    • Used CPU memory

    • CPU compute utilization

    • CPU memory utilization

    • Used swap CPU memory

    • Advanced metrics

  • Expanded pod telemetry for resource allocation - The node view displays detailed allocation metrics for each pod. This added visibility helps users better understand how resources are distributed across running pods:

    • Allocated GPUs

    • Allocated GPU memory

    • Allocated CPUs (cores)

    • Allocated CPU memory

  • Expanded GPU telemetry for resource allocation - The node view and API display detailed allocation metrics for each GPU:

    • Allocated compute

    • Allocated memory

Authentication and Authorization

  • Bulk delete for access rules - Users can select and delete multiple access rules at once, provided they have the necessary permissions for each rule. See Access rules for more details.

  • Multi-selection support when creating access rules - When creating a new access rule, you can now select multiple subjects (users, applications or SSO groups) and multiple scopes (projects, departments, and clusters) in a single rule in both the UI and API. This will automatically create a separate access rule for each subject-scope combination. See Access rules for more details.

  • New security settings API with scoped permissions - A dedicated API is available for managing security settings, with its own scoped permissions separate from the General settings. This separation enables more granular control over who can manage sensitive security-related configurations. From cluster v2.20 onward The API supports updates to the following configuration keys:

    • autoRedirectSSO

    • excludeGroupsFromToken

    • browserSessionTimeout

    • logoutRedirectUri

  • Support for exchanging external IdP tokens via API - The Tokens API supports exchanging an access token issued by an external identity provider (IdP) for an NVIDIA Run:ai access token. This enables seamless integration with external authentication systems. From cluster v2.20 onward

Notifications

  • Support for 'None' authentication type in email notifications - The email notification settings support an authentication type of None, allowing configuration of SMTP servers that do not require authentication.

Organizations - Projects/Departments

  • Departments enabled by default - To simplify onboarding and standardize tenant structure, all new tenants will include a default department. The Departments setting flag has been removed from the General settings. See Departments for more details.

  • GPU memory utilization metrics for projects and departments - Added new metrics that track GPU memory utilization at the project and department level. This enables more granular visibility into resource usage across organizational units, helping teams monitor consumption and optimize allocations.

  • Enhanced visibility into node pool configuration for projects and departments - The node pool details view for both projects and departments includes additional scheduling and allocation fields: Priority and Max GPU devices allocation. For departments, Over-quota and Over-quota weight have also been added. These updates provide clearer visibility into how resources are configured and prioritized, helping Admins better understand scheduling behavior across node pools. From cluster v2.20 onward

Infrastructure Administrators

Installation

  • Support for dedicated Prometheus in OpenShift deployments - Administrators can install and configure their own Prometheus instance in OpenShift environments to be used by NVIDIA Run:ai. This provides greater flexibility and control over monitoring infrastructure in custom or enterprise-grade deployments. From cluster v2.22 onward

  • Support for custom CA with S3 and Git integrations - Administrators can configure a custom Certificate Authority (CA) for secure TLS communication with S3 and Git repositories. This update extends existing custom CA support to also cover the Git-sync and S3 sidecar containers. It simplifies setup for airgapped environments by eliminating the need for manually built images and ensures consistent secure communication across all components. From cluster v2.22 onward

Advanced Cluster Configurations

  • Enable replica capability for additional services - NVIDIA Run:ai supports running multiple replicas for additional services to enable high availability. These services operate in hot-standby or leader election mode, ensuring fault tolerance by eliminating single points of failure without impacting system performance. From cluster v2.22 onward

System Requirements

  • NVIDIA Run:ai supports Knative version 1.18.

  • NVIDIA Run:ai supports OpenShift version 4.19.

  • NVIDIA Run:ai supports Kubernetes version 1.33.

  • Kubernetes version 1.30 is no longer supported.

Deprecation Notifications

Consumption Dashboard

The Consumption dashboard is deprecated and replaced with Reports. Consumption reports provide improved visibility into resource usage with enhanced filtering and export capabilities. We recommend transitioning to consumption reports for the most up-to-date insights.

Templates

The Templates feature is deprecated. We recommend transitioning to flexible workload templates, which offer enhanced functionality and support for flexible workload types - including workspace, standard training, and distributed training.

Last updated