What's New

The what's new provides transparency into the latest changes and improvements to NVIDIA Run:ai’s SaaS platform. The updates include new features, optimizations, and fixes aimed at improving performance and user experience.

Gradual Rollout

SaaS features and bug fixes are gradually rolled out to customers to ensure a smooth transition and minimize any potential disruption. SaaS releases follow a scheduled rollout cadence, typically every two weeks, allowing us to introduce new functionalities in a controlled and predictable manner. All customers receive the changes within 10 days of the initial release.

In contrast, hotfixes are deployed as needed to address urgent issues and are released immediately to ensure the stability and security of the service.

DGX Cloud

Certain features are first made available in fully managed cloud-based deployments provisioned through DGX Cloud. These features are labeled as DGX Cloud only and will become available to all customers in future releases.

Feature Life Cycle

NVIDIA Run:ai uses life cycle labels to indicate the maturity and stability of features across releases:

Experimental - This feature is in early development. It may not be stable and could be removed or changed significantly in future versions. Use with caution.
Beta - This feature is still being developed for official release in a future version and may have some limitations. Use with caution.
Legacy - This feature is scheduled to be removed in future versions. We recommend using alternatives if available. Use only if necessary.

July 2025 Releases

July 06

Product Enhancements

Flexible workload templates - Flexible workload templates let you save workload configurations that can be reused across submissions. You can create templates from scratch or base them on existing assets like environments, compute resources, or data sources. These templates simplify the submission process and promote standardization across users and teams. DGX Cloud only Experimental
- Template creation - Define and save workload configurations for flexible workspace, standard training, and distributed training workloads.
- Reusable configurations - Submit flexible workloads quickly by reusing existing templates or modifying them as needed.
- Edit support - Use the new Edit action to update saved templates with new or revised settings.
- Standardized setup - Ensure consistency across workload submissions and align with organizational best practices.
Expanded workload priority management - Workload priority management has been extended with a new set of predefined priority values and broader configuration options. Researchers can now select a priority when submitting flexible workloads in the UI, in addition to the API and CLI. Administrators can update the default priority mapping for each workload type via the API, allowing platform-wide alignment with scheduling policies. These enhancements provide greater flexibility and control over how workloads are scheduled within a project. DGX Cloud only
New workload category feature for standardized workload classification - NVIDIA Run:ai now supports workload categories, enabling consistent classification of workloads based on their purpose - such as train, build, or deploy. Each workload type is automatically assigned a default category, which is visible in the Overview dashboard to improve filtering, grouping, and monitoring. Administrators can customize these default mappings using the API to align with organizational needs. DGX Cloud only
User-scoped credential management - Users can now create and manage their own credentials directly in the UI and API. These credentials are scoped to the individual user and can be securely referenced when configuring environment variables or setting up authenticated image pulls during flexible workload submission. DGX Cloud only
Built-in NVIDIA environments for NeMo, BioNeMo, and PyTorch - NVIDIA Run:ai now provides pre-configured NVIDIA environments for NeMo, BioNeMo, and PyTorch, enabling researchers to quickly launch workloads using optimized, ready-to-use frameworks. DGX Cloud only
Distributed inference support via API - The inference API now supports distributed inference by allowing submission of multi-node workloads with a leader worker set. This enables more advanced deployment patterns where a leader pod coordinates execution across multiple nodes, supporting scalable inference use cases. DGX Cloud only Experimental
New timeout controls for inference workloads - The inference workload submission in the UI now supports timeout parameters across all inference workload types. These parameters manage workload initialization and request handling, improving control over workload behavior and ensuring timely failure detection and response enforcement. UI support is available for all inference workload types, including custom inference (flexible submission), Hugging Face, and NVIDIA NIM. DGX Cloud only
UI support for inference policies - Administrators can now submit inference policies directly through the UI. These policies are reflected in the custom inference flexible submission form. When applied, a policy dynamically adjusts the interface by modifying card visibility, enabling or disabling fields, locking values, and enforcing defaults or value ranges.
Application-based API access for inference endpoints - Inference endpoints now support authentication using NVIDIA Run:ai user applications (OIDC clients). This enables secure, programmatic access to inference endpoints via API, particularly important when the endpoint is accessed externally from the cluster. To use this feature, configure the serving endpoint, authenticate using a token granted by a user application and make subsequent API requests to the inference endpoint.
Copy and edit NIM Inference workloads in the UI - You can now use the Make a Copy option to duplicate and modify existing NVIDIA NIM Inference workloads directly in the UI, making it easier to reuse and adapt workload configurations.
Enable replica capability for additional services - NVIDIA Run:ai now supports running multiple replicas for additional services to enable high availability. These services operate in hot-standby or leader election mode, ensuring fault tolerance by eliminating single points of failure without impacting system performance. DGX Cloud only
Support for custom sshd mount path for non-root MPI jobs - You can now define a custom sshd mount path when submitting MPI distributed training workloads as a non-root user using the flexible workload form in the UI, as well as the API and CLI. This ensures correct SSH communication between master and worker nodes by avoiding fallback to root’s SSH configuration. From cluster v2.21 onward
Component version updates – NVIDIA Run:ai now supports Knative version 1.18, OpenShift version 4.19, and Kubernetes version 1.33. Support for Kubernetes version 1.30 has been removed. DGX Cloud only

Resolved Bugs

Description

RUN-30861

Fixed an issue where pagination could fail due to insufficient timestamp precision.

RUN-30195

Fixed an issue where the project creation form displayed a generic error message without highlighting any invalid fields or showing error details.

RUN-29678

Fixed an issue where adding a master section to a policy that originally only had a worker section caused the agent to crash.

RUN-30112

Fixed an issue where the workspace policy failed to apply due to an error in the priorityClass field.

RUN-29319

Fixed an issue where a workspace with multiple mounts of the same PVC would fail.

RUN-29607

Fixed a security vulnerability in cluster-installer and go-operator related to CVE-2025-22869 with severity HIGH.

RUN-30634

Fixed a security vulnerability in cluster-installer related to CVE-2025-30204 with severity HIGH.

RUN-29845

Fixed a security vulnerability in cli-exposer, cluster-api, researcher-service, runai-cli, runaijob-controller, workload-controller, and workload-overseer related to CVE-2025-22874 with severity HIGH.

June 2025 Releases

June 30

Enhanced visibility into node pool configuration for projects and departments - The node pool details view for both projects and departments now includes additional scheduling and allocation fields: Priority and Max GPU devices allocation. For departments, Over-quota and Over-quota weight have also been added. These updates provide clearer visibility into how resources are configured and prioritized, helping Admins better understand scheduling behavior across node pools. See Projects and Departments for more details. From cluster v2.20 onward
Configurable MPI launcher start behavior for distributed training - You can now configure the MPI launcher in the UI to wait until all workers are ready before starting execution. This prevents failures caused by premature launcher execution and improves the stability of distributed workloads. See Train models using a distributed training workload for more details. DGX Cloud only
Improved compute resource configuration experience - The compute resource form has been enhanced to offer a more intuitive and streamlined experience. Users now benefit from clear summaries of total GPU resources requested, improved GPU memory configuration with validation, and contextual preemption alerts when request and limit values may trigger preemption. The layout also separates required fields from advanced options. See Compute resources for more details.
Multi-selection support when creating access rules - You can now select multiple subjects (users, applications, or SSO groups) and multiple scopes (projects, departments, and clusters) when creating a new access rule via the UI or API. A separate access rule is automatically created for each subject-scope combination. See Access rules for more details.
Support for exchanging external IdP tokens via API - The Tokens API now supports exchanging an access token issued by an external identity provider (IdP) for an NVIDIA Run:ai access token. This enables seamless integration with external authentication systems.
New timeout controls for inference workloads - The Inference API and CLI now support additional timeout parameters for managing workload initialization and request handling, initializationTimeoutSeconds and requestTimeoutSeconds. These settings improve control over workload behavior and help ensure timely failure detection and response enforcement.
Support for dedicated Prometheus in OpenShift deployments - Administrators can now install and configure their own Prometheus instance in OpenShift environments to be used by NVIDIA Run:ai. This provides greater flexibility and control over monitoring infrastructure in custom or enterprise-grade deployments. DGX Cloud only

Resolved Bugs

Description

RUN-30674

Fixed an issue where, on rare occasions, running the runai upgrade command deleted all files in the current directory.

RUN-30197

Fixed a security vulnerability in with stdlib package in go v1.24.2 with CVE-2025-22874 with severity HIGH.

RUN-29756

Fixed an issue where not all subjects were returned for each project or department.

RUN-29143

Fixed an issue where nodes could become unschedulable when workloads were submitted to a different node pool.

RUN-29712

Fixed an issue where workload name validation during submission in the UI took an unusually long time when checking for existing workloads with the same name.

RUN-28599

Fixed an issue where users with the L1 Researcher role were unable to create data volumes due to incorrect permissions.

RUN-29320

Fixed a typo in the documentation where the API key was incorrectly written as enforceRun:aiScheduler instead of the correct enforceRunaiScheduler.

RUN-29548

Fixed an issue in CLI v2 where the update server did not receive the terminal size during exec commands requiring TTY support. The terminal size is now set once upon session creation, ensuring proper behavior for interactive sessions.

RUN-29221

Fixed an issue where interactive login using the CLI application failed, preventing users from authenticating successfully.

RUN-27126

Fixed an issue where using -1 for compute resources in CLI submit commands resulted in successful submission, but the generated workload spec did not include a GPU field. The CLI now correctly rejects negative GPU values during submission.

RUN-24003

Fixed a security vulnerability in java-17-openjdk-headless related to CVE-2024-21147 with severity HIGH.

RUN-25281

Fixed an issue where deploying a Hugging Face model with vLLM using the Hugging Face UI submission form on an OpenShift environment failed due to permission errors.

RUN-26113

Fixed an issue where resetting a user's password, either by the user or an admin, did not revoke all active sessions.

RUN-27281

Added guidance for authenticating with the NGC registry using an NGC API key for NVIDIA NIM inference workloads. This resolves previous confusion around image pulls by including clear instructions in both the UI and documentation.

RUN-28248

Fixed an issue where specifying an incorrect image name for a workload resulted in no logs being displayed, instead of showing a relevant error message.

RUN-28558

Fixed an issue where memory usage alerts for NVIDIA Run:ai containers were based on container_memory_usage_bytes, which included evictable page cache memory. Alerts now use container_memory_working_set_bytes, which reflects only non-evictable memory, providing more accurate usage reporting.

RUN-28780

Fixed an issue where the Hugging Face validation incorrectly blocked some valid models that are supported by vLLM.

RUN-29282

Fixed a security vulnerability in github.com.golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

RUN-29241

Fixed an issue in distributed training workloads where using new PVCs resulted in mismatched claim names between the PVC resource and the pod template, causing workload failures. Also fixed a related issue where combining new and existing PVCs in the same workload caused incorrect claim name generation.

June 03

Product Enhancements

Project and department support in the CLI - The CLI now includes enhanced functionality for Admins to manage projects and departments. See CLI commands reference for more details:
- runai project create to create new projects
- runai departments list to view a list of departments
Expanded commands for inference workloads in the CLI - The CLI now supports additional container management commands for inference workloads, including logs, bash, exec, and port-forward. This update aligns inference support with other workload types and provides a more complete and consistent CLI experience for debugging and runtime interaction. See CLI commands reference for more details
Enhanced timeframe granularity for workload metrics - Added more predefined options for selecting workload metric timeframes (e.g., last minute, last 5 minutes, last hour), and improved the custom date range selector for greater flexibility. This allows users to analyze workload performance with finer granularity and more control over historical data views.
Configurable MPI launcher start behavior for distributed training - You can now configure the MPI launcher to wait until all workers are ready before starting execution. Available via the API and CLI, this prevents failures caused by premature launcher execution and improves the stability of distributed workloads. DGX Cloud only
Node selection controls via API and CLI - Gain more control over workload placement with new node selection capabilities. DGX Cloud only
- Node affinity via API - Define rules to target specific nodes based on hardware or configuration labels when submitting workloads through the API using nodeAffinityRequired.
- Exclude nodes via CLI - Use the --exclude-node flag to avoid specific nodes when submitting a workload through the CLI.
Replica support added for more services - Expanded replica configuration support to additional system services, improving scalability and reliability of the NVIDIA Run:ai cluster. This enhancement ensures greater consistency across the platform and strengthens resilience in high-availability environments. DGX Cloud only
New security settings API with scoped permissions - A dedicated API is now available for managing security settings, with its own scoped permissions separate from general settings. This separation enables more granular control over who can manage sensitive security-related configurations. From cluster v2.20 onward The API supports updates to the following configuration keys:
- autoRedirectSSO
- excludeGroupsFromToken
- browserSessionTimeout
- logoutRedirectUri
Support for 'None' authentication type in email notifications - The email notification settings now support an authentication type of None, allowing configuration of SMTP servers that do not require authentication.
Display of custom storage class fields in PVC creation - The UI now surfaces custom fields defined in storage class configurations when creating PVCs or volumes. This ensures users see all relevant configuration details up front, reducing setup errors and aligning storage choices with admin-defined policies. See Data sources for more details. DGX Cloud only
Swap and Node Level Scheduler controls for node pools - Admins can now configure swap settings per node pool via the API, including CPU swap memory size and reserved GPU memory for swap operations. Additionally, a new API option allows enabling or disabling the Node Level Scheduler, replacing the deprecated overProvisioningRatio field. These capabilities offer more fine-grained resource control and scheduling behavior per node pool. DGX Cloud only
Enhanced metric graphs for nodes - Enhanced metric graphs in the DETAILS tab for nodes by aligning these graphs with the dashboard and the Nodes API. As part of this improvement, the following columns have been removed from the nodes table:
- Used GPU memory
- GPU compute utilization
- GPU memory utilization
- Used CPU memory
- CPU compute utilization
- CPU memory utilization
- Used swap CPU memory
- Advanced metrics
Expanded pod telemetry for resource allocation - The node view now displays detailed allocation metrics for each pod. This added visibility helps users better understand how resources are distributed across running pods:
- Allocated GPUs
- Allocated GPU memory
- Allocated CPUs (cores)
- Allocated CPU memory
Expanded GPU telemetry for resource allocation - The node view now displays detailed allocation metrics for each GPU:
- Allocated compute
- Allocated memory

Resolved Bugs

Description

RUN-28931

Fixed an issue where the tenantId field was missing from the Audit log payload for the Tokens API.

RUN-29341

Fixed an issue which caused high CPU usage in the Cluster API.

RUN-27012

Fixed a security vulnerability in DOMPurify related to CVE-2024-24762 with severity HIGH.

RUN-26102

Fixed an issue where user sessions were not revoked after changing the user’s password via the UI, API, or CLI.

RUN-28283

Fixed an issue where the pods modal in inference workloads displayed an incorrect container image.

RUN-28727

Fixed an issue where, on rare occasions, the Get Projects operation failed when called via the API.

RUN-26361

Fixed an issue where Prometheus remote-write credentials were not properly updated on OpenShift clusters.

RUN-29323

Fixed an issue where Prometheus failed to send metrics for OpenShift.

RUN-28730

Fixed an issue where, during the Knative installation check, the Cluster API failed to detect the existence of HPA.

RUN-28957

Fixed an issue where the SAML Entity ID was not visible during SSO configuration.

RUN-28859

Fixed an issue where the knative.enable-scale-to-zero setting did not default to true as intended.

RUN-28851

Fixed an issue in CLI v2 where the port-forward command terminated SSH connections after 15–30 seconds due to an idle timeout.

RUN-28848

Fixed an issue where creating an asset with a duplicate name returned the wrong error code.

RUN-28755

Fixed an issue where the tooltip next to External URL for an inference serving endpoint incorrectly stated that the URL was internal.

RUN-28719

Fixed an issue where the ML Engineer role was incorrectly shown as lacking read access to data sources on the Roles page.

RUN-28608

Fixed an issue where users with the ML Engineer role were unable to delete multiple inference jobs at once.

RUN-28555

Fixed an issue in General settings where the "Disabled" workloads count displayed inconsistently between the collapsed and expanded views.

RUN-28550

Fixed an issue where users were still able to create HostPath data sources through inference workloads, even when policies should have restricted it.

RUN-28372

Fixed an issue where pagination in the CLI did not work as expected, preventing users from viewing paged results.

RUN-28286

Fixed an issue where CPU only workloads incorrectly triggered idle timeout notifications intended for GPU workloads.

RUN-27375

Fixed an issue where projects were not visible in the legacy job submission form, preventing users from selecting a target project.

RUN-28212

Fixed a security vulnerability in github.com.golang-jwt.jwt.v5 with CVE-2025-30204 with severity HIGH.

RUN-29093

Fixed an issue where rotating the runai-config webhook secret caused the app.kubernetes.io/managed-by=helm label to be removed.

May 2025 Releases

May 18

Product Enhancements

Bulk delete for access rules - Users can now select and delete multiple access rules at once, provided they have the necessary permissions for each rule.
GPU memory utilization metrics for projects and departments - Added new metrics that track GPU memory utilization at the project and department level. This enables more granular visibility into resource usage across organizational units, helping teams monitor consumption and optimize allocations.
Shared volume between projects - You can now create PVCs at the department or cluster scope with a shared volume accessible across projects. This enables more efficient storage utilization, especially in environments with limited quota space, and removes the need to over-provision volumes purely for performance gains. See Data sources for more details. DGX Cloud only

Resolved Bugs

Description

RUN-28665

Fixed an issue where using servingPort authorization fields in the Create an inference API on unsupported clusters did not return an error.

RUN-28266

Fixed an issue where the documentation examples for the runai workload delete CLI command were incorrect.

RUN-28717

Fixed an issue where the Update inference spec API documentation listed an incorrect response code.

RUN-28923

Fixed an issue where calling the Get node telemetry data API with the telemetryType IDLE_ALLOCATED_GPUS resulted in a 500 Internal Server Error.

RUN-28950

Fixed a security vulnerability in github.com/moby and github.com/docker/docker related to CVE-2024-41110 with severity Critical.

RUN-28762

Fixed an issue with inference workload ownership protection.

RUN-28626

Fixed an issue where columns representing average over-time resource usage appeared empty in the departments grid.

RUN-27521

Fixed an issue where disabling CPU quota in the General settings did not remove existing CPU quotas from projects and departments.

RUN-28006

Fixed an issue where tokens became invalid for the API server after one hour.

RUN-28832

Fixed inference CLI v2 documentation with examples that reflect correct usage.

RUN-28311

Fixed an issue where user creation failed with a duplicate email error, even though the email address did not exist in the system.

RUN-27423

Fixed an issue where missing fields in the grantType request body did not return a proper error. The Tokens API now responds with a 400 Bad Request when required fields are missing.

RUN-28213

Fixed a security vulnerability in github.com.golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

RUN-27986

Fixed an issue where leaving the Overview dashboard open for an extended period caused the Grafana session cookie to expire without refreshing, resulting in the dashboard becoming unavailable.

RUN-27955

Fixed an issue where the option to create a new host path data source was incorrectly available during inference workload submission, even when the policy did not allow it.

RUN-27638

Fixed a security vulnerability in axios related to CVE-2025-27152 with severity HIGH.

RUN-27514

Fixed an issue with incorrect calculation of the ALLOCATED_CPU_MEMORY_BYTES telemetry metric.

RUN-27422

Fixed an issue where deleting a node type did not trigger project updates in the cluster.

May 04

Product Enhancements

Resource flag support for master pod in distributed training - Added new flags in the CLI to specify CPU and memory resources for the master pod in distributed training, including options to set CPU core limits, CPU core requests, memory limits, and memory requests. See CLI commands reference for more details.
ConfigMap subpath support - Added support for the subpath parameter in ConfigMap mounting, allowing customers to use different paths within the volume instead of just the root. This is supported across all workloads that use flexible submission in the UI, and is also available via the API and CLI. DGX Cloud only
Suspend and resume actions for multiple workloads - You can now suspend or resume multiple workloads at once using the multi-select option in the UI, making it faster and easier to manage large sets of jobs. From cluster v2.18 onward
Pod deletion policy for terminal distributed training workloads - You can now specify which pods should be deleted when a distributed training workload reaches a terminal state (completed/failed) in the UI. This enhancement provides greater control over resource cleanup and helps maintain a more organized and efficient cluster environment. See Train models using a distributed training workload for more details. From cluster v2.20 onward
Pagination support in the CLI - The CLI now supports pagination for list commands (workloads, projects, nodes, node pools, and PVCs). You can control the number of results per page using --page-size, limit the total number of results with --max-items, and retrieve additional pages using --next-token. To disable pagination and retrieve a single page of results, use the --no-pagination flag. See CLI commands reference for more details.
Automatic CLI version updates - A new --auto-update flag has been added to the config set CLI command allowing you to enable automatic version updates. This ensures you're always using the latest CLI features and fixes without needing manual upgrades. See CLI commands reference for more details.

Resolved Bugs

Description

RUN-27837

Fixed an issue where a node pool’s placement strategy stopped functioning correctly after being edited.

RUN-28258

Fixed an issue where the nodes grid displayed undefined values in MNNVL columns.

RUN-28097

Fixed an issue where the allocated_gpu_count_per_gpu metric displayed incorrect data for fractional pods.

RUN-26359

Fixed an issue in CLI v2 where using the --toleration option required incorrect mandatory fields.

RUN-26608

Fixed an issue by adding a flag to the cli config set command and the CLI install script, allowing users to set a cache directory.

RUN-27484

Fixed an issue where duplicate app.kubernetes.io/name labels were applied to services in the control plane Helm chart.

RUN-27247

Fixed security vulnerabilities in Spring framework used by db-mechanic service - CVE-2021-27568, CVE-2021-44228, CVE-2022-22965, CVE-2023-20873, CVE-2024-22243, CVE-2024-22259 and CVE-2024-22262.

RUN-27309

Fixed an issue where workloads configured with a multi node pool setup could fail to schedule on a specific node pool in the future after an initial scheduling failure, even if sufficient resources later became available.

RUN-27515

Fixed an issue where users were unable to use assets from an upper scope during flexible workload submissions.

RUN-27826

Fixed an issue where the runai inference update command could result in a failure to update the workload. Although the command itself succeeded (since the update is asynchronous), the update often failed, and the new spec was not applied.

RUN-27915

Fixed an issue where the "Improved Command Line Interface" admin setting was incorrectly labeled as Beta instead of Stable.

RUN-27251

Fixed a security vulnerability in knative.dev/serving with CVE-2023-48713 with severity MEDIUM.

Fixed a security vulnerability in golang.org.x.net with CVE-2025-22872 with severity MEDIUM.

Fixed a security vulnerability in github.com.golang-jwt.jwt.v4 and github.com.golang-jwt.jwt.v5 with CVE-2025-30204 with severity HIGH.

April 2025 Releases

April 20

Product Enhancements

New gRPC option to NIM workloads - You can now select gRPC as a protocol when submitting inference workloads through the NVIDIA NIM form, enabling more flexible communication with inference servers. See Deploy inference workloads with NVIDIA NIM for more details.
Departments enabled by default - To simplify onboarding and standardize tenant structure, all new tenants will now include a default department. The Departments setting flag has been removed from the Admin UI. See Departments for more details.
Pending time visibility for workloads - The Workloads API and UI now display total pending time, which represents the cumulative duration a workload spent in Pending state. This helps Admins assess resource demands for specific projects or departments.

Resolved Bugs

Description

RUN-27485

Fixed an issue where users with the ML Engineer role were unable to submit inference workloads due to insufficient permissions.

RUN-27497

Fixed an issue where, after deleting an SSO user and immediately creating a local user, the delete confirmation dialog reappeared unexpectedly.

RUN-27008

Increased the range of generated reports to 31 days.

RUN-27502

Fixed the inference CLI commands documentation: --max-replicas and --min-replicas were incorrectly used instead of --max-scale and --min-scale.

RUN-27520

Fixed an issue where adding access rules immediately after creating an application did not refresh the access rules table.

RUN-26754

Fixed an issue where workload submission requests to the API did not apply UID and GID from the token when uidGidSource was set to fromIdpToken.

RUN-26953

Fixed an issue where OIDC client ID and password values containing spaces were allowed.

RUN-27264

Fixed an issue where creating a project from the UI with a non-unlimited deserved CPU value caused the queue to be created with limit = deserved instead of unlimited.

RUN-27246

Fixed an issue by adding clientID=cli to the CLI exchange request in Tokens API to ensure proper authentication flow.

RUN-26989

Fixed an issue that prevented reordering node pools in the workload submission form.

RUN-26992

Fixed an issue where workloads submitted with an invalid node port range would get stuck in Creating status.

RUN-26433

Fixed an issue where invalid GrantType values in Tokens API requests returned unclear error messages.

April 06

Product Enhancements

See the What's new in version 2.21 for the full list of new features.

Resolved bugs

Description

RUN-27088

Fixed a security vulnerability in tar-fs related to CVE-2024-12905 with severity HIGH.

RUN-26464

Fixed an issue where fields and values associated with a selected storage class were not disabled as expected.

RUN-26690

Fixed an issue where the Run:ai logs view displayed both loading and empty states.

RUN-27229

Fixed a security vulnerability in github.com.opencontainers.runc related to CVE-2024-21626 with severity HIGH.

RUN-27308

Fixed an issue where the API documentation did not include the return codes for duplicate project or department creation.

RUN-27219

Fixed an issue where project creation failed if a quota was set for CPU but not for memory.

RUN-27210

Fixed an issue that occurred when submitting a workload with multiple ConfigMap data storage entries.

RUN-26671

Fixed an issue where compute resources configured with multiple whole GPUs (e.g., 3 GPUs at 100%) were incorrectly submitted as a single GPU.

RUN-27120

Fixed an issue where copying a workload that a user was not authorized to access incorrectly granted them access to the serving port in the copied workload.

RUN-26753

Fixed an issue where creating a department or project scoped ConfigMap using the name of an existing cluster-wide ConfigMap resulted in an incorrect status

RUN-26386

Fixed an issue where inconsistent behavior occurred during project creation when configuring GPU resources with limit=null, overQuotaPriority=null, and deserved=0.

RUN-27075

Fixed an issue where, in some cases, creating a project through the API with partial parameters would return an error when the "Limit projects from exceeding department quota" setting was enabled.

RUN-26120

Fixed an issue where the metrics service could get stuck while processing reports under certain conditions.

RUN-26861

Fixed an issue where the create workload page could remain stuck in a loading state due to pending cluster details.

RUN-26602

Fixed an issue where multiple workloads with the same name could be created via the UI, eventually leading to workload failures.

RUN-26892

Fixed an issue where the inference serving endpoint did not display the connection protocol and container port in the UI.

RUN-24579

Fixed an issue where the pubsub package failed to connect if the pubsub server (Redis/NATS) was not yet deployed.

RUN-26691

Fixed a security vulnerability in axios related to CVE-2025-27152 with severity HIGH.

RUN-26324

Fixed an issue in the documentation where the toleration name was incorrectly marked as mandatory. Also fixed an issue in CLI v2 where the required fields were incorrect: name is no longer mandatory, and key is now required.

RUN-26410

Fixed an issue in the POST /api/v1/workloads/trainings API where, if Completions was set but Parallelism was not specified, the response returned Parallelism as null instead of the expected default value 1.

RUN-26955

Fixed an issue where duplicate results appeared in some cases for node metrics.

RUN-26764

Fixed an issue where in some cases, a node pool was stuck in "Creating" phase.

RUN-26772

Fixed an issue where a GET request for a non-existent workload returned an unexpected response format.

RUN-26641

Fixed an issue where CLI usage could be blocked even when the CLI and control plane version were aligned.

RUN-27041

Fixed an issue where Hugging Face inference workloads could not be submitted via the UI due to an error in the General section.

March 2025 Releases

March 16

Resolved Bugs

Description

RUN-26686

Fixed an issue where workload names exceeding 50 characters caused failures due to Kubernetes label length constraints (max 63 characters).

RUN-26272

Fixed an issue where connecting to the SMTP server without credentials was not allowed.

RUN-26659

Fixed an issue where deleting the node pool did not remove it from the default node pools list.

RUN-26630

Fixed an issue that prevented updating tenant-scoped data sources.

RUN-25769

Fixed an issue where unusual text appeared at the end of each line when using the --help option for the runai inference submit --help command.

RUN-25918

Fixed an issue where the Running/Requested Pods column in the workload list displayed 1/0 instead of the correct format (1/1-3) for inference and other workload types that support minimum and maximum requested pods in the runai workloads list command.

RUN-26473

Fixed an issue where removing labels and annotations from a workload created using "Copy & Edit" did not properly remove them.

RUN-26624

Fixed an issue which caused workloads to fail if both gpuPortionRequest and gpuPortionLimit were set to 1 (100%).

RUN-26270

Fixed an issue in SSO SAML where the Entity ID field had a different value before and after configuring SAML.

RUN-26240

CLI v2: Fixed an issue in the install script, where setting the install path environment variable did not install all the files in the correct path.

RUN-26479

CLI v2: Fixed an issue where using the wrong workload type in the workload describe command did not display an error.

RUN-26345

CLI v2: Added UIDGIDSOURCE_CUSTOM when SupplementalGroups is set.

March 05

Product Enhancements

Added functionality to verify the proper installation of Knative. The UI and API will reflect the status of various features based on their current state in Knative.
Added the NVIDIA logo to the platform, including the login page and other general areas.
Audit log: Only users with tenant-wide permissions now have the ability to access audit logs, ensuring proper access control and data security.
CLI v2: Users will be able to submit workloads and map secrets to volumes using the --secret-volume flag. This feature is applicable for all workload types - workspaces, training, and inference.

Resolved Bugs

Description

RUN-26310

Fixed an issue where Docker registry credentials/secrets were not found when adding environment variables.

RUN-26253

CLI v2 list project now supports limit and offset flags.

RUN-25382

Fixed an issue where invalid min/max policy values caused an error in the policy pod.

RUN-26135

Fixed an issue which prevented enabling/disabling email notifications.

RUN-25131

Fixed an issue where authentication failures in the Grafana proxy incorrectly returned a 401 error causing users to be signed out of the UI.

RUN-26248

CLI v2: Fixed an issue where submitting an interactive workload with --attach was not possible after the workload started running.

RUN-25982

CLI v2: Fixed an issue where interactive mode did not return an error for invalid control plane/Authentication URLs and timeout duration.

RUN-26356

Fixed an issue where Lowest for over quota weight did not appear as 0.

RUN-26249

Fixed an issue where creating a policy with the fields tty and stdin resulted in a validation error.

RUN-26178

Fixed an issue where the upgrade to 2.20 failed to migrate departments and projects if the job to validate the default department to clusters ran first.

RUN-25895

Fixed an issue where projects that were updated due to changes in their department override fields were not updated in the cluster.

RUN-26152

GET API for retrieving Workspaces, Trainings, and Inferences by ID returns deleted items.

RUN-25987

Updated all workload APIs to accurately reflect that both creating and deleting workloads return a 202 status code in the API documentation.

RUN-25984

Added a validation message to api/v1/me/password.

RUN-26062

Fixed an issue where a new API, intended for clusters running version 2.18 and above, was not disabled for older clusters, causing unintended workload operations — such as creation, deletion, resumption, or stoppage — after upgrading from versions below 2.18 to 2.18 or higher.

February 2025 Releases

February 16

Product Enhancements

NIM and model store: UX improvements
New functionalities added for CLI v2:
- Allow users to list all available Persistent Volume Claims (PVCs) when submitting workloads. This enhancement simplifies the selection process for appropriate PVCs, making workload submission more efficient.
- Enable users to display the config file in multiple formats. The available options are:
  - --json: Output structure in JSON format
  - --yaml: Output structure in YAML format

Resolved Bugs

Description

RUN-25974

Fixed an issue where using filters in the Quota management dashboard was not working properly.

RUN-25969

Fixed an issue where the UI incorrectly rejected valid toleration key inputs during validation checks.

RUN-25946

Fixed an issue where the Update Inference Spec API did not enforce a minimum cluster version returning a 400 Bad Request for versions below 2.19.

RUN-25921

Fixed an issue where the Workspaces, Trainings and Distributed APIs did not enforce a minimum cluster version returning a 400 Bad Request for versions below 2.18.

RUN-25249

Fixed an issue where submitting a workload using a yaml file with a port but without service type would use ClusterIP as the default service type. If no host port is provided, the target port will be used as the host.

RUN-25269

Fixed an issue where the Pods modal was not paginated, limiting the display to only 50 records.

RUN-25466

Fixed an issue where an environment variable with the value SECRET was not valid as only SECRET:xxx was accepted.

RUN-23048

Improved error handling to display meaningful messages from the CLI upgrade command.

RUN-25552

Fixed an issue where clicking on "View Access Rules" in the Users table displayed only the first group if a user belonged to multiple groups.

RUN-25558

Fixed a memory issue when handling external workloads (deployments, ray etc.) which when scaled caused ETCD memory to increase.

RUN-25659

CLI v2: Fixed an issue where min and max replicas were submitted using TensorFlow.

February 02

Product Enhancements

Workload Events API, /api/v1/workloads/{workloadId}/events, now supports the sort order parameter (asc, desc).
MIG profile and MIG options are now marked as deprecated in CLI v2, following the deprecation notice in the last version.
As part of inference support in CLI v2, Knative readiness is now validated on submit requests.
Improved permission error messaging when attempting to delete a user with higher privileges.
Improved visibility of metrics in the Resources utilization widget by repositioning them above the graphs.
Added a new Idle workloads table widget to help users easily identify and manage underutilized resources.
Renamed and updated the "Workloads by type" widget to provide clearer insights into cluster usage with a focus on workloads.
Improved user experience by moving the date picker to a dedicated section within the overtime widgets, Resources allocation and Resources utilization.
Simplified configuration by enabling auto-creation of storage class for discovered storage classes.
Enhanced PVC underlying storage configuration by specifying allowed context for the selected storage (Workload Volume, PVC, both, or neither).
Added configurable grace period for workload preemption in CLI v2.

Resolved Bugs

Description

RUN-24838

Fixed an issue where an environment asset could not be created if it included an environment variable with no value specified.

RUN-25031

Fixed an issue in the Templates form where existing credentials in the environment variables section were not displayed.

RUN-25303

Fixed an issue where submitting with the --attach flag was supported only in a workspace workload.

RUN-24354

Fixed an issue where migrating workloads failed due to slow network connection.

RUN-25220

CLI v2: Changed --image flag from a required field to an optional one.

RUN-25290

Fixed a security vulnerability in golang.org/x/net v0.33.0 with CVE-2024-45338 with severity HIGH.

RUN-24688

Fixed an issue that blocked the Create Template submission due to a server error. This occurred when using the Copy & Edit Template form.

RUN-25511

Fixed an issue where deleting a workload in the CLI v2 caused an error due to a missing response body. The CLI now correctly receives and handles the expected response body.

PreviousOverview NextInstallation

Last updated 2 days ago