Only this pageAll pages
Couldn't generate the PDF for 150 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

v2.21

Getting Started

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Infrastructure setup

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Platform management

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Workloads in NVIDIA Run:ai

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

What's New

This section includes release information for the self-hosted version of NVIDIA Run:ai:

  • New Features and Enhancements – Highlights major updates introduced in each version, including new capabilities, UI improvements, and changes to system behavior..

  • Hotfixes – Lists patches applied to released versions, including critical fixes and behavior corrections.

Note

See our Product version life cycle for a list of supported versions and their respective support timelines.

Feature Life Cycle

NVIDIA Run:ai uses life cycle labels to indicate the maturity and stability of features across releases:

  • Experimental - This feature is in early development. It may not be stable and could be removed or changed significantly in future versions. Use with caution.

  • Beta - This feature is still being developed for official release in a future version and may have some limitations. Use with caution.

  • Legacy - This feature is scheduled to be removed in future versions. We recommend using alternatives if available. Use only if necessary.

Advanced Setup

SSO

Managing Your Resources

Manage AI Initiatives

Infrastructure Procedures

Authentication and Authorization

Quick Starts

Quick Starts

Quick Starts

Scheduling and Resource Optimization

Scheduling

Managing Your Organization

Monitor Performance and Health

Resource Optimization

Experiment Using Workspaces

Workload Templates

Policies

Workload Assets

Uninstall

Uninstall the Control Plane

To delete the control plane, run:

helm uninstall runai-backend -n runai-backend

Uninstall the Cluster

Installation

NVIDIA Run:ai Components

As part of the installation process, you will install:

  • A managing cluster/s

  • One or more

Both the control plane and clusters require Kubernetes. Typically, the control plane and first cluster are installed on the same Kubernetes cluster.

Installation Types

The self-hosted option is for organizations that cannot use a SaaS solution due to data leakage concerns. NVIDIA Run:ai self-hosting comes with two variants:

Type
Description

Authentication and Authorization

NVIDIA Run:ai authentication and authorization enables a streamlined experience for the user with precise controls covering the data each user can see and the actions each user can perform in the NVIDIA Run:ai platform.

Authentication verifies user identity during login, and authorization assigns the user with specific permissions according to the assigned .

Authenticated access is required to use all aspects of the NVIDIA Run:ai interfaces, including the NVIDIA Run:ai platform, the NVIDIA Run:ai Command Line Interface (CLI) and APIs.

Authentication

There are multiple methods to authenticate and access NVIDIA Run:ai.

Single Sign-On (SSO)

NVIDIA Run:ai supports three methods to set up SSO:

When using SSO, it is highly recommended to manage at least one local user, as a breakglass account (an emergency account), in case access to SSO is not possible.

Username and Password

Username and password access can be used when SSO integration is not possible.

Secret Key (for Application Programmatic Access)

Secret is the authentication method for . Applications use the NVIDIA Run:ai APIs to perform automated tasks including scripts and pipelines based on their assigned .

Authorization

The NVIDIA Run:ai platform uses Role Base Access Control (RBAC) to manage authorization. Once a user or an application is authenticated, they can perform actions according to their assigned access rules.

Role Based Access Control (RBAC) in NVIDIA Run:ai

While Kubernetes RBAC is limited to a single cluster, NVIDIA Run:ai expands the scope of Kubernetes RBAC, making it easy for administrators to manage access rules across multiple clusters.

RBAC at NVIDIA Run:ai is configured using access rules. An access rule is the assignment of a to a : <Subject> is a <Role> in a <Scope>.

  • Subject

    • A user, a group, or an application assigned with the role

  • Role

    • A set of permissions that can be assigned to subjects. Roles at NVIDIA Run:ai are system defined and cannot be created, edited or deleted.

    • A permission is a set of actions (view, edit, create and delete) over a NVIDIA Run:ai entity (e.g. projects, workloads, users). For example, a role might allow a user to create and read Projects, but not update or delete them

  • Scope

    • A scope is part of an organization in which a set of permissions (roles) is effective. Scopes include Projects, Departments, Clusters, Account (all clusters).

Below is an example of an access rule: [email protected] is a Department admin in Department: A

User Applications

This article explains the procedure to create your own user applications.

Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token as detailed in and use it within subsequent API calls.

Notes

  • All clusters in the tenant must be version 2.20 and onward.

  • The token obtained through user applications assumes the of the user

Creating an Application

To create an application:

  1. Click the user avatar at the top right corner, then select Settings

  2. Click +APPLICATION

  3. Enter the application’s name

  4. Click CREATE

  5. Copy the Client ID and Client secret and store securely

  6. Click DONE

You can create up to 20 user applications.

Note

The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

Regenerating a Client Secret

To regenerate a client secret:

  1. Locate the application you want to regenerate its client secret

  2. Click Regenerate client secret

  3. Click REGENERATE

  4. Copy the New client secret and store it securely

  5. Click DONE

Note

Regenerating a client secret revokes the previous one.

Deleting an Application

  1. Locate the application you want to delete

  2. Click on the trash icon

  3. On the dialog, click DELETE to confirm

Using API

Go to the API reference to view the available actions.

Service Mesh

NVIDIA Run:ai supports service mesh implementations. When a service mesh is deployed with sidecar injection, specific configurations must be applied to ensure compatibility with NVIDIA Run:ai. This document outlines the required changes for the NVIDIA Run:ai control plane and cluster.

Control Plane Configuration

Note

This section applies for self-hosted only.

By default, NVIDIA Run:ai prevents Istio from injecting sidecar containers into system jobs in the control plane. For other service mesh solutions, users must manually add annotations during installation.

To disable sidecar injection in the NVIDIA Run:ai control plane, modify the Helm values file by adding the required pod labels to the following components. See for more details.

Example for :

Cluster Configuration

Installation Phase

Sidecar containers injected by some service mesh solutions can prevent NVIDIA Run:ai installation hooks from completing. To avoid this, modify the Helm installation command to include the required labels or annotations:

Example for :

Workloads

To prevent sidecar injection in workloads created at runtime (such as training workloads), update the runaiconfig resource. See for more details:

Monitoring and Maintenance

Deploying NVIDIA Run:ai in mission-critical environments requires proper monitoring and maintenance of resources to ensure workloads run and are deployed as expected.

Details on how to monitor different parts of the physical resources in your Kubernetes system, including and , can be found in the monitoring and maintenance section. Adjacent configuration and troubleshooting sections also cover high availability, and clusters, , and to meet compliance requirements.

In addition to monitoring NVIDIA Run:ai resources, it is also highly recommended to monitor NVIDIA Run:ai runs on Kubernetes, which manages containerized applications. In particular, focus on three main layers:

NVIDIA Run:ai Control Plane and Cluster Services

This is the highest layer and includes the parts of NVIDIA Run:ai pods, which run in containers managed by Kubernetes.

Kubernetes Cluster

This layer includes the main Kubernetes system that runs and manages NVIDIA Run:ai components. Important elements to monitor include:

  • The health of the cluster and nodes (machines in the cluster).

  • The status of key Kubernetes services, such as the API server. For detailed information on managing clusters, see the .

Host Infrastructure

This is the base layer, representing the actual machines (virtual or physical) that make up the cluster IT teams need to handle:

  • Managing CPU, memory, and storage

  • Keeping the operating system updated

  • Setting up the network and balancing the load

NVIDIA Run:ai does not require any special configurations at this level.

The articles below explain how to monitor these layers, maintain system security and compliance, and ensure the reliable operation of NVIDIA Run:ai in critical environments.

Shared Storage

Shared storage is a critical component in AI and machine learning workflows, particularly in scenarios involving distributed training and shared datasets. In AI and ML environments, data must be readily accessible across multiple nodes, especially when training large models or working with vast datasets. Shared storage enable seamless access to data, ensuring that all nodes in a distributed training setup can read and write to the same datasets simultaneously. This setup not only enhances efficiency but is also crucial for maintaining consistency and speed in high-performance computing environments.

While NVIDIA Run:ai Platform supports a variety of remote data sources, such as Git and S3, it is often more efficient to keep data close to the compute resources. This proximity is typically achieved through the use of shared storage, accessible to multiple nodes in your Kubernetes cluster.

Shared Storage

When implementing shared storage in Kubernetes, there are two primary approaches:

  • Utilizing the of your storage provider (Recommended)

  • Using a direct NFS (Network File System) mount

NVIDIA Run:ai support both direct NFS mount and Kubernetes Storage Classes.

Kubernetes Storage Classes

Storage classes in Kubernetes defines how storage is provisioned and managed. This allows you to select storage types optimized for AI workloads. For example, you can choose storage with high IOPS (Input/Output Operations Per Second) for rapid data access during intensive training sessions, or tiered storage options to balance cost and performance-based on your organization’s requirements. This approach supports dynamic provisioning, enabling storage to be allocated on-demand as required by your applications.

NVIDIA Run:ai data sources such as and leverage storage class to manage and allocate storage efficiently. This ensures that the most suitable storage option is always accessible, contributing to the efficiency and performance of AI workloads.

Note

NVIDIA Run:ai lists all available storage classes in the Kubernetes cluster, making it easy for users to select the appropriate storage. Additionally, can be set to restrict or enforce the use of specific storage classes, to help maintain compliance with organizational standards and optimize resource utilization.

Direct NFS Mount

Direct NFS allows you to mount a shared file system directly across multiple nodes in your Kubernetes cluster. This method provides a straightforward way to share data among nodes and is often used for simple setups or when a dedicated NFS server is available.

However, using NFS can present challenges related to security and control. Direct NFS setups might lack the fine-grained control and security features available with storage class.

Cluster Restore

This section explains how to restore a NVIDIA Run:ai cluster on a different Kubernetes environment.

In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster - projects, workloads and other cluster data is synced automatically.

The restoration or back-up of NVIDIA Run:ai cluster and which are stored locally on the Kubernetes cluster is optional and they can be restored and backed-up separately.

Backup

As back-up of data is not required, the backup procedure is optional for advanced deployments, as explained above.

Backup Cluster Configurations

To backup NVIDIA Run:ai cluster configurations:

  1. Run the following command in your terminal:

  2. Once the runaiconfig_back.yaml back-up file is created, save the file externally, so that it can be retrieved later.

Restore

Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.

Prerequisites

Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled.

  1. If the Kubernetes cluster is still available, the NVIDIA Run:ai cluster - make sure not to remove the cluster from the Control Plane

  2. Navigate to the Cluster page in the NVIDIA Run:ai platform

  3. Search for the cluster, and make sure its status is Disconnected

Re-installing NVIDIA Run:ai Cluster

  1. Follow the NVIDIA Run:ai cluster instructions and ensure all prerequisites are met

  2. If you have a back-up of the cluster configurations, reload it once the installation is complete

  3. Navigate to the Cluster page in the NVIDIA Run:ai platform

  4. Search for the cluster, and make sure its status is Connected

Connected

The organization can freely download from the internet (though upload is not allowed)

Air-gapped

The organization has no connection to the internet

control plane
clusters
API authentication
roles and permissions
User Applications
authorizationMigrator:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
clusterMigrator:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
identityProviderReconciler:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
keepPVC:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
orgUnitsMigrator:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
helm upgrade -i ... 
--set global.additionalJobLabels.A=B --set global.additionalJobAnnotations.A=B
helm upgrade -i ... 
--set-json global.additionalJobLabels='{"sidecar.istio.io/inject":false}'
spec:
  workload-controller:
    additionalPodLabels:
      sidecar.istio.io/inject: false
Advanced control plane configurations
Open Service Mesh
Istio Service Mesh
Advanced cluster configurations
clusters
nodes
restoring
securing
collecting logs
reviewing audit logs
official Kubernetes documentation
Kubernetes Storage Classes
Data Sources
Persistent Volume Claims (PVC)
Data Volumes
policies
kubectl get runaiconfig runai -n runai -o yaml -o=jsonpath='{.spec}' > runaiconfig_backup.yaml
kubectl apply -f runaiconfig_backup.yaml -n runai
Advanced features
Customized deployment configurations
uninstall
installation
access rules
SAML
OpenID Connect (OIDC)
OpenShift
Applications
access rules
role
subject in a scope

Interworking with Karpenter

Karpenter is an open-source, Kubernetes cluster autoscaler built for cloud deployments. Karpenter optimizes the cloud cost of a customer’s cluster by moving workloads between different node types, consolidating workloads into fewer nodes, using lower-cost nodes where possible, scaling up new nodes when needed, and shutting down unused nodes.

Karpenter’s main goal is cost optimization. Unlike Karpenter, NVIDIA Run:ai’s Scheduler optimizes for fairness and resource utilization. Therefore, there are a few potential friction points when using both on the same cluster.

Friction Points Using Karpenter with NVIDIA Run:ai

  1. Karpenter looks for “unschedulable” pending workloads and may try to scale up new nodes to make those workloads schedulable. However, in some scenarios, these workloads may exceed their quota parameters, and the NVIDIA Run:ai Scheduler will put them into a pending state.

  2. Karpenter is not aware of the NVIDIA Run:ai fractions mechanism and may try to interfere incorrectly.

  3. Karpenter preempts any type of workload (i.e., high-priority, non-preemptible workloads will potentially be interrupted and moved to save cost).

  4. Karpenter has no pod-group (i.e., workload) notion or gang scheduling awareness, meaning that Karpenter is unaware that a set of “arbitrary” pods is a single workload. This may cause Karpenter to schedule those pods into different node pools (in the case of multi-node-pool workloads) or scale up or down a mix of wrong nodes.

Mitigating the Friction Points

NVIDIA Run:ai Scheduler mitigates the friction points using the following techniques (each numbered bullet below corresponds to the related friction point listed above):

  1. Karpenter uses a “nominated node” to recommend a node for the Scheduler. The NVIDIA Run:ai Scheduler treats this as a “preferred” recommendation, meaning it will try to use this node, but it’s not required and it may choose another node.

  2. Fractions - Karpenter won’t consolidate nodes with one or more pods that cannot be moved. The NVIDIA Run:ai reservation pod is marked as ‘do not evict’ to allow the NVIDIA Run:ai Scheduler to control the scheduling of fractions.

  3. Non-preemptible workloads - NVIDIA Run:ai marks non-preemptible workloads as ‘do not evict’ and Karpenter respects this annotation.

  4. NVIDIA Run:ai node pools (single-node-pool workloads) - Karpenter respects the ‘node affinity’ that NVIDIA Run:ai sets on a pod, so Karpenter uses the node affinity for its recommended node. For the gang-scheduling/pod-group (workload) notion, NVIDIA Run:ai Scheduler considers Karpenter directives as preferred recommendations rather than mandatory instructions and overrides Karpenter instructions where appropriate.

Deployment Considerations

  • Using multi-node-pool workloads

    • Workloads may include a list of optional node pools. Karpenter is not aware that only a single node pool should be selected out of that list for the workload. It may therefore recommend putting pods of the same workload into different node pools and may scale up nodes from different node pools to serve a “multi-node-pool” workload instead of nodes on the selected single node pool.

    • If this becomes an issue (i.e., if Karpenter scales up the wrong node types), users can set an inter-pod affinity using the node pool label or another common label as a ‘topology’ identifier. This will force Karpenter to choose nodes from a single-node pool per workload, selecting from any of the node pools listed as allowed by the workload.

    • An alternative approach is to use a single-node pool for each workload instead of multi-node pools.

  • Consolidation

    • To make Karpenter more effective when using its consolidation function, users should consider separating preemptible and non-preemptible workloads, either by using node pools, node affinities, taint/tolerations, or inter-pod anti-affinity.

    • If users don’t separate preemptible and non-preemptible workloads (i.e., make them run on different nodes), Karpenter’s ability to consolidate (bin-pack) and shut down nodes will be reduced, but it is still effective.

  • Conflicts between bin-packing and spread policies

    • If NVIDIA Run:ai is used with a scheduling spread policy, it will clash with Karpenter’s default bin-packs/consolidation policy, and the outcome may be a deployment that is not optimized for any of these policies.

    • Usually spread is used for Inference, which is non-preemptible and therefore not controlled by Karpenter (NVIDIA Run:ai Scheduler will mark those workloads as ‘do not evict’ for Karpenter), so this should not present a real deployment issue for customers.

Secure Your Cluster

This section details the security considerations for deploying NVIDIA Run:ai. It is intended to help administrators and security officers understand the specific permissions required by NVIDIA Run:ai.

Access to the Kubernetes Cluster

NVIDIA Run:ai integrates with Kubernetes clusters and requires specific permissions to successfully operate. These are permissions are controlled with configuration flags that dictate how NVIDIA Run:ai interacts with cluster resources. Prior to installation, security teams can review the permissions and ensure it aligns with their organization’s policies.

Permissions and their Related Use Case

NVIDIA Run:ai provides various security-related permissions that can be customized to fit specific organizational needs. Below are brief descriptions of the key use cases for these customizations:

Permission
Use case

Automatic Namespace creation

Controls whether NVIDIA Run:ai automatically creates Kubernetes namespaces when new projects are created. Useful in environments where namespace creation must be strictly managed.

Automatic user assignment

Decides if users are automatically assigned to projects within NVIDIA Run:ai. Helps manage user access more tightly in certain compliance-driven environments.

Secret propagation

Determines whether NVIDIA Run:ai should propagate secrets across the cluster. Relevant for organizations with specific security protocols for managing sensitive data.

Disabling Kubernetes limit range

Chooses whether to disable the Kubernetes Limit Range feature. May be adjusted in environments with specific resource management needs.

Note

These security customizations allow organizations to tailor NVIDIA Run:ai to their specific needs. All changes should be modified cautiously and only when necessary to meet particular security, compliance or operational requirements.

Secure Installation

Many organizations enforce IT compliance rules for Kubernetes, with strict access control for installing and running workloads. OpenShift uses Security Context Constraints (SCC) for this purpose. NVIDIA Run:ai fully supports SCC, ensuring integration with OpenShift's security requirements.

Security Vulnerabilities

The platform is actively monitored for security vulnerabilities, with regular scans conducted to identify and address potential issues. Necessary fixes are applied to ensure that the software remains secure and resilient against emerging threats, providing a safe and reliable experience.

NVIDIA Run:ai at Scale

Operating NVIDIA Run:ai at scale ensures that the system can efficiently handle fluctuating workloads while maintaining optimal performance. As clusters grow—whether due to an increasing number of nodes or a surge in workload demand—NVIDIA Run:ai services must be appropriately tuned to support large-scale environments.

This guide outlines the best practices for optimizing NVIDIA Run:ai for high-performance deployments, including NVIDIA Run:ai system services configurations, vertical scaling (adjusting CPU and memory resources) and where applicable, horizontal scaling (replicas).

NVIDIA Run:ai Services

Vertical Scaling

Each of the NVIDIA Run:ai containers has default resource requirements that reflect an average customer load. With significantly larger cluster loads, certain NVIDIA Run:ai services will require more CPU and memory resources. NVIDIA Run:ai supports configuring these resources for each NVIDIA Run:ai service group separately. For instructions and more information, see NVIDIA Run:ai services resource management.

Scheduling Services

The scheduling services group should be scaled together with the number of nodes and the number of workloads handled by the Scheduler (running / pending). These resource recommendations are based on internal benchmarks performed on stressed environments:

Scale (nodes/workloads)
CPU (request)
Memory (request)

Small - 30 / 480

1

1GB

Medium - 100 / 1600

2

2GB

Large - 500 / 8500

2

7GB

Sync and Workload Services

The sync and workload service groups are less sensitive for scale. The recommendation for large or intensive environments is set to the following:

CPU (request-limit)
Memory (request-limit)

1-2

1GB-2GB

Horizontal Scaling

By default, NVIDIA Run:ai cluster services are deployed with a single replica. For large scale and intensive environments it is recommended to scale the NVIDIA Run:ai services horizontally by increasing the number of replicas. For more information, see NVIDIA Run:ai services replicas.

Metrics Collection

NVIDIA Run:ai relies on Prometheus to scrape cluster metrics and forward them to the NVIDIA Run:ai control plane. The volume of metrics generated is directly proportional to the number of nodes, workloads, and projects in the system. When operating at scale—reaching hundreds, and thousands of nodes and projects—the system generates a significant volume of metrics which can place a strain on the cluster and the network bandwidth.

To mitigate this impact, it is recommended to tune the Prometheus remote-write configurations. See remote write tuning to read more about the tuning parameters available via the remote write configuration and refer to this article for optimizing Prometheus remote write performance.

You can apply the remote-write configurations required as described in advanced cluster configurations.

The following example demonstrates the recommended approach in NVIDIA Run:ai for tuning Prometheus remote-write configurations:

remoteWrite:
  queueConfig:
    capacity: 5000
    maxSamplesPerSend: 1000
    maxShards: 100

Overview

NVIDIA Run:ai is a GPU orchestration and optimization platform that helps organizations maximize compute utilization for AI workloads. By optimizing the use of expensive compute resources, NVIDIA Run:ai accelerates AI development cycles, and drives faster time-to-market for AI-powered innovations.

Built on Kubernetes, NVIDIA Run:ai supports dynamic GPU allocation, workload submission, workload scheduling, and resource sharing, ensuring that AI teams get the compute power they need while IT teams maintain control over infrastructure efficiency.

How NVIDIA Run:ai Helps Your Organization

For Infrastructure Administrators

NVIDIA Run:ai centralizes cluster management and optimizes infrastructure control by offering:

  • Centralized cluster management – Manage all clusters from a single platform, ensuring consistency and control across environments.

  • Usage monitoring and capacity planning – Gain real-time and historical insights into GPU consumption across clusters to optimize resource allocation and plan future capacity needs efficiently.

  • Policy enforcement – Define and enforce security and usage policies to align GPU consumption with business and compliance requirements.

  • Enterprise-grade authentication – Integrate with your organization's identity provider for streamlined authentication (Single Sign On) and role-based access control (RBAC).

  • Kubernetes-native application – Install as a Kubernetes-native application, seamlessly extending Kubernetes for native cloud experience and operational standards (install, upgrade, configure).

For Platform Administrators

NVIDIA Run:ai simplifies AI infrastructure management by providing a structured approach to managing AI initiatives, resources, and user access. It enables platform administrators maintain control, efficiency, and scalability across their infrastructure:

  • AI Initiative structuring and management – Map and set up AI initiatives according to your organization's structure, ensuring clear resource allocation.

  • Centralized GPU resource management – Enable seamless sharing and pooling of GPUs across multiple users, reducing idle time and optimizing utilization.

  • User and access control – Assign users (AI practitioners, ML engineers) to specific projects and departments to manage access and enforce security policies, utilizing role-based access control (RBAC) to ensure permissions align with user roles.

  • Workload scheduling – Use scheduling to prioritize and allocate GPUs based on workload needs.

  • Monitoring and insights – Track real-time and historical data on GPU usage to help track resource consumption and optimize costs.

For AI Practitioners

NVIDIA Run:ai empowers data scientists and ML engineers by providing:

  • Optimized workload scheduling – Ensure high-priority jobs get GPU resources. Workloads dynamically receive resources based on demand.

  • Fractional GPU usage – Request and utilize only a fraction of a GPU's memory, ensuring efficient resource allocation and leaving room for other workloads.

  • AI initiatives lifecycle support – Run your entire AI initiatives lifecycle – Jupyter Notebooks, training jobs, and inference workloads efficiently.

  • Interactive session – Ensure an uninterrupted experience when working on Jupyter Notebooks without taking away GPUs.

  • Scalability for training and inference – Support for distributed training across multiple GPUs and auto-scales inference workloads.

  • Integrations – Integrate with popular ML frameworks - PyTorch, TensorFlow, XGBoost, Knative, Spark, Kubeflow Pipelines, Apache Airflow, Argo workloads, Ray and more.

  • Flexible workload submission – Submit workloads using the NVIDIA Run:ai UI, API, CLI or run third-party workloads.

NVIDIA Run:ai System Components

NVIDIA Run:ai is made up of two components both installed over a Kubernetes cluster:

  • NVIDIA Run:ai cluster – Provides scheduling and workload management, extending Kubernetes native capabilities.

  • NVIDIA Run:ai control plane – Provides resource management, handles workload submission and provides cluster monitoring and analytics.

NVIDIA Run:ai Cluster

The NVIDIA Run:ai cluster is responsible for scheduling AI workloads and efficiently allocating GPU resources across users and projects:

  • NVIDIA Run:ai Scheduler – Applies AI-aware rules to efficiently schedule workloads submitted by AI practitioners.

  • Workload management – Handles workload management which includes the researcher code running as a Kubernetes container and the system resources required to run the code, such as storage, credentials, network endpoints to access the container and so on.

  • Kubernetes operator-based deployment – Installed as a Kubernetes Operator to automate deployment, upgrades and configuration of NVIDIA Run:ai cluster services.

  • Storage – Supports Kubernetes-native storage using Storage Classes, allowing organizations to bring their own storage solutions. Additionally, it also integrates with external storage solutions such as Git, S3, and NFS to support various data requirements.

  • Secured communication – Uses an outbound-only, secured (SSL) connection to synchronize with the NVIDIA Run:ai control plane.

  • Private – NVIDIA Run:ai only synchronizes metadata and operational metrics (e.g., workloads, nodes) with the control plane. No proprietary data, model artifacts, or user data sets are ever transmitted, ensuring full data privacy and security.

NVIDIA Run:ai Control Plane

The NVIDIA Run:ai control plane provides a centralized management interface for organizations to oversee their GPU infrastructure across multiple locations/subnets, accessible via Web UI, API and CLI. The control plane can be deployed on the cloud or on-premise for organizations that require local control over their infrastructure (self-hosted).

  • Multi-cluster management – Manages multiple NVIDIA Run:ai clusters for a single tenant across different locations and subnets from a single unified interface.

  • Resource and access management – Allows administrators to define Projects, Departments and user roles, enforcing policies for fair resource distribution.

  • Workload submission and monitoring – Allows teams to submit workloads, track usage, and monitor GPU performance in real time.

Installation Types

There are two main installation options:

Installation Type
Description

SaaS

NVIDIA Run:ai is installed on the customer's data science GPU clusters. The cluster connects to the NVIDIA Run:ai control plane on the cloud (https://<tenant-name>.run.ai). With this installation, the cluster requires an outbound connection to the NVIDIA Run:ai cloud.

Self-hosted

The NVIDIA Run:ai control plane is also installed in the customer's data center

Customized Installation

This section explains the available configurations for customizing the NVIDIA Run:ai control plane and cluster installation.

Control Plane Helm Chart Values

The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm values files or Helm install flags. See Advanced control plane configurations.

Cluster Helm Chart Values

The NVIDIA Run:ai cluster installation can be customized to support your environment via Helm values files or Helm install flags.

These configurations are saved in the runaiconfig Kubernetes object and can be edited post-installation as needed. For more information, see Advanced cluster configurations.

The following table lists the available Helm chart values that can be configured to customize the NVIDIA Run:ai cluster installation.

Key
Description

global.image.registry (string)

Global Docker image registry Default: ""

global.additionalImagePullSecrets (list)

List of image pull secrets references Default: []

spec.researcherService.ingress.tlsSecret (string)

Existing secret key where cluster are stored (non-OpenShift) Default: runai-cluster-domain-tls-secret

spec.researcherService.route.tlsSecret (string)

Existing secret key where cluster are stored (OpenShift only) Default: ""

spec.prometheus.spec.image (string)

Due to a In the Prometheus Helm chart, the imageRegistry setting is ignored. To pull the image from a different registry, you can manually specify the Prometheus image reference. Default: quay.io/prometheus/prometheus

spec.prometheus.spec.imagePullSecrets (string)

List of image pull secrets references in the runai namespace to use for pulling Prometheus images (relevant for air-gapped installations). Default: []

global.customCA.enabled

Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.

openShift.securityContextConstraints.create

Enables the deployment of Security Context Constraints (SCC). Disable for CIS compliance. Default: true

Before You Start

NVIDIA Run:ai provides metrics and telemetry for both physical cluster entities such as clusters, nodes, and node pools and application organization entities such as departments and projects. Metrics represent over-time data while telemetry represents current analytics data. This data is essential for monitoring and analyzing the performance and health of your platform.

Consuming Metrics and Telemetry Data

Users can consume the data based on their permissions:

  1. API - Access the data programmatically through the NVIDIA Run:ai API.

  2. CLI - Use the NVIDIA Run:ai Command Line Interface to query and manage the data.

  3. UI - Visualize the data through the NVIDIA Run:ai user interface.

API

  • Metrics API - Access over-time detailed analytics data programmatically.

  • Telemetry API - Access current analytics data programmatically.

Refer to metrics and telemetry to see the full list of supported metrics and telemetry APIs.

CLI

Use the list and describe commands to fetch and manage the data. See CLI reference for more details.

Describe a specific workload telemetry
List projects and view their telemetry and metrics

UI Views

Refer to metrics and telemetry to see the full list of supported metrics and telemetry.

  • Overview dashboard - Provides a high-level summary of the cluster's health and performance, including key metrics such as GPU utilization, memory usage, and node status. Allows administrators to quickly identify any potential issues or areas for optimization. Offers advanced analytics capabilities for analyzing GPU usage patterns and identifying trends. Helps administrators optimize resource allocation and improve cluster efficiency.

  • Quota management - Enables administrators to monitor and manage GPU quotas across the cluster. Includes features for setting and adjusting quotas, tracking usage, and receiving alerts when quotas are exceeded.

  • Workload visualizations - Provides detailed insights into the resource usage and utilization of each GPU in the cluster. Includes metrics such as GPU memory utilization, core utilization, and power consumption. Allows administrators to identify GPUs that are under-utilized and overloaded.

  • Node and node pool visualizations - Similar to workload visualizations, but focused on the resource usage and utilization of each GPU within a specific node or node pool. Helps administrators identify potential issues or bottlenecks at the node level.

  • Advanced NVIDIA metrics - Provides access to a range of advanced NVIDIA metrics, such as GPU temperature, fan speed, and voltage. Enables administrators to monitor the health and performance of GPUs in greater detail. This data is available at the node and workload level. To enable these metrics, contact NVIDIA Run:ai customer support.

Network Requirements

The following network requirements are for the NVIDIA Run:ai components installation and usage.

External Access

Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management.

Note

Ensure the inbound and outbound rules are correctly applied to your firewall.

Inbound Rules

To allow your organization’s NVIDIA Run:ai users to interact with the cluster using the , or access specific UI features, certain inbound ports need to be open:

Name
Description
Source
Destination
Port

Outbound Rules

Note

Outbound rules applied to the NVIDIA Run:ai cluster component only. In case the NVIDIA Run:ai cluster is installed together with the NVIDIA Run:ai control plane, the NVIDIA Run:ai cluster FQDN refers to the NVIDIA Run:ai control plane FQDN.

For the NVIDIA Run:ai cluster installation and usage, certain outbound ports must be open:

Name
Description
Source
Destination
Port

The NVIDIA Run:ai installation has that require additional components to be installed on the cluster. This article includes simple installation examples which can be used optionally and require the following cluster outbound ports to be open:

Name
Description
Source
Destination
Port

Internal Network

Ensure that all Kubernetes nodes can communicate with each other across all necessary ports. Kubernetes assumes full interconnectivity between nodes, so you must configure your network to allow this seamless communication. Specific port requirements may vary depending on your network setup.

Install the Control Plane

System and Network Requirements

Before installing the NVIDIA Run:ai control plane, validate that the and are met. For air-gapped environments, make sure you have the prepared.

Permissions

As part of the installation, you will be required to install the NVIDIA Run:ai control plane . The Helm charts require Kubernetes administrator permissions. You can review the exact objects that are created by the charts using the --dry-run on both helm charts.

Installation

Kubernetes

Connected

Run the following command. Replace global.domain=<DOMAIN> with the one obtained :

Note: To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

Note: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation.

Air-gapped

To run the following command, make sure to replace the following. The custom-env.yaml is created when :

  1. control-plane-<VERSION>.tgz - The NVIDIA Run:ai control plane version

  2. global.domain=<DOMAIN> - The domain name set

  3. global.customCA.enabled=true as described

Note: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation.

OpenShift

Connected

Run the following command. The <OPENSHIFT-CLUSTER-DOMAIN> is the subdomain configured for the OpenShift cluster:

Note: To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

Air-gapped

To run the following command, make sure to replace the following. The custom-env.yaml is created when

  1. control-plane-<VERSION>.tgz - The NVIDIA Run:ai control plane version

  2. <OPENSHIFT-CLUSTER-DOMAIN> - The domain configured for the OpenShift cluster. To find out the OpenShift cluster domain, run oc get routes -A

  3. global.customCA.enabled=true as described

Note

To customize the installation based on your environment, see .

Connect to NVIDIA Run:ai User Interface

  1. Open your browser and go to:

https://<DOMAIN>

https://runai.apps.<OpenShift-DOMAIN>

  1. Log in using the default credentials:

    • User: [email protected]

    • Password: Abcd!234

You will be prompted to change the password.

Users

This section explains the procedure to manage users and their permissions.

Users can be managed locally, or via the identity provider (Idp), while assigned with to manage permissions. For example, user [email protected] is a department admin in department A.

Users Table

The Users table can be found under Access in the NVIDIA Run:ai platform.

The users table provides a list of all the users in the platform. You can manage users and user permissions (access rules) for both local and .

Single Sign-On Users

SSO users are managed by the identity provider and appear once they have signed in to NVIDIA Run:ai.

The Users table consists of the following columns:

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Creating a Local User

To create a local user:

  1. Click +NEW LOCAL USER

  2. Enter the user’s Email address

  3. Click CREATE

  4. Review and copy the user’s credentials:

    • User Email

    • Temporary password to be used on first sign-in

  5. Click DONE

Note

The temporary password is visible only at the time of user’s creation and must be changed after the first sign-in.

Adding an Access Rule to a User

To create an access rule:

  1. Select the user you want to add an access rule for

  2. Click ACCESS RULES

  3. Click +ACCESS RULE

  4. Select a role

  5. Select a scope

  6. Click SAVE RULE

  7. Click CLOSE

Deleting a User’s Access Rule

To delete an access rule:

  1. Select the user you want to remove an access rule from

  2. Click ACCESS RULES

  3. Find the access rule assigned to the user you would like to delete

  4. Click on the trash icon

  5. Click CLOSE

Resetting a User's Password

To reset a user’s password:

  1. Select the user you want to reset it’s password

  2. Click RESET PASSWORD

  3. Click RESET

  4. Review and copy the user’s credentials:

    • User Email

    • Temporary password to be used on next sign-in

  5. Click DONE

Deleting a User

  1. Select the user you want to delete

  2. Click DELETE

  3. In the dialog, click DELETE to confirm

Note

To ensure administrative operations are always available, at least one local user with System Administrator role should exist.

Using API

Go to the , API reference to view the available actions.

Access Rules

This section explains the procedure to manage Access rules.

Access rules provide users, groups, or applications privileges to system entities. An access rule is the assignment of a to a : <Subject> is a <Role> in a <Scope>. For example, user [email protected] is a department admin in department A.

Access Rules Table

The Access rules table can be found under Access in the NVIDIA Run:ai platform.

The Access rules table provides a list of all the access rules defined in the platform and allows you to manage them.

Flexible management

It is also possible to manage access rules directly for a specific , , , or .

The Access rules table consists of the following columns:

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Adding a New Access Rule

To add a new access rule:

  1. Click +NEW ACCESS RULE

  2. Select a subject User, SSO Group, or Application

  3. Select or enter the subject identifier:

    • User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP

    • Group name as recognized by the IDP

    • Application name as created in NVIDIA Run:ai

  4. Select a role

  5. Select a scope

  6. Click SAVE RULE

Note

An access rule consists of a single subject with a single role in a single scope. To assign multiple roles or multiple scopes to the same subject, multiple access rules must be added.

Editing an Access Rule

Access rules cannot be edited. To change an access rule, you must delete the rule, and then create a new rule to replace it.

Deleting an Access Rule

  1. Select the access rule you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm

Viewing Your User Access Rule

To view the assigned roles and scopes you have access to:

  1. Click the user avatar at the top right corner, then select Settings

  2. Click User details

The list of assigned roles and scopes will be displayed.

Using API

Go to the API reference to view the available actions.

Event History

This section provides details about NVIDIA Run:ai’s Audit log.

The NVIDIA Run:ai control plane provides the audit log API and event history table in the NVIDIA Run:ai UI. Both reflect the same information regarding changes to business objects: clusters, projects and assets etc.

Note

Only system administrator users with tenant-wide permissions can access Audit log.

Event History Table

The Event history table can be found under Event history in the NVIDIA Run:ai UI.

The Event history table consists of the following columns:

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV or Download as JSON

Using the Event History Date Selector

The Event history table saves events for the last 90 days. However, the table itself presents up to the last 30 days of information due to the potentially very high number of operations that might be logged during this period.

To view older events, or to refine your search for more specific results or fewer results, use the time selector and change the period you search for. You can also refine your search by clicking and using ADD FILTER accordingly.

Using API

Go to the reference to view the available actions. Since the amount of data is not trivial, the API is based on paging. It retrieves a specified number of items for each API call. You can get more data by using subsequent calls.

Limitations

Submissions of workloads are not audited. As a result, the system does not track or log details of workload submissions, such as timestamps or user activity.

Upgrade

Before Upgrade

Before proceeding with the upgrade, it's crucial to apply the specific prerequisites associated with your current version of NVIDIA Run:ai and every version in between up to the version you are upgrading to.

Helm

NVIDIA Run:ai requires 3.14 or later. Before you continue, validate your installed helm client version. To install or upgrade Helm, see . If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.

Software Files

Run the helm command below:

  • Ask for a tar file runai-air-gapped-<NEW-VERSION>.tar.gz from NVIDIA Run:ai customer support. The file contains the new version you want to upgrade to. <NEW-VERSION> is the updated version of the NVIDIA Run:ai control plane.

  • Upload the images as described .

Upgrade Control Plane

System and Network Requirements

Before upgrading the NVIDIA Run:ai control plane, validate that the latest and are met, as they can change from time to time.

Upgrade

To upgrade from 2.17 or later, run the following:

Note

To upgrade to a specific version, modify the --version flag by specifying the desired <VERSION>. You can find all available versions by using the helm search repo command.

Upgrade Cluster

System and Network Requirements

Before upgrading the NVIDIA Run:ai cluster, validate that the latest and are met, as they can change from time to time.

Note

It is highly recommended to upgrade the Kubernetes version together with the NVIDIA Run:ai cluster version, to ensure compatibility with latest supported version of your .

Getting Installation Instructions

Follow the setup and installation instructions below to get the installation instructions to upgrade the NVIDIA Run:ai cluster.

Setup

  1. In the NVIDIA Run:ai UI, go to Clusters

  2. Select the cluster you want to upgrade

  3. Click INSTALLATION INSTRUCTIONS

  4. Optional: Select the NVIDIA Run:ai cluster version (latest, by default)

  5. Click CONTINUE

Installation Instructions

  1. Follow the installation instructions. Run the Helm commands provided on your Kubernetes cluster. See the below if .

  2. Click DONE

  3. Once installation is complete, validate the cluster is Connected and listed with the new cluster version (see the ). Once you have done this, the cluster is upgraded to the latest version.

Note

To upgrade to a specific version, modify the --version flag by specifying the desired <VERSION>. You can find all available versions by using the helm search repo command.

Troubleshooting

If you encounter an issue with the cluster upgrade, use the troubleshooting scenarios below.

Installation Fails

If the NVIDIA Run:ai cluster upgrade fails, check the installation logs to identify the issue.

Run the following script to print the installation logs:

Cluster Status

If the NVIDIA Run:ai cluster upgrade completes, but the cluster status does not show as Connected, refer to .

Workload Assets

NVIDIA Run:ai assets are preconfigured building blocks that simplify the workload submission effort and remove the complexities of Kubernetes and networks for AI practitioners.

Workload assets enable organizations to:

  • Create and reuse preconfigured setup for code, data, storage and resources to be used by AI practitioners to simplify the process of submitting workloads

  • Share the preconfigured setup with a wide audience of AI practitioners with similar needs

Note

  • The creation of assets is possible only via API and the NVIDIA Run:ai UI.

  • The submission of workloads using assets, is possible only via the NVIDIA Run:ai UI.

Workload Asset Types

There are four workload asset types used by the workload:

  • The container image, tools and connections for the workload

  • The type of data, its origin and the target storage location such as PVCs or cloud storage buckets where datasets are stored

  • The compute specification, including GPU and CPU compute and memory

  • The secrets to be used to access sensitive data, services, and applications such as docker registry or S3 buckets

Asset Scope

When a workload asset is created, a is required. The scope defines who in the organization can view and/or use the asset.

Note

When an asset is created via API, the scope can be the entire account. This is currently an experimental feature.

Who Can Create an Asset?

Any subject (user, application, or SSO group) with a that has permissions to Create an asset, can do so within their scope.

Who Can Use an Asset?

Assets are used when submitting workloads. Any subject (user, application or SSO group) with a that has permissions to Create workloads, can also use assets.

Who Can View an Asset?

Any subject (user, application, or SSO group) with a that has permission to View an asset, can do so within their scope.

TLS certificates
TLS certificates
known issue

NVIDIA Run:ai control plane

HTTPS entrypoint

0.0.0.0

NVIDIA Run:ai system nodes

443

NVIDIA Run:ai cluster

HTTPS entrypoint

0.0.0.0

NVIDIA Run:ai system nodes

443

Cluster sync

Sync NVIDIA Run:ai cluster with NVIDIA Run:ai control plane

NVIDIA Run:ai system nodes

NVIDIA Run:ai control plane FQDN

443

Metric store

Push NVIDIA Run:ai cluster metrics to NVIDIA Run:ai control plane's metric store

NVIDIA Run:ai system nodes

NVIDIA Run:ai control plane FQDN

443

Container Registry

Pull NVIDIA Run:ai images

All kubernetes nodes

runai.jfrog.io

443

Helm repository

NVIDIA Run:ai Helm repository for installation

Installer machine

runai.jfrog.io

443

Kubernetes Registry

Ingress Nginx image repository

All kubernetes nodes

registry.k8s.io

443

Google Container Registry

GPU Operator, and Knative image repository

All kubernetes nodes

gcr.io

443

Red Hat Container Registry

Prometheus Operator image repository

All kubernetes nodes

quay.io

443

Docker Hub Registry

Training Operator image repository

All kubernetes nodes

docker.io

443

NVIDIA Run:ai Command-line interface
software requirements
workload
Environments
Data sources
Compute resources
Credentials
scope
role
role
role

User

The unique identity of the user (email address)

Type

The type of the user - SSO / local

Last login

The timestamp for the last time the user signed in

Access rule(s)

The access rule assigned to the user

Created By

The user who created the user

Creation time

The timestamp for when the user was created

Last updated

The last time the user was updated

access rules
SSO users
Users
Access rules

Type

The type of subject assigned to the access rule (user, SSO group, or application).

Subject

The user, SSO group, or application assigned with the role

Role

The role assigned to the subject

Scope

The scope to which the subject has access. Click the name of the scope to see the scope and its subordinates

Authorized by

The user who granted the access rule

Creation time

The timestamp for when the rule was created

Last updated

The last time the access rule was updated

role
subject in a scope
user
application
project
department
Access rules

Subject

The name of the subject

Subject type

The user or application assigned with the role

Source IP

The IP address of the subject

Date & time

The exact timestamp at which the event occurred. Format dd/mm/yyyy for date and hh:mm am/pm for time.

Event

The type of the event. Possible values: Create, Update, Delete, Login

Event ID

Internal event ID, can be used for support purposes

Status

The outcome of the logged operation. Possible values: Succeeded, Failed

Entity type

The type of the logged business object.

Entity name

The name of logged business object.

Entity ID

The system's internal id of the logged business object.

URL

The endpoint or address that was accessed during the logged event.

HTTP Method

The HTTP operation method used for the request. Possible values include standard HTTP methods such as GET, POST, PUT, DELETE, indicating what kind of action was performed on the specified URL.

Audit log API

Node Roles

This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.

For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:

  • NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.

  • Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.

NVIDIA Run:ai services are scheduled on the defined node roles by applying Kubernetes Node Affinity using node labels .

Prerequisites

To perform these tasks, make sure to install the NVIDIA Run:ai Administrator CLI.

Configure Node Roles

The following node roles can be configured on the cluster:

  • System node: Reserved for NVIDIA Run:ai system-level services.

  • GPU Worker node: Dedicated for GPU-based workloads.

  • CPU Worker node: Used for CPU-only workloads.

System Nodes

NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the Kubectl (recommended) or via NVIDIA Run:ai Administrator CLI.

By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system for system services scheduling. You can modify the default node affinity rule by:

  • Editing the spec.global.affinity configuration parameter as detailed in Advanced cluster configurations.

  • Editing the global.affinity configuration as detailed in Advanced control plane configurations for self-hosted deployments.

Note

To ensure high availability and prevent a single point of failure, it is recommended to configure at least three system nodes in your cluster.

Important

Do not assign a system node role to the Kubernetes master node. This may disrupt Kubernetes functionality, particularly if the Kubernetes API Server is configured to use port 443 instead of the default 6443.

Kubectl

To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:

  1. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  2. Run one of the following commands to label the node with its role:

    kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=true
    kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=false

NVIDIA Run:ai Administrator CLI

Note

The NVIDIA Run:ai Administrator CLI only supports the default node affinity.

To set a system role for a node in your Kubernetes cluster, follow these steps:

  1. Run the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  2. Run one of the following commands to set or remove a node’s role:

    runai-adm set node-role --runai-system-worker <node-name>
    runai-adm remove node-role --runai-system-worker <node-name>

The set node-role command will label the node and set relevant cluster configurations.

Worker Nodes

NVIDIA Run:ai worker nodes run user-submitted workloads and system-level DeamonSets required to operate. This can be managed via the Kubectl (recommended) or via NVIDIA Run:ai Administrator CLI.

By default, GPU workloads are scheduled on GPU nodes baed on the nvidia.com/gpu.present label. When global.nodeAffinity.restrictScheduling is set to true via the Advanced cluster configurations:

  • GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-gpu-worker

  • CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-cpu-worker

Kubectl

To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:

  1. Validate the global.nodeAffinity.restrictScheduling is set to true in the cluster’s Configurations.

  2. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  3. Run one of the following commands to label the node with its role. Replace the label and value (true/false) to enable or disable GPU/CPU roles as needed:

    kubectl label nodes <node-name> node-role.kubernetes.io/runai-gpu-worker=true
    kubectl label nodes <node-name> node-role.kubernetes.io/runai-cpu-worker=false

NVIDIA Run:ai Administrator CLI

To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai Administrator CLI, follow these steps:

  1. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  2. Run one of the following commands to set or remove a node’s role. <node-role> must be either --gpu-worker or --cpu-worker :

    runai-adm set node-role <node-role> <node-name>
    runai-adm remove node-role <node-role> <node-name>

The set node-role command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling true.

Note

Use the --all flag to set or remove a role to all nodes.

Configuring NVIDIA MIG Profiles

NVIDIA’s Multi-Instance GPU (MIG) enables splitting a GPU into multiple logical GPU devices, each with its own memory and compute portion of the physical GPU.

NVIDIA provides two MIG strategies:

  • Single - A GPU can be divided evenly. This means all MIG profiles are the same.

  • Mixed - A GPU can be divided into different profiles.

The NVIDIA Run:ai platform supports running workloads using NVIDIA MIG. Administrators can set the Kubernetes nodes to their preferred MIG strategy and configure the appropriate MIG profiles for researchers and MLOPS engineers to use.

This guide explains how to configure MIG in each strategy to submit workloads. It also outlines the individual implications of each strategy and best practices for administrators.

Note

  • Starting from v2.19, Dynamic MIG feature began a deprecation process and is now no longer supported. With Dynamic MIG, the NVIDIA Run:ai platform automatically configured MIG profiles according to on-demand user requests for different MIG profiles or memory fractions.

  • GPU fractions and memory fractions are not supported with MIG profiles.

  • Single strategy supports both NVIDIA Run:ai and third-party workloads. Using mixed strategy can only be done using third-party workloads. For more details on NVIDIA Run:ai and third-party workloads, see Introduction to workloads.

Before You Start

To use MIG single and mixed strategy effectively, make sure to familiarize yourself with the following NVIDIA resources:

  • NVIDIA Multi-Instance GPU

  • MIG User Guide

  • GPU Operator with MIG

Configuring Single MIG Strategy

When deploying MIG using single strategy, all GPUs within a node are configured with the same profile. For example, a node might have GPUs configured with 3 MIG slices of profile type 1g.20gb, or 7 MIG slices of profile 1g.10gb. With this strategy, MIG profiles are displayed as whole GPU devices by CUDA.

The NVIDIA Run:ai platform discovers these MIG profiles as whole GPU devices as well, ensuring MIG devices are transparent to the end-user (practitioner). For example, a node that consists of 8 physical GPUs split into MIG slices, 3×2g20gb slices each, is discovered by the NVIDIA Run:ai platform as a node with 24 GPU devices.

Users can submit workloads by requesting a specific number of GPU devices (X GPU) and NVIDIA Run:ai will allocate X MIG slices (logical devices). The NVIDIA Run:ai platform deducts X GPUs from the workload’s Project quota, regardless of whether this ‘logical GPU’ represents 1/3 of a physical GPU device or 1/7 of a physical GPU device.

Configuring Mixed MIG Strategy

When deploying MIG using mixed strategy, each GPU in a node can be configured with a different combination of MIG profiles such as 2×2g.20gb and 3×1g.10gb. For details on supported combinations per GPU type, refer to Supported MIG Profiles.

In mixed strategy, physical GPU devices continue to be displayed as physical GPU devices by CUDA, and each MIG profile is shown individually. The NVIDIA Run:ai platform identifies the physical GPU devices normally, however, MIG profiles are not visible in the UI or node APIs.

When submitting third-party workloads with this strategy, the user should explicitly specify the exact requested MIG profile (for example, nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb). The NVIDIA Run:ai Scheduler finds a node that can provide this specific profile and binds it to the workload.

A third-party workload submitted with a MIG profile of type Xg.Ygb (e.g. 3g.40gb or 2g.20gb) is considered as consuming X GPUs. These X GPUs will be deducted from the workload’s project quota of GPUs. For example, a 3g.40gb profile deducts 3 GPUs from the associated Project’s quota, while 2g.20gb deducts 2 GPUs from the associated Project’s quota. This is done to maintain a logical ratio according to the characteristics of the MIG profile.

Best Practices for Administrators

Single Strategy

  • Configure proper and uniform sizes of MIG slices (profiles) across all GPUs within a node.

  • Set the same MIG profiles on all nodes of a single node pool.

  • Create separate node pools with different MIG profile configurations allowing users to select the pool that best matches their workloads’ needs.

  • Ensure Project quotas are allocated according to the MIG profile sizes.

Mixed Strategy

  • Use mixed strategy with workloads that require diverse resources. Make sure to evaluate the workload requirements and plan accordingly.

  • Configure individual MIG profiles on each node by using a limited set of MIG profile combinations to minimize complexity. Make sure to evaluate your requirements and node configurations.

  • Ensure Project quotas are allocated according to the MIG profile sizes.

Note

Since MIG slices are a fixed size, once configured, changing MIG profiles requires administrative intervention.

Scheduling Rules

This article explains the procedure to configure and manage scheduling rules.

Scheduling rules are restrictions applied to workloads. These restrictions apply to either the resources (nodes) on which workloads can run or the duration of the run time. Scheduling rules are set for Projects or Departments and apply to specific workload types. Once scheduling rules are set for a project or department, all matching workloads associated with the project have the restrictions applied to them, as defined, when the workload was submitted. New scheduling rules added to a project are not applied over previously created workloads associated with that project.

There are three types of scheduling rules:

Workload Duration (Time Limit)

This rule limits the duration of a workload run time. Workload run time is calculated as the total time in which the workload was in status Running. You can apply a single rule per workload type - Preemptive Workspaces, Non-preemptive Workspaces, and Training.

Idle GPU Time Limit

This rule limits the total GPU time of a workload. Workload idle time is counted from the first time the workload is in status Running and the GPU was idle. Idleness is calculated by employing the runai_gpu_idle_seconds_per_workload metric. This metric determines the total duration of zero GPU utilization within each 30-second interval. If the GPU remains idle throughout the 30-second window, 30 seconds are added to the idleness sum; otherwise, the idleness count is reset. You can apply a single rule per workload type - “Preemptible” Workspaces, “Non-preemptible” Workspaces, and Training.

Note

To make Idle GPU timeout effective, it must be set to a shorter duration than the workload duration of the same workload type.

Node Type (Affinity)

Node type is used to select a group of nodes, typically with specific characteristics such as a hardware feature, storage type, fast networking interconnection, etc. The Scheduler uses node type as an indication of which nodes should be used for your workloads, within this project.

Node type is a label in the form of run.ai/type and a value (e.g. run.ai/type = dgx200) that the administrator uses to tag a set of nodes. Adding the node type to the project’s scheduling rules mandates the user to submit workloads with a node type label/value pairs from this list, according to the workload type - Workspace or Training. The Scheduler then schedules workloads using a node selector, targeting nodes tagged with the NVIDIA Run:ai node type label/value pair. Node pools and a node type can be used in conjunction. For example, specifying a node pool and a smaller group of nodes from that node pool that includes a fast SSD memory or other unique characteristics.

Labelling Nodes for Node Types Grouping

The administrator should use a node label with the key of run.ai/type and any coupled value

To assign a label to nodes you want to group, set the ‘node type (affinity)’ on each relevant node:

  1. Obtain the list of nodes and their current labels by copying the following to your terminal:

kubectl get nodes --show-labels
  1. Annotate a specific node with a new label by copying the following to your terminal:

kubectl label node <node-name> run.ai/type=<value>

Adding a Scheduling Rule to a Project or Department

To add a scheduling rule:

  1. Select the project/department for which you want to add a scheduling rule

  2. Click EDIT

  3. In the Scheduling rules section click +RULE

  4. Select the rule type

  5. Select the workload type and time limitation period

  6. For Node type, choose one or more labels for the desired nodes

  7. Click SAVE

Note

You can review the defined rules in the Projects table in the relevant column.

Editing the Scheduling Rule

To edit a scheduling rule:

  1. Select the project/department for which you want to edit its scheduling rule

  2. Click EDIT

  3. Find the scheduling rule you would like to edit

  4. Edit the rule

  5. Click SAVE

Note

Setting scheduling rules in a department enforces the rules on all associated projects.

Editing a scheduling rule within a project - you can only tighten a rule applied by your department admin, meaning set a lower time limitation not higher.

Deleting the Scheduling Rule

To delete a scheduling rule:

  1. Select the project/department from which you want to delete a scheduling rule

  2. Click EDIT

  3. Find the scheduling rule you would like to delete

  4. Click on the x icon

  5. Click SAVE

Note

Deleting a department rule within a project - a project admin cannot delete a rule created by the department admin.

Using API

Go to the Projects API reference to view the available actions

helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \
    --set global.domain=runai.apps.<OPENSHIFT-CLUSTER-DOMAIN> \ 
    --set global.config.kubernetesDistribution=openshift
helm upgrade -i runai-backend ./control-plane-<VERSION>.tgz -n runai-backend \
    --set global.domain=runai.apps.<OPENSHIFT-CLUSTER-DOMAIN> \ 
    --set global.config.kubernetesDistribution=openshift \
    --set global.customCA.enabled=true \ 
    -f custom-env.yaml 
system requirements
network requirements
software artifacts
Helm chart
here
preparing the installation script
here
here
helm upgrade -i runai-backend control-plane-<VERSION>.tgz \
    --set global.domain=<DOMAIN> \ 
    --set global.customCA.enabled=true \ 
    -n runai-backend -f custom-env.yaml
preparing the installation script:
here
Customized installation
helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
helm upgrade runai-backend -n runai-backend runai-backend/control-plane --version "<VERSION>" -f runai_control_plane_values.yaml --reset-then-reuse-values
helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
helm upgrade runai-backend control-plane-<NEW-VERSION>.tgz -n runai-backend  -f runai_control_plane_values.yaml --reset-then-reuse-values
curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh
Helm
Installing Helm
here
system requirements
network requirements
system requirements
network requirements
Kubernetes distribution
installation fails
cluster troubleshooting scenarios
Troubleshooting scenarios

Policies and Rules

At NVIDIA Run:ai, Administrators can access a suite of tools designed to facilitate efficient account management. This article focuses on two key features: workload policies and workload scheduling rules. These features empower admins to establish default values and implement restrictions allowing enhanced control, assuring compatibility with organizational policies, and optimizing resource usage and utilization.

Note

Policies V1 are still supported but require additional setup. If you have policies on clusters prior to NVIDIA Run:ai version 2.18 and upgraded to a newer version, contact NVIDIA Run:ai Customer Success for assistance in transitioning to the new policies framework.

Workload Policies

A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted. This solution allows them to set best practices, enforce limitations, and standardize processes for the submission of workloads for AI projects within their organization. It acts as a key guideline for data scientists, researchers, ML & MLOps engineers by standardizing submission practices and simplifying the workload submission process.

Why Use a Workload Policy?

Implementing workload policies is essential when managing complex AI projects within an enterprise for several reasons:

  1. Resource control and management - Defining or limiting the use of costly resources across the enterprise via a centralized management system to ensure efficient allocation and prevent overuse.

  2. Setting best practices - Provide managers with the ability to establish guidelines and standards to follow, reducing errors amongst AI practitioners within the organization.

  3. Security and compliance - Define and enforce permitted and restricted actions to uphold organizational security and meet compliance requirements.

  4. Simplified setup - Conveniently allow setting defaults and streamline the workload submission process for AI practitioners.

  5. Scalability and diversity

    1. Multi-purpose clusters with various workload types that may have different requirements and characteristics for resource usage.

    2. The organization has multiple hierarchies, each with distinct goals, objectives, and degrees of flexibility.

    3. Manage multiple users and projects with distinct requirements and methods, ensuring appropriate utilization of resources.

Understanding the Mechanism

The following sections provide details of how the workload policy mechanism works.

Cross-Interface Enforcement

The policy enforces the workloads regardless of whether they were submitted via UI, CLI, Rest APIs, or Kubernetes YAMLs.

Policy Types

NVIDIA Run:ai’s policies enforce NVIDIA Run:ai workloads. The policy type is per NVIDIA Run:ai workload type. This allows administrators to set different policies for each workload type.

Policy type
Workload type
Kubernetes name

Workspace

Workspace

Interactive workload

Training: Standard

Training: Standard

Training workload

Training: Distributed

Training: Distributed

Distributed workload

Inference

Inference

Inference workload

Policy Structure - Rules, Defaults, and Imposed Assets

A policy consists of rules for limiting and controlling the values of fields of the workload. In addition to rules, some defaults allow the implementation of default values to different workload fields. These default values are not rules, as they simply suggest values that can be overridden during the workload submission.

Furthermore, policies allow the enforcement of workload assets. For example, as an admin, you can impose a data source of type PVC to be used by any workload submitted.

For more information, see rules, defaults and imposed assets.

Scope of Effectiveness

Numerous teams working on various projects require the use of different tools, requirements, and safeguards. One policy may not suit all teams and their requirements. Hence, administrators can select the scope to cover the effectiveness of the policy. When a scope is selected, all of its subordinate units are also affected. As a result, all workloads submitted within the selected scope are controlled by the policy.

For example, if a policy is set for Department A, all workloads submitted by any of the projects within this department are controlled.

A scope for a policy can be:

Note

The policy submission to the entire account scope is supported via API only.

The different scoping of policies also allows the breakdown of the responsibility between different administrators. This allows delegation of ownership between different levels within the organization. The policies, containing rules and defaults, propagate* down the organizational tree, forming an “effective” policy that enforces any workload submitted by users within the project.

If a rule for a specific field is already occupied by a policy in the organization, another unit within the same branch cannot submit an additional rule on the same field. As a result, administrators of higher scopes must request lower-scope administrators to free up the specific rule from their policy. However, defaults of the same field can be submitted by different organizational policies, as they are “soft” rules that are not critical to override, and the smallest level of the default is the one that becomes the effective default (project default ‚”wins” vs department default, department default “wins” vs cluster default etc.).

NVIDIA Run:ai policies vs. Kyverno policies

Kyverno runs as a dynamic admission controller in a Kubernetes cluster. Kyverno receives validating and mutating admission webhook HTTP callbacks from the Kubernetes API server and applies matching policies to return results that enforce admission policies or reject requests. Kyverno policies can match resources using the resource kind, name, label selectors, and much more. For more information, see How Kyverno Works.

Scheduling Rules

Scheduling rules limit a researcher's access to resources and provides a way for the admin to control resource allocation and prevent the waste of resources. Admins should use the rules to prevent GPU idleness, prevent GPU hogging and allocate specific types of resources to different types of workloads.

Admin can limit the duration of a workload, the duration of the idle time, or the type of nodes the workload can use. Rules are defined for and apply to all workloads in the project or department. In addition, rules can be applied to a specific type of workload in a project or department (workspace, standard training, or inference). When a workload reaches the limitation of the rule, it is stopped if the rule is time-limited. The rule type prevents the workload from being scheduled on nodes that violate the rule limitation.

Workspace Templates

This section explains the procedure to manage templates.

A template is a pre-set configuration that is used to quickly configure and submit workloads using existing assets. A template consists of all the assets a workload needs, allowing researchers to submit a workload in a single click, or make subtle adjustments to differentiate them from each other.

Workspace Templates Table

The Templates table can be found under Workload manager in the NVIDIA Run:ai User interface.

The Templates table provides a list of all the templates defined in the platform, and allows you to manage them.

Flexible management

It is also possible to manage templates directly for a specific user, application, project, or department.

The Templates table consists of the following columns:

Column
Description

Scope

The scope to which the subject has access. Click the name of the scope to see the scope and its subordinates

Environment

The name of the environment related to the workspace template

Compute resource

The name of the compute resource connected to the workspace template

Data source(s)

The name of the data source(s) connected to the workspace template

Created by

The subject that created the template

Creation time

The timestamp for when the template was created

Cluster

The cluster name containing the template

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then click Download as CSV. Export to CSV is limited to 20,000 rows.

  • Refresh (optional) - Click REFRESH to update the table with the latest data

  • Show/Hide details (optional) - Click to view additional information on the selected row

Adding a New Workspace Template

To add a new template:

  1. Click +NEW TEMPLATE

  2. Set the scope for the template

  3. Enter a name for the template

  4. Select the environment for your workload

  5. Select the node resources needed to run your workload - or - Click +NEW COMPUTE RESOURCE

  6. Set the volume needed for your workload

  7. Create a new data source

  8. Set auto-deletion, annotations and labels, as required

  9. Click CREATE TEMPLATE

Copying a Template

To copy an existing template:

  1. Select the template you want to copy

  2. Click MAKE A COPY

  3. Enter a name for the template. The name must be unique.

  4. Update the template and click CREATE TEMPLATE

Renaming a Template

To rename an existing template:

  1. Select the template you want to rename

  2. Click Rename and edit the name/description

Deleting a Template

To delete a template:

  1. Select the template you want to delete

  2. Click DELETE

  3. Confirm you want to delete the template

Using API

Go to the Workload template API reference to view the available actions

Applications

This section explains the procedure to manage your organization's applications.

Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token as detailed in and use it within subsequent API calls.

Applications are assigned with to manage permissions. For example, application ci-pipeline-prod is assigned with a Researcher role in Cluster: A.

Applications Table

The Applications table can be found under Access in the NVIDIA Run:ai platform.

The Applications table provides a list of all the applications defined in the platform, and allows you to manage them.

The Applications table consists of the following columns:

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Creating an Application

To create an application:

  1. Click +NEW APPLICATION

  2. Enter the application’s name

  3. Click CREATE

  4. Copy the Client ID and Client secret and store them securely

  5. Click DONE

Note

The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

Adding an Access Rule to an Application

To create an access rule:

  1. Select the application you want to add an access rule for

  2. Click ACCESS RULES

  3. Click +ACCESS RULE

  4. Select a role

  5. Select a scope

  6. Click SAVE RULE

  7. Click CLOSE

Deleting an Access Rule from an Application

To delete an access rule:

  1. Select the application you want to remove an access rule from

  2. Click ACCESS RULES

  3. Find the access rule assigned to the user you would like to delete

  4. Click on the trash icon

  5. Click CLOSE

Regenerating a Client Secret

To regenerate a client secret:

  1. Locate the application you want to regenerate its client secret

  2. Click REGENERATE CLIENT SECRET

  3. Click REGENERATE

  4. Copy the New client secret and store it securely

  5. Click DONE

Note

Regenerating a client secret revokes the previous one.

Deleting an Application

  1. Select the application you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm

Using API

Go to the , API reference to view the available actions.

NVIDIA Run:ai Workload Types

In the world of machine learning (ML), the journey from raw data to actionable insights is a complex process that spans multiple stages. Each stage of the AI lifecycle requires different tools, resources, and frameworks to ensure optimal performance. NVIDIA Run:ai simplifies this process by offering specialized workload types tailored to each phase, facilitating a smooth transition across various stages of the ML workflows.

The ML lifecycle usually begins with the experimental work on data and exploration of different modeling techniques to identify the best approach for accurate predictions. At this stage, resource consumption is usually moderate as experimentation is done on a smaller scale. As confidence grows in the model's potential and its accuracy, the demand for compute resources increases. This is especially true during the training phase, where vast amounts of data need to be processed, particularly with complex models such as large language models (LLMs), with their huge parameter sizes, that often require distributed training across multiple GPUs to handle the intensive computational load.

Finally, once the model is ready, it moves to the inference stage, where it is deployed to make predictions on new, unseen data. NVIDIA Run:ai's workload types are designed to correspond with the natural stages of this lifecycle. They are structured to align with the specific resource and framework requirements of each phase, ensuring that AI researchers and data scientists can focus on advancing their models without worrying about infrastructure management.

NVIDIA Run:ai offers three workload types that correspond to a specific phase of the researcher’s work:

  • Workspaces – For experimentation with data and models.

  • Training – For resource-intensive tasks such as model training and data preparation.

  • Inference – For deploying and serving the trained model.

Workspaces: The Experimentation Phase

The Workspace is where data scientists conduct initial research, experiment with different data sets, and test various algorithms. This is the most flexible stage in the ML lifecycle, where models and data are explored, tuned, and refined. The value of workspaces lies in the flexibility they offer, allowing the researcher to iterate quickly without being constrained by rigid infrastructure.

  • Framework flexibility

    Workspaces support a variety of machine learning frameworks, as researchers need to experiment with different tools and methods.

  • Resource requirements

    Workspaces are often lighter on resources compared to the training phase, but they still require significant computational power for data processing, analysis, and model iteration.

    Hence, the default for the NVIDIA Run:ai workspaces considerations is to allow scheduling those workloads without the ability to preempt them once the resources were allocated. However, this non-preemptible state doesn’t allow utilizing more resources outside of the project’s deserved quota.

See to learn more about how to submit a workspace via the NVIDIA Run:ai platform. For quick starts, see .

Training: Scaling Resources for Model Development

As models mature and the need for more robust data processing and model training increases, NVIDIA Run:ai facilitates this shift through the Training workload. This phase is resource-intensive, often requiring distributed computing and high-performance clusters to process vast data sets and train models.

  • Training architecture

    For training workloads NVIDIA Run:ai allows you to specify the architecture - standard or distributed. The distributed architecture is relevant for larger data sets and more complex models that require utilizing multiple nodes. For the distributed architecture, NVIDIA Run:ai allows you to specify different configurations for the master and workers and select which framework to use - PyTorch, XGBoost, MPI, TensorFlow and JAX. In addition, as part of the distributed configuration, NVIDIA Run:ai enables the researchers to schedule their distributed workloads on nodes within the same region, zone, placement group, or any other topology.

  • Resource requirements

    Training tasks demand high memory, compute power, and storage. NVIDIA Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPUs that are in your quota.

See and to learn more about how to submit a training workload via the NVIDIA Run:ai UI. For quick starts, see and .

Note

Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

Inference: Deploying and Serving models

Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.

  • Inference-specific use cases

    Naturally, inference workloads are required to change and adapt to the ever-changing demands to meet SLA. For example, additional replicas may be deployed, manually or automatically, to increase compute resources as part of a horizontal scaling approach or a new version of the deployment may need to be rolled out without affecting the running services.

  • Resource requirements

    Inference models differ in size and purpose, leading to varying computational requirements. For example, small OCR models can run efficiently on CPUs, whereas LLMs typically require significant GPU memory for deployment and serving. Inference workloads are considered production-critical and are given the highest priority to ensure compliance with SLAs. Additionally, NVIDIA Run:ai ensures that inference workloads cannot be preempted, maintaining consistent performance and reliability.

See to learn more about how to submit an inference workload via the NVIDIA Run:ai UI.

GPU Time-Slicing

NVIDIA Run:ai supports simultaneous submission of multiple workloads to single or multi-GPUs when using . This is achieved by slicing the GPU memory between the different workloads according to the requested GPU fraction, and by using NVIDIA’s GPU time-slicing to share the GPU compute runtime. NVIDIA Run:ai ensures each workload receives the exact share of the GPU memory (= gpu_memory * requested), while the NVIDIA GPU time-slicing splits the GPU runtime evenly between the different workloads running on that GPU.

To provide customers with predictable and accurate GPU compute resource scheduling, NVIDIA Run:ai’s GPU time-slicing adds fractional compute capabilities on top of NVIDIA Run:ai GPU fraction capabilities.

How GPU Time-Slicing Works

While the default NVIDIA GPU time-slicing allows for sharing the GPU compute runtime evenly without splitting or limiting the runtime of each workload, NVIDIA Run:ai’s GPU time-slicing mechanism gives each workload exclusive access to the full GPU for a limited amount of time, lease time, in each scheduling cycle, plan time. This cycle repeats itself for the lifetime of the workload. Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction, but also allows splitting GPU unused compute time up to a requested Limit.

For example, when there are 2 workloads running on the same GPU, with NVIDIA’s default GPU time slicing, each workload gets 50% of the GPU compute runtime, even if one workload requests 25% of the GPU memory, and the other workload requests 75% of the GPU memory. With the NVIDIA Run:ai GPU time-slicing, the first workload will get 25% of the GPU compute time and the second will get 75%. If one of the workloads does not use its deserved GPU compute time, the others can split that time evenly between them. As shown in the example, if one of the workloads does not request the GPU for some time, the other will get the full GPU compute time.

GPU Time-Slicing Modes

NVIDIA Run:ai offers two GPU time-slicing modes:

  • Strict - Each workload gets its precise GPU compute fraction, which equals to its requested GPU (memory) fraction. In terms of official Kubernetes resource specification, this means:

  • Fair - Each workload is guaranteed at least its GPU compute fraction, but at the same time can also use additional GPU runtime compute slices that are not used by other idle workloads. Those excess time slices are divided equally between all workloads running on that GPU (after each got at least its requested GPU compute fraction). In terms of official Kubernetes resource specification, this means:

The figure below illustrates how Strict time-slicing mode uses the GPU from Lease (slice) and Plan (cycle) perspective:

The figure below illustrates how Fair time-slicing mode uses the GPU from Lease (slice) and Plan (cycle) perspective:

Time-Slicing Plan and Lease Times

Each GPU scheduling cycle is a plan. The plan is determined by the lease time and granularity (precision). By default, basic lease time is 250ms with 5% granularity (precision), which means the plan (cycle) time is: 250 / 0.05 = 5000ms (5 Sec). Using these values, a workload that requests gpu-fraction=0.5 gets 2.5s runtime out of the 5s cycle time.

Different workloads require different SLA and precision, so it also possible to tune the lease time and precision for customizing the time-slicing capabilities to your cluster.

Note

Decreasing the lease time makes time-slicing less accurate. Increasing the lease time makes the system more accurate, but each workload is less responsive.

Once timeSlicing is enabled in the runaiconfig, all submitted GPU fractions or GPU memory workloads will have their gpu-compute-request/limit set automatically by the system, depending on the annotation used on the time-slicing mode:

  • Strict compute resources:

  • Fair compute resources:

Note

The above tables show that when submitting a workload using gpu-memory annotation, the system will split the GPU compute time between the different workloads running on that GPU. This means the workload can get anything from very little compute time (>0) to full GPU compute time (1.0).

Enabling GPU Time-Slicing

NVIDIA Run:ai’s GPU time-slicing is a cluster flag which changes the default NVIDIA time-slicing used by GPU fractions. For more details, see .

Enable GPU time-slicing by setting the following cluster flag in the runaiconfig file:

If the timeSlicing flag is not set, the system continues to use the default NVIDIA GPU time-slicing to maintain backward compatibility.

Optimize Performance with Node Level Scheduler

The Node Level Scheduler optimizes the performance of your pods and maximizes the utilization of GPUs by making optimal local decisions on GPU allocation to your pods. While the chooses the specific node for a pod, it has no visibility to the node’s GPUs' internal state. The Node Level Scheduler is aware of the local GPUs' states and makes optimal local decisions such that it can optimize both the GPU utilization and pods’ performance running on the node’s GPUs.

This guide provides an overview of the best use cases for the Node Level Scheduler and instructions for configuring it to maximize GPU performance and pod efficiency.

Deployment Considerations

  • While the Node Level Scheduler applies to all , it will best optimize the performance of burstable workloads. Burstable workloads are workloads that use , giving those more GPU memory than requested and up to the Limit specified.

  • Burstable workloads are always susceptible to an OOM Kill signal if the owner of the excess memory requires it back. This means that using the Node Level Scheduler with inference or training workloads may cause pod preemption.

  • Using interactive workloads with notebooks is the best use case for burstable workloads and Node Level Scheduler. These workloads behave differently since the OOM Kill signal will cause the notebooks' GPU process to exit but not the notebook itself. This keeps the interactive pod running and retrying to attach a GPU again.

Interactive Notebooks Use Case

This use case is one scenario that shows how Node Level Scheduler locally optimizes and maximizes GPU utilization and workspaces’ performance.

  1. The below shows a node with 2 GPUs and 2 submitted workspaces:

  1. The Scheduler instructs the node to put the 2 workspaces on a single GPU, a single GPU and leaving the other free for a workload that requires resources. This means GPU#2 is idle while the two workspaces can only use up to half a GPU, even if they temporarily need more:

  1. With the Node Level Scheduler enabled, the local decision will be to spread those 2 workspaces on 2 GPUs and allow them to maximize both workspaces’ performance and GPUs’ utilization by bursting out up to the full GPU memory and GPU compute resources:

  1. The NVIDIA Run:ai Scheduler still sees a node with one fully empty GPU and one fully occupied GPU. When a 3rd workload is scheduled, and it requires a full GPU (or more than 0.5 GPU), the Scheduler will schedule it to that node, and the Node Level Scheduler will move one of the workspaces to run with the other in GPU#1, as was the Scheduler’s initial plan. Moving the workspace from GPU#1 back to GPU#2 maintains the workspace running while the GPU process within the Jupyter notebook is killed and re-established on GPU#2, continuing to serve the workspace:

Using Node Level Scheduler

The Node Level Scheduler can be enabled per node pool. To use Node Level Scheduler, follow the below steps.

Enable on Your Cluster

  1. Enable the Node Level Scheduler at the cluster level (per cluster) by:

    1. Editing the runaiconfig as follows. For more details, see :

    2. Or, using the following kubectl patch command:

Enable on a Node Pool

Note

GPU resource optimization is disabled by default. It must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Enable Node Level Scheduler on any of the node pools:

  1. Select Resources → Node pools

  2. or

  3. Under the Resource Utilization Optimization tab, change the number of workloads on each GPU to any value other than Not Enforced (i.e. 2, 3, 4, 5)

The Node Level Scheduler is now ready to be used on that node pool.

Submit a Workload

In order for a workload to be considered by the Node Level Scheduler for rerouting, it must be submitted with a GPU Request and Limit where the Limit is larger than the Request:

  • Enable and set

  • Then using dynamic GPU fractions

Running workspaces
Running Jupyter Notebook using workspaces
Standard training
Distributed training
Run your first standard training
Run your first distributed training
Deploy a custom inference workload

Application

The name of the application

Client ID

The client ID of the application

Access rule(s)

The access rules assigned to the application

Last login

The timestamp for the last time the user signed in

Created by

The user who created the application

Creation time

The timestamp for when the application was created

Last updated

The last time the application was updated

API authentication
access rules
Applications
Access rules
gpu-compute-request = gpu-compute-limit = gpu-(memory-)fraction
gpu-compute-request = gpu-(memory-)fraction
gpu-compute-limit = 1.0

Annotation

Value

GPU Compute Request

GPU Compute Limit

gpu-fraction

x

x

x

gpu-memory

x

0

1.0

Annotation

Value

GPU Compute Request

GPU Compute Limit

gpu-fraction

x

x

1.0

gpu-memory

x

0

1.0

global: 
    core: 
        timeSlicing: 
             mode: fair/strict
GPU fractions
Advanced cluster configurations
Strict time-slicing mode
Fair time-slicing mode
spec: 
  global: 
      core: 
        nodeScheduler:
          enabled: true
kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"nodeScheduler":{"enabled": true}}}}}'
NVIDIA Run:ai Scheduler
workload types
dynamic GPU fractions
bin-packing
Advanced cluster configurations
Create a new node pool
edit an existing node pool
dynamic GPU fractions
submit a workload
Unallocated GPU nodes
Single allocated GPU node
Two allocated GPU nodes
Node Level Scheduler locally optimized GPU nodes

Nodes Maintenance

This section provides detailed instructions on how to manage both planned and unplanned node downtimes in a Kubernetes cluster running NVIDIA Run:ai. It covers all the steps to maintain service continuity and ensure the proper handling of workloads during these events.

Prerequisites

  • Access to Kubernetes cluster - Administrative access to the Kubernetes cluster, including permissions to run kubectl commands

  • Basic knowledge of Kubernetes - Familiarity with Kubernetes concepts such as nodes, taints, and workloads

  • NVIDIA Run:ai installation - The NVIDIA Run:ai software installed and configured within your Kubernetes cluster

  • Node naming conventions - Know the names of the nodes within your cluster, as these are required when executing the commands

Node Types

This section distinguishes between two types of nodes within a NVIDIA Run:ai installation:

  • Worker nodes - Nodes on which AI practitioners can submit and run workloads

  • NVIDIA Run:ai system nodes - Nodes on which the NVIDIA Run:ai software runs, managing the cluster's operations

Worker Nodes

Worker nodes are responsible for running workloads. When a worker node goes down, either due to planned maintenance or unexpected failure, workloads ideally migrate to other available nodes or wait in the queue to be executed when possible.

Training vs. Interactive Workloads

The following workload types can run on worker nodes:

  • Training workloads - These are long-running processes that, in case of node downtime, can automatically move to another node.

  • Interactive workloads - These are short-lived, interactive processes that require manual intervention to be relocated to another node.

Note

While training workloads can be automatically migrated, it is recommended to plan maintenance and manually manage this process for a faster response, as it may take time for Kubernetes to detect a node failure.

Planned Maintenance

Before stopping a worker node for maintenance, perform the following steps:

  1. Prevent new workloads on the node

    To stop the Kubernetes Scheduler from assigning new workloads to the node and to safely remove all existing workloads, copy the following command to your terminal:

    kubectl taint nodes <node-name> runai=drain:NoExecute
    • <node-name> Replace this placeholder with the actual name of the node you want to drain

    • kubectl taint nodes This command is used to add a taint to the node, which prevents any new pods from being scheduled on it

    • runai=drain:NoExecute This specific taint ensures that all existing pods on the node are evicted and rescheduled on other available nodes, if possible

    Result: The node stops accepting new workloads, and existing workloads either migrate to other nodes or are placed in a queue for later execution.

  2. Shut down and perform maintenance

    After draining the node, you can safely shut it down and perform the necessary maintenance tasks.

  3. Restart the node

    Once maintenance is complete and the node is back online, remove the taint to allow the node to resume normal operations. Copy the following command to your terminal:

    kubectl taint nodes <node-name> runai=drain:NoExecute-

    runai=drain:NoExecute- The - at the end of the command indicates the removal of the taint. This allows the node to start accepting new workloads again.

    Result: The node rejoins the cluster's pool of available resources, and workloads can be scheduled on it as usual.

Unplanned Downtime

In the event of unplanned downtime:

  1. Automatic restart If a node fails but immediately restarts, all services and workloads automatically resume.

  2. Extended downtime

    If the node remains down for an extended period, drain the node to migrate workloads to other nodes. Copy the following command to your terminal:

    kubectl taint nodes <node-name> runai=drain:NoExecute

    The command works the same as in the planned maintenance section, ensuring that no workloads remain scheduled on the node while it is down.

  3. Reintegrate the node

    Once the node is back online, remove the taint to allow it to rejoin the cluster's operations. Copy the following command to your terminal:

    kubectl taint nodes <node-name> runai=drain:NoExecute- 

    Result: This action reintegrates the node into the cluster, allowing it to accept new workloads.

  4. Permanent shutdown

    If the node is to be permanently decommissioned, remove it from Kubernetes with the following command:

    kubectl delete node <node-name>
    • kubectl delete node This command completely removes the node from the cluster

    • <node-name> Replace this placeholder with the actual name of the node

    Result: The node is no longer part of the Kubernetes cluster. If you plan to bring the node back later, it must be rejoined to thel cluster using the steps outlined in the next section.

NVIDIA Run:ai System Nodes

In a production environment, the services responsible for scheduling, submitting and managing NVIDIA Run:ai workloads operate on one or more NVIDIA Run:ai system nodes. It is recommended to have more than one system node to ensure high availability. If one system node goes down, another can take over, maintaining continuity. If a second system node does not exist, you must designate another node in the cluster as a temporary NVIDIA Run:ai system node to maintain operations.

The protocols for handling planned maintenance and unplanned downtime are identical to those for worker nodes. Refer to the above section for detailed instructions.

Rejoining a Node into the Kubernetes Cluster

To rejoin a node to the Kubernetes cluster, follow these steps:

  1. Generate a join command on the master node

    On the master node, copy the following command to your terminal:

    kubeadm token create --print-join-command
    • kubeadm token create This command generates a token that can be used to join a node to the Kubernetes cluster.

    • --print-join-command This option outputs the full command that needs to be run on the worker node to rejoin it to the cluster.

    Result: The command outputs a kubeadm join command.

  2. Run the join command on the worker node

    Copy the kubeadm join command generated from the previous step and run it on the worker node that needs to rejoin the cluster.

    kubeadm join <master-ip>:<master-port> 
    --token <token> \ --discovery-token-ca-cert-hash sha256:<hash>

    The kubeadm join command re-enrolls the node into the cluster, allowing it to start participating in the cluster's workload scheduling.

  3. Verify node rejoining

    Verify that the node has successfully rejoined the cluster by running:

    kubectl get nodes
    • kubectl get nodes This command lists all nodes currently part of the Kubernetes cluster, along with their status

    Result: The rejoined node should appear in the list with a status of Ready

  4. Re-label nodes

    Once the node is ready, ensure it is labeled according to its role within the cluster.

To uninstall the NVIDIA Run:ai cluster, run the following helm command in your terminal:

helm uninstall runai-cluster -n runai

To remove the NVIDIA Run:ai cluster from the NVIDIA Run:ai platform, see Removing a cluster.

Note

Uninstall of NVIDIA Run:ai cluster from the Kubernetes cluster does not delete existing projects, departments or workloads submitted by users.

helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update
helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \ 
    --set global.domain=<DOMAIN>
helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
helm repo update

Adapting AI Initiatives to Your Organization

AI initiatives refer to advancing research, development, and implementation of AI technologies. These initiatives represent your business needs and involve collaboration between individuals, teams, and other stakeholders. AI initiatives require compute resources and a methodology to effectively and efficiently use those compute resources and split them among the different AI initiatives stakeholders. The building blocks of AI compute resources are GPUs, CPUs, and memory, which are built into nodes (servers) and can be further grouped into node pools. Nodes and node pools are part of a Kubernetes cluster.

To manage AI initiatives in NVIDIA Run:ai you should:

  • Map your organization and initiatives to projects and optionally departments

  • Map compute resources (node pools and quotas) to projects and optionally departments

  • Assign users (e.g. AI practitioners, ML engineers, Admins) to projects and departments

Mapping Your Organization

The way you map your AI initiatives and organization into NVIDIA Run:ai projects and departments should reflect your organization’s structure and Project management practices. There are multiple options, and we provide you here with 3 examples of typical forms in which to map your organization, initiatives, and users into NVIDIA Run:ai, but of course, other ways that suit your requirements are also acceptable.

Based on Individuals

A typical use case would be students (individual practitioners) within a faculty (business unit) - an individual practitioner may be involved in one or more initiatives. In this example, the resources are accounted for by the student (project) and aggregated per faculty (department).

Department = business unit / Project = individual practitioner

Based on Business Units

A typical use case would be an AI service (business unit) split into AI capabilities (initiatives) - an individual practitioner may be involved in several initiatives. In this example, the resources are accounted for by Initiative (project) and aggregated per AI service (department).

Department = business unit / Project = initiative

Based on the Organizational Structure

A typical use case would be a business unit split into teams - an individual practitioner is involved in a single team (project) but the team may be involved in several AI initiatives. In this example, the resources are accounted for by team (project) and aggregated per business unit (department).

Department = business unit / Project = team

Mapping Your Resources

AI initiatives require compute resources such as GPUs and CPUs to run. Compute resources in any organization are limited, either due to the number of servers (nodes) owned by the organization is limited, the budget it can spend to lease resources in the cloud or spending for in-house servers is also limited. Every organization strives to optimize the usage of its resources by maximizing their utilization and providing all users with their needs. Therefore, the organization needs to split resources according to the organization's internal priorities and budget constraints. But even after splitting the resources, the orchestration layer should still provide fairness between the resourced consumers, and allow access to unused resources to minimize scenarios of idle resources.

Another aspect of resource management is how to group your resources effectively, especially in large environments, or environments that are made of heterogeneous types of hardware, where some users need to use specific hardware types, or where other users should avoid occupying critical hardware of some users or initiatives.

NVIDIA Run:ai assists you with all of these complex issues by allowing you to map your cluster resources to node pools, then map each Project and Department a quota allocation per node pool, and set access rights to unused resources (over quota) per node pool.

Grouping Your Resources

There are several reasons why you would group resources (nodes) into node pools:

  • Control the GPU type to use in heterogeneous hardware environment - in many cases, AI models can be optimized per hardware type they will use, e.g. a training workload that is optimized for H100 does not necessarily run optimally on an A100, and vice versa. Therefore segmenting into node pools, each with a different hardware type gives the AI researcher and ML engineer better control of where to run.

  • Quota control - splitting to node pools allows the admin to set specific quota per hardware type, e.g. give high priority project guaranteed access to advanced GPU hardware, while keeping lower priority project with a lower quota or even with no quota at all for that high-end GPU, but give it a “best-effort” access only (i.e. if the high priority guaranteed project is not using those resources).

  • Multi-region or multi-availability-zone cloud environments - if some or all of your clusters run on the cloud (or even on-premise) but any of your clusters uses different physical locations or different topologies (e.g. racks), you probably want to segment your resources per region/zone/topology to be able to control where to run your workloads, how much quota to assign to specific environments (per project, per department), even if all those locations are all using the same hardware type. This methodology can help in optimizing the performance of your workloads because of the superior performance of local computing such as the locality of distributed workloads, local storage etc.

  • Explainability and predictability - large environments are complex to understand, this becomes even more complex when an environment is loaded. To maintain users’ satisfaction and their understanding of the resources state, as well as to keep predictability of your workload chances to get scheduled, segmenting your cluster into smaller pools may significantly help.

  • Scale - NVIDIA Run:ai implementation of node pools has many benefits, one of the main of them is scale. Each node pool has its own Scheduler instance, therefore allowing the cluster to handle more nodes and schedule workloads faster when segmented into node pools vs. one large cluster. To allow your workloads to use any resource within a cluster that is split to node pools, a second-level Scheduler is in charge of scheduling workloads between node pools according to your preferences and resource availability.

  • Prevent mutual exclusion - Some AI workloads consume CPU-only resources, to prevent those workloads from consuming the CPU resources of GPU nodes and thus block GPU workloads from using those nodes, it is recommended to group CPU-only nodes into a dedicated node pool(s) and assign a quota for CPU projects to CPU node-pools only while keeping GPU node-pools with zero quota and optionally “best-effort” over quota access for CPU-only projects.

Grouping Examples

Set out below are illustrations of different grouping options.

Example: grouping nodes by topology

Example: grouping nodes by hardware type

Assigning Your Resources

After the initial grouping of resources, it is time to associate resources to AI initiatives, this is performed by assigning quotas to projects and optionally to departments. Assigning GPU quota to a project, on a node pool basis, means that the workloads submitted by that project are entitled to use those GPUs as guaranteed resources and can use them for all workload types.

However, what happens if the project requires more resources than its quota? This depends on the type of workloads that the user wants to submit. If the user requires more resources for non-preemptible workloads, then the quota must be increased, because non-preemptible workloads require guaranteed resources. On the other hand, if the type of workload is, for example, a model Training workload that is preemptible - in this case the project can exploit unused resources of other projects, as long as the other projects don’t need them. over quota is set per project on a node-pool basis and per department.

Administrators can use quota allocations to prioritize resources between users, teams, and AI initiatives. The administrator can completely prevent the use of certain node pools by a project or department by setting the node pool quota to 0 and disabling over quota for that node pool, or it can keep the quota to 0 and enable over quota to that node pool and allow access based on resource availability only (e.g. unused GPUs). However, when a project with a non-zero quota needs to use those resources, the Scheduler reclaims those resources back and preempts the preemptible workloads of over quota projects. As an administrator, you can also have an impact on the amount of over quota resources a project or department uses.

It is essential to make sure that the sum of all projects' quota does NOT surpass that of the Department, and that the sum of all departments does not surpass the number of physical resources, per node pool and for the entire cluster (we call such behavior - ‘over-subscription’). The reason over-subscription is not recommended is that it may produce unexpected scheduling decisions, especially those that might preempt ‘non-preemptive’ workloads or fail to schedule workloads within quota, either non-preemptible or preemptible, thus quota cannot be considered anymore as ‘guaranteed’. Admins can opt-in a system flag that helps to prevent over-subscription scenarios.

Example: assigning resources to projects

Assigning Users to Projects and Departments

NVIDIA Run:ai system is using ‘Role Based Access Control’ (RBAC) to manage users’ access rights to the different objects of the system, its resources, and the set of allowed actions. To allow AI researchers, ML engineers, Project Admins, or any other stakeholder of your AI initiatives to access projects and use AI compute resources with their AI initiatives, the administrator needs to assign users to projects. After a user is assigned to a project with the proper role, e.g. ‘L1 Researcher’, the user can submit and monitor its workloads under that project. Assigning users to departments is usually done to assign ‘Department Admin’ to manage a specific department. Other roles, such as ‘L1 Researcher’, can also be assigned to departments, this allows the researcher access to all projects within that department.

Scopes in an Organization

This is an example of an organization, as represented in the NVIDIA Run:ai platform:

The organizational tree is structured from top down under a single node headed by the account. The account is comprised of clusters, departments and projects.

After mapping and building your hierarchal structured organization as shown above, you can assign or associate various NVIDIA Run:ai components (e.g. workloads, roles, assets, policies, and more) to different parts of the organization - these organizational parts are the Scopes. The following organizational example consists of 5 optional scopes:

Note

When a scope is selected, the very same unit, including all of its subordinates (both existing and any future subordinates, if added), are selected as well.

Next Steps

Now that resources are grouped into node pools, organizational units or business initiatives are mapped into projects and departments, projects’ quota parameters are set per node pool, and users are assigned to projects, you can finally submit workloads from a project and use compute resources to run your AI initiatives.

Workload Policies

This section explains the procedure to manage workload policies.

Workload Policies Table

The Workload policies table can be found under Policies in the NVIDIA Run:ai platform.

Note

Workload policies are disabled by default. If you cannot see Workload policies in the menu, then it must be enabled by your administrator, under General settings → Workloads → Policies

The Workload policies table provides a list of all the policies defined in the platform, and allows you to manage them.

The Workload policies table consists of the following columns:

Column
Description

Policy

The policy name which is a combination of the policy scope and the policy type

Type

The policy type is per NVIDIA Run:ai workload type. This allows administrators to set different policies for each .

Status

Representation of the policy lifecycle (one of the following - “Creating…”, “Updating…”, “Deleting…”, Ready or Failed)

Scope

The scope the policy affects. Click the name of the scope to view the organizational tree diagram. You can only view the parts of the organizational tree for which you have permission to view.

Created by

The user who created the policy

Creation time

The timestamp for when the policy was created

Last updated

The last time the policy was updated

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Refresh - Click REFRESH to update the table with the latest data

Adding a Policy

To create a new policy:

  1. Click +NEW POLICY

  2. Select a scope

  3. Select the workload type

  4. Click +POLICY YAML

  5. In the YAML editor type or paste a YAML policy with defaults and rules. You can utilize the following references and examples:

    • Policy YAML reference

    • Policy YAML examples

  6. Click SAVE POLICY

Editing a Policy

  1. Select the policy you want to edit

  2. Click EDIT

  3. Update the policy and click APPLY

  4. Click SAVE POLICY

Troubleshooting

Listed below are issues that might occur when creating or editing a policy via the YAML Editor:

Issue
Message
Mitigation

Cluster connectivity issues

There's no communication from cluster “cluster_name“. Actions may be affected, and the data may be stale.

Verify that you are on a network that has been allowed access to the cluster. Reach out to your cluster administrator for instructions on verifying the issue.

Policy can’t be applied due to a rule that is occupied by a different policy

Field “field_name” already has rules in cluster: “cluster_id”

Remove the rule from the new policy or adjust the old policy for the specific rule.

Policy is not visible in the UI

-

Check that the policy hasn’t been deleted.

Policy syntax is no valid

Add a valid policy YAML;json: unknown field "field_name"

For correct syntax check the or the .

Policy can’t be saved for some reason

The policy couldn't be saved due to a network or other unknown issue. Download your draft and try pasting and saving it again later.

Possible cluster connectivity issues. Try updating the policy once again at a different time.

Policies were submitted before version 2.18, you upgraded to version 2.18 or above and wish to submit new policies

If you have policies and want to create a new one, first contact NVIDIA Run:ai support to prevent potential conflicts

Contact NVIDIA Run:ai support. R&D can migrate your old policies to the new version.

Viewing a Policy

To view a policy:

  1. Select the policy for which you want to view its policies.

  2. Click VIEW POLICY

  3. In the Policy form per workload section, view the workload rules and defaults:

    • Parameter The workload submission parameter that Rules and Defaults are applied to

    • Type (applicable for data sources only) The data source type (Git, S3, nfs, pvc etc.)

    • Default The default value of the Parameter

    • Rule Set up constraint on workload policy field

    • Source The origin of the applied policy (cluster, department or project)

Note

Some of the rules and defaults may be derived from policies of a parent cluster and/or department. You can see the source of each rule in the policy form. For more information, check the Scope of effectiveness documentation.

Deleting a Policy

  1. Select the policy you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm the deletion

Using API

Go to the Policies API reference to view the available actions.

Reports

This section explains the procedure of managing reports in NVIDIA Run:ai.

Reports allow users to access and organize large amounts of data in a clear, CSV-formatted layout. They enable users to monitor resource consumption, analyze trends, and make data-driven decisions to optimize their AI workloads effectively.

Note

Reports are enabled by default for SaaS. To enable this feature for self-hosted, additional configurations must be added. See Enabling reports for self-hosted accounts.

Report Types

Currently, only “Consumption Reports” are available, which provides insights into the consumption of resources such as GPU, CPU, and CPU memory across organizational units.

Reports Table

The Reports table can be found under Analytics in the NVIDIA Run:ai platform.

The Reports table provides a list of all the reports defined in the platform and allows you to manage them.

Users are able to access the reports they have generated themselves. Users with project viewing permissions throughout the tenant can access all reports within the tenant.

The Reports table comprises the following columns:

Column
Description

Report

The name of the report

Description

The description of the report

Status

The different lifecycle phases and representation of the report condition

Type

The type of the report – e.g., consumption

Created by

The user who created the report

Creation time

The timestamp of when the report was created

Collection period

The period in which the data was collected

Reports Status

The following table describes the reports' condition and whether they were created successfully:

Status
Description

Ready

Report is ready and can be downloaded as CSV

Pending

Report is in the queue and waiting to be processed

Failed

The report couldn’t be created

Processing...

The report is being created

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

Creating a New Report

Before you start, make sure you have a project.

To create a new report:

  1. Click +NEW REPORT

  2. Enter a name for the report (if the name already exists, you will need to choose a different one)

  3. Optional: Provide a description of the report

  4. Set the report’s data collection period

    • Start date - The date at which the report data commenced

    • End date - The date at which the report data concluded

  5. Set the report segmentation and filters

    • Filters - Filter by project or department name

    • Segment by - Data is collected and aggregated based on the segment

  6. Click CREATE REPORT

Deleting a Report

  1. Select the report you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm

Downloading a report

Note

To download, the report must be in status “Ready”.

  1. Select the report you want to download

  2. Click DOWNLOAD CSV

Enabling Reports for Self-Hosted Accounts

Reports must be saved in a storage solution compatible with S3. To activate this feature for self-hosted accounts, the storage needs to be linked to the account. The configuration should be incorporated into two ConfigMap objects within the Control Plane.

  1. Edit the runai-backend-org-unit-service ConfigMap:

    kubectl edit cm runai-backend-org-unit-service -n runai-backend
  2. Add the following lines to the file:

    S3_ENDPOINT: <S3_END_POINT_URL>
    S3_ACCESS_KEY_ID: <S3_ACCESS_KEY_ID>
    S3_ACCESS_KEY: <S3_ACCESS_KEY>
    S3_USE_SSL: "true"
    S3_BUCKET: <BUCKET_NAME>
  3. Edit the runai-backend-metrics-service ConfigMap:

    kubectl edit cm runai-backend-metrics-service -n runai-backend
  4. Add the following lines to the file:

    S3_ENDPOINT: <S3_END_POINT_URL>
    S3_ACCESS_KEY_ID: <S3_ACCESS_KEY_ID>
    S3_ACCESS_KEY: <S3_ACCESS_KEY>
    S3_USE_SSL: "true"
  5. In addition on the same file, under config.yaml section, add the following right after log_level: \"Info\":

    reports:\n s3_config:\n bucket: \"<BUCKET_NAME>\"\n
  6. Restart the deployments:

    kubectl rollout restart deployment runai-backend-metrics-service runai-backend-org-unit-service -n runai-backend
  7. Refresh the page to see Reports under Analytics in the NVIDIA Run:ai platform.

Using API

To view the available actions, go to the Reports API reference.

How the Scheduler Works

Efficient resource allocation is critical for managing AI and compute-intensive workloads in Kubernetes clusters. The NVIDIA Run:ai Scheduler enhances Kubernetes' native capabilities by introducing advanced scheduling principles such as fairness, quota management, and dynamic resource balancing. It ensures that workloads, whether simple single-pod or complex distributed tasks, are allocated resources effectively while adhering to organizational policies and priorities.

This guide explores the NVIDIA Run:ai Scheduler’s allocation process, preemption mechanisms, and resource management. Through examples and detailed explanations, you'll gain insights into how the Scheduler dynamically balances workloads to optimize cluster utilization and maintain fairness across projects and departments.

Allocation Process

Pod Creation and Prouping

When a workload is submitted, the workload controller creates a pod or pods (for distributed training workloads or deployment based inference). When the Scheduler gets a submit request with the first pod, it creates a pod group and allocates all the relevant building blocks of that workload. The next pods of the same workload are attached to the same pod group.

Queue Management

A workload, with its associated pod group, is queued in the appropriate scheduling queue. In every scheduling cycle, the Scheduler ranks the order of queues by calculating their precedence for scheduling.

Resource Binding

The next step is for the Scheduler to find nodes for those pods, assign the pods to their nodes (bind operation), and bind other building blocks of the pods such as storage, ingress and so on. If the pod group has a gang scheduling rule attached to it, the Scheduler either allocates and binds all pods together, or puts all of them into pending state. It retries to schedule them all together in the next scheduling cycle. The Scheduler also updates the status of the pods and their associated pod group. Users are able to track the workload submission process both in the CLI or NVIDIA Run:ai UI. For more details on submitting and managing workloads, see Workloads.

Preemption

If the Scheduler cannot find resources for the submitted workloads (and all of its associated pods), and the workload deserves resources either because it is under its queue quota fairshare, the Scheduler tries to reclaim resources from other queues. If this does not solve the resource issue, the Scheduler tries to preempt lower priority preemptible workloads within the same queue (project).

Reclaim Preemption Between Projects and Departments

Reclaim is an inter-project and inter-department resource balancing action that takes back resources from one project or department that has used them as an over quota. It returns the resources back to a project (or department) that deserves those resources as part of its deserved quota, or to balance fairness between projects (or departments), so a project (or department) does not exceed its fairshare (portion of the unused resources).

This mode of operation means that a lower priority workload submitted in one project (e.g. training) can reclaim resources from a project that runs a higher priority workload (e.g. preemptive workspace) if fairness balancing is required.

Note

Only preemptive workloads can go over quota as they are susceptible to reclaim (cross-projects preemption) of the over quota resources they are using. The amount of over quota resources a project can gain depends on the over quota weight or quota (if over quota weight is disabled). Departments’ over quota is always proportional to its quota.

Priority Preemption Within a Project

Higher priority workloads may preempt lower priority preemptible workloads within the same project/node pool queue. For example, in a project that runs a training workload that exceeds the project quota for a certain node pool, a newly submitted workspace within the same project/node pool may stop (preempt) the training workload if there are not enough over quota resources for the project within that node pool to run both workloads (e.g. workspace using in-quota resources and training using over quota resources).

Note

Workload priority applies only within the same project and does not influence workloads across different projects, where fairness determines precedence.

Quota, Over Quota, and Fairshare

The NVIDIA Run:ai Scheduler strives to ensure fairness between projects and between departments. This means each department and project always strive to get their deserved quota, and unused resources are split between projects according to known rules (e.g. over quota weights).

If a project needs more resources even beyond its fairshare, and the Scheduler finds unused resources that no other project needs, this project can consume resources even beyond its fairshare.

Some scenarios can prevent the Scheduler from fully providing deserved quota and fairness:

  • Fragmentation or other scheduling constraints such affinities, taints etc.

  • Some requested resources, such as GPUs and CPU memory, can be allocated, while others, like CPU cores, are insufficient to meet the request. As a result, the Scheduler will place the workload in a pending state until the required resource becomes available.

Example of Splitting Quota

The example below illustrates a split of quota between different projects and departments using several node pools:

The example below illustrates how fairshare is calculated per project/node pool for the above example:

  • For each Project:

    • The over quota (OQ) portion of each project (per node pool) is calculated as:

    [(OQ-Weight) / (Σ Projects OQ-Weights)] x (Unused Resource per node pool)

    • Fairshare is calculated as the sum of quota + over quota.

  • In Project 2, we assume that out of the 36 available GPUs in node pool A, 20 GPUs are currently unused. This means either these GPUs are not part of any project’s quota, or they are part of a project’s quota but not used by any workloads of that project:

    • Project 2 over quota share:

      [(Project 2 OQ-Weight) / (Σ all Projects OQ-Weights)] x (Unused Resource within node pool A)

      [(3) / (2 + 3 + 1)] x (20) = (3/6) x 20 = 10 GPUs

    • Fairshare = deserved quota + over quota = 6 +10 = 16 GPUs. Similarly, fairshare is also calculated for CPU and CPU memory. The Scheduler can grant a project more resources than its fairshare if the Scheduler finds resources not required by other projects that may deserve those resources.

  • In Project 3, fairshare = deserved quota + over quota = 0 +3 = 3 GPUs. Project 3 has no guaranteed quota, but it still has a share of the excess resources in node pool A. The NVIDIA Run:ai Scheduler ensures that Project 3 receives its part of the unused resources for over quota, even if this results in reclaiming resources from other projects and preempting preemptible workloads.

Fairshare Balancing

The Scheduler constantly re-calculates the fairshare of each project and department per node pool, represented in the scheduler as queues, resulting in the re-balancing of resources between projects and between departments. This means that a preemptible workload that was granted resources to run in one scheduling cycle, can find itself preempted and go back to pending state while waiting for resources in the next cycle.

A queue, representing a scheduler-managed object for each project or department per node pool, can be in one of 3 states:

  • In-quota: The queue’s allocated resources ≤ queue deserved quota. The Scheduler’s first priority is to ensure each queue receives its deserved quota.

  • Over quota but below fairshare: The queue’s deserved quota < queue’s allocated resources <= queue’s fairshare. The Scheduler tries to find and allocate more resources to queues that need resources beyond their deserved quota and up to their fairshare.

  • Over-fairshare and over quota: The queue’s fairshare < queue’s allocated resources. The Scheduler tries to allocate resources to queues that need even more resources beyond their fairshare.

When re-balancing resources between queues of different projects and departments, the Scheduler goes in the opposite direction, i.e. first take resources from over-fairshare queues, then from over quota queues, and finally, in some scenarios, even from queues that are below their deserved quota.

Next Steps

Now that you have gained insights into how the Scheduler dynamically balances workloads to optimize cluster utilization and maintain fairness across projects and departments, you can submit workloads. Before submitting your workloads, it’s important to familiarize yourself with the following key topics:

  • Introduction to workloads - Learn what workloads are and what is supported for both NVIDIA Run:ai and third-party workloads.

  • NVIDIA Run:ai workload types - Explore the various NVIDIA Run:ai workload types available and understand their specific purposes to enable you to choose the most appropriate workload type for your needs.

Workload Priority Control

The workload priority management feature allows you to change the priority of a workload within a project. The priority determines the workload's position in the project scheduling queue managed by the NVIDIA Run:ai Scheduler. By adjusting the priority, you can increase the likelihood that a workload will be scheduled and preferred over others within the same project, ensuring that critical tasks are given higher priority and resources are allocated efficiently.

You can change the priority of a workload by selecting one of the predefined values from the NVIDIA Run:ai priority dictionary. This can be done using the NVIDIA Run:ai UI, API or CLI, depending on the workload type.

Note

This applies only within a single project. It does not impact the scheduling queues or workloads of other projects.

Priority Dictionary

Workload priority is defined by selecting a string name from a predefined list in the NVIDIA Run:ai priority dictionary. Each string corresponds to a specific Kubernetes PriorityClass, which in turn determines scheduling behavior, such as whether the workload is preemptible or allowed to run over quota.

Note

The numeric priority levels (1 = highest, 4 = lowest) are descriptive only and are not part of the NVIDIA Run:ai priority dictionary.

Priority Level
Name (string)
Preemption
Over Quota

1

inference

Non-preemptible

Not available

2

build

Non-preemptible

Not available

3

interactive-preemptible

Preemptible

Available

4

train

Preemptible

Available

Preemptible vs Non-Preemptible Workloads

  • Non-preemptible workloads must run within the project’s deserved quota, cannot use over-quota resources, and will not be interrupted once scheduled.

  • Preemptible workloads can use opportunistic compute resources beyond the project’s quota but may be interrupted at any time.

Default Priority per Workload

Both NVIDIA Run:ai and third-party workloads are assigned a default priority. The below table shows the default priority per workload type:

Workload Type
Default Priority

build

train

inference

train

inference

Supported Priority Overrides per Workload

Note

Changing a workload’s priority may impact its ability to be scheduled. For example, switching a workload from a train priority (which allows over-quota usage) to build priority (which requires in-quota resources) may reduce its chances of being scheduled in cases where the required quota is unavailable.

The below table shows the default priority listed in the previous section and the supported override options per workload:

Workload Type
interactive-preemptible
build
train
inference

How to Override Priority

You can override the default priority when submitting a workload through the UI, API, or CLI depending on the workload type.

Workspaces

To use the override options:

  • UI: Enable "Allow the workload to exceed the project quota" when submitting a workspace

  • API: Set PriorityClass in the Workspaces API

  • CLI: Submit a workspace using the --priority flag

    runai workspace submit --priority priority-class

Training Workloads

To use the override options:

  • API: Set PriorityClass in the Trainings API

  • CLI: Submit training using the --priority flag

    runai training submit --priority priority-class

Set Up SSO with OpenShift

Single Sign-On (SSO) is an authentication scheme, allowing users to log-in with a single pair of credentials to multiple, independent software systems.

This article explains the procedure to to NVIDIA Run:ai using the OpenID Connect protocol in OpenShift V4.

Prerequisites

Before starting, make sure you have the following available from your OpenShift cluster:

  • :

    • ClientID - The ID used to identify the client with the Authorization Server.

    • Client Secret - A secret password that only the Client and Authorization Server know.

  • Base URL - The OpenShift API Server endpoint (for example, )

Setup

Adding the Identity Provider

  1. Go to General settings

  2. Open the Security section and click +IDENTITY PROVIDER

  3. Select OpenShift V4

  4. Enter the Base URL, Client ID, and Client Secret from your OpenShift OAuth client.

  5. Copy the Redirect URL to be used in your OpenShift OAuth client

  6. Optional: Enter the user attributes and their value in the identity provider as shown in the below table

  7. Click SAVE

  8. Optional: Enable Auto-Redirect to SSO to automatically redirect users to your configured identity provider’s login page when accessing the platform.

Attribute
Default value in NVIDIA Run:ai
Description

Testing the Setup

  1. Open the NVIDIA Run:ai platform as an admin

  2. Add to an SSO user defined in the IDP

  3. Open the NVIDIA Run:ai platform in an incognito browser tab

  4. On the sign-in page click CONTINUE WITH SSO You are redirected to the OpenShift IDP sign-in page

  5. In the identity provider sign-in page, log in with the SSO user who you granted with access rules

  6. If you are unsuccessful signing-in to the identity provider, follow the section below

Editing the Identity Provider

You can view the identity provider details and edit its configuration:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider box, click Edit identity provider

  4. You can edit either the Base URL, Client ID, Client Secret, or the User attributes

Removing the Identity Provider

You can remove the identity provider configuration:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider card, click Remove identity provider

  4. In the dialog, click REMOVE to confirm

Note

To avoid losing access, removing the identity provider must be carried out by a local user.

Troubleshooting

If testing the setup was unsuccessful, try the different troubleshooting scenarios according to the error you received.

Troubleshooting Scenarios

Error: "403 - Sorry, we can’t let you see this page. Something about permissions…"

Description: The authenticated user is missing permissions

Mitigation:

  1. Validate either the user or its related group/s are assigned with

  2. Validate groups attribute is available in the configured OIDC Scopes

  3. Validate the user’s groups attribute is mapped correctly

Advanced:

  1. Open the Chrome DevTools: Right-click on page → Inspect → Console tab

  2. Run the following command to retrieve and copy the user’s token: localStorage.token;

  3. Paste in

  4. Under the Payload section validate the value of the user’s attributes

Error: "401 - We’re having trouble identifying your account because your email is incorrect or can’t be found."

Description: Authentication failed because email attribute was not found.

Mitigation:

  1. Validate email attribute is available in the configured OIDC Scopes

  2. Validate the user’s email attribute is mapped correctly

Error: "Unexpected error when authenticating with identity provider"

Description: User authentication failed

Mitigation: Validate the the configured OIDC Scopes exist and match the Identity Provider’s available scopes

Advanced: Look for the specific error message in the URL address

Error: "Unexpected error when authenticating with identity provider (SSO sign-in is not available)"

Description: User authentication failed

Mitigation:

  1. Validate the the configured OIDC scope exists in the Identity Provider

  2. Validate the configured Client Secret match the Client Secret value in the OAuthclient Kubernetes object.

Advanced: Look for the specific error message in the URL address

Error: "unauthorized_client"

Description: OIDC Client ID was not found in the OpenShift IDP

Mitigation: Validate the the configured Client ID matches the value in the OAuthclient Kubernetes object

Launching Workloads with GPU Fractions

This quick start provides a step-by-step walkthrough for running a Jupyter Notebook workspace using .

NVIDIA Run:ai’s GPU fractions provides an agile and easy-to-use method to share a GPU or multiple GPUs across workloads. With GPU fractions, you can divide the GPU/s memory into smaller chunks and share the GPU/s compute resources between different workloads and users, resulting in higher GPU utilization and more efficient resource allocation.

Note

If enabled by your Administrator, the NVIDIA Run:ai UI allows you to create a new workload using either the . The steps in this quick start guide reflect the Original form only.

Prerequisites

Before you start, make sure:

  • You have created a or have one created for you.

  • The project has an assigned quota of at least 0.5 GPU.

Step 1: Logging In

Step 2: Submitting a Workspace

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Workspace

  3. Select under which cluster to create the workload

  4. Select the project in which your workspace will run

  5. Select a preconfigured or select the Start from scratch to launch a new workspace quickly

  6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Select the ‘jupyter-lab’ environment for your workspace (Image URL: jupyter/scipy-notebook)

    • If the ‘jupyter-lab’ is not displayed in the gallery, follow the below steps:

      • Click +NEW ENVIRONMENT

      • Enter a name for the environment. The name must be unique.

      • Enter the jupyter-lab Image URL - jupyter/scipy-notebook

      • Tools - Set the connection for your tool

        • Click +TOOL

        • Select Jupyter tool from the list

      • Set the runtime settings for the environment

        • Click +COMMAND

        • Enter command - start-notebook.sh

        • Enter arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

        Note: If is enabled on the cluster, enter the --NotebookApp.token='' only.

      • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  9. Select the ‘small-fraction’ compute resource for your workspace (GPU % of devices: 10)

    • If ‘small-fraction’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

        • Enter a name for the compute resource. The name must be unique.

        • Set GPU devices per pod - 1

        • Set GPU memory per device

          • Select % (of device) - Fraction of a GPU device’s memory

          • Set the memory Request - 10 (the workload will allocate 10% of the GPU memory)

        • Optional: set the CPU compute per pod - 0.1 cores (default)

        • Optional: set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  10. Click CREATE WORKSPACE

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see :

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see :

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

  • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

  • toolName will show when connecting to the Jupyter tool via the user interface.

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 3: Connecting to the Jupyter Notebook

  1. Select the newly created workspace with the Jupyter application that you want to connect to

  2. Click CONNECT

  3. Select the Jupyter tool. The selected tool is opened in a new tab on your browser.

To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

Next Steps

Manage and monitor your newly created workload using the table.

GPU Memory Swap

NVIDIA Run:ai’s GPU memory swap helps administrators and AI practitioners to further increase the utilization of their existing GPU hardware by improving GPU sharing between AI initiatives and stakeholders. This is done by expanding the GPU physical memory to the CPU memory, typically an order of magnitude larger than that of the GPU.

Expanding the GPU physical memory helps the NVIDIA Run:ai system to put more workloads on the same GPU physical hardware, and to provide a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.

Benefits of GPU Memory Swap

There are several use cases where GPU memory swap can benefit and improve the user experience and the system's overall utilization.

Sharing a GPU Between Multiple Interactive Workloads (Notebooks)

AI practitioners use notebooks to develop and test new AI models and to improve existing AI models. While developing or testing an AI model, notebooks use GPU resources intermittently, yet, required resources of the GPUs are pre-allocated by the notebook and cannot be used by other workloads after one notebook has already reserved them. To overcome this inefficiency, NVIDIA Run:ai introduced and .

When one or more workloads require more than their requested GPU resources, there’s a high probability not all workloads can run on a single GPU because the total memory required is larger than the physical size of the GPU memory.

With GPU memory swap, several workloads can run on the same GPU, even if the sum of their used memory is larger than the size of the physical GPU memory. GPU memory swap can swap in and out workloads interchangeably, allowing multiple workloads to each use the full amount of GPU memory. The most common scenario is for one workload to run on the GPU (for example, an interactive notebook), while other notebooks are either idle or using the CPU to develop new code (while not using the GPU). From a user experience point of view, the swap in and out is a smooth process since the notebooks do not notice that they are being swapped in and out of the GPU memory. On rare occasions, when multiple notebooks need to access the GPU simultaneously, slower workload execution may be experienced.

Notebooks typically use the GPU intermittently, therefore with high probability, only one workload (for example, an ), will use the GPU at a time. The more notebooks the system puts on a single GPU, the higher the chances are that there will be more than one notebook requiring the GPU resources at the same time. Admins have a significant role here in fine tuning the number of notebooks running on the same GPU, based on specific use patterns and required SLAs. Using Node Level Scheduler reduces GPU access contention between different interactive notebooks running on the same node.

Sharing a GPU Between Inference/Interactive Workloads and Training Workloads

A single GPU can be shared between an (for example, a Jupyter notebook, image recognition services, or an LLM service), and a training workload that is not time-sensitive or delay-sensitive. At times when the inference/interactive workload uses the GPU, both training and inference/interactive workloads share the GPU resources, each running part of the time swapped-in to the GPU memory, and swapped-out into the CPU memory the rest of the time.

Whenever the inference/interactive workload stops using the GPU, the swap mechanism swaps out the inference/interactive workload GPU data to the CPU memory. Kubernetes wise, the pod is still alive and running using the CPU. This allows the training workload to run faster when the inference/interactive workload is not using the GPU, and slower when it does, thus sharing the same resource between multiple workloads, fully utilizing the GPU at all times, and maintaining uninterrupted service for both workloads.

Serving Inference Warm Models with GPU Memory Swap

Running multiple is a demanding task and you will need to ensure that your SLA is met. You need to provide high performance and low latency, while maximizing GPU utilization. This becomes even more challenging when the exact model usage patterns are unpredictable. You must plan for the agility of inference services and strive to keep models on standby in a ready state rather than an idle state.

NVIDIA Run:ai’s GPU memory swap feature enables you to load multiple models to a single GPU, where each can use up to the full amount GPU memory. Using an application load balancer, the administrator can control to which server each inference request is sent. Then the GPU can be loaded with multiple models, where the model in use is loaded into the GPU memory and the rest of the models are swapped-out to the CPU memory. The swapped models are stored as ready models to be loaded when required. GPU memory swap always maintains the context of the workload (model) on the GPU so it can easily and quickly switch between models. This is unlike industry standard model servers that load models from scratch into the GPU whenever required.

How GPU Memory Swap Works

Swapping the workload’s GPU memory to and from the CPU is performed simultaneously and synchronously for all GPUs used by the workload. In some cases, if workloads specify a memory limit smaller than a full GPU memory size, multiple workloads can run in parallel on the same GPUs, maximizing the utilization and shortening the response times.

In other cases, workloads will run serially, with each workload running for a few seconds before the system swaps them in/out. If multiple workloads occupy more than the GPU physical memory and attempt to run simultaneously, memory swapping will occur. In this scenario, each workload will run part of the time on the GPU while being swapped out to the CPU memory the other part of the time, slowing down the execution of the workloads. Therefore, it is important to evaluate whether memory swapping is suitable for your specific use cases, weighing the benefits against the potential for slower execution time. To better understand the benefits and use cases of GPU memory swap, refer to the detailed sections below. This will help you determine how to best utilize GPU swap for your workloads and achieve optimal performance.

The workload MUST use . This means the workload’s memory Request is less than a full GPU, but it may add a GPU memory Limit to allow the workload to effectively use the full GPU memory. The NVIDIA Run:ai Scheduler allocates the dynamic fraction pair (Request and Limit) on single or multiple GPU devices in the same node.

The administrator must label each node that they want to provide GPU memory swap with a run.ai/swap-enabled=true to enable that node. Enabling the feature reserves CPU memory to serve the swapped GPU memory from all GPUs on that node. The administrator sets the size of the CPU reserved RAM memory using the runaiconfig file as detailed in .

Optionally, you can also configure the :

  • The Node Level Scheduler automatically spreads workloads between the different GPUs on a node, ensuring maximum workload performance and GPU utilization.

  • In scenarios where Interactive notebooks are involved, if the CPU reserved memory for the GPU swap is full, the Node Level Scheduler preempts the GPU process of that workload and potentially routes the workload to another GPU to run.

Multi-GPU Memory Swap

NVIDIA Run:ai also supports workload submission using multi-GPU memory swap. Multi-GPU memory swap works similarly to single GPU memory swap, but instead of swapping memory for a single GPU workload, it swaps memory for workloads across multiple GPUs simultaneously and synchronously.

The NVIDIA Run:ai Scheduler allocates the same dynamic GPU fraction pair (Request and Limit) on multiple GPU devices in the same node. For example, if you want to run two LLM models, each consuming 8 GPUs that are not used simultaneously, you can use GPU memory swap to share their GPUs. This approach allows multiple models to be stacked on the same node.

The following outlines the advantages of stacking multiple models on the same node:

  • Maximizes GPU utilization - Efficiently uses available GPU resources by enabling multiple workloads to share GPUs.

  • Improves cold start times - Loading large LLM models to a node and its GPUs can take several minutes during a “cold start”. Using memory swap turns this process into a “warm start” that takes only a fraction of a second to a few seconds (depending on the model size and the GPU model).

  • Increases GPU availability - Frees up and maximizes GPU availability for additional workloads (and users), enabling better resource sharing.

  • Smaller quota requirements - Enables more precise and often smaller quota requirements for the end user.

Deployment Considerations

  • A pod created before the GPU memory swap feature was enabled in that cluster, cannot be scheduled to a swap-enabled node. A proper event is generated in case no matching node is found. Users must re-submit those pods to make them swap-enabled.

  • GPU memory swap cannot be enabled if the NVIDIA Run:ai is used. GPU memory swap can only be used with the default NVIDIA time-slicing mechanism.

  • CPU RAM size cannot be decreased once GPU memory swap is enabled.

Enabling and Configuring GPU Memory Swap

Before configuring GPU memory swap, dynamic GPU fractions must be enabled. You can also configure and use Node Level Scheduler. Dynamic GPU fractions enable you to make your workloads burstable, while both features will maximize your workloads’ performance and GPU utilization within a single node.

To enable GPU memory swap in a NVIDIA Run:ai cluster:

  1. Edit the runaiconfig file with the following parameters. This example uses 100Gi as the size of the swap memory. For more details, see :

  1. Or, use the following patch command from your terminal:

Configuring System Reserved GPU Resources

Swappable workloads require reserving a small part of the GPU for non-swappable allocations like binaries and GPU context. To avoid getting out-of-memory (OOM) errors due to non-swappable memory regions, the system reserves a 2GiB of GPU RAM memory by default, effectively truncating the total size of the GPU memory. For example, a 16GiB T4 will appear as 14GiB on a swap-enabled node. The exact reserved size is application-dependent, and 2GiB is a safe assumption for 2-3 applications sharing and swapping on a GPU. This value can be changed by:

  1. Editing the runaiconfig as follows:

  1. Or, using the following patch command from your terminal:

Preventing Your Workloads from Getting Swapped

If you prefer your workloads not to be swapped into CPU memory, you can specify on the pod an anti-affinity to run.ai/swap-enabled=true node label when submitting your workloads and the Scheduler will ensure not to use swap-enabled nodes. An alternative way is to set swap on a dedicated node pool and not use this node pool for workloads you prefer not to swap.

What Happens When the CPU Reserved Memory for GPU Swap is Exhausted?

CPU memory is limited, and since a single CPU serves multiple GPUs on a node, this number is usually between 2 to 8. For example, when using 80GB of GPU memory, each swapped workload consumes up to 80GB (but may use less) assuming each GPU is shared between 2-4 workloads. In this example, you can see how the swap memory can become very large. Therefore, we give administrators a way to limit the size of the CPU reserved memory for swapped GPU memory on each swap-enabled node as shown in .

Limiting the CPU reserved memory means that there may be scenarios where the GPU memory cannot be swapped out to the CPU reserved RAM. Whenever the CPU reserved memory for swapped GPU memory is exhausted, the workloads currently running will not be swapped out to the CPU reserved RAM, instead, (if enabled) logic takes over and provides GPU resource optimization.

Logs Collection

This section provides instructions for IT administrators on collecting NVIDIA Run:ai logs for support, including prerequisites, CLI commands, and log file retrieval. It also covers enabling verbose logging for Prometheus and the NVIDIA Run:ai Scheduler.

Collect Logs to Send to Support

To collect NVIDIA Run:ai logs, follow these steps:

Prerequisites

  • Ensure that you have administrator-level access to the Kubernetes cluster where NVIDIA Run:ai is installed.

  • The NVIDIA Run:ai (CLI) must be installed.

Step-by-step Instructions

  1. Run the Command from your local machine or a Bastion Host (secure server). Open a terminal on your local machine (or any machine that has network access to the Kubernetes cluster) where the NVIDIA Run:ai Administrator CLI is installed.

  2. Collect the Logs. Execute the following command to collect the logs:

This command gathers all relevant NVIDIA Run:ai logs from the system and generate a compressed file.

  1. Locate the Generated File. After running the command, note the location of the generated compressed log file. You can retrieve and send this file to NVIDIA Run:ai Support for further troubleshooting.

Note

The tar file packages the logs of NVIDIA Run:ai components only. It does not include logs of researcher containers that may contain private information.

Logs Verbosity

Increase log verbosity to capture more detailed information, providing deeper insights into system behavior and make it easier to identify and resolve issues.

Prerequisites

Before you begin, ensure you have the following:

  • Access to the Kubernetes cluster where NVIDIA Run:ai is installed

    • Including to view and modify configurations.

  • kubectl installed and configured:

    • The Kubernetes command-line tool, kubectl, must be installed and configured to interact with the cluster.

    • Sufficient privileges to edit configurations and view logs.

  • Monitoring Disk Space

    • When enabling verbose logging, ensure adequate disk space to handle the increased log output, especially when enabling debug or high verbosity levels.

Adding Verbosity

Adding verbosity to Prometheus

To increase the logging verbosity for Prometheus, follow these steps:

  1. Edit the RunaiConfig to adjust Prometheus log levels. Copy the following command to your terminal:

  2. In the configuration file that opens, add or modify the following section to set the log level to debug:

  3. Save the changes. To view the Prometheus logs with the new verbosity level, run:

    This command streams the last 100 lines of logs from Prometheus, providing detailed information useful for debu

Adding verbosity to the Scheduler

To enable extended logging for the NVIDIA Run:ai scheduler:

  1. Edit the RunaiConfig to adjust scheduler verbosity:

  2. Add or modify the following section under the scheduler settings:

    This increases the verbosity level of the scheduler logs to provide more detailed output.

Warning: Enabling verbose logging can significantly increase disk space usage. Monitor your storage capacity and adjust the verbosity level as necessary.

Workspaces
Training
Inference
Third-party workloads
NVIDIA Cloud Functions (NVCF)
Workspaces
Training
Inference
Third-party workloads
NVIDIA Cloud Functions (NVCF)
 spec: 
  global: 
    core: 
      swap:
        enabled: true
        limits:
          cpuRam: 100Gi
 kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"swap":{"enabled": true, "limits": {"cpuRam": "100Gi"}}}}}}'
 spec: 
  global: 
    core: 
      swap:
        limits:
          reservedGpuRam: 2Gi
 kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"swap":{"limits":{"reservedGpuRam": <quantity>}}}}}}'
dynamic GPU fractions
Node Level Scheduler
interactive notebook
interactive or inference workload
inference models
dynamic GPU fractions
enabling and configuring GPU memory swap
Node Level Scheduler
strict or fair time-slicing
Advanced cluster configurations
enabling and configuring GPU memory swap
Node Level Scheduler
runai-adm collect-logs
kubectl edit runaiconfig runai -n runai
spec:
  prometheus:
    spec:
      logLevel: debug
kubectl logs -n runai prometheus-runai-0 
kubectl edit runaiconfig runai -n runai
runai-scheduler:
  args:
    verbosity: 6
Administrator Command-Line Interface
necessary permissions

User role groups

GROUPS

If it exists in the IDP, it allows you to assign NVIDIA Run:ai role groups via the IDP. The IDP attribute must be a list of strings.

Linux User ID

UID

If it exists in the IDP, it allows researcher containers to start with the Linux User UID. Used to map access to network resources such as file systems to users. The IDP attribute must be of type integer.

Linux Group ID

GID

If it exists in the IDP, it allows researcher containers to start with the Linux Group GID. The IDP attribute must be of type integer.

Supplementary Groups

SUPPLEMENTARYGROUPS

If it exists in the IDP, it allows researcher containers to start with the relevant Linux supplementary groups. The IDP attribute must be a list of integers.

Email

email

Defines the user attribute in the IDP holding the user's email address, which is the user identifier in NVIDIA Run:ai

User first name

firstName

Used as the user’s first name appearing in the NVIDIA Run:ai platform

User last name

lastName

Used as the user’s last name appearing in the NVIDIA Run:ai platform

configure single sign-on
OpenShift OAuth client
https://api.<cluster-url>:6443
access rules
Troubleshooting
access rules
https://jwt.io
workload type
Policy YAML reference
Policy YAML examples

Install Using Helm

System and Network Requirements

Before installing the NVIDIA Run:ai cluster, validate that the system requirements and network requirements are met. For air-gapped environments, make sure you have the software artifacts prepared.

Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

  • Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking

  • Look at additional components installed and analyze their relevance to a successful installation

For more information, see preinstall diagnostics. To run the preinstall diagnostics tool, download the latest version, and run:

chmod +x ./preinstall-diagnostics-<platform> && \ 
./preinstall-diagnostics-<platform> \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
#if the diagnostics image is hosted in a private registry
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    

In an air-gapped deployment, the diagnostics image is saved, pushed, and pulled manually from the organization's registry.

#Save the image locally
docker save --output preinstall-diagnostics.tar gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION}
#Load the image to the organization's registry
docker load --input preinstall-diagnostics.tar
docker tag gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION} ${CLIENT_IMAGE_AND_TAG} 
docker push ${CLIENT_IMAGE_AND_TAG}

Run the binary with the --image parameter to modify the diagnostics image to be used:

chmod +x ./preinstall-diagnostics-darwin-arm64 && \
./preinstall-diagnostics-darwin-arm64 \
  --domain ${CONTROL_PLANE_FQDN} \
  --cluster-domain ${CLUSTER_FQDN} \
  --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
  --image ${PRIVATE_REGISTRY_IMAGE_URL}    

Helm

NVIDIA Run:ai requires Helm 3.14 or later. To install Helm, see Installing Helm. If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.

Permissions

A Kubernetes user with the cluster-admin role is required to ensure a successful installation. For more information, see Using RBAC authorization.

Installation

Kubernetes

Connected

Follow the steps below to add a new cluster.

Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

  1. In the NVIDIA Run:ai platform, go to Resources

  2. Click +NEW CLUSTER

  3. Enter a unique name for your cluster

  4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  5. Enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.

  6. Click Continue

Installing NVIDIA Run:ai cluster

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

  2. Click DONE

The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

Tip: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation. For more details, see see Understanding cluster access roles.

Air-gapped

Follow the steps below to add a new cluster.

Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log-in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

  1. In the NVIDIA Run:ai platform, go to Resources

  2. Click +NEW CLUSTER

  3. Enter a unique name for your cluster

  4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  5. Enter the Cluster URL . For more information, see Fully Qualified Domain Name requirement.

  6. Click Continue

Installing NVIDIA Run:ai cluster

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

  2. On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the pre-provided installation file instead of using helm repositories. As such:

    • Do not add the helm repository and do not run helm repo update.

    • Instead, edit the helm upgrade command.

      • Replace runai/runai-cluster with runai-cluster-<VERSION>.tgz.

      • Add --set global.image.registry=<DOCKER REGISTRY ADDRESS> where the registry address is as entered in the preparation section

      • Add --set global.customCA.enabled=true as described here

    The command should look like the following:

    helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
        --set controlPlane.url=... \
        --set controlPlane.clientSecret=... \
        --set cluster.uid=... \
        --set cluster.url=... --create-namespace \
        --set global.image.registry=registry.mycompany.local \
        --set global.customCA.enabled=true
  3. Click DONE

The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

Tip: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.

OpenShift

Connected

Follow the steps below to add a new cluster.

Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

  1. In the NVIDIA Run:ai platform, go to Resources

  2. Click +NEW CLUSTER

  3. Enter a unique name for your cluster

  4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  5. Enter the Cluster URL

  6. Click Continue

Installing NVIDIA Run:ai cluster

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

  2. Click DONE

The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

Air-gapped

When creating a new cluster, select the OpenShift target platform.

Follow the steps below to add a new cluster.

Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

  1. In the NVIDIA Run:ai platform, go to Resources

  2. Click +NEW CLUSTER

  3. Enter a unique name for your cluster

  4. Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  5. Enter the Cluster URL

  6. Click Continue

Installing NVIDIA Run:ai cluster

In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

  1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

  2. On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the pre-provided installation file instead of using helm repositories. As such:

    • Do not add the helm repository and do not run helm repo update.

    • Instead, edit the helm upgrade command.

      • Replace runai/runai-cluster with runai-cluster-<VERSION>.tgz.

      • Add --set global.image.registry=<DOCKER REGISTRY ADDRESS> where the registry address is as entered in the preparations section

      • Add --set global.customCA.enabled=true as described here

    The command should look like the following:

    helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
        --set controlPlane.url=... \
        --set controlPlane.clientSecret=... \
        --set cluster.uid=... \
        --set cluster.url=... --create-namespace \
        --set global.image.registry=registry.mycompany.local \
        --set global.customCA.enabled=true
  3. Click DONE

The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

Note

To customize the installation based on your environment, see Customized installation.

Troubleshooting

If you encounter an issue with the installation, try the troubleshooting scenario below.

Installation

If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh

Cluster Status

If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster troubleshooting scenarios

Preparations

The following section provides the information needed to prepare for a NVIDIA Run:ai installation.

Software Artifacts

The following software artifacts should be used when installing the control plane and cluster.

Kubernetes

Connected

You will receive a token from NVIDIA Run:ai to access the NVIDIA Run:ai container registry. Use the following command to create the required Kubernetes secret:

kubectl create secret docker-registry runai-reg-creds  \
--docker-server=https://runai.jfrog.io \
--docker-username=self-hosted-image-puller-prod \
--docker-password=<TOKEN> \
[email protected] \
--namespace=runai-backend
Air-gapped

You should receive a single file runai-airgapped-package-<VERSION>.tar.gz from NVIDIA Run:ai customer support.

NVIDIA Run:ai assumes the existence of a Docker registry for images most likely installed within the organization. The installation requires the network address and port for the registry (referenced below as <REGISTRY_URL>).

SSH into a node with kubectl access to the cluster and Docker installed. To extract the NVIDIA Run:ai files, replace <VERSION> in the command below and run:

tar xvf runai-airgapped-package-<VERSION>.tar.gz

Upload images

  1. Upload images to a local Docker Registry. Set the Docker Registry address in the form of NAME:PORT (do not add https):

export REGISTRY_URL=<DOCKER REGISTRY ADDRESS>
  1. Run the following script. You must have at least 20GB of free disk space to run. If Docker is configured to run as non-root then sudo is not required:

sudo ./setup.sh

The script should create a file named custom-env.yaml which will be used during control plane installation.

OpenShift

Connected

You will receive a token from NVIDIA Run:ai to access the NVIDIA Run:ai container registry. Use the following command to create the required Kubernetes secret:

oc create secret docker-registry runai-reg-creds  \
--docker-server=https://runai.jfrog.io \
--docker-username=self-hosted-image-puller-prod \
--docker-password=<TOKEN> \
[email protected] \
--namespace=runai-backend
Air-gapped

You should receive a single file runai-airgapped-package-<VERSION>.tar.gz from NVIDIA Run:ai customer support.

NVIDIA Run:ai assumes the existence of a Docker registry for images most likely installed within the organization. The installation requires the network address and port for the registry (referenced below as <REGISTRY_URL>).

SSH into a node with oc access to the cluster and Docker installed. To extract the NVIDIA Run:ai files, replace <VERSION> in the command below and run:

tar xvf runai-airgapped-package-<VERSION>.tar.gz

Upload images

  1. Upload images to a local Docker Registry. Set the Docker Registry address in the form of NAME:PORT (do not add https):

export REGISTRY_URL=<DOCKER REGISTRY ADDRESS>
  1. Run the following script. You must have at least 20GB of free disk space to run. If Docker is configured to run as non-root then sudo is not required:

sudo ./setup.sh

The script should create a file named custom-env.yaml which will be used by the control plane installation.

Private Docker Registry (Optional)

Kubernetes

To access the organization's docker registry it is required to set the registry's credentials (imagePullSecret).

Create the secret named runai-reg-creds based on your existing credentials. For more information, see Pull an Image from a Private Registry.

OpenShift

To access the organization's docker registry it is required to set the registry's credentials (imagePullSecret).

Create the secret named runai-reg-creds in the runai-backend namespace based on your existing credentials. The configuration will be copied over to the runai namespace at cluster install. For more information, see Allowing pods to reference images from other secured registries.

Set Up Your Environment

External Postgres Database (Optional)

If you have opted to use an external PostgreSQL database, you need to perform initial setup to ensure successful installation. Follow these steps:

  1. Create a SQL script file, edit the parameters below, and save it locally:

    • Replace <DATABASE_NAME> with a dedicate database name for NVIDIA Run:ai in your PostgreSQL database.

    • Replace <ROLE_NAME> with a dedicated role name (user) for NVIDIA Run:ai database.

    • Replace <ROLE_PASSWORD> with a password for the new PostgreSQL role.

    • Replace <GRAFANA_PASSWORD> with the password to be set for Grafana integration.

    -- Create a new database for runai
    CREATE DATABASE <DATABASE_NAME>; 
    
    -- Create the role with login and password
    CREATE ROLE <ROLE_NAME>  WITH LOGIN PASSWORD '<ROLE_PASSWORD>'; 
    
    -- Grant all privileges on the database to the role
    GRANT ALL PRIVILEGES ON DATABASE <DATABASE_NAME> TO <ROLE_NAME>; 
    
    -- Connect to the newly created database
    \c <DATABASE_NAME> 
    
    -- grafana
    CREATE ROLE grafana WITH LOGIN PASSWORD '<GRAFANA_PASSWORD>'; 
    CREATE SCHEMA grafana authorization grafana;
    ALTER USER grafana set search_path='grafana';
    -- Exit psql
    \q
  2. Run the following command on a machine where PostgreSQL client (pgsql) is installed:

    • Replace <POSTGRESQL_HOST> with the PostgreSQL ip address or hostname.

    • Replace <POSTGRESQL_USER> with the PostgreSQL username.

    • Replace <POSTGRESQL_PORT> with the port number where PostgreSQL is running.

    • Replace <POSTGRESQL_DB> with the name of your PostgreSQL database.

    • Replace <POSTGRESQL_DB> with the name of your PostgreSQL database.

    • Replace <SQL_FILE> with the path to the SQL script created in the previous step.

    psql --host <POSTGRESQL_HOST> \ 
    --user <POSTGRESQL_USER> \
    --port <POSTGRESQL_PORT> \ 
    --dbname <POSTGRESQL_DB> \
    -a -f <SQL_FILE> \
runai project set "project-name"
runai workspace submit "workload-name" --image jupyter/scipy-notebook \
--gpu-devices-request 0.1 --command --external-url container=8888 \
--name-prefix jupyter --command -- start-notebook.sh \
--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=
runai config project "project-name"
runai submit "workload-name" --jupyter -g 0.1
curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>",  
    "clusterId": "<CLUSTER-UUID>", 
    "spec": {
        "command" : "start-notebook.sh",
        "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
        "image": "jupyter/scipy-notebook",
        "compute": {
            "gpuDevicesRequest": 1,
            "gpuRequestType": "portion",
            "gpuPortionRequest": 0.1

        },
        "exposedUrls" : [
            { 
                "container" : 8888,
                "toolType": "jupyter-notebook", 
                "toolName": "Jupyter" 
            }
        ]
    }
}
GPU fractions
Flexible or Original submission form
project
template
host-based routing
CLI reference
CLI reference
Workspaces API:
Step 1
Get Projects API
Get Clusters API
Workloads

What’s New in Version 2.21

The NVIDIA Run:ai v2.21 what's new provides a detailed summary of the latest features, enhancements, and updates introduced in this version. They serve as a guide to help users, administrators, and researchers understand the new capabilities and how to leverage them for improved workload management, resource optimization, and more.

Important

For a complete list of deprecations, see Deprecation notifications. Deprecated features and capabilities will be available for two versions ahead of the notification.

AI Practitioners

Flexible Workload Submission

Streamlined workload submission with a customizable form – The new customizable submission form allows you to submit workloads by selecting and modifying an existing setup or providing your own settings. This enables faster, more accurate submissions that align with organizational policies and individual workload needs. Experimental From cluster v2.18 onward

Feature high level details:

  • Flexible submission options – Choose from an existing setup and customize it, or start from scratch and provide your own settings for a one-time setup.

  • Improved visibility – Review existing setups and understand their associated policy definitions.

  • One-time data sources setup – Configure a data source as part of your one-time setup for a specific workload.

  • Unified experience – Use the new form for all workload types — workspaces, standard training, distributed training, and custom Inference.

Workspaces and Training

  • Support for JAX distributed training workloads – You can now submit distributed training workloads using the JAX framework via the UI, API, and CLI. This enables you to leverage JAX for scalable, high-performance training, making it easier to run and manage JAX-based workloads seamlessly within NVIDIA Run:ai. See Train models using a distributed training workload for more details. From cluster v2.21 onward

  • Pod restart policy for all workload types – A restart policy can be configured to define how pods are restarted when they terminate. The policy is set at the workload level across all workload types via the API and CLI. For distributed training workloads, restart policies can be set separately for master and worker pods. This enhancement ensures workloads are restarted efficiently, minimizing downtime and optimizing resource usage. From cluster v2.21 onward

  • Enhanced failure status details for workloads – When a workload is marked as "Failed", clicking the “i” icon next to the status provides detailed failure reasons, with clear explanations across compute, network, and storage resources. This enhancement improves troubleshooting efficiency, and helps you quickly diagnose and resolve issues, leading to faster workload recovery. From cluster v2.21 onward

  • Workload priority class management for training workloads – You can now change the default priority class of training workloads within a project, via the API or CLI, by selecting from predefined priority class values. This influences the workload’s position in the project scheduling queue managed by the Run:ai Scheduler, ensuring critical training jobs are prioritized and resources are allocated more efficiently. See Workload priority class control for more details. From cluster v2.18 onward

Workload Assets

  • New environment presets – Added new NVIDIA Run:ai environment presets when running in a host-based routing cluster - vscode, rstudio, jupyter-scipy, tensorboard-tensorflow. See Environments for more details. From cluster v2.21 onward

  • Support for PVC size expansion – Adjust the size of Persistent Volume Claims (PVCs) via the Update a PVC asset API, leveraging the allowVolumeExpansion field of the storage class resource. This enhancement enables you to dynamically adjust storage capacity as needed.

  • Improved visibility of storage class configurations – When creating new PVCs or volumes, the UI now displays access modes, volume modes, and size options based on administrator-defined storage class configurations. This update ensures consistency, increases transparency, and helps prevent misconfigurations during setup. From cluster v2.21 onward

  • ConfigMaps as environment variables – Use predefined ConfigMaps as environment variables during environment setup or workload submission. From cluster v2.21 onward

  • Improved scope selection experience – The scope mechanism has been improved to reduce clicks and enhance usability. The organization tree now opens by default at the cluster level for quicker navigation. Scope search now includes alphabetical sorting and supports browsing non-displayed scopes. You can also use keyboard shortcuts: Escape to cancel, or click outside the modal to close it. These improvements apply across templates, policies, projects, and all workload assets.

Command-line Interface (CLI v2)

  • New default CLI – CLI v2 is the default command-line interface. CLI v1 has been as of version 2.20.

  • Secret volume mapping for workloads – You can now map secrets to volumes when submitting workloads using the --secret-volume flag. This feature is available for all workload types - workspaces, training, and inference.

  • Support for environment field references in submit commands – A new flag, fieldRef, has been added to all submit commands to support environment field references in a key:value format. This enhancement enables dynamic injection of environment variables directly from pod specifications, offering greater flexibility during workload submission.

  • Improved PVC visibility and selection for researchers – Use runai pvc to list existing PVCs within your scope, making it easier to reference available options when submitting workloads. A noun auto-completion has been introduced for storage, streamlining the selection process. The workload describe command also includes a PVC section, improving visibility into persistent volume claims. These enhancements provide greater clarity and efficiency in storage utilization.

  • Enhanced workload deletion options – The runai workload delete command now supports deleting multiple workloads by specifying a list of workload names (e.g., workload-a, workload-b, workload-c).

ML Engineers

Workloads - Inference

  • Support for inference workloads via CLI v2 – You can now run inference workloads directly from the command-line interface. This update enables greater automation and flexibility for managing inference workloads. See runai inference for more details.

  • Enhanced rolling inference updates – Rolling inference updates allow ML engineers to apply live updates to existing inference workloads—regardless of their current status (e.g., running or pending)—without disrupting critical services. Experimental

    • This capability is now supported for both Hugging Face and custom inference workloads, with a new UI flow that aligns with the API functionality introduced in v2.19. From cluster v2.19 onward

    • Compute resources can now be updated via API and UI. From cluster v2.21 onward

  • Support for NVIDIA Cloud Functions (NVCF) external workloads – NVIDIA Run:ai enables you to deploy, schedule and manage NVCF workloads as external workloads within the platform. See Deploy NVIDIA Cloud Functions (NVCF) in NVIDIA Run:ai for more details. From cluster v2.21 onward

  • Added validation for Knative – You can now only submit inference workloads if Knative is properly installed. This ensures workloads are deployed successfully by preventing submission when Knative is misconfigured or missing. From cluster v2.21 onward

  • Enhancements in Hugging Face workloads. For more details, see Deploy inference workloads from Hugging Face:

    • Added Hugging Face model authentication – NVIDIA Run:ai validates whether a user-provided token grants access to a specific model, in addition to checking if a model requires a token and verifying the token format. This enhancement ensures that users can only load models they have permission to access, improving security and usability. From cluster v2.18 onward

    • Introduced model store support using data sources – Select a data source to serve as a model store, caching model weights to reduce loading time and avoid repeated downloads. This improves performance and deployment speed, especially for frequently used models, minimizing the need to re-authenticate with external sources.

    • Improved model selection – Select a model from a drop-down list. The list is partial and consists only of models that were tested. From cluster v2.18 onward

    • Enhanced Hugging Face environment control – Choose between vLLM, TGI, or any other custom container image by selecting an image tag and providing additional arguments. By default, workloads use the official vLLM or TGI containers, with full flexibility to override the image and customize runtime settings for more controlled and adaptable inference deployments. From cluster v2.18 onward

  • Updated authentication for NIM model access – You can now authenticate access to NIM models using tokens or credentials, ensuring a consistent, flexible, and secure authentication process. See Deploy inference workloads with NVIDIA NIM for more details. From cluster v2.19 onward

  • Added support for volume configuration – You can now set volumes for custom inference workloads. This feature allows inference workloads to allocate and retain storage, ensuring continuity and efficiency in inference execution. From cluster v2.20 onward

Platform Administrators

Analytics

  • Enhancements to the Overview dashboard – The Overview dashboard includes optimization insights for projects and departments, providing real-time visibility into GPU resource allocation and utilization. These insights help department and project managers make more informed decisions about quota management, ensuring efficient resource usage.

  • Dashboard UX improvements:

    • Improved visibility of metrics in the Resources utilization widget by repositioning them above the graphs.

    • Added a new Idle workloads table widget to help you easily identify and manage underutilized resources.

    • Renamed and updated the "Workloads by type" widget to provide clearer insights into cluster usage with a focus on workloads.

    • Improved user experience by moving the date picker to a dedicated section within the overtime widgets, Resources allocation and Resources utilization.

Organizations - Projects/Departments

  • Enhanced resource prioritization for projects and departments – Admins can now define and manage SLAs tailored to specific departments and projects via the UI, ensuring resource allocation aligns with real business priorities. This enhancement empowers admins to assign strict priority to over-quota resources, extending control beyond the existing over-quota weight system. From cluster v2.20 onward

    This feature allows administrators to:

    • Set the priority of each department relative to other departments within the same node pool.

    • Define the priority of projects within a department, on a per-node pool basis.

    • Set specific GPU resource limits for both departments and projects.

Audit Logs

Updated access control for audit logs – Only users with tenant-wide permissions have the ability to access audit logs, ensuring proper access control and data security. This update reinforces security and compliance by restricting access to sensitive system logs. It ensures that only authorized users can view audit logs, reducing the risk of unauthorized access and potential data exposure.

Notifications

Slack API integration for notifications – A new API allows organizations to receive notifications directly to Slack. This feature enhances real-time communication and monitoring by enabling users to stay informed about workload statuses. See Configuring Slack notifications for more details.

Authentication and Authorization

  • Improved visibility into user roles and access scopes – Individual users can now view their assigned roles and scopes directly in their settings. This enhancement provides greater transparency into user permissions, allowing individuals to easily verify their access levels. It helps users understand what actions they can perform and reduces dependency on administrators for access-related inquiries. See Access rules for more details.

  • Added auto-redirect to SSO – To deliver a consistent and streamlined login experience across customer applications, users accessing the NVIDIA Run:ai login page will be automatically redirected to SSO, bypassing the standard login screen entirely. This can be enabled via a toggle after an Identity Provider is added, and is available through both the UI and API. See Single Sign-On (SSO) for more details.

  • SAML service provider metadata XML – After configuring SAML IDP, the service provider metadata XML is now available for download to simplify integration with identity providers. See Set up SSO with SAML for more details.

  • Expanded SSO OpenID Connect authentication support – SSO OpenID Connect authentication supports attribute mapping of groups in both list and map formats. In map format, the group name is used as the value. This applies to new identity providers only. See Set up SSO with OpenID Connect for more details.

  • Improved permission error messaging – Enhanced clarity when attempting to delete a user with higher privileges, making it easier to understand and resolve permission-related actions.

Data & Storage

Added Data volumes to the UI – Administrators can now create and manage data volumes directly from the UI and share data across different scopes in a cluster, including projects and departments. See Data volumes for more details. Experimental From cluster v2.19 onward

Infrastructure Administrators

NVIDIA Datacenter GPUs - Grace-Blackwell

Support for NVIDIA GB200 NVL72 and MultiNode NVLink systems – NVIDIA Run:ai offers full support for NVIDIA’s most advanced MultiNode NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives. NVIDIA Run:ai simplifies the complexity of managing and submitting workloads on these systems by automating infrastructure detection, domain labeling, and distributed job submission via the UI, CLI, or API. See Using GB200 NVL72 and Multi-Node NVLink Domains for more details. From cluster v2.21 onward

Advanced Cluster Configurations

Automatic cleanup of resources for failed workloads – When a workload fails due to infrastructure issues, its resources can be automatically cleaned up using failureResourceCleanupPolicy, reducing resource of failed workloads. For more details, see Advanced cluster configurations. From cluster v2.21 onward

Advanced Setup

Custom pod labels and annotations – Add custom labels and annotations to pods in both the control plane and cluster. This new capability enables service mesh deployment in NVIDIA Run:ai. This feature provides greater flexibility in workload customization and management, allowing users to integrate with service meshes more easily. See Service mesh for more details.

System Requirements

  • NVIDIA Run:ai now supports NVIDIA GPU Operator version 25.3.

  • NVIDIA Run:ai now supports OpenShift version 4.18.

  • NVIDIA Run:ai now supports Kubeflow Training Operator 1.9.

  • Kubernetes version 1.29 is no longer supported.

Deprecation Notifications

Cluster API for Workload Submission

Using the Cluster API to submit NVIDIA Run:ai workloads via YAML was deprecated starting from NVIDIA Run:ai version 2.18. For cluster version 2.18 and above, use the Run:ai REST API to submit workloads. The Cluster API documentation has also been removed from v2.20 and above.

Integrations

Integration Support

Support for third-party integrations varies. When noted below, the integration is supported out of the box with NVIDIA Run:ai. For other integrations, our Customer Success team has prior experience assisting customers with setup. In many cases, the NVIDIA Enterprise Support Portal may include additional reference documentation provided on an as-is basis.

Tool
Category
NVIDIA Run:ai support details
Additional Information

Triton

Orchestration

Supported

Usage via docker base image

Spark

Orchestration

Community Support

It is possible to schedule Spark workflows with the NVIDIA Run:ai Scheduler. Sample code: .

Kubeflow Pipelines

Orchestration

Community Support

It is possible to schedule kubeflow pipelines with the NVIDIA Run:ai Scheduler. Sample code: .

Apache Airflow

Orchestration

Community Support

It is possible to schedule Airflow workflows with the NVIDIA Run:ai Scheduler. Sample code: .

Argo workflows

Orchestration

Community Support

It is possible to schedule Argo workflows with the NVIDIA Run:ai Scheduler. Sample code: .

SeldonX

Orchestration

Community Support

It is possible to schedule Seldon Core workloads with the NVIDIA Run:ai Scheduler. Sample code: .

Jupyter Notebook

Development

Supported

NVIDIA Run:ai provides integrated support with Jupyter Notebooks. See example.

JupyterHub

Development

Community Support

It is possible to submit NVIDIA Run:ai workloads via JupyterHub. Sample code: .

PyCharm

Development

Supported

Containers created by NVIDIA Run:ai can be accessed via PyCharm.

VScode

Development

Supported

Containers created by NVIDIA Run:ai can be accessed via Visual Studio Code. You can automatically launch Visual Studio code web from the NVIDIA Run:ai console.

Kubeflow notebooks

Development

Community Support

It is possible to launch a Kubeflow notebook with the NVIDIA Run:ai Scheduler. Sample code: .

Ray

training, inference, data processing.

Community Support

It is possible to schedule Ray jobs with the NVIDIA Run:ai Scheduler. Sample code: .

TensorBoard

Experiment tracking

Supported

NVIDIA Run:ai comes with a preset TensorBoard asset

Weights & Biases

Experiment tracking

Community Support

It is possible to schedule W&B workloads with the NVIDIA Run:ai Scheduler. Sample code: .

ClearML

Experiment tracking

Community Support

It is possible to schedule ClearML workloads with the NVIDIA Run:ai Scheduler. Sample code: .

MLFlow

Model Serving

Community Support

It is possible to use ML Flow together with the NVIDIA Run:ai Scheduler. Sample code: .

Hugging Face

Repositories

Supported

NVIDIA Run:ai provides an out of the box integration with

Docker Registry

Repositories

Supported

NVIDIA Run:ai allows using a docker registry as a asset

S3

Storage

Supported

NVIDIA Run:ai communicates with S3 by defining a asset

GitHub

Storage

Supported

NVIDIA Run:ai communicates with GitHub by defining it as a asset

TensorFlow

Training

Supported

NVIDIA Run:ai provides out of the box support for submitting TensorFlow workloads via API, CLI or UI. See for more details.

PyTorch

Training

Supported

NVIDIA Run:ai provides out of the box support for submitting PyTorch workloads via API, CLI or UI. See for more details.

Training

Supported

NVIDIA Run:ai provides out of the box support for submitting MPI workloads via API, CLI or UI. See for more details.

Training

Supported

NVIDIA Run:ai provides out of the box support for submitting XGBoost via API, CLI or UI. See for more details.

Cost Optimization

Supported

NVIDIA Run:ai provides out of the box support for Karpenter to save cloud costs. Integration notes with Karpenter can be found .

Kubernetes Workloads Integration

Kubernetes has several built-in resources that encapsulate running Pods. These are called Kubernetes Workloads and should not be confused with NVIDIA Run:ai workloads.

Examples of such resources are a Deployment that manages a stateless application, or a Job that runs tasks to completion.

A NVIDIA Run:ai workload encapsulates all the resources needed to run and creates/deletes them together. Since NVIDIA Run:ai is an open platform, it allows the scheduling of any Kubernetes Workflow.

For more information, see Kubernetes Workloads Integration.

Set Up SSO with OpenID Connect

Single Sign-On (SSO) is an authentication scheme, allowing users to log-in with a single pair of credentials to multiple, independent software systems.

This article explains the procedure to configure single sign-on to NVIDIA Run:ai using the OpenID Connect protocol.

Prerequisites

Before you start, make sure you have the following available from your identity provider:

  • Discovery URL - The OpenID server where the content discovery information is published.

  • ClientID - The ID used to identify the client with the Authorization Server.

  • Client Secret - A secret password that only the Client and Authorization server know.

  • Optional: Scopes - A set of user attributes to be used during authentication to authorize access to a user's details.

Setup

Adding the Identity Provider

  1. Go to General settings

  2. Open the Security section and click +IDENTITY PROVIDER

  3. Select Custom OpenID Connect

  4. Enter the Discovery URL, Client ID, and Client Secret

  5. Copy the Redirect URL to be used in your identity provider

  6. Optional: Add the OIDC scopes

  7. Optional: Enter the user attributes and their value in the identity provider as shown in the below table

  8. Click SAVE

  9. Optional: Enable Auto-Redirect to SSO to automatically redirect users to your configured identity provider’s login page when accessing the platform.

Attribute
Default value in NVIDIA Run:ai
Description

User role groups

GROUPS

If it exists in the IDP, it allows you to assign NVIDIA Run:ai role groups via the IDP. The IDP attribute must be a list of strings or an object where the group names are the values.

Linux User ID

UID

If it exists in the IDP, it allows Researcher containers to start with the Linux User UID. Used to map access to network resources such as file systems to users. The IDP attribute must be of type integer.

Linux Group ID

GID

If it exists in the IDP, it allows Researcher containers to start with the Linux Group GID. The IDP attribute must be of type integer.

Supplementary Groups

SUPPLEMENTARYGROUPS

If it exists in the IDP, it allows Researcher containers to start with the relevant Linux supplementary groups. The IDP attribute must be a list of integers.

Email

email

Defines the user attribute in the IDP holding the user's email address, which is the user identifier in NVIDIA Run:ai

User first name

firstName

Used as the user’s first name appearing in the NVIDIA Run:ai user interface

User last name

lastName

Used as the user’s last name appearing in the NVIDIA Run:ai user interface

Testing the Setup

  1. Log in to the NVIDIA Run:ai platform as an admin

  2. Add access rules to an SSO user defined in the IDP

  3. Open the NVIDIA Run:ai platform in an incognito browser tab

  4. On the sign-in page click CONTINUE WITH SSO You are redirected to the identity provider sign in page

  5. In the identity provider sign-in page, log in with the SSO user who you granted with access rules

  6. If you are unsuccessful signing-in to the identity provider, follow the Troubleshooting section below

Editing the Identity Provider

You can view the identity provider details and edit its configuration:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider box, click Edit identity provider

  4. You can edit either the Discovery URL, Client ID, Client Secret, OIDC scopes, or the User attributes

Removing the Identity Provider

You can remove the identity provider configuration:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider card, click Remove identity provider

  4. In the dialog, click REMOVE to confirm the action

Note

To avoid losing access, removing the identity provider must be carried out by a local user.

Troubleshooting

If testing the setup was unsuccessful, try the different troubleshooting scenarios according to the error you received.

Troubleshooting Scenarios

Error: "403 - Sorry, we can’t let you see this page. Something about permissions…"

Description: The authenticated user is missing permissions

Mitigation:

  1. Validate either the user or its related group/s are assigned with access rules

  2. Validate groups attribute is available in the configured OIDC Scopes

  3. Validate the user’s groups attribute is mapped correctly

Advanced:

  1. Open the Chrome DevTools: Right-click on page → Inspect → Console tab

  2. Run the following command to retrieve and paste the user’s token: localStorage.token;

  3. Paste in https://jwt.io

  4. Under the Payload section validate the values of the user’s attribute

Error: "401 - We’re having trouble identifying your account because your email is incorrect or can’t be found."

Description: Authentication failed because email attribute was not found.

Mitigation:

  1. Validate email attribute is available in the configured OIDC Scopes

  2. Validate the user’s email attribute is mapped correctly

Error: "Unexpected error when authenticating with identity provider"

Description: User authentication failed

Mitigation: Validate the the configured OIDC Scopes exist and match the Identity Provider’s available scopes

Advanced: Look for the specific error message in the URL address

Error: "Unexpected error when authenticating with identity provider (SSO sign-in is not available)"

Description: User authentication failed

Mitigation:

  1. Validate the the configured OIDC scope exists in the Identity Provider

  2. Validate the configured Client Secret match the Client Secret in the Identity Provider

Advanced: Look for the specific error message in the URL address

Error: "Client not found"

Description: OIDC Client ID was not found in the Identity Provider

Mitigation: Validate the the configured Client ID matches the Identity Provider Client ID

The NVIDIA Run:ai Scheduler: Concepts and Principles

When a user submits a workload, the workload is directed to the selected Kubernetes cluster and managed by the NVIDIA Run:ai Scheduler. The Scheduler’s primary responsibility is to allocate workloads to the most suitable node or nodes based on resource requirements and other characteristics, as well as adherence to NVIDIA Run:ai’s fairness and quota management.

The NVIDIA Run:ai Scheduler schedules native Kubernetes workloads, NVIDIA Run:ai workloads, or any other type of third-party workloads. To learn more about workloads support, see Introduction to workloads.

To understand what is behind the NVIDIA Run:ai Scheduler’s decision-making logic, get to know the key concepts, resource management and scheduling principles of the Scheduler.

Workloads and Pod Groups

Workloads can range from a single pod running on individual nodes to distributed workloads using multiple pods, each running on a node (or part of a node). For example, a large scale training workload could use up to 128 nodes or more, while an inference workload could use many pods (replicas) and nodes.

Every newly created pod is assigned to a pod group, which can represent one or multiple pods within a workload. For example, a distributed PyTorch training workload with 32 workers is grouped into a single pod group. All pods are attached to the pod group with certain rules, such as gang scheduling, applied to the entire pod group.

Scheduling Queue

A scheduling queue (or simply a queue) represents a scheduler primitive that manages the scheduling of workloads based on different parameters.

A queue is created for each project/node pool pair and department/node pool pair. The NVIDIA Run:ai Scheduler supports hierarchical queueing, project queues are bound to department queues, per node pool. This allows an organization to manage quota, over quota and more for projects and their associated departments.

Resource Management

Quota

Each project and department includes a set of deserved resource quotas, per node pool and resource type. For example, project “LLM-Train/Node Pool NV-H100” quota parameters specify the number of GPUs, CPUs(cores), and the amount of CPU memory that this project deserves to get when using this node pool. Non-preemptible workloads can only be scheduled if their requested resources are within the deserved resource quotas of their respective project/node-pool and department/node-pool.

Over Quota

Projects and departments can have a share in the unused resources of any node pool, beyond their quota of deserved resources. These resources are referred to as over quota resources. The administrator configures the over quota parameters per node pool for each project and department.

Over Quota Weight

Projects can receive a share of the cluster/node pool unused resources when the over quota weight setting is enabled. The part each Project receives depends on its over quota weight value, and the total weights of all other projects’ over quota weights. The administrator configures the over quota weight parameters per node pool for each project and department.

Multi-Level Quota System

Each project has a set of guaranteed resource quotas (GPUs, CPUs, and CPU memory) per node pool. Projects can go over quota and get a share of the unused resources in a node pool beyond their guaranteed quota in that node pool. The same applies to Departments. The Scheduler balances the amount of over quota between departments, and then between projects. The department’s deserved quota and over quota limit the sum of resources of all projects, within the department. If the project shows it has deserved quota, but the department deserved quota is exhausted, the Scheduler will not give the project anymore deserved resources. The same applies to over quota resources. over quota resources are first given to the department, and only then split among its projects.

Fairshare and Fairshare Balancing

The NVIDIA Run:ai Scheduler calculates a numerical value, fairshare, per project (or department) for each node pool, representing the project’s (department’s) sum of guaranteed resources plus the portion of non-guaranteed resources in that node pool.

The Scheduler aims to provide each project (or department) the resources they deserve per node pool using two main parameters: deserved quota and deserved fairshare (i.e. quota + over quota resources). If one project’s node pool queue is below fairshare and another project’s node pool queue is above fairshare, the Scheduler shifts resources between queues to balance fairness. This may result in the preemption of some over quota preemptible workloads.

Over-Subscription

Over-subscription is a scenario where the sum of all guaranteed resource quotas surpasses the physical resources of the cluster or node pool. In this case, there may be scenarios in which the Scheduler cannot find matching nodes to all workload requests, even if those requests were within the resource quota of their associated projects.

Placement Strategy - Bin-Pack and Spread

The administrator can set a placement strategy, bin-pack or spread, of the Scheduler per node pool. For GPU based workloads, workloads can request both GPU and CPU resources. For CPU-only based workloads, workloads can request CPU resources only.

  • GPU workloads:

    • Bin-pack - The Scheduler places as many workloads as possible in each GPU and node to use fewer resources and maximize GPU and node vacancy.

    • Spread - The Scheduler spreads workloads across as many GPUs and nodes as possible to minimize the load and maximize the available resources per workload.

  • CPU workloads:

    • Bin-pack - The Scheduler places as many workloads as possible in each CPU and node to use fewer resources and maximize CPU and node vacancy.

    • Spread - The Scheduler spreads workloads across as many CPUs and nodes as possible to minimize the load and maximize the available resources per workload.

Scheduling Principles

Priority and Preemption

NVIDIA Run:ai supports scheduling workloads using different priority and preemption policies:

  • High-priority workloads (pods) can preempt lower priority workloads (pods) within the same scheduling queue (project), according to their preemption policy. The NVIDIA Run:ai Scheduler implicitly assumes any PriorityClass >= 100 is non-preemptible and any PriorityClass < 100 is preemptible.

  • Cross project and cross department workload preemptions are referred to as resource reclaim and are based on fairness between queues rather than the priority of the workloads.

To make it easier for users to submit workloads, NVIDIA Run:ai preconfigured several Kubernetes PriorityClass objects. The NVIDIA Run:ai preset PriorityClass objects have their ‘preemptionPolicy’ always set to ‘PreemptLowerPriority’, regardless of their actual NVIDIA Run:ai preemption policy within the NVIDIA Run:ai platform. A non-preemptible workload is only scheduled if in-quota and cannot be preempted after being scheduled, not even by a higher priority workload.

PriorityClass Name
PriorityClass
NVIDIA Run:ai preemption policy
K8s preemption policy

125

Non-preemptible

PreemptLowerPriority

Build ()

100

Non-preemptible

PreemptLowerPriority

Interactive-preemptible ()

75

Preemptible

PreemptLowerPriority

50

Preemptible

PreemptLowerPriority

Note

You can override the default priority class of a workload. See Workload priority class control for more details.

Preemption of Lower Priority Workloads Within a Project

Workload priority is always respected within a project. This means higher priority workloads are scheduled before lower priority workloads. It also means that higher priority workloads may preempt lower priority workloads within the same project if the lower priority workloads are preemptible.

Fairness (Fair Resource Distribution)

Fairness is a major principle within the NVIDIA Run:ai scheduling system. It means that the NVIDIA Run:ai Scheduler always respects certain resource splitting rules (fairness) between projects and between departments.

Reclaim of Resources Between Projects and Departments

Reclaim is an inter-project (and inter-department) scheduling action that takes back resources from one project (or department) that has used them as over quota, back to a project (or department) that deserves those resources as part of its deserved quota, or to balance fairness between projects, each to its fairshare (i.e. sharing fairly the portion of the unused resources).

Gang Scheduling

Gang scheduling describes a scheduling principle where a workload composed of multiple pods is either fully scheduled (i.e. all pods are scheduled and running) or fully pending (i.e. all pods are not running). Gang scheduling refers to a single pod group.

Next Steps

Now that you have learned the key concepts and principles of the NVIDIA Run:ai Scheduler, see how the Scheduler works - allocating pods to workloads, applying preemption mechanisms, and managing resources.

GPU Fractions

To submit a workload with GPU resources in Kubernetes, you typically need to specify an integer number of GPUs. However, workloads often require diverse GPU memory and compute requirements or even use GPUs intermittently depending on the application (such as inference workloads, training workloads or notebooks at the model-creation phase). Additionally, GPUs are becoming increasingly powerful, offering more processing power and larger memory capacity for applications. Despite the increasing model sizes, the increasing capabilities of GPUs allow them to be effectively shared among multiple users or applications.

NVIDIA Run:ai’s GPU fractions provide an agile and easy-to-use method to share a GPU or multiple GPUs across workloads. With GPU fractions, you can divide the GPU/s memory into smaller chunks and share the GPU/s compute resources between different workloads and users, resulting in higher GPU utilization and more efficient resource allocation.

Benefits of GPU Fractions

Utilizing GPU fractions to share GPU resources among multiple workloads provides numerous advantages for both platform administrators and practitioners, including improved efficiency, resource optimization, and enhanced user experience.

  • For the AI practitioner:

    • Reduced wait time - Workloads with smaller GPU requests are more likely to be scheduled quickly, minimizing delays in accessing resources.

    • Increased workload capacity - More workloads can be run using the same admin-defined GPU quota and available unused resources ([over quota](../scheduling-and-resource-optimization/runai-scheduler-concepts-and-principles.md#over quota)).

  • For the platform administrator:

    • Improved GPU utilization - Sharing GPUs across workloads increases the utilization of individual GPUs, resulting in better overall platform efficiency.

    • Higher resource availability - More users gain access to GPU resources, ensuring better distribution.

    • Enhanced workload throughput - More workloads can be served per GPU, ensuring maximum output from existing hardware.

    • Optimized scheduling - Smaller and dynamic resource allocations gives the Scheduler a higher chance of finding GPU resources for incoming workloads.

Quota Planning with GPU Fractions

When planning the quota distribution for your projects and departments, using fractions gives the platform administrator the ability to allocate more precise quota per project and department, assuming the usage of GPU fractions or enforcing it with pre-defined policies or compute resource templates.

For example, in an organization with a department budgeted for two nodes of 8×H100 GPUs and a team of 32 researchers:

  • Allocating 0.5 GPU per researcher ensures all researchers have access to GPU resources.

  • Using fractions enables researchers to run smaller workloads intermittently within their quota or go over their quota by using temporary over quota resources with higher resource demanding workloads.

  • Using GPUs for notebook-based model development, where GPUs are not continuously active and can be shared among multiple users.

For more details on mapping your organization and resources, see Adapting AI initiatives to your organization.

How GPU Fractions Work

When a workload is submitted, the Scheduler finds a node with a GPU that can satisfy the requested GPU portion or GPU memory, then it schedules the pod to that node. The NVIDIA Run:ai GPU fractions logic, running locally on each NVIDIA Run:ai worker node, allocates the requested memory size on the selected GPU. Each pod uses its own separate virtual memory address space. NVIDIA Run:ai’s GPU fractions logic enforces the requested memory size, so no workload can use more than requested, and no workload can run over another workload’s memory. This gives users the experience of a ‘logical GPU’ per workload.

While MIG requires administrative work to configure every MIG slice, where a slice is a fixed chunk of memory, GPU fractions allow dynamic and fully flexible allocation of GPU memory chunks. By default, GPU fractions use NVIDIA’s time-slicing to share the GPU compute runtime. You can also use the NVIDIA Run:ai GPU time-slicing which allows dynamic and fully flexible splitting of the GPU compute time.

NVIDIA Run:ai GPU fractions are agile and dynamic allowing a user to allocate and free GPU fractions during the runtime of the system, at any size between zero to the maximum GPU portion (100%) or memory size (up to the maximum memory size of a GPU).

The NVIDIA Run:ai Scheduler can work alongside other schedulers. In order to avoid collisions with other schedulers, the NVIDIA Run:ai Scheduler creates special reservation pods. Once a workload is submitted requesting a fraction of a GPU, NVIDIA Run:ai will create a pod in a dedicated runai-reservation namespace with the full GPU as a resource, allowing other schedulers to understand that the GPU is reserved.

Note

  • Splitting a GPU into fractions may generate some fragmentation of the GPU memory. The Scheduler will try to consolidate GPU resources where feasible (i.e. preemptible workloads).

  • Using bin-pack as a scheduling placement strategy can also reduce GPU fragmentation.

  • Using dynamic GPU fractions ensures that even small unused fragments of GPU memory are utilized by workloads.

Multi-GPU Fractions

NVIDIA Run:ai also supports workload submission using multi-GPU fractions. Multi-GPU fractions work similarly to single-GPU fractions, however, the NVIDIA Run:ai Scheduler allocates the same fraction size on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, they can allocate 8×40GB with multi-GPU fractions instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.

Time sharing where single GPUs can serve multiple workloads with fractions remains unchanged, only now, it serves multiple workloads using multi-GPUs per workload, single-GPU per workload, or a mix of both.

Deployment Considerations

  • Selecting a GPU portion using percentages as units does not guarantee the exact memory size. This means 50% of an A-100-40GB is 20GB while 50% of an A-100-80 is 40GB. To have better control over the exact allocated memory, specify the exact memory size, i.e. 40GB.

  • Using NVIDIA Run:ai GPU fractions controls the memory split (i.e. 0.5 GPU means 50% of the GPU memory) but not the compute (processing time). To split the compute time, see NVIDIA Run:ai’s GPU time slicing.

  • NVIDIA Run:ai GPU fractions and MIG mode cannot be used on the same node.

Setting GPU Fractions

Using the compute resources asset, you can define the compute requirements by specifying your requested GPU portion or GPU memory, and use it with any of the NVIDIA Run:ai workload types for single GPU and multi-GPU fractions.

  • Single-GPU fractions - Define the compute requirement to run 1 GPU device, by specifying either a fraction (percentage) of the overall memory or a memory request (GB, MB).

  • Multi-GPU fractions - Define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or a memory request (GB, MB).

Setting GPU Fractions for External Workloads

To enable GPU fractions for workloads submitted via Kubernetes YAML, use the following annotations to define the GPU fraction configuration. You can configure either gpu-fraction or gpu-memory.

Variable
Input Format
Where to Set

gpu-fraction

A portion of GPU memory as a double-precision floating-point number. Example: 0.25, 0.75.

Pod annotation (metadata.annotations)

gpu-memory

Memory size in MiB. Example: 2500, 4096. The gpu-memory values are always in MiB.

Pod annotation (metadata.annotations)

gpu-fraction-num-devices

The number of GPU devices to allocate using the specified gpu-fraction or gpu-memory value. Set this annotation only if you want to request multiple GPU devices.

Pod annotation (metadata.annotations)

The following example YAML creates a pod that requests 2 GPU devices, each requesting 50% of memory (gpu-fraction: "0.5") .

apiVersion: v1
kind: Pod
metadata:
  annotations:
    user: test
    gpu-fraction: "0.5"
    gpu-fraction-num-devices: "2"
  labels:
    runai/queue: test
  name: multi-fractional-pod-job
  namespace: test
spec:
  containers:
  - image: gcr.io/run-ai-demo/quickstart-cuda
    imagePullPolicy: Always
    name: job
    env:
    - name: RUNAI_VERBOSE
      value: "1"
    resources:
      limits:
        cpu: 200m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 100Mi
    securityContext:
      capabilities:
        drop: ["ALL"]
  schedulerName: runai-scheduler
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 5

Using CLI

To view the available actions, go to the CLI v2 reference or the CLI v1 reference and run according to your workload.

Using API

To view the available actions, go to the API reference and run according to your workload.

Nodes

This section explains the procedure for managing Nodes.

Nodes are Kubernetes elements automatically discovered by the NVIDIA Run:ai platform. Once a node is discovered by the NVIDIA Run:ai platform, an associated instance is created in the Nodes table, administrators can view the Node’s relevant information, and NVIDIA Run:ai scheduler can use the node for Scheduling.

Nodes Table

The Nodes table can be found under Resources in the NVIDIA Run:ai platform.

The Nodes table displays a list of predefined nodes available to users in the NVIDIA Run:ai platform.

Note

  • It is not possible to create additional nodes, or edit, or delete existing nodes.

  • Only users with relevant permissions can view the table.

The Nodes table consists of the following columns:

Column
Description

Node

The Kubernetes name of the node

Status

The state of the node. Nodes in the Ready state are eligible for scheduling. If the state is Not ready then the main reason appears in parenthesis on the right side of the state field. Hovering the state lists the reasons why a node is Not ready.

NVLink domain UID

Indicates if the MNNVL domain ID is part of the MNNVL label value. In case the MNNVL label is not the default MNNVL label key (nvidia.com/gpu.clique), this field will show the whole label value.

MNNVL domain clique ID

Indicates if the MNNVL clique ID is part of the MNNVL label value. In case the MNNVL label is not the default MNNVL label key (nvidia.com/gpu.clique), this field will show an empty value.

Node pool

The name of the associated node pool. By default, every node in the NVIDIA Run:ai platform is associated with the default node pool, if no other node pool is associated

GPU type

The GPU model, for example, H100, or V100

GPU devices

The number of GPU devices installed on the node. Clicking this field pops up a dialog with details per GPU (described below in this article)

Free GPU devices

The current number of fully vacant GPU devices

GPU memory

The total amount of GPU memory installed on this node. For example, if the number is 640GB and the number of GPU devices is 8, then each GPU is installed with 80GB of memory (assuming the node is assembled of homogenous GPU devices)

Allocated GPUs

The total allocation of GPU devices in units of GPUs (decimal number). For example, if 3 GPUs are 50% allocated, the field prints out the value 1.50. This value represents the portion of GPU memory consumed by all running pods using this node

Used GPU memory

The actual amount of memory (in GB or MB) used by pods running on this node.

GPU compute utilization

The average compute utilization of all GPU devices in this node

GPU memory utilization

The average memory utilization of all GPU devices in this node

CPU (Cores)

The number of CPU cores installed on this node

CPU memory

The total amount of CPU memory installed on this node

Allocated CPU (Cores)

The number of CPU cores allocated by pods running on this node (decimal number, e.g. a pod allocating 350 mili-cores shows an allocation of 0.35 cores).

Allocated CPU memory

The total amount of CPU memory allocated by pods running on this node (in GB or MB)

Used CPU memory

The total amount of actually used CPU memory by pods running on this node. Pods may allocate memory but not use all of it, or go beyond their CPU memory allocation if using Limit > Request for CPU memory (burstable workload)

CPU compute utilization

The utilization of all CPU compute resources on this node (percentage)

CPU memory utilization

The utilization of all CPU memory resources on this node (percentage)

Used swap CPU memory

The amount of CPU memory (in GB or MB) used for GPU swap memory (* future)

Pod(s)

List of pods running on this node, click the field to view details (described below in this article)

GPU Devices for Node

Click one of the values in the GPU devices column, to view the list of GPU devices and their parameters.

Column
Description

Index

The GPU index, read from the GPU hardware. The same index is used when accessing the GPU directly

Used memory

The amount of memory used by pods and drivers using the GPU (in GB or MB)

Compute utilization

The portion of time the GPU is being used by applications (percentage)

Memory utilization

The portion of the GPU memory that is being used by applications (percentage)

Idle time

The elapsed time since the GPU was used (i.e. the GPU is being idle for ‘Idle time’)

Pods Associated with Node

Click one of the values in the Pod(s) column, to view the list of pods and their parameters.

Note

This column is only viewable if your role in the NVIDIA Run:ai platform gives you read access to workloads, even if you are allowed to view workloads, you can only view the workloads within your allowed scope. This means, there might be more pods running on this node than appear in the list your are viewing.

Column
Description

Pod

The Kubernetes name of the pod. Usually name of the pod is made of the name of the parent workload if there is one, and an index for unique for that pod instance within the workload

Status

The state of the pod. In steady state this should be Running and the amount of time the pod is running

Project

The NVIDIA Run:ai project name the pod belongs to. Clicking this field takes you to the Projects table filtered by this project name

Workload

The workload name the pod belongs to. Clicking this field takes you to the Workloads table filtered by this workload name

Image

The full path of the image used by the main container of this pod

Creation time

The pod’s creation date and time

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

  • Show/Hide details - Click to view additional information on the selected row

Show/Hide Details

Click a row in the Nodes table and then click the Show details button at the upper right side of the action bar. The details screen appears, presenting the following metrics graphs:

  • GPU utilization - Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs compute utilization (percentage of GPU compute) in this node.

  • GPU memory utilization - Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs memory usage (percentage of the GPU memory) in this node.

  • CPU compute utilization - The average of all CPUs’ cores compute utilization graph, along an adjustable period allows you to see the trends of CPU compute utilization (percentage of CPU compute) in this node.

  • CPU memory utilization - The utilization of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory utilization (percentage of CPU memory) in this node.

  • CPU memory usage - The usage of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory usage (in GB or MB of CPU memory) in this node.

  • For GPUs charts - Click the GPU legend on the right-hand side of the chart, to activate or deactivate any of the GPU lines.

  • You can click the date picker to change the presented period

  • You can use your mouse to mark a sub-period in the graph for zooming in, and use the ‘Reset zoom’ button to go back to the preset period

  • Changes in the period affect all graphs on this screen.

Using API

To view the available actions, go to the Nodes API reference.

Running Jupyter Notebooks Using Workspaces

This quick start provides a step-by-step walkthrough for running a Jupyter Notebook using workspaces.

A workspace contains the setup and configuration needed for building your model, including the container, images, data sets, and resource requests, as well as the required tools for the research, all in one place. See Running workspaces for more information.

Note

If enabled by your Administrator, the NVIDIA Run:ai UI allows you to create a new workload using either the Flexible or Original submission form. The steps in this quick start guide reflect the Original form only.

Prerequisites

Before you start, make sure:

  • You have created a project or have one created for you.

  • The project has an assigned quota of at least 1 GPU.

Step 1: Logging In

Step 2: Submitting a Workspace

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Workspace

  3. Select under which cluster to create the workload

  4. Select the project in which your workspace will run

  5. Select a preconfigured template or select the Start from scratch to launch a new workspace quickly

  6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Select the ‘jupyter-lab’ environment for your workspace (Image URL: jupyter/scipy-notebook)

    • If ‘jupyter-lab’ is not displayed in the gallery, follow the below steps:

      • Click +NEW ENVIRONMENT

      • Enter a name for the environment. The name must be unique.

      • Enter the jupyter-lab Image URL - jupyter/scipy-notebook

      • Tools - Set the connection for your tool

        • Click +TOOL

        • Select Jupyter tool from the list

      • Set the runtime settings for the environment

        • Click +COMMAND

        • Enter command - start-notebook.sh

        • Enter arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

          Note: If host-based routing is enabled on the cluster, enter the argument --NotebookApp.token='' only.

      • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  9. Select the ‘one-gpu’ compute resource for your workspace (GPU devices: 1)

    • If ‘one-gpu’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

      • Enter a name for the compute resource. The name must be unique.

      • Set GPU devices per pod - 1

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device’s memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  10. Click CREATE WORKSPACE

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:

runai project set "project-name"
runai workspace submit "workload-name" \
--image jupyter/scipy-notebook --gpu-devices-request 0 \
--command --external-url container=8888 -- start-notebook.sh \
--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:

runai config project "project-name"
runai submit "workload-name" --jupyter -g 1

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see Workspaces API:

curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>",  
    "clusterId": "<CLUSTER-UUID>",
    "spec": {
        "command" : "start-notebook.sh",
        "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
        "image": "jupyter/scipy-notebook",
        "compute": {
            "gpuDevicesRequest": 1
        },
        "exposedUrls" : [
            { 
                "container" : 8888,
                "toolType": "jupyter-notebook",
                "toolName": "Jupyter" 
            }
        ]
    }
}
  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in Step 1

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

  • toolName will show when connecting to the Jupyter tool via the user interface.

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 3: Connecting to the Jupyter Notebook

  1. Select the newly created workspace with the Jupyter application that you want to connect to

  2. Click CONNECT

  3. Select the Jupyter tool. The selected tool is opened in a new tab on your browser.

To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

Next Steps

Manage and monitor your newly created workload using the Workloads table.

Data Volumes

Data volumes (DVs) are one type of workload assets. They offer a powerful solution for storing, managing, and sharing AI training data, promoting collaboration, simplifying data access control, and streamlining the AI development lifecycle.

Acting as a central repository for organizational data resources, data volumes can represent datasets or raw data, that is stored in Kubernetes Persistent Volume Claims (PVCs).

Once a data volume is created, it can be shared with additional multiple scopes and easily utilized by AI practitioners when submitting workloads. Shared data volumes are mounted with read-only permissions, ensuring data integrity. Any modifications to the data in a shared DV must be made by writing to the original volume of the PVC used to create the data volume.

Note

  • Data volumes are disabled, by default. If you cannot see Data volumes, then it must be enabled by your Administrator, under General settings → Workloads → Data volumes.

  • Data volumes are supported only for flexible workload submission.

Why Use a Data Volume?

  1. Sharing with multiple scopes - Data volumes can be shared across different scopes in a cluster, including projects, departments. Using data volumes allows for data reuse and collaboration within the organization.

  2. Storage saving - A single copy of the data can be used across multiple scopes

Typical Use Cases

  1. Sharing large datasets - In large organizations, the data is often stored in a remote location, which can be a barrier for large model training. Even if the data is transferred into the cluster, sharing it easily with multiple users is still challenging. Data volumes can help share the data seamlessly, with maximum security and control.

  2. Sharing data with colleagues - When sharing training results, generated datasets, or other artifacts with team members is needed, data volumes can help make the data available easily.

Prerequisites

To create a data volume, you must have a PVC data source already created. Make sure the PVC includes data before sharing it.

Data Volumes Table

The data volumes table can be found under Workload manager in the NVIDIA Run:ai platform.

The data volumes table provides a list of all the data volumes defined in the platform and allows you to manage them.

The data volumes table comprises the following columns:

Column
Description

Data volume

The name of the data volume

Description

A description of the data volume

Status

The different lifecycle and representation of the data volume condition

Scope

The of the data source within the organizational tree. Click the scope name to view the organizational tree diagram

Origin project

The project of the origin PVC

Origin PVC

The original PVC from which the data volume was created that points to the same PV

Cluster

The cluster that the data volume is associated with

Created by

The user who created the data volume

Creation time

The timestamp for when the data volume was created

Last updated

The timestamp of when the data volume was last updated

Data Volumes Status

The following table describes the data volumes' condition and whether they were created successfully for the selected scope.

Status
Description

No issues found

No issues were found while creating the data volume

Issues found

Issues were found while sharing the data volume. Contact NVIDIA Run:ai support.

Creating…

The data volume is being created

Deleting...

The data volume is being deleted

No status / “-”

When the data volume’s scope is an account, the current version of the cluster is not up to date, or the asset is not a cluster-syncing entity, the status can’t be displayed

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Refresh - Click REFRESH to update the table with the latest data

Adding a New Data Volume

To create a new data volume:

  1. Click +NEW DATA VOLUME

  2. Enter a name for the data volume. The name must be unique.

  3. Optional: Provide a description of the data volume

  4. Set the project where the data is located

  5. Set a PVC from which to create the data volume

  6. Set the Scopes that will be able to mount the data volume

  7. Click CREATE DATA VOLUME

Editing a Data Volume

To edit a data volume:

  1. Select the data volume you want to edit

  2. Click Edit

  3. Click SAVE DATA VOLUME

Copying a Data Volume

To copy an existing data volume:

  1. Select the data volume you want to copy

  2. Click MAKE A COPY

  3. Enter a name for the data volume. The name must be unique.

  4. Set a new Origin PVC for your data volume, since only one Origin PVC can be used per data volume

  5. Click CREATE DATA VOLUME

Deleting a Data Volume

To delete a data volume:

  1. Select the data volume you want to delete

  2. Click DELETE

  3. Confirm you want to delete the data volume

Note

It is not possible to delete a data volume being used by an existing workload.

Using API

To view the available actions, go to the Data volumes API reference.

Set Up SSO with SAML

Single Sign-On (SSO) is an authentication scheme, allowing users to log in with a single pair of credentials to multiple, independent software systems.

This section explains the procedure to using the SAML 2.0 protocol.

Prerequisites

Before your start, make sure you have the IDP Metadata XML available from your identity provider.

Setup

Adding the Identity Provider

  1. Go to General settings

  2. Open the Security section and click +IDENTITY PROVIDER

  3. Select Custom SAML 2.0

  4. Select either From computer or From URL to upload your identity provider metadata file

    • From computer - Click the Metadata XML file field, then select your file for upload

    • From URL - In the Metadata XML field, enter the URL to the IDP Metadata XML file

  5. You can either copy the Redirect URL and Entity ID displayed on the screen and enter them in your identity provider, or use the service provider metadata XML, which contains the same information in XML format. This file becomes available after you click SAVE in step 7.

  6. Optional: Enter the user attributes and their value in the identity provider as shown in the below table

  7. Click SAVE. After save, click Open service provider metadata XML to access the metadata file. This file can be used to configure your identity provider.

  8. Optional: Enable Auto-Redirect to SSO to automatically redirect users to your configured identity provider’s login page when accessing the platform.

Attribute
Default value in NVIDIA Run:ai
Description

Testing the Setup

  1. Open the NVIDIA Run:ai platform as an admin

  2. Add to an SSO user defined in the IDP

  3. Open the NVIDIA Run:ai platform in an incognito browser tab

  4. On the sign-in page click CONTINUE WITH SSO. You are redirected to the identity provider sign in page

  5. In the identity provider sign-in page, log in with the SSO user who you granted with access rules

  6. If you are unsuccessful signing-in to the identity provider, follow the section below

Editing the Identity Provider

You can view the identity provider details and edit its configuration:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider box, click Edit identity provider

  4. You can edit either the metadata file or the user attributes

  5. You can view the identity provider URL, identity provider entity ID, and the certificate expiration date

Removing the Identity Provider

You can remove the identity provider configuration:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider card, click Remove identity provider

  4. In the dialog, click REMOVE to confirm the action

Note

To avoid losing access, removing the identity provider must be carried out by a local user.

Downloading the IDP Metadata XML File

You can download the XML file to view the identity provider settings:

  1. Go to General settings

  2. Open the Security section

  3. On the identity provider card, click Edit identity provider

  4. In the dialog, click DOWNLOAD IDP METADATA XML FILE

Troubleshooting

If testing the setup was unsuccessful, try the different troubleshooting scenarios according to the error you received. If an error still occurs, check the .

Troubleshooting Scenarios

Error: "Invalid signature in response from identity provider"

Description: After trying to log in, the following message is received in the NVIDIA Run:ai login page.

Mitigation:

  1. Go to the General settings menu

  2. Open the Security section

  3. In the identity provider box, check for a "Certificate expired” error

  4. If it is expired, update the SAML metadata file to include a valid certificate

Error: "401 - We’re having trouble identifying your account because your email is incorrect or can’t be found."

Description: Authentication failed because email attribute was not found.

Mitigation: Validate the user’s email attribute is mapped correctly

Error: "403 - Sorry, we can’t let you see this page. Something about permissions…"

Description: The authenticated user is missing permissions

Mitigation:

  1. Validate either the user or its related group/s are assigned with

  2. Validate the user’s groups attribute is mapped correctly

Advanced:

  1. Open the Chrome DevTools: Right-click on page → Inspect → Console tab

  2. Run the following command to retrieve and paste the user’s token: localStorage.token;

  3. Paste in

  4. Under the Payload section validate the values of the user’s attributes

Advanced Troubleshooting

Validating the SAML request

The SAML login flow can be separated into two parts:

  • NVIDIA Run:ai redirects to the IDP for log-ins using a SAML Request

  • On successful log-in, the IDP redirects back to NVIDIA Run:ai with a SAML Response

Validate the SAML Request to ensure the SAML flow works as expected:

  1. Go to the NVIDIA Run:ai login screen

  2. Open the Chrome Network inspector: Right-click → Inspect on the page → Network tab

  3. On the sign-in page click CONTINUE WITH SSO.

  4. Once redirected to the Identity Provider, search in the Chrome network inspector for an HTTP request showing the SAML Request. Depending on the IDP url, this would be a request to the IDP domain name. For example, accounts.google.com/idp?1234.

  5. When found, go to the Payload tab and copy the value of the SAML Request

  6. Paste the value into a SAML decoder (e.g. )

  7. Validate the request:

    • The content of the <saml:Issuer> tag is the same as Entity ID given when

    • The content of the AssertionConsumerServiceURL is the same as the Redirect URI given when

  8. Validate the response:

    • The user email under the <saml2:Subject> tag is the same as the logged-in user

    • Make sure that under the <saml2:AttributeStatement> tag, there is an Attribute named email (lowercase). This attribute is mandatory.

    • If other, optional user attributes (groups, firstName, lastName, uid, gid) are mapped make sure they also exist under <saml2:AttributeStatement> along with their respective values.

Advanced Control Plane Configurations

Helm Chart Values

The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm or flags. Make sure to restart the relevant NVIDIA Run:ai pods so they can fetch the new configurations.

Key
Change
Description

Additional Third-Party Configurations

The NVIDIA Run:ai control plane chart includes multiple sub-charts of third-party components:

  • Data store- (postgresql)

  • Metrics Store - (thanos)

  • Identity & Access Management - (keycloakx)

  • Analytics Dashboard - (grafana)

  • Caching, Queue - (nats)

Tip

Click on any component to view its chart values and configurations.

PostgreSQL

If you have opted to connect to an , refer to the additional configurations table below. Adjust the following parameters based on your connection details:

  1. Disable PostgreSQL deployment - postgresql.enabled

  2. NVIDIA Run:ai connection details - global.postgresql.auth

  3. Grafana connection details - grafana.dbUser, grafana.dbPassword

Key
Change
Description

Thanos

Note

This section applies to Kubernetes only.

Key
Change
Description

Keycloakx

The keycloakx.adminUser can only be set during the initial installation. The admin password can be changed later through the Keycloak UI, but you must also update the keycloakx.adminPassword value in the Helm chart using helm upgrade. Failing to update the Helm values after changing the password can lead to control plane services encountering errors.

Key
Change
Description

Grafana

Key
Change
Description

Dynamic GPU Fractions

Many workloads utilize GPU resources intermittently, with long periods of inactivity. These workloads typically need GPU resources when they are running AI applications or debugging a model in development. Other workloads such as inference may utilize GPUs at lower rates than requested, but may demand higher resource usage during peak utilization. The disparity between resource request and actual resource utilization often leads to inefficient utilization of GPUs. This usually occurs when multiple workloads request resources based on their peak demand, despite operating below those peaks for the majority of their runtime.

To address this challenge, NVIDIA Run:ai has introduced dynamic GPU fractions. This feature optimizes GPU utilization by enabling workloads to dynamically adjust their resource usage. It allows users to specify a guaranteed fraction of GPU memory and compute resources with a higher limit that can be dynamically utilized when additional resources are requested.

How Dynamic GPU Fractions Work

With dynamic GPU fractions, users can using GPU fraction Request and Limit which is achieved by leveraging the Kubernetes Request and Limit notations. You can either:

  • Request a GPU fraction (portion) using a percentage of a GPU and specify a Limit

  • Request a GPU memory size (GB, MB) and specify a Limit

When setting a GPU memory limit either as GPU fraction or GPU memory size, the Limit must be equal to or greater than the GPU fractional memory request. Both GPU fraction and GPU memory are translated into the actual requested memory size of the Request (guaranteed resources) and the Limit (burstable resources - non guaranteed).

For example, a user can specify a workload with a GPU fraction request of 0.25 GPU, and add a limit of up to 0.80 GPU. The NVIDIA Run:ai schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the Limit), where only 0.25 of the GPU memory is guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.

NVIDIA Run:ai automatically manages the state changes between Request and Limit as well as the reverse (when the balance needs to be "returned"), updating the workloads’ utilization vs. Request and Limit parameters in the .

To guarantee fair quality of service between different workloads using the same GPU, NVIDIA Run:ai developed an extendable GPUOOMKiller (Out Of Memory Killer) component that guarantees the quality of service using Kubernetes semantics for resources of Request and Limit.

The OOMKiller capability requires adding CAP_KILL capabilities to the dynamic GPU fractions and to the NVIDIA Run:ai core scheduling module (toolkit daemon). This capability is enabled by default.

Note

Dynamic GPU fractions is enabled by default in the cluster. Disabling dynamic GPU fractions in removes the CAP_KILL capability.

Multi-GPU Dynamic Fractions

NVIDIA Run:ai also supports workload submission using multi-GPU dynamic fractions. Multi-GPU dynamic fractions work similarly to dynamic fractions on a single GPU workload, however, instead of a single GPU device, the NVIDIA Run:ai Scheduler allocates the same dynamic fraction pair (Request and Limit) on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, but may want to burst out and consume up to the full GPU memory, they can allocate 8×40GB with multi-GPU fractions and a limit of 80GB (e.g. H100 GPU) instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.This is useful during model development, where memory requirements are usually lower due to experimentation with smaller models or configurations.

This approach significantly improves GPU utilization and availability, enabling more precise and often smaller quota requirements for the end user. Time sharing where single GPUs can serve multiple workloads with dynamic fractions remains unchanged, only now, it serves multiple workloads using multi-GPUs per workload.

Setting Dynamic GPU Fractions

Note

Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Using the asset, you can define the compute requirements by specifying your requested GPU portion or GPU memory, and set a Limit. You can then use the compute resource with any of the for single and multi-GPU dynamic fractions. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the .

  • Single dynamic GPU fractions - Define the compute requirement to run 1 GPU device, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB) with a Limit. The limit must be equal to or greater than the GPU fractional memory request.

  • Multi-GPU dynamic fractions - Define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB) with a Limit. The limit must be equal to or greater than the GPU fractional memory request.

Note

When setting a workload with dynamic GPU fractions, (for example, when using it with GPU Request or GPU memory Limits), you practically make the workload burstable. This means it can use memory that is not guaranteed for that workload and is susceptible to an ‘OOM Kill’ signal if the actual owner of that memory requires it back. This applies to non-preemptive workloads as well. For that reason, it is recommended that you use dynamic GPU fractions with Interactive workloads running Notebooks. Notebook pods are not evicted when their GPU process is OOM Kill’ed. This behavior is the same as standard Kubernetes burstable CPU workloads.

Setting Dynamic GPU Fractions for External Workloads

To enable dynamic GPU fractions for workloads submitted via Kubernetes YAML, use the following annotations to define the GPU fraction configuration. You can configure either gpu-fraction or gpu-memory. You must also set the RUNAI_GPU_MEMORY_LIMIT environment variable in the first container to enforce the memory limit. This is the GPU consuming container.

Variable
Input Format
Where to Set

The following example YAML creates a pod that requests 2 GPU devices, each requesting 50% of memory (gpu-fraction: "0.5") and allows usage of up to 95% (RUNAI_GPU_MEMORY_LIMIT: "0.95") if available.

Using CLI

To view the available actions, go to the and run according to your workload.

Using API

To view the available actions, go to the and run according to your workload.

How to Run Spark job with NVIDIA Run:ai
How to integrate NVIDIA Run:ai with Kubeflow
How to integrate NVIDIA Run:ai with Apache Airflow
How to integrate NVIDIA Run:ai with Argo Workflows
How to integrate NVIDIA Run:ai with Seldon Core
Jupyter Notebook quick start
How to connect JupyterHub with NVIDIA Run:ai
How to integrate NVIDIA Run:ai with Kubeflow
How to Integrate NVIDIA Run:ai with Ray
Environment
How to integrate with Weights and Biases
How to integrate NVIDIA Run:ai with ClearML
How to integrate NVIDIA Run:ai with MLFlow
Hugging Face
Credential
data source
data source
Distributed training
Distributed training
Kubeflow MPI
Distributed training
XGBoost
Distributed training
Karpenter
here
Inference
workspace
workspace
Train

User role groups

GROUPS

If it exists in the IDP, it allows you to assign NVIDIA Run:ai role groups via the IDP. The IDP attribute must be a list of strings.

Linux User ID

UID

If it exists in the IDP, it allows Researcher containers to start with the Linux User UID. Used to map access to network resources such as file systems to users. The IDP attribute must be of type integer.

Linux Group ID

GID

If it exists in the IDP, it allows Researcher containers to start with the Linux Group GID. The IDP attribute must be of type integer.

Supplementary Groups

SUPPLEMENTARYGROUPS

If it exists in the IDP, it allows Researcher containers to start with the relevant Linux supplementary groups. The IDP attribute must be a list of integers.

Email

email

Defines the user attribute in the IDP holding the user's email address, which is the user identifier in NVIDIA Run:ai.

User first name

firstName

Used as the user’s first name appearing in the NVIDIA Run:ai platform.

User last name

lastName

Used as the user’s last name appearing in the NVIDIA Run:ai platform.

configure SSO to NVIDIA Run:ai
access rules
Troubleshooting
advanced troubleshooting section
access rules
https://jwt.io
https://www.samltool.com/decode.php
adding the identity provider
adding the identity provider

global.ingress.ingressClass

Ingress class

NVIDIA Run:ai default is using NGINX. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai

global.ingress.tlsSecretName

TLS secret name

NVIDIA Run:ai requires the creation of a secret with domain certificate. If the runai-backend namespace already had such a secret, you can set the secret name here

<service-name>.podLabels

Pod labels

Set NVIDIA Run:ai and 3rd party services' Pod Labels in a format of key/value pairs.

<service-name>  resources:   limits:     cpu: 500m     memory: 512Mi   requests:     cpu: 250m     memory: 256Mi

Pod request and limits

Set NVIDIA Run:ai and 3rd party services' resources

disableIstioSidecarInjection.enabled

Disable Istio sidecar injection

Disable the automatic injection of Istio sidecars across the entire NVIDIA Run:ai Control Plane services.

global.affinity

System nodes

Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system

global.customCA.enabled

Certificate authority

Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.

postgresql.enabled

PostgreSQL installation

If set to false, PostgreSQL will not be installed.

global.postgresql.auth.host

PostgreSQL host

Hostname or IP address of the PostgreSQL server.

global.postgresql.auth.port

PostgreSQL port

Port number on which PostgreSQL is running.

global.postgresql.auth.username

PostgreSQL username

Username for connecting to PostgreSQL.

global.postgresql.auth.password

PostgreSQL password

Password for the PostgreSQL user specified by global.postgresql.auth.username.

global.postgresql.auth.postgresPassword

PostgreSQL default admin password

Password for the built-in PostgreSQL superuser (postgres).

global.postgresql.auth.existingSecret

Postgres Credentials (secret)

Existing secret name with authentication credentials.

global.postgresql.auth.dbSslMode

Postgres connection SSL mode

Set the SSL mode. See the full list in Protection Provided in Different Modes. Prefer mode is not supported.

postgresql.primary.initdb.password

PostgreSQL default admin password

Set the same password as in global.postgresql.auth.postgresPassword (if changed).

postgresql.primary.persistence.storageClass

Storage class

The installation is configured to work with a specific storage class instead of the default one.

thanos.receive.persistence.storageClass

Storage class

The installation is configured to work with a specific storage class instead of the default one.

keycloakx.adminUser

User name of the internal identity provider administrator

This user is the administrator of Keycloak.

keycloakx.adminPassword

Password of the internal identity provider administrator

This password is for the administrator of Keycloak.

keycloakx.existingSecret

Keycloakx Credentials (secret)

Existing secret name with authentication credentials.

global.keycloakx.host

KeyCloak (NVIDIA Run:ai internal identity provider) host path

Override the DNS for Keycloak. This can be used to access access Keycloack externally to the cluster.

grafana.db.existingSecret

Grafana database connection credentials (secret)

Existing secret name with authentication credentials.

grafana.dbUser

Grafana database username

Username for accessing the Grafana database.

grafana.dbPassword

Grafana database password

Password for the Grafana database user.

grafana.admin.existingSecret

Grafana admin default credentials (secret)

Existing secret name with authentication credentials.

grafana.adminUser

Grafana username

Override the NVIDIA Run:ai default user name for accessing Grafana.

grafana.adminPassword

Grafana password

Override the NVIDIA Run:ai default password for accessing Grafana.

values files
Helm install
PostgreSQL
Thanos
Keycloakx
Grafana
NATS
external PostgreSQL database

gpu-fraction

A portion of GPU memory as a double-precision floating-point number. Example: 0.25, 0.75.

Pod annotation (metadata.annotations)

gpu-memory

Memory size in MiB. Example: 2500, 4096. The gpu-memory values are always in MiB.

Pod annotation (metadata.annotations)

gpu-fraction-num-devices

The number of GPU devices to allocate using the specified gpu-fraction or gpu-memory value. Set this annotation only if you want to request multiple GPU devices.

Pod annotation (metadata.annotations)

RUNAI_GPU_MEMORY_LIMIT

  • To use for gpu-fraction - Specify a double-precision floating-point number. Example: 0.95

  • To use for gpu-memory - Specify a Kubernetes resource quantity format. Example: 500000000, 2500M

The limit must be equal to or greater than the GPU fractional memory request.

Environment variable in the first container

apiVersion: v1
kind: Pod
metadata:
  annotations:
    user: test
    gpu-fraction: "0.5"
    gpu-fraction-num-devices: "2"
  labels:
    runai/queue: test
  name: multi-fractional-pod-job
  namespace: test
spec:
  containers:
  - image: gcr.io/run-ai-demo/quickstart-cuda
    imagePullPolicy: Always
    name: job
    env:
    - name: RUNAI_VERBOSE
      value: "1"
    - name: RUNAI_GPU_MEMORY_LIMIT
      value: "0.95"
    resources:
      limits:
        cpu: 200m
        memory: 200Mi
      requests:
        cpu: 100m
        memory: 100Mi
    securityContext:
      capabilities:
        drop: ["ALL"]
  schedulerName: runai-scheduler
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 5
submit workloads
Scheduler
metrics pane for each workload
runaiconfig
compute resources
NVIDIA Run:ai workload types
metrics pane for each workload
CLI v2 reference
API reference
phases
scope

Control Plane System Requirements

The NVIDIA Run:ai control plane is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai control plane. Before you start, make sure to review the Installation overview.

Installer Machine

The machine running the installation script (typically the Kubernetes master) must have:

  • At least 50GB of free space

  • Docker installed

  • Helm 3.14 or later

Note

If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai software artifacts include the Helm binary.

Hardware Requirements

The following hardware requirements are for the control plane system nodes. By default, all NVIDIA Run:ai control plane services run on all available nodes.

Architecture

  • x86 – Supported for both Kubernetes and OpenShift deployments.

  • ARM – Supported for Kubernetes only. ARM is currently not supported for OpenShift.

NVIDIA Run:ai Control Plane - System Nodes

This configuration is the minimum requirement you need to install and use NVIDIA Run:ai control plane:

Component
Required Capacity

CPU

10 cores

Memory

12GB

Disk space

110GB

Note

To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes.

If NVIDIA Run:ai control plane is planned to be installed on the same Kubernetes cluster as the NVIDIA Run:ai cluster, make sure the cluster Hardware requirements are considered in addition to the NVIDIA Run:ai control plane hardware requirements.

Software Requirements

The following software requirements must be fulfilled.

Operating System

  • Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator

  • Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

Network Time Protocol

Nodes are required to be synchronized by time using NTP (Network Time Protocol) for proper system functionality.

Kubernetes Distribution

NVIDIA Run:ai control plane requires Kubernetes. The following Kubernetes distributions are supported:

  • Vanilla Kubernetes

  • OpenShift Container Platform (OCP)

  • NVIDIA Base Command Manager (BCM)

  • Elastic Kubernetes Engine (EKS)

  • Google Kubernetes Engine (GKE)

  • Azure Kubernetes Service (AKS)

  • Oracle Kubernetes Engine (OKE)

  • Rancher Kubernetes Engine (RKE1)

  • Rancher Kubernetes Engine 2 (RKE2)

Note

The latest release of the NVIDIA Run:ai control plane supports Kubernetes 1.30 to 1.32 and OpenShift 4.14 to 4.18.

See the following Kubernetes version support matrix for the latest NVIDIA Run:ai releases:

NVIDIA Run:ai version
Supported Kubernetes versions
Supported OpenShift versions

v2.17

1.27 to 1.29

4.12 to 4.15

v2.18

1.28 to 1.30

4.12 to 4.16

v2.19

1.28 to 1.31

4.12 to 4.17

v2.20

1.29 to 1.32

4.14 to 4.17

v2.21 (latest)

1.30 to 1.32

4.14 to 4.18

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release History or OpenShift Container Platform Life Cycle Policy.

NVIDIA Run:ai Namespace

The NVIDIA Run:ai control plane uses a namespace or project (OpenShift) called runai-backend. Use the following to create the namespace/project:

kubectl create namespace runai-backend
oc new-project runai-backend

Default Storage Class

Note

Default storage class applies for Kubernetes only.

The NVIDIA Run:ai control plane requires a default storage class to create persistent volume claims for NVIDIA Run:ai storage. The storage class, as per Kubernetes standards, controls the reclaim behavior, whether the NVIDIA Run:ai persistent data is saved or deleted when the NVIDIA Run:ai control plane is deleted.

Note

For a simple (non-production) storage class example see Kubernetes Local Storage Class. The storage class will set the directory /opt/local-path-provisioner to be used across all nodes as the path for provisioning persistent volumes. Then set the new storage class as default:

kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Kubernetes Ingress Controller

Note

Installing ingress controller applies for Kubernetes only.

The NVIDIA Run:ai control plane requires Kubernetes Ingress Controller to be installed.

  • OpenShift, RKE and RKE2 come with a pre-installed ingress controller.

  • Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.

  • Make sure that a default ingress controller is set.

There are many ways to install and configure different ingress controllers. The following shows a simple example to install and configure NGINX ingress controller using helm:

Vanilla Kubernetes

Run the following commands:

  • For cloud deployments, both the internal IP and external IP are required.

  • For on-prem deployments, only the external IP is needed.

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace \
    --set controller.kind=DaemonSet \
    --set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes
Managed Kubernetes (EKS, GKE, AKS)

Run the following commands:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace
Oracle Kubernetes Engine (OKE)

Run the following commands:

helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace ingress-nginx --create-namespace \
    --set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \
    --set controller.service.externalTrafficPolicy=Local \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet=<SUBNET-ID> # Replace <SUBNET-ID> with the subnet ID of one of your cluster

Fully Qualified Domain Name (FQDN)

Note

Fully Qualified Domain Name applies for Kubernetes only.

You must have a Fully Qualified Domain Name (FQDN) to install NVIDIA Run:ai control plane (ex: runai.mycorp.local). This cannot be an IP. The FQDN must be resolvable within the organization's private network.

TLS Certificate

Kubernetes

You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a Kubernetes Secret named runai-backend-tls in the runai-backend namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

kubectl create secret tls runai-backend-tls -n runai-backend \
  --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate 
  --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key

OpenShift

NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on configuring certificates.

Local Certificate Authority

A local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority if external connections or standard HTTPS authentication is required. Follow the below steps to configure the local certificate authority.

For air-gapped environments, you must configure the local certificate authority public key of your local certificate authority. It will need to be installed in Kubernetes for the installation to succeed:

  1. Add the public key to the runai-backend namespace:

kubectl -n runai-backend create secret generic runai-ca-cert \ 
    --from-file=runai-ca.pem=<ca_bundle_path>
oc -n runai-backend create secret generic runai-ca-cert \ 
    --from-file=runai-ca.pem=<ca_bundle_path>
  1. When installing the control plane, make sure the following flag is added to the helm command --set global.customCA.enabled=true. See Install control plane.

External Postgres Database (Optional)

The NVIDIA Run:ai control plane installation includes a default PostgreSQL database. However, you may opt to use an existing PostgreSQL database if you have specific requirements or preferences as detailed in External Postgres database configuration. Please ensure that your PostgreSQL database is version 16 or higher.

Using GB200 NVL72 and Multi-Node NVLink Domains

Multi-Node NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives are fully supported by the NVIDIA Run:ai platform.

Kubernetes does not natively recognize NVIDIA’s MNNVL architecture, which makes managing and scheduling workloads across these high-performance domains more complex. The NVIDIA Run:ai platform simplifies this by abstracting the complexity of MNNVL configuration. Without this abstraction, optimal performance on a GB200 NVL72 system would require deep knowledge of NVLink domains, their hardware dependencies, and manual configuration for each distributed workload. NVIDIA Run:ai automates these steps, ensuring high performance with minimal effort. While GB200 NVL72 supports all workload types, distributed training workloads benefit most from its accelerated GPU networking capabilities.

To learn more about GB200, MNNVL and related NVIDIA technologies, refer to the following:

  • NVIDIA GB200 NVL72

  • NVIDIA Blackwell datasheet

  • NVIDIA Multi-Node NVLink Systems

Benefits of Using GB200 NVL72 with NVIDIA Run:ai

The NVIDIA Run:ai platform enables administrators, researchers, and MLOps engineers to fully leverage GB200 NVL72 systems and other NVLink-based domains without requiring deep knowledge of hardware configurations or NVLink topologies. Key capabilities include:

  • Automatic detection and labeling

    • Detects GB200 NVL72 nodes and identifies MNNVL domains (e.g., GB200 NVL72 racks).

    • Automatically detects whether a node pool contains GB200 NVL72.

    • Supports manual override of GB200 MNNVL detection and label key for future compatibility and improved resiliency.

  • Simplified distributed workload submission

    • Allows seamless submission of distributed workloads into GB200-based node pools, eliminating all the complexities involved with that operation on top of GB200 MNNVL domains.

    • Abstracts away the complexity of configuring workloads for NVL domains.

  • Flexible support for NVLink domain variants

    • Compatible with current and future NVL domain configurations.

    • Supports any number of domains or GB200 racks.

  • Enhanced monitoring and visibility

    • Provides detailed NVIDIA Run:ai dashboards for monitoring GB200 nodes and MNNVL domains by node pool.

  • Control and customization

    • Offers manual override and label configuration for greater resiliency and future-proofing.

    • Enables advanced users to fine-tune GB200 scheduling behavior based on workload requirements.

Prerequisites

  • Ensure that NVIDIA's GPU Operator version 25.3 or higher is installed: GPU Operator v25.3 Release Notes. This version must include the associated Dynamic Resource Allocation (DRA) driver, which provides support for GB200 accelerated networking resources and the ComputeDomain feature. For detailed steps on installing the DRA driver and configuring ComputeDomain, refer to the documentation for your installed GPU Operator version.

  • After the DRA driver is installed, update runaiconfig using the spec.workload-controller.GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See Advanced cluster configurations for more details.

Configuring and Managing GB200 NVL72 Domains

Administrators must define dedicated node pools that align with GB200 NVL72 rack topologies. These node pools ensure that workloads are isolated to nodes with NVLink interconnects and are not scheduled on incompatible hardware. Each node pool can be manually configured in the NVIDIA Run:ai platform and associated with specific node labels. Two key configurations are required for each node pool:

  • Node Labels – Identify nodes equipped with GB200.

  • MNNVL Domain Discovery – Specify how the platform detects whether the node pool includes NVLink-connected nodes.

To create a node pool with GPU network acceleration, see Node pools.

Identifying GB200 Nodes

To enable the NVIDIA Run:ai Scheduler to recognize GB200-based nodes, administrators must:

  • Use the default node label provided by the NVIDIA GPU Operator - nvidia.com/gpu.clique.

  • Or, apply a custom label that clearly marks the node as GB200/MNNVL capable.

This node label serves as the basis for identifying appropriate nodes and ensuring workloads are scheduled on the correct hardware.

Enabling MNNVL Domain Discovery

The administrator can configure how the NVIDIA Run:ai platform detects MNNVL domains for each node pool. The available options include:

  • Automatic Discovery – Uses the default label key nvidia.com/gpu.clique, or a custom label key specified by the administrator. The NVIDIA Run:ai platform automatically discovers MNNVL domains within node pools. If a node is labeled with the MNNVL label key, the NVIDIA Run:ai platform indicates this node pool as MNNVL detected. MNNVL detected node pools are treated differently by the NVIDIA Run:ai platform when submitting a distributed training workload.

  • Manual Discovery – The platform does not evaluate any node labels. Detection is based solely on the administrator’s configuration of the node pool as MNNVL “Detected” or “Not Detected.”

When automatic discovery is enabled, all GB200 nodes that are part of the same physical rack (NVL72 or other future topologies) are part of the same NVL Domain and automatically labeled by the GPU Operator with a common label using a unique label value per domain and sub-domain. The default label key set by the NVIDIA GPU Operator is nvidia.com/gpu.clique and its value consists of - <NVL Domain ID (ClusterUUID)>.<Clique ID> :

  • The NVL Domain ID (ClusterUUID) is a unique identifier that represents the physical NVL domain, for example, a physical GB200 NVL72 rack.

  • The Clique ID denotes a logical MNNVL sub-domain. A clique represents a further logical split of the MNNVL into smaller domains that enable secure, fast, and isolated communication between pods running on different GB200 nodes within the same GB200 NVL72.

The Nodes table provides more information on which GB200 NVL72 domain each node belongs to, and which Clique ID it is associated with.

Submitting Distributed Training Workloads

When a distributed training workload is submitted to an MNNVL-detected node pool, the NVIDIA Run:ai platform automates several key configuration steps to ensure optimal workload execution:

  • ComputeDomain creation - The NVIDIA Run:ai platform creates a ComputeDomain Custom Resource Definition (CRD), which is a proprietary resource used to manage NVLink-based domain assignments.

  • Resource Claim injection - A reference to the ComputeDomain is automatically added to the workload specification as a resource claim, allowing the Scheduler to link the workload to a specific NVLink domain.

  • Pod affinity configuration - Pod affinity is applied using a Preferred policy with the MNNVL label key (e.g., nvidia.com/gpu.clique) as the topology key. This ensures that pods within the distributed workload are located on nodes with NVLink interconnects.

  • Node affinity configuration - Node affinity is also applied using a Preferred policy based on the same label key, further guiding the Scheduler to place workloads within the correct node group.

These additional steps are crucial for the creation of underlying HW resources (also known as IMEX channels) and stickiness of the distributed workload to MNNVL topologies and nodes. When a distributed workload is stopped or evicted, the platform automatically removes the corresponding ComputeDomain.

Best Practices for MNNVL Node Pool Management

  • When submitting a distributed workload, you should explicitly specify a list of one or more MNNVL detected node pools, or a list of one or more non-MNNVL detected node pools. A mix of MNNVL detected and non-MNNVL detected node pools is not supported. A GB200 MNNVL node pool is a pool that contains at least one node belonging to an MNNVL domain.

  • Other workload types (not distributed) can include a list of mixed MNNVL and non-MNNVL node pools, from which the Scheduler will choose.

  • MNNVL node pools can include any size of MNNVL domains (i.e. NVL72 and any future domain size) and support any Grace-Blackwell models (GB200 and any future models).

  • To support the submission of larger distributed workloads, it is recommended to group as many GB200 racks as possible into fewer node pools. When possible, use a single GB200 node pool, unless there is a specific operational reason to divide resources across multiple node pools.

  • When submitting distributed training workloads with the controller pod set as a distinct non-GPU workload, the MNNVL feature should be used with the default Preferred mode as explained in the below section.

Fine-tuning Scheduling Behavior for MNNVL

You can influence how the Scheduler places distributed training workloads into GB200 MNNVL node pools using the Topology field available in the distributed training workload submission form.

Note

The following options are based on inter-pod affinity rules, which define how pods are grouped based on topology.

  • Confine a workload to a single GB200 MNNVL domain - To ensure the workload is scheduled within a single GB200 MNNVL domain (e.g., a GB200 NVL72 rack), apply a topology label with a Required policy using the MNNVL label key (nvidia.com/gpu.clique). This instructs the Scheduler to strictly place all pods within the same MNNVL domain. If the workload exceeds 18 pods (or 72 GPUs), the Scheduler will not be able to find a matching domain and will fail to schedule the workload.

  • Try to schedule a workload using a Preferred topology - To guide the Scheduler to prioritize a specific topology without enforcing it, apply a topology label with a policy of Preferred. You can apply any topology label with a Preferred policy. These labels are treated with higher scheduling weight than the default Preferred pod affinity automatically applied by NVIDIA Run:ai for MNNVL.

  • Mandate a custom topology - To force scheduling a workload into a custom topology, add a topology label with a policy of Required. This ensures the workload is strictly scheduled according to the specified topology. Keep in mind that using a Required policy can significantly constrain scheduling. If matching resources are not available, the Scheduler may fail to place the workload.

Fine-tuning MNNVL per Workload

You can customize how the NVIDIA Run:ai platform applies the MNNVL feature to each distributed training workload. This allows you to override the default behavior when needed. To configure this behavior, set the proprietary label key run.ai/MNNVL in the General settings section of the distributed training workload submission form. The following values are supported:

  • None - Disables the MNNVL feature for the workload. The platform does not create a ComputeDomain and no pod affinity or node affinity is applied by default.

  • Preferred (default) - Indicates that MNNVL feature is preferred but not required. This is the default behavior when submitting a distributed training workload:

    • If the workload is submitted to a 'non-MNNVL detected' node pool, then the NVIDIA Run:ai platform does not add a ComputeDomain, ComputeDomain claim, pod affinity or node affinity for MNNVL nodes.

    • Otherwise, if the workload is submitted to a 'MNNVL detected' node pool, then the NVIDIA Run:ai platform automatically adds: ComputeDomain, ComputeDomain claim, NodeAffinity and PodAffinity both with a Preferred policy and using the MNNVL label.

    • If you manually add an additional Preferred topology label, it will be given higher scheduling weight than the default embedded pod affinity (which has weight = 1).

  • Required - Enforces a strict use of MNNVL domains for the workload. The workload must be scheduled on MNNVL supported nodes:

    • The NVIDIA Run:ai platform creates a ComputeDomain and ComputeDomain claim.

    • The NVIDIA Run:ai platform will automatically add a node affinity rule with a Required policy using the appropriate label.

    • Pod affinity is set to Preferred by default, but you can override it manually with a Required pod affinity rule using the MNNVL label key or another custom label.

    • If any of the targeted node pools do not support MNNVL or if the workload (or any of its pods) does not request GPU resources, the workload will fail to run.

Known Limitations and Compatibility

  • If the DRA driver is not installed correctly in the cluster, particularly if the required CRDs are missing, and the MNNVL feature is enabled in the NVIDIA Run:ai platform, the workload controller will enter a crash loop. This will continue until the DRA driver is properly installed with all necessary CRDs or the MNNVL feature is disabled in the NVIDIA Run:ai platform.

  • To run workloads on a GB200 node pool (i.e., a node pool detected as MNNVL-enabled), the workload must explicitly request that node pool. To prevent unintentional use of MNNVL-detected node pools, administrators must ensure these node pools are not included in any project's default list of node pools.

  • Only one distributed training workload per node can use GB200 accelerated networking resources. If GPUs remain unused on that node, other workload types may still utilize them.

  • If a GB200 node fails, any associated pod will be re-scheduled, causing the entire distributed workload to fail and restart. On non-GB200 nodes, this scenario may be self-healed by the Scheduler without impacting the entire workload.

  • If a pod from a distributed training workload fails or is evicted by the Scheduler, it must be re-scheduled on the same node. Otherwise, the entire workload will be evicted and, in some cases, re-queued.

  • Elastic distributed training workloads are not supported with MNNVL.

  • Workloads created in versions earlier than 2.21 do not include GB200 MNNVL node pools and are therefore not expected to experience compatibility issues.

  • If a node pool that was previously used in a workload submission is later updated to include GB200 nodes (i.e., becomes a mixed node pool), the workload submitted before version 2.21 will not use any accelerated networking resources, although it may still run on GB200 nodes.

Introduction to Workloads

NVIDIA Run:ai enhances visibility and simplifies , by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an .

Workloads Across the AI Life Cycle

A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With NVIDIA Run:ai, research and engineering teams can host and manage all these workloads to achieve the following:

  • Data preparation: Aggregating, cleaning, normalizing, and labeling data to prepare for training.

  • Training: Conducting resource-intensive model development and iterative performance optimization.

  • Fine-tuning: Adapting pre-trained models to domain-specific datasets while balancing efficiency and performance.

  • Inference: Deploying models for real-time or batch predictions with a focus on low latency and high throughput.

  • Monitoring and optimization: Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed.

What is a Workload?

A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a , allocating resources for in an integrated development environment (IDE)/notebook, or serving requests in production.

The workload, defined by the AI practitioner, consists of:

  • Container images: This includes the application, its dependencies, and the runtime environment.

  • Compute resources: CPU, GPU, and RAM to execute efficiently and address the workload’s needs.

  • Data & storage configuration: The data needed for processing such as training and testing datasets or input from external databases, and the storage configuration which refers to the way this data is managed, stored and accessed.

  • Credentials: The access to certain data sources or external services, ensuring proper authentication and authorization.

Workload Scheduling and Orchestration

NVIDIA Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient of all cluster workloads using the NVIDIA Run:ai . The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator.

NVIDIA Run:ai and Third-Party Workloads

  • NVIDIA Run:ai workloads: These workloads are submitted via the NVIDIA Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using , a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied.

  • Third-party workloads: These workloads are submitted via third-party applications that use the NVIDIA Run:ai Scheduler. The NVIDIA Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility.

Levels of Support

Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in NVIDIA Run:ai. NVIDIA Run:ai workloads are fully supported with all of NVIDIA Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different NVIDIA Run:ai versions.

Functionality
NVIDIA Run:ai Workspace
NVIDIA Run:ai Training - Standard
NVIDIA Run:ai Training - distributed
NVIDIA Run:ai Inference
Third-party workloads

Workload awareness

Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).

Compute Resources

This article explains what compute resources are and how to create and use them.

Compute resources are one type of . A compute resource is a template that simplifies how workloads are submitted and can be used by AI practitioners when they submit their workloads.

A compute resource asset is a preconfigured building block that encapsulates all the specifications of compute requirements for the workload including:

  • GPU devices and GPU memory

  • CPU memory and CPU compute

Compute Resource Table

The Compute resource table can be found under Workload manager in the NVIDIA Run:ai UI.

The Compute resource table provides a list of all the compute resources defined in the platform and allows you to manage them.

The Compute resource table consists of the following columns:

Column
Description

Workloads Associated with the Compute Resource

Click one of the values in the Workload(s) column to view the list of workloads and their parameters.

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

Adding a New Compute Resource

To add a new compute resource:

  1. Go to the Compute resource table

  2. Click +NEW COMPUTE RESOURCE

  3. Select under which cluster to create the compute resource

  4. Select a

  5. Enter a name for the compute resource. The name must be unique.

  6. Optional: Provide a description of the essence of the compute resource

  7. Set the resource types needed within a single node (the NVIDIA Run:ai scheduler tries to match a single node that complies with the compute resource for each of the workload’s pods)

    • GPU

      • GPU devices per pod The number of devices (physical GPUs) per pod (for example, if you requested 3 devices per pod and the running workload using this compute resource consists of 3 pods, there are 9 physical GPU devices used in total)

        Note

        • When setting it to zero, the workload using this computer resource neither requests or uses GPU resources while running

        • You can set any number of GPU devices and specify the memory requirement to any portion size (1..100), or memory size value using GB or MB units per device

  • GPU memory per device

    • Select the memory request format

      • % (of device) - Fraction of a GPU device’s memory

      • MB (memory size) - An explicit GPU memory unit

      • GB (memory size) - An explicit GPU memory unit

        • Set the memory Request - The minimum amount of GPU memory that is provisioned per device. This means that any pod of a running workload that uses this compute resource, receives this amount of GPU memory for each device(s) the pod utilizes

        • Optional: Set the memory Limit - The maximum amount of GPU memory that is provisioned per device. This means that any pod of a running workload that uses this compute resource, receives at most this amount of GPU memory for each device(s) the pod utilizes. To set a Limit, first enable the limit toggle. The limit value must be equal to or higher than the request.

Note

  • GPU memory limit is disabled by default. If you cannot see the Limit toggle in the compute resource form, then it must be enabled by your Administrator, under General settings → Resources → GPU resource optimization.

  • When a Limit is set and is bigger than the Request, the scheduler allows each pod to reach the maximum amount of GPU memory in an opportunistic manner (only upon availability).

  • If the GPU Memory Limit is bigger that the Request the pod is prone to be killed by the NVIDIA Run:ai toolkit (out of memory signal). The greater the difference between the GPU memory used and the request, the higher the risk of being killed.

  • If GPU resource optimization is turned off, the minimum and maximum are in fact equal.

  • CPU

    • CPU compute per pod

      • Select the units for the CPU compute (Cores / Millicores)

      • Set the CPU compute Request - the minimum amount of CPU compute that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives this amount of CPU compute for each pod.

      • Optional: Set the CPU compute Limit - The maximum amount of CPU compute that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives at most this amount of CPU compute. To set a Limit, first enable the limit toggle. The limit value must be equal to or higher than the request. By default, the limit is set to “Unlimited” - which means that the pod may consume all the node's free CPU compute resources.

    • CPU memory per pod

      • Select the units for the CPU memory (MB / GB)

      • Set the CPU memory Request - The minimum amount of CPU memory that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives this amount of CPU memory for each pod.

      • Optional: Set the CPU memory Limit - The maximum amount of CPU memory that is provisioned per pod. This means that any pod of a running workload that uses this compute resource, receives at most this amount of CPU memory. To set a Limit, first enable the limit toggle. The limit value must be equal to or higher than the request. By default, the limit is set to “Unlimited” - Meaning that the pod may consume all the node's free CPU memory resources.

Note

If the CPU Memory Limit is bigger that the Request the pod is prone to be killed by the operating system (out of memory signal). The greater the difference between the CPU memory used and the request, the higher the risk of being killed.

  1. Optional: More settings

  • Increase shared memory size When enabled, the shared memory size available to the pod is increased from the default 64MB to the node's total available memory or the CPU memory limit, if set above.

  • Set extended resource(s) Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the and guides

  1. Click CREATE COMPUTE RESOURCE

Note

It is also possible to add data sources directly when creating a specific , or workload.

Editing a Compute Resource

To edit a compute resource:

  1. Select the compute resource you want to edit

  2. Click Edit

  3. Update the compute resource and click SAVE COMPUTE RESOURCE

Note

The already bound workload that is using this asset will not be affected.

Copying a Compute Resource

To copy an existing compute resource:

  1. Select the compute resource you want to copy

  2. Click MAKE A COPY

  3. Enter a name for the compute resource. The name must be unique.

  4. Update the compute resource and click CREATE COMPUTE RESOURCE

Deleting a Compute Resource

  1. Select the compute resource you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm

Note

The already bound workload that is using this asset will not be affected.

Using API

Go to the API reference to view the available actions

Fairness

v

v

v

v

v

Priority and preemption

v

v

v

v

v

Over quota

v

v

v

v

v

Node pools

v

v

v

v

v

Bin packing / Spread

v

v

v

v

v

Multi-GPU fractions

v

v

v

v

v

Multi-GPU dynamic fractions

v

v

v

v

v

Node level scheduler

v

v

v

v

v

Multi-GPU memory swap

v

v

v

v

v

Elastic scaling

NA

NA

v

v

v

Gang scheduling

v

v

v

v

v

Monitoring

v

v

v

v

v

Workload awareness

v

v

v

v

v

RBAC

v

v

v

v

v

Workload submission

v

v

v

v

Workload actions (stop/run)

v

v

v

Rolling updates

v

Workload Policies

v

v

v

v

Scheduling rules

v

v

v

v

management
AI initiative
batch job
experimentation
inference
scheduling and orchestrating
Scheduler
NVIDIA Run:ai workloads

Compute resource

The name of the compute resource

Description

A description of the essence of the compute resource

GPU devices request per pod

The number of requested physical devices per pod of the workload that uses this compute resource

GPU memory request per device

The amount of GPU memory per requested device that is granted to each pod of the workload that uses this compute resource

CPU memory request

The minimum amount of CPU memory per pod of the workload that uses this compute resource

CPU memory limit

The maximum amount of CPU memory per pod of the workload that uses this compute resource

CPU compute request

The minimum number of CPU cores per pod of the workload that uses this compute resource

CPU compute limit

The maximum number of CPU cores per pod of the workload that uses this compute resource

Scope

The scope of this compute resource within the organizational tree. Click the name of the scope to view the organizational tree diagram

Workload(s)

The list of workloads associated with the compute resource

Template(s)

The list of workload templates that use this compute resource

Created by

The name of the user who created the compute resource

Creation time

The timestamp of when the compute resource was created

Last updated

The timestamp of when the compute resource was last updated

Cluster

The cluster that the compute resource is associated with

Workload

The workload that uses the compute resource

Type

Workspace/Training/Inference

Status

Represents the workload lifecycle. See the full list of workload status.

workload assets
scope
Extended resources
Quantity
workspace
training
inference
Compute resources

Policy YAML Examples

This article provides examples of:

  1. Creating a new rule within a policy

  2. Best practices for adding sections to a policy

  3. A full example of a whole policy

Creating a New Rule Within a Policy

This example shows how to add a new limitation to the GPU usage for workloads of type workspace:

  1. Check the documentation and select the field(s) that are most relevant for GPU usage.

  2. Search the field in the . For example, gpuDevicesRequest appears under the Compute fields sub-table and appears as follow:

Fields
Description
Value type
Supported NVIDIA Run:ai workload type
  1. Use the value type of the gpuDevicesRequest field indicated in the table - “integer” and navigate to the Value types table to view the possible rules that can be applied to this value type -

    for integer, the options are:

    • canEdit

    • required

    • min

    • max

    • step

  2. Proceed to the table, select the required rule for the limitation of the field - for example “max” and use the examples syntax to indicate the maximum GPU device requested.

Policy YAML Best Practices

Create a policy that has multiple defaults and rules

Best practices description

Presentation of the syntax while adding a set of defaults and rules

Example

Allow only single selection out of many

Best practices description

Blocking the option to create all types of data sources except the one that is allowed is the solution

Example

Create a robust set of guidelines

Best practices description

Set rules for specific compute resource usage, addressing most relevant spec fields

Example

Policy for distributed training workloads

Best practices description

Set rules and defaults for a distributed training workload with different setting for master and workers

Example

Examples for specific sections in the policy

Best practices description

Environment creation

Example

Best practices description

Setting security measures

Example

Best practices description

Impose an asset

Example

Example of a Whole Policy

Launching Workloads with Dynamic GPU Fractions

This quick start provides a step-by-step walkthrough for running a Jupyter Notebook with .

NVIDIA Run:ai’s dynamic GPU fractions optimizes GPU utilization by enabling workloads to dynamically adjust their resource usage. It allows users to specify a guaranteed fraction of GPU memory and compute resources with a higher limit that can be dynamically utilized when additional resources are requested.

Note

If enabled by your Administrator, the NVIDIA Run:ai UI allows you to create a new workload using either the . The steps in this quick start guide reflect the Original form only.

Prerequisites

Before you start, make sure:

  • You have created a or have one created for you.

  • The project has an assigned quota of at least 0.5 GPU.

  • Dynamic GPU fractions is enabled.

Note

Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Run the below --help command to obtain the login options and log in according to your setup:

To use the API, you will need to obtain a token as shown in

Step 2: Submitting the First Workspace

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Workspace

  3. Select under which cluster to create the workload

  4. Select the project in which your workspace will run

  5. Select a preconfigured or select the Start from scratch to launch a new workspace quickly

  6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Create an environment for your workspace

    • Click +NEW ENVIRONMENT

    • Enter a name for the environment. The name must be unique.

    • Enter the Image URL - gcr.io/run-ai-lab/pytorch-example-jupyter

    • Tools - Set the connection for your tool

      • Click +TOOL

      • Select Jupyter tool from the list

    • Set the runtime settings for the environment

      • Click +COMMAND

      • Enter command - start-notebook.sh

      • Enter arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

      Note: If is enabled on the cluster, enter the --NotebookApp.token='' only.

    • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  9. Create a new “request-limit” compute resource

    • Click +NEW COMPUTE RESOURCE

    • Enter a name for the compute resource. The name must be unique.

    • Set GPU devices per pod - 1

    • Set GPU memory per device

      • Select GB - Fraction of a GPU device’s memory

      • Set the memory Request - 4GB (the workload will allocate 4GB of the GPU memory)

      • Toggle Limit and set to 12

    • Optional: set the CPU compute per pod - 0.1 cores (default)

    • Optional: set the CPU memory per pod - 100 MB (default)

    • Select More settings and toggle Increase shared memory size

    • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  10. Click CREATE WORKSPACE

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see :

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

  • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

  • toolName will show when connecting to the Jupyter tool via the user interface.

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 3: Submitting the Second Workspace

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Workspace

  3. Select the cluster where the previous workspace was created

  4. Select the project where the previous workspace was created

  5. Select a preconfigured or select the Start from scratch to launch a new workspace quickly

  6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Select the environment created in

  9. Select the compute resource created in

  10. Click CREATE WORKSPACE

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see :

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

  • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

  • toolName will show when connecting to the Jupyter tool via the user interface.

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 4: Connecting to the Jupyter Notebook

  1. Select the newly created workspace with the Jupyter application that you want to connect to

  2. Click CONNECT

  3. Select the Jupyter tool. The selected tool is opened in a new tab on your browser.

  4. Open a terminal and use the watch nvidia-smi to get a constant reading of the memory consumed by the pod. Note that the number shown in the memory box is the Limit and not the Request or Guarantee.

  5. Open the file Untitled.ipynb and move the frame so you can see both tabs

  6. Execute both cells in Untitled.ipynb. This will consume about 3 GB of GPU memory and be well below the 4GB of the GPU Memory Request value.

  7. In the second cell, edit the value after --image-size from 100 to 200 and run the cell. This will increase the GPU memory utilization to about 11.5 GB which is above the Request value.

  1. To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

  2. Open a terminal and use the watch nvidia-smi to get a constant reading of the memory consumed by the pod. Note that the number shown in the memory box is the Limit and not the Request or Guarantee.

  3. Open the file Untitled.ipynb and move the frame so you can see both tabs

  4. Execute both cells in Untitled.ipynb. This will consume about 3 GB of GPU memory and be well below the 4GB of the GPU Memory Request value.

  5. In the second cell, edit the value after --image-size from 100 to 200 and run the cell. This will increase the GPU memory utilization to about 11.5 GB which is above the Request value.

  1. To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

  2. Open a terminal and use the watch nvidia-smi to get a constant reading of the memory consumed by the pod. Note that the number shown in the memory box is the Limit and not the Request or Guarantee.

  3. Open the file Untitled.ipynb and move the frame so you can see both tabs

  4. Execute both cells in Untitled.ipynb. This will consume about 3 GB of GPU memory and be well below the 4GB of the GPU Memory Request value.

  5. In the second cell, edit the value after --image-size from 100 to 200 and run the cell. This will increase the GPU memory utilization to about 11.5 GB which is above the Request value.

Next Steps

Manage and monitor your newly created workload using the table.

Departments

This section explains the procedure for managing departments

Departments are a grouping of projects. By grouping projects into a department, you can set quota limitations to a set of projects, create policies that are applied to the department, and create assets that can be scoped to the whole department or a partial group of descendent projects

For example, in an academic environment, a department can be the Physics Department grouping various projects (AI Initiatives) within the department, or grouping projects where each project represents a single student.

Departments Table

The Departments table can be found under Organization in the NVIDIA Run:ai platform.

Note

Departments are disabled, by default. If you cannot see Departments in the menu, then it must be enabled by your Administrator, under General settings → Resources → Departments

The Departments table lists all departments defined for a specific cluster and allows you to manage them. You can switch between clusters by selecting your cluster using the filter at the top.

The Departments table consists of the following columns:

Column
Description

Node Pools with Quota Associated with the Department

Click one of the values of Node pool(s) with quota column, to view the list of node pools and their parameters

Column
Description

Subjects Authorized for the Project

Click one of the values of the Subject(s) column, to view the list of subjects and their parameters. This column is only viewable if your role in the NVIDIA Run:ai system affords you those permissions.

Column
Description

Note

A role given in a certain scope, means the role applies to this scope and any descendant scopes in the organizational tree.

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Adding a New Department

To create a new Department:

  1. Click +NEW DEPARTMENT

  2. Select a scope. By default, the field contains the scope of the current UI context cluster, viewable at the top left side of your screen. You can change the current UI context cluster by clicking the ‘Cluster: cluster-name’ field and applying another cluster as the UI context. Alternatively, you can choose another cluster within the ‘+ New Department’ form by clicking the organizational tree icon on the right side of the scope field, opening the organizational tree and selecting one of the available clusters.

  3. Enter a name for the department. Department names must start with a letter and can only contain lower case latin letters, numbers or a hyphen ('-’).

  4. In the Quota management section, you can set the quota parameters and prioritize resources

    • Order of priority This column is displayed only if more than one node pool exists. The node-pools order of priority in 'Departments/Quota management' sets the default node-pools order of priority for newly created projects under that Department. The Administrator can then change the order per Project. Node-pools order of priority sets the order in which the uses node pools to schedule a workload, it is effective for projects and their associated workloads. This means the Scheduler first tries to allocate resources using the highest priority node pool, then the next in priority, until it reaches the lowest priority node pool list, then the Scheduler starts from the highest again. The Scheduler uses the Project's list of prioritized node pools, only if the order of priority of node pools is not set in the workload during submission, either by an admin policy or by the user. Empty value means the node pool is not part of the Department default node pool priority list inherited to newly created projects, but a node pool can still be chosen by the admin policy or a user during workload submission.

    • Node pool This column is displayed only if more than one node pool exists. It represents the name of the node pool

    • Under the QUOTA tab

      • Over-quota state Indicates if over-quota is enabled or disabled as set in the SCHEDULING PREFERENCES tab. If over-quota is set to None, then it is disabled.

      • GPU devices The number of GPUs you want to allocate for this department in this node pool (decimal number).

      • CPUs (Cores) This column is displayed only if CPU quota is enabled via the General settings. Represents the number of CPU cores you want to allocate for this department in this node pool (decimal number).

      • CPU memory This column is displayed only if CPU quota is enabled via the General settings. Represents the amount of CPU memory you want to allocate for this department in this node pool (in Megabytes or Gigabytes).

      • Under the SCHEDULING PREFERENCES tab

        • Department priority Sets the department's scheduling priority compared to other departments in the same node pool, using one of the following priorities:

          • Highest - 255

          • VeryHigh - 240

          • High - 210

          • MediumHigh - 180

          • Medium - 150

          • MediumLow - 100

          • Low - 50

          • VeryLow - 20

          • Lowest - 1

          For v2.21, the default value is MediumLow. All departments are set with the same default value, therefore there is no change of scheduling behavior unless the Administrator changes any department priority values. To learn more about department priority, see .

        • Over-quota If over quota weight is enabled via the General settings then over quota weight is presented, otherwise over quota is presented

          • Over-quota When enabled, the department can use non-guaranteed overage resources above its quota in this node pool. The amount of the non-guaranteed overage resources for this department is calculated proportionally to the department's quota in this node pool. When disabled, the department cannot use more resources than the guaranteed quota in this node pool.

          • Over quota weight Represents a weight used to calculate the amount of non-guaranteed overage resources a project can get on top of its quota in this node pool. All unused resources are split between departments that require the use of overage resources:

            • Medium The default value. The Admin can change the default to any of the following values: High, Low, Lowest, or None.

            • Lowest over quota weight ‘Lowest’ has a unique behavior, it can only use over-quota (unused overage) resources if no other department needs them, and any department with a higher over quota weight can snap the average resources at any time.

            • None When set, the department cannot use more resources than the guaranteed quota in this node pool.

          • In case over quota is disabled, workloads running under subordinate projects are not able to use more resources than the department’s quota, but each project can still go over-quota (if enabled at the project level) up to the department’s quota.

          • Unlimited CPU(Cores) and CPU memory quotas are an exception - in this case, workloads of subordinated projects can consume available resources up to the physical limitation of the cluster or any of the node pools.

        • Department max. GPU device allocation Represents the maximum GPU device allocation the department can get from this node pool - the maximum sum of quota and over-quota GPUs (decimal number).

  5. Set as required.

  6. Click CREATE DEPARTMENT

Adding an Access Rule to a Department

To create a new access rule for a department:

  1. Select the department you want to add an access rule for

  2. Click ACCESS RULES

  3. Click +ACCESS RULE

  4. Select a subject

  5. Select or enter the subject identifier:

    • User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP

    • Group name as recognized by the IDP

    • Application name as created in NVIDIA Run:ai

  6. Select a role

  7. Click SAVE RULE

  8. Click CLOSE

Deleting an Access Rule from a Department

To delete an access rule from a department:

  1. Select the department you want to remove an access rule from

  2. Click ACCESS RULES

  3. Find the access rule you would like to delete

  4. Click on the trash icon

  5. Click CLOSE

Editing a Department

  1. Select the Department you want to edit

  2. Click EDIT

  3. Update the Department and click SAVE

Viewing a Department’s Policy

To view the policy of a department:

  1. Select the department for which you want to view its . This option is only active if the department has defined policies in place.

  2. Click VIEW POLICY and select the workload type for which you want to view the policies: a. Workspace workload type policy with its set of rules b. Training workload type policies with its set of rules

  3. In the Policy form, view the workload rules that are enforcing your department for the selected workload type as well as the defaults:

    • Parameter - The workload submission parameter that Rule and Default is applied on

    • Type (applicable for data sources only) - The data source type (Git, S3, nfs, pvc etc.)

    • Default - The default value of the Parameter

    • Rule - Set up constraints on workload policy fields

    • Source - The origin of the applied policy (cluster, department or project)

Note

  • The policy affecting the department consists of rules and defaults. Some of these rules and defaults may be derived from the policies of a parent cluster (source). You can see the source of each rule in the policy form.

  • A policy set for a department affects all subordinated projects and their workloads, according to the policy workload type

Deleting a Department

  1. Select the department you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm the deletion

Note

Deleting a department permanently deletes its subordinated projects, any assets created in the scope of this department, and any of its subordinated projects such as compute resources, environments, data sources, templates, and credentials. However, workloads running within the department’s subordinated projects, or the policies defined for this department or its subordinated projects - remain intact and running.

Reviewing a Department

  1. Select the department you want to review

  2. Click REVIEW

  3. Review and click CLOSE

Using API

To view the available actions, go to the API reference.

Node Pools

This section explains the procedure for managing Node pools.

Node pools assist in managing heterogeneous resources effectively. A node pool is a NVIDIA Run:ai construct representing a set of nodes grouped into a bucket of resources using a predefined node label (e.g. NVIDIA GPU type) or an administrator-defined node label (any key/value pair).

Typically, the grouped nodes share a common feature or property, such as GPU type or other HW capability (such as Infiniband connectivity), or represent a proximity group (i.e. nodes interconnected via a local ultra-fast switch). Researchers and ML Engineers would typically use node pools to run specific workloads on specific resource types.

In the NVIDIA Run:ai Platform a user with the System administrator role can create, view, edit, and delete node pools. Creating a new node pool creates a new instance of the NVIDIA Run:ai . Workloads submitted to a node pool are scheduled using the node pool’s designated scheduler instance.

Once created, the new node pool is automatically assigned to all and with a quota of zero GPU resources, unlimited CPU resources, and over quota enabled (medium weight if over quota weight is enabled). This allows any project and department to use any node pool when over quota is enabled, even if the administrator has not assigned a quota for a specific node pool within that project or department.

When submitting a new , users can add a prioritized list of node pools. The node pool selector picks one node pool at a time (according to the prioritized list) and the designated node pool scheduler instance handles the submission request and tries to match the requested resources within that node pool. If the scheduler cannot find resources to satisfy the submitted workload, the node pool selector moves the request to the next node pool in the prioritized list, if no node pool satisfies the request, the node pool selector starts from the first node pool again until one of the node pools satisfies the request.

Node Pools Table

The Node pools table can be found under Resources in the NVIDIA Run:ai platform.

The Node pools table lists all the node pools defined in the NVIDIA Run:ai platform and allows you to manage them.

Note

By default, the NVIDIA Run:ai platform includes a single node pool named ‘default’. When no other node pool is defined, all existing and new nodes are associated with the ‘default’ node pool. When deleting a node pool, if no other node pool matches any of the nodes’ labels, the node will be included in the default node pool.

The Node pools table consists of the following columns:

Column
Description

Workloads Associated with the Node Pool

Click one of the values in the Workload(s) column, to view the list of workloads and their parameters.

Note

This column is only viewable if your role in the NVIDIA Run:ai platform gives you read access to workloads, even if you are allowed to view workloads, you can only view the workloads within your allowed scope. This means, there might be more pods running on this node than appear in the list your are viewing.

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

  • Show/Hide details - Click to view additional information on the selected row

Show/Hide Details

Select a row in the Node pools table and then click Show details in the upper-right corner of the action bar. The details window appears, presenting metrics graphs for the whole node pool:

  • Node GPU allocation - This graph shows an overall sum of the Allocated, Unallocated, and Total number of GPUs for this node pool, over time. From observing this graph, you can learn about the occupancy of GPUs in this node pool, over time.

  • GPU Utilization Distribution - This graph shows the distribution of GPU utilization in this node pool over time. Observing this graph, you can learn how many GPUs are utilized up to 25%, 25%-50%, 50%-75%, and 75%-100%. This information helps to understand how many available resources you have in this node pool, and how well those resources are utilized by comparing the allocation graph to the utilization graphs, over time.

  • GPU Utilization - This graph shows the average GPU utilization in this node pool over time. Comparing this graph with the GPU Utilization Distribution helps to understand the actual distribution of GPU occupancy over time.

  • GPU Memory Utilization - This graph shows the average GPU memory utilization in this node pool over time, for example an average of all nodes’ GPU memory utilization over time.

  • CPU Utilization - This graph shows the average CPU utilization in this node pool over time, for example, an average of all nodes’ CPU utilization over time.

  • CPU Memory Utilization - This graph shows the average CPU memory utilization in this node pool over time, for example an average of all nodes’ CPU memory utilization over time.

Adding a New Node Pool

To create a new node pool:

  1. Click +NEW NODE POOL

  2. Enter a name for the node pool. Node pools names must start with a letter and can only contain lowercase Latin letters, numbers or a hyphen ('-’)

  3. Enter the node pool label: The node pool controller will use this node-label key-value pair to match nodes into this node pool.

    • Key is the unique identifier of a node label.

      • The key must fit the following regular expression: ^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?/?([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]$

      • The administrator can put an automatically preset label such as the nvidia.com/gpu.product that labels the GPU type or any other key from a node label.

    • Value is the value of that label identifier (key). The same key may have different values, in this case, they are considered as different labels.

      • Value must fit the following regular expression: ^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$

    • A node pool is defined by a single key-value pair. You must not use different labels that are set on the same node by different node pools, this situation may lead to unexpected results.

  4. Set the GPU placement strategy:

    • Bin-pack - Place as many workloads as possible in each GPU and node to use fewer resources and maximize GPU and node vacancy.

    • Spread - Spread workloads across as many GPUs and nodes as possible to minimize the load and maximize the available resources per workload.

    • GPU workloads are workloads that request both GPU and CPU resources

  5. Set the CPU placement strategy:

    • Bin-pack - Place as many workloads as possible in each CPU and node to use fewer resources and maximize CPU and node vacancy.

    • Spread - Spread workloads across as many CPUs and nodes as possible to minimize the load and maximize the available resources per workload.

    • CPU workloads are workloads that request purely CPU resources

  6. Set the GPU network acceleration. For more details, see :

    • Set the discovery method of GPU network acceleration (MNNVL)

      • Automatic - Automatically identify whether the node pool contains any MNNVL nodes. MNNVL nodes that share the same ID are part of the same NVL rack.

      • Manual - Manually set whether the node pool contains any MNNVL nodes

        • Detected

        • Not detected

    • Set the node’s label used to discover GPU network acceleration (MNNVL) to nvidia.com/gpu.clique

  7. Click CREATE NODE POOL

Labeling Nodes for Node Pool Grouping

The administrator can use a preset node label, such as the nvidia.com/gpu.product that labels the GPU type, or configure any other node label (e.g. faculty=physics).

To assign a label to nodes you want to group into a node pool, set a node label on each node:

  1. Obtain the list of nodes and their current labels by copying the following to your terminal:

  2. Annotate a specific node with a new label by copying the following to your terminal:

Editing a Node Pool

  1. Select the node pool you want to edit

  2. Click EDIT

  3. Update the node pool and click SAVE

Deleting a Node Pool

  1. Select the node pool you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm the deletion

Note

The default node pool cannot be deleted. When deleting a node pool, if no other node pool matches any of the nodes’ labels, the node will be included in the default node pool.

Using API

To view the available actions, go to the API reference.

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Run the below --help command to obtain the login options and log in according to your setup:

Log in using the following command. You will be prompted to enter your username and password:

To use the API, you will need to obtain a token as shown in

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Run the below --help command to obtain the login options and log in according to your setup:

Log in using the following command. You will be prompted to enter your username and password:

To use the API, you will need to obtain a token as shown in

{
"spec": {
    "compute": {
    "gpuDevicesRequest": 1,
    "gpuRequestType": "portion",
    "gpuPortionRequest": 0.5,
    "gpuPortionLimit": 0.5,
    "gpuMemoryRequest": "10M",
    "gpuMemoryLimit": "10M",
    "migProfile": "1g.5gb",
    "cpuCoreRequest": 0.5,
    "cpuCoreLimit": 2,
    "cpuMemoryRequest": "20M",
    "cpuMemoryLimit": "30M",
    "largeShmRequest": false,
    "extendedResources": [
        {
        "resource": "hardware-vendor.example/foo",
        "quantity": 2,
        "exclude": false
        }
    ]
    },
}
}

gpuDeviceRequest

Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1, the gpuRequestType can be defined.

integer

Workspace & Training

compute:
    gpuDevicesRequest:
        max: 2
defaults:
  createHomeDir: true
  environmentVariables:
    instances:
      - name: MY_ENV
        value: my_value
  security:
    allowPrivilegeEscalation: false

rules:
  storage:
    s3:
      attributes:
        url:
          options:
            - value: https://www.google.com
              displayed: https://www.google.com
            - value: https://www.yahoo.com
              displayed: https://www.yahoo.com
rules:
  storage:
    dataVolume:
      instances:
        canAdd: false
    hostPath:
      instances:
        canAdd: false
    pvc:
      instances:
        canAdd: false
    git:
      attributes:
        repository:
          required: true
        branch:
          required: true
        path:
          required: true
    nfs:
      instances:
        canAdd: false
    s3:
      instances:
        canAdd: false
compute:
    cpuCoreRequest:
      required: true
      min: 0
      max: 8
    cpuCoreLimit:
      min: 0
      max: 8
    cpuMemoryRequest:
      required: true
      min: '0'
      max: 16G
    cpuMemoryLimit:
      min: '0'
      max: 8G
    migProfile:
      canEdit: false
    gpuPortionRequest:
      min: 0
      max: 1
    gpuMemoryRequest:
      canEdit: false
    extendedResources:
      instances:
        canAdd: false
defaults:
  worker:
    command: my-command-worker-1
    environmentVariables:
      instances:
        - name: LOG_DIR
          value: policy-worker-to-be-ignored
        - name: ADDED_VAR
          value: policy-worker-added
    security:
      runAsUid: 500
    storage:
      s3:
        attributes:
          bucket: bucket1-worker
  master:
    command: my-command-master-2
    environmentVariables:
      instances:
        - name: LOG_DIR
          value: policy-master-to-be-ignored
        - name: ADDED_VAR
          value: policy-master-added
    security:
      runAsUid: 800
    storage:
      s3:
        attributes:
          bucket: bucket1-master
rules:
  worker:
    command:
      options:
        - value: my-command-worker-1
          displayed: command1
        - value: my-command-worker-2
          displayed: command2
    storage:
      nfs:
        instances:
          canAdd: false
      s3:
        attributes:
          bucket:
            options:
              - value: bucket1-worker
              - value: bucket2-worker
  master:
    command:
      options:
        - value: my-command-master-1
          displayed: command1
        - value: my-command-master-2
          displayed: command2
    storage:
      nfs:
        instances:
          canAdd: false
      s3:
        attributes:
          bucket:
            options:
              - value: bucket1-master
              - value: bucket2-maste
rules:
  imagePullPolicy:
    required: true
    options:
      - value: Always
        displayed: Always
      - value: Never
        displayed: Never
  createHomeDir:
    canEdit: false
rules:
  security:
    runAsUid:
      min: 1
      max: 32700
    allowPrivilegeEscalation:
      canEdit: false
defaults: null
rules: null
imposedAssets:
  - f12c965b-44e9-4ff6-8b43-01d8f9e630cc
defaults:
  createHomeDir: true
  imagePullPolicy: IfNotPresent
  nodePools:
    - node-pool-a
    - node-pool-b
  environmentVariables:
    instances:
      - name: WANDB_API_KEY
        value: REPLACE_ME!
      - name: WANDB_BASE_URL
        value: https://wandb.mydomain.com
  compute:
    cpuCoreRequest: 0.1
    cpuCoreLimit: 20
    cpuMemoryRequest: 10G
    cpuMemoryLimit: 40G
    largeShmRequest: true
  security:
    allowPrivilegeEscalation: false
  storage:
    git:
      attributes:
        repository: https://git-repo.my-domain.com
        branch: master
    hostPath:
      instances:
        - name: vol-data-1
          path: /data-1
          mountPath: /mount/data-1
        - name: vol-data-2
          path: /data-2
          mountPath: /mount/data-2
rules:
  createHomeDir:
    canEdit: false
  imagePullPolicy:
    canEdit: false
  environmentVariables:
    instances:
      locked:
        - WANDB_BASE_URL
  compute:
    cpuCoreRequest:
      max: 32
    cpuCoreLimit:
      max: 32
    cpuMemoryRequest:
      min: 1G
      max: 20G
    cpuMemoryLimit:
      min: 1G
      max: 40G
    largeShmRequest:
      canEdit: false
    extendedResources:
      instances:
        canAdd: false
  security:
    allowPrivilegeEscalation:
      canEdit: false
    runAsUid:
      min: 1
  storage:
    hostPath:
      instances:
        locked:
          - vol-data-1
          - vol-data-2
imposedAssets:
  - 4ba37689-f528-4eb6-9377-5e322780cc27
workload API fields
Policy YAML fields - reference table
Rule Type

Department

The name of the department

Node pool(s) with quota

The node pools associated with this department. By default, all node pools within a cluster are associated with each department. Administrators can change the node pools’ quota parameters for a department. Click the values under this column to view the list of node pools with their parameters (as described below)

GPU quota

GPU quota associated with the department

Total GPUs for projects

The sum of all projects’ GPU quotas associated with this department

Project(s)

List of projects associated with this department

Subject(s)

The users, SSO groups, or applications with access to the project. Click the values under this column to view the list of subjects with their parameters (as described below). This column is only viewable if your role in NVIDIA Run:ai platform allows you those permissions.

Allocated GPUs

The total number of GPUs allocated by successfully scheduled workloads in projects associated with this department

GPU allocation ratio

The ratio of Allocated GPUs to GPU quota. This number reflects how well the department’s GPU quota is utilized by its descendant projects. A number higher than 100% means the department is using over quota GPUs. A number lower than 100% means not all projects are utilizing their quotas. A quota becomes allocated once a workload is successfully scheduled.

Creation time

The timestamp for when the department was created

Workload(s)

The list of workloads under projects associated with this department. Click the values under this column to view the list of workloads with their resource parameters (as described below)

Cluster

The cluster that the department is associated with

Node pool

The name of the node pool is given by the administrator during node pool creation. All clusters have a default node pool created automatically by the system and named ‘default’.

GPU quota

The amount of GPU quota the administrator dedicated to the department for this node pool (floating number, e.g. 2.3 means 230% of a GPU capacity)

CPU (Cores)

The amount of CPU (cores) quota the administrator has dedicated to the department for this node pool (floating number, e.g. 1.3 Cores = 1300 mili-cores). The ‘unlimited’ value means the CPU (Cores) quota is not bound and workloads using this node pool can use as many CPU (Cores) resources as they need (if available)

CPU memory

The amount of CPU memory quota the administrator has dedicated to the department for this node pool (floating number, in MB or GB). The ‘unlimited’ value means the CPU memory quota is not bounded and workloads using this node pool can use as much CPU memory resource as they need (if available).

Allocated GPUs

The total amount of GPUs allocated by workloads using this node pool under projects associated with this department. The number of allocated GPUs may temporarily surpass the GPU quota of the department if over quota is used.

Allocated CPU (Cores)

The total amount of CPUs (cores) allocated by workloads using this node pool under all projects associated with this department. The number of allocated CPUs (cores) may temporarily surpass the CPUs (Cores) quota of the department if over quota is used.

Allocated CPU memory

The actual amount of CPU memory allocated by workloads using this node pool under all projects associated with this department. The number of Allocated CPU memory may temporarily surpass the CPU memory quota if over quota is used.

Subject

A user, SSO group, or application assigned with a role in the scope of this department

Type

The type of subject assigned to the access rule (user, SSO group, or application).

Scope

The scope of this department within the organizational tree. Click the name of the scope to view the organizational tree diagram, you can only view the parts of the organizational tree for which you have permission to view.

Role

The role assigned to the subject, in this department’s scope

Authorized by

The user who granted the access rule

Last updated

The last time the access rule was updated

Scheduler
The NVIDIA Run:ai Scheduler: concepts and principles
Scheduling rules
policies
Departments

Node pool

The node pool name, set by the administrator during its creation (the node pool name cannot be changed after its creation).

Status

Node pool status. A ‘Ready’ status means the scheduler can use this node pool to schedule workloads. ‘Empty’ status means no nodes are currently included in that node pool.

Label key Label value

The node pool controller will use this node-label key-value pair to match nodes into this node pool.

Node(s)

List of nodes included in this node pool. Click the field to view details (the details are in the Nodes article).

GPU network acceleration (MNNVL)

Indicates whether the discovery method of Multi-Node NVL nodes is done automatically or manually as set nu the Admin.

MNNVL label key

The label key that is used to automatically detect if a node is part of an MNNVL domain. The default MNNVL domain label is nvidia.com/gpu.clique.

MNNVL nodes

Indicates whether MNNVL nodes are detected - automatically or manually.

GPU devices

The total number of GPU devices installed into nodes included in this node pool. For example, a node pool that includes 12 nodes each with 8 GPU devices would show a total number of 96 GPU devices.

GPU memory

The total amount of GPU memory included in this node pool. The total amount of GPU memory installed in nodes included in this node pool. For example, a node pool that includes 12 nodes, each with 8 GPU devices, and each device with 80 GB of memory would show a total memory amount of 7.68 TB.

Allocated GPUs

The total allocation of GPU devices in units of GPUs (decimal number). For example, if 3 GPUs are 50% allocated, the field prints out the value 1.50. This value represents the portion of GPU memory consumed by all running pods using this node pool. ‘Allocated GPUs’ can be larger than ‘Projects’ GPU quota’ if over quota is used by workloads, but not larger than GPU devices.

GPU resource optimization ratio

Shows the Node Level Scheduler mode.

CPUs (Cores)

The number of CPU cores installed on nodes included in this node

CPU memory

The total amount of CPU memory installed on nodes using this node pool

Allocated CPUs (Cores)

The total allocation of CPU compute in units of Cores (decimal number). This value represents the amount of CPU cores consumed by all running pods using this node pool. ‘Allocated CPUs’ can be larger than ‘Projects’ GPU quota’ if over quota is used by workloads, but not larger than CPUs (Cores).

Allocated CPU memory

The total allocation of CPU memory in units of TB/GB/MB (decimal number). This value represents the amount of CPU memory consumed by all running pods using this node pool. ‘Allocated CPUs’ can be larger than ‘Projects’ CPU memory quota’ if over quota is used by workloads, but not larger than CPU memory.

GPU placement strategy

Sets the Scheduler strategy for the assignment of pods requesting both GPU and CPU resources to nodes, which can be either Bin-pack or Spread. By default, Bin-Pack is used, but can be changed to Spread by editing the node pool. When set to Bin-pack the scheduler will try to fill nodes as much as possible before using empty or sparse nodes, when set to spread the scheduler will try to keep nodes as sparse as possible by spreading workloads across as many nodes as it succeeds.

CPU placement strategy

Sets the Scheduler strategy for the assignment of pods requesting only CPU resources to nodes, which can be either Bin-pack or Spread. By default, Bin-Pack is used, but can be changed to Spread by editing the node pool. When set to Bin-pack the scheduler will try to fill nodes as much as possible before using empty or sparse nodes, when set to spread the scheduler will try to keep nodes as sparse as possible by spreading workloads across as many nodes as it succeeds.

Last update

The date and time when the node pool was last updated

Creation time

The date and time when the node pool was created

Workload(s)

List of workloads running on nodes included in this node pool, click the field to view details (described below in this article)

Workload

The name of the workload. If the workloads’ type is one of the recognized types (for example: Pytorch, MPI, Jupyter, Ray, Spark, Kubeflow, and many more), an appropriate icon is printed.

Type

The NVIDIA Run:ai platform type of the workload - Workspace, Training, or Inference

Status

The state of the workload. The Workloads state is described in the NVIDIA Run:ai workloads section

Created by

The User or Application created this workload

Running/requested pods

The number of running pods out of the number of requested pods within this workload.

Creation time

The workload’s creation date and time

Allocated GPU compute

The total amount of GPU compute allocated by this workload. A workload with 3 Pods, each allocating 0.5 GPU, will show a value of 1.5 GPUs for the workload.

Allocated GPU memory

The total amount of GPU memory allocated by this workload. A workload with 3 Pods, each allocating 20GB, will show a value of 60 GB for the workload.

Allocated CPU compute (cores)

The total amount of CPU compute allocated by this workload. A workload with 3 Pods, each allocating 0.5 Core, will show a value of 1.5 Cores for the workload.

Allocated CPU memory

The total amount of CPU memory allocated by this workload. A workload with 3 Pods, each allocating 5 GB of CPU memory, will show a value of 15 GB of CPU memory for the workload.

kubectl get nodes --show-labels
kubectl label node <node-name> <key>=<value>
Scheduler
projects
departments
workload
Using GB200 NVL72 and Multi-Node NVLink domains
Node pools
runai project set "project-name"
runai workspace submit "workload-name" \
--image gcr.io/run-ai-lab/pytorch-example-jupyter \
--gpu-memory-request 4G --gpu-memory-limit 12G --large-shm \
--external-url container=8888 --name-prefix jupyter  \
--command -- start-notebook.sh \
--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=
curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>", 
    "clusterId": "<CLUSTER-UUID>",
    "spec": {
        "command" : "start-notebook.sh",
        "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
        "image": "gcr.io/run-ai-lab/pytorch-example-jupyter",
        "compute": {
            "gpuDevicesRequest": 1,
            "gpuMemoryRequest": "4G",
            "gpuMemoryLimit": "12G",
            "largeShmRequest": true

        },
        "exposedUrls" : [
            { 
                "container" : 8888,
                "toolType": "jupyter-notebook", 
                "toolName": "Jupyter"  
            }
        ]
    }
}
runai project set "project-name"
runai workspace submit "workload-name" \
--image gcr.io/run-ai-lab/pytorch-example-jupyter --gpu-memory-request 4G \
--gpu-memory-limit 12G --large-shm --external-url container=8888 \
--name-prefix jupyter --command -- start-notebook.sh \
--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=
curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>", 
    "clusterId": "<CLUSTER-UUID>",
    "spec": {
        "command" : "start-notebook.sh",
        "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
        "image": "gcr.io/run-ai-lab/pytorch-example-jupyter",
        "compute": {
            "gpuDevicesRequest": 1,
            "gpuMemoryRequest": "4G",
            "gpuMemoryLimit": "12G",
            "largeShmRequest": true

        },
        "exposedUrls" : [
            { 
                "container" : 8888,
                "toolType": "jupyter-notebook",  
                "toolName": "Jupyter" 
            }
        ]
    }
}
dynamic GPU fractions
Flexible or Original submission form
project
API authentication.
template
host-based routing
CLI reference
Workspaces API:
Step 1
Get Projects API
Get Clusters API
template
Step 2
Step 2
CLI reference
Workspaces API:
Step 1
Get Projects API
Get Clusters API
Workloads
runai login --help
API authentication.
runai login --help
runai login
API authentication.
runai login --help
runai login

Hotfixes for Version 2.21

This section provides details on all hotfixes available for version 2.21. Hotfixes are critical updates released between our major and minor versions to address specific issues or vulnerabilities. These updates ensure the system remains secure, stable, and optimized without requiring a full version upgrade.

Version
Date
Internal ID
Description

2.21.25

11/06/2025

RUN-29548

Fixed a typo in the documentation where the API key was incorrectly written as enforceRun:aiScheduler instead of the correct enforceRunaiScheduler.

2.21.25

11/06/2025

RUN-29320

Fixed an issue in CLI v2 where the update server did not receive the terminal size during exec commands requiring TTY support. The terminal size is now set once upon session creation, ensuring proper behavior for interactive sessions.

2.21.24

08/06/2025

RUN-29282

Fixed a security vulnerability in golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

2.21.23

08/06/2025

RUN-28891

  • Fixed a security vulnerability in golang.org/x/crypto related to CVE-2024-45337 with severity HIGH.

  • Fixed a security vulnerability in go-git/go-git related to CVE-2025-21613 with severity HIGH.

2.21.23

08/06/2025

RUN-25281

Fixed an issue where deploying a Hugging Face model with vLLM using the Hugging Face inference UI form on an OpenShift environment failed due to permission errors.

2.21.22

03/06/2025

RUN-29341

Fixed an issue which caused high CPU usage in the Cluster API.

2.21.22

03/06/2025

RUN-29323

Fixed an issue where Prometheus failed to send metrics for OpenShift.

2.21.19

27/05/2025

RUN-29093

Fixed an issue where rotating the runai-config webhook secret caused the app.kubernetes.io/managed-by=helm label to be removed.

2.21.18

27/05/2025

RUN-28286

Fixed an issue where CPU-only workloads incorrectly triggered idle timeout notifications intended for GPU workloads.

2.21.18

27/05/2025

RUN-28555

Fixed an issue in Admin → General Settings where the "Disabled" workloads count displayed inconsistently between the collapsed and expanded views.

2.21.18

27/05/2025

RUN-26361

Fixed an issue where Prometheus remote-write credentials were not properly updated on OpenShift clusters.

2.21.18

27/05/2025

RUN-28780

Fixed an issue where Hugging Face model validation incorrectly blocked some valid models supported by vLLM and TGI.

2.21.18

27/05/2025

RUN-28851

Fixed an issue in CLI v2 where the port-forward command terminated SSH connections after 15–30 seconds due to an idle timeout.

2.21.18

27/05/2025

RUN-25281

Fixed an issue where the Hugging Face UI submission flow failed on OpenShift (OCP) clusters.

2.21.17

21/05/2025

RUN-28266

Fixed an issue where the documentation examples for the runai workload delete CLI command were incorrect.

2.21.17

21/05/2025

RUN-28609

Fixed an issue where users with the ML Engineer role were unable to delete multiple inference jobs at once.

2.21.17

21/05/2025

RUN-28665

Fixed an issue where using servingPort authorization fields in the API on unsupported clusters did not return an error.

2.21.17

21/05/2025

RUN-28717

Fixed an issue where the API documentation listed an incorrect response code.

2.21.17

21/05/2025

RUN-28755

Fixed an issue where the tooltip next to the External URL for an inference endpoint incorrectly stated that the URL was internal.

2.21.17

21/05/2025

RUN-28762

Fixed an issue with the inference workload ownership protection.

2.21.17

21/05/2025

RUN-28859

Fixed an issue where the knative.enable-scale-to-zero setting did not default to true as expected.

2.21.17

21/05/2025

RUN-28923

Fixed an issue where calling the API with the telemetryType IDLE_ALLOCATED_GPUS resulted in a 500 Internal Server Error.

2.21.17

21/05/2025

RUN-28950

Fixed a security vulnerability in github.com/moby and github.com/docker/docker related to CVE-2024-41110 with severity Critical.

2.21.16

18/05/2025

RUN-27295

Fixed an issue in CLI v2 where the --node-type flag for inference workloads was not properly propagated to the pod specification.

2.21.16

18/05/2025

RUN-27375

Fixed an issue where projects were not visible in the legacy job submission form, preventing users from selecting a target project.

2.21.16

18/05/2025

RUN-27514

Fixed an issue where disabling CPU quota in the General settings did not remove existing CPU quotas from projects and departments.

2.21.16

18/05/2025

RUN-27521

Fixed a security vulnerability in axios related to CVE-2025-27152 with severity HIGH.

2.21.16

18/05/2025

RUN-27638

Fixed an issue where a node pool’s placement strategy stopped functioning correctly after being edited.

2.21.16

18/05/2025

RUN-27438

Fixed an issue where MPI jobs were unavailable due to an OpenShift MPI Operator installation error.

2.21.16

18/05/2025

RUN-27952

Fixed a security vulnerability in emacs-filesystem related to CVE-2025-1244 with severity HIGH.

2.21.16

18/05/2025

RUN-28244

Fixed a security vulnerability in liblzma5 related to CVE-2025-31115 with severity HIGH.

2.21.16

18/05/2025

RUN-28006

Fixed an issue where tokens became invalid for the API server after one hour.

2.21.16

18/05/2025

RUN-28097

Fixed an issue where the allocated_gpu_count_per_gpu metric displayed incorrect data for fractional pods.

2.21.16

18/05/2025

RUN-28213

Fixed a security vulnerability in github.com.golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

2.21.16

18/05/2025

RUN-28311

Fixed an issue where user creation failed with a duplicate email error, even though the email address did not exist in the system.

2.21.16

18/05/2025

RUN-28832

Fixed inference CLI v2 documentation with examples that reflect correct usage.

2.21.15

30/04/2025

RUN-27533

Fixed an issue where workloads with idle GPUs were not suspended after exceeding the configured idle time.

2.21.14

29/04/2025

RUN-26608

Fixed an issue by adding a flag to the cli config set command and the CLI install script, allowing users to set a cache directory.

2.21.14

29/04/2025

RUN-27264

Fixed an issue where creating a project from the UI with a non-unlimited deserved CPU value caused the queue to be created with limit = deserved instead of unlimited.

2.21.14

29/04/2025

RUN-27484

Fixed an issue where duplicate app.kubernetes.io/name labels were applied to services in the control plane Helm chart.

2.21.14

29/04/2025

RUN-27502

Fixed the inference CLI commands documentation: --max-replicas and --min-replicas were incorrectly used instead of --max-scale and --min-scale.

2.21.14

29/04/2025

RUN-27513

Fixed an issue where cluster-scoped policies were not visible to users with appropriate permissions.

2.21.14

29/04/2025

RUN-27515

Fixed an issue where users were unable to use assets from an upper scope during flexible workload submissions.

2.21.14

29/04/2025

RUN-27520

Fixed an issue where adding access rules immediately after creating an application did not refresh the access rules table.

2.21.14

29/04/2025

RUN-27628

Fixed an issue where a node pool could remain stuck in Updating status in certain cases.

2.21.14

29/04/2025

RUN-27826

Fixed an issue where the runai inference update command could result in a failure to update the workload. Although the command itself succeeded (since the update is asynchronous), the update often failed, and the new spec was not applied.

2.21.14

29/04/2025

RUN-27915

Fixed an issue where the "Improved Command Line Interface" admin setting was incorrectly labeled as Beta instead of Stable.

2.21.11

29/04/2025

RUN-27251

  • Fixed a security vulnerability in github.com.golang-jwt.jwt.v4 and github.com.golang-jwt.jwt.v5 with CVE-2025-30204 with severity HIGH.

  • Fixed a security vulnerability in golang.org.x.net with CVE-2025-22872 with severity MEDIUM.

  • Fixed a security vulnerability in knative.dev/serving with CVE-2023-48713 with severity MEDIUM.

2.21.11

29/04/2025

RUN-27309

Fixed an issue where workloads configured with a multi node pool setup could fail to schedule on a specific node pool in the future after an initial scheduling failure, even if sufficient resources later became available.

2.21.10

29/04/2025

RUN-26992

Fixed an issue where workloads submitted with an invalid node port range would get stuck in Creating status.

2.21.10

29/04/2025

RUN-27497

Fixed an issue where, after deleting an SSO user and immediately creating a local user, the delete confirmation dialog reappeared unexpectedly.

2.21.9

15/04/2025

RUN-26989

Fixed an issue that prevented reordering node pools in the workload submission form.

2.21.9

15/04/2025

RUN-27247

Fixed security vulnerabilities in Spring framework used by db-mechanic service - CVE-2021-27568, CVE-2021-44228, CVE-2022-22965, CVE-2023-20873, CVE-2024-22243, CVE-2024-22259 and CVE-2024-22262.

2.21.9

15/04/2025

RUN-26359

Fixed an issue in CLI v2 where using the --toleration option required incorrect mandatory fields.

Clusters

This section explains the procedure to view and manage Clusters.

The Cluster table provides a quick and easy way to see the status of your cluster.

Clusters Table

The Clusters table can be found under Resources in the NVIDIA Run:ai platform.

The clusters table provides a list of the clusters added to NVIDIA Run:ai platform, along with their status.

The clusters table consists of the following columns:

Column
Description

Cluster

The name of the cluster

Status

The status of the cluster. For more information see the . Hover over the information icon for a short description and links to troubleshooting

Creation time

The timestamp when the cluster was created

URL

The URL that was given to the cluster

NVIDIA Run:ai cluster version

The NVIDIA Run:ai version installed on the cluster

Kubernetes distribution

The flavor of Kubernetes distribution

Kubernetes version

The version of Kubernetes installed

NVIDIA Run:ai cluster UUID

The unique ID of the cluster

Cluster Status

Status
Description

Waiting to connect

The cluster has never been connected.

Disconnected

There is no communication from the cluster to the Control plane. This may be due to a network issue.

Missing prerequisites

Some prerequisites are missing from the cluster. As a result, some features may be impacted.

Service issues

At least one of the services is not working properly. You can view the list of nonfunctioning services for more information.

Connected

The NVIDIA Run:ai cluster is connected, and all NVIDIA Run:ai services are running.

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Adding a New Cluster

To add a new cluster, see the installation guide.

Removing a Cluster

  1. Select the cluster you want to remove

  2. Click REMOVE

  3. A dialog appears: Make sure to carefully read the message before removing

  4. Click REMOVE to confirm the removal.

Using API

Go to the Clusters API reference to view the available actions

Troubleshooting

Before starting, make sure you have access to the Kubernetes cluster where NVIDIA Run:ai is deployed with the necessary permissions

Troubleshooting Scenarios

Cluster disconnected

Description: When the cluster's status is ‘disconnected’, there is no communication from the cluster services reaching the NVIDIA Run:ai Platform. This may be due to networking issues or issues with NVIDIA Run:ai services.

Mitigation:

  1. Check NVIDIA Run:ai’s services status:

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to view pods

    3. Copy and paste the following command to verify that NVIDIA Run:ai’s services are running:

      kubectl get pods -n runai | grep -E 'runai-agent|cluster-sync|assets-sync'
    4. If any of the services are not running, see the ‘cluster has service issues’ scenario.

  2. Check the network connection

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to create pods

    3. Copy and paste the following command to create a connectivity check pod:

      kubectl run control-plane-connectivity-check -n runai --image=wbitt/network-multitool --command -- /bin/sh -c 'curl -sSf <control-plane-endpoint> > /dev/null && echo "Connection Successful" || echo "Failed connecting to the Control Plane"'
    4. Replace <control-plane-endpoint> with the URL of the Control Plane in your environment. If the pod fails to connect to the Control Plane, check for potential network policies

  3. Check and modify the network policies

    1. Open your terminal

    2. Copy and paste the following command to check the existence of network policies:

      kubectl get networkpolicies -n runai
    3. Review the policies to ensure that they allow traffic from the NVIDIA Run:ai namespace to the Control Plane. If necessary, update the policies to allow the required traffic

      Example of allowing traffic:

      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-control-plane-traffic
        namespace: runai
      spec:
        podSelector:
          matchLabels:
            app: runai
        policyTypes:
          - Ingress
          - Egress
        egress:
          - to:
              - ipBlock:
                  cidr: <control-plane-ip-range>
            ports:
              - protocol: TCP
                port: <control-plane-port>
        ingress:
          - from:
              - ipBlock:
                  cidr: <control-plane-ip-range>
            ports:
              - protocol: TCP
                port: <control-plane-port>
    4. Check infrastructure-level configurations:

      • Ensure that firewall rules and security groups allow traffic between your Kubernetes cluster and the Control Plane

      • Verify required ports and protocols:

        • Ensure that the necessary ports and protocols for NVIDIA Run:ai’s services are not blocked by any firewalls or security groups

  4. Check NVIDIA Run:ai services logs

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to view logs

    3. Copy and paste the following commands to view the logs of the NVIDIA Run:ai services:

      kubectl logs deployment/runai-agent -n runai
      kubectl logs deployment/cluster-sync -n runai
      kubectl logs deployment/assets-sync -n runai
    4. Try to identify the problem from the logs. If you cannot resolve the issue, continue to the next step.

  5. Diagnosing internal network issues: NVIDIA Run:ai operates on Kubernetes, which uses its internal subnet and DNS services for communication between pods and services. If you find connectivity issues in the logs, the problem might be related to Kubernetes' internal networking.

    To diagnose DNS or connectivity issues, you can start a debugging {{glossary.Pod}} with networking utilities:

    1. Copy the following command to your terminal, to start a pod with networking tools:

      kubectl run -i --tty netutils --image=dersimn/netutils -- bash

      This command creates an interactive pod (netutils) where you can use networking commands like ping, curl, nslookup, etc., to troubleshoot network issues.

    2. Use this pod to perform network resolution tests and other diagnostics to identify any DNS or connectivity problems within your Kubernetes {{glossary.Cluster}}.

  6. Contact NVIDIA Run:ai’s support

    • If the issue persists, contact NVIDIA Run:ai’s support for assistance.

Cluster has service issues

Description: When a cluster's status is ‘Has service issues`, it means that one or more NVIDIA Run:ai services running in the cluster are not available.

Mitigation:

  1. Verify non-functioning services

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to view the runaiconfig resource

    3. Copy and paste the following command to determine which services are not functioning:

      kubectl get runaiconfig -n runai runai -ojson | jq -r '.status.conditions | map(select(.type == "Available"))'
  2. Check for Kubernetes events

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to view events

    3. Copy and paste the following command to get all Kubernetes events:

      kubectl get events -A
  3. Inspect resource details

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to describe resources

    3. Copy and paste the following command to check the details of the required resource:

      kubectl describe <resource_type> <name>
  4. Contact NVIDIA Run:ai’s Support

    • If the issue persists, contact contact NVIDIA Run:ai’s support for assistance.

Cluster is waiting to connect

Description: When the cluster's status is ‘waiting to connect’, it means that no communication from the cluster services reaches the NVIDIA Run:ai Platform. This may be due to networking issues or issues with NVIDIA Run:ai services.

Mitigation:

  1. Check NVIDIA Run:ai’s services status

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to view pods

    3. Copy and paste the following command to verify that NVIDIA Run:ai’s services are running:

      kubectl get pods -n runai | grep -E 'runai-agent|cluster-sync|assets-sync'
    4. If any of the services are not running, see the ‘cluster has service issues’ scenario.

  2. Check the network connection

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to create pods

    3. Copy and paste the following command to create a connectivity check pod:

      kubectl run control-plane-connectivity-check -n runai --image=wbitt/network-multitool --command -- /bin/sh -c 'curl -sSf <control-plane-endpoint> > /dev/null && echo "Connection Successful" || echo "Failed connecting to the Control Plane"'
    4. Replace <control-plane-endpoint> with the URL of the Control Plane in your environment. If the pod fails to connect to the Control Plane, check for potential network policies:

  3. Check and modify the network policies

    1. Open your terminal

    2. Copy and paste the following command to check the existence of network policies:

      kubectl get networkpolicies -n runai
    3. Review the policies to ensure that they allow traffic from the NVIDIA Run:ai namespace to the Control Plane. If necessary, update the policies to allow the required traffic

    4. Example of allowing traffic:

      apiVersion: networking.k8s.io/v1
      kind: NetworkPolicy
      metadata:
        name: allow-control-plane-traffic
        namespace: runai
      spec:
        podSelector:
          matchLabels:
            app: runai
        policyTypes:
          - Ingress
          - Egress
        egress:
          - to:
              - ipBlock:
                  cidr: <control-plane-ip-range>
            ports:
              - protocol: TCP
                port: <control-plane-port>
        ingress:
          - from:
              - ipBlock:
                  cidr: <control-plane-ip-range>
            ports:
              - protocol: TCP
                port: <control-plane-port>
    5. Check infrastructure-level configurations:

    6. Ensure that firewall rules and security groups allow traffic between your Kubernetes cluster and the Control Plane

    7. Verify required ports and protocols:

      • Ensure that the necessary ports and protocols for NVIDIA Run:ai’s services are not blocked by any firewalls or security groups

  4. Check NVIDIA Run:ai services logs

    1. Open your terminal

    2. Make sure you have access to the Kubernetes cluster with permissions to view logs

    3. Copy and paste the following commands to view the logs of the NVIDIA Run:ai services:

      kubectl logs deployment/runai-agent -n runai
      kubectl logs deployment/cluster-sync -n runai
      kubectl logs deployment/assets-sync -n runai
    4. Try to identify the problem from the logs. If you cannot resolve the issue, continue to the next step

  5. Contact NVIDIA Run:ai’s support

    • If the issue persists, contact NVIDIA Run:ai’s support for assistance

Cluster is missing prerequisites

Description: When a cluster's status displays Missing prerequisites, it indicates that at least one of the Mandatory Prerequisites has not been fulfilled. In such cases, NVIDIA Run:ai services may not function properly.

Mitigation:

If you have ensured that all prerequisites are installed and the status still shows missing prerequisites, follow these steps:

  1. Check the message in the NVIDIA Run:ai platform for further details regarding the missing prerequisites.

  2. Inspect the runai-public ConfigMap:

    1. Open your terminal. In the terminal, type the following command to list all ConfigMaps in the runai-public namespace:

      kubectl get configmap -n runai-public
  3. Describe the ConfigMap

    1. Locate the ConfigMap named runai-public from the list

    2. To view the detailed contents of this ConfigMap, type the following command:

      kubectl describe configmap runai-public -n runai-public
  4. Find Missing Prerequisites

    1. In the output displayed, look for a section labeled dependencies.required

    2. This section provides detailed information about any missing resources or prerequisites. Review this information to identify what is needed

  5. Contact NVIDIA Run:ai’s support

    • If the issue persists, contact NVIDIA Run:ai’s support for assistance

Credentials

This section explains what credentials are and how to create and use them.

Credentials are workload assets that simplify the complexities of Kubernetes secrets. They consist of and mask sensitive access information, such as passwords, tokens, and access keys, which are necessary for gaining access to various resources.

Credentials are crucial for the security of AI workloads and the resources they require, as they restrict access to authorized users, verify identities, and ensure secure interactions. By enforcing the protection of sensitive data, credentials help organizations comply with industry regulations, fostering a secure environment overall.

Essentially, credentials enable AI practitioners to access relevant protected resources, such as private data sources and Docker images, thereby streamlining the workload submission process.

Credentials Table

The Credentials table can be found under Workload manager in the NVIDIA Run:ai User interface.

The Credentials table provides a list of all the credentials defined in the platform and allows you to manage them.

The Credentials table comprises the following columns:

Column
Description

Credential

The name of the credential

Description

A description of the credential

Type

The type of credential, e.g., Docker registry

Status

The different lifecycle and representation of the credential's condition

Scope

The of this compute resource within the organizational tree. Click the name of the scope to view the organizational tree diagram

Kubernetes name

The unique name of the credential's Kubernetes name as it appears in the cluster

Environment(s)

The environment(s) that are associated with the credential

Data source(s)

The private data source(s) that are accessed using the credential

Created by

The user who created the credential

Creation time

The timestamp of when the credential were created

Cluster

The cluster with which the credential are associated

Credentials Status

The following table describes the credentials’ condition and whether they were created successfully for the selected scope.

Status
Description

No issues found

No issues were found while creating the credential (this status may change while propagating the credential to the selected scope)

Issues found

Issues found while propagating the credential

Issues found

Failed to access the cluster

Creating…

Credential is being created

Deleting…

Credential is being deleted

No status

When the credential's scope is an account, or the current version of the cluster is not up to date, the status cannot be displayed

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then click ‘Download as CSV’. Export to CSV is limited to 20,000 rows.

  • Refresh - Click REFRESH to update the table with the latest data

Adding a New Credential

Creating credentials is limited to specific roles.

To add a new credential:

  1. Go to the Credentials table

  2. Click +NEW CREDENTIAL

  3. Select the credential type from the list Follow the step-by-step guide for each credential type:

Docker registry

These credentials allow users to authenticate and pull images from a Docker registry, enabling access to containerized applications and services.

After creating the credential, it is used automatically when pulling images.

  1. Select a scope.

  2. Enter a name for the credential. The name must be unique.

  3. Optional: Provide a description of the credential

  4. Set how the credential is created

    • Existing secret (in the cluster) This option applies when the purpose is to create the credential based on an existing secret

      • Select a secret from the list (The list is empty if no secrets were created in advance)

    • New secret (recommended) A new secret is created together with the credential. New secrets are not added to the list of existing secrets.

      • Enter the username, password, and Docker registry URL

  5. Click CREATE CREDENTIAL

After the credential is created, check the status to monitor proper creation across the selected scope.

Access key

These credentials are unique identifiers used to authenticate and authorize access to cloud services or APIs, ensuring secure communication between applications. They typically consist of two parts:

  • An access key ID

  • A secret access key

The purpose of this credential type is to allow access to restricted data.

  1. Select a scope

  2. Enter a name for the credential. The name must be unique.

  3. Optional: Provide a description of the credential

  4. Set how the credential is created

    • Existing secret (in the cluster) This option applies when the purpose is to create the credential based on an existing secret

      • Select a secret from the list (The list is empty if no secrets were created in advance)

    • New secret (recommended) A new secret is created together with the credential. New secrets are not added to the list of existing secrets.

      • Enter the Access key and Access secret

  5. Click CREATE CREDENTIAL

After the credential is created, check the status to monitor proper creation across the selected scope.

Username & password

These credentials require a username and corresponding password to access various resources, ensuring that only authorized users can log in.

The purpose of this credential type is to allow access to restricted data.

  1. Select a scope

  2. Enter a name for the credential. The name must be unique.

  3. Optional: Provide a description of the credential

  4. Set how the credential is created

    • Existing secret (in the cluster) This option applies when the purpose is to create the credential based on an existing secret

      • Select a secret from the list (The list is empty if no secrets were created in advance)

    • New secret (recommended) A new secret is created together with the credential. New secrets are not added to the list of existing secrets.

      • Enter the username and password

  5. Click CREATE CREDENTIAL

After the credential is created, check the status to monitor proper creation across the selected scope.

Generic secret

These credentials are a flexible option that consists of multiple keys & values and can store various sensitive information, such as API keys or configuration data, to be used securely within applications.

The purpose of this credential type is to allow access to restricted data.

  1. Select a scope

  2. Enter a name for the credential. The name must be unique.

  3. Optional: Provide a description of the credential

  4. Set how the credential is created

    • Existing secret (in the cluster) This option applies when the purpose is to create the credential based on an existing secret

      • Select a secret from the list (The list is empty if no secrets were created in advance)

    • New secret (recommended) A new secret is created together with the credential. New secrets are not added to the list of existing secrets.

      • Click +KEY & VALUE - to add key/value pairs to store in the new secret

  5. Click CREATE CREDENTIAL

Editing a Credential

To rename a credential:

  1. Select the credential from the table

  2. Click Rename to edit its name and description

Deleting a Credential

To delete a credential:

  1. Select the credential you want to delete

  2. Click DELETE

  3. In the dialog, click DELETE to confirm

Note

Credentials cannot be deleted if they are being used by a workload and template.

Using Credentials

You can use credentials (secrets) in various ways within the system

Access Private Data Sources

To access private data sources, attach credentials to data sources of the following types: Git, S3 Bucket

Use Directly Within the Container

To use the secret directly from within the container, you can choose between the following options

  1. Get the secret mounted to the file system by using the Generic secret data source

  2. Get the secret as an environment variable injected into the container. There are two equivalent ways to inject the environment variable.

    a. By adding it to the Environment asset. b. By adding it ad-hoc as part of the workload.

Creating Secrets in Advance

Add secrets in advance to be used when creating credentials via the NVIDIA Run:ai UI. Follow the steps below for each required scope:

Cluster Scope

  1. Create the secret in the NVIDIA Run:ai namespace (runai)

  2. To authorize NVIDIA Run:ai to use the secret, label it: run.ai/cluster-wide: "true"

  3. Label the secret with the correct credential type:

    1. Docker registry - run.ai/resource: "docker-registry"

    2. Access key - run.ai/resource: "access-key"

    3. Username and password - run.ai/resource: "password"

    4. Generic secret - run.ai/resource: "generic"

The secret is now displayed for that scope in the list of existing secrets.

Department Scope

  1. Create the secret in the NVIDIA Run:ai namespace (runai)

  2. To authorize NVIDIA Run:ai to use the secret, label it: run.ai/department: "<department_id>"

  3. Label the secret with the correct credential type:

    1. Docker registry - run.ai/resource: "docker-registry"

    2. Access key - run.ai/resource: "access-key"

    3. Username and password - run.ai/resource: "password"

    4. Generic secret - run.ai/resource: "generic"

The secret is now displayed for that scope in the list of existing secrets.

Project Scope

  1. Create the secret in the project’s namespace

  2. Label the secret with the correct credential type:

    1. Docker registry - run.ai/resource: "docker-registry"

    2. Access key - run.ai/resource: "access-key"

    3. Username and password - run.ai/resource: "password"

    4. Generic secret - run.ai/resource: "generic"

Using API

To view the available actions, go to the Credentials API reference

Advanced Cluster Configurations

Advanced cluster configurations can be used to tailor your NVIDIA Run:ai cluster deployment to meet specific operational requirements and optimize resource management. By fine-tuning these settings, you can enhance functionality, ensure compatibility with organizational policies, and achieve better control over your cluster environment. This article provides guidance on implementing and managing these configurations to adapt the NVIDIA Run:ai cluster to your unique needs.

After the NVIDIA Run:ai cluster is installed, you can adjust various settings to better align with your organization's operational needs and security requirements.

Modify Cluster Configurations

Advanced cluster configurations in NVIDIA Run:ai are managed through the runaiconfig . To edit the cluster configurations, run:

To see the full runaiconfig object structure, use:

Configurations

The following configurations allow you to enable or disable features, control permissions, and customize the behavior of your NVIDIA Run:ai cluster:

Key
Description

NVIDIA Run:ai Services Resource Management

NVIDIA Run:ai cluster includes many different services. To simplify resource management, the configuration structure allows you to configure the containers CPU / memory resources for each service individually or group of services together.

Service Group
Description
NVIDIA Run:ai containers

Apply the following configuration in order to change resources request and limit for a group of services:

Or, apply the following configuration in order to change resources request and limit for each service individually:

For resource recommendations, see .

NVIDIA Run:ai Services Replicas

By default, all NVIDIA Run:ai containers are deployed with a single replica. Some services support multiple replicas for redundancy and performance.

To simplify configuring replicas, a global replicas configuration can be set and is applied to all supported services:

This can be overwritten for specific services (if supported). Services without the replicas configuration does not support replicas:

Prometheus

The Prometheus instance in NVIDIA Run:ai is used for metrics collection and alerting.

The configuration scheme follows the official and supports additional custom configurations. The PrometheusSpec schema is available using the spec.prometheus.spec configuration.

A common use case using the PrometheusSpec is for metrics retention. This prevents metrics loss during potential connectivity issues and can be achieved by configuring local temporary metrics retention. For more information, see :

In addition to the PrometheusSpec schema, some custom NVIDIA Run:ai configurations are also available:

  • Additional labels – Set additional labels for NVIDIA Run:ai's sent by Prometheus.

  • Log level configuration – Configure the logLevel setting for the Prometheus container.

NVIDIA Run:ai Managed Nodes

To include or exclude specific nodes from running workloads within a cluster managed by NVIDIA Run:ai, use the nodeSelectorTerms flag. For additional details, see .

Label the nodes using the below:

  • key: Label key (e.g., zone, instance-type).

  • operator: Operator defining the inclusion/exclusion condition (In, NotIn, Exists, DoesNotExist).

  • values: List of values for the key when using In or NotIn.

The below example shows how to include NVIDIA GPUs only and exclude all other GPU types in a cluster with mixed nodes, based on product type GPU label:

S3 and Git Sidecar Images

Note

This section applies for self-hosted only.

For air-gapped environments, when , it is required to replace the default sidecar images in order to use the Git and S3 data source integrations. Use the following configurations:

Over Quota, Fairness and Preemption

This quick start provides a step-by-step walkthrough of the core scheduling concepts - , , and . It demonstrates the simplicity of resource provisioning and how the system eliminates bottlenecks by allowing users or teams to exceed their resource quota when free GPUs are available.

  • Over quota - In this scenario, team-a runs two training workloads and team-b runs one. Team-a has a quota of 3 GPUs and is over quota by 1 GPU, while team-b has a quota of 1 GPU. The system allows this over quota usage as long as there are available GPUs in the cluster.

  • Fairness and preemption - Since the cluster is already at full capacity, when team-b launches a new b2 workload requiring 1 GPU , team-a can no longer remain over quota. To maintain fairness, the preempts workload a1 (1 GPU), freeing up resources for team-b.

Note

If enabled by your Administrator, the NVIDIA Run:ai UI allows you to create a new workload using either the . The steps in this quick start guide reflect the Original form only.

Prerequisites

  • You have created two - team-a and team-b - or have them created for you.

  • Each project has an assigned quota of 2 GPUs.

Step 1: Logging In

Step 2: Submitting the First Training Workload (team-a)

  1. Go to the Workload Manager → Workloads

  2. Click +NEW WORKLOAD and select Training

  3. Select under which cluster to create the workload

  4. Select the project named team-a

  5. Under Workload architecture, select Standard

  6. Select a preconfigured or select the Start from scratch to launch a new training quickly

  7. Enter a1 as the workload name

  8. Click CONTINUE In the next step:

  9. Create a new environment

    • Click +NEW ENVIRONMENT

    • Enter a name for the environment. The name must be unique.

    • Enter the training Image URL - runai.jfrog.io/demo/quickstart

    • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  10. Select the ‘one-gpu’ compute resource for your workload (GPU devices: 1 )

    • If ‘one-gpu’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

      • Enter a name for the compute resource. The name must be unique.

      • Set GPU devices per pod: 1

      • Set GPU memory per device:

        • Select % (of device) - Fraction of a GPU device's memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  11. Click CREATE TRAINING

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. Make sure to update the following parameters. For more details, see .

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 3: Submitting the Second Training Workload (team-a)

  1. Go to the Workload Manager → Workloads

  2. Click +NEW WORKLOAD and select Training

  3. Select the cluster where the previous training workload was created

  4. Select the project named team-a

  5. Under Workload architecture, select Standard

  6. Select a preconfigured or select the Start from scratch to launch a new training quickly

  7. Enter a2 as the workload name

  8. Click CONTINUE In the next step:

  9. Select the environment created in

  10. Select the ‘two-gpus’ compute resource for your workload (GPU devices: 2)

    • If ‘two-gpus’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

      • Enter a name for the compute resource. The name must be unique.

      • Set GPU devices per pod: 2

      • Set GPU memory per device:

        • Select % (of device) - Fraction of a GPU device's memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  11. Click CREATE TRAINING

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. Make sure to update the following parameters. For more details, see .

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 4: Submitting the First Training Workload (team-b)

  1. Go to the Workload Manager → Workloads

  2. Click +NEW WORKLOAD and select Training

  3. Select the cluster where the previous training was created

  4. Select the project named team-b

  5. Under Workload architecture, select Standard

  6. Select a preconfigured or select the Start from scratch to launch a new training quickly

  7. Enter b1 as the workload name

  8. Click CONTINUE In the next step:

  9. Select the environment created in

  10. Select the compute resource created in

  11. Click CREATE TRAINING

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. Make sure to update the following parameters. For more details, see .

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Over Quota Status

System status after run:

System status after run:

System status after run:

System status after run:

Step 5: Submitting the Second Training Workload (team-b)

  1. Go to the Workload Manager → Workloads

  2. Click +NEW WORKLOAD and select Training

  3. Select the cluster where the previous training was created

  4. Select the project named team-b

  5. Under Workload architecture, select Standard

  6. Select a preconfigured or select the Start from scratch to launch a new training quickly

  7. Enter b2 as the workload name

  8. Click CONTINUE In the next step:

  9. Select the environment created in

  10. Select the compute resource created in

  11. Click CREATE TRAINING

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. For more details, see :

Copy the following command to your terminal. Make sure to update the following parameters. For more details, see .

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Basic Fairness and Preemption Status

Workloads status after run:

Workloads status after run:

Workloads status after run:

Workloads status after run:

Next Steps

Manage and monitor your newly created workload using the table.

Environments

This section explains what environments are and how to create and use them.

Environments are one type of . An environment consists of a configuration that simplifies how workloads are submitted and can be used by AI practitioners when they submit their workloads.

An environment asset is a preconfigured building block that encapsulates aspects for the workload such as:

  • Container image and container configuration

  • Tools and connections

  • The type of workload it serves

Environments Table

The Environments table can be found under Workload manager in the NVIDIA Run:ai platform.

The Environment table provides a list of all the environment defined in the platform and allows you to manage them.

The Environments table consists of the following columns:

Column
Description

Tools Associated with the Environment

Click one of the values in the tools column to view the list of tools and their connection type.

Column
Description

Workloads Associated with the Environment

Click one of the values in the Workload(s) column to view the list of workloads and their parameters.

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Environments Created by NVIDIA Run:ai

When installing NVIDIA Run:ai, you automatically get the environments created by NVIDIA Run:ai to ease up the onboarding process and support different use cases out of the box. These environments are created at the of the account.

Note

The environments listed below are available based on your cluster settings. Some environments, such as vscode and rstudio, are only available in clusters with .

Environment
Image
Description

Adding a New Environment

Environment creation is limited to

To add a new environment:

  1. Go to the Environments table

  2. Click +NEW ENVIRONMENT

  3. Select under which cluster to create the environment

  4. Select a

  5. Enter a name for the environment. The name must be unique.

  6. Optional: Provide a description of the essence of the environment

  7. Enter the Image URL If a token or secret is required to pull the image, it is possible to create it via . These credentials are automatically used once the image is pulled (which happens when the workload is submitted)

  8. Set the image pull policy - the condition for when to pull the image from the registry

  9. Set the workload architecture:

    • Standard Only standard workloads can use the environment. A standard workload consists of a single process.

    • Distributed Only distributed workloads can use the environment. A distributed workload consists of multiple processes working together. These processes can run on different nodes.

      • Select a framework from the list.

  10. Set the workload type:

    • Workspace

    • Training

    • Inference

      • When inference is selected, define the endpoint of the model by providing both the protocol and the container’s serving port

  11. Optional: Set the connection for your tool(s). The tools must be configured in the image. When submitting a workload using the environment, it is possible to connect to these tools

    • Select the tool from the list (the available tools varies from IDE, experiment tracking, and more, including a custom tool for your choice)

    • Select the connection type

      • External URL

        • Auto generate A unique URL is automatically created for each workload using the environment

        • Custom URL The URL is set manually

      • Node port

        • Auto generate A unique port is automatically exposed for each workload using the environment

        • Custom URL Set the port manually

      • Set the container port

  12. Optional: Set a command and arguments for the container running the pod

    • When no command is added, the default command of the image is used (the image entrypoint)

    • The command can be modified while submitting a workload using the environment

    • The argument(s) can be modified while submitting a workload using the environment

  13. Optional: Set the environment variable(s)

    • Click +ENVIRONMENT VARIABLE

    • Enter a name

    • Select the source for the environment variable

      • Custom

        • Enter a value

        • Leave empty

        • Add instructions for the expected value if any

      • Credentials - Select an existing credential as the environment variable

        • Select a credential name To add new credentials to the credentials list, and for additional information, see .

        • Select a secret key

      • ConfigMap - Select a predefined ConfigMap

        • Select a ConfigMap name To create a ConfigMap in your cluster, see .

        • Enter a ConfigMap key

    • The environment variables can be modified and new variables can be added while submitting a workload using the environment

  14. Optional: Set the container’s working directory to define where the container’s process starts running. When left empty, the default directory is used.

  15. Optional: Set where the UID, GID and supplementary groups are taken from, this can be:

    • From the image

    • From the IdP token (only available in an SSO installations)

    • Custom (manually set) - decide whether the submitter can modify these value upon submission.

      • Set the User ID (UID), Group ID (GID) and the supplementary groups that can run commands in the container

        • Enter UID

        • Enter GID

        • Add Supplementary groups (multiple groups can be added, separated by commas)

        • Disable Allow the values above to be modified within the workload if you want the above values to be used as the default

  16. Optional: Select Linux capabilities - Grant certain privileges to a container without granting all the privileges of the root user.

  17. Click CREATE ENVIRONMENT

Note

It is also possible to add data sources directly when creating a specific , or workload.

Editing an Environment

To edit an existing environment:

  1. Select the environment you want to edit

  2. Click Edit

  3. Update the environment and click SAVE ENVIRONMENT

Note

  • The already bound workload that is using this asset will not be affected.

  • llm-server and chatbot-ui environments cannot be edited.

Copying an Environment

To copy an existing environment:

  1. Select the environment you want to copy

  2. Click MAKE A COPY

  3. Enter a name for the environment. The name must be unique.

  4. Update the environment and click CREATE ENVIRONMENT

Deleting an Environment

To delete an environment:

  1. Select the environment you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm

Note

The already bound workload that is using this asset will not be affected.

Using API

Go to the API reference to view the available actions

Create an inference
Update inference spec
Get node telemetry data
kubectl edit runaiconfig runai -n runai
kubectl get crds/runaiconfigs.run.ai -n runai -o yaml

spec.project-controller.createNamespaces (boolean)

Allows Kubernetes namespace creation for new projects Default: true

spec.workload-controller.additionalPodLabels (object)

Set workload's Pod Labels in a format of key/value pairs. These labels are applied to all pods.

spec.workload-controller.failureResourceCleanupPolicy

NVIDIA Run:ai cleans the workload's unnecessary resources:

  • All - Removes all resources of the failed workload

  • None - Retains all resources

  • KeepFailing - Removes all resources except for those that encountered issues (primarily for debugging purposes)

Default: All

spec.workload-controller.GPUNetworkAccelerationEnabled

Enables GPU network acceleration. See Using GB200 NVL72 and Multi-Node NVLink Domains for more details.

Default: false

spec.mps-server.enabled (boolean)

Enabled when using NVIDIA MPS Default: false

spec.global.subdomainSupport (boolean)

Allows the creation of subdomains for ingress endpoints, enabling access to workloads via unique subdomains on the Fully Qualified Domain Name (FQDN). For details, see External Access to Container Default: false

spec.global.nodeAffinity.restrictScheduling (boolean)

Enables setting node roles and restricting workload scheduling to designated nodes Default: false

spec.global.affinity (object)

Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Using global.affinity will overwrite the node roles set using the Administrator CLI (runai-adm). Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system

spec.global.tolerations (object)

Configure Kubernetes tolerations for NVIDIA Run:ai system-level services

spec.daemonSetsTolerations (object)

Configure Kubernetes tolerations for NVIDIA Run:ai daemonSets / engine

spec.runai-container-toolkit.logLevel (boolean)

Specifies the NVIDIA Run:ai-container-toolkit logging level: either 'SPAM', 'DEBUG', 'INFO', 'NOTICE', 'WARN', or 'ERROR' Default: INFO

spec.runai-container-toolkit.enabled (boolean)

Enables workloads to use GPU fractions

Default: true

node-scale-adjuster.args.gpuMemoryToFractionRatio (object)

A scaling-pod requesting a single GPU device will be created for every 1 to 10 pods requesting fractional GPU memory (1/gpuMemoryToFractionRatio). This value represents the ratio (0.1-0.9) of fractional GPU memory (any size) to GPU fraction (portion) conversion. Default: 0.1

spec.global.core.dynamicFractions.enabled (boolean)

Enables dynamic GPU fractions Default: true

spec.global.core.swap.enabled (boolean)

Enables memory swap for GPU workloads Default: false

spec.global.core.swap.limits.cpuRam (string)

Sets the CPU memory size used to swap GPU workloads Default:100Gi

spec.global.core.swap.limits.reservedGpuRam (string)

Sets the reserved GPU memory size used to swap GPU workloads Default: 2Gi

spec.global.core.nodeScheduler.enabled (boolean)

Enables the node-level scheduler Default: false

spec.global.core.timeSlicing.mode (string)

Sets the GPU time-slicing mode. Possible values:

  • timesharing - all pods on a GPU share the GPU compute time evenly.

  • strict - each pod gets an exact time slice according to its memory fraction value.

  • fair - each pod gets an exact time slice according to its memory fraction value and any unused GPU compute time is split evenly between the running pods.

Default: timesharing

spec.runai-scheduler.args.fullHierarchyFairness (boolean)

Enables fairness between departments, on top of projects fairness Default: true

spec.runai-scheduler.args.defaultStalenessGracePeriod

Sets the timeout in seconds before the scheduler evicts a stale pod-group (gang) that went below its min-members in running state:

  • 0s - Immediately (no timeout)

  • -1 - Never

Default: 60s

spec.pod-grouper.args.gangSchedulingKnative (boolean)

Enables gang scheduling for inference workloads.For backward compatibility with versions earlier than v2.19, change the value to false Default: false

spec.pod-grouper.args.gangScheduleArgoWorkflow (boolean)

Groups all pods of a single ArgoWorkflow workload into a single Pod-Group for gang scheduling Default: true

spec.runai-scheduler.args.verbosity (int)

Configures the level of detail in the logs generated by the scheduler service Default: 4

spec.limitRange.cpuDefaultRequestCpuLimitFactorNoGpu (string)

Sets a default ratio between the CPU request and the limit for workloads without GPU requests Default: 0.1

spec.limitRange.memoryDefaultRequestMemoryLimitFactorNoGpu (string)

Sets a default ratio between the memory request and the limit for workloads without GPU requests Default: 0.1

spec.limitRange.cpuDefaultRequestGpuFactor (string)

Sets a default amount of CPU allocated per GPU when the CPU is not specified Default: 100

spec.limitRange.cpuDefaultLimitGpuFactor (int)

Sets a default CPU limit based on the number of GPUs requested when no CPU limit is specified Default: NO DEFAULT

spec.limitRange.memoryDefaultRequestGpuFactor (string)

Sets a default amount of memory allocated per GPU when the memory is not specified Default: 100Mi

spec.limitRange.memoryDefaultLimitGpuFactor (string)

Sets a default memory limit based on the number of GPUs requested when no memory limit is specified Default: NO DEFAULT

spec.global.enableWorkloadOwnershipProtection (boolean)

Prevents users within the same project from deleting workloads created by others. This enhances workload ownership security and ensures better collaboration by restricting unauthorized modifications or deletions. Default: false

SchedulingServices

Containers associated with the NVIDIA Run:ai Scheduler

Scheduler, StatusUpdater, MetricsExporter, PodGrouper, PodGroupAssigner, Binder

SyncServices

Containers associated with syncing updates between the NVIDIA Run:ai cluster and the NVIDIA Run:ai control plane

Agent, ClusterSync, AssetsSync

WorkloadServices

Containers associated with submitting NVIDIA Run:ai workloads

WorkloadController,

JobController

spec:
  global:
   <service-group-name>: # schedulingServices | SyncServices | WorkloadServices
     resources:
       limits:
         cpu: 1000m
         memory: 1Gi
       requests:
         cpu: 100m
         memory: 512Mi
spec:
  <service-name>: # for example: pod-grouper
    resources:
      limits:
        cpu: 1000m
        memory: 1Gi
      requests:
        cpu: 100m
        memory: 512Mi
spec:
  global: 
    replicaCount: 1 # default
spec:
  <service-name>: # for example: pod-grouper
    replicas: 1 # default
spec:  
  prometheus:
    spec: # PrometheusSpec
      retention: 2h # default 
      retentionSize: 20GB
spec:  
  prometheus:
    logLevel: info # debug | info | warn | error
    additionalAlertLabels:
      - env: prod # example
spec:   
  global:
     managedNodes:
       inclusionCriteria:
          nodeSelectorTerms:
          - matchExpressions:
            - key: nvidia.com/gpu.product  
              operator: Exists
spec:
  workload-controller:    
    s3FileSystemImage:
      name: goofys       
      registry: runai.jfrog.io/op-containers-prod      
      tag: 3.12.24    
    gitSyncImage:      
      name: git-sync      
      registry: registry.k8s.io     
      tag: v4.4.0
Kubernetes Custom Resource
Vertical scaling
PrometheusSpec
Prometheus Storage
built-in alerts
Kubernetes nodeSelector
working with a Local Certificate Authority

Environment

The name of the environment

Description

A description of the environment

Scope

The scope of this environment within the organizational tree. Click the name of the scope to view the organizational tree diagram

Image

The application or service to be run by the workload

Workload Architecture

This can be either standard for running workloads on a single node or distributed for running distributed workloads on multiple nodes

Tool(s)

The tools and connection types the environment exposes

Workload(s)

The list of existing workloads that use the environment

Workload types

The workload types that can use the environment (Workspace/ Training / Inference)

Template(s)

The list of workload templates that use this environment

Created by

The user who created the environment. By default NVIDIA Run:ai UI comes with preinstalled environments created by NVIDIA Run:ai

Creation time

The timestamp of when the environment was created

Last updated

The timestamp of when the environment was last updated

Cluster

The cluster with which the environment is associated

Tool name

The name of the tool or application AI practitioner can set up within the environment. For more information, see Integrations.

Connection type

The method by which you can access and interact with the running workload. It's essentially the "doorway" through which you can reach and use the tools the workload provide. (E.g node port, external URL, etc)

Workload

The workload that uses the environment

Type

The workload type (Workspace/Training/Inference)

Status

Represents the workload lifecycle. See the full list of workload status)

jupyter-lab / jupyter-scipy

jupyter/scipy-notebook

An interactive development environment for Jupyter notebooks, code, and data visualization

jupyter-tensorboard

gcr.io/run-ai-demo/jupyter-tensorboard

An integrated combination of the interactive Jupyter development environment and TensorFlow's visualization toolkit for monitoring and analyzing ML models

tensorboard / tensorboad-tensorflow

tensorflow/tensorflow:latest

A visualization toolkit for TensorFlow that helps users monitor and analyze ML models, displaying various metrics and model architecture

llm-server

runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0

A vLLM-based server that hosts and serves large language models for inference, enabling API-based access to AI models

chatbot-ui

runai.jfrog.io/core-llm/llm-app

A user interface for interacting with chat-based AI models, often used for testing and deploying chatbot applications

rstudio

rocker/rstudio:4

An integrated development environment (IDE) for R, commonly used for statistical computing and data analysis

vscode

ghcr.io/coder/code-server

A fast, lightweight code editor with powerful features like intelligent code completion, debugging, Git integration, and extensions, ideal for web development, data science, and more

gpt2

runai.jfrog.io/core-llm/quickstart-inference:gpt2-cpu

A package containing an inference server, GPT2 model and chat UI often used for quick demos

workload assets
scope
host-based routing
specific roles
scope
credentials of type docker registry
Credentials
Creating ConfigMaps in advance
workspace
training
inference
Environment
runai training submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-a
runai submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-a
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ 
--data '{
  "name": "a1",
  "projectId": "<PROJECT-ID>", 
  "clusterId": "<CLUSTER-UUID>",
  "spec": {
    "image":"runai.jfrog.io/demo/quickstart",
    "compute": {
      "gpuDevicesRequest": 1
    }
  }
}'
runai training submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a
runai submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ 
--data '{
  "name": "a2",
  "projectId": "<PROJECT-ID>", 
  "clusterId": "<CLUSTER-UUID>",
  "spec": {
    "image":"runai.jfrog.io/demo/quickstart",
    "compute": {
      "gpuDevicesRequest": 2
    }
  }
}'
runai training submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
runai submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \
--data '{
  "name": "b1",
  "projectId": "<PROJECT-ID>", 
  "clusterId": "<CLUSTER-UUID>",
  "spec": {
    "image":"runai.jfrog.io/demo/quickstart",
    "compute": {
      "gpuDevicesRequest": 1
    }
  }
}'
~ runai workload list -A
Workload  Type      Status   Project  Running/Req.Pods  GPU Alloc.
────────────────────────────────────────────────────────────────────────────
a2       Training   Running   team-a        1/1           2.00
b1       Training   Running   team-b        1/1           1.00
a1       Training.  Running   team-a        0/1           1.00
~ runai list -A
Workload  Type      Status   Project  Running/Req.Pods  GPU Alloc.
────────────────────────────────────────────────────────────────────────────
a2       Training   Running   team-a        1/1           2.00
b1       Training   Running   team-b        1/1           1.00
a1       Training.  Running   team-a        0/1           1.00
curl --location 'https://<COMPANY-URL>/api/v1/workloads' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ #<TOKEN> is the API access token obtained in Step 1.
--data ''
runai training submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
runai submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ 
--data '{
  "name": "b2",
  "projectId": "<PROJECT-ID>", 
  "clusterId": "<CLUSTER-UUID>",
  "spec": {
    "image":"runai.jfrog.io/demo/quickstart",
    "compute": {
      "gpuDevicesRequest": 1
    }
  }
}'
~ runai workload list -A
Workload  Type      Status   Project  Running/Req.Pods  GPU Alloc.
────────────────────────────────────────────────────────────────────────────
a2       Training   Running   team-a        1/1           2.00
b1       Training   Running   team-b        1/1           1.00
b2       Training   Running   team-b        1/1           1.00
a1       Training.  Pending   team-a        0/1           1.00
~ runai list -A
Workload   Type     Status   Project  Running/Req.Pods  GPU Alloc.
────────────────────────────────────────────────────────────────────────────
a2       Training   Running   team-a        1/1           2.00
b1       Training   Running   team-b        1/1           1.00
b2       Training   Running   team-b        1/1           1.00
a1       Training.  Pending   team-a        0/1           1.00
curl --location 'https://<COMPANY-URL>/api/v1/workloads' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ #<TOKEN> is the API access token obtained in Step 1.
--data ''
over quota
fairness
preemption
NVIDIA Run:ai Scheduler
Flexible or Original submission form
projects
template
CLI reference
CLI reference
Trainings API
Step 1
Get Projects API
Get Clusters API
template
Step 2
CLI reference
CLI reference
Trainings API
Step 1
Get Projects API
Get Clusters API
template
Step 2
Step 2
CLI reference
CLI reference
Trainings API
Step 1
Get Projects API
Get Clusters API
template
Step 2
Step 2
CLI reference
CLI reference
Trainings API
Step 1
Get Projects API
Get Clusters API
Workloads

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Run the below --help command to obtain the login options and log in according to your setup:

runai login --help

Log in using the following command. You will be prompted to enter your username and password:

runai login

To use the API, you will need to obtain a token as shown in API authentication.

table below
See troubleshooting scenarios.
See troubleshooting scenarios.
See troubleshooting scenarios.
phases
scope

Projects

This section explains the procedure to manage Projects.

Researchers submit AI workloads. To streamline resource allocation and prioritize work, NVIDIA Run:ai introduces the . Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives. A project may represent a team, an individual, or an initiative that shares resources or has a specific resource quota. Projects may be aggregated in NVIDIA Run:ai .

For example, you may have several people involved in a specific face-recognition initiative collaborating under one project named “face-recognition-2024”. Alternatively, you can have a project per person in your team, where each member receives their own quota.

Projects Table

The Projects table can be found under Organization in the NVIDIA Run:ai platform.

The Projects table provides a list of all projects defined for a specific cluster, and allows you to manage them. You can switch between clusters by selecting your cluster using the filter at the top.

The Projects table consists of the following columns:

Column
Description

Node Pools with Quota Associated with the Project

Click one of the values of Node pool(s) with quota column, to view the list of node pools and their parameters

Column
Description

Subjects Authorized for the Project

Click one of the values in the Subject(s) column, to view the list of subjects and their parameters. This column is only viewable, if your role in the NVIDIA Run:ai system affords you those permissions.

Column
Description

Workloads Associated with the Project

Click one of the values of Workload(s) column, to view the list of workloads and their parameters

Column
Description

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Adding a New Project

To create a new Project:

  1. Click +NEW PROJECT

  2. Select a scope, you can only view clusters if you have permission to do so - within the scope of the roles assigned to you

  3. Enter a name for the project Project names must start with a letter and can only contain lower case Latin letters, numbers or a hyphen ('-’)

  4. Namespace associated with Project Each project has an associated (Kubernetes) namespace in the cluster. All workloads under this project use this namespace.

    1. By default, Run:ai creates a namespace based on the Project name (in the form of runai-<name>)

    2. Alternatively, you can choose an existing namespace created for you by the cluster administrator

  5. In the Quota management section, you can set the quota parameters and prioritize resources

    • Order of priority This column is displayed only if more than one node pool exists. The default order in which the uses node pools to schedule a workload. This means the Scheduler first tries to allocate resources using the highest priority node pool, then the next in priority, until it reaches the lowest priority node pool list, then the Scheduler starts from the highest again. The Scheduler uses the Project list of prioritized node pools, only if the order of priority of node pools is not set in the workload during submission, either by an admin policy or by the user. Empty value means the node pool is not part of the Project’s default node pool priority list, but a node pool can still be chosen by the admin policy or a user during workload submission

    • Node pool This column is displayed only if more than one node pool exists. It represents the name of the node pool

    • Under the QUOTA tab

      • Over-quota state Indicates if over-quota is enabled or disabled as set in the SCHEDULING PREFERENCES tab. If over-quota is set to None, then it is disabled.

      • GPU devices The number of GPUs you want to allocate for this project in this node pool (decimal number)

      • CPUs (Cores) This column is displayed only if CPU quota is enabled via the General settings. Represents the number of CPU cores you want to allocate for this project in this node pool (decimal number).

      • CPU memory This column is displayed only if CPU quota is enabled via the General settings. Represents the amount of CPU memory you want to allocate for this project in this node pool (in Megabytes or Gigabytes).

      • Under the SCHEDULING PREFERENCES tab

        • Project priority Sets the project's scheduling priority compared to other projects in the same node pool, using one of the following priorities:

          • Highest - 255

          • VeryHigh - 240

          • High - 210

          • MediumHigh - 180

          • Medium - 150

          • MediumLow - 100

          • Low - 50

          • VeryLow - 20

          • Lowest - 1

          For v2.21, the default value is MediumLow. All Projects are set with the same default value, therefore there is no change of scheduling behavior unless the Administrator changes any Project priority values. To learn more about Project priority, see .

        • Over-quota If over quota weight is enabled via the General settings, then over quota weight is presented, otherwise over quota is presented

          • Over-quota When enabled, the project can use non-guaranteed overage resources above its quota in this node pool. The amount of the non-guaranteed overage resources for this project is calculated proportionally to the project quota in this node pool. When disabled, the project cannot use more resources than the guaranteed quota in this node pool.

          • Over quota weight Represents a weight used to calculate the amount of non-guaranteed overage resources a project can get on top of its quota in this node pool. All unused resources are split between projects that require the use of overage resources:

            • Medium The default value. The Administrator can change the default to any of the following values - High, Low, Lowest, or None.

            • Lowest Over quota weight ‘Lowest’ has a unique behavior since it can only use over-quota (unused overage) resources if no other project needs them. Any project with a higher over quota weight can snap the average resources at any time.

            • None When set, the project cannot use more resources than the guaranteed quota in this node pool

          • Unlimited CPU(Cores) and CPU memory quotas are an exception. In this case, workloads of subordinated projects can consume available resources up to the physical limitation of the cluster or any of the node pools.

        • Project max. GPU device allocation Represents the maximum GPU device allocation the project can get from this node pool - the maximum sum of quota and over-quota GPUs (decimal number)

Note

Setting the quota to 0 (either GPU, CPU, or CPU memory) and the over quota to ‘disabled’ or over quota weight to ‘none’ means the project is blocked from using those resources on this node pool.

When no node pools are configured, you can set the same parameters for the whole project, instead of per node pool. After node pools are created, you can set the above parameters for each node-pool separately.

  1. Set as required.

  2. Click CREATE PROJECT

Adding an Access Rule to a Project

To create a new access rule for a project:

  1. Select the project you want to add an access rule for

  2. Click ACCESS RULES

  3. Click +ACCESS RULE

  4. Select a subject

  5. Select or enter the subject identifier:

    • User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP

    • Group name as recognized by the IDP

    • Application name as created in NVIDIA Run:ai

  6. Select a role

  7. Click SAVE RULE

  8. Click CLOSE

Deleting an Access Rule from a Project

To delete an access rule from a project:

  1. Select the project you want to remove an access rule from

  2. Click ACCESS RULES

  3. Find the access rule you want to delete

  4. Click on the trash icon

  5. Click CLOSE

Editing a Project

To edit a project:

  1. Select the project you want to edit

  2. Click EDIT

  3. Update the Project and click SAVE

Viewing a Project’s Policy

To view the policy of a project:

  1. Select the project for which you want to view its . This option is only active for projects with defined policies in place.

  2. Click VIEW POLICY and select the workload type for which you want to view the policies: a. Workspace workload type policy with its set of rules b. Training workload type policies with its set of rules

  3. In the Policy form, view the workload rules that are enforcing your project for the selected workload type as well as the defaults:

    • Parameter - The workload submission parameter that Rules and Defaults are applied to

    • Type (applicable for data sources only) - The data source type (Git, S3, nfs, pvc etc.)

    • Default - The default value of the Parameter

    • Rule - Set up constraints on workload policy fields

    • Source - The origin of the applied policy (cluster, department or project)

Note

The policy affecting the project consists of rules and defaults. Some of these rules and defaults may be derived from policies of a parent cluster and/or department (source). You can see the source of each rule in the policy form.

Deleting a Project

To delete a project:

  1. Select the project you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm

Note

Clusters < v2.20

Deleting a project does not delete its associated namespace, any of the workloads running using this namespace, or the policies defined for this project. However, any assets created in the scope of this project such as compute resources, environments, data sources, templates and credentials, are permanently deleted from the system.

Clusters >=v2.20

Deleting a project does not delete its associated namespace, but will attempt to delete it’s associated workloads and assets. Any assets created in the scope of this project such as compute resources, environments, data sources, templates and credentials, are permanently deleted from the system.

Using API

To view the available actions, go to the API reference.

Project

The name of the project

Department

The name of the parent department. Several projects may be grouped under a department.

Status

The Project creation status. Projects are manifested as Kubernetes namespaces. The project status represents the Namespace creation status.

Node pool(s) with quota

The node pools associated with the project. By default, a new project is associated with all node pools within its associated cluster. Administrators can change the node pools’ quota parameters for a project. Click the values under this column to view the list of node pools with their parameters (as described below)

Subject(s)

The users, SSO groups, or applications with access to the project. Click the values under this column to view the list of subjects with their parameters (as described below). This column is only viewable if your role in the NVIDIA Run:ai platform allows you those permissions.

Allocated GPUs

The total number of GPUs allocated by successfully scheduled workloads under this project

GPU allocation ratio

The ratio of Allocated GPUs to GPU quota. This number reflects how well the project’s GPU quota is utilized by its descendent workloads. A number higher than 100% indicates the project is using over quota GPUs.

GPU quota

The GPU quota allocated to the project. This number represents the sum of all node pools’ GPU quota allocated to this project.

Allocated CPUs (Core)

The total number of CPU cores allocated by workloads submitted within this project. (This column is only available if the CPU Quota setting is enabled, as described below).

Allocated CPU Memory

The total number of CPUs allocated by successfully scheduled workloads under this project. (This column is only available if the CPU Quota setting is enabled, as described below).

CPU quota (Cores)

CPU quota allocated to this project. (This column is only available if the CPU Quota setting is enabled, as described below). This number represents the sum of all node pools’ CPU quota allocated to this project. The ‘unlimited’ value means the CPU (cores) quota is not bounded and workloads using this project can use as many CPU (cores) resources as they need (if available).

CPU memory quota

CPU memory quota allocated to this project. (This column is only available if the CPU Quota setting is enabled, as described below). This number represents the sum of all node pools’ CPU memory quota allocated to this project. The ‘unlimited’ value means the CPU memory quota is not bounded and workloads using this Project can use as much CPU memory resources as they need (if available).

CPU allocation ratio

The ratio of Allocated CPUs (cores) to CPU quota (cores). This number reflects how much the project’s ‘CPU quota’ is utilized by its descendent workloads. A number higher than 100% indicates the project is using over quota CPU cores.

CPU memory allocation ratio

The ratio of Allocated CPU memory to CPU memory quota. This number reflects how well the project’s ‘CPU memory quota’ is utilized by its descendent workloads. A number higher than 100% indicates the project is using over quota CPU memory.

Node affinity of training workloads

The list of NVIDIA Run:ai node-affinities. Any training workload submitted within this project must specify one of those NVIDIA Run:ai node affinities, otherwise it is not submitted.

Node affinity of interactive workloads

The list of NVIDIA Run:ai node-affinities. Any interactive (workspace) workload submitted within this project must specify one of those NVIDIA Run:ai node affinities, otherwise it is not submitted.

Idle time limit of training workloads

The time in days:hours:minutes after which the project stops a training workload not using its allocated GPU resources.

Idle time limit of preemptible workloads

The time in days:hours:minutes after which the project stops a preemptible interactive (workspace) workload not using its allocated GPU resources.

Idle time limit of non preemptible workloads

The time in days:hours:minutes after which the project stops a non-preemptible interactive (workspace) workload not using its allocated GPU resources..

Interactive workloads time limit

The duration in days:hours:minutes after which the project stops an interactive (workspace) workload

Training workloads time limit

The duration in days:hours:minutes after which the project stops a training workload

Creation time

The timestamp for when the project was created

Workload(s)

The list of workloads associated with the project. Click the values under this column to view the list of workloads with their resource parameters (as described below).

Cluster

The cluster that the project is associated with

Node pool

The name of the node pool is given by the administrator during node pool creation. All clusters have a default node pool created automatically by the system and named ‘default’.

GPU quota

The amount of GPU quota the administrator dedicated to the project for this node pool (floating number, e.g. 2.3 means 230% of GPU capacity).

CPU (Cores)

The amount of CPUs (cores) quota the administrator has dedicated to the project for this node pool (floating number, e.g. 1.3 Cores = 1300 mili-cores). The ‘unlimited’ value means the CPU (Cores) quota is not bounded and workloads using this node pool can use as many CPU (Cores) resources as they require, (if available).

CPU memory

The amount of CPU memory quota the administrator has dedicated to the project for this node pool (floating number, in MB or GB). The ‘unlimited’ value means the CPU memory quota is not bounded and workloads using this node pool can use as much CPU memory resource as they need (if available).

Allocated GPUs

The actual amount of GPUs allocated by workloads using this node pool under this project. The number of allocated GPUs may temporarily surpass the GPU quota if over quota is used.

Allocated CPU (Cores)

The actual amount of CPUs (cores) allocated by workloads using this node pool under this project. The number of allocated CPUs (cores) may temporarily surpass the CPUs (Cores) quota if over quota is used.

Allocated CPU memory

The actual amount of CPU memory allocated by workloads using this node pool under this Project. The number of Allocated CPU memory may temporarily surpass the CPU memory quota if over quota is used.

Order of priority

The default order in which the Scheduler uses node-pools to schedule a workload. This is used only if the order of priority of node pools is not set in the workload during submission, either by an admin policy or the user. An empty value means the node pool is not part of the project’s default list, but can still be chosen by an admin policy or the user during workload submission

Subject

A user, SSO group, or application assigned with a role in the scope of this Project

Type

The type of subject assigned to the access rule (user, SSO group, or application)

Scope

The scope of this project in the organizational tree. Click the name of the scope to view the organizational tree diagram, you can only view the parts of the organizational tree for which you have permission to view.

Role

The role assigned to the subject, in this project’s scope

Authorized by

The user who granted the access rule

Last updated

The last time the access rule was updated

Workload

The name of the workload, given during its submission. Optionally, an icon describing the type of workload is also visible

Type

The type of the workload, e.g. Workspace, Training, Inference

Status

The state of the workload and time elapsed since the last status change

Created by

The subject that created this workload

Running/ requested pods

The number of running pods out of the number of requested pods for this workload. e.g. a distributed workload requesting 4 pods but may be in a state where only 2 are running and 2 are pending

Creation time

The date and time the workload was created

GPU compute request

The amount of GPU compute requested (floating number, represents either a portion of the GPU compute, or the number of whole GPUs requested)

GPU memory request

The amount of GPU memory requested (floating number, can either be presented as a portion of the GPU memory, an absolute memory size in MB or GB, or a MIG profile)

CPU memory request

The amount of CPU memory requested (floating number, presented as an absolute memory size in MB or GB)

CPU compute request

The amount of CPU compute requested (floating number, represents the number of requested Cores)

concept of Projects
departments
Scheduler
The NVIDIA Run:ai Scheduler: concepts and principles
Scheduling rules
policies
Projects

Workloads

This section explains the procedure for managing workloads.

Workloads Table

The Workloads table can be found under Workload manager in the NVIDIA Run:ai platform.

The workloads table provides a list of all the workloads scheduled on the NVIDIA Run:ai Scheduler, and allows you to manage them.

The Workloads table consists of the following columns:

Column
Description

Workload

The name of the workload

Type

The workload type

Preemptible

Is the workload (Yes/no)

Status

The different in a workload lifecycle

Project

The project in which the workload runs

Department

The department that the workload is associated with. This column is visible only if the department toggle is enabled by your administrator.

Created by

The user who created the workload

Running/requested pods

The number of running pods out of the requested

Creation time

The timestamp of when the workload was created

Completion time

The timestamp the workload reached a terminal state (failed/completed)

Connection(s)

The method by which you can access and interact with the running workload. It's essentially the "doorway" through which you can reach and use the tools the workload provide. (E.g node port, external URL, etc). Click one of the values in the column to view the list of connections and their parameters.

Data source(s)

Data resources used by the workload

Environment

The environment used by the workload

Workload architecture

Standard or distributed. A standard workload consists of a single process. A distributed workload consists of multiple processes working together. These processes can run on different nodes.

GPU compute request

Amount of GPU devices requested

GPU compute allocation

Amount of GPU devices allocated

GPU memory request

Amount of GPU memory Requested

GPU memory allocation

Amount of GPU memory allocated

Idle GPU devices

The number of allocated GPU devices that have been idle for more than 5 minutes

CPU compute request

Amount of CPU cores requested

CPU compute allocation

Amount of CPU cores allocated

CPU memory request

Amount of CPU memory requested

CPU memory allocation

Amount of CPU memory allocated

Cluster

The cluster that the workload is associated with

Workload Status

The following table describes the different phases in a workload life cycle. The UI provides additional details for some of the below workload statuses which can be viewed by clicking the icon next to the status.

Status
Description
Entry Condition
Exit Condition

Creating

Workload setup is initiated in the cluster. Resources and pods are now provisioning.

A workload is submitted

A multi-pod group is created

Pending

Workload is queued and awaiting resource allocation

A pod group exists

All pods are scheduled

Initializing

Workload is retrieving images, starting containers, and preparing pods

All pods are scheduled

All pods are initialized or a failure to initialize is detected

Running

Workload is currently in progress with all pods operational

All pods initialized (all containers in pods are ready)

Workload completion or failure

Degraded

Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details.

  • Pending - All pods are running but have issues.

  • Running - All pods are running with no issues.

  • Running - All resources are OK.

  • Completed - Workload finished with fewer resources

  • Failed - Workload failure or user-defined rules.

Deleting

Workload and its associated resources are being decommissioned from the cluster

Deleting the workload

Resources are fully deleted

Stopped

Workload is on hold and resources are intact but inactive

Stopping the workload without deleting resources

Transitioning back to the initializing phase or proceeding to deleting the workload

Failed

Image retrieval failed or containers experienced a crash. Check your logs for specific details

An error occurs preventing the successful completion of the workload

Terminal state

Completed

Workload has successfully finished its execution

The workload has finished processing without errors

Terminal state

Pods Associated with the Workload

Click one of the values in the Running/requested pods column, to view the list of pods and their parameters.

Column
Description

Pod

Pod name

Status

Pod lifecycle stages

Node

The node on which the pod resides

Node pool

The node pool in which the pod resides (applicable if node pools are enabled)

Image

The pod’s main image

GPU compute allocation

Amount of GPU devices allocated for the pod

GPU memory allocation

Amount of GPU memory allocated for the pod

Connections Associated with the Workload

A connection refers to the method by which you can access and interact with the running workloads. It is essentially the "doorway" through which you can reach and use the applications (tools) these workloads provide.

Click one of the values in the Connection(s) column, to view the list of connections and their parameters. Connections are network interfaces that communicate with the application running in the workload. Connections are either the URL the application exposes or the IP and the port of the node that the workload is running on.

Column
Description

Name

The name of the application running on the workload

Connection type

The network connection type selected for the workload

Access

Who is authorized to use this connection (everyone, specific groups/users)

Address

The connection URL

Copy button

Copy URL to clipboard

Connect button

Enabled only for supported tools

Data Sources Associated with the Workload

Click one of the values in the Data source(s) column to view the list of data sources and their parameters.

Column
Description

Data source

The name of the data source mounted to the workload

Type

The

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

  • Refresh - Click REFRESH to update the table with the latest data

  • Show/Hide details - Click to view additional information on the selected row

Show/Hide Details

Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the following tabs:

Event History

Displays the workload status over time. It displays events describing the workload lifecycle and alerts on notable events. Use the filter to search through the history for specific events.

Metrics

  • GPU utilization Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs compute utilization (percentage of GPU compute) in this node.

  • GPU memory utilization Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs memory usage (percentage of the GPU memory) in this node.

  • CPU compute utilization The average of all CPUs’ cores compute utilization graph, along an adjustable period allows you to see the trends of CPU compute utilization (percentage of CPU compute) in this node.

  • CPU memory utilization The utilization of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory utilization (percentage of CPU memory) in this node.

  • CPU memory usage The usage of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory usage (in GB or MB of CPU memory) in this node.

  • For GPUs charts - Click the GPU legend on the right-hand side of the chart, to activate or deactivate any of the GPU lines.

  • You can click the date picker to change the presented period

  • You can use your mouse to mark a sub-period in the graph for zooming in, and use Reset zoom to go back to the preset period

  • Changes in the period affect all graphs on this screen.

Logs

Workload events are ordered in chronological order. The logs contain events from the workload’s lifecycle to help monitor and debug issues.

Adding a New Workload

Before starting, make sure you have created a project or have one created for you to work with workloads.

To create a new workload:

  1. Click +NEW WORKLOAD

  2. Select a workload type - Follow the links below to view the step-by-step guide for each workload type:

    • Workspace - Used for data preparation and model-building tasks.

    • Training - Used for standard training tasks of all sorts

    • Distributed Training - Used for distributed tasks of all sorts

    • Inference - Used for inference and serving tasks

    • Job (legacy). This type is displayed only if enabled by your Administrator, under General settings → Workloads → Workload policies

  3. Click CREATE WORKLOAD

Stopping a Workload

Stopping a workload kills the workload pods and releases the workload resources.

  1. Select the workload you want to stop

  2. Click STOP

Running a Workload

Running a workload spins up new pods and resumes the workload work after it was stopped.

  1. Select the workload you want to run again

  2. Click RUN

Connecting to a Workload

To connect to an application running in the workload (for example, Jupyter Notebook)

  1. Select the workload you want to connect

  2. Click CONNECT

  3. Select the tool from the drop-down list

  4. The selected tool is opened in a new tab on your browser

Copying a Workload

  1. Select the workload you want to copy

  2. Click MAKE A COPY

  3. Enter a name for the workload. The name must be unique.

  4. Update the workload and click CREATE WORKLOAD

Deleting a Workload

  1. Select the workload you want to delete

  2. Click DELETE

  3. On the dialog, click DELETE to confirm the deletion

Note

Once a workload is deleted you can view it in the Deleted tab in the workloads view. This tab is displayed only if enabled by your Administrator, under General settings → Workloads → Deleted workloads

Using API

Go to the Workloads API reference to view the available actions

Troubleshooting

To understand the condition of the workload, review the workload status in the Workload table. For more information, see check the workload’s event history.

Listed below are a number of known issues when working with workloads and how to fix them:

Issue
Mediation

Cluster connectivity issues (there are issues with your connection to the cluster error message)

  • Verify that you are on a network that has been granted access to the cluster.

  • Reach out to your cluster admin for instructions on verifying this.

  • If you are an admin, see the section in the cluster documentation

Workload in “Initializing” status for some time

  • Check that you have access to the Container image registry.

  • Check the statuses of the pods in the .

  • Check the event history for more details

Workload has been pending for some time

  • Check that you have the required quota.

  • Check the project’s available quota in the project dialog.

  • Check that all services needed to run are bound to the workload.

  • Check the event history for more details.

PVCs created using the K8s API or kubectl are not visible or mountable in NVIDIA Run:ai

This is by design.

  1. Create a new data source of type PVC in the NVIDIA Run:ai UI

  2. In the Data mount section, select Existing PVC

  3. Select the PVC you created via the K8S API

You are now able to select and mount this PVC in your NVIDIA Run:ai submitted workloads.

Workload is not visible in the UI

  • Check that the workload hasn’t been deleted.

  • See the “Deleted” tab in the workloads view

Launching Workloads with GPU Memory Swap

This quick start provides a step-by-step walkthrough for running multiple LLMs (inference workload) on a single GPU using .

GPU memory swap expands the GPU physical memory to the CPU memory, allowing NVIDIA Run:ai to place and run more workloads on the same GPU physical hardware. This provides a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.

Note

If enabled by your Administrator, the NVIDIA Run:ai UI allows you to create a new workload using either the . The steps in this quick start guide reflect the Original form only.

Prerequisites

Before you start, make sure:

  • You have created a or have one created for you.

  • The project has an assigned quota of at least 1 GPU.

  • is enabled.

  • GPU memory swap is enabled on at least one free node as detailed .

  • is configured.

Note

Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

To use the API, you will need to obtain a token as shown in

Step 2: Submitting the First Inference Workload

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference

  3. Select under which cluster to create the workload

  4. Select the project in which your workload will run

  5. Select custom inference from Inference type

  6. Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Create an environment for your workload

    • Click +NEW ENVIRONMENT

    • Enter a name for the environment. The name must be unique.

    • Enter the NVIDIA Run:ai vLLM Image URL - runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0

    • Set the runtime settings for the environment

      • Click +ENVIRONMENT VARIABLE and add the following

        • Name: RUNAI_MODEL Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct (you can choose any vLLM supporting model from Hugging Face)

        • Name: RUNAI_MODEL_NAME Source: Custom Value: Llama-3.2-1B-Instruct

        • Name: HF_TOKEN Source: Custom Value: <Your Hugging Face token> (only needed for gated models)

        • Name: VLLM_RPC_TIMEOUT Source: Custom Value: 60000

    • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  9. Create a new “request-limit” compute resource

    • Click +NEW COMPUTE RESOURCE

    • Enter a name for the compute resource. The name must be unique.

    • Set GPU devices per pod - 1

    • Set GPU memory per device

      • Select % (of device) - Fraction of a GPU device’s memory

      • Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)

      • Toggle Limit and set to 100%

    • Optional: set the CPU compute per pod - 0.1 cores (default)

    • Optional: set the CPU memory per pod - 100 MB (default)

    • Select More settings and toggle Increase shared memory size

    • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  10. Click CREATE INFERENCE

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see :

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 3: Submitting the Second Inference Workload

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Inference

  3. Select the cluster where the previous inference workload was created

  4. Select the project where the previous inference workload was created

  5. Select custom inference from Inference type

  6. Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

  7. Click CONTINUE

    In the next step:

  8. Select the environment created in

  9. Select the compute resource created in

  10. Click CREATE INFERENCE

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see :

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 4: Submitting the First Workspace

  1. Go to the Workload manager → Workloads

  2. Click COLUMNS and select Connections

  3. Select the link under the Connections column for the first inference workload created in

  4. In the Connections Associated with Workload form, copy the URL under the Address column

  5. Click +NEW WORKLOAD and select Workspace

  6. Select the cluster where the previous inference workloads were created

  7. Select the project where the previous inference workloads were created

  8. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  9. Click CONTINUE

    In the next step:

  10. Select the ‘chatbot-ui’ environment for your workspace (Image URL: runai.jfrog.io/core-llm/llm-app)

    • Set the runtime settings for the environment with the following environment variables:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4

      • Delete the PATH_PREFIX environment variable if you are using host-based routing.

    • If ‘chatbot-ui’ is not displayed in the gallery, follow the below steps:

      • Click +NEW ENVIRONMENT

      • Enter a name for the environment. The name must be unique.

      • Enter the chatbot-ui Image URL - runai.jfrog.io/core-llm/llm-app

      • Tools - Set the connection for your tool

        • Click +TOOL

        • Select Chatbot UI tool from the list

      • Set the runtime settings for the environment

        • Click +ENVIRONMENT VARIABLE

        • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

        • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4

        • Name: RUNAI_MODEL_TOKEN_LIMIT Source: Custom Value: 8192

        • Name: RUNAI_MODEL_MAX_LENGTH Source: Custom Value: 16384

      • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  11. Select the ‘cpu-only’ compute resource for your workspace

    • If ‘cpu-only’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

      • Enter a name for the compute resource. The name must be unique.

      • Set GPU devices per pod - 0

      • Set CPU compute per pod - 0.1 cores

      • Set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  12. Click CREATE WORKSPACE

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

  • <URL> - The URL for connecting an external service related to the workload. You can get the URL via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 5: Submitting the Second Workspace

  1. Go to the Workload manager → Workloads

  2. Click COLUMNS and select Connections

  3. Select the link under the Connections column for the second inference workload created in

  4. In the Connections Associated with Workload form, copy the URL under the Address column

  5. Click +NEW WORKLOAD and select Workspace

  6. Select the cluster where the previous inference workloads were created

  7. Select the project where the previous inference workloads were created

  8. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  9. Click CONTINUE

    In the next step:

  10. Select the ‘chatbot-ui’ environment created in

    • Set the runtime settings for the environment with the following environment variables:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4

      • Delete the PATH_PREFIX environment variable if you are using host-based routing.

  11. Select the ‘cpu-only’ compute resource created in

  12. Click CREATE WORKSPACE

Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

  • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

  • <TOKEN> - The API access token obtained in

  • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

  • <URL> - The URL for connecting an external service related to the workload. You can get the URL via the .

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Step 6: Connecting to ChatbotUI

  1. Select the newly created workspace that you want to connect to

  2. Click CONNECT

  3. Select the ChatbotUI tool. The selected tool is opened in a new tab on your browser.

  4. Query both workspaces simultaneously and see them both responding. The one on CPU RAM at the time will take longer as it switches back to the GPU and vice versa.

  1. To connect to the ChatbotUI tool, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

  2. Query both workspaces simultaneously and see them both responding. The one on CPU RAM at the time will take longer as it switches back to the GPU and vice versa.

Next Steps

Manage and monitor your newly created workloads using the table.

Running Workspaces

This section explains how to create a workspace via the NVIDIA Run:ai UI.

A workspace contains the setup and configuration needed for building your model, including the container, images, data sets, and resource requests, as well as the required tools for the research, all in a single place.

To learn more about the workspace workload type in NVIDIA Run:ai and determine that it is the most suitable workload type for your goals, see .

Before You start

Make sure you have created a or have one created for you.

Note

  • Flexible workload submission – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Workloads → Flexible Workload Submission.

  • GPU memory limit – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Resources → GPU Resource Optimization.

  • Tolerations – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Workloads → Tolerations.

  • Data volumes – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Workloads → Data volumes. Data volumes are available for flexible workload submission only.

Workload Priority

By default, workspaces in NVIDIA Run:ai are assigned the build priority, which is non-preemptible. If needed, you can override this default and set the priority to interactive-preemptible. For more details, see .

Workload Policies

When creating a new workload, fields and assets may have limitations or defaults. These rules and defaults are derived from a policy your administrator set.

Policies allow you to control, standardize, and simplify the workload submission process. For additional information, see .

The effects of the policy are reflected in the workspace creation form:

  • Defaults derived from the policy will be displayed automatically for specific fields.

  • Disabled actions and permitted value ranges for values will be visibly explained per field.

  • Rules and defaults for entire sections (such as environments, compute resources, or data sources) may prevent selection and will appear on the entire library card with an option for additional information via an external modal.

Submission Form Options

You can create a new workspace using either the Flexible or Original submission form. The Flexible submission form offers greater customization and is the recommended method. Within the Flexible form, you have two options:

  • Load from an existing setup - You can select an existing setup to populate the workspace form with predefined values. While the Original submission form also allows you to select an existing setup, with the Flexible submission you can customize any of the populated fields for a one-time configuration. These changes will apply only to this workspace and will not modify the original setup. If needed, you can reset the configuration to the original setup at any time.

  • Provide your own settings - Manually fill in the workspace configuration fields. This is a one-time setup that applies only to the current workspace and will not be saved for future use.

Note

The Original submission form will be deprecated in a future release.

Creating a New Workspace

  1. To add a new workspace, go to Workload manager → Workloads.

  2. Click +NEW WORKLOAD and select Workspace from the drop-down menu.

  3. Within the new workspace form, select the cluster and project. To create a new project, click +NEW PROJECT and refer to for a step-by-step guide.

  4. Select a preconfigured or select Start from scratch to launch a new workspace quickly.

  5. Enter a unique name for the workspace. If the name already exists in the project, you will be requested to submit a different name.

  6. Under Submission, select Flexible or Original and click CONTINUE.

Setting Up an Environment

Load from existing setup

  1. Click the load icon. A side pane appears, displaying a list of available environments. Select an environment from the list.

  2. Optionally, customize any of the environment’s predefined fields as shown below. The changes will apply to this workspace only and will not affect the selected environment.

  3. Alternatively, click the ➕ icon in the side pane to create a new environment. For step-by-step instructions, see .

Provide your own settings

Manually configure the settings below as needed. The changes will apply to this workspace only.

Configure environment

  1. Add the Image URL or update the URL of the existing setup.

  2. Set the condition for pulling the image by selecting the image pull policy. It is recommended to pull the image only if it's not already present on the host.

  3. Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.

    • Select the connection type - External URL or NodePort:

      • Auto generate - A unique URL / port is automatically created for each workload using the environment.

      • Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between 30000 and 32767. If the node port is already in use, the workload will fail and display an error message.

    • Modify who can access the tool:

      • By default, All authenticated users is selected giving access to everyone within the organization’s account.

      • For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.

      • For Specific user(s), enter a valid email address or username. If you remove yourself, you will lose access to the tool.

  4. Set the command and arguments for the container running the workspace. If no command is added, the container will use the image’s default command (entry-point):

    • Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.

    • Set multiple arguments separated by spaces, using the following format (e.g.: --arg1=val1).

  5. Set the environment variable(s):

    • Modify the existing environment variable(s) if you are loading from an existing setup. The existing environment variables may include instructions to guide you with entering the correct values.

    • To add a new variable, click + ENVIRONMENT VARIABLE.

    • You can either select Custom to define your own variable, or choose from a predefined list of or .

  6. Enter a path pointing to the container's working directory.

  7. Set where the UID, GID, and supplementary groups for the container should be taken from. If you select Custom, you’ll need to manually enter the UID, GID and Supplementary groups values.

  8. Select additional Linux capabilities for the container from the drop-down menu. This grants certain privileges to a container without granting all the root user's privileges.

  1. Select an environment or click +NEW ENVIRONMENT to add a new environment to the gallery. For a step-by-step guide on adding environments to the gallery, see . Once created, the new environment will be automatically selected.

  2. Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.

    • Select the connection type - External URL or NodePort:

      • Auto generate - A unique URL / port is automatically created for each workload using the environment.

      • Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between 30000 and 32767. If the node port is already in use, the workload will fail and display an error message.

    • Optional: Modify who can access the tool:

      • By default, All authenticated users is selected giving access to everyone within the organization’s account.

      • For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.

      • For Specific user(s), enter a valid email address or username. If you remove yourself, you will lose access to the tool.

    • Set the User ID (UID), Group ID (GID) and the Supplementary groups that can run commands in the container.

  3. Optional: Set the command and arguments for the container running the workload. If no command is added, the container will use the image’s default command (entry-point):

    • Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.

    • Set multiple arguments separated by spaces, using the following format (e.g.: --arg1=val1).

  4. Set the environment variable(s):

    • Modify the existing environment variable(s). The existing environment variables may include instructions to guide you in entering the correct values.

    • Optional: To add a new variable, click + ENVIRONMENT VARIABLE.

    • You can either select Custom to define your own variable, or choose from a predefined list of or .

Setting Up Compute Resources

Load from existing setup

  1. Click the load icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.

  2. Optionally, customize any of the compute resource's predefined fields. The changes will apply to this workspace only and will not affect the selected compute resource.

  3. Alternatively, click the ➕ icon in the side pane to create a new compute resource. For step-by-step instructions, see .

Provide your own settings

Manually configure the settings below as needed. The changes will apply to this workspace only.

Configure compute resources

  1. Set the number of GPU devices per pod (physical GPUs).

  2. Set the GPU memory per device using either a fraction of a GPU device’s memory (% of device) or a GPU memory unit (MB/GB):

    • Request - The minimum GPU memory allocated per device. Each pod in the workspace receives at least this amount per device it uses.

    • Limit - The maximum GPU memory allocated per device. Each pod in the workspace receives at most this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see .

  3. Set the CPU compute per pod by choosing the unit (cores or millicores):

    • Request - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.

    • Limit - The maximum amount of CPU compute a pod can use. Each pod receives at most this amount of CPU compute. By default, the limit is set to Unlimited which means that the pod may consume all the node's free CPU compute resources.

  4. Set the CPU memory per pod by selecting the unit (MB or GB):

    • Request - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.

    • Limit - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to Unlimited which means that the pod may consume all the node's free CPU memory resources.

  5. Set extended resource(s):

    • Enable Increase shared memory size to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.

    • Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the and guides.

  6. Set the order of priority for the node pools on which the Scheduler tries to run the workspace. When a workspace is created, the Scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available:

    • Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.

    • Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see .

  7. Select a node affinity to schedule the workspace on a specific node type. If the administrator added a ‘’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. with a label that matches the node type key and value.

  8. Click +TOLERATION to allow the workspace to be scheduled on a node with a matching taint. Select the operator and the effect:

    • If you select Exists, the effect will be applied if the key exists on the node.

    • If you select Equals, the effect will be applied if the key and the value set match the value on the node.

  1. Select a compute resource or click +NEW COMPUTE RESOURCE to add a new compute resource to the gallery. For a step-by-step guide on adding compute resources to the gallery, see . Once created, the new compute resource will be automatically selected.

  2. Optional: Set the order of priority for the node pools on which the Scheduler tries to run the workload. When a workload is created, the scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available.

    • Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.

    • Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see .

  3. Select a node affinity to schedule the workload on a specific node type. If the administrator added a ‘’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. with a label that matches the node type key and value.

  4. Optional: Click +TOLERATION to allow the workload to be scheduled on a node with a matching taint. Select the operator and the effect.

    • If you select Exists, the effect will be applied if the key exists on the node.

    • If you select Equals, the effect will be applied if the key and the value set match the value on the node.

Setting Up Data & Storage

Note

  • Flexible - If Data volumes are not enabled, Data & storage appears as Data sources only, and no data volumes will be available. To enable Data volumes, see .

  • Original - This tab outlines how to set Volumes and Data sources.

Load from existing setup

  1. Click the load icon. A side pane appears, displaying a list of available data sources/volumes. Select a data source/volume from the list.

  2. Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this workspace only and will not affect the selected data source.

  3. Alternatively, click the ➕ icon in the side pane to create a new data source/data volume. For step-by-step instructions, see or .

Provide your own settings

Manually configure the settings below as needed. The changes will apply to this workspace only.

Note: , and cannot be added as a one-time configuration.

Configure data sources

  1. Click the ➕ icon and choose the data source from the drop-down menu. You can add multiple data sources.

  2. Once selected, set the data origin according to the required fields and enter the container path to set the data target location. For Git and S3, select Secret. This option is relevant for private buckets/repositories based on existing secrets that were created for the scope.

  3. Select Volume to allocate a storage space to your workspace that is persistent across restarts:

    • Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see . If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.

    • Select one or more access mode(s) and define the claim size and its units.

    • Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.

    • Set the Container path with the volume target location.

    • Set the volume persistency to Persistent if the volume and its data should be deleted when the workspace is deleted or Ephemeral if the volume and its data should be deleted every time the workspace’s status changes to “Stopped”.

  1. Optional: Click +VOLUME to set the volume needed for your workload. A volume allocates storage space to your workload that is persistent across restarts:

    • Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see . If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.

    • Select one or more access mode(s) and define the claim size and its units.

    • Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.

    • Set the Container path with the volume target location.

    • Set the volume persistency to Persistent if the volume and its data should be deleted when the workload is deleted or Ephemeral if the volume and its data should be deleted every time the workload’s status changes to “Stopped”.

  2. Optional: Select an existing data source. Modify the data target location if needed.

  3. To add a new data source, click + NEW DATA SOURCE. For a step-by-step guide, see . Once created, it will be automatically selected.

Note: If there are connectivity issues with the cluster or problems during data source creation, the data source may not appear in the list.

Setting Up General Settings

Note

The following general settings are optional.

  1. Allow the workload to exceed the project quota. Workloads running over quota may be preempted and stopped at any time.

  2. Set the backoff limit before workload failure. The backoff limit is the maximum number of retry attempts for failed workloads. After reaching the limit, the workload status will change to "Failed." Enter a value between 1 and 100.

  3. Set the timeframe for auto-deletion after workload completion or failure. The time after which a completed or failed workload is deleted; if this field is set to 0 seconds, the workload will be deleted automatically.

  4. Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.

  5. Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.

Completing the Workspace

  1. Before finalizing your workspace, review your configurations and make any necessary adjustments.

  2. Click CREATE WORKSPACE

Managing and Monitoring

After the workspace is created, it is added to the table, where it can be managed and monitored.

Using CLI

To view the available actions on workspaces, see the Workspaces or the .

Using API

To view the available actions on workspaces, see the API reference.

curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{ 
    "name": "workload-name", 
    "useGivenNameAsPrefix": true,
    "projectId": "<PROJECT-ID>", 
    "clusterId": "<CLUSTER-UUID>", 
    "spec": {
        "image": "runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0",
        "imagePullPolicy":"IfNotPresent",
        "environmentVariables": [
          {
            "name": "RUNAI_MODEL",
            "value": "meta-lama/Llama-3.2-1B-Instruct"
          },
          {
            "name": "VLLM_RPC_TIMEOUT",
            "value": "60000"
          },
          {
            "name": "HF_TOKEN",
            "value":"<INSERT HUGGINGFACE TOKEN>"
          }
        ],
        "compute": {
            "gpuDevicesRequest": 1,
            "gpuRequestType": "portion",
            "gpuPortionRequest": 0.1,
            "gpuPortionLimit": 1,
            "cpuCoreRequest":0.2,
            "cpuMemoryRequest": "200M",
            "largeShmRequest": false

        },
        "servingPort": {
            "container": 8000,
            "protocol": "http",
            "authorizationType": "public"
        }
    }
}       
curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "useGivenNameAsPrefix": true,
    "projectId": "<PROJECT-ID>",  
    "clusterId": "<CLUSTER-UUID>",
    "spec": {
        "image": "runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0",
        "imagePullPolicy":"IfNotPresent",
        "environmentVariables": [
          {
            "name": "RUNAI_MODEL",
            "value": "meta-lama/Llama-3.2-1B-Instruct"
          },
          {
            "name": "VLLM_RPC_TIMEOUT",
            "value": "60000"
          },
          {
            "name": "HF_TOKEN",
            "value":"<INSERT HUGGINGFACE TOKEN>"
          }
        ],
        "compute": {
            "gpuDevicesRequest": 1,
            "gpuRequestType": "portion",
            "gpuPortionRequest": 0.1,
            "gpuPortionLimit": 1,
            "cpuCoreRequest":0.2,
            "cpuMemoryRequest": "200M",
            "largeShmRequest": false

        },
        "servingPort": {
            "container": 8000,
            "protocol": "http",
            "authorizationType": "public"
        }
    }
}       
curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>", 
    "clusterId": "<CLUSTER-UUID>",
    "spec": {  
        "image": "runai.jfrog.io/core-llm/llm-app",
        "environmentVariables": [
          {
            "name": "RUNAI_MODEL_NAME",
            "value": "meta-llama/Llama-3.2-1B-Instruct"
          },
          {
            "name": "RUNAI_MODEL_BASE_URL",
            "value": "<URL>" 
          }
        ],
        "compute": {
            "cpuCoreRequest":0.1,
            "cpuMemoryRequest": "100M",
        }
    }
}
curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \ 
-d '{ 
    "name": "workload-name", 
    "projectId": "<PROJECT-ID>", '\ 
    "clusterId": "<CLUSTER-UUID>", \ 
    "spec": {  
        "image": "runai.jfrog.io/core-llm/llm-app",
        "environmentVariables": [
          {
            "name": "RUNAI_MODEL_NAME",
            "value": "meta-llama/Llama-3.2-1B-Instruct"
          },
          {
            "name": "RUNAI_MODEL_BASE_URL",
            "value": "<URL>" 
          }
        ],
        "compute": {
            "cpuCoreRequest":0.1,
            "cpuMemoryRequest": "100M",
        }
    }
}
GPU memory swap
Flexible or Original submission form
project
Dynamic GPU fractions
here
Host-based routing
API authentication.
Inferences API
Step 1
Get Projects API
Get Clusters API
Step 2
Step 2
Inferences API
Step 1
Get Projects API
Get Clusters API
Step 2
Workspaces API:
Step 1
Get Projects API
Get Clusters API
List Workloads API
Step 3
Step 4
Step 4
Workspaces API:
Step 1
Get Projects API
Get Clusters API
List Workloads API
Workloads
Workload types
project
Workload priority control
Policies and rules
Projects
template
Environments
Secrets
ConfigMaps
Environments
Credentials
ConfigMaps
Compute resources
Before you start
Extended resources
Quantity
Node pools
node type (affinity)
Nodes must be tagged
Compute resources
Node pools
node type (affinity)
Nodes must be tagged
Before you start
Data sources
Data volumes
Secrets
ConfigMaps
Data volumes
Kubernetes storage classes
Kubernetes storage classes
Data sources
Workloads
CLI v2 reference
CLI v1 reference
Workspaces
preemptible
phases
data source type
troubleshooting
pods’ dialog

Data Sources

This section explains what data sources are and how to create and use them.

Data sources are a type of workload assets and represent a location where data is actually stored. They may represent a remote data location, such as NFS, Git, or S3, or a Kubernetes local resource, such as PVC, ConfigMap, HostPath, or Secret.

This configuration simplifies the mapping of the data into the workload’s file system and handles the mounting process during workload creation for reading and writing. These data sources are reusable and can be easily integrated and used by AI practitioners while submitting workloads across various scopes.

Data Sources Table

The data sources table can be found under Workload manager in the NVIDIA Run:ai platform.

The data sources table provides a list of all the data sources defined in the platform and allows you to manage them.

Note

Data & storage - with Data sources and Data volumes - is visible only if your Administrator has enabled Data volumes.

The data sources table comprises the following columns:

Column
Description

Data source

The name of the data source

Description

A description of the data source

Type

The type of data source connected – e.g., S3 bucket, PVC, or others

Status

The different lifecycle and representation of the data source condition

Scope

The of the data source within the organizational tree. Click the scope name to view the organizational tree diagram

Kubernetes name

The unique name of the data sources Kubernetes name as it appears in the cluster

Workload(s)

The list of existing workloads that use the data source

Template(s)

The list of workload templates that use the data source

Created by

The user who created the data source

Creation time

The timestamp for when the data source was created

Cluster

The cluster that the data source is associated with

Data Sources Status

The following table describes the data sources' condition and whether they were created successfully for the selected scope.

Status
Description

No issues found

No issues were found while creating the data source

Issues found

Issues were found while propagating the data source credentials

Issues found

The data source couldn’t be created at the cluster

Creating…

The data source is being created

No status / “-”

When the data source’s scope is an account, the current version of the cluster is not up to date, or the asset is not a cluster-syncing entity, the status can’t be displayed

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then click ‘Download as CSV’. Export to CSV is limited to 20,000 rows.

  • Refresh - Click REFRESH to update the table with the latest data

Adding a New Data Source

To create a new data source:

  1. Click +NEW DATA SOURCE

  2. Select the data source type from the list. Follow the step-by-step guide for each data source type:

NFS

A Network File System (NFS) is a Kubernetes concept used for sharing storage in the cluster among different pods. Like a PVC, the NFS volume’s content remains preserved, even outside the lifecycle of a single pod. However, unlike PVCs, which abstract storage management, NFS provides a method for network-based file sharing. The NFS volume can be pre-populated with data and can be mounted by multiple pod writers simultaneously. At NVIDIA Run:ai, an NFS-type data source is an abstraction that is mapped directly to a Kubernetes NFS volume. This integration allows multiple workloads under various scopes to mount and present the NFS data source.

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Set the data origin

    • Enter the NFS server (host name or host IP)

    • Enter the NFS path

  6. Set the data target location

    • Container path

  7. Optional: Restrictions

    • Prevent data modification - When enabled, the data will be mounted with read-only permissions

  8. Click CREATE DATA SOURCE

PVC

A Persistent Volume Claim (PVC) is a Kubernetes concept used for managing storage in the cluster, which can be provisioned by an administrator or dynamically by Kubernetes using a StorageClass. PVCs allow users to request specific sizes and access modes (read/write once, read-only many). NVIDIA Run:ai ensures that data remains consistent and accessible across various scopes and workloads, beyond the lifecycle of individual pods, which is efficient while working with large datasets typically associated with AI projects.

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Select PVC:

    • Existing PVC

      This option is relevant when the purpose is to create a PVC-type data source based on an existing PVC in the cluster

      • Select a PVC from the list - (The list is empty if no existing PVCs were created in advance)

    • New PVC - creates a new PVC in the cluster. New PVCs are not added to the Existing PVCs list.

      When creating a PVC-type data source and selecting the ‘New PVC’ option, the PVC is immediately created in the cluster (even if no workload has requested this PVC).

  6. Select the storage class

    • None - Proceed without defining a storage class

    • Custom storage class - This option applies when selecting a storage class based on existing storage classes.

      To add new storage classes to the storage class list, and for additional information, check Kubernetes storage classes

  7. Select the access mode(s) (multiple modes can be selected)

    • Read-write by one node - The volume can be mounted as read-write by a single node.

    • Read-only by many nodes - The volume can be mounted as read-only by many nodes.

    • Read-write by many nodes - The volume can be mounted as read-write by many nodes.

  8. Set the claim size and its units

  9. Select the volume mode

    1. File system (default) - allows the volume to be mounted as a filesystem, enabling the usage of directories and files.

    2. Block - exposes the volume as a block storage, which can be formatted or used by applications directly without a filesystem.

  10. Set the data target location

    • container path

  11. Optional: Prevent data modification - When enabled, the data will be mounted with read-only permission.

  12. Click CREATE DATA SOURCE

After the data source is created, check its status to monitor its proper creation across the selected scope.

S3 Bucket

The S3 bucket data source enables the mapping of a remote S3 bucket into the workload’s file system. Similar to a PVC, this mapping remains accessible across different workload executions, extending beyond the lifecycle of individual pods. However, unlike PVCs, data stored in an S3 bucket resides remotely, which may lead to decreased performance during the execution of heavy machine learning workloads. As part of the NVIDIA Run:ai connection to the S3 bucket, you can create credentials in order to access and map private buckets.

Note

S3 data sources are not supported for custom inference workloads.

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Set the data origin

    • Set the S3 service URL

    • Select the credential

      • None - for public buckets

      • Credential names - This option is relevant for private buckets based on existing credentials that were created for the scope.

        To add new credentials to the credentials list, and for additional information, check the Credentials article.

    • Enter the bucket name

  6. Set the data target location

    • container path

  7. Click CREATE DATA SOURCE

After a private data source is created, check its status to monitor its proper creation across the selected scope.

Git

A Git-type data source is a NVIDIA Run:ai integration, that enables code to be copied from a Git branch into a dedicated folder in the container. It is mainly used to provide the workload with the latest code repository. As part of the integration with Git, in order to access private repositories, you can add predefined credentials to the data source mapping.

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Set the data origin

    • Set the Repository URL

    • Set the Revision (branch, tag, or hash)- If left empty, it will use the 'HEAD' (latest)

    • Select the credential

      • None - for public repositories

      • Credential names - This option applies to private repositories based on existing credentials that were created for the scope.

        To add new credentials to the credentials list, and for additional information, check the Credentials article.

  6. Set the data target location

    • container path

  7. Click CREATE DATA SOURCE

After a private data source is created, check its status to monitor its proper creation across the selected scope.

Host path

A Host path volume is a Kubernetes concept that enables mounting a host path file or a directory on the workload’s file system. Like a PVC, the host path volume’s data persists across workloads under various scopes. It also enables data serving from the hosting node.

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Set the data origin

    • host path

  6. Set the data target location

    • container path

  7. Optional: Prevent data modification - When enabled, the data will be mounted with read-only permissions.

  8. Click CREATE DATA SOURCE

ConfigMap

A ConfigMap data source is a NVIDIA Run:ai abstraction for the Kubernetes ConfigMap concept. The ConfigMap is used mainly for storage that can be mounted on the workload container for non-confidential data. It is usually represented in key-value pairs (e.g., environment variables, command-line arguments etc.). It allows you to decouple environment-specific system configurations from your container images, so that your applications are easily portable. ConfigMaps must be created on the cluster prior to being used within the NVIDIA Run:ai system.\

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Set the data origin

    • Select the ConfigMap name (The list is empty if no existing ConfigMaps were created in advance).

  6. Set the data target location

    • container path

  7. Click CREATE DATA SOURCE

Secret

A secret-type data source enables the mapping of a credential into the workload’s file system. Credentials are a workload asset that simplify the complexities of Kubernetes Secrets. The credentials mask sensitive access information, such as passwords, tokens, and access keys, which are necessary for gaining access to various resources.

  1. Select the cluster under which to create this data source

  2. Select a scope

  3. Enter a name for the data source. The name must be unique.

  4. Optional: Provide a description of the data source

  5. Set the data origin

    • Select the credential

      To add new credentials, and for additional information, check the Credentials article.

  6. Set the data target location

    • container path

  7. Click CREATE DATA SOURCE

After the data source is created, check its status to monitor its proper creation across the selected scope.

Note

It is also possible to add data sources directly when creating a specific workspace, training or inference workload.

Copying a Data Source

To copy an existing data source:

  1. Select the data source you want to copy

  2. Click MAKE A COPY

  3. Enter a name for the data source. The name must be unique.

  4. Update the data source and click CREATE DATA SOURCE

Renaming a Data Source

To rename an existing data source:

  1. Select the data source you want to rename

  2. Click Rename and edit the name/description

Deleting a Data Source

To delete a data source:

  1. Select the data source you want to delete

  2. Click DELETE

  3. Confirm you want to delete the data source

Note

It is not possible to delete a data source being used by an existing workload or template.

Creating PVCs in Advance

Add PVCs in advance to be used when creating a PVC-type data source via the NVIDIA Run:ai UI.

The actions taken by the admin are based on the scope (cluster, department or project) that the admin wants for data source of type PVC. Follow the steps below for each required scope:

Cluster Scope

  1. Locate the PVC in the NVIDIA Run:ai namespace (runai)

  2. Provide NVIDIA Run:ai with visibility and authorization to share the PVC to your selected scope by implementing the following label: run.ai/cluster-wide: "true”

The PVC is now displayed for that scope in the list of existing PVCs.

Note

This step is also relevant for creating the data source of type PVC via API

Department Scope

  1. Locate the PVC in the NVIDIA Run:ai namespace (runai)

  2. To authorize NVIDIA Run:ai to use the PVC, label it: run.ai/department: "id"

The PVC is now displayed for that scope in the list of existing PVCs.

Project Scope

Locate the PVC in the project’s namespace.

The PVC is now displayed for that scope in the list of existing PVCs.

Creating ConfigMaps in Advance

Add ConfigMaps in advance to be used when creating a ConfigMap-type data source via the NVIDIA Run:ai UI.

Cluster Scope

  1. Locate the ConfigMap in the NVIDIA Run:ai namespace (runai)

  2. To authorize NVIDIA Run:ai to use the ConfigMap, label it: run.ai/cluster-wide: "true”

  3. The ConfigMap must have a label of run.ai/resource: <resource-name>

The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

Department Scope

  1. Locate the ConfigMap in the NVIDIA Run:ai namespace (runai)

  2. To authorize NVIDIA Run:ai to use the ConfigMap, label it: run.ai/department: "<department-id>"

  3. The ConfigMap must have a label of run.ai/resource: <resource-name>

The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

Project Scope

  1. Locate the ConfigMap in the project’s namespace

  2. The ConfigMap must have a label of run.ai/resource: <resource-name>

The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

Using API

To view the available actions, go to the Data sources API reference.

Cluster System Requirements

The NVIDIA Run:ai cluster is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai cluster.

The system requirements needed depend on where the control plane and cluster are installed. The following applies for Kubernetes only:

  • If you are installing the first cluster and control plane on the same Kubernetes cluster, , and are not required.

  • If you are installing the first cluster and control plane on separate Kubernetes clusters, the , and are required.

Hardware Requirements

The following hardware requirements are for the Kubernetes cluster nodes. By default, all NVIDIA Run:ai cluster services run on all available nodes. For production deployments, you may want to set , to separate between system and worker nodes, reduce downtime and save CPU cycles on expensive GPU Machines.

Architecture

  • x86 – Supported for both Kubernetes and OpenShift deployments.

  • ARM – Supported for Kubernetes only. ARM is currently not supported for OpenShift.

NVIDIA Run:ai Cluster - System Nodes

This configuration is the minimum requirement you need to install and use NVIDIA Run:ai cluster.

Component
Required Capacity

Note

To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in .

NVIDIA Run:ai Cluster - Worker Nodes

The NVIDIA Run:ai cluster supports x86 and ARM (see the below note) CPUs, and NVIDIA GPUs from the T, V, A, L, H, B, GH and GB architecture families. For the list of supported GPU models, see .

The following configuration represents the minimum hardware requirements for installing and operating the NVIDIA Run:ai cluster on worker nodes. Each node must meet these specifications:

Component
Required Capacity

Note

To designate nodes to NVIDIA Run:ai workloads, follow the instructions as described in .

Shared Storage

NVIDIA Run:ai workloads must be able to access data from any worker node in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.

Typical protocols are Network File Storage (NFS) or Network-attached storage (NAS). NVIDIA Run:ai cluster supports both, for more information see .

Software Requirements

The following software requirements must be fulfilled on the Kubernetes cluster.

Operating System

  • Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator

  • NVIDIA Run:ai cluster on Google Kubernetes Engine (GKE) supports both Ubuntu and Container Optimized OS (COS). COS is supported only with NVIDIA GPU Operator 24.6 or newer, and NVIDIA Run:ai cluster version 2.19 or newer. NVIDIA Run:ai cluster on Oracle Kubernetes Engine (OKE) supports only Ubuntu.

  • Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

Kubernetes Distribution

NVIDIA Run:ai cluster requires Kubernetes. The following Kubernetes distributions are supported:

  • Vanilla Kubernetes

  • OpenShift Container Platform (OCP)

  • NVIDIA Base Command Manager (BCM)

  • Elastic Kubernetes Engine (EKS)

  • Google Kubernetes Engine (GKE)

  • Azure Kubernetes Service (AKS)

  • Oracle Kubernetes Engine (OKE)

  • Rancher Kubernetes Engine (RKE1)

  • Rancher Kubernetes Engine 2 (RKE2)

Note

The latest release of the NVIDIA Run:ai cluster supports Kubernetes 1.30 to 1.32 and OpenShift 4.14 to 4.18.

For existing Kubernetes clusters, see the following Kubernetes version support matrix for the latest NVIDIA Run:ai cluster releases:

NVIDIA Run:ai version
Supported Kubernetes versions
Supported OpenShift versions

For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see or .

Container Runtime

NVIDIA Run:ai supports the following . Make sure your Kubernetes cluster is configured with one of these runtimes:

  • (default in Kubernetes)

  • (default in OpenShift)

Kubernetes Pod Security Admission

NVIDIA Run:ai supports restricted policy for (PSA) on OpenShift only. Other Kubernetes distributions are only supported with privileged policy.

For NVIDIA Run:ai on OpenShift to run with PSA restricted policy:

  • Label the runai namespace as described in with the following labels:

  • The workloads submitted through NVIDIA Run:ai should comply with the restrictions of PSA restricted policy. This can be enforced using Policies.

NVIDIA Run:ai Namespace

The NVIDIA Run:ai must be installed in a namespace or project (OpenShift) called runai. Use the following to create the namespace/project:

Kubernetes Ingress Controller

NVIDIA Run:ai cluster requires to be installed on the Kubernetes cluster.

  • OpenShift, RKE and RKE2 come pre-installed ingress controller.

  • Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.

  • Make sure that a default ingress controller is set.

There are many ways to install and configure different ingress controllers. A simple example to install and configure NGINX ingress controller using :

Vanilla Kubernetes

Run the following commands:

  • For cloud deployments, both the internal IP and external IP are required.

  • For on-prem deployments, only the external IP is needed.

Managed Kubernetes (EKS, GKE, AKS)

Run the following commands:

Oracle Kubernetes Engine (OKE)

Run the following commands:

Fully Qualified Domain Name (FQDN)

You must have a Fully Qualified Domain Name (FQDN) to install NVIDIA Run:ai control plane (ex: runai.mycorp.local). This cannot be an IP. The domain name must be accessible inside the organization's private network.

Wildcard FQDN for Inference (Optional)

In order to make inference serving endpoints available externally to the cluster, configure a wildcard DNS record (*.runai-inference.mycorp.local) that resolves to the cluster’s public IP address, or to the cluster's load balancer IP address in on-prem environments. This ensures each inference workload receives a unique subdomain under the wildcard domain.

TLS Certificate

You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a named runai-cluster-domain-tls-secret in the runai namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

Local Certificate Authority

A local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority if external connections or standard HTTPS authentication is required.

In air-gapped environments, you must configure the local certificate authority public key of your local certificate authority It will need to be installed in Kubernetes for the installation to succeed:

  1. Add the public key to the required namespace:

  1. When installing the cluster, make sure the following flag is added to the helm command --set global.customCA.enabled=true. See .

NVIDIA GPU Operator

NVIDIA Run:ai Cluster requires NVIDIA GPU Operator to be installed on the Kubernetes Cluster, supports version 22.9 to 25.3. Information on how to download the GPU Operator for air-gapped installation can be found in the .

See the , followed by notes below:

  • Use the default gpu-operator namespace . Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.

  • NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false. is one such example as it comes bundled with NVIDIA Drivers.

  • For distribution-specific additional instructions see below:

OpenShift Container Platform (OCP)

The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator in OpenShift. Install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console. For more information, see .

Elastic Kubernetes Service (EKS)
  • When setting-up the cluster, do not install the NVIDIA device plug-in (we want the NVIDIA GPU Operator to install it instead).

  • When using the tool to create a cluster, use the flag --install-nvidia-plugin=false to disable the installation.

For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: --set driver.enabled=false.

Google Kubernetes Engine (GKE)

Before installing the GPU Operator:

  1. Create the gpu-operator namespace by running:

  1. Create the following file:

  1. Run:

Rancher Kubernetes Engine 2 (RKE2)

Make sure to specify the CONTAINERD_CONFIG option exactly as outlined in the and , using the path /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl. Do not create the file manually if it does not already exist. The GPU Operator will handle this configuration during deployment.

Oracle Kubernetes Engine (OKE)
  • During cluster setup, , and set initial_node_labels to include oci.oraclecloud.com/disable-gpu-device-plugin=true which disables the NVIDIA GPU device plugin.

  • For GPU nodes, OKE defaults to Oracle Linux, which is incompatible with NVIDIA drivers. To resolve this, use a custom Ubuntu image instead.

For troubleshooting information, see the .

Prometheus

Note

Installing Prometheus applies for Kubernetes only.

NVIDIA Run:ai cluster requires Prometheus to be installed on the Kubernetes cluster.

  • OpenShift comes pre-installed with prometheus

  • For RKE2 see instructions to install Prometheus

There are many ways to install Prometheus. A simple example to install the community using , run the following commands:

Additional Software Requirements

Additional NVIDIA Run:ai capabilities, Distributed Training and Inference require additional Kubernetes applications (frameworks) to be installed on the cluster.

Distributed Training

Distributed training enables training of AI models over multiple nodes. This requires installing a distributed training framework on the cluster. The following frameworks are supported:

There are several ways to install each framework. A simple method of installation example is the which includes TensorFlow, PyTorch, XGBoost and JAX.

It is recommended to use Kubeflow Training Operator v1.9.2, and MPI Operator v0.6.0 or later for compatibility with advanced workload capabilities, such as and .

  • To install the Kubeflow Training Operator for TensorFlow, PyTorch, XGBoost and JAX frameworks, run the following command:

  • To install the MPI Operator for MPI v2, run the following command:

Note

If you require both the MPI Operator and Kubeflow Training Operator, follow the steps below:

  • Install the Kubeflow Training Operator as described above.

  • Disable and delete MPI v1 in the Kubeflow Training Operator by running:

  • Install the MPI Operator as described above.

Inference

Inference enables serving of AI models. This requires the framework to be installed on the cluster and supports Knative versions 1.11 to 1.16. Follow the instructions. Once installed, follow the below steps.

  1. Configure Knative to use the NVIDIA Run:ai Scheduler and other features using the following command:

  2. Optional: If inference serving endpoints should be accessible outside the cluster:

    1. Patch the Knative service and assign the DNS for inference workloads to the Knative ingress service:

    2. Following the instructions to configure TLS for the Knative ingress.

Knative Autoscaling

NVIDIA Run:ai allows for autoscaling a deployment according to the below metrics:

  • Latency (milliseconds)

  • Throughput (requests/sec)

  • Concurrency (requests)

Using a custom metric (for example, Latency) requires installing the . Use the following command to install. Make sure to update the {VERSION} in the below command with a .

CPU

10 cores

Memory

20GB

Disk space

50GB

CPU

2 cores

Memory

4GB

v2.17

1.27 to 1.29

4.12 to 4.15

v2.18

1.28 to 1.30

4.12 to 4.16

v2.19

1.28 to 1.31

4.12 to 4.17

v2.20

1.29 to 1.32

4.14 to 4.17

v2.21 (latest)

1.30 to 1.32

4.14 to 4.18

pod-security.kubernetes.io/audit=privileged
pod-security.kubernetes.io/enforce=privileged
pod-security.kubernetes.io/warn=privileged
kubectl create ns runai
oc new-project runai
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace \
    --set controller.kind=DaemonSet \
    --set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace nginx-ingress --create-namespace
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install nginx-ingress ingress-nginx/ingress-nginx \
    --namespace ingress-nginx --create-namespace \
    --set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \
    --set controller.service.externalTrafficPolicy=Local \
    --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet=<SUBNET-ID> # Replace <SUBNET-ID> with the subnet ID of one of your cluster
kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
    --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate
    --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
kubectl -n runai create secret generic runai-ca-cert \
    --from-file=runai-ca.pem=<ca_bundle_path>
kubectl label secret runai-ca-cert -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
oc -n runai create secret generic runai-ca-cert \
    --from-file=runai-ca.pem=<ca_bundle_path>
oc -n openshift-monitoring create secret generic runai-ca-cert \
    --from-file=runai-ca.pem=<ca_bundle_path>
oc label secret runai-ca-cert -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
kubectl create ns gpu-operator
#resourcequota.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
name: gcp-critical-pods
namespace: gpu-operator
spec:
scopeSelector:
    matchExpressions:
    - operator: In
    scopeName: PriorityClass
    values:
    - system-node-critical
    - system-cluster-critical
kubectl apply -f resourcequota.yaml
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
    -n monitoring --create-namespace --set grafana.enabled=false
kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.2"
kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
kubectl patch deployment training-operator -n kubeflow --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--enable-scheme=tfjob", "--enable-scheme=pytorchjob", "--enable-scheme=xgboostjob", "--enable-scheme=jaxjob"]}]'
kubectl delete crd mpijobs.kubeflow.org
kubectl patch configmap/config-autoscaler \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"enable-scale-to-zero":"true"}}' && \
kubectl patch configmap/config-features \
  --namespace knative-serving \
  --type merge \
  --patch '{"data":{"kubernetes.podspec-schedulername":"enabled","kubernetes.podspec-nodeselector":"enabled","kubernetes.podspec-affinity":"enabled","kubernetes.podspec-tolerations":"enabled","kubernetes.podspec-volumes-emptydir":"enabled","kubernetes.podspec-securitycontext":"enabled","kubernetes.containerspec-addcapabilities":"enabled","kubernetes.podspec-persistent-volume-claim":"enabled","kubernetes.podspec-persistent-volume-write":"enabled","multi-container":"enabled","kubernetes.podspec-init-containers":"enabled","kubernetes.podspec-fieldref":"enabled"}}'
# Replace <runai-inference.mycorp.local> with your FQDN for Inference (without the wildcard)
kubectl patch configmap/config-domain \
   --namespace knative-serving \
   --type merge \
   --patch '{"data":{"<runai-inference.mycorp.local>":""}}'
kubectl apply -f https://github.com/knative/serving/releases/download/knative-{VERSION}/serving-hpa.yaml
Kubernetes ingress controller
Prometheus
Fully Qualified Domain Name
Kubernetes ingress controller
Prometheus
Fully Qualified Domain Name
node roles
System nodes
Supported NVIDIA Data Center GPUs and Systems
Worker nodes
Shared storage
Kubernetes Release History
OpenShift Container Platform Life Cycle Policy
container runtimes
Containerd
CRI-O
Pod Security Admission
Pod Security Admission
Kubernetes Ingress Controller
helm
Kubernetes Secret
Install cluster
NVIDIA GPU Operator prerequisites
Installing the NVIDIA GPU Operator
DGX OS
Installing the Node Feature Discovery (NFD) Operator
eksctl
documentation
custom configuration guide
create a nodepool
NVIDIA GPU Operator Troubleshooting Guide
Enable Monitoring
Kube-Prometheus Stack
helm
TensorFlow
PyTorch
XGBoost
MPI v2
JAX
Kubeflow Training Operator
Stopping a workload
Scheduling rules
Knative Serving
Installing Knative
Configure external domain encryption
Kubernetes Horizontal Pod Autoscaler (HPA)
supported Knative version
phases
scope

NVIDIA Run:ai System Monitoring

This section explains how to configure NVIDIA Run:ai to generate health alerts and to connect these alerts to alert-management systems within your organization. Alerts are generated for NVIDIA Run:ai clusters.

Alert Infrastructure

NVIDIA Run:ai uses Prometheus for externalizing metrics and providing visibility to end-users. The NVIDIA Run:ai Cluster installation includes Prometheus or can connect to an existing Prometheus instance used in your organization. The alerts are based on the Prometheus AlertManager. Once installed, it is enabled by default.

This document explains how to:

  • Configure alert destinations - triggered alerts send data to specified destinations

  • Understand the out-of-the-box cluster alerts, provided by NVIDIA Run:ai

  • Add additional custom alerts

Prerequisites

  • A Kubernetes cluster with the necessary permissions

  • Up and running NVIDIA Run:ai environment, including Prometheus Operator

  • kubectl command-line tool installed and configured to interact with the cluster

Setup

Use the steps below to set up monitoring alerts.

Validating Prometheus Operator Installed

  1. Verify that the Prometheus Operator Deployment is running. Copy the following command and paste it in your terminal, where you have access to the Kubernetes cluster:

kubectl get deployment kube-prometheus-stack-operator -n monitoring

In your terminal, you can see an output indicating the deployment's status, including the number of replicas and their current state.

  1. Verify that Prometheus instances are running. Copy the following command and paste it in your terminal:

kubectl get prometheus -n runai

You can see the Prometheus instance(s) listed along with their status.

Enabling Prometheus AlertManager

In each of the steps in this section, copy the content of the code snippet to a new YAML file (e.g., step1.yaml).

  1. Copy the following command to your terminal, to apply the YAML file to the cluster:

kubectl apply -f step1.yaml 
  1. Copy the following command to your terminal to create the AlertManager CustomResource, to enable AlertManager:

apiVersion: monitoring.coreos.com/v1  
kind: Alertmanager  
metadata:  
   name: runai  
   namespace: runai  
spec:  
   replicas: 1  
   alertmanagerConfigSelector:  
      matchLabels:
         alertmanagerConfig: runai 
  1. Copy the following command to your terminal to validate that the AlertManager instance has started:

kubectl get alertmanager -n runai
  1. Copy the following command to your terminal to validate that the Prometheus operator has created a Service for AlertManager:

kubectl get svc alertmanager-operated -n runai

Configuring Prometheus to Send Alerts

  1. Open the terminal on your local machine or another machine that has access to your Kubernetes cluster

  2. Copy and paste the following command in your terminal to edit the Prometheus configuration for the runai Namespace:

kubectl edit prometheus runai -n runai

This command opens the Prometheus configuration file in your default text editor (usually vi or nano).

  1. Copy and paste the following text to your terminal to change the configuration file:

alerting:  
   alertmanagers:  
      - namespace: runai  
        name: alertmanager-operated  
        port: web
  1. Save the changes and exit the text editor.

Note

To save changes using vi, type :wq and press Enter. The changes are applied to the Prometheus configuration in the cluster.

Alert Destinations

Set out below are the various alert destinations.

Configuring AlertManager for Custom Email Alerts

In each step, copy the contents of the code snippets to a new file and apply it to the cluster using kubectl apply -f.

  1. Add your smtp password as a secret:

apiVersion: v1  
kind: Secret  
metadata:  
   name: alertmanager-smtp-password  
   namespace: runai  
stringData:
   password: "your_smtp_password"
  1. Replace the relevant smtp details with your own, then apply the alertmanagerconfig using kubectl apply.

 apiVersion: monitoring.coreos.com/v1alpha1  
 kind: AlertmanagerConfig  
 metadata:  
   name: runai  
   namespace: runai  
 labels:  
    alertmanagerConfig: runai  
 spec:  
    route:  
       continue: true  
       groupBy:   
       - alertname

       groupWait: 30s  
       groupInterval: 5m  
       repeatInterval: 1h

    matchers:  
    - matchType: =~  
      name: alertname  
      value: Runai.*

    receiver: email

 receivers:  
 - name: 'email'  
   emailConfigs:  
   - to: '<destination_email_address>'  
     from: '<from_email_address>'  
     smarthost: 'smtp.gmail.com:587'  
     authUsername: '<smtp_server_user_name>'  
     authPassword:  
       name: alertmanager-smtp-password
         key: password  
  1. Save and exit the editor. The configuration is automatically reloaded.

Third-Party Alert Destinations

Prometheus AlertManager provides a structured way to connect to alert-management systems. There are built-in plugins for popular systems such as PagerDuty and OpsGenie, including a generic Webhook.

Example: Integrating NVIDIA Run:ai with a Webhook

  1. Use webhook.site to get a unique URL.

  2. Use the upgrade cluster instructions to modify the values file: Edit the values file to add the following, and replace <WEB-HOOK-URL> with the URL from webhook.site:

codekube-prometheus-stack:  
  ...  
  alertmanager:  
    enabled: true  
    config:  
      global:  
        resolve_timeout: 5m  
      receivers:  
      - name: "null"  
      - name: webhook-notifications  
        webhook_configs:  
          - url: <WEB-HOOK-URL>  
            send_resolved: true  
      route:  
        group_by:  
        - alertname  
        group_interval: 5m  
        group_wait: 30s  
        receiver: 'null'  
        repeat_interval: 10m  
        routes:  
        - receiver: webhook-notifications
  1. Verify that you are receiving alerts on the webhook.site, in the left pane:

Built-in Alerts

A NVIDIA Run:ai cluster comes with several built-in alerts. Each alert notifies on a specific functionality of a NVIDIA Run:ai’s entity. There is also a single, inclusive alert: NVIDIA Run:ai Critical Problems, which aggregates all component-based alerts into a single cluster health test.

Runai agent cluster info push rate low

Meaning

The cluster-sync Pod in the runai namespace might not be functioning properly

Impact

Possible impact - no info/partial info from the cluster is being synced back to the control-plane

Severity

Critical

Diagnosis

kubectl get pod -n runai to see if the cluster-sync pod is running

Troubleshooting/Mitigation

To diagnose issues with the cluster-sync pod, follow these steps:

  1. Paste the following command to your terminal, to receive detailed information about the cluster-sync deployment:kubectl describe deployment cluster-sync -n runai

  2. Check the Logs: Use the following command to view the logs of the cluster-sync deployment:kubectl logs deployment/cluster-sync -n runai

  3. Analyze the Logs and Pod Details: From the information provided by the logs and the deployment details, attempt to identify the reason why the cluster-sync pod is not functioning correctly

  4. Check Connectivity: Ensure there is a stable network connection between the cluster and the NVIDIA Run:ai Control Plane. A connectivity issue may be the root cause of the problem.

  5. Contact Support: If the network connection is stable and you are still unable to resolve the issue, contact NVIDIA Run:ai support for further assistance

Runai agent pull rate low

Meaning

The runai-agent pod may be too loaded, is slow in processing data (possible in very big clusters), or the runai-agent pod itself in the runai namespace may not be functioning properly.

Impact

Possible impact - no info/partial info from the control-plane is being synced in the cluster

Severity

Critical

Diagnosis

Run: kubectl get pod -n runai And see if the runai-agent pod is running.

Troubleshooting/Mitigation

To diagnose issues with the runai-agent pod, follow these steps:

  1. Describe the Deployment: Run the following command to get detailed information about the runai-agent deployment:kubectl describe deployment runai-agent -n runai

  2. Check the Logs: Use the following command to view the logs of the runai-agent deployment:kubectl logs deployment/runai-agent -n runai

  3. Analyze the Logs and Pod Details: From the information provided by the logs and the deployment details, attempt to identify the reason why the runai-agent pod is not functioning correctly. There may be a connectivity issue with the control plane.

  4. Check Connectivity: Ensure there is a stable network connection between the runai-agent and the control plane. A connectivity issue may be the root cause of the problem.

  5. Consider Cluster Load: If the runai-agent appears to be functioning properly but the cluster is very large and heavily loaded, it may take more time for the agent to process data from the control plane.

  6. Adjust Alert Threshold: If the cluster load is causing the alert to fire, you can adjust the threshold at which the alert triggers. The default value is 0.05. You can try changing it to a lower value (e.g., 0.045 or 0.04). To edit the value, paste the following in your terminal:kubectl edit runaiconfig -n runai/. In the editor, navigate to: spec: prometheus: agentPullPushRateMinForAlert . If the agentPullPushRateMinForAlert value does not exist, add it under spec -> prometheus .

Runai container memory usage critical

Meaning

Runai container is using more than 90% of its Memory limit

Impact

The container might run out of memory and crash.

Severity

Critical

Diagnosis

Calculate the memory usage, this is performed by pasting the following to your terminal: container_memory_usage_bytes{namespace=~"runai

Troubleshooting/Mitigation

Add more memory resources to the container. If the issue persists, contact NVIDIA Run:ai

Runai container memory usage warning

Meaning

Runai container is using more than 80% of its memory limit

Impact

The container might run out of memory and crash

Severity

Warning

Diagnosis

Calculate the memory usage, this can be done by pasting the following to your terminal: container_memory_usage_bytes{namespace=~"runai

Troubleshooting/Mitigation

Add more memory resources to the container. If the issue persists, contact NVIDIA Run:ai

Runai container restarting

Meaning

Runai container has restarted more than twice in the last 10 min

Impact

The container might become unavailable and impact the NVIDIA Run:ai system

Severity

Warning

Diagnosis

To diagnose the issue and identify the problematic pods, paste this into your terminal: kubectl get pods -n runai kubectl get pods -n runai-backendOne or more of the pods have a restart count >= 2.

Troubleshooting/Mitigation

Paste this into your terminal:kubectl logs -n NAMESPACE POD_NAMEReplace NAMESPACE and POD_NAME with the relevant pod information from the previous step. Check the logs for any standout issues and verify that the container has sufficient resources. If you need further assistance, contact NVIDIA Run:ai

Runai CPU usage warning

Meaning

runai container is using more than 80% of its CPU limit

Impact

This might cause slowness in the operation of certain NVIDIA Run:ai features.

Severity

Warning

Diagnosis

Paste the following query to your terminal in order to calculate the CPU usage: rate(container_cpu_usage_seconds_total{namespace=~"runai

Troubleshooting/Mitigation

Add more CPU resources to the container. If the issue persists, please contact NVIDIA Run:ai.

Runai critical problem

Meaning

One of the critical NVIDIA Run:ai alerts is currently active

Impact

Impact is based on the active alert

Severity

Critical

Diagnosis

Check NVIDIA Run:ai alerts in Prometheus to identify any active critical alerts

Unknown state alert for a node

Meaning

The Kubernetes node hosting GPU workloads is in an unknown state, and its health and readiness cannot be determined.

Impact

This may interrupt GPU workload scheduling and execution.

Severity

Critical - Node is either unschedulable or has unknown status. The node is in one of the following states:

  • Ready=Unknown: The control plane cannot communicate with the node.

  • Ready=False: The node is not healthy.

  • Unschedulable=True: The node is marked as unschedulable.

Diagnosis

Check the node's status using kubectl describe node, verify Kubernetes API server connectivity, and inspect system logs for GPU-specific or node-level errors.

Low memory node alert

Meaning

The Kubernetes node hosting GPU workloads has insufficient memory to support current or upcoming workloads.

Impact

GPU workloads may fail to schedule, experience degraded performance, or crash due to memory shortages, disrupting dependent applications.

Severity

Critical - Node is using more than 90% of its memory. Warning - Node is using more than 80% of its memory.

Diagnosis

Use kubectl top node to assess memory usage, identify memory-intensive pods, consider resizing the node or optimizing memory usage in affected pods.

Runai daemonSet rollout stuck / Runai DaemonSet unavailable on nodes

Meaning

There are currently 0 available pods for the runai daemonset on the relevant node

Impact

No fractional GPU workloads support

Severity

Critical

Diagnosis

Paste the following command to your terminal: kubectl get daemonset -n runai-backend In the result of this command, identify the daemonset(s) that don’t have any running pods

Troubleshooting/Mitigation

Paste the following command to your terminal, where daemonsetX is the problematic daemonset from the pervious step: kubectl describe daemonsetX -n runai on the relevant deamonset(s) from the previous step. The next step is to look for the specific error which prevents it from creating pods. Possible reasons might be:

  • Node Resource Constraints: The nodes in the cluster may lack sufficient resources (CPU, memory, etc.) to accommodate new pods from the daemonset.

  • Node Selector or Affinity Rules: The daemonset may have node selector or affinity rules that are not matching with any nodes currently available in the cluster, thus preventing pod creation.

Runai deployment insufficient replicas / Runai deployment no available replicas /RunaiDeploymentUnavailableReplicas

Meaning

Runai deployment has one or more unavailable pods

Impact

When this happens, there may be scale issues. Additionally, new versions cannot be deployed, potentially resulting in missing features.

Severity

Critical

Diagnosis

Paste the following commands to your terminal, in order to get the status of the deployments in the runai and runai-backend namespaces:kubectl get deployment -n runai kubectl get deployment -n runai-backendIdentify any deployments that have missing pods. Look for discrepancies in the DESIRED and AVAILABLE columns. If the number of AVAILABLE pods is less than the DESIRED pods, it indicates that there are missing pods.

Troubleshooting/Mitigation

  • Paste the following commands to your terminal, to receive detailed information about the problematic deployment:kubectl describe deployment <DEPLOYMENT_NAME> -n runai kubectl describe deployment <DEPLOYMENT_NAME> -n runai-backend

  • Paste the following commands to your terminal, to check the replicaset details associated with the deployment:kubectl describe replicaset <REPLICASET_NAME> -n runai kubectl describe replicaset <REPLICASET_NAME> -n runai-backend

  • Paste the following commands to your terminal to retrieve the logs for the deployment to identify any errors or issues:kubectl logs deployment/<DEPLOYMENT_NAME> -n runai kubectl logs deployment/<DEPLOYMENT_NAME> -n runai-backend

  • From the logs and the detailed information provided by the describe commands, analyze the reasons why the deployment is unable to create pods. Look for common issues such as:

    • Resource constraints (CPU, memory)

    • Misconfigured deployment settings or replicasets

    • Node selector or affinity rules preventing pod scheduling

    If the issue persists, contact NVIDIA Run:ai.

Runai project controller reconcile failure

Meaning

The project-controller in runai namespace had errors while reconciling projects

Impact

Some projects might not be in the “Ready” state. This means that they are not fully operational and may not have all the necessary components running or configured correctly.

Severity

Critical

Diagnosis

Retrieve the logs for the project-controller deployment by pasting the following command in your terminal:kubectl logs deployment/project-controller -n runai Carefully examine the logs for any errors or warning messages. These logs help you understand what might be going wrong with the project controller.

Troubleshooting/Mitigation

Once errors in the log have been identified, follow these steps to mitigate the issue: The error messages in the logs should provide detailed information about the problem.

  1. Read through them to understand the nature of the issue. If the logs indicate which project failed to reconcile, you can further investigate by checking the status of that specific project.

  2. Run the following command, replacing <PROJECT_NAME> with the name of the problematic project:kubectl get project <PROJECT_NAME> -o yaml

  3. Review the status section in the YAML output. This section describes the current state of the project and provide insights into what might be causing the failure. If the issue persists, contact NVIDIA Run:ai.

Runai StatefulSet insufficient replicas / Runai StatefulSet no available replicas

Meaning

Runai statefulset has no available pods

Impact

Absence of Metrics Database Unavailability

Severity

Critical

Diagnosis

To diagnose the issue, follow these steps:

  1. Check the status of the stateful sets in the runai-backend namespace by running the following command:kubectl get statefulset -n runai-backend

  2. Identify any stateful sets that have no running pods. These are the ones that might be causing the problem.

Troubleshooting/Mitigation

Once you've identified the problematic stateful sets, follow these steps to mitigate the issue:

  1. Describe the stateful set to get detailed information on why it cannot create pods. Replace X with the name of the stateful set:kubectl describe statefulset X -n runai-backend

  2. Review the description output to understand the root cause of the issue. Look for events or error messages that explain why the pods are not being created.

  3. If you're unable to resolve the issue based on the information gathered, contact NVIDIA Run:ai support for further assistance.

Adding a Custom Alert

You can add additional alerts from NVIDIA Run:ai. Alerts are triggered by using the Prometheus query language with any NVIDIA Run:ai metric.

To create an alert, follow these steps using Prometheus query language with NVIDIA Run:ai Metrics:

  • Modify Values File: Use the upgrade cluster instructions to modify the values file.

  • Add Alert Structure: Incorporate alerts according to the structure outlined below. Replace placeholders <ALERT-NAME>, <ALERT-SUMMARY-TEXT>, <PROMQL-EXPRESSION>, <optional: duration s/m/h>, and <critical/warning> with appropriate values for your alert, as described below.

kube-prometheus-stack:  
   additionalPrometheusRulesMap:  
     custom-runai:  
       groups:  
       - name: custom-runai-rules  
         rules:  
         - alert: <ALERT-NAME>  
           annotations:  
             summary: <ALERT-SUMMARY-TEXT>  
           expr:  <PROMQL-EXPRESSION>  
           for: <optional: duration s/m/h>  
           labels:  
             severity: <critical/warning>
  • <ALERT-NAME>: Choose a descriptive name for your alert, such as HighCPUUsage or LowMemory.

  • <ALERT-SUMMARY-TEXT>: Provide a brief summary of what the alert signifies, for example, High CPU usage detected or Memory usage below threshold.

  • <PROMQL-EXPRESSION>: Construct a Prometheus query (PROMQL) that defines the conditions under which the alert should trigger. This query should evaluate to a boolean value (1 for alert, 0 for no alert).

  • <optional: duration s/m/h>: Optionally, specify a duration in seconds (s), minutes (m), or hours (h) that the alert condition should persist before triggering an alert. If not specified, the alert triggers as soon as the condition is met.

  • <critical/warning>: Assign a severity level to the alert, indicating its importance. Choose between critical for severe issues requiring immediate attention, or warning for less critical issues that still need monitoring.

You can find an example in the Prometheus documentation.

Metrics and Telemetry

Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.

Scopes

NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.

Level
Description

Cluster

A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.

Node

Data is aggregated at the node level.

Node pool

Data is aggregated at the node pool level.

Workload

Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.

Pod

The basic unit of execution.

Project

The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.

Department

Departments are a grouping of projects.

Supported Metrics

Metric name in API
Applicable API endpoint
Metric name in UI per grid
Applicable UI grid

ALLOCATED_GPU

  • GPU devices (allocated)

  • Allocated GPUs

AVG_WORKLOAD_WAIT_TIME

CPU_LIMIT_CORES

CPU limit

CPU_MEMORY_LIMIT_BYTES

CPU memory limit

CPU_MEMORY_REQUEST_BYTES

CPU memory request

CPU_MEMORY_USAGE_BYTES

CPU memory usage

CPU_MEMORY_UTILIZATION

CPU memory utilization

CPU_REQUEST_CORES

CPU request

CPU_USAGE_CORES

CPU usage

CPU_UTILIZATION

  • CPU compute utilization

  • CPU utilization

  • and

GPU_ALLOCATION

GPU devices (allocated)

GPU_MEMORY_REQUEST_BYTES

GPU memory request

GPU_MEMORY_USAGE_BYTES

GPU memory usage

GPU_MEMORY_USAGE_BYTES_PER_GPU

GPU memory usage per GPU

GPU_MEMORY_UTILIZATION

GPU memory utilization

GPU_MEMORY_UTILIZATION_PER_GPU

GPU memory utilization per GPU

GPU_QUOTA

Quota

GPU_UTILIZATION

GPU compute utilization

GPU_UTILIZATION_PER_GPU

GPU utilization per GPU

TOTAL_GPU

  • GPU devices total

  • Total GPUs

TOTAL_GPU_NODES

GPU_UTILIZATION_DISTRIBUTION

GPU utilization distribution

UNALLOCATED_GPU

  • GPU devices (unallocated)

  • Unallocated GPUs

CPU_QUOTA_MILLICORES

CPU_MEMORY_QUOTA_MB

CPU_ALLOCATION_MILLICORES

CPU_MEMORY_ALLOCATION_MB

POD_COUNT

RUNNING_POD_COUNT

Advanced Metrics

NVIDIA provides extended metrics as shown here here. To enable these metrics, please contact NVIDIA Run:ai customer support.

Metric name in API
Applicable API endpoint
Metric name in UI
Applicable UI table

GPU_FP16_ENGINE_ACTIVITY_PER_GPU

GPU FP16 engine activity

GPU_FP32_ENGINE_ACTIVITY_PER_GPU

GPU FP32 engine activity

GPU_FP64_ENGINE_ACTIVITY_PER_GPU

GPU FP64 engine activity

GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU

Graphics engine activity

GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU

GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU

GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU

GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU

GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU

GPU_SM_ACTIVITY_PER_GPU

GPU SM activity

GPU_SM_OCCUPANCY_PER_GPU

GPU SM occupancy

GPU_TENSOR_ACTIVITY_PER_GPU

GPU tensor activity

Supported Telemetry

Metric
Applicable API endpoint
Metric name in UI
Applicable UI table

WORKLOADS_COUNT

ALLOCATED_GPUS

Allocated GPUs

GPU_allocation

READY_GPU_NODES

Ready / Total GPU nodes

READY_GPUS

Ready / Total GPU devices

TOTAL_GPU_NODES

Ready / Total GPU nodes

TOTAL_GPUS

Ready / Total GPU devices

IDLE_ALLOCATED_GPUS

Idle allocated GPU devices

FREE_GPUS

Free GPU devices

TOTAL_CPU_CORES

CPU (Cores)

USED_CPU_CORES

ALLOCATED_CPU_CORES

Allocated CPU cores

TOTAL_GPU_MEMORY_BYTES

GPU memory

USED_GPU_MEMORY_BYTES

Used GPU memory

TOTAL_CPU_MEMORY_BYTES

CPU memory

USED_CPU_MEMORY_BYTES

Used CPU memory

ALLOCATED_CPU_MEMORY_BYTES

Allocated CPU memory

GPU_QUOTA

GPU quota

CPU_QUOTA

MEMORY_QUOTA

GPU_ALLOCATION_NON_PREEMPTIBLE

CPU_ALLOCATION_NON_PREEMPTIBLE

MEMORY_ALLOCATION_NON_PREEMPTIBLE

Clusters
Node pools
Overview dashboard
Node pools
Clusters
Node pools
Workloads
Workloads
Workloads
Workloads
Workloads
Workloads
Workloads
Pods
Workloads
Clusters
Node pools
Nodes
Overview dashboard
Node pools
Nodes
Workloads
Workloads
Nodes
Workloads
Pods
Workloads
Clusters
Node pools
Nodes
Overview dashboard
Node pools
Nodes
Workloads
Projects
Departments
Overview dashboard
Workloads
Workloads
Workloads
Pods
Nodes
Workloads
Nodes
Pods
Workloads per pod
Clusters
Node pools
Overview dashboard
Node pools
Nodes
Nodes
Clusters
Node pools
Projects
Departments
Quota management
Clusters
Node pools
Workloads
Pods
Overview dashboard
Node pools
Workloads
Nodes
Pods
Nodes
Clusters
Node pools
Overview dashboard
Node pools
Clusters
Node pools
Clusters
Node pools
Node pools
Clusters
Node pools
Overview dashboard
Node pools
Projects
Departments
Projects
Departments
Projects
Departments
Projects
Departments
Workloads
Workloads
Pods
Nodes
Workloads per pod
Pods
Nodes
Workloads per pod
Pods
Nodes
Workloads per pod
Pods
Nodes
Workloads per pod
Pods
Pods
Pods
Pods
Pods
Pods
Nodes
Workloads per pod
Pods
Nodes
Workloads per pod
Pods
Nodes
Workloads per pod
Workloads
Nodes
Nodes
Workloads
Projects
Departments
Nodes
Overview dashboard
Nodes
Overview dashboard
Nodes
Overview dashboard
Nodes
Overview dashboard
Nodes
Overview dashboard
Nodes
Nodes
Nodes
Nodes
Nodes
Nodes
Projects
Departments
Nodes
Nodes
Nodes
Nodes
Nodes
Nodes
Nodes
Nodes
Nodes
Nodes
Projects
Departments
Nodes
Projects
Departments
Projects
Departments
Projects
Departments
Projects
Departments
Projects
Departments
Projects
Departments
Projects
Departments
Projects
Departments

Roles

This section explains the available roles in the NVIDIA Run:ai platform.

A role is a set of permissions that can be assigned to a subject in a scope. A permission is a set of actions (View, Edit, Create and Delete) over a NVIDIA Run:ai entity (e.g. projects, workloads, users).

Roles Table

The Roles table can be found under Access in the NVIDIA Run:ai platform.

The Roles table displays a list of roles available to users in the NVIDIA Run:ai platform. Both predefined and custom roles will be displayed in the table.

The Roles table consists of the following columns:

Column
Description

Role

The name of the role

Created by

The name of the role creator

Creation time

The timestamp when the role was created

Customizing the Table View

  • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

  • Search - Click SEARCH and type the value to search by

  • Sort - Click each column header to sort by

  • Column selection - Click COLUMNS and select the columns to display in the table

  • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Reviewing a Role

  1. To review a role click the role name on the table

  2. In the role form review the following:

    • Role name The name of the role

    • Entity A system-managed object that can be viewed, edited, created or deleted by a user based on their assigned role and scope

    • Actions The actions that the role assignee is authorized to perform for each entity

      • View - If checked, an assigned user with this role can view instances of this type of entity within their defined scope

      • Edit - If checked, an assigned user with this role can change the settings of an instance of this type of entity within their defined scope

      • Create - If checked, an assigned user with this role can create new instances of this type of entity within their defined scope

      • Delete - If checked, an assigned user with this role can delete instances of this type of entity within their defined scope

Roles in NVIDIA Run:ai

NVIDIA Run:ai supports the following roles and their permissions. Under each role is a detailed list of the actions that the role assignee is authorized to perform for each entity.

Compute resource administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Credentials administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Data source administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Data volume administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Department administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Department viewer
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Editor
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Environment administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

L1 researcher
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

L2 researcher
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

ML engineer
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Research manager
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

System administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Template administrator
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Viewer
Entity
View
Edit
Create
Delete

Account

Departments

Event history

Policies

Projects

Security settings

Settings

Clusters

Node pools

Nodes

Access rules

Applications

Groups

Roles

User applications

Users

Analytics dashboard

Consumption dashboard

Overview dashboard

Inferences

Workloads

Compute resources

Credentials

Data sources

Data volumes

Data volumes - sharing list

Environments

Storage class configurations

Templates

Permitted workloads

When assigning a role with either one, all or any combination of the View, Edit, Create and Delete permissions for workloads, the subject has permissions to manage not only NVIDIA Run:ai workloads (Workspace, Training, Inference), but also a list of 3rd party workloads:

  • k8s: StatefulSet

  • k8s: ReplicaSet

  • k8s: Pod

  • k8s: Deployment

  • batch: Job

  • batch: CronJob

  • machinelearning.seldon.io: SeldonDeployment

  • kubevirt.io: VirtualMachineInstance

  • kubeflow.org: TFJob

  • kubeflow.org: PyTorchJob

  • kubeflow.org: XGBoostJob

  • kubeflow.org: MPIJob

  • kubeflow.org: MPIJob

  • kubeflow.org: Notebook

  • kubeflow.org: ScheduledWorkflow

  • amlarc.azureml.com: AmlJob

  • serving.knative.dev: Service

  • workspace.devfile.io: DevWorkspace

  • ray.io: RayCluster

  • ray.io: RayJob

  • ray.io: RayService

  • tekton.dev: TaskRun

  • tekton.dev: PipelineRun

  • argoproj.io: Workflow

Using API

Go to the Roles API reference to view the available actions.

deprecated

Policy YAML Reference

A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted, setting best practices, enforcing limitations, and standardizing processes for AI projects within their organization.

This article explains the policy YAML fields and the possible rules and defaults that can be set for each field.

Policy YAML Fields - Reference Table

The policy fields are structured in a similar format to the workload API fields. The following tables represent a structured guide designed to help you understand and configure policies in a YAML format. It provides the fields, descriptions, defaults and rules for each workload type.

Click the link to view the value type of each field.

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

args

When set, contains the arguments sent along with the command. These override the entry point of the image in the created workload

  • Workspace

  • Training

command

A command to serve as the entry point of the container running the workspace

  • Workspace

  • Training

createHomeDir

Instructs the system to create a temporary home directory for the user within the container. Data stored in this directory is not saved when the container exists. When the runAsUser flag is set to true, this flag defaults to true as well

  • Workspace

  • Training

environmentVariables

Set of environmentVariables to populate the container running the workspace

  • Workspace

  • Training

image

Specifies the image to use when creating the container running the workload

  • Workspace

  • Training

imagePullPolicy

Specifies the pull policy of the image when starting t a container running the created workload. Options are: always, ifNotPresent, or never

  • Workspace

  • Training

workingDir

Container’s working directory. If not specified, the container runtime default is used, which might be configured in the container image

  • Workspace

  • Training

nodeType

Nodes (machines) or a group of nodes on which the workload runs

  • Workspace

  • Training

nodePools

A prioritized list of node pools for the scheduler to run the workspace on. The scheduler always tries to use the first node pool before moving to the next one when the first is not available.

  • Workspace

  • Training

annotations

Set of annotations to populate into the container running the workspace

  • Workspace

  • Training

labels

Set of labels to populate into the container running the workspace

  • Workspace

  • Training

terminateAfterPreemtpion

Indicates whether the job should be terminated, by the system, after it has been preempted

  • Workspace

  • Training

autoDeletionTimeAfterCompletionSeconds

Specifies the duration after which a finished workload (Completed or Failed) is automatically deleted. If this field is set to zero, the workload becomes eligible to be deleted immediately after it finishes.

  • Workspace

  • Training

backoffLimit

Specifies the number of retries before marking a workload as failed

  • Workspace

  • Training

cleanPodPolicy

Specifies which pods will be deleted when the workload reaches a terminal state (completed/failed). The policy can be one of the following values:

  • Running - Only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default).

  • All - All (including completed) pods will be deleted immediately when the job finishes.

  • None - No pods will be deleted when the job completes. It will keep running pods that consume GPU, CPU and memory over time. It is recommended to set to None only for debugging and obtaining logs from running pods.

Distributed

completions

Used with Hyperparameter Optimization. Specifies the number of successful pods the job should reach to be completed. The Job is marked as successful once the specified amount of pods has succeeded.

  • Workspace

  • Training

parallelism

Used with Hyperparameters Optimization. Specifies the maximum desired number of pods the workload should run at any given time.

  • Workspace

  • Training

exposeUrls

Specifies a set of exported URL (e.g. ingress) from the container running the created workload.

  • Workspace

  • Training

largeShmRequest

Specifies a large /dev/shm device to mount into a container running the created workload. SHM is a shared file system mounted on RAM.

  • Workspace

  • Training

PodAffinitySchedulingRule

Indicates if we want to use the Pod affinity rule as: the “hard” (required) or the “soft” (preferred) option. This field can be specified only if PodAffinity is set to true.

  • Workspace

  • Training

podAffinityTopology

Specifies the Pod Affinity Topology to be used for scheduling the job. This field can be specified only if PodAffinity is set to true.

  • Workspace

  • Training

ports

Specifies a set of ports exposed from the container running the created workload. More information in Ports fields below.

  • Workspace

  • Training

probes

Specifies the ReadinessProbe to use to determine if the container is ready to accept traffic. More information in below

-

  • Workspace

  • Training

tolerations

Toleration rules which apply to the pods running the workload. Toleration rules guide (but do not require) the system to which node each pod can be scheduled to or evicted from, based on matching between those rules and the set of taints defined for each Kubernetes node.

  • Workspace

  • Training

priorityClass

Priority class of the workload. The values for workspace are build (default) or interactive-preemptible. For training only, use train. Enum: "build", "train", "interactive-preemptible"

Workspace

storage

Contains all the fields related to storage configurations. More information in below.

-

  • Workspace

  • Training

security

Contains all the fields related to security configurations. More information in below.

-

  • Workspace

  • Training

compute

Contains all the fields related to compute configurations. More information in below.

-

  • Workspace

  • Training

Ports Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

container

The port that the container running the workload exposes.

  • Workspace

  • Training

serviceType

Specifies the default service exposure method for ports. the default shall be sued for ports which do not specify service type. Options are: LoadBalancer, NodePort or ClusterIP. For more information see the guide.

  • Workspace

  • Training

external

The external port which allows a connection to the container port. If not specified, the port is auto-generated by the system.

  • Workspace

  • Training

toolType

The tool type that runs on this port.

  • Workspace

  • Training

toolName

A name describing the tool that runs on this port.

  • Workspace

  • Training

Probes Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

readiness

Specifies the Readiness Probe to use to determine if the container is ready to accept traffic.

-

  • Workspace

  • Training

Readiness Field Details

  • Description: Specifies the Readiness Probe to use to determine if the container is ready to accept traffic

  • Supported NVIDIA Run:ai workload types: Workspace, Training

  • Value type: itemized

  • Example workload snippet:

defaults:
   probes:
     readiness:
         initialDelaySeconds: 2
Spec readiness fields
Description
Value type

initialDelaySeconds

Number of seconds after the container has started before liveness or readiness probes are initiated.

periodSeconds

How often (in seconds) to perform the probe

timeoutSeconds

Number of seconds after which the probe times out

successThreshold

Minimum consecutive successes for the probe to be considered successful after having failed

failureThreshold

When a probe fails, the number of times to try before giving up

Security Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

uidGidSource

Indicates the way to determine the user and group ids of the container. The options are:

  • fromTheImage - user and group IDs are determined by the docker image that the container runs. This is the default option.

  • custom - user and group IDs can be specified in the environment asset and/or the workspace creation request.

  • idpToken - user and group IDs are determined according to the identity provider (idp) access token. This option is intended for internal use of the environment UI form. For more information, see .

  • Workspace

  • Training

capabilities

The capabilities field allows adding a set of unix capabilities to the container running the workload. Capabilities are Linux distinct privileges traditionally associated with superuser which can be independently enabled and disabled

  • Workspace

  • Training

seccompProfileType

Indicates which kind of seccomp profile is applied to the container. The options are:

  • RuntimeDefault - the container runtime default profile should be used

  • Unconfined - no profile should be applied

  • Workspace

  • Training

runAsNonRoot

Indicates that the container must run as a non-root user.

  • Workspace

  • Training

readOnlyRootFilesystem

If true, mounts the container's root filesystem as read-only.

  • Workspace

  • Training

runAsUid

Specifies the Unix user id with which the container running the created workload should run.

  • Workspace

  • Training

runasGid

Specifies the Unix Group ID with which the container should run.

  • Workspace

  • Training

supplementalGroups

Comma separated list of groups that the user running the container belongs to, in addition to the group indicated by runAsGid.

  • Workspace

  • Training

allowPrivilegeEscalation

Allows the container running the workload and all launched processes to gain additional privileges after the workload starts

  • Workspace

  • Training

hostIpc

Whether to enable hostIpc. Defaults to false.

  • Workspace

  • Training

hostNetwork

Whether to enable host network.

  • Workspace

  • Training

Compute Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

cpuCoreRequest

CPU units to allocate for the created workload (0.5, 1, .etc). The workload receives at least this amount of CPU. Note that the workload is not scheduled unless the system can guarantee this amount of CPUs to the workload.

  • Workspace

  • Training

cpuCoreLimit

Limitations on the number of CPUs consumed by the workload (0.5, 1, .etc). The system guarantees that this workload is not able to consume more than this amount of CPUs.

  • Workspace

  • Training

cpuMemoryRequest

The amount of CPU memory to allocate for this workload (1G, 20M, .etc). The workload receives at least this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of memory to the workload

  • Workspace

  • Training

cpuMemoryLimit

Limitations on the CPU memory to allocate for this workload (1G, 20M, .etc). The system guarantees that this workload is not be able to consume more than this amount of memory. The workload receives an error when trying to allocate more memory than this limit.

  • Workspace

  • Training

largeShmRequest

A large /dev/shm device to mount into a container running the created workload (shm is a shared file system mounted on RAM).

  • Workspace

  • Training

gpuRequestType

Sets the unit type for GPU resources requests to either portion, memory or mig profile. Only if gpuDeviceRequest = 1, the request type can be stated as portion, memory or migProfile.

  • Workspace

  • Training

gpuPortionRequest

Specifies the fraction of GPU to be allocated to the workload, between 0 and 1. For backward compatibility, it also supports the number of gpuDevices larger than 1, currently provided using the gpuDevices field.

  • Workspace

  • Training

gpuDeviceRequest

Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1, the gpuRequestType can be defined.

  • Workspace

  • Training

gpuPortionLimit

When a fraction of a GPU is requested, the GPU limit specifies the portion limit to allocate to the workload. The range of the value is from 0 to 1.

  • Workspace

  • Training

gpuMemoryRequest

Specifies GPU memory to allocate for the created workload. The workload receives this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of GPU memory to the workload.

  • Workspace

  • Training

gpuMemoryLimit

Specifies a limit on the GPU memory to allocate for this workload. Should be no less than the gpuMemory.

  • Workspace

  • Training

extendedResources

Specifies values for extended resources. Extended resources are third-party devices (such as high-performance NICs, FPGAs, or InfiniBand adapters) that you want to allocate to your Job.

  • Workspace

  • Training

Storage Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

dataVolume

Set of data volumes to use in the workload. Each data volume is mapped to a file-system mount point within the container running the workload.

  • Workspace

  • Training

Maps a folder to a file-system mount point within the container running the workload.

  • Workspace

  • Training

Details of the git repository and items mapped to it.

  • Workspace

  • Training

Specifies persistent volume claims to mount into a container running the created workload.

  • Workspace

  • Training

Specifies NFS volume to mount into the container running the workload.

  • Workspace

  • Training

Specifies S3 buckets to mount into the container running the workload.

  • Workspace

  • Training

configMapVolumes

Specifies ConfigMaps to mount as volumes into a container running the created workload.

  • Workspace

  • Training

secretVolume

Set of secret volumes to use in the workload. A secret volume maps a secret resource in the cluster to a file-system mount point within the container running the workload.

  • Workspace

  • Training

hostPath Field Details

  • Description: Maps a folder to a file system mount oint within the container running the workload

  • Supported NVIDIA Run:ai workload types: Workspace, Training

  • Value type: itemized

  • Example workload snippet:

defaults:
  storage:
    hostPath:
      instances:
        - path: h3-path-1
          mountPath: h3-mount-1
        - path: h3-path-2
          mountPath: h3-mount-2
      attributes:
        - readOnly: true
hostPath fields
Description
Value type

name

Unique name to identify the instance. Primarily used for policy locked rules.

path

Local path within the controller to which the host volume is mapped.

readOnly

Force the volume to be mounted with read-only permissions. Defaults to false

mountPath

The path that the host volume is mounted to when in use. Enum:

  • "None"

  • "HostToContainer"

mountPropagation

Share this volume mount with other containers. If set to HostToContainer, this volume mount receives all subsequent mounts that are mounted to this volume or any of its subdirectories. In case of multiple hostPath entries, this field should have the same value for all of them.

Git Field Details

  • Description: Details of the git repository and items mapped to it

  • Supported NVIDIA Run:ai workload types: Workspace, Training

  • Value type: itemized

  • Example workload snippet:

defaults:
  storage:
    git:
      attributes:
        Repository: https://runai.public.github.com
      instances
        - branch: "master"
          path: /container/my-repository
          passwordSecret: my-password-secret
Git fields
Description
Value type

repository

URL to a remote git repository. The content of this repository is mapped to the container running the workload

revision

Specific revision to synchronize the repository from

path

Local path within the workspace to which the S3 bucket is mapped

secretName

Optional name of Kubernetes secret that holds your git username and password

username

If secretName is provided, this field should contain the key, within the provided Kubernetes secret, which holds the value of your git username. Otherwise, this field should specify your git username in plain text (example: myuser).

PVC Field Details

  • Description: Specifies persistent volume claims to mount into a container running the created workload

  • Supported NVIDIA Run:ai workload types: Workspace, Training

  • Value type: itemized

  • Example workload snippet:

defaults:
  storage:
    pvc:
      instances:
        - claimName: pvc-staging-researcher1-home
          existingPvc: true
          path: /myhome
          readOnly: false
          claimInfo:
            accessModes:
              readWriteMany: true
Spec PVC fields
Description
Value type

claimName (mandatory)

A given name for the PVC. Allowed referencing it across workspaces

ephemeral

Use true to set PVC to ephemeral. If set to true, the PVC is deleted when the workspace is stopped.

path

Local path within the workspace to which the PVC bucket is mapped

readonly

Permits read only from the PVC, prevents additions or modifications to its content

ReadwriteOnce

Requesting claim that can be mounted in read/write mode to exactly 1 host. If none of the modes are specified, the default is readWriteOnce.

size

Requested size for the PVC. Mandatory when existing PVC is false

storageClass

Storage class name to associate with the PVC. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class. Further details at .

readOnlyMany

Requesting claim that can be mounted in read-only mode to many hosts

readWriteMany

Requesting claim that can be mounted in read/write mode to many hosts

NFS Field Details

  • Description: Specifies NFS volume to mount into the container running the workload

  • Supported NVIDIA Run:ai workload types: Workspace, Training

  • Value type: itemized

  • Example workload snippet:

defaults:
 storage:
   nfs:
     instances:
       - path: nfs-path
         readOnly: true
         server: nfs-server
         mountPath: nfs-mount
rules:
  storage:
    nfs:
      instances:
        canAdd: false
nfs fields
Description
Value type

mountPath

The path that the NFS volume is mounted to when in use

path

Path that is exported by the NFS server

readOnly

Whether to force the NFS export to be mounted with read-only permissions

nfsServer

The hostname or IP address of the NFS server

S3 Field Details

  • Description: Specifies S3 buckets to mount into the container running the workload

  • Supported NVIDIA Run:ai workload types: Workspace, Training

  • Value type: itemized

  • Example workload snippet:

defaults:
  storage:
    s3:
      instances:
        - bucket: bucket-opt-1
          path: /s3/path
          accessKeySecret: s3-access-key
          secretKeyOfAccessKeyId: s3-secret-id
          secretKeyOfSecretKey: s3-secret-key
      attributes:
        url: https://amazonaws.s3.com
s3 fields
Description
Value type

Bucket

The name of the bucket

path

Local path within the workspace to which the S3 bucket is mapped

url

The URL of the S3 service provider. The default is the URL of the Amazon AWS S3 service

Value Types

Each field has a specific value type. The following value types are supported.

Value type
Description
Supported rule type
Defaults

Boolean

A binary value that can be either True or False

true/false

String

A sequence of characters used to represent text. It can include letters, numbers, symbols, and spaces

abc

Itemized

An ordered collection of items (objects), which can be of different types (all items in the list are of the same type). For further information see the chapter below the table.

See below

Integer

An Integer is a whole number without a fractional component.

100

Number

Capable of having non-integer values

10.3

Quantity

Holds a string composed of a number and a unit representing a quantity

5M

Array

Set of values that are treated as one, as opposed to Itemized in which each item can be referenced separately.

  • node-a

  • node-b

  • node-c

Itemized

Workload fields of type itemized have multiple instances, however in comparison to objects, each can be referenced by a key field. The key field is defined for each field.

Consider the following workload spec:

spec:
  image: ubuntu
  compute:
    extendedResources:
      - resource: added/cpu
        quantity: 10
      - resource: added/memory
        quantity: 20M

In this example, extendedResources have two instances, each has two attributes: resource (the key attribute) and quantity.

In policy, the defaults and rules for itemized fields have two sub sections:

  • Instances: default items to be added to the policy or rules which apply to an instance as a whole.

  • Attributes: defaults for attributes within an item or rules which apply to attributes within each item.

Consider the following example:

defaults:
  compute:
    extendedResources:
      instances: 
        - resource: default/cpu
          quantity: 5
        - resource: default/memory
          quantity: 4M
      attributes:
        quantity: 3
rules:
  compute:
    extendedResources:
      instances:
        locked: 
          - default/cpu
      attributes:
        quantity: 
          required: true

Assume the following workload submission is requested:

spec:
  image: ubuntu
  compute:
    extendedResources:
      - resource: default/memory
        exclude: true
      - resource: added/cpu
      - resource: added/memory
        quantity: 5M

The effective policy for the above mentioned workload has the following extendedResources instances:

Resource
Source of the instance
Quantity
Source of the attribute quantity

default/cpu

Policy defaults

5

The default of this instance in the policy defaults section

added/cpu

Submission request

3

The default of the quantity attribute from the attributes section

added/memory

Submission request

5M

Submission request

Note

The default/memory is not populated to the workload, this is because it has been excluded from the workload using “exclude: true”.

A workload submission request cannot exclude the default/cpu resource, as this key is included in the locked rules under the instances section. {#a-workload-submission-request-cannot-exclude-the-default/cpu-resource,-as-this-key-is-included-in-the-locked-rules-under-the-instances-section.}

Rule Types

Rule types
Description
Supported value types
Rule type example

canAdd

Whether the submission request can add items to an itemized field other than those listed in the policy defaults for this field.

storage: hostPath: instances: canAdd: false

locked

Set of items that the workload is unable to modify or exclude. In this example, a workload policy default is given to HOME and USER, that the submission request cannot modify or exclude from the workload.

storage: hostPath: Instances: locked: - HOME - USER

canEdit

Whether the submission request can modify the policy default for this field. In this example, it is assumed that the policy has default for imagePullPolicy. As canEdit is set to false, submission requests are not able to alter this default.

imagePullPolicy: canEdit: false

required

When set to true, the workload must have a value for this field. The value can be obtained from policy defaults. If no value specified in the policy defaults, a value must be specified for this field in the submission request.

image: required: true

min

The minimal value for the field

compute: gpuDevicesRequest: min: 3

max

The maximal value for the field

compute: gpuMemoryRequest: max: 2G

step

The allowed gap between values for this field. In this example the allowed values are: 1, 3, 5, 7

compute: cpuCoreRequest: min: 1 max: 7 Step: 2

options

Set of allowed values for this field

image: options: - value: image-1 - value: image-2

defaultFrom

Set a default value for a field that will be calculated based on the value of another field

cpuCoreRequest: defaultFrom: field: compute.cpuCoreLimit factor: 0.5

Policy Spec Sections

For each field of a specific policy, you can specify both rules and defaults. A policy spec consists of the following sections:

  • Rules

  • Defaults

  • Imposed Assets

Rules

Rules set up constraints on workload policy fields. For example, consider the following policy:

rules:
  compute:
    gpuDevicesRequest: 
      max: 8
  security:
    runAsUid: 
      min: 500

Such a policy restricts the maximum value for gpuDeviceRequests to 8, and the minimal value for runAsUid, provided in the security section to 500.

Defaults

The defaults section is used for providing defaults for various workload fields. For example, consider the following policy:

defaults:
  imagePullPolicy: Always
  security:
    runAsNonRoot: true
    runAsUid: 500

Assume a submission request with the following values:

  • Image: ubuntu

  • runAsUid: 501

The effective workload that runs has the following set of values:

Field
Value
Source

Image

Ubuntu

Submission request

ImagePullPolicy

Always

Policy defaults

security.runAsNonRoot

true

Policy defaults

security.runAsUid

501

Submission request

Note

It is possible to specify a rule for each field, which states if a submission request is allowed to change the policy default for that given field, for example:

defaults:
imagePullPolicy: Always
security:
    runAsNonRoot: true
    runAsUid: 500
 rules:
 security:
    runAsUid:
    canEdit: false

If this policy is applied, the submission request above fails, as it attempts to change the value of secuirty.runAsUid from 500 (the policy default) to 501 (the value provided in the submission request), which is forbidden due to canEdit rule set to false for this field.

Imposed Assets

Default instances of a storage field can be provided using a datasource containing the details of this storage instance. To add such instances in the policy, specify those asset IDs in the imposedAssets section of the policy.

defaults: null
rules: null
imposedAssets:
  - f12c965b-44e9-4ff6-8b43-01d8f9e630cc

Assets with references to credential assets (for example: private S3, containing reference to an AccessKey asset) cannot be used as imposedAssets.

string
string
boolean
array
string
string
string
string
array
itemized
itemized
boolean
integer
integer
string
integer
itemized
itemized
boolean
string
string
itemized
Probes fields
itemized
string
Storage fields
Security fields
Compute fields
string
External Access to Containers
string
integer
string
string
integer
integer
integer
integer
integer
Non-root containers
string
array
string
boolean
boolean
integer
integer
string
boolean
boolean
boolean
number
number
quantity
quantity
boolean
string
number
integer
number
quantity
quantity
itemized
itemized
hostPath
itemized
git
itemized
pvc
itemized
nfs
itemized
s3
itemized
itemized
itemized
string
string
boolean
string
string
string
string
string
string
string
string
boolean
string
boolean
boolean
string
Kubernetes storage classes
string
boolean
boolean
string
string
boolean
string
string
string
string
canEdit
required
canEdit
required
options
canAdd
locked
canEdit
required
min
max
step
defaultFrom
canEdit
required
min
defaultFrom
canEdit
required
min
max
defaultFrom
canEdit
required
itemized
itemized
string
boolean
integer
number
quantity
array
string
boolean
integer
number
quantity
array
integer
number
quantity
integer
number
quantity
integer
number
string
integer
number
quantity