Only this pageAll pages
Couldn't generate the PDF for 164 pages, generation stopped at 100.
Extend with 50 more pages.
1 of 100

v2.21

Getting Started

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Infrastructure setup

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Platform management

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Workloads in NVIDIA Run:ai

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Install Using Helm

Authentication and Authorization

Overview

NVIDIA Run:ai is a GPU orchestration and optimization platform that helps organizations maximize compute utilization for AI workloads. By optimizing the use of expensive compute resources, NVIDIA Run:ai accelerates AI development cycles, and drives faster time-to-market for AI-powered innovations.

Built on Kubernetes, NVIDIA Run:ai supports dynamic GPU allocation, workload submission, workload scheduling, and resource sharing, ensuring that AI teams get the compute power they need while IT teams maintain control over infrastructure efficiency.

How NVIDIA Run:ai Helps Your Organization

For Infrastructure Administrators

NVIDIA Run:ai centralizes cluster management and optimizes infrastructure control by offering:

  • - Manage all clusters from a single platform, ensuring consistency and control across environments.

  • - Gain real-time and historical insights into GPU consumption across clusters to optimize resource allocation and plan future capacity needs efficiently.

  • - Define and enforce security and usage policies to align GPU consumption with business and compliance requirements.

  • - Integrate with your organization's identity provider for streamlined authentication (Single Sign On) and role-based access control (RBAC).

For Platform Administrators

NVIDIA Run:ai simplifies AI infrastructure management by providing a structured approach to managing AI initiatives, resources, and user access. It enables platform administrators maintain control, efficiency, and scalability across their infrastructure:

  • - Map and set up AI initiatives according to your organization's structure, ensuring clear resource allocation.

  • - Enable seamless sharing and pooling of GPUs across multiple users, reducing idle time and optimizing utilization.

  • - Assign users (AI practitioners, ML engineers) to specific projects and departments to manage access and enforce security policies, utilizing role-based access control (RBAC) to ensure permissions align with user roles.

For AI Practitioners

NVIDIA Run:ai empowers data scientists and ML engineers by providing:

  • - Ensure high-priority jobs get GPU resources. Workloads dynamically receive resources based on demand.

  • - Request and utilize only a fraction of a GPU's memory, ensuring efficient resource allocation and leaving room for other workloads.

  • - Run your entire AI initiatives lifecycle – Jupyter Notebooks, training jobs, and inference workloads efficiently.

  • - Ensure an uninterrupted experience when working on Jupyter Notebooks without taking away GPUs.

NVIDIA Run:ai System Components

NVIDIA Run:ai is made up of two components both installed over a cluster:

  • NVIDIA Run:ai cluster - Provides scheduling and workload management, extending Kubernetes native capabilities.

  • NVIDIA Run:ai control plane - Provides resource management, handles workload submission and provides cluster monitoring and analytics.

NVIDIA Run:ai Cluster

The NVIDIA Run:ai cluster is responsible for scheduling AI workloads and efficiently allocating GPU resources across users and projects:

  • - Applies AI-aware rules to efficiently schedule workloads submitted by AI practitioners.

  • - Handles workload management which includes the researcher code running as a Kubernetes container and the system resources required to run the code, such as storage, credentials, network endpoints to access the container and so on.

  • - Installed as a Kubernetes Operator to automate deployment, upgrades and configuration of NVIDIA Run:ai cluster services.

NVIDIA Run:ai Control Plane

The NVIDIA Run:ai control plane provides a centralized management interface for organizations to oversee their GPU infrastructure across multiple locations/subnets, accessible via Web UI, and . The control plane can be deployed on the cloud or on-premise for organizations that require local control over their infrastructure (self-hosted).

  • - Manages multiple NVIDIA Run:ai clusters for a single tenant across different locations and subnets from a single unified interface.

  • - Allows administrators to define Projects, Departments and user roles, enforcing policies for fair resource distribution.

  • - Allows teams to submit workloads, track usage, and monitor GPU performance in real time.

Installation Types

There are two main installation options:

Installation Type
Description

SSO

Container Access

Advanced Setup

Manage AI Initiatives

Managing Your Organization

Managing Your Resources

Resource Optimization

Infrastructure Procedures

Quick Starts

Quick Starts

Policies

  • Kubernetes-native application - Install as a Kubernetes-native application, seamlessly extending Kubernetes for native cloud experience and operational standards (install, upgrade, configure).

  • - Use scheduling to prioritize and allocate GPUs based on workload needs.
  • Monitoring and insights - Track real-time and historical data on GPU usage to help track resource consumption and optimize costs.

  • Scalability for training and inference - Support for distributed training across multiple GPUs and auto-scales inference workloads.

  • Integrations - Integrate with popular ML frameworks - PyTorch, TensorFlow, XGBoost, Knative, Spark, Kubeflow Pipelines, Apache Airflow, Argo workloads, Ray and more.

  • Flexible workload submission - Submit workloads using the NVIDIA Run:ai UI, API, CLI or run third-party workloads.

  • Storage
    - Supports Kubernetes-native storage using
    , allowing organizations to bring their own storage solutions. Additionally, it also integrates with
    such as Git, S3, and NFS to support various data requirements.
  • Secured communication - Uses an outbound-only, secured (SSL) connection to synchronize with the NVIDIA Run:ai control plane.

  • Private - NVIDIA Run:ai only synchronizes metadata and operational metrics (e.g., workloads, nodes) with the control plane. No proprietary data, model artifacts, or user data sets are ever transmitted, ensuring full data privacy and security.

  • SaaS

    NVIDIA Run:ai is installed on the customer's data science GPU clusters. The cluster connects to the NVIDIA Run:ai control plane on the cloud (https://<tenant-name>.run.ai). With this installation, the cluster requires an outbound connection to the NVIDIA Run:ai cloud.

    Self-hosted

    The NVIDIA Run:ai control plane is also installed in the customer's data center

    Centralized cluster management
    Usage monitoring and capacity planning
    Policy enforcement
    Enterprise-grade authentication
    AI Initiative structuring and management
    Centralized GPU resource management
    User and access control
    Optimized workload scheduling
    Fractional GPU usage
    AI initiatives lifecycle support
    Interactive session
    Kubernetes
    NVIDIA Run:ai Scheduler
    Workload management
    Kubernetes operator-based deployment
    API
    CLI
    Multi-cluster management
    Resource and access management
    Workload submission and monitoring
    Workload scheduling
    Storage Classes
    external storage solutions

    Scheduling and Resource Optimization

    Scheduling

    Workload Assets

    Uninstall

    Uninstall the Control Plane

    To delete the control plane, run:

    helm uninstall runai-backend -n runai-backend

    Uninstall the Cluster

    What's New

    This section includes release information for the self-hosted version of NVIDIA Run:ai:

    • New Features and Enhancements - Highlights major updates introduced in each version, including new capabilities, UI improvements, and changes to system behavior..

    • Hotfixes - Lists patches applied to released versions, including critical fixes and behavior corrections.

    Note

    See our for a list of supported versions and their respective support timelines.

    Feature Life Cycle

    NVIDIA Run:ai uses life cycle labels to indicate the maturity and stability of features across releases:

    • Experimental - This feature is in early development. It may not be stable and could be removed or changed significantly in future versions. Use with caution.

    • Beta - This feature is still being developed for official release in a future version and may have some limitations. Use with caution.

    • Legacy - This feature is scheduled to be removed in future versions. We recommend using alternatives if available. Use only if necessary.

    Installation

    NVIDIA Run:ai Components

    As part of the installation process, you will install:

    • A managing cluster/s

    User Applications

    This article explains the procedure to create your own user applications.

    Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token as detailed in and use it within subsequent API calls.

    Notes

    • All clusters in the tenant must be version 2.20 and onward.

    Product version life cycle

    The token obtained through user applications assumes the roles and permissions of the user.

    Creating an Application

    To create an application:

    1. Click the user avatar at the top right corner, then select Settings

    2. Click +APPLICATION

    3. Enter the application’s name

    4. Click CREATE

    5. Copy the Client ID and Client secret and store securely

    6. Click DONE

    You can create up to 20 user applications.

    Note

    The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

    Regenerating a Client Secret

    To regenerate a client secret:

    1. Locate the application you want to regenerate its client secret

    2. Click Regenerate client secret

    3. Click REGENERATE

    4. Copy the New client secret and store it securely

    5. Click DONE

    Important

    Regenerating a client secret revokes the previous one.

    Deleting an Application

    1. Locate the application you want to delete

    2. Click on the trash icon

    3. On the dialog, click DELETE to confirm

    Using API

    Go to the User Applications API reference to view the available actions.

    API authentication
    One or more

    Both the control plane and clusters require Kubernetes. Typically, the control plane and first cluster are installed on the same Kubernetes cluster.

    Installation Types

    The self-hosted option is for organizations that cannot use a SaaS solution due to data leakage concerns. NVIDIA Run:ai self-hosting comes with two variants:

    Type
    Description

    Connected

    The organization can freely download from the internet (though upload is not allowed)

    Air-gapped

    The organization has no connection to the internet

    control plane
    clusters

    Monitor Performance and Health

    Service Mesh

    NVIDIA Run:ai supports service mesh implementations. When a service mesh is deployed with sidecar injection, specific configurations must be applied to ensure compatibility with NVIDIA Run:ai. This document outlines the required changes for the NVIDIA Run:ai control plane and cluster.

    Control Plane Configuration

    Note

    This section applies to self-hosted only.

    By default, NVIDIA Run:ai prevents Istio from injecting sidecar containers into system jobs in the control plane. For other service mesh solutions, users must manually add annotations during installation.

    To disable sidecar injection in the NVIDIA Run:ai control plane, modify the Helm values file by adding the required pod labels to the following components. See for more details.

    Example for :

    Cluster Configuration

    Installation Phase

    Sidecar containers injected by some service mesh solutions can prevent NVIDIA Run:ai installation hooks from completing. To avoid this, modify the Helm installation command to include the required labels or annotations:

    Example for :

    Workloads

    To prevent sidecar injection in workloads created at runtime (such as training workloads), update the runaiconfig resource. See for more details:

    Monitoring and Maintenance

    Deploying NVIDIA Run:ai in mission-critical environments requires proper monitoring and maintenance of resources to ensure workloads run and are deployed as expected.

    Details on how to monitor different parts of the physical resources in your Kubernetes system, including clusters and nodes, can be found in the monitoring and maintenance section. Adjacent configuration and troubleshooting sections also cover high availability, restoring and securing clusters, collecting logs, and reviewing audit logs to meet compliance requirements.

    In addition to monitoring NVIDIA Run:ai resources, it is also highly recommended to monitor NVIDIA Run:ai runs on Kubernetes, which manages containerized applications. In particular, focus on three main layers:

    NVIDIA Run:ai Control Plane and Cluster Services

    This is the highest layer and includes the parts of NVIDIA Run:ai pods, which run in containers managed by Kubernetes.

    Kubernetes Cluster

    This layer includes the main Kubernetes system that runs and manages NVIDIA Run:ai components. Important elements to monitor include:

    • The health of the cluster and nodes (machines in the cluster).

    • The status of key Kubernetes services, such as the API server. For detailed information on managing clusters, see the .

    Host Infrastructure

    This is the base layer, representing the actual machines (virtual or physical) that make up the cluster IT teams need to handle:

    • Managing CPU, memory, and storage

    • Keeping the operating system updated

    • Setting up the network and balancing the load

    NVIDIA Run:ai does not require any special configurations at this level.

    The articles below explain how to monitor these layers, maintain system security and compliance, and ensure the reliable operation of NVIDIA Run:ai in critical environments.

    Interworking with Karpenter

    Karpenter is an open-source, Kubernetes cluster autoscaler built for cloud deployments. Karpenter optimizes the cloud cost of a customer’s cluster by moving workloads between different node types, consolidating workloads into fewer nodes, using lower-cost nodes where possible, scaling up new nodes when needed, and shutting down unused nodes.

    Karpenter’s main goal is cost optimization. Unlike Karpenter, NVIDIA Run:ai’s Scheduler optimizes for fairness and resource utilization. Therefore, there are a few potential friction points when using both on the same cluster.

    Friction Points Using Karpenter with NVIDIA Run:ai

    1. Karpenter looks for “unschedulable” pending workloads and may try to scale up new nodes to make those workloads schedulable. However, in some scenarios, these workloads may exceed their quota parameters, and the NVIDIA Run:ai Scheduler will put them into a pending state.

    2. Karpenter is not aware of the NVIDIA Run:ai fractions mechanism and may try to interfere incorrectly.

    3. Karpenter preempts any type of workload (i.e., high-priority, non-preemptible workloads will potentially be interrupted and moved to save cost).

    4. Karpenter has no pod-group (i.e., workload) notion or gang scheduling awareness, meaning that Karpenter is unaware that a set of “arbitrary” pods is a single workload. This may cause Karpenter to schedule those pods into different node pools (in the case of multi-node-pool workloads) or scale up or down a mix of wrong nodes.

    Mitigating the Friction Points

    NVIDIA Run:ai Scheduler mitigates the friction points using the following techniques (each numbered bullet below corresponds to the related friction point listed above):

    1. Karpenter uses a “nominated node” to recommend a node for the Scheduler. The NVIDIA Run:ai Scheduler treats this as a “preferred” recommendation, meaning it will try to use this node, but it’s not required and it may choose another node.

    2. Fractions - Karpenter won’t consolidate nodes with one or more pods that cannot be moved. The NVIDIA Run:ai reservation pod is marked as ‘do not evict’ to allow the NVIDIA Run:ai Scheduler to control the scheduling of fractions.

    3. Non-preemptible workloads - NVIDIA Run:ai marks non-preemptible workloads as ‘do not evict’ and Karpenter respects this annotation.

    4. NVIDIA Run:ai node pools (single-node-pool workloads) - Karpenter respects the ‘node affinity’ that NVIDIA Run:ai sets on a pod, so Karpenter uses the node affinity for its recommended node. For the gang-scheduling/pod-group (workload) notion, NVIDIA Run:ai Scheduler considers Karpenter directives as preferred recommendations rather than mandatory instructions and overrides Karpenter instructions where appropriate.

    Deployment Considerations

    • Using multi-node-pool workloads

      • Workloads may include a list of optional node pools. Karpenter is not aware that only a single node pool should be selected out of that list for the workload. It may therefore recommend putting pods of the same workload into different node pools and may scale up nodes from different node pools to serve a “multi-node-pool” workload instead of nodes on the selected single node pool.

      • If this becomes an issue (i.e., if Karpenter scales up the wrong node types), users can set an inter-pod affinity using the node pool label or another common label as a ‘topology’ identifier. This will force Karpenter to choose nodes from a single-node pool per workload, selecting from any of the node pools listed as allowed by the workload.

    High Availability

    This guide outlines the best practices for configuring the NVIDIA Run:ai platform to ensure high availability and maintain service continuity during system failures or under heavy load. The goal is to reduce downtime and eliminate single points of failure by leveraging Kubernetes best practices alongside NVIDIA Run:ai specific configuration options. The NVIDIA Run:ai platform relies on two fundamental high availability strategies:

    • Use of system nodes - Assigning multiple dedicated nodes for critical system services ensures control, resource isolation, and enables system-level scaling.

    • Replication of core and third-party services - Configuring multiple replicas of essential services, including both platform and third-party components, distributes workloads and reduces single points of failure. If a component fails on one node, requests can seamlessly route to another instance.

    System Nodes

    The NVIDIA Run:ai platform allows you to dedicate specific nodes (system nodes) exclusively for core platform services. This approach provides improved operational isolation and easier resource management.

    Ensure that at least three system nodes are configured to support high availability. If you use only a single node for core services, horizontally scaled components will not be distributed, resulting in a single point of failure. See for more details. This practice applies to both the NVIDIA Run:ai cluster and control plane (self-hosted).

    Service Replicas

    Control Plane Service Replicas

    The NVIDIA Run:ai control plane runs in the runai-backend namespace and consists of multiple Kubernetes and . To achieve high availability, it is recommended to configure multiple replicas during installation or upgrade using Helm flags.

    In addition, the control plane supports autoscaling for certain services to handle variable load and improve system resiliency. Autoscaling can be enabled or configured during installation or upgrade using Helm flags.

    Deployments

    Each of the NVIDIA Run:ai deployments can be set to scale up, by adding a helm settings on install/upgrade. For a full list of settings, contact NVIDIA Run:ai support.

    To increase the replica count, use the following NVIDIA Run:ai control plane Helm flag:

    StatefulSets

    NVIDIA Run:ai uses the following third-party components which are managed as Kubernetes StatefulSets. For more information, see :

    • PostgreSQL - The internal PostgreSQL cannot be scaled horizontally. To connect NVIDIA Run:ai to an external PostgreSQL service which can be configured for high availability, see .

    • Thanos - To enable Thanos autoscaling, use the following NVIDIA Run:ai control plane helm flags:

    • Keycloak - By default, Keycloak sets a minimum of 3 pods and will scale to more on transaction load. To scale Keycloak, use the following NVIDIA Run:ai control plane helm flags:

    Cluster Services Replicas

    By default, NVIDIA Run:ai cluster services are deployed with a single replica. To achieve high availability, it is recommended to configure multiple replicas for core NVIDIA Run:ai services. For more information, see .

    Note Some NVIDIA Run:ai services do not have a replicas configuration. These will always run a single replica, and their recovery time after failure is tied to pod restart and rescheduling time.

    Secure Your Cluster

    This section details the security considerations for deploying NVIDIA Run:ai. It is intended to help administrators and security officers understand the specific permissions required by NVIDIA Run:ai.

    Access to the Kubernetes Cluster

    NVIDIA Run:ai integrates with Kubernetes clusters and requires specific permissions to successfully operate. These are permissions are controlled with configuration flags that dictate how NVIDIA Run:ai interacts with cluster resources. Prior to installation, security teams can review the permissions and ensure it aligns with their organization’s policies.

    Permissions and their Related Use Case

    NVIDIA Run:ai provides various security-related permissions that can be customized to fit specific organizational needs. Below are brief descriptions of the key use cases for these customizations:

    Permission
    Use case

    Note

    These security customizations allow organizations to tailor NVIDIA Run:ai to their specific needs. All changes should be modified cautiously and only when necessary to meet particular security, compliance or operational requirements.

    Secure Installation

    Many organizations enforce IT compliance rules for Kubernetes, with strict access control for installing and running workloads. OpenShift uses Security Context Constraints (SCC) for this purpose. NVIDIA Run:ai fully supports SCC, ensuring integration with OpenShift's security requirements.

    Security Vulnerabilities

    The platform is actively monitored for security vulnerabilities, with regular scans conducted to identify and address potential issues. Necessary fixes are applied to ensure that the software remains secure and resilient against emerging threats, providing a safe and reliable experience.

    Shared Storage

    Shared storage is a critical component in AI and machine learning workflows, particularly in scenarios involving distributed training and shared datasets. In AI and ML environments, data must be readily accessible across multiple nodes, especially when training large models or working with vast datasets. Shared storage enable seamless access to data, ensuring that all nodes in a distributed training setup can read and write to the same datasets simultaneously. This setup not only enhances efficiency but is also crucial for maintaining consistency and speed in high-performance computing environments.

    While NVIDIA Run:ai Platform supports a variety of remote data sources, such as Git and S3, it is often more efficient to keep data close to the compute resources. This proximity is typically achieved through the use of shared storage, accessible to multiple nodes in your Kubernetes cluster.

    Shared Storage

    When implementing shared storage in Kubernetes, there are two primary approaches:

    • Utilizing the of your storage provider (Recommended)

    • Using a direct NFS (Network File System) mount

    NVIDIA Run:ai support both direct NFS mount and Kubernetes Storage Classes.

    Kubernetes Storage Classes

    Storage classes in Kubernetes defines how storage is provisioned and managed. This allows you to select storage types optimized for AI workloads. For example, you can choose storage with high IOPS (Input/Output Operations Per Second) for rapid data access during intensive training sessions, or tiered storage options to balance cost and performance-based on your organization’s requirements. This approach supports dynamic provisioning, enabling storage to be allocated on-demand as required by your applications.

    NVIDIA Run:ai data sources such as and leverage storage class to manage and allocate storage efficiently. This ensures that the most suitable storage option is always accessible, contributing to the efficiency and performance of AI workloads.

    Note

    NVIDIA Run:ai lists all available storage classes in the Kubernetes cluster, making it easy for users to select the appropriate storage. Additionally, can be set to restrict or enforce the use of specific storage classes, to help maintain compliance with organizational standards and optimize resource utilization.

    Direct NFS Mount

    Direct NFS allows you to mount a shared file system directly across multiple nodes in your Kubernetes cluster. This method provides a straightforward way to share data among nodes and is often used for simple setups or when a dedicated NFS server is available.

    However, using NFS can present challenges related to security and control. Direct NFS setups might lack the fine-grained control and security features available with storage class.

    Applications

    This section explains the procedure to manage your organization's applications.

    Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token as detailed in and use it within subsequent API calls.

    Applications are assigned with to manage permissions. For example, application ci-pipeline-prod is assigned with a Researcher role in Cluster: A.

    Applications Table

    The Applications table can be found under Access in the NVIDIA Run:ai platform.

    Logs Collection

    This section provides instructions for IT administrators on collecting NVIDIA Run:ai logs for support, including prerequisites, CLI commands, and log file retrieval. It also covers enabling verbose logging for Prometheus and the NVIDIA Run:ai Scheduler.

    Collect Logs to Send to Support

    To collect NVIDIA Run:ai logs, follow these steps:

    Authentication and Authorization

    NVIDIA Run:ai authentication and authorization enables a streamlined experience for the user with precise controls covering the data each user can see and the actions each user can perform in the NVIDIA Run:ai platform.

    Authentication verifies user identity during login, and authorization assigns the user with specific permissions according to the assigned .

    Authenticated access is required to use all aspects of the NVIDIA Run:ai interfaces, including the NVIDIA Run:ai platform, the NVIDIA Run:ai Command Line Interface (CLI) and APIs.

    Authentication

    There are multiple methods to authenticate and access NVIDIA Run:ai.

    Before You Start

    NVIDIA Run:ai provides for both physical cluster entities such as clusters, nodes, and node pools and application organization entities such as departments and projects. Metrics represent over-time data while telemetry represents current analytics data. This data is essential for monitoring and analyzing the performance and health of your platform.

    Consuming Metrics and Telemetry Data

    Users can consume the data based on their permissions:

    1. API - Access the data programmatically through the

    Workload Assets

    NVIDIA Run:ai assets are preconfigured building blocks that simplify the workload submission effort and remove the complexities of Kubernetes and networks for AI practitioners.

    Workload assets enable organizations to:

    • Create and reuse preconfigured setup for code, data, storage and resources to be used by AI practitioners to simplify the process of submitting workloads

    • Share the preconfigured setup with a wide audience of AI practitioners with similar needs

    An alternative approach is to use a single-node pool for each workload instead of multi-node pools.

  • Consolidation

    • To make Karpenter more effective when using its consolidation function, users should consider separating preemptible and non-preemptible workloads, either by using node pools, node affinities, taint/tolerations, or inter-pod anti-affinity.

    • If users don’t separate preemptible and non-preemptible workloads (i.e., make them run on different nodes), Karpenter’s ability to consolidate (bin-pack) and shut down nodes will be reduced, but it is still effective.

  • Conflicts between bin-packing and spread policies

    • If NVIDIA Run:ai is used with a scheduling spread policy, it will clash with Karpenter’s default bin-packs/consolidation policy, and the outcome may be a deployment that is not optimized for any of these policies.

    • Usually spread is used for Inference, which is non-preemptible and therefore not controlled by Karpenter (NVIDIA Run:ai Scheduler will mark those workloads as ‘do not evict’ for Karpenter), so this should not present a real deployment issue for customers.

  • Automatic Namespace creation

    Controls whether NVIDIA Run:ai automatically creates Kubernetes namespaces when new projects are created. Useful in environments where namespace creation must be strictly managed.

    Automatic user assignment

    Decides if users are automatically assigned to projects within NVIDIA Run:ai. Helps manage user access more tightly in certain compliance-driven environments.

    Secret propagation

    Determines whether NVIDIA Run:ai should propagate secrets across the cluster. Relevant for organizations with specific security protocols for managing sensitive data.

    Disabling Kubernetes limit range

    Chooses whether to disable the Kubernetes Limit Range feature. May be adjusted in environments with specific resource management needs.

    official Kubernetes documentation
    Kubernetes Storage Classes
    Data Sources
    Persistent Volume Claims (PVC)
    Data Volumes
    policies

    Note

    • The creation of assets is possible only via API and the NVIDIA Run:ai UI.

    • The submission of workloads using assets, is possible only via the NVIDIA Run:ai UI.

    Workload Asset Types

    There are four workload asset types used by the workload:

    • Environments The container image, tools and connections for the workload

    • Data sources The type of data, its origin and the target storage location such as PVCs or cloud storage buckets where datasets are stored

    • Compute resources The compute specification, including GPU and CPU compute and memory

    • Credentials The secrets to be used to access sensitive data, services, and applications such as docker registry or S3 buckets

    Asset Scope

    When a workload asset is created, a scope is required. The scope defines who in the organization can view and/or use the asset.

    Note

    When an asset is created via API, the scope can be the entire account. This is currently an experimental feature.

    Who Can Create an Asset?

    Any subject (user, application, or SSO group) with a role that has permissions to Create an asset, can do so within their scope.

    Who Can Use an Asset?

    Assets are used when submitting workloads. Any subject (user, application or SSO group) with a role that has permissions to Create workloads, can also use assets.

    Who Can View an Asset?

    Any subject (user, application, or SSO group) with a role that has permission to View an asset, can do so within their scope.

    workload
    Advanced control plane configurations
    Open Service Mesh
    Istio Service Mesh
    Advanced cluster configurations
    NVIDIA Run:ai system nodes
    Deployments
    StatefulSets
    Advanced control plane configurations
    External Postgres Database
    NVIDIA Run:ai services replicas
    authorizationMigrator:
      podLabels:
        openservicemesh.io/sidecar-injection: disabled
    clusterMigrator:
      podLabels:
        openservicemesh.io/sidecar-injection: disabled
    identityProviderReconciler:
      podLabels:
        openservicemesh.io/sidecar-injection: disabled
    keepPVC:
      podLabels:
        openservicemesh.io/sidecar-injection: disabled
    orgUnitsMigrator:
      podLabels:
        openservicemesh.io/sidecar-injection: disabled
    helm upgrade -i ... 
    --set global.additionalJobLabels.A=B --set global.additionalJobAnnotations.A=B
    helm upgrade -i ... 
    --set-json global.additionalJobLabels='{"sidecar.istio.io/inject":false}'
    spec:
      workload-controller:
        additionalPodLabels:
          sidecar.istio.io/inject: false
    --set <service>.replicaCount=2
    --set thanos.query.autoscaling.enabled=true \  
    --set thanos.query.autoscaling.maxReplicas=2 \
    --set thanos.query.autoscaling.minReplicas=2 
    --set keycloakx.autoscaling.enabled=true
    The Applications table provides a list of all the applications defined in the platform, and allows you to manage them.

    The Applications table consists of the following columns:

    Column
    Description

    Application

    The name of the application

    Client ID

    The client ID of the application

    Access rule(s)

    The access rules assigned to the application

    Last login

    The timestamp for the last time the user signed in

    Created by

    The user who created the application

    Creation time

    The timestamp for when the application was created

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    Creating an Application

    To create an application:

    1. Click +NEW APPLICATION

    2. Enter the application’s name

    3. Click CREATE

    4. Copy the Client ID and Client secret and store them securely

    5. Click DONE

    Note

    The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

    Adding an Access Rule to an Application

    To create an access rule:

    1. Select the application you want to add an access rule for

    2. Click ACCESS RULES

    3. Click +ACCESS RULE

    4. Select a role

    5. Select a scope

    6. Click SAVE RULE

    7. Click CLOSE

    Deleting an Access Rule from an Application

    To delete an access rule:

    1. Select the application you want to remove an access rule from

    2. Click ACCESS RULES

    3. Find the access rule assigned to the user you would like to delete

    4. Click on the trash icon

    5. Click CLOSE

    Regenerating a Client Secret

    To regenerate a client secret:

    1. Locate the application you want to regenerate its client secret

    2. Click REGENERATE CLIENT SECRET

    3. Click REGENERATE

    4. Copy the New client secret and store it securely

    5. Click DONE

    Important

    Regenerating a client secret revokes the previous one.

    Deleting an Application

    1. Select the application you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm

    Using API

    Go to the Applications, Access rules API reference to view the available actions.

    API authentication
    access rules
    Prerequisites
    • Ensure that you have administrator-level access to the Kubernetes cluster where NVIDIA Run:ai is installed.

    • The NVIDIA Run:ai Administrator Command-Line Interface (CLI) must be installed.

    Step-by-step Instructions

    1. Run the Command from your local machine or a Bastion Host (secure server). Open a terminal on your local machine (or any machine that has network access to the Kubernetes cluster) where the NVIDIA Run:ai Administrator CLI is installed.

    2. Collect the Logs. Execute the following command to collect the logs:

    This command gathers all relevant NVIDIA Run:ai logs from the system and generate a compressed file.

    1. Locate the Generated File. After running the command, note the location of the generated compressed log file. You can retrieve and send this file to NVIDIA Run:ai Support for further troubleshooting.

    Note

    The tar file packages the logs of NVIDIA Run:ai components only. It does not include logs of researcher containers that may contain private information.

    Logs Verbosity

    Increase log verbosity to capture more detailed information, providing deeper insights into system behavior and make it easier to identify and resolve issues.

    Prerequisites

    Before you begin, ensure you have the following:

    • Access to the Kubernetes cluster where NVIDIA Run:ai is installed

      • Including necessary permissions to view and modify configurations.

    • kubectl installed and configured:

      • The Kubernetes command-line tool, kubectl, must be installed and configured to interact with the cluster.

      • Sufficient privileges to edit configurations and view logs.

    • Monitoring Disk Space

      • When enabling verbose logging, ensure adequate disk space to handle the increased log output, especially when enabling debug or high verbosity levels.

    Adding Verbosity

    Adding verbosity to Prometheus

    To increase the logging verbosity for Prometheus, follow these steps:

    1. Edit the RunaiConfig to adjust Prometheus log levels. Copy the following command to your terminal:

    2. In the configuration file that opens, add or modify the following section to set the log level to debug:

    3. Save the changes. To view the Prometheus logs with the new verbosity level, run:

      This command streams the last 100 lines of logs from Prometheus, providing detailed information useful for debu

    Adding verbosity to the Scheduler

    To enable extended logging for the NVIDIA Run:ai scheduler:

    1. Edit the RunaiConfig to adjust scheduler verbosity:

    2. Add or modify the following section under the scheduler settings:

      This increases the verbosity level of the scheduler logs to provide more detailed output.

    Warning: Enabling verbose logging can significantly increase disk space usage. Monitor your storage capacity and adjust the verbosity level as necessary.

    Single Sign-On (SSO)

    NVIDIA Run:ai supports three methods to set up SSO:

    • SAML

    • OpenID Connect (OIDC)

    • OpenShift

    When using SSO, it is highly recommended to manage at least one local user, as a breakglass account (an emergency account), in case access to SSO is not possible.

    Username and Password

    Username and password access can be used when SSO integration is not possible.

    Secret Key (for Application Programmatic Access)

    Secret is the authentication method for Applications. Applications use the NVIDIA Run:ai APIs to perform automated tasks including scripts and pipelines based on their assigned access rules.

    Authorization

    The NVIDIA Run:ai platform uses Role Base Access Control (RBAC) to manage authorization. Once a user or an application is authenticated, they can perform actions according to their assigned access rules.

    Role Based Access Control (RBAC) in NVIDIA Run:ai

    While Kubernetes RBAC is limited to a single cluster, NVIDIA Run:ai expands the scope of Kubernetes RBAC, making it easy for administrators to manage access rules across multiple clusters.

    RBAC at NVIDIA Run:ai is configured using access rules. An access rule is the assignment of a role to a subject in a scope: <Subject> is a <Role> in a <Scope>.

    • Subject

      • A user, a group, or an application assigned with the role

    • Role

      • A set of permissions that can be assigned to subjects. Roles at NVIDIA Run:ai are system defined and cannot be created, edited or deleted.

      • A permission is a set of actions (view, edit, create and delete) over a NVIDIA Run:ai entity (e.g. projects, workloads, users). For example, a role might allow a user to create and read Projects, but not update or delete them

    • Scope

      • A scope is part of an organization in which a set of permissions (roles) is effective. Scopes include Projects, Departments, Clusters, Account (all clusters).

    Below is an example of an access rule: [email protected] is a Department admin in Department: A

    access rules
    .
  • CLI - Use the NVIDIA Run:ai Command Line Interface to query and manage the data.

  • UI - Visualize the data through the NVIDIA Run:ai user interface.

  • API

    • Metrics API - Access over-time detailed analytics data programmatically.

    • Telemetry API - Access current analytics data programmatically.

    Refer to metrics and telemetry to see the full list of supported metrics and telemetry APIs.

    CLI

    Use the list and describe commands to fetch and manage the data. See CLI reference for more details.

    Describe a specific workload telemetry
    List projects and view their telemetry and metrics

    UI Views

    Refer to metrics and telemetry to see the full list of supported metrics and telemetry.

    • Overview dashboard - Provides a high-level summary of the cluster's health and performance, including key metrics such as GPU utilization, memory usage, and node status. Allows administrators to quickly identify any potential issues or areas for optimization. Offers advanced analytics capabilities for analyzing GPU usage patterns and identifying trends. Helps administrators optimize resource allocation and improve cluster efficiency.

    • Quota management - Enables administrators to monitor and manage GPU quotas across the cluster. Includes features for setting and adjusting quotas, tracking usage, and receiving alerts when quotas are exceeded.

    • Workload visualizations - Provides detailed insights into the resource usage and utilization of each GPU in the cluster. Includes metrics such as GPU memory utilization, core utilization, and power consumption. Allows administrators to identify GPUs that are under-utilized and overloaded.

    • Node and node pool visualizations - Similar to workload visualizations, but focused on the resource usage and utilization of each GPU within a specific node or node pool. Helps administrators identify potential issues or bottlenecks at the node level.

    • Advanced NVIDIA metrics - Provides access to a range of advanced NVIDIA metrics, such as GPU temperature, fan speed, and voltage. Enables administrators to monitor the health and performance of GPUs in greater detail. This data is available at the node and workload level. To enable these metrics, contact NVIDIA Run:ai customer support.

    metrics and telemetry
    NVIDIA Run:ai API

    Upgrade

    Before Upgrade

    Before proceeding with the upgrade, it's crucial to apply the specific prerequisites associated with your current version of NVIDIA Run:ai and every version in between up to the version you are upgrading to.

    Helm

    NVIDIA Run:ai requires 3.14 or later. Before you continue, validate your installed helm client version. To install or upgrade Helm, see . If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.

    Software Files

    Run the following commands to add the NVIDIA Run:ai Helm repository and browse the available versions:

    Run the following command to browse all available air-gapped packages using the token provided by NVIDIA Run:ai.

    To download and extract a specific version, and to upload the container images to your private registry, see the section.

    Upgrade Control Plane

    System and Network Requirements

    Before upgrading the NVIDIA Run:ai control plane, validate that the latest and are met, as they can change from time to time.

    Upgrade

    Note

    To upgrade to a specific version, modify the --version flag by specifying the desired <VERSION>. You can find all available versions by using the helm search repo runai-backend/control-plane --versions command.

    Upgrading from Version 2.16

    You must perform a two-step upgrade:

    1. Upgrade to version 2.18:

    1. Then upgrade to the required version:

    Upgrading from Version 2.17 or Later

    If your current version is 2.17 or higher, you can upgrade directly to the required version:

    Upgrade Cluster

    System and Network Requirements

    Before upgrading the NVIDIA Run:ai cluster, validate that the latest and are met, as they can change from time to time.

    Note

    It is highly recommended to upgrade the Kubernetes version together with the NVIDIA Run:ai cluster version, to ensure compatibility with latest supported version of your .

    Getting Installation Instructions

    Follow the setup and installation instructions below to get the installation instructions to upgrade the NVIDIA Run:ai cluster.

    Note

    To upgrade to a specific version, modify the --version flag by specifying the desired <VERSION>. You can find all available versions by using the helm search repo runai/runai-cluster --versions command.

    Setup

    1. In the NVIDIA Run:ai UI, go to Clusters

    2. Select the cluster you want to upgrade

    3. Click INSTALLATION INSTRUCTIONS

    4. Optional: Select the NVIDIA Run:ai cluster version (latest, by default)

    Installation Instructions

    1. Follow the installation instructions. Run the Helm commands provided on your Kubernetes cluster. See the below if .

    2. Click DONE

    3. Once installation is complete, validate the cluster is Connected and listed with the new cluster version (see the ). Once you have done this, the cluster is upgraded to the latest version.

    Troubleshooting

    If you encounter an issue with the cluster upgrade, use the troubleshooting scenarios below.

    Installation Fails

    If the NVIDIA Run:ai cluster upgrade fails, check the installation logs to identify the issue.

    Run the following script to print the installation logs:

    Cluster Status

    If the NVIDIA Run:ai cluster upgrade completes, but the cluster status does not show as Connected, refer to .

    Users

    This section explains the procedure to manage users and their permissions.

    Users can be managed locally, or via the identity provider (Idp), while assigned with access rules to manage permissions. For example, user [email protected] is a department admin in department A.

    Users Table

    The Users table can be found under Access in the NVIDIA Run:ai platform.

    The users table provides a list of all the users in the platform. You can manage users and user permissions (access rules) for both local and SSO users.

    Single Sign-On Users

    SSO users are managed by the identity provider and appear once they have signed in to NVIDIA Run:ai.

    The Users table consists of the following columns:

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Creating a Local User

    To create a local user:

    1. Click +NEW LOCAL USER

    2. Enter the user’s Email address

    3. Click CREATE

    4. Review and copy the user’s credentials:

    Note

    The temporary password is visible only at the time of user’s creation and must be changed after the first sign-in.

    Adding an Access Rule to a User

    To create an access rule:

    1. Select the user you want to add an access rule for

    2. Click ACCESS RULES

    3. Click +ACCESS RULE

    4. Select a role

    Deleting a User’s Access Rule

    To delete an access rule:

    1. Select the user you want to remove an access rule from

    2. Click ACCESS RULES

    3. Find the access rule assigned to the user you would like to delete

    4. Click on the trash icon

    Resetting a User's Password

    To reset a user’s password:

    1. Select the user you want to reset it’s password

    2. Click RESET PASSWORD

    3. Click RESET

    4. Review and copy the user’s credentials:

    Deleting a User

    1. Select the user you want to delete

    2. Click DELETE

    3. In the dialog, click DELETE to confirm

    Note

    To ensure administrative operations are always available, at least one local user with System Administrator role should exist.

    Using API

    Go to the , API reference to view the available actions.

    External Access to Containers

    Researchers may need to access containers remotely during workload execution. Common use cases include:

    • Running a Jupyter Notebook inside the container

    • Connecting PyCharm for remote Python development

    • Viewing machine learning visualizations using TensorBoard

    To enable this access, you must expose the relevant container ports.

    Exposing Container Ports

    Accessing the containers remotely requires exposing container ports. In Docker, ports are exposed by them when launching the container. NVIDIA Run:ai provides similar functionality within a Kubernetes environment.

    Since Kubernetes abstracts the container's physical location, exposing ports is more complex. Kubernetes supports multiple methods for exposing container ports. For more details, refer to the .

    Method
    Description
    NVIDIA Run:ai Support

    Access to the Running Workload's Container

    Many tools used by researchers, such as Jupyter, TensorBoard, or VSCode, require remote access to the running workload's container. In NVIDIA Run:ai, this access is provided through dynamically generated URLs.

    Path-Based Routing

    By default, NVIDIA Run:ai uses the provided to dynamically create SSL-secured URLs in the following format:

    While path-based routing works with applications such as Jupyter Notebooks, it may not be compatible with other applications. Some applications assume they are running at the root file system, so hardcoded file paths and settings within the container may become invalid when running at a path other than the root. For example, if an application expects to access /etc/config.json but is served at /project-name/workspace-name, the file will not be found. This can cause the container to fail or not function as intended.

    Host-Based Routing

    NVIDIA Run:ai provides support for host-based routing. When enabled, URLs follow the format:

    This allows all workloads to run at the root path, avoiding file path issues and ensuring proper application behavior.

    Enabling Host-Based Routing

    To enable host-based routing, perform the following steps:

    Note

    For OpenShift, editing the runaiconfig command is the only step required to generate URLs. Refer to the last step below.

    1. Create a second DNS entry (A record) for *.<CLUSTER_URL>, pointing to the same IP as the cluster's .

    2. Obtain a wildcard SSL certificate for this second DNS entry.

    3. Add the certificate as a secret:

    1. Create the following ingress rule and replace <CLUSTER_URL>:

    1. Run the following:

    1. Edit to generate the URLs correctly:

    Once these requirements have been met, all workloads will automatically be assigned a secured URL with a subdomain, ensuring full functionality for all researcher applications.

    Install the Control Plane

    System and Network Requirements

    Before installing the NVIDIA Run:ai control plane, validate that the system requirements and network requirements are met. For air-gapped environments, make sure you have the software artifacts prepared.

    Permissions

    As part of the installation, you will be required to install the NVIDIA Run:ai control plane . The Helm charts require Kubernetes administrator permissions. You can review the exact objects that are created by the charts using the --dry-run on both helm charts.

    Installation

    Note

    • To customize the installation based on your environment, see .

    • PostgreSQL and Keycloakx are installed with default usernames and passwords. To change the default credentials, see .

    NVIDIA Run:ai version

    It’s recommended to install the latest NVIDIA Run:ai release. If you need to install a specific version, you can browse the available versions using the following commands:

    Connected

    Run the following command:

    Air-gapped

    Run the following command to browse all available air-gapped packages using the token provided by NVIDIA Run:ai.

    To download and extract a specific version, and to upload the container images to your private registry, see the section.

    Kubernetes

    Connected

    Run the following command. Replace global.domain=<DOMAIN> with the one obtained :

    Note: To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

    Note: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation.

    Air-gapped

    To run the following command, make sure to replace the following. The custom-env.yaml is created when :

    1. control-plane-<VERSION>.tgz - The NVIDIA Run:ai control plane version

    2. global.domain=<DOMAIN>

    OpenShift

    Connected

    Run the following command. The <OPENSHIFT-CLUSTER-DOMAIN> is the subdomain configured for the OpenShift cluster:

    Note: To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

    Air-gapped

    To run the following command, make sure to replace the following. The custom-env.yaml is created when

    1. control-plane-<VERSION>.tgz - The NVIDIA Run:ai control plane version

    2. <OPENSHIFT-CLUSTER-DOMAIN>

    Connect to NVIDIA Run:ai User Interface

    1. Open your browser and go to:

    https://<DOMAIN>

    https://runai.apps.<OpenShift-DOMAIN>

    1. Log in using the default credentials:

      • User: [email protected]

      • Password: Abcd!234

    You will be prompted to change the password.

    Customized Installation

    This section explains the available configurations for customizing the NVIDIA Run:ai control plane and cluster installation.

    Control Plane Helm Chart Values

    The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm values files or Helm install flags. See Advanced control plane configurations.

    Cluster Helm Chart Values

    The NVIDIA Run:ai cluster installation can be customized to support your environment via Helm or flags.

    These configurations are saved in the runaiconfig Kubernetes object and can be edited post-installation as needed. For more information, see .

    The following table lists the available Helm chart values that can be configured to customize the NVIDIA Run:ai cluster installation.

    Key
    Description

    Configuring NVIDIA MIG Profiles

    NVIDIA’s Multi-Instance GPU (MIG) enables splitting a GPU into multiple logical GPU devices, each with its own memory and compute portion of the physical GPU.

    NVIDIA provides two MIG strategies:

    • Single - A GPU can be divided evenly. This means all MIG profiles are the same.

    • Mixed - A GPU can be divided into different profiles.

    The NVIDIA Run:ai platform supports running workloads using NVIDIA MIG. Administrators can set the Kubernetes nodes to their preferred MIG strategy and configure the appropriate MIG profiles for researchers and MLOPS engineers to use.

    This guide explains how to configure MIG in each strategy to . It also outlines the individual implications of each strategy and best practices for administrators.

    Note

    • Starting from v2.19, Dynamic MIG feature began a and is now no longer supported. With Dynamic MIG, the NVIDIA Run:ai platform automatically configured MIG profiles according to on-demand user requests for different MIG profiles or memory fractions.

    • GPU fractions and memory fractions are not supported with MIG profiles.

    Before You Start

    To use MIG single and mixed strategy effectively, make sure to familiarize yourself with the following NVIDIA resources:

    Configuring Single MIG Strategy

    When deploying MIG using single strategy, all GPUs within a are configured with the same profile. For example, a node might have GPUs configured with 3 MIG slices of profile type 1g.20gb, or 7 MIG slices of profile 1g.10gb. With this strategy, MIG profiles are displayed as whole GPU devices by CUDA.

    The NVIDIA Run:ai platform discovers these MIG profiles as whole GPU devices as well, ensuring MIG devices are transparent to the end-user (practitioner). For example, a node that consists of 8 physical GPUs split into MIG slices, 3×2g20gb slices each, is discovered by the NVIDIA Run:ai platform as a node with 24 GPU devices.

    Users can submit workloads by requesting a specific number of GPU devices (X GPU) and NVIDIA Run:ai will allocate X MIG slices (logical devices). The NVIDIA Run:ai platform deducts X GPUs from the workload’s , regardless of whether this ‘logical GPU’ represents 1/3 of a physical GPU device or 1/7 of a physical GPU device.

    Configuring Mixed MIG Strategy

    When deploying MIG using mixed strategy, each GPU in a can be configured with a different combination of MIG profiles such as 2×2g.20gb and 3×1g.10gb. For details on supported combinations per GPU type, refer to .

    In mixed strategy, physical GPU devices continue to be displayed as physical GPU devices by CUDA, and each MIG profile is shown individually. The NVIDIA Run:ai platform identifies the physical GPU devices normally, however, MIG profiles are not visible in the UI or node APIs.

    When submitting third-party workloads with this strategy, the user should explicitly specify the exact requested MIG profile (for example, nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb). The NVIDIA Run:ai finds a node that can provide this specific profile and binds it to the workload.

    A third-party workload submitted with a MIG profile of type Xg.Ygb (e.g. 3g.40gb or 2g.20gb) is considered as consuming X GPUs. These X GPUs will be deducted from the workload’s project quota of GPUs. For example, a 3g.40gb profile deducts 3 GPUs from the associated , while 2g.20gb deducts 2 GPUs from the associated Project’s quota. This is done to maintain a logical ratio according to the characteristics of the MIG profile.

    Best Practices for Administrators

    Single Strategy

    • Configure proper and uniform sizes of MIG slices (profiles) across all GPUs within a node.

    • Set the same MIG profiles on all nodes of a single .

    • Create separate node pools with different MIG profile configurations allowing users to select the pool that best matches their workloads’ needs.

    • Ensure Project quotas are allocated according to the MIG profile sizes.

    Mixed Strategy

    • Use mixed strategy with workloads that require diverse resources. Make sure to evaluate the workload requirements and plan accordingly.

    • Configure individual MIG profiles on each node by using a limited set of MIG profile combinations to minimize complexity. Make sure to evaluate your requirements and node configurations.

    • Ensure Project quotas are allocated according to the MIG profile sizes.

    Note

    Since MIG slices are a fixed size, once configured, changing MIG profiles requires administrative intervention.

    NVIDIA Run:ai at Scale

    Operating NVIDIA Run:ai at scale ensures that the system can efficiently handle fluctuating workloads while maintaining optimal performance. As clusters grow, whether due to an increasing number of nodes or a surge in workload demand, NVIDIA Run:ai services must be appropriately tuned to support large-scale environments.

    This guide outlines the best practices for optimizing NVIDIA Run:ai for high-performance deployments, including NVIDIA Run:ai system services configurations, vertical scaling (adjusting CPU and memory resources) and where applicable, horizontal scaling (replicas).

    NVIDIA Run:ai Services

    Vertical Scaling

    Each of the NVIDIA Run:ai containers has default resource requirements that reflect an average customer load. With significantly larger cluster loads, certain NVIDIA Run:ai services will require more CPU and memory resources. NVIDIA Run:ai supports configuring these resources for each NVIDIA Run:ai service group separately. For instructions and more information, see .

    Scheduling Services

    The scheduling services group should be scaled together with the number of and the number of handled by the (running / pending). These resource recommendations are based on internal benchmarks performed on stressed environments:

    Scale (nodes/workloads)
    CPU (request)
    Memory (request)

    Sync and Workload Services

    The sync and workload service groups are less sensitive for scale. The recommendation for large or intensive environments is set to the following:

    Scale (nodes/workloads)
    CPU (request)
    Memory (request)

    Horizontal Scaling

    By default, NVIDIA Run:ai cluster services are deployed with a single replica. For large scale and intensive environments it is recommended to scale the NVIDIA Run:ai services horizontally by increasing the number of replicas. For more information, see .

    Metrics Collection

    NVIDIA Run:ai relies on to scrape cluster metrics and forward them to the NVIDIA Run:ai control plane. The volume of metrics generated is directly proportional to the number of nodes, workloads, and projects in the system. When operating at scale—reaching hundreds, and thousands of nodes and projects—the system generates a significant volume of metrics which can place a strain on the cluster and the network bandwidth.

    To mitigate this impact, it is recommended to tune the Prometheus configurations. See to read more about the tuning parameters available via the remote write configuration and refer to this for optimizing Prometheus remote write performance.

    You can apply the remote-write configurations required as described in

    The following example demonstrates the recommended approach in NVIDIA Run:ai for tuning Prometheus remote-write configurations:

    Scaling the NVIDIA Run:ai Control Plane

    For clusters with more than 32 nodes (SuperPod and larger), increase the replica count for key control plane services to 2.

    To set the replica count, use the following NVIDIA Run:ai control plane Helm flag:

    Replicas for following services should not be increased: postgres, keycloak, grafana, thanos, nats, redoc, cluster-migrator, identity provider reconciler, settings migrator.

    For Grafana, enable autoscaling first and then set the number of minReplicas. Use the following NVIDIA Run:ai control plane Helm flags:

    Thanos

    is the third-party used by NVIDIA Run:ai to store metrics under a significant user load. Use the following NVIDIA Run:ai control plane Helm flags to increase resources for the Thanos query function:

    Event History

    This section provides details about NVIDIA Run:ai’s Audit log.

    The NVIDIA Run:ai control plane provides the audit log API and event history table in the NVIDIA Run:ai UI. Both reflect the same information regarding changes to business objects: clusters, projects and assets etc.

    Note

    Only system administrator users with tenant-wide permissions can access Audit log.

    Event History Table

    The Event history table can be found under Event history in the NVIDIA Run:ai UI.

    The Event history table consists of the following columns:

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Using the Event History Date Selector

    The Event history table saves events for the last 90 days. However, the table itself presents up to the last 30 days of information due to the potentially very high number of operations that might be logged during this period.

    To view older events, or to refine your search for more specific results or fewer results, use the time selector and change the period you search for. You can also refine your search by clicking and using ADD FILTER accordingly.

    Using API

    Go to the reference to view the available actions. Since the amount of data is not trivial, the API is based on paging. It retrieves a specified number of items for each API call. You can get more data by using subsequent calls.

    Limitations

    Submissions of workloads are not audited. As a result, the system does not track or log details of workload submissions, such as timestamps or user activity.

    Optimize Performance with Node Level Scheduler

    The Node Level Scheduler optimizes the performance of your pods and maximizes the utilization of GPUs by making optimal local decisions on GPU allocation to your pods. While the NVIDIA Run:ai Scheduler chooses the specific node for a pod, it has no visibility to the node’s GPUs' internal state. The Node Level Scheduler is aware of the local GPUs' states and makes optimal local decisions such that it can optimize both the GPU utilization and pods’ performance running on the node’s GPUs.

    This guide provides an overview of the best use cases for the Node Level Scheduler and instructions for configuring it to maximize GPU performance and pod efficiency.

    Deployment Considerations

    • While the Node Level Scheduler applies to all , it will best optimize the performance of burstable workloads. Burstable workloads are workloads that use , giving those more GPU memory than requested and up to the Limit specified.

    • Burstable workloads are always susceptible to an OOM Kill signal if the owner of the excess memory requires it back. This means that using the Node Level Scheduler with inference or training workloads may cause pod preemption.

    • Using interactive workloads with notebooks is the best use case for burstable workloads and Node Level Scheduler. These workloads behave differently since the OOM Kill signal will cause the notebooks' GPU process to exit but not the notebook itself. This keeps the interactive pod running and retrying to attach a GPU again.

    Interactive Notebooks Use Case

    This use case is one scenario that shows how Node Level Scheduler locally optimizes and maximizes GPU utilization and workspaces’ performance.

    1. The below shows a node with 2 GPUs and 2 submitted workspaces:

    1. The Scheduler instructs the node to put the 2 workspaces on a single GPU, a single GPU and leaving the other free for a workload that requires resources. This means GPU#2 is idle while the two workspaces can only use up to half a GPU, even if they temporarily need more:

    1. With the Node Level Scheduler enabled, the local decision will be to spread those 2 workspaces on 2 GPUs and allow them to maximize both workspaces’ performance and GPUs’ utilization by bursting out up to the full GPU memory and GPU compute resources:

    1. The NVIDIA Run:ai Scheduler still sees a node with one fully empty GPU and one fully occupied GPU. When a 3rd workload is scheduled, and it requires a full GPU (or more than 0.5 GPU), the Scheduler will schedule it to that node, and the Node Level Scheduler will move one of the workspaces to run with the other in GPU#1, as was the Scheduler’s initial plan. Moving the workspace from GPU#1 back to GPU#2 maintains the workspace running while the GPU process within the Jupyter notebook is killed and re-established on GPU#2, continuing to serve the workspace:

    Using Node Level Scheduler

    The Node Level Scheduler can be enabled per node pool. To use Node Level Scheduler, follow the below steps.

    Enable on Your Cluster

    1. Enable the Node Level Scheduler at the cluster level (per cluster) by:

      1. Editing the runaiconfig as follows. For more details, see :

      2. Or, using the following kubectl patch command:

    Enable on a Node Pool

    Note

    GPU resource optimization is disabled by default. It must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

    Enable Node Level Scheduler on any of the node pools:

    1. Select Resources → Node pools

    2. or

    3. Under the Resource Utilization Optimization tab, change the number of workloads on each GPU to any value other than Not Enforced (i.e. 2, 3, 4, 5)

    The Node Level Scheduler is now ready to be used on that node pool.

    Submit a Workload

    In order for a workload to be considered by the Node Level Scheduler for rerouting, it must be submitted with a GPU Request and Limit where the Limit is larger than the Request:

    • Enable and set

    • Then using dynamic GPU fractions

    Network Requirements

    The following network requirements are for the NVIDIA Run:ai components installation and usage.

    External Access

    Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management.

    Note

    Node Roles

    This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.

    For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:

    • NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.

    • Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.

    NVIDIA Run:ai services are scheduled on the defined node roles by applying using node labels .

    Backup and Restore

    This document outlines how to back up and restore a NVIDIA Run:ai deployment, including both the NVIDIA Run:ai cluster and control plane.

    Back Up the Cluster

    The restoration or backup of NVIDIA Run:ai and which are stored locally on the Kubernetes cluster is optional and can be restored and backed up separately. As backup of data is not required, the backup procedure is optional for advanced deployments.

    Access Rules

    This section explains the procedure to manage Access rules.

    Access rules provide users, groups, or applications privileges to system entities. An access rule is the assignment of a to a : <Subject> is a <Role> in a <Scope>. For example, user [email protected] is a department admin in department A.

    Access Rules Table

    User Identity in Containers

    The identity of the user inside a container determines its access to various resources. For example, network file systems often rely on this identity to control access to mounted volumes. As a result, propagating the correct user identity into a container is crucial for both functionality and security.

    By default, containers in both Docker and Kubernetes run as the root user. This means any process inside the container has full administrative privileges, capable of modifying system files, installing packages, or changing configurations.

    While this level of access provides researchers with maximum flexibility, it conflicts with modern enterprise security practices. If the container’s root identity is propagated to external systems (e.g., network-attached storage), it can result in elevated permissions outside the container, increasing the risk of security breaches.

    To uninstall the NVIDIA Run:ai cluster, run the following command in your terminal:

    To remove the NVIDIA Run:ai cluster from the NVIDIA Run:ai platform, see .

    Note

    Uninstall of NVIDIA Run:ai cluster from the Kubernetes cluster does not delete existing projects, departments or workloads submitted by users.

    kubectl edit runaiconfig runai -n runai
    kubectl edit runaiconfig runai -n runai
    runai-scheduler:
      args:
        verbosity: 6
    runai-adm collect-logs

    Single strategy supports both NVIDIA Run:ai and third-party workloads. Using mixed strategy can only be done using third-party workloads. For more details on NVIDIA Run:ai and third-party workloads, see Introduction to workloads.

    submit workloads
    deprecation process
    NVIDIA Multi-Instance GPU
    MIG User Guide
    GPU Operator with MIG
    node
    Project quota
    node
    Supported MIG Profiles
    Scheduler
    Project’s quota
    node pool

    Last updated

    The last time the user was updated

    Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    User Email

  • Temporary password to be used on first sign-in

  • Click DONE

  • Select a scope

  • Click SAVE RULE

  • Click CLOSE

  • Click CLOSE

    User Email

  • Temporary password to be used on next sign-in

  • Click DONE

  • User

    The unique identity of the user (email address)

    Type

    The type of the user - SSO / local

    Last login

    The timestamp for the last time the user signed in

    Access rule(s)

    The access rule assigned to the user

    Created By

    The user who created the user

    Creation time

    The timestamp for when the user was created

    Users
    Access rules

    Status

    The outcome of the logged operation. Possible values: Succeeded, Failed

    Entity type

    The type of the logged business object.

    Entity name

    The name of logged business object.

    Entity ID

    The system's internal id of the logged business object.

    URL

    The endpoint or address that was accessed during the logged event.

    HTTP Method

    The HTTP operation method used for the request. Possible values include standard HTTP methods such as GET, POST, PUT, DELETE, indicating what kind of action was performed on the specified URL.

    Download table - Click MORE and then Click Download as CSV or Download as JSON

    Subject

    The name of the subject

    Subject type

    The user or application assigned with the role

    Source IP

    The IP address of the subject

    Date & time

    The exact timestamp at which the event occurred. Format dd/mm/yyyy for date and hh:mm am/pm for time.

    Event

    The type of the event. Possible values: Create, Update, Delete, Login

    Event ID

    Internal event ID, can be used for support purposes

    Audit log API

    Last updated

    The last time the application was updated

    helm
    Removing a cluster
    helm uninstall runai-cluster -n runai

    Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.

    openShift.securityContextConstraints.create

    Enables the deployment of Security Context Constraints (SCC). Disable for CIS compliance. Default: true

    controlPlane.existingSecret

    Specifies the name of the existing Kubernetes secret where the cluster’s clientSecret used for secure connection with the control plane is stored.

    controlPlane.secretKeys.clientSecret

    Specifies the key within the controlPlane.existingSecret that stores the cluster’s clientSecret used for secure connection with the control plane.

    global.image.registry (string)

    Global Docker image registry Default: ""

    global.additionalImagePullSecrets (list)

    List of image pull secrets references Default: []

    spec.researcherService.ingress.tlsSecret (string)

    Existing secret key where cluster TLS certificates are stored (non-OpenShift) Default: runai-cluster-domain-tls-secret

    spec.researcherService.route.tlsSecret (string)

    Existing secret key where cluster TLS certificates are stored (OpenShift only) Default: ""

    spec.prometheus.spec.image (string)

    Due to a known issue In the Prometheus Helm chart, the imageRegistry setting is ignored. To pull the image from a different registry, you can manually specify the Prometheus image reference. Default: quay.io/prometheus/prometheus

    spec.prometheus.spec.imagePullSecrets (string)

    List of image pull secrets references in the runai namespace to use for pulling Prometheus images (relevant for air-gapped installations). Default: []

    values files
    Helm install
    Advanced cluster configurations

    global.customCA.enabled

    NVIDIA Run:ai Controls for User Identity and Privileges

    NVIDIA Run:ai allows you to enhance security and enforce organizational policies by:

    • Controlling root access and privilege escalation within containers

    • Propagating the user identity to align with enterprise access policies

    Root Access and Privilege Escalation

    NVIDIA Run:ai supports security-related workload configurations to control user permissions and restrict privilege escalation. These options are available via the API and CLI during workload creation:

    • runAsNonRoot / --run-as-user - Force the container to run as non-root user.

    • allowPrivilegeEscalation / --allow-privilege-escalation - Allow the container to use setuid binaries to escalate privileges, even when running as a non-root user. This setting can increase security risk and should be disabled if elevated privileges are not required.

    Administrators can enforce secure defaults across the environment using Policies, ensuring consistent workload behavior aligned with organizational security practices.

    Passing User Identity

    Passing User Identity from Identity Provider

    A best practice is to store the User Identifier (UID) and Group Identifier (GID) in the organization's directory. NVIDIA Run:ai allows you to pass these values to the container and use them as the container identity. To perform this, you must set up single sign-on and perform the steps for UID/GID integration.

    Passing User Identity via UI

    It is possible to explicitly pass user identity when creating an environment or submitting a workload:

    • From the image - Use the UID/GID defined in the container image.

    • From the IdP token - Use identity attributes provided by the SSO identity provider (available only in SSO-enabled installations).

    • Custom - Manually set the User ID (UID), Group ID (GID) and supplementary groups that can run commands in the container.

    Administrators can enforce secure defaults across the environment using Policies, ensuring consistent workload behavior aligned with organizational security practices.

    Note

    It is also possible to set the above using the API or CLI.

    Using OpenShift or Gatekeeper to Provide Cluster Level Controls

    In OpenShift, Security Context Constraints (SCCs) manage pod-level security, including root access. By default, containers are assigned a random non-root UID, and flags such as --run-as-user and --allow-privilege-escalation are disabled.

    On non-OpenShift Kubernetes clusters, similar enforcement can be achieved using tools like Gatekeeper, which applies system-level policies to restrict containers from running as root.

    Enabling UID and GID on OpenShift

    By default, OpenShift restricts setting specific user and group IDs (UIDs/GIDs) in workloads through its SCCs. To allow NVIDIA Run:ai workloads to run with explicitly defined UIDs and GIDs, a cluster administrator must modify the relevant SCCs.

    To enable UID and GID assignment:

    1. Edit the runai-user-job SCC:

    2. Edit the runai-jupyter-notebook SCC (only required if using Jupyter environments):

    3. In both SCC definitions, ensure the following sections are configured:

    These settings allow NVIDIA Run:ai to pass specific UID and GID values into the container, enabling compatibility with identity-aware file systems and enterprise access controls.

    Creating a Temporary Home Directory

    When containers run as a specific user, the user must have a home directory defined within the image. Otherwise, starting a shell session will fail due to the absence of a home directory.

    Since pre-creating a home directory for every possible user is impractical, NVIDIA Run:ai offers the createHomeDir / --create-home-dir option. When enabled, this flag creates a temporary home directory for the user inside the container at runtime. By default, the directory is created at /home/<username>.

    Note

    • This home directory is temporary and exists only for the duration of the container's lifecycle. Any data saved in this location will be lost when the container exits.

    • By default, this flag is set to true when --run-as-user is enabled, and false otherwise.

    Click CONTINUE

    Helm
    Installing Helm
    Preparations
    system requirements
    network requirements
    system requirements
    network requirements
    Kubernetes distribution
    installation fails
    cluster troubleshooting scenarios
    Troubleshooting scenarios
    helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
    helm repo update
    helm search repo -l runai-backend
    curl -H "Authorization: Bearer <token>" "https://runai.jfrog.io/artifactory/api/storage/runai-airgapped-prod/?list"

    Port Forwarding

    Simple port forwarding allows access to the container via local and/or remote port.

    Supported natively via Kubernetes

    NodePort

    Exposes the service on each Node’s IP at a static port (the NodePort). You’ll be able to contact the NodePort service from outside the cluster by requesting <NODE-IP>:<NODE-PORT> regardless of which node the container actually resides in.

    Supported

    LoadBalancer

    Exposes the service externally using a cloud provider’s load balancer.

    Supported via API with limited capabilities

    declaring
    Kubernetes services and networking documentation
    Cluster URL
    Fully Qualified Domain Name (FQDN)
    runaiconfig

    Small - 30 / 480

    1

    1GB

    Medium - 100 / 1600

    2

    2GB

    Large - 500 / 8500

    2

    7GB

    Small - 30 / 480

    1

    2GB

    Medium - 100 / 1600

    2

    10GB

    Large - 500 / 8500

    4

    24GB

    NVIDIA Run:ai services resource management
    nodes
    workloads
    Scheduler
    NVIDIA Run:ai services replicas
    Prometheus
    remote-write
    remote write tuning
    article
    advanced cluster configurations.
    Thanos
    workload types
    dynamic GPU fractions
    bin-packing
    Advanced cluster configurations
    Create a new node pool
    edit an existing node pool
    dynamic GPU fractions
    submit a workload
    Unallocated GPU nodes
    Single allocated GPU node
    Two allocated GPU nodes
    Node Level Scheduler locally optimized GPU nodes
    - The domain name set
  • global.customCA.enabled=true as described here

  • Note: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation.

    - The domain configured for the OpenShift cluster. To find out the OpenShift cluster domain, run
    oc get routes -A
  • global.customCA.enabled=true as described here

  • Helm chart
    Advanced control plane configurations
    Additional third-party configurations
    Preparations
    here
    preparing the installation script
    preparing the installation script:
    helm search repo -l runai-backend
    here
    oc edit scc runai-user-job
    oc edit scc runai-jupyter-notebook
    runAsUser:
      type: RunAsAny
    supplementalGroups:
      type: RunAsAny
    spec:
      prometheus:
        spec:
          logLevel: debug
    kubectl logs -n runai prometheus-runai-0 
    helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
    helm upgrade runai-backend -n runai-backend runai-backend/control-plane --version "2.18.0" -f runai_control_plane_values.yaml --reset-then-reuse-values
    helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
    helm upgrade runai-backend control-plane-2.18.0.tgz -n runai-backend -f runai_control_plane_values.yaml --reset-then-reuse-values
    helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
    helm upgrade runai-backend -n runai-backend runai-backend/control-plane --version "<VERSION>" -f runai_control_plane_values.yaml --reset-then-reuse-values
    helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
    helm upgrade runai-backend control-plane-<NEW-VERSION>.tgz -n runai-backend -f runai_control_plane_values.yaml --reset-then-reuse-values
    helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
    helm upgrade runai-backend -n runai-backend runai-backend/control-plane --version "<VERSION>" -f runai_control_plane_values.yaml --reset-then-reuse-values
    helm get values runai-backend -n runai-backend > runai_control_plane_values.yaml
    helm upgrade runai-backend control-plane-<NEW-VERSION>.tgz -n runai-backend -f runai_control_plane_values.yaml --reset-then-reuse-values
    curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh
    https://<CLUSTER_URL>/project-name/workload-name
    https://project-name-workload-name.<CLUSTER_URL>/
    kubectl create secret tls runai-cluster-domain-star-tls-secret -n runai \    
      --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate    
      --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: runai-cluster-domain-star-ingress
      namespace: runai
    spec:
      ingressClassName: nginx
      rules:
      - host: '*.<CLUSTER_URL>'
      tls:
      - hosts:
        - '*.<CLUSTER_URL>'
        secretName: runai-cluster-domain-star-tls-secret
    kubectl apply -f <filename>
    kubectl patch RunaiConfig runai -n runai --type="merge" \    
        -p '{"spec":{"global":{"subdomainSupport": true}}}' 
    remoteWrite:
      queueConfig:
        capacity: 5000
        maxSamplesPerSend: 1000
        maxShards: 100
    --set <service>.replicaCount=2
    --set grafana.autoscaling.enabled=true \
    --set grafana.autoscaling.minReplicas=2
    --set thanos.query.resources.limits.memory=3G \
    --set thanos.query.resources.requests.memory=3G \
    --set thanos.query.resources.limits.cpu=1 \
    --set thanos.query.resources.requests.cpu=1 \
    --set thanos.receive.resources.limits.memory=15G \
    --set thanos.receive.resources.requests.memory=15G \
    --set thanos.receive.resources.limits.cpu=2 \
    --set thanos.receive.resources.requests.cpu=2
    spec: 
      global: 
          core: 
            nodeScheduler:
              enabled: true
    kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"nodeScheduler":{"enabled": true}}}}}'
    curl -H "Authorization: Bearer <token>" "https://runai.jfrog.io/artifactory/api/storage/runai-airgapped-prod/?list"
    helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
    helm repo update
    helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \ 
        --set global.domain=<DOMAIN>
    helm repo add runai-backend https://runai.jfrog.io/artifactory/cp-charts-prod
    helm repo update
    helm upgrade -i runai-backend -n runai-backend runai-backend/control-plane \
        --set global.domain=runai.apps.<OPENSHIFT-CLUSTER-DOMAIN> \ 
        --set global.config.kubernetesDistribution=openshift
    helm upgrade -i runai-backend control-plane-<VERSION>.tgz \
        --set global.domain=<DOMAIN> \ 
        --set global.customCA.enabled=true \ 
        -n runai-backend -f custom-env.yaml
    helm upgrade -i runai-backend ./control-plane-<VERSION>.tgz -n runai-backend \
        --set global.domain=runai.apps.<OPENSHIFT-CLUSTER-DOMAIN> \ 
        --set global.config.kubernetesDistribution=openshift \
        --set global.customCA.enabled=true \ 
        -f custom-env.yaml 
    Ensure the inbound and outbound rules are correctly applied to your firewall.

    Inbound Rules

    To allow your organization’s NVIDIA Run:ai users to interact with the cluster using the NVIDIA Run:ai Command-line interface, or access specific UI features, certain inbound ports need to be open:

    Name
    Description
    Source
    Destination
    Port

    NVIDIA Run:ai control plane

    HTTPS entrypoint

    0.0.0.0

    NVIDIA Run:ai system nodes

    443

    NVIDIA Run:ai cluster

    HTTPS entrypoint

    0.0.0.0

    NVIDIA Run:ai system nodes

    443

    Outbound Rules

    Note

    Outbound rules applied to the NVIDIA Run:ai cluster component only. In case the NVIDIA Run:ai cluster is installed together with the NVIDIA Run:ai control plane, the NVIDIA Run:ai cluster FQDN refers to the NVIDIA Run:ai control plane FQDN.

    For the NVIDIA Run:ai cluster installation and usage, certain outbound ports must be open:

    Name
    Description
    Source
    Destination
    Port

    Cluster sync

    Sync NVIDIA Run:ai cluster with NVIDIA Run:ai control plane

    NVIDIA Run:ai cluster system nodes

    NVIDIA Run:ai control plane FQDN

    443

    Metric store

    Push NVIDIA Run:ai cluster metrics to NVIDIA Run:ai control plane's metric store

    NVIDIA Run:ai cluster system nodes

    NVIDIA Run:ai control plane FQDN

    443

    The NVIDIA Run:ai installation has software requirements that require additional components to be installed on the cluster. This article includes simple installation examples which can be used optionally and require the following cluster outbound ports to be open:

    Name
    Description
    Source
    Destination
    Port

    Kubernetes Registry

    Ingress Nginx image repository

    All kubernetes nodes

    registry.k8s.io

    443

    Google Container Registry

    GPU Operator, and Knative image repository

    All kubernetes nodes

    gcr.io

    443

    Internal Network

    Ensure that all Kubernetes nodes can communicate with each other across all necessary ports. Kubernetes assumes full interconnectivity between nodes, so you must configure your network to allow this seamless communication. Specific port requirements may vary depending on your network setup.

    Prerequisites

    To perform these tasks, make sure to install the NVIDIA Run:ai Administrator CLI.

    Configure Node Roles

    The following node roles can be configured on the cluster:

    • System node: Reserved for NVIDIA Run:ai system-level services.

    • GPU Worker node: Dedicated for GPU-based workloads.

    • CPU Worker node: Used for CPU-only workloads.

    System Nodes

    NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the Kubectl (recommended) or via NVIDIA Run:ai Administrator CLI.

    By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system for system services scheduling. You can modify the default node affinity rule by:

    • Editing the spec.global.affinity configuration parameter as detailed in Advanced cluster configurations.

    • Editing the global.affinity configuration as detailed in Advanced control plane configurations for self-hosted deployments.

    Note

    • To ensure high availability and prevent a single point of failure, it is recommended to configure at least three system nodes in your cluster.

    • By default, Kubernetes master nodes are configured to prevent workloads from running on them as a best-practice measure to safeguard control plane stability. While this restriction is generally recommended, certain NVIDIA reference architectures allow adding tolerations to the NVIDIA Run:ai deployment so critical system services can run on these nodes.

    Kubectl

    To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:

    1. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

    2. Run one of the following commands to label the node with its role:

    NVIDIA Run:ai Administrator CLI

    Note

    The NVIDIA Run:ai Administrator CLI only supports the default node affinity.

    To set a system role for a node in your Kubernetes cluster, follow these steps:

    1. Run the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

    2. Run one of the following commands to set or remove a node’s role:

    The set node-role command will label the node and set relevant cluster configurations.

    Worker Nodes

    NVIDIA Run:ai worker nodes run user-submitted workloads and system-level DeamonSets required to operate. This can be managed via the Kubectl (recommended) or via NVIDIA Run:ai Administrator CLI.

    By default, GPU workloads are scheduled on GPU nodes based on the nvidia.com/gpu.present label. When global.nodeAffinity.restrictScheduling is set to true via the Advanced cluster configurations:

    • GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-gpu-worker

    • CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-cpu-worker

    Kubectl

    To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:

    1. Validate the global.nodeAffinity.restrictScheduling is set to true in the cluster’s Configurations.

    2. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

    3. Run one of the following commands to label the node with its role. Replace the label and value (true/false) to enable or disable GPU/CPU roles as needed:

    NVIDIA Run:ai Administrator CLI

    To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai Administrator CLI, follow these steps:

    1. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

    2. Run one of the following commands to set or remove a node’s role. <node-role> must be either --gpu-worker or --cpu-worker :

    The set node-role command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling true.

    Note

    Use the --all flag to set or remove a role to all nodes.

    Kubernetes Node Affinity
    Save Cluster Configurations

    To back up the NVIDIA Run:ai cluster configurations:

    1. Run the following command in your terminal:

    2. Once the runaiconfig_back.yaml backup file is created, save the file externally, so that it can be retrieved later.

    Restore the Cluster

    In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster, projects, workloads and other cluster data are synced automatically. Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.

    Prerequisites

    Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled:

    1. If the Kubernetes cluster is still available, uninstall the NVIDIA Run:ai cluster. Make sure not to remove the cluster from the control plane.

    2. Navigate to the Clusters grid in the NVIDIA Run:ai UI

    3. Locate the cluster and verify its status is Disconnected

    Re-install the Cluster

    1. Follow the NVIDIA Run:ai cluster installation instructions and ensure all prerequisites are met.

    2. If you have a backup of the cluster configurations, reload it once the installation is complete:

    3. Navigate to the Clusters grid in the NVIDIA Run:ai UI

    4. Locate the cluster and verify its status is Connected

    Restore Namespace and RoleBindings

    If your cluster configuration disables automatic namespace creation for projects, you must manually:

    • Re-create each project namespace

    • Reapply the required role bindings for access control

    For more information, see Advanced cluster configurations.

    Back Up the Control Plane

    Database Storage

    By default, NVIDIA Run:ai utilizes an internal PostgreSQL database to manage control plane data. This database resides on a Kubernetes Persistent Volume (PV). To safeguard against data loss, it's essential to implement a reliable backup strategy.

    Backup Methods

    Consider the following methods to back up the PostgreSQL database:

    • PostgreSQL logical backup - Use pg_dump to create a logical backup of the database. Replace <password> with the appropriate PostgreSQL password. For example:

    • Persistent volume backup - Back up the entire PV that stores the PostgreSQL data.

    • Third-Party backup solutions - Integrate with external backup tools that support Kubernetes and PostgreSQL to automate and manage backups effectively.

    Note

    • To obtain your PGPASSWORD=<password>, run helm get values runai-backend -n runai-backend --all.

    • NVIDIA Run:ai also supports an external PostgreSQL database. If you are using an PostgreSQL database, the above steps do not apply. For more details, see .

    Metrics Storage

    NVIDIA Run:ai stores metrics history using Thanos. Thanos is configured to write data to a persistent volume (PV). To protect against data loss, it is recommended to regularly back up this volume.

    Deployment Configurations

    The NVIDIA Run:ai control plane installation can be customized using --set flags during Helm deployment. These configuration overrides are preserved during upgrades but are not retained if Kubernetes is uninstalled or damaged. To ensure recovery, it's recommended to back up the full set of applied Helm customizations. You can retrieve the current configuration using:

    Restore the Control Plane

    Follow the steps below to restore the control plane including previously backed-up data and configurations:

    1. Recreate the Kubernetes environment - Begin by provisioning a new Kubernetes or OpenShift cluster that meets all NVIDIA Run:ai installation requirements.

    2. Restore Persistent Volumes - Recover the PVs and ensure these volumes are correctly reattached or restored from your backup solution:

      • PostgreSQL database - Stores control plane metadata

      • Thanos - Stores workload metrics and historical data

    3. Reinstall the control plane - Install the NVIDIA Run:ai on the newly created cluster. During installation:

      • Use the saved Helm configuration overrides to preserve custom settings

      • Connect the control plane to the recovered PostgreSQL volume

      • Reconnect Thanos to the restored metrics volume

    Note

    For external PostgreSQL databases, ensure the appropriate connection details and credentials are reconfigured. See External PostgreSQL database for more details.

    advanced cluster configurations
    customized deployment configurations
    The Access rules table can be found under Access in the NVIDIA Run:ai platform.

    The Access rules table provides a list of all the access rules defined in the platform and allows you to manage them.

    Flexible management

    It is also possible to manage access rules directly for a specific user, application, project, or department.

    The Access rules table consists of the following columns:

    Column
    Description

    Type

    The type of subject assigned to the access rule (user, SSO group, or application).

    Subject

    The user, SSO group, or application assigned with the role

    Role

    The role assigned to the subject

    Scope

    The scope to which the subject has access. Click the name of the scope to see the scope and its subordinates

    Authorized by

    The user who granted the access rule

    Creation time

    The timestamp for when the rule was created

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    Adding a New Access Rule

    To add a new access rule:

    1. Click +NEW ACCESS RULE

    2. Select a subject User, SSO Group, or Application

    3. Select or enter the subject identifier:

      • User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP

      • Group name as recognized by the IDP

      • Application name as created in NVIDIA Run:ai

    4. Select a role

    5. Select a scope

    6. Click SAVE RULE

    Note

    An access rule consists of a single subject with a single role in a single scope. To assign multiple roles or multiple scopes to the same subject, multiple access rules must be added.

    Editing an Access Rule

    Access rules cannot be edited. To change an access rule, you must delete the rule, and then create a new rule to replace it.

    Deleting an Access Rule

    1. Select the access rule you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm

    Viewing Your User Access Rule

    To view the assigned roles and scopes you have access to:

    1. Click the user avatar at the top right corner, then select Settings

    2. Click User details

    The list of assigned roles and scopes will be displayed.

    Using API

    Go to the Access rules API reference to view the available actions.

    role
    subject in a scope

    Adapting AI Initiatives to Your Organization

    AI initiatives refer to advancing research, development, and implementation of AI technologies. These initiatives represent your business needs and involve collaboration between individuals, teams, and other stakeholders. AI initiatives require compute resources and a methodology to effectively and efficiently use those compute resources and split them among the different AI initiatives stakeholders. The building blocks of AI compute resources are GPUs, CPUs, and memory, which are built into nodes (servers) and can be further grouped into node pools. Nodes and node pools are part of a Kubernetes cluster.

    To manage AI initiatives in NVIDIA Run:ai you should:

    • Map your organization and initiatives to projects and optionally departments

    • Map compute resources (node pools and quotas) to projects and optionally departments

    • Assign users (e.g. AI practitioners, ML engineers, Admins) to projects and departments

    Mapping Your Organization

    The way you map your AI initiatives and organization into NVIDIA Run:ai and should reflect your organization’s structure and Project management practices. There are multiple options, and we provide you here with 3 examples of typical forms in which to map your organization, initiatives, and users into NVIDIA Run:ai, but of course, other ways that suit your requirements are also acceptable.

    Based on Individuals

    A typical use case would be students (individual practitioners) within a faculty (business unit) - an individual practitioner may be involved in one or more initiatives. In this example, the resources are accounted for by the student (project) and aggregated per faculty (department).

    Department = business unit / Project = individual practitioner

    Based on Business Units

    A typical use case would be an AI service (business unit) split into AI capabilities (initiatives) - an individual practitioner may be involved in several initiatives. In this example, the resources are accounted for by Initiative (project) and aggregated per AI service (department).

    Department = business unit / Project = initiative

    Based on the Organizational Structure

    A typical use case would be a business unit split into teams - an individual practitioner is involved in a single team (project) but the team may be involved in several AI initiatives. In this example, the resources are accounted for by team (project) and aggregated per business unit (department).

    Department = business unit / Project = team

    Mapping Your Resources

    AI initiatives require compute resources such as GPUs and CPUs to run. Compute resources in any organization are limited, either due to the number of servers (nodes) owned by the organization is limited, the budget it can spend to lease resources in the cloud or spending for in-house servers is also limited. Every organization strives to optimize the usage of its resources by maximizing their utilization and providing all users with their needs. Therefore, the organization needs to split resources according to the organization's internal priorities and budget constraints. But even after splitting the resources, the orchestration layer should still provide fairness between the resourced consumers, and allow access to unused resources to minimize scenarios of idle resources.

    Another aspect of resource management is how to group your resources effectively, especially in large environments, or environments that are made of heterogeneous types of hardware, where some users need to use specific hardware types, or where other users should avoid occupying critical hardware of some users or initiatives.

    NVIDIA Run:ai assists you with all of these complex issues by allowing you to map your cluster resources to node pools, then map each Project and Department a quota allocation per node pool, and set access rights to unused resources () per node pool.

    Grouping Your Resources

    There are several reasons why you would group resources (nodes) into node pools:

    • Control the GPU type to use in heterogeneous hardware environment - in many cases, AI models can be optimized per hardware type they will use, e.g. a training workload that is optimized for H100 does not necessarily run optimally on an A100, and vice versa. Therefore segmenting into node pools, each with a different hardware type gives the AI researcher and ML engineer better control of where to run.

    • Quota control - splitting to node pools allows the admin to set specific quota per hardware type, e.g. give high priority project guaranteed access to advanced GPU hardware, while keeping lower priority project with a lower quota or even with no quota at all for that high-end GPU, but give it a “best-effort” access only (i.e. if the high priority guaranteed project is not using those resources).

    • Multi-region or multi-availability-zone cloud environments - if some or all of your clusters run on the cloud (or even on-premise) but any of your clusters uses different physical locations or different topologies (e.g. racks), you probably want to segment your resources per region/zone/topology to be able to control where to run your workloads, how much quota to assign to specific environments (per project, per department), even if all those locations are all using the same hardware type. This methodology can help in optimizing the performance of your workloads because of the superior performance of local computing such as the locality of distributed workloads, local storage etc.

    Grouping Examples

    Set out below are illustrations of different grouping options.

    Example: grouping nodes by topology

    Example: grouping nodes by hardware type

    Assigning Your Resources

    After the initial grouping of resources, it is time to associate resources to AI initiatives, this is performed by assigning quotas to projects and optionally to departments. Assigning GPU quota to a project, on a node pool basis, means that the workloads submitted by that project are entitled to use those GPUs as guaranteed resources and can use them for all .

    However, what happens if the project requires more resources than its quota? This depends on the type of workloads that the user wants to submit. If the user requires more resources for non-preemptible workloads, then the quota must be increased, because non-preemptible workloads require guaranteed resources. On the other hand, if the type of workload is, for example, a model Training workload that is preemptible - in this case the project can exploit unused resources of other projects, as long as the other projects don’t need them. over quota is set per project on a node-pool basis and per department.

    Administrators can use quota allocations to prioritize resources between users, teams, and AI initiatives. The administrator can completely prevent the use of certain node pools by a project or department by setting the node pool quota to 0 and disabling over quota for that node pool, or it can keep the quota to 0 and enable over quota to that node pool and allow access based on resource availability only (e.g. unused GPUs). However, when a project with a non-zero quota needs to use those resources, the Scheduler reclaims those resources back and preempts the preemptible workloads of over quota projects. As an administrator, you can also have an impact on the amount of over quota resources a project or department uses.

    It is essential to make sure that the sum of all projects' quota does NOT surpass that of the Department, and that the sum of all departments does not surpass the number of physical resources, per node pool and for the entire cluster (we call such behavior - ‘over-subscription’). The reason over-subscription is not recommended is that it may produce unexpected scheduling decisions, especially those that might preempt ‘non-preemptive’ workloads or fail to schedule workloads within quota, either non-preemptible or preemptible, thus quota cannot be considered anymore as ‘guaranteed’. Admins can opt-in a system flag that helps to prevent over-subscription scenarios.

    Example: assigning resources to projects

    Assigning Users to Projects and Departments

    NVIDIA Run:ai system is using to manage users’ access rights to the different objects of the system, its resources, and the set of allowed actions. To allow AI researchers, ML engineers, Project Admins, or any other stakeholder of your AI initiatives to access projects and use AI compute resources with their AI initiatives, the administrator needs to assign users to projects. After a user is assigned to a project with the proper role, e.g. ‘L1 Researcher’, the user can submit and monitor its workloads under that project. Assigning users to departments is usually done to assign ‘Department Admin’ to manage a specific department. Other roles, such as ‘L1 Researcher’, can also be assigned to departments, this allows the researcher access to all projects within that department.

    Scopes in an Organization

    This is an example of an organization, as represented in the NVIDIA Run:ai platform:

    The organizational tree is structured from top down under a single node headed by the account. The account is comprised of clusters, departments and projects.

    After mapping and building your hierarchal structured organization as shown above, you can assign or associate various NVIDIA Run:ai components (e.g. workloads, roles, assets, policies, and more) to different parts of the organization - these organizational parts are the Scopes. The following organizational example consists of 5 optional scopes:

    Note

    When a scope is selected, the very same unit, including all of its subordinates (both existing and any future subordinates, if added), are selected as well.

    Next Steps

    Now that resources are grouped into node pools, organizational units or business initiatives are mapped into projects and departments, projects’ quota parameters are set per node pool, and users are assigned to projects, you can finally from a project and use compute resources to run your AI initiatives.

    How the Scheduler Works

    Efficient resource allocation is critical for managing AI and compute-intensive workloads in Kubernetes clusters. The NVIDIA Run:ai Scheduler enhances Kubernetes' native capabilities by introducing advanced scheduling principles such as fairness, quota management, and dynamic resource balancing. It ensures that workloads, whether simple single-pod or complex distributed tasks, are allocated resources effectively while adhering to organizational policies and priorities.

    This guide explores the NVIDIA Run:ai Scheduler’s allocation process, preemption mechanisms, and resource management. Through examples and detailed explanations, you'll gain insights into how the Scheduler dynamically balances workloads to optimize cluster utilization and maintain fairness across projects and departments.

    Allocation Process

    Pod Creation and Prouping

    When a workload is submitted, the workload controller creates a pod or pods (for distributed training workloads or deployment based inference). When the Scheduler gets a submit request with the first pod, it creates a and allocates all the relevant building blocks of that workload. The next pods of the same workload are attached to the same pod group.

    Queue Management

    A workload, with its associated pod group, is queued in the appropriate . In every scheduling cycle, the Scheduler ranks the order of queues by calculating their precedence for scheduling.

    Resource Binding

    The next step is for the Scheduler to find nodes for those pods, assign the pods to their nodes (bind operation), and bind other building blocks of the pods such as storage, ingress and so on. If the pod group has a rule attached to it, the Scheduler either allocates and binds all pods together, or puts all of them into pending state. It retries to schedule them all together in the next scheduling cycle. The Scheduler also updates the status of the pods and their associated pod group. Users are able to track the workload submission process both in the CLI or NVIDIA Run:ai UI. For more details on submitting and managing workloads, see .

    Preemption

    If the Scheduler cannot find resources for the submitted workloads (and all of its associated pods), and the workload deserves resources either because it is under its queue quota , the Scheduler tries to from other queues. If this does not solve the resource issue, the Scheduler tries to preempt within the same queue (project).

    Reclaim Preemption Between Projects and Departments

    Reclaim is an inter-project and inter-department resource balancing action that takes back resources from one project or department that has used them as an over quota. It returns the resources back to a project (or department) that deserves those resources as part of its deserved quota, or to balance fairness between projects (or departments), so a project (or department) does not exceed its fairshare (portion of the unused resources).

    This mode of operation means that a lower priority workload submitted in one project (e.g. training) can reclaim resources from a project that runs a higher priority workload (e.g. preemptive workspace) if fairness balancing is required.

    Note

    Only preemptive workloads can go over quota as they are susceptible to reclaim (cross-projects preemption) of the over quota resources they are using. The amount of over quota resources a project can gain depends on the over quota weight or quota (if over quota weight is disabled). Departments’ over quota is always proportional to its quota.

    Priority Preemption Within a Project

    may preempt lower priority preemptible workloads within the same project/node pool queue. For example, in a project that runs a training workload that exceeds the project quota for a certain node pool, a newly submitted workspace within the same project/node pool may stop (preempt) the training workload if there are not enough over quota resources for the project within that node pool to run both workloads (e.g. workspace using in-quota resources and training using over quota resources).

    Note

    Workload priority applies only within the same project and does not influence workloads across different projects, where fairness determines precedence.

    Quota, Over Quota, and Fairshare

    The NVIDIA Run:ai Scheduler strives to ensure fairness between projects and between departments. This means each department and project always strive to get their deserved quota, and unused resources are split between projects according to known rules (e.g. over quota weights).

    If a project needs more resources even beyond its fairshare, and the Scheduler finds unused resources that no other project needs, this project can consume resources even beyond its fairshare.

    Some scenarios can prevent the Scheduler from fully providing deserved quota and fairness:

    • Fragmentation or other scheduling constraints such affinities, taints etc.

    • Some requested resources, such as GPUs and CPU memory, can be allocated, while others, like CPU cores, are insufficient to meet the request. As a result, the Scheduler will place the workload in a pending state until the required resource becomes available.

    Example of Splitting Quota

    The example below illustrates a split of quota between different projects and departments using several node pools:

    The example below illustrates how fairshare is calculated per project/node pool for the above example:

    • For each Project:

      • The over quota (OQ) portion of each project (per node pool) is calculated as:

      [(OQ-Weight) / (Σ Projects OQ-Weights)] x (Unused Resource per node pool)

      • Fairshare is calculated as the sum of quota + over quota.

    Fairshare Balancing

    The Scheduler constantly re-calculates the fairshare of each project and department per node pool, represented in the scheduler as queues, resulting in the re-balancing of resources between projects and between departments. This means that a preemptible workload that was granted resources to run in one scheduling cycle, can find itself preempted and go back to pending state while waiting for resources in the next cycle.

    A queue, representing a scheduler-managed object for each project or department per node pool, can be in one of 3 states:

    • In-quota: The queue’s allocated resources ≤ queue deserved quota. The Scheduler’s first priority is to ensure each queue receives its deserved quota.

    • Over quota but below fairshare: The queue’s deserved quota < queue’s allocated resources <= queue’s fairshare. The Scheduler tries to find and allocate more resources to queues that need resources beyond their deserved quota and up to their fairshare.

    • Over-fairshare and over quota: The queue’s fairshare < queue’s allocated resources. The Scheduler tries to allocate resources to queues that need even more resources beyond their fairshare.

    When re-balancing resources between queues of different projects and departments, the Scheduler goes in the opposite direction, i.e. first take resources from over-fairshare queues, then from over quota queues, and finally, in some scenarios, even from queues that are below their deserved quota.

    Next Steps

    Now that you have gained insights into how the Scheduler dynamically balances workloads to optimize cluster utilization and maintain fairness across projects and departments, you can . Before submitting your workloads, it’s important to familiarize yourself with the following key topics:

    • - Learn what workloads are and what is supported for both NVIDIA Run:ai and third-party workloads.

    • - Explore the various NVIDIA Run:ai workload types available and understand their specific purposes to enable you to choose the most appropriate workload type for your needs.

    Policies and Rules

    At NVIDIA Run:ai, Administrators can access a suite of tools designed to facilitate efficient account management. This article focuses on two key features: workload policies and workload scheduling rules. These features empower admins to establish default values and implement restrictions allowing enhanced control, assuring compatibility with organizational policies, and optimizing resource usage and utilization.

    Note

    Policies V1 are still supported but require additional setup. If you have policies on clusters prior to NVIDIA Run:ai version 2.18 and upgraded to a newer version, contact NVIDIA Run:ai Support for assistance in transitioning to the new policies framework.

    Workload Policies

    A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted. This solution allows them to set best practices, enforce limitations, and standardize processes for the submission of workloads for AI projects within their organization. It acts as a key guideline for data scientists, researchers, ML & MLOps engineers by standardizing submission practices and simplifying the workload submission process.

    Why Use a Workload Policy?

    Implementing workload policies is essential when managing complex AI projects within an enterprise for several reasons:

    1. Resource control and management - Defining or limiting the use of costly resources across the enterprise via a centralized management system to ensure efficient allocation and prevent overuse.

    2. Setting best practices - Provide managers with the ability to establish guidelines and standards to follow, reducing errors amongst AI practitioners within the organization.

    3. Security and compliance - Define and enforce permitted and restricted actions to uphold organizational security and meet compliance requirements.

    Understanding the Mechanism

    The following sections provide details of how the workload policy mechanism works.

    Cross-Interface Enforcement

    The policy enforces the workloads regardless of whether they were submitted via UI, CLI, Rest APIs, or Kubernetes YAMLs.

    Policy Types

    NVIDIA Run:ai’s policies enforce NVIDIA Run:ai workloads. The policy type is per . This allows administrators to set different policies for each workload type.

    Policy type
    Workload type
    Kubernetes name

    Policy Structure - Rules, Defaults, and Imposed Assets

    A policy consists of rules for limiting and controlling the values of fields of the workload. In addition to rules, some defaults allow the implementation of default values to different workload fields. These default values are not rules, as they simply suggest values that can be overridden during the workload submission.

    Furthermore, policies allow the enforcement of workload assets. For example, as an admin, you can impose a data source of type PVC to be used by any workload submitted.

    For more information, see , and .

    Scope of Effectiveness

    Numerous teams working on various projects require the use of different tools, requirements, and safeguards. One policy may not suit all teams and their requirements. Hence, administrators can select the scope to cover the effectiveness of the policy. When a scope is selected, all of its subordinate units are also affected. As a result, all workloads submitted within the selected scope are controlled by the policy.

    For example, if a policy is set for Department A, all workloads submitted by any of the projects within this department are controlled.

    A scope for a policy can be:

    Note

    The policy submission to the entire account scope is supported via API only.

    The different scoping of policies also allows the breakdown of the responsibility between different administrators. This allows delegation of ownership between different levels within the organization. The policies, containing rules and defaults, propagate* down the organizational tree, forming an “effective” policy that enforces any workload submitted by users within the project.

    If a field is used by multiple policies at different scopes, the platform applies a reconciliation mechanism to determine which policy takes effect. Defaults of the same field can still be submitted by different organizational policies, as they are considered “soft” rules. In this case, the closest scope to the workload becomes the effective default (project default “wins” vs. department default, department default “wins” vs. cluster default, etc.). For rules, precedence depends on their type: simple rules on non-security and non-compute fields follow the same order as defaults (project > department > cluster), while strict rules on security and compute fields apply in reverse order (cluster > department > project).

    NVIDIA Run:ai policies vs. Kyverno policies

    Kyverno runs as a dynamic admission controller in a Kubernetes cluster. Kyverno receives validating and mutating admission webhook HTTP callbacks from the Kubernetes API server and applies matching policies to return results that enforce admission policies or reject requests. Kyverno policies can match resources using the resource kind, name, label selectors, and much more. For more information, see .

    Scheduling Rules

    limit a researcher's access to resources and provides a way for the admin to control resource allocation and prevent the waste of resources. Admins should use the rules to prevent GPU idleness, prevent GPU hogging and allocate specific types of resources to different types of workloads.

    Admin can limit the duration of a workload, the duration of the idle time, or the type of nodes the workload can use. Rules are defined for and apply to all workloads in the project or department. In addition, rules can be applied to a specific type of workload in a project or department (workspace, standard training, or inference). When a workload reaches the limitation of the rule, it is stopped if the rule is time-limited. The rule type prevents the workload from being scheduled on nodes that violate the rule limitation.

    Nodes Maintenance

    This section provides detailed instructions on how to manage both planned and unplanned node downtimes in a Kubernetes cluster running NVIDIA Run:ai. It covers all the steps to maintain service continuity and ensure the proper handling of workloads during these events.

    Prerequisites

    • Access to Kubernetes cluster - Administrative access to the Kubernetes cluster, including permissions to run kubectl commands

    • Basic knowledge of Kubernetes - Familiarity with Kubernetes concepts such as nodes, taints, and workloads

    • NVIDIA Run:ai installation - The and configured within your Kubernetes cluster

    • Node naming conventions - Know the names of the nodes within your cluster, as these are required when executing the commands

    Node Types

    This section distinguishes between two types of nodes within a NVIDIA Run:ai installation:

    • Worker nodes - Nodes on which AI practitioners can submit and run workloads

    • NVIDIA Run:ai system nodes - Nodes on which the NVIDIA Run:ai software runs, managing the cluster's operations

    Worker Nodes

    Worker nodes are responsible for running workloads. When a worker node goes down, either due to planned maintenance or unexpected failure, workloads ideally migrate to other available nodes or wait in the queue to be executed when possible.

    Training vs. Interactive Workloads

    The following workload types can run on worker nodes:

    • Training workloads - These are long-running processes that, in case of node downtime, can automatically move to another node.

    • Interactive workloads - These are short-lived, interactive processes that require manual intervention to be relocated to another node.

    Note

    While training workloads can be automatically migrated, it is recommended to plan maintenance and manually manage this process for a faster response, as it may take time for Kubernetes to detect a node failure.

    Planned Maintenance

    Before stopping a worker node for maintenance, perform the following steps:

    1. Prevent new workloads on the node

      To stop the Kubernetes Scheduler from assigning new workloads to the node and to safely remove all existing workloads, copy the following command to your terminal:

      • <node-name> Replace this placeholder with the actual name of the node you want to drain

      • kubectl taint nodes This command is used to add a taint to the node, which prevents any new pods from being scheduled on it

    Unplanned Downtime

    In the event of unplanned downtime:

    1. Automatic restart If a node fails but immediately restarts, all services and workloads automatically resume.

    2. Extended downtime

      If the node remains down for an extended period, drain the node to migrate workloads to other nodes. Copy the following command to your terminal:

      The command works the same as in the planned maintenance section, ensuring that no workloads remain scheduled on the node while it is down.

    3. Reintegrate the node

    NVIDIA Run:ai System Nodes

    In a production environment, the services responsible for scheduling, submitting and managing NVIDIA Run:ai workloads operate on one or more NVIDIA Run:ai system nodes. It is recommended to have more than one system node to ensure . If one system node goes down, another can take over, maintaining continuity. If a second system node does not exist, you must designate another node in the cluster as a temporary NVIDIA Run:ai system node to maintain operations.

    The protocols for handling planned maintenance and unplanned downtime are identical to those for worker nodes. Refer to the above section for detailed instructions.

    Rejoining a Node into the Kubernetes Cluster

    To rejoin a node to the Kubernetes cluster, follow these steps:

    1. Generate a join command on the master node

      On the master node, copy the following command to your terminal:

      • kubeadm token create This command generates a token that can be used to join a node to the Kubernetes cluster.

      • --print-join-command This option outputs the full command that needs to be run on the worker node to rejoin it to the cluster.

    Workload Priority Control

    The workload priority management feature allows you to change the priority of a workload within a project. The priority determines the workload's position in the project scheduling queue managed by the NVIDIA Run:ai Scheduler. By adjusting the priority, you can increase the likelihood that a workload will be scheduled and preferred over others within the same project, ensuring that critical tasks are given higher priority and resources are allocated efficiently.

    You can change the priority of a workload by selecting one of the predefined values from the NVIDIA Run:ai priority dictionary. This can be done using the NVIDIA Run:ai UI, API or CLI, depending on the workload type.

    Note

    This applies only within a single project. It does not impact the scheduling queues or workloads of other projects.

    Priority Dictionary

    Workload priority is defined by selecting a string name from a predefined list in the NVIDIA Run:ai priority dictionary. Each string corresponds to a specific Kubernetes , which in turn determines scheduling behavior, such as whether the workload is preemptible or allowed to run over quota.

    Note

    The numeric priority levels (1 = highest, 4 = lowest) are descriptive only and are not part of the NVIDIA Run:ai priority dictionary.

    Priority Level
    Name (string)
    Preemption
    Over Quota

    Preemptible vs Non-Preemptible Workloads

    • Non-preemptible workloads must run within the project’s deserved quota, cannot use over-quota resources, and will not be interrupted once scheduled.

    • Preemptible workloads can use opportunistic compute resources beyond the project’s quota but may be interrupted at any time.

    Default Priority per Workload

    Both NVIDIA Run:ai and third-party workloads are assigned a default priority. The below table shows the default priority per workload type:

    Workload Type
    Default Priority

    Supported Priority Overrides per Workload

    Note

    Changing a workload’s priority may impact its ability to be scheduled. For example, switching a workload from a train priority (which allows over-quota usage) to build priority (which requires in-quota resources) may reduce its chances of being scheduled in cases where the required quota is unavailable.

    The below table shows the default priority listed in the previous section and the supported override options per workload:

    Workload Type
    interactive-preemptible
    build
    train
    inference

    How to Override Priority

    You can override the default priority when submitting a workload through the UI, API, or CLI depending on the workload type.

    Workspaces

    To use the override options:

    • UI: Enable "Allow the workload to exceed the project quota" when

    • API: Set PriorityClass in the

    • CLI: using the --priority flag

    Training Workloads

    To use the override options:

    • API: Set PriorityClass in the

    • CLI: using the --priority flag

    Scheduling Rules

    This article explains the procedure to configure and manage scheduling rules.

    Scheduling rules are restrictions applied to workloads. These restrictions apply to either the resources (nodes) on which workloads can run or the duration of the run time. Scheduling rules are set for Projects or Departments and apply to specific workload types. Once scheduling rules are set for a project or department, all matching workloads associated with the project have the restrictions applied to them, as defined, when the workload was submitted. New scheduling rules added to a project are not applied over previously created workloads associated with that project.

    There are three types of scheduling rules:

    Workload Duration (Time Limit)

    This rule limits the duration of a workload run time. Workload run time is calculated as the total time in which the workload was in status Running. You can apply a single rule per workload type - Preemptive Workspaces, Non-preemptive Workspaces, and Training.

    Idle GPU Time Limit

    This rule limits the total GPU time of a workload. Workload idle time is counted from the first time the workload is in status Running and the GPU was idle. Idleness is calculated by employing the runai_gpu_idle_seconds_per_workload metric. This metric determines the total duration of zero GPU utilization within each 30-second interval. If the GPU remains idle throughout the 30-second window, 30 seconds are added to the idleness sum; otherwise, the idleness count is reset. You can apply a single rule per workload type - “Preemptible” Workspaces, “Non-preemptible” Workspaces, and Training.

    Note

    To make Idle GPU timeout effective, it must be set to a shorter duration than the workload duration of the same workload type.

    Node Type (Affinity)

    Node type is used to select a group of nodes, typically with specific characteristics such as a hardware feature, storage type, fast networking interconnection, etc. The uses node type as an indication of which nodes should be used for your workloads, within this project.

    Node type is a label in the form of run.ai/type and a value (e.g. run.ai/type = dgx200) that the administrator uses to tag a set of nodes. Adding the node type to the project’s scheduling rules mandates the user to submit workloads with a node type label/value pairs from this list, according to the workload type - Workspace or Training. The Scheduler then schedules workloads using a node selector, targeting nodes tagged with the NVIDIA Run:ai node type label/value pair. Node pools and a node type can be used in conjunction. For example, specifying a node pool and a smaller group of nodes from that node pool that includes a fast SSD memory or other unique characteristics.

    Labelling Nodes for Node Types Grouping

    The administrator should use a node label with the key of run.ai/type and any coupled value

    To assign a label to nodes you want to group, set the ‘node type (affinity)’ on each relevant node:

    1. Obtain the list of nodes and their current labels by copying the following to your terminal:

    1. Annotate a specific node with a new label by copying the following to your terminal:

    Adding a Scheduling Rule to a Project or Department

    To add a scheduling rule:

    1. Select the project/department for which you want to add a scheduling rule

    2. Click EDIT

    3. In the Scheduling rules section click +RULE

    4. Select the rule type

    Note

    You can review the defined rules in the in the relevant column.

    Editing the Scheduling Rule

    To edit a scheduling rule:

    1. Select the project/department for which you want to edit its scheduling rule

    2. Click EDIT

    3. Find the scheduling rule you would like to edit

    4. Edit the rule

    Note

    Setting scheduling rules in a department enforces the rules on all associated projects.

    Editing a scheduling rule within a project - you can only tighten a rule applied by your department admin, meaning set a lower time limitation not higher.

    Deleting the Scheduling Rule

    To delete a scheduling rule:

    1. Select the project/department from which you want to delete a scheduling rule

    2. Click EDIT

    3. Find the scheduling rule you would like to delete

    4. Click on the x icon

    Note

    Deleting a department rule within a project - a project admin cannot delete a rule created by the department admin.

    Using API

    Go to the API reference to view the available actions

    NVIDIA Run:ai Workload Types

    In the world of machine learning (ML), the journey from raw data to actionable insights is a complex process that spans multiple stages. Each stage of the AI lifecycle requires different tools, resources, and frameworks to ensure optimal performance. NVIDIA Run:ai simplifies this process by offering specialized workload types tailored to each phase, facilitating a smooth transition across various stages of the ML workflows.

    The ML lifecycle usually begins with the experimental work on data and exploration of different modeling techniques to identify the best approach for accurate predictions. At this stage, resource consumption is usually moderate as experimentation is done on a smaller scale. As confidence grows in the model's potential and its accuracy, the demand for compute resources increases. This is especially true during the training phase, where vast amounts of data need to be processed, particularly with complex models such as large language models (LLMs), with their huge parameter sizes, that often require distributed training across multiple GPUs to handle the intensive computational load.

    Finally, once the model is ready, it moves to the inference stage, where it is deployed to make predictions on new, unseen data. NVIDIA Run:ai's workload types are designed to correspond with the natural stages of this lifecycle. They are structured to align with the specific resource and framework requirements of each phase, ensuring that AI researchers and data scientists can focus on advancing their models without worrying about infrastructure management.

    NVIDIA Run:ai offers three workload types that correspond to a specific phase of the researcher’s work:

    • Workspaces – For experimentation with data and models.

    • Training – For resource-intensive tasks such as model training and data preparation.

    • Inference – For deploying and serving the trained model.

    Workspaces: The Experimentation Phase

    The Workspace is where data scientists conduct initial research, experiment with different data sets, and test various algorithms. This is the most flexible stage in the ML lifecycle, where models and data are explored, tuned, and refined. The value of workspaces lies in the flexibility they offer, allowing the researcher to iterate quickly without being constrained by rigid infrastructure.

    • Framework flexibility

      Workspaces support a variety of machine learning frameworks, as researchers need to experiment with different tools and methods.

    • Resource requirements

      Workspaces are often lighter on resources compared to the training phase, but they still require significant computational power for data processing, analysis, and model iteration.

      Hence, the default for the NVIDIA Run:ai workspaces considerations is to allow scheduling those workloads without the ability to preempt them once the resources were allocated. However, this non-preemptible state doesn’t allow utilizing more resources outside of the project’s deserved quota.

    See to learn more about how to submit a workspace via the NVIDIA Run:ai platform. For quick starts, see .

    Training: Scaling Resources for Model Development

    As models mature and the need for more robust data processing and model training increases, NVIDIA Run:ai facilitates this shift through the Training workload. This phase is resource-intensive, often requiring distributed computing and high-performance clusters to process vast data sets and train models.

    • Training architecture

      For training workloads NVIDIA Run:ai allows you to specify the architecture - standard or distributed. The distributed architecture is relevant for larger data sets and more complex models that require utilizing multiple nodes. For the distributed architecture, NVIDIA Run:ai allows you to specify different configurations for the master and workers and select which framework to use - PyTorch, XGBoost, MPI, TensorFlow and JAX. In addition, as part of the distributed configuration, NVIDIA Run:ai enables the researchers to schedule their distributed workloads on nodes within the same region, zone, placement group, or any other topology.

    • Resource requirements

      Training tasks demand high memory, compute power, and storage. NVIDIA Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPUs that are in your quota.

    See and to learn more about how to submit a training workload via the NVIDIA Run:ai UI. For quick starts, see and .

    Note

    Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

    Inference: Deploying and Serving models

    Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.

    • Inference-specific use cases

      Naturally, inference workloads are required to change and adapt to the ever-changing demands to meet SLA. For example, additional replicas may be deployed, manually or automatically, to increase compute resources as part of a horizontal scaling approach or a new version of the deployment may need to be rolled out without affecting the running services.

    • Resource requirements

      Inference models differ in size and purpose, leading to varying computational requirements. For example, small OCR models can run efficiently on CPUs, whereas LLMs typically require significant GPU memory for deployment and serving. Inference workloads are considered production-critical and are given the highest priority to ensure compliance with SLAs. Additionally, NVIDIA Run:ai ensures that inference workloads cannot be preempted, maintaining consistent performance and reliability.

    See to learn more about how to submit an inference workload via the NVIDIA Run:ai UI. For a quick start, see .

    Set Up SSO with OpenID Connect

    Single Sign-On (SSO) is an authentication scheme, allowing users to log-in with a single pair of credentials to multiple, independent software systems.

    This article explains the procedure to to NVIDIA Run:ai using the OpenID Connect protocol.

    Prerequisites

    Before you start, make sure you have the following available from your identity provider:

    Preparations

    The following section provides the information needed to prepare for a NVIDIA Run:ai installation.

    Software Artifacts

    The following software artifacts should be used when installing the and .

    Security Best Practices

    This guide provides actionable best practices for administrators to securely configure, operate, and manage NVIDIA Run:ai environments. Each section highlights both platform-native features and mapped Kubernetes security practices to maintain robust protection for workloads and resources.

    Security Area
    Best Practice

    Reports

    This section explains the procedure of managing reports in NVIDIA Run:ai.

    Reports allow users to access and organize large amounts of data in a clear, CSV-formatted layout. They enable users to monitor resource consumption, analyze trends, and make data-driven decisions to optimize their AI workloads effectively.

    Note

    Reports are enabled by default for SaaS. To enable this feature for self-hosted, additional configurations must be added. See .

    kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=true
    kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=false
    runai-adm set node-role --runai-system-worker <node-name>
    runai-adm remove node-role --runai-system-worker <node-name>
    runai-adm set node-role <node-role> <node-name>
    runai-adm remove node-role <node-role> <node-name>
    kubectl get runaiconfig runai -n runai -o yaml -o=jsonpath='{.spec}' > runaiconfig_backup.yaml
    kubectl apply -f runaiconfig_backup.yaml -n runai
    kubectl -n runai-backend exec -it runai-backend-postgresql-0 -- \
        env PGPASSWORD=<password> pg_dump -U postgres backend > cluster_name_db_backup.sql
    helm get values runai-backend -n runai-backend

    Container Registry

    Pull NVIDIA Run:ai images

    All kubernetes nodes

    runai.jfrog.io

    443

    Hugging Face

    Browse Hugging Face models

    NVIDIA Run:ai control plane system nodes

    huggingface.co

    443

    Helm repository

    NVIDIA Run:ai Helm repository for installation

    Installer machine

    runai.jfrog.io

    443

    Red Hat Container Registry

    Prometheus Operator image repository

    All kubernetes nodes

    quay.io

    443

    Docker Hub Registry

    Training Operator image repository

    All kubernetes nodes

    docker.io

    443

    external PostgreSQL database
    control plane
    Running workspaces
    Running Jupyter Notebook using workspaces
    Standard training
    Distributed training
    Run your first standard training
    Run your first distributed training
    Deploy a custom inference workload
    Run your first custom inference workload
  • Explainability and predictability - large environments are complex to understand, this becomes even more complex when an environment is loaded. To maintain users’ satisfaction and their understanding of the resources state, as well as to keep predictability of your workload chances to get scheduled, segmenting your cluster into smaller pools may significantly help.

  • Scale - NVIDIA Run:ai implementation of node pools has many benefits, one of the main of them is scale. Each node pool has its own Scheduler instance, therefore allowing the cluster to handle more nodes and schedule workloads faster when segmented into node pools vs. one large cluster. To allow your workloads to use any resource within a cluster that is split to node pools, a second-level Scheduler is in charge of scheduling workloads between node pools according to your preferences and resource availability.

  • Prevent mutual exclusion - Some AI workloads consume CPU-only resources, to prevent those workloads from consuming the CPU resources of GPU nodes and thus block GPU workloads from using those nodes, it is recommended to group CPU-only nodes into a dedicated node pool(s) and assign a quota for CPU projects to CPU node-pools only while keeping GPU node-pools with zero quota and optionally “best-effort” over quota access for CPU-only projects.

  • projects
    departments
    over quota
    workload types
    ‘Role Based Access Control’ (RBAC)
    submit workloads

    In Project 2, we assume that out of the 36 available GPUs in node pool A, 20 GPUs are currently unused. This means either these GPUs are not part of any project’s quota, or they are part of a project’s quota but not used by any workloads of that project:

    • Project 2 over quota share:

      [(Project 2 OQ-Weight) / (Σ all Projects OQ-Weights)] x (Unused Resource within node pool A)

      [(3) / (2 + 3 + 1)] x (20) = (3/6) x 20 = 10 GPUs

    • Fairshare = deserved quota + over quota = 6 +10 = 16 GPUs. Similarly, fairshare is also calculated for CPU and CPU memory. The Scheduler can grant a project more resources than its fairshare if the Scheduler finds resources not required by other projects that may deserve those resources.

  • In Project 3, fairshare = deserved quota + over quota = 0 +3 = 3 GPUs. Project 3 has no guaranteed quota, but it still has a share of the excess resources in node pool A. The NVIDIA Run:ai Scheduler ensures that Project 3 receives its part of the unused resources for over quota, even if this results in reclaiming resources from other projects and preempting preemptible workloads.

  • pod group
    scheduling queue
    gang scheduling
    Workloads
    fairshare
    reclaim resources
    lower priority preemptible workloads
    Higher priority workloads
    submit workloads
    Introduction to workloads
    NVIDIA Run:ai workload types

    Last updated

    The last time the access rule was updated

    Simplified setup - Conveniently allow setting defaults and streamline the workload submission process for AI practitioners.
  • Scalability and diversity

    1. Multi-purpose clusters with various workload types that may have different requirements and characteristics for resource usage.

    2. The organization has multiple hierarchies, each with distinct goals, objectives, and degrees of flexibility.

    3. Manage multiple users and projects with distinct requirements and methods, ensuring appropriate utilization of resources.

  • Workspace

    Workspace

    Interactive workload

    Training: Standard

    Training: Standard

    Training workload

    Training: Distributed

    Training: Distributed

    Distributed workload

    Inference

    Inference

    NVIDIA Run:ai workload type
    rules
    defaults
    imposed assets
    How Kyverno Works
    Scheduling rules

    Inference workload

    Select the workload type and time limitation period

  • For Node type, choose one or more labels for the desired nodes

  • Click SAVE

  • Click SAVE

    Click SAVE

    Scheduler
    Projects table
    Projects
  • runai=drain:NoExecute This specific taint ensures that all existing pods on the node are evicted and rescheduled on other available nodes, if possible

  • Result: The node stops accepting new workloads, and existing workloads either migrate to other nodes or are placed in a queue for later execution.

  • Shut down and perform maintenance

    After draining the node, you can safely shut it down and perform the necessary maintenance tasks.

  • Restart the node

    Once maintenance is complete and the node is back online, remove the taint to allow the node to resume normal operations. Copy the following command to your terminal:

    runai=drain:NoExecute- The - at the end of the command indicates the removal of the taint. This allows the node to start accepting new workloads again.

    Result: The node rejoins the cluster's pool of available resources, and workloads can be scheduled on it as usual.

  • Once the node is back online, remove the taint to allow it to rejoin the cluster's operations. Copy the following command to your terminal:

    Result: This action reintegrates the node into the cluster, allowing it to accept new workloads.

  • Permanent shutdown

    If the node is to be permanently decommissioned, remove it from Kubernetes with the following command:

    • kubectl delete node This command completely removes the node from the cluster

    • <node-name> Replace this placeholder with the actual name of the node

    Result: The node is no longer part of the Kubernetes cluster. If you plan to bring the node back later, it must be rejoined to thel cluster using the steps outlined in the next section.

  • Result: The command outputs a kubeadm join command.
  • Run the join command on the worker node

    Copy the kubeadm join command generated from the previous step and run it on the worker node that needs to rejoin the cluster.

    The kubeadm join command re-enrolls the node into the cluster, allowing it to start participating in the cluster's workload scheduling.

  • Verify node rejoining

    Verify that the node has successfully rejoined the cluster by running:

    • kubectl get nodes This command lists all nodes currently part of the Kubernetes cluster, along with their status

    Result: The rejoined node should appear in the list with a status of Ready

  • Re-label nodes

    Once the node is ready, ensure it is labeled according to its role within the cluster.

  • NVIDIA Run:ai software installed
    high availability
    kubectl label nodes <node-name> node-role.kubernetes.io/runai-gpu-worker=true
    kubectl label nodes <node-name> node-role.kubernetes.io/runai-cpu-worker=false
    kubectl get nodes --show-labels
    kubectl label node <node-name> run.ai/type=<value>
    kubectl taint nodes <node-name> runai=drain:NoExecute-
    kubectl delete node <node-name>
    kubeadm join <master-ip>:<master-port> 
    --token <token> \ --discovery-token-ca-cert-hash sha256:<hash>
    kubectl get nodes
    kubectl taint nodes <node-name> runai=drain:NoExecute
    kubectl taint nodes <node-name> runai=drain:NoExecute
    kubeadm token create --print-join-command
    kubectl taint nodes <node-name> runai=drain:NoExecute- 

    4

    train

    Preemptible

    Available

    1

    inference

    Non-preemptible

    Not available

    2

    build

    Non-preemptible

    Not available

    3

    interactive-preemptible

    Preemptible

    Workspaces

    build

    Training

    train

    Inference

    inference

    Third-party workloads

    train

    NVIDIA Cloud Functions (NVCF)

    inference

    Workspaces

    Training

    PriorityClass
    submitting a workspace
    Workspaces API
    Submit a workspace
    Trainings API
    Submit training

    Available

    Discovery URL - The OpenID server where the content discovery information is published.
  • ClientID - The ID used to identify the client with the Authorization Server.

  • Client Secret - A secret password that only the Client and Authorization server know.

  • Optional: Scopes - A set of user attributes to be used during authentication to authorize access to a user's details.

  • Setup

    Adding the Identity Provider

    1. Go to General settings

    2. Open the Security section and click +IDENTITY PROVIDER

    3. Select Custom OpenID Connect

    4. Enter the Discovery URL, Client ID, and Client Secret

    5. Copy the Redirect URL to be used in your identity provider

    6. Optional: Add the OIDC scopes

    7. Optional: Enter the user attributes and their value in the identity provider as shown in the below table

    8. Click SAVE

    9. Optional: Enable Auto-Redirect to SSO to automatically redirect users to your configured identity provider’s login page when accessing the platform.

    Attribute
    Default value in NVIDIA Run:ai
    Description

    User role groups

    GROUPS

    If it exists in the IDP, it allows you to assign NVIDIA Run:ai role groups via the IDP. The IDP attribute must be a list of strings or an object where the group names are the values.

    Linux User ID

    UID

    If it exists in the IDP, it allows Researcher containers to start with the Linux User UID. Used to map access to network resources such as file systems to users. The IDP attribute must be of type integer.

    Linux Group ID

    GID

    If it exists in the IDP, it allows Researcher containers to start with the Linux Group GID. The IDP attribute must be of type integer.

    Supplementary Groups

    SUPPLEMENTARYGROUPS

    Testing the Setup

    1. Log in to the NVIDIA Run:ai platform as an admin

    2. Add access rules to an SSO user defined in the IDP

    3. Open the NVIDIA Run:ai platform in an incognito browser tab

    4. On the sign-in page click CONTINUE WITH SSO You are redirected to the identity provider sign in page

    5. In the identity provider sign-in page, log in with the SSO user who you granted with access rules

    6. If you are unsuccessful signing-in to the identity provider, follow the section below

    Editing the Identity Provider

    You can view the identity provider details and edit its configuration:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider box, click Edit identity provider

    4. You can edit either the Discovery URL, Client ID, Client Secret, OIDC scopes, or the User attributes

    Removing the Identity Provider

    You can remove the identity provider configuration:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider card, click Remove identity provider

    4. In the dialog, click REMOVE to confirm the action

    Note

    To avoid losing access, removing the identity provider must be carried out by a local user.

    Troubleshooting

    If testing the setup was unsuccessful, try the different troubleshooting scenarios according to the error you received.

    Troubleshooting Scenarios

    Error: "403 - Sorry, we can’t let you see this page. Something about permissions…"

    Description: The authenticated user is missing permissions

    Mitigation:

    1. Validate either the user or its related group/s are assigned with access rules

    2. Validate groups attribute is available in the configured OIDC Scopes

    3. Validate the user’s groups attribute is mapped correctly

    Advanced:

    1. Open the Chrome DevTools: Right-click on page → Inspect → Console tab

    2. Run the following command to retrieve and paste the user’s token: localStorage.token;

    3. Paste in

    4. Under the Payload section validate the values of the user’s attribute

    Error: "401 - We’re having trouble identifying your account because your email is incorrect or can’t be found."

    Description: Authentication failed because email attribute was not found.

    Mitigation:

    1. Validate email attribute is available in the configured OIDC Scopes

    2. Validate the user’s email attribute is mapped correctly

    Error: "Unexpected error when authenticating with identity provider"

    Description: User authentication failed

    Mitigation: Validate the the configured OIDC Scopes exist and match the Identity Provider’s available scopes

    Advanced: Look for the specific error message in the URL address

    Error: "Unexpected error when authenticating with identity provider (SSO sign-in is not available)"

    Description: User authentication failed

    Mitigation:

    1. Validate the the configured OIDC scope exists in the Identity Provider

    2. Validate the configured Client Secret match the Client Secret in the Identity Provider

    Advanced: Look for the specific error message in the URL address

    Error: "Client not found"

    Description: OIDC Client ID was not found in the Identity Provider

    Mitigation: Validate the the configured Client ID matches the Identity Provider Client ID

    configure single sign-on
    Kubernetes
    Connected

    You will receive a token from NVIDIA Run:ai to access the NVIDIA Run:ai container registry. Use the following command to create the required Kubernetes secret:

    kubectl create secret docker-registry runai-reg-creds  \
    --docker-server=https://runai.jfrog.io \
    --docker-username=self-hosted-image-puller-prod \
    --docker-password=<TOKEN> \
    [email protected] \
    --namespace=runai-backend
    Air-gapped

    You will receive a token from NVIDIA Run:ai to access the NVIDIA Run:ai air-gapped installation package. Use the following commands with the token provided by NVIDIA Run:ai to download and extract the package.

    Download and Extract the Air-gapped Package

    1. Run the following command to browse all available air-gapped packages:

      curl -H "Authorization: Bearer <token>" "https://runai.jfrog.io/artifactory/api/storage/runai-airgapped-prod/?list"
    2. Run the following command to download the desired package:

      curl -L -H "Authorization: Bearer <token>" -O "https://runai.jfrog.io/artifactory/runai-airgapped-prod/runai-airgapped-package-<VERSION>.tar.gz"
    3. SSH into a node with kubectl access to the cluster and Docker installed.

    4. Extract the NVIDIA Run:ai package and replace <VERSION> in the command below and run:

    Upload Images

    NVIDIA Run:ai assumes the existence of a Docker registry within your organization for hosting container images. The installation requires the network address and port for this registry (referred to as <REGISTRY_URL>).

    1. Upload images to a local Docker Registry. Set the Docker Registry address in the form of NAME:PORT (do not add https):

    2. Run the following script. You must have at least 20GB of free disk space to run. If Docker is configured to then sudo is not required:

    The script should create a file named custom-env.yaml which will be used during control plane installation.

    OpenShift

    Connected

    You will receive a token from NVIDIA Run:ai to access the NVIDIA Run:ai container registry. Use the following command to create the required Kubernetes secret:

    oc create secret docker-registry runai-reg-creds  \
    --docker-server=https://runai.jfrog.io \
    --docker-username=self-hosted-image-puller-prod \
    --docker-password=<TOKEN> \
    [email protected] \
    --namespace=runai-backend
    Air-gapped

    You will receive a token from NVIDIA Run:ai to access the NVIDIA Run:ai air-gapped installation package. Use the following commands with the token provided by NVIDIA Run:ai to download and extract the package.

    Download and Extract the Air-gapped Package

    1. Run the following command to browse all available air-gapped packages:

    2. Run the following command to download the desired package:

    3. SSH into a node with oc access to the cluster and Docker installed.

    4. Extract the NVIDIA Run:ai package and replace <VERSION> in the command below and run:

    Upload Images

    NVIDIA Run:ai assumes the existence of a Docker registry within your organization for hosting container images. The installation requires the network address and port for this registry (referred to as <REGISTRY_URL>).

    1. Upload images to a local Docker Registry. Set the Docker Registry address in the form of NAME:PORT (do not add https):

    2. Run the following script. You must have at least 20GB of free disk space to run. If Docker is configured to then sudo is not required:

    The script should create a file named custom-env.yaml which will be used by the control plane installation.

    Private Docker Registry (Optional)

    Kubernetes

    To access the organization's docker registry it is required to set the registry's credentials (imagePullSecret).

    Create the secret named runai-reg-creds based on your existing credentials. For more information, see Pull an Image from a Private Registry.

    OpenShift

    To access the organization's docker registry it is required to set the registry's credentials (imagePullSecret).

    Create the secret named runai-reg-creds in the runai-backend namespace based on your existing credentials. The configuration will be copied over to the runai namespace at cluster install. For more information, see Allowing pods to reference images from other secured registries.

    Set Up Your Environment

    External Postgres Database (Optional)

    If you have opted to use an external PostgreSQL database, you need to perform initial setup to ensure successful installation. Follow these steps:

    1. Create a SQL script file, edit the parameters below, and save it locally:

      • Replace <DATABASE_NAME> with a dedicate database name for NVIDIA Run:ai in your PostgreSQL database.

      • Replace <ROLE_NAME> with a dedicated role name (user) for NVIDIA Run:ai database.

      • Replace <ROLE_PASSWORD> with a password for the new PostgreSQL role.

      • Replace <GRAFANA_PASSWORD> with the password to be set for Grafana integration.

    2. Run the following command on a machine where PostgreSQL client (pgsql) is installed:

      • Replace <POSTGRESQL_HOST> with the PostgreSQL ip address or hostname.

      • Replace <POSTGRESQL_USER> with the PostgreSQL username.

    control plane
    cluster

    Tools and serving endpoint access control

    Control who can access tools and endpoints; restrict network exposure

    Maintenance and compliance

    Follow secure install guides, perform vulnerability scans, maintain data-privacy alignment

    Access Control (RBAC)

    NVIDIA Run:ai uses Role‑Based Access Control to define what each user, group, or application can do, and where. Roles are assigned within a scope, such as a project, department, or cluster, and permissions cover actions like viewing, creating, editing, or deleting entities. Unlike Kubernetes RBAC, NVIDIA Run:ai’s RBAC works across multiple clusters, giving you a single place to manage access rules. See Role Based Access Control (RBAC) for more details.

    Best Practices

    • Assign the minimum required permissions to users, groups and applications.

    • Segment duties using organizational scopes to restrict roles to specific projects or departments.

    • Regularly audit access rules and remove unnecessary privileges, especially admin-level roles.

    Kubernetes Connection

    NVIDIA Run:ai predefined roles are automatically mapped to Kubernetes cluster roles (also predefined by NVIDIA Run:ai). This means administrators do not need to manually configure role mappings.

    These cluster roles define permissions for the entities NVIDIA Run:ai manages and displays (such as workloads) and also apply to users who access cluster data directly through Kubernetes tools (for example, kubectl).

    Authentication and Session Management

    NVIDIA Run:ai supports several authentication methods to control platform access. You can use single sign-on (SSO) for unified enterprise logins, traditional username/password accounts if SSO isn’t an option, and API secret keys for automated application access. Authentication is mandatory for all interfaces, including the UI, CLI, and APIs, ensuring only verified users or applications can interact with your environment.

    Administrators can also configure session timeout. This refers to the period of inactivity before a user is automatically logged out. Once the timeout is reached, the session ends and re‑authentication is required, helping protect against risks from unattended or abandoned sessions. See Authentication and authorization for more details.

    Best Practices

    • Integrate corporate SSO for centralized identity management.

    • Enforce strong password policies for local accounts.

    • Set appropriate session timeout values to minimize idle session risk.

    • Prefer SSO to eliminate password management within NVIDIA Run:ai.

    Kubernetes Connection

    Configure the Kubernetes API server to validate tokens via NVIDIA Run:ai’s identity service, ensuring unified authentication across the platform. For more information, see Cluster authentication.

    Workload Policies: Enforcing Security at Submission

    Workload policies allow administrators to define and enforce how AI workloads are submitted and controlled across projects and teams. With these policies, you can set clear rules and defaults for workload parameters such as which resources can be requested, required security settings, and which defaults should apply. Policies are enforced whether workloads are submitted via the UI, CLI, API or Kubernetes YAML, and can be scoped to specific projects, departments, or clusters for fine-grained control. See Policies and rules for more details.

    Best Practices

    • Enforce containers to run as non-root by default. Define policies that set constraints and defaults for workload submissions, such as requiring non-root users or specifying minimum UID/GID. Example security fields in policies:

      • security.runAsNonRoot: true

      • security.runAsUid: 1000

      • Restrict runAsUid with canEdit: false to prevent users from overriding.

    • Require explicit user/group IDs for all workload containers.

    • Impose data source and resource usage limits through policies.

    • Use policy rules to prevent users from submitting non-compliant workloads.

    • Apply policies by organizational scope for nuanced control within departments or projects.

    Kubernetes Connection

    Map these policies to PodSecurityContext settings in Kubernetes, and enforce them with Pod Security Admission or Kyverno for stricter compliance.

    Managing Namespace and Resource Creation

    NVIDIA Run:ai offers flexible controls for how namespaces and resources are created and managed within your clusters. When a new project is set up, you can choose whether Kubernetes namespaces are created automatically, and whether users are auto-assigned to those projects. There are also options to manage how secrets are propagated across namespaces and to enable or disable resource limit enforcement using Kubernetes LimitRange objects. See Advanced cluster configurations for more details.

    Best Practices

    • Require admin approval for namespace creation to avoid sprawl.

    • Limit secret propagation to essential cases only.

    • Use Kubernetes LimitRanges and ResourceQuotas alongside NVIDIA Run:ai policies for layered resource control.

    • Regularly audit and remove unused namespaces, secrets, and workloads.

    Tools and Serving Endpoint Access Control

    NVIDIA Run:ai provides flexible options to control access to tools and serving endpoints. Access can be defined during workload submission or updated later, ensuring that only the intended users or groups can interact with the resource.

    When configuring an endpoint or tool, users can select from the following access levels:

    • Public - Everyone within the network can access with no authentication (serving endpoints).

    • All authenticated users - Access is granted to anyone in the organization who can log in (NVIDIA Run:ai or SSO).

    • Specific groups - Access is restricted to members of designated identity provider groups.

    • Specific users - Access is restricted to individual users by email or username.

    By default, network exposure is restricted, and access must be explicitly granted. Model endpoints automatically inherit RBAC and workload policy controls, ensuring consistent enforcement of role- and scope-based permissions across the platform. Administrators can also limit who can deploy, view, or manage endpoints, and should open network access only when required.

    Best Practices

    • Define explicit roles for model management/use.

    • Restrict endpoint access to authorized users, groups and applications.

    • Monitor and audit endpoint access logs.

    Kubernetes Connection

    Use Kubernetes NetworkPolicies to limit inter-pod and external traffic to model-serving pods. Pair with NVIDIA Run:ai RBAC for end-to-end control.

    Secure Installation and Maintenance

    A secure deployment is the foundation on which all other controls rest, and NVIDIA Run:ai’s installation procedures are built to align with organizational policies such as OpenShift Security Context Constraints (SCC). See Advanced cluster configurations for more details.

    • Deploy NVIDIA Run:ai cluster following secure installation guides (including IT compliance mandates such as SCC for OpenShift).

    • Run regular security scans and patch/update NVIDIA Run:ai deployments promptly when vulnerabilities are reported.

    • Regularly review and update all security policies, both at the NVIDIA Run:ai and Kubernetes levels, to adapt to evolving risks.

    Compliance and Data Privacy

    NVIDIA Run:ai supports SaaS and self-hosted modes to satisfy a range of data security needs. The self-hosted mode keeps all models, logs, and user data entirely within your infrastructure; SaaS requires careful review of what (minimal) data is transmitted for platform operations and analytics. See for more details.

    • Use the self-hosted mode when full control over the environment is required - including deployment and day-2 operations such as upgrades, monitoring, backup, and metadata restore.

    • Ensure transmission to the NVIDIA Run:ai cloud is scoped (in SaaS mode) and aligns with organization policy.

    • Encrypt secrets and sensitive resources; control secret propagation.

    • Document and audit data flows for regulatory alignment.

    Access control (RBAC)

    Enforce least privilege, segment roles by scope, audit regularly

    Authentication and sessions management

    Use SSO, token-based authentication, strong passwords, limit idle time

    Workload policies

    Require non-root, set UID/GID, block overrides, use trusted images

    Namespace and resource management

    Require namespace approval, limit secret propagation, apply quotas

    Report Types

    Currently, only “Consumption Reports” are available, which provides insights into the consumption of resources such as GPU, CPU, and CPU memory across organizational units.

    Reports Table

    The Reports table can be found under Analytics in the NVIDIA Run:ai platform.

    The Reports table provides a list of all the reports defined in the platform and allows you to manage them.

    Users are able to access the reports they have generated themselves. Users with project viewing permissions throughout the tenant can access all reports within the tenant.

    The Reports table comprises the following columns:

    Column
    Description

    Report

    The name of the report

    Description

    The description of the report

    Status

    The different lifecycle phases and representation of the report condition

    Type

    The type of the report – e.g., consumption

    Created by

    The user who created the report

    Creation time

    The timestamp of when the report was created

    Reports Status

    The following table describes the reports' condition and whether they were created successfully:

    Status
    Description

    Ready

    Report is ready and can be downloaded as CSV

    Pending

    Report is in the queue and waiting to be processed

    Failed

    The report couldn’t be created

    Processing...

    The report is being created

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Creating a New Report

    Before you start, make sure you have a project.

    To create a new report:

    1. Click +NEW REPORT

    2. Enter a name for the report (if the name already exists, you will need to choose a different one)

    3. Optional: Provide a description of the report

    4. Set the report’s data collection period

      • Start date - The date at which the report data commenced

      • End date - The date at which the report data concluded

    5. Set the report segmentation and filters

      • Filters - Filter by project or department name

      • Segment by - Data is collected and aggregated based on the segment

    6. Click CREATE REPORT

    Deleting a Report

    1. Select the report you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm

    Downloading a report

    Note

    To download, the report must be in status “Ready”.

    1. Select the report you want to download

    2. Click DOWNLOAD CSV

    Enabling Reports for Self-Hosted Accounts

    Reports must be saved in a storage solution compatible with S3. To activate this feature for self-hosted accounts, the storage needs to be linked to the account. The configuration should be incorporated into two ConfigMap objects within the Control Plane.

    1. Edit the runai-backend-org-unit-service ConfigMap:

    2. Add the following lines to the file:

    3. Edit the runai-backend-metrics-service ConfigMap:

    4. Add the following lines to the file:

    5. In addition on the same file, under config.yaml section, add the following right after log_level: \"Info\":

    6. Restart the deployments:

    7. Refresh the page to see Reports under Analytics in the NVIDIA Run:ai platform.

    Using API

    To view the available actions, go to the Reports API reference.

    Enabling reports for self-hosted accounts

    Install the Cluster

    System and Network Requirements

    Before installing the NVIDIA Run:ai cluster, validate that the system requirements and network requirements are met. For air-gapped environments, make sure you have the software artifacts prepared.

    Once all the requirements are met, it is highly recommend to use the NVIDIA Run:ai cluster preinstall diagnostics tool to:

    • Test the below requirements in addition to failure points related to Kubernetes, NVIDIA, storage, and networking

    • Look at additional components installed and analyze their relevance to a successful installation

    For more information, see . To run the preinstall diagnostics tool, the latest version, and run:

    In an air-gapped deployment, the diagnostics image is saved, pushed, and pulled manually from the organization's registry.

    Run the binary with the --image parameter to modify the diagnostics image to be used:

    Helm

    NVIDIA Run:ai requires 3.14 or later. To install Helm, see . If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the .

    Permissions

    A Kubernetes user with the cluster-admin role is required to ensure a successful installation. For more information, see .

    Installation

    Note

    • To customize the installation based on your environment, see .

    • You can store the clientSecret as a Kubernetes secret within the cluster instead of using plain text. You can then configure the installation to use it by setting the controlPlane.existingSecret

    Kubernetes

    Connected

    Follow the steps below to add a new cluster.

    Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

    If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

    1. In the NVIDIA Run:ai platform, go to Resources

    Air-gapped

    Follow the steps below to add a new cluster.

    Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log-in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

    If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

    1. In the NVIDIA Run:ai platform, go to Resources

    OpenShift

    Connected

    Follow the steps below to add a new cluster.

    Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

    If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

    1. In the NVIDIA Run:ai platform, go to Resources

    Air-gapped

    When creating a new cluster, select the OpenShift target platform.

    Follow the steps below to add a new cluster.

    Note: When adding a cluster for the first time, the New Cluster form automatically opens when you log in to the NVIDIA Run:ai platform. Other actions are prevented, until the cluster is created.

    If this is your first cluster and you have completed the New Cluster form, start at step 3. Otherwise, start at step 1.

    1. In the NVIDIA Run:ai platform, go to Resources

    Troubleshooting

    If you encounter an issue with the installation, try the troubleshooting scenario below.

    Installation

    If the NVIDIA Run:ai cluster installation failed, check the installation logs to identify the issue. Run the following script to print the installation logs:

    Cluster Status

    If the NVIDIA Run:ai cluster installation completed, but the cluster status did not change its status to Connected, check the cluster

    Nodes

    This section explains the procedure for managing Nodes.

    Nodes are Kubernetes elements automatically discovered by the NVIDIA Run:ai platform. Once a node is discovered by the NVIDIA Run:ai platform, an associated instance is created in the Nodes table, administrators can view the Node’s relevant information, and NVIDIA Run:ai scheduler can use the node for Scheduling.

    Nodes Table

    The Nodes table can be found under Resources in the NVIDIA Run:ai platform.

    The Nodes table displays a list of predefined nodes available to users in the NVIDIA Run:ai platform.

    Note

    • It is not possible to create additional nodes, or edit, or delete existing nodes.

    • Only users with relevant permissions can view the table.

    The Nodes table consists of the following columns:

    Column
    Description

    GPU Devices for Node

    Click one of the values in the GPU devices column, to view the list of GPU devices and their parameters.

    Column
    Description

    Pods Associated with Node

    Click one of the values in the Pod(s) column, to view the list of pods and their parameters.

    Note

    This column is only viewable if your role in the NVIDIA Run:ai platform gives you read access to workloads, even if you are allowed to view workloads, you can only view the workloads within your allowed scope. This means, there might be more pods running on this node than appear in the list your are viewing.

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Show/Hide Details

    Click a row in the Nodes table and then click the Show details button at the upper right side of the action bar. The details screen appears, presenting the following metrics graphs:

    • GPU utilization - Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs compute utilization (percentage of GPU compute) in this node.

    • GPU memory utilization - Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs memory usage (percentage of the GPU memory) in this node.

    • CPU compute utilization - The average of all CPUs’ cores compute utilization graph, along an adjustable period allows you to see the trends of CPU compute utilization (percentage of CPU compute) in this node.

    Using API

    To view the available actions, go to the API reference.

    Cluster Authentication

    To allow users to securely submit workloads using kubectl, you must configure the Kubernetes API server to authenticate users via the NVIDIA Run:ai identity provider. This is done by adding OpenID Connect (OIDC) flags to the Kubernetes API server configuration on each cluster.

    Retrieve Required OIDC Flags

    1. Go to General settings

    2. Navigate to Cluster authentication

    • --oidc-client-id - A client id that all tokens must be issued for.

    • --oidc-issuer-url - The URL of the NVIDIA Run:ai identity provider

    • --oidc-username-prefix - Prefix prepended to username claims to prevent clashes with existing names (e.g., [email protected]).

    Note

    These flags must be configured in the API server startup parameters for each cluster in your environment.

    Kubernetes Distribution-Specific Configuration

    Note

    • Azure Kubernetes Service (AKS) is not supported.

    • For other Kubernetes distributions, refer to specific instructions in the documentation.

    Vanilla Kubernetes
    1. Locate the Kubernetes API server configuration file. For vanilla Kubernetes, the configuration file is typically located at: /etc/kubernetes/manifests/kube-apiserver.yaml.

    2. Edit the file. Under the command section, add the .

    OpenShift Container Platform (OCP)

    No additional configuration is required.

    Rancher Kubernetes Engine (RKE1)
    1. Edit the cluster.yml file used by RKE1. If you're using the Rancher UI, follow the instructions .

    2. Add the under the kube-api section:

    Rancher Kubernetes Engine 2 (RKE2)

    If you're using the :

    1. Edit /etc/rancher/rke2/config.yaml.

    2. Add the under kube-apiserver-arg, using the format shown below:

    Google Kubernetes Engine (GKE)

    To configure researcher authentication on GKE, use Anthos Identity Service and apply the appropriate OIDC configuration.

    1. Install by running:

    2. Install the utility.

    Elastic Kubernetes Engine (EKS)
    1. In the AWS Console, under EKS, find your cluster.

    2. Go to Configuration and then to Authentication.

    NVIDIA Base Command Manager (BCM)
    1. Locate the Kubernetes API server configuration file. For vanilla Kubernetes, the configuration file is typically located at: /etc/kubernetes/manifests/kube-apiserver.yaml.

    2. Edit the file. Under the command section, add the .

    Set Up SSO with SAML

    Single Sign-On (SSO) is an authentication scheme, allowing users to log in with a single pair of credentials to multiple, independent software systems.

    This section explains the procedure to configure SSO to NVIDIA Run:ai using the SAML 2.0 protocol.

    Prerequisites

    Before your start, make sure you have the IDP Metadata XML available from your identity provider.

    Setup

    Adding the Identity Provider

    1. Go to General settings

    2. Open the Security section and click +IDENTITY PROVIDER

    3. Select Custom SAML 2.0

    4. Select either From computer or From URL to upload your identity provider metadata file

    Attribute
    Default value in NVIDIA Run:ai
    Description

    Testing the Setup

    1. Open the NVIDIA Run:ai platform as an admin

    2. Add to an SSO user defined in the IDP

    3. Open the NVIDIA Run:ai platform in an incognito browser tab

    4. On the sign-in page click CONTINUE WITH SSO. You are redirected to the identity provider sign in page

    Editing the Identity Provider

    You can view the identity provider details and edit its configuration:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider box, click Edit identity provider

    4. You can edit either the metadata file or the user attributes

    Removing the Identity Provider

    You can remove the identity provider configuration:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider card, click Remove identity provider

    4. In the dialog, click REMOVE to confirm the action

    Note

    To avoid losing access, removing the identity provider must be carried out by a local user.

    Downloading the IDP Metadata XML File

    You can download the XML file to view the identity provider settings:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider card, click Edit identity provider

    4. In the dialog, click DOWNLOAD IDP METADATA XML FILE

    Troubleshooting

    If testing the setup was unsuccessful, try the different troubleshooting scenarios according to the error you received. If an error still occurs, check the .

    Troubleshooting Scenarios

    Error: "Invalid signature in response from identity provider"

    Description: After trying to log in, the following message is received in the NVIDIA Run:ai login page.

    Mitigation:

    1. Go to the General settings menu

    Error: "401 - We’re having trouble identifying your account because your email is incorrect or can’t be found."

    Description: Authentication failed because email attribute was not found.

    Mitigation: Validate the user’s email attribute is mapped correctly

    Error: "403 - Sorry, we can’t let you see this page. Something about permissions…"

    Description: The authenticated user is missing permissions

    Mitigation:

    1. Validate either the user or its related group/s are assigned with

    Advanced Troubleshooting

    Validating the SAML request

    The SAML login flow can be separated into two parts:

    • NVIDIA Run:ai redirects to the IDP for log-ins using a SAML Request

    • On successful log-in, the IDP redirects back to NVIDIA Run:ai with a SAML Response

    Validate the SAML Request to ensure the SAML flow works as expected:

    Integrations

    Integration Support

    Support for third-party integrations varies. When noted below, the integration is supported out of the box with NVIDIA Run:ai. For other integrations, our Customer Success team has prior experience assisting customers with setup. In many cases, the NVIDIA Enterprise Support Portal may include additional reference documentation provided on an as-is basis.

    Tool
    Category
    NVIDIA Run:ai support details
    Additional Information

    Kubernetes Workloads Integration

    Kubernetes has several built-in resources that encapsulate running Pods. These are called and should not be confused with .

    Examples of such resources are a Deployment that manages a stateless application, or a Job that runs tasks to completion.

    A NVIDIA Run:ai workload encapsulates all the resources needed to run and creates/deletes them together. Since NVIDIA Run:ai is an open platform, it allows the scheduling of any Kubernetes Workflow.

    For more information, see .

    GPU Memory Swap

    NVIDIA Run:ai’s GPU memory swap helps administrators and AI practitioners to further increase the utilization of their existing GPU hardware by improving GPU sharing between AI initiatives and stakeholders. This is done by expanding the GPU physical memory to the CPU memory, typically an order of magnitude larger than that of the GPU.

    Expanding the GPU physical memory helps the NVIDIA Run:ai system to put more workloads on the same GPU physical hardware, and to provide a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.

    Benefits of GPU Memory Swap

    There are several use cases where GPU memory swap can benefit and improve the user experience and the system's overall utilization.

    Sharing a GPU Between Multiple Interactive Workloads (Notebooks)

    AI practitioners use notebooks to develop and test new AI models and to improve existing AI models. While developing or testing an AI model, notebooks use GPU resources intermittently, yet, required resources of the GPUs are pre-allocated by the notebook and cannot be used by other workloads after one notebook has already reserved them. To overcome this inefficiency, NVIDIA Run:ai introduced and .

    When one or more workloads require more than their requested GPU resources, there’s a high probability not all workloads can run on a single GPU because the total memory required is larger than the physical size of the GPU memory.

    With GPU memory swap, several workloads can run on the same GPU, even if the sum of their used memory is larger than the size of the physical GPU memory. GPU memory swap can swap in and out workloads interchangeably, allowing multiple workloads to each use the full amount of GPU memory. The most common scenario is for one workload to run on the GPU (for example, an interactive notebook), while other notebooks are either idle or using the CPU to develop new code (while not using the GPU). From a user experience point of view, the swap in and out is a smooth process since the notebooks do not notice that they are being swapped in and out of the GPU memory. On rare occasions, when multiple notebooks need to access the GPU simultaneously, slower workload execution may be experienced.

    Notebooks typically use the GPU intermittently, therefore with high probability, only one workload (for example, an ), will use the GPU at a time. The more notebooks the system puts on a single GPU, the higher the chances are that there will be more than one notebook requiring the GPU resources at the same time. Admins have a significant role here in fine tuning the number of notebooks running on the same GPU, based on specific use patterns and required SLAs. Using Node Level Scheduler reduces GPU access contention between different interactive notebooks running on the same node.

    Sharing a GPU Between Inference/Interactive Workloads and Training Workloads

    A single GPU can be shared between an (for example, a Jupyter notebook, image recognition services, or an LLM service), and a training workload that is not time-sensitive or delay-sensitive. At times when the inference/interactive workload uses the GPU, both training and inference/interactive workloads share the GPU resources, each running part of the time swapped-in to the GPU memory, and swapped-out into the CPU memory the rest of the time.

    Whenever the inference/interactive workload stops using the GPU, the swap mechanism swaps out the inference/interactive workload GPU data to the CPU memory. Kubernetes wise, the pod is still alive and running using the CPU. This allows the training workload to run faster when the inference/interactive workload is not using the GPU, and slower when it does, thus sharing the same resource between multiple workloads, fully utilizing the GPU at all times, and maintaining uninterrupted service for both workloads.

    Serving Inference Warm Models with GPU Memory Swap

    Running multiple is a demanding task and you will need to ensure that your SLA is met. You need to provide high performance and low latency, while maximizing GPU utilization. This becomes even more challenging when the exact model usage patterns are unpredictable. You must plan for the agility of inference services and strive to keep models on standby in a ready state rather than an idle state.

    NVIDIA Run:ai’s GPU memory swap feature enables you to load multiple models to a single GPU, where each can use up to the full amount GPU memory. Using an application load balancer, the administrator can control to which server each inference request is sent. Then the GPU can be loaded with multiple models, where the model in use is loaded into the GPU memory and the rest of the models are swapped-out to the CPU memory. The swapped models are stored as ready models to be loaded when required. GPU memory swap always maintains the context of the workload (model) on the GPU so it can easily and quickly switch between models. This is unlike industry standard model servers that load models from scratch into the GPU whenever required.

    How GPU Memory Swap Works

    Swapping the workload’s GPU memory to and from the CPU is performed simultaneously and synchronously for all GPUs used by the workload. In some cases, if workloads specify a memory limit smaller than a full GPU memory size, multiple workloads can run in parallel on the same GPUs, maximizing the utilization and shortening the response times.

    In other cases, workloads will run serially, with each workload running for a few seconds before the system swaps them in/out. If multiple workloads occupy more than the GPU physical memory and attempt to run simultaneously, memory swapping will occur. In this scenario, each workload will run part of the time on the GPU while being swapped out to the CPU memory the other part of the time, slowing down the execution of the workloads. Therefore, it is important to evaluate whether memory swapping is suitable for your specific use cases, weighing the benefits against the potential for slower execution time. To better understand the benefits and use cases of GPU memory swap, refer to the detailed sections below. This will help you determine how to best utilize GPU swap for your workloads and achieve optimal performance.

    The workload MUST use . This means the workload’s memory Request is less than a full GPU, but it may add a GPU memory Limit to allow the workload to effectively use the full GPU memory. The NVIDIA Run:ai Scheduler allocates the dynamic fraction pair (Request and Limit) on single or multiple GPU devices in the same node.

    The administrator must label each node that they want to provide GPU memory swap with a run.ai/swap-enabled=true to enable that node. Enabling the feature reserves CPU memory to serve the swapped GPU memory from all GPUs on that node. The administrator sets the size of the CPU reserved RAM memory using the runaiconfig file as detailed in .

    Optionally, you can also configure the :

    • The Node Level Scheduler automatically spreads workloads between the different GPUs on a node, ensuring maximum workload performance and GPU utilization.

    • In scenarios where Interactive notebooks are involved, if the CPU reserved memory for the GPU swap is full, the Node Level Scheduler preempts the GPU process of that workload and potentially routes the workload to another GPU to run.

    Multi-GPU Memory Swap

    NVIDIA Run:ai also supports workload submission using multi-GPU memory swap. Multi-GPU memory swap works similarly to single GPU memory swap, but instead of swapping memory for a single GPU workload, it swaps memory for workloads across multiple GPUs simultaneously and synchronously.

    The NVIDIA Run:ai Scheduler allocates the same dynamic GPU fraction pair (Request and Limit) on multiple GPU devices in the same node. For example, if you want to run two LLM models, each consuming 8 GPUs that are not used simultaneously, you can use GPU memory swap to share their GPUs. This approach allows multiple models to be stacked on the same node.

    The following outlines the advantages of stacking multiple models on the same node:

    • Maximizes GPU utilization - Efficiently uses available GPU resources by enabling multiple workloads to share GPUs.

    • Improves cold start times - Loading large LLM models to a node and its GPUs can take several minutes during a “cold start”. Using memory swap turns this process into a “warm start” that takes only a fraction of a second to a few seconds (depending on the model size and the GPU model).

    • Increases GPU availability - Frees up and maximizes GPU availability for additional workloads (and users), enabling better resource sharing.

    Deployment Considerations

    • A pod created before the GPU memory swap feature was enabled in that cluster, cannot be scheduled to a swap-enabled node. A proper event is generated in case no matching node is found. Users must re-submit those pods to make them swap-enabled.

    • GPU memory swap cannot be enabled if the NVIDIA Run:ai is used. GPU memory swap can only be used with the default NVIDIA time-slicing mechanism.

    • CPU RAM size cannot be decreased once GPU memory swap is enabled.

    Enabling and Configuring GPU Memory Swap

    Before configuring GPU memory swap, dynamic GPU fractions must be enabled. You can also configure and use Node Level Scheduler. Dynamic GPU fractions enable you to make your workloads burstable, while both features will maximize your workloads’ performance and GPU utilization within a single node.

    To enable GPU memory swap in a NVIDIA Run:ai cluster:

    1. Add the following label to each node where you want to enable GPU memory swap:

    2. Edit the runaiconfig file with the following parameters. This example uses 100Gi as the size of the swap memory. For more details, see :

    3. Or, use the following patch command from your terminal:

    Configuring System Reserved GPU Resources

    Swappable workloads require reserving a small part of the GPU for non-swappable allocations like binaries and GPU context. To avoid getting out-of-memory (OOM) errors due to non-swappable memory regions, the system reserves a 2GiB of GPU RAM memory by default, effectively truncating the total size of the GPU memory. For example, a 16GiB T4 will appear as 14GiB on a swap-enabled node. The exact reserved size is application-dependent, and 2GiB is a safe assumption for 2-3 applications sharing and swapping on a GPU. This value can be changed by:

    1. Editing the runaiconfig as follows:

    2. Or, using the following patch command from your terminal:

    Preventing Your Workloads from Getting Swapped

    If you prefer your workloads not to be swapped into CPU memory, you can specify on the pod an anti-affinity to run.ai/swap-enabled=true node label when submitting your workloads and the Scheduler will ensure not to use swap-enabled nodes. An alternative way is to set swap on a dedicated node pool and not use this node pool for workloads you prefer not to swap.

    What Happens When the CPU Reserved Memory for GPU Swap is Exhausted?

    CPU memory is limited, and since a single CPU serves multiple GPUs on a node, this number is usually between 2 to 8. For example, when using 80GB of GPU memory, each swapped workload consumes up to 80GB (but may use less) assuming each GPU is shared between 2-4 workloads. In this example, you can see how the swap memory can become very large. Therefore, we give administrators a way to limit the size of the CPU reserved memory for swapped GPU memory on each swap-enabled node as shown in .

    Limiting the CPU reserved memory means that there may be scenarios where the GPU memory cannot be swapped out to the CPU reserved RAM. Whenever the CPU reserved memory for swapped GPU memory is exhausted, the workloads currently running will not be swapped out to the CPU reserved RAM, instead, (if enabled) logic takes over and provides GPU resource optimization.

    Dynamic GPU Fractions

    Many workloads utilize GPU resources intermittently, with long periods of inactivity. These workloads typically need GPU resources when they are running AI applications or debugging a model in development. Other workloads such as inference may utilize GPUs at lower rates than requested, but may demand higher resource usage during peak utilization. The disparity between resource request and actual resource utilization often leads to inefficient utilization of GPUs. This usually occurs when multiple workloads request resources based on their peak demand, despite operating below those peaks for the majority of their runtime.

    To address this challenge, NVIDIA Run:ai has introduced dynamic GPU fractions. This feature optimizes GPU utilization by enabling workloads to dynamically adjust their resource usage. It allows users to specify a guaranteed fraction of GPU memory and compute resources with a higher limit that can be dynamically utilized when additional resources are requested.

    How Dynamic GPU Fractions Work

    With dynamic GPU fractions, users can submit workloads using GPU fraction Request and Limit which is achieved by leveraging the Kubernetes Request and Limit notations. You can either:

    • Request a GPU fraction (portion) using a percentage of a GPU and specify a Limit

    • Request a GPU memory size (GB, MB) and specify a Limit

    When setting a GPU memory limit either as GPU fraction or GPU memory size, the Limit must be equal to or greater than the GPU fractional memory request. Both GPU fraction and GPU memory are translated into the actual requested memory size of the Request (guaranteed resources) and the Limit (burstable resources - non guaranteed).

    For example, a user can specify a workload with a GPU fraction request of 0.25 GPU, and add a limit of up to 0.80 GPU. The NVIDIA Run:ai schedules the workload to a node that can provide the GPU fraction request (0.25), and then assigns the workload to a GPU. The GPU scheduler monitors the workload and allows it to occupy memory between 0 to 0.80 of the GPU memory (based on the Limit), where only 0.25 of the GPU memory is guaranteed to that workload. The rest of the memory (from 0.25 to 0.8) is “loaned” to the workload, as long as it is not needed by other workloads.

    NVIDIA Run:ai automatically manages the state changes between Request and Limit as well as the reverse (when the balance needs to be "returned"), updating the workloads’ utilization vs. Request and Limit parameters in the .

    To guarantee fair quality of service between different workloads using the same GPU, NVIDIA Run:ai developed an extendable GPUOOMKiller (Out Of Memory Killer) component that guarantees the quality of service using Kubernetes semantics for resources of Request and Limit.

    The OOMKiller capability requires adding CAP_KILL capabilities to the dynamic GPU fractions and to the NVIDIA Run:ai core scheduling module (toolkit daemon). This capability is enabled by default.

    Note

    Dynamic GPU fractions is enabled by default in the cluster. Disabling dynamic GPU fractions in removes the CAP_KILL capability.

    Multi-GPU Dynamic Fractions

    NVIDIA Run:ai also supports workload submission using multi-GPU dynamic fractions. Multi-GPU dynamic fractions work similarly to dynamic fractions on a single GPU workload, however, instead of a single GPU device, the NVIDIA Run:ai Scheduler allocates the same dynamic fraction pair (Request and Limit) on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, but may want to burst out and consume up to the full GPU memory, they can allocate 8×40GB with multi-GPU fractions and a limit of 80GB (e.g. H100 GPU) instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.This is useful during model development, where memory requirements are usually lower due to experimentation with smaller models or configurations.

    This approach significantly improves GPU utilization and availability, enabling more precise and often smaller quota requirements for the end user. Time sharing where single GPUs can serve multiple workloads with dynamic fractions remains unchanged, only now, it serves multiple workloads using multi-GPUs per workload.

    Setting Dynamic GPU Fractions

    Note

    Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

    Using the asset, you can define the compute requirements by specifying your requested GPU portion or GPU memory, and set a Limit. You can then use the compute resource with any of the for single and multi-GPU dynamic fractions. In addition, you will be able to view the workloads’ utilization vs. Request and Limit parameters in the .

    • Single dynamic GPU fractions - Define the compute requirement to run 1 GPU device, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB) with a Limit. The limit must be equal to or greater than the GPU fractional memory request.

    • Multi-GPU dynamic fractions - Define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or specifying the memory request (GB, MB) with a Limit. The limit must be equal to or greater than the GPU fractional memory request.

    Note

    When setting a workload with dynamic GPU fractions, (for example, when using it with GPU Request or GPU memory Limits), you practically make the workload burstable. This means it can use memory that is not guaranteed for that workload and is susceptible to an ‘OOM Kill’ signal if the actual owner of that memory requires it back. This applies to non-preemptive workloads as well. For that reason, it is recommended that you use dynamic GPU fractions with Interactive workloads running Notebooks. Notebook pods are not evicted when their GPU process is OOM Kill’ed. This behavior is the same as standard Kubernetes burstable CPU workloads.

    Setting Dynamic GPU Fractions for Third-Party Workloads

    To enable dynamic GPU fractions for workloads submitted via Kubernetes YAML, use the following annotations to define the GPU fraction configuration. You can configure either gpu-fraction or gpu-memory. You must also set the RUNAI_GPU_MEMORY_LIMIT environment variable in the first container to enforce the memory limit. This is the GPU consuming container. Make sure the default scheduler is set to runai-scheduler. See for more details.

    Variable
    Input Format
    Where to Set

    The following example YAML creates a pod that requests 2 GPU devices, each requesting 50% of memory (gpu-fraction: "0.5") and allows usage of up to 95% (RUNAI_GPU_MEMORY_LIMIT: "0.95") if available.

    Using CLI

    To view the available actions, go to the and run according to your workload.

    Using API

    To view the available actions, go to the and run according to your workload.

    Set Up SSO with OpenShift

    Single Sign-On (SSO) is an authentication scheme, allowing users to log-in with a single pair of credentials to multiple, independent software systems.

    This article explains the procedure to configure single sign-on to NVIDIA Run:ai using the OpenID Connect protocol in OpenShift V4.

    Prerequisites

    Before starting, make sure you have the following available from your OpenShift cluster:

    • :

      • ClientID - The ID used to identify the client with the Authorization Server.

      • Client Secret - A secret password that only the Client and Authorization Server know.

    • Base URL - The OpenShift API Server endpoint (for example, )

    Setup

    Adding the Identity Provider

    1. Go to General settings

    2. Open the Security section and click +IDENTITY PROVIDER

    3. Select OpenShift V4

    4. Enter the Base URL, Client ID, and Client Secret from your OpenShift OAuth client.

    Attribute
    Default value in NVIDIA Run:ai
    Description

    Testing the Setup

    1. Open the NVIDIA Run:ai platform as an admin

    2. Add to an SSO user defined in the IDP

    3. Open the NVIDIA Run:ai platform in an incognito browser tab

    4. On the sign-in page click CONTINUE WITH SSO You are redirected to the OpenShift IDP sign-in page

    Editing the Identity Provider

    You can view the identity provider details and edit its configuration:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider box, click Edit identity provider

    4. You can edit either the Base URL, Client ID, Client Secret, or the User attributes

    Removing the Identity Provider

    You can remove the identity provider configuration:

    1. Go to General settings

    2. Open the Security section

    3. On the identity provider card, click Remove identity provider

    4. In the dialog, click REMOVE to confirm

    Note

    To avoid losing access, removing the identity provider must be carried out by a local user.

    Troubleshooting

    If testing the setup was unsuccessful, try the different troubleshooting scenarios according to the error you received.

    Troubleshooting Scenarios

    Error: "403 - Sorry, we can’t let you see this page. Something about permissions…"

    Description: The authenticated user is missing permissions

    Mitigation:

    1. Validate either the user or its related group/s are assigned with

    Error: "401 - We’re having trouble identifying your account because your email is incorrect or can’t be found."

    Description: Authentication failed because email attribute was not found.

    Mitigation:

    1. Validate email attribute is available in the configured OIDC Scopes

    Error: "Unexpected error when authenticating with identity provider"

    Description: User authentication failed

    Mitigation: Validate the the configured OIDC Scopes exist and match the Identity Provider’s available scopes

    Advanced: Look for the specific error message in the URL address

    Error: "Unexpected error when authenticating with identity provider (SSO sign-in is not available)"

    Description: User authentication failed

    Mitigation:

    1. Validate the the configured OIDC scope exists in the Identity Provider

    Error: "unauthorized_client"

    Description: OIDC Client ID was not found in the OpenShift IDP

    Mitigation: Validate the the configured Client ID matches the value in the OAuthclient Kubernetes object

    Using GB200 NVL72 and Multi-Node NVLink Domains

    Multi-Node NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives are fully supported by the NVIDIA Run:ai platform.

    Kubernetes does not natively recognize NVIDIA’s MNNVL architecture, which makes managing and scheduling workloads across these high-performance domains more complex. The NVIDIA Run:ai platform simplifies this by abstracting the complexity of MNNVL configuration. Without this abstraction, optimal performance on a GB200 NVL72 system would require deep knowledge of NVLink domains, their hardware dependencies, and manual configuration for each distributed workload. NVIDIA Run:ai automates these steps, ensuring high performance with minimal effort. While GB200 NVL72 supports all , distributed training workloads benefit most from its accelerated GPU networking capabilities.

    To learn more about GB200, MNNVL and related NVIDIA technologies, refer to the following:

    GPU Fractions

    To submit a with GPU resources in Kubernetes, you typically need to specify an integer number of GPUs. However, workloads often require diverse GPU memory and compute requirements or even use GPUs intermittently depending on the application (such as inference workloads, training workloads or notebooks at the model-creation phase). Additionally, GPUs are becoming increasingly powerful, offering more processing power and larger memory capacity for applications. Despite the increasing model sizes, the increasing capabilities of GPUs allow them to be effectively shared among multiple users or applications.

    NVIDIA Run:ai’s GPU fractions provide an agile and easy-to-use method to share a GPU or multiple GPUs across workloads. With GPU fractions, you can divide the GPU/s memory into smaller chunks and share the GPU/s compute resources between different workloads and users, resulting in higher GPU utilization and more efficient resource allocation.

    Benefits of GPU Fractions

    Utilizing GPU fractions to share GPU resources among multiple workloads provides numerous advantages for both platform administrators and practitioners, including improved efficiency, resource optimization, and enhanced user experience.

    The NVIDIA Run:ai Scheduler: Concepts and Principles

    When a user , the workload is directed to the selected Kubernetes cluster and managed by the NVIDIA Run:ai Scheduler. The Scheduler’s primary responsibility is to allocate workloads to the most suitable node or nodes based on resource requirements and other characteristics, as well as adherence to NVIDIA Run:ai’s fairness and quota management.

    The NVIDIA Run:ai Scheduler schedules native Kubernetes workloads, NVIDIA Run:ai workloads, or any other type of third-party workloads. To learn more about workloads support, see .

    To understand what is behind the NVIDIA Run:ai Scheduler’s decision-making logic, get to know the key concepts, resource management and scheduling principles of the Scheduler.

    Workloads and Pod Groups

    can range from a single pod running on individual nodes to distributed workloads using multiple pods, each running on a node (or part of a node). For example, a large scale training workload could use up to 128 nodes or more, while an inference workload could use many pods (replicas) and nodes.

    Data Volumes

    Data volumes (DVs) are one type of . They offer a powerful solution for storing, managing, and sharing AI training data, promoting collaboration, simplifying data access control, and streamlining the AI development lifecycle.

    Acting as a central repository for organizational data resources, data volumes can represent datasets or raw data, that is stored in Kubernetes Persistent Volume Claims (PVCs).

    Once a data volume is created, it can be shared with additional multiple scopes and easily utilized by AI practitioners when submitting workloads. Shared data volumes are mounted with read-only permissions, ensuring data integrity. Any modifications to the data in a shared DV must be made by writing to the original volume of the PVC used to create the data volume.

    Note

    runai workspace submit --priority priority-class
    runai training submit --priority priority-class
    curl -H "Authorization: Bearer <token>" "https://runai.jfrog.io/artifactory/api/storage/runai-airgapped-prod/?list"
    curl -L -H "Authorization: Bearer <token>" -O "https://runai.jfrog.io/artifactory/runai-airgapped-prod/runai-airgapped-package-<VERSION>.tar.gz"
    kubectl edit cm runai-backend-org-unit-service -n runai-backend
    S3_ENDPOINT: <S3_END_POINT_URL>
    S3_ACCESS_KEY_ID: <S3_ACCESS_KEY_ID>
    S3_ACCESS_KEY: <S3_ACCESS_KEY>
    S3_USE_SSL: "true"
    S3_BUCKET: <BUCKET_NAME>
    kubectl edit cm runai-backend-metrics-service -n runai-backend
    S3_ENDPOINT: <S3_END_POINT_URL>
    S3_ACCESS_KEY_ID: <S3_ACCESS_KEY_ID>
    S3_ACCESS_KEY: <S3_ACCESS_KEY>
    S3_USE_SSL: "true"
    Inference
    Third-party workloads
    NVIDIA Cloud Functions (NVCF)

    GPU devices

    The number of GPU devices installed on the node. Clicking this field pops up a dialog with details per GPU (described below in this article)

    Free GPU devices

    The current number of fully vacant GPU devices

    GPU memory

    The total amount of GPU memory installed on this node. For example, if the number is 640GB and the number of GPU devices is 8, then each GPU is installed with 80GB of memory (assuming the node is assembled of homogenous GPU devices)

    Allocated GPUs

    The total allocation of GPU devices in units of GPUs (decimal number). For example, if 3 GPUs are 50% allocated, the field prints out the value 1.50. This value represents the portion of GPU memory consumed by all running pods using this node

    Used GPU memory

    The actual amount of memory (in GB or MB) used by pods running on this node.

    GPU compute utilization

    The average compute utilization of all GPU devices in this node

    GPU memory utilization

    The average memory utilization of all GPU devices in this node

    CPU (Cores)

    The number of CPU cores installed on this node

    CPU memory

    The total amount of CPU memory installed on this node

    Allocated CPU (Cores)

    The number of CPU cores allocated by pods running on this node (decimal number, e.g. a pod allocating 350 mili-cores shows an allocation of 0.35 cores).

    Allocated CPU memory

    The total amount of CPU memory allocated by pods running on this node (in GB or MB)

    Used CPU memory

    The total amount of actually used CPU memory by pods running on this node. Pods may allocate memory but not use all of it, or go beyond their CPU memory allocation if using Limit > Request for CPU memory (burstable workload)

    CPU compute utilization

    The utilization of all CPU compute resources on this node (percentage)

    CPU memory utilization

    The utilization of all CPU memory resources on this node (percentage)

    Used swap CPU memory

    The amount of CPU memory (in GB or MB) used for GPU swap memory (* future)

    Pod(s)

    List of pods running on this node, click the field to view details (described below in this article)

    Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

  • Show/Hide details - Click to view additional information on the selected row

  • CPU memory utilization - The utilization of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory utilization (percentage of CPU memory) in this node.

  • CPU memory usage - The usage of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory usage (in GB or MB of CPU memory) in this node.

  • For GPUs charts - Click the GPU legend on the right-hand side of the chart, to activate or deactivate any of the GPU lines.

  • You can click the date picker to change the presented period

  • You can use your mouse to mark a sub-period in the graph for zooming in, and use the ‘Reset zoom’ button to go back to the preset period

  • Changes in the period affect all graphs on this screen.

  • Node

    The Kubernetes name of the node

    Status

    The state of the node. Nodes in the Ready state are eligible for scheduling. If the state is Not ready then the main reason appears in parenthesis on the right side of the state field. Hovering the state lists the reasons why a node is Not ready.

    NVLink domain UID

    Indicates if the MNNVL domain ID is part of the MNNVL label value. In case the MNNVL label is not the default MNNVL label key (nvidia.com/gpu.clique), this field will show the whole label value.

    MNNVL domain clique ID

    Indicates if the MNNVL clique ID is part of the MNNVL label value. In case the MNNVL label is not the default MNNVL label key (nvidia.com/gpu.clique), this field will show an empty value.

    Node pool

    The name of the associated node pool. By default, every node in the NVIDIA Run:ai platform is associated with the default node pool, if no other node pool is associated

    GPU type

    The GPU model, for example, H100, or V100

    Index

    The GPU index, read from the GPU hardware. The same index is used when accessing the GPU directly

    Used memory

    The amount of memory used by pods and drivers using the GPU (in GB or MB)

    Compute utilization

    The portion of time the GPU is being used by applications (percentage)

    Memory utilization

    The portion of the GPU memory that is being used by applications (percentage)

    Idle time

    The elapsed time since the GPU was used (i.e. the GPU is being idle for ‘Idle time’)

    Pod

    The Kubernetes name of the pod. Usually name of the pod is made of the name of the parent workload if there is one, and an index for unique for that pod instance within the workload

    Status

    The state of the pod. In steady state this should be Running and the amount of time the pod is running

    Project

    The NVIDIA Run:ai project name the pod belongs to. Clicking this field takes you to the Projects table filtered by this project name

    Workload

    The workload name the pod belongs to. Clicking this field takes you to the Workloads table filtered by this workload name

    Image

    The full path of the image used by the main container of this pod

    Creation time

    The pod’s creation date and time

    Nodes

    If it exists in the IDP, it allows Researcher containers to start with the relevant Linux supplementary groups. The IDP attribute must be a list of integers.

    Email

    email

    Defines the user attribute in the IDP holding the user's email address, which is the user identifier in NVIDIA Run:ai

    User first name

    firstName

    Used as the user’s first name appearing in the NVIDIA Run:ai user interface

    User last name

    lastName

    Used as the user’s last name appearing in the NVIDIA Run:ai user interface

    Troubleshooting
    https://jwt.io

    Copy the Redirect URL to be used in your OpenShift OAuth client

  • Optional: Enter the user attributes and their value in the identity provider as shown in the below table

  • Click SAVE

  • Optional: Enable Auto-Redirect to SSO to automatically redirect users to your configured identity provider’s login page when accessing the platform.

  • Email

    email

    Defines the user attribute in the IDP holding the user's email address, which is the user identifier in NVIDIA Run:ai

    User first name

    firstName

    Used as the user’s first name appearing in the NVIDIA Run:ai platform

    User last name

    lastName

    Used as the user’s last name appearing in the NVIDIA Run:ai platform

    In the identity provider sign-in page, log in with the SSO user who you granted with access rules

  • If you are unsuccessful signing-in to the identity provider, follow the Troubleshooting section below

  • Validate groups attribute is available in the configured OIDC Scopes
  • Validate the user’s groups attribute is mapped correctly

  • Advanced:

    1. Open the Chrome DevTools: Right-click on page → Inspect → Console tab

    2. Run the following command to retrieve and copy the user’s token: localStorage.token;

    3. Paste in https://jwt.io

    4. Under the Payload section validate the value of the user’s attributes

    Validate the user’s email attribute is mapped correctly

    Validate the configured Client Secret match the Client Secret value in the OAuthclient Kubernetes object.

    Advanced: Look for the specific error message in the URL address

    User role groups

    GROUPS

    If it exists in the IDP, it allows you to assign NVIDIA Run:ai role groups via the IDP. The IDP attribute must be a list of strings.

    Linux User ID

    UID

    If it exists in the IDP, it allows researcher containers to start with the Linux User UID. Used to map access to network resources such as file systems to users. The IDP attribute must be of type integer.

    Linux Group ID

    GID

    If it exists in the IDP, it allows researcher containers to start with the Linux Group GID. The IDP attribute must be of type integer.

    Supplementary Groups

    SUPPLEMENTARYGROUPS

    OpenShift OAuth client
    https://api.<cluster-url>:6443
    access rules
    access rules

    If it exists in the IDP, it allows researcher containers to start with the relevant Linux supplementary groups. The IDP attribute must be a list of integers.

    Collection period

    The period in which the data was collected

    Smaller quota requirements - Enables more precise and often smaller quota requirements for the end user.
    dynamic GPU fractions
    Node Level Scheduler
    interactive notebook
    interactive or inference workload
    inference models
    dynamic GPU fractions
    enabling and configuring GPU memory swap
    Node Level Scheduler
    strict or fair time-slicing
    Advanced cluster configurations
    enabling and configuring GPU memory swap
    Node Level Scheduler

    gpu-fraction

    A portion of GPU memory as a double-precision floating-point number. Example: 0.25, 0.75.

    Pod annotation (metadata.annotations)

    gpu-memory

    Memory size in MiB. Example: 2500, 4096. The gpu-memory values are always in MiB.

    Pod annotation (metadata.annotations)

    gpu-fraction-num-devices

    The number of GPU devices to allocate using the specified gpu-fraction or gpu-memory value. Set this annotation only if you want to request multiple GPU devices.

    Pod annotation (metadata.annotations)

    RUNAI_GPU_MEMORY_LIMIT

    • To use for gpu-fraction - Specify a double-precision floating-point number. Example: 0.95

    • To use for gpu-memory - Specify a Kubernetes resource quantity format. Example: 500000000, 2500M

    The limit must be equal to or greater than the GPU fractional memory request.

    Scheduler
    metrics pane for each workload
    runaiconfig
    compute resources
    NVIDIA Run:ai workload types
    metrics pane for each workload
    Using the Scheduler with third-party workloads
    CLI v2 reference
    API reference

    Environment variable in the first container

    Replace <POSTGRESQL_PORT> with the port number where PostgreSQL is running.

  • Replace <POSTGRESQL_DB> with the name of your PostgreSQL database.

  • Replace <POSTGRESQL_DB> with the name of your PostgreSQL database.

  • Replace <SQL_FILE> with the path to the SQL script created in the previous step.

  • run as non-root
    run as non-root
    tar xvf runai-airgapped-package-<VERSION>.tar.gz
    export REGISTRY_URL=<DOCKER REGISTRY ADDRESS>
    sudo ./setup.sh

    Verify that the changes have been applied. After saving the file, the API server should automatically restart since it's managed as a static pod. Confirm that the kube-apiserver-<master-node-name> pod in the kube-system namespace has restarted and is running with the new configuration. You can run the following command to check the pod status:

    Verify the flags are applied by inspecting the running API server container:

    • Follow the Rancher documentation here to locate the API server container ID.

    • Run the following command:

    • Confirm that the OIDC flags have been added correctly to the container's configuration.

    If you're using Rancher UI:

    1. Add the required flags during the cluster provisioning process.

    2. Navigate to: Cluster Management > Create, select RKE2, and choose your platform.

    3. In the Cluster Configuration screen, go to: Advanced > Additional API Server Args.

    4. Add the required OIDC flags as <key>=<value> (e.g. oidc-username-prefix=-).

    Configure the OIDC provider for username-password authentication. Make sure to use the required OIDC flags:

  • Or, configure the OIDC provider for single-sign-on. Make sure to use the required OIDC flags:

  • Update the runaiconfig with the Anthos Identity Service endpoint. First, get the external IP of the gke-oidc-envoy service:

  • Then, patch the runaiconfig to use this endpoint. Replace the below with the actual IP address of the gke-oidc-envoy service:

  • Associate a new identity provider. Use the required OIDC flags.

    The process can take up to 30 minutes.

    Verify that the changes have been applied. After saving the file, the API server should automatically restart since it's managed as a static pod. Confirm that the kube-apiserver-<master-node-name> pod in the kube-system namespace has restarted and is running with the new configuration. You can run the following command to check the pod status:

    required OIDC flags
    here
    required OIDC flags
    RKE2 Quickstart
    required OIDC flags
    Anthos identity service
    yq
    required OIDC flags
    \nreports:\n s3_config:\n bucket: \"<BUCKET_NAME>\"\n
    kubectl rollout restart deployment runai-backend-metrics-service runai-backend-org-unit-service -n runai-backend
    run.ai/swap-enabled=true
    spec: 
      global: 
        core: 
          swap:
            enabled: true
            limits:
              cpuRam: 100Gi
     kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"swap":{"enabled": true, "limits": {"cpuRam": "100Gi"}}}}}}'
    spec: 
      global: 
        core: 
          swap:
            limits:
              reservedGpuRam: 2Gi
     kubectl patch -n runai runaiconfigs.run.ai/runai --type='merge' --patch '{"spec":{"global":{"core":{"swap":{"limits":{"reservedGpuRam": <quantity>}}}}}}'
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        user: test
        gpu-fraction: "0.5"
        gpu-fraction-num-devices: "2"
      labels:
        runai/queue: test
      name: multi-fractional-pod-job
      namespace: test
    spec:
      containers:
      - image: gcr.io/run-ai-demo/quickstart-cuda
        imagePullPolicy: Always
        name: job
        env:
        - name: RUNAI_VERBOSE
          value: "1"
        - name: RUNAI_GPU_MEMORY_LIMIT
          value: "0.95"
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          capabilities:
            drop: ["ALL"]
      schedulerName: runai-scheduler
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 5
    tar xvf runai-airgapped-package-<VERSION>.tar.gz
    export REGISTRY_URL=<DOCKER REGISTRY ADDRESS>
    sudo ./setup.sh
    -- Create a new database for runai
    CREATE DATABASE <DATABASE_NAME>; 
    
    -- Create the role with login and password
    CREATE ROLE <ROLE_NAME>  WITH LOGIN PASSWORD '<ROLE_PASSWORD>'; 
    
    -- Grant all privileges on the database to the role
    GRANT ALL PRIVILEGES ON DATABASE <DATABASE_NAME> TO <ROLE_NAME>; 
    
    -- Connect to the newly created database
    \c <DATABASE_NAME> 
    
    -- grafana
    CREATE ROLE grafana WITH LOGIN PASSWORD '<GRAFANA_PASSWORD>'; 
    CREATE SCHEMA grafana authorization grafana;
    ALTER USER grafana set search_path='grafana';
    -- Exit psql
    \q
    psql --host <POSTGRESQL_HOST> \ 
    --user <POSTGRESQL_USER> \
    --port <POSTGRESQL_PORT> \ 
    --dbname <POSTGRESQL_DB> \
    -a -f <SQL_FILE> \
    kubectl get pods -n kube-system kube-apiserver-<master-node-name> -o yaml
    docker inspect <kube-api-server-container-id>
    kubectl get clientconfig default -n kube-public -o yaml > login-config.yaml
    yq -i e ".spec +={\"authentication\":[{\"name\":\"oidc\",\"oidc\":{\"clientID\":\"runai\",\"issuerURI\":\"$OIDC_ISSUER_URL\",\"kubectlRedirectURI\":\"http://localhost:8000/callback\",\"userClaim\":\"sub\",\"userPrefix\":\"-\"}}]}" login-config.yaml
    kubectl apply -f login-config.yaml
    kubectl get clientconfig default -n kube-public -o yaml > login-config.yaml
    yq -i e ".spec +={\"authentication\":[{\"name\":\"oidc\",\"oidc\":{\"clientID\":\"runai\",\"issuerURI\":\"$OIDC_ISSUER_URL\",\"groupsClaim\":\"groups\",\"kubectlRedirectURI\":\"http://localhost:8000/callback\",\"userClaim\":\"sub\",\"userPrefix\":\"-\"}}]}" login-config.yaml
    kubectl apply -f login-config.yaml
    kubectl get svc -n anthos-identity-service
    NAME               TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)              AGE
    gke-oidc-envoy     LoadBalancer   10.37.3.111   39.201.319.10   443:31545/TCP        12h
    kubectl -n runai patch runaiconfig runai -p '{"spec": {"researcher-service": 
    {"args": {"gkeOidcEnvoyHost": "35.236.229.19"}}}}'  --type="merge"
    kubectl get pods -n kube-system kube-apiserver-<master-node-name> -o yaml
      containers:
      - command:
        ...
        - --oidc-client-id=runai
        - --oidc-issuer-url=https://<HOST>/auth/realms/runai
        - --oidc-username-prefix=-
    kube-api:
        always_pull_images: false
        extra_args:
            oidc-client-id: runai  # 
            ...
    gcloud container clusters update <gke-cluster-name> \
        --enable-identity-service --project=<gcp-project-name> --zone=<gcp-zone-name>
    kube-apiserver-arg:
    - "oidc-client-id=runai" # 
    ...
    and
    controlPlane.secretKeys.clientSecret
    parameters as described in
    .

    Click +NEW CLUSTER

  • Enter a unique name for your cluster

  • Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  • Enter the Cluster URL. For more information, see Fully Qualified Domain Name requirement.

  • Click Continue

  • Installing NVIDIA Run:ai cluster

    In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

    1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

    2. Click DONE

    The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

    Tip: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation. For more details, see see Understanding cluster access roles.

    Click +NEW CLUSTER

  • Enter a unique name for your cluster

  • Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  • Enter the Cluster URL . For more information, see Fully Qualified Domain Name requirement.

  • Click Continue

  • Installing NVIDIA Run:ai cluster

    In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

    1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

    2. On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the pre-provided installation file instead of using helm repositories. As such:

      • Do not add the helm repository and do not run helm repo update.

      • Instead, edit the helm upgrade command.

        • Replace runai/runai-cluster with runai-cluster-<VERSION>.tgz.

        • Add --set global.image.registry=<DOCKER REGISTRY ADDRESS> where the registry address is as entered in the section

        • Add --set global.customCA.enabled=true as described

      The command should look like the following:

    3. Click DONE

    The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

    Tip: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation. For more details, see Understanding cluster access roles.

    Click +NEW CLUSTER

  • Enter a unique name for your cluster

  • Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  • Enter the Cluster URL

  • Click Continue

  • Installing NVIDIA Run:ai cluster

    In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

    1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

    2. Click DONE

    The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

    Click +NEW CLUSTER

  • Enter a unique name for your cluster

  • Optional: Choose the NVIDIA Run:ai cluster version (latest, by default)

  • Enter the Cluster URL

  • Click Continue

  • Installing NVIDIA Run:ai cluster

    In the next Section, the NVIDIA Run:ai cluster installation steps will be presented.

    1. Follow the installation instructions and run the commands provided on your Kubernetes cluster.

    2. On the second tab of the cluster wizard, when copying the helm command for installation, you will need to use the pre-provided installation file instead of using helm repositories. As such:

      • Do not add the helm repository and do not run helm repo update.

      • Instead, edit the helm upgrade command.

        • Replace runai/runai-cluster with runai-cluster-<VERSION>.tgz.

        • Add --set global.image.registry=<DOCKER REGISTRY ADDRESS> where the registry address is as entered in the section

        • Add --set global.customCA.enabled=true as described

      The command should look like the following:

    3. Click DONE

    The cluster is displayed in the table with the status Waiting to connect. Once installation is complete, the cluster status changes to Connected.

    preinstall diagnostics
    download
    Helm
    Installing Helm
    helm binary
    Using RBAC authorization
    Customized installation
    troubleshooting scenarios
    chmod +x ./preinstall-diagnostics-<platform> && \ 
    ./preinstall-diagnostics-<platform> \
      --domain ${CONTROL_PLANE_FQDN} \
      --cluster-domain ${CLUSTER_FQDN} \
    #if the diagnostics image is hosted in a private registry
      --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
      --image ${PRIVATE_REGISTRY_IMAGE_URL}    
    #Save the image locally
    docker save --output preinstall-diagnostics.tar gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION}
    #Load the image to the organization's registry
    docker load --input preinstall-diagnostics.tar
    docker tag gcr.io/run-ai-lab/preinstall-diagnostics:${VERSION} ${CLIENT_IMAGE_AND_TAG} 
    docker push ${CLIENT_IMAGE_AND_TAG}
    chmod +x ./preinstall-diagnostics-darwin-arm64 && \
    ./preinstall-diagnostics-darwin-arm64 \
      --domain ${CONTROL_PLANE_FQDN} \
      --cluster-domain ${CLUSTER_FQDN} \
      --image-pull-secret ${IMAGE_PULL_SECRET_NAME} \
      --image ${PRIVATE_REGISTRY_IMAGE_URL}    
    Customized installation
  • From computer - Click the Metadata XML file field, then select your file for upload

  • From URL - In the Metadata XML field, enter the URL to the IDP Metadata XML file

  • You can either copy the Redirect URL and Entity ID displayed on the screen and enter them in your identity provider, or use the service provider metadata XML, which contains the same information in XML format. This file becomes available after you click SAVE in step 7.

  • Optional: Enter the user attributes and their value in the identity provider as shown in the below table

  • Click SAVE. After save, click Open service provider metadata XML to access the metadata file. This file can be used to configure your identity provider.

  • Optional: Enable Auto-Redirect to SSO to automatically redirect users to your configured identity provider’s login page when accessing the platform.

  • Email

    email

    Defines the user attribute in the IDP holding the user's email address, which is the user identifier in NVIDIA Run:ai.

    User first name

    firstName

    Used as the user’s first name appearing in the NVIDIA Run:ai platform.

    User last name

    lastName

    Used as the user’s last name appearing in the NVIDIA Run:ai platform.

    In the identity provider sign-in page, log in with the SSO user who you granted with access rules

  • If you are unsuccessful signing-in to the identity provider, follow the Troubleshooting section below

  • You can view the identity provider URL, identity provider entity ID, and the certificate expiration date

    Open the Security section
  • In the identity provider box, check for a "Certificate expired” error

  • If it is expired, update the SAML metadata file to include a valid certificate

  • Validate the user’s groups attribute is mapped correctly

    Advanced:

    1. Open the Chrome DevTools: Right-click on page → Inspect → Console tab

    2. Run the following command to retrieve and paste the user’s token: localStorage.token;

    3. Paste in https://jwt.io

    4. Under the Payload section validate the values of the user’s attributes

  • Go to the NVIDIA Run:ai login screen

  • Open the Chrome Network inspector: Right-click → Inspect on the page → Network tab

  • On the sign-in page click CONTINUE WITH SSO.

  • Once redirected to the Identity Provider, search in the Chrome network inspector for an HTTP request showing the SAML Request. Depending on the IDP url, this would be a request to the IDP domain name. For example, accounts.google.com/idp?1234.

  • When found, go to the Payload tab and copy the value of the SAML Request

  • Paste the value into a SAML decoder (e.g. )

  • Validate the request:

    • The content of the <saml:Issuer> tag is the same as Entity ID given when

    • The content of the AssertionConsumerServiceURL is the same as the Redirect URI given when

  • Validate the response:

    • The user email under the <saml2:Subject> tag is the same as the logged-in user

    • Make sure that under the <saml2:AttributeStatement> tag, there is an Attribute named email (lowercase). This attribute is mandatory.

  • User role groups

    GROUPS

    If it exists in the IDP, it allows you to assign NVIDIA Run:ai role groups via the IDP. The IDP attribute must be a list of strings.

    Linux User ID

    UID

    If it exists in the IDP, it allows Researcher containers to start with the Linux User UID. Used to map access to network resources such as file systems to users. The IDP attribute must be of type integer.

    Linux Group ID

    GID

    If it exists in the IDP, it allows Researcher containers to start with the Linux Group GID. The IDP attribute must be of type integer.

    Supplementary Groups

    SUPPLEMENTARYGROUPS

    access rules
    advanced troubleshooting section
    access rules

    If it exists in the IDP, it allows Researcher containers to start with the relevant Linux supplementary groups. The IDP attribute must be a list of integers.

    Supported

    NVIDIA Run:ai communicates with GitHub by defining it as a asset

    Hugging Face

    Repositories

    Supported

    NVIDIA Run:ai provides an out of the box integration with

    JupyterHub

    Development

    Community Support

    It is possible to submit NVIDIA Run:ai workloads via JupyterHub.

    Jupyter Notebook

    Development

    Supported

    NVIDIA Run:ai provides integrated support with Jupyter Notebooks. See example.

    Cost Optimization

    Supported

    NVIDIA Run:ai provides out of the box support for Karpenter to save cloud costs. Integration notes with Karpenter can be found .

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting MPI workloads via API, CLI or UI. See for more details.

    Kubeflow notebooks

    Development

    Community Support

    It is possible to launch a Kubeflow notebook with the NVIDIA Run:ai Scheduler. Sample code: .

    Kubeflow Pipelines

    Orchestration

    Community Support

    It is possible to schedule kubeflow pipelines with the NVIDIA Run:ai Scheduler. Sample code: .

    MLFlow

    Model Serving

    Community Support

    It is possible to use ML Flow together with the NVIDIA Run:ai Scheduler.

    PyCharm

    Development

    Supported

    Containers created by NVIDIA Run:ai can be accessed via PyCharm.

    PyTorch

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting PyTorch workloads via API, CLI or UI. See for more details.

    Ray

    training, inference, data processing.

    Community Support

    It is possible to schedule Ray jobs with the NVIDIA Run:ai Scheduler. Sample code: .

    SeldonX

    Orchestration

    Community Support

    It is possible to schedule Seldon Core workloads with the NVIDIA Run:ai Scheduler.

    Spark

    Orchestration

    Community Support

    It is possible to schedule Spark workflows with the NVIDIA Run:ai Scheduler.

    S3

    Storage

    Supported

    NVIDIA Run:ai communicates with S3 by defining a asset

    TensorBoard

    Experiment tracking

    Supported

    NVIDIA Run:ai comes with a preset TensorBoard asset

    TensorFlow

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting TensorFlow workloads via API, CLI or UI. See for more details.

    Triton

    Orchestration

    Supported

    Usage via docker base image

    VScode

    Development

    Supported

    Containers created by NVIDIA Run:ai can be accessed via Visual Studio Code. You can automatically launch Visual Studio code web from the NVIDIA Run:ai console.

    Weights & Biases

    Experiment tracking

    Community Support

    It is possible to schedule W&B workloads with the NVIDIA Run:ai Scheduler. Sample code: .

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting XGBoost via API, CLI or UI. See for more details.

    Apache Airflow

    Orchestration

    Community Support

    It is possible to schedule Airflow workflows with the NVIDIA Run:ai Scheduler. Sample code: How to integrate NVIDIA Run:ai with Apache Airflow.

    Argo workflows

    Orchestration

    Community Support

    It is possible to schedule Argo workflows with the NVIDIA Run:ai Scheduler. Sample code: How to integrate NVIDIA Run:ai with Argo Workflows.

    ClearML

    Experiment tracking

    Community Support

    It is possible to schedule ClearML workloads with the NVIDIA Run:ai Scheduler.

    Docker Registry

    Repositories

    Supported

    NVIDIA Run:ai allows using a docker registry as a Credential asset

    GitHub

    Kubernetes Workloads
    NVIDIA Run:ai workloads
    Kubernetes Workloads Integration

    Storage

    NVIDIA Blackwell datasheet
  • NVIDIA Multi-Node NVLink Systems

  • Benefits of Using GB200 NVL72 with NVIDIA Run:ai

    The NVIDIA Run:ai platform enables administrators, researchers, and MLOps engineers to fully leverage GB200 NVL72 systems and other NVLink-based domains without requiring deep knowledge of hardware configurations or NVLink topologies. Key capabilities include:

    • Automatic detection and labeling

      • Detects GB200 NVL72 nodes and identifies MNNVL domains (e.g., GB200 NVL72 racks).

      • Automatically detects whether a node pool contains GB200 NVL72.

      • Supports manual override of GB200 MNNVL detection and label key for future compatibility and improved resiliency.

    • Simplified distributed workload submission

      • Allows seamless submission of distributed workloads into GB200-based node pools, eliminating all the complexities involved with that operation on top of GB200 MNNVL domains.

      • Abstracts away the complexity of configuring workloads for NVL domains.

    • Flexible support for NVLink domain variants

      • Compatible with current and future NVL domain configurations.

      • Supports any number of domains or GB200 racks.

    • Enhanced monitoring and visibility

      • Provides detailed NVIDIA Run:ai dashboards for monitoring GB200 nodes and MNNVL domains by node pool.

    • Control and customization

      • Offers manual override and label configuration for greater resiliency and future-proofing.

      • Enables advanced users to fine-tune GB200 scheduling behavior based on workload requirements.

    Prerequisites

    • Kubernetes version - Requires Kubernetes 1.32 or later.

    • NVIDIA GPU Operator - Install NVIDIA GPU Operator version 25.3 or above. See the NVIDIA GPU Operator section for installation instructions. This version must include the associated Dynamic Resource Allocation (DRA) driver, which provides support for GB200 accelerated networking resources and the ComputeDomain feature. For detailed steps on installing the DRA driver and configuring ComputeDomain, refer to NVIDIA Dynamic Resource Allocation (DRA) Driver.

    • NVIDIA Network Operator - Install the NVIDIA Network Operator. See the NVIDIA Network Operator section for installation instructions.

    • Enable GPU network acceleration - After installation, update runaiconfig using the GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See for more details.

    Configuring and Managing GB200 NVL72 Domains

    Administrators must define dedicated node pools that align with GB200 NVL72 rack topologies. These node pools ensure that workloads are isolated to nodes with NVLink interconnects and are not scheduled on incompatible hardware. Each node pool can be manually configured in the NVIDIA Run:ai platform and associated with specific node labels. Two key configurations are required for each node pool:

    • Node Labels – Identify nodes equipped with GB200.

    • MNNVL Domain Discovery – Specify how the platform detects whether the node pool includes NVLink-connected nodes.

    To create a node pool with GPU network acceleration, see Node pools.

    Identifying GB200 Nodes

    To enable the NVIDIA Run:ai Scheduler to recognize GB200-based nodes, administrators must:

    • Use the default node label provided by the NVIDIA GPU Operator - nvidia.com/gpu.clique.

    • Or, apply a custom label that clearly marks the node as GB200/MNNVL capable.

    This node label serves as the basis for identifying appropriate nodes and ensuring workloads are scheduled on the correct hardware.

    Enabling MNNVL Domain Discovery

    The administrator can configure how the NVIDIA Run:ai platform detects MNNVL domains for each node pool. The available options include:

    • Automatic Discovery – Uses the default label key nvidia.com/gpu.clique, or a custom label key specified by the administrator. The NVIDIA Run:ai platform automatically discovers MNNVL domains within node pools. If a node is labeled with the MNNVL label key, the NVIDIA Run:ai platform indicates this node pool as MNNVL detected. MNNVL detected node pools are treated differently by the NVIDIA Run:ai platform when submitting a distributed training workload.

    • Manual Discovery – The platform does not evaluate any node labels. Detection is based solely on the administrator’s configuration of the node pool as MNNVL “Detected” or “Not Detected.”

    When automatic discovery is enabled, all GB200 nodes that are part of the same physical rack (NVL72 or other future topologies) are part of the same NVL Domain and automatically labeled by the GPU Operator with a common label using a unique label value per domain and sub-domain. The default label key set by the NVIDIA GPU Operator is nvidia.com/gpu.clique and its value consists of - <NVL Domain ID (ClusterUUID)>.<Clique ID> :

    • The NVL Domain ID (ClusterUUID) is a unique identifier that represents the physical NVL domain, for example, a physical GB200 NVL72 rack.

    • The Clique ID denotes a logical MNNVL sub-domain. A clique represents a further logical split of the MNNVL into smaller domains that enable secure, fast, and isolated communication between pods running on different GB200 nodes within the same GB200 NVL72.

    The Nodes table provides more information on which GB200 NVL72 domain each node belongs to, and which Clique ID it is associated with.

    Submitting Distributed Training Workloads

    When a distributed training workload is submitted to an MNNVL-detected node pool, the NVIDIA Run:ai platform automates several key configuration steps to ensure optimal workload execution:

    • ComputeDomain creation - The NVIDIA Run:ai platform creates a ComputeDomain Custom Resource Definition (CRD), which is a proprietary resource used to manage NVLink-based domain assignments.

    • Resource Claim injection - A reference to the ComputeDomain is automatically added to the workload specification as a resource claim, allowing the Scheduler to link the workload to a specific NVLink domain.

    • Pod affinity configuration - Pod affinity is applied using a Preferred policy with the MNNVL label key (e.g., nvidia.com/gpu.clique) as the topology key. This ensures that pods within the distributed workload are located on nodes with NVLink interconnects.

    • Node affinity configuration - Node affinity is also applied using a Preferred policy based on the same label key, further guiding the Scheduler to place workloads within the correct node group.

    These additional steps are crucial for the creation of underlying HW resources (also known as IMEX channels) and stickiness of the distributed workload to MNNVL topologies and nodes. When a distributed workload is stopped or evicted, the platform automatically removes the corresponding ComputeDomain.

    Best Practices for MNNVL Node Pool Management

    • When submitting a distributed workload, you should explicitly specify a list of one or more MNNVL detected node pools, or a list of one or more non-MNNVL detected node pools. A mix of MNNVL detected and non-MNNVL detected node pools is not supported. A GB200 MNNVL node pool is a pool that contains at least one node belonging to an MNNVL domain.

    • Other workload types (not distributed) can include a list of mixed MNNVL and non-MNNVL node pools, from which the Scheduler will choose.

    • MNNVL node pools can include any size of MNNVL domains (i.e. NVL72 and any future domain size) and support any Grace-Blackwell models (GB200 and any future models).

    • To support the submission of larger distributed workloads, it is recommended to group as many GB200 racks as possible into fewer node pools. When possible, use a single GB200 node pool, unless there is a specific operational reason to divide resources across multiple node pools.

    • When submitting distributed training workloads with the controller pod set as a distinct non-GPU workload, the MNNVL feature should be used with the default Preferred mode as explained in the below section.

    Fine-tuning Scheduling Behavior for MNNVL

    You can influence how the Scheduler places distributed training workloads into GB200 MNNVL node pools using the Topology field available in the distributed training workload submission form.

    Note

    The following options are based on inter-pod affinity rules, which define how pods are grouped based on topology.

    • Confine a workload to a single GB200 MNNVL domain - To ensure the workload is scheduled within a single GB200 MNNVL domain (e.g., a GB200 NVL72 rack), apply a topology label with a Required policy using the MNNVL label key (nvidia.com/gpu.clique). This instructs the Scheduler to strictly place all pods within the same MNNVL domain. If the workload exceeds 18 pods (or 72 GPUs), the Scheduler will not be able to find a matching domain and will fail to schedule the workload.

    • Try to schedule a workload using a Preferred topology - To guide the Scheduler to prioritize a specific topology without enforcing it, apply a topology label with a policy of Preferred. You can apply any topology label with a Preferred policy. These labels are treated with higher scheduling weight than the default Preferred pod affinity automatically applied by NVIDIA Run:ai for MNNVL.

    • Mandate a custom topology - To force scheduling a workload into a custom topology, add a topology label with a policy of Required. This ensures the workload is strictly scheduled according to the specified topology. Keep in mind that using a Required policy can significantly constrain scheduling. If matching resources are not available, the Scheduler may fail to place the workload.

    Fine-tuning MNNVL per Workload

    You can customize how the NVIDIA Run:ai platform applies the MNNVL feature to each distributed training workload. This allows you to override the default behavior when needed. To configure this behavior, set the proprietary label key run.ai/MNNVL in the General settings section of the distributed training workload submission form. The following values are supported:

    • None - Disables the MNNVL feature for the workload. The platform does not create a ComputeDomain and no pod affinity or node affinity is applied by default.

    • Preferred (default) - Indicates that MNNVL feature is preferred but not required. This is the default behavior when submitting a distributed training workload:

      • If the workload is submitted to a 'non-MNNVL detected' node pool, then the NVIDIA Run:ai platform does not add a ComputeDomain, ComputeDomain claim, pod affinity or node affinity for MNNVL nodes.

      • Otherwise, if the workload is submitted to a 'MNNVL detected' node pool, then the NVIDIA Run:ai platform automatically adds: ComputeDomain, ComputeDomain claim, NodeAffinity and PodAffinity both with a Preferred policy and using the MNNVL label.

      • If you manually add an additional Preferred topology label, it will be given higher scheduling weight than the default embedded pod affinity (which has weight = 1).

    • Required - Enforces a strict use of MNNVL domains for the workload. The workload must be scheduled on MNNVL supported nodes:

      • The NVIDIA Run:ai platform creates a ComputeDomain and ComputeDomain claim.

      • The NVIDIA Run:ai platform will automatically add a node affinity rule with a Required policy using the appropriate label.

      • Pod affinity is set to Preferred by default, but you can override it manually with a Required pod affinity rule using the MNNVL label key or another custom label.

    Known Limitations and Compatibility

    • If the DRA driver is not installed correctly in the cluster, particularly if the required CRDs are missing, and the MNNVL feature is enabled in the NVIDIA Run:ai platform, the workload controller will enter a crash loop. This will continue until the DRA driver is properly installed with all necessary CRDs or the MNNVL feature is disabled in the NVIDIA Run:ai platform.

    • To run workloads on a GB200 node pool (i.e., a node pool detected as MNNVL-enabled), the workload must explicitly request that node pool. To prevent unintentional use of MNNVL-detected node pools, administrators must ensure these node pools are not included in any project's default list of node pools.

    • Only one distributed training workload per node can use GB200 accelerated networking resources. If GPUs remain unused on that node, other workload types may still utilize them.

    • If a GB200 node fails, any associated pod will be re-scheduled, causing the entire distributed workload to fail and restart. On non-GB200 nodes, this scenario may be self-healed by the Scheduler without impacting the entire workload.

    • If a pod from a distributed training workload fails or is evicted by the Scheduler, it must be re-scheduled on the same node. Otherwise, the entire workload will be evicted and, in some cases, re-queued.

    • Elastic distributed training workloads are not supported with MNNVL.

    • Workloads created in versions earlier than 2.21 do not include GB200 MNNVL node pools and are therefore not expected to experience compatibility issues.

    • If a node pool that was previously used in a workload submission is later updated to include GB200 nodes (i.e., becomes a mixed node pool), the workload submitted before version 2.21 will not use any accelerated networking resources, although it may still run on GB200 nodes.

    workload types
    NVIDIA GB200 NVL72
    • For the AI practitioner:

      • Reduced wait time - Workloads with smaller GPU requests are more likely to be scheduled quickly, minimizing delays in accessing resources.

      • Increased workload capacity - More workloads can be run using the same admin-defined GPU quota and available unused resources - over quota.

    • For the platform administrator:

      • Improved GPU utilization - Sharing GPUs across workloads increases the utilization of individual GPUs, resulting in better overall platform efficiency.

      • Higher resource availability - More users gain access to GPU resources, ensuring better distribution.

      • Enhanced workload throughput - More workloads can be served per GPU, ensuring maximum output from existing hardware.

    Quota Planning with GPU Fractions

    When planning the quota distribution for your projects and departments, using fractions gives the platform administrator the ability to allocate more precise quota per project and department, assuming the usage of GPU fractions or enforcing it with pre-defined policies or compute resource templates.

    For example, in an organization with a department budgeted for two nodes of 8×H100 GPUs and a team of 32 researchers:

    • Allocating 0.5 GPU per researcher ensures all researchers have access to GPU resources.

    • Using fractions enables researchers to run smaller workloads intermittently within their quota or go over their quota by using temporary over quota resources with higher resource demanding workloads.

    • Using GPUs for notebook-based model development, where GPUs are not continuously active and can be shared among multiple users.

    For more details on mapping your organization and resources, see Adapting AI initiatives to your organization.

    How GPU Fractions Work

    When a workload is submitted, the Scheduler finds a node with a GPU that can satisfy the requested GPU portion or GPU memory, then it schedules the pod to that node. The NVIDIA Run:ai GPU fractions logic, running locally on each NVIDIA Run:ai worker node, allocates the requested memory size on the selected GPU. Each pod uses its own separate virtual memory address space. NVIDIA Run:ai’s GPU fractions logic enforces the requested memory size, so no workload can use more than requested, and no workload can run over another workload’s memory. This gives users the experience of a ‘logical GPU’ per workload.

    While MIG requires administrative work to configure every MIG slice, where a slice is a fixed chunk of memory, GPU fractions allow dynamic and fully flexible allocation of GPU memory chunks. By default, GPU fractions use NVIDIA’s time-slicing to share the GPU compute runtime. You can also use the NVIDIA Run:ai GPU time-slicing which allows dynamic and fully flexible splitting of the GPU compute time.

    NVIDIA Run:ai GPU fractions are agile and dynamic allowing a user to allocate and free GPU fractions during the runtime of the system, at any size between zero to the maximum GPU portion (100%) or memory size (up to the maximum memory size of a GPU).

    The NVIDIA Run:ai Scheduler can work alongside other schedulers. In order to avoid collisions with other schedulers, the NVIDIA Run:ai Scheduler creates special reservation pods. Once a workload is submitted requesting a fraction of a GPU, NVIDIA Run:ai will create a pod in a dedicated runai-reservation namespace with the full GPU as a resource, allowing other schedulers to understand that the GPU is reserved.

    Note

    • Splitting a GPU into fractions may generate some fragmentation of the GPU memory. The Scheduler will try to consolidate GPU resources where feasible (i.e. preemptible workloads).

    • Using bin-pack as a scheduling placement strategy can also reduce GPU fragmentation.

    • Using ensures that even small unused fragments of GPU memory are utilized by workloads.

    Multi-GPU Fractions

    NVIDIA Run:ai also supports workload submission using multi-GPU fractions. Multi-GPU fractions work similarly to single-GPU fractions, however, the NVIDIA Run:ai Scheduler allocates the same fraction size on multiple GPU devices within the same node. For example, if practitioners develop a new model that uses 8 GPUs and requires 40GB of memory per GPU, they can allocate 8×40GB with multi-GPU fractions instead of reserving the full memory of each GPU (e.g. 80GB). This leaves 40GB of GPU memory available on each of the 8 GPUs for other workloads within that node.

    Time sharing where single GPUs can serve multiple workloads with fractions remains unchanged, only now, it serves multiple workloads using multi-GPUs per workload, single-GPU per workload, or a mix of both.

    Deployment Considerations

    • Selecting a GPU portion using percentages as units does not guarantee the exact memory size. This means 50% of an A-100-40GB is 20GB while 50% of an A-100-80 is 40GB. To have better control over the exact allocated memory, specify the exact memory size, i.e. 40GB.

    • Using NVIDIA Run:ai GPU fractions controls the memory split (i.e. 0.5 GPU means 50% of the GPU memory) but not the compute (processing time). To split the compute time, see NVIDIA Run:ai’s GPU time slicing.

    • NVIDIA Run:ai GPU fractions and MIG mode cannot be used on the same node.

    Setting GPU Fractions

    Using the compute resources asset, you can define the compute requirements by specifying your requested GPU portion or GPU memory, and use it with any of the NVIDIA Run:ai workload types for single GPU and multi-GPU fractions.

    • Single-GPU fractions - Define the compute requirement to run 1 GPU device, by specifying either a fraction (percentage) of the overall memory or a memory request (GB, MB).

    • Multi-GPU fractions - Define the compute requirement to run multiple GPU devices, by specifying either a fraction (percentage) of the overall memory or a memory request (GB, MB).

    Setting GPU Fractions for Third-Party Workloads

    To enable GPU fractions for workloads submitted via Kubernetes YAML, use the following annotations to define the GPU fraction configuration. You can configure either gpu-fraction or gpu-memory. Make sure the default scheduler is set to runai-scheduler. See Using the Scheduler with third-party workloads for more details.

    Variable
    Input Format
    Where to Set

    gpu-fraction

    A portion of GPU memory as a double-precision floating-point number. Example: 0.25, 0.75.

    Pod annotation (metadata.annotations)

    gpu-memory

    Memory size in MiB. Example: 2500, 4096. The gpu-memory values are always in MiB.

    Pod annotation (metadata.annotations)

    gpu-fraction-num-devices

    The number of GPU devices to allocate using the specified gpu-fraction or gpu-memory value. Set this annotation only if you want to request multiple GPU devices.

    Pod annotation (metadata.annotations)

    The following example YAML creates a pod that requests 2 GPU devices, each requesting 50% of memory (gpu-fraction: "0.5") .

    Using CLI

    To view the available actions, go to the CLI v2 reference or the CLI v1 reference and run according to your workload.

    Using API

    To view the available actions, go to the API reference and run according to your workload.

    workload

    Every newly created pod is assigned to a pod group, which can represent one or multiple pods within a workload. For example, a distributed PyTorch training workload with 32 workers is grouped into a single pod group. All pods are attached to the pod group with certain rules, such as gang scheduling, applied to the entire pod group.

    Scheduling Queue

    A scheduling queue (or simply a queue) represents a scheduler primitive that manages the scheduling of workloads based on different parameters.

    A queue is created for each project/node pool pair and department/node pool pair. The NVIDIA Run:ai Scheduler supports hierarchical queueing, project queues are bound to department queues, per node pool. This allows an organization to manage quota, over quota and more for projects and their associated departments.

    Resource Management

    Quota

    Each project and department includes a set of deserved resource quotas, per node pool and resource type. For example, project “LLM-Train/Node Pool NV-H100” quota parameters specify the number of GPUs, CPUs(cores), and the amount of CPU memory that this project deserves to get when using this node pool. Non-preemptible workloads can only be scheduled if their requested resources are within the deserved resource quotas of their respective project/node-pool and department/node-pool.

    Over Quota

    Projects and departments can have a share in the unused resources of any node pool, beyond their quota of deserved resources. These resources are referred to as over quota resources. The administrator configures the over quota parameters per node pool for each project and department.

    Over Quota Weight

    Projects can receive a share of the cluster/node pool unused resources when the over quota weight setting is enabled. The part each Project receives depends on its over quota weight value, and the total weights of all other projects’ over quota weights. The administrator configures the over quota weight parameters per node pool for each project and department.

    Multi-Level Quota System

    Each project has a set of guaranteed resource quotas (GPUs, CPUs, and CPU memory) per node pool. Projects can go over quota and get a share of the unused resources in a node pool beyond their guaranteed quota in that node pool. The same applies to Departments. The Scheduler balances the amount of over quota between departments, and then between projects. The department’s deserved quota and over quota limit the sum of resources of all projects, within the department. If the project shows it has deserved quota, but the department deserved quota is exhausted, the Scheduler will not give the project anymore deserved resources. The same applies to over quota resources. over quota resources are first given to the department, and only then split among its projects.

    Fairshare and Fairshare Balancing

    The NVIDIA Run:ai Scheduler calculates a numerical value, fairshare, per project (or department) for each node pool, representing the project’s (department’s) sum of guaranteed resources plus the portion of non-guaranteed resources in that node pool.

    The Scheduler aims to provide each project (or department) the resources they deserve per node pool using two main parameters: deserved quota and deserved fairshare (i.e. quota + over quota resources). If one project’s node pool queue is below fairshare and another project’s node pool queue is above fairshare, the Scheduler shifts resources between queues to balance fairness. This may result in the preemption of some over quota preemptible workloads.

    Over-Subscription

    Over-subscription is a scenario where the sum of all guaranteed resource quotas surpasses the physical resources of the cluster or node pool. In this case, there may be scenarios in which the Scheduler cannot find matching nodes to all workload requests, even if those requests were within the resource quota of their associated projects.

    Placement Strategy - Bin-Pack and Spread

    The administrator can set a placement strategy, bin-pack or spread, of the Scheduler per node pool. For GPU based workloads, workloads can request both GPU and CPU resources. For CPU-only based workloads, workloads can request CPU resources only.

    • GPU workloads:

      • Bin-pack - The Scheduler places as many workloads as possible in each GPU and node to use fewer resources and maximize GPU and node vacancy.

      • Spread - The Scheduler spreads workloads across as many GPUs and nodes as possible to minimize the load and maximize the available resources per workload.

    • CPU workloads:

      • Bin-pack - The Scheduler places as many workloads as possible in each CPU and node to use fewer resources and maximize CPU and node vacancy.

      • Spread - The Scheduler spreads workloads across as many CPUs and nodes as possible to minimize the load and maximize the available resources per workload.

    Scheduling Principles

    Priority and Preemption

    NVIDIA Run:ai supports scheduling workloads using different priority and preemption policies:

    • High-priority workloads (pods) can preempt lower priority workloads (pods) within the same scheduling queue (project), according to their preemption policy. The NVIDIA Run:ai Scheduler implicitly assumes any PriorityClass >= 100 is non-preemptible and any PriorityClass < 100 is preemptible.

    • Cross project and cross department workload preemptions are referred to as resource reclaim and are based on fairness between queues rather than the priority of the workloads.

    To make it easier for users to submit workloads, NVIDIA Run:ai preconfigured several Kubernetes PriorityClass objects. The NVIDIA Run:ai preset PriorityClass objects have their ‘preemptionPolicy’ always set to ‘PreemptLowerPriority’, regardless of their actual NVIDIA Run:ai preemption policy within the NVIDIA Run:ai platform. A non-preemptible workload is only scheduled if in-quota and cannot be preempted after being scheduled, not even by a higher priority workload.

    PriorityClass Name
    PriorityClass
    NVIDIA Run:ai preemption policy
    K8s preemption policy

    125

    Non-preemptible

    PreemptLowerPriority

    Build ()

    100

    Non-preemptible

    PreemptLowerPriority

    Interactive-preemptible ()

    75

    Note

    You can override the default priority class of a workload. See Workload priority class control for more details.

    Preemption of Lower Priority Workloads Within a Project

    Workload priority is always respected within a project. This means higher priority workloads are scheduled before lower priority workloads. It also means that higher priority workloads may preempt lower priority workloads within the same project if the lower priority workloads are preemptible.

    Fairness (Fair Resource Distribution)

    Fairness is a major principle within the NVIDIA Run:ai scheduling system. It means that the NVIDIA Run:ai Scheduler always respects certain resource splitting rules (fairness) between projects and between departments.

    Reclaim of Resources Between Projects and Departments

    Reclaim is an inter-project (and inter-department) scheduling action that takes back resources from one project (or department) that has used them as over quota, back to a project (or department) that deserves those resources as part of its deserved quota, or to balance fairness between projects, each to its fairshare (i.e. sharing fairly the portion of the unused resources).

    Gang Scheduling

    Gang scheduling describes a scheduling principle where a workload composed of multiple pods is either fully scheduled (i.e. all pods are scheduled and running) or fully pending (i.e. all pods are not running). Gang scheduling refers to a single pod group.

    Next Steps

    Now that you have learned the key concepts and principles of the NVIDIA Run:ai Scheduler, see how the Scheduler works - allocating pods to workloads, applying preemption mechanisms, and managing resources.

    submits a workload
    Introduction to workloads
    Workloads

    Data volumes are disabled, by default. If you cannot see Data volumes, then it must be enabled by your Administrator, under General settings → Workloads → Data volumes.

  • Data volumes are supported only for flexible workload submission.

  • Why Use a Data Volume?

    1. Sharing with multiple scopes - Data volumes can be shared across different scopes in a cluster, including projects, departments. Using data volumes allows for data reuse and collaboration within the organization.

    2. Storage saving - A single copy of the data can be used across multiple scopes

    Typical Use Cases

    1. Sharing large datasets - In large organizations, the data is often stored in a remote location, which can be a barrier for large model training. Even if the data is transferred into the cluster, sharing it easily with multiple users is still challenging. Data volumes can help share the data seamlessly, with maximum security and control.

    2. Sharing data with colleagues - When sharing training results, generated datasets, or other artifacts with team members is needed, data volumes can help make the data available easily.

    Prerequisites

    To create a data volume, you must have a PVC data source already created. Make sure the PVC includes data before sharing it.

    Data Volumes Table

    The data volumes table can be found under Workload manager in the NVIDIA Run:ai platform.

    The data volumes table provides a list of all the data volumes defined in the platform and allows you to manage them.

    The data volumes table comprises the following columns:

    Column
    Description

    Data volume

    The name of the data volume

    Description

    A description of the data volume

    Status

    The different lifecycle and representation of the data volume condition

    Scope

    The of the data source within the organizational tree. Click the scope name to view the organizational tree diagram

    Origin project

    The project of the origin PVC

    Origin PVC

    The original PVC from which the data volume was created that points to the same PV

    Data Volumes Status

    The following table describes the data volumes' condition and whether they were created successfully for the selected scope.

    Status
    Description

    No issues found

    No issues were found while creating the data volume

    Issues found

    Issues were found while sharing the data volume. Contact NVIDIA Run:ai support.

    Creating…

    The data volume is being created

    Deleting...

    The data volume is being deleted

    No status / “-”

    When the data volume’s scope is an account, the current version of the cluster is not up to date, or the asset is not a cluster-syncing entity, the status can’t be displayed

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Refresh - Click REFRESH to update the table with the latest data

    Adding a New Data Volume

    To create a new data volume:

    1. Click +NEW DATA VOLUME

    2. Enter a name for the data volume. The name must be unique.

    3. Optional: Provide a description of the data volume

    4. Set the project where the data is located

    5. Set a PVC from which to create the data volume

    6. Set the that will be able to mount the data volume

    7. Click CREATE DATA VOLUME

    Editing a Data Volume

    To edit a data volume:

    1. Select the data volume you want to edit

    2. Click Edit

    3. Click SAVE DATA VOLUME

    Copying a Data Volume

    To copy an existing data volume:

    1. Select the data volume you want to copy

    2. Click MAKE A COPY

    3. Enter a name for the data volume. The name must be unique.

    4. Set a new Origin PVC for your data volume, since only one Origin PVC can be used per data volume

    5. Click CREATE DATA VOLUME

    Deleting a Data Volume

    To delete a data volume:

    1. Select the data volume you want to delete

    2. Click DELETE

    3. Confirm you want to delete the data volume

    Note

    It is not possible to delete a data volume being used by an existing workload.

    Using API

    To view the available actions, go to the Data volumes API reference.

    workload assets

    What’s New in Version 2.21

    The NVIDIA Run:ai v2.21 what's new provides a detailed summary of the latest features, enhancements, and updates introduced in this version. They serve as a guide to help users, administrators, and researchers understand the new capabilities and how to leverage them for improved workload management, resource optimization, and more.

    Important

    For a complete list of deprecations, see Deprecation notifications. Deprecated features and capabilities will be available for two versions ahead of the notification.

    AI Practitioners

    Flexible Workload Submission

    Streamlined workload submission with a customizable form - The new customizable submission form allows you to submit workloads by selecting and modifying an existing setup or providing your own settings. This enables faster, more accurate submissions that align with organizational policies and individual workload needs. Beta From cluster v2.18 onward

    Feature high level details:

    • Flexible submission options - Choose from an existing setup and customize it, or start from scratch and provide your own settings for a one-time setup.

    • Improved visibility - Review existing setups and understand their associated policy definitions.

    • One-time data sources setup - Configure a data source as part of your one-time setup for a specific workload.

    • Unified experience - Use the new form for all workload types: , , , and

    Workspaces and Training

    • Support for JAX distributed training workloads - You can now submit distributed training workloads using the JAX framework via the UI, API, and CLI. This enables you to leverage JAX for scalable, high-performance training, making it easier to run and manage JAX-based workloads seamlessly within NVIDIA Run:ai. See for more details. From cluster v2.21 onward

    • Pod restart policy for all workload types - A restart policy can be configured to define how pods are restarted when they terminate. The policy is set at the workload level across all workload types via the API and CLI. For distributed training workloads, restart policies can be set separately for master and worker pods. This enhancement ensures workloads are restarted efficiently, minimizing downtime and optimizing resource usage. From cluster v2.21 onward

    Workload Assets

    • New environment presets - Added new NVIDIA Run:ai environment presets when running in a host-based routing cluster - vscode, rstudio, jupyter-scipy, tensorboard-tensorflow. See for more details. From cluster v2.21 onward

    • Support for PVC size expansion - Adjust the size of Persistent Volume Claims (PVCs) via the API, leveraging the allowVolumeExpansion field of the storage class resource. This enhancement enables you to dynamically adjust storage capacity as needed.

    • Improved visibility of storage class configurations - When creating new PVCs or volumes, the UI now displays access modes, volume modes, and size options based on administrator-defined storage class configurations. This update ensures consistency, increases transparency, and helps prevent misconfigurations during setup.

    Command-line Interface (CLI v2)

    • New default CLI - CLI v2 is the default command-line interface. CLI v1 has been as of version 2.20.

    • Secret volume mapping for workloads - You can now map secrets to volumes when submitting workloads using the --secret-volume flag. This feature is available for all workload types - workspaces, training, and inference.

    • Support for environment field references in submit commands - A new flag, fieldRef, has been added to all submit commands to support environment field references in a key:value format. This enhancement enables dynamic injection of environment variables directly from pod specifications, offering greater flexibility during workload submission.

    ML Engineers

    Workloads - Inference

    • Support for inference workloads via CLI v2 - You can now run inference workloads directly from the command-line interface. This update enables greater automation and flexibility for managing inference workloads. See for more details.

    • Enhanced rolling inference updates - Rolling inference updates allow ML engineers to apply live updates to existing inference workloads—regardless of their current status (e.g., running or pending)—without disrupting critical services. Experimental

      • This capability is now supported for both and workloads, with a new UI flow that aligns with the API functionality introduced in v2.19.

    Platform Administrators

    Analytics

    • Enhancements to the Overview dashboard - The Overview dashboard includes optimization insights for projects and departments, providing real-time visibility into GPU resource allocation and utilization. These insights help department and project managers make more informed decisions about quota management, ensuring efficient resource usage.

    • Dashboard UX improvements:

      • Improved visibility of metrics in the Resources utilization widget by repositioning them above the graphs.

    Organizations - Projects/Departments

    • Enhanced resource prioritization for projects and departments - Admins can now define and manage SLAs tailored to specific and via the UI, ensuring resource allocation aligns with real business priorities. This enhancement empowers admins to assign strict priority to over-quota resources, extending control beyond the existing over-quota weight system. From cluster v2.20 onward

      This feature allows administrators to:

      • Set the priority of each department relative to other departments within the same node pool.

    Audit Logs

    Updated access control for audit logs - Only users with tenant-wide permissions have the ability to access audit logs, ensuring proper access control and data security. This update reinforces security and compliance by restricting access to sensitive system logs. It ensures that only authorized users can view audit logs, reducing the risk of unauthorized access and potential data exposure.

    Notifications

    Slack API integration for notifications - A new API allows organizations to receive notifications directly to Slack. This feature enhances real-time communication and monitoring by enabling users to stay informed about workload statuses. See for more details.

    Authentication and Authorization

    • Improved visibility into user roles and access scopes - Individual users can now view their assigned roles and scopes directly in their settings. This enhancement provides greater transparency into user permissions, allowing individuals to easily verify their access levels. It helps users understand what actions they can perform and reduces dependency on administrators for access-related inquiries. See for more details.

    • Added auto-redirect to SSO - To deliver a consistent and streamlined login experience across customer applications, users accessing the NVIDIA Run:ai login page will be automatically redirected to SSO, bypassing the standard login screen entirely. This can be enabled via a toggle after an Identity Provider is added, and is available through both the UI and API. See for more details.

    • SAML service provider metadata XML - After configuring SAML IDP, the service provider metadata XML is now available for download to simplify integration with identity providers. See

    Data & Storage

    Added Data volumes to the UI - Administrators can now create and manage data volumes directly from the UI and share data across different scopes in a cluster, including projects and departments. See for more details. Experimental From cluster v2.19 onward

    Infrastructure Administrators

    NVIDIA Datacenter GPUs - Grace-Blackwell

    Support for NVIDIA GB200 NVL72 and MultiNode NVLink systems - NVIDIA Run:ai offers full support for NVIDIA’s most advanced MultiNode NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives. NVIDIA Run:ai simplifies the complexity of managing and submitting workloads on these systems by automating infrastructure detection, domain labeling, and distributed job submission via the UI, CLI, or API. See for more details. From cluster v2.21 onward

    Advanced Cluster Configurations

    Automatic cleanup of resources for failed workloads - When a workload fails due to infrastructure issues, its resources can be automatically cleaned up using failureResourceCleanupPolicy, reducing resource of failed workloads. For more details, see . From cluster v2.21 onward

    Advanced Setup

    Custom pod labels and annotations - Add custom labels and annotations to pods in both the control plane and cluster. This new capability enables service mesh deployment in NVIDIA Run:ai. This feature provides greater flexibility in workload customization and management, allowing users to integrate with service meshes more easily. See for more details.

    System Requirements

    • NVIDIA Run:ai now supports NVIDIA GPU Operator version 25.3.

    • NVIDIA Run:ai now supports OpenShift version 4.18.

    • NVIDIA Run:ai now supports Kubeflow Training Operator 1.9.

    • Kubernetes version 1.29 is no longer supported.

    Deprecation Notifications

    Cluster API for Workload Submission

    Using the Cluster API to submit NVIDIA Run:ai workloads via YAML was starting from NVIDIA Run:ai version 2.18. For cluster version 2.18 and above, use the to submit workloads. The Cluster API documentation has also been removed from v2.20 and above.

    Control Plane System Requirements

    The NVIDIA Run:ai control plane is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai control plane. Before you start, make sure to review the Installation overview.

    Installer Machine

    The machine running the installation script (typically the Kubernetes master) must have:

    • At least 50GB of free space

    • Docker installed

    • 3.14 or later

    Note

    If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai include the Helm binary.

    Hardware Requirements

    The following hardware requirements are for the control plane system nodes. By default, all NVIDIA Run:ai control plane services run on all available nodes.

    Architecture

    • x86 - Supported for both Kubernetes and OpenShift deployments.

    • ARM - Supported for Kubernetes only. ARM is currently not supported for OpenShift.

    NVIDIA Run:ai Control Plane - System Nodes

    This configuration is the minimum requirement you need to install and use NVIDIA Run:ai control plane:

    Component
    Required Capacity

    Note

    To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in .

    If NVIDIA Run:ai control plane is planned to be installed on the same Kubernetes cluster as the NVIDIA Run:ai cluster, make sure the cluster are considered in addition to the NVIDIA Run:ai control plane hardware requirements.

    Software Requirements

    The following software requirements must be fulfilled.

    Operating System

    • Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator

    • Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

    Network Time Protocol

    Nodes are required to be synchronized by time using NTP (Network Time Protocol) for proper system functionality.

    Kubernetes Distribution

    NVIDIA Run:ai control plane requires Kubernetes. The following Kubernetes distributions are supported:

    • Vanilla Kubernetes

    • OpenShift Container Platform (OCP)

    • NVIDIA Base Command Manager (BCM)

    • Elastic Kubernetes Engine (EKS)

    Note

    The latest release of the NVIDIA Run:ai control plane supports Kubernetes 1.30 to 1.32 and OpenShift 4.14 to 4.18.

    See the following Kubernetes version support matrix for the latest NVIDIA Run:ai releases:

    NVIDIA Run:ai version
    Supported Kubernetes versions
    Supported OpenShift versions

    For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see or .

    NVIDIA Run:ai Namespace

    The NVIDIA Run:ai control plane uses a namespace or project (OpenShift) called runai-backend. Use the following to create the namespace/project:

    Default Storage Class

    Note

    Default storage class applies for Kubernetes only.

    The NVIDIA Run:ai control plane requires a default storage class to create persistent volume claims for NVIDIA Run:ai storage. The storage class, as per Kubernetes standards, controls the reclaim behavior, whether the NVIDIA Run:ai persistent data is saved or deleted when the NVIDIA Run:ai control plane is deleted.

    Note

    For a simple (non-production) storage class example see . The storage class will set the directory /opt/local-path-provisioner to be used across all nodes as the path for provisioning persistent volumes. Then set the new storage class as default:

    Kubernetes Ingress Controller

    Note

    Installing ingress controller applies for Kubernetes only.

    The NVIDIA Run:ai control plane requires to be installed.

    • OpenShift, RKE and RKE2 come with a pre-installed ingress controller.

    • Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.

    • Make sure that a default ingress controller is set.

    There are many ways to install and configure different ingress controllers. The following shows a simple example to install and configure NGINX ingress controller using :

    Vanilla Kubernetes

    Run the following commands:

    • For cloud deployments, both the internal IP and external IP are required.

    • For on-prem deployments, only the external IP is needed.

    Managed Kubernetes (EKS, GKE, AKS)

    Run the following commands:

    Oracle Kubernetes Engine (OKE)

    Run the following commands:

    Fully Qualified Domain Name (FQDN)

    Note

    Fully Qualified Domain Name applies for Kubernetes only.

    You must have a Fully Qualified Domain Name (FQDN) to install the NVIDIA Run:ai control plane (ex: runai.mycorp.local). This cannot be an IP. The FQDN must be resolvable within the organization's private network.

    TLS Certificate

    Kubernetes

    You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a named runai-backend-tls in the runai-backend namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

    OpenShift

    NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on .

    Local Certificate Authority

    A local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority. Follow the below steps to configure the local certificate authority.

    In air-gapped environments, you must configure and install the local CA's public key in the Kubernetes cluster. This is required for the installation to succeed:

    1. Add the public key to the runai-backend namespace:

    1. When installing the control plane, make sure the following flag is added to the helm command --set global.customCA.enabled=true. See .

    External Postgres Database (Optional)

    The NVIDIA Run:ai control plane installation includes a default PostgreSQL database. However, you may opt to use an existing PostgreSQL database if you have specific requirements or preferences as detailed in Please ensure that your PostgreSQL database is version 16 or higher.

    Launching Workloads with GPU Fractions

    This quick start provides a step-by-step walkthrough for running a Jupyter Notebook workspace using GPU fractions.

    NVIDIA Run:ai’s GPU fractions provides an agile and easy-to-use method to share a GPU or multiple GPUs across workloads. With GPU fractions, you can divide the GPU/s memory into smaller chunks and share the GPU/s compute resources between different workloads and users, resulting in higher GPU utilization and more efficient resource allocation.

    Prerequisites

    Before you start, make sure:

    • You have created a or have one created for you.

    • The project has an assigned quota of at least 0.5 GPU.

    Note

    is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

    Step 1: Logging In

    Step 2: Submitting a Workspace

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Workspace

    3. Select under which cluster to create the workload

    4. Select the project in which your workspace will run

    Step 3: Connecting to the Jupyter Notebook

    1. Select the newly created workspace with the Jupyter application that you want to connect to

    2. Click CONNECT

    3. Select the Jupyter tool. The selected tool is opened in a new tab on your browser.

    To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

    Next Steps

    Manage and monitor your newly created workload using the table.

    Workload Policies

    This section explains the procedure to manage workload policies.

    Workload Policies Table

    The Workload policies table can be found under Policies in the NVIDIA Run:ai platform.

    Note

    Workload policies are disabled by default. If you cannot see Workload policies in the menu, then it must be enabled by your administrator, under General settings → Workloads → Policies

    The Workload policies table provides a list of all the policies defined in the platform, and allows you to manage them.

    The Workload policies table consists of the following columns:

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Adding a Policy

    To create a new policy:

    1. Click +NEW POLICY

    2. Select a scope

    3. Select the workload type

    4. Click +POLICY YAML

    Editing a Policy

    1. Select the policy you want to edit

    2. Click EDIT

    3. Update the policy and click APPLY

    4. Click SAVE POLICY

    Troubleshooting

    Listed below are issues that might occur when creating or editing a policy via the YAML Editor:

    Issue
    Message
    Mitigation

    Viewing a Policy

    To view a policy:

    1. Select the policy for which you want to view its .

    2. Click VIEW POLICY

    3. In the Policy form per workload section, view the workload rules and defaults:

      • Parameter The workload submission parameter that Rules and Defaults are applied to

    Note

    Some of the rules and defaults may be derived from policies of a parent cluster and/or department. You can see the source of each rule in the policy form. For more information, check the .

    Deleting a Policy

    1. Select the policy you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm the deletion

    Using API

    Go to the API reference to view the available actions.

    Advanced Control Plane Configurations

    Helm Chart Values

    The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm or flags. Make sure to restart the relevant NVIDIA Run:ai pods so they can fetch the new configurations.

    Key
    Change
    Description
    curl -fsSL https://raw.githubusercontent.com/run-ai/public/main/installation/get-installation-logs.sh
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        user: test
        gpu-fraction: "0.5"
        gpu-fraction-num-devices: "2"
      labels:
        runai/queue: test
      name: multi-fractional-pod-job
      namespace: test
    spec:
      containers:
      - image: gcr.io/run-ai-demo/quickstart-cuda
        imagePullPolicy: Always
        name: job
        env:
        - name: RUNAI_VERBOSE
          value: "1"
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          capabilities:
            drop: ["ALL"]
      schedulerName: runai-scheduler
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 5

    If other, optional user attributes (groups, firstName, lastName, uid, gid) are mapped make sure they also exist under <saml2:AttributeStatement> along with their respective values.

    https://www.samltool.com/decode.php
    adding the identity provider
    adding the identity provider
  • If any of the targeted node pools do not support MNNVL or if the workload (or any of its pods) does not request GPU resources, the workload will fail to run.

  • Advanced cluster configurations
    data source
    Hugging Face
    Jupyter Notebook quick start
    Karpenter
    here
    Kubeflow MPI
    Distributed training
    How to integrate NVIDIA Run:ai with Kubeflow
    How to integrate NVIDIA Run:ai with Kubeflow
    Distributed training
    How to Integrate NVIDIA Run:ai with Ray
    data source
    Environment
    Distributed training
    How to integrate with Weights and Biases
    XGBoost
    Distributed training

    Preemptible

    PreemptLowerPriority

    Train

    50

    Preemptible

    PreemptLowerPriority

    Inference
    workspace
    workspace

    Last updated

    The last time the policy was updated

    Refresh - Click REFRESH to update the table with the latest data

    In the YAML editor type or paste a YAML policy with defaults and rules. You can utilize the following references and examples:

    • Policy YAML reference

    • Policy YAML examples

  • Click SAVE POLICY

  • Policy can’t be saved for some reason

    The policy couldn't be saved due to a network or other unknown issue. Download your draft and try pasting and saving it again later.

    Possible cluster connectivity issues. Try updating the policy once again at a different time.

    Policies were submitted before version 2.18, you upgraded to version 2.18 or above and wish to submit new policies

    If you have policies and want to create a new one, first contact NVIDIA Run:ai support to prevent potential conflicts

    Contact NVIDIA Run:ai support. R&D can migrate your old policies to the new version.

    Type (applicable for data sources only) The data source type (Git, S3, nfs, pvc etc.)

  • Default The default value of the Parameter

  • Rule Set up constraint on workload policy field

  • Source The origin of the applied policy (cluster, department or project)

  • Policy

    The policy name which is a combination of the policy scope and the policy type

    Type

    The policy type is per NVIDIA Run:ai workload type. This allows administrators to set different policies for each workload type.

    Status

    Representation of the policy lifecycle (one of the following - “Creating…”, “Updating…”, “Deleting…”, Ready or Failed)

    Scope

    The scope the policy affects. Click the name of the scope to view the organizational tree diagram. You can only view the parts of the organizational tree for which you have permission to view.

    Created by

    The user who created the policy

    Creation time

    The timestamp for when the policy was created

    Cluster connectivity issues

    There's no communication from cluster “cluster_name“. Actions may be affected, and the data may be stale.

    Verify that you are on a network that has been allowed access to the cluster. Reach out to your cluster administrator for instructions on verifying the issue.

    Policy can’t be applied due to a rule that is occupied by a different policy

    Field “field_name” already has rules in cluster: “cluster_id”

    Remove the rule from the new policy or adjust the old policy for the specific rule.

    Policy is not visible in the UI

    -

    Check that the policy hasn’t been deleted.

    Policy syntax is no valid

    Add a valid policy YAML;json: unknown field "field_name"

    For correct syntax check the Policy YAML reference or the Policy YAML examples.

    policies
    Scope of effectiveness documentation
    Policies

    Cluster

    The cluster that the data volume is associated with

    Created by

    The user who created the data volume

    Creation time

    The timestamp for when the data volume was created

    Last updated

    The timestamp of when the data volume was last updated

    Scopes
    phases
    scope
    Optimized scheduling
    - Smaller and dynamic resource allocations gives the
    a higher chance of finding GPU resources for incoming workloads.
    dynamic GPU fractions
    Scheduler
    preparation
    here
    preparations
    here
    Google Kubernetes Engine (GKE)
  • Azure Kubernetes Service (AKS)

  • Oracle Kubernetes Engine (OKE)

  • Rancher Kubernetes Engine (RKE1)

  • Rancher Kubernetes Engine 2 (RKE2)

  • v2.21 (latest)

    1.30 to 1.32

    4.14 to 4.18

    CPU

    10 cores

    Memory

    12GB

    Disk space

    110GB

    v2.17

    1.27 to 1.29

    4.12 to 4.15

    v2.18

    1.28 to 1.30

    4.12 to 4.16

    v2.19

    1.28 to 1.31

    4.12 to 4.17

    v2.20

    1.29 to 1.32

    Helm
    software artifacts
    System nodes
    Hardware requirements
    Kubernetes Release History
    OpenShift Container Platform Life Cycle Policy
    Kubernetes Local Storage Class
    Kubernetes Ingress Controller
    helm
    Kubernetes Secret
    configuring certificates
    Install control plane
    External Postgres database configuration.

    4.14 to 4.17

    helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
        --set controlPlane.url=... \
        --set controlPlane.clientSecret=... \
        --set cluster.uid=... \
        --set cluster.url=... --create-namespace \
        --set global.image.registry=registry.mycompany.local \
        --set global.customCA.enabled=true
    helm upgrade -i runai-cluster runai-cluster-<VERSION>.tgz \
        --set controlPlane.url=... \
        --set controlPlane.clientSecret=... \
        --set cluster.uid=... \
        --set cluster.url=... --create-namespace \
        --set global.image.registry=registry.mycompany.local \
        --set global.customCA.enabled=true
    kubectl create namespace runai-backend
    oc new-project runai-backend
    kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
        --namespace nginx-ingress --create-namespace \
        --set controller.kind=DaemonSet \
        --set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    helm install nginx-ingress ingress-nginx/ingress-nginx \
        --namespace nginx-ingress --create-namespace
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    helm install nginx-ingress ingress-nginx/ingress-nginx \
        --namespace ingress-nginx --create-namespace \
        --set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \
        --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \
        --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \
        --set controller.service.externalTrafficPolicy=Local \
        --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet=<SUBNET-ID> # Replace <SUBNET-ID> with the subnet ID of one of your cluster
    kubectl create secret tls runai-backend-tls -n runai-backend \
      --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate 
      --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
    kubectl -n runai-backend create secret generic runai-ca-cert \ 
        --from-file=runai-ca.pem=<ca_bundle_path>
    oc -n runai-backend create secret generic runai-ca-cert \ 
        --from-file=runai-ca.pem=<ca_bundle_path>
    .
    Enhanced failure status details for workloads - When a workload is marked as "Failed", clicking the “i” icon next to the status provides detailed failure reasons, with clear explanations across compute, network, and storage resources. This enhancement improves troubleshooting efficiency, and helps you quickly diagnose and resolve issues, leading to faster workload recovery. From cluster v2.21 onward
  • Workload priority class management for training workloads - You can now change the default priority class of training workloads within a project, via the API or CLI, by selecting from predefined priority class values. This influences the workload’s position in the project scheduling queue managed by the Run:ai Scheduler, ensuring critical training jobs are prioritized and resources are allocated more efficiently. See Workload priority class control for more details. From cluster v2.18 onward

  • From cluster v2.21 onward
  • ConfigMaps as environment variables - Use predefined ConfigMaps as environment variables during environment setup or workload submission. From cluster v2.21 onward

  • Improved scope selection experience - The scope mechanism has been improved to reduce clicks and enhance usability. The organization tree now opens by default at the cluster level for quicker navigation. Scope search now includes alphabetical sorting and supports browsing non-displayed scopes. You can also use keyboard shortcuts: Escape to cancel, or click outside the modal to close it. These improvements apply across templates, policies, projects, and all workload assets.

  • Improved PVC visibility and selection for researchers - Use runai pvc to list existing PVCs within your scope, making it easier to reference available options when submitting workloads. A noun auto-completion has been introduced for storage, streamlining the selection process. The workload describe command also includes a PVC section, improving visibility into persistent volume claims. These enhancements provide greater clarity and efficiency in storage utilization.

  • Enhanced workload deletion options - The runai workload delete command now supports deleting multiple workloads by specifying a list of workload names (e.g., workload-a, workload-b, workload-c).

  • From cluster v2.19 onward
  • Compute resources can now be updated via API and UI. From cluster v2.21 onward

  • Support for NVIDIA Cloud Functions (NVCF) external workloads - NVIDIA Run:ai enables you to deploy, schedule and manage NVCF workloads as external workloads within the platform. See Deploy NVIDIA Cloud Functions (NVCF) in NVIDIA Run:ai for more details. From cluster v2.21 onward

  • Added validation for Knative - You can now only submit inference workloads if Knative is properly installed. This ensures workloads are deployed successfully by preventing submission when Knative is misconfigured or missing. From cluster v2.21 onward

  • Enhancements in Hugging Face workloads. For more details, see Deploy inference workloads from Hugging Face:

    • Added Hugging Face model authentication - NVIDIA Run:ai validates whether a user-provided token grants access to a specific model, in addition to checking if a model requires a token and verifying the token format. This enhancement ensures that users can only load models they have permission to access, improving security and usability. From cluster v2.18 onward

    • Introduced model store support using data sources - Select a data source to serve as a model store, caching model weights to reduce loading time and avoid repeated downloads. This improves performance and deployment speed, especially for frequently used models, minimizing the need to re-authenticate with external sources.

    • Improved model selection - Select a model from a drop-down list. The list is partial and consists only of models that were tested. From cluster v2.18 onward

    • Enhanced Hugging Face environment control - Choose between vLLM, TGI, or any other custom container image by selecting an image tag and providing additional arguments. By default, workloads use the official vLLM or TGI containers, with full flexibility to override the image and customize runtime settings for more controlled and adaptable inference deployments. From cluster v2.18 onward

  • Updated authentication for NIM model access - You can now authenticate access to NIM models using tokens or credentials, ensuring a consistent, flexible, and secure authentication process. See Deploy inference workloads with NVIDIA NIM for more details. From cluster v2.19 onward

  • Added support for volume configuration - You can now set volumes for custom inference workloads. This feature allows inference workloads to allocate and retain storage, ensuring continuity and efficiency in inference execution. From cluster v2.20 onward

  • Added a new Idle workloads table widget to help you easily identify and manage underutilized resources.
  • Renamed and updated the "Workloads by type" widget to provide clearer insights into cluster usage with a focus on workloads.

  • Improved user experience by moving the date picker to a dedicated section within the overtime widgets, Resources allocation and Resources utilization.

  • Define the priority of projects within a department, on a per-node pool basis.
  • Set specific GPU resource limits for both departments and projects.

  • for more details.
  • Expanded SSO OpenID Connect authentication support - SSO OpenID Connect authentication supports attribute mapping of groups in both list and map formats. In map format, the group name is used as the value. This applies to new identity providers only. See Set up SSO with OpenID Connect for more details.

  • Improved permission error messaging - Enhanced clarity when attempting to delete a user with higher privileges, making it easier to understand and resolve permission-related actions.

  • workspaces
    standard training
    distributed training
    custom Inference
    Train models using a distributed training workload
    Environments
    Update a PVC asset
    runai inference
    Hugging Face
    custom inference
    departments
    projects
    Configuring Slack notifications
    Access rules
    Single Sign-On (SSO)
    Data volumes
    Using GB200 NVL72 and Multi-Node NVLink Domains
    Advanced cluster configurations
    Service mesh
    deprecated
    Run:ai REST API
    Set up SSO with SAML
  • Select Start from scratch to launch a new workspace quickly

  • Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Flexible and click CONTINUE

  • Click the load icon. A side pane appears, displaying a list of available environments. Select the ‘jupyter-lab’ environment for your workspace (Image URL: jupyter/scipy-notebook)

    • If ‘jupyter-lab’ is not displayed in the gallery, follow the below steps to create a one-time environment configuration:

      • Enter the jupyter-lab Image URL - jupyter/scipy-notebook

      • Tools - Set the connection for your tool

        • Click +TOOL

        • Select Jupyter tool from the list

      • Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:

        • Enter the command - start-notebook.sh

        • Enter the arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

        Note: If is enabled on the cluster, enter the --NotebookApp.token=''

  • Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘small-fraction’ compute resource for your workspace.

    • If ‘small-fraction’ is not displayed in the gallery, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device’s memory

        • Set the memory Request - 10 (the workload will allocate 10% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE WORKSPACE

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Workspace

    3. Select under which cluster to create the workload

    4. Select the project in which your workspace will run

    5. Select Start from scratch to launch a new workspace quickly

    6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

    7. Under Submission, select Original and click CONTINUE

    8. Select the ‘jupyter-lab’ environment for your workspace (Image URL: jupyter/scipy-notebook)

      • If the ‘jupyter-lab’ is not displayed in the gallery, follow the below steps:

        • Click +NEW ENVIRONMENT

    9. Select the ‘small-fraction’ compute resource for your workspace

      • If ‘small-fraction’ is not displayed in the gallery, follow the below steps:

        • Click +NEW COMPUTE RESOURCE

        • Enter small-fraction as the name for the compute resource. The name must be unique.

    10. Click CREATE WORKSPACE

    Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:

    Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see Workspaces API:

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in Step 1

    • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

    • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

    • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

    • toolName will show when connecting to the Jupyter tool via the user interface.

    Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

    To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

    project
    Flexible workload submission
    Workloads

    Ingress class

    NVIDIA Run:ai uses NGINX as the default ingress controller. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai.

    global.ingress.tlsSecretName

    TLS secret name

    NVIDIA Run:ai requires the creation of a secret with . If the runai-backend namespace already had such a secret, you can set the secret name here

    <service-name>.podLabels

    Pod labels

    Set NVIDIA Run:ai and 3rd party services' in a format of key/value pairs.

    <service-name>  resources:   limits:     cpu: 500m     memory: 512Mi   requests:     cpu: 250m     memory: 256Mi

    Pod request and limits

    Set NVIDIA Run:ai and 3rd party services' resources

    disableIstioSidecarInjection.enabled

    Disable Istio sidecar injection

    Disable the automatic injection of Istio sidecars across the entire NVIDIA Run:ai Control Plane services.

    global.affinity

    System nodes

    Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system

    global.customCA.enabled

    Certificate authority

    Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.

    Additional Third-Party Configurations

    The NVIDIA Run:ai control plane chart includes multiple sub-charts of third-party components:

    • Data store- PostgreSQL (postgresql)

    • Metrics Store - Thanos (thanos)

    • Identity & Access Management - Keycloakx (keycloakx)

    • Analytics Dashboard - (grafana)

    • Caching, Queue - (nats)

    Note

    Click on any component to view its chart values and configurations.

    PostgreSQL

    If you have opted to connect to an external PostgreSQL database, refer to the additional configurations table below. Adjust the following parameters based on your connection details:

    1. Disable PostgreSQL deployment - postgresql.enabled

    2. NVIDIA Run:ai connection details - global.postgresql.auth

    3. Grafana connection details - grafana.dbUser, grafana.dbPassword

    Key
    Change
    Description

    postgresql.enabled

    PostgreSQL installation

    If set to false, PostgreSQL will not be installed.

    global.postgresql.auth.host

    PostgreSQL host

    Hostname or IP address of the PostgreSQL server.

    global.postgresql.auth.port

    PostgreSQL port

    Port number on which PostgreSQL is running.

    global.postgresql.auth.username

    PostgreSQL username

    Username for connecting to PostgreSQL.

    Thanos

    Note

    This section applies to Kubernetes only.

    Key
    Change
    Description

    thanos.receive.persistence.storageClass

    Storage class

    The installation is configured to work with a specific storage class instead of the default one.

    Keycloakx

    The keycloakx.adminUser can only be set during the initial installation. The admin password can be changed later through the Keycloak UI, but you must also update the keycloakx.adminPassword value in the Helm chart using helm upgrade. See Changing Keycloak admin password for more details.

    Key
    Change
    Description

    keycloakx.adminUser

    User name of the internal identity provider administrator

    Defines the username for the Keycloak administrator. This can only be set during the initial installation.

    keycloakx.adminPassword

    Password of the internal identity provider administrator

    Defines the password for the Keycloak administrator.

    keycloakx.existingSecret

    Keycloakx credentials (secret)

    Existing secret name with authentication credentials.

    global.keycloakx.host

    Keycloak (NVIDIA Run:ai internal identity provider) host path

    Overrides the DNS for Keycloak. This can be used to access access Keycloak externally to the cluster.

    Changing Keycloak Admin Password

    You can change the Keycloak admin password after deployment by performing the following steps:

    1. Open the Keycloak UI at: https://<runai-domain>/auth

    2. Sign in with your existing admin credentials as configured in your Helm values

    3. Go to Users and select admin (or your admin username)

    4. Open Credentials → Reset password

    5. Set the new password and click Save

    6. Update the keycloakx.adminPassword value using the helm upgrade command to match the password you set in the Keycloak UI

    Note

    Failing to update the Helm values after changing the password can lead to control plane services encountering errors.

    Grafana

    Key
    Change
    Description

    grafana.db.existingSecret

    Grafana database connection credentials (secret)

    Existing secret name with authentication credentials.

    grafana.dbUser

    Grafana database username

    Username for accessing the Grafana database.

    grafana.dbPassword

    Grafana database password

    Password for the Grafana database user.

    grafana.admin.existingSecret

    Grafana admin default credentials (secret)

    Existing secret name with authentication credentials.

    values files
    Helm install

    global.ingress.ingressClass

    Introduction to Workloads

    NVIDIA Run:ai enhances visibility and simplifies management, by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an AI initiative.

    Workloads Across the AI Life Cycle

    A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With NVIDIA Run:ai, research and engineering teams can host and manage all these workloads to achieve the following:

    • Data preparation: Aggregating, cleaning, normalizing, and labeling data to prepare for training.

    • Training: Conducting resource-intensive model development and iterative performance optimization.

    • Fine-tuning: Adapting pre-trained models to domain-specific datasets while balancing efficiency and performance.

    • Inference: Deploying models for real-time or batch predictions with a focus on low latency and high throughput.

    • Monitoring and optimization: Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed.

    What is a Workload?

    A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a , allocating resources for in an integrated development environment (IDE)/notebook, or serving requests in production.

    The workload, defined by the AI practitioner, consists of:

    • Container images: This includes the application, its dependencies, and the runtime environment.

    • Compute resources: CPU, GPU, and RAM to execute efficiently and address the workload’s needs.

    • Data & storage configuration: The data needed for processing such as training and testing datasets or input from external databases, and the storage configuration which refers to the way this data is managed, stored and accessed.

    Workload Scheduling and Orchestration

    NVIDIA Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient of all cluster workloads using the NVIDIA Run:ai . The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator.

    NVIDIA Run:ai and Third-Party Workloads

    • NVIDIA Run:ai workloads: These workloads are submitted via the NVIDIA Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using , a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied.

    • Third-party workloads: These workloads are submitted via third-party applications that use the NVIDIA Run:ai Scheduler. The NVIDIA Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility. See .

    Levels of Support

    Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in NVIDIA Run:ai. NVIDIA Run:ai workloads are fully supported with all of NVIDIA Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different NVIDIA Run:ai versions.

    Functionality
    NVIDIA Run:ai Workspace
    NVIDIA Run:ai Training - Standard
    NVIDIA Run:ai Training - distributed
    NVIDIA Run:ai Inference
    Third-party workloads

    Workload awareness

    Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).

    Node Pools

    This section explains the procedure for managing Node pools.

    Node pools assist in managing heterogeneous resources effectively. A node pool is a NVIDIA Run:ai construct representing a set of nodes grouped into a bucket of resources using a predefined node label (e.g. NVIDIA GPU type) or an administrator-defined node label (any key/value pair).

    Typically, the grouped nodes share a common feature or property, such as GPU type or other HW capability (such as Infiniband connectivity), or represent a proximity group (i.e. nodes interconnected via a local ultra-fast switch). Researchers and ML Engineers would typically use node pools to run specific workloads on specific resource types.

    In the NVIDIA Run:ai Platform a user with the System administrator role can create, view, edit, and delete node pools. Creating a new node pool creates a new instance of the NVIDIA Run:ai . Workloads submitted to a node pool are scheduled using the node pool’s designated scheduler instance.

    Once created, the new node pool is automatically assigned to all and with a quota of zero GPU resources, unlimited CPU resources, and over quota enabled (medium weight if over quota weight is enabled). This allows any project and department to use any node pool when over quota is enabled, even if the administrator has not assigned a quota for a specific node pool within that project or department.

    When submitting a new , users can add a prioritized list of node pools. The node pool selector picks one node pool at a time (according to the prioritized list) and the designated node pool scheduler instance handles the submission request and tries to match the requested resources within that node pool. If the scheduler cannot find resources to satisfy the submitted workload, the node pool selector moves the request to the next node pool in the prioritized list, if no node pool satisfies the request, the node pool selector starts from the first node pool again until one of the node pools satisfies the request.

    Using the Scheduler with Third-Party Workloads

    By default, Kubernetes uses its own native scheduler to determine pod placement. The NVIDIA Run:ai platform provides a custom scheduler, runai-scheduler, which is used by default for workloads submitted using the platform. This section outlines how to configure third-party workloads, such as those submitted directly to Kubernetes, to run with the , runai-scheduler, instead of the default Kubernetes scheduler.

    This section outlines how to configure workloads submitted directly to Kubernetes or through external frameworks to run with the , instead of the default Kubernetes scheduler.

    Enforce the Scheduler at the Namespace Level

    runai project set "project-name"
    runai workspace submit "workload-name" --image jupyter/scipy-notebook \
    --gpu-devices-request 0.1 --command --external-url container=8888 \
    --name-prefix jupyter --command -- start-notebook.sh \
    --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=
    runai config project "project-name"
    runai submit "workload-name" --jupyter -g 0.1
    curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \ 
    -d '{ 
        "name": "workload-name", 
        "projectId": "<PROJECT-ID>",  
        "clusterId": "<CLUSTER-UUID>", 
        "spec": {
            "command" : "start-notebook.sh",
            "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
            "image": "jupyter/scipy-notebook",
            "compute": {
                "gpuDevicesRequest": 1,
                "gpuRequestType": "portion",
                "gpuPortionRequest": 0.1
    
            },
            "exposedUrls" : [
                { 
                    "container" : 8888,
                    "toolType": "jupyter-notebook", 
                    "toolName": "Jupyter" 
                }
            ]
        }
    }

    global.postgresql.auth.password

    PostgreSQL password

    Password for the PostgreSQL user specified by global.postgresql.auth.username.

    global.postgresql.auth.postgresPassword

    PostgreSQL default admin password

    Password for the built-in PostgreSQL superuser (postgres).

    global.postgresql.auth.existingSecret

    Postgres Credentials (secret)

    Existing secret name with authentication credentials.

    global.postgresql.auth.dbSslMode

    Postgres connection SSL mode

    Set the SSL mode. See the full list in Protection Provided in Different Modes. Prefer mode is not supported.

    postgresql.primary.initdb.password

    PostgreSQL default admin password

    Set the same password as in global.postgresql.auth.postgresPassword (if changed).

    postgresql.primary.persistence.storageClass

    Storage class

    The installation is configured to work with a specific storage class instead of the default one.

    grafana.adminUser

    Grafana username

    Override the NVIDIA Run:ai default user name for accessing Grafana.

    grafana.adminPassword

    Grafana password

    Override the NVIDIA Run:ai default password for accessing Grafana.

    Grafana
    NATS
    domain certificate
    Pod Labels
    When submitting workloads in a given namespace (i.e., NVIDIA Run:ai project), the parameter enforceRunaiScheduler is enabled (true) by default. This ensures that any workload associated with a NVIDIA Run:ai project automatically uses the runai-scheduler, including workloads submitted directly to Kubernetes or through external frameworks.

    If this parameter is disabled, enforceRunaiScheduler=false, workloads will no longer default to the NVIDIA Run:ai Scheduler. In this case, you can still use the NVIDIA Run:ai Scheduler by specifying it manually in the workload YAML.

    Specify the Scheduler in the Workload YAML

    To use the NVIDIA Run:ai Scheduler, specify it in the workload’s YAML file. This instructs Kubernetes to schedule the workload using the NVIDIA Run:ai Scheduler instead of the default one.

    For example:

    NVIDIA Run:ai
    NVIDIA Run:ai Scheduler
    NVIDIA Run:ai Scheduler
    spec:schedulerName: runai-scheduler
    apiVersion: v1
    kind: Pod
    metadata:
      annotations:
        user: test
        gpu-fraction: "0.5"
        gpu-fraction-num-devices: "2"
      labels:
        runai/queue: test
      name: multi-fractional-pod-job
      namespace: test
    spec:
      containers:
      - image: gcr.io/run-ai-demo/quickstart-cuda
        imagePullPolicy: Always
        name: job
        env:
        - name: RUNAI_VERBOSE
          value: "1"
        resources:
          limits:
            cpu: 200m
            memory: 200Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          capabilities:
            drop: ["ALL"]
      schedulerName: runai-scheduler
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 5
    only.
    Enter jupyter-lab as the name for the environment. The name must be unique.
  • Enter the jupyter-lab Image URL - jupyter/scipy-notebook

  • Tools - Set the connection for your tool

    • Click +TOOL

    • Select Jupyter tool from the list

  • Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:

    • Enter the command - start-notebook.sh

    • Enter the arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

    Note: If host-based routing is enabled on the cluster, enter the --NotebookApp.token='' only.

  • Click CREATE ENVIRONMENT

  • The newly created environment will be selected automatically

  • Set GPU devices per pod - 1

  • Set GPU memory per device

    • Select % (of device) - Fraction of a GPU device’s memory

    • Set the memory Request - 10 (the workload will allocate 10% of the GPU memory)

  • Optional: set the CPU compute per pod - 0.1 cores (default)

  • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE COMPUTE RESOURCE

  • The newly created compute resource will be selected automatically

    host-based routing
    Get Projects API
    Get Clusters API

    Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

    Run the below --help command to obtain the login options and log in according to your setup:

    Log in using the following command. You will be prompted to enter your username and password:

    To use the API, you will need to obtain a token as shown in

    Credentials:
    The access to certain data sources or external services, ensuring proper authentication and authorization.

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    Elastic scaling

    NA

    NA

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    Workload awareness

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    v

    Fairness

    v

    v

    v

    v

    v

    Priority and preemption

    v

    batch job
    experimentation
    inference
    scheduling and orchestrating
    Scheduler
    NVIDIA Run:ai workloads
    Using the Scheduler with third-party workloads

    v

    Node Pools Table

    The Node pools table can be found under Resources in the NVIDIA Run:ai platform.

    The Node pools table lists all the node pools defined in the NVIDIA Run:ai platform and allows you to manage them.

    Note

    By default, the NVIDIA Run:ai platform includes a single node pool named ‘default’. When no other node pool is defined, all existing and new nodes are associated with the ‘default’ node pool. When deleting a node pool, if no other node pool matches any of the nodes’ labels, the node will be included in the default node pool.

    The Node pools table consists of the following columns:

    Column
    Description

    Node pool

    The node pool name, set by the administrator during its creation (the node pool name cannot be changed after its creation).

    Status

    Node pool status. A ‘Ready’ status means the scheduler can use this node pool to schedule workloads. ‘Empty’ status means no nodes are currently included in that node pool.

    Label key Label value

    The node pool controller will use this node-label key-value pair to match nodes into this node pool.

    Node(s)

    List of nodes included in this node pool. Click the field to view details (the details are in the article).

    GPU network acceleration (MNNVL)

    Indicates whether the discovery method of Multi-Node NVL nodes is done automatically or manually

    MNNVL label key

    The label key that is used to automatically detect if a node is part of an MNNVL domain. The default MNNVL domain label is nvidia.com/gpu.clique.

    Workloads Associated with the Node Pool

    Click one of the values in the Workload(s) column, to view the list of workloads and their parameters.

    Note

    This column is only viewable if your role in the NVIDIA Run:ai platform gives you read access to workloads, even if you are allowed to view workloads, you can only view the workloads within your allowed scope. This means, there might be more pods running on this node than appear in the list your are viewing.

    Column
    Description

    Workload

    The name of the workload. If the workloads’ type is one of the recognized types (for example: Pytorch, MPI, Jupyter, Ray, Spark, Kubeflow, and many more), an appropriate icon is printed.

    Type

    The NVIDIA Run:ai platform type of the workload - Workspace, Training, or Inference

    Status

    The state of the workload. The Workloads state is described in the NVIDIA Run:ai section

    Created by

    The User or Application created this workload

    Running/requested pods

    The number of running pods out of the number of requested pods within this workload.

    Creation time

    The workload’s creation date and time

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    • Show/Hide details - Click to view additional information on the selected row

    Show/Hide Details

    Select a row in the Node pools table and then click Show details in the upper-right corner of the action bar. The details window appears, presenting metrics graphs for the whole node pool:

    • Node GPU allocation - This graph shows an overall sum of the Allocated, Unallocated, and Total number of GPUs for this node pool, over time. From observing this graph, you can learn about the occupancy of GPUs in this node pool, over time.

    • GPU Utilization Distribution - This graph shows the distribution of GPU utilization in this node pool over time. Observing this graph, you can learn how many GPUs are utilized up to 25%, 25%-50%, 50%-75%, and 75%-100%. This information helps to understand how many available resources you have in this node pool, and how well those resources are utilized by comparing the allocation graph to the utilization graphs, over time.

    • GPU Utilization - This graph shows the average GPU utilization in this node pool over time. Comparing this graph with the GPU Utilization Distribution helps to understand the actual distribution of GPU occupancy over time.

    • GPU Memory Utilization - This graph shows the average GPU memory utilization in this node pool over time, for example an average of all nodes’ GPU memory utilization over time.

    • CPU Utilization - This graph shows the average CPU utilization in this node pool over time, for example, an average of all nodes’ CPU utilization over time.

    • CPU Memory Utilization - This graph shows the average CPU memory utilization in this node pool over time, for example an average of all nodes’ CPU memory utilization over time.

    Adding a New Node Pool

    To create a new node pool:

    1. Click +NEW NODE POOL

    2. Enter a name for the node pool. Node pools names must start with a letter and can only contain lowercase Latin letters, numbers or a hyphen ('-’)

    3. Enter the node pool label: The node pool controller will use this node-label key-value pair to match nodes into this node pool.

      • Key is the unique identifier of a node label.

        • The key must fit the following regular expression: ^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?/?([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9]$

        • The administrator can put an automatically preset label such as the nvidia.com/gpu.product that labels the GPU type or any other key from a node label.

      • Value is the value of that label identifier (key). The same key may have different values, in this case, they are considered as different labels.

        • Value must fit the following regular expression: ^(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?$

      • A node pool is defined by a single key-value pair. You must not use different labels that are set on the same node by different node pools, this situation may lead to unexpected results.

    4. Set the GPU placement strategy:

      • Bin-pack - Place as many workloads as possible in each GPU and node to use fewer resources and maximize GPU and node vacancy.

      • Spread - Spread workloads across as many GPUs and nodes as possible to minimize the load and maximize the available resources per workload.

      • GPU workloads are workloads that request both GPU and CPU resources

    5. Set the CPU placement strategy:

      • Bin-pack - Place as many workloads as possible in each CPU and node to use fewer resources and maximize CPU and node vacancy.

      • Spread - Spread workloads across as many CPUs and nodes as possible to minimize the load and maximize the available resources per workload.

      • CPU workloads are workloads that request purely CPU resources

    6. Set the GPU network acceleration. For more details, see :

      • Set the discovery method of GPU network acceleration (MNNVL)

        • Automatic - Automatically identify whether the node pool contains any MNNVL nodes. MNNVL nodes that share the same ID are part of the same NVL rack.

    7. Click CREATE NODE POOL

    Labeling Nodes for Node Pool Grouping

    The administrator can use a preset node label, such as the nvidia.com/gpu.product that labels the GPU type, or configure any other node label (e.g. faculty=physics).

    To assign a label to nodes you want to group into a node pool, set a node label on each node:

    1. Obtain the list of nodes and their current labels by copying the following to your terminal:

    2. Annotate a specific node with a new label by copying the following to your terminal:

    Labeling Nodes via Cloud Providers

    Most cloud providers allow you to configure node labels at the node pool level. You can apply labels when creating a cluster, creating a node pool, or by editing an existing node pool.

    Ensure that each node is labeled using the Kubernetes label format. This label ensures that workloads are scheduled correctly based on node pool definitions:

    Refer to the provider-specific documentation below for guidance on how to configure node pool labels:

    • Google Kubernetes Engine (GKE)

    • Azure Kubernetes Service (AKS)

    • Amazon Elastic Kubernetes (EKS)

    Editing a Node Pool

    1. Select the node pool you want to edit

    2. Click EDIT

    3. Update the node pool and click SAVE

    Deleting a Node Pool

    1. Select the node pool you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm the deletion

    Note

    The default node pool cannot be deleted. When deleting a node pool, if no other node pool matches any of the nodes’ labels, the node will be included in the default node pool.

    Using API

    To view the available actions, go to the Node pools API reference.

    Scheduler
    projects
    departments
    workload

    Departments

    This section explains the procedure for managing departments

    Departments are a grouping of projects. By grouping projects into a department, you can set quota limitations to a set of projects, create policies that are applied to the department, and create assets that can be scoped to the whole department or a partial group of descendent projects

    For example, in an academic environment, a department can be the Physics Department grouping various projects (AI Initiatives) within the department, or grouping projects where each project represents a single student.

    Departments Table

    The Departments table can be found under Organization in the NVIDIA Run:ai platform.

    Note

    Departments are disabled, by default. If you cannot see Departments in the menu, then it must be enabled by your Administrator, under General settings → Resources → Departments

    The Departments table lists all departments defined for a specific cluster and allows you to manage them. You can switch between clusters by selecting your cluster using the filter at the top.

    The Departments table consists of the following columns:

    Column
    Description

    Node Pools with Quota Associated with the Department

    Click one of the values of Node pool(s) with quota column, to view the list of node pools and their parameters

    Column
    Description

    Subjects Authorized for the Project

    Click one of the values of the Subject(s) column, to view the list of subjects and their parameters. This column is only viewable if your role in the NVIDIA Run:ai system affords you those permissions.

    Column
    Description

    Note

    A role given in a certain scope, means the role applies to this scope and any descendant scopes in the organizational tree.

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Adding a New Department

    To create a new Department:

    1. Click +NEW DEPARTMENT

    2. Select a scope. By default, the field contains the scope of the current UI context cluster, viewable at the top left side of your screen. You can change the current UI context cluster by clicking the ‘Cluster: cluster-name’ field and applying another cluster as the UI context. Alternatively, you can choose another cluster within the ‘+ New Department’ form by clicking the organizational tree icon on the right side of the scope field, opening the organizational tree and selecting one of the available clusters.

    3. Enter a name for the department. Department names must start with a letter and can only contain lower case latin letters, numbers or a hyphen ('-’).

    Adding an Access Rule to a Department

    To create a new access rule for a department:

    1. Select the department you want to add an access rule for

    2. Click ACCESS RULES

    3. Click +ACCESS RULE

    4. Select a subject

    Deleting an Access Rule from a Department

    To delete an access rule from a department:

    1. Select the department you want to remove an access rule from

    2. Click ACCESS RULES

    3. Find the access rule you would like to delete

    4. Click on the trash icon

    Editing a Department

    1. Select the Department you want to edit

    2. Click EDIT

    3. Update the Department and click SAVE

    Viewing a Department’s Policy

    To view the policy of a department:

    1. Select the department for which you want to view its . This option is only active if the department has defined policies in place.

    2. Click VIEW POLICY and select the workload type for which you want to view the policies: a. Workspace workload type policy with its set of rules b. Training workload type policies with its set of rules

    3. In the Policy form, view the workload rules that are enforcing your department for the selected workload type as well as the defaults:

    Note

    • The policy affecting the department consists of rules and defaults. Some of these rules and defaults may be derived from the policies of a parent cluster (source). You can see the source of each rule in the policy form.

    • A policy set for a department affects all subordinated projects and their workloads, according to the policy workload type

    Deleting a Department

    1. Select the department you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm the deletion

    Note

    Deleting a department permanently deletes its subordinated projects, any assets created in the scope of this department, and any of its subordinated projects such as compute resources, environments, data sources, templates, and credentials. However, workloads running within the department’s subordinated projects, or the policies defined for this department or its subordinated projects - remain intact and running.

    Reviewing a Department

    1. Select the department you want to review

    2. Click REVIEW

    3. Review and click CLOSE

    Using API

    To view the available actions, go to the API reference.

    Clusters

    This section explains the procedure to view and manage Clusters.

    The Cluster table provides a quick and easy way to see the status of your cluster.

    Clusters Table

    The Clusters table can be found under Resources in the NVIDIA Run:ai platform.

    The clusters table provides a list of the clusters added to NVIDIA Run:ai platform, along with their status.

    The clusters table consists of the following columns:

    Advanced Cluster Configurations

    Advanced cluster configurations can be used to tailor your NVIDIA Run:ai cluster deployment to meet specific operational requirements and optimize resource management. By fine-tuning these settings, you can enhance functionality, ensure compatibility with organizational policies, and achieve better control over your cluster environment. This article provides guidance on implementing and managing these configurations to adapt the NVIDIA Run:ai cluster to your unique needs.

    After the NVIDIA Run:ai cluster is installed, you can adjust various settings to better align with your organization's operational needs and security requirements.

    Modify Cluster Configurations

    Advanced cluster configurations in NVIDIA Run:ai are managed through the runaiconfig . To edit the cluster configurations, run:

    Launching Workloads with Dynamic GPU Fractions

    This quick start provides a step-by-step walkthrough for running a Jupyter Notebook with .

    NVIDIA Run:ai’s dynamic GPU fractions optimizes GPU utilization by enabling workloads to dynamically adjust their resource usage. It allows users to specify a guaranteed fraction of GPU memory and compute resources with a higher limit that can be dynamically utilized when additional resources are requested.

    Prerequisites

    Before you start, make sure:

    Environments

    This section explains what environments are and how to create and use them.

    Environments are one type of . An environment consists of a configuration that simplifies how workloads are submitted and can be used by AI practitioners when they submit their workloads.

    An environment asset is a preconfigured building block that encapsulates aspects for the workload such as:

    • Container image and container configuration

    • Tools and connections

    kubectl get nodes --show-labels
    kubectl label node <node-name> <key>=<value>
    run.ai/type=<TYPE_VALUE>
    Over quota
    Node pools
    Bin packing / Spread
    Multi-GPU fractions
    Multi-GPU dynamic fractions
    Node level scheduler
    Multi-GPU memory swap
    Gang scheduling
    Monitoring
    RBAC
    Workload submission
    Workload actions (stop/run)
    Rolling updates
    Workload Policies
    Scheduling rules
    Manual - Manually set whether the node pool contains any MNNVL nodes
    • Detected

    • Not detected

  • Set the node’s label used to discover GPU network acceleration (MNNVL) to nvidia.com/gpu.clique

  • MNNVL nodes

    Indicates whether MNNVL nodes are detected - automatically or manually.

    Total GPU devices

    The total number of GPU devices installed into nodes included in this node pool. For example, a node pool that includes 12 nodes each with 8 GPU devices would show a total number of 96 GPU devices.

    Total GPU memory

    The total amount of GPU memory included in this node pool. The total amount of GPU memory installed in nodes included in this node pool. For example, a node pool that includes 12 nodes, each with 8 GPU devices, and each device with 80 GB of memory would show a total memory amount of 7.68 TB.

    Allocated GPUs

    The total allocation of GPU devices in units of GPUs (decimal number). For example, if 3 GPUs are 50% allocated, the field prints out the value 1.50. This value represents the portion of GPU memory consumed by all running pods using this node pool. ‘Allocated GPUs’ can be larger than ‘Projects’ GPU quota’ if over quota is used by workloads, but not larger than GPU devices.

    GPU resource optimization ratio

    Shows the Node Level Scheduler mode.

    Total CPU (Cores)

    The number of CPU cores installed on nodes included in this node

    Total CPU memory

    The total amount of CPU memory installed on nodes using this node pool

    Allocated CPU (Cores)

    The total allocation of CPU compute in units of Cores (decimal number). This value represents the amount of CPU cores consumed by all running pods using this node pool. ‘Allocated CPUs’ can be larger than ‘Projects’ GPU quota’ if over quota is used by workloads, but not larger than CPUs (Cores).

    Allocated CPU memory

    The total allocation of CPU memory in units of TB/GB/MB (decimal number). This value represents the amount of CPU memory consumed by all running pods using this node pool. ‘Allocated CPUs’ can be larger than ‘Projects’ CPU memory quota’ if over quota is used by workloads, but not larger than CPU memory.

    GPU placement strategy

    Sets the Scheduler strategy for the assignment of pods requesting both GPU and CPU resources to nodes, which can be either Bin-pack or Spread. By default, Bin-Pack is used, but can be changed to Spread by editing the node pool. When set to Bin-pack the scheduler will try to fill nodes as much as possible before using empty or sparse nodes, when set to spread the scheduler will try to keep nodes as sparse as possible by spreading workloads across as many nodes as it succeeds.

    CPU placement strategy

    Sets the Scheduler strategy for the assignment of pods requesting only CPU resources to nodes, which can be either Bin-pack or Spread. By default, Bin-Pack is used, but can be changed to Spread by editing the node pool. When set to Bin-pack the scheduler will try to fill nodes as much as possible before using empty or sparse nodes, when set to spread the scheduler will try to keep nodes as sparse as possible by spreading workloads across as many nodes as it succeeds.

    Last update

    The date and time when the node pool was last updated

    Creation time

    The date and time when the node pool was created

    Workload(s)

    List of workloads running on nodes included in this node pool, click the field to view details (described below in this article)

    Allocated GPU compute

    The total amount of GPU compute allocated by this workload. A workload with 3 Pods, each allocating 0.5 GPU, will show a value of 1.5 GPUs for the workload.

    Allocated GPU memory

    The total amount of GPU memory allocated by this workload. A workload with 3 Pods, each allocating 20GB, will show a value of 60 GB for the workload.

    Allocated CPU compute (cores)

    The total amount of CPU compute allocated by this workload. A workload with 3 Pods, each allocating 0.5 Core, will show a value of 1.5 Cores for the workload.

    Allocated CPU memory

    The total amount of CPU memory allocated by this workload. A workload with 3 Pods, each allocating 5 GB of CPU memory, will show a value of 15 GB of CPU memory for the workload.

    Using GB200 NVL72 and Multi-Node NVLink domains
    Nodes
    workloads
    API authentication.
    runai login --help
    runai login

    Allocated GPUs

    The total number of GPUs allocated by successfully scheduled workloads in projects associated with this department

    GPU allocation ratio

    The ratio of Allocated GPUs to GPU quota. This number reflects how well the department’s GPU quota is utilized by its descendant projects. A number higher than 100% means the department is using over quota GPUs. A number lower than 100% means not all projects are utilizing their quotas. A quota becomes allocated once a workload is successfully scheduled.

    Creation time

    The timestamp for when the department was created

    Workload(s)

    The list of workloads under projects associated with this department. Click the values under this column to view the list of workloads with their resource parameters (as described below)

    Cluster

    The cluster that the department is associated with

    Allocated CPU memory

    The actual amount of CPU memory allocated by workloads using this node pool under all projects associated with this department. The number of Allocated CPU memory may temporarily surpass the CPU memory quota if over quota is used.

    Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    In the Quota management section, you can set the quota parameters and prioritize resources

    • Order of priority This column is displayed only if more than one node pool exists. The node-pools order of priority in 'Departments/Quota management' sets the default node-pools order of priority for newly created projects under that Department. The Administrator can then change the order per Project. Node-pools order of priority sets the order in which the Scheduler uses node pools to schedule a workload, it is effective for projects and their associated workloads. This means the Scheduler first tries to allocate resources using the highest priority node pool, then the next in priority, until it reaches the lowest priority node pool list, then the Scheduler starts from the highest again. The Scheduler uses the Project's list of prioritized node pools, only if the order of priority of node pools is not set in the workload during submission, either by an admin policy or by the user. Empty value means the node pool is not part of the Department default node pool priority list inherited to newly created projects, but a node pool can still be chosen by the admin policy or a user during workload submission.

    • Node pool This column is displayed only if more than one node pool exists. It represents the name of the node pool

    • Under the QUOTA tab

      • Over-quota state Indicates if over-quota is enabled or disabled as set in the SCHEDULING PREFERENCES tab. If over-quota is set to None, then it is disabled.

      • GPU devices The number of GPUs you want to allocate for this department in this node pool (decimal number).

  • Set Scheduling rules as required.

  • Click CREATE DEPARTMENT

  • Select or enter the subject identifier:

    • User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP

    • Group name as recognized by the IDP

    • Application name as created in NVIDIA Run:ai

  • Select a role

  • Click SAVE RULE

  • Click CLOSE

  • Click CLOSE

    Parameter - The workload submission parameter that Rule and Default is applied on

  • Type (applicable for data sources only) - The data source type (Git, S3, nfs, pvc etc.)

  • Default - The default value of the Parameter

  • Rule - Set up constraints on workload policy fields

  • Source - The origin of the applied policy (cluster, department or project)

  • Department

    The name of the department

    Node pool(s) with quota

    The node pools associated with this department. By default, all node pools within a cluster are associated with each department. Administrators can change the node pools’ quota parameters for a department. Click the values under this column to view the list of node pools with their parameters (as described below)

    GPU quota

    GPU quota associated with the department

    Total GPUs for projects

    The sum of all projects’ GPU quotas associated with this department

    Project(s)

    List of projects associated with this department

    Subject(s)

    The users, SSO groups, or applications with access to the project. Click the values under this column to view the list of subjects with their parameters (as described below). This column is only viewable if your role in NVIDIA Run:ai platform allows you those permissions.

    Node pool

    The name of the node pool is given by the administrator during node pool creation. All clusters have a default node pool created automatically by the system and named ‘default’.

    GPU quota

    The amount of GPU quota the administrator dedicated to the department for this node pool (floating number, e.g. 2.3 means 230% of a GPU capacity)

    CPU (Cores)

    The amount of CPU (cores) quota the administrator has dedicated to the department for this node pool (floating number, e.g. 1.3 Cores = 1300 mili-cores). The ‘unlimited’ value means the CPU (Cores) quota is not bound and workloads using this node pool can use as many CPU (Cores) resources as they need (if available)

    CPU memory

    The amount of CPU memory quota the administrator has dedicated to the department for this node pool (floating number, in MB or GB). The ‘unlimited’ value means the CPU memory quota is not bounded and workloads using this node pool can use as much CPU memory resource as they need (if available).

    Allocated GPUs

    The total amount of GPUs allocated by workloads using this node pool under projects associated with this department. The number of allocated GPUs may temporarily surpass the GPU quota of the department if over quota is used.

    Allocated CPU (Cores)

    The total amount of CPUs (cores) allocated by workloads using this node pool under all projects associated with this department. The number of allocated CPUs (cores) may temporarily surpass the CPUs (Cores) quota of the department if over quota is used.

    Subject

    A user, SSO group, or application assigned with a role in the scope of this department

    Type

    The type of subject assigned to the access rule (user, SSO group, or application).

    Scope

    The scope of this department within the organizational tree. Click the name of the scope to view the organizational tree diagram, you can only view the parts of the organizational tree for which you have permission to view.

    Role

    The role assigned to the subject, in this department’s scope

    Authorized by

    The user who granted the access rule

    Last updated

    The last time the access rule was updated

    policies
    Departments
    Column
    Description

    Cluster

    The name of the cluster

    Status

    The status of the cluster. For more information see the . Hover over the information icon for a short description and links to troubleshooting

    Creation time

    The timestamp when the cluster was created

    URL

    The URL that was given to the cluster

    NVIDIA Run:ai cluster version

    The NVIDIA Run:ai version installed on the cluster

    Kubernetes distribution

    The flavor of Kubernetes distribution

    Kubernetes version

    Cluster Status

    Status
    Description

    Waiting to connect

    The cluster has never been connected.

    Disconnected

    There is no communication from the cluster to the Control plane. This may be due to a network issue.

    Missing prerequisites

    Some prerequisites are missing from the cluster. As a result, some features may be impacted.

    Service issues

    At least one of the services is not working properly. You can view the list of nonfunctioning services for more information.

    Connected

    The NVIDIA Run:ai cluster is connected, and all NVIDIA Run:ai services are running.

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    Adding a New Cluster

    To add a new cluster, see the installation guide.

    Removing a Cluster

    1. Select the cluster you want to remove

    2. Click REMOVE

    3. A dialog appears: Make sure to carefully read the message before removing

    4. Click REMOVE to confirm the removal.

    Using API

    Go to the Clusters API reference to view the available actions

    Troubleshooting

    Before starting, make sure you have access to the Kubernetes cluster where NVIDIA Run:ai is deployed with the necessary permissions

    Troubleshooting Scenarios

    Cluster disconnected

    Description: When the cluster's status is ‘disconnected’, there is no communication from the cluster services reaching the NVIDIA Run:ai Platform. This may be due to networking issues or issues with NVIDIA Run:ai services.

    Mitigation:

    1. Check NVIDIA Run:ai’s services status:

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to view pods

      3. Copy and paste the following command to verify that NVIDIA Run:ai’s services are running:

      4. If any of the services are not running, see the ‘cluster has service issues’ scenario.

    2. Check the network connection

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to create pods

      3. Copy and paste the following command to create a connectivity check pod:

    3. Check and modify the network policies

      1. Open your terminal

      2. Copy and paste the following command to check the existence of network policies:

      3. Review the policies to ensure that they allow traffic from the NVIDIA Run:ai namespace to the Control Plane. If necessary, update the policies to allow the required traffic

    4. Check NVIDIA Run:ai services logs

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to view logs

      3. Copy and paste the following commands to view the logs of the NVIDIA Run:ai services:

    5. Diagnosing internal network issues: NVIDIA Run:ai operates on Kubernetes, which uses its internal subnet and DNS services for communication between pods and services. If you find connectivity issues in the logs, the problem might be related to Kubernetes' internal networking.

      To diagnose DNS or connectivity issues, you can start a debugging {{glossary.Pod}} with networking utilities:

      1. Copy the following command to your terminal, to start a pod with networking tools:

        This command creates an interactive pod (netutils) where you can use networking commands like ping, curl, nslookup

    6. Contact NVIDIA Run:ai’s support

      • If the issue persists, for assistance.

    Cluster has service issues

    Description: When a cluster's status is ‘Has service issues`, it means that one or more NVIDIA Run:ai services running in the cluster are not available.

    Mitigation:

    1. Verify non-functioning services

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to view the runaiconfig resource

      3. Copy and paste the following command to determine which services are not functioning:

    2. Check for Kubernetes events

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to view events

      3. Copy and paste the following command to get all :

    3. Inspect resource details

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to describe resources

      3. Copy and paste the following command to check the details of the required resource:

    4. Contact NVIDIA Run:ai’s Support

      • If the issue persists, contact for assistance.

    Cluster is waiting to connect

    Description: When the cluster's status is ‘waiting to connect’, it means that no communication from the cluster services reaches the NVIDIA Run:ai Platform. This may be due to networking issues or issues with NVIDIA Run:ai services.

    Mitigation:

    1. Check NVIDIA Run:ai’s services status

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to view pods

      3. Copy and paste the following command to verify that NVIDIA Run:ai’s services are running:

      4. If any of the services are not running, see the ‘cluster has service issues’ scenario.

    2. Check the network connection

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to create pods

      3. Copy and paste the following command to create a connectivity check pod:

    3. Check and modify the network policies

      1. Open your terminal

      2. Copy and paste the following command to check the existence of network policies:

      3. Review the policies to ensure that they allow traffic from the NVIDIA Run:ai namespace to the Control Plane. If necessary, update the policies to allow the required traffic

    4. Check NVIDIA Run:ai services logs

      1. Open your terminal

      2. Make sure you have access to the Kubernetes cluster with permissions to view logs

      3. Copy and paste the following commands to view the logs of the NVIDIA Run:ai services:

    5. Contact NVIDIA Run:ai’s support

      • If the issue persists, for assistance

    Cluster is missing prerequisites

    Description: When a cluster's status displays Missing prerequisites, it indicates that at least one of the Mandatory Prerequisites has not been fulfilled. In such cases, NVIDIA Run:ai services may not function properly.

    Mitigation:

    If you have ensured that all prerequisites are installed and the status still shows missing prerequisites, follow these steps:

    1. Check the message in the NVIDIA Run:ai platform for further details regarding the missing prerequisites.

    2. Inspect the runai-public ConfigMap:

      1. Open your terminal. In the terminal, type the following command to list all ConfigMaps in the runai-public namespace:

    3. Describe the ConfigMap

      1. Locate the ConfigMap named runai-public from the list

      2. To view the detailed contents of this ConfigMap, type the following command:

    4. Find Missing Prerequisites

      1. In the output displayed, look for a section labeled dependencies.required

      2. This section provides detailed information about any missing resources or prerequisites. Review this information to identify what is needed

    5. Contact NVIDIA Run:ai’s support

      • If the issue persists, for assistance

    To see the full runaiconfig object structure, use:

    Configurations

    The following configurations allow you to enable or disable features, control permissions, and customize the behavior of your NVIDIA Run:ai cluster:

    Key
    Description

    spec.global.affinity (object)

    Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Using global.affinity will overwrite the set using the Administrator CLI (runai-adm). Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system

    spec.global.nodeAffinity.restrictScheduling (boolean)

    Enables setting and restricting workload scheduling to designated nodes Default: false

    spec.global.tolerations (object)

    Configure Kubernetes tolerations for NVIDIA Run:ai system-level services

    spec.global.ingress.ingressClass

    NVIDIA Run:ai uses NGINX as the default ingress controller. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai.

    spec.global.subdomainSupport (boolean)

    Allows the creation of subdomains for ingress endpoints, enabling access to workloads via unique subdomains on the . For details, see . Default: false

    spec.global.enableWorkloadOwnershipProtection (boolean)

    Prevents users within the same project from deleting workloads created by others. This enhances workload ownership security and ensures better collaboration by restricting unauthorized modifications or deletions. Default: false

    NVIDIA Run:ai Services Resource Management

    NVIDIA Run:ai cluster includes many different services. To simplify resource management, the configuration structure allows you to configure the containers CPU / memory resources for each service individually or group of services together.

    Service Group
    Description
    NVIDIA Run:ai containers

    SchedulingServices

    Containers associated with the NVIDIA Run:ai Scheduler

    Scheduler, StatusUpdater, MetricsExporter, PodGrouper, PodGroupAssigner, Binder

    SyncServices

    Containers associated with syncing updates between the NVIDIA Run:ai cluster and the NVIDIA Run:ai control plane

    Agent, ClusterSync, AssetsSync

    WorkloadServices

    Containers associated with submitting NVIDIA Run:ai workloads

    WorkloadController,

    JobController

    Apply the following configuration in order to change resources request and limit for a group of services:

    Or, apply the following configuration in order to change resources request and limit for each service individually:

    For resource recommendations, see Vertical scaling.

    NVIDIA Run:ai Services Replicas

    By default, all NVIDIA Run:ai containers are deployed with a single replica. Some services support multiple replicas for redundancy and performance.

    To simplify configuring replicas, a global replicas configuration can be set and is applied to all supported services:

    This can be overwritten for specific services (if supported). Services without the replicas configuration does not support replicas:

    Prometheus

    The Prometheus instance in NVIDIA Run:ai is used for metrics collection and alerting.

    The configuration scheme follows the official PrometheusSpec and supports additional custom configurations. The PrometheusSpec schema is available using the spec.prometheus.spec configuration.

    A common use case using the PrometheusSpec is for metrics retention. This prevents metrics loss during potential connectivity issues and can be achieved by configuring local temporary metrics retention. For more information, see Prometheus Storage:

    In addition to the PrometheusSpec schema, some custom NVIDIA Run:ai configurations are also available:

    • Additional labels – Set additional labels for NVIDIA Run:ai's built-in alerts sent by Prometheus.

    • Log level configuration – Configure the logLevel setting for the Prometheus container.

    NVIDIA Run:ai Managed Nodes

    To include or exclude specific nodes from running workloads within a cluster managed by NVIDIA Run:ai, use the nodeSelectorTerms flag. For additional details, see Kubernetes nodeSelector.

    Label the nodes using the below:

    • key: Label key (e.g., zone, instance-type).

    • operator: Operator defining the inclusion/exclusion condition (In, NotIn, Exists, DoesNotExist).

    • values: List of values for the key when using In or NotIn.

    The below example shows how to include NVIDIA GPUs only and exclude all other GPU types in a cluster with mixed nodes, based on product type GPU label:

    S3 and Git Sidecar Images

    For air-gapped environments, when working with a Local Certificate Authority, it is required to replace the default sidecar images in order to use the Git and S3 data source integrations. Use the following configurations:

    Kubernetes Custom Resource
    You have created a project or have one created for you.
  • The project has an assigned quota of at least 0.5 GPU.

  • Dynamic GPU fractions is enabled.

  • Note

    • Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

    • Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

    Step 1: Logging In

    Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

    Run the below --help command to obtain the login options and log in according to your setup:

    runai login --help

    To use the API, you will need to obtain a token as shown in API authentication.

    Step 2: Submitting the First Workspace

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Workspace

    3. Select under which cluster to create the workload

    4. Select the project in which your workspace will run

    5. Select Start from scratch to launch a new workspace quickly

    6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

    7. Under Submission, select Flexible and click CONTINUE

    8. Click the load icon. A side pane appears, displaying a list of available environments. To add a new environment:

      • Click the + icon to create a new environment

      • Enter quick-start as the name for the environment. The name must be unique.

      • Enter the Image URL -

    9. Click the load icon. A side pane appears, displaying a list of available compute resources. To add a new compute resource:

      • Click the + icon to create a new compute resource

      • Enter request-limit as the name for the compute resource. The name must be unique.

      • Set GPU devices per pod

    10. Click CREATE WORKSPACE

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Workspace

    3. Select under which cluster to create the workload

    4. Select the project in which your workspace will run

    Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see :

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in

    • <PROJECT-ID>

    Step 3: Submitting the Second Workspace

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Workspace

    3. Select the cluster where the previous workspace was created

    4. Select the project where the previous workspace was created

    5. Select Start from scratch to launch a new workspace quickly

    6. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

    7. Under Submission, select Flexible and click CONTINUE

    8. Click the load icon. A side pane appears, displaying a list of available environments. Select the environment created in .

    9. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the compute resources created in .

    10. Click CREATE WORKSPACE

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Workspace

    3. Select the cluster where the previous workspace was created

    4. Select the project where the previous workspace was created

    Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see :

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in

    • <PROJECT-ID>

    Step 4: Connecting to the Jupyter Notebook

    1. Select the newly created workspace with the Jupyter application that you want to connect to

    2. Click CONNECT

    3. Select the Jupyter tool. The selected tool is opened in a new tab on your browser.

    4. Open a terminal and use the watch nvidia-smi to get a constant reading of the memory consumed by the pod. Note that the number shown in the memory box is the Limit and not the Request or Guarantee.

    5. Open the file Untitled.ipynb and move the frame so you can see both tabs

    6. Execute both cells in Untitled.ipynb. This will consume about 3 GB of GPU memory and be well below the 4GB of the GPU Memory Request value.

    7. In the second cell, edit the value after --image-size from 100 to 200 and run the cell. This will increase the GPU memory utilization to about 11.5 GB which is above the Request value.

    1. To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

    2. Open a terminal and use the watch nvidia-smi to get a constant reading of the memory consumed by the pod. Note that the number shown in the memory box is the Limit and not the Request or Guarantee.

    3. Open the file Untitled.ipynb and move the frame so you can see both tabs

    1. To connect to the Jupyter Notebook, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

    2. Open a terminal and use the watch nvidia-smi to get a constant reading of the memory consumed by the pod. Note that the number shown in the memory box is the Limit and not the Request or Guarantee.

    3. Open the file Untitled.ipynb and move the frame so you can see both tabs

    Next Steps

    Manage and monitor your newly created workload using the Workloads table.

    dynamic GPU fractions
    The type of workload it serves

    Environments Table

    The Environments table can be found under Workload manager in the NVIDIA Run:ai platform.

    The Environment table provides a list of all the environment defined in the platform and allows you to manage them.

    The Environments table consists of the following columns:

    Column
    Description

    Environment

    The name of the environment

    Description

    A description of the environment

    Scope

    The of this environment within the organizational tree. Click the name of the scope to view the organizational tree diagram

    Image

    The application or service to be run by the workload

    Workload Architecture

    This can be either standard for running workloads on a single node or distributed for running distributed workloads on multiple nodes

    Tool(s)

    The tools and connection types the environment exposes

    Tools Associated with the Environment

    Click one of the values in the tools column to view the list of tools and their connection type.

    Column
    Description

    Tool name

    The name of the tool or application AI practitioner can set up within the environment. For more information, see .

    Connection type

    The method by which you can access and interact with the running workload. It's essentially the "doorway" through which you can reach and use the tools the workload provide. (E.g node port, external URL, etc)

    Workloads Associated with the Environment

    Click one of the values in the Workload(s) column to view the list of workloads and their parameters.

    Column
    Description

    Workload

    The workload that uses the environment

    Type

    The workload type (Workspace/Training/Inference)

    Status

    Represents the workload lifecycle. See the full list of )

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    Environments Created by NVIDIA Run:ai

    When installing NVIDIA Run:ai, you automatically get the environments created by NVIDIA Run:ai to ease up the onboarding process and support different use cases out of the box. These environments are created at the scope of the account.

    Note

    The environments listed below are available based on your cluster settings. Some environments, such as vscode and rstudio, are only available in clusters with host-based routing.

    Environment
    Image
    Description

    jupyter-lab / jupyter-scipy

    jupyter/scipy-notebook

    An interactive development environment for Jupyter notebooks, code, and data visualization

    jupyter-tensorboard

    gcr.io/run-ai-demo/jupyter-tensorboard

    An integrated combination of the interactive Jupyter development environment and TensorFlow's visualization toolkit for monitoring and analyzing ML models

    tensorboard / tensorboad-tensorflow

    tensorflow/tensorflow:latest

    A visualization toolkit for TensorFlow that helps users monitor and analyze ML models, displaying various metrics and model architecture

    llm-server

    runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0

    Adding a New Environment

    Environment creation is limited to specific roles

    To add a new environment:

    1. Go to the Environments table

    2. Click +NEW ENVIRONMENT

    3. Select under which cluster to create the environment

    4. Select a scope

    5. Enter a name for the environment. The name must be unique.

    6. Optional: Provide a description of the essence of the environment

    7. Enter the Image URL If a token or secret is required to pull the image, it is possible to create it via . These credentials are automatically used once the image is pulled (which happens when the workload is submitted)

    8. Set the image pull policy - the condition for when to pull the image from the registry

    9. Set the workload architecture:

      • Standard Only standard workloads can use the environment. A standard workload consists of a single process.

      • Distributed Only distributed workloads can use the environment. A distributed workload consists of multiple processes working together. These processes can run on different nodes.

    10. Set the workload type:

      • Workspace

      • Training

      • Inference

    11. Optional: Set the connection for your tool(s). The tools must be configured in the image. When submitting a workload using the environment, it is possible to connect to these tools

      • Select the tool from the list (the available tools varies from IDE, experiment tracking, and more, including a custom tool for your choice)

      • Select the connection type

    12. Optional: Set a command and arguments for the container running the pod

      • When no command is added, the default command of the image is used (the image entrypoint)

      • The command can be modified while submitting a workload using the environment

      • The argument(s) can be modified while submitting a workload using the environment

    13. Optional: Set the environment variable(s)

      • Click +ENVIRONMENT VARIABLE

      • Enter a name

      • Select the source for the environment variable

    14. Optional: Set the container’s working directory to define where the container’s process starts running. When left empty, the default directory is used.

    15. Optional: Set where the UID, GID and supplementary groups are taken from, this can be:

      • From the image

      • From the IdP token (only available in an SSO installations)

      • Custom (manually set) - decide whether the submitter can modify these value upon submission.

    16. Optional: Select Linux capabilities - Grant certain privileges to a container without granting all the privileges of the root user.

    17. Click CREATE ENVIRONMENT

    Note

    It is also possible to add environments directly when creating a specific workspace, training or inference workload.

    Editing an Environment

    To edit an existing environment:

    1. Select the environment you want to edit

    2. Click Edit

    3. Update the environment and click SAVE ENVIRONMENT

    Note

    • The already bound workload that is using this asset will not be affected.

    • llm-server and chatbot-ui environments cannot be edited.

    Copying an Environment

    To copy an existing environment:

    1. Select the environment you want to copy

    2. Click MAKE A COPY

    3. Enter a name for the environment. The name must be unique.

    4. Update the environment and click CREATE ENVIRONMENT

    Deleting an Environment

    To delete an environment:

    1. Select the environment you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm

    Note

    The already bound workload that is using this asset will not be affected.

    Using API

    Go to the Environment API reference to view the available actions

    workload assets
    kubectl edit runaiconfig runai -n runai
    kubectl get crds/runaiconfigs.run.ai -n runai -o yaml
    spec:
      global:
       <service-group-name>: # schedulingServices | SyncServices | WorkloadServices
         resources:
           limits:
             cpu: 1000m
             memory: 1Gi
           requests:
             cpu: 100m
             memory: 512Mi
    spec:
      <service-name>: # for example: pod-grouper
        resources:
          limits:
            cpu: 1000m
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 512Mi
    spec:
      global: 
        replicaCount: 1 # default
    spec:
      <service-name>: # for example: pod-grouper
        replicas: 1 # default
    spec:  
      prometheus:
        spec: # PrometheusSpec
          retention: 2h # default 
          retentionSize: 20GB
    spec:  
      prometheus:
        logLevel: info # debug | info | warn | error
        additionalAlertLabels:
          - env: prod # example
    spec:   
      global:
         managedNodes:
           inclusionCriteria:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.product  
                  operator: Exists
    spec:
      workload-controller:    
        s3FileSystemImage:
          name: goofys       
          registry: runai.jfrog.io/op-containers-prod      
          tag: 3.12.24    
        gitSyncImage:      
          name: git-sync      
          registry: registry.k8s.io     
          tag: v4.4.0

    spec.project-controller.createNamespaces (boolean)

    Allows Kubernetes namespace creation for new projects Default: true

    spec.project-controller.createRoleBindings (boolean)

    Specifies if role bindings should be created in the project's namespace Default: true

    spec.project-controller.limitRange (boolean)

    Specifies if limit ranges should be defined for projects Default: true

    spec.project-controller.clusterWideSecret (boolean)

    Allows Kubernetes Secrets creation at the cluster scope. See Credentials for more details. Default: true

    spec.workload-controller.additionalPodLabels (object)

    Set workload's Pod Labels in a format of key/value pairs. These labels are applied to all pods.

    spec.workload-controller.failureResourceCleanupPolicy

    NVIDIA Run:ai cleans the workload's unnecessary resources:

    • All - Removes all resources of the failed workload

    • None - Retains all resources

    • KeepFailing - Removes all resources except for those that encountered issues (primarily for debugging purposes)

    Default: All

    spec.workload-controller.GPUNetworkAccelerationEnabled

    Enables GPU network acceleration. See Using GB200 NVL72 and Multi-Node NVLink Domains for more details. Default: false

    spec.mps-server.enabled (boolean)

    Enabled when using NVIDIA MPS Default: false

    spec.daemonSetsTolerations (object)

    Configure Kubernetes tolerations for NVIDIA Run:ai daemonSets / engine

    spec.runai-container-toolkit.logLevel (boolean)

    Specifies the NVIDIA Run:ai-container-toolkit logging level: either 'SPAM', 'DEBUG', 'INFO', 'NOTICE', 'WARN', or 'ERROR' Default: INFO

    spec.runai-container-toolkit.enabled (boolean)

    Enables workloads to use GPU fractions Default: true

    node-scale-adjuster.args.gpuMemoryToFractionRatio (object)

    A scaling-pod requesting a single GPU device will be created for every 1 to 10 pods requesting fractional GPU memory (1/gpuMemoryToFractionRatio). This value represents the ratio (0.1-0.9) of fractional GPU memory (any size) to GPU fraction (portion) conversion. Default: 0.1

    spec.global.core.dynamicFractions.enabled (boolean)

    Enables dynamic GPU fractions Default: true

    spec.global.core.swap.enabled (boolean)

    Enables memory swap for GPU workloads Default: false

    spec.global.core.swap.limits.cpuRam (string)

    Sets the CPU memory size used to swap GPU workloads Default:100Gi

    spec.global.core.swap.limits.reservedGpuRam (string)

    Sets the reserved GPU memory size used to swap GPU workloads Default: 2Gi

    spec.global.core.nodeScheduler.enabled (boolean)

    Enables the node-level scheduler Default: false

    spec.global.core.timeSlicing.mode (string)

    Sets the GPU time-slicing mode. Possible values:

    • timesharing - all pods on a GPU share the GPU compute time evenly.

    • strict - each pod gets an exact time slice according to its memory fraction value.

    • fair - each pod gets an exact time slice according to its memory fraction value and any unused GPU compute time is split evenly between the running pods.

    Default: timesharing

    spec.runai-scheduler.args.fullHierarchyFairness (boolean)

    Enables fairness between departments, on top of projects fairness Default: true

    spec.runai-scheduler.args.defaultStalenessGracePeriod

    Sets the timeout in seconds before the scheduler evicts a stale pod-group (gang) that went below its min-members in running state:

    • 0s - Immediately (no timeout)

    • -1 - Never

    Default: 60s

    spec.pod-grouper.args.gangSchedulingKnative (boolean)

    Enables gang scheduling for inference workloads.For backward compatibility with versions earlier than v2.19, change the value to false Default: false

    spec.pod-grouper.args.gangScheduleArgoWorkflow (boolean)

    Groups all pods of a single ArgoWorkflow workload into a single Pod-Group for gang scheduling Default: true

    spec.runai-scheduler.args.verbosity (int)

    Configures the level of detail in the logs generated by the scheduler service Default: 4

    spec.limitRange.cpuDefaultRequestCpuLimitFactorNoGpu (string)

    Sets a default ratio between the CPU request and the limit for workloads without GPU requests Default: 0.1

    spec.limitRange.memoryDefaultRequestMemoryLimitFactorNoGpu (string)

    Sets a default ratio between the memory request and the limit for workloads without GPU requests Default: 0.1

    spec.limitRange.cpuDefaultRequestGpuFactor (string)

    Sets a default amount of CPU allocated per GPU when the CPU is not specified Default: 100

    spec.limitRange.cpuDefaultLimitGpuFactor (int)

    Sets a default CPU limit based on the number of GPUs requested when no CPU limit is specified Default: NO DEFAULT

    spec.limitRange.memoryDefaultRequestGpuFactor (string)

    Sets a default amount of memory allocated per GPU when the memory is not specified Default: 100Mi

    spec.limitRange.memoryDefaultLimitGpuFactor (string)

    Sets a default memory limit based on the number of GPUs requested when no memory limit is specified Default: NO DEFAULT

    node roles
    node roles
    Fully Qualified Domain Name (FQDN)
    External Access to Containers
    Select a framework from the list.

    When inference is selected, define the endpoint of the model by providing both the protocol and the container’s serving port

    External URL
    • Auto generate A unique URL is automatically created for each workload using the environment

    • Custom URL The URL is set manually

  • Node port

    • Auto generate A unique port is automatically exposed for each workload using the environment

    • Custom URL Set the port manually

  • Set the container port

  • Custom

    • Enter a value

    • Leave empty

    • Add instructions for the expected value if any

  • Credentials - Select an existing credential as the environment variable

    • Select a credential name To add new credentials to the credentials list, and for additional information, see Credentials.

    • Select a secret key

  • ConfigMap - Select a predefined ConfigMap

    • Select a ConfigMap name To create a ConfigMap in your cluster, see Creating ConfigMaps in advance.

    • Enter a ConfigMap key

  • The environment variables can be modified and new variables can be added while submitting a workload using the environment

  • Set the User ID (UID), Group ID (GID) and the supplementary groups that can run commands in the container

    • Enter UID

    • Enter GID

    • Add Supplementary groups (multiple groups can be added, separated by commas)

    • Disable Allow the values above to be modified within the workload if you want the above values to be used as the default

  • Workload(s)

    The list of existing workloads that use the environment

    Workload types

    The workload types that can use the environment (Workspace/ Training / Inference)

    Template(s)

    The list of workload templates that use this environment

    Created by

    The user who created the environment. By default NVIDIA Run:ai UI comes with preinstalled environments created by NVIDIA Run:ai

    Creation time

    The timestamp of when the environment was created

    Last updated

    The timestamp of when the environment was last updated

    Cluster

    The cluster with which the environment is associated

    A vLLM-based server that hosts and serves large language models for inference, enabling API-based access to AI models

    chatbot-ui

    runai.jfrog.io/core-llm/llm-app

    A user interface for interacting with chat-based AI models, often used for testing and deploying chatbot applications

    rstudio

    rocker/rstudio:4

    An integrated development environment (IDE) for R, commonly used for statistical computing and data analysis

    vscode

    ghcr.io/coder/code-server

    A fast, lightweight code editor with powerful features like intelligent code completion, debugging, Git integration, and extensions, ideal for web development, data science, and more

    gpt2

    runai.jfrog.io/core-llm/quickstart-inference:gpt2-cpu

    A package containing an inference server, GPT2 model and chat UI often used for quick demos

    credentials of type docker registry
    scope
    Integrations
    workload status

    Replace <control-plane-endpoint> with the URL of the Control Plane in your environment. If the pod fails to connect to the Control Plane, check for potential network policies

    Example of allowing traffic:
  • Check infrastructure-level configurations:

    • Ensure that firewall rules and security groups allow traffic between your Kubernetes cluster and the Control Plane

    • Verify required ports and protocols:

      • Ensure that the necessary ports and protocols for NVIDIA Run:ai’s services are not blocked by any firewalls or security groups

  • Try to identify the problem from the logs. If you cannot resolve the issue, continue to the next step.

    , etc., to troubleshoot network issues.
  • Use this pod to perform network resolution tests and other diagnostics to identify any DNS or connectivity problems within your Kubernetes {{glossary.Cluster}}.

  • Replace <control-plane-endpoint> with the URL of the Control Plane in your environment. If the pod fails to connect to the Control Plane, check for potential network policies:

    Example of allowing traffic:

  • Check infrastructure-level configurations:

  • Ensure that firewall rules and security groups allow traffic between your Kubernetes cluster and the Control Plane

  • Verify required ports and protocols:

    • Ensure that the necessary ports and protocols for NVIDIA Run:ai’s services are not blocked by any firewalls or security groups

  • Try to identify the problem from the logs. If you cannot resolve the issue, continue to the next step

    The version of Kubernetes installed

    NVIDIA Run:ai cluster UUID

    The unique ID of the cluster

    contact NVIDIA Run:ai’s support
    Kubernetes events
    contact NVIDIA Run:ai’s support
    contact NVIDIA Run:ai’s support
    contact NVIDIA Run:ai’s support
    table below
    See troubleshooting scenarios.
    See troubleshooting scenarios.
    See troubleshooting scenarios.
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-control-plane-traffic
      namespace: runai
    spec:
      podSelector:
        matchLabels:
          app: runai
      policyTypes:
        - Ingress
        - Egress
      egress:
        - to:
            - ipBlock:
                cidr: <control-plane-ip-range>
          ports:
            - protocol: TCP
              port: <control-plane-port>
      ingress:
        - from:
            - ipBlock:
                cidr: <control-plane-ip-range>
          ports:
            - protocol: TCP
              port: <control-plane-port>
    kubectl get pods -n runai | grep -E 'runai-agent|cluster-sync|assets-sync'
    kubectl run control-plane-connectivity-check -n runai --image=wbitt/network-multitool --command -- /bin/sh -c 'curl -sSf <control-plane-endpoint> > /dev/null && echo "Connection Successful" || echo "Failed connecting to the Control Plane"'
    kubectl get networkpolicies -n runai
    kubectl logs deployment/runai-agent -n runai
    kubectl logs deployment/cluster-sync -n runai
    kubectl logs deployment/assets-sync -n runai
    kubectl run -i --tty netutils --image=dersimn/netutils -- bash
    kubectl get runaiconfig -n runai runai -ojson | jq -r '.status.conditions | map(select(.type == "Available"))'
    kubectl get events -A
    kubectl describe <resource_type> <name>
    kubectl get pods -n runai | grep -E 'runai-agent|cluster-sync|assets-sync'
    kubectl run control-plane-connectivity-check -n runai --image=wbitt/network-multitool --command -- /bin/sh -c 'curl -sSf <control-plane-endpoint> > /dev/null && echo "Connection Successful" || echo "Failed connecting to the Control Plane"'
    kubectl get networkpolicies -n runai
    kubectl logs deployment/runai-agent -n runai
    kubectl logs deployment/cluster-sync -n runai
    kubectl logs deployment/assets-sync -n runai
    kubectl get configmap -n runai-public
    kubectl describe configmap runai-public -n runai-public
    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: allow-control-plane-traffic
      namespace: runai
    spec:
      podSelector:
        matchLabels:
          app: runai
      policyTypes:
        - Ingress
        - Egress
      egress:
        - to:
            - ipBlock:
                cidr: <control-plane-ip-range>
          ports:
            - protocol: TCP
              port: <control-plane-port>
      ingress:
        - from:
            - ipBlock:
                cidr: <control-plane-ip-range>
          ports:
            - protocol: TCP
              port: <control-plane-port>
    CPUs (Cores) This column is displayed only if CPU quota is enabled via the General settings. Represents the number of CPU cores you want to allocate for this department in this node pool (decimal number).
  • CPU memory This column is displayed only if CPU quota is enabled via the General settings. Represents the amount of CPU memory you want to allocate for this department in this node pool (in Megabytes or Gigabytes).

  • Under the SCHEDULING PREFERENCES tab

    • Department priority Sets the department's scheduling priority compared to other departments in the same node pool, using one of the following priorities:

      • Highest - 255

      • VeryHigh - 240

      • High - 210

      • MediumHigh - 180

      • Medium - 150

      • MediumLow - 100

      • Low - 50

      • VeryLow - 20

      • Lowest - 1

      For v2.21, the default value is MediumLow. All departments are set with the same default value, therefore there is no change of scheduling behavior unless the Administrator changes any department priority values. To learn more about department priority, see .

    • Over-quota If over quota weight is enabled via the General settings then over quota weight is presented, otherwise over quota is presented

      • Over-quota When enabled, the department can use non-guaranteed overage resources above its quota in this node pool. The amount of the non-guaranteed overage resources for this department is calculated proportionally to the department's quota in this node pool. When disabled, the department cannot use more resources than the guaranteed quota in this node pool.

      • Over quota weight Represents a weight used to calculate the amount of non-guaranteed overage resources a project can get on top of its quota in this node pool. All unused resources are split between departments that require the use of overage resources:

    • Department max. GPU device allocation Represents the maximum GPU device allocation the department can get from this node pool - the maximum sum of quota and over-quota GPUs (decimal number).

  • gcr.io/run-ai-lab/pytorch-example-jupyter
  • Tools - Set the connection for your tool:

    • Click +TOOL

    • Select Jupyter tool from the list

  • Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:

    • Enter the command - start-notebook.sh

    • Enter the arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

    Note: If host-based routing is enabled on the cluster, enter the --NotebookApp.token='' only.

  • Click CREATE ENVIRONMENT

  • Select the newly created environment from the side pane

  • - 1
  • Set GPU memory per device

    • Select GB - Fraction of a GPU device’s memory

    • Set the memory Request - 4GB (the workload will allocate 4GB of the GPU memory)

    • Toggle Limit and set to 12

  • Optional: set the CPU compute per pod - 0.1 cores (default)

  • Optional: set the CPU memory per pod - 100 MB (default)

  • Select More settings and toggle Increase shared memory size

  • Click CREATE COMPUTE RESOURCE

  • Select the newly created compute resource from the side pane

  • Select Start from scratch to launch a new workspace quickly

  • Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Original and click CONTINUE

  • Create an environment for your workspace

    • Click +NEW ENVIRONMENT

    • Enter quick-start as the name for the environment. The name must be unique.

    • Enter the Image URL - gcr.io/run-ai-lab/pytorch-example-jupyter

    • Tools - Set the connection for your tool

      • Click +TOOL

      • Select Jupyter tool from the list

    • Set the runtime settings for the environment. Click +COMMAND & ARGUMENTS and add the following:

      • Enter the command - start-notebook.sh

      • Enter the arguments - --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''

      Note: If is enabled on the cluster, enter the --NotebookApp.token=''

    • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  • Create a new “request-limit” compute resource for your workspace

    • Click +NEW COMPUTE RESOURCE

    • Enter request-limit as the name for the compute resource. The name must be unique.

    • Set GPU devices per pod - 1

    • Set GPU memory per device

      • Select GB - Fraction of a GPU device’s memory

      • Set the memory Request - 4GB (the workload will allocate 4GB of the GPU memory)

      • Toggle Limit and set to 12

    • Optional: set the CPU compute per pod - 0.1 cores (default)

    • Optional: set the CPU memory per pod - 100 MB (default)

    • Select More settings and toggle Increase shared memory size

    • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  • Click CREATE WORKSPACE

  • - The ID of the Project the workload is running on. You can get the Project ID via the
    .
  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

  • toolName will show when connecting to the Jupyter tool via the user interface.

  • Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

  • Select Start from scratch to launch a new workspace quickly

  • Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Original and click CONTINUE

  • Select the environment created in Step 2

  • Select the compute resource created in Step 2

  • Click CREATE WORKSPACE

  • - The ID of the Project the workload is running on. You can get the Project ID via the
    .
  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • toolType will show the Jupyter icon when connecting to the Jupyter tool via the user interface.

  • toolName will show when connecting to the Jupyter tool via the user interface.

  • Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    Execute both cells in Untitled.ipynb. This will consume about 3 GB of GPU memory and be well below the 4GB of the GPU Memory Request value.

  • In the second cell, edit the value after --image-size from 100 to 200 and run the cell. This will increase the GPU memory utilization to about 11.5 GB which is above the Request value.

  • Execute both cells in Untitled.ipynb. This will consume about 3 GB of GPU memory and be well below the 4GB of the GPU Memory Request value.

  • In the second cell, edit the value after --image-size from 100 to 200 and run the cell. This will increase the GPU memory utilization to about 11.5 GB which is above the Request value.

  • CLI reference
    Workspaces API:
    Step 1
    Step 2
    Step 2
    CLI reference
    Workspaces API:
    Step 1
    Get Projects API
    Get Projects API

    Projects

    This section explains the procedure to manage Projects.

    Researchers submit AI workloads. To streamline resource allocation and prioritize work, NVIDIA Run:ai introduces the concept of Projects. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives. A project may represent a team, an individual, or an initiative that shares resources or has a specific resource quota. Projects may be aggregated in NVIDIA Run:ai departments.

    For example, you may have several people involved in a specific face-recognition initiative collaborating under one project named “face-recognition-2024”. Alternatively, you can have a project per person in your team, where each member receives their own quota.

    Projects Table

    The Projects table can be found under Organization in the NVIDIA Run:ai platform.

    The Projects table provides a list of all projects defined for a specific cluster, and allows you to manage them. You can switch between clusters by selecting your cluster using the filter at the top.

    The Projects table consists of the following columns:

    Column
    Description

    Node Pools with Quota Associated with the Project

    Click one of the values of Node pool(s) with quota column, to view the list of node pools and their parameters

    Column
    Description

    Subjects Authorized for the Project

    Click one of the values in the Subject(s) column, to view the list of subjects and their parameters. This column is only viewable, if your role in the NVIDIA Run:ai system affords you those permissions.

    Column
    Description

    Workloads Associated with the Project

    Click one of the values of Workload(s) column, to view the list of workloads and their parameters

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Adding a New Project

    To create a new Project:

    1. Click +NEW PROJECT

    2. Select a scope, you can only view clusters if you have permission to do so - within the scope of the roles assigned to you

    3. Enter a name for the project Project names must start with a letter and can only contain lower case Latin letters, numbers or a hyphen ('-’)

    4. Namespace associated with Project Each project has an associated (Kubernetes) namespace in the cluster. All workloads under this project use this namespace.

    Note

    Setting the quota to 0 (either GPU, CPU, or CPU memory) and the over quota to ‘disabled’ or over quota weight to ‘none’ means the project is blocked from using those resources on this node pool.

    When no node pools are configured, you can set the same parameters for the whole project, instead of per node pool. After node pools are created, you can set the above parameters for each node-pool separately.

    1. Set as required.

    2. Click CREATE PROJECT

    Adding an Access Rule to a Project

    To create a new access rule for a project:

    1. Select the project you want to add an access rule for

    2. Click ACCESS RULES

    3. Click +ACCESS RULE

    4. Select a subject

    Deleting an Access Rule from a Project

    To delete an access rule from a project:

    1. Select the project you want to remove an access rule from

    2. Click ACCESS RULES

    3. Find the access rule you want to delete

    4. Click on the trash icon

    Editing a Project

    To edit a project:

    1. Select the project you want to edit

    2. Click EDIT

    3. Update the Project and click SAVE

    Viewing a Project’s Policy

    To view the policy of a project:

    1. Select the project for which you want to view its . This option is only active for projects with defined policies in place.

    2. Click VIEW POLICY and select the workload type for which you want to view the policies: a. Workspace workload type policy with its set of rules b. Training workload type policies with its set of rules

    3. In the Policy form, view the workload rules that are enforcing your project for the selected workload type as well as the defaults:

    Note

    The policy affecting the project consists of rules and defaults. Some of these rules and defaults may be derived from policies of a parent cluster and/or department (source). You can see the source of each rule in the policy form.

    Deleting a Project

    To delete a project:

    1. Select the project you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm

    Note

    Clusters < v2.20

    Deleting a project does not delete its associated namespace, any of the workloads running using this namespace, or the policies defined for this project. However, any assets created in the scope of this project such as compute resources, environments, data sources, templates and credentials, are permanently deleted from the system.

    Clusters >=v2.20

    Deleting a project does not delete its associated namespace, but will attempt to delete it’s associated workloads and assets. Any assets created in the scope of this project such as compute resources, environments, data sources, templates and credentials, are permanently deleted from the system.

    Using API

    To view the available actions, go to the API reference.

    Workloads

    This section explains the procedure for managing workloads.

    Workloads Table

    The Workloads table can be found under Workload manager in the NVIDIA Run:ai platform.

    The workloads table provides a list of all the workloads scheduled on the NVIDIA Run:ai Scheduler, and allows you to manage them.

    The Workloads table consists of the following columns:

    Column
    Description

    Workload Status

    The following table describes the different phases in a workload life cycle. The UI provides additional details for some of the below workload statuses which can be viewed by clicking the icon next to the status.

    Status
    Description
    Entry Condition
    Exit Condition

    Pods Associated with the Workload

    Click one of the values in the Running/requested pods column, to view the list of pods and their parameters.

    Column
    Description

    Connections Associated with the Workload

    A connection refers to the method by which you can access and interact with the running workloads. It is essentially the "doorway" through which you can reach and use the applications (tools) these workloads provide.

    Click one of the values in the Connection(s) column, to view the list of connections and their parameters. Connections are network interfaces that communicate with the application running in the workload. Connections are either the URL the application exposes or the IP and the port of the node that the workload is running on.

    Column
    Description

    Data Sources Associated with the Workload

    Click one of the values in the Data source(s) column to view the list of data sources and their parameters.

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Show/Hide Details

    Click a row in the Workloads table and then click the SHOW DETAILS button at the upper-right side of the action bar. The details pane appears, presenting the following tabs:

    Event History

    Displays the workload status over time. It displays events describing the workload lifecycle and alerts on notable events. Use the filter to search through the history for specific events.

    Metrics

    • GPU utilization Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs compute utilization (percentage of GPU compute) in this node.

    • GPU memory utilization Per GPU graph and an average of all GPUs graph, all on the same chart, along an adjustable period allows you to see the trends of all GPUs memory usage (percentage of the GPU memory) in this node.

    • CPU compute utilization The average of all CPUs’ cores compute utilization graph, along an adjustable period allows you to see the trends of CPU compute utilization (percentage of CPU compute) in this node.

    Logs

    Workload events are ordered in chronological order. The logs contain events from the workload’s lifecycle to help monitor and debug issues.

    Note

    Logs are available only while the workload is in a non-terminal state. Once the workload completes or fails, logs are no longer accessible.

    Adding a New Workload

    Before starting, make sure you have created a or have one created for you to work with workloads.

    To create a new workload:

    1. Click +NEW WORKLOAD

    2. Select a workload type - Follow the links below to view the step-by-step guide for each workload type:

      • - Used for data preparation and model-building tasks.

      • - Used for standard training tasks of all sorts

    Stopping a Workload

    Stopping a workload kills the workload pods and releases the workload resources.

    1. Select the workload you want to stop

    2. Click STOP

    Running a Workload

    Running a workload spins up new pods and resumes the workload work after it was stopped.

    1. Select the workload you want to run again

    2. Click RUN

    Connecting to a Workload

    To connect to an application running in the workload (for example, Jupyter Notebook)

    1. Select the workload you want to connect

    2. Click CONNECT

    3. Select the tool from the drop-down list

    4. The selected tool is opened in a new tab on your browser

    Copying a Workload

    1. Select the workload you want to copy

    2. Click MAKE A COPY

    3. Enter a name for the workload. The name must be unique.

    4. Update the workload and click CREATE WORKLOAD

    Deleting a Workload

    1. Select the workload you want to delete

    2. Click DELETE

    3. On the dialog, click DELETE to confirm the deletion

    Note

    Once a workload is deleted you can view it in the Deleted tab in the workloads view. This tab is displayed only if enabled by your Administrator, under General settings → Workloads → Deleted workloads

    Using API

    Go to the API reference to view the available actions

    Troubleshooting

    To understand the condition of the workload, review the workload status in the Workload table. For more information, see check the .

    Listed below are a number of known issues when working with workloads and how to fix them:

    Issue
    Mediation

    GPU Time-Slicing

    NVIDIA Run:ai supports simultaneous submission of multiple workloads to single or multi-GPUs when using GPU fractions. This is achieved by slicing the GPU memory between the different workloads according to the requested GPU fraction, and by using NVIDIA’s GPU time-slicing to share the GPU compute runtime. NVIDIA Run:ai ensures each workload receives the exact share of the GPU memory (= gpu_memory * requested), while the NVIDIA GPU time-slicing splits the GPU runtime evenly between the different workloads running on that GPU.

    To provide customers with predictable and accurate GPU compute resource scheduling, NVIDIA Run:ai’s GPU time-slicing adds fractional compute capabilities on top of NVIDIA Run:ai GPU fraction capabilities.

    How GPU Time-Slicing Works

    While the default NVIDIA GPU time-slicing allows for sharing the GPU compute runtime evenly without splitting or limiting the runtime of each workload, NVIDIA Run:ai’s GPU time-slicing mechanism gives each workload exclusive access to the full GPU for a limited amount of time, lease time, in each scheduling cycle, plan time. This cycle repeats itself for the lifetime of the workload. Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction, but also allows splitting GPU unused compute time up to a requested Limit.

    For example, when there are 2 workloads running on the same GPU, with NVIDIA’s default GPU time slicing, each workload gets 50% of the GPU compute runtime, even if one workload requests 25% of the GPU memory, and the other workload requests 75% of the GPU memory. With the NVIDIA Run:ai GPU time-slicing, the first workload will get 25% of the GPU compute time and the second will get 75%. If one of the workloads does not use its deserved GPU compute time, the others can split that time evenly between them. As shown in the example, if one of the workloads does not request the GPU for some time, the other will get the full GPU compute time.

    GPU Time-Slicing Modes

    NVIDIA Run:ai offers two GPU time-slicing modes:

    • Strict - Each workload gets its precise GPU compute fraction, which equals to its requested GPU (memory) fraction. In terms of official Kubernetes resource specification, this means:

    • Fair - Each workload is guaranteed at least its GPU compute fraction, but at the same time can also use additional GPU runtime compute slices that are not used by other idle workloads. Those excess time slices are divided equally between all workloads running on that GPU (after each got at least its requested GPU compute fraction). In terms of official Kubernetes resource specification, this means:

    The figure below illustrates how Strict time-slicing mode uses the GPU from Lease (slice) and Plan (cycle) perspective:

    The figure below illustrates how Fair time-slicing mode uses the GPU from Lease (slice) and Plan (cycle) perspective:

    Time-Slicing Plan and Lease Times

    Each GPU scheduling cycle is a plan. The plan is determined by the lease time and granularity (precision). By default, basic lease time is 250ms with 5% granularity (precision), which means the plan (cycle) time is: 250 / 0.05 = 5000ms (5 Sec). Using these values, a workload that requests gpu-fraction=0.5 gets 2.5s runtime out of the 5s cycle time.

    Different workloads require different SLA and precision, so it also possible to tune the lease time and precision for customizing the time-slicing capabilities to your cluster.

    Note

    Decreasing the lease time makes time-slicing less accurate. Increasing the lease time makes the system more accurate, but each workload is less responsive.

    Once timeSlicing is enabled in the runaiconfig, all submitted GPU fractions or GPU memory workloads will have their gpu-compute-request/limit set automatically by the system, depending on the annotation used on the time-slicing mode:

    • Strict compute resources:

    • Fair compute resources:

    Note

    The above tables show that when submitting a workload using gpu-memory annotation, the system will split the GPU compute time between the different workloads running on that GPU. This means the workload can get anything from very little compute time (>0) to full GPU compute time (1.0).

    Enabling GPU Time-Slicing

    NVIDIA Run:ai’s GPU time-slicing is a cluster flag which changes the default NVIDIA time-slicing used by GPU fractions. For more details, see .

    Enable GPU time-slicing by setting the following cluster flag in the runaiconfig file:

    If the timeSlicing flag is not set, the system continues to use the default NVIDIA GPU time-slicing to maintain backward compatibility.

    Data Sources

    This section explains what data sources are and how to create and use them.

    Data sources are a type of and represent a location where data is actually stored. They may represent a remote data location, such as NFS, Git, or S3, or a Kubernetes local resource, such as PVC, ConfigMap, HostPath, or Secret.

    This configuration simplifies the mapping of the data into the workload’s file system and handles the mounting process during workload creation for reading and writing. These data sources are reusable and can be easily integrated and used by AI practitioners while submitting workloads across various scopes.

    Data Sources Table

    The data sources table can be found under Workload manager in the NVIDIA Run:ai platform.

    runai project set "project-name"
    runai workspace submit "workload-name" \
    --image gcr.io/run-ai-lab/pytorch-example-jupyter \
    --gpu-memory-request 4G --gpu-memory-limit 12G --large-shm \
    --external-url container=8888 --name-prefix jupyter  \
    --command -- start-notebook.sh \
    --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=
    curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \ 
    -d '{ 
        "name": "workload-name", 
        "projectId": "<PROJECT-ID>", 
        "clusterId": "<CLUSTER-UUID>",
        "spec": {
            "command" : "start-notebook.sh",
            "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
            "image": "gcr.io/run-ai-lab/pytorch-example-jupyter",
            "compute": {
                "gpuDevicesRequest": 1,
                "gpuMemoryRequest": "4G",
                "gpuMemoryLimit": "12G",
                "largeShmRequest": true
    
            },
            "exposedUrls" : [
                { 
                    "container" : 8888,
                    "toolType": "jupyter-notebook", 
                    "toolName": "Jupyter"  
                }
            ]
        }
    }
    runai project set "project-name"
    runai workspace submit "workload-name" \
    --image gcr.io/run-ai-lab/pytorch-example-jupyter --gpu-memory-request 4G \
    --gpu-memory-limit 12G --large-shm --external-url container=8888 \
    --name-prefix jupyter --command -- start-notebook.sh \
    --NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=
    curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \ 
    -d '{ 
        "name": "workload-name", 
        "projectId": "<PROJECT-ID>", 
        "clusterId": "<CLUSTER-UUID>",
        "spec": {
            "command" : "start-notebook.sh",
            "args" : "--NotebookApp.base_url=/${RUNAI_PROJECT}/${RUNAI_JOB_NAME} --NotebookApp.token=''",
            "image": "gcr.io/run-ai-lab/pytorch-example-jupyter",
            "compute": {
                "gpuDevicesRequest": 1,
                "gpuMemoryRequest": "4G",
                "gpuMemoryLimit": "12G",
                "largeShmRequest": true
    
            },
            "exposedUrls" : [
                { 
                    "container" : 8888,
                    "toolType": "jupyter-notebook",  
                    "toolName": "Jupyter" 
                }
            ]
        }
    }

    Medium The default value. The Admin can change the default to any of the following values: High, Low, Lowest, or None.

  • Lowest over quota weight ‘Lowest’ has a unique behavior, it can only use over-quota (unused overage) resources if no other department needs them, and any department with a higher over quota weight can snap the average resources at any time.

  • None When set, the department cannot use more resources than the guaranteed quota in this node pool.

  • In case over quota is disabled, workloads running under subordinate projects are not able to use more resources than the department’s quota, but each project can still go over-quota (if enabled at the project level) up to the department’s quota.

  • Unlimited CPU(Cores) and CPU memory quotas are an exception - in this case, workloads of subordinated projects can consume available resources up to the physical limitation of the cluster or any of the node pools.

  • The NVIDIA Run:ai Scheduler: concepts and principles
    only.
    host-based routing

    Creation time

    The timestamp of when the workload was created

    Completion time

    The timestamp the workload reached a terminal state (failed/completed)

    Connection(s)

    The method by which you can access and interact with the running workload. It's essentially the "doorway" through which you can reach and use the tools the workload provide. (E.g node port, external URL, etc). Click one of the values in the column to view the list of connections and their parameters.

    Data source(s)

    Data resources used by the workload

    Environment

    The environment used by the workload

    Workload architecture

    Standard or distributed. A standard workload consists of a single process. A distributed workload consists of multiple processes working together. These processes can run on different nodes.

    GPU compute request

    Amount of GPU devices requested

    GPU compute allocation

    Amount of GPU devices allocated

    GPU memory request

    Amount of GPU memory Requested

    GPU memory allocation

    Amount of GPU memory allocated

    Idle GPU devices

    The number of allocated GPU devices that have been idle for more than 5 minutes

    CPU compute request

    Amount of CPU cores requested

    CPU compute allocation

    Amount of CPU cores allocated

    CPU memory request

    Amount of CPU memory requested

    CPU memory allocation

    Amount of CPU memory allocated

    Cluster

    The cluster that the workload is associated with

    Running

    Workload is currently in progress with all pods operational

    All pods initialized (all containers in pods are ready)

    Workload completion or failure

    Degraded

    Pods may not align with specifications, network services might be incomplete, or persistent volumes may be detached. Check your logs for specific details.

    • Pending - All pods are running but have issues.

    • Running - All pods are running with no issues.

    • Running - All resources are OK.

    • Completed - Workload finished with fewer resources

    • Failed - Workload failure or user-defined rules.

    Deleting

    Workload and its associated resources are being decommissioned from the cluster

    Deleting the workload

    Resources are fully deleted

    Stopped

    Workload is on hold and resources are intact but inactive

    Stopping the workload without deleting resources

    Transitioning back to the initializing phase or proceeding to deleting the workload

    Failed

    Image retrieval failed or containers experienced a crash. Check your logs for specific details

    An error occurs preventing the successful completion of the workload

    Terminal state

    Completed

    Workload has successfully finished its execution

    The workload has finished processing without errors

    Terminal state

    GPU memory allocation

    Amount of GPU memory allocated for the pod

    Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

  • Refresh - Click REFRESH to update the table with the latest data

  • Show/Hide details - Click to view additional information on the selected row

  • CPU memory utilization The utilization of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory utilization (percentage of CPU memory) in this node.

  • CPU memory usage The usage of all CPUs memory in a single graph, along an adjustable period allows you to see the trends of CPU memory usage (in GB or MB of CPU memory) in this node.

  • For GPUs charts - Click the GPU legend on the right-hand side of the chart, to activate or deactivate any of the GPU lines.

  • You can click the date picker to change the presented period

  • You can use your mouse to mark a sub-period in the graph for zooming in, and use Reset zoom to go back to the preset period

  • Changes in the period affect all graphs on this screen.

  • Distributed Training - Used for distributed tasks of all sorts

  • Inference - Used for inference and serving tasks

  • Job (legacy). This type is displayed only if enabled by your Administrator, under General settings → Workloads → Workload policies

  • Click CREATE WORKLOAD

  • Workload

    The name of the workload

    Type

    The workload type

    Preemptible

    Is the workload preemptible (Yes/no)

    Status

    The different phases in a workload lifecycle

    Project

    The project in which the workload runs

    Department

    The department that the workload is associated with. This column is visible only if the department toggle is enabled by your administrator.

    Created by

    The user who created the workload

    Running/requested pods

    The number of running pods out of the requested

    Creating

    Workload setup is initiated in the cluster. Resources and pods are now provisioning.

    A workload is submitted

    A multi-pod group is created

    Pending

    Workload is queued and awaiting resource allocation

    A pod group exists

    All pods are scheduled

    Initializing

    Workload is retrieving images, starting containers, and preparing pods

    All pods are scheduled

    Pod

    Pod name

    Status

    Pod lifecycle stages

    Node

    The node on which the pod resides

    Node pool

    The node pool in which the pod resides (applicable if node pools are enabled)

    Image

    The pod’s main image

    GPU compute allocation

    Amount of GPU devices allocated for the pod

    Name

    The name of the application running on the workload

    Connection type

    The network connection type selected for the workload

    Access

    Who is authorized to use this connection (everyone, specific groups/users)

    Address

    The connection URL

    Copy button

    Copy URL to clipboard

    Connect button

    Enabled only for supported tools

    Data source

    The name of the data source mounted to the workload

    Type

    The data source type

    Cluster connectivity issues (there are issues with your connection to the cluster error message)

    • Verify that you are on a network that has been granted access to the cluster.

    • Reach out to your cluster admin for instructions on verifying this.

    • If you are an admin, see the troubleshooting section in the cluster documentation

    Workload in “Initializing” status for some time

    • Check that you have access to the Container image registry.

    • Check the statuses of the pods in the pods’ dialog.

    • Check the event history for more details

    Workload has been pending for some time

    • Check that you have the required quota.

    • Check the project’s available quota in the project dialog.

    • Check that all services needed to run are bound to the workload.

    • Check the event history for more details.

    PVCs created using the K8s API or kubectl are not visible or mountable in NVIDIA Run:ai

    This is by design.

    1. Create a new data source of type PVC in the NVIDIA Run:ai UI

    2. In the Data mount section, select Existing PVC

    3. Select the PVC you created via the K8S API

    You are now able to select and mount this PVC in your NVIDIA Run:ai submitted workloads.

    Workload is not visible in the UI

    • Check that the workload hasn’t been deleted.

    • See the “Deleted” tab in the workloads view

    project
    Workspace
    Training
    Workloads
    workload’s event history

    All pods are initialized or a failure to initialize is detected

    Annotation

    Value

    GPU Compute Request

    GPU Compute Limit

    gpu-fraction

    x

    x

    x

    gpu-memory

    x

    0

    1.0

    Annotation

    Value

    GPU Compute Request

    GPU Compute Limit

    gpu-fraction

    x

    x

    1.0

    gpu-memory

    x

    0

    1.0

    Advanced cluster configurations
    Strict time-slicing mode
    Fair time-slicing mode
    gpu-compute-request = gpu-compute-limit = gpu-(memory-)fraction
    gpu-compute-request = gpu-(memory-)fraction
    gpu-compute-limit = 1.0
    global: 
        core: 
            timeSlicing: 
                 mode: fair/strict

    GPU allocation ratio

    The ratio of Allocated GPUs to GPU quota. This number reflects how well the project’s GPU quota is utilized by its descendent workloads. A number higher than 100% indicates the project is using over quota GPUs.

    GPU quota

    The GPU quota allocated to the project. This number represents the sum of all node pools’ GPU quota allocated to this project.

    Allocated CPUs (Core)

    The total number of CPU cores allocated by workloads submitted within this project. (This column is only available if the CPU Quota setting is enabled, as described below).

    Allocated CPU Memory

    The total number of CPUs allocated by successfully scheduled workloads under this project. (This column is only available if the CPU Quota setting is enabled, as described below).

    CPU quota (Cores)

    CPU quota allocated to this project. (This column is only available if the CPU Quota setting is enabled, as described below). This number represents the sum of all node pools’ CPU quota allocated to this project. The ‘unlimited’ value means the CPU (cores) quota is not bounded and workloads using this project can use as many CPU (cores) resources as they need (if available).

    CPU memory quota

    CPU memory quota allocated to this project. (This column is only available if the CPU Quota setting is enabled, as described below). This number represents the sum of all node pools’ CPU memory quota allocated to this project. The ‘unlimited’ value means the CPU memory quota is not bounded and workloads using this Project can use as much CPU memory resources as they need (if available).

    CPU allocation ratio

    The ratio of Allocated CPUs (cores) to CPU quota (cores). This number reflects how much the project’s ‘CPU quota’ is utilized by its descendent workloads. A number higher than 100% indicates the project is using over quota CPU cores.

    CPU memory allocation ratio

    The ratio of Allocated CPU memory to CPU memory quota. This number reflects how well the project’s ‘CPU memory quota’ is utilized by its descendent workloads. A number higher than 100% indicates the project is using over quota CPU memory.

    Node affinity of training workloads

    The list of NVIDIA Run:ai node-affinities. Any training workload submitted within this project must specify one of those NVIDIA Run:ai node affinities, otherwise it is not submitted.

    Node affinity of interactive workloads

    The list of NVIDIA Run:ai node-affinities. Any interactive (workspace) workload submitted within this project must specify one of those NVIDIA Run:ai node affinities, otherwise it is not submitted.

    Idle time limit of training workloads

    The time in days:hours:minutes after which the project stops a training workload not using its allocated GPU resources.

    Idle time limit of preemptible workloads

    The time in days:hours:minutes after which the project stops a preemptible interactive (workspace) workload not using its allocated GPU resources.

    Idle time limit of non preemptible workloads

    The time in days:hours:minutes after which the project stops a non-preemptible interactive (workspace) workload not using its allocated GPU resources..

    Interactive workloads time limit

    The duration in days:hours:minutes after which the project stops an interactive (workspace) workload

    Training workloads time limit

    The duration in days:hours:minutes after which the project stops a training workload

    Creation time

    The timestamp for when the project was created

    Workload(s)

    The list of workloads associated with the project. Click the values under this column to view the list of workloads with their resource parameters (as described below).

    Cluster

    The cluster that the project is associated with

    Allocated CPU memory

    The actual amount of CPU memory allocated by workloads using this node pool under this Project. The number of Allocated CPU memory may temporarily surpass the CPU memory quota if over quota is used.

    Order of priority

    The default order in which the Scheduler uses node-pools to schedule a workload. This is used only if the order of priority of node pools is not set in the workload during submission, either by an admin policy or the user. An empty value means the node pool is not part of the project’s default list, but can still be chosen by an admin policy or the user during workload submission

    GPU compute request

    The amount of GPU compute requested (floating number, represents either a portion of the GPU compute, or the number of whole GPUs requested)

    GPU memory request

    The amount of GPU memory requested (floating number, can either be presented as a portion of the GPU memory, an absolute memory size in MB or GB, or a MIG profile)

    CPU memory request

    The amount of CPU memory requested (floating number, presented as an absolute memory size in MB or GB)

    CPU compute request

    The amount of CPU compute requested (floating number, represents the number of requested Cores)

    Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    1. By default, Run:ai creates a namespace based on the Project name (in the form of runai-<name>)

    2. Alternatively, you can choose an existing namespace created for you by the cluster administrator

  • In the Quota management section, you can set the quota parameters and prioritize resources

    • Order of priority This column is displayed only if more than one node pool exists. The default order in which the Scheduler uses node pools to schedule a workload. This means the Scheduler first tries to allocate resources using the highest priority node pool, then the next in priority, until it reaches the lowest priority node pool list, then the Scheduler starts from the highest again. The Scheduler uses the Project list of prioritized node pools, only if the order of priority of node pools is not set in the workload during submission, either by an admin policy or by the user. Empty value means the node pool is not part of the Project’s default node pool priority list, but a node pool can still be chosen by the admin policy or a user during workload submission

    • Node pool This column is displayed only if more than one node pool exists. It represents the name of the node pool

    • Under the QUOTA tab

      • Over-quota state Indicates if over-quota is enabled or disabled as set in the SCHEDULING PREFERENCES tab. If over-quota is set to None, then it is disabled.

      • GPU devices The number of GPUs you want to allocate for this project in this node pool (decimal number)

  • Select or enter the subject identifier:

    • User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP

    • Group name as recognized by the IDP

    • Application name as created in NVIDIA Run:ai

  • Select a role

  • Click SAVE RULE

  • Click CLOSE

  • Click CLOSE

    Parameter - The workload submission parameter that Rules and Defaults are applied to

  • Type (applicable for data sources only) - The data source type (Git, S3, nfs, pvc etc.)

  • Default - The default value of the Parameter

  • Rule - Set up constraints on workload policy fields

  • Source - The origin of the applied policy (cluster, department or project)

  • Project

    The name of the project

    Department

    The name of the parent department. Several projects may be grouped under a department.

    Status

    The Project creation status. Projects are manifested as Kubernetes namespaces. The project status represents the Namespace creation status.

    Node pool(s) with quota

    The node pools associated with the project. By default, a new project is associated with all node pools within its associated cluster. Administrators can change the node pools’ quota parameters for a project. Click the values under this column to view the list of node pools with their parameters (as described below)

    Subject(s)

    The users, SSO groups, or applications with access to the project. Click the values under this column to view the list of subjects with their parameters (as described below). This column is only viewable if your role in the NVIDIA Run:ai platform allows you those permissions.

    Allocated GPUs

    The total number of GPUs allocated by successfully scheduled workloads under this project

    Node pool

    The name of the node pool is given by the administrator during node pool creation. All clusters have a default node pool created automatically by the system and named ‘default’.

    GPU quota

    The amount of GPU quota the administrator dedicated to the project for this node pool (floating number, e.g. 2.3 means 230% of GPU capacity).

    CPU (Cores)

    The amount of CPUs (cores) quota the administrator has dedicated to the project for this node pool (floating number, e.g. 1.3 Cores = 1300 mili-cores). The ‘unlimited’ value means the CPU (Cores) quota is not bounded and workloads using this node pool can use as many CPU (Cores) resources as they require, (if available).

    CPU memory

    The amount of CPU memory quota the administrator has dedicated to the project for this node pool (floating number, in MB or GB). The ‘unlimited’ value means the CPU memory quota is not bounded and workloads using this node pool can use as much CPU memory resource as they need (if available).

    Allocated GPUs

    The actual amount of GPUs allocated by workloads using this node pool under this project. The number of allocated GPUs may temporarily surpass the GPU quota if over quota is used.

    Allocated CPU (Cores)

    The actual amount of CPUs (cores) allocated by workloads using this node pool under this project. The number of allocated CPUs (cores) may temporarily surpass the CPUs (Cores) quota if over quota is used.

    Subject

    A user, SSO group, or application assigned with a role in the scope of this Project

    Type

    The type of subject assigned to the access rule (user, SSO group, or application)

    Scope

    The scope of this project in the organizational tree. Click the name of the scope to view the organizational tree diagram, you can only view the parts of the organizational tree for which you have permission to view.

    Role

    The role assigned to the subject, in this project’s scope

    Authorized by

    The user who granted the access rule

    Last updated

    The last time the access rule was updated

    Workload

    The name of the workload, given during its submission. Optionally, an icon describing the type of workload is also visible

    Type

    The type of the workload, e.g. Workspace, Training, Inference

    Status

    The state of the workload and time elapsed since the last status change

    Created by

    The subject that created this workload

    Running/ requested pods

    The number of running pods out of the number of requested pods for this workload. e.g. a distributed workload requesting 4 pods but may be in a state where only 2 are running and 2 are pending

    Creation time

    The date and time the workload was created

    Scheduling rules
    policies
    Projects
    The data sources table provides a list of all the data sources defined in the platform and allows you to manage them.

    Note

    Data & storage - with Data sources and Data volumes - is visible only if your Administrator has enabled Data volumes.

    The data sources table comprises the following columns:

    Column
    Description

    Data source

    The name of the data source

    Description

    A description of the data source

    Type

    The type of data source connected – e.g., S3 bucket, PVC, or others

    Status

    The different lifecycle and representation of the data source condition

    Scope

    The of the data source within the organizational tree. Click the scope name to view the organizational tree diagram

    Kubernetes name

    The unique name of the data sources Kubernetes name as it appears in the cluster

    Data Sources Status

    The following table describes the data sources' condition and whether they were created successfully for the selected scope.

    Status
    Description

    No issues found

    No issues were found while creating the data source

    Issues found

    Issues were found while propagating the data source credentials

    Issues found

    The data source couldn’t be created at the cluster

    Creating…

    The data source is being created

    No status / “-”

    When the data source’s scope is an account, the current version of the cluster is not up to date, or the asset is not a cluster-syncing entity, the status can’t be displayed

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    • Download table - Click MORE and then click ‘Download as CSV’. Export to CSV is limited to 20,000 rows.

    • Refresh - Click REFRESH to update the table with the latest data

    Adding a New Data Source

    To create a new data source:

    1. Click +NEW DATA SOURCE

    2. Select the data source type from the list. Follow the step-by-step guide for each data source type:

    NFS

    A Network File System (NFS) is a Kubernetes concept used for sharing storage in the cluster among different pods. Like a PVC, the NFS volume’s content remains preserved, even outside the lifecycle of a single pod. However, unlike PVCs, which abstract storage management, NFS provides a method for network-based file sharing. The NFS volume can be pre-populated with data and can be mounted by multiple pod writers simultaneously. At NVIDIA Run:ai, an NFS-type data source is an abstraction that is mapped directly to a Kubernetes NFS volume. This integration allows multiple workloads under various scopes to mount and present the NFS data source.

    1. Select the cluster under which to create this data source

    2. Select a

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    5. Set the data origin

      • Enter the NFS server (host name or host IP)

      • Enter the NFS path

    6. Set the data target location

      • Container path

    7. Optional: Restrictions

      • Prevent data modification - When enabled, the data will be mounted with read-only permissions

    8. Click CREATE DATA SOURCE

    PVC

    A Persistent Volume Claim (PVC) is a Kubernetes concept used for managing storage in the cluster, which can be provisioned by an administrator or dynamically by Kubernetes using a StorageClass. PVCs allow users to request specific sizes and access modes (read/write once, read-only many). NVIDIA Run:ai ensures that data remains consistent and accessible across various scopes and workloads, beyond the lifecycle of individual pods, which is efficient while working with large datasets typically associated with AI projects.

    1. Select the cluster under which to create this data source

    2. Select a

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    5. Select PVC:

      • Existing PVC

        This option is relevant when the purpose is to create a PVC-type data source based on an existing PVC in the cluster

        • Select a PVC from the list - (The list is empty if no existing PVCs were )

    6. Select the storage class

      • None - Proceed without defining a storage class

      • Custom storage class - This option applies when selecting a storage class based on existing storage classes.

        To add new storage classes to the storage class list, and for additional information, check

    7. Select the access mode(s) (multiple modes can be selected)

      • Read-write by one node - The volume can be mounted as read-write by a single node.

      • Read-only by many nodes - The volume can be mounted as read-only by many nodes.

      • Read-write by many nodes - The volume can be mounted as read-write by many nodes.

    8. Set the claim size and its units

    9. Select the volume mode

      1. File system (default) - allows the volume to be mounted as a filesystem, enabling the usage of directories and files.

      2. Block - exposes the volume as a block storage, which can be formatted or used by applications directly without a filesystem.

    10. Set the data target location

      • container path

    11. Optional: Prevent data modification - When enabled, the data will be mounted with read-only permission.

    12. Click CREATE DATA SOURCE

    After the data source is created, check its status to monitor its proper creation across the selected scope.

    S3 Bucket

    The S3 bucket data source enables the mapping of a remote S3 bucket into the workload’s file system. Similar to a PVC, this mapping remains accessible across different workload executions, extending beyond the lifecycle of individual pods. However, unlike PVCs, data stored in an S3 bucket resides remotely, which may lead to decreased performance during the execution of heavy machine learning workloads. As part of the NVIDIA Run:ai connection to the S3 bucket, you can create credentials in order to access and map private buckets.

    Note

    S3 data sources are not supported for custom inference workloads.

    1. Select the cluster under which to create this data source

    2. Select a

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    After a private data source is created, check its status to monitor its proper creation across the selected scope.

    Git

    A Git-type data source is a NVIDIA Run:ai integration, that enables code to be copied from a Git branch into a dedicated folder in the container. It is mainly used to provide the workload with the latest code repository. As part of the integration with Git, in order to access private repositories, you can add predefined credentials to the data source mapping.

    1. Select the cluster under which to create this data source

    2. Select a scope

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    5. Set the data origin

      • Set the Repository URL

      • Set the Revision (branch, tag, or hash)- If left empty, it will use the 'HEAD' (latest)

      • Select the credential

    6. Set the data target location

      • container path

    7. Click CREATE DATA SOURCE

    After a private data source is created, check its status to monitor its proper creation across the selected scope.

    Host path

    A Host path volume is a Kubernetes concept that enables mounting a host path file or a directory on the workload’s file system. Like a PVC, the host path volume’s data persists across workloads under various scopes. It also enables data serving from the hosting node.

    1. Select the cluster under which to create this data source

    2. Select a scope

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    5. Set the data origin

      • host path

    6. Set the data target location

      • container path

    7. Optional: Prevent data modification - When enabled, the data will be mounted with read-only permissions.

    8. Click CREATE DATA SOURCE

    ConfigMap

    A ConfigMap data source is a NVIDIA Run:ai abstraction for the Kubernetes ConfigMap concept. The ConfigMap is used mainly for storage that can be mounted on the workload container for non-confidential data. It is usually represented in key-value pairs (e.g., environment variables, command-line arguments etc.). It allows you to decouple environment-specific system configurations from your container images, so that your applications are easily portable. ConfigMaps must be created on the cluster prior to being used within the NVIDIA Run:ai system.\

    1. Select the cluster under which to create this data source

    2. Select a

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    5. Set the data origin

      • Select the ConfigMap name (The list is empty if no existing ConfigMaps were ).

    6. Set the data target location

      • container path

    7. Click CREATE DATA SOURCE

    Secret

    A secret-type data source enables the mapping of a credential into the workload’s file system. Credentials are a workload asset that simplify the complexities of Kubernetes Secrets. The credentials mask sensitive access information, such as passwords, tokens, and access keys, which are necessary for gaining access to various resources.

    1. Select the cluster under which to create this data source

    2. Select a scope

    3. Enter a name for the data source. The name must be unique.

    4. Optional: Provide a description of the data source

    5. Set the data origin

      • Select the credential

        To add new credentials, and for additional information, check the article.

    6. Set the data target location

      • container path

    7. Click CREATE DATA SOURCE

    After the data source is created, check its status to monitor its proper creation across the selected scope.

    Note

    It is also possible to add data sources directly when creating a specific workspace, training or inference workload.

    Copying a Data Source

    To copy an existing data source:

    1. Select the data source you want to copy

    2. Click MAKE A COPY

    3. Enter a name for the data source. The name must be unique.

    4. Update the data source and click CREATE DATA SOURCE

    Renaming a Data Source

    To rename an existing data source:

    1. Select the data source you want to rename

    2. Click Rename and edit the name/description

    Deleting a Data Source

    To delete a data source:

    1. Select the data source you want to delete

    2. Click DELETE

    3. Confirm you want to delete the data source

    Note

    It is not possible to delete a data source being used by an existing workload or template.

    Creating PVCs in Advance

    Add PVCs in advance to be used when creating a PVC-type data source via the NVIDIA Run:ai UI.

    The actions taken by the admin are based on the scope (cluster, department or project) that the admin wants for data source of type PVC. Follow the steps below for each required scope:

    Cluster Scope

    1. Locate the PVC in the NVIDIA Run:ai namespace (runai)

    2. Provide NVIDIA Run:ai with visibility and authorization to share the PVC to your selected scope by implementing the following label: run.ai/cluster-wide: "true”

    The PVC is now displayed for that scope in the list of existing PVCs.

    Note

    This step is also relevant for creating the data source of type PVC via API

    Department Scope

    1. Locate the PVC in the NVIDIA Run:ai namespace (runai)

    2. To authorize NVIDIA Run:ai to use the PVC, label it: run.ai/department: "id"

    The PVC is now displayed for that scope in the list of existing PVCs.

    Project Scope

    Locate the PVC in the project’s namespace.

    The PVC is now displayed for that scope in the list of existing PVCs.

    Creating ConfigMaps in Advance

    Add ConfigMaps in advance to be used when creating a ConfigMap-type data source via the NVIDIA Run:ai UI.

    Cluster Scope

    1. Locate the ConfigMap in the NVIDIA Run:ai namespace (runai)

    2. To authorize NVIDIA Run:ai to use the ConfigMap, label it: run.ai/cluster-wide: "true”

    3. The ConfigMap must have a label of run.ai/resource: <resource-name>

    The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

    Department Scope

    1. Locate the ConfigMap in the NVIDIA Run:ai namespace (runai)

    2. To authorize NVIDIA Run:ai to use the ConfigMap, label it: run.ai/department: "<department-id>"

    3. The ConfigMap must have a label of run.ai/resource: <resource-name>

    The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

    Project Scope

    1. Locate the ConfigMap in the project’s namespace

    2. The ConfigMap must have a label of run.ai/resource: <resource-name>

    The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

    Using API

    To view the available actions, go to the Data sources API reference.

    workload assets

    Over Quota, Fairness and Preemption

    This quick start provides a step-by-step walkthrough of the core scheduling concepts - over quota, fairness, and preemption. It demonstrates the simplicity of resource provisioning and how the system eliminates bottlenecks by allowing users or teams to exceed their resource quota when free GPUs are available.

    • Over quota - In this scenario, team-a runs two training workloads and team-b runs one. Team-a has a quota of 3 GPUs and is over quota by 1 GPU, while team-b has a quota of 1 GPU. The system allows this over quota usage as long as there are available GPUs in the cluster.

    • Fairness and preemption - Since the cluster is already at full capacity, when team-b launches a new b2 workload requiring 1 GPU , team-a can no longer remain over quota. To maintain fairness, the NVIDIA Run:ai Scheduler preempts workload a1 (1 GPU), freeing up resources for team-b.

    Prerequisites

    • You have created two - team-a and team-b - or have them created for you.

    • Each project has an assigned quota of 2 GPUs.

    Note

    is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

    Step 1: Logging In

    Step 2: Submitting the First Training Workload (team-a)

    1. Go to Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select under which cluster to create the workload

    4. Select the project named team-a

    Step 3: Submitting the Second Training Workload (team-a)

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select the cluster where the previous training workload was created

    4. Select the project named team-a

    Step 4: Submitting the First Training Workload (team-b)

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select the cluster where the previous training was created

    4. Select the project named team-b

    Over Quota Status

    System status after run:

    System status after run:

    System status after run:

    System status after run:

    Step 5: Submitting the Second Training Workload (team-b)

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select the cluster where the previous training was created

    4. Select the project named team-b

    Basic Fairness and Preemption Status

    Workloads status after run:

    Workloads status after run:

    Workloads status after run:

    Workloads status after run:

    Next Steps

    Manage and monitor your newly created workload using the table.

    CPUs (Cores) This column is displayed only if CPU quota is enabled via the General settings. Represents the number of CPU cores you want to allocate for this project in this node pool (decimal number).
  • CPU memory This column is displayed only if CPU quota is enabled via the General settings. Represents the amount of CPU memory you want to allocate for this project in this node pool (in Megabytes or Gigabytes).

  • Under the SCHEDULING PREFERENCES tab

    • Project priority Sets the project's scheduling priority compared to other projects in the same node pool, using one of the following priorities:

      • Highest - 255

      • VeryHigh - 240

      • High - 210

      • MediumHigh - 180

      • Medium - 150

      • MediumLow - 100

      • Low - 50

      • VeryLow - 20

      • Lowest - 1

      For v2.21, the default value is MediumLow. All Projects are set with the same default value, therefore there is no change of scheduling behavior unless the Administrator changes any Project priority values. To learn more about Project priority, see .

    • Over-quota If over quota weight is enabled via the General settings, then over quota weight is presented, otherwise over quota is presented

      • Over-quota When enabled, the project can use non-guaranteed overage resources above its quota in this node pool. The amount of the non-guaranteed overage resources for this project is calculated proportionally to the project quota in this node pool. When disabled, the project cannot use more resources than the guaranteed quota in this node pool.

      • Over quota weight Represents a weight used to calculate the amount of non-guaranteed overage resources a project can get on top of its quota in this node pool. All unused resources are split between projects that require the use of overage resources:

    • Project max. GPU device allocation Represents the maximum GPU device allocation the project can get from this node pool - the maximum sum of quota and over-quota GPUs (decimal number)

  • New PVC - creates a new PVC in the cluster. New PVCs are not added to the Existing PVCs list.

    When creating a PVC-type data source and selecting the ‘New PVC’ option, the PVC is immediately created in the cluster (even if no workload has requested this PVC).

    Set the data origin

    • Set the S3 service URL

    • Select the credential

      • None - for public buckets

      • Credential names - This option is relevant for private buckets based on existing credentials that were created for the scope.

        To add new credentials to the credentials list, and for additional information, check the article.

    • Enter the bucket name

  • Set the data target location

    • container path

  • Click CREATE DATA SOURCE

  • None - for public repositories

  • Credential names - This option applies to private repositories based on existing credentials that were created for the scope.

    To add new credentials to the credentials list, and for additional information, check the Credentials article.

  • Workload(s)

    The list of existing workloads that use the data source

    Template(s)

    The list of workload templates that use the data source

    Created by

    The user who created the data source

    Creation time

    The timestamp for when the data source was created

    Cluster

    The cluster that the data source is associated with

    scope
    scope
    created in advance
    Kubernetes storage classes
    scope
    scope
    created in advance
    Credentials
    phases
    scope

    Under Workload architecture, select Standard

  • Select Start from scratch to launch a new training quickly

  • Enter a1 as the workload name

  • Under Submission, select Flexible and click CONTINUE

  • Under Environment, enter the Image URL - runai.jfrog.io/demo/quickstart

  • Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.

    • If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device's memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE TRAINING

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select under which cluster to create the workload

    4. Select the project named team-a

    5. Under Workload architecture, select Standard

    6. Select Start from scratch to launch a new training quickly

    7. Enter a1 as the workload name

    8. Under Submission, select Original and click CONTINUE

    9. Create a new environment:

      • Click +NEW ENVIRONMENT

      • Enter quick-start as the name for the environment. The name must be unique.

      • Enter the Image URL - runai.jfrog.io/demo/quickstart

    10. Select the ‘one-gpu’ compute resource for your workload

      • If ‘one-gpu’ is not displayed in the gallery, follow the below steps:

        • Click +NEW COMPUTE RESOURCE

        • Enter one-gpu as the name for the compute resource. The name must be unique.

    11. Click CREATE TRAINING

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in Step 1

    • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

    • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

    Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

  • Under Workload architecture, select Standard

  • Select Start from scratch to launch a new training quickly

  • Enter a2 as the workload name

  • Under Submission, select Flexible and click CONTINUE

  • Under Environment, enter the Image URL - runai.jfrog.io/demo/quickstart

  • Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘two-gpus’ compute resource for your workload.

    • If ‘two-gpus’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 2

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device's memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE TRAINING

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select the cluster where the previous training workload was created

    4. Select the project named team-a

    5. Under Workload architecture, select Standard

    6. Select Start from scratch to launch a new training quickly

    7. Enter a2 as the workload name

    8. Under Submission, select Original and click CONTINUE

    9. Select the environment created in

    10. Select the ‘two-gpus’ compute resource for your workload

      • If ‘two-gpus’ is not displayed in the gallery, follow the below steps:

        • Click +NEW COMPUTE RESOURCE

        • Enter two-gpus as the name for the compute resource. The name must be unique.

    11. Click CREATE TRAINING

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in Step 1

    • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

    • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

    Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    Under Workload architecture, select Standard

  • Select Start from scratch to launch a new training quickly

  • Enter b1 as the workload name

  • Under Submission, select Flexible and click CONTINUE

  • Under Environment, enter the Image URL - runai.jfrog.io/demo/quickstart

  • Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.

    • If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device's memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE TRAINING

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select the cluster where the previous training was created

    4. Select the project named team-b

    5. Under Workload architecture, select Standard

    6. Select Start from scratch to launch a new training quickly

    7. Enter b1 as the workload name

    8. Under Submission, select Original and click CONTINUE

    9. Create a new environment:

      • Click +NEW ENVIRONMENT

      • Enter quick-start as the name for the environment. The name must be unique.

      • Enter the Image URL - runai.jfrog.io/demo/quickstart

    10. Select the ‘one-gpu’ compute resource for your workload

      • If ‘one-gpu’ is not displayed in the gallery, follow the below steps:

        • Click +NEW COMPUTE RESOURCE

        • Enter one-gpu as the name for the compute resource. The name must be unique.

    11. Click CREATE TRAINING

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in Step 1

    • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

    • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

    Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    Under Workload architecture, select Standard

  • Select Start from scratch to launch a new training quickly

  • Enter b2 as the workload name

  • Under Submission, select Flexible and click CONTINUE

  • Under Environment, enter the Image URL - runai.jfrog.io/demo/quickstart

  • Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.

    • If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device's memory

        • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE TRAINING

    1. Go to the Workload Manager → Workloads

    2. Click +NEW WORKLOAD and select Training

    3. Select the cluster where the previous training was created

    4. Select the project named team-b

    5. Under Workload architecture, select Standard

    6. Select Start from scratch to launch a new training quickly

    7. Enter b2 as the workload name

    8. Under Submission, select Original and click CONTINUE

    9. Select the environment created in

    10. Select the compute resource created in

    11. Click CREATE TRAINING

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. For more details, see CLI reference:

    Copy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in Step 1

    • <PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the .

    • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the .

    Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    projects
    Flexible workload submission
    Workloads

    Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

    Run the below --help command to obtain the login options and log in according to your setup:

    runai login --help

    Log in using the following command. You will be prompted to enter your username and password:

    runai login

    To use the API, you will need to obtain a token as shown in API authentication.

    Cluster System Requirements

    The NVIDIA Run:ai cluster is a Kubernetes application. This section explains the required hardware and software system requirements for the NVIDIA Run:ai cluster.

    The system requirements needed depend on where the control plane and cluster are installed. The following applies for Kubernetes only:

    • If you are installing the first cluster and control plane on the same Kubernetes cluster, and are not required.

    • If you are installing the first cluster and control plane on separate Kubernetes clusters, the and are required.

    runai training submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-a
    runai submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-a
    curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <TOKEN>' \ 
    --data '{
      "name": "a1",
      "projectId": "<PROJECT-ID>", 
      "clusterId": "<CLUSTER-UUID>",
      "spec": {
        "image":"runai.jfrog.io/demo/quickstart",
        "compute": {
          "gpuDevicesRequest": 1
        }
      }
    }'
    runai training submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a
    runai submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-a
    curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <TOKEN>' \ 
    --data '{
      "name": "a2",
      "projectId": "<PROJECT-ID>", 
      "clusterId": "<CLUSTER-UUID>",
      "spec": {
        "image":"runai.jfrog.io/demo/quickstart",
        "compute": {
          "gpuDevicesRequest": 2
        }
      }
    }'
    runai training submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
    runai submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
    curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <TOKEN>' \
    --data '{
      "name": "b1",
      "projectId": "<PROJECT-ID>", 
      "clusterId": "<CLUSTER-UUID>",
      "spec": {
        "image":"runai.jfrog.io/demo/quickstart",
        "compute": {
          "gpuDevicesRequest": 1
        }
      }
    }'
    runai training submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
    runai submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-b
    curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <TOKEN>' \ 
    --data '{
      "name": "b2",
      "projectId": "<PROJECT-ID>", 
      "clusterId": "<CLUSTER-UUID>",
      "spec": {
        "image":"runai.jfrog.io/demo/quickstart",
        "compute": {
          "gpuDevicesRequest": 1
        }
      }
    }'
    ~ runai workload list -A
    Workload  Type      Status   Project  Running/Req.Pods  GPU Alloc.
    ────────────────────────────────────────────────────────────────────────────
    a2       Training   Running   team-a        1/1           2.00
    b1       Training   Running   team-b        1/1           1.00
    a1       Training.  Running   team-a        0/1           1.00
    ~ runai list -A
    Workload  Type      Status   Project  Running/Req.Pods  GPU Alloc.
    ────────────────────────────────────────────────────────────────────────────
    a2       Training   Running   team-a        1/1           2.00
    b1       Training   Running   team-b        1/1           1.00
    a1       Training.  Running   team-a        0/1           1.00
    curl --location 'https://<COMPANY-URL>/api/v1/workloads' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <TOKEN>' \ #<TOKEN> is the API access token obtained in Step 1.
    --data ''
    ~ runai workload list -A
    Workload  Type      Status   Project  Running/Req.Pods  GPU Alloc.
    ────────────────────────────────────────────────────────────────────────────
    a2       Training   Running   team-a        1/1           2.00
    b1       Training   Running   team-b        1/1           1.00
    b2       Training   Running   team-b        1/1           1.00
    a1       Training.  Pending   team-a        0/1           1.00
    ~ runai list -A
    Workload   Type     Status   Project  Running/Req.Pods  GPU Alloc.
    ────────────────────────────────────────────────────────────────────────────
    a2       Training   Running   team-a        1/1           2.00
    b1       Training   Running   team-b        1/1           1.00
    b2       Training   Running   team-b        1/1           1.00
    a1       Training.  Pending   team-a        0/1           1.00
    curl --location 'https://<COMPANY-URL>/api/v1/workloads' \
    --header 'Content-Type: application/json' \
    --header 'Authorization: Bearer <TOKEN>' \ #<TOKEN> is the API access token obtained in Step 1.
    --data ''

    Medium The default value. The Administrator can change the default to any of the following values - High, Low, Lowest, or None.

  • Lowest Over quota weight ‘Lowest’ has a unique behavior since it can only use over-quota (unused overage) resources if no other project needs them. Any project with a higher over quota weight can snap the average resources at any time.

  • None When set, the project cannot use more resources than the guaranteed quota in this node pool

  • Unlimited CPU(Cores) and CPU memory quotas are an exception. In this case, workloads of subordinated projects can consume available resources up to the physical limitation of the cluster or any of the node pools.

  • The NVIDIA Run:ai Scheduler: concepts and principles
    Credentials
  • Click CREATE ENVIRONMENT

  • The newly created environment will be selected automatically

  • Set GPU devices per pod - 1

  • Set GPU memory per device

    • Select % (of device) - Fraction of a GPU device's memory

    • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

  • Optional: set the CPU compute per pod - 0.1 cores (default)

  • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE COMPUTE RESOURCE

  • The newly created compute resource will be selected automatically

  • Set GPU devices per pod - 2

  • Set GPU memory per device

    • Select % (of device) - Fraction of a GPU device's memory

    • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

  • Optional: set the CPU compute per pod - 0.1 cores (default)

  • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE COMPUTE RESOURCE

  • The newly created compute resource will be selected automatically

  • Click CREATE ENVIRONMENT

  • The newly created environment will be selected automatically

  • Set GPU devices per pod - 1

  • Set GPU memory per device

    • Select % (of device) - Fraction of a GPU device's memory

    • Set the memory Request - 100 (the workload will allocate 100% of the GPU memory)

  • Optional: set the CPU compute per pod - 0.1 cores (default)

  • Optional: set the CPU memory per pod - 100 MB (default)

  • Click CREATE COMPUTE RESOURCE

  • The newly created compute resource will be selected automatically

    Get Projects API
    Get Clusters API
    Step 2
    Get Projects API
    Get Clusters API
    Get Projects API
    Get Clusters API
    Step 4
    Step 4
    Get Projects API
    Get Clusters API
    Hardware Requirements

    The following hardware requirements are for the Kubernetes cluster nodes. By default, all NVIDIA Run:ai cluster services run on all available nodes. For production deployments, you may want to set node roles, to separate between system and worker nodes, reduce downtime and save CPU cycles on expensive GPU Machines.

    Architecture

    • x86 - Supported for both Kubernetes and OpenShift deployments.

    • ARM - Supported for Kubernetes only. ARM is currently not supported for OpenShift.

    NVIDIA Run:ai Cluster - System Nodes

    This configuration is the minimum requirement you need to install and use NVIDIA Run:ai cluster.

    Component
    Required Capacity

    CPU

    10 cores

    Memory

    20GB

    Disk space

    50GB

    Note

    To designate nodes to NVIDIA Run:ai system services, follow the instructions as described in System nodes.

    NVIDIA Run:ai Cluster - Worker Nodes

    The NVIDIA Run:ai cluster supports x86 and ARM CPUs, and any NVIDIA GPU supported by the NVIDIA GPU Operator. GPU compatibility depends on the version of the NVIDIA GPU Operator installed in the cluster. NVIDIA Run:ai supports GPU Operator versions 22.9 to 25.3. For the list of supported GPU models, see Supported NVIDIA Data Center GPUs and Systems. To install the GPU Operator, see NVIDIA GPU Operator.

    The following configuration represents the minimum hardware requirements for installing and operating the NVIDIA Run:ai cluster on worker nodes. Each node must meet these specifications:

    Component
    Required Capacity

    CPU

    2 cores

    Memory

    4GB

    Note

    To designate nodes to NVIDIA Run:ai workloads, follow the instructions as described in Worker nodes.

    Shared Storage

    NVIDIA Run:ai workloads must be able to access data from any worker node in a uniform way, to access training data and code as well as save checkpoints, weights, and other machine-learning-related artifacts.

    Typical protocols are Network File Storage (NFS) or Network-attached storage (NAS). NVIDIA Run:ai cluster supports both, for more information see Shared storage.

    Software Requirements

    The following software requirements must be fulfilled on the Kubernetes cluster.

    Operating System

    • Any Linux operating system supported by both Kubernetes and NVIDIA GPU Operator

    • NVIDIA Run:ai cluster on Google Kubernetes Engine (GKE) supports both Ubuntu and Container Optimized OS (COS). COS is supported only with NVIDIA GPU Operator 24.6 or newer, and NVIDIA Run:ai cluster version 2.19 or newer.

    • NVIDIA Run:ai cluster on Elastic Kubernetes Service (EKS) does not support Bottlerocket or Amazon Linux.

    • NVIDIA Run:ai cluster on Oracle Kubernetes Engine (OKE) supports only Ubuntu.

    • Internal tests are being performed on Ubuntu 22.04 and CoreOS for OpenShift.

    Kubernetes Distribution

    NVIDIA Run:ai cluster requires Kubernetes. The following Kubernetes distributions are supported:

    • Vanilla Kubernetes

    • OpenShift Container Platform (OCP)

    • NVIDIA Base Command Manager (BCM)

    • Elastic Kubernetes Engine (EKS)

    • Google Kubernetes Engine (GKE)

    • Azure Kubernetes Service (AKS)

    • Oracle Kubernetes Engine (OKE)

    • Rancher Kubernetes Engine (RKE1)

    • Rancher Kubernetes Engine 2 (RKE2)

    Note

    • The latest release of the NVIDIA Run:ai cluster supports Kubernetes 1.30 to 1.32 and OpenShift 4.14 to 4.18.

    • For Multi-Node NVLink support (e.g. GB200), Kubernetes 1.32 and above is required.

    For existing Kubernetes clusters, see the following Kubernetes version support matrix for the latest NVIDIA Run:ai cluster releases:

    NVIDIA Run:ai version
    Supported Kubernetes versions
    Supported OpenShift versions

    v2.17

    1.27 to 1.29

    4.12 to 4.15

    v2.18

    1.28 to 1.30

    4.12 to 4.16

    v2.19

    1.28 to 1.31

    4.12 to 4.17

    v2.20

    1.29 to 1.32

    For information on supported versions of managed Kubernetes, it's important to consult the release notes provided by your Kubernetes service provider. There, you can confirm the specific version of the underlying Kubernetes platform supported by the provider, ensuring compatibility with NVIDIA Run:ai. For an up-to-date end-of-life statement see Kubernetes Release History or OpenShift Container Platform Life Cycle Policy.

    Container Runtime

    NVIDIA Run:ai supports the following container runtimes. Make sure your Kubernetes cluster is configured with one of these runtimes:

    • Containerd (default in Kubernetes)

    • CRI-O (default in OpenShift)

    Kubernetes Pod Security Admission

    NVIDIA Run:ai supports restricted policy for Pod Security Admission (PSA) on OpenShift only. Other Kubernetes distributions are only supported with privileged policy.

    For NVIDIA Run:ai on OpenShift to run with PSA restricted policy:

    • Label the runai namespace as described in Pod Security Admission with the following labels:

    • The workloads submitted through NVIDIA Run:ai should comply with the restrictions of PSA restricted policy. This can be enforced using Policies.

    NVIDIA Run:ai Namespace

    The NVIDIA Run:ai must be installed in a namespace or project (OpenShift) called runai. Use the following to create the namespace/project:

    Kubernetes Ingress Controller

    NVIDIA Run:ai cluster requires Kubernetes Ingress Controller to be installed on the Kubernetes cluster.

    • OpenShift, RKE and RKE2 come pre-installed ingress controller.

    • Internal tests are being performed on NGINX, Rancher NGINX, OpenShift Router, and Istio.

    • Make sure that a default ingress controller is set.

    There are many ways to install and configure different ingress controllers. A simple example to install and configure NGINX ingress controller using helm:

    Vanilla Kubernetes

    Run the following commands:

    • For cloud deployments, both the internal IP and external IP are required.

    • For on-prem deployments, only the external IP is needed.

    Managed Kubernetes (EKS, GKE, AKS)

    Run the following commands:

    Oracle Kubernetes Engine (OKE)

    Run the following commands:

    Fully Qualified Domain Name (FQDN)

    Note

    Fully Qualified Domain Name applies for Kubernetes only.

    You must have a Fully Qualified Domain Name (FQDN) to install the NVIDIA Run:ai cluster (ex: runai.mycorp.local). This cannot be an IP. The domain name must be accessible inside the organization's private network.

    Wildcard FQDN for Inference (Optional)

    In order to make inference serving endpoints available externally to the cluster, configure a wildcard DNS record (*.runai-inference.mycorp.local) that resolves to the cluster’s public IP address, or to the cluster's load balancer IP address in on-prem environments. This ensures each inference workload receives a unique subdomain under the wildcard domain.

    TLS Certificate

    Kubernetes

    You must have a TLS certificate that is associated with the FQDN for HTTPS access. Create a Kubernetes Secret named runai-cluster-domain-tls-secret in the runai namespace and include the path to the TLS --cert and its corresponding private --key by running the following:

    OpenShift

    NVIDIA Run:ai uses the OpenShift default Ingress router for serving. The TLS certificate configured for this router must be issued by a trusted CA. For more details, see the OpenShift documentation on configuring certificates.

    Wildcard TLS Certificate - Inference

    Note

    The following instructions apply only to Kubernetes. Instructions for configuring a TLS certificate for Inference on OpenShift will be available in a future release.

    For serving inference endpoints over HTTPS, NVIDIA Run:ai requires a dedicated wildcard TLS certificate that matches the fully qualified domain name (FQDN) used for inference. This certificate ensures secure external access to inference workloads.

    Local Certificate Authority

    A local certificate authority serves as the root certificate for organizations that cannot use publicly trusted certificate authority. Follow the below steps to configure the local certificate authority.

    In air-gapped environments, you must configure and install the local CA's public key in the Kubernetes cluster. This is required for the installation to succeed:

    1. Add the public key to the required namespace:

    1. When installing the cluster, make sure the following flag is added to the helm command --set global.customCA.enabled=true. See Install cluster.

    Note

    When using a custom CA, sidecar containers used for S3 or Git integrations do not automatically inherit the CA configured at the cluster level. See Git and S3 sidecar containers for more details.

    NVIDIA GPU Operator

    NVIDIA Run:ai cluster requires NVIDIA GPU Operator to be installed on the Kubernetes cluster. GPU Operator versions 22.9 to 25.3 are supported.

    Note

    For Multi-Node NVLink support (e.g. GB200), GPU Operator 25.3 and above is required.

    For air-gapped installation, follow the instructions in Install NVIDIA GPU Operator in Air-Gapped Environments.

    See Installing the NVIDIA GPU Operator, followed by notes below:

    • Use the default gpu-operator namespace . Otherwise, you must specify the target namespace using the flag runai-operator.config.nvidiaDcgmExporter.namespace as described in customized cluster installation.

    • NVIDIA drivers may already be installed on the nodes. In such cases, use the NVIDIA GPU Operator flags --set driver.enabled=false. DGX OS is one such example as it comes bundled with NVIDIA Drivers.

    • For distribution-specific additional instructions see below:

    OpenShift Container Platform (OCP)

    The Node Feature Discovery (NFD) Operator is a prerequisite for the NVIDIA GPU Operator in OpenShift. Install the NFD Operator using the Red Hat OperatorHub catalog in the OpenShift Container Platform web console. For more information, see Installing the Node Feature Discovery (NFD) Operator.

    Elastic Kubernetes Service (EKS)
    • When setting-up the cluster, do not install the NVIDIA device plug-in (we want the NVIDIA GPU Operator to install it instead).

    • When using the eksctl tool to create a cluster, use the flag --install-nvidia-plugin=false to disable the installation.

    For GPU nodes, EKS uses an AMI which already contains the NVIDIA drivers. As such, you must use the GPU Operator flags: --set driver.enabled=false.

    Google Kubernetes Engine (GKE)

    Before installing the GPU Operator:

    1. Create the gpu-operator namespace by running:

    1. Create the following file:

    1. Run:

    Rancher Kubernetes Engine 2 (RKE2)

    Make sure to specify the CONTAINERD_CONFIG option exactly as outlined in the documentation and custom configuration guide, using the path /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl. Do not create the file manually if it does not already exist. The GPU Operator will handle this configuration during deployment.

    Oracle Kubernetes Engine (OKE)
    • During cluster setup, create a nodepool, and set initial_node_labels to include oci.oraclecloud.com/disable-gpu-device-plugin=true which disables the NVIDIA GPU device plugin.

    • For GPU nodes, OKE defaults to Oracle Linux, which is incompatible with NVIDIA drivers. To resolve this, use a custom Ubuntu image instead.

    For troubleshooting information, see the NVIDIA GPU Operator Troubleshooting Guide.

    NVIDIA Network Operator

    When deploying on clusters with RDMA or Multi Node NVLink‑capable nodes (e.g. B200, GB200), the NVIDIA Network Operator is required to enable high-performance networking features such as GPUDirect RDMA in Kubernetes. Network Operator versions v24.4 and above are supported.

    The Network Operator works alongside the NVIDIA GPU Operator to provide:

    • NVIDIA networking drivers for advanced network capabilities.

    • Kubernetes device plugins to expose high‑speed network hardware to workloads.

    • Secondary network components to support network‑intensive applications.

    The Network Operator must be installed and configured as follows:

    1. Install the network operator as detailed in Network Operator Deployment on Vanilla Kubernetes Cluster.

    2. Configure SR-IOV InfiniBand support as detailed in Network Operator Deployment with an SR-IOV InfiniBand Network.

    For air-gapped installation, follow the instructions in Network Operator Deployment in an Air-gapped Environment.

    NVIDIA Dynamic Resource Allocation (DRA) Driver

    When deploying on clusters with Multi-Node NVLink (e.g. GB200), the NVIDIA DRA driver is essential to enable Dynamic Resource Allocation at the Kubernetes level. To install, follow the instructions in Configure and Helm-install the driver.

    After installation, update runaiconfig using the GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See Advanced cluster configurations for more details.

    Note

    For air-gapped installation, contact NVIDIA Run:ai support.

    Prometheus

    Note

    Installing Prometheus applies for Kubernetes only.

    NVIDIA Run:ai cluster requires Prometheus to be installed on the Kubernetes cluster.

    • OpenShift comes pre-installed with prometheus

    • For RKE2 see Enable Monitoring instructions to install Prometheus

    There are many ways to install Prometheus. A simple example to install the community Kube-Prometheus Stack using helm, run the following commands:

    Additional Software Requirements

    Additional NVIDIA Run:ai capabilities, Distributed Training and Inference require additional Kubernetes applications (frameworks) to be installed on the cluster.

    Distributed Training

    Distributed training enables training of AI models over multiple nodes. This requires installing a distributed training framework on the cluster. The following frameworks are supported:

    • TensorFlow

    • PyTorch

    • XGBoost

    • MPI v2

    There are several ways to install each framework. A simple method of installation example is the Kubeflow Training Operator which includes TensorFlow, PyTorch, XGBoost and JAX.

    It is recommended to use Kubeflow Training Operator v1.9.2, and MPI Operator v0.6.0 or later for compatibility with advanced workload capabilities, such as Stopping a workload and Scheduling rules.

    • To install the Kubeflow Training Operator for TensorFlow, PyTorch, XGBoost and JAX frameworks, run the following command:

    • To install the MPI Operator for MPI v2, run the following command:

    Note

    If you require both the MPI Operator and Kubeflow Training Operator, follow the steps below:

    • Install the Kubeflow Training Operator as described above.

    • Disable and delete MPI v1 in the Kubeflow Training Operator by running:

    • Install the MPI Operator as described above.

    Inference

    Inference enables serving of AI models. This requires the Knative Serving framework to be installed on the cluster and supports Knative versions 1.11 to 1.16. Follow the Installing Knative instructions or run:

    Once installed, follow the below steps:

    1. Create the knative-serving namespace:

    2. Create a YAML file named knative-serving.yaml and replace the placeholder FQDN with your wildcard inference domain (for example, runai-inference.mycorp.local):

    3. Apply the changes:

    4. Configure NGINX to proxy requests to Kourier / Knative and handle TLS termination using the wildcard certificate. Create a YAML file named knative-ingress.yaml and replace the FQDN placeholders with your wildcard inference domain:

    5. Apply the changes:

    Knative Autoscaling

    NVIDIA Run:ai allows for autoscaling a deployment according to the below metrics:

    • Latency (milliseconds)

    • Throughput (requests/sec)

    • Concurrency (requests)

    Using a custom metric (for example, Latency) requires installing the Kubernetes Horizontal Pod Autoscaler (HPA). Use the following command to install. Make sure to update the {VERSION} in the below command with a supported Knative version.

    Kubernetes ingress controller
    Fully Qualified Domain Name
    Kubernetes ingress controller
    Fully Qualified Domain Name

    NVIDIA Run:ai System Monitoring

    This section explains how to configure NVIDIA Run:ai to generate health alerts and to connect these alerts to alert-management systems within your organization. Alerts are generated for NVIDIA Run:ai clusters.

    Alert Infrastructure

    NVIDIA Run:ai uses Prometheus for externalizing metrics and providing visibility to end-users. The NVIDIA Run:ai Cluster installation includes Prometheus or can connect to an existing Prometheus instance used in your organization. The alerts are based on the Prometheus AlertManager. Once installed, it is enabled by default.

    This document explains how to:

    • Configure alert destinations - triggered alerts send data to specified destinations

    • Understand the out-of-the-box cluster alerts, provided by NVIDIA Run:ai

    • Add additional custom alerts

    Prerequisites

    • A Kubernetes cluster with the necessary permissions

    • Up and running NVIDIA Run:ai environment, including Prometheus Operator

    • command-line tool installed and configured to interact with the cluster

    Setup

    Use the steps below to set up monitoring alerts.

    Validating Prometheus Operator Installed

    1. Verify that the Prometheus Operator Deployment is running. Copy the following command and paste it in your terminal, where you have access to the Kubernetes cluster. In your terminal, you can see an output indicating the deployment's status, including the number of replicas and their current state.

    2. Verify that Prometheus instances are running. Copy the following command and paste it in your terminal. You can see the Prometheus instance(s) listed along with their status:

    Enabling Prometheus AlertManager

    In each of the steps in this section, copy the content of the code snippet to a new YAML file (e.g., step1.yaml).

    1. Copy the following command to your terminal, to apply the YAML file to the cluster:

    2. Copy the following command to your terminal to create the AlertManager CustomResource, to enable AlertManager:

    3. Copy the following command to your terminal to validate that the AlertManager instance has started:

    4. Copy the following command to your terminal to validate that the Prometheus operator has created a Service for AlertManager:

    Configuring Prometheus to Send Alerts

    1. Open the terminal on your local machine or another machine that has access to your Kubernetes cluster.

    2. Copy and paste the following command in your terminal to edit the Prometheus configuration for the runai namespace. This command opens the Prometheus configuration file in your default text editor (usually vi or nano):

    3. Copy and paste the following text to your terminal to change the configuration file:

    Note

    To save changes using vi, type :wq and press Enter. The changes are applied to the Prometheus configuration in the cluster.

    Alert Destinations

    Set out below are the various alert destinations.

    Configuring AlertManager for Custom Email Alerts

    In each step, copy the contents of the code snippets to a new file and apply it to the cluster using kubectl apply -f.

    1. Add your smtp password as a secret:

    2. Replace the relevant smtp details with your own, then apply the alertmanagerconfig using kubectl apply:

    3. Save and exit the editor. The configuration is automatically reloaded.

    Third-Party Alert Destinations

    Prometheus AlertManager provides a structured way to connect to alert-management systems. There are built-in plugins for popular systems such as PagerDuty and OpsGenie, including a generic Webhook.

    Example: Integrating NVIDIA Run:ai with a Webhook

    1. Use to get a unique URL.

    2. Use the upgrade cluster instructions to modify the values file: Edit the values file to add the following, and replace <WEB-HOOK-URL> with the URL from :

    3. Verify that you are receiving alerts on the , in the left pane:

    Built-in Alerts

    A NVIDIA Run:ai cluster comes with several built-in alerts. Each alert notifies on a specific functionality of a NVIDIA Run:ai’s entity. There is also a single, inclusive alert: NVIDIA Run:ai Critical Problems, which aggregates all component-based alerts into a single cluster health test.

    Runai agent cluster info push rate low

    Runai agent pull rate low

    Runai container memory usage critical

    Runai container memory usage warning

    Runai container restarting

    Runai CPU usage warning

    Runai critical problem

    Unknown state alert for a node

    Low memory node alert

    Runai daemonSet rollout stuck / Runai DaemonSet unavailable on nodes

    Runai deployment insufficient replicas / Runai deployment no available replicas /RunaiDeploymentUnavailableReplicas

    Runai project controller reconcile failure

    Runai StatefulSet insufficient replicas / Runai StatefulSet no available replicas

    Adding a Custom Alert

    You can add additional alerts from NVIDIA Run:ai. Alerts are triggered by using the Prometheus query language with any NVIDIA Run:ai metric.

    To create an alert, follow these steps using Prometheus query language with NVIDIA Run:ai Metrics:

    • Modify Values File: Use the upgrade cluster instructions to modify the values file.

    • Add Alert Structure: Incorporate alerts according to the structure outlined below. Replace placeholders <ALERT-NAME>, <ALERT-SUMMARY-TEXT>, <PROMQL-EXPRESSION>, <optional: duration s/m/h>, and <critical/warning> with appropriate values for your alert, as described below:

    You can find an example in the .

    Launching Workloads with GPU Memory Swap

    This quick start provides a step-by-step walkthrough for running multiple LLMs (inference workload) on a single GPU using .

    GPU memory swap expands the GPU physical memory to the CPU memory, allowing NVIDIA Run:ai to place and run more workloads on the same GPU physical hardware. This provides a smooth workload context switching between GPU memory and CPU memory, eliminating the need to kill workloads when the memory requirement is larger than what the GPU physical memory can provide.

    Prerequisites

    Before you start, make sure:

    kubectl create ns runai
    oc new-project runai
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    helm upgrade -i nginx-ingress ingress-nginx/ingress-nginx \
        --namespace nginx-ingress --create-namespace \
        --set controller.kind=DaemonSet \
        --set controller.service.externalIPs="{<INTERNAL-IP>,<EXTERNAL-IP>}" # Replace <INTERNAL-IP> and <EXTERNAL-IP> with the internal and external IP addresses of one of the nodes
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    helm install nginx-ingress ingress-nginx/ingress-nginx \
        --namespace nginx-ingress --create-namespace
    helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
    helm repo update
    helm install nginx-ingress ingress-nginx/ingress-nginx \
        --namespace ingress-nginx --create-namespace \
        --set controller.service.annotations.oci.oraclecloud.com/load-balancer-type=nlb \
        --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/is-preserve-source=True \
        --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/security-list-management-mode=None \
        --set controller.service.externalTrafficPolicy=Local \
        --set controller.service.annotations.oci-network-load-balancer.oraclecloud.com/subnet=<SUBNET-ID> # Replace <SUBNET-ID> with the subnet ID of one of your cluster
    kubectl -n runai create secret generic runai-ca-cert \
        --from-file=runai-ca.pem=<ca_bundle_path>
    kubectl label secret runai-ca-cert -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
    oc -n runai create secret generic runai-ca-cert \
        --from-file=runai-ca.pem=<ca_bundle_path>
    oc -n openshift-monitoring create secret generic runai-ca-cert \
        --from-file=runai-ca.pem=<ca_bundle_path>
    oc label secret runai-ca-cert -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
    kubectl create ns gpu-operator
    #resourcequota.yaml
    
    apiVersion: v1
    kind: ResourceQuota
    metadata:
    name: gcp-critical-pods
    namespace: gpu-operator
    spec:
    scopeSelector:
        matchExpressions:
        - operator: In
        scopeName: PriorityClass
        values:
        - system-node-critical
        - system-cluster-critical
    kubectl patch deployment training-operator -n kubeflow --type='json' -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args", "value": ["--enable-scheme=tfjob", "--enable-scheme=pytorchjob", "--enable-scheme=xgboostjob", "--enable-scheme=jaxjob"]}]'
    kubectl delete crd mpijobs.kubeflow.org
    kubectl create ns knative-serving
    apiVersion: operator.knative.dev/v1beta1
    kind: KnativeServing
    metadata:
      name: knative-serving
      namespace: knative-serving
    spec:
      config:
        config-autoscaler:
          enable-scale-to-zero: "true"
        config-features:
          kubernetes.podspec-affinity: enabled
          kubernetes.podspec-init-containers: enabled
          kubernetes.podspec-persistent-volume-claim: enabled
          kubernetes.podspec-persistent-volume-write: enabled
          kubernetes.podspec-schedulername: enabled
          kubernetes.podspec-securitycontext: enabled
          kubernetes.podspec-tolerations: enabled
          kubernetes.podspec-volumes-emptydir: enabled
          kubernetes.podspec-fieldref: enabled
          kubernetes.containerspec-addcapabilities: enabled
          kubernetes.podspec-nodeselector: enabled
          multi-container: enabled
        domain:
          runai-inference.mycorp.local: "" # replace with the wildcard FQDN for Inference
        network:
          domainTemplate: '{{.Name}}-{{.Namespace}}.{{.Domain}}'
          ingress-class: kourier.ingress.networking.knative.dev
          default-external-scheme: https
      high-availability:
        replicas: 2
      ingress:
        kourier:
          enabled: true
    kubectl apply -f knative-serving.yaml
    pod-security.kubernetes.io/audit=privileged
    pod-security.kubernetes.io/enforce=privileged
    pod-security.kubernetes.io/warn=privileged
    kubectl create secret tls runai-cluster-domain-tls-secret -n runai \
        --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate
        --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
    kubectl create secret tls runai-cluster-inference-tls-secret -n knative-serving \
        --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate
        --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
    helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    helm repo update
    helm install prometheus prometheus-community/kube-prometheus-stack \
        -n monitoring --create-namespace --set grafana.enabled=false
    kubectl apply --server-side -k "github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.9.2"
    kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
    helm repo add knative-operator https://knative.github.io/operator
    helm install knative-operator --create-namespace --namespace knativeoperator --version 1.16.6 knative-operator/knative-operator
    kubectl apply -f https://github.com/knative/serving/releases/download/knative-{VERSION}/serving-hpa.yaml

    4.14 to 4.17

    v2.21 (latest)

    1.30 to 1.32

    4.14 to 4.18

    JAX
    kubectl apply -f resourcequota.yaml
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: knative-serving
      namespace: knative-serving
    spec:
      ingressClassName: nginx
      rules:
      - host: '*.runai-inference.mycorp.local' # replace with the wildcard FQDN for Inference
        http:
          paths:
          - backend:
              service:
                name: kourier
                port:
                  number: 80
            path: /
            pathType: Prefix
      tls:
      - hosts:
        - '*.runai-inference.mycorp.local' # replace with the wildcard FQDN for Inference
        secretName: runai-cluster-inference-tls-secret
    kubectl apply -f knative-ingress.yaml

    Delete the prometheus pod to reset the pod's settings:

  • Save the changes and exit the text editor.

  • <ALERT-NAME>: Choose a descriptive name for your alert, such as HighCPUUsage or LowMemory.

  • <ALERT-SUMMARY-TEXT>: Provide a brief summary of what the alert signifies, for example, High CPU usage detected or Memory usage below threshold.

  • <PROMQL-EXPRESSION>: Construct a Prometheus query (PROMQL) that defines the conditions under which the alert should trigger. This query should evaluate to a boolean value (1 for alert, 0 for no alert).

  • <optional: duration s/m/h>: Optionally, specify a duration in seconds (s), minutes (m), or hours (h) that the alert condition should persist before triggering an alert. If not specified, the alert triggers as soon as the condition is met.

  • <critical/warning>: Assign a severity level to the alert, indicating its importance. Choose between critical for severe issues requiring immediate attention, or warning for less critical issues that still need monitoring.

  • Meaning

    The cluster-sync Pod in the runai namespace might not be functioning properly

    Impact

    Possible impact - no info/partial info from the cluster is being synced back to the control-plane

    Severity

    Critical

    Diagnosis

    kubectl get pod -n runai to see if the cluster-sync pod is running

    Troubleshooting/Mitigation

    To diagnose issues with the cluster-sync pod, follow these steps:

    1. Paste the following command to your terminal, to receive detailed information about the cluster-sync deployment:kubectl describe deployment cluster-sync -n runai

    2. Check the Logs: Use the following command to view the logs of the cluster-sync deployment:kubectl logs deployment/cluster-sync -n runai

    3. Analyze the Logs and Pod Details: From the information provided by the logs and the deployment details, attempt to identify the reason why the cluster-sync pod is not functioning correctly

    4. Check Connectivity: Ensure there is a stable network connection between the cluster and the NVIDIA Run:ai Control Plane. A connectivity issue may be the root cause of the problem.

    5. Contact Support: If the network connection is stable and you are still unable to resolve the issue, contact NVIDIA Run:ai support for further assistance

    Meaning

    The runai-agent pod may be too loaded, is slow in processing data (possible in very big clusters), or the runai-agent pod itself in the runai namespace may not be functioning properly.

    Impact

    Possible impact - no info/partial info from the control-plane is being synced in the cluster

    Severity

    Critical

    Diagnosis

    Run: kubectl get pod -n runai And see if the runai-agent pod is running.

    Troubleshooting/Mitigation

    To diagnose issues with the runai-agent pod, follow these steps:

    1. Describe the Deployment: Run the following command to get detailed information about the runai-agent deployment:kubectl describe deployment runai-agent -n runai

    2. Check the Logs: Use the following command to view the logs of the runai-agent deployment:kubectl logs deployment/runai-agent -n runai

    3. Analyze the Logs and Pod Details: From the information provided by the logs and the deployment details, attempt to identify the reason why the runai-agent pod is not functioning correctly. There may be a connectivity issue with the control plane.

    4. Check Connectivity: Ensure there is a stable network connection between the runai-agent and the control plane. A connectivity issue may be the root cause of the problem.

    5. Consider Cluster Load: If the runai-agent appears to be functioning properly but the cluster is very large and heavily loaded, it may take more time for the agent to process data from the control plane.

    6. Adjust Alert Threshold: If the cluster load is causing the alert to fire, you can adjust the threshold at which the alert triggers. The default value is 0.05. You can try changing it to a lower value (e.g., 0.045 or 0.04). To edit the value, paste the following in your terminal:kubectl edit runaiconfig -n runai/. In the editor, navigate to: spec: prometheus: agentPullPushRateMinForAlert . If the agentPullPushRateMinForAlert value does not exist, add it under spec -> prometheus .

    Meaning

    Runai container is using more than 90% of its Memory limit

    Impact

    The container might run out of memory and crash.

    Severity

    Critical

    Diagnosis

    Calculate the memory usage, this is performed by pasting the following to your terminal: container_memory_usage_bytes{namespace=~"runai

    Troubleshooting/Mitigation

    Add more memory resources to the container. If the issue persists, contact NVIDIA Run:ai

    Meaning

    Runai container is using more than 80% of its memory limit

    Impact

    The container might run out of memory and crash

    Severity

    Warning

    Diagnosis

    Calculate the memory usage, this can be done by pasting the following to your terminal: container_memory_usage_bytes{namespace=~"runai

    Troubleshooting/Mitigation

    Add more memory resources to the container. If the issue persists, contact NVIDIA Run:ai

    Meaning

    Runai container has restarted more than twice in the last 10 min

    Impact

    The container might become unavailable and impact the NVIDIA Run:ai system

    Severity

    Warning

    Diagnosis

    To diagnose the issue and identify the problematic pods, paste this into your terminal: kubectl get pods -n runai kubectl get pods -n runai-backendOne or more of the pods have a restart count >= 2.

    Troubleshooting/Mitigation

    Paste this into your terminal:kubectl logs -n NAMESPACE POD_NAMEReplace NAMESPACE and POD_NAME with the relevant pod information from the previous step. Check the logs for any standout issues and verify that the container has sufficient resources. If you need further assistance, contact NVIDIA Run:ai

    Meaning

    runai container is using more than 80% of its CPU limit

    Impact

    This might cause slowness in the operation of certain NVIDIA Run:ai features.

    Severity

    Warning

    Diagnosis

    Paste the following query to your terminal in order to calculate the CPU usage: rate(container_cpu_usage_seconds_total{namespace=~"runai

    Troubleshooting/Mitigation

    Add more CPU resources to the container. If the issue persists, please contact NVIDIA Run:ai.

    Meaning

    One of the critical NVIDIA Run:ai alerts is currently active

    Impact

    Impact is based on the active alert

    Severity

    Critical

    Diagnosis

    Check NVIDIA Run:ai alerts in Prometheus to identify any active critical alerts

    Meaning

    The Kubernetes node hosting GPU workloads is in an unknown state, and its health and readiness cannot be determined.

    Impact

    This may interrupt GPU workload scheduling and execution.

    Severity

    Critical - Node is either unschedulable or has unknown status. The node is in one of the following states:

    • Ready=Unknown: The control plane cannot communicate with the node.

    • Ready=False: The node is not healthy.

    • Unschedulable=True: The node is marked as unschedulable.

    Diagnosis

    Check the node's status using kubectl describe node, verify Kubernetes API server connectivity, and inspect system logs for GPU-specific or node-level errors.

    Meaning

    The Kubernetes node hosting GPU workloads has insufficient memory to support current or upcoming workloads.

    Impact

    GPU workloads may fail to schedule, experience degraded performance, or crash due to memory shortages, disrupting dependent applications.

    Severity

    Critical - Node is using more than 90% of its memory. Warning - Node is using more than 80% of its memory.

    Diagnosis

    Use kubectl top node to assess memory usage, identify memory-intensive pods, consider resizing the node or optimizing memory usage in affected pods.

    Meaning

    There are currently 0 available pods for the runai daemonset on the relevant node

    Impact

    No fractional GPU workloads support

    Severity

    Critical

    Diagnosis

    Paste the following command to your terminal: kubectl get daemonset -n runai-backend In the result of this command, identify the daemonset(s) that don’t have any running pods

    Troubleshooting/Mitigation

    Paste the following command to your terminal, where daemonsetX is the problematic daemonset from the pervious step: kubectl describe daemonsetX -n runai on the relevant deamonset(s) from the previous step. The next step is to look for the specific error which prevents it from creating pods. Possible reasons might be:

    • Node Resource Constraints: The nodes in the cluster may lack sufficient resources (CPU, memory, etc.) to accommodate new pods from the daemonset.

    • Node Selector or Affinity Rules: The daemonset may have node selector or affinity rules that are not matching with any nodes currently available in the cluster, thus preventing pod creation.

    Meaning

    Runai deployment has one or more unavailable pods

    Impact

    When this happens, there may be scale issues. Additionally, new versions cannot be deployed, potentially resulting in missing features.

    Severity

    Critical

    Diagnosis

    Paste the following commands to your terminal, in order to get the status of the deployments in the runai and runai-backend namespaces:kubectl get deployment -n runai kubectl get deployment -n runai-backendIdentify any deployments that have missing pods. Look for discrepancies in the DESIRED and AVAILABLE columns. If the number of AVAILABLE pods is less than the DESIRED pods, it indicates that there are missing pods.

    Troubleshooting/Mitigation

    • Paste the following commands to your terminal, to receive detailed information about the problematic deployment:kubectl describe deployment <DEPLOYMENT_NAME> -n runai kubectl describe deployment <DEPLOYMENT_NAME> -n runai-backend

    • Paste the following commands to your terminal, to check the replicaset details associated with the deployment:kubectl describe replicaset <REPLICASET_NAME> -n runai kubectl describe replicaset <REPLICASET_NAME> -n runai-backend

    • Paste the following commands to your terminal to retrieve the logs for the deployment to identify any errors or issues:kubectl logs deployment/<DEPLOYMENT_NAME> -n runai kubectl logs deployment/<DEPLOYMENT_NAME> -n runai-backend

    • From the logs and the detailed information provided by the describe commands, analyze the reasons why the deployment is unable to create pods. Look for common issues such as:

      • Resource constraints (CPU, memory)

      • Misconfigured deployment settings or replicasets

      • Node selector or affinity rules preventing pod scheduling

    Meaning

    The project-controller in runai namespace had errors while reconciling projects

    Impact

    Some projects might not be in the “Ready” state. This means that they are not fully operational and may not have all the necessary components running or configured correctly.

    Severity

    Critical

    Diagnosis

    Retrieve the logs for the project-controller deployment by pasting the following command in your terminal:kubectl logs deployment/project-controller -n runai Carefully examine the logs for any errors or warning messages. These logs help you understand what might be going wrong with the project controller.

    Troubleshooting/Mitigation

    Once errors in the log have been identified, follow these steps to mitigate the issue: The error messages in the logs should provide detailed information about the problem.

    1. Read through them to understand the nature of the issue. If the logs indicate which project failed to reconcile, you can further investigate by checking the status of that specific project.

    2. Run the following command, replacing <PROJECT_NAME> with the name of the problematic project:kubectl get project <PROJECT_NAME> -o yaml

    3. Review the status section in the YAML output. This section describes the current state of the project and provide insights into what might be causing the failure. If the issue persists, contact NVIDIA Run:ai.

    Meaning

    Runai statefulset has no available pods

    Impact

    Absence of Metrics Database Unavailability

    Severity

    Critical

    Diagnosis

    To diagnose the issue, follow these steps:

    1. Check the status of the stateful sets in the runai-backend namespace by running the following command:kubectl get statefulset -n runai-backend

    2. Identify any stateful sets that have no running pods. These are the ones that might be causing the problem.

    Troubleshooting/Mitigation

    Once you've identified the problematic stateful sets, follow these steps to mitigate the issue:

    1. Describe the stateful set to get detailed information on why it cannot create pods. Replace X with the name of the stateful set:kubectl describe statefulset X -n runai-backend

    2. Review the description output to understand the root cause of the issue. Look for events or error messages that explain why the pods are not being created.

    3. If you're unable to resolve the issue based on the information gathered, contact NVIDIA Run:ai support for further assistance.

    kubectl
    webhook.site
    webhook.site
    webhook.site
    Prometheus documentation
    You have created a project or have one created for you.
  • The project has an assigned quota of at least 1 GPU.

  • Dynamic GPU fractions is enabled.

  • GPU memory swap is enabled on at least one free node as detailed here.

  • Host-based routing is configured.

  • Note

    • Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

    • The Custom inference type appears only if your administrator has enabled it under General settings → Workloads → Models. If not enabled, Custom becomes the default inference type and is not displayed as a selectable option.

    • Dynamic GPU fractions is disabled by default in the NVIDIA Run:ai UI. To use dynamic GPU fractions, it must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

    Step 1: Logging In

    Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

    To use the API, you will need to obtain a token as shown in API authentication.

    Step 2: Submitting the First Inference Workload

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Inference

    3. Select under which cluster to create the workload

    4. Select the project in which your workload will run

    5. Select custom inference from Inference type (if applicable)

    6. Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

    7. Under Submission, select Flexible and click CONTINUE

    8. Click the load icon. A side pane appears, displaying a list of available environments. To add a new environment:

      • Click the + icon to create a new environment

      • Enter quick-start as the name for the environment. The name must be unique.

      • Enter the NVIDIA Run:ai vLLM Image URL -

    9. Click the load icon. A side pane appears, displaying a list of available compute resources. To add a new compute resource:

      • Click the + icon to create a new compute resource

      • Enter request-limit as the name for the compute resource. The name must be unique.

      • Set GPU devices per pod

    10. Click CREATE INFERENCE

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Inference

    3. Select under which cluster to create the workload

    4. Select the project in which your workload will run

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see :

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in

    • <PROJECT-ID>

    Step 3: Submitting the Second Inference Workload

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Inference

    3. Select the cluster where the previous inference workload was created

    4. Select the project where the previous inference workload was created

    5. Select custom inference from Inference type (if applicable)

    6. Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

    7. Under Submission, select Flexible and click CONTINUE

    8. Click the load icon. A side pane appears, displaying a list of available environments. Select the environment created in .

    9. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the compute resources created in .

    10. Click CREATE INFERENCE

    1. Go to the Workload manager → Workloads

    2. Click +NEW WORKLOAD and select Inference

    3. Select the cluster where the previous inference workload was created

    4. Select the project where the previous inference workload was created

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see :

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in

    • <PROJECT-ID>

    Step 4: Submitting the First Workspace

    1. Go to the Workload manager → Workloads

    2. Click COLUMNS and select Connections

    3. Select the link under the Connections column for the first inference workload created in Step 2

    4. In the Connections Associated with Workload form, copy the URL under the Address column

    5. Click +NEW WORKLOAD and select Workspace

    6. Select the cluster where the previous inference workloads were created

    7. Select the project where the previous inference workloads were created

    8. Select Start from scratch to launch a new workspace quickly

    9. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

    10. Under Submission, select Flexible and click CONTINUE

    11. Click the load icon. A side pane appears, displaying a list of available environments. Select the ‘chatbot-ui’ environment for your workspace (Image URL: runai.jfrog.io/core-llm/llm-app)

      • Set the runtime settings for the environment with the following environment variables:

        • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

    12. Click the load icon. A side pane appears, displaying a list of available compute resources. Select ‘cpu-only’ from the list.

      • If ‘cpu-only’ is not displayed, follow the below steps:

        • Click the + icon to create a new compute resource

    13. Click CREATE WORKSPACE

    1. Go to the Workload manager → Workloads

    2. Click COLUMNS and select Connections

    3. Select the link under the Connections column for the first inference workload created in

    4. In the Connections Associated with Workload form, copy the URL under the Address

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in

    • <PROJECT-ID>

    Step 5: Submitting the Second Workspace

    1. Go to the Workload manager → Workloads

    2. Click COLUMNS and select Connections

    3. Select the link under the Connections column for the second inference workload created in Step 3

    4. In the Connections Associated with Workload form, copy the URL under the Address column

    5. Click +NEW WORKLOAD and select Workspace

    6. Select the cluster where the previous inference workloads were created

    7. Select the project where the previous inference workloads were created

    8. Select Start from scratch to launch a new workspace quickly

    9. Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

    10. Under Submission, select Flexible and click CONTINUE

    11. Click the load icon. A side pane appears, displaying a list of available environments. Select the environment created in .

      • Set the runtime settings for the environment with the following environment variables:

        • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

    12. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the compute resources created in .

    13. Click CREATE WORKSPACE

    1. Go to the Workload manager → Workloads

    2. Click COLUMNS and select Connections

    3. Select the link under the Connections column for the second inference workload created in

    4. In the Connections Associated with Workload form, copy the URL under the Address

    Copy the following command to your terminal. Make sure to update the below parameters. For more details, see

    • <COMPANY-URL> - The link to the NVIDIA Run:ai user interface

    • <TOKEN> - The API access token obtained in

    • <PROJECT-ID>

    Step 6: Connecting to ChatbotUI

    1. Select the newly created workspace that you want to connect to

    2. Click CONNECT

    3. Select the ChatbotUI tool. The selected tool is opened in a new tab on your browser.

    4. Query both workspaces simultaneously and see them both responding. The one on CPU RAM at the time will take longer as it switches back to the GPU and vice versa.

    1. To connect to the ChatbotUI tool, browse directly to https://<COMPANY-URL>/<PROJECT-NAME>/<WORKLOAD-NAME>

    2. Query both workspaces simultaneously and see them both responding. The one on CPU RAM at the time will take longer as it switches back to the GPU and vice versa.

    Next Steps

    Manage and monitor your newly created workloads using the Workloads table.

    GPU memory swap
    kubectl delete pod prometheus-runai-0 -n runai
    kubectl get deployment kube-prometheus-stack-operator -n monitoring
    kubectl get prometheus -n runai
    kubectl apply -f step1.yaml 
    apiVersion: monitoring.coreos.com/v1  
    kind: Alertmanager  
    metadata:  
       name: runai  
       namespace: runai  
    spec:  
       replicas: 1  
       alertmanagerConfigSelector:  
          matchLabels:
             alertmanagerConfig: runai 
    kubectl get alertmanager -n runai
    kubectl get svc alertmanager-operated -n runai
    kubectl edit runaiconfig -n runai
    prometheus:
      spec:
        alerting:
          alertmanagers:
          - name: alertmanager-operated
            namespace: runai
            port: web
    apiVersion: v1  
    kind: Secret  
    metadata:  
       name: alertmanager-smtp-password  
       namespace: runai  
    stringData:
       password: "your_smtp_password"
    apiVersion: monitoring.coreos.com/v1alpha1  
     kind: AlertmanagerConfig  
     metadata:  
       name: runai  
       namespace: runai  
     labels:  
        alertmanagerConfig: runai  
     spec:  
        route:  
           continue: true  
           groupBy:   
           - alertname
    
           groupWait: 30s  
           groupInterval: 5m  
           repeatInterval: 1h
    
        matchers:  
        - matchType: =~  
          name: alertname  
          value: Runai.*
    
        receiver: email
    
     receivers:  
     - name: 'email'  
       emailConfigs:  
       - to: '<destination_email_address>'  
         from: '<from_email_address>'  
         smarthost: 'smtp.gmail.com:587'  
         authUsername: '<smtp_server_user_name>'  
         authPassword:  
           name: alertmanager-smtp-password
             key: password  
    codekube-prometheus-stack:  
      ...  
      alertmanager:  
        enabled: true  
        config:  
          global:  
            resolve_timeout: 5m  
          receivers:  
          - name: "null"  
          - name: webhook-notifications  
            webhook_configs:  
              - url: <WEB-HOOK-URL>  
                send_resolved: true  
          route:  
            group_by:  
            - alertname  
            group_interval: 5m  
            group_wait: 30s  
            receiver: 'null'  
            repeat_interval: 10m  
            routes:  
            - receiver: webhook-notifications
    kube-prometheus-stack:  
       additionalPrometheusRulesMap:  
         custom-runai:  
           groups:  
           - name: custom-runai-rules  
             rules:  
             - alert: <ALERT-NAME>  
               annotations:  
                 summary: <ALERT-SUMMARY-TEXT>  
               expr:  <PROMQL-EXPRESSION>  
               for: <optional: duration s/m/h>  
               labels:  
                 severity: <critical/warning>
    If the issue persists, contact NVIDIA Run:ai.
    runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0
  • Set the inference serving endpoint to HTTP and the container port to 8000

  • Set the runtime settings for the environment. Click +ENVIRONMENT VARIABLE and add the following:

    • Name: RUNAI_MODEL Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct (you can choose any vLLM supporting model from Hugging Face)

    • Name: RUNAI_MODEL_NAME Source: Custom Value: Llama-3.2-1B-Instruct

    • Name: HF_TOKEN Source: Custom Value: <Your Hugging Face token> (only needed for gated models)

    • Name: VLLM_RPC_TIMEOUT Source: Custom Value: 60000

  • Click CREATE ENVIRONMENT

  • Select the newly created environment from the side pane

  • - 1
  • Set GPU memory per device

    • Select % (of device) - Fraction of a GPU device’s memory

    • Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)

    • Toggle Limit and set to 100%

  • Optional: set the CPU compute per pod - 0.1 cores (default)

  • Optional: set the CPU memory per pod - 100 MB (default)

  • Select More settings and toggle Increase shared memory size

  • Click CREATE COMPUTE RESOURCE

  • Select the newly created compute resource from the side pane

  • Select custom inference from Inference type (if applicable)

  • Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Original and click CONTINUE

  • Create an environment for your workload

    • Click +NEW ENVIRONMENT

    • Enter quick-start as the name for the environment. The name must be unique.

    • Enter the NVIDIA Run:ai vLLM Image URL - runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0

    • Set the runtime settings for the environment. Click +ENVIRONMENT VARIABLE and add the following:

      • Name: RUNAI_MODEL Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct (you can choose any vLLM supporting model from Hugging Face)

      • Name: RUNAI_MODEL_NAME Source: Custom Value: Llama-3.2-1B-Instruct

    • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  • Create a new “request-limit” compute resource

    • Click +NEW COMPUTE RESOURCE

    • Enter request-limit as the name for the compute resource. The name must be unique.

    • Set GPU devices per pod - 1

    • Set GPU memory per device

      • Select % (of device) - Fraction of a GPU device’s memory

      • Set the memory Request - 50 (the workload will allocate 50% of the GPU memory)

      • Toggle Limit and set to 100%

    • Optional: set the CPU compute per pod - 0.1 cores (default)

    • Optional: set the CPU memory per pod - 100 MB (default)

    • Select More settings and toggle Increase shared memory size

    • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  • Click CREATE INFERENCE

  • - The ID of the Project the workload is running on. You can get the Project ID via the
    .
  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

  • Select custom inference from Inference type (if applicable)

  • Enter a name for the workload (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Original and click CONTINUE

  • Select the environment created in Step 2

  • Select the compute resource created in Step 2

  • Click CREATE INFERENCE

  • - The ID of the Project the workload is running on. You can get the Project ID via the
    .
  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

  • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4

  • Delete the PATH_PREFIX environment variable if you are using host-based routing.

  • If ‘chatbot-ui’ is not displayed in the gallery, follow the below steps:

    • Click the + icon to create a new environment

    • Enter chatbot-ui as the name for the environment. The name must be unique.

    • Enter the chatbot-ui Image URL - runai.jfrog.io/core-llm/llm-app

    • Tools - Set the connection for your tool

      • Click +TOOL

      • Select Chatbot UI tool from the list

    • Set the runtime settings for the environment. Click +ENVIRONMENT VARIABLE and add the following:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the Address link

    • Click CREATE ENVIRONMENT

    • Select the newly created environment from the side pane

  • Enter cpu-only as the name for the compute resource. The name must be unique.
  • Set GPU devices per pod - 0

  • Set CPU compute per pod - 0.1 cores

  • Set the CPU memory per pod - 100 MB (default)

  • Click CREATE COMPUTE RESOURCE

  • Select the newly created compute resource from the side pane

  • column
  • Click +NEW WORKLOAD and select Workspace

  • Select the cluster where the previous inference workloads were created

  • Select the project where the previous inference workloads were created

  • Select Start from scratch to launch a new workspace quickly

  • Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Original and click CONTINUE

  • Select the ‘chatbot-ui’ environment for your workspace (Image URL: runai.jfrog.io/core-llm/llm-app)

    • Set the runtime settings for the environment with the following environment variables:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the address link from Step 4

      • Delete the PATH_PREFIX environment variable if you are using host-based routing.

    • If ‘chatbot-ui’ is not displayed in the gallery, follow the below steps:

      • Click +NEW ENVIRONMENT

      • Enter chatbot-ui as the name for the environment. The name must be unique.

      • Enter the chatbot-ui Image URL - runai.jfrog.io/core-llm/llm-app

    The newly created environment will be selected automatically

  • Select the ‘cpu-only’ compute resource for your workspace

    • If ‘cpu-only’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

      • Enter cpu-only as the name for the compute resource. The name must be unique.

      • Set GPU devices per pod - 0

      • Set CPU compute per pod - 0.1 cores

      • Set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

      The newly created compute resource will be selected automatically

  • Click CREATE WORKSPACE

  • - The ID of the Project the workload is running on. You can get the Project ID via the
    .
  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • <URL> - The URL for connecting an external service related to the workload. You can get the URL via the List Workloads API.

  • Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the Address link

  • Delete the PATH_PREFIX environment variable if you are using host-based routing.

  • column
  • Click +NEW WORKLOAD and select Workspace

  • Select the cluster where the previous inference workloads were created

  • Select the project where the previous inference workloads were created

  • Select Start from scratch to launch a new workspace quickly

  • Enter a name for the workspace (if the name already exists in the project, you will be requested to submit a different name)

  • Under Submission, select Original and click CONTINUE

  • Select the environment created in Step 4

    • Set the runtime settings for the environment with the following environment variables:

      • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

      • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the Address link

      • Delete the PATH_PREFIX environment variable if you are using host-based routing.

  • Select the compute resource created in Step 4

  • Click CREATE WORKSPACE

  • - The ID of the Project the workload is running on. You can get the Project ID via the
    .
  • <CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

  • <URL> - The URL for connecting an external service related to the workload. You can get the URL via the List Workloads API.

  • Note

    The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

    Inferences API
    Step 1
    Step 2
    Step 2
    Inferences API
    Step 1
    Step 2
    Workspaces API:
    Step 1
    Step 4
    Step 4
    Step 3
    Workspaces API:
    Step 1
    Get Projects API
    Get Projects API
    Get Projects API
    Get Projects API

    Policy YAML Examples

    This article provides examples of:

    1. Creating a new rule within a policy

    2. Best practices for adding sections to a policy

    3. A full example of a whole policy

    Creating a New Rule Within a Policy

    This example shows how to add a new limitation to the GPU usage for workloads of type workspace:

    1. Check the documentation and select the field(s) that are most relevant for GPU usage.

    2. Search the field in the . For example, gpuDevicesRequest appears under the Compute fields sub-table and appears as follow:

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type
    1. Use the value type of the gpuDevicesRequest field indicated in the table - “integer” and navigate to the Value types table to view the possible rules that can be applied to this value type -

      for integer, the options are:

      • canEdit

      • required

    Policy YAML Best Practices

    Create a policy that has multiple defaults and rules

    Best practices description

    Presentation of the syntax while adding a set of defaults and rules

    Example

    Allow only single selection out of many

    Best practices description

    Blocking the option to create all types of data sources except the one that is allowed is the solution

    Example

    Create a robust set of guidelines

    Best practices description

    Set rules for specific compute resource usage, addressing most relevant spec fields

    Example

    Policy for distributed training workloads

    Best practices description

    Set rules and defaults for a distributed training workload with different setting for master and workers

    Example

    Examples for specific sections in the policy

    Best practices description

    Environment creation

    Example

    Best practices description

    Setting security measures

    Example

    Best practices description

    Impose an asset

    Example of a Whole Policy

    curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \
    -d '{ 
        "name": "workload-name", 
        "useGivenNameAsPrefix": true,
        "projectId": "<PROJECT-ID>", 
        "clusterId": "<CLUSTER-UUID>", 
        "spec": {
            "image": "runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0",
            "imagePullPolicy":"IfNotPresent",
            "environmentVariables": [
              {
                "name": "RUNAI_MODEL",
                "value": "meta-lama/Llama-3.2-1B-Instruct"
              },
              {
                "name": "VLLM_RPC_TIMEOUT",
                "value": "60000"
              },
              {
                "name": "HF_TOKEN",
                "value":"<INSERT HUGGINGFACE TOKEN>"
              }
            ],
            "compute": {
                "gpuDevicesRequest": 1,
                "gpuRequestType": "portion",
                "gpuPortionRequest": 0.1,
                "gpuPortionLimit": 1,
                "cpuCoreRequest":0.2,
                "cpuMemoryRequest": "200M",
                "largeShmRequest": false
    
            },
            "servingPort": {
                "container": 8000,
                "protocol": "http",
                "authorizationType": "public"
            }
        }
    }       
    curl -L 'https://<COMPANY-URL>/api/v1/workloads/inferences' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \ 
    -d '{ 
        "name": "workload-name", 
        "useGivenNameAsPrefix": true,
        "projectId": "<PROJECT-ID>",  
        "clusterId": "<CLUSTER-UUID>",
        "spec": {
            "image": "runai.jfrog.io/core-llm/runai-vllm:v0.6.4-0.10.0",
            "imagePullPolicy":"IfNotPresent",
            "environmentVariables": [
              {
                "name": "RUNAI_MODEL",
                "value": "meta-lama/Llama-3.2-1B-Instruct"
              },
              {
                "name": "VLLM_RPC_TIMEOUT",
                "value": "60000"
              },
              {
                "name": "HF_TOKEN",
                "value":"<INSERT HUGGINGFACE TOKEN>"
              }
            ],
            "compute": {
                "gpuDevicesRequest": 1,
                "gpuRequestType": "portion",
                "gpuPortionRequest": 0.1,
                "gpuPortionLimit": 1,
                "cpuCoreRequest":0.2,
                "cpuMemoryRequest": "200M",
                "largeShmRequest": false
    
            },
            "servingPort": {
                "container": 8000,
                "protocol": "http",
                "authorizationType": "public"
            }
        }
    }       
    curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \ 
    -d '{ 
        "name": "workload-name", 
        "projectId": "<PROJECT-ID>", 
        "clusterId": "<CLUSTER-UUID>",
        "spec": {  
            "image": "runai.jfrog.io/core-llm/llm-app",
            "environmentVariables": [
              {
                "name": "RUNAI_MODEL_NAME",
                "value": "meta-llama/Llama-3.2-1B-Instruct"
              },
              {
                "name": "RUNAI_MODEL_BASE_URL",
                "value": "<URL>" 
              }
            ],
            "compute": {
                "cpuCoreRequest":0.1,
                "cpuMemoryRequest": "100M",
            }
        }
    }
    curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer <TOKEN>' \ 
    -d '{ 
        "name": "workload-name", 
        "projectId": "<PROJECT-ID>", '\ 
        "clusterId": "<CLUSTER-UUID>", \ 
        "spec": {  
            "image": "runai.jfrog.io/core-llm/llm-app",
            "environmentVariables": [
              {
                "name": "RUNAI_MODEL_NAME",
                "value": "meta-llama/Llama-3.2-1B-Instruct"
              },
              {
                "name": "RUNAI_MODEL_BASE_URL",
                "value": "<URL>" 
              }
            ],
            "compute": {
                "cpuCoreRequest":0.1,
                "cpuMemoryRequest": "100M",
            }
        }
    }

    Name: HF_TOKEN Source: Custom Value: <Your Hugging Face token> (only needed for gated models)

  • Name: VLLM_RPC_TIMEOUT Source: Custom Value: 60000

  • Name: RUNAI_MODEL_TOKEN_LIMIT Source: Custom Value: 8192

  • Name: RUNAI_MODEL_MAX_LENGTH Source: Custom Value: 16384

  • Tools - Set the connection for your tool

    • Click +TOOL

    • Select Chatbot UI tool from the list

  • Set the runtime settings for the environment. Click +ENVIRONMENT VARIABLE and add the following:

    • Name: RUNAI_MODEL_NAME Source: Custom Value: meta-llama/Llama-3.2-1B-Instruct

    • Name: RUNAI_MODEL_BASE_URL Source: Custom Value: Add the Address link

    • Name: RUNAI_MODEL_TOKEN_LIMIT Source: Custom Value: 8192

    • Name: RUNAI_MODEL_MAX_LENGTH Source: Custom Value: 16384

  • Click CREATE ENVIRONMENT

  • min
  • max

  • step

  • Proceed to the Rule Type table, select the required rule for the limitation of the field - for example “max” and use the examples syntax to indicate the maximum GPU device requested.

  • Example

    gpuDeviceRequest

    Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1, the gpuRequestType can be defined.

    integer

    Workspace & Training

    workload API fields
    Policy YAML fields - reference table
    {
    "spec": {
        "compute": {
        "gpuDevicesRequest": 1,
        "gpuRequestType": "portion",
        "gpuPortionRequest": 0.5,
        "gpuPortionLimit": 0.5,
        "gpuMemoryRequest": "10M",
        "gpuMemoryLimit": "10M",
        "migProfile": "1g.5gb",
        "cpuCoreRequest": 0.5,
        "cpuCoreLimit": 2,
        "cpuMemoryRequest": "20M",
        "cpuMemoryLimit": "30M",
        "largeShmRequest": false,
        "extendedResources": [
            {
            "resource": "hardware-vendor.example/foo",
            "quantity": 2,
            "exclude": false
            }
        ]
        },
    }
    }
    compute:
        gpuDevicesRequest:
            max: 2
    defaults:
      createHomeDir: true
      environmentVariables:
        instances:
          - name: MY_ENV
            value: my_value
      security:
        allowPrivilegeEscalation: false
    
    rules:
      storage:
        s3:
          attributes:
            url:
              options:
                - value: https://www.google.com
                  displayed: https://www.google.com
                - value: https://www.yahoo.com
                  displayed: https://www.yahoo.com
    rules:
      storage:
        dataVolume:
          instances:
            canAdd: false
        hostPath:
          instances:
            canAdd: false
        pvc:
          instances:
            canAdd: false
        git:
          attributes:
            repository:
              required: true
            branch:
              required: true
            path:
              required: true
        nfs:
          instances:
            canAdd: false
        s3:
          instances:
            canAdd: false
    compute:
        cpuCoreRequest:
          required: true
          min: 0
          max: 8
        cpuCoreLimit:
          min: 0
          max: 8
        cpuMemoryRequest:
          required: true
          min: '0'
          max: 16G
        cpuMemoryLimit:
          min: '0'
          max: 8G
        migProfile:
          canEdit: false
        gpuPortionRequest:
          min: 0
          max: 1
        gpuMemoryRequest:
          canEdit: false
        extendedResources:
          instances:
            canAdd: false
    defaults:
      worker:
        command: my-command-worker-1
        environmentVariables:
          instances:
            - name: LOG_DIR
              value: policy-worker-to-be-ignored
            - name: ADDED_VAR
              value: policy-worker-added
        security:
          runAsUid: 500
        storage:
          s3:
            attributes:
              bucket: bucket1-worker
      master:
        command: my-command-master-2
        environmentVariables:
          instances:
            - name: LOG_DIR
              value: policy-master-to-be-ignored
            - name: ADDED_VAR
              value: policy-master-added
        security:
          runAsUid: 800
        storage:
          s3:
            attributes:
              bucket: bucket1-master
    rules:
      worker:
        command:
          options:
            - value: my-command-worker-1
              displayed: command1
            - value: my-command-worker-2
              displayed: command2
        storage:
          nfs:
            instances:
              canAdd: false
          s3:
            attributes:
              bucket:
                options:
                  - value: bucket1-worker
                  - value: bucket2-worker
      master:
        command:
          options:
            - value: my-command-master-1
              displayed: command1
            - value: my-command-master-2
              displayed: command2
        storage:
          nfs:
            instances:
              canAdd: false
          s3:
            attributes:
              bucket:
                options:
                  - value: bucket1-master
                  - value: bucket2-maste
    rules:
      imagePullPolicy:
        required: true
        options:
          - value: Always
            displayed: Always
          - value: Never
            displayed: Never
      createHomeDir:
        canEdit: false
    rules:
      security:
        runAsUid:
          min: 1
          max: 32700
        allowPrivilegeEscalation:
          canEdit: false
    defaults:
      createHomeDir: true
      imagePullPolicy: IfNotPresent
      nodePools:
        - node-pool-a
        - node-pool-b
      environmentVariables:
        instances:
          - name: WANDB_API_KEY
            value: REPLACE_ME!
          - name: WANDB_BASE_URL
            value: https://wandb.mydomain.com
      compute:
        cpuCoreRequest: 0.1
        cpuCoreLimit: 20
        cpuMemoryRequest: 10G
        cpuMemoryLimit: 40G
        largeShmRequest: true
      security:
        allowPrivilegeEscalation: false
      storage:
        git:
          attributes:
            repository: https://git-repo.my-domain.com
            branch: master
        hostPath:
          instances:
            - name: vol-data-1
              path: /data-1
              mountPath: /mount/data-1
            - name: vol-data-2
              path: /data-2
              mountPath: /mount/data-2
    rules:
      createHomeDir:
        canEdit: false
      imagePullPolicy:
        canEdit: false
      environmentVariables:
        instances:
          locked:
            - WANDB_BASE_URL
      compute:
        cpuCoreRequest:
          max: 32
        cpuCoreLimit:
          max: 32
        cpuMemoryRequest:
          min: 1G
          max: 20G
        cpuMemoryLimit:
          min: 1G
          max: 40G
        largeShmRequest:
          canEdit: false
        extendedResources:
          instances:
            canAdd: false
      security:
        allowPrivilegeEscalation:
          canEdit: false
        runAsUid:
          min: 1
      storage:
        hostPath:
          instances:
            locked:
              - vol-data-1
              - vol-data-2
    imposedAssets:
      - 4ba37689-f528-4eb6-9377-5e322780cc27
    
    defaults: null
    rules: null
    imposedAssets:
      - f12c965b-44e9-4ff6-8b43-01d8f9e630cc

    Hotfixes for Version 2.21

    This section provides details on all hotfixes available for version 2.21. Hotfixes are critical updates released between our major and minor versions to address specific issues or vulnerabilities. These updates ensure the system remains secure, stable, and optimized without requiring a full version upgrade.

    Version
    Date
    Internal ID
    Description

    2.21.57

    20/11/2025

    RUN-33802

    Fixed an issue that caused distributed inference workloads to become unsynchronized.

    2.21.57

    20/11/2025

    RUN-33144

    Fixed a security vulnerability related to CVE-2025-62156 with severity HIGH.

    2.21.56

    20/11/2025

    RUN-33613

    Fixed missing validations for CPU resources when the CPU quota feature flag was disabled, which caused project and department updates to skip required CPU checks.

    2.21.56

    20/11/2025

    RUN-33947

    Fixed an issue where SMTP configurations using the “none” option still sent empty username/password fields. Added the auth_none type to ensure no credentials are sent for passwordless SMTP servers.

    2.21.55

    13/11/2025

    RUN-33840

    Fixed a revision sync issue that caused excessive error logs

    2.21.53

    02/11/2024

    RUN-33053

    Fixed an issue that caused conflicts with additional built-in Prometheus Operator deployments in OpenShift.

    2.21.53

    02/11/2024

    RUN-33006

    Fixed an issue in the CLI installer where the PATH was not configured for all shells. The installer now correctly configures PATH for both zsh and bash.

    2.21.53

    02/11/2024

    RUN-32945

    Fixed a security vulnerability related to CVE-2025-58754 with severity HIGH.

    2.21.53

    02/11/2024

    RUN-32548

    Fixed an issue where, in certain edge cases, removing an inference workload without deleting its revision caused the cluster to panic during revision sync.

    2.21.52

    19/10/2025

    RUN-31803

    Fixed an issue where the Quota management dashboard occasionally displayed incorrect GPU quota values.

    2.21.52

    19/10/2025

    RUN-33044

    Fixed an issue where the workload controller could delete all running workloads when init-ca generated a new certificate (every 30 days).

    2.21.51

    16/10/2025

    RUN-31383

    Fixed a security vulnerability related to CVE-2025-7783 with severity HIGH.

    2.21.51

    16/10/2025

    RUN-31422

    Fixed an issue where updating project resources created through the deprecated Projects API did not work correctly.

    2.21.51

    16/10/2025

    RUN-31571

    Fixed a security vulnerability related to CVE-2025-6965 with severity HIGH.

    2.21.51

    16/10/2025

    RUN-31792

    Fixed a security vulnerability related to CVE-2025-7425 with severity HIGH.

    2.21.51

    16/10/2025

    RUN-31855

    Fixed a security vulnerability related to CVE-2025-47907 with severity HIGH.

    2.21.51

    16/10/2025

    RUN-31993

    Fixed a security vulnerability related to CVE-2025-22868 with severity HIGH.

    2.21.51

    16/10/2025

    RUN-32146

    Fixed a security vulnerability related to CVE-2025-5914 with severity HIGH.

    2.21.51

    16/10/2025

    RUN-32572

    Fixed an issue where the RunaiAgentPullRateLow and RunaiAgentClusterInfoPushRateLow Prometheus alerts were firing incorrectly without cause.

    2.21.51

    16/10/2025

    RUN-32730

    Fixed an issue where incorrect average GPU utilization per project and workload type was displayed in the Projects view charts and tables.

    2.21.51

    16/10/2025

    RUN-32789

    Fixed an issue in CLI v2 where the --master-extended-resource flag had no effect in MPI training workloads.

    2.21.51

    16/10/2025

    RUN-32889

    Fixed an issue where idle GPU timeout rules were incorrectly applied to preemptible workspaces.

    2.21.51

    16/10/2025

    RUN-33039

    Fixed an issue where setting uid or gid to 0 during environment creation was not allowed.

    2.21.46

    12/08/2025

    RUN-28394

    Fixed an issue where using the GET Roles API returned a 403 unauthorized for all users.

    2.21.46

    12/08/2025

    RUN-31008

    Fixed a security vulnerability related to CVE-2025-53547 with severity HIGH.

    2.21.46

    12/08/2025

    RUN-31051

    Fixed a security vulnerability related to CVE-2025-49794 with severity HIGH.

    2.21.46

    12/08/2025

    RUN-31310

    Fixed a security vulnerability related to CVE-2025-22868 with severity HIGH.

    2.21.46

    12/08/2025

    RUN-31678

    Fixed an issue where the workload flexible submission form did not load the correct default node pools for a project.

    2.21.45

    01/08/2025

    RUN-31265

    Fixed a security vulnerability related to CVE-2025-30749 with severity HIGH.

    2.21.45

    01/08/2025

    RUN-31007

    Fixed a security vulnerability related to CVE-2025-22874 with severity HIGH.

    2.21.43

    30/07/2025

    RUN-29828

    Fixed an issue where the completion time date formatting in the Workload grid was inconsistent. Also resolved a bug where exported CSV files shifted date values to the next cell.

    2.21.43

    30/07/2025

    RUN-31039

    Fixed a base image security vulnerability in libxml2 related to CVE-2025-49796 with severity HIGH.

    2.21.43

    30/07/2025

    RUN-31263

    Fixed an issue where setting defaults for servingPort fields failed and incorrectly required the container port default as well.

    2.21.42

    24/07/2025

    RUN-30746

    Fixed an issue where workloads could not be scheduled if the combined length of the project name and node pool name was excessively long.

    2.21.42

    24/07/2025

    RUN-31039

    Fixed a security vulnerability in golang.org/x/oauth2 related to CVE-2025-22868 with severity HIGH.

    2.21.42

    24/07/2025

    RUN-31358

    Fixed an issue where enabling enableWorkloadOwnershipProtection for inference workloads caused newly submitted workloads to get stuck.

    2.21.41

    20/07/2025

    RUN-31131

    Fixed a security vulnerability in runai-container-runtime-installer and runai-container-toolkit related to CVE-2025-49794 with severity HIGH.

    2.21.39

    17/07/2025

    RUN-29092

    Fixed an issue where project quota could not be changed due to scheduling rules being set to 0 instead of null.

    2.21.38

    14/07/2025

    RUN-28377

    Fixed an issue where the CLI cache folder was created in a location where the user might not have sufficient permissions, leading to failures. The cache folder is now created in the same directory as the config file.

    2.21.38

    14/07/2025

    RUN-30713

    Fixed an issue where configuring an incorrect Auth URL during CLI installation could lead to connectivity issues. To prevent this, the option to set the Auth URL during installation has been removed. The install script now automatically sets the control plane URL based on the script's source.

    2.21.37

    09/07/2025

    RUN-29113

    Fixed a security vulnerability in DOMPurify related to CVE-2024-24762 with severity HIGH.

    2.21.37

    09/07/2025

    RUN-30634

    Fixed a security vulnerability in cluster-installer related to CVE-2025-30204 with severity HIGH.

    2.21.37

    09/07/2025

    RUN-29831

    Fixed an issue where the API documentation for asset filtering parameters was inaccurate.

    2.21.37

    09/07/2025

    RUN-30673

    Fixed an issue where users with create permissions on one scope and read-only permissions on another were incorrectly allowed to create projects in both scopes.

    2.21.37

    09/07/2025

    RUN-30657

    Fixed a security vulnerability in runai-container-runtime-installer and runai-container-toolkit related to CVE-2025-6020 with severity HIGH.

    2.21.33

    30/06/2025

    RUN-30197

    Fixed a security vulnerability in with stdlib package in go v1.24.2 related to CVE-2025-22874 with severity HIGH.

    2.21.32

    29/06/2025

    RUN-30674

    Fixed an issue where, on rare occasions, running the runai upgrade command deleted all files in the current directory.

    2.21.30

    29/06/2025

    RUN-25883

    Fixed a security vulnerability in io.netty:netty-handler related to CVE-2025-24970 with severity HIGH.

    2.21.30

    29/06/2025

    RUN-30666

    Fixed an issue where users were unable to create Hugging Face workloads due to a missing function in the system.

    2.21.29

    25/06/2025

    RUN-27390

    Fixed an issue where CPU-only workloads submitted via the CLI incorrectly displayed a GPU allocation value.

    2.21.29

    25/06/2025

    RUN-29768

    Fixed an issue where the Get token request returned a 500 error when the email mapper failed.

    2.21.29

    25/06/2025

    RUN-29049

    Fixed a security vulnerability in github.com.golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

    2.21.28

    25/06/2025

    RUN-29143

    Fixed an issue where nodes could become unschedulable when workloads were submitted to a different node pool.

    2.21.27

    17/06/2025

    RUN-29709

    • Fixed a security vulnerability in jq cli related to CVE-2024-53427 with severity HIGH.

    • Fixed a security vulnerability in jq cli related to CVE-2025-48060 with severity HIGH.

    2.21.27

    17/06/2025

    RUN-29756

    Fixed an issue where not all subjects were returned for each project or department

    2.21.27

    17/06/2025

    RUN-29700

    Fixed a security vulnerability in github.com/moby and github.com/docker/docker related to CVE-2024-41110 with severity Critical.

    2.21.25

    11/06/2025

    RUN-29548

    Fixed a typo in the documentation where the API key was incorrectly written as enforceRun:aiScheduler instead of the correct enforceRunaiScheduler.

    2.21.25

    11/06/2025

    RUN-29320

    Fixed an issue in CLI v2 where the update server did not receive the terminal size during exec commands requiring TTY support. The terminal size is now set once upon session creation, ensuring proper behavior for interactive sessions.

    2.21.24

    08/06/2025

    RUN-29282

    Fixed a security vulnerability in golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

    2.21.23

    08/06/2025

    RUN-28891

    • Fixed a security vulnerability in golang.org/x/crypto related to CVE-2024-45337 with severity HIGH.

    • Fixed a security vulnerability in go-git/go-git related to CVE-2025-21613 with severity HIGH.

    2.21.23

    08/06/2025

    RUN-25281

    Fixed an issue where deploying a Hugging Face model with vLLM using the Hugging Face inference UI form on an OpenShift environment failed due to permission errors.

    2.21.22

    03/06/2025

    RUN-29341

    Fixed an issue which caused high CPU usage in the Cluster API.

    2.21.22

    03/06/2025

    RUN-29323

    Fixed an issue where Prometheus failed to send metrics for OpenShift.

    2.21.19

    27/05/2025

    RUN-29093

    Fixed an issue where rotating the runai-config webhook secret caused the app.kubernetes.io/managed-by=helm label to be removed.

    2.21.18

    27/05/2025

    RUN-28286

    Fixed an issue where CPU-only workloads incorrectly triggered idle timeout notifications intended for GPU workloads.

    2.21.18

    27/05/2025

    RUN-28555

    Fixed an issue in Admin → General Settings where the "Disabled" workloads count displayed inconsistently between the collapsed and expanded views.

    2.21.18

    27/05/2025

    RUN-26361

    Fixed an issue where Prometheus remote-write credentials were not properly updated on OpenShift clusters.

    2.21.18

    27/05/2025

    RUN-28780

    Fixed an issue where Hugging Face model validation incorrectly blocked some valid models supported by vLLM and TGI.

    2.21.18

    27/05/2025

    RUN-28851

    Fixed an issue in CLI v2 where the port-forward command terminated SSH connections after 15–30 seconds due to an idle timeout.

    2.21.18

    27/05/2025

    RUN-25281

    Fixed an issue where the Hugging Face UI submission flow failed on OpenShift (OCP) clusters.

    2.21.17

    21/05/2025

    RUN-28266

    Fixed an issue where the documentation examples for the runai workload delete CLI command were incorrect.

    2.21.17

    21/05/2025

    RUN-28609

    Fixed an issue where users with the ML Engineer role were unable to delete multiple inference jobs at once.

    2.21.17

    21/05/2025

    RUN-28665

    Fixed an issue where using servingPort authorization fields in the Create an inference API on unsupported clusters did not return an error.

    2.21.17

    21/05/2025

    RUN-28717

    Fixed an issue where the Update inference spec API documentation listed an incorrect response code.

    2.21.17

    21/05/2025

    RUN-28755

    Fixed an issue where the tooltip next to the External URL for an inference endpoint incorrectly stated that the URL was internal.

    2.21.17

    21/05/2025

    RUN-28762

    Fixed an issue with the inference workload ownership protection.

    2.21.17

    21/05/2025

    RUN-28859

    Fixed an issue where the knative.enable-scale-to-zero setting did not default to true as expected.

    2.21.17

    21/05/2025

    RUN-28923

    Fixed an issue where calling the Get node telemetry data API with the telemetryType IDLE_ALLOCATED_GPUS resulted in a 500 Internal Server Error.

    2.21.17

    21/05/2025

    RUN-28950

    Fixed a security vulnerability in github.com/moby and github.com/docker/docker related to CVE-2024-41110 with severity Critical.

    2.21.16

    18/05/2025

    RUN-27295

    Fixed an issue in CLI v2 where the --node-type flag for inference workloads was not properly propagated to the pod specification.

    2.21.16

    18/05/2025

    RUN-27375

    Fixed an issue where projects were not visible in the legacy job submission form, preventing users from selecting a target project.

    2.21.16

    18/05/2025

    RUN-27514

    Fixed an issue where disabling CPU quota in the General settings did not remove existing CPU quotas from projects and departments.

    2.21.16

    18/05/2025

    RUN-27521

    Fixed a security vulnerability in axios related to CVE-2025-27152 with severity HIGH.

    2.21.16

    18/05/2025

    RUN-27638

    Fixed an issue where a node pool’s placement strategy stopped functioning correctly after being edited.

    2.21.16

    18/05/2025

    RUN-27438

    Fixed an issue where MPI jobs were unavailable due to an OpenShift MPI Operator installation error.

    2.21.16

    18/05/2025

    RUN-27952

    Fixed a security vulnerability in emacs-filesystem related to CVE-2025-1244 with severity HIGH.

    2.21.16

    18/05/2025

    RUN-28244

    Fixed a security vulnerability in liblzma5 related to CVE-2025-31115 with severity HIGH.

    2.21.16

    18/05/2025

    RUN-28006

    Fixed an issue where tokens became invalid for the API server after one hour.

    2.21.16

    18/05/2025

    RUN-28097

    Fixed an issue where the allocated_gpu_count_per_gpu metric displayed incorrect data for fractional pods.

    2.21.16

    18/05/2025

    RUN-28213

    Fixed a security vulnerability in github.com.golang.org.x.crypto related to CVE-2025-22869 with severity HIGH.

    2.21.16

    18/05/2025

    RUN-28311

    Fixed an issue where user creation failed with a duplicate email error, even though the email address did not exist in the system.

    2.21.16

    18/05/2025

    RUN-28832

    Fixed inference CLI v2 documentation with examples that reflect correct usage.

    2.21.15

    30/04/2025

    RUN-27533

    Fixed an issue where workloads with idle GPUs were not suspended after exceeding the configured idle time.

    2.21.14

    29/04/2025

    RUN-26608

    Fixed an issue by adding a flag to the cli config set command and the CLI install script, allowing users to set a cache directory.

    2.21.14

    29/04/2025

    RUN-27264

    Fixed an issue where creating a project from the UI with a non-unlimited deserved CPU value caused the queue to be created with limit = deserved instead of unlimited.

    2.21.14

    29/04/2025

    RUN-27484

    Fixed an issue where duplicate app.kubernetes.io/name labels were applied to services in the control plane Helm chart.

    2.21.14

    29/04/2025

    RUN-27502

    Fixed the inference CLI commands documentation: --max-replicas and --min-replicas were incorrectly used instead of --max-scale and --min-scale.

    2.21.14

    29/04/2025

    RUN-27513

    Fixed an issue where cluster-scoped policies were not visible to users with appropriate permissions.

    2.21.14

    29/04/2025

    RUN-27515

    Fixed an issue where users were unable to use assets from an upper scope during flexible workload submissions.

    2.21.14

    29/04/2025

    RUN-27520

    Fixed an issue where adding access rules immediately after creating an application did not refresh the access rules table.

    2.21.14

    29/04/2025

    RUN-27628

    Fixed an issue where a node pool could remain stuck in Updating status in certain cases.

    2.21.14

    29/04/2025

    RUN-27826

    Fixed an issue where the runai inference update command could result in a failure to update the workload. Although the command itself succeeded (since the update is asynchronous), the update often failed, and the new spec was not applied.

    2.21.14

    29/04/2025

    RUN-27915

    Fixed an issue where the "Improved Command Line Interface" admin setting was incorrectly labeled as Beta instead of Stable.

    2.21.11

    29/04/2025

    RUN-27251

    • Fixed a security vulnerability in github.com.golang-jwt.jwt.v4 and github.com.golang-jwt.jwt.v5 with CVE-2025-30204 with severity HIGH.

    • Fixed a security vulnerability in golang.org.x.net with CVE-2025-22872 with severity MEDIUM.

    • Fixed a security vulnerability in knative.dev/serving with CVE-2023-48713 with severity MEDIUM.

    2.21.11

    29/04/2025

    RUN-27309

    Fixed an issue where workloads configured with a multi node pool setup could fail to schedule on a specific node pool in the future after an initial scheduling failure, even if sufficient resources later became available.

    2.21.10

    29/04/2025

    RUN-26992

    Fixed an issue where workloads submitted with an invalid node port range would get stuck in Creating status.

    2.21.10

    29/04/2025

    RUN-27497

    Fixed an issue where, after deleting an SSO user and immediately creating a local user, the delete confirmation dialog reappeared unexpectedly.

    2.21.9

    15/04/2025

    RUN-26989

    Fixed an issue that prevented reordering node pools in the workload submission form.

    2.21.9

    15/04/2025

    RUN-27247

    Fixed security vulnerabilities in Spring framework used by db-mechanic service - CVE-2021-27568, CVE-2021-44228, CVE-2022-22965, CVE-2023-20873, CVE-2024-22243, CVE-2024-22259 and CVE-2024-22262.

    2.21.9

    15/04/2025

    RUN-26359

    Fixed an issue in CLI v2 where using the --toleration option required incorrect mandatory fields.

    Metrics and Telemetry

    Metrics are numeric measurements recorded over time that are emitted from the NVIDIA Run:ai cluster and telemetry is a numeric measurement recorded in real-time when emitted from the NVIDIA Run:ai cluster.

    Scopes

    NVIDIA Run:ai provides control-plane API which supports and aggregates analytics at various levels.

    Level
    Description

    Supported Metrics

    Metric name in API
    Applicable API endpoint
    Metric name in UI per grid
    Applicable UI grid

    Advanced Metrics

    NVIDIA provides extended metrics as shown here . To enable these metrics, please contact NVIDIA Run:ai customer support.

    Metric name in API
    Applicable API endpoint
    Metric name in UI
    Applicable UI table

    Supported Telemetry

    Metric
    Applicable API endpoint
    Metric name in UI
    Applicable UI table

    CPU_MEMORY_LIMIT_BYTES

    CPU memory limit

    CPU_MEMORY_REQUEST_BYTES

    CPU memory request

    CPU_MEMORY_USAGE_BYTES

    CPU memory usage

    CPU_MEMORY_UTILIZATION

    CPU memory utilization

    CPU_REQUEST_CORES

    CPU request

    CPU_USAGE_CORES

    CPU usage

    CPU_UTILIZATION

    • CPU compute utilization

    • CPU utilization

    • and

    GPU_ALLOCATION

    GPU devices (allocated)

    GPU_MEMORY_REQUEST_BYTES

    GPU memory request

    GPU_MEMORY_USAGE_BYTES

    GPU memory usage

    GPU_MEMORY_USAGE_BYTES_PER_GPU

    GPU memory usage per GPU

    GPU_MEMORY_UTILIZATION

    GPU memory utilization

    GPU_MEMORY_UTILIZATION_PER_GPU

    GPU memory utilization per GPU

    GPU_QUOTA

    Quota

    GPU_UTILIZATION

    GPU compute utilization

    GPU_UTILIZATION_PER_GPU

    GPU utilization per GPU

    TOTAL_GPU

    • GPU devices total

    • Total GPUs

    TOTAL_GPU_NODES

    GPU_UTILIZATION_DISTRIBUTION

    GPU utilization distribution

    UNALLOCATED_GPU

    • GPU devices (unallocated)

    • Unallocated GPUs

    CPU_QUOTA_MILLICORES

    CPU_MEMORY_QUOTA_MB

    CPU_ALLOCATION_MILLICORES

    CPU_MEMORY_ALLOCATION_MB

    POD_COUNT

    RUNNING_POD_COUNT

    GPU_GRAPHICS_ENGINE_ACTIVITY_PER_GPU

    Graphics engine activity

    GPU_MEMORY_BANDWIDTH_UTILIZATION_PER_GPU

    GPU_NVLINK_RECEIVED_BANDWIDTH_PER_GPU

    GPU_NVLINK_TRANSMITTED_BANDWIDTH_PER_GPU

    GPU_PCIE_RECEIVED_BANDWIDTH_PER_GPU

    GPU_PCIE_TRANSMITTED_BANDWIDTH_PER_GPU

    GPU_SM_ACTIVITY_PER_GPU

    GPU SM activity

    GPU_SM_OCCUPANCY_PER_GPU

    GPU SM occupancy

    GPU_TENSOR_ACTIVITY_PER_GPU

    GPU tensor activity

    READY_GPU_NODES

    Ready / Total GPU nodes

    READY_GPUS

    Ready / Total GPU devices

    TOTAL_GPU_NODES

    Ready / Total GPU nodes

    TOTAL_GPUS

    Ready / Total GPU devices

    IDLE_ALLOCATED_GPUS

    Idle allocated GPU devices

    FREE_GPUS

    Free GPU devices

    TOTAL_CPU_CORES

    CPU (Cores)

    USED_CPU_CORES

    ALLOCATED_CPU_CORES

    Allocated CPU cores

    TOTAL_GPU_MEMORY_BYTES

    GPU memory

    USED_GPU_MEMORY_BYTES

    Used GPU memory

    TOTAL_CPU_MEMORY_BYTES

    CPU memory

    USED_CPU_MEMORY_BYTES

    Used CPU memory

    ALLOCATED_CPU_MEMORY_BYTES

    Allocated CPU memory

    GPU_QUOTA

    GPU quota

    CPU_QUOTA

    MEMORY_QUOTA

    GPU_ALLOCATION_NON_PREEMPTIBLE

    CPU_ALLOCATION_NON_PREEMPTIBLE

    MEMORY_ALLOCATION_NON_PREEMPTIBLE

    Cluster

    A cluster is a set of nodes pools and nodes. With Cluster metrics, metrics are aggregated at the Cluster level. In the NVIDIA Run:ai user interface, metrics are available in the Overview dashboard.

    Node

    Data is aggregated at the node level.

    Node pool

    Data is aggregated at the node pool level.

    Workload

    Data is aggregated at the workload level. In some workloads, e.g. with distributed workloads, these metrics aggregate data from all worker pods.

    Pod

    The basic unit of execution.

    Project

    The basic organizational unit. Projects are the tool to implement resource allocation policies as well as the segregation between different initiatives.

    Department

    Departments are a grouping of projects.

    ALLOCATED_GPU

    • Clusters

    • Node pools

    • GPU devices (allocated)

    • Allocated GPUs

    • Overview dashboard

    • Node pools

    AVG_WORKLOAD_WAIT_TIME

    • Clusters

    • Node pools

    CPU_LIMIT_CORES

    Workloads

    GPU_FP16_ENGINE_ACTIVITY_PER_GPU

    Pods

    GPU FP16 engine activity

    • Nodes

    • Workloads per pod

    GPU_FP32_ENGINE_ACTIVITY_PER_GPU

    Pods

    GPU FP32 engine activity

    • Nodes

    • Workloads per pod

    GPU_FP64_ENGINE_ACTIVITY_PER_GPU

    Pods

    WORKLOADS_COUNT

    Workloads

    ALLOCATED_GPUS

    Nodes

    Allocated GPUs

    Nodes

    GPU_allocation

    • Workloads

    • Projects

    • Departments

    here

    CPU limit

    GPU FP64 engine activity

    Workloads
    Workloads
    Workloads
    Workloads
    Workloads
    Workloads
    Pods
    Workloads
    Clusters
    Node pools
    Nodes
    Overview dashboard
    Node pools
    Nodes
    Workloads
    Workloads
    Nodes
    Workloads
    Pods
    Workloads
    Clusters
    Node pools
    Nodes
    Overview dashboard
    Node pools
    Nodes
    Workloads
    Projects
    Departments
    Overview dashboard
    Workloads
    Workloads
    Workloads
    Pods
    Nodes
    Workloads
    Nodes
    Pods
    Workloads per pod
    Clusters
    Node pools
    Overview dashboard
    Node pools
    Nodes
    Nodes
    Clusters
    Node pools
    Projects
    Departments
    Quota management
    Clusters
    Node pools
    Workloads
    Pods
    Overview dashboard
    Node pools
    Workloads
    Nodes
    Pods
    Nodes
    Clusters
    Node pools
    Overview dashboard
    Node pools
    Clusters
    Node pools
    Clusters
    Node pools
    Node pools
    Clusters
    Node pools
    Overview dashboard
    Node pools
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Workloads
    Workloads
    Nodes
    Workloads per pod
    Pods
    Nodes
    Workloads per pod
    Pods
    Pods
    Pods
    Pods
    Pods
    Pods
    Nodes
    Workloads per pod
    Pods
    Nodes
    Workloads per pod
    Pods
    Nodes
    Workloads per pod
    Nodes
    Overview dashboard
    Nodes
    Overview dashboard
    Nodes
    Overview dashboard
    Nodes
    Overview dashboard
    Nodes
    Overview dashboard
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Projects
    Departments
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Nodes
    Projects
    Departments
    Nodes
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    Projects
    Departments
    deprecated

    Roles

    This section explains the available roles in the NVIDIA Run:ai platform.

    A role is a set of permissions that can be assigned to a subject in a scope. A permission is a set of actions (View, Edit, Create and Delete) over a NVIDIA Run:ai entity (e.g. projects, workloads, users).

    Roles Table

    The Roles table can be found under Access in the NVIDIA Run:ai platform.

    The Roles table displays a list of roles available to users in the NVIDIA Run:ai platform. Both predefined and custom roles will be displayed in the table.

    The Roles table consists of the following columns:

    Column
    Description

    Customizing the Table View

    • Filter - Click ADD FILTER, select the column to filter by, and enter the filter values

    • Search - Click SEARCH and type the value to search by

    • Sort - Click each column header to sort by

    • Column selection - Click COLUMNS and select the columns to display in the table

    Reviewing a Role

    1. To review a role click the role name on the table

    2. In the role form review the following:

      • Role name The name of the role

      • Entity A system-managed object that can be viewed, edited, created or deleted by a user based on their assigned role and scope

    Roles in NVIDIA Run:ai

    NVIDIA Run:ai supports the following roles and their permissions. Under each role is a detailed list of the actions that the role assignee is authorized to perform for each entity.

    Compute resource administrator
    Entity
    View
    Edit
    Create
    Delete
    Credentials administrator
    Entity
    View
    Edit
    Create
    Delete
    Data source administrator
    Entity
    View
    Edit
    Create
    Delete
    Data volume administrator
    Entity
    View
    Edit
    Create
    Delete
    Department administrator
    Entity
    View
    Edit
    Create
    Delete
    Department viewer
    Entity
    View
    Edit
    Create
    Delete
    Editor
    Entity
    View
    Edit
    Create
    Delete
    Environment administrator
    Entity
    View
    Edit
    Create
    Delete
    L1 researcher
    Entity
    View
    Edit
    Create
    Delete
    L2 researcher
    Entity
    View
    Edit
    Create
    Delete
    ML engineer
    Entity
    View
    Edit
    Create
    Delete
    Research manager
    Entity
    View
    Edit
    Create
    Delete
    System administrator
    Entity
    View
    Edit
    Create
    Delete
    Template administrator
    Entity
    View
    Edit
    Create
    Delete
    Viewer
    Entity
    View
    Edit
    Create
    Delete
    Permitted workloads

    When assigning a role with either one, all or any combination of the View, Edit, Create and Delete permissions for workloads, the subject has permissions to manage not only (Workspace, Training, Inference), but also a list of 3rd party workloads:

    • k8s: StatefulSet

    • k8s: ReplicaSet

    Using API

    Go to the API reference to view the available actions.

    Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

    Actions The actions that the role assignee is authorized to perform for each entity

    • View - If checked, an assigned user with this role can view instances of this type of entity within their defined scope

    • Edit - If checked, an assigned user with this role can change the settings of an instance of this type of entity within their defined scope

    • Create - If checked, an assigned user with this role can create new instances of this type of entity within their defined scope

    • Delete - If checked, an assigned user with this role can delete instances of this type of entity within their defined scope

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    Departments

    Event history

    Policies

    Projects

    Settings

    Clusters

    Node pools

    Nodes

    Access rules

    Applications

    Groups

    Roles

    User applications

    Users

    Analytics dashboard

    Consumption dashboard

    Overview dashboard

    Inferences

    Workloads

    Compute resources

    Credentials

    Data sources

    Data volumes

    Data volumes - sharing list

    Environments

    Storage class configurations

    Templates

    k8s: Pod

  • k8s: Deployment

  • batch: Job

  • batch: CronJob

  • machinelearning.seldon.io: SeldonDeployment

  • kubevirt.io: VirtualMachineInstance

  • kubeflow.org: TFJob

  • kubeflow.org: PyTorchJob

  • kubeflow.org: XGBoostJob

  • kubeflow.org: MPIJob

  • kubeflow.org: MPIJob

  • kubeflow.org: Notebook

  • kubeflow.org: ScheduledWorkflow

  • amlarc.azureml.com: AmlJob

  • serving.knative.dev: Service

  • workspace.devfile.io: DevWorkspace

  • ray.io: RayCluster

  • ray.io: RayJob

  • ray.io: RayService

  • tekton.dev: TaskRun

  • tekton.dev: PipelineRun

  • argoproj.io: Workflow

  • Role

    The name of the role

    Created by

    The name of the role creator

    Creation time

    The timestamp when the role was created

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    Account

    NVIDIA Run:ai workloads
    Roles

    Policy YAML Reference

    A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted, setting best practices, enforcing limitations, and standardizing processes for AI projects within their organization.

    This article explains the policy YAML fields and the possible rules and defaults that can be set for each field.

    Policy YAML Fields - Reference Table

    The policy fields are structured in a similar format to the workload API fields. The following tables represent a structured guide designed to help you understand and configure policies in a YAML format. It provides the fields, descriptions, defaults and rules for each workload type.

    Click the link to view the value type of each field.

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type

    Ports Fields

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type

    Probes Fields

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type
    Readiness Field Details
    • Description: Specifies the Readiness Probe to use to determine if the container is ready to accept traffic

    • Value type:

    Security Fields

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type

    Compute Fields

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type

    Storage Fields

    Fields
    Description
    Value type
    Supported NVIDIA Run:ai workload type

    Storage Field Examples

    hostPath Field Details
    • Description: Maps a folder to a file system mount point within the container running the workload

    • Value type:

    Git Field Details
    • Description: Details of the git repository and items mapped to it

    • Value type:

    PVC Field Details
    • Description: Specifies persistent volume claims to mount into a container running the created workload

    • Value type:

    NFS Field Details
    • Description: Specifies NFS volume to mount into the container running the workload

    • Value type:

    S3 Field Details
    • Description: Specifies S3 buckets to mount into the container running the workload

    • Value type:

    Value Types

    Each field has a specific value type. The following value types are supported.

    Value type
    Description
    Supported rule type
    Defaults

    Itemized

    Workload fields of type itemized have multiple instances, however in comparison to objects, each can be referenced by a key field. The key field is defined for each field.

    Consider the following workload spec:

    In this example, extendedResources have two instances, each has two attributes: resource (the key attribute) and quantity.

    In policy, the defaults and rules for itemized fields have two sub sections:

    • Instances: default items to be added to the policy or rules which apply to an instance as a whole.

    • Attributes: defaults for attributes within an item or rules which apply to attributes within each item.

    Consider the following example:

    Assume the following workload submission is requested:

    The effective policy for the above mentioned workload has the following extendedResources instances:

    Resource
    Source of the instance
    Quantity
    Source of the attribute quantity

    Note

    The default/memory is not populated to the workload, this is because it has been excluded from the workload using “exclude: true”.

    A workload submission request cannot exclude the default/cpu resource, as this key is included in the locked rules under the instances section. {#a-workload-submission-request-cannot-exclude-the-default/cpu-resource,-as-this-key-is-included-in-the-locked-rules-under-the-instances-section.}

    Rule Types

    Rule types
    Description
    Supported value types

    Rule Type Examples

    canAdd
    locked
    canEdit
    required
    min
    max
    step
    options
    defaultFrom

    Policy Spec Sections

    For each field of a specific policy, you can specify both rules and defaults. A policy spec consists of the following sections:

    • Rules

    • Defaults

    • Imposed Assets

    Rules

    Rules set up constraints on workload policy fields. For example, consider the following policy:

    Such a policy restricts the maximum value for gpuDeviceRequests to 8, and the minimal value for runAsUid, provided in the security section to 500.

    Defaults

    The defaults section is used for providing defaults for various workload fields. For example, consider the following policy:

    Assume a submission request with the following values:

    • Image: ubuntu

    • runAsUid: 501

    The effective workload that runs has the following set of values:

    Field
    Value
    Source

    Note

    It is possible to specify a rule for each field, which states if a submission request is allowed to change the policy default for that given field, for example:

    If this policy is applied, the submission request above fails, as it attempts to change the value of secuirty.runAsUid from 500 (the policy default) to 501 (the value provided in the submission request), which is forbidden due to canEdit rule set to false for this field.

    Imposed Assets

    Default instances of a storage field can be provided using a datasource containing the details of this storage instance. To add such instances in the policy, specify those asset IDs in the imposedAssets section of the policy.

    Assets with references to credential assets (for example: private S3, containing reference to an AccessKey asset) cannot be used as imposedAssets.

    environmentVariables

    Set of environmentVariables to populate the container running the workspace

    • Workspace

    • Standard training

    • Distributed training

    image

    Specifies the image to use when creating the container running the workload

    • Workspace

    • Standard training

    • Distributed training

    imagePullPolicy

    Specifies the pull policy of the image when starting t a container running the created workload. Options are: Always, Never, or IfNotPresent

    • Workspace

    • Standard training

    • Distributed training

    workingDir

    Container’s working directory. If not specified, the container runtime default is used, which might be configured in the container image

    • Workspace

    • Standard training

    • Distributed training

    nodeType

    Nodes (machines) or a group of nodes on which the workload runs

    • Workspace

    • Standard training

    • Distributed training

    nodePools

    A prioritized list of node pools for the scheduler to run the workspace on. The scheduler always tries to use the first node pool before moving to the next one when the first is not available.

    • Workspace

    • Standard training

    • Distributed training

    annotations

    Set of annotations to populate into the container running the workspace

    • Workspace

    • Standard training

    • Distributed training

    labels

    Set of labels to populate into the container running the workspace

    • Workspace

    • Standard training

    • Distributed training

    terminateAfterPreemtpion

    Indicates whether the job should be terminated, by the system, after it has been preempted

    • Workspace

    • Standard training

    • Distributed training

    autoDeletionTimeAfterCompletionSeconds

    Specifies the duration after which a finished workload (Completed or Failed) is automatically deleted. If this field is set to zero, the workload becomes eligible to be deleted immediately after it finishes.

    • Workspace

    • Standard training

    • Distributed training

    backoffLimit

    Specifies the number of retries before marking a workload as failed

    • Workspace

    • Standard training

    • Distributed training

    restartPolicy

    Specify the restart policy of the workload pods. Default is empty, which is determine by the framework default

    Enum: "Always" "Never" "OnFailure"

    • Workspace

    • Standard training

    • Distributed training

    cleanPodPolicy

    Specifies which pods will be deleted when the workload reaches a terminal state (completed/failed). The policy can be one of the following values:

    • Running - Only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default for MPI)

    • All - All (including completed) pods will be deleted immediately when the job finishes.

    • None

    Distributed training

    completions

    Used with Hyperparameter Optimization. Specifies the number of successful pods the job should reach to be completed. The Job is marked as successful once the specified amount of pods has succeeded.

    Standard training

    parallelism

    Used with Hyperparameters Optimization. Specifies the maximum desired number of pods the workload should run at any given time.

    Standard training

    exposedUrls

    Specifies a set of exported URL (e.g. ingress) from the container running the created workload.

    • Workspace

    • Standard training

    • Distributed training

    relatedUrls

    Specifies a set of URLs related to the workload. For example, a URL to an external server providing statistics or logging about the workload.

    • Workspace

    • Standard training

    • Distributed training

    PodAffinitySchedulingRule

    Indicates if we want to use the Pod affinity rule as: the “hard” (required) or the “soft” (preferred) option. This field can be specified only if PodAffinity is set to true.

    • Workspace

    • Standard training

    • Distributed training

    podAffinityTopology

    Specifies the Pod Affinity Topology to be used for scheduling the job. This field can be specified only if PodAffinity is set to true.

    • Workspace

    • Standard training

    • Distributed training

    sshAuthMountPath

    Specifies the directory where SSH keys are mounted

    Distributed training (MPI only)

    ports

    Specifies a set of ports exposed from the container running the created workload. More information in Ports fields below.

    • Workspace

    • Standard training

    • Distributed training

    probes

    Specifies the ReadinessProbe to use to determine if the container is ready to accept traffic. More information in below

    -

    • Workspace

    • Standard training

    • Distributed training

    tolerations

    Toleration rules which apply to the pods running the workload. Toleration rules guide (but do not require) the system to which node each pod can be scheduled to or evicted from, based on matching between those rules and the set of taints defined for each Kubernetes node.

    • Workspace

    • Standard training

    • Distributed training

    priorityClass

    Priority class of the workload. The default value for workspace is 'build' and it can be changed to 'interactive-preemptible' to allow the workload to use over-quota resources. The default value for training is 'train' and it can be changed to 'build' to allow the training workload to have a higher priority for in-queue scheduling and also become non-preemptive (if it's in deserved quota).

    Enum: "build" "train" "interactive-preemptible"

    • Workspace

    • Standard training

    storage

    Contains all the fields related to storage configurations. More information in below.

    -

    • Workspace

    • Standard training

    • Distributed training

    security

    Contains all the fields related to security configurations. More information in below.

    -

    • Workspace

    • Standard training

    • Distributed training

    compute

    Contains all the fields related to compute configurations. More information in below.

    -

    • Workspace

    • Standard training

    • Distributed training

    tty

    Whether this container should allocate a TTY for itself, also requires 'stdin' to be true

    • Workspace

    • Standard training

    • Distributed training

    stdin

    Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF

    • Workspace

    • Standard training

    • Distributed training

    numWorkers

    The number of workers that will be allocated for running the workload.

    Distributed training

    distributedFramework

    The distributed training framework used in the workload.

    Enum: "MPI" "PyTorch" "TF" "XGBoost"

    Distributed training

    slotsPerWorker

    Specifies the number of slots per worker used in hostfile. Defaults to 1. (applicable only for MPI)

    Distributed training (MPI only)

    minReplicas

    The lower limit for the number of worker pods to which the training job can scale down. (applicable only for PyTorch)

    Distributed training (PyTorch only)

    maxReplicas

    The upper limit for the number of worker pods that can be set by the autoscaler. Cannot be smaller than MinReplicas. (applicable only for PyTorch)

    Distributed training (PyTorch only)

    • Workspace

    • Standard training

    • Distributed training

    toolType

    The tool type that runs on this port.

    • Workspace

    • Standard training

    • Distributed training

    toolName

    A name describing the tool that runs on this port.

    • Workspace

    • Standard training

    • Distributed training

    Example policy snippet:
    Spec readiness fields
    Description
    Value type

    initialDelaySeconds

    Number of seconds after the container has started before liveness or readiness probes are initiated.

    periodSeconds

    How often (in seconds) to perform the probe

    timeoutSeconds

    Number of seconds after which the probe times out

    successThreshold

    Minimum consecutive successes for the probe to be considered successful after having failed

    • Workspace

    • Standard training

    • Distributed training

    runAsNonRoot

    Indicates that the container must run as a non-root user.

    • Workspace

    • Standard training

    • Distributed training

    readOnlyRootFilesystem

    If true, mounts the container's root filesystem as read-only.

    • Workspace

    • Standard training

    • Distributed training

    runAsUid

    Specifies the Unix user id with which the container running the created workload should run.

    • Workspace

    • Standard training

    • Distributed training

    runasGid

    Specifies the Unix Group ID with which the container should run.

    • Workspace

    • Standard training

    • Distributed training

    supplementalGroups

    Comma separated list of groups that the user running the container belongs to, in addition to the group indicated by runAsGid.

    • Workspace

    • Standard training

    • Distributed training

    allowPrivilegeEscalation

    Allows the container running the workload and all launched processes to gain additional privileges after the workload starts

    • Workspace

    • Standard training

    • Distributed training

    hostIpc

    Whether to enable hostIpc. Defaults to false.

    • Workspace

    • Standard training

    • Distributed training

    hostNetwork

    Whether to enable host network.

    • Workspace

    • Standard training

    • Distributed training

    • Workspace

    • Standard training

    • Distributed training

    cpuMemoryLimit

    Limitations on the CPU memory to allocate for this workload (1G, 20M, .etc). The system guarantees that this workload is not be able to consume more than this amount of memory. The workload receives an error when trying to allocate more memory than this limit.

    • Workspace

    • Standard training

    • Distributed training

    largeShmRequest

    A large /dev/shm device to mount into a container running the created workload (shm is a shared file system mounted on RAM).

    • Workspace

    • Standard training

    • Distributed training

    gpuRequestType

    Sets the unit type for GPU resources requests to either portion, memory or mig profile. Only if gpuDeviceRequest = 1, the request type can be stated as portion, memory or migProfile.

    • Workspace

    • Standard training

    • Distributed training

    migProfile (Deprecated)

    Specifies the memory profile to be used for workload running on NVIDIA Multi-Instance GPU (MIG) technology.

    • Workspace

    • Standard training

    • Distributed training

    gpuPortionRequest

    Specifies the fraction of GPU to be allocated to the workload, between 0 and 1. For backward compatibility, it also supports the number of gpuDevices larger than 1, currently provided using the gpuDevices field.

    • Workspace

    • Standard training

    • Distributed training

    gpuDeviceRequest

    Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1, the gpuRequestType can be defined.

    • Workspace

    • Standard training

    • Distributed training

    gpuPortionLimit

    When a fraction of a GPU is requested, the GPU limit specifies the portion limit to allocate to the workload. The range of the value is from 0 to 1.

    • Workspace

    • Standard training

    • Distributed training

    gpuMemoryRequest

    Specifies GPU memory to allocate for the created workload. The workload receives this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of GPU memory to the workload.

    • Workspace

    • Standard training

    • Distributed training

    gpuMemoryLimit

    Specifies a limit on the GPU memory to allocate for this workload. Should be no less than the gpuMemory.

    • Workspace

    • Standard training

    • Distributed training

    extendedResources

    Specifies values for extended resources. Extended resources are third-party devices (such as high-performance NICs, FPGAs, or InfiniBand adapters) that you want to allocate to your Job.

    • Workspace

    • Standard training

    • Distributed training

    • Workspace

    • Standard training

    • Distributed training

    Specifies persistent volume claims to mount into a container running the created workload.

    • Workspace

    • Standard training

    • Distributed training

    Specifies NFS volume to mount into the container running the workload.

    • Workspace

    • Standard training

    • Distributed training

    Specifies S3 buckets to mount into the container running the workload.

    • Workspace

    • Standard training

    • Distributed training

    configMapVolumes

    Specifies ConfigMaps to mount as volumes into a container running the created workload.

    • Workspace

    • Standard training

    • Distributed training

    secretVolume

    Set of secret volumes to use in the workload. A secret volume maps a secret resource in the cluster to a file-system mount point within the container running the workload.

    • Workspace

    • Standard training

    • Distributed training

    Example policy snippet:
    hostPath fields
    Description
    Value type

    name

    Unique name to identify the instance. Primarily used for policy locked rules.

    path

    Local path within the controller to which the host volume is mapped.

    readOnly

    Force the volume to be mounted with read-only permissions. Defaults to false

    mountPath

    The path that the host volume is mounted to when in use. Enum:

    • "None"

    • "HostToContainer"

    Example policy snippet:
    Git fields
    Description
    Value type

    repository

    URL to a remote git repository. The content of this repository is mapped to the container running the workload

    revision

    Specific revision to synchronize the repository from

    path

    Local path within the workspace to which the S3 bucket is mapped

    secretName

    Optional name of Kubernetes secret that holds your git username and password

    Example policy snippet:
    Spec PVC fields
    Description
    Value type

    claimName (mandatory)

    A given name for the PVC. Allowed referencing it across workspaces

    ephemeral

    Use true to set PVC to ephemeral. If set to true, the PVC is deleted when the workspace is stopped.

    path

    Local path within the workspace to which the PVC bucket is mapped

    readonly

    Permits read only from the PVC, prevents additions or modifications to its content

    Example policy snippet:
    nfs fields
    Description
    Value type

    mountPath

    The path that the NFS volume is mounted to when in use

    path

    Path that is exported by the NFS server

    readOnly

    Whether to force the NFS export to be mounted with read-only permissions

    nfsServer

    The hostname or IP address of the NFS server

    Example policy snippet:
    s3 fields
    Description
    Value type

    Bucket

    The name of the bucket

    path

    Local path within the workspace to which the S3 bucket is mapped

    url

    The URL of the S3 service provider. The default is the URL of the Amazon AWS S3 service

    Integer

    An Integer is a whole number without a fractional component.

    • canEdit

    • required

    • min

    • max

    100

    Number

    Capable of having non-integer values

    • canEdit

    • required

    • min

    • defaultFrom

    10.3

    Quantity

    Holds a string composed of a number and a unit representing a quantity

    • canEdit

    • required

    • min

    • max

    5M

    Array

    Set of values that are treated as one, as opposed to Itemized in which each item can be referenced separately.

    • canEdit

    • required

    • node-a

    • node-b

    • node-c

    Submission request

    min

    The minimal value for the field

    max

    The maximal value for the field

    step

    The allowed gap between values for this field. In this example the allowed values are: 1, 3, 5, 7

    options

    Set of allowed values for this field

    defaultFrom

    Set a default value for a field that will be calculated based on the value of another field

    args

    When set, contains the arguments sent along with the command. These override the entry point of the image in the created workload

    string

    • Workspace

    • Standard training

    • Distributed training

    command

    A command to serve as the entry point of the container running the workspace

    string

    • Workspace

    • Standard training

    • Distributed training

    createHomeDir

    Instructs the system to create a temporary home directory for the user within the container. Data stored in this directory is not saved when the container exists. When the runAsUser flag is set to true, this flag defaults to true as well

    boolean

    • Workspace

    • Standard training

    • Distributed training

    container

    The port that the container running the workload exposes.

    string

    • Workspace

    • Standard training

    • Distributed training

    serviceType

    Specifies the default service exposure method for ports. the default shall be sued for ports which do not specify service type. Options are: LoadBalancer, NodePort or ClusterIP. For more information see the External Access to Containers guide.

    string

    • Workspace

    • Standard training

    • Distributed training

    external

    The external port which allows a connection to the container port. If not specified, the port is auto-generated by the system.

    readiness

    Specifies the Readiness Probe to use to determine if the container is ready to accept traffic.

    -

    • Workspace

    • Standard training

    • Distributed training

    uidGidSource

    Indicates the way to determine the user and group ids of the container. The options are:

    • fromTheImage - user and group IDs are determined by the docker image that the container runs. This is the default option.

    • custom - user and group IDs can be specified in the environment asset and/or the workspace creation request.

    • fromIdpToken - user and group IDs are automatically taken from the identity provider (IdP) token (available only in SSO-enabled installations).

    For more information, see .

    string

    • Workspace

    • Standard training

    • Distributed training

    capabilities

    The capabilities field allows adding a set of unix capabilities to the container running the workload. Capabilities are Linux distinct privileges traditionally associated with superuser which can be independently enabled and disabled

    array

    • Workspace

    • Standard training

    • Distributed training

    seccompProfileType

    Indicates which kind of seccomp profile is applied to the container. The options are:

    • RuntimeDefault - the container runtime default profile should be used

    • Unconfined - no profile should be applied

    cpuCoreRequest

    CPU units to allocate for the created workload (0.5, 1, .etc). The workload receives at least this amount of CPU. Note that the workload is not scheduled unless the system can guarantee this amount of CPUs to the workload.

    number

    • Workspace

    • Standard training

    • Distributed training

    cpuCoreLimit

    Limitations on the number of CPUs consumed by the workload (0.5, 1, .etc). The system guarantees that this workload is not able to consume more than this amount of CPUs.

    number

    • Workspace

    • Standard training

    • Distributed training

    cpuMemoryRequest

    The amount of CPU memory to allocate for this workload (1G, 20M, .etc). The workload receives at least this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of memory to the workload

    dataVolume

    Set of data volumes to use in the workload. Each data volume is mapped to a file-system mount point within the container running the workload.

    itemized

    • Workspace

    • Standard training

    • Distributed training

    hostPath

    Maps a folder to a file-system mount point within the container running the workload.

    itemized

    • Workspace

    • Standard training

    • Distributed training

    git

    Details of the git repository and items mapped to it.

    Boolean

    A binary value that can be either True or False

    • canEdit

    • required

    true/false

    String

    A sequence of characters used to represent text. It can include letters, numbers, symbols, and spaces

    • canEdit

    • required

    • options

    abc

    Itemized

    An ordered collection of items (objects), which can be of different types (all items in the list are of the same type). For further information see the chapter below the table.

    • canAdd

    • locked

    default/cpu

    Policy defaults

    5

    The default of this instance in the policy defaults section

    added/cpu

    Submission request

    3

    The default of the quantity attribute from the attributes section

    added/memory

    Submission request

    canAdd

    Whether the submission request can add items to an itemized field other than those listed in the policy defaults for this field.

    itemized

    locked

    Set of items that the workload is unable to modify or exclude. In this example, a workload policy default is given to HOME and USER, that the submission request cannot modify or exclude from the workload.

    itemized

    canEdit

    Whether the submission request can modify the policy default for this field. In this example, it is assumed that the policy has default for imagePullPolicy. As canEdit is set to false, submission requests are not able to alter this default.

    • string

    • boolean

    • integer

    • number

    required

    When set to true, the workload must have a value for this field. The value can be obtained from policy defaults. If no value specified in the policy defaults, a value must be specified for this field in the submission request.

    Image

    Ubuntu

    Submission request

    ImagePullPolicy

    Always

    Policy defaults

    security.runAsNonRoot

    true

    Policy defaults

    security.runAsUid

    501

    Submission request

    itemized
    itemized
    itemized
    itemized
    itemized
    itemized

    See below

    5M

    spec:
      image: ubuntu
      compute:
        extendedResources:
          - resource: added/cpu
            quantity: 10
          - resource: added/memory
            quantity: 20M
    defaults:
      compute:
        extendedResources:
          instances: 
            - resource: default/cpu
              quantity: 5
            - resource: default/memory
              quantity: 4M
          attributes:
            quantity: 3
    rules:
      compute:
        extendedResources:
          instances:
            locked: 
              - default/cpu
          attributes:
            quantity: 
              required: true
    spec:
      image: ubuntu
      compute:
        extendedResources:
          - resource: default/memory
            exclude: true
          - resource: added/cpu
          - resource: added/memory
            quantity: 5M
    storage:
      hostPath:
         instances:
           canAdd: false
    storage:
      hostPath:
        Instances:
          locked:
            - HOME
            - USER
    imagePullPolicy:
        canEdit: false
    image:
        required: true
    compute:
      gpuDevicesRequest:
        min: 3
    compute:
      gpuMemoryRequest:
         max: 2G
    compute:
      cpuCoreRequest:
        min: 1
        max: 7
        Step: 2
    image:
      options:
        - value: image-1
        - value: image-2
    cpuCoreRequest:
      defaultFrom:
        field: compute.cpuCoreLimit
        factor: 0.5
    rules:
      compute:
        gpuDevicesRequest: 
          max: 8
      security:
        runAsUid: 
          min: 500
    defaults:
      imagePullPolicy: Always
      security:
        runAsNonRoot: true
        runAsUid: 500
    defaults:
    imagePullPolicy: Always
    security:
        runAsNonRoot: true
        runAsUid: 500
     rules:
     security:
        runAsUid:
        canEdit: false
    defaults: null
    rules: null
    imposedAssets:
      - f12c965b-44e9-4ff6-8b43-01d8f9e630cc
    defaults:
       probes:
         readiness:
             initialDelaySeconds: 2
    defaults:
      storage:
        hostPath:
          instances:
            - path: h3-path-1
              mountPath: h3-mount-1
            - path: h3-path-2
              mountPath: h3-mount-2
          attributes:
            - readOnly: true
    defaults:
      storage:
        git:
          attributes:
            Repository: https://runai.public.github.com
          instances
            - branch: "master"
              path: /container/my-repository
              passwordSecret: my-password-secret
    defaults:
      storage:
        pvc:
          instances:
            - claimName: pvc-staging-researcher1-home
              existingPvc: true
              path: /myhome
              readOnly: false
              claimInfo:
                accessModes:
                  readWriteMany: true
    defaults:
     storage:
       nfs:
         instances:
           - path: nfs-path
             readOnly: true
             server: nfs-server
             mountPath: nfs-mount
    rules:
      storage:
        nfs:
          instances:
            canAdd: false
    defaults:
      storage:
        s3:
          instances:
            - bucket: bucket-opt-1
              path: /s3/path
              accessKeySecret: s3-access-key
              secretKeyOfAccessKeyId: s3-secret-id
              secretKeyOfSecretKey: s3-secret-key
          attributes:
            url: https://amazonaws.s3.com
    - No pods will be deleted when the job completes. It will keep running pods that consume GPU, CPU and memory over time. It is recommended to set to None only for debugging and obtaining logs from running pods. (Default for PyTorch)
    step
  • defaultFrom

  • defaultFrom

    quantity

  • array

  • integer

    failureThreshold

    When a probe fails, the number of times to try before giving up

    integer

    string

    mountPropagation

    Share this volume mount with other containers. If set to HostToContainer, this volume mount receives all subsequent mounts that are mounted to this volume or any of its subdirectories. In case of multiple hostPath entries, this field should have the same value for all of them.

    string

    string

    username

    If secretName is provided, this field should contain the key, within the provided Kubernetes secret, which holds the value of your git username. Otherwise, this field should specify your git username in plain text (example: myuser).

    string

    boolean

    ReadwriteOnce

    Requesting claim that can be mounted in read/write mode to exactly 1 host. If none of the modes are specified, the default is readWriteOnce.

    boolean

    size

    Requested size for the PVC. Mandatory when existing PVC is false

    string

    storageClass

    Storage class name to associate with the PVC. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class. Further details at Kubernetes storage classes.

    string

    readOnlyMany

    Requesting claim that can be mounted in read-only mode to many hosts

    boolean

    readWriteMany

    Requesting claim that can be mounted in read/write mode to many hosts

    boolean

    string

    User identity in containers
    quantity
    array
    array
    string
    string
    string
    string
    array
    itemized
    itemized
    boolean
    integer
    integer
    string
    string
    integer
    itemized
    itemized
    itemized
    string
    string
    string
    itemized
    Probes fields
    itemized
    string
    Storage fields
    Security fields
    Compute fields
    boolean
    boolean
    integer
    string
    integer
    integer
    integer
    integer
    string
    string
    integer
    integer
    integer
    string
    boolean
    boolean
    integer
    integer
    string
    boolean
    boolean
    boolean
    quantity
    quantity
    boolean
    string
    string
    number
    integer
    number
    quantity
    quantity
    itemized
    itemized
    pvc
    itemized
    nfs
    itemized
    s3
    itemized
    itemized
    itemized
    string
    string
    boolean
    string
    string
    string
    string
    boolean
    string
    string
    string
    boolean
    string
    string
    string
    string
    boolean
    integer
    number
    integer
    number
    quantity
    integer
    number
    quantity
    integer
    number
    string
    integer
    number
    quantity
    Compliance