1 of 100

v2.21

Getting Started

Overview

NVIDIA Run:ai is a GPU orchestration and optimization platform that helps organizations maximize compute utilization for AI workloads. By optimizing the use of expensive compute resources, NVIDIA Run:ai accelerates AI development cycles, and drives faster time-to-market for AI-powered innovations.

Built on Kubernetes, NVIDIA Run:ai supports dynamic GPU allocation, workload submission, workload scheduling, and resource sharing, ensuring that AI teams get the compute power they need while IT teams maintain control over infrastructure efficiency.

How NVIDIA Run:ai Helps Your Organization

For Infrastructure Administrators

NVIDIA Run:ai centralizes cluster management and optimizes infrastructure control by offering:

Centralized cluster management – Manage all clusters from a single platform, ensuring consistency and control across environments.
Usage monitoring and capacity planning – Gain real-time and historical insights into GPU consumption across clusters to optimize resource allocation and plan future capacity needs efficiently.
Policy enforcement – Define and enforce security and usage policies to align GPU consumption with business and compliance requirements.
Enterprise-grade authentication – Integrate with your organization's identity provider for streamlined authentication (Single Sign On) and role-based access control (RBAC).
Kubernetes-native application – Install as a Kubernetes-native application, seamlessly extending Kubernetes for native cloud experience and operational standards (install, upgrade, configure).

For Platform Administrators

NVIDIA Run:ai simplifies AI infrastructure management by providing a structured approach to managing AI initiatives, resources, and user access. It enables platform administrators maintain control, efficiency, and scalability across their infrastructure:

AI Initiative structuring and management – Map and set up AI initiatives according to your organization's structure, ensuring clear resource allocation.
Centralized GPU resource management – Enable seamless sharing and pooling of GPUs across multiple users, reducing idle time and optimizing utilization.
User and access control – Assign users (AI practitioners, ML engineers) to specific projects and departments to manage access and enforce security policies, utilizing role-based access control (RBAC) to ensure permissions align with user roles.
Workload scheduling – Use scheduling to prioritize and allocate GPUs based on workload needs.
Monitoring and insights – Track real-time and historical data on GPU usage to help track resource consumption and optimize costs.

For AI Practitioners

NVIDIA Run:ai empowers data scientists and ML engineers by providing:

Optimized workload scheduling – Ensure high-priority jobs get GPU resources. Workloads dynamically receive resources based on demand.
Fractional GPU usage – Request and utilize only a fraction of a GPU's memory, ensuring efficient resource allocation and leaving room for other workloads.
AI initiatives lifecycle support – Run your entire AI initiatives lifecycle – Jupyter Notebooks, training jobs, and inference workloads efficiently.
Interactive session – Ensure an uninterrupted experience when working on Jupyter Notebooks without taking away GPUs.
Scalability for training and inference – Support for distributed training across multiple GPUs and auto-scales inference workloads.
Integrations – Integrate with popular ML frameworks - PyTorch, TensorFlow, XGBoost, Knative, Spark, Kubeflow Pipelines, Apache Airflow, Argo workloads, Ray and more.
Flexible workload submission – Submit workloads using the NVIDIA Run:ai UI, API, CLI or run third-party workloads.

NVIDIA Run:ai System Components

NVIDIA Run:ai is made up of two components both installed over a Kubernetes cluster:

NVIDIA Run:ai cluster – Provides scheduling and workload management, extending Kubernetes native capabilities.
NVIDIA Run:ai control plane – Provides resource management, handles workload submission and provides cluster monitoring and analytics.

NVIDIA Run:ai Cluster

The NVIDIA Run:ai cluster is responsible for scheduling AI workloads and efficiently allocating GPU resources across users and projects:

NVIDIA Run:ai Scheduler – Applies AI-aware rules to efficiently schedule workloads submitted by AI practitioners.
Workload management – Handles workload management which includes the researcher code running as a Kubernetes container and the system resources required to run the code, such as storage, credentials, network endpoints to access the container and so on.
Kubernetes operator-based deployment – Installed as a Kubernetes Operator to automate deployment, upgrades and configuration of NVIDIA Run:ai cluster services.
Storage – Supports Kubernetes-native storage using Storage Classes, allowing organizations to bring their own storage solutions. Additionally, it also integrates with external storage solutions such as Git, S3, and NFS to support various data requirements.
Secured communication – Uses an outbound-only, secured (SSL) connection to synchronize with the NVIDIA Run:ai control plane.
Private – NVIDIA Run:ai only synchronizes metadata and operational metrics (e.g., workloads, nodes) with the control plane. No proprietary data, model artifacts, or user data sets are ever transmitted, ensuring full data privacy and security.

NVIDIA Run:ai Control Plane

The NVIDIA Run:ai control plane provides a centralized management interface for organizations to oversee their GPU infrastructure across multiple locations/subnets, accessible via Web UI, API and CLI. The control plane can be deployed on the cloud or on-premise for organizations that require local control over their infrastructure (self-hosted).

Multi-cluster management – Manages multiple NVIDIA Run:ai clusters for a single tenant across different locations and subnets from a single unified interface.
Resource and access management – Allows administrators to define Projects, Departments and user roles, enforcing policies for fair resource distribution.
Workload submission and monitoring – Allows teams to submit workloads, track usage, and monitor GPU performance in real time.

Installation Types

There are two main installation options:

Installation Type

Description

SaaS

NVIDIA Run:ai is installed on the customer's data science GPU clusters. The cluster connects to the NVIDIA Run:ai control plane on the cloud (https://<tenant-name>.run.ai). With this installation, the cluster requires an outbound connection to the NVIDIA Run:ai cloud.

Self-hosted

The NVIDIA Run:ai control plane is also installed in the customer's data center

What's New

This section includes release information for the self-hosted version of NVIDIA Run:ai:

New Features and Enhancements – Highlights major updates introduced in each version, including new capabilities, UI improvements, and changes to system behavior..
Hotfixes – Lists patches applied to released versions, including critical fixes and behavior corrections.

Note

See our Product version life cycle for a list of supported versions and their respective support timelines.

Feature Life Cycle

NVIDIA Run:ai uses life cycle labels to indicate the maturity and stability of features across releases:

Experimental - This feature is in early development. It may not be stable and could be removed or changed significantly in future versions. Use with caution.
Beta - This feature is still being developed for official release in a future version and may have some limitations. Use with caution.
Legacy - This feature is scheduled to be removed in future versions. We recommend using alternatives if available. Use only if necessary.

Installation

NVIDIA Run:ai Components

As part of the installation process, you will install:

A managing cluster/s
One or more

Both the control plane and clusters require Kubernetes. Typically, the control plane and first cluster are installed on the same Kubernetes cluster.

Installation Types

The self-hosted option is for organizations that cannot use a SaaS solution due to data leakage concerns. NVIDIA Run:ai self-hosting comes with two variants:

Type

Description

Network Requirements

The following network requirements are for the NVIDIA Run:ai components installation and usage.

External Access

Set out below are the domains to whitelist and ports to open for installation, upgrade, and usage of the application and its management.

Note

Ensure the inbound and outbound rules are correctly applied to your firewall.

Inbound Rules

To allow your organization’s NVIDIA Run:ai users to interact with the cluster using the , or access specific UI features, certain inbound ports need to be open:

Name

Description

Source

Destination

Port

Outbound Rules

Note

Outbound rules applied to the NVIDIA Run:ai cluster component only. In case the NVIDIA Run:ai cluster is installed together with the NVIDIA Run:ai control plane, the NVIDIA Run:ai cluster FQDN refers to the NVIDIA Run:ai control plane FQDN.

For the NVIDIA Run:ai cluster installation and usage, certain outbound ports must be open:

Name

Description

Source

Destination

Port

The NVIDIA Run:ai installation has that require additional components to be installed on the cluster. This article includes simple installation examples which can be used optionally and require the following cluster outbound ports to be open:

Name

Description

Source

Destination

Port

Internal Network

Ensure that all Kubernetes nodes can communicate with each other across all necessary ports. Kubernetes assumes full interconnectivity between nodes, so you must configure your network to allow this seamless communication. Specific port requirements may vary depending on your network setup.

Install the Control Plane

System and Network Requirements

Before installing the NVIDIA Run:ai control plane, validate that the and are met. For air-gapped environments, make sure you have the prepared.

Permissions

As part of the installation, you will be required to install the NVIDIA Run:ai control plane . The Helm charts require Kubernetes administrator permissions. You can review the exact objects that are created by the charts using the --dry-run on both helm charts.

Installation

Kubernetes

Connected

Run the following command. Replace global.domain=<DOMAIN> with the one obtained :

Note: To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

Note: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation.

Air-gapped

To run the following command, make sure to replace the following. The custom-env.yaml is created when :

control-plane-<VERSION>.tgz - The NVIDIA Run:ai control plane version
global.domain=<DOMAIN> - The domain name set
global.customCA.enabled=true as described

Note: Use the --dry-run flag to gain an understanding of what is being installed before the actual installation.

OpenShift

Connected

Run the following command. The <OPENSHIFT-CLUSTER-DOMAIN> is the subdomain configured for the OpenShift cluster:

Note: To install a specific version, add --version <VERSION> to the install command. You can find available versions by running helm search repo -l runai-backend.

Air-gapped

To run the following command, make sure to replace the following. The custom-env.yaml is created when

control-plane-<VERSION>.tgz - The NVIDIA Run:ai control plane version
<OPENSHIFT-CLUSTER-DOMAIN> - The domain configured for the OpenShift cluster. To find out the OpenShift cluster domain, run oc get routes -A
global.customCA.enabled=true as described

Note

To customize the installation based on your environment, see .

Connect to NVIDIA Run:ai User Interface

Open your browser and go to:

https://<DOMAIN>

https://runai.apps.<OpenShift-DOMAIN>

You will be prompted to change the password.

Customized Installation

This section explains the available configurations for customizing the NVIDIA Run:ai control plane and cluster installation.

Control Plane Helm Chart Values

The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm values files or Helm install flags. See Advanced control plane configurations.

Cluster Helm Chart Values

The NVIDIA Run:ai cluster installation can be customized to support your environment via Helm values files or Helm install flags.

These configurations are saved in the runaiconfig Kubernetes object and can be edited post-installation as needed. For more information, see Advanced cluster configurations.

The following table lists the available Helm chart values that can be configured to customize the NVIDIA Run:ai cluster installation.

Key

Description

global.image.registry (string)

Global Docker image registry Default: ""

global.additionalImagePullSecrets (list)

List of image pull secrets references Default: []

spec.researcherService.ingress.tlsSecret (string)

Existing secret key where cluster are stored (non-OpenShift) Default: runai-cluster-domain-tls-secret

spec.researcherService.route.tlsSecret (string)

Existing secret key where cluster are stored (OpenShift only) Default: ""

spec.prometheus.spec.image (string)

Due to a In the Prometheus Helm chart, the imageRegistry setting is ignored. To pull the image from a different registry, you can manually specify the Prometheus image reference. Default: quay.io/prometheus/prometheus

spec.prometheus.spec.imagePullSecrets (string)

List of image pull secrets references in the runai namespace to use for pulling Prometheus images (relevant for air-gapped installations). Default: []

global.customCA.enabled

Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.

openShift.securityContextConstraints.create

Enables the deployment of Security Context Constraints (SCC). Disable for CIS compliance. Default: true

Upgrade

Before Upgrade

Before proceeding with the upgrade, it's crucial to apply the specific prerequisites associated with your current version of NVIDIA Run:ai and every version in between up to the version you are upgrading to.

Helm

NVIDIA Run:ai requires 3.14 or later. Before you continue, validate your installed helm client version. To install or upgrade Helm, see . If you are installing an air-gapped version of NVIDIA Run:ai, the NVIDIA Run:ai tar file contains the helm binary.

Software Files

Run the helm command below:

Ask for a tar file runai-air-gapped-<NEW-VERSION>.tar.gz from NVIDIA Run:ai customer support. The file contains the new version you want to upgrade to. <NEW-VERSION> is the updated version of the NVIDIA Run:ai control plane.
Upload the images as described .

Upgrade Control Plane

System and Network Requirements

Before upgrading the NVIDIA Run:ai control plane, validate that the latest and are met, as they can change from time to time.

Upgrade

To upgrade from 2.17 or later, run the following:

Note

To upgrade to a specific version, modify the --version flag by specifying the desired <VERSION>. You can find all available versions by using the helm search repo command.

Upgrade Cluster

System and Network Requirements

Before upgrading the NVIDIA Run:ai cluster, validate that the latest and are met, as they can change from time to time.

Note

It is highly recommended to upgrade the Kubernetes version together with the NVIDIA Run:ai cluster version, to ensure compatibility with latest supported version of your .

Getting Installation Instructions

Follow the setup and installation instructions below to get the installation instructions to upgrade the NVIDIA Run:ai cluster.

Setup

In the NVIDIA Run:ai UI, go to Clusters
Select the cluster you want to upgrade
Click INSTALLATION INSTRUCTIONS
Optional: Select the NVIDIA Run:ai cluster version (latest, by default)
Click CONTINUE

Installation Instructions

Follow the installation instructions. Run the Helm commands provided on your Kubernetes cluster. See the below if .
Click DONE
Once installation is complete, validate the cluster is Connected and listed with the new cluster version (see the ). Once you have done this, the cluster is upgraded to the latest version.

Note

To upgrade to a specific version, modify the --version flag by specifying the desired <VERSION>. You can find all available versions by using the helm search repo command.

Troubleshooting

If you encounter an issue with the cluster upgrade, use the troubleshooting scenarios below.

Installation Fails

If the NVIDIA Run:ai cluster upgrade fails, check the installation logs to identify the issue.

Run the following script to print the installation logs:

Cluster Status

If the NVIDIA Run:ai cluster upgrade completes, but the cluster status does not show as Connected, refer to .

Uninstall

Uninstall the Control Plane

To delete the control plane, run:

helm uninstall runai-backend -n runai-backend

Uninstall the Cluster

Infrastructure setup

Authentication and Authorization

NVIDIA Run:ai authentication and authorization enables a streamlined experience for the user with precise controls covering the data each user can see and the actions each user can perform in the NVIDIA Run:ai platform.

Authentication verifies user identity during login, and authorization assigns the user with specific permissions according to the assigned .

Authenticated access is required to use all aspects of the NVIDIA Run:ai interfaces, including the NVIDIA Run:ai platform, the NVIDIA Run:ai Command Line Interface (CLI) and APIs.

Authentication

There are multiple methods to authenticate and access NVIDIA Run:ai.

Single Sign-On (SSO)

NVIDIA Run:ai supports three methods to set up SSO:

When using SSO, it is highly recommended to manage at least one local user, as a breakglass account (an emergency account), in case access to SSO is not possible.

Username and Password

Username and password access can be used when SSO integration is not possible.

Secret Key (for Application Programmatic Access)

Secret is the authentication method for . Applications use the NVIDIA Run:ai APIs to perform automated tasks including scripts and pipelines based on their assigned .

Authorization

The NVIDIA Run:ai platform uses Role Base Access Control (RBAC) to manage authorization. Once a user or an application is authenticated, they can perform actions according to their assigned access rules.

Role Based Access Control (RBAC) in NVIDIA Run:ai

While Kubernetes RBAC is limited to a single cluster, NVIDIA Run:ai expands the scope of Kubernetes RBAC, making it easy for administrators to manage access rules across multiple clusters.

RBAC at NVIDIA Run:ai is configured using access rules. An access rule is the assignment of a to a : <Subject> is a <Role> in a <Scope>.

Subject
- A user, a group, or an application assigned with the role
Role
- A set of permissions that can be assigned to subjects. Roles at NVIDIA Run:ai are system defined and cannot be created, edited or deleted.
- A permission is a set of actions (view, edit, create and delete) over a NVIDIA Run:ai entity (e.g. projects, workloads, users). For example, a role might allow a user to create and read Projects, but not update or delete them
Scope
- A scope is part of an organization in which a set of permissions (roles) is effective. Scopes include Projects, Departments, Clusters, Account (all clusters).

Below is an example of an access rule: [email protected] is a Department admin in Department: A

Users

This section explains the procedure to manage users and their permissions.

Users can be managed locally, or via the identity provider (Idp), while assigned with to manage permissions. For example, user [email protected] is a department admin in department A.

Users Table

The Users table can be found under Access in the NVIDIA Run:ai platform.

The users table provides a list of all the users in the platform. You can manage users and user permissions (access rules) for both local and .

Single Sign-On Users

SSO users are managed by the identity provider and appear once they have signed in to NVIDIA Run:ai.

The Users table consists of the following columns:

Column

Description

Customizing the Table View

Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Creating a Local User

To create a local user:

Click +NEW LOCAL USER
Enter the user’s Email address
Click CREATE
Review and copy the user’s credentials:
- User Email
- Temporary password to be used on first sign-in
Click DONE

Note

The temporary password is visible only at the time of user’s creation and must be changed after the first sign-in.

Adding an Access Rule to a User

To create an access rule:

Select the user you want to add an access rule for
Click ACCESS RULES
Click +ACCESS RULE
Select a role
Select a scope
Click SAVE RULE
Click CLOSE

Deleting a User’s Access Rule

To delete an access rule:

Select the user you want to remove an access rule from
Click ACCESS RULES
Find the access rule assigned to the user you would like to delete
Click on the trash icon
Click CLOSE

Resetting a User's Password

To reset a user’s password:

Select the user you want to reset it’s password
Click RESET PASSWORD
Click RESET
Review and copy the user’s credentials:
- User Email
- Temporary password to be used on next sign-in
Click DONE

Deleting a User

Select the user you want to delete
Click DELETE
In the dialog, click DELETE to confirm

Note

To ensure administrative operations are always available, at least one local user with System Administrator role should exist.

Using API

Go to the , API reference to view the available actions.

SSO

Applications

This section explains the procedure to manage your organization's applications.

Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token as detailed in and use it within subsequent API calls.

Applications are assigned with to manage permissions. For example, application ci-pipeline-prod is assigned with a Researcher role in Cluster: A.

Applications Table

The Applications table can be found under Access in the NVIDIA Run:ai platform.

The Applications table provides a list of all the applications defined in the platform, and allows you to manage them.

The Applications table consists of the following columns:

Column

Description

Customizing the Table View

Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Creating an Application

To create an application:

Click +NEW APPLICATION
Enter the application’s name
Click CREATE
Copy the Client ID and Client secret and store them securely
Click DONE

Note

The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

Adding an Access Rule to an Application

To create an access rule:

Select the application you want to add an access rule for
Click ACCESS RULES
Click +ACCESS RULE
Select a role
Select a scope
Click SAVE RULE
Click CLOSE

Deleting an Access Rule from an Application

To delete an access rule:

Select the application you want to remove an access rule from
Click ACCESS RULES
Find the access rule assigned to the user you would like to delete
Click on the trash icon
Click CLOSE

Regenerating a Client Secret

To regenerate a client secret:

Locate the application you want to regenerate its client secret
Click REGENERATE CLIENT SECRET
Click REGENERATE
Copy the New client secret and store it securely
Click DONE

Note

Regenerating a client secret revokes the previous one.

Deleting an Application

Select the application you want to delete
Click DELETE
On the dialog, click DELETE to confirm

Using API

Go to the , API reference to view the available actions.

User Applications

This article explains the procedure to create your own user applications.

Notes

All clusters in the tenant must be version 2.20 and onward.
The token obtained through user applications assumes the of the user

Creating an Application

To create an application:

Click the user avatar at the top right corner, then select Settings
Click +APPLICATION
Enter the application’s name
Click CREATE
Copy the Client ID and Client secret and store securely
Click DONE

You can create up to 20 user applications.

Note

The client secret is visible only at the time of creation. It cannot be recovered but can be regenerated.

Regenerating a Client Secret

To regenerate a client secret:

Locate the application you want to regenerate its client secret
Click Regenerate client secret
Click REGENERATE
Copy the New client secret and store it securely
Click DONE

Note

Regenerating a client secret revokes the previous one.

Deleting an Application

Locate the application you want to delete
Click on the trash icon
On the dialog, click DELETE to confirm

Using API

Go to the API reference to view the available actions.

Access Rules

This section explains the procedure to manage Access rules.

Access rules provide users, groups, or applications privileges to system entities. An access rule is the assignment of a to a : <Subject> is a <Role> in a <Scope>. For example, user [email protected] is a department admin in department A.

Access Rules Table

The Access rules table can be found under Access in the NVIDIA Run:ai platform.

The Access rules table provides a list of all the access rules defined in the platform and allows you to manage them.

Flexible management

It is also possible to manage access rules directly for a specific , , , or .

The Access rules table consists of the following columns:

Column

Description

Customizing the Table View

Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then Click Download as CSV. Export to CSV is limited to 20,000 rows.

Adding a New Access Rule

To add a new access rule:

Click +NEW ACCESS RULE
Select a subject User, SSO Group, or Application
Select or enter the subject identifier:
- User Email for a local user created in NVIDIA Run:ai or for SSO user as recognized by the IDP
- Group name as recognized by the IDP
- Application name as created in NVIDIA Run:ai
Select a role
Select a scope
Click SAVE RULE

Note

An access rule consists of a single subject with a single role in a single scope. To assign multiple roles or multiple scopes to the same subject, multiple access rules must be added.

Editing an Access Rule

Access rules cannot be edited. To change an access rule, you must delete the rule, and then create a new rule to replace it.

Deleting an Access Rule

Select the access rule you want to delete
Click DELETE
On the dialog, click DELETE to confirm

Viewing Your User Access Rule

To view the assigned roles and scopes you have access to:

Click the user avatar at the top right corner, then select Settings
Click User details

The list of assigned roles and scopes will be displayed.

Using API

Go to the API reference to view the available actions.

Advanced Setup

Node Roles

This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.

For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:

NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.
Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.

NVIDIA Run:ai services are scheduled on the defined node roles by applying Kubernetes Node Affinity using node labels .

Prerequisites

To perform these tasks, make sure to install the NVIDIA Run:ai Administrator CLI.

Configure Node Roles

The following node roles can be configured on the cluster:

System node: Reserved for NVIDIA Run:ai system-level services.
GPU Worker node: Dedicated for GPU-based workloads.
CPU Worker node: Used for CPU-only workloads.

System Nodes

NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the Kubectl (recommended) or via NVIDIA Run:ai Administrator CLI.

By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system for system services scheduling. You can modify the default node affinity rule by:

Editing the spec.global.affinity configuration parameter as detailed in Advanced cluster configurations.
Editing the global.affinity configuration as detailed in Advanced control plane configurations for self-hosted deployments.

Note

To ensure high availability and prevent a single point of failure, it is recommended to configure at least three system nodes in your cluster.

Important

Do not assign a system node role to the Kubernetes master node. This may disrupt Kubernetes functionality, particularly if the Kubernetes API Server is configured to use port 443 instead of the default 6443.

Kubectl

To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:

Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

Run one of the following commands to label the node with its role:

kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=true
kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=false

NVIDIA Run:ai Administrator CLI

Note

The NVIDIA Run:ai Administrator CLI only supports the default node affinity.

To set a system role for a node in your Kubernetes cluster, follow these steps:

Run the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

Run one of the following commands to set or remove a node’s role:

runai-adm set node-role --runai-system-worker <node-name>
runai-adm remove node-role --runai-system-worker <node-name>

The set node-role command will label the node and set relevant cluster configurations.

Worker Nodes

NVIDIA Run:ai worker nodes run user-submitted workloads and system-level DeamonSets required to operate. This can be managed via the Kubectl (recommended) or via NVIDIA Run:ai Administrator CLI.

By default, GPU workloads are scheduled on GPU nodes baed on the nvidia.com/gpu.present label. When global.nodeAffinity.restrictScheduling is set to true via the Advanced cluster configurations:

GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-gpu-worker
CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-cpu-worker

Kubectl

To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:

Validate the global.nodeAffinity.restrictScheduling is set to true in the cluster’s Configurations.
Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

Run one of the following commands to label the node with its role. Replace the label and value (true/false) to enable or disable GPU/CPU roles as needed:

kubectl label nodes <node-name> node-role.kubernetes.io/runai-gpu-worker=true
kubectl label nodes <node-name> node-role.kubernetes.io/runai-cpu-worker=false

NVIDIA Run:ai Administrator CLI

To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai Administrator CLI, follow these steps:

Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to set or remove a node’s role. <node-role> must be either --gpu-worker or --cpu-worker :
```
runai-adm set node-role <node-role> <node-name>
runai-adm remove node-role <node-role> <node-name>
```

The set node-role command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling true.

Note

Use the --all flag to set or remove a role to all nodes.

Service Mesh

NVIDIA Run:ai supports service mesh implementations. When a service mesh is deployed with sidecar injection, specific configurations must be applied to ensure compatibility with NVIDIA Run:ai. This document outlines the required changes for the NVIDIA Run:ai control plane and cluster.

Control Plane Configuration

Note

This section applies for self-hosted only.

By default, NVIDIA Run:ai prevents Istio from injecting sidecar containers into system jobs in the control plane. For other service mesh solutions, users must manually add annotations during installation.

To disable sidecar injection in the NVIDIA Run:ai control plane, modify the Helm values file by adding the required pod labels to the following components. See for more details.

Example for :

Cluster Configuration

Installation Phase

Sidecar containers injected by some service mesh solutions can prevent NVIDIA Run:ai installation hooks from completing. To avoid this, modify the Helm installation command to include the required labels or annotations:

Example for :

Workloads

To prevent sidecar injection in workloads created at runtime (such as training workloads), update the runaiconfig resource. See for more details:

Interworking with Karpenter

Karpenter is an open-source, Kubernetes cluster autoscaler built for cloud deployments. Karpenter optimizes the cloud cost of a customer’s cluster by moving workloads between different node types, consolidating workloads into fewer nodes, using lower-cost nodes where possible, scaling up new nodes when needed, and shutting down unused nodes.

Karpenter’s main goal is cost optimization. Unlike Karpenter, NVIDIA Run:ai’s Scheduler optimizes for fairness and resource utilization. Therefore, there are a few potential friction points when using both on the same cluster.

Friction Points Using Karpenter with NVIDIA Run:ai

Karpenter looks for “unschedulable” pending workloads and may try to scale up new nodes to make those workloads schedulable. However, in some scenarios, these workloads may exceed their quota parameters, and the NVIDIA Run:ai Scheduler will put them into a pending state.
Karpenter is not aware of the NVIDIA Run:ai fractions mechanism and may try to interfere incorrectly.
Karpenter preempts any type of workload (i.e., high-priority, non-preemptible workloads will potentially be interrupted and moved to save cost).
Karpenter has no pod-group (i.e., workload) notion or gang scheduling awareness, meaning that Karpenter is unaware that a set of “arbitrary” pods is a single workload. This may cause Karpenter to schedule those pods into different node pools (in the case of multi-node-pool workloads) or scale up or down a mix of wrong nodes.

Mitigating the Friction Points

NVIDIA Run:ai Scheduler mitigates the friction points using the following techniques (each numbered bullet below corresponds to the related friction point listed above):

Karpenter uses a “nominated node” to recommend a node for the Scheduler. The NVIDIA Run:ai Scheduler treats this as a “preferred” recommendation, meaning it will try to use this node, but it’s not required and it may choose another node.
Fractions - Karpenter won’t consolidate nodes with one or more pods that cannot be moved. The NVIDIA Run:ai reservation pod is marked as ‘do not evict’ to allow the NVIDIA Run:ai Scheduler to control the scheduling of fractions.
Non-preemptible workloads - NVIDIA Run:ai marks non-preemptible workloads as ‘do not evict’ and Karpenter respects this annotation.
NVIDIA Run:ai node pools (single-node-pool workloads) - Karpenter respects the ‘node affinity’ that NVIDIA Run:ai sets on a pod, so Karpenter uses the node affinity for its recommended node. For the gang-scheduling/pod-group (workload) notion, NVIDIA Run:ai Scheduler considers Karpenter directives as preferred recommendations rather than mandatory instructions and overrides Karpenter instructions where appropriate.

Deployment Considerations

Using multi-node-pool workloads
- Workloads may include a list of optional node pools. Karpenter is not aware that only a single node pool should be selected out of that list for the workload. It may therefore recommend putting pods of the same workload into different node pools and may scale up nodes from different node pools to serve a “multi-node-pool” workload instead of nodes on the selected single node pool.
- If this becomes an issue (i.e., if Karpenter scales up the wrong node types), users can set an inter-pod affinity using the node pool label or another common label as a ‘topology’ identifier. This will force Karpenter to choose nodes from a single-node pool per workload, selecting from any of the node pools listed as allowed by the workload.
- An alternative approach is to use a single-node pool for each workload instead of multi-node pools.
Consolidation
- To make Karpenter more effective when using its consolidation function, users should consider separating preemptible and non-preemptible workloads, either by using node pools, node affinities, taint/tolerations, or inter-pod anti-affinity.
- If users don’t separate preemptible and non-preemptible workloads (i.e., make them run on different nodes), Karpenter’s ability to consolidate (bin-pack) and shut down nodes will be reduced, but it is still effective.
Conflicts between bin-packing and spread policies
- If NVIDIA Run:ai is used with a scheduling spread policy, it will clash with Karpenter’s default bin-packs/consolidation policy, and the outcome may be a deployment that is not optimized for any of these policies.
- Usually spread is used for Inference, which is non-preemptible and therefore not controlled by Karpenter (NVIDIA Run:ai Scheduler will mark those workloads as ‘do not evict’ for Karpenter), so this should not present a real deployment issue for customers.

Infrastructure Procedures

NVIDIA Run:ai at Scale

Operating NVIDIA Run:ai at scale ensures that the system can efficiently handle fluctuating workloads while maintaining optimal performance. As clusters grow—whether due to an increasing number of nodes or a surge in workload demand—NVIDIA Run:ai services must be appropriately tuned to support large-scale environments.

This guide outlines the best practices for optimizing NVIDIA Run:ai for high-performance deployments, including NVIDIA Run:ai system services configurations, vertical scaling (adjusting CPU and memory resources) and where applicable, horizontal scaling (replicas).

NVIDIA Run:ai Services

Vertical Scaling

Each of the NVIDIA Run:ai containers has default resource requirements that reflect an average customer load. With significantly larger cluster loads, certain NVIDIA Run:ai services will require more CPU and memory resources. NVIDIA Run:ai supports configuring these resources for each NVIDIA Run:ai service group separately. For instructions and more information, see NVIDIA Run:ai services resource management.

Scheduling Services

The scheduling services group should be scaled together with the number of nodes and the number of workloads handled by the Scheduler (running / pending). These resource recommendations are based on internal benchmarks performed on stressed environments:

Scale (nodes/workloads)

CPU (request)

Memory (request)

Small - 30 / 480

1GB

Medium - 100 / 1600

2GB

Large - 500 / 8500

7GB

Sync and Workload Services

The sync and workload service groups are less sensitive for scale. The recommendation for large or intensive environments is set to the following:

CPU (request-limit)

Memory (request-limit)

1-2

1GB-2GB

Horizontal Scaling

By default, NVIDIA Run:ai cluster services are deployed with a single replica. For large scale and intensive environments it is recommended to scale the NVIDIA Run:ai services horizontally by increasing the number of replicas. For more information, see NVIDIA Run:ai services replicas.

Metrics Collection

NVIDIA Run:ai relies on Prometheus to scrape cluster metrics and forward them to the NVIDIA Run:ai control plane. The volume of metrics generated is directly proportional to the number of nodes, workloads, and projects in the system. When operating at scale—reaching hundreds, and thousands of nodes and projects—the system generates a significant volume of metrics which can place a strain on the cluster and the network bandwidth.

To mitigate this impact, it is recommended to tune the Prometheus remote-write configurations. See remote write tuning to read more about the tuning parameters available via the remote write configuration and refer to this article for optimizing Prometheus remote write performance.

You can apply the remote-write configurations required as described in advanced cluster configurations.

The following example demonstrates the recommended approach in NVIDIA Run:ai for tuning Prometheus remote-write configurations:

remoteWrite:
  queueConfig:
    capacity: 5000
    maxSamplesPerSend: 1000
    maxShards: 100

Monitoring and Maintenance

Deploying NVIDIA Run:ai in mission-critical environments requires proper monitoring and maintenance of resources to ensure workloads run and are deployed as expected.

Details on how to monitor different parts of the physical resources in your Kubernetes system, including and , can be found in the monitoring and maintenance section. Adjacent configuration and troubleshooting sections also cover high availability, and clusters, , and to meet compliance requirements.

In addition to monitoring NVIDIA Run:ai resources, it is also highly recommended to monitor NVIDIA Run:ai runs on Kubernetes, which manages containerized applications. In particular, focus on three main layers:

NVIDIA Run:ai Control Plane and Cluster Services

This is the highest layer and includes the parts of NVIDIA Run:ai pods, which run in containers managed by Kubernetes.

Kubernetes Cluster

This layer includes the main Kubernetes system that runs and manages NVIDIA Run:ai components. Important elements to monitor include:

The health of the cluster and nodes (machines in the cluster).
The status of key Kubernetes services, such as the API server. For detailed information on managing clusters, see the .

Host Infrastructure

This is the base layer, representing the actual machines (virtual or physical) that make up the cluster IT teams need to handle:

Managing CPU, memory, and storage
Keeping the operating system updated
Setting up the network and balancing the load

NVIDIA Run:ai does not require any special configurations at this level.

The articles below explain how to monitor these layers, maintain system security and compliance, and ensure the reliable operation of NVIDIA Run:ai in critical environments.

Shared Storage

Shared storage is a critical component in AI and machine learning workflows, particularly in scenarios involving distributed training and shared datasets. In AI and ML environments, data must be readily accessible across multiple nodes, especially when training large models or working with vast datasets. Shared storage enable seamless access to data, ensuring that all nodes in a distributed training setup can read and write to the same datasets simultaneously. This setup not only enhances efficiency but is also crucial for maintaining consistency and speed in high-performance computing environments.

While NVIDIA Run:ai Platform supports a variety of remote data sources, such as Git and S3, it is often more efficient to keep data close to the compute resources. This proximity is typically achieved through the use of shared storage, accessible to multiple nodes in your Kubernetes cluster.

Shared Storage

When implementing shared storage in Kubernetes, there are two primary approaches:

Utilizing the of your storage provider (Recommended)
Using a direct NFS (Network File System) mount

NVIDIA Run:ai support both direct NFS mount and Kubernetes Storage Classes.

Kubernetes Storage Classes

Storage classes in Kubernetes defines how storage is provisioned and managed. This allows you to select storage types optimized for AI workloads. For example, you can choose storage with high IOPS (Input/Output Operations Per Second) for rapid data access during intensive training sessions, or tiered storage options to balance cost and performance-based on your organization’s requirements. This approach supports dynamic provisioning, enabling storage to be allocated on-demand as required by your applications.

NVIDIA Run:ai data sources such as and leverage storage class to manage and allocate storage efficiently. This ensures that the most suitable storage option is always accessible, contributing to the efficiency and performance of AI workloads.

Note

NVIDIA Run:ai lists all available storage classes in the Kubernetes cluster, making it easy for users to select the appropriate storage. Additionally, can be set to restrict or enforce the use of specific storage classes, to help maintain compliance with organizational standards and optimize resource utilization.

Direct NFS Mount

Direct NFS allows you to mount a shared file system directly across multiple nodes in your Kubernetes cluster. This method provides a straightforward way to share data among nodes and is often used for simple setups or when a dedicated NFS server is available.

However, using NFS can present challenges related to security and control. Direct NFS setups might lack the fine-grained control and security features available with storage class.

Cluster Restore

This section explains how to restore a NVIDIA Run:ai cluster on a different Kubernetes environment.

In the event of a critical Kubernetes failure or alternatively, if you want to migrate a NVIDIA Run:ai cluster to a new Kubernetes environment, simply reinstall the NVIDIA Run:ai cluster. Once you have reinstalled and reconnected the cluster - projects, workloads and other cluster data is synced automatically.

The restoration or back-up of NVIDIA Run:ai cluster and which are stored locally on the Kubernetes cluster is optional and they can be restored and backed-up separately.

Backup

As back-up of data is not required, the backup procedure is optional for advanced deployments, as explained above.

Backup Cluster Configurations

To backup NVIDIA Run:ai cluster configurations:

Run the following command in your terminal:
Once the runaiconfig_back.yaml back-up file is created, save the file externally, so that it can be retrieved later.

Restore

Follow the steps below to restore the NVIDIA Run:ai cluster on a new Kubernetes environment.

Prerequisites

Before restoring the NVIDIA Run:ai cluster, it is essential to validate that it is both disconnected and uninstalled.

If the Kubernetes cluster is still available, the NVIDIA Run:ai cluster - make sure not to remove the cluster from the Control Plane
Navigate to the Cluster page in the NVIDIA Run:ai platform
Search for the cluster, and make sure its status is Disconnected

Re-installing NVIDIA Run:ai Cluster

Follow the NVIDIA Run:ai cluster instructions and ensure all prerequisites are met
If you have a back-up of the cluster configurations, reload it once the installation is complete
Navigate to the Cluster page in the NVIDIA Run:ai platform
Search for the cluster, and make sure its status is Connected

Secure Your Cluster

This section details the security considerations for deploying NVIDIA Run:ai. It is intended to help administrators and security officers understand the specific permissions required by NVIDIA Run:ai.

Access to the Kubernetes Cluster

NVIDIA Run:ai integrates with Kubernetes clusters and requires specific permissions to successfully operate. These are permissions are controlled with configuration flags that dictate how NVIDIA Run:ai interacts with cluster resources. Prior to installation, security teams can review the permissions and ensure it aligns with their organization’s policies.

NVIDIA Run:ai provides various security-related permissions that can be customized to fit specific organizational needs. Below are brief descriptions of the key use cases for these customizations:

Permission

Use case

Automatic Namespace creation

Controls whether NVIDIA Run:ai automatically creates Kubernetes namespaces when new projects are created. Useful in environments where namespace creation must be strictly managed.

Automatic user assignment

Decides if users are automatically assigned to projects within NVIDIA Run:ai. Helps manage user access more tightly in certain compliance-driven environments.

Secret propagation

Determines whether NVIDIA Run:ai should propagate secrets across the cluster. Relevant for organizations with specific security protocols for managing sensitive data.

Disabling Kubernetes limit range

Chooses whether to disable the Kubernetes Limit Range feature. May be adjusted in environments with specific resource management needs.

Note

These security customizations allow organizations to tailor NVIDIA Run:ai to their specific needs. All changes should be modified cautiously and only when necessary to meet particular security, compliance or operational requirements.

Secure Installation

Many organizations enforce IT compliance rules for Kubernetes, with strict access control for installing and running workloads. OpenShift uses Security Context Constraints (SCC) for this purpose. NVIDIA Run:ai fully supports SCC, ensuring integration with OpenShift's security requirements.

Security Vulnerabilities

The platform is actively monitored for security vulnerabilities, with regular scans conducted to identify and address potential issues. Necessary fixes are applied to ensure that the software remains secure and resilient against emerging threats, providing a safe and reliable experience.

Event History

This section provides details about NVIDIA Run:ai’s Audit log.

The NVIDIA Run:ai control plane provides the audit log API and event history table in the NVIDIA Run:ai UI. Both reflect the same information regarding changes to business objects: clusters, projects and assets etc.

Note

Only system administrator users with tenant-wide permissions can access Audit log.

Event History Table

The Event history table can be found under Event history in the NVIDIA Run:ai UI.

The Event history table consists of the following columns:

Column

Description

Customizing the Table View

Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then Click Download as CSV or Download as JSON

Using the Event History Date Selector

The Event history table saves events for the last 90 days. However, the table itself presents up to the last 30 days of information due to the potentially very high number of operations that might be logged during this period.

To view older events, or to refine your search for more specific results or fewer results, use the time selector and change the period you search for. You can also refine your search by clicking and using ADD FILTER accordingly.

Using API

Go to the reference to view the available actions. Since the amount of data is not trivial, the API is based on paging. It retrieves a specified number of items for each API call. You can get more data by using subsequent calls.

Limitations

Submissions of workloads are not audited. As a result, the system does not track or log details of workload submissions, such as timestamps or user activity.

Platform management

Manage AI Initiatives

Managing Your Organization

Managing Your Resources

Configuring NVIDIA MIG Profiles

NVIDIA’s Multi-Instance GPU (MIG) enables splitting a GPU into multiple logical GPU devices, each with its own memory and compute portion of the physical GPU.

NVIDIA provides two MIG strategies:

Single - A GPU can be divided evenly. This means all MIG profiles are the same.
Mixed - A GPU can be divided into different profiles.

The NVIDIA Run:ai platform supports running workloads using NVIDIA MIG. Administrators can set the Kubernetes nodes to their preferred MIG strategy and configure the appropriate MIG profiles for researchers and MLOPS engineers to use.

This guide explains how to configure MIG in each strategy to submit workloads. It also outlines the individual implications of each strategy and best practices for administrators.

Note

Starting from v2.19, Dynamic MIG feature began a deprecation process and is now no longer supported. With Dynamic MIG, the NVIDIA Run:ai platform automatically configured MIG profiles according to on-demand user requests for different MIG profiles or memory fractions.
GPU fractions and memory fractions are not supported with MIG profiles.
Single strategy supports both NVIDIA Run:ai and third-party workloads. Using mixed strategy can only be done using third-party workloads. For more details on NVIDIA Run:ai and third-party workloads, see Introduction to workloads.

Before You Start

To use MIG single and mixed strategy effectively, make sure to familiarize yourself with the following NVIDIA resources:

Configuring Single MIG Strategy

When deploying MIG using single strategy, all GPUs within a node are configured with the same profile. For example, a node might have GPUs configured with 3 MIG slices of profile type 1g.20gb, or 7 MIG slices of profile 1g.10gb. With this strategy, MIG profiles are displayed as whole GPU devices by CUDA.

The NVIDIA Run:ai platform discovers these MIG profiles as whole GPU devices as well, ensuring MIG devices are transparent to the end-user (practitioner). For example, a node that consists of 8 physical GPUs split into MIG slices, 3×2g20gb slices each, is discovered by the NVIDIA Run:ai platform as a node with 24 GPU devices.

Users can submit workloads by requesting a specific number of GPU devices (X GPU) and NVIDIA Run:ai will allocate X MIG slices (logical devices). The NVIDIA Run:ai platform deducts X GPUs from the workload’s Project quota, regardless of whether this ‘logical GPU’ represents 1/3 of a physical GPU device or 1/7 of a physical GPU device.

Configuring Mixed MIG Strategy

When deploying MIG using mixed strategy, each GPU in a node can be configured with a different combination of MIG profiles such as 2×2g.20gb and 3×1g.10gb. For details on supported combinations per GPU type, refer to Supported MIG Profiles.

In mixed strategy, physical GPU devices continue to be displayed as physical GPU devices by CUDA, and each MIG profile is shown individually. The NVIDIA Run:ai platform identifies the physical GPU devices normally, however, MIG profiles are not visible in the UI or node APIs.

When submitting third-party workloads with this strategy, the user should explicitly specify the exact requested MIG profile (for example, nvidia.com/gpu.product: A100-SXM4-40GB-MIG-3g.20gb). The NVIDIA Run:ai Scheduler finds a node that can provide this specific profile and binds it to the workload.

A third-party workload submitted with a MIG profile of type Xg.Ygb (e.g. 3g.40gb or 2g.20gb) is considered as consuming X GPUs. These X GPUs will be deducted from the workload’s project quota of GPUs. For example, a 3g.40gb profile deducts 3 GPUs from the associated Project’s quota, while 2g.20gb deducts 2 GPUs from the associated Project’s quota. This is done to maintain a logical ratio according to the characteristics of the MIG profile.

Best Practices for Administrators

Single Strategy

Configure proper and uniform sizes of MIG slices (profiles) across all GPUs within a node.
Set the same MIG profiles on all nodes of a single node pool.
Create separate node pools with different MIG profile configurations allowing users to select the pool that best matches their workloads’ needs.
Ensure Project quotas are allocated according to the MIG profile sizes.

Mixed Strategy

Use mixed strategy with workloads that require diverse resources. Make sure to evaluate the workload requirements and plan accordingly.
Configure individual MIG profiles on each node by using a limited set of MIG profile combinations to minimize complexity. Make sure to evaluate your requirements and node configurations.
Ensure Project quotas are allocated according to the MIG profile sizes.

Note

Since MIG slices are a fixed size, once configured, changing MIG profiles requires administrative intervention.

Scheduling and Resource Optimization

Scheduling

Quick Starts

Resource Optimization

Optimize Performance with Node Level Scheduler

The Node Level Scheduler optimizes the performance of your pods and maximizes the utilization of GPUs by making optimal local decisions on GPU allocation to your pods. While the chooses the specific node for a pod, it has no visibility to the node’s GPUs' internal state. The Node Level Scheduler is aware of the local GPUs' states and makes optimal local decisions such that it can optimize both the GPU utilization and pods’ performance running on the node’s GPUs.

This guide provides an overview of the best use cases for the Node Level Scheduler and instructions for configuring it to maximize GPU performance and pod efficiency.

Deployment Considerations

While the Node Level Scheduler applies to all , it will best optimize the performance of burstable workloads. Burstable workloads are workloads that use , giving those more GPU memory than requested and up to the Limit specified.
Burstable workloads are always susceptible to an OOM Kill signal if the owner of the excess memory requires it back. This means that using the Node Level Scheduler with inference or training workloads may cause pod preemption.
Using interactive workloads with notebooks is the best use case for burstable workloads and Node Level Scheduler. These workloads behave differently since the OOM Kill signal will cause the notebooks' GPU process to exit but not the notebook itself. This keeps the interactive pod running and retrying to attach a GPU again.

Interactive Notebooks Use Case

This use case is one scenario that shows how Node Level Scheduler locally optimizes and maximizes GPU utilization and workspaces’ performance.

The below shows a node with 2 GPUs and 2 submitted workspaces:

The Scheduler instructs the node to put the 2 workspaces on a single GPU, a single GPU and leaving the other free for a workload that requires resources. This means GPU#2 is idle while the two workspaces can only use up to half a GPU, even if they temporarily need more:

With the Node Level Scheduler enabled, the local decision will be to spread those 2 workspaces on 2 GPUs and allow them to maximize both workspaces’ performance and GPUs’ utilization by bursting out up to the full GPU memory and GPU compute resources:

The NVIDIA Run:ai Scheduler still sees a node with one fully empty GPU and one fully occupied GPU. When a 3rd workload is scheduled, and it requires a full GPU (or more than 0.5 GPU), the Scheduler will schedule it to that node, and the Node Level Scheduler will move one of the workspaces to run with the other in GPU#1, as was the Scheduler’s initial plan. Moving the workspace from GPU#1 back to GPU#2 maintains the workspace running while the GPU process within the Jupyter notebook is killed and re-established on GPU#2, continuing to serve the workspace:

Using Node Level Scheduler

The Node Level Scheduler can be enabled per node pool. To use Node Level Scheduler, follow the below steps.

Enable on Your Cluster

Enable the Node Level Scheduler at the cluster level (per cluster) by:
1. Editing the runaiconfig as follows. For more details, see :
2. Or, using the following kubectl patch command:

Enable on a Node Pool

Note

GPU resource optimization is disabled by default. It must be enabled by your Administrator, under General Settings → Resources → GPU resource optimization.

Enable Node Level Scheduler on any of the node pools:

Select Resources → Node pools
or
Under the Resource Utilization Optimization tab, change the number of workloads on each GPU to any value other than Not Enforced (i.e. 2, 3, 4, 5)

The Node Level Scheduler is now ready to be used on that node pool.

Submit a Workload

In order for a workload to be considered by the Node Level Scheduler for rerouting, it must be submitted with a GPU Request and Limit where the Limit is larger than the Request:

Enable and set
Then using dynamic GPU fractions

GPU Time-Slicing

NVIDIA Run:ai supports simultaneous submission of multiple workloads to single or multi-GPUs when using . This is achieved by slicing the GPU memory between the different workloads according to the requested GPU fraction, and by using NVIDIA’s GPU time-slicing to share the GPU compute runtime. NVIDIA Run:ai ensures each workload receives the exact share of the GPU memory (= gpu_memory * requested), while the NVIDIA GPU time-slicing splits the GPU runtime evenly between the different workloads running on that GPU.

To provide customers with predictable and accurate GPU compute resource scheduling, NVIDIA Run:ai’s GPU time-slicing adds fractional compute capabilities on top of NVIDIA Run:ai GPU fraction capabilities.

How GPU Time-Slicing Works

While the default NVIDIA GPU time-slicing allows for sharing the GPU compute runtime evenly without splitting or limiting the runtime of each workload, NVIDIA Run:ai’s GPU time-slicing mechanism gives each workload exclusive access to the full GPU for a limited amount of time, lease time, in each scheduling cycle, plan time. This cycle repeats itself for the lifetime of the workload. Using the GPU runtime this way guarantees a workload is granted its requested GPU compute resources proportionally to its requested GPU fraction, but also allows splitting GPU unused compute time up to a requested Limit.

For example, when there are 2 workloads running on the same GPU, with NVIDIA’s default GPU time slicing, each workload gets 50% of the GPU compute runtime, even if one workload requests 25% of the GPU memory, and the other workload requests 75% of the GPU memory. With the NVIDIA Run:ai GPU time-slicing, the first workload will get 25% of the GPU compute time and the second will get 75%. If one of the workloads does not use its deserved GPU compute time, the others can split that time evenly between them. As shown in the example, if one of the workloads does not request the GPU for some time, the other will get the full GPU compute time.

GPU Time-Slicing Modes

NVIDIA Run:ai offers two GPU time-slicing modes:

Strict - Each workload gets its precise GPU compute fraction, which equals to its requested GPU (memory) fraction. In terms of official Kubernetes resource specification, this means:

Fair - Each workload is guaranteed at least its GPU compute fraction, but at the same time can also use additional GPU runtime compute slices that are not used by other idle workloads. Those excess time slices are divided equally between all workloads running on that GPU (after each got at least its requested GPU compute fraction). In terms of official Kubernetes resource specification, this means:

The figure below illustrates how Strict time-slicing mode uses the GPU from Lease (slice) and Plan (cycle) perspective:

The figure below illustrates how Fair time-slicing mode uses the GPU from Lease (slice) and Plan (cycle) perspective:

Time-Slicing Plan and Lease Times

Each GPU scheduling cycle is a plan. The plan is determined by the lease time and granularity (precision). By default, basic lease time is 250ms with 5% granularity (precision), which means the plan (cycle) time is: 250 / 0.05 = 5000ms (5 Sec). Using these values, a workload that requests gpu-fraction=0.5 gets 2.5s runtime out of the 5s cycle time.

Different workloads require different SLA and precision, so it also possible to tune the lease time and precision for customizing the time-slicing capabilities to your cluster.

Note

Decreasing the lease time makes time-slicing less accurate. Increasing the lease time makes the system more accurate, but each workload is less responsive.

Once timeSlicing is enabled in the runaiconfig, all submitted GPU fractions or GPU memory workloads will have their gpu-compute-request/limit set automatically by the system, depending on the annotation used on the time-slicing mode:

Strict compute resources:

Fair compute resources:

Note

The above tables show that when submitting a workload using gpu-memory annotation, the system will split the GPU compute time between the different workloads running on that GPU. This means the workload can get anything from very little compute time (>0) to full GPU compute time (1.0).

Enabling GPU Time-Slicing

NVIDIA Run:ai’s GPU time-slicing is a cluster flag which changes the default NVIDIA time-slicing used by GPU fractions. For more details, see .

Enable GPU time-slicing by setting the following cluster flag in the runaiconfig file:

If the timeSlicing flag is not set, the system continues to use the default NVIDIA GPU time-slicing to maintain backward compatibility.

Quick Starts

Policies

Policies and Rules

At NVIDIA Run:ai, Administrators can access a suite of tools designed to facilitate efficient account management. This article focuses on two key features: workload policies and workload scheduling rules. These features empower admins to establish default values and implement restrictions allowing enhanced control, assuring compatibility with organizational policies, and optimizing resource usage and utilization.

Note

Policies V1 are still supported but require additional setup. If you have policies on clusters prior to NVIDIA Run:ai version 2.18 and upgraded to a newer version, contact NVIDIA Run:ai Customer Success for assistance in transitioning to the new policies framework.

Workload Policies

A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted. This solution allows them to set best practices, enforce limitations, and standardize processes for the submission of workloads for AI projects within their organization. It acts as a key guideline for data scientists, researchers, ML & MLOps engineers by standardizing submission practices and simplifying the workload submission process.

Why Use a Workload Policy?

Implementing workload policies is essential when managing complex AI projects within an enterprise for several reasons:

Resource control and management - Defining or limiting the use of costly resources across the enterprise via a centralized management system to ensure efficient allocation and prevent overuse.
Setting best practices - Provide managers with the ability to establish guidelines and standards to follow, reducing errors amongst AI practitioners within the organization.
Security and compliance - Define and enforce permitted and restricted actions to uphold organizational security and meet compliance requirements.
Simplified setup - Conveniently allow setting defaults and streamline the workload submission process for AI practitioners.
Scalability and diversity
1. Multi-purpose clusters with various workload types that may have different requirements and characteristics for resource usage.
2. The organization has multiple hierarchies, each with distinct goals, objectives, and degrees of flexibility.
3. Manage multiple users and projects with distinct requirements and methods, ensuring appropriate utilization of resources.

Understanding the Mechanism

The following sections provide details of how the workload policy mechanism works.

Cross-Interface Enforcement

The policy enforces the workloads regardless of whether they were submitted via UI, CLI, Rest APIs, or Kubernetes YAMLs.

Policy Types

NVIDIA Run:ai’s policies enforce NVIDIA Run:ai workloads. The policy type is per NVIDIA Run:ai workload type. This allows administrators to set different policies for each workload type.

Policy type

Workload type

Kubernetes name

Workspace

Interactive workload

Training: Standard

Training workload

Training: Distributed

Distributed workload

Inference

Inference workload

Policy Structure - Rules, Defaults, and Imposed Assets

A policy consists of rules for limiting and controlling the values of fields of the workload. In addition to rules, some defaults allow the implementation of default values to different workload fields. These default values are not rules, as they simply suggest values that can be overridden during the workload submission.

Furthermore, policies allow the enforcement of workload assets. For example, as an admin, you can impose a data source of type PVC to be used by any workload submitted.

For more information, see rules, defaults and imposed assets.

Scope of Effectiveness

Numerous teams working on various projects require the use of different tools, requirements, and safeguards. One policy may not suit all teams and their requirements. Hence, administrators can select the scope to cover the effectiveness of the policy. When a scope is selected, all of its subordinate units are also affected. As a result, all workloads submitted within the selected scope are controlled by the policy.

For example, if a policy is set for Department A, all workloads submitted by any of the projects within this department are controlled.

A scope for a policy can be:

Note

The policy submission to the entire account scope is supported via API only.

The different scoping of policies also allows the breakdown of the responsibility between different administrators. This allows delegation of ownership between different levels within the organization. The policies, containing rules and defaults, propagate* down the organizational tree, forming an “effective” policy that enforces any workload submitted by users within the project.

If a rule for a specific field is already occupied by a policy in the organization, another unit within the same branch cannot submit an additional rule on the same field. As a result, administrators of higher scopes must request lower-scope administrators to free up the specific rule from their policy. However, defaults of the same field can be submitted by different organizational policies, as they are “soft” rules that are not critical to override, and the smallest level of the default is the one that becomes the effective default (project default ‚”wins” vs department default, department default “wins” vs cluster default etc.).

NVIDIA Run:ai policies vs. Kyverno policies

Kyverno runs as a dynamic admission controller in a Kubernetes cluster. Kyverno receives validating and mutating admission webhook HTTP callbacks from the Kubernetes API server and applies matching policies to return results that enforce admission policies or reject requests. Kyverno policies can match resources using the resource kind, name, label selectors, and much more. For more information, see How Kyverno Works.

Scheduling Rules

Scheduling rules limit a researcher's access to resources and provides a way for the admin to control resource allocation and prevent the waste of resources. Admins should use the rules to prevent GPU idleness, prevent GPU hogging and allocate specific types of resources to different types of workloads.

Admin can limit the duration of a workload, the duration of the idle time, or the type of nodes the workload can use. Rules are defined for and apply to all workloads in the project or department. In addition, rules can be applied to a specific type of workload in a project or department (workspace, standard training, or inference). When a workload reaches the limitation of the rule, it is stopped if the rule is time-limited. The rule type prevents the workload from being scheduled on nodes that violate the rule limitation.

Scheduling Rules

This article explains the procedure to configure and manage scheduling rules.

Scheduling rules are restrictions applied to workloads. These restrictions apply to either the resources (nodes) on which workloads can run or the duration of the run time. Scheduling rules are set for Projects or Departments and apply to specific workload types. Once scheduling rules are set for a project or department, all matching workloads associated with the project have the restrictions applied to them, as defined, when the workload was submitted. New scheduling rules added to a project are not applied over previously created workloads associated with that project.

There are three types of scheduling rules:

Workload Duration (Time Limit)

This rule limits the duration of a workload run time. Workload run time is calculated as the total time in which the workload was in status Running. You can apply a single rule per workload type - Preemptive Workspaces, Non-preemptive Workspaces, and Training.

Idle GPU Time Limit

This rule limits the total GPU time of a workload. Workload idle time is counted from the first time the workload is in status Running and the GPU was idle. Idleness is calculated by employing the runai_gpu_idle_seconds_per_workload metric. This metric determines the total duration of zero GPU utilization within each 30-second interval. If the GPU remains idle throughout the 30-second window, 30 seconds are added to the idleness sum; otherwise, the idleness count is reset. You can apply a single rule per workload type - “Preemptible” Workspaces, “Non-preemptible” Workspaces, and Training.

Note

To make Idle GPU timeout effective, it must be set to a shorter duration than the workload duration of the same workload type.

Node Type (Affinity)

Node type is used to select a group of nodes, typically with specific characteristics such as a hardware feature, storage type, fast networking interconnection, etc. The Scheduler uses node type as an indication of which nodes should be used for your workloads, within this project.

Node type is a label in the form of run.ai/type and a value (e.g. run.ai/type = dgx200) that the administrator uses to tag a set of nodes. Adding the node type to the project’s scheduling rules mandates the user to submit workloads with a node type label/value pairs from this list, according to the workload type - Workspace or Training. The Scheduler then schedules workloads using a node selector, targeting nodes tagged with the NVIDIA Run:ai node type label/value pair. Node pools and a node type can be used in conjunction. For example, specifying a node pool and a smaller group of nodes from that node pool that includes a fast SSD memory or other unique characteristics.

Labelling Nodes for Node Types Grouping

The administrator should use a node label with the key of run.ai/type and any coupled value

To assign a label to nodes you want to group, set the ‘node type (affinity)’ on each relevant node:

Obtain the list of nodes and their current labels by copying the following to your terminal:

kubectl get nodes --show-labels

Annotate a specific node with a new label by copying the following to your terminal:

kubectl label node <node-name> run.ai/type=<value>

Adding a Scheduling Rule to a Project or Department

To add a scheduling rule:

Select the project/department for which you want to add a scheduling rule
Click EDIT
In the Scheduling rules section click +RULE
Select the rule type
Select the workload type and time limitation period
For Node type, choose one or more labels for the desired nodes
Click SAVE

Note

You can review the defined rules in the Projects table in the relevant column.

Editing the Scheduling Rule

To edit a scheduling rule:

Select the project/department for which you want to edit its scheduling rule
Click EDIT
Find the scheduling rule you would like to edit
Edit the rule
Click SAVE

Note

Setting scheduling rules in a department enforces the rules on all associated projects.

Editing a scheduling rule within a project - you can only tighten a rule applied by your department admin, meaning set a lower time limitation not higher.

Deleting the Scheduling Rule

To delete a scheduling rule:

Select the project/department from which you want to delete a scheduling rule
Click EDIT
Find the scheduling rule you would like to delete
Click on the x icon
Click SAVE

Note

Deleting a department rule within a project - a project admin cannot delete a rule created by the department admin.

Using API

Go to the Projects API reference to view the available actions

Monitor Performance and Health

Before You Start

NVIDIA Run:ai provides metrics and telemetry for both physical cluster entities such as clusters, nodes, and node pools and application organization entities such as departments and projects. Metrics represent over-time data while telemetry represents current analytics data. This data is essential for monitoring and analyzing the performance and health of your platform.

Consuming Metrics and Telemetry Data

Users can consume the data based on their permissions:

API - Access the data programmatically through the NVIDIA Run:ai API.
CLI - Use the NVIDIA Run:ai Command Line Interface to query and manage the data.
UI - Visualize the data through the NVIDIA Run:ai user interface.

API

Metrics API - Access over-time detailed analytics data programmatically.
Telemetry API - Access current analytics data programmatically.

Refer to metrics and telemetry to see the full list of supported metrics and telemetry APIs.

CLI

Use the list and describe commands to fetch and manage the data. See CLI reference for more details.

UI Views

Refer to metrics and telemetry to see the full list of supported metrics and telemetry.

Overview dashboard - Provides a high-level summary of the cluster's health and performance, including key metrics such as GPU utilization, memory usage, and node status. Allows administrators to quickly identify any potential issues or areas for optimization. Offers advanced analytics capabilities for analyzing GPU usage patterns and identifying trends. Helps administrators optimize resource allocation and improve cluster efficiency.
Quota management - Enables administrators to monitor and manage GPU quotas across the cluster. Includes features for setting and adjusting quotas, tracking usage, and receiving alerts when quotas are exceeded.
Workload visualizations - Provides detailed insights into the resource usage and utilization of each GPU in the cluster. Includes metrics such as GPU memory utilization, core utilization, and power consumption. Allows administrators to identify GPUs that are under-utilized and overloaded.
Node and node pool visualizations - Similar to workload visualizations, but focused on the resource usage and utilization of each GPU within a specific node or node pool. Helps administrators identify potential issues or bottlenecks at the node level.
Advanced NVIDIA metrics - Provides access to a range of advanced NVIDIA metrics, such as GPU temperature, fan speed, and voltage. Enables administrators to monitor the health and performance of GPUs in greater detail. This data is available at the node and workload level. To enable these metrics, contact NVIDIA Run:ai customer support.

Workloads in NVIDIA Run:ai

NVIDIA Run:ai Workload Types

In the world of machine learning (ML), the journey from raw data to actionable insights is a complex process that spans multiple stages. Each stage of the AI lifecycle requires different tools, resources, and frameworks to ensure optimal performance. NVIDIA Run:ai simplifies this process by offering specialized workload types tailored to each phase, facilitating a smooth transition across various stages of the ML workflows.

The ML lifecycle usually begins with the experimental work on data and exploration of different modeling techniques to identify the best approach for accurate predictions. At this stage, resource consumption is usually moderate as experimentation is done on a smaller scale. As confidence grows in the model's potential and its accuracy, the demand for compute resources increases. This is especially true during the training phase, where vast amounts of data need to be processed, particularly with complex models such as large language models (LLMs), with their huge parameter sizes, that often require distributed training across multiple GPUs to handle the intensive computational load.

Finally, once the model is ready, it moves to the inference stage, where it is deployed to make predictions on new, unseen data. NVIDIA Run:ai's workload types are designed to correspond with the natural stages of this lifecycle. They are structured to align with the specific resource and framework requirements of each phase, ensuring that AI researchers and data scientists can focus on advancing their models without worrying about infrastructure management.

NVIDIA Run:ai offers three workload types that correspond to a specific phase of the researcher’s work:

Workspaces – For experimentation with data and models.
Training – For resource-intensive tasks such as model training and data preparation.
Inference – For deploying and serving the trained model.

Workspaces: The Experimentation Phase

The Workspace is where data scientists conduct initial research, experiment with different data sets, and test various algorithms. This is the most flexible stage in the ML lifecycle, where models and data are explored, tuned, and refined. The value of workspaces lies in the flexibility they offer, allowing the researcher to iterate quickly without being constrained by rigid infrastructure.

Framework flexibility
Workspaces support a variety of machine learning frameworks, as researchers need to experiment with different tools and methods.
Resource requirements
Workspaces are often lighter on resources compared to the training phase, but they still require significant computational power for data processing, analysis, and model iteration.
Hence, the default for the NVIDIA Run:ai workspaces considerations is to allow scheduling those workloads without the ability to preempt them once the resources were allocated. However, this non-preemptible state doesn’t allow utilizing more resources outside of the project’s deserved quota.

See to learn more about how to submit a workspace via the NVIDIA Run:ai platform. For quick starts, see .

Training: Scaling Resources for Model Development

As models mature and the need for more robust data processing and model training increases, NVIDIA Run:ai facilitates this shift through the Training workload. This phase is resource-intensive, often requiring distributed computing and high-performance clusters to process vast data sets and train models.

Training architecture
For training workloads NVIDIA Run:ai allows you to specify the architecture - standard or distributed. The distributed architecture is relevant for larger data sets and more complex models that require utilizing multiple nodes. For the distributed architecture, NVIDIA Run:ai allows you to specify different configurations for the master and workers and select which framework to use - PyTorch, XGBoost, MPI, TensorFlow and JAX. In addition, as part of the distributed configuration, NVIDIA Run:ai enables the researchers to schedule their distributed workloads on nodes within the same region, zone, placement group, or any other topology.
Resource requirements
Training tasks demand high memory, compute power, and storage. NVIDIA Run:ai ensures that the allocated resources match the scale of the task and allows those workloads to utilize more compute resources than the project’s deserved quota. Make sure that if you wish your training workload not to be preempted, specify the number of GPUs that are in your quota.

See and to learn more about how to submit a training workload via the NVIDIA Run:ai UI. For quick starts, see and .

Note

Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

Inference: Deploying and Serving models

Once a model is trained and validated, it moves to the Inference stage, where it is deployed to make predictions (usually in a production environment). This phase is all about efficiency and responsiveness, as the model needs to serve real-time or batch predictions to end-users or other systems.

Inference-specific use cases
Naturally, inference workloads are required to change and adapt to the ever-changing demands to meet SLA. For example, additional replicas may be deployed, manually or automatically, to increase compute resources as part of a horizontal scaling approach or a new version of the deployment may need to be rolled out without affecting the running services.
Resource requirements
Inference models differ in size and purpose, leading to varying computational requirements. For example, small OCR models can run efficiently on CPUs, whereas LLMs typically require significant GPU memory for deployment and serving. Inference workloads are considered production-critical and are given the highest priority to ensure compliance with SLAs. Additionally, NVIDIA Run:ai ensures that inference workloads cannot be preempted, maintaining consistent performance and reliability.

See to learn more about how to submit an inference workload via the NVIDIA Run:ai UI.

Workload Assets

NVIDIA Run:ai assets are preconfigured building blocks that simplify the workload submission effort and remove the complexities of Kubernetes and networks for AI practitioners.

Workload assets enable organizations to:

Create and reuse preconfigured setup for code, data, storage and resources to be used by AI practitioners to simplify the process of submitting workloads
Share the preconfigured setup with a wide audience of AI practitioners with similar needs

Note

The creation of assets is possible only via API and the NVIDIA Run:ai UI.
The submission of workloads using assets, is possible only via the NVIDIA Run:ai UI.

Workload Asset Types

There are four workload asset types used by the workload:

The container image, tools and connections for the workload
The type of data, its origin and the target storage location such as PVCs or cloud storage buckets where datasets are stored
The compute specification, including GPU and CPU compute and memory
The secrets to be used to access sensitive data, services, and applications such as docker registry or S3 buckets

Asset Scope

When a workload asset is created, a is required. The scope defines who in the organization can view and/or use the asset.

Note

When an asset is created via API, the scope can be the entire account. This is currently an experimental feature.

Who Can Create an Asset?

Any subject (user, application, or SSO group) with a that has permissions to Create an asset, can do so within their scope.

Who Can Use an Asset?

Assets are used when submitting workloads. Any subject (user, application or SSO group) with a that has permissions to Create workloads, can also use assets.

Who Can View an Asset?

Any subject (user, application, or SSO group) with a that has permission to View an asset, can do so within their scope.

Workload Templates

Workspace Templates

This section explains the procedure to manage templates.

A template is a pre-set configuration that is used to quickly configure and submit workloads using existing assets. A template consists of all the assets a workload needs, allowing researchers to submit a workload in a single click, or make subtle adjustments to differentiate them from each other.

Workspace Templates Table

The Templates table can be found under Workload manager in the NVIDIA Run:ai User interface.

The Templates table provides a list of all the templates defined in the platform, and allows you to manage them.

Flexible management

It is also possible to manage templates directly for a specific user, application, project, or department.

The Templates table consists of the following columns:

Column

Description

Scope

The scope to which the subject has access. Click the name of the scope to see the scope and its subordinates

Environment

The name of the environment related to the workspace template

Compute resource

The name of the compute resource connected to the workspace template

Data source(s)

The name of the data source(s) connected to the workspace template

Created by

The subject that created the template

Creation time

The timestamp for when the template was created

Cluster

The cluster name containing the template

Customizing the Table View

Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then click Download as CSV. Export to CSV is limited to 20,000 rows.
Refresh (optional) - Click REFRESH to update the table with the latest data
Show/Hide details (optional) - Click to view additional information on the selected row

Adding a New Workspace Template

To add a new template:

Click +NEW TEMPLATE
Set the scope for the template
Enter a name for the template
Select the environment for your workload
Select the node resources needed to run your workload - or - Click +NEW COMPUTE RESOURCE
Set the volume needed for your workload
Create a new data source
Set auto-deletion, annotations and labels, as required
Click CREATE TEMPLATE

Copying a Template

To copy an existing template:

Select the template you want to copy
Click MAKE A COPY
Enter a name for the template. The name must be unique.
Update the template and click CREATE TEMPLATE

Renaming a Template

To rename an existing template:

Select the template you want to rename
Click Rename and edit the name/description

Deleting a Template

To delete a template:

Select the template you want to delete
Click DELETE
Confirm you want to delete the template

Using API

Go to the Workload template API reference to view the available actions

Experiment Using Workspaces

Quick Starts

Running Workspaces

This section explains how to create a workspace via the NVIDIA Run:ai UI.

A workspace contains the setup and configuration needed for building your model, including the container, images, data sets, and resource requests, as well as the required tools for the research, all in a single place.

To learn more about the workspace workload type in NVIDIA Run:ai and determine that it is the most suitable workload type for your goals, see .

Before You start

Make sure you have created a or have one created for you.

Note

Flexible workload submission – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Workloads → Flexible Workload Submission.
GPU memory limit – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Resources → GPU Resource Optimization.
Tolerations – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Workloads → Tolerations.
Data volumes – Disabled by default. If unavailable, your Administrator must enable it under General Settings → Workloads → Data volumes. Data volumes are available for flexible workload submission only.

Workload Priority

By default, workspaces in NVIDIA Run:ai are assigned the build priority, which is non-preemptible. If needed, you can override this default and set the priority to interactive-preemptible. For more details, see .

Workload Policies

When creating a new workload, fields and assets may have limitations or defaults. These rules and defaults are derived from a policy your administrator set.

Policies allow you to control, standardize, and simplify the workload submission process. For additional information, see .

The effects of the policy are reflected in the workspace creation form:

Defaults derived from the policy will be displayed automatically for specific fields.
Disabled actions and permitted value ranges for values will be visibly explained per field.
Rules and defaults for entire sections (such as environments, compute resources, or data sources) may prevent selection and will appear on the entire library card with an option for additional information via an external modal.

Submission Form Options

You can create a new workspace using either the Flexible or Original submission form. The Flexible submission form offers greater customization and is the recommended method. Within the Flexible form, you have two options:

Load from an existing setup - You can select an existing setup to populate the workspace form with predefined values. While the Original submission form also allows you to select an existing setup, with the Flexible submission you can customize any of the populated fields for a one-time configuration. These changes will apply only to this workspace and will not modify the original setup. If needed, you can reset the configuration to the original setup at any time.
Provide your own settings - Manually fill in the workspace configuration fields. This is a one-time setup that applies only to the current workspace and will not be saved for future use.

Note

The Original submission form will be deprecated in a future release.

Creating a New Workspace

To add a new workspace, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Workspace from the drop-down menu.
Within the new workspace form, select the cluster and project. To create a new project, click +NEW PROJECT and refer to for a step-by-step guide.
Select a preconfigured or select Start from scratch to launch a new workspace quickly.
Enter a unique name for the workspace. If the name already exists in the project, you will be requested to submit a different name.
Under Submission, select Flexible or Original and click CONTINUE.

Setting Up an Environment

Load from existing setup

Click the load icon. A side pane appears, displaying a list of available environments. Select an environment from the list.
Optionally, customize any of the environment’s predefined fields as shown below. The changes will apply to this workspace only and will not affect the selected environment.
Alternatively, click the ➕ icon in the side pane to create a new environment. For step-by-step instructions, see .

Provide your own settings

Manually configure the settings below as needed. The changes will apply to this workspace only.

Configure environment

Add the Image URL or update the URL of the existing setup.
Set the condition for pulling the image by selecting the image pull policy. It is recommended to pull the image only if it's not already present on the host.
Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.
- Select the connection type - External URL or NodePort:
  - Auto generate - A unique URL / port is automatically created for each workload using the environment.
  - Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between 30000 and 32767. If the node port is already in use, the workload will fail and display an error message.
- Modify who can access the tool:
  - By default, All authenticated users is selected giving access to everyone within the organization’s account.
  - For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
  - For Specific user(s), enter a valid email address or username. If you remove yourself, you will lose access to the tool.
Set the command and arguments for the container running the workspace. If no command is added, the container will use the image’s default command (entry-point):
- Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
- Set multiple arguments separated by spaces, using the following format (e.g.: --arg1=val1).
Set the environment variable(s):
- Modify the existing environment variable(s) if you are loading from an existing setup. The existing environment variables may include instructions to guide you with entering the correct values.
- To add a new variable, click + ENVIRONMENT VARIABLE.
- You can either select Custom to define your own variable, or choose from a predefined list of or .
Enter a path pointing to the container's working directory.
Set where the UID, GID, and supplementary groups for the container should be taken from. If you select Custom, you’ll need to manually enter the UID, GID and Supplementary groups values.
Select additional Linux capabilities for the container from the drop-down menu. This grants certain privileges to a container without granting all the root user's privileges.

Select an environment or click +NEW ENVIRONMENT to add a new environment to the gallery. For a step-by-step guide on adding environments to the gallery, see . Once created, the new environment will be automatically selected.
Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.
- Select the connection type - External URL or NodePort:
  - Auto generate - A unique URL / port is automatically created for each workload using the environment.
  - Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between 30000 and 32767. If the node port is already in use, the workload will fail and display an error message.
- Optional: Modify who can access the tool:
  - By default, All authenticated users is selected giving access to everyone within the organization’s account.
  - For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
  - For Specific user(s), enter a valid email address or username. If you remove yourself, you will lose access to the tool.
- Set the User ID (UID), Group ID (GID) and the Supplementary groups that can run commands in the container.
Optional: Set the command and arguments for the container running the workload. If no command is added, the container will use the image’s default command (entry-point):
- Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
- Set multiple arguments separated by spaces, using the following format (e.g.: --arg1=val1).
Set the environment variable(s):
- Modify the existing environment variable(s). The existing environment variables may include instructions to guide you in entering the correct values.
- Optional: To add a new variable, click + ENVIRONMENT VARIABLE.
- You can either select Custom to define your own variable, or choose from a predefined list of or .

Setting Up Compute Resources

Load from existing setup

Click the load icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.
Optionally, customize any of the compute resource's predefined fields. The changes will apply to this workspace only and will not affect the selected compute resource.
Alternatively, click the ➕ icon in the side pane to create a new compute resource. For step-by-step instructions, see .

Provide your own settings

Manually configure the settings below as needed. The changes will apply to this workspace only.

Configure compute resources

Set the number of GPU devices per pod (physical GPUs).
Set the GPU memory per device using either a fraction of a GPU device’s memory (% of device) or a GPU memory unit (MB/GB):
- Request - The minimum GPU memory allocated per device. Each pod in the workspace receives at least this amount per device it uses.
- Limit - The maximum GPU memory allocated per device. Each pod in the workspace receives at most this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see .
Set the CPU compute per pod by choosing the unit (cores or millicores):
- Request - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.
- Limit - The maximum amount of CPU compute a pod can use. Each pod receives at most this amount of CPU compute. By default, the limit is set to Unlimited which means that the pod may consume all the node's free CPU compute resources.
Set the CPU memory per pod by selecting the unit (MB or GB):
- Request - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.
- Limit - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to Unlimited which means that the pod may consume all the node's free CPU memory resources.
Set extended resource(s):
- Enable Increase shared memory size to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.
- Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the and guides.
Set the order of priority for the node pools on which the Scheduler tries to run the workspace. When a workspace is created, the Scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available:
- Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
- Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see .
Select a node affinity to schedule the workspace on a specific node type. If the administrator added a ‘’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. with a label that matches the node type key and value.
Click +TOLERATION to allow the workspace to be scheduled on a node with a matching taint. Select the operator and the effect:
- If you select Exists, the effect will be applied if the key exists on the node.
- If you select Equals, the effect will be applied if the key and the value set match the value on the node.

Select a compute resource or click +NEW COMPUTE RESOURCE to add a new compute resource to the gallery. For a step-by-step guide on adding compute resources to the gallery, see . Once created, the new compute resource will be automatically selected.
Optional: Set the order of priority for the node pools on which the Scheduler tries to run the workload. When a workload is created, the scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available.
- Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
- Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see .
Select a node affinity to schedule the workload on a specific node type. If the administrator added a ‘’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. with a label that matches the node type key and value.
Optional: Click +TOLERATION to allow the workload to be scheduled on a node with a matching taint. Select the operator and the effect.
- If you select Exists, the effect will be applied if the key exists on the node.
- If you select Equals, the effect will be applied if the key and the value set match the value on the node.

Setting Up Data & Storage

Note

Flexible - If Data volumes are not enabled, Data & storage appears as Data sources only, and no data volumes will be available. To enable Data volumes, see .
Original - This tab outlines how to set Volumes and Data sources.

Load from existing setup

Click the load icon. A side pane appears, displaying a list of available data sources/volumes. Select a data source/volume from the list.
Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this workspace only and will not affect the selected data source.
Alternatively, click the ➕ icon in the side pane to create a new data source/data volume. For step-by-step instructions, see or .

Provide your own settings

Manually configure the settings below as needed. The changes will apply to this workspace only.

Note: , and cannot be added as a one-time configuration.

Configure data sources

Click the ➕ icon and choose the data source from the drop-down menu. You can add multiple data sources.
Once selected, set the data origin according to the required fields and enter the container path to set the data target location. For Git and S3, select Secret. This option is relevant for private buckets/repositories based on existing secrets that were created for the scope.
Select Volume to allocate a storage space to your workspace that is persistent across restarts:
- Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see . If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
- Select one or more access mode(s) and define the claim size and its units.
- Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
- Set the Container path with the volume target location.
- Set the volume persistency to Persistent if the volume and its data should be deleted when the workspace is deleted or Ephemeral if the volume and its data should be deleted every time the workspace’s status changes to “Stopped”.

Optional: Click +VOLUME to set the volume needed for your workload. A volume allocates storage space to your workload that is persistent across restarts:
- Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see . If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
- Select one or more access mode(s) and define the claim size and its units.
- Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
- Set the Container path with the volume target location.
- Set the volume persistency to Persistent if the volume and its data should be deleted when the workload is deleted or Ephemeral if the volume and its data should be deleted every time the workload’s status changes to “Stopped”.
Optional: Select an existing data source. Modify the data target location if needed.
To add a new data source, click + NEW DATA SOURCE. For a step-by-step guide, see . Once created, it will be automatically selected.

Note: If there are connectivity issues with the cluster or problems during data source creation, the data source may not appear in the list.

Setting Up General Settings

Note

The following general settings are optional.

Allow the workload to exceed the project quota. Workloads running over quota may be preempted and stopped at any time.
Set the backoff limit before workload failure. The backoff limit is the maximum number of retry attempts for failed workloads. After reaching the limit, the workload status will change to "Failed." Enter a value between 1 and 100.
Set the timeframe for auto-deletion after workload completion or failure. The time after which a completed or failed workload is deleted; if this field is set to 0 seconds, the workload will be deleted automatically.
Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.

Completing the Workspace

Before finalizing your workspace, review your configurations and make any necessary adjustments.
Click CREATE WORKSPACE

Managing and Monitoring

After the workspace is created, it is added to the table, where it can be managed and monitored.

Using CLI

To view the available actions on workspaces, see the Workspaces or the .

Using API

To view the available actions on workspaces, see the API reference.

NVIDIA Run:ai System Monitoring

This section explains how to configure NVIDIA Run:ai to generate health alerts and to connect these alerts to alert-management systems within your organization. Alerts are generated for NVIDIA Run:ai clusters.

Alert Infrastructure

NVIDIA Run:ai uses Prometheus for externalizing metrics and providing visibility to end-users. The NVIDIA Run:ai Cluster installation includes Prometheus or can connect to an existing Prometheus instance used in your organization. The alerts are based on the Prometheus AlertManager. Once installed, it is enabled by default.

This document explains how to:

Configure alert destinations - triggered alerts send data to specified destinations
Understand the out-of-the-box cluster alerts, provided by NVIDIA Run:ai
Add additional custom alerts

Prerequisites

A Kubernetes cluster with the necessary permissions
Up and running NVIDIA Run:ai environment, including Prometheus Operator
kubectl command-line tool installed and configured to interact with the cluster

Setup

Use the steps below to set up monitoring alerts.

Validating Prometheus Operator Installed

Verify that the Prometheus Operator Deployment is running. Copy the following command and paste it in your terminal, where you have access to the Kubernetes cluster:

kubectl get deployment kube-prometheus-stack-operator -n monitoring

In your terminal, you can see an output indicating the deployment's status, including the number of replicas and their current state.

Verify that Prometheus instances are running. Copy the following command and paste it in your terminal:

kubectl get prometheus -n runai

You can see the Prometheus instance(s) listed along with their status.

Enabling Prometheus AlertManager

In each of the steps in this section, copy the content of the code snippet to a new YAML file (e.g., step1.yaml).

Copy the following command to your terminal, to apply the YAML file to the cluster:

kubectl apply -f step1.yaml

Copy the following command to your terminal to create the AlertManager CustomResource, to enable AlertManager:

apiVersion: monitoring.coreos.com/v1  
kind: Alertmanager  
metadata:  
   name: runai  
   namespace: runai  
spec:  
   replicas: 1  
   alertmanagerConfigSelector:  
      matchLabels:
         alertmanagerConfig: runai

Copy the following command to your terminal to validate that the AlertManager instance has started:

kubectl get alertmanager -n runai

Copy the following command to your terminal to validate that the Prometheus operator has created a Service for AlertManager:

kubectl get svc alertmanager-operated -n runai

Configuring Prometheus to Send Alerts

Open the terminal on your local machine or another machine that has access to your Kubernetes cluster
Copy and paste the following command in your terminal to edit the Prometheus configuration for the runai Namespace:

kubectl edit prometheus runai -n runai

This command opens the Prometheus configuration file in your default text editor (usually vi or nano).

Copy and paste the following text to your terminal to change the configuration file:

alerting:  
   alertmanagers:  
      - namespace: runai  
        name: alertmanager-operated  
        port: web

Save the changes and exit the text editor.

Note

To save changes using vi, type :wq and press Enter. The changes are applied to the Prometheus configuration in the cluster.

Alert Destinations

Set out below are the various alert destinations.

Configuring AlertManager for Custom Email Alerts

In each step, copy the contents of the code snippets to a new file and apply it to the cluster using kubectl apply -f.

Add your smtp password as a secret:

apiVersion: v1  
kind: Secret  
metadata:  
   name: alertmanager-smtp-password  
   namespace: runai  
stringData:
   password: "your_smtp_password"

Replace the relevant smtp details with your own, then apply the alertmanagerconfig using kubectl apply.

 apiVersion: monitoring.coreos.com/v1alpha1  
 kind: AlertmanagerConfig  
 metadata:  
   name: runai  
   namespace: runai  
 labels:  
    alertmanagerConfig: runai  
 spec:  
    route:  
       continue: true  
       groupBy:   
       - alertname

       groupWait: 30s  
       groupInterval: 5m  
       repeatInterval: 1h

    matchers:  
    - matchType: =~  
      name: alertname  
      value: Runai.*

    receiver: email

 receivers:  
 - name: 'email'  
   emailConfigs:  
   - to: '<destination_email_address>'  
     from: '<from_email_address>'  
     smarthost: 'smtp.gmail.com:587'  
     authUsername: '<smtp_server_user_name>'  
     authPassword:  
       name: alertmanager-smtp-password
         key: password

Save and exit the editor. The configuration is automatically reloaded.

Third-Party Alert Destinations

Prometheus AlertManager provides a structured way to connect to alert-management systems. There are built-in plugins for popular systems such as PagerDuty and OpsGenie, including a generic Webhook.

Example: Integrating NVIDIA Run:ai with a Webhook

Use webhook.site to get a unique URL.
Use the upgrade cluster instructions to modify the values file: Edit the values file to add the following, and replace <WEB-HOOK-URL> with the URL from webhook.site:

codekube-prometheus-stack:  
  ...  
  alertmanager:  
    enabled: true  
    config:  
      global:  
        resolve_timeout: 5m  
      receivers:  
      - name: "null"  
      - name: webhook-notifications  
        webhook_configs:  
          - url: <WEB-HOOK-URL>  
            send_resolved: true  
      route:  
        group_by:  
        - alertname  
        group_interval: 5m  
        group_wait: 30s  
        receiver: 'null'  
        repeat_interval: 10m  
        routes:  
        - receiver: webhook-notifications

Verify that you are receiving alerts on the webhook.site, in the left pane:

Built-in Alerts

A NVIDIA Run:ai cluster comes with several built-in alerts. Each alert notifies on a specific functionality of a NVIDIA Run:ai’s entity. There is also a single, inclusive alert: NVIDIA Run:ai Critical Problems, which aggregates all component-based alerts into a single cluster health test.

Runai agent cluster info push rate low

Meaning

The cluster-sync Pod in the runai namespace might not be functioning properly

Impact

Possible impact - no info/partial info from the cluster is being synced back to the control-plane

Severity

Critical

Diagnosis

kubectl get pod -n runai to see if the cluster-sync pod is running

Troubleshooting/Mitigation

To diagnose issues with the cluster-sync pod, follow these steps:

Paste the following command to your terminal, to receive detailed information about the cluster-sync deployment:kubectl describe deployment cluster-sync -n runai
Check the Logs: Use the following command to view the logs of the cluster-sync deployment:kubectl logs deployment/cluster-sync -n runai
Analyze the Logs and Pod Details: From the information provided by the logs and the deployment details, attempt to identify the reason why the cluster-sync pod is not functioning correctly
Check Connectivity: Ensure there is a stable network connection between the cluster and the NVIDIA Run:ai Control Plane. A connectivity issue may be the root cause of the problem.
Contact Support: If the network connection is stable and you are still unable to resolve the issue, contact NVIDIA Run:ai support for further assistance

Runai agent pull rate low

Meaning

The runai-agent pod may be too loaded, is slow in processing data (possible in very big clusters), or the runai-agent pod itself in the runai namespace may not be functioning properly.

Impact

Possible impact - no info/partial info from the control-plane is being synced in the cluster

Severity

Critical

Diagnosis

Run: kubectl get pod -n runai And see if the runai-agent pod is running.

Troubleshooting/Mitigation

To diagnose issues with the runai-agent pod, follow these steps:

Describe the Deployment: Run the following command to get detailed information about the runai-agent deployment:kubectl describe deployment runai-agent -n runai
Check the Logs: Use the following command to view the logs of the runai-agent deployment:kubectl logs deployment/runai-agent -n runai
Analyze the Logs and Pod Details: From the information provided by the logs and the deployment details, attempt to identify the reason why the runai-agent pod is not functioning correctly. There may be a connectivity issue with the control plane.
Check Connectivity: Ensure there is a stable network connection between the runai-agent and the control plane. A connectivity issue may be the root cause of the problem.
Consider Cluster Load: If the runai-agent appears to be functioning properly but the cluster is very large and heavily loaded, it may take more time for the agent to process data from the control plane.
Adjust Alert Threshold: If the cluster load is causing the alert to fire, you can adjust the threshold at which the alert triggers. The default value is 0.05. You can try changing it to a lower value (e.g., 0.045 or 0.04). To edit the value, paste the following in your terminal:kubectl edit runaiconfig -n runai/. In the editor, navigate to: spec: prometheus: agentPullPushRateMinForAlert . If the agentPullPushRateMinForAlert value does not exist, add it under spec -> prometheus .

Runai container memory usage critical

Meaning

Runai container is using more than 90% of its Memory limit

Impact

The container might run out of memory and crash.

Severity

Critical

Diagnosis

Calculate the memory usage, this is performed by pasting the following to your terminal: container_memory_usage_bytes{namespace=~"runai

Troubleshooting/Mitigation

Add more memory resources to the container. If the issue persists, contact NVIDIA Run:ai

Runai container memory usage warning

Meaning

Runai container is using more than 80% of its memory limit

Impact

The container might run out of memory and crash

Severity

Warning

Diagnosis

Calculate the memory usage, this can be done by pasting the following to your terminal: container_memory_usage_bytes{namespace=~"runai

Troubleshooting/Mitigation

Add more memory resources to the container. If the issue persists, contact NVIDIA Run:ai

Runai container restarting

Meaning

Runai container has restarted more than twice in the last 10 min

Impact

The container might become unavailable and impact the NVIDIA Run:ai system

Severity

Warning

Diagnosis

To diagnose the issue and identify the problematic pods, paste this into your terminal: kubectl get pods -n runai kubectl get pods -n runai-backendOne or more of the pods have a restart count >= 2.

Troubleshooting/Mitigation

Paste this into your terminal:kubectl logs -n NAMESPACE POD_NAMEReplace NAMESPACE and POD_NAME with the relevant pod information from the previous step. Check the logs for any standout issues and verify that the container has sufficient resources. If you need further assistance, contact NVIDIA Run:ai

Runai CPU usage warning

Meaning

runai container is using more than 80% of its CPU limit

Impact

This might cause slowness in the operation of certain NVIDIA Run:ai features.

Severity

Warning

Diagnosis

Paste the following query to your terminal in order to calculate the CPU usage: rate(container_cpu_usage_seconds_total{namespace=~"runai

Troubleshooting/Mitigation

Add more CPU resources to the container. If the issue persists, please contact NVIDIA Run:ai.

Runai critical problem

Meaning

One of the critical NVIDIA Run:ai alerts is currently active

Impact

Impact is based on the active alert

Severity

Critical

Diagnosis

Check NVIDIA Run:ai alerts in Prometheus to identify any active critical alerts

Unknown state alert for a node

Meaning

The Kubernetes node hosting GPU workloads is in an unknown state, and its health and readiness cannot be determined.

Impact

This may interrupt GPU workload scheduling and execution.

Severity

Critical - Node is either unschedulable or has unknown status. The node is in one of the following states:

Ready=Unknown: The control plane cannot communicate with the node.
Ready=False: The node is not healthy.
Unschedulable=True: The node is marked as unschedulable.

Diagnosis

Check the node's status using kubectl describe node, verify Kubernetes API server connectivity, and inspect system logs for GPU-specific or node-level errors.

Low memory node alert

Meaning

The Kubernetes node hosting GPU workloads has insufficient memory to support current or upcoming workloads.

Impact

GPU workloads may fail to schedule, experience degraded performance, or crash due to memory shortages, disrupting dependent applications.

Severity

Critical - Node is using more than 90% of its memory. Warning - Node is using more than 80% of its memory.

Diagnosis

Use kubectl top node to assess memory usage, identify memory-intensive pods, consider resizing the node or optimizing memory usage in affected pods.

Runai daemonSet rollout stuck / Runai DaemonSet unavailable on nodes

Meaning

There are currently 0 available pods for the runai daemonset on the relevant node

Impact

No fractional GPU workloads support

Severity

Critical

Diagnosis

Paste the following command to your terminal: kubectl get daemonset -n runai-backend In the result of this command, identify the daemonset(s) that don’t have any running pods

Troubleshooting/Mitigation

Paste the following command to your terminal, where daemonsetX is the problematic daemonset from the pervious step: kubectl describe daemonsetX -n runai on the relevant deamonset(s) from the previous step. The next step is to look for the specific error which prevents it from creating pods. Possible reasons might be:

Node Resource Constraints: The nodes in the cluster may lack sufficient resources (CPU, memory, etc.) to accommodate new pods from the daemonset.
Node Selector or Affinity Rules: The daemonset may have node selector or affinity rules that are not matching with any nodes currently available in the cluster, thus preventing pod creation.

Runai deployment insufficient replicas / Runai deployment no available replicas /RunaiDeploymentUnavailableReplicas

Meaning

Runai deployment has one or more unavailable pods

Impact

When this happens, there may be scale issues. Additionally, new versions cannot be deployed, potentially resulting in missing features.

Severity

Critical

Diagnosis

Paste the following commands to your terminal, in order to get the status of the deployments in the runai and runai-backend namespaces:kubectl get deployment -n runai kubectl get deployment -n runai-backendIdentify any deployments that have missing pods. Look for discrepancies in the DESIRED and AVAILABLE columns. If the number of AVAILABLE pods is less than the DESIRED pods, it indicates that there are missing pods.

Troubleshooting/Mitigation

Paste the following commands to your terminal, to receive detailed information about the problematic deployment:kubectl describe deployment <DEPLOYMENT_NAME> -n runai kubectl describe deployment <DEPLOYMENT_NAME> -n runai-backend
Paste the following commands to your terminal, to check the replicaset details associated with the deployment:kubectl describe replicaset <REPLICASET_NAME> -n runai kubectl describe replicaset <REPLICASET_NAME> -n runai-backend
Paste the following commands to your terminal to retrieve the logs for the deployment to identify any errors or issues:kubectl logs deployment/<DEPLOYMENT_NAME> -n runai kubectl logs deployment/<DEPLOYMENT_NAME> -n runai-backend
From the logs and the detailed information provided by the describe commands, analyze the reasons why the deployment is unable to create pods. Look for common issues such as:
- Resource constraints (CPU, memory)
- Misconfigured deployment settings or replicasets
- Node selector or affinity rules preventing pod scheduling
If the issue persists, contact NVIDIA Run:ai.

Runai project controller reconcile failure

Meaning

The project-controller in runai namespace had errors while reconciling projects

Impact

Some projects might not be in the “Ready” state. This means that they are not fully operational and may not have all the necessary components running or configured correctly.

Severity

Critical

Diagnosis

Retrieve the logs for the project-controller deployment by pasting the following command in your terminal:kubectl logs deployment/project-controller -n runai Carefully examine the logs for any errors or warning messages. These logs help you understand what might be going wrong with the project controller.

Troubleshooting/Mitigation

Once errors in the log have been identified, follow these steps to mitigate the issue: The error messages in the logs should provide detailed information about the problem.

Read through them to understand the nature of the issue. If the logs indicate which project failed to reconcile, you can further investigate by checking the status of that specific project.
Run the following command, replacing <PROJECT_NAME> with the name of the problematic project:kubectl get project <PROJECT_NAME> -o yaml
Review the status section in the YAML output. This section describes the current state of the project and provide insights into what might be causing the failure. If the issue persists, contact NVIDIA Run:ai.

Runai StatefulSet insufficient replicas / Runai StatefulSet no available replicas

Meaning

Runai statefulset has no available pods

Impact

Absence of Metrics Database Unavailability

Severity

Critical

Diagnosis

To diagnose the issue, follow these steps:

Check the status of the stateful sets in the runai-backend namespace by running the following command:kubectl get statefulset -n runai-backend
Identify any stateful sets that have no running pods. These are the ones that might be causing the problem.

Troubleshooting/Mitigation

Once you've identified the problematic stateful sets, follow these steps to mitigate the issue:

Describe the stateful set to get detailed information on why it cannot create pods. Replace X with the name of the stateful set:kubectl describe statefulset X -n runai-backend
Review the description output to understand the root cause of the issue. Look for events or error messages that explain why the pods are not being created.
If you're unable to resolve the issue based on the information gathered, contact NVIDIA Run:ai support for further assistance.

Adding a Custom Alert

You can add additional alerts from NVIDIA Run:ai. Alerts are triggered by using the Prometheus query language with any NVIDIA Run:ai metric.

To create an alert, follow these steps using Prometheus query language with NVIDIA Run:ai Metrics:

Modify Values File: Use the upgrade cluster instructions to modify the values file.
Add Alert Structure: Incorporate alerts according to the structure outlined below. Replace placeholders <ALERT-NAME>, <ALERT-SUMMARY-TEXT>, <PROMQL-EXPRESSION>, <optional: duration s/m/h>, and <critical/warning> with appropriate values for your alert, as described below.

kube-prometheus-stack:  
   additionalPrometheusRulesMap:  
     custom-runai:  
       groups:  
       - name: custom-runai-rules  
         rules:  
         - alert: <ALERT-NAME>  
           annotations:  
             summary: <ALERT-SUMMARY-TEXT>  
           expr:  <PROMQL-EXPRESSION>  
           for: <optional: duration s/m/h>  
           labels:  
             severity: <critical/warning>

<ALERT-NAME>: Choose a descriptive name for your alert, such as HighCPUUsage or LowMemory.
<ALERT-SUMMARY-TEXT>: Provide a brief summary of what the alert signifies, for example, High CPU usage detected or Memory usage below threshold.
<PROMQL-EXPRESSION>: Construct a Prometheus query (PROMQL) that defines the conditions under which the alert should trigger. This query should evaluate to a boolean value (1 for alert, 0 for no alert).
<optional: duration s/m/h>: Optionally, specify a duration in seconds (s), minutes (m), or hours (h) that the alert condition should persist before triggering an alert. If not specified, the alert triggers as soon as the condition is met.
<critical/warning>: Assign a severity level to the alert, indicating its importance. Choose between critical for severe issues requiring immediate attention, or warning for less critical issues that still need monitoring.

You can find an example in the Prometheus documentation.

Policy YAML Reference

A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted, setting best practices, enforcing limitations, and standardizing processes for AI projects within their organization.

This article explains the policy YAML fields and the possible rules and defaults that can be set for each field.

Policy YAML Fields - Reference Table

The policy fields are structured in a similar format to the workload API fields. The following tables represent a structured guide designed to help you understand and configure policies in a YAML format. It provides the fields, descriptions, defaults and rules for each workload type.

Click the link to view the value type of each field.

Fields

Description

Value type

Supported NVIDIA Run:ai workload type

args

When set, contains the arguments sent along with the command. These override the entry point of the image in the created workload

Workspace
Training

command

A command to serve as the entry point of the container running the workspace

Workspace
Training

createHomeDir

Instructs the system to create a temporary home directory for the user within the container. Data stored in this directory is not saved when the container exists. When the runAsUser flag is set to true, this flag defaults to true as well

Workspace
Training

environmentVariables

Set of environmentVariables to populate the container running the workspace

Workspace
Training

image

Specifies the image to use when creating the container running the workload

Workspace
Training

imagePullPolicy

Specifies the pull policy of the image when starting t a container running the created workload. Options are: always, ifNotPresent, or never

Workspace
Training

workingDir

Container’s working directory. If not specified, the container runtime default is used, which might be configured in the container image

Workspace
Training

nodeType

Nodes (machines) or a group of nodes on which the workload runs

Workspace
Training

nodePools

A prioritized list of node pools for the scheduler to run the workspace on. The scheduler always tries to use the first node pool before moving to the next one when the first is not available.

Workspace
Training

annotations

Set of annotations to populate into the container running the workspace

Workspace
Training

labels

Set of labels to populate into the container running the workspace

Workspace
Training

terminateAfterPreemtpion

Indicates whether the job should be terminated, by the system, after it has been preempted

Workspace
Training

autoDeletionTimeAfterCompletionSeconds

Specifies the duration after which a finished workload (Completed or Failed) is automatically deleted. If this field is set to zero, the workload becomes eligible to be deleted immediately after it finishes.

Workspace
Training

backoffLimit

Specifies the number of retries before marking a workload as failed

Workspace
Training

cleanPodPolicy

Specifies which pods will be deleted when the workload reaches a terminal state (completed/failed). The policy can be one of the following values:

Running - Only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default).
All - All (including completed) pods will be deleted immediately when the job finishes.
None - No pods will be deleted when the job completes. It will keep running pods that consume GPU, CPU and memory over time. It is recommended to set to None only for debugging and obtaining logs from running pods.

Distributed

completions

Used with Hyperparameter Optimization. Specifies the number of successful pods the job should reach to be completed. The Job is marked as successful once the specified amount of pods has succeeded.

Workspace
Training

parallelism

Used with Hyperparameters Optimization. Specifies the maximum desired number of pods the workload should run at any given time.

Workspace
Training

exposeUrls

Specifies a set of exported URL (e.g. ingress) from the container running the created workload.

Workspace
Training

largeShmRequest

Specifies a large /dev/shm device to mount into a container running the created workload. SHM is a shared file system mounted on RAM.

Workspace
Training

PodAffinitySchedulingRule

Indicates if we want to use the Pod affinity rule as: the “hard” (required) or the “soft” (preferred) option. This field can be specified only if PodAffinity is set to true.

Workspace
Training

podAffinityTopology

Specifies the Pod Affinity Topology to be used for scheduling the job. This field can be specified only if PodAffinity is set to true.

Workspace
Training

ports

Specifies a set of ports exposed from the container running the created workload. More information in Ports fields below.

Workspace
Training

probes

Specifies the ReadinessProbe to use to determine if the container is ready to accept traffic. More information in below

Workspace
Training

tolerations

Toleration rules which apply to the pods running the workload. Toleration rules guide (but do not require) the system to which node each pod can be scheduled to or evicted from, based on matching between those rules and the set of taints defined for each Kubernetes node.

Workspace
Training

priorityClass

Priority class of the workload. The values for workspace are build (default) or interactive-preemptible. For training only, use train. Enum: "build", "train", "interactive-preemptible"

Workspace

storage

Contains all the fields related to storage configurations. More information in below.

Workspace
Training

security

Contains all the fields related to security configurations. More information in below.

Workspace
Training

compute

Contains all the fields related to compute configurations. More information in below.

Workspace
Training

Ports Fields

Fields

Description

Value type

Supported NVIDIA Run:ai workload type

container

The port that the container running the workload exposes.

Workspace
Training

serviceType

Specifies the default service exposure method for ports. the default shall be sued for ports which do not specify service type. Options are: LoadBalancer, NodePort or ClusterIP. For more information see the guide.

Workspace
Training

external

The external port which allows a connection to the container port. If not specified, the port is auto-generated by the system.

Workspace
Training

toolType

The tool type that runs on this port.

Workspace
Training

toolName

A name describing the tool that runs on this port.

Workspace
Training

Probes Fields

Fields

Description

Value type

Supported NVIDIA Run:ai workload type

readiness

Specifies the Readiness Probe to use to determine if the container is ready to accept traffic.

Workspace
Training

Readiness Field Details

Description: Specifies the Readiness Probe to use to determine if the container is ready to accept traffic
Supported NVIDIA Run:ai workload types: Workspace, Training
Value type: itemized
Example workload snippet:

defaults:
   probes:
     readiness:
         initialDelaySeconds: 2

Spec readiness fields

Description

Value type

initialDelaySeconds

Number of seconds after the container has started before liveness or readiness probes are initiated.

periodSeconds

How often (in seconds) to perform the probe

timeoutSeconds

Number of seconds after which the probe times out

successThreshold

Minimum consecutive successes for the probe to be considered successful after having failed

failureThreshold

When a probe fails, the number of times to try before giving up

Security Fields

Fields

Description

Value type

Supported NVIDIA Run:ai workload type

uidGidSource

Indicates the way to determine the user and group ids of the container. The options are:

fromTheImage - user and group IDs are determined by the docker image that the container runs. This is the default option.
custom - user and group IDs can be specified in the environment asset and/or the workspace creation request.
idpToken - user and group IDs are determined according to the identity provider (idp) access token. This option is intended for internal use of the environment UI form. For more information, see .

Workspace
Training

capabilities

The capabilities field allows adding a set of unix capabilities to the container running the workload. Capabilities are Linux distinct privileges traditionally associated with superuser which can be independently enabled and disabled

Workspace
Training

seccompProfileType

Indicates which kind of seccomp profile is applied to the container. The options are:

RuntimeDefault - the container runtime default profile should be used
Unconfined - no profile should be applied

Workspace
Training

runAsNonRoot

Indicates that the container must run as a non-root user.

Workspace
Training

readOnlyRootFilesystem

If true, mounts the container's root filesystem as read-only.

Workspace
Training

runAsUid

Specifies the Unix user id with which the container running the created workload should run.

Workspace
Training

runasGid

Specifies the Unix Group ID with which the container should run.

Workspace
Training

supplementalGroups

Comma separated list of groups that the user running the container belongs to, in addition to the group indicated by runAsGid.

Workspace
Training

allowPrivilegeEscalation

Allows the container running the workload and all launched processes to gain additional privileges after the workload starts

Workspace
Training

hostIpc

Whether to enable hostIpc. Defaults to false.

Workspace
Training

hostNetwork

Whether to enable host network.

Workspace
Training

Compute Fields

Fields

Description

Value type

Supported NVIDIA Run:ai workload type

cpuCoreRequest

CPU units to allocate for the created workload (0.5, 1, .etc). The workload receives at least this amount of CPU. Note that the workload is not scheduled unless the system can guarantee this amount of CPUs to the workload.

Workspace
Training

cpuCoreLimit

Limitations on the number of CPUs consumed by the workload (0.5, 1, .etc). The system guarantees that this workload is not able to consume more than this amount of CPUs.

Workspace
Training

cpuMemoryRequest

The amount of CPU memory to allocate for this workload (1G, 20M, .etc). The workload receives at least this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of memory to the workload

Workspace
Training

cpuMemoryLimit

Limitations on the CPU memory to allocate for this workload (1G, 20M, .etc). The system guarantees that this workload is not be able to consume more than this amount of memory. The workload receives an error when trying to allocate more memory than this limit.

Workspace
Training

largeShmRequest

A large /dev/shm device to mount into a container running the created workload (shm is a shared file system mounted on RAM).

Workspace
Training

gpuRequestType

Sets the unit type for GPU resources requests to either portion, memory or mig profile. Only if gpuDeviceRequest = 1, the request type can be stated as portion, memory or migProfile.

Workspace
Training

gpuPortionRequest

Specifies the fraction of GPU to be allocated to the workload, between 0 and 1. For backward compatibility, it also supports the number of gpuDevices larger than 1, currently provided using the gpuDevices field.

Workspace
Training

gpuDeviceRequest

Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1, the gpuRequestType can be defined.

Workspace
Training

gpuPortionLimit

When a fraction of a GPU is requested, the GPU limit specifies the portion limit to allocate to the workload. The range of the value is from 0 to 1.

Workspace
Training

gpuMemoryRequest

Specifies GPU memory to allocate for the created workload. The workload receives this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of GPU memory to the workload.

Workspace
Training

gpuMemoryLimit

Specifies a limit on the GPU memory to allocate for this workload. Should be no less than the gpuMemory.

Workspace
Training

extendedResources

Specifies values for extended resources. Extended resources are third-party devices (such as high-performance NICs, FPGAs, or InfiniBand adapters) that you want to allocate to your Job.

Workspace
Training

Storage Fields

Fields

Description

Value type

Supported NVIDIA Run:ai workload type

dataVolume

Set of data volumes to use in the workload. Each data volume is mapped to a file-system mount point within the container running the workload.

Workspace
Training

Maps a folder to a file-system mount point within the container running the workload.

Workspace
Training

Details of the git repository and items mapped to it.

Workspace
Training

Specifies persistent volume claims to mount into a container running the created workload.

Workspace
Training

Specifies NFS volume to mount into the container running the workload.

Workspace
Training

Specifies S3 buckets to mount into the container running the workload.

Workspace
Training

configMapVolumes

Specifies ConfigMaps to mount as volumes into a container running the created workload.

Workspace
Training

secretVolume

Set of secret volumes to use in the workload. A secret volume maps a secret resource in the cluster to a file-system mount point within the container running the workload.

Workspace
Training

hostPath Field Details

Description: Maps a folder to a file system mount oint within the container running the workload
Supported NVIDIA Run:ai workload types: Workspace, Training
Value type: itemized
Example workload snippet:

defaults:
  storage:
    hostPath:
      instances:
        - path: h3-path-1
          mountPath: h3-mount-1
        - path: h3-path-2
          mountPath: h3-mount-2
      attributes:
        - readOnly: true

hostPath fields

Description

Value type

name

Unique name to identify the instance. Primarily used for policy locked rules.

path

Local path within the controller to which the host volume is mapped.

readOnly

Force the volume to be mounted with read-only permissions. Defaults to false

mountPath

The path that the host volume is mounted to when in use. Enum:

"None"
"HostToContainer"

mountPropagation

Share this volume mount with other containers. If set to HostToContainer, this volume mount receives all subsequent mounts that are mounted to this volume or any of its subdirectories. In case of multiple hostPath entries, this field should have the same value for all of them.

Git Field Details

Description: Details of the git repository and items mapped to it
Supported NVIDIA Run:ai workload types: Workspace, Training
Value type: itemized
Example workload snippet:

defaults:
  storage:
    git:
      attributes:
        Repository: https://runai.public.github.com
      instances
        - branch: "master"
          path: /container/my-repository
          passwordSecret: my-password-secret

Git fields

Description

Value type

repository

URL to a remote git repository. The content of this repository is mapped to the container running the workload

revision

Specific revision to synchronize the repository from

path

Local path within the workspace to which the S3 bucket is mapped

secretName

Optional name of Kubernetes secret that holds your git username and password

username

If secretName is provided, this field should contain the key, within the provided Kubernetes secret, which holds the value of your git username. Otherwise, this field should specify your git username in plain text (example: myuser).

PVC Field Details

Description: Specifies persistent volume claims to mount into a container running the created workload
Supported NVIDIA Run:ai workload types: Workspace, Training
Value type: itemized
Example workload snippet:

defaults:
  storage:
    pvc:
      instances:
        - claimName: pvc-staging-researcher1-home
          existingPvc: true
          path: /myhome
          readOnly: false
          claimInfo:
            accessModes:
              readWriteMany: true

Spec PVC fields

Description

Value type

claimName (mandatory)

A given name for the PVC. Allowed referencing it across workspaces

ephemeral

Use true to set PVC to ephemeral. If set to true, the PVC is deleted when the workspace is stopped.

path

Local path within the workspace to which the PVC bucket is mapped

readonly

Permits read only from the PVC, prevents additions or modifications to its content

ReadwriteOnce

Requesting claim that can be mounted in read/write mode to exactly 1 host. If none of the modes are specified, the default is readWriteOnce.

size

Requested size for the PVC. Mandatory when existing PVC is false

storageClass

Storage class name to associate with the PVC. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class. Further details at .

readOnlyMany

Requesting claim that can be mounted in read-only mode to many hosts

readWriteMany

Requesting claim that can be mounted in read/write mode to many hosts

NFS Field Details

Description: Specifies NFS volume to mount into the container running the workload
Supported NVIDIA Run:ai workload types: Workspace, Training
Value type: itemized
Example workload snippet:

defaults:
 storage:
   nfs:
     instances:
       - path: nfs-path
         readOnly: true
         server: nfs-server
         mountPath: nfs-mount
rules:
  storage:
    nfs:
      instances:
        canAdd: false

nfs fields

Description

Value type

mountPath

The path that the NFS volume is mounted to when in use

path

Path that is exported by the NFS server

readOnly

Whether to force the NFS export to be mounted with read-only permissions

nfsServer

The hostname or IP address of the NFS server

S3 Field Details

Description: Specifies S3 buckets to mount into the container running the workload
Supported NVIDIA Run:ai workload types: Workspace, Training
Value type: itemized
Example workload snippet:

defaults:
  storage:
    s3:
      instances:
        - bucket: bucket-opt-1
          path: /s3/path
          accessKeySecret: s3-access-key
          secretKeyOfAccessKeyId: s3-secret-id
          secretKeyOfSecretKey: s3-secret-key
      attributes:
        url: https://amazonaws.s3.com

s3 fields

Description

Value type

Bucket

The name of the bucket

path

Local path within the workspace to which the S3 bucket is mapped

url

The URL of the S3 service provider. The default is the URL of the Amazon AWS S3 service

Value Types

Each field has a specific value type. The following value types are supported.

Value type

Description

Supported rule type

Defaults

Boolean

A binary value that can be either True or False

true/false

String

A sequence of characters used to represent text. It can include letters, numbers, symbols, and spaces

abc

Itemized

An ordered collection of items (objects), which can be of different types (all items in the list are of the same type). For further information see the chapter below the table.

See below

Integer

An Integer is a whole number without a fractional component.

100

Number

Capable of having non-integer values

10.3

Quantity

Holds a string composed of a number and a unit representing a quantity

Array

Set of values that are treated as one, as opposed to Itemized in which each item can be referenced separately.

node-a
node-b
node-c

Itemized

Workload fields of type itemized have multiple instances, however in comparison to objects, each can be referenced by a key field. The key field is defined for each field.

Consider the following workload spec:

spec:
  image: ubuntu
  compute:
    extendedResources:
      - resource: added/cpu
        quantity: 10
      - resource: added/memory
        quantity: 20M

In this example, extendedResources have two instances, each has two attributes: resource (the key attribute) and quantity.

In policy, the defaults and rules for itemized fields have two sub sections:

Instances: default items to be added to the policy or rules which apply to an instance as a whole.
Attributes: defaults for attributes within an item or rules which apply to attributes within each item.

Consider the following example:

defaults:
  compute:
    extendedResources:
      instances: 
        - resource: default/cpu
          quantity: 5
        - resource: default/memory
          quantity: 4M
      attributes:
        quantity: 3
rules:
  compute:
    extendedResources:
      instances:
        locked: 
          - default/cpu
      attributes:
        quantity: 
          required: true

Assume the following workload submission is requested:

spec:
  image: ubuntu
  compute:
    extendedResources:
      - resource: default/memory
        exclude: true
      - resource: added/cpu
      - resource: added/memory
        quantity: 5M

The effective policy for the above mentioned workload has the following extendedResources instances:

Resource

Source of the instance

Quantity

Source of the attribute quantity

default/cpu

Policy defaults

The default of this instance in the policy defaults section

added/cpu

Submission request

The default of the quantity attribute from the attributes section

added/memory

Submission request

Note

The default/memory is not populated to the workload, this is because it has been excluded from the workload using “exclude: true”.

A workload submission request cannot exclude the default/cpu resource, as this key is included in the locked rules under the instances section. {#a-workload-submission-request-cannot-exclude-the-default/cpu-resource,-as-this-key-is-included-in-the-locked-rules-under-the-instances-section.}

Rule Types

Rule types

Description

Supported value types

Rule type example

canAdd

Whether the submission request can add items to an itemized field other than those listed in the policy defaults for this field.

storage: hostPath: instances: canAdd: false

locked

Set of items that the workload is unable to modify or exclude. In this example, a workload policy default is given to HOME and USER, that the submission request cannot modify or exclude from the workload.

storage: hostPath: Instances: locked: - HOME - USER

canEdit

Whether the submission request can modify the policy default for this field. In this example, it is assumed that the policy has default for imagePullPolicy. As canEdit is set to false, submission requests are not able to alter this default.

imagePullPolicy: canEdit: false

required

When set to true, the workload must have a value for this field. The value can be obtained from policy defaults. If no value specified in the policy defaults, a value must be specified for this field in the submission request.

image: required: true

min

The minimal value for the field

compute: gpuDevicesRequest: min: 3

max

The maximal value for the field

compute: gpuMemoryRequest: max: 2G

step

The allowed gap between values for this field. In this example the allowed values are: 1, 3, 5, 7

compute: cpuCoreRequest: min: 1 max: 7 Step: 2

options

Set of allowed values for this field

image: options: - value: image-1 - value: image-2

defaultFrom

Set a default value for a field that will be calculated based on the value of another field

cpuCoreRequest: defaultFrom: field: compute.cpuCoreLimit factor: 0.5

Policy Spec Sections

For each field of a specific policy, you can specify both rules and defaults. A policy spec consists of the following sections:

Rules
Defaults
Imposed Assets

Rules

Rules set up constraints on workload policy fields. For example, consider the following policy:

rules:
  compute:
    gpuDevicesRequest: 
      max: 8
  security:
    runAsUid: 
      min: 500

Such a policy restricts the maximum value for gpuDeviceRequests to 8, and the minimal value for runAsUid, provided in the security section to 500.

Defaults

The defaults section is used for providing defaults for various workload fields. For example, consider the following policy:

defaults:
  imagePullPolicy: Always
  security:
    runAsNonRoot: true
    runAsUid: 500

Assume a submission request with the following values:

Image: ubuntu
runAsUid: 501

The effective workload that runs has the following set of values:

Field

Value

Source

Image

Ubuntu

Submission request

ImagePullPolicy

Always

Policy defaults

security.runAsNonRoot

true

Policy defaults

security.runAsUid

501

Submission request

Note

It is possible to specify a rule for each field, which states if a submission request is allowed to change the policy default for that given field, for example:

defaults:
imagePullPolicy: Always
security:
    runAsNonRoot: true
    runAsUid: 500
 rules:
 security:
    runAsUid:
    canEdit: false

If this policy is applied, the submission request above fails, as it attempts to change the value of secuirty.runAsUid from 500 (the policy default) to 501 (the value provided in the submission request), which is forbidden due to canEdit rule set to false for this field.

Imposed Assets

Default instances of a storage field can be provided using a datasource containing the details of this storage instance. To add such instances in the policy, specify those asset IDs in the imposedAssets section of the policy.

defaults: null
rules: null
imposedAssets:
  - f12c965b-44e9-4ff6-8b43-01d8f9e630cc

Assets with references to credential assets (for example: private S3, containing reference to an AccessKey asset) cannot be used as imposedAssets.