Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.
For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:
NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.
Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.
NVIDIA Run:ai services are scheduled on the defined node roles by applying Kubernetes Node Affinity using node labels .
To perform these tasks, make sure to install the NVIDIA Run:ai .
The following node roles can be configured on the cluster:
System node: Reserved for NVIDIA Run:ai system-level services.
GPU Worker node: Dedicated for GPU-based workloads.
CPU Worker node: Used for CPU-only workloads.
NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the (recommended) or via NVIDIA Run:ai .
By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system for system services scheduling. You can modify the default node affinity rule by:
Editing the spec.global.affinity configuration parameter as detailed in .
Editing the global.affinity configuration as detailed in for self-hosted deployments.
To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:
Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to label the node with its role:
To set a system role for a node in your Kubernetes cluster, follow these steps:
Run the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to set or remove a node’s role:
The set node-role command will label the node and set relevant cluster configurations.
NVIDIA Run:ai worker nodes run user-submitted workloads and system-level required to operate. This can be managed via the (recommended) or via NVIDIA Run:ai .
By default, GPU workloads are scheduled on GPU nodes based on the nvidia.com/gpu.present label. When global.nodeAffinity.restrictScheduling is set to true via the :
GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-gpu-worker
CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-cpu-worker
To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:
Validate the global.nodeAffinity.restrictScheduling is set to true in the cluster’s .
Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to label the node with its role. Replace the label and value (true/false) to enable or disable GPU/CPU roles as needed:
To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai , follow these steps:
Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.
Run one of the following commands to set or remove a node’s role. <node-role> must be either --gpu-worker or --cpu-worker :
The set node-role command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling true.
kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=true
kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=falserunai-adm set node-role --runai-system-worker <node-name>
runai-adm remove node-role --runai-system-worker <node-name>runai-adm set node-role <node-role> <node-name>
runai-adm remove node-role <node-role> <node-name>kubectl label nodes <node-name> node-role.kubernetes.io/runai-gpu-worker=true
kubectl label nodes <node-name> node-role.kubernetes.io/runai-cpu-worker=falseNVIDIA Run:ai supports service mesh implementations. When a service mesh is deployed with sidecar injection, specific configurations must be applied to ensure compatibility with NVIDIA Run:ai. This document outlines the required changes for the NVIDIA Run:ai control plane and cluster.
By default, NVIDIA Run:ai prevents Istio from injecting sidecar containers into system jobs in the control plane. For other service mesh solutions, users must manually add annotations during installation.
To disable sidecar injection in the NVIDIA Run:ai control plane, modify the Helm values file by adding the required pod labels to the following components. See for more details.
Example for :
Sidecar containers injected by some service mesh solutions can prevent NVIDIA Run:ai installation hooks from completing. To avoid this, modify the Helm installation command to include the required labels or annotations:
Example for :
To prevent sidecar injection in workloads created at runtime (such as training workloads), update the runaiconfig resource. See for more details:
authorizationMigrator:
podLabels:
openservicemesh.io/sidecar-injection: disabled
clusterMigrator:
podLabels:
openservicemesh.io/sidecar-injection: disabled
identityProviderReconciler:
podLabels:
openservicemesh.io/sidecar-injection: disabled
keepPVC:
podLabels:
openservicemesh.io/sidecar-injection: disabled
orgUnitsMigrator:
podLabels:
openservicemesh.io/sidecar-injection: disabledhelm upgrade -i ...
--set global.additionalJobLabels.A=B --set global.additionalJobAnnotations.A=Bhelm upgrade -i ...
--set-json global.additionalJobLabels='{"sidecar.istio.io/inject":false}'spec:
workload-controller:
additionalPodLabels:
sidecar.istio.io/inject: falseKarpenter is an open-source, Kubernetes cluster autoscaler built for cloud deployments. Karpenter optimizes the cloud cost of a customer’s cluster by moving workloads between different node types, consolidating workloads into fewer nodes, using lower-cost nodes where possible, scaling up new nodes when needed, and shutting down unused nodes.
Karpenter’s main goal is cost optimization. Unlike Karpenter, NVIDIA Run:ai’s Scheduler optimizes for fairness and resource utilization. Therefore, there are a few potential friction points when using both on the same cluster.
Karpenter looks for “unschedulable” pending workloads and may try to scale up new nodes to make those workloads schedulable. However, in some scenarios, these workloads may exceed their quota parameters, and the NVIDIA Run:ai Scheduler will put them into a pending state.
Karpenter is not aware of the NVIDIA Run:ai fractions mechanism and may try to interfere incorrectly.
Karpenter preempts any type of workload (i.e., high-priority, non-preemptible workloads will potentially be interrupted and moved to save cost).
Karpenter has no pod-group (i.e., workload) notion or gang scheduling awareness, meaning that Karpenter is unaware that a set of “arbitrary” pods is a single workload. This may cause Karpenter to schedule those pods into different node pools (in the case of multi-node-pool workloads) or scale up or down a mix of wrong nodes.
NVIDIA Run:ai Scheduler mitigates the friction points using the following techniques (each numbered bullet below corresponds to the related friction point listed above):
Karpenter uses a “nominated node” to recommend a node for the Scheduler. The NVIDIA Run:ai Scheduler treats this as a “preferred” recommendation, meaning it will try to use this node, but it’s not required and it may choose another node.
Fractions - Karpenter won’t consolidate nodes with one or more pods that cannot be moved. The NVIDIA Run:ai reservation pod is marked as ‘do not evict’ to allow the NVIDIA Run:ai Scheduler to control the scheduling of fractions.
Non-preemptible workloads - NVIDIA Run:ai marks non-preemptible workloads as ‘do not evict’ and Karpenter respects this annotation.
NVIDIA Run:ai node pools (single-node-pool workloads) - Karpenter respects the ‘node affinity’ that NVIDIA Run:ai sets on a pod, so Karpenter uses the node affinity for its recommended node. For the gang-scheduling/pod-group (workload) notion, NVIDIA Run:ai Scheduler considers Karpenter directives as preferred recommendations rather than mandatory instructions and overrides Karpenter instructions where appropriate.
Using multi-node-pool workloads
Workloads may include a list of optional node pools. Karpenter is not aware that only a single node pool should be selected out of that list for the workload. It may therefore recommend putting pods of the same workload into different node pools and may scale up nodes from different node pools to serve a “multi-node-pool” workload instead of nodes on the selected single node pool.
If this becomes an issue (i.e., if Karpenter scales up the wrong node types), users can set an inter-pod affinity using the node pool label or another common label as a ‘topology’ identifier. This will force Karpenter to choose nodes from a single-node pool per workload, selecting from any of the node pools listed as allowed by the workload.
An alternative approach is to use a single-node pool for each workload instead of multi-node pools.
Consolidation
To make Karpenter more effective when using its consolidation function, users should consider separating preemptible and non-preemptible workloads, either by using node pools, node affinities, taint/tolerations, or inter-pod anti-affinity.
If users don’t separate preemptible and non-preemptible workloads (i.e., make them run on different nodes), Karpenter’s ability to consolidate (bin-pack) and shut down nodes will be reduced, but it is still effective.
Conflicts between bin-packing and spread policies
If NVIDIA Run:ai is used with a scheduling spread policy, it will clash with Karpenter’s default bin-packs/consolidation policy, and the outcome may be a deployment that is not optimized for any of these policies.
Usually spread is used for Inference, which is non-preemptible and therefore not controlled by Karpenter (NVIDIA Run:ai Scheduler will mark those workloads as ‘do not evict’ for Karpenter), so this should not present a real deployment issue for customers.
The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm values files or Helm install flags. Make sure to restart the relevant NVIDIA Run:ai pods so they can fetch the new configurations.
The NVIDIA Run:ai control plane chart includes multiple sub-charts of third-party components:
Data store- (postgresql)
Metrics Store - (thanos)
Identity & Access Management - (keycloakx)
If you have opted to connect to an , refer to the additional configurations table below. Adjust the following parameters based on your connection details:
Disable PostgreSQL deployment - postgresql.enabled
NVIDIA Run:ai connection details - global.postgresql.auth
Grafana connection details - grafana.dbUser, grafana.dbPassword
The keycloakx.adminUser can only be set during the initial installation. The admin password can be changed later through the Keycloak UI, but you must also update the keycloakx.adminPassword value in the Helm chart using helm upgrade. See for more details.
You can change the Keycloak admin password after deployment by performing the following steps:
Open the Keycloak UI at: https://<runai-domain>/auth
Sign in with your existing admin credentials as configured in your Helm values
Go to Users and select admin (or your admin username)
Open Credentials →
Researchers may need to access containers remotely during workload execution. Common use cases include:
Running a Jupyter Notebook inside the container
Connecting PyCharm for remote Python development
Viewing machine learning visualizations using TensorBoard
To enable this access, you must expose the relevant container ports.
Accessing the containers remotely requires exposing container ports. In Docker, ports are exposed by declaring them when launching the container. NVIDIA Run:ai provides similar functionality within a Kubernetes environment.
Since Kubernetes abstracts the container's physical location, exposing ports is more complex. Kubernetes supports multiple methods for exposing container ports. For more details, refer to the Kubernetes services and networking documentation.
Port Forwarding
Simple port forwarding allows access to the container via local and/or remote port.
Supported natively via Kubernetes
NodePort
Exposes the service on each Node’s IP at a static port (the NodePort). You’ll be able to contact the NodePort service from outside the cluster by requesting <NODE-IP>:<NODE-PORT> regardless of which node the container actually resides in.
Supported
LoadBalancer
Exposes the service externally using a cloud provider’s load balancer.
Supported via API with limited capabilities
Many tools used by researchers, such as Jupyter, TensorBoard, or VSCode, require remote access to the running workload's container. In NVIDIA Run:ai, this access is provided through dynamically generated URLs.
By default, NVIDIA Run:ai uses the Cluster URL provided to dynamically create SSL-secured URLs in the following format:
While path-based routing works with applications such as Jupyter Notebooks, it may not be compatible with other applications. Some applications assume they are running at the root file system, so hardcoded file paths and settings within the container may become invalid when running at a path other than the root. For example, if an application expects to access /etc/config.json but is served at /project-name/workspace-name, the file will not be found. This can cause the container to fail or not function as intended.
NVIDIA Run:ai provides support for host-based routing. When enabled, URLs follow the format:
This allows all workloads to run at the root path, avoiding file path issues and ensuring proper application behavior.
To enable host-based routing, perform the following steps:
Create a second DNS entry (A record) for *.<CLUSTER_URL>, pointing to the same IP as the cluster's Fully Qualified Domain Name (FQDN).
Obtain a wildcard SSL certificate for this second DNS entry.
Add the certificate as a secret:
Create the following ingress rule and replace <CLUSTER_URL>:
Run the following:
Edit runaiconfig to generate the URLs correctly:
Once these requirements have been met, all workloads will automatically be assigned a secured URL with a subdomain, ensuring full functionality for all researcher applications.
https://<CLUSTER_URL>/project-name/workload-namehttps://project-name-workload-name.<CLUSTER_URL>/kubectl create secret tls runai-cluster-domain-star-tls-secret -n runai \
--cert /path/to/fullchain.pem \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate
--key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private keyapiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: runai-cluster-domain-star-ingress
namespace: runai
spec:
ingressClassName: nginx
rules:
- host: '*.<CLUSTER_URL>'
tls:
- hosts:
- '*.<CLUSTER_URL>'
secretName: runai-cluster-domain-star-tls-secretkubectl apply -f <filename>kubectl patch RunaiConfig runai -n runai --type="merge" \
-p '{"spec":{"global":{"subdomainSupport": true}}}' Analytics Dashboard - Grafana (grafana)
Caching, Queue - NATS (nats)
global.postgresql.auth.password
PostgreSQL password
Password for the PostgreSQL user specified by global.postgresql.auth.username.
global.postgresql.auth.postgresPassword
PostgreSQL default admin password
Password for the built-in PostgreSQL superuser (postgres).
global.postgresql.auth.existingSecret
Postgres Credentials (secret)
Existing secret name with authentication credentials.
global.postgresql.auth.dbSslMode
Postgres connection SSL mode
Set the SSL mode. See the full list in . Prefer mode is not supported.
postgresql.primary.initdb.password
PostgreSQL default admin password
Set the same password as in global.postgresql.auth.postgresPassword (if changed).
postgresql.primary.persistence.storageClass
Storage class
The installation is configured to work with a specific storage class instead of the default one.
Set the new password and click Save
Update the keycloakx.adminPassword value using the helm upgrade command to match the password you set in the Keycloak UI
grafana.adminUser
Grafana username
Override the NVIDIA Run:ai default user name for accessing Grafana.
grafana.adminPassword
Grafana password
Override the NVIDIA Run:ai default password for accessing Grafana.
global.ingress.ingressClass
Ingress class
NVIDIA Run:ai uses NGINX as the default ingress controller. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai.
global.ingress.tlsSecretName
TLS secret name
NVIDIA Run:ai requires the creation of a secret with domain certificate. If the runai-backend namespace already had such a secret, you can set the secret name here
<service-name>.podLabels
Pod labels
Set NVIDIA Run:ai and 3rd party services' Pod Labels in a format of key/value pairs.
<service-name>
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
Pod request and limits
Set NVIDIA Run:ai and 3rd party services' resources
disableIstioSidecarInjection.enabled
Disable Istio sidecar injection
Disable the automatic injection of Istio sidecars across the entire NVIDIA Run:ai Control Plane services.
global.affinity
System nodes
Sets the system nodes where NVIDIA Run:ai system-level services are scheduled.
Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system
global.customCA.enabled
Certificate authority
Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.
postgresql.enabled
PostgreSQL installation
If set to false, PostgreSQL will not be installed.
global.postgresql.auth.host
PostgreSQL host
Hostname or IP address of the PostgreSQL server.
global.postgresql.auth.port
PostgreSQL port
Port number on which PostgreSQL is running.
global.postgresql.auth.username
PostgreSQL username
Username for connecting to PostgreSQL.
thanos.receive.persistence.storageClass
Storage class
The installation is configured to work with a specific storage class instead of the default one.
keycloakx.adminUser
User name of the internal identity provider administrator
Defines the username for the Keycloak administrator. This can only be set during the initial installation.
keycloakx.adminPassword
Password of the internal identity provider administrator
Defines the password for the Keycloak administrator.
keycloakx.existingSecret
Keycloakx credentials (secret)
Existing secret name with authentication credentials.
global.keycloakx.host
Keycloak (NVIDIA Run:ai internal identity provider) host path
Overrides the DNS for Keycloak. This can be used to access access Keycloak externally to the cluster.
grafana.db.existingSecret
Grafana database connection credentials (secret)
Existing secret name with authentication credentials.
grafana.dbUser
Grafana database username
Username for accessing the Grafana database.
grafana.dbPassword
Grafana database password
Password for the Grafana database user.
grafana.admin.existingSecret
Grafana admin default credentials (secret)
Existing secret name with authentication credentials.
Support for third-party integrations varies. When noted below, the integration is supported out of the box with NVIDIA Run:ai. For other integrations, our Customer Success team has prior experience assisting customers with setup. In many cases, the NVIDIA Enterprise Support Portal may include additional reference documentation provided on an as-is basis.
Kubernetes has several built-in resources that encapsulate running Pods. These are called and should not be confused with .
Examples of such resources are a Deployment that manages a stateless application, or a Job that runs tasks to completion.
A NVIDIA Run:ai workload encapsulates all the resources needed to run and creates/deletes them together. Since NVIDIA Run:ai is an open platform, it allows the scheduling of any Kubernetes Workflow.
For more information, see .
This guide provides actionable best practices for administrators to securely configure, operate, and manage NVIDIA Run:ai environments. Each section highlights both platform-native features and mapped Kubernetes security practices to maintain robust protection for workloads and resources.
Access control (RBAC)
Enforce least privilege, segment roles by scope, audit regularly
Authentication and sessions management
Use SSO, token-based authentication, strong passwords, limit idle time
Workload policies
Require non-root, set UID/GID, block overrides, use trusted images
Namespace and resource management
NVIDIA Run:ai uses Role‑Based Access Control to define what each user, group, or application can do, and where. Roles are assigned within a scope, such as a project, department, or cluster, and permissions cover actions like viewing, creating, editing, or deleting entities. Unlike Kubernetes RBAC, NVIDIA Run:ai’s RBAC works across multiple clusters, giving you a single place to manage access rules. See for more details.
Assign the minimum required permissions to users, groups and applications.
Segment duties using organizational scopes to restrict roles to specific projects or departments.
Regularly audit access rules and remove unnecessary privileges, especially admin-level roles.
NVIDIA Run:ai predefined roles are automatically mapped to Kubernetes cluster roles (also predefined by NVIDIA Run:ai). This means administrators do not need to manually configure role mappings.
These cluster roles define permissions for the entities NVIDIA Run:ai manages and displays (such as workloads) and also apply to users who access cluster data directly through Kubernetes tools (for example, kubectl).
NVIDIA Run:ai supports several authentication methods to control platform access. You can use single sign-on (SSO) for unified enterprise logins, traditional username/password accounts if SSO isn’t an option, and API secret keys for automated application access. Authentication is mandatory for all interfaces, including the UI, CLI, and APIs, ensuring only verified users or applications can interact with your environment.
Administrators can also configure session timeout. This refers to the period of inactivity before a user is automatically logged out. Once the timeout is reached, the session ends and re‑authentication is required, helping protect against risks from unattended or abandoned sessions. See for more details.
Integrate corporate SSO for centralized identity management.
Enforce strong password policies for local accounts.
Set appropriate session timeout values to minimize idle session risk.
Prefer SSO to eliminate password management within NVIDIA Run:ai.
Configure the Kubernetes API server to validate tokens via NVIDIA Run:ai’s identity service, ensuring unified authentication across the platform. For more information, see .
Workload policies allow administrators to define and enforce how AI workloads are submitted and controlled across projects and teams. With these policies, you can set clear rules and defaults for workload parameters such as which resources can be requested, required security settings, and which defaults should apply. Policies are enforced whether workloads are submitted via the UI, CLI, API or Kubernetes YAML, and can be scoped to specific projects, departments, or clusters for fine-grained control. See for more details.
Enforce containers to run as non-root by default. Define policies that set constraints and defaults for workload submissions, such as requiring non-root users or specifying minimum UID/GID. Example security fields in policies:
security.runAsNonRoot: true
security.runAsUid: 1000
Map these policies to PodSecurityContext settings in Kubernetes, and enforce them with Pod Security Admission or Kyverno for stricter compliance.
NVIDIA Run:ai offers flexible controls for how namespaces and resources are created and managed within your clusters. When a new project is set up, you can choose whether Kubernetes namespaces are created automatically, and whether users are auto-assigned to those projects. There are also options to manage how secrets are propagated across namespaces and to enable or disable resource limit enforcement using Kubernetes LimitRange objects. See for more details.
Require admin approval for namespace creation to avoid sprawl.
Limit secret propagation to essential cases only.
Use Kubernetes LimitRanges and ResourceQuotas alongside NVIDIA Run:ai policies for layered resource control.
NVIDIA Run:ai provides flexible options to control access to tools and serving endpoints. Access can be defined during workload submission or updated later, ensuring that only the intended users or groups can interact with the resource.
When configuring an endpoint or tool, users can select from the following access levels:
Public - Everyone within the network can access with no authentication (serving endpoints).
All authenticated users - Access is granted to anyone in the organization who can log in (NVIDIA Run:ai or SSO).
Specific groups - Access is restricted to members of designated identity provider groups.
Specific users - Access is restricted to individual users by email or username.
By default, network exposure is restricted, and access must be explicitly granted. Model endpoints automatically inherit RBAC and workload policy controls, ensuring consistent enforcement of role- and scope-based permissions across the platform. Administrators can also limit who can deploy, view, or manage endpoints, and should open network access only when required.
Define explicit roles for model management/use.
Restrict endpoint access to authorized users, groups and applications.
Monitor and audit endpoint access logs.
Use Kubernetes NetworkPolicies to limit inter-pod and external traffic to model-serving pods. Pair with NVIDIA Run:ai RBAC for end-to-end control.
A secure deployment is the foundation on which all other controls rest, and NVIDIA Run:ai’s installation procedures are built to align with organizational policies such as OpenShift Security Context Constraints (SCC). See for more details.
Deploy NVIDIA Run:ai cluster following secure installation guides (including IT compliance mandates such as SCC for OpenShift).
Run regular security scans and patch/update NVIDIA Run:ai deployments promptly when vulnerabilities are reported.
Regularly review and update all security policies, both at the NVIDIA Run:ai and Kubernetes levels, to adapt to evolving risks.
NVIDIA Run:ai supports SaaS and self-hosted modes to satisfy a range of data security needs. The self-hosted mode keeps all models, logs, and user data entirely within your infrastructure; SaaS requires careful review of what (minimal) data is transmitted for platform operations and analytics. See for more details.
Use the self-hosted mode when full control over the environment is required - including deployment and day-2 operations such as upgrades, monitoring, backup, and metadata restore.
Ensure transmission to the NVIDIA Run:ai cloud is scoped (in SaaS mode) and aligns with organization policy.
Encrypt secrets and sensitive resources; control secret propagation.
Document and audit data flows for regulatory alignment.
The identity of the user inside a container determines its access to various resources. For example, network file systems often rely on this identity to control access to mounted volumes. As a result, propagating the correct user identity into a container is crucial for both functionality and security.
By default, containers in both Docker and Kubernetes run as the root user. This means any process inside the container has full administrative privileges, capable of modifying system files, installing packages, or changing configurations.
While this level of access provides researchers with maximum flexibility, it conflicts with modern enterprise security practices. If the container’s root identity is propagated to external systems (e.g., network-attached storage), it can result in elevated permissions outside the container, increasing the risk of security breaches.
NVIDIA Run:ai allows you to enhance security and enforce organizational policies by:
Controlling root access and privilege escalation within containers
Propagating the user identity to align with enterprise access policies
NVIDIA Run:ai supports security-related workload configurations to control user permissions and restrict privilege escalation. These options are available via the API and CLI during workload creation:
runAsNonRoot / --run-as-user - Force the container to run as non-root user.
allowPrivilegeEscalation / --allow-privilege-escalation - Allow the container to use setuid binaries to escalate privileges, even when running as a non-root user. This setting can increase security risk and should be disabled if elevated privileges are not required.
Administrators can enforce secure defaults across the environment using Policies, ensuring consistent workload behavior aligned with organizational security practices.
A best practice is to store the User Identifier (UID) and Group Identifier (GID) in the organization's directory. NVIDIA Run:ai allows you to pass these values to the container and use them as the container identity. To perform this, you must set up single sign-on and perform the steps for UID/GID integration.
It is possible to explicitly pass user identity when creating an environment or submitting a workload:
From the image - Use the UID/GID defined in the container image.
From the IdP token - Use identity attributes provided by the SSO identity provider (available only in SSO-enabled installations).
Custom - Manually set the User ID (UID), Group ID (GID) and supplementary groups that can run commands in the container.
Administrators can enforce secure defaults across the environment using Policies, ensuring consistent workload behavior aligned with organizational security practices.
In OpenShift, Security Context Constraints (SCCs) manage pod-level security, including root access. By default, containers are assigned a random non-root UID, and flags such as --run-as-user and --allow-privilege-escalation are disabled.
On non-OpenShift Kubernetes clusters, similar enforcement can be achieved using tools like Gatekeeper, which applies system-level policies to restrict containers from running as root.
By default, OpenShift restricts setting specific user and group IDs (UIDs/GIDs) in workloads through its SCCs. To allow NVIDIA Run:ai workloads to run with explicitly defined UIDs and GIDs, a cluster administrator must modify the relevant SCCs.
To enable UID and GID assignment:
Edit the runai-user-job SCC:
Edit the runai-jupyter-notebook SCC (only required if using Jupyter environments):
In both SCC definitions, ensure the following sections are configured:
These settings allow NVIDIA Run:ai to pass specific UID and GID values into the container, enabling compatibility with identity-aware file systems and enterprise access controls.
When containers run as a specific user, the user must have a home directory defined within the image. Otherwise, starting a shell session will fail due to the absence of a home directory.
Since pre-creating a home directory for every possible user is impractical, NVIDIA Run:ai offers the createHomeDir / --create-home-dir option. When enabled, this flag creates a temporary home directory for the user inside the container at runtime. By default, the directory is created at /home/<username>.
oc edit scc runai-user-joboc edit scc runai-jupyter-notebookrunAsUser:
type: RunAsAny
supplementalGroups:
type: RunAsAnySupported
NVIDIA Run:ai communicates with GitHub by defining it as a asset
Hugging Face
Repositories
Supported
NVIDIA Run:ai provides an out of the box integration with
JupyterHub
Development
Community Support
It is possible to submit NVIDIA Run:ai workloads via JupyterHub.
Jupyter Notebook
Development
Supported
NVIDIA Run:ai provides integrated support with Jupyter Notebooks. See example.
Cost Optimization
Supported
NVIDIA Run:ai provides out of the box support for Karpenter to save cloud costs. Integration notes with Karpenter can be found .
Training
Supported
NVIDIA Run:ai provides out of the box support for submitting MPI workloads via API, CLI or UI. See for more details.
Kubeflow notebooks
Development
Community Support
It is possible to launch a Kubeflow notebook with the NVIDIA Run:ai Scheduler. Sample code: .
Kubeflow Pipelines
Orchestration
Community Support
It is possible to schedule kubeflow pipelines with the NVIDIA Run:ai Scheduler. Sample code: .
MLFlow
Model Serving
Community Support
It is possible to use ML Flow together with the NVIDIA Run:ai Scheduler.
PyCharm
Development
Supported
Containers created by NVIDIA Run:ai can be accessed via PyCharm.
PyTorch
Training
Supported
NVIDIA Run:ai provides out of the box support for submitting PyTorch workloads via API, CLI or UI. See for more details.
Ray
training, inference, data processing.
Community Support
It is possible to schedule Ray jobs with the NVIDIA Run:ai Scheduler. Sample code: .
SeldonX
Orchestration
Community Support
It is possible to schedule Seldon Core workloads with the NVIDIA Run:ai Scheduler.
Spark
Orchestration
Community Support
It is possible to schedule Spark workflows with the NVIDIA Run:ai Scheduler.
S3
Storage
Supported
NVIDIA Run:ai communicates with S3 by defining a asset
TensorBoard
Experiment tracking
Supported
NVIDIA Run:ai comes with a preset TensorBoard asset
TensorFlow
Training
Supported
NVIDIA Run:ai provides out of the box support for submitting TensorFlow workloads via API, CLI or UI. See for more details.
Triton
Orchestration
Supported
Usage via docker base image
VScode
Development
Supported
Containers created by NVIDIA Run:ai can be accessed via Visual Studio Code. You can automatically launch Visual Studio code web from the NVIDIA Run:ai console.
Weights & Biases
Experiment tracking
Community Support
It is possible to schedule W&B workloads with the NVIDIA Run:ai Scheduler. Sample code: .
Training
Supported
NVIDIA Run:ai provides out of the box support for submitting XGBoost via API, CLI or UI. See for more details.
Apache Airflow
Orchestration
Community Support
It is possible to schedule Airflow workflows with the NVIDIA Run:ai Scheduler. Sample code: How to integrate NVIDIA Run:ai with Apache Airflow.
Argo workflows
Orchestration
Community Support
It is possible to schedule Argo workflows with the NVIDIA Run:ai Scheduler. Sample code: How to integrate NVIDIA Run:ai with Argo Workflows.
ClearML
Experiment tracking
Community Support
It is possible to schedule ClearML workloads with the NVIDIA Run:ai Scheduler.
Docker Registry
Repositories
Supported
NVIDIA Run:ai allows using a docker registry as a Credential asset
GitHub
Storage
runAsUid with canEdit: false to prevent users from overriding.Require explicit user/group IDs for all workload containers.
Impose data source and resource usage limits through policies.
Use policy rules to prevent users from submitting non-compliant workloads.
Apply policies by organizational scope for nuanced control within departments or projects.
Require namespace approval, limit secret propagation, apply quotas
Tools and serving endpoint access control
Control who can access tools and endpoints; restrict network exposure
Maintenance and compliance
Follow secure install guides, perform vulnerability scans, maintain data-privacy alignment
Advanced cluster configurations can be used to tailor your NVIDIA Run:ai cluster deployment to meet specific operational requirements and optimize resource management. By fine-tuning these settings, you can enhance functionality, ensure compatibility with organizational policies, and achieve better control over your cluster environment. This article provides guidance on implementing and managing these configurations to adapt the NVIDIA Run:ai cluster to your unique needs.
After the NVIDIA Run:ai cluster is installed, you can adjust various settings to better align with your organization's operational needs and security requirements.
Advanced cluster configurations in NVIDIA Run:ai are managed through the runaiconfig Kubernetes Custom Resource. To edit the cluster configurations, run:
To see the full runaiconfig object structure, use:
The following configurations allow you to enable or disable features, control permissions, and customize the behavior of your NVIDIA Run:ai cluster:
NVIDIA Run:ai cluster includes many different services. To simplify resource management, the configuration structure allows you to configure the containers CPU / memory resources for each service individually or group of services together.
Apply the following configuration in order to change resources request and limit for a group of services:
Or, apply the following configuration in order to change resources request and limit for each service individually:
For resource recommendations, see .
By default, all NVIDIA Run:ai containers are deployed with a single replica. Some services support multiple replicas for redundancy and performance.
To simplify configuring replicas, a global replicas configuration can be set and is applied to all supported services:
This can be overwritten for specific services (if supported). Services without the replicas configuration does not support replicas:
The Prometheus instance in NVIDIA Run:ai is used for metrics collection and alerting.
The configuration scheme follows the official and supports additional custom configurations. The PrometheusSpec schema is available using the spec.prometheus.spec configuration.
A common use case using the PrometheusSpec is for metrics retention. This prevents metrics loss during potential connectivity issues and can be achieved by configuring local temporary metrics retention. For more information, see :
In addition to the PrometheusSpec schema, some custom NVIDIA Run:ai configurations are also available:
Additional labels – Set additional labels for NVIDIA Run:ai's sent by Prometheus.
Log level configuration – Configure the logLevel setting for the Prometheus container.
To include or exclude specific nodes from running workloads within a cluster managed by NVIDIA Run:ai, use the nodeSelectorTerms flag. For additional details, see .
Label the nodes using the below:
key: Label key (e.g., zone, instance-type).
operator: Operator defining the inclusion/exclusion condition (In, NotIn, Exists, DoesNotExist).
values: List of values for the key when using In or NotIn.
The below example shows how to include NVIDIA GPUs only and exclude all other GPU types in a cluster with mixed nodes, based on product type GPU label:
For air-gapped environments, when working with a , it is required to replace the default sidecar images in order to use the Git and S3 data source integrations. Use the following configurations:
kubectl edit runaiconfig runai -n runaiAllows Kubernetes namespace creation for new projects
Default: true
spec.project-controller.createRoleBindings (boolean)
Specifies if role bindings should be created in the project's namespace
Default: true
spec.project-controller.limitRange (boolean)
Specifies if limit ranges should be defined for projects
Default: true
spec.project-controller.clusterWideSecret (boolean)
Allows Kubernetes Secrets creation at the cluster scope. See for more details.
Default: true
spec.workload-controller.additionalPodLabels (object)
Set workload's in a format of key/value pairs. These labels are applied to all pods.
spec.workload-controller.failureResourceCleanupPolicy
NVIDIA Run:ai cleans the workload's unnecessary resources:
All - Removes all resources of the failed workload
None - Retains all resources
KeepFailing - Removes all resources except for those that encountered issues (primarily for debugging purposes)
spec.workload-controller.GPUNetworkAccelerationEnabled
Enables GPU network acceleration. See for more details.
Default: false
spec.mps-server.enabled (boolean)
Enabled when using
Default: false
spec.daemonSetsTolerations (object)
Configure Kubernetes tolerations for NVIDIA Run:ai daemonSets / engine
spec.runai-container-toolkit.logLevel (boolean)
Specifies the NVIDIA Run:ai-container-toolkit logging level: either 'SPAM', 'DEBUG', 'INFO', 'NOTICE', 'WARN', or 'ERROR'
Default: INFO
spec.runai-container-toolkit.enabled (boolean)
Enables workloads to use
Default: true
node-scale-adjuster.args.gpuMemoryToFractionRatio (object)
A scaling-pod requesting a single GPU device will be created for every 1 to 10 pods requesting fractional GPU memory (1/gpuMemoryToFractionRatio). This value represents the ratio (0.1-0.9) of fractional GPU memory (any size) to GPU fraction (portion) conversion.
Default: 0.1
spec.global.core.dynamicFractions.enabled (boolean)
Enables
Default: true
spec.global.core.swap.enabled (boolean)
Enables for GPU workloads
Default: false
spec.global.core.swap.limits.cpuRam (string)
Sets the CPU memory size used to swap GPU workloads
Default:100Gi
spec.global.core.swap.limits.reservedGpuRam (string)
Sets the reserved GPU memory size used to swap GPU workloads
Default: 2Gi
spec.global.core.nodeScheduler.enabled (boolean)
Enables the
Default: false
spec.global.core.timeSlicing.mode (string)
Sets the . Possible values:
timesharing - all pods on a GPU share the GPU compute time evenly.
strict - each pod gets an exact time slice according to its memory fraction value.
fair - each pod gets an exact time slice according to its memory fraction value and any unused GPU compute time is split evenly between the running pods.
spec.runai-scheduler.args.fullHierarchyFairness (boolean)
Enables fairness between departments, on top of projects fairness
Default: true
spec.runai-scheduler.args.defaultStalenessGracePeriod
Sets the timeout in seconds before the scheduler evicts a stale pod-group (gang) that went below its min-members in running state:
0s - Immediately (no timeout)
-1 - Never
Default: 60s
spec.pod-grouper.args.gangSchedulingKnative (boolean)
Enables gang scheduling for inference workloads.For backward compatibility with versions earlier than v2.19, change the value to false
Default: false
spec.pod-grouper.args.gangScheduleArgoWorkflow (boolean)
Groups all pods of a single ArgoWorkflow workload into a single Pod-Group for gang scheduling
Default: true
spec.runai-scheduler.args.verbosity (int)
Configures the level of detail in the logs generated by the scheduler service
Default: 4
spec.limitRange.cpuDefaultRequestCpuLimitFactorNoGpu (string)
Sets a default ratio between the CPU request and the limit for workloads without GPU requests
Default: 0.1
spec.limitRange.memoryDefaultRequestMemoryLimitFactorNoGpu (string)
Sets a default ratio between the memory request and the limit for workloads without GPU requests
Default: 0.1
spec.limitRange.cpuDefaultRequestGpuFactor (string)
Sets a default amount of CPU allocated per GPU when the CPU is not specified
Default: 100
spec.limitRange.cpuDefaultLimitGpuFactor (int)
Sets a default CPU limit based on the number of GPUs requested when no CPU limit is specified
Default: NO DEFAULT
spec.limitRange.memoryDefaultRequestGpuFactor (string)
Sets a default amount of memory allocated per GPU when the memory is not specified
Default: 100Mi
spec.limitRange.memoryDefaultLimitGpuFactor (string)
Sets a default memory limit based on the number of GPUs requested when no memory limit is specified
Default: NO DEFAULT
spec.global.affinity (object)
Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Using global.affinity will overwrite the node roles set using the Administrator CLI (runai-adm).
Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system
spec.global.nodeAffinity.restrictScheduling (boolean)
Enables setting node roles and restricting workload scheduling to designated nodes
Default: false
spec.global.tolerations (object)
Configure Kubernetes tolerations for NVIDIA Run:ai system-level services
spec.global.ingress.ingressClass
NVIDIA Run:ai uses NGINX as the default ingress controller. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai.
spec.global.subdomainSupport (boolean)
Allows the creation of subdomains for ingress endpoints, enabling access to workloads via unique subdomains on the Fully Qualified Domain Name (FQDN). For details, see External Access to Containers.
Default: false
spec.global.enableWorkloadOwnershipProtection (boolean)
Prevents users within the same project from deleting workloads created by others. This enhances workload ownership security and ensures better collaboration by restricting unauthorized modifications or deletions.
Default: false
SchedulingServices
Containers associated with the NVIDIA Run:ai Scheduler
Scheduler, StatusUpdater, MetricsExporter, PodGrouper, PodGroupAssigner, Binder
SyncServices
Containers associated with syncing updates between the NVIDIA Run:ai cluster and the NVIDIA Run:ai control plane
Agent, ClusterSync, AssetsSync
WorkloadServices
Containers associated with submitting NVIDIA Run:ai workloads
WorkloadController,
JobController
spec.project-controller.createNamespaces (boolean)
kubectl get crds/runaiconfigs.run.ai -n runai -o yamlspec:
global:
<service-group-name>: # schedulingServices | SyncServices | WorkloadServices
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 512Mispec:
<service-name>: # for example: pod-grouper
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 512Mispec:
global:
replicaCount: 1 # defaultspec:
<service-name>: # for example: pod-grouper
replicas: 1 # defaultspec:
prometheus:
spec: # PrometheusSpec
retention: 2h # default
retentionSize: 20GBspec:
prometheus:
logLevel: info # debug | info | warn | error
additionalAlertLabels:
- env: prod # examplespec:
global:
managedNodes:
inclusionCriteria:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: Existsspec:
workload-controller:
s3FileSystemImage:
name: goofys
registry: runai.jfrog.io/op-containers-prod
tag: 3.12.24
gitSyncImage:
name: git-sync
registry: registry.k8s.io
tag: v4.4.0AllDefault: timesharing