All pages
1 of 11

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Container Access

Node Roles

This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.

For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:

  • NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.

  • Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.

NVIDIA Run:ai services are scheduled on the defined node roles by applying Kubernetes Node Affinity using node labels .

Prerequisites

To perform these tasks, make sure to install the NVIDIA Run:ai .

Configure Node Roles

The following node roles can be configured on the cluster:

  • System node: Reserved for NVIDIA Run:ai system-level services.

  • GPU Worker node: Dedicated for GPU-based workloads.

  • CPU Worker node: Used for CPU-only workloads.

System Nodes

NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the (recommended) or via NVIDIA Run:ai .

By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system for system services scheduling. You can modify the default node affinity rule by:

  • Editing the spec.global.affinity configuration parameter as detailed in .

  • Editing the global.affinity configuration as detailed in for self-hosted deployments.

Note

  • To ensure and prevent a single point of failure, it is recommended to configure at least three system nodes in your cluster.

  • By default, Kubernetes master nodes are configured to prevent workloads from running on them as a best-practice measure to safeguard control plane stability. While this restriction is generally recommended, certain NVIDIA reference architectures allow adding tolerations to the NVIDIA Run:ai deployment so critical system services can run on these nodes.

Kubectl

To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:

  1. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  2. Run one of the following commands to label the node with its role:

NVIDIA Run:ai Administrator CLI

Note

The NVIDIA Run:ai Administrator CLI only supports the default node affinity.

To set a system role for a node in your Kubernetes cluster, follow these steps:

  1. Run the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  2. Run one of the following commands to set or remove a node’s role:

The set node-role command will label the node and set relevant cluster configurations.

Worker Nodes

NVIDIA Run:ai worker nodes run user-submitted workloads and system-level required to operate. This can be managed via the (recommended) or via NVIDIA Run:ai .

By default, GPU workloads are scheduled on GPU nodes based on the nvidia.com/gpu.present label. When global.nodeAffinity.restrictScheduling is set to true via the :

  • GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-gpu-worker

  • CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with node-role.kubernetes.io/runai-cpu-worker

Kubectl

To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:

  1. Validate the global.nodeAffinity.restrictScheduling is set to true in the cluster’s .

  2. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  3. Run one of the following commands to label the node with its role. Replace the label and value (true/false) to enable or disable GPU/CPU roles as needed:

NVIDIA Run:ai Administrator CLI

To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai , follow these steps:

  1. Use the kubectl get nodes command to list all the nodes in your cluster and identify the name of the node you want to modify.

  2. Run one of the following commands to set or remove a node’s role. <node-role> must be either --gpu-worker or --cpu-worker :

The set node-role command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling true.

Note

Use the --all flag to set or remove a role to all nodes.

Advanced Setup

Administrator CLI
Kubectl
Administrator CLI
Advanced cluster configurations
Advanced control plane configurations
high availability
DeamonSets
Kubectl
Administrator CLI
Advanced cluster configurations
Configurations
Administrator CLI
kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=true
kubectl label nodes <node-name> node-role.kubernetes.io/runai-system=false
runai-adm set node-role --runai-system-worker <node-name>
runai-adm remove node-role --runai-system-worker <node-name>
runai-adm set node-role <node-role> <node-name>
runai-adm remove node-role <node-role> <node-name>
kubectl label nodes <node-name> node-role.kubernetes.io/runai-gpu-worker=true
kubectl label nodes <node-name> node-role.kubernetes.io/runai-cpu-worker=false

Service Mesh

NVIDIA Run:ai supports service mesh implementations. When a service mesh is deployed with sidecar injection, specific configurations must be applied to ensure compatibility with NVIDIA Run:ai. This document outlines the required changes for the NVIDIA Run:ai control plane and cluster.

Control Plane Configuration

Note

This section applies to self-hosted only.

By default, NVIDIA Run:ai prevents Istio from injecting sidecar containers into system jobs in the control plane. For other service mesh solutions, users must manually add annotations during installation.

To disable sidecar injection in the NVIDIA Run:ai control plane, modify the Helm values file by adding the required pod labels to the following components. See for more details.

Example for :

Cluster Configuration

Installation Phase

Sidecar containers injected by some service mesh solutions can prevent NVIDIA Run:ai installation hooks from completing. To avoid this, modify the Helm installation command to include the required labels or annotations:

Example for :

Workloads

To prevent sidecar injection in workloads created at runtime (such as training workloads), update the runaiconfig resource. See for more details:

Advanced control plane configurations
Open Service Mesh
Istio Service Mesh
Advanced cluster configurations
authorizationMigrator:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
clusterMigrator:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
identityProviderReconciler:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
keepPVC:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
orgUnitsMigrator:
  podLabels:
    openservicemesh.io/sidecar-injection: disabled
helm upgrade -i ... 
--set global.additionalJobLabels.A=B --set global.additionalJobAnnotations.A=B
helm upgrade -i ... 
--set-json global.additionalJobLabels='{"sidecar.istio.io/inject":false}'
spec:
  workload-controller:
    additionalPodLabels:
      sidecar.istio.io/inject: false

Interworking with Karpenter

Karpenter is an open-source, Kubernetes cluster autoscaler built for cloud deployments. Karpenter optimizes the cloud cost of a customer’s cluster by moving workloads between different node types, consolidating workloads into fewer nodes, using lower-cost nodes where possible, scaling up new nodes when needed, and shutting down unused nodes.

Karpenter’s main goal is cost optimization. Unlike Karpenter, NVIDIA Run:ai’s Scheduler optimizes for fairness and resource utilization. Therefore, there are a few potential friction points when using both on the same cluster.

Friction Points Using Karpenter with NVIDIA Run:ai

  1. Karpenter looks for “unschedulable” pending workloads and may try to scale up new nodes to make those workloads schedulable. However, in some scenarios, these workloads may exceed their quota parameters, and the NVIDIA Run:ai Scheduler will put them into a pending state.

  2. Karpenter is not aware of the NVIDIA Run:ai fractions mechanism and may try to interfere incorrectly.

  3. Karpenter preempts any type of workload (i.e., high-priority, non-preemptible workloads will potentially be interrupted and moved to save cost).

  4. Karpenter has no pod-group (i.e., workload) notion or gang scheduling awareness, meaning that Karpenter is unaware that a set of “arbitrary” pods is a single workload. This may cause Karpenter to schedule those pods into different node pools (in the case of multi-node-pool workloads) or scale up or down a mix of wrong nodes.

Mitigating the Friction Points

NVIDIA Run:ai Scheduler mitigates the friction points using the following techniques (each numbered bullet below corresponds to the related friction point listed above):

  1. Karpenter uses a “nominated node” to recommend a node for the Scheduler. The NVIDIA Run:ai Scheduler treats this as a “preferred” recommendation, meaning it will try to use this node, but it’s not required and it may choose another node.

  2. Fractions - Karpenter won’t consolidate nodes with one or more pods that cannot be moved. The NVIDIA Run:ai reservation pod is marked as ‘do not evict’ to allow the NVIDIA Run:ai Scheduler to control the scheduling of fractions.

  3. Non-preemptible workloads - NVIDIA Run:ai marks non-preemptible workloads as ‘do not evict’ and Karpenter respects this annotation.

  4. NVIDIA Run:ai node pools (single-node-pool workloads) - Karpenter respects the ‘node affinity’ that NVIDIA Run:ai sets on a pod, so Karpenter uses the node affinity for its recommended node. For the gang-scheduling/pod-group (workload) notion, NVIDIA Run:ai Scheduler considers Karpenter directives as preferred recommendations rather than mandatory instructions and overrides Karpenter instructions where appropriate.

Deployment Considerations

  • Using multi-node-pool workloads

    • Workloads may include a list of optional node pools. Karpenter is not aware that only a single node pool should be selected out of that list for the workload. It may therefore recommend putting pods of the same workload into different node pools and may scale up nodes from different node pools to serve a “multi-node-pool” workload instead of nodes on the selected single node pool.

    • If this becomes an issue (i.e., if Karpenter scales up the wrong node types), users can set an inter-pod affinity using the node pool label or another common label as a ‘topology’ identifier. This will force Karpenter to choose nodes from a single-node pool per workload, selecting from any of the node pools listed as allowed by the workload.

An alternative approach is to use a single-node pool for each workload instead of multi-node pools.

  • Consolidation

    • To make Karpenter more effective when using its consolidation function, users should consider separating preemptible and non-preemptible workloads, either by using node pools, node affinities, taint/tolerations, or inter-pod anti-affinity.

    • If users don’t separate preemptible and non-preemptible workloads (i.e., make them run on different nodes), Karpenter’s ability to consolidate (bin-pack) and shut down nodes will be reduced, but it is still effective.

  • Conflicts between bin-packing and spread policies

    • If NVIDIA Run:ai is used with a scheduling spread policy, it will clash with Karpenter’s default bin-packs/consolidation policy, and the outcome may be a deployment that is not optimized for any of these policies.

    • Usually spread is used for Inference, which is non-preemptible and therefore not controlled by Karpenter (NVIDIA Run:ai Scheduler will mark those workloads as ‘do not evict’ for Karpenter), so this should not present a real deployment issue for customers.

  • Advanced Control Plane Configurations

    Helm Chart Values

    The NVIDIA Run:ai control plane installation can be customized to support your environment via Helm values files or Helm install flags. Make sure to restart the relevant NVIDIA Run:ai pods so they can fetch the new configurations.

    Key
    Change
    Description

    Additional Third-Party Configurations

    The NVIDIA Run:ai control plane chart includes multiple sub-charts of third-party components:

    • Data store- (postgresql)

    • Metrics Store - (thanos)

    • Identity & Access Management - (keycloakx)

    Note

    Click on any component to view its chart values and configurations.

    PostgreSQL

    If you have opted to connect to an , refer to the additional configurations table below. Adjust the following parameters based on your connection details:

    1. Disable PostgreSQL deployment - postgresql.enabled

    2. NVIDIA Run:ai connection details - global.postgresql.auth

    3. Grafana connection details - grafana.dbUser, grafana.dbPassword

    Key
    Change
    Description

    Thanos

    Note

    This section applies to Kubernetes only.

    Key
    Change
    Description

    Keycloakx

    The keycloakx.adminUser can only be set during the initial installation. The admin password can be changed later through the Keycloak UI, but you must also update the keycloakx.adminPassword value in the Helm chart using helm upgrade. See for more details.

    Key
    Change
    Description

    Changing Keycloak Admin Password

    You can change the Keycloak admin password after deployment by performing the following steps:

    1. Open the Keycloak UI at: https://<runai-domain>/auth

    2. Sign in with your existing admin credentials as configured in your Helm values

    3. Go to Users and select admin (or your admin username)

    4. Open Credentials →

    Note

    Failing to update the Helm values after changing the password can lead to control plane services encountering errors.

    Grafana

    Key
    Change
    Description

    External Access to Containers

    Researchers may need to access containers remotely during workload execution. Common use cases include:

    • Running a Jupyter Notebook inside the container

    • Connecting PyCharm for remote Python development

    • Viewing machine learning visualizations using TensorBoard

    To enable this access, you must expose the relevant container ports.

    Exposing Container Ports

    Accessing the containers remotely requires exposing container ports. In Docker, ports are exposed by declaring them when launching the container. NVIDIA Run:ai provides similar functionality within a Kubernetes environment.

    Since Kubernetes abstracts the container's physical location, exposing ports is more complex. Kubernetes supports multiple methods for exposing container ports. For more details, refer to the Kubernetes services and networking documentation.

    Method
    Description
    NVIDIA Run:ai Support

    Port Forwarding

    Simple port forwarding allows access to the container via local and/or remote port.

    Supported natively via Kubernetes

    NodePort

    Exposes the service on each Node’s IP at a static port (the NodePort). You’ll be able to contact the NodePort service from outside the cluster by requesting <NODE-IP>:<NODE-PORT> regardless of which node the container actually resides in.

    Supported

    LoadBalancer

    Exposes the service externally using a cloud provider’s load balancer.

    Supported via API with limited capabilities

    Access to the Running Workload's Container

    Many tools used by researchers, such as Jupyter, TensorBoard, or VSCode, require remote access to the running workload's container. In NVIDIA Run:ai, this access is provided through dynamically generated URLs.

    Path-Based Routing

    By default, NVIDIA Run:ai uses the Cluster URL provided to dynamically create SSL-secured URLs in the following format:

    While path-based routing works with applications such as Jupyter Notebooks, it may not be compatible with other applications. Some applications assume they are running at the root file system, so hardcoded file paths and settings within the container may become invalid when running at a path other than the root. For example, if an application expects to access /etc/config.json but is served at /project-name/workspace-name, the file will not be found. This can cause the container to fail or not function as intended.

    Host-Based Routing

    NVIDIA Run:ai provides support for host-based routing. When enabled, URLs follow the format:

    This allows all workloads to run at the root path, avoiding file path issues and ensuring proper application behavior.

    Enabling Host-Based Routing

    To enable host-based routing, perform the following steps:

    Note

    For OpenShift, editing the runaiconfig command is the only step required to generate URLs. Refer to the last step below.

    1. Create a second DNS entry (A record) for *.<CLUSTER_URL>, pointing to the same IP as the cluster's Fully Qualified Domain Name (FQDN).

    2. Obtain a wildcard SSL certificate for this second DNS entry.

    3. Add the certificate as a secret:

    1. Create the following ingress rule and replace <CLUSTER_URL>:

    1. Run the following:

    1. Edit runaiconfig to generate the URLs correctly:

    Once these requirements have been met, all workloads will automatically be assigned a secured URL with a subdomain, ensuring full functionality for all researcher applications.

    https://<CLUSTER_URL>/project-name/workload-name
    https://project-name-workload-name.<CLUSTER_URL>/
    kubectl create secret tls runai-cluster-domain-star-tls-secret -n runai \    
      --cert /path/to/fullchain.pem  \ # Replace /path/to/fullchain.pem with the actual path to your TLS certificate    
      --key /path/to/private.pem # Replace /path/to/private.pem with the actual path to your private key
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: runai-cluster-domain-star-ingress
      namespace: runai
    spec:
      ingressClassName: nginx
      rules:
      - host: '*.<CLUSTER_URL>'
      tls:
      - hosts:
        - '*.<CLUSTER_URL>'
        secretName: runai-cluster-domain-star-tls-secret
    kubectl apply -f <filename>
    kubectl patch RunaiConfig runai -n runai --type="merge" \    
        -p '{"spec":{"global":{"subdomainSupport": true}}}' 

    Analytics Dashboard - Grafana (grafana)

  • Caching, Queue - NATS (nats)

  • global.postgresql.auth.password

    PostgreSQL password

    Password for the PostgreSQL user specified by global.postgresql.auth.username.

    global.postgresql.auth.postgresPassword

    PostgreSQL default admin password

    Password for the built-in PostgreSQL superuser (postgres).

    global.postgresql.auth.existingSecret

    Postgres Credentials (secret)

    Existing secret name with authentication credentials.

    global.postgresql.auth.dbSslMode

    Postgres connection SSL mode

    Set the SSL mode. See the full list in . Prefer mode is not supported.

    postgresql.primary.initdb.password

    PostgreSQL default admin password

    Set the same password as in global.postgresql.auth.postgresPassword (if changed).

    postgresql.primary.persistence.storageClass

    Storage class

    The installation is configured to work with a specific storage class instead of the default one.

    Reset password
  • Set the new password and click Save

  • Update the keycloakx.adminPassword value using the helm upgrade command to match the password you set in the Keycloak UI

  • grafana.adminUser

    Grafana username

    Override the NVIDIA Run:ai default user name for accessing Grafana.

    grafana.adminPassword

    Grafana password

    Override the NVIDIA Run:ai default password for accessing Grafana.

    global.ingress.ingressClass

    Ingress class

    NVIDIA Run:ai uses NGINX as the default ingress controller. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai.

    global.ingress.tlsSecretName

    TLS secret name

    NVIDIA Run:ai requires the creation of a secret with domain certificate. If the runai-backend namespace already had such a secret, you can set the secret name here

    <service-name>.podLabels

    Pod labels

    Set NVIDIA Run:ai and 3rd party services' Pod Labels in a format of key/value pairs.

    <service-name>  resources:   limits:     cpu: 500m     memory: 512Mi   requests:     cpu: 250m     memory: 256Mi

    Pod request and limits

    Set NVIDIA Run:ai and 3rd party services' resources

    disableIstioSidecarInjection.enabled

    Disable Istio sidecar injection

    Disable the automatic injection of Istio sidecars across the entire NVIDIA Run:ai Control Plane services.

    global.affinity

    System nodes

    Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system

    global.customCA.enabled

    Certificate authority

    Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true, the system is configured to trust a user-provided CA certificate for secure communication.

    postgresql.enabled

    PostgreSQL installation

    If set to false, PostgreSQL will not be installed.

    global.postgresql.auth.host

    PostgreSQL host

    Hostname or IP address of the PostgreSQL server.

    global.postgresql.auth.port

    PostgreSQL port

    Port number on which PostgreSQL is running.

    global.postgresql.auth.username

    PostgreSQL username

    Username for connecting to PostgreSQL.

    thanos.receive.persistence.storageClass

    Storage class

    The installation is configured to work with a specific storage class instead of the default one.

    keycloakx.adminUser

    User name of the internal identity provider administrator

    Defines the username for the Keycloak administrator. This can only be set during the initial installation.

    keycloakx.adminPassword

    Password of the internal identity provider administrator

    Defines the password for the Keycloak administrator.

    keycloakx.existingSecret

    Keycloakx credentials (secret)

    Existing secret name with authentication credentials.

    global.keycloakx.host

    Keycloak (NVIDIA Run:ai internal identity provider) host path

    Overrides the DNS for Keycloak. This can be used to access access Keycloak externally to the cluster.

    grafana.db.existingSecret

    Grafana database connection credentials (secret)

    Existing secret name with authentication credentials.

    grafana.dbUser

    Grafana database username

    Username for accessing the Grafana database.

    grafana.dbPassword

    Grafana database password

    Password for the Grafana database user.

    grafana.admin.existingSecret

    Grafana admin default credentials (secret)

    Existing secret name with authentication credentials.

    PostgreSQL
    Thanos
    Keycloakx
    external PostgreSQL database
    Changing Keycloak admin password
    Protection Provided in Different Modes

    Integrations

    Integration Support

    Support for third-party integrations varies. When noted below, the integration is supported out of the box with NVIDIA Run:ai. For other integrations, our Customer Success team has prior experience assisting customers with setup. In many cases, the NVIDIA Enterprise Support Portal may include additional reference documentation provided on an as-is basis.

    Tool
    Category
    NVIDIA Run:ai support details
    Additional Information

    Kubernetes Workloads Integration

    Kubernetes has several built-in resources that encapsulate running Pods. These are called and should not be confused with .

    Examples of such resources are a Deployment that manages a stateless application, or a Job that runs tasks to completion.

    A NVIDIA Run:ai workload encapsulates all the resources needed to run and creates/deletes them together. Since NVIDIA Run:ai is an open platform, it allows the scheduling of any Kubernetes Workflow.

    For more information, see .

    Security Best Practices

    This guide provides actionable best practices for administrators to securely configure, operate, and manage NVIDIA Run:ai environments. Each section highlights both platform-native features and mapped Kubernetes security practices to maintain robust protection for workloads and resources.

    Security Area
    Best Practice

    Access control (RBAC)

    Enforce least privilege, segment roles by scope, audit regularly

    Authentication and sessions management

    Use SSO, token-based authentication, strong passwords, limit idle time

    Workload policies

    Require non-root, set UID/GID, block overrides, use trusted images

    Namespace and resource management

    Access Control (RBAC)

    NVIDIA Run:ai uses Role‑Based Access Control to define what each user, group, or application can do, and where. Roles are assigned within a scope, such as a project, department, or cluster, and permissions cover actions like viewing, creating, editing, or deleting entities. Unlike Kubernetes RBAC, NVIDIA Run:ai’s RBAC works across multiple clusters, giving you a single place to manage access rules. See for more details.

    Best Practices

    • Assign the minimum required permissions to users, groups and applications.

    • Segment duties using organizational scopes to restrict roles to specific projects or departments.

    • Regularly audit access rules and remove unnecessary privileges, especially admin-level roles.

    Kubernetes Connection

    NVIDIA Run:ai predefined roles are automatically mapped to Kubernetes cluster roles (also predefined by NVIDIA Run:ai). This means administrators do not need to manually configure role mappings.

    These cluster roles define permissions for the entities NVIDIA Run:ai manages and displays (such as workloads) and also apply to users who access cluster data directly through Kubernetes tools (for example, kubectl).

    Authentication and Session Management

    NVIDIA Run:ai supports several authentication methods to control platform access. You can use single sign-on (SSO) for unified enterprise logins, traditional username/password accounts if SSO isn’t an option, and API secret keys for automated application access. Authentication is mandatory for all interfaces, including the UI, CLI, and APIs, ensuring only verified users or applications can interact with your environment.

    Administrators can also configure session timeout. This refers to the period of inactivity before a user is automatically logged out. Once the timeout is reached, the session ends and re‑authentication is required, helping protect against risks from unattended or abandoned sessions. See for more details.

    Best Practices

    • Integrate corporate SSO for centralized identity management.

    • Enforce strong password policies for local accounts.

    • Set appropriate session timeout values to minimize idle session risk.

    • Prefer SSO to eliminate password management within NVIDIA Run:ai.

    Kubernetes Connection

    Configure the Kubernetes API server to validate tokens via NVIDIA Run:ai’s identity service, ensuring unified authentication across the platform. For more information, see .

    Workload Policies: Enforcing Security at Submission

    Workload policies allow administrators to define and enforce how AI workloads are submitted and controlled across projects and teams. With these policies, you can set clear rules and defaults for workload parameters such as which resources can be requested, required security settings, and which defaults should apply. Policies are enforced whether workloads are submitted via the UI, CLI, API or Kubernetes YAML, and can be scoped to specific projects, departments, or clusters for fine-grained control. See for more details.

    Best Practices

    • Enforce containers to run as non-root by default. Define policies that set constraints and defaults for workload submissions, such as requiring non-root users or specifying minimum UID/GID. Example security fields in policies:

      • security.runAsNonRoot: true

      • security.runAsUid: 1000

    Kubernetes Connection

    Map these policies to PodSecurityContext settings in Kubernetes, and enforce them with Pod Security Admission or Kyverno for stricter compliance.

    Managing Namespace and Resource Creation

    NVIDIA Run:ai offers flexible controls for how namespaces and resources are created and managed within your clusters. When a new project is set up, you can choose whether Kubernetes namespaces are created automatically, and whether users are auto-assigned to those projects. There are also options to manage how secrets are propagated across namespaces and to enable or disable resource limit enforcement using Kubernetes LimitRange objects. See for more details.

    Best Practices

    • Require admin approval for namespace creation to avoid sprawl.

    • Limit secret propagation to essential cases only.

    • Use Kubernetes LimitRanges and ResourceQuotas alongside NVIDIA Run:ai policies for layered resource control.

    Tools and Serving Endpoint Access Control

    NVIDIA Run:ai provides flexible options to control access to tools and serving endpoints. Access can be defined during workload submission or updated later, ensuring that only the intended users or groups can interact with the resource.

    When configuring an endpoint or tool, users can select from the following access levels:

    • Public - Everyone within the network can access with no authentication (serving endpoints).

    • All authenticated users - Access is granted to anyone in the organization who can log in (NVIDIA Run:ai or SSO).

    • Specific groups - Access is restricted to members of designated identity provider groups.

    • Specific users - Access is restricted to individual users by email or username.

    By default, network exposure is restricted, and access must be explicitly granted. Model endpoints automatically inherit RBAC and workload policy controls, ensuring consistent enforcement of role- and scope-based permissions across the platform. Administrators can also limit who can deploy, view, or manage endpoints, and should open network access only when required.

    Best Practices

    • Define explicit roles for model management/use.

    • Restrict endpoint access to authorized users, groups and applications.

    • Monitor and audit endpoint access logs.

    Kubernetes Connection

    Use Kubernetes NetworkPolicies to limit inter-pod and external traffic to model-serving pods. Pair with NVIDIA Run:ai RBAC for end-to-end control.

    Secure Installation and Maintenance

    A secure deployment is the foundation on which all other controls rest, and NVIDIA Run:ai’s installation procedures are built to align with organizational policies such as OpenShift Security Context Constraints (SCC). See for more details.

    • Deploy NVIDIA Run:ai cluster following secure installation guides (including IT compliance mandates such as SCC for OpenShift).

    • Run regular security scans and patch/update NVIDIA Run:ai deployments promptly when vulnerabilities are reported.

    • Regularly review and update all security policies, both at the NVIDIA Run:ai and Kubernetes levels, to adapt to evolving risks.

    Compliance and Data Privacy

    NVIDIA Run:ai supports SaaS and self-hosted modes to satisfy a range of data security needs. The self-hosted mode keeps all models, logs, and user data entirely within your infrastructure; SaaS requires careful review of what (minimal) data is transmitted for platform operations and analytics. See for more details.

    • Use the self-hosted mode when full control over the environment is required - including deployment and day-2 operations such as upgrades, monitoring, backup, and metadata restore.

    • Ensure transmission to the NVIDIA Run:ai cloud is scoped (in SaaS mode) and aligns with organization policy.

    • Encrypt secrets and sensitive resources; control secret propagation.

    • Document and audit data flows for regulatory alignment.

    User Identity in Containers

    The identity of the user inside a container determines its access to various resources. For example, network file systems often rely on this identity to control access to mounted volumes. As a result, propagating the correct user identity into a container is crucial for both functionality and security.

    By default, containers in both Docker and Kubernetes run as the root user. This means any process inside the container has full administrative privileges, capable of modifying system files, installing packages, or changing configurations.

    While this level of access provides researchers with maximum flexibility, it conflicts with modern enterprise security practices. If the container’s root identity is propagated to external systems (e.g., network-attached storage), it can result in elevated permissions outside the container, increasing the risk of security breaches.

    NVIDIA Run:ai Controls for User Identity and Privileges

    NVIDIA Run:ai allows you to enhance security and enforce organizational policies by:

    • Controlling root access and privilege escalation within containers

    • Propagating the user identity to align with enterprise access policies

    Root Access and Privilege Escalation

    NVIDIA Run:ai supports security-related workload configurations to control user permissions and restrict privilege escalation. These options are available via the API and CLI during workload creation:

    • runAsNonRoot / --run-as-user - Force the container to run as non-root user.

    • allowPrivilegeEscalation / --allow-privilege-escalation - Allow the container to use setuid binaries to escalate privileges, even when running as a non-root user. This setting can increase security risk and should be disabled if elevated privileges are not required.

    Administrators can enforce secure defaults across the environment using Policies, ensuring consistent workload behavior aligned with organizational security practices.

    Passing User Identity

    Passing User Identity from Identity Provider

    A best practice is to store the User Identifier (UID) and Group Identifier (GID) in the organization's directory. NVIDIA Run:ai allows you to pass these values to the container and use them as the container identity. To perform this, you must set up single sign-on and perform the steps for UID/GID integration.

    Passing User Identity via UI

    It is possible to explicitly pass user identity when creating an environment or submitting a workload:

    • From the image - Use the UID/GID defined in the container image.

    • From the IdP token - Use identity attributes provided by the SSO identity provider (available only in SSO-enabled installations).

    • Custom - Manually set the User ID (UID), Group ID (GID) and supplementary groups that can run commands in the container.

    Administrators can enforce secure defaults across the environment using Policies, ensuring consistent workload behavior aligned with organizational security practices.

    Note

    It is also possible to set the above using the API or CLI.

    Using OpenShift or Gatekeeper to Provide Cluster Level Controls

    In OpenShift, Security Context Constraints (SCCs) manage pod-level security, including root access. By default, containers are assigned a random non-root UID, and flags such as --run-as-user and --allow-privilege-escalation are disabled.

    On non-OpenShift Kubernetes clusters, similar enforcement can be achieved using tools like Gatekeeper, which applies system-level policies to restrict containers from running as root.

    Enabling UID and GID on OpenShift

    By default, OpenShift restricts setting specific user and group IDs (UIDs/GIDs) in workloads through its SCCs. To allow NVIDIA Run:ai workloads to run with explicitly defined UIDs and GIDs, a cluster administrator must modify the relevant SCCs.

    To enable UID and GID assignment:

    1. Edit the runai-user-job SCC:

    2. Edit the runai-jupyter-notebook SCC (only required if using Jupyter environments):

    3. In both SCC definitions, ensure the following sections are configured:

    These settings allow NVIDIA Run:ai to pass specific UID and GID values into the container, enabling compatibility with identity-aware file systems and enterprise access controls.

    Creating a Temporary Home Directory

    When containers run as a specific user, the user must have a home directory defined within the image. Otherwise, starting a shell session will fail due to the absence of a home directory.

    Since pre-creating a home directory for every possible user is impractical, NVIDIA Run:ai offers the createHomeDir / --create-home-dir option. When enabled, this flag creates a temporary home directory for the user inside the container at runtime. By default, the directory is created at /home/<username>.

    Note

    • This home directory is temporary and exists only for the duration of the container's lifecycle. Any data saved in this location will be lost when the container exits.

    • By default, this flag is set to true when --run-as-user is enabled, and false otherwise.

    oc edit scc runai-user-job
    oc edit scc runai-jupyter-notebook
    runAsUser:
      type: RunAsAny
    supplementalGroups:
      type: RunAsAny

    Supported

    NVIDIA Run:ai communicates with GitHub by defining it as a asset

    Hugging Face

    Repositories

    Supported

    NVIDIA Run:ai provides an out of the box integration with

    JupyterHub

    Development

    Community Support

    It is possible to submit NVIDIA Run:ai workloads via JupyterHub.

    Jupyter Notebook

    Development

    Supported

    NVIDIA Run:ai provides integrated support with Jupyter Notebooks. See example.

    Cost Optimization

    Supported

    NVIDIA Run:ai provides out of the box support for Karpenter to save cloud costs. Integration notes with Karpenter can be found .

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting MPI workloads via API, CLI or UI. See for more details.

    Kubeflow notebooks

    Development

    Community Support

    It is possible to launch a Kubeflow notebook with the NVIDIA Run:ai Scheduler. Sample code: .

    Kubeflow Pipelines

    Orchestration

    Community Support

    It is possible to schedule kubeflow pipelines with the NVIDIA Run:ai Scheduler. Sample code: .

    MLFlow

    Model Serving

    Community Support

    It is possible to use ML Flow together with the NVIDIA Run:ai Scheduler.

    PyCharm

    Development

    Supported

    Containers created by NVIDIA Run:ai can be accessed via PyCharm.

    PyTorch

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting PyTorch workloads via API, CLI or UI. See for more details.

    Ray

    training, inference, data processing.

    Community Support

    It is possible to schedule Ray jobs with the NVIDIA Run:ai Scheduler. Sample code: .

    SeldonX

    Orchestration

    Community Support

    It is possible to schedule Seldon Core workloads with the NVIDIA Run:ai Scheduler.

    Spark

    Orchestration

    Community Support

    It is possible to schedule Spark workflows with the NVIDIA Run:ai Scheduler.

    S3

    Storage

    Supported

    NVIDIA Run:ai communicates with S3 by defining a asset

    TensorBoard

    Experiment tracking

    Supported

    NVIDIA Run:ai comes with a preset TensorBoard asset

    TensorFlow

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting TensorFlow workloads via API, CLI or UI. See for more details.

    Triton

    Orchestration

    Supported

    Usage via docker base image

    VScode

    Development

    Supported

    Containers created by NVIDIA Run:ai can be accessed via Visual Studio Code. You can automatically launch Visual Studio code web from the NVIDIA Run:ai console.

    Weights & Biases

    Experiment tracking

    Community Support

    It is possible to schedule W&B workloads with the NVIDIA Run:ai Scheduler. Sample code: .

    Training

    Supported

    NVIDIA Run:ai provides out of the box support for submitting XGBoost via API, CLI or UI. See for more details.

    Apache Airflow

    Orchestration

    Community Support

    It is possible to schedule Airflow workflows with the NVIDIA Run:ai Scheduler. Sample code: How to integrate NVIDIA Run:ai with Apache Airflow.

    Argo workflows

    Orchestration

    Community Support

    It is possible to schedule Argo workflows with the NVIDIA Run:ai Scheduler. Sample code: How to integrate NVIDIA Run:ai with Argo Workflows.

    ClearML

    Experiment tracking

    Community Support

    It is possible to schedule ClearML workloads with the NVIDIA Run:ai Scheduler.

    Docker Registry

    Repositories

    Supported

    NVIDIA Run:ai allows using a docker registry as a Credential asset

    GitHub

    Kubernetes Workloads
    NVIDIA Run:ai workloads
    Kubernetes Workloads Integration

    Storage

    Restrict runAsUid with canEdit: false to prevent users from overriding.
  • Require explicit user/group IDs for all workload containers.

  • Impose data source and resource usage limits through policies.

  • Use policy rules to prevent users from submitting non-compliant workloads.

  • Apply policies by organizational scope for nuanced control within departments or projects.

  • Regularly audit and remove unused namespaces, secrets, and workloads.

    Require namespace approval, limit secret propagation, apply quotas

    Tools and serving endpoint access control

    Control who can access tools and endpoints; restrict network exposure

    Maintenance and compliance

    Follow secure install guides, perform vulnerability scans, maintain data-privacy alignment

    Role Based Access Control (RBAC)
    Authentication and authorization
    Cluster authentication
    Policies and rules
    Advanced cluster configurations
    Advanced cluster configurations
    data source
    Hugging Face
    Jupyter Notebook quick start
    Karpenter
    here
    Kubeflow MPI
    Distributed training
    How to integrate NVIDIA Run:ai with Kubeflow
    How to integrate NVIDIA Run:ai with Kubeflow
    Distributed training
    How to Integrate NVIDIA Run:ai with Ray
    data source
    Environment
    Distributed training
    How to integrate with Weights and Biases
    XGBoost
    Distributed training

    Advanced Cluster Configurations

    Advanced cluster configurations can be used to tailor your NVIDIA Run:ai cluster deployment to meet specific operational requirements and optimize resource management. By fine-tuning these settings, you can enhance functionality, ensure compatibility with organizational policies, and achieve better control over your cluster environment. This article provides guidance on implementing and managing these configurations to adapt the NVIDIA Run:ai cluster to your unique needs.

    After the NVIDIA Run:ai cluster is installed, you can adjust various settings to better align with your organization's operational needs and security requirements.

    Modify Cluster Configurations

    Advanced cluster configurations in NVIDIA Run:ai are managed through the runaiconfig Kubernetes Custom Resource. To edit the cluster configurations, run:

    To see the full runaiconfig object structure, use:

    Configurations

    The following configurations allow you to enable or disable features, control permissions, and customize the behavior of your NVIDIA Run:ai cluster:

    Key
    Description

    NVIDIA Run:ai Services Resource Management

    NVIDIA Run:ai cluster includes many different services. To simplify resource management, the configuration structure allows you to configure the containers CPU / memory resources for each service individually or group of services together.

    Service Group
    Description
    NVIDIA Run:ai containers

    Apply the following configuration in order to change resources request and limit for a group of services:

    Or, apply the following configuration in order to change resources request and limit for each service individually:

    For resource recommendations, see .

    NVIDIA Run:ai Services Replicas

    By default, all NVIDIA Run:ai containers are deployed with a single replica. Some services support multiple replicas for redundancy and performance.

    To simplify configuring replicas, a global replicas configuration can be set and is applied to all supported services:

    This can be overwritten for specific services (if supported). Services without the replicas configuration does not support replicas:

    Prometheus

    The Prometheus instance in NVIDIA Run:ai is used for metrics collection and alerting.

    The configuration scheme follows the official and supports additional custom configurations. The PrometheusSpec schema is available using the spec.prometheus.spec configuration.

    A common use case using the PrometheusSpec is for metrics retention. This prevents metrics loss during potential connectivity issues and can be achieved by configuring local temporary metrics retention. For more information, see :

    In addition to the PrometheusSpec schema, some custom NVIDIA Run:ai configurations are also available:

    • Additional labels – Set additional labels for NVIDIA Run:ai's sent by Prometheus.

    • Log level configuration – Configure the logLevel setting for the Prometheus container.

    NVIDIA Run:ai Managed Nodes

    To include or exclude specific nodes from running workloads within a cluster managed by NVIDIA Run:ai, use the nodeSelectorTerms flag. For additional details, see .

    Label the nodes using the below:

    • key: Label key (e.g., zone, instance-type).

    • operator: Operator defining the inclusion/exclusion condition (In, NotIn, Exists, DoesNotExist).

    • values: List of values for the key when using In or NotIn.

    The below example shows how to include NVIDIA GPUs only and exclude all other GPU types in a cluster with mixed nodes, based on product type GPU label:

    S3 and Git Sidecar Images

    For air-gapped environments, when working with a , it is required to replace the default sidecar images in order to use the Git and S3 data source integrations. Use the following configurations:

    kubectl edit runaiconfig runai -n runai

    Allows Kubernetes namespace creation for new projects Default: true

    spec.project-controller.createRoleBindings (boolean)

    Specifies if role bindings should be created in the project's namespace Default: true

    spec.project-controller.limitRange (boolean)

    Specifies if limit ranges should be defined for projects Default: true

    spec.project-controller.clusterWideSecret (boolean)

    Allows Kubernetes Secrets creation at the cluster scope. See for more details. Default: true

    spec.workload-controller.additionalPodLabels (object)

    Set workload's in a format of key/value pairs. These labels are applied to all pods.

    spec.workload-controller.failureResourceCleanupPolicy

    NVIDIA Run:ai cleans the workload's unnecessary resources:

    • All - Removes all resources of the failed workload

    • None - Retains all resources

    • KeepFailing - Removes all resources except for those that encountered issues (primarily for debugging purposes)

    spec.workload-controller.GPUNetworkAccelerationEnabled

    Enables GPU network acceleration. See for more details. Default: false

    spec.mps-server.enabled (boolean)

    Enabled when using Default: false

    spec.daemonSetsTolerations (object)

    Configure Kubernetes tolerations for NVIDIA Run:ai daemonSets / engine

    spec.runai-container-toolkit.logLevel (boolean)

    Specifies the NVIDIA Run:ai-container-toolkit logging level: either 'SPAM', 'DEBUG', 'INFO', 'NOTICE', 'WARN', or 'ERROR' Default: INFO

    spec.runai-container-toolkit.enabled (boolean)

    Enables workloads to use Default: true

    node-scale-adjuster.args.gpuMemoryToFractionRatio (object)

    A scaling-pod requesting a single GPU device will be created for every 1 to 10 pods requesting fractional GPU memory (1/gpuMemoryToFractionRatio). This value represents the ratio (0.1-0.9) of fractional GPU memory (any size) to GPU fraction (portion) conversion. Default: 0.1

    spec.global.core.dynamicFractions.enabled (boolean)

    Enables Default: true

    spec.global.core.swap.enabled (boolean)

    Enables for GPU workloads Default: false

    spec.global.core.swap.limits.cpuRam (string)

    Sets the CPU memory size used to swap GPU workloads Default:100Gi

    spec.global.core.swap.limits.reservedGpuRam (string)

    Sets the reserved GPU memory size used to swap GPU workloads Default: 2Gi

    spec.global.core.nodeScheduler.enabled (boolean)

    Enables the Default: false

    spec.global.core.timeSlicing.mode (string)

    Sets the . Possible values:

    • timesharing - all pods on a GPU share the GPU compute time evenly.

    • strict - each pod gets an exact time slice according to its memory fraction value.

    • fair - each pod gets an exact time slice according to its memory fraction value and any unused GPU compute time is split evenly between the running pods.

    spec.runai-scheduler.args.fullHierarchyFairness (boolean)

    Enables fairness between departments, on top of projects fairness Default: true

    spec.runai-scheduler.args.defaultStalenessGracePeriod

    Sets the timeout in seconds before the scheduler evicts a stale pod-group (gang) that went below its min-members in running state:

    • 0s - Immediately (no timeout)

    • -1 - Never

    Default: 60s

    spec.pod-grouper.args.gangSchedulingKnative (boolean)

    Enables gang scheduling for inference workloads.For backward compatibility with versions earlier than v2.19, change the value to false Default: false

    spec.pod-grouper.args.gangScheduleArgoWorkflow (boolean)

    Groups all pods of a single ArgoWorkflow workload into a single Pod-Group for gang scheduling Default: true

    spec.runai-scheduler.args.verbosity (int)

    Configures the level of detail in the logs generated by the scheduler service Default: 4

    spec.limitRange.cpuDefaultRequestCpuLimitFactorNoGpu (string)

    Sets a default ratio between the CPU request and the limit for workloads without GPU requests Default: 0.1

    spec.limitRange.memoryDefaultRequestMemoryLimitFactorNoGpu (string)

    Sets a default ratio between the memory request and the limit for workloads without GPU requests Default: 0.1

    spec.limitRange.cpuDefaultRequestGpuFactor (string)

    Sets a default amount of CPU allocated per GPU when the CPU is not specified Default: 100

    spec.limitRange.cpuDefaultLimitGpuFactor (int)

    Sets a default CPU limit based on the number of GPUs requested when no CPU limit is specified Default: NO DEFAULT

    spec.limitRange.memoryDefaultRequestGpuFactor (string)

    Sets a default amount of memory allocated per GPU when the memory is not specified Default: 100Mi

    spec.limitRange.memoryDefaultLimitGpuFactor (string)

    Sets a default memory limit based on the number of GPUs requested when no memory limit is specified Default: NO DEFAULT

    spec.global.affinity (object)

    Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Using global.affinity will overwrite the node roles set using the Administrator CLI (runai-adm). Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system

    spec.global.nodeAffinity.restrictScheduling (boolean)

    Enables setting node roles and restricting workload scheduling to designated nodes Default: false

    spec.global.tolerations (object)

    Configure Kubernetes tolerations for NVIDIA Run:ai system-level services

    spec.global.ingress.ingressClass

    NVIDIA Run:ai uses NGINX as the default ingress controller. If your cluster has a different ingress controller, you can configure the ingress class to be created by NVIDIA Run:ai.

    spec.global.subdomainSupport (boolean)

    Allows the creation of subdomains for ingress endpoints, enabling access to workloads via unique subdomains on the Fully Qualified Domain Name (FQDN). For details, see External Access to Containers. Default: false

    spec.global.enableWorkloadOwnershipProtection (boolean)

    Prevents users within the same project from deleting workloads created by others. This enhances workload ownership security and ensures better collaboration by restricting unauthorized modifications or deletions. Default: false

    SchedulingServices

    Containers associated with the NVIDIA Run:ai Scheduler

    Scheduler, StatusUpdater, MetricsExporter, PodGrouper, PodGroupAssigner, Binder

    SyncServices

    Containers associated with syncing updates between the NVIDIA Run:ai cluster and the NVIDIA Run:ai control plane

    Agent, ClusterSync, AssetsSync

    WorkloadServices

    Containers associated with submitting NVIDIA Run:ai workloads

    WorkloadController,

    JobController

    Vertical scaling
    PrometheusSpec
    Prometheus Storage
    built-in alerts
    Kubernetes nodeSelector
    Local Certificate Authority

    spec.project-controller.createNamespaces (boolean)

    kubectl get crds/runaiconfigs.run.ai -n runai -o yaml
    spec:
      global:
       <service-group-name>: # schedulingServices | SyncServices | WorkloadServices
         resources:
           limits:
             cpu: 1000m
             memory: 1Gi
           requests:
             cpu: 100m
             memory: 512Mi
    spec:
      <service-name>: # for example: pod-grouper
        resources:
          limits:
            cpu: 1000m
            memory: 1Gi
          requests:
            cpu: 100m
            memory: 512Mi
    spec:
      global: 
        replicaCount: 1 # default
    spec:
      <service-name>: # for example: pod-grouper
        replicas: 1 # default
    spec:  
      prometheus:
        spec: # PrometheusSpec
          retention: 2h # default 
          retentionSize: 20GB
    spec:  
      prometheus:
        logLevel: info # debug | info | warn | error
        additionalAlertLabels:
          - env: prod # example
    spec:   
      global:
         managedNodes:
           inclusionCriteria:
              nodeSelectorTerms:
              - matchExpressions:
                - key: nvidia.com/gpu.product  
                  operator: Exists
    spec:
      workload-controller:    
        s3FileSystemImage:
          name: goofys       
          registry: runai.jfrog.io/op-containers-prod      
          tag: 3.12.24    
        gitSyncImage:      
          name: git-sync      
          registry: registry.k8s.io     
          tag: v4.4.0
    Default: All

    Default: timesharing

    Credentials
    Pod Labels
    Using GB200 NVL72 and Multi-Node NVLink Domains
    NVIDIA MPS
    GPU fractions
    dynamic GPU fractions
    memory swap
    node-level scheduler
    GPU time-slicing mode
    Compliance