Advanced Cluster Configurations
Advanced cluster configurations allow you to customize your NVIDIA Run:ai cluster deployment to support your environment. Some settings may be required for deployment, while others can be fine-tuned to align with organizational policies, security requirements, or other operational preferences.
By adjusting these configurations, you can influence system behavior, including functionality, scheduling policies, and resource management, giving you greater control over how the cluster operates. This article provides guidance on configuring and managing these settings so you can adapt your NVIDIA Run:ai cluster to your organization’s needs.
Configuration Scope
The Helm chart provides the complete set of configuration options for the NVIDIA Run:ai cluster. Some options, such as global.affinity
, are available only through Helm. The rest of the configurable settings are grouped under clusterConfig
. These clusterConfig
settings can be applied through Helm as part of your deployment or upgrade process.
The clusterConfig
subset can also be managed at runtime through the runaiconfig
Custom Resource (under spec
). For details, see Modify cluster configurations at runtime.
At runtime, runaiconfig
is the source of truth for the active cluster configuration. If a configuration key is defined in both Helm and runaiconfig
and the values differ, a Helm upgrade will overwrite the runaiconfig
value to match the chart.
Helm Chart Values
The NVIDIA Run:ai cluster installation can be customized to support your environment via Helm values files or Helm install flags. For example:
# values.yaml
global:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/runai-system
operator: Exists
Modify Cluster Configurations at Runtime
The clusterConfig
subset of settings can also be managed at runtime via the runaiconfig
Kubernetes Custom Resource.
To edit the cluster configurations, run:
kubectl edit runaiconfig runai -n runai
To see the full
runaiconfig
object structure, use:kubectl get crds/runaiconfigs.run.ai -n runai -o yaml
When using runaiconfig
, the clusterConfig
values appear under spec
.
Configurations
The following configurations allow you to enable or disable features, control permissions, and customize the behavior of your NVIDIA Run:ai cluster
global.image.registry
(string)
Global Docker image registry
Default: ""
global.additionalImagePullSecrets
(list)
List of image pull secrets references
Default: []
global.affinity
(object)
Sets the system nodes where NVIDIA Run:ai system-level services are scheduled. Using global.affinity will overwrite the node roles set using the Administrator CLI (runai-adm).
Default: Prefer to schedule on nodes that are labeled with node-role.kubernetes.io/runai-system
global.tolerations
(object)
Configure Kubernetes tolerations for NVIDIA Run:ai system-level services
global.additionalJobLabels
(object)
Set NVIDIA Run:ai and 3rd party services' Pod Labels in a format of key/value pairs.
Default: ""
global.additionalJobAnnotations
(object)
Set NVIDIA Run:ai and 3rd party services' Annotations in a format of key/value pairs.
Default: ""
global.customCA.enabled
Enables the use of a custom Certificate Authority (CA) in your deployment. When set to true
, the system is configured to trust a user-provided CA certificate for secure communication.
global.customCAGit.enabled
Enables the use of a custom Certificate Authority (CA) for Git data sources. When set to true
, the system uses the global CA certificate defined at installation unless overridden using global.customCAGit.secret.name
.
global.customCAGit.secret.name
Specifies the name of the Kubernetes secret that contains a custom CA certificate for Git data sources. Overrides the default global CA when global.customCAGit.enabled
is set to true
.
global.customCAS3.enabled
Enables the use of a custom Certificate Authority (CA) for S3 data sources. When set to true
, the system uses the global CA certificate defined at installation unless overridden using global.customS3Git.secret.name
.
global.customCAS3.secret.name
Specifies the name of the Kubernetes secret that contains a custom CA certificate for S3 data sources. Overrides the default global CA when global.customCAS3.enabled
is set to true
.
openShift.securityContextConstraints.create
Enables the deployment of Security Context Constraints (SCC). Disable for CIS compliance.
Default: true
researcherService.ingress.tlsSecret
(string)
Existing secret key where cluster TLS certificates are stored (non-OpenShift)
Default: runai-cluster-domain-tls-secret
researcherService.route.tlsSecret
(string)
Existing secret key where cluster TLS certificates are stored (OpenShift only)
Default: ""
clusterConfig.global.nodeAffinity.restrictScheduling
(boolean)
Enables setting node roles and restricting workload scheduling to designated nodes
Default: false
clusterConfig.global.subdomainSupport
(boolean)
Allows the creation of subdomains for ingress endpoints, enabling access to workloads via unique subdomains on the Fully Qualified Domain Name (FQDN). For details, see External Access to Containers.
Default: false
clusterConfig.global.devicePluginBindings
(boolean)
Instruct NVIDIA Run:ai fractions to use device plugin for host mount instead of NVIDIA Run:ai fractions using explicit host path mount configuration on the pod. See GPU fractions and dynamic GPU fractions.
Default: false
clusterConfig.global.enableWorkloadOwnershipProtection
(boolean)
Prevents users within the same project from deleting workloads created by others. This enhances workload ownership security and ensures better collaboration by restricting unauthorized modifications or deletions.
Default: false
clusterConfig.project-controller.createNamespaces
(boolean)
Allows Kubernetes namespace creation for new projects
Default: true
clusterConfig.project-controller.CreateRoleBindings
(boolean)
Specifies if role bindings should be created in the project's namespace
Default: true
clusterConfig.project-controller.limitRange
(boolean)
Specifies if limit ranges should be defined for projects
Default: true
clusterConfig.project-controller.clusterWideSecret
(boolean)
Allows Kubernetes Secrets creation at the cluster scope. See Credentials for more details.
Default: true
clusterConfig.workload-controller.failureResourceCleanupPolicy
NVIDIA Run:ai cleans the workload's unnecessary resources:
All
- Removes all resources of the failed workloadNone
- Retains all resourcesKeepFailing
- Removes all resources except for those that encountered issues (primarily for debugging purposes)
Default: All
clusterConfig.workload-controller.GPUNetworkAccelerationEnabled
Enables GPU network acceleration. See Using GB200 NVL72 and Multi-Node NVLink Domains for more details.
Default: false
clusterConfig.mps-server.enabled
(boolean)
Enabled when using NVIDIA MPS
Default: false
clusterConfig.daemonSetsTolerations
(object)
Configure Kubernetes tolerations for NVIDIA Run:ai daemonSets / engine
clusterConfig.runai-container-toolkit.enabled
(boolean)
Enables workloads to use GPU fractions
Default: true
clusterConfig.runai-container-toolkit.logLevel
(boolean)
Specifies the NVIDIA Run:ai-container-toolkit logging level: either 'SPAM', 'DEBUG', 'INFO', 'NOTICE', 'WARN', or 'ERROR'
Default: INFO
clusterConfig.node-scale-adjuster.args.gpuMemoryToFractionRatio
(object)
A scaling-pod requesting a single GPU device will be created for every 1 to 10 pods requesting fractional GPU memory (1/gpuMemoryToFractionRatio). This value represents the ratio (0.1-0.9) of fractional GPU memory (any size) to GPU fraction (portion) conversion.
Default: 0.1
clusterConfig.global.core.dynamicFractions.enabled
(boolean)
Enables dynamic GPU fractions
Default: true
clusterConfig.global.core.swap.enabled
(boolean)
Enables memory swap for GPU workloads
Default: false
clusterConfig.global.core.swap.limits.cpuRam
(string)
Sets the CPU memory size used to swap GPU workloads
Default:100Gi
clusterConfig.global.core.swap.limits.reservedGpuRam
(string)
Sets the reserved GPU memory size used to swap GPU workloads
Default: 2Gi
clusterConfig.global.core.swap.biDirectional
(string)
Sets the read/write memory mode of GPU memory swap to bi-directional (fully duplex). This produces higher performance (typically +80%) vs. uni-directional (simplex) read-write operations. For more details, see GPU memory swap.
Default: false
clusterConfig.global.core.swap.mode
(string)
Sets the GPU to CPU memory swap method to use UVA and optimized memory prefetch for optimized performance in some scenarios. For more details, see GPU memory swap.
Default: None. The parameter is not set by default. To add this parameter set mode=mapped
.
clusterConfig.global.core.nodeScheduler.enabled
(boolean)
Enables the node-level scheduler
Default: false
clusterConfig.global.core.timeSlicing.mode
(string)
Sets the GPU time-slicing mode. Possible values:
timesharing
- all pods on a GPU share the GPU compute time evenly.strict
- each pod gets an exact time slice according to its memory fraction value.fair
- each pod gets an exact time slice according to its memory fraction value and any unused GPU compute time is split evenly between the running pods.
Default: timesharing
clusterConfig.runai-scheduler.args.fullHierarchyFairness
(boolean)
Enables fairness between departments, on top of projects fairness
Default: true
clusterConfig.runai-scheduler.args.defaultStalenessGracePeriod
Sets the timeout in seconds before the scheduler evicts a stale pod-group (gang) that went below its min-members in running state:
0s
- Immediately (no timeout)-1
- Never
Default: 60s
clusterConfig.runai-scheduler.args.verbosity
(int)
Configures the level of detail in the logs generated by the scheduler service
Default: 4
clusterConfig.pod-grouper.args.gangSchedulingKnative
(boolean)
Enables gang scheduling for inference workloads.For backward compatibility with versions earlier than v2.19, change the value to false
Default: false
clusterConfig.pod-grouper.args.gangScheduleArgoWorkflow
(boolean)
Groups all pods of a single ArgoWorkflow workload into a single Pod-Group for gang scheduling
Default: true
clusterConfig.limitRange.cpuDefaultRequestCpuLimitFactorNoGpu
(string)
Sets a default ratio between the CPU request and the limit for workloads without GPU requests
Default: 0.1
clusterConfig.limitRange.memoryDefaultRequestMemoryLimitFactorNoGpu
(string)
Sets a default ratio between the memory request and the limit for workloads without GPU requests
Default: 0.1
clusterConfig.limitRange.cpuDefaultRequestGpuFactor
(string)
Sets a default amount of CPU allocated per GPU when the CPU is not specified
Default: 100
clusterConfig.limitRange.cpuDefaultLimitGpuFactor
(int)
Sets a default CPU limit based on the number of GPUs requested when no CPU limit is specified
Default: NO DEFAULT
clusterConfig.limitRange.memoryDefaultRequestGpuFactor
(string)
Sets a default amount of memory allocated per GPU when the memory is not specified
Default: 100Mi
clusterConfig.limitRange.memoryDefaultLimitGpuFactor
(string)
Sets a default memory limit based on the number of GPUs requested when no memory limit is specified
Default: NO DEFAULT
NVIDIA Run:ai Services Resource Management
NVIDIA Run:ai cluster includes many different services. To simplify resource management, the configuration structure allows you to configure the containers CPU / memory resources for each service individually or group of services together.
SchedulingServices
Containers associated with the NVIDIA Run:ai Scheduler
Scheduler, StatusUpdater, MetricsExporter, PodGrouper, PodGroupAssigner, PodGroupController, QueueController, NodePoolController, Binder, DevicePlugin
SyncServices
Containers associated with syncing updates between the NVIDIA Run:ai cluster and the NVIDIA Run:ai control plane
Agent, ClusterSync, AssetsSync
WorkloadServices
Containers associated with submitting NVIDIA Run:ai workloads
WorkloadController, JobController, WorkloadOverseer, ExternalWorkloadIntegrator, ClusterRedis, ClusterAPI, InferenceWorkloadController, ResearcherService, SharedObjectsController, WorkloadExporter
Apply the following configuration in order to change resources request and limit for a group of services:
clusterConfig:
global:
<service-group-name>: # schedulingServices | SyncServices | WorkloadServices
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 512Mi
Or, apply the following configuration in order to change resources request and limit for each service individually:
clusterConfig:
<service-name>: # for example: pod-grouper
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 100m
memory: 512Mi
For resource recommendations, see Vertical scaling.
NVIDIA Run:ai Services Replicas
By default, all NVIDIA Run:ai containers are deployed with a single replica. Some services support multiple replicas for redundancy and performance.
To simplify configuring replicas, a global replicas configuration can be set and is applied to all supported services:
clusterConfig:
global:
replicaCount: 1 # default
This can be overwritten for specific services (if supported). Services without the replicas
configuration does not support replicas:
clusterConfig:
<service-name>: # for example: pod-grouper
replicas: 1 # default
Prometheus
The Prometheus instance in NVIDIA Run:ai is used for metrics collection and alerting.
The configuration scheme follows the official PrometheusSpec and supports additional custom configurations. The PrometheusSpec schema is available using the spec.prometheus.spec
configuration.
A common use case using the PrometheusSpec is for metrics retention. This prevents metrics loss during potential connectivity issues and can be achieved by configuring local temporary metrics retention. For more information, see Prometheus Storage:
clusterConfig:
prometheus:
spec: # PrometheusSpec
retention: 2h # default
retentionSize: 20GB
In addition to the PrometheusSpec schema, some custom NVIDIA Run:ai configurations are also available:
Additional labels - Set additional labels for NVIDIA Run:ai's built-in alerts sent by Prometheus.
Log level configuration - Configure the
logLevel
setting for the Prometheus container.Image override - Use
prometheus.spec.image
to manually specify the Prometheus image reference. Due to a known issue, theimageRegistry
setting in the Prometheus Helm chart is ignored. To pull the image from a different registry, specify the full image reference. Default:quay.io/prometheus/prometheus
.Image pull secrets - Use
prometheus.spec.imagePullSecrets
to list Kubernetes image pull secrets in therunai
namespace. This is particularly relevant for air-gapped installations where pulling Prometheus images requires authentication. Default:[]
.
clusterConfig:
prometheus:
logLevel: info # debug | info | warn | error
additionalAlertLabels:
- env: prod # example
NVIDIA Run:ai Managed Nodes
To include or exclude specific nodes from running workloads within a cluster managed by NVIDIA Run:ai, use the nodeSelectorTerms
flag. For additional details, see Kubernetes nodeSelector.
Label the nodes using the below:
key - Label key (e.g., zone, instance-type).
operator - Operator defining the inclusion/exclusion condition (In, NotIn, Exists, DoesNotExist).
values - List of values for the key when using In or NotIn.
The below example shows how to include NVIDIA GPUs only and exclude all other GPU types in a cluster with mixed nodes, based on product type GPU label:
clusterConfig:
global:
managedNodes:
inclusionCriteria:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: Exists
Custom Certificate Authority for Git and S3
To override the default global CA used by the system and inject a custom CA certificate for Git or S3 data sources, follow these steps:
Create a Kubernetes secret with the custom CA certificate:
kubectl -n runai create secret generic runai-cluster-git-ca \ --from-file=runai-ca-git.pem=<ca_bundle_path> kubectl label secret runai-cluster-git-ca -n runai run.ai/cluster-wide=true run.ai/name=runai-ca-cert --overwrite
When installing the cluster, make sure the following flags are added to the helm command. See Install cluster.
--set global.customCAGit.enabled=true \
--set global.customCAGit.secret.name=<secret-name>
Last updated