Using Node Affinity via API

NVIDIA Run:ai leverages the Kubernetes' Node Affinity feature to allow administrators and researchers more control over where workloads are scheduled. This guide explains how NVIDIA Run:ai integrates with and supports the standard Kubernetes Node Affinity API, both directly in workload specifications and through administrative policies. For more details, refer to the official Kubernetes documentationarrow-up-right.

Functionality

You can use the nodeAffinity field within your workload specifications (spec.nodeAffinityRequired) to define scheduling constraints based on node labels.

When a workload with a node affinity specification is submitted, the NVIDIA Run:ai Scheduler evaluates these constraints alongside other scheduling factors such as resource availability and fairness policies.

circle-info

Note

Preferred node affinity is not supported.

Supported Features

  • nodeAffinityRequired (requiredDuringSchedulingIgnoredDuringExecution) - Define hard requirements for node selection. Pods are scheduled only onto nodes that meet these requirements

  • Node selector terms - Use nodeSelectorTerms with matchExpressions to specify label-based rules.

  • Operators - Supported operators in matchExpressions include: In, NotIn, Exists, DoesNotExist, Gt, and Lt.

nodeAffinityRequired	<Object>
  nodeSelectorTerms	<[]Object>
    matchExpressions	<[]Object>
      key	    <string>
      operator	<enum> (In, NotIn, Exists, DoesNotExist, Gt, Lt)
      values	<[]string>

Setting Node Affinity in Workload Submissions

When submitting a workload, include the nodeAffinityRequired field in the API body. This field should describe the required node affinity rule, similar to Kubernetes’ nodeAffinity under requiredDuringSchedulingIgnoredDuringExecution.

See NVIDIA Run:ai APIarrow-up-right for more details.

Example:

Viewing Node Affinity in Workloads

nodeAffinity is dynamically generated by combining user input with system-level scheduling requirements. The final, effective affinity expression is a result of several components:

  • User-defined affinity - The initial rules you provide for the workload

  • Platform features - System-generated rules for features such as node pools and Multi-Node NVLink (MNNVL)

  • Scheduling policies - Additional constraints applied by the NVIDIA Run:ai Scheduler

As a result, the affinity expression returned by the GET workloads/{workloadId}/pods endpoint reflects this final merged configuration, not only your original input.

Example:\

A user submits a workload excluding nodes runai-cluster-system-0-0 and runai-cluster-system-0-1:

The project also has quotas on two node pools: pool-a and pool-b. The merged affinity expression returned by the API reflects both the user-defined rules and the system-enforced node pool constraints:

Applying Node Affinity via Policies

Administrators can enforce node affinity policies in two ways:

  • Can edit - The administrator applies a policy, but users can override it when submitting a workload.

  • Can't edit - The administrator applies a policy that can't be overridden by the user.

Example:

Last updated