Using Node Affinity via API

NVIDIA Run:ai leverages the Kubernetes' Node Affinity feature to allow administrators and researchers more control over where workloads are scheduled. This guide explains how NVIDIA Run:ai integrates with and supports the standard Kubernetes Node Affinity API, both directly in workload specifications and through administrative policies. For more details, refer to the official Kubernetes documentation.

Functionality

You can use the nodeAffinity field within your workload specifications (spec.nodeAffinityRequired) to define scheduling constraints based on node labels.

When a workload with a node affinity specification is submitted, the NVIDIA Run:ai Scheduler evaluates these constraints alongside other scheduling factors such as resource availability and fairness policies.

Note

Preferred node affinity is not supported.

Supported Features

  • nodeAffinityRequired (requiredDuringSchedulingIgnoredDuringExecution) - Define hard requirements for node selection. Pods are scheduled only onto nodes that meet these requirements

  • Node selector terms - Use nodeSelectorTerms with matchExpressions to specify label-based rules.

  • Operators - Supported operators in matchExpressions include: In, NotIn, Exists, DoesNotExist, Gt, and Lt.

nodeAffinityRequired	<Object>
  nodeSelectorTerms	<[]Object>
    matchExpressions	<[]Object>
      key	    <string>
      operator	<enum> (In, NotIn, Exists, DoesNotExist, Gt, Lt)
      values	<[]string>

Setting Node Affinity in Workload Submissions

When submitting a workload, include the nodeAffinityRequired field in the API body. This field should describe the required node affinity rule, similar to Kubernetes’ nodeAffinity under requiredDuringSchedulingIgnoredDuringExecution.

See NVIDIA Run:ai API for more details.

Example:

curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \ 
-H 'Authorization: Bearer <API-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
      "name": "workload-name", 
      "projectId": "<PROJECT-ID>", 
      "clusterId": "<CLUSTER-UUID>", 
      "spec": {
        "nodeAffinityRequired": {
          "nodeSelectorTerms": [
            {
              "matchExpressions": [
                {
                  "key": "run.ai/type",
                  "operator": "In",
                  "values": ["training", "inference"]
                }
              ]
            }
          ]
        }
      }
    }'

Viewing Node Affinity in Workloads

nodeAffinity is dynamically generated by combining user input with system-level scheduling requirements. The final, effective affinity expression is a result of several components:

  • User-defined affinity - The initial rules you provide for the workload

  • Platform features - System-generated rules for features such as node pools and Multi-Node NVLink (MNNVL)

  • Scheduling policies - Additional constraints applied by the NVIDIA Run:ai Scheduler

As a result, the affinity expression returned by the GET workloads/{workloadId}/pods endpoint reflects this final merged configuration, not only your original input.

Example:

A user submits a workload excluding nodes runai-cluster-system-0-0 and runai-cluster-system-0-1:

"nodeAffinityRequired": {
        "nodeSelectorTerms": [
          {
            "matchExpressions": [
              {
                "key": "kubernetes.io/hostname",
                "operator": "NotIn",
                "values": [
                    "runai-cluster-system-0-0",
                    "runai-cluster-system-0-1"
                ]
              }
            ]
          }
       ]
     }

The project also has quotas on two node pools: pool-a and pool-b. The merged affinity expression returned by the API reflects both the user-defined rules and the system-enforced node pool constraints:

{
  "pods": [
    {
      .
      .
      .
      "requestedNodePools": [
          "pool-b",
          "pool-a"
      ],
      "nodeAffinity": {
        "required": {
          "nodeSelectorTerms": [
            {
              "matchExpressions": [
                {
                  "key": "kubernetes.io/hostname",
                  "operator": "NotIn",
                  "values": [
                    "runai-cluster-system-0-0",
                    "runai-cluster-system-0-1"
                  ]
                },
                {
                  "key": "node-pool-label",
                  "operator": "In",
                  "values": [
                    "b"
                  ]
                }
              ]
            },
            {
              "matchExpressions": [
                {
                  "key": "kubernetes.io/hostname",
                  "operator": "NotIn",
                  "values": [
                    "runai-cluster-system-0-0",
                    "runai-cluster-system-0-1"
                  ]
                },
                {
                  "key": "node-pool-label",
                  "operator": "In",
                  "values": [
                    "a" 
                  ]
                }
              ]
            }
          ]
        }
      },
      .
      .
      .
    }
  ]
}

Applying Node Affinity via Policies

Administrators can enforce node affinity policies in two ways:

  • Can edit - The administrator applies a policy, but users can override it when submitting a workload.

  • Can't edit - The administrator applies a policy that can't be overridden by the user.

Example:

defaults:
  nodeAffinityRequired:
      nodeSelectorTerms:
        - matchExpressions:
            - key: app
              operator: In
              values:
                - frontend
                - backend

rules:
  nodeAffinityRequired:
    canEdit: false

Last updated