Using Node Affinity via API
NVIDIA Run:ai leverages the Kubernetes' Node Affinity feature to allow administrators and researchers more control over where workloads are scheduled. This guide explains how NVIDIA Run:ai integrates with and supports the standard Kubernetes Node Affinity API, both directly in workload specifications and through administrative policies. For more details, refer to the official Kubernetes documentation.
Functionality
You can use the nodeAffinity
field within your workload specifications (spec.nodeAffinityRequired
) to define scheduling constraints based on node labels.
When a workload with a node affinity specification is submitted, the NVIDIA Run:ai Scheduler evaluates these constraints alongside other scheduling factors such as resource availability and fairness policies.
Supported Features
nodeAffinityRequired (
requiredDuringSchedulingIgnoredDuringExecution
) - Define hard requirements for node selection. Pods are scheduled only onto nodes that meet these requirementsNode selector terms - Use
nodeSelectorTerms
withmatchExpressions
to specify label-based rules.Operators - Supported operators in
matchExpressions
include:In
,NotIn
,Exists
,DoesNotExist
,Gt
, andLt
.
nodeAffinityRequired <Object>
nodeSelectorTerms <[]Object>
matchExpressions <[]Object>
key <string>
operator <enum> (In, NotIn, Exists, DoesNotExist, Gt, Lt)
values <[]string>
Setting Node Affinity in Workload Submissions
When submitting a workload, include the nodeAffinityRequired
field in the API body. This field should describe the required node affinity rule, similar to Kubernetes’ nodeAffinity
under requiredDuringSchedulingIgnoredDuringExecution
.
See NVIDIA Run:ai API for more details.
Example:
curl -L 'https://<COMPANY-URL>/api/v1/workloads/workspaces' \
-H 'Authorization: Bearer <API-TOKEN>' \
-H 'Content-Type: application/json' \
-d '{
"name": "workload-name",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"nodeAffinityRequired": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "run.ai/type",
"operator": "In",
"values": ["training", "inference"]
}
]
}
]
}
}
}'
Viewing Node Affinity in Workloads
nodeAffinity
is dynamically generated by combining user input with system-level scheduling requirements. The final, effective affinity expression is a result of several components:
User-defined affinity - The initial rules you provide for the workload
Platform features - System-generated rules for features such as node pools and Multi-Node NVLink (MNNVL)
Scheduling policies - Additional constraints applied by the NVIDIA Run:ai Scheduler
As a result, the affinity expression returned by the GET workloads/{workloadId}/pods
endpoint reflects this final merged configuration, not only your original input.
Example:
A user submits a workload excluding nodes runai-cluster-system-0-0
and runai-cluster-system-0-1
:
"nodeAffinityRequired": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "kubernetes.io/hostname",
"operator": "NotIn",
"values": [
"runai-cluster-system-0-0",
"runai-cluster-system-0-1"
]
}
]
}
]
}
The project also has quotas on two node pools: pool-a
and pool-b
. The merged affinity expression returned by the API reflects both the user-defined rules and the system-enforced node pool constraints:
{
"pods": [
{
.
.
.
"requestedNodePools": [
"pool-b",
"pool-a"
],
"nodeAffinity": {
"required": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "kubernetes.io/hostname",
"operator": "NotIn",
"values": [
"runai-cluster-system-0-0",
"runai-cluster-system-0-1"
]
},
{
"key": "node-pool-label",
"operator": "In",
"values": [
"b"
]
}
]
},
{
"matchExpressions": [
{
"key": "kubernetes.io/hostname",
"operator": "NotIn",
"values": [
"runai-cluster-system-0-0",
"runai-cluster-system-0-1"
]
},
{
"key": "node-pool-label",
"operator": "In",
"values": [
"a"
]
}
]
}
]
}
},
.
.
.
}
]
}
Applying Node Affinity via Policies
Administrators can enforce node affinity policies in two ways:
Can edit - The administrator applies a policy, but users can override it when submitting a workload.
Can't edit - The administrator applies a policy that can't be overridden by the user.
Example:
defaults:
nodeAffinityRequired:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- frontend
- backend
rules:
nodeAffinityRequired:
canEdit: false
Last updated