Node roles
This article explains how to designate specific node roles in a Kubernetes cluster to ensure optimal performance and reliability in production deployments.
For optimal performance in production clusters, it is essential to avoid extensive CPU usage on GPU nodes where possible. This can be done by ensuring the following:
NVIDIA Run:ai system-level services run on dedicated CPU-only nodes.
Workloads that do not request GPU resources (e.g. Machine Learning jobs) are executed on CPU-only nodes.
NVIDIA Run:ai services are scheduled on the defined node roles by applying Kubernetes Node Affinity using node labels .
Prerequisites
To perform these tasks, make sure to install the NVIDIA Run:ai Administrator CLI.
Configure Node Roles
The following node roles can be configured on the cluster:
System node: Reserved for NVIDIA Run:ai system-level services.
GPU Worker node: Dedicated for GPU-based workloads.
CPU Worker node: Used for CPU-only workloads.
System nodes
NVIDIA Run:ai system nodes run system-level services required to operate. This can be done via the Kubectl (preferred method) or via NVIDIA Run:ai Administrator CLI.
By default, NVIDIA Run:ai applies a node affinity rule to prefer nodes that are labeled with node-role.kubernetes.io/runai-system
for system services scheduling. You can modify the default node affinity rule by:
Editing the
spec.global.affinity
configuration parameter as detailed in Advanced cluster configurations.Editing the
global.affinity
configuration as detailed in Advanced control plane configurations for self-hosted deployments
Note
To ensure high availability and prevent a single point of failure, it is recommended to configure at least three system nodes in your cluster.
Important
Do not assign a system node role to the Kubernetes master node. This may disrupt Kubernetes functionality, particularly if the Kubernetes API Server is configured to use port 443 instead of the default 6443.
Kubectl
To set a system role for a node in your Kubernetes cluster using Kubectl, follow these steps:
Use the
kubectl get nodes
command to list all the nodes in your cluster and identify the name of the node you want to modify.Run one of the following commands to label the node with its role:
NVIDIA Run:ai Administrator CLI
To set a system role for a node in your Kubernetes cluster, follow these steps:
Run the
kubectl get nodes
command to list all the nodes in your cluster and identify the name of the node you want to modify.Run one of the following commands to set or remove a node’s role:
The set node-role
command will label the node and set relevant cluster configurations.
Worker nodes
NVIDIA Run:ai worker nodes run user-submitted workloads and system-level DeamonSets required to operate. This can be managed via the Kubectl (preferred method) or via NVIDIA Run:ai Administrator CLI,
By default, GPU workloads are scheduled on GPU nodes baed on the nvidia.com/gpu.present
label. When global.nodeAffinity.restrictScheduling
is set to true via the Advanced cluster configurations:
GPU Workloads are scheduled with node affinity rule to require nodes that are labeled with
node-role.kubernetes.io/runai-gpu-worker
CPU-only Workloads are scheduled with node affinity rule to require nodes that are labeled with
node-role.kubernetes.io/runai-cpu-worker
Kubectl
To set a worker role for a node in your Kubernetes cluster using Kubectl, follow these steps:
Validate the
global.nodeAffinity.restrictScheduling
is set to true in the cluster’s Configurations.Use the
kubectl get nodes
command to list all the nodes in your cluster and identify the name of the node you want to modify.Run one of the following commands to label the node with its role:
NVIDIA Run:ai Administrator CLI
To set worker role for a node in your Kubernetes cluster via NVIDIA Run:ai Administrator CLI, follow these steps:
Use the
kubectl get nodes
command to list all the nodes in your cluster and identify the name of the node you want to modify.Run one of the following commands to set or remove a node’s role:
The set node-role
command will label the node and set cluster configuration global.nodeAffinity.restrictScheduling
true.
Last updated