Accelerating Workloads with Network Topology-Aware Scheduling

Topology-aware scheduling in NVIDIA Run:ai optimizes the placement of workloads across data center nodes by leveraging knowledge of the underlying network topology. In modern AI/ML clusters, communication between the different pods of a distributed workload can be a significant performance bottleneck. By scheduling workloads' pods on nodes that are “closer” to each other in the network (e.g., same rack, block, NVLink domain), NVIDIA Run:ai reduces communication overhead and improves workload efficiency.

Kubernetes represents hierarchical network structures, such as racks, blocks, or NVLink domains, using node labels. These labels describe where each node resides in the data center. The NVIDIA Run:ai Scheduler then uses this topology information to keep workloads on nodes that minimize latency and maximize bandwidth availability.

Note

Distributed workloads include both distributed training and distributed inference.
For guidance on using topology-aware scheduling with GB200 and Multi-Node NVLink (MNNVL) systems, see Using GB200 and Multi-Node NVLink Domains.

Benefits of Using Network Topology

Improved performance for distributed workloads - Reduces inter-node communication latency by scheduling pods in nodes closer to each other.
Optimized GPU utilization - Keeps workloads within NVLink/NVL72 domains where possible, leveraging high-bandwidth interconnects.
Multi-level topology - Supports multi-level topology definitions (e.g., rack → block → node), giving administrators fine-grained control. The NVIDIA Run:ai Scheduler considers all levels when placing workloads. If scheduling cannot occur at a lower level, it automatically moves up the hierarchy, attempting placement layer by layer in order.
Seamless distributed workload experience - Topology-aware scheduling for distributed workloads is applied automatically once an administrator has configured the network topology. This ensures performance gains without requiring any additional user configuration.

Enabling Topology-Aware Scheduling

When creating or editing a cluster, administrators assign a network topology that represents the connectivity of nodes within the cluster. This topology is defined through Kubernetes node label keys. See Managing network topologies for more details.

These labels must match those configured on the cluster nodes and are used by the NVIDIA Run:ai Scheduler to guide workload placement decisions. The order of labels defines the hierarchy:

The first label represents the farthest point in the network (for example, a block).
The last label represents the closest switch or node (for example, a hostname or rack).

Example:

spec:
  levels:
  - nodeLabel: "cloud.provider.com/topology-block"
  - nodeLabel: "cloud.provider.com/topology-rack"
  - nodeLabel: "kubernetes.io/hostname"

After creating a topology, the administrator must associate it with the relevant node pool(s).

If different node pools have different topologies, each node pool must be linked to its corresponding topology.
If the entire cluster shares the same topology, link the same topology to all node pools.

Submitting Distributed Workloads

When a distributed workload is submitted, the NVIDIA Run:ai Scheduler automatically applies topology-aware placement using the defined topology. The workload uses the network topology configured on its target node pool, ensuring that distributed workloads benefit from improved performance without additional user configuration. The NVIDIA Run:ai platform provides the following automation and visibility:

Automatic placement - For distributed workloads (training and inference), NVIDIA Run:ai automatically applies a Preferred topology constraint at the lowest defined topology level. This co-locates pods as close as possible in the network hierarchy, reducing communication overhead. If the Scheduler cannot place pods at this level, it automatically escalates placement by moving up through the topology hierarchy (for example: from node → rack → block → zone), always seeking the closest available level to minimize latency.
Workloads API visibility - The topology name and the constraint applied automatically by NVIDIA Run:ai are visible via the Workloads API.

Submitting Distributed Workloads via YAML

When submitting distributed workloads via YAML, topology-aware scheduling is not applied automatically by NVIDIA Run:ai. Instead, you can configure the workload with annotations to ensure the Scheduler respects the desired topology.

Add annotations to the workload manifest by specifying the topology name and constraint type. You can request both Preferred and Required constraints, but the Preferred constraint must be applied at a lower level in the topology hierarchy than the Required constraint.

For example, you can specify Preferred for the rack level and Required for the zone level:

apiVersion: batch/v1
kind: Job
metadata:
  name: topology-aware-job
  annotations:
    kai.scheduler/topology-preferred-placement: "rack"
    kai.scheduler/topology-required-placement: "zone"
    kai.scheduler/topology: "network"

Pod Affinity vs. Topology-Aware Scheduling

The following example demonstrates the difference between pod affinity and topology-aware scheduling when placing distributed workloads across two GB200 racks:

In the example, two workloads (Workload 1 requiring 12 nodes and Workload 2 requiring 15 nodes) are already running. A third workload that requires 6 nodes is submitted:

With pod affinity, the Scheduler places pods one by one, only checking for “closeness” to existing pods without awareness of the entire workload compared to the available nodes. As a result, the workload is split across Rack A and Rack B, introducing unnecessary cross-rack communication overhead.
In contrast, topology-aware scheduling evaluates the full node requirement in advance and uses knowledge of the hierarchy (rack, block, NVLink domains) to allocate resources. This ensures all 6 nodes are placed together in Rack A, minimizing latency and maximizing bandwidth efficiency. By avoiding fragmentation and cross-rack placement, topology-aware scheduling improves workload performance and overall cluster utilization compared to pod affinity.

Using API

To view the available actions, go to the Network Topologies API reference.

Known Limitations

If a topology is detached from a node pool, workloads that are already running will continue using it, as long as the topology still exists.
If a topology is completely deleted while workloads are still using it:
- Running workloads will continue unaffected.
- Suspended workloads that are later resumed, or workloads not yet bound to a node, will become unschedulable and remain in Pending.
Submitting a workload to multiple node pools that each have different topologies is not supported. Workloads submitted through NVIDIA Run:ai will fail, while external workloads may either remain pending or run if the topology matches at least one node pool.

PreviousUsing GB200 NVL72 and Multi-Node NVLink Domains NextNode Pools

Last updated 1 month ago

Good morning