Accelerating Workloads with Network Topology-Aware Scheduling

Topology-aware scheduling in NVIDIA Run:ai optimizes the placement of workloads across data center nodes by leveraging knowledge of the underlying network topology. In modern AI/ML clusters, communication between the different pods of a distributed workload can be a significant performance bottleneck. By scheduling workloads' pods on nodes that are “closer” to each other in the network (e.g., same rack, block, NVLink domain), NVIDIA Run:ai reduces communication overhead and improves workload efficiency.

Kubernetes represents hierarchical network structures, such as racks, blocks, or NVLink domains, using node labels. These labels describe where each node resides in the data center. The NVIDIA Run:ai Scheduler then uses this topology information to keep workloads on nodes that minimize latency and maximize bandwidth availability.

circle-info

Note

  • Distributed workloads include both distributed training and distributed inference.

  • For guidance on using topology-aware scheduling with GB200 and Multi-Node NVLink (MNNVL) systems, see Using GB200 and Multi-Node NVLink Domains.

Benefits of Using Network Topology

  • Improved performance for distributed workloads - Reduces inter-node communication latency by scheduling pods in nodes closer to each other.

  • Optimized GPU utilization - Keeps workloads within NVLink/NVL72 domains where possible, leveraging high-bandwidth interconnects.

  • Multi-level topology - Supports multi-level topology definitions (e.g., rack → block → node), giving administrators fine-grained control. The NVIDIA Run:ai Scheduler considers all levels when placing workloads. If scheduling cannot occur at a lower level, it automatically moves up the hierarchy, attempting placement layer by layer in order.

  • Seamless distributed workload experience - Topology-aware scheduling for distributed workloads is applied automatically once an administrator has configured the network topology. This ensures performance gains without requiring any additional user configuration.

Topology Labels in Kubernetes

Topology-aware scheduling relies on Kubernetes node labels that describe each node’s location in the topology. These labels are applied at the Kubernetes level and are managed outside of NVIDIA Run:ai, typically by tools such as Topographarrow-up-right or by the cloud provider.

Once the labels are in place, administrators only need to specify the label keys (as described below) that represent the topology on which the NVIDIA Run:ai Scheduler should reference when making placement decisions.

Configuring Topology-Aware Scheduling

When creating or editing a cluster, administrators assign a network topology that represents the connectivity of nodes within the cluster. This topology is defined through Kubernetes node label keys. See Managing network topologies for more details.

These labels must match those configured on the cluster nodes and are used by the NVIDIA Run:ai Scheduler to guide workload placement decisions. The order of labels defines the hierarchy:

  • The first label represents the farthest point in the network (for example, a region).

  • The last label represents the closest switch or node (for example, a hostname or rack).

Example:

After creating a topology, the administrator must associate it with the relevant node pool(s).

  • If you are using the default node pool (that is, no additional node pools are defined), attach the topology to the default node pool.

  • If different node pools have different topologies, each node pool must be linked to its corresponding topology.

  • If the entire cluster shares the same topology, link the same topology to all node pools.

Submitting Distributed Workloads

When a distributed workload is submitted, the platform automatically applies topology-aware scheduling based on the network topology configured on the target node pool. This behavior ensures that distributed workloads benefit from improved performance without additional user configuration.

Topology-aware scheduling in NVIDIA Run:ai is applied at the workload level. This means the Scheduler considers the entire distributed workload as a single unit and places all of its pods according to the same topology constraints.

NVIDIA Run:ai automatically applies a Preferred topology constraint at the lowest defined topology level. This co-locates pods as close as possible in the network hierarchy, reducing communication overhead. If the Scheduler cannot place pods at this level, it automatically escalates placement by moving up through the topology hierarchy (for example: from node → rack → block → zone), always seeking the closest available level to minimize latency.

Submitting Distributed Workloads via YAML

When submitting distributed workloads via YAML, topology-aware scheduling is not applied automatically by NVIDIA Run:ai. Instead, you can configure the workload with annotations to ensure the Scheduler respects the desired topology.

  • Add annotations to the workload manifest by specifying the topology name and constraint type.

  • You can use either Required or Preferred constraints, or combine both for the same topology tree. When using both, note that network topologies are hierarchical (tree-structured). Applying a Preferred constraint at the same level as, or higher than, a Required constraint has no effect. A Preferred constraint is meaningful only when it is defined at a lower (more specific) topology level than the Required constraint. In this case, the topology-aware scheduling logic attempts to further group pods at that lower level, while still enforcing the mandatory Required constraint.

For example:

Pod Affinity vs. Topology-Aware Scheduling

The following example demonstrates the difference between pod affinity and topology-aware scheduling when placing distributed workloads across two GB200 racks:

In the example, two workloads (Workload 1 requiring 12 nodes and Workload 2 requiring 15 nodes) are already running. A third workload that requires 6 nodes is submitted:

  • With pod affinity, the Scheduler places pods one by one, only checking for “closeness” to existing pods without awareness of the entire workload compared to the available nodes. As a result, the workload is split across Rack A and Rack B, introducing unnecessary cross-rack communication overhead.

  • In contrast, topology-aware scheduling evaluates the full node requirement in advance and uses knowledge of the hierarchy (rack, block, NVLink domains) to allocate resources. This ensures all 6 nodes are placed together in Rack A, minimizing latency and maximizing bandwidth efficiency. By avoiding fragmentation and cross-rack placement, topology-aware scheduling improves workload performance and overall cluster utilization compared to pod affinity.

Using API

To view the available actions, go to the Network Topologiesarrow-up-right API reference.

Known Limitations

  • If a topology is detached from a node pool, workloads that are already running will continue using it, as long as the topology still exists.

  • If a topology is completely deleted while workloads are still using it:

    • Running workloads will continue unaffected.

    • Suspended workloads that are later resumed, or workloads not yet bound to a node, will become unschedulable and remain in Pending.

  • Submitting a workload to multiple node pools that each have different topologies is not supported. Workloads submitted through NVIDIA Run:ai will fail, while external workloads may either remain pending or run if the topology matches at least one node pool.

Last updated