Accelerating Workloads with Network Topology-Aware Scheduling

Topology-aware scheduling in NVIDIA Run:ai optimizes the placement of workloads across data center nodes by leveraging knowledge of the underlying network topology. In modern AI/ML clusters, communication between the different pods of a distributed workload can be a significant performance bottleneck. By scheduling workloads' pods on nodes that are “closer” to each other in the network (e.g., same rack, block, NVLink domain), NVIDIA Run:ai reduces communication overhead and improves workload efficiency.

Kubernetes represents hierarchical network structures, such as racks, blocks, or NVLink domains, using node labels. These labels describe where each node resides in the data center. The NVIDIA Run:ai Scheduler then uses this topology information to keep workloads on nodes that minimize latency and maximize bandwidth availability.

Note

Distributed workloads include both distributed training and distributed inference.
For guidance on using topology-aware scheduling with GB200 and Multi-Node NVLink (MNNVL) systems, see Using GB200 and Multi-Node NVLink Domains.

Benefits of Using Network Topology

Improved performance for distributed workloads - Reduces inter-node communication latency by scheduling pods in nodes closer to each other.
Optimized GPU utilization - Keeps workloads within NVLink/NVL72 domains where possible, leveraging high-bandwidth interconnects.
Multi-level topology - Supports multi-level topology definitions (e.g., rack → block → node), giving administrators fine-grained control. The NVIDIA Run:ai Scheduler considers all levels when placing workloads. If scheduling cannot occur at a lower level, it automatically moves up the hierarchy, attempting placement layer by layer in order.
Seamless distributed workload experience - Topology-aware scheduling for distributed workloads is applied automatically once an administrator has configured the network topology. This ensures performance gains without requiring any additional user configuration.

Topology Labels in Kubernetes

Topology-aware scheduling relies on Kubernetes node labels that describe each node’s location in the topology. These labels are applied at the Kubernetes level and are managed outside of NVIDIA Run:ai.

How labels are created depends on the environment and user setup. Users may apply labels manually or use topology discovery tools such as Topograph. In cloud environments, labels are often applied automatically by the cloud provider. When NVIDIA NetQ is used in on-prem environments, Topograph can integrate with NetQ to obtain network topology and real-time network telemetry data and derive the appropriate labels based on the physical network layout. See the NVIDIA NetQ User Guide for more details.

Once the labels are in place, administrators only need to specify the label keys (as described below) that represent the topology which the NVIDIA Run:ai Scheduler should reference when making placement decisions.

Configuring Topology-Aware Scheduling

When creating or editing a cluster, administrators assign a network topology that represents the connectivity of nodes within the cluster. This topology is defined through Kubernetes node label keys. See Managing network topologies for more details.

These labels must match those configured on the cluster nodes and are used by the NVIDIA Run:ai Scheduler to guide workload placement decisions. The order of labels defines the hierarchy:

The first label represents the farthest point in the network (for example, a region).
The last label represents the closest switch or node (for example, a hostname or rack).

Example:

{
  "levels": [
    "topology.kubernetes.io/region",
    "topology.kubernetes.io/zone",
    "cloud.provider.com/topology-block",
    "cloud.provider.com/topology-rack",
    "kubernetes.io/hostname"
  ],
  "name": "default-topology",
  "clusterId": "<CLUSTER_ID>"
}
  - nodeLabel: "kubernetes.io/hostname"

After creating a topology, the administrator must associate it with the relevant node pool(s).

If you are using the default node pool (that is, no additional node pools are defined), attach the topology to the default node pool.
If different node pools have different topologies, each node pool must be linked to its corresponding topology.
If the entire cluster shares the same topology, link the same topology to all node pools.

Automatic Topology-Aware Scheduling

When a distributed workload is submitted, the platform automatically applies topology-aware scheduling based on the network topology configured on the target node pool. This behavior ensures that distributed workloads benefit from improved performance without additional user configuration.

Topology-aware scheduling in NVIDIA Run:ai is applied at the workload level. This means the Scheduler considers the entire distributed workload as a single unit and places all of its pods according to the same topology constraints.

NVIDIA Run:ai automatically applies a Preferred topology constraint at the lowest defined topology level. This co-locates pods as close as possible in the network hierarchy, reducing communication overhead. If the Scheduler cannot place pods at this level, it automatically escalates placement by moving up through the topology hierarchy (for example: from node → rack → block → zone), always seeking the closest available level to minimize latency.

This behavior applies to distributed native workloads. To override the default behavior and define your own topology-related annotations when submitting a distributed workload, see Fine-tuning topology-aware scheduling per workload.

Fine-tuning Topology-Aware Scheduling per Workload

You can customize how NVIDIA Run:ai applies topology labels for each distributed workload, allowing you to override the default scheduling behavior when needed. For native distributed workloads, define the topology annotations in the workload’s annotations field through the NVIDIA Run:ai UI, API, or CLI.

Use the following topology-aware scheduling annotations when submitting a distributed workload:

kai.scheduler/topology - Specifies the name of the topology to use. The value must match the name of a topology entity defined for the cluster.
kai.scheduler/topology-required and/or kai.scheduler/topology-preferred - Define placement constraints using a topology label key (for example, "rack").
- Required - Enforces strict placement. All pods must be scheduled within the specified topology level.
- Preferred - Expresses a soft preference. The Scheduler attempts to place pods within the specified topology level when possible.

You can use Required, Preferred, or both together for the same topology tree. When combining them, keep in mind that network topologies are hierarchical (tree-structured):

A Preferred constraint defined at the same level as, or higher than, a Required constraint has no effect.
A Preferred constraint is effective only when it is defined at a lower (more specific) topology level than the Required constraint.
In this case, the Scheduler enforces the Required constraint while attempting to further group pods according to the Preferred constraint.

Submitting Distributed Workloads via YAML

When submitting distributed workloads via YAML, topology-aware scheduling is not applied automatically by NVIDIA Run:ai. Instead, you can configure the workload with annotations to ensure the Scheduler respects the desired topology.

Add annotations to the workload manifest by specifying the topology name and constraint type.
You can use either Required or Preferred constraints, or combine both for the same topology tree. When using both, note that network topologies are hierarchical (tree-structured). Applying a Preferred constraint at the same level as, or higher than, a Required constraint has no effect. A Preferred constraint is meaningful only when it is defined at a lower (more specific) topology level than the Required constraint. In this case, the topology-aware scheduling logic attempts to further group pods at that lower level, while still enforcing the mandatory Required constraint.

For example:

apiVersion: batch/v1
kind: Job
metadata:
  name: topology-aware-job
  annotations:
    kai.scheduler/topology-preferred-placement: "rack"
    kai.scheduler/topology-required-placement: "zone"
    kai.scheduler/topology: "network"

Pod Affinity vs. Topology-Aware Scheduling

The following example demonstrates the difference between pod affinity and topology-aware scheduling when placing distributed workloads across two GB200 racks:

In the example, two workloads (Workload 1 requiring 12 nodes and Workload 2 requiring 15 nodes) are already running. A third workload that requires 6 nodes is submitted:

With pod affinity, the Scheduler places pods one by one, only checking for “closeness” to existing pods without awareness of the entire workload compared to the available nodes. As a result, the workload is split across Rack A and Rack B, introducing unnecessary cross-rack communication overhead.
In contrast, topology-aware scheduling evaluates the full node requirement in advance and uses knowledge of the hierarchy (rack, block, NVLink domains) to allocate resources. This ensures all 6 nodes are placed together in Rack A, minimizing latency and maximizing bandwidth efficiency. By avoiding fragmentation and cross-rack placement, topology-aware scheduling improves workload performance and overall cluster utilization compared to pod affinity.

Using API

To view the available actions, go to the Network Topologies API reference.

Known Limitations

If a topology is detached from a node pool, workloads that are already running will continue using it, as long as the topology still exists.
If a topology is completely deleted while workloads are still using it:
- Running workloads will continue unaffected.
- Suspended workloads that are later resumed, or workloads not yet bound to a node, will become unschedulable and remain in Pending.
Submitting a workload to multiple node pools that each have different topologies is not supported. Workloads submitted through NVIDIA Run:ai will fail, while external workloads may either remain pending or run if the topology matches at least one node pool.

PreviousUsing GB200 NVL72 and Multi-Node NVLink Domains NextNode Pools

Last updated 1 month ago

Good afternoon

hashtagBenefits of Using Network Topology

hashtagTopology Labels in Kubernetes

hashtagConfiguring Topology-Aware Scheduling

hashtagAutomatic Topology-Aware Scheduling

hashtagFine-tuning Topology-Aware Scheduling per Workload

hashtagSubmitting Distributed Workloads via YAML

hashtagPod Affinity vs. Topology-Aware Scheduling

hashtagUsing API

hashtagKnown Limitations