Using GB200 NVL72 and Multi-Node NVLink Domains

Multi-Node NVLink (MNNVL) systems, including NVIDIA GB200, NVIDIA GB200 NVL72 and its derivatives are fully supported by the NVIDIA Run:ai platform.

Kubernetes does not natively recognize NVIDIA’s MNNVL architecture, which makes managing and scheduling workloads across these high-performance domains more complex. The NVIDIA Run:ai platform simplifies this by abstracting the complexity of MNNVL configuration. Without this abstraction, optimal performance on a GB200 NVL72 system would require deep knowledge of NVLink domains, their hardware dependencies, and manual configuration for each distributed workload. NVIDIA Run:ai automates these steps, ensuring high performance with minimal effort. While GB200 NVL72 supports all workload types, distributed training workloads benefit most from its accelerated GPU networking capabilities.

To learn more about GB200, MNNVL and related NVIDIA technologies, refer to the following:

Benefits of Using GB200 NVL72 with NVIDIA Run:ai

The NVIDIA Run:ai platform enables administrators, researchers, and MLOps engineers to fully leverage GB200 NVL72 systems and other NVLink-based domains without requiring deep knowledge of hardware configurations or NVLink topologies. Key capabilities include:

Automatic detection and labeling
- Detects GB200 NVL72 nodes and identifies MNNVL domains (e.g., GB200 NVL72 racks).
- Automatically detects whether a node pool contains GB200 NVL72.
- Supports manual override of GB200 MNNVL detection and label key for future compatibility and improved resiliency.
Simplified distributed workload submission
- Allows seamless submission of distributed workloads into GB200-based node pools, eliminating all the complexities involved with that operation on top of GB200 MNNVL domains.
- Abstracts away the complexity of configuring workloads for NVL domains.
Flexible support for NVLink domain variants
- Compatible with current and future NVL domain configurations.
- Supports any number of domains or GB200 racks.
Enhanced monitoring and visibility
- Provides detailed NVIDIA Run:ai dashboards for monitoring GB200 nodes and MNNVL domains by node pool.
Control and customization
- Offers manual override and label configuration for greater resiliency and future-proofing.
- Enables advanced users to fine-tune GB200 scheduling behavior based on workload requirements.

Prerequisites

Ensure that NVIDIA's GPU Operator version 25.3 or higher is installed: GPU Operator v25.3 Release Notes. This version must include the associated Dynamic Resource Allocation (DRA) driver, which provides support for GB200 accelerated networking resources and the ComputeDomain feature. For detailed steps on installing the DRA driver and configuring ComputeDomain, refer to the documentation for your installed GPU Operator version.
After the DRA driver is installed, update runaiconfig using the spec.workload-controller.GPUNetworkAccelerationEnabled=True flag to enable GPU network acceleration. This triggers an update of the NVIDIA Run:ai workload-controller deployment and restarts the controller. See Advanced cluster configurations for more details.

Configuring and Managing GB200 NVL72 Domains

Administrators must define dedicated node pools that align with GB200 NVL72 rack topologies. These node pools ensure that workloads are isolated to nodes with NVLink interconnects and are not scheduled on incompatible hardware. Each node pool can be manually configured in the NVIDIA Run:ai platform and associated with specific node labels. Two key configurations are required for each node pool:

Node Labels – Identify nodes equipped with GB200.
MNNVL Domain Discovery – Specify how the platform detects whether the node pool includes NVLink-connected nodes.

To create a node pool with GPU network acceleration, see Node pools.

Identifying GB200 Nodes

To enable the NVIDIA Run:ai Scheduler to recognize GB200-based nodes, administrators must:

Use the default node label provided by the NVIDIA GPU Operator - nvidia.com/gpu.clique.
Or, apply a custom label that clearly marks the node as GB200/MNNVL capable.

This node label serves as the basis for identifying appropriate nodes and ensuring workloads are scheduled on the correct hardware.

Enabling MNNVL Domain Discovery

The administrator can configure how the NVIDIA Run:ai platform detects MNNVL domains for each node pool. The available options include:

Automatic Discovery – Uses the default label key nvidia.com/gpu.clique, or a custom label key specified by the administrator. The NVIDIA Run:ai platform automatically discovers MNNVL domains within node pools. If a node is labeled with the MNNVL label key, the NVIDIA Run:ai platform indicates this node pool as MNNVL detected. MNNVL detected node pools are treated differently by the NVIDIA Run:ai platform when submitting a distributed training workload.
Manual Discovery – The platform does not evaluate any node labels. Detection is based solely on the administrator’s configuration of the node pool as MNNVL “Detected” or “Not Detected.”

When automatic discovery is enabled, all GB200 nodes that are part of the same physical rack (NVL72 or other future topologies) are part of the same NVL Domain and automatically labeled by the GPU Operator with a common label using a unique label value per domain and sub-domain. The default label key set by the NVIDIA GPU Operator is nvidia.com/gpu.clique and its value consists of - <NVL Domain ID (ClusterUUID)>.<Clique ID> :

The NVL Domain ID (ClusterUUID) is a unique identifier that represents the physical NVL domain, for example, a physical GB200 NVL72 rack.
The Clique ID denotes a logical MNNVL sub-domain. A clique represents a further logical split of the MNNVL into smaller domains that enable secure, fast, and isolated communication between pods running on different GB200 nodes within the same GB200 NVL72.

The Nodes table provides more information on which GB200 NVL72 domain each node belongs to, and which Clique ID it is associated with.

Submitting Distributed Training Workloads

When a distributed training workload is submitted to an MNNVL-detected node pool, the NVIDIA Run:ai platform automates several key configuration steps to ensure optimal workload execution:

ComputeDomain creation - The NVIDIA Run:ai platform creates a ComputeDomain Custom Resource Definition (CRD), which is a proprietary resource used to manage NVLink-based domain assignments.
Resource Claim injection - A reference to the ComputeDomain is automatically added to the workload specification as a resource claim, allowing the Scheduler to link the workload to a specific NVLink domain.
Pod affinity configuration - Pod affinity is applied using a Preferred policy with the MNNVL label key (e.g., nvidia.com/gpu.clique) as the topology key. This ensures that pods within the distributed workload are located on nodes with NVLink interconnects.
Node affinity configuration - Node affinity is also applied using a Preferred policy based on the same label key, further guiding the Scheduler to place workloads within the correct node group.

These additional steps are crucial for the creation of underlying HW resources (also known as IMEX channels) and stickiness of the distributed workload to MNNVL topologies and nodes. When a distributed workload is stopped or evicted, the platform automatically removes the corresponding ComputeDomain.

Best Practices for MNNVL Node Pool Management

When submitting a distributed workload, you should explicitly specify a list of one or more MNNVL detected node pools, or a list of one or more non-MNNVL detected node pools. A mix of MNNVL detected and non-MNNVL detected node pools is not supported. A GB200 MNNVL node pool is a pool that contains at least one node belonging to an MNNVL domain.
Other workload types (not distributed) can include a list of mixed MNNVL and non-MNNVL node pools, from which the Scheduler will choose.
MNNVL node pools can include any size of MNNVL domains (i.e. NVL72 and any future domain size) and support any Grace-Blackwell models (GB200 and any future models).
To support the submission of larger distributed workloads, it is recommended to group as many GB200 racks as possible into fewer node pools. When possible, use a single GB200 node pool, unless there is a specific operational reason to divide resources across multiple node pools.
When submitting distributed training workloads with the controller pod set as a distinct non-GPU workload, the MNNVL feature should be used with the default Preferred mode as explained in the below section.

Fine-tuning Scheduling Behavior for MNNVL

You can influence how the Scheduler places distributed training workloads into GB200 MNNVL node pools using the Topology field available in the distributed training workload submission form.

Note

The following options are based on inter-pod affinity rules, which define how pods are grouped based on topology.

Confine a workload to a single GB200 MNNVL domain - To ensure the workload is scheduled within a single GB200 MNNVL domain (e.g., a GB200 NVL72 rack), apply a topology label with a Required policy using the MNNVL label key (nvidia.com/gpu.clique). This instructs the Scheduler to strictly place all pods within the same MNNVL domain. If the workload exceeds 18 pods (or 72 GPUs), the Scheduler will not be able to find a matching domain and will fail to schedule the workload.
Try to schedule a workload using a Preferred topology - To guide the Scheduler to prioritize a specific topology without enforcing it, apply a topology label with a policy of Preferred. You can apply any topology label with a Preferred policy. These labels are treated with higher scheduling weight than the default Preferred pod affinity automatically applied by NVIDIA Run:ai for MNNVL.
Mandate a custom topology - To force scheduling a workload into a custom topology, add a topology label with a policy of Required. This ensures the workload is strictly scheduled according to the specified topology. Keep in mind that using a Required policy can significantly constrain scheduling. If matching resources are not available, the Scheduler may fail to place the workload.

Fine-tuning MNNVL per Workload

You can customize how the NVIDIA Run:ai platform applies the MNNVL feature to each distributed training workload. This allows you to override the default behavior when needed. To configure this behavior, set the proprietary label key run.ai/MNNVL in the General settings section of the distributed training workload submission form. The following values are supported:

None - Disables the MNNVL feature for the workload. The platform does not create a ComputeDomain and no pod affinity or node affinity is applied by default.
Preferred (default) - Indicates that MNNVL feature is preferred but not required. This is the default behavior when submitting a distributed training workload:
- If the workload is submitted to a 'non-MNNVL detected' node pool, then the NVIDIA Run:ai platform does not add a ComputeDomain, ComputeDomain claim, pod affinity or node affinity for MNNVL nodes.
- Otherwise, if the workload is submitted to a 'MNNVL detected' node pool, then the NVIDIA Run:ai platform automatically adds: ComputeDomain, ComputeDomain claim, NodeAffinity and PodAffinity both with a Preferred policy and using the MNNVL label.
- If you manually add an additional Preferred topology label, it will be given higher scheduling weight than the default embedded pod affinity (which has weight = 1).
Required - Enforces a strict use of MNNVL domains for the workload. The workload must be scheduled on MNNVL supported nodes:
- The NVIDIA Run:ai platform creates a ComputeDomain and ComputeDomain claim.
- The NVIDIA Run:ai platform will automatically add a node affinity rule with a Required policy using the appropriate label.
- Pod affinity is set to Preferred by default, but you can override it manually with a Required pod affinity rule using the MNNVL label key or another custom label.
- If any of the targeted node pools do not support MNNVL or if the workload (or any of its pods) does not request GPU resources, the workload will fail to run.

Known Limitations and Compatibility

If the DRA driver is not installed correctly in the cluster, particularly if the required CRDs are missing, and the MNNVL feature is enabled in the NVIDIA Run:ai platform, the workload controller will enter a crash loop. This will continue until the DRA driver is properly installed with all necessary CRDs or the MNNVL feature is disabled in the NVIDIA Run:ai platform.
To run workloads on a GB200 node pool (i.e., a node pool detected as MNNVL-enabled), the workload must explicitly request that node pool. To prevent unintentional use of MNNVL-detected node pools, administrators must ensure these node pools are not included in any project's default list of node pools.
Only one distributed training workload per node can use GB200 accelerated networking resources. If GPUs remain unused on that node, other workload types may still utilize them.
If a GB200 node fails, any associated pod will be re-scheduled, causing the entire distributed workload to fail and restart. On non-GB200 nodes, this scenario may be self-healed by the Scheduler without impacting the entire workload.
If a pod from a distributed training workload fails or is evicted by the Scheduler, it must be re-scheduled on the same node. Otherwise, the entire workload will be evicted and, in some cases, re-queued.
Elastic distributed training workloads are not supported with MNNVL.
Workloads created in versions earlier than 2.21 do not include GB200 MNNVL node pools and are therefore not expected to experience compatibility issues.
If a node pool that was previously used in a workload submission is later updated to include GB200 nodes (i.e., becomes a mixed node pool), the workload submitted before version 2.21 will not use any accelerated networking resources, although it may still run on GB200 nodes.

PreviousConfiguring NVIDIA MIG Profiles NextNode Pools

Last updated 1 month ago