The NVIDIA Run:ai Scheduler: Concepts and Principles

When a user submits a workload, it is directed to the designated Kubernetes cluster and managed by the NVIDIA Run:ai Scheduler. The Scheduler’s primary role is to optimize placement, allocating workloads to the most suitable nodes based on specific resource requirements, workload characteristics, and organizational policies. This ensures high utilization while strictly adhering to NVIDIA Run:ai’s fairness and quota management logic.

The Scheduler is platform-agnostic, supporting native Kubernetes workloads, NVIDIA Run:ai workloads, and various third-party frameworks. For a detailed breakdown of compatible types, see Introduction to workloads.

To better understand the logic driving these allocation decisions, get to know the key concepts, resource management and scheduling principles of the Scheduler.

Workloads, Pod Groups and Sub Groups

Workloads and Pod Groups

Workloads can range from a single pod running on individual nodes to distributed workloads using multiple pods, each running on a node (or part of a node). A large-scale training workload could use dozens of nodes; similarly, an inference workload could use many pods (replicas) across multiple nodes.

Every newly created pod is assigned to a pod group that represents one or multiple pods belonging to a single workload. For example, a distributed PyTorch training workload with 32 workers is grouped into a single pod group.

The NVIDIA Run:ai Scheduler is a workload scheduler. This means the Scheduler always uses the pod group associated with each pod and handles the pod as part of a larger group of pods, while taking into consideration the common characteristics of the pod group, such as gang scheduling, minimum replicas or workers, workload priority class, workload preemptibility policy, topology information, and more. These characteristics are applied consistently across the entire pod group.

Multi-Level Sub Groups

Some workloads use a workload structure composed of multiple functions (functionality disaggregation). In this structure, each function can have one or more replicas, and each replica is made up of one or more pods, usually structured as leader and worker sets (distributed replicas). One example of such a workload structure is the NVIDIA Dynamo inference framework, which uses Grove as a Kubernetes API (CRD) and controller for the underlying pods. A single Dynamo workload is composed of three functions: an incoming router (gateway), a prefill, and a decoder. Each of these functions may have multiple replicas, where the prefill and decoder replicas are structured as leader-worker sets.

This workload structure requires framing sets of leader-worker replicas as gang-scheduled sub-groups, which may be further grouped into sets of scale replicas, forming a second level of sub-groups. Therefore, the NVIDIA Run:ai workload group structure, handled by the Scheduler, is a hierarchy consisting of a top-level pod group with subordinate sub-groups. In this hierarchical group structure, a sub-group always points to a higher-level sub-group and ultimately to the top-level pod group. By default, each pod group has at least one sub-group under which all workers or replicas are federated.

Scheduling Queue

A scheduling queue (or simply a queue) represents a scheduler primitive that manages the scheduling of workloads based on different parameters.

A queue is created for each project/node pool pair and department/node pool pair. The NVIDIA Run:ai Scheduler supports hierarchical queueing, project queues are bound to department queues, per node pool. This allows an organization to manage per node pool quota, over quota and more parameters for projects and their associated departments.

Resource Management

Quota

Each project and department includes a set of deserved resource quotas, per node pool and resource type. For example, project “LLM-Train/Node Pool NV-H100” quota parameters specify the number of GPUs, CPUs(cores), and the amount of CPU memory that this project deserves to get when using this node pool. Non-preemptible workloads can only be scheduled if their requested resources are within the deserved resource quotas of their respective project/node pool and department/node pool.

Deserved Quota and Workload Scheduling

Deserved quota means that a project is entitled to use up to a maximum number of resources defined by its quota (for example, 10 GPUs). The Scheduler does not allow a project to exceed this limit within a given node pool for non-preemptible workloads. A workload scheduled within its deserved quota is assured to run once scheduled, particularly in the case of non-preemptible workloads.

Preemptible workloads, on the other hand, may be preempted by higher priority workloads using the same project / node pool resource quota.

When Deserved Quota Cannot Be Fully Satisfied

The Scheduler may not always be able to provide the full deserved quota for several reasons. For example, this may occur due to cluster fragmentation, additional workload constraints such as specific topology requirements, node failures, or situations where administrators assign projects or departments more quota than the physically available resources (that is, over-subscription).

Over Quota

Projects and departments can have a share in the unused resources of any node pool, beyond their quota of deserved resources. These resources are referred to as over quota resources. The administrator configures the over quota parameters per node pool for each project and department. Over quota resources can only be used by preemptible workloads.

Over Quota Weight

Projects can receive a share of the cluster/node pool unused resources when the over quota weight setting is enabled. The part each Project receives depends on its over quota weight value, and the total weights of all other projects over quota weights. The same applies to Departments. The administrator configures the over quota weight parameters per node pool for each project and department.

Max GPU Device Allocation

Administrators can limit the maximum number of GPUs that a project can allocate per node pool (that is, deserved quota + over quota <= max GPU device allocation). The same constraint applies at the department level.

A department’s max GPU device allocation effectively limits the total number of GPUs that can be allocated by all projects under that department. This applies even if the combined max GPU device allocations of the individual projects exceed the department’s max GPU device allocation.

Project and Department Rank

Note

NVIDIA Run:ai terminology for project priority and department priority has changed to project rank and department rank, across the UI and API.

Administrators can assign each project/node pool pair a rank. The rank determines the project’s relative precedence for resources compared to its siblings within the same department and node pool pair. Projects with a higher rank are allocated resources first, both within their quota and over quota, and may also reclaim (preempt) resources from lower-ranked projects within the same department/node pool pair.

The same principle applies to departments: a department’s rank is relative to its sibling departments using the same node pool. Because a project’s rank is local to its associated department/node pool pair, a high-ranked project in a low-ranked department can have lower overall resource precedence than a lower-ranked project in a higher-ranked department.

Multi-Level Quota System

Each project has a set of deserved resource quotas (GPUs, CPUs, and CPU memory) defined per node pool. Projects can exceed their deserved quota and receive a share of unused resources in the node pool beyond that quota. The same model applies at the department level.

The Scheduler first balances over-quota resources across departments and then, within each department, distributes those resources among its projects. A department’s deserved quota and over-quota limits constrain the total amount of resources that can be allocated by all projects within that department.

If a project still has available deserved quota but the department’s deserved quota is exhausted, the Scheduler does not allocate additional deserved resources to that project. The same rule applies to over-quota resources: over-quota capacity is first allocated to the department and only then divided among its projects.

Resource Limit Parameter

Each project and department has a Limit parameter per resource type. This parameter defines the maximum amount of that resource that a project or department can allocate, effectively placing an upper bound on the combined deserved quota + over-quota resources.

The NVIDIA Run:ai API exposes the Limit parameter for all resource types at both the project and department levels. In the UI, this parameter is exposed only for GPUs and is referred to as Max GPU Device Allocation, as GPUs are typically the most critical resource in AI clusters.

By default, all projects and departments have a Limit value of Unlimited, meaning they can theoretically allocate resources up to the cluster’s total capacity. In practice, however, resource allocation is constrained by additional factors, such as quotas, over-quota distribution, and competition between departments and projects for unused resources.

Over-Subscription

Over-subscription is a scenario where the sum of all deserved resource quotas surpasses the physical resources of the cluster or node pool. In this case, there may be scenarios in which the Scheduler cannot find matching nodes to all workload requests, even if those requests were within the deserved resource quota of their associated projects.

Scheduling Principles

Fairness (Fair Resource Distribution)

Fairness is a major principle within the NVIDIA Run:ai scheduling system. It means that the NVIDIA Run:ai Scheduler always respects certain resource splitting rules (fairness) between projects and between departments.

Fairshare and Fairshare Balancing

To implement fairness, the NVIDIA Run:ai Scheduler calculates a numerical value called fairshare for each project or department, per node pool. This value represents the sum of the project’s or department’s deserved resources (quota) plus its share of unused resources in that node pool (over-quota resources).

The Scheduler then takes the minimum of:

the calculated fairshare, and
the total resources requested by the project’s or department’s workloads in that node pool

and uses this value as the effective fairshare for the project or department.

The Scheduler aims to provide each project or department with the resources they deserve per node pool using two main parameters: deserved quota and deserved fairshare (that is, quota plus over-quota resources). If one project’s node pool queue is below its fairshare while another project’s node pool queue exceeds its fairshare, the Scheduler shifts resources between queues to rebalance fairness. This process may result in the preemption of some over-quota, preemptible workloads.

Time-Based Fairshare

Administrators can enable, per node pool, a fairshare mode called time-based fairshare.

When this mode is enabled, the Scheduler continuously collects historical resource usage data for each project and department, per node pool. It evaluates each project’s and department’s GPU-hour consumption relative to its configured weight and uses this information to balance resource distribution more effectively across projects and departments over time.

This capability applies only to over-quota resources. Quota resources always have the highest allocation precedence. Only after all quota allocations are satisfied does the Scheduler distribute excess (over-quota) resources among departments and projects, according to their ranks and weights.

This time-based fairshare calculation ensures that each project and department receives its fair share of resource processing time (GPUs, CPUs, and CPU memory) over extended periods. As a result, over-quota resource distribution becomes more stable and fair, reducing short-term imbalances.

Priority and Preemption

NVIDIA Run:ai supports scheduling workloads using different priority and preemption policies:

Workload's priority and preemption are two distinct parameters set in a workload. Workload's priority sets the scheduling precedence within a project, while workload's preemption sets whether the workload is preemptible or non-preemptible.
High priority workloads (pods) can preempt lower priority workloads (pods) within the same scheduling queue (project), according to their preemption policy (i.e. if preemptible).
If no explicit preemption policy parameter is set, the NVIDIA Run:ai Scheduler implicitly assumes any PriorityClass >= 100 is non-preemptible and any PriorityClass < 100 is preemptible.
Cross project and cross department workload preemptions are referred to as resource reclaim and are based on fairness between queues rather than the priority of the workloads.

To make it easier for users to submit workloads, NVIDIA Run:ai preconfigured several Kubernetes PriorityClass objects. The NVIDIA Run:ai preset PriorityClass objects have their ‘preemptionPolicy’ always set to ‘PreemptLowerPriority’, regardless of their actual NVIDIA Run:ai preemption policy within the NVIDIA Run:ai platform.

A non-preemptible workload is only scheduled if in-quota and cannot be preempted after being scheduled, not even by a higher priority workload. To see the default priority and preemption policy of a workload and for details on how to change the priority and preemption, see Workload priority and preemption.

Preemption of Lower Priority Workloads Within a Project

Workload priority is always respected within a project. This means higher priority workloads are scheduled before lower priority workloads. It also means that higher priority workloads may preempt lower priority workloads within the same project if the lower priority workloads are preemptible.

Reclaim of Resources Between Projects and Departments

Reclaim is an inter-project (and inter-department) scheduling action that takes back resources from one project (or department) that has used them as over quota, back to a project (or department) that deserves those resources as part of its deserved quota, or to balance fairness between projects, each to its fairshare (i.e. sharing fairly the portion of the unused resources).

Gang Scheduling

Gang scheduling describes a scheduling principle in which a workload composed of multiple pods is either fully scheduled (all pods are scheduled and running) or fully pending (none of the pods are running). Gang scheduling applies to a single pod group.

Multi-Level Gang Scheduling

The NVIDIA Run:ai Scheduler supports scheduling workloads that use multi-level pod-group structures, as described in Workloads, pod groups, and sub groups. To support these workloads, the Scheduler implements multi-level pod grouping.

The top level of the hierarchy is always a pod group, regardless of whether the workload is flat or hierarchical. Any grouping level below the top level is a sub-group. A workload may include multiple levels of sub-groups, as required, and each sub-group points to its immediate parent in the hierarchy.

A sub-group has parameters similar to those of a pod group. For example, it includes a min-members parameter, which indicates the minimum number of pods that must be scheduled for the sub-group to be considered ganged.

For a hierarchical workload to transition to the Running state, all sub-groups must be successfully ganged, including the top-level pod group.

Placement Strategy - Bin-Pack and Spread

The administrator can set a placement strategy, bin-pack or spread, of the Scheduler per node pool. For GPU based workloads, workloads can request both GPU and CPU resources. For CPU-only based workloads, workloads can request CPU resources only.

GPU workloads:
- Bin-pack - The Scheduler places as many workloads as possible in each GPU and node to use fewer resources and maximize GPU and node vacancy.
- Spread - The Scheduler spreads workloads across as many GPUs and nodes as possible to minimize the load and maximize the available resources per workload.
CPU workloads:
- Bin-pack - The Scheduler places as many workloads as possible in each CPU and node to use fewer resources and maximize CPU and node vacancy.
- Spread - The Scheduler spreads workloads across as many CPUs and nodes as possible to minimize the load and maximize the available resources per workload.

Next Steps

Now that you have learned the key concepts and principles of the NVIDIA Run:ai Scheduler, see how the Scheduler works - allocating pods to workloads, applying preemption mechanisms, and managing resources.

PreviousScheduling NextHow the Scheduler Works

Last updated 1 month ago

Good afternoon

hashtagWorkloads, Pod Groups and Sub Groups

hashtagWorkloads and Pod Groups

hashtagMulti-Level Sub Groups

hashtagScheduling Queue

hashtagResource Management

hashtagQuota

hashtagDeserved Quota and Workload Scheduling

hashtagWhen Deserved Quota Cannot Be Fully Satisfied

hashtagOver Quota

hashtagOver Quota Weight

hashtagMax GPU Device Allocation

hashtagProject and Department Rank

hashtagMulti-Level Quota System

hashtagResource Limit Parameter

hashtagOver-Subscription

hashtagScheduling Principles

hashtagFairness (Fair Resource Distribution)

hashtagFairshare and Fairshare Balancing

hashtagTime-Based Fairshare

hashtagPriority and Preemption

hashtagPreemption of Lower Priority Workloads Within a Project

hashtagReclaim of Resources Between Projects and Departments

hashtagGang Scheduling

hashtagMulti-Level Gang Scheduling

hashtagPlacement Strategy - Bin-Pack and Spread

hashtagNext Steps