Over Quota, Fairness and Preemption
This quick start provides a step-by-step walkthrough of the core scheduling concepts - over quota, fairness, and preemption. It demonstrates the simplicity of resource provisioning and how the system eliminates bottlenecks by allowing users or teams to exceed their resource quota when free GPUs are available.
Over quota - In this scenario, team-a runs two training workloads and team-b runs one. Team-a has a quota of 3 GPUs and is over quota by 1 GPU, while team-b has a quota of 1 GPU. The system allows this over quota usage as long as there are available GPUs in the cluster.
Fairness and preemption - Since the cluster is already at full capacity, when team-b launches a new b2 workload requiring 1 GPU , team-a can no longer remain over quota. To maintain fairness, the NVIDIA Run:ai Scheduler preempts workload a1 (1 GPU), freeing up resources for team-b.
Prerequisites
You have created two projects - team-a and team-b - or have them created for you.
Each project has an assigned quota of 2 GPUs. In this example, we have 4 GPUs on 2 machines with 2 GPUs each.
Step 1: Logging In
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
Run the below --help command to obtain the login options and log in according to your setup:
runai login --helpTo use the API, you will need to obtain a token as shown in API authentication.
Step 2: Submitting the First Training Workload (team-a)
Go to Workload manager → Workloads
Click +NEW WORKLOAD and select Training
Select under which cluster to create the workload
Select the project named team-a
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter a1 as the workload name
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/quickstartClick the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.
If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select under which cluster to create the workload
Select the project named team-a
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter a1 as the workload name
Click CONTINUE In the next step:
Create a new environment:
Click +NEW ENVIRONMENT
Enter quick-start as the name for the environment. The name must be unique.
Enter the Image URL -
runai.jfrog.io/demo/quickstartClick CREATE ENVIRONMENT
The newly created environment will be selected automatically
Select the ‘one-gpu’ compute resource for your workload
If ‘one-gpu’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter one-gpu as the name for the compute resource. The name must be unique.
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE TRAINING
Copy the following command to your terminal. For more details, see CLI reference:
runai training submit a1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-aCopy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \
--data '{
"name": "a1",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image":"runai.jfrog.io/demo/quickstart",
"compute": {
"gpuDevicesRequest": 1
}
}
}'<COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.
Step 3: Submitting the Second Training Workload (team-a)
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select the cluster where the previous training workload was created
Select the project named team-a
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter a2 as the workload name
Click CONTINUE In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/quickstartClick the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘two-gpus’ compute resource for your workload.
If ‘two-gpus’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 2
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select the cluster where the previous training workload was created
Select the project named team-a
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter a2 as the workload name
Click CONTINUE In the next step:
Select the environment created in Step 2
Select the ‘two-gpus’ compute resource for your workload
If ‘two-gpus’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter two-gpus as the name for the compute resource. The name must be unique.
Set GPU devices per pod - 2
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE TRAINING
Copy the following command to your terminal. For more details, see CLI reference:
runai training submit a2 -i runai.jfrog.io/demo/quickstart -g 2 -p team-aCopy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \
--data '{
"name": "a2",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image":"runai.jfrog.io/demo/quickstart",
"compute": {
"gpuDevicesRequest": 2
}
}
}'<COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.
Step 4: Submitting the First Training Workload (team-b)
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select the cluster where the previous training was created
Select the project named team-b
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter b1 as the workload name
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/quickstartClick the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.
If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select the cluster where the previous training was created
Select the project named team-b
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter b1 as the workload name
Click CONTINUE In the next step:
Create a new environment:
Click +NEW ENVIRONMENT
Enter quick-start as the name for the environment. The name must be unique.
Enter the Image URL -
runai.jfrog.io/demo/quickstartClick CREATE ENVIRONMENT
The newly created environment will be selected automatically
Select the ‘one-gpu’ compute resource for your workload
If ‘one-gpu’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter one-gpu as the name for the compute resource. The name must be unique.
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE TRAINING
Copy the following command to your terminal. For more details, see CLI reference:
runai training submit b1 -i runai.jfrog.io/demo/quickstart -g 1 -p team-bCopy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \
--data '{
"name": "b1",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image":"runai.jfrog.io/demo/quickstart",
"compute": {
"gpuDevicesRequest": 1
}
}
}'<COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.
Over Quota Status
System status after run:

System status after run:
~ runai workload list -A
Workload Type Status Project Running/Req.Pods GPU Alloc.
────────────────────────────────────────────────────────────────────────────
a2 Training Running team-a 1/1 2.00
b1 Training Running team-b 1/1 1.00
a1 Training. Running team-a 0/1 1.00System status after run:
curl --location 'https://<COMPANY-URL>/api/v1/workloads' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ #<TOKEN> is the API access token obtained in Step 1.
--data ''Step 5: Submitting the Second Training Workload (team-b)
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select the cluster where the previous training was created
Select the project named team-b
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter b2 as the workload name
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/quickstartClick the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.
If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Go to the Workload Manager → Workloads
Click +NEW WORKLOAD and select Training
Select the cluster where the previous training was created
Select the project named team-b
Under Workload architecture, select Standard
Select Start from scratch to launch a new training quickly
Enter b2 as the workload name
Click CONTINUE In the next step:
Select the environment created in Step 4
Select the compute resource created in Step 4
Click CREATE TRAINING
Copy the following command to your terminal. For more details, see CLI reference:
runai training submit b2 -i runai.jfrog.io/demo/quickstart -g 1 -p team-bCopy the following command to your terminal. Make sure to update the following parameters. For more details, see Trainings API.
curl --location 'https://<COMPANY-URL>/api/v1/workloads/trainings' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \
--data '{
"name": "b2",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image":"runai.jfrog.io/demo/quickstart",
"compute": {
"gpuDevicesRequest": 1
}
}
}'<COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.
Basic Fairness and Preemption Status
Workloads status after run:

Workloads status after run:
~ runai workload list -A
Workload Type Status Project Running/Req.Pods GPU Alloc.
────────────────────────────────────────────────────────────────────────────
a2 Training Running team-a 1/1 2.00
b1 Training Running team-b 1/1 1.00
b2 Training Running team-b 1/1 1.00
a1 Training. Pending team-a 0/1 1.00Workloads status after run:
curl --location 'https://<COMPANY-URL>/api/v1/workloads' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer <TOKEN>' \ #<TOKEN> is the API access token obtained in Step 1.
--data ''Next Steps
Manage and monitor your newly created workload using the Workloads table.
Last updated