Run Your First Standard Training
This quick start provides a step-by-step walkthrough for running a standard training workload.
A training workload contains the setup and configuration needed for building your model, including the container, images, data sets, and resource requests, as well as the required tools for the research, all in a single place.
Prerequisites
Before you start, make sure:
You have created a project or have one created for you.
The project has an assigned quota of at least 1 GPU.
Step 1: Logging In
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
Run the below --help command to obtain the login options and log in according to your setup:
runai login --helpTo use the API, you will need to obtain a token as shown in API authentication.
Step 2: Submitting a Standard Training Workload
Go to the Workload manager → Workloads
Click +NEW WORKLOAD and select Training
Select under which cluster to create the workload
Select the project in which your workload will run
Under Workload architecture, select Standard
Select Start from scratch to launch a new workload quickly
Enter a name for the standard training workload (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Under Environment, enter the Image URL -
runai.jfrog.io/demo/quickstartClick the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.
If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE TRAINING
Go to the Workload manager → Workloads
Click +NEW WORKLOAD and select Training
Select under which cluster to create the workload
Select the project in which your workload will run
Under Workload architecture, select Standard
Select Start from scratch to launch a new workload quickly
Enter a name for the standard training workload (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Create an environment for your workload
Click +NEW ENVIRONMENT
Enter quick-start as the name for the environment. The name must be unique.
Enter
runai.jfrog.io/demo/quickstartas the Image URLClick CREATE ENVIRONMENT
The newly created environment will be selected automatically
Select the ‘one-gpu’ compute resource for your workload
If ‘one-gpu’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter one-gpu as the name for the compute resource. The name must be unique.
Set GPU devices per pod - 1
Optional: set the CPU compute per pod - 0.1 cores (default)
Optional: set the CPU memory per pod - 100 MB (default)
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
Click CREATE TRAINING
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:
runai project set "project-name"
runai training submit "workload-name" -i runai.jfrog.io/demo/quickstart -g 1Copy the following command to your terminal. Make sure to update the below parameters according. For more details, see Trainings API:
curl -L 'https://<COMPANY-URL>/api/v1/workloads/training' \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer <TOKEN>' \
-d '{
"name": "workload-name",
"projectId": "<PROJECT-ID>",
"clusterId": "<CLUSTER-UUID>",
"spec": {
"image": "runai.jfrog.io/demo/quickstart",
"compute": {
"gpuDevicesRequest": 1
}
}
}<COMPANY-URL>- The link to the NVIDIA Run:ai user interface<TOKEN>- The API access token obtained in Step 1<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.
Next Steps
Manage and monitor your newly created workload using the Workloads table.
After validating your training performance and results, deploy your model using inference.
Last updated