Run Your First Standard Training

This quick start provides a step-by-step walkthrough for running a standard training workload.

A training workload contains the setup and configuration needed for building your model, including the container, images, data sets, and resource requests, as well as the required tools for the research, all in a single place.

Prerequisites

Before you start, make sure:

  • You have created a project or have one created for you.

  • The project has an assigned quota of at least 1 GPU.

Note

Flexible workload submission is disabled by default. If unavailable, your administrator must enable it under General Settings → Workloads → Flexible workload submission.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Submitting a Standard Training Workload

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Training

  3. Select under which cluster to create the workload

  4. Select the project in which your workload will run

  5. Under Workload architecture, select Standard

  6. Select Start from scratch to launch a new workload quickly

  7. Enter a name for the standard training workload (if the name already exists in the project, you will be requested to submit a different name)

  8. Click CONTINUE

    In the next step:

  9. Under Environment, enter the Image URL - runai.jfrog.io/demo/quickstart

  10. Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘one-gpu’ compute resource for your workload.

    • If ‘one-gpu’ is not displayed, follow the below steps to create a one-time compute resource configuration:

      • Set GPU devices per pod - 1

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

  11. Click CREATE TRAINING

Next Steps

  • Manage and monitor your newly created workload using the Workloads table.

  • After validating your training performance and results, deploy your model using inference.

Last updated