Run Your First Distributed Training

This quick start provides a step-by-step walkthrough for running a PyTorch distributed training workload.

Distributed training is the ability to split the training of a model among multiple processors. Each processor is called a worker. Worker nodes work in parallel to speed up model training. There is also a master which coordinates the workers.

Note

Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.

Prerequisites

Before you start, make sure:

You have created a project or have one created for you.
The project has an assigned quota of at least 1 GPU.

Note

Flexible workload submission is enabled by default. If unavailable, contact your administrator to enable it under General settings → Workloads → Flexible workload submission.

Step 1: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Submitting a Standard Training Workload

Go to the Workload manager → Workloads
Click +NEW WORKLOAD and select Training
Select under which cluster to create the workload
Select the project in which your workload will run
Under Workload architecture, select Distributed
Select PyTorch as the distributed framework and the distributed training configuration to Worker & master
Select Start from scratch to launch a new workload quickly
Enter a name for the training workload (if the name already exists in the project, you will be requested to submit a different name)
Click CONTINUE
In the next step:
Under Environment, enter the Image URL - kubeflow/pytorch-dist-mnist:latest
Click the load icon. A side pane appears, displaying a list of available compute resources. Select the ‘small-fraction’ compute resource for your workload.
- If ‘small-fraction’ is not displayed, follow the below steps to create a one-time compute resource configuration:
  - Set GPU devices per pod - 1
  - Enable GPU fractioning to set the GPU memory per device:
    Select % (of device) - Fraction of a GPU device’s memory
    Set the memory Request - 10 (the workload will allocate 10% of the GPU memory)
  - Optional: set the CPU compute per pod - 0.1 cores (default)
  - Optional: set the CPU memory per pod - 100 MB (default)
Click CONTINUE
Click CREATE TRAINING

Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:

runai project set "project-name"
runai training pytorch submit "workload-name" \
-i kubeflow/pytorch-dist-mnist:latest --workers 2 \
--gpu-request-type portion --gpu-portion-request 0.1 \
--gpu-devices-request 1 --cpu-memory-request 100M

Copy the following command to your terminal. Make sure to update the following parameters. For more details, see Distributed API:

curl -L 'https://<COMPANY-URL>/api/v1/workloads/distributed' \
-H 'Content-Type: application/json' \ 
-H 'Authorization: Bearer <TOKEN>' \   
-d '{  
    "name": "workload-name",  
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",  
    "spec": {  
        "compute": { 
            "cpuCoreRequest": 0.1,
            "gpuRequestType": "portion",
            "cpuMemoryRequest": "100M",
            "gpuDevicesRequest": 1,
            "gpuPortionRequest": 0.1
        },
        "image": "kubeflow/pytorch-dist-mnist:latest",  
        "numWorkers": 2, 
        "distributedFramework": "PyTorch" 
    } 
}

<COMPANY-URL> - The link to the NVIDIA Run:ai user interface
<TOKEN> - The API access token obtained in Step 1
<PROJECT-ID> - The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.
<CLUSTER-UUID> - The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.

Note

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.

Next Steps

Manage and monitor your newly created workload using the Workloads table.
After validating your training performance and results, deploy your model using inference.

PreviousQuick Starts NextBest Practices: Checkpointing Preemptible Training Workloads

Last updated 2 months ago

Good night

Prerequisites

Step 1: Logging In

Step 2: Submitting a Standard Training Workload

Next Steps