# Run Your First Distributed Training

This article provides a step-by-step walkthrough for running a PyTorch distributed training workload.

Distributed training is the ability to split the training of a model among multiple processors. Each processor is called a worker. Worker nodes work in parallel to speed up model training. There is also a master which coordinates the workers.

{% hint style="info" %}
**Note**

Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.
{% endhint %}

## Prerequisites

Before you start, make sure:

* You have created a [project](https://run-ai-docs.nvidia.com/self-hosted/2.20/platform-management/aiinitiatives/organization/projects) or have one created for you.
* The project has an assigned quota of at least 1 GPU.

## Step 1: Logging In

{% tabs %}
{% tab title="UI" %}
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
{% endtab %}

{% tab title="CLI v2" %}
Log in using the following command. You will be prompted to enter your username and password:

```sh
runai login --help
```

{% endtab %}

{% tab title="CLI v1 (Deprecated)" %}
Log in using the following command. You will be prompted to enter your username and password:

```sh
runai login
```

{% endtab %}

{% tab title="API" %}
To use the API, you will need to obtain a token as shown in [API authentication](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/getting-started/how-to-authenticate-to-the-api).
{% endtab %}
{% endtabs %}

## Step 2: Submitting a Standard Training Workload

{% tabs %}
{% tab title="UI" %}

1. Go to the Workload manager → Workloads
2. Click **+NEW WORKLOAD** and select **Training**
3. Select under which **cluster** to create the workload
4. Select the **project** in which your workload will run
5. Under **Workload architecture,** select **Distributed** and choose **PyTorch.** Set the distributed training configuration to **Worker & master**
6. Select **Start from scratch** to launch a new workload quickly
7. Enter a **name** for the training workload (if the name already exists in the project, you will be requested to submit a different name)
8. Click **CONTINUE**

   In the next step:
9. Create an environment for your workload

   * Click **+NEW ENVIRONMENT**
   * Enter **pytorch-dt** as the name
   * Enter `kubeflow/pytorch-dist-mnist:latest` as the **Image URL**
   * Click **CREATE ENVIRONMENT**

   The newly created environment will be selected automatically
10. Select the **‘small-fraction’** compute resource for your workload (GPU devices: 1)

    * If ‘small-fraction’ is not displayed in the gallery, follow the below steps:
      * Click **+NEW COMPUTE RESOURCE**
      * Enter a **name** for the compute resource. The name must be unique.
      * Set **GPU devices per pod** - 1
      * Set **GPU memory per device**
        * Select **% (of device)** - Fraction of a GPU device’s memory
        * Set the memory **Request** - 10 (the workload will allocate 10% of the GPU memory)
      * Optional: set the **CPU compute per pod** - 0.1 cores (default)
      * Optional: set the **CPU memory per pod** - 100 MB (default)
      * Click **CREATE COMPUTE RESOURCE**

    The newly created compute resource will be selected automatically
11. Click **CONTINUE**
12. Click **CREATE TRAINING**
    {% endtab %}

{% tab title="CLI v2" %}
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see [CLI reference](https://run-ai-docs.nvidia.com/self-hosted/2.20/reference/cli/runai):

<pre class="language-sh"><code class="lang-sh">runai project set "project-name"
<strong>runai training pytorch submit "workload-name" \
</strong>-i kubeflow/pytorch-dist-mnist:latest --workers 2 \
--gpu-request-type portion --gpu-portion-request 0.1 \
--gpu-devices-request 1 --cpu-memory-request 100M
</code></pre>

{% endtab %}

{% tab title="CLI v1 (Deprecated)" %}
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see [CLI reference](https://docs.run.ai/latest/Researcher/cli-reference/Introduction/):

```sh
runai config project "project-name" 
runai submit-dist pytorch "workload-name" --workers=2 -g 0.1 \
    -i kubeflow/pytorch-dist-mnist:latest
```

{% endtab %}

{% tab title="API" %}
Copy the following command to your terminal. Make sure to update the following parameters. For more details, see [Distributed](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/workloads/distributed) API:

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/distributed' \
-H 'Content-Type: application/json' \ 
-H 'Authorization: Bearer <TOKEN>' \   
-d '{  
    "name": "workload-name",  
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",  
    "spec": {  
        "compute": { 
            "cpuCoreRequest": 0.1,
            "gpuRequestType": "portion",
            "cpuMemoryRequest": "100M",
            "gpuDevicesRequest": 1,
            "gpuPortionRequest": 0.1
        },
        "image": "kubeflow/pytorch-dist-mnist:latest",  
        "numWorkers": 2, 
        "distributedFramework": "PyTorch" 
    } 
}
```

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface
* `<TOKEN>` - The API access token obtained in [Step 1](#step-1-logging-in)
* `<PROJECT-ID>` - The ID of the Project the workload is running on. You can get the Project ID via the [Get Projects](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the Cluster. You can get the Cluster UUID via the [Get Clusters](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/clusters#get-api-v1-clusters) API.

{% hint style="info" %}
**Note**

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.
{% endhint %}
{% endtab %}
{% endtabs %}

## Next Steps

* Manage and monitor your newly created workload using the [Workloads](https://run-ai-docs.nvidia.com/self-hosted/2.20/workloads-in-nvidia-run-ai/workloads) table.
* After validating your training performance and results, deploy your model using [inference](https://run-ai-docs.nvidia.com/self-hosted/2.20/workloads-in-nvidia-run-ai/using-inference).
