> For the complete documentation index, see [llms.txt](https://run-ai-docs.nvidia.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://run-ai-docs.nvidia.com/self-hosted/2.20/workloads-in-nvidia-run-ai/using-training/distributed-training/quick-starts/distributed-training-quickstart.md).

# Run Your First Distributed Training

This article provides a step-by-step walkthrough for running a PyTorch distributed training workload.

Distributed training is the ability to split the training of a model among multiple processors. Each processor is called a worker. Worker nodes work in parallel to speed up model training. There is also a master which coordinates the workers.

{% hint style="info" %}
**Note**

Multi-GPU training and distributed training are two distinct concepts. Multi-GPU training uses multiple GPUs within a single node, whereas distributed training spans multiple nodes and typically requires coordination between them.
{% endhint %}

## Prerequisites

Before you start, make sure:

* You have created a [project](/self-hosted/2.20/platform-management/aiinitiatives/organization/projects.md) or have one created for you.
* The project has an assigned quota of at least 1 GPU.

## Step 1: Logging In

{% tabs %}
{% tab title="UI" %}
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
{% endtab %}

{% tab title="CLI v2" %}
Log in using the following command. You will be prompted to enter your username and password:

```sh
runai login --help
```

{% endtab %}

{% tab title="CLI v1 (Deprecated)" %}
Log in using the following command. You will be prompted to enter your username and password:

```sh
runai login
```

{% endtab %}

{% tab title="API" %}
To use the API, you will need to obtain a token as shown in [API authentication](/api/getting-started/how-to-authenticate-to-the-api.md).
{% endtab %}
{% endtabs %}

## Step 2: Submitting a Standard Training Workload

{% tabs %}
{% tab title="UI" %}

1. Go to the Workload manager → Workloads
2. Click **+NEW WORKLOAD** and select **Training**
3. Select under which **cluster** to create the workload
4. Select the **project** in which your workload will run
5. Under **Workload architecture,** select **Distributed** and choose **PyTorch.** Set the distributed training configuration to **Worker & master**
6. Select **Start from scratch** to launch a new workload quickly
7. Enter a **name** for the training workload (if the name already exists in the project, you will be requested to submit a different name)
8. Click **CONTINUE**

   In the next step:
9. Create an environment for your workload

   * Click **+NEW ENVIRONMENT**
   * Enter **pytorch-dt** as the name
   * Enter `kubeflow/pytorch-dist-mnist:latest` as the **Image URL**
   * Click **CREATE ENVIRONMENT**

   The newly created environment will be selected automatically
10. Select the **‘small-fraction’** compute resource for your workload (GPU devices: 1)

    * If ‘small-fraction’ is not displayed in the gallery, follow the below steps:
      * Click **+NEW COMPUTE RESOURCE**
      * Enter a **name** for the compute resource. The name must be unique.
      * Set **GPU devices per pod** - 1
      * Set **GPU memory per device**
        * Select **% (of device)** - Fraction of a GPU device’s memory
        * Set the memory **Request** - 10 (the workload will allocate 10% of the GPU memory)
      * Optional: set the **CPU compute per pod** - 0.1 cores (default)
      * Optional: set the **CPU memory per pod** - 100 MB (default)
      * Click **CREATE COMPUTE RESOURCE**

    The newly created compute resource will be selected automatically
11. Click **CONTINUE**
12. Click **CREATE TRAINING**
    {% endtab %}

{% tab title="CLI v2" %}
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see [CLI reference](/self-hosted/2.20/reference/cli/runai.md):

<pre class="language-sh"><code class="lang-sh">runai project set "project-name"
<strong>runai training pytorch submit "workload-name" \
</strong>-i kubeflow/pytorch-dist-mnist:latest --workers 2 \
--gpu-request-type portion --gpu-portion-request 0.1 \
--gpu-devices-request 1 --cpu-memory-request 100M
</code></pre>

{% endtab %}

{% tab title="CLI v1 (Deprecated)" %}
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see [CLI reference](https://docs.run.ai/latest/Researcher/cli-reference/Introduction/):

```sh
runai config project "project-name" 
runai submit-dist pytorch "workload-name" --workers=2 -g 0.1 \
    -i kubeflow/pytorch-dist-mnist:latest
```

{% endtab %}

{% tab title="API" %}
Copy the following command to your terminal. Make sure to update the following parameters. For more details, see [Distributed](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/workloads/distributed) API:

```bash
curl -L 'https://<COMPANY-URL>/api/v1/workloads/distributed' \
-H 'Content-Type: application/json' \ 
-H 'Authorization: Bearer <TOKEN>' \   
-d '{  
    "name": "workload-name",  
    "projectId": "<PROJECT-ID>",
    "clusterId": "<CLUSTER-UUID>",  
    "spec": {  
        "compute": { 
            "cpuCoreRequest": 0.1,
            "gpuRequestType": "portion",
            "cpuMemoryRequest": "100M",
            "gpuDevicesRequest": 1,
            "gpuPortionRequest": 0.1
        },
        "image": "kubeflow/pytorch-dist-mnist:latest",  
        "numWorkers": 2, 
        "distributedFramework": "PyTorch" 
    } 
}
```

* `<COMPANY-URL>` - The link to the NVIDIA Run:ai user interface
* `<TOKEN>` - The API access token obtained in [Step 1](#step-1-logging-in)
* `<PROJECT-ID>` - The ID of the Project the workload is running on. You can get the Project ID via the [Get Projects](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/projects#get-api-v1-org-unit-projects) API.
* `<CLUSTER-UUID>` - The unique identifier of the Cluster. You can get the Cluster UUID via the [Get Clusters](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/organizations/clusters#get-api-v1-clusters) API.

{% hint style="info" %}
**Note**

The above API snippet runs with NVIDIA Run:ai clusters of 2.18 and above only.
{% endhint %}
{% endtab %}
{% endtabs %}

## Next Steps

* Manage and monitor your newly created workload using the [Workloads](/self-hosted/2.20/workloads-in-nvidia-run-ai/workloads.md) table.
* After validating your training performance and results, deploy your model using [inference](/self-hosted/2.20/workloads-in-nvidia-run-ai/using-inference.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://run-ai-docs.nvidia.com/self-hosted/2.20/workloads-in-nvidia-run-ai/using-training/distributed-training/quick-starts/distributed-training-quickstart.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.