Distributed Training with PyTorch
This tutorial demonstrates how to run a distributed training workload with PyTorch on the NVIDIA Run:ai platform. Distributed training enables you to scale model training across multiple GPUs and nodes, improving performance and reducing training time. The example configurations provided here can be adapted to fit your own models, datasets, and training workflows.
In this tutorial, you will learn how to:
Build a custom PyTorch container image
Create a user application for API integrations with NVIDIA Run:ai
Create a persistent data source for storing checkpoints
Submit and run a distributed training workload through the NVIDIA Run:ai user interface, API, or CLI
Monitor workload progress and retrieve generated checkpoints
Prerequisites
Before you start, make sure you have created a project or have one created for you.
To run a distributed PyTorch training workload, you must first build a custom Docker container. This allows you to package the required code into a container that can be run and shared for future workloads.
The Docker runtime must be installed on a local machine with the same CPU architecture as the target hosts. For example, if the hosts use AMD64 CPUs, build the container on an AMD64 machine. If the hosts use ARM CPUs, build it on an ARM-based machine. Follow the Docker Engine installation guide to install Docker locally.
To store and pull the image from your own private Docker registry, create your own Docker registry credential as detailed in Step 4.
Step 1: Creating a Custom Docker Container
On your local machine where Docker is installed, create and navigate to a directory to save the Dockerfile, such as
pytorch-distributed:In the new directory, open a new file named
run.shand copy the following contents to the file. This is a very simple script that usestorchrunto launch a distributed training workload and copies the generated checkpoint to the/checkpointsdirectory inside the container so it can be used again later:Save and close the file.
Open another new file named
Dockerfileand copy the following contents to the file. This Dockerfile uses the 24.07 PyTorch container hosted on NGC as a base, clones the official PyTorch examples repository inside the container, and copies therun.shfile created previously into the container.Save and close the file.
Once both files have been saved locally, build a container with the following command, replacing <your-reg> with the name of your private or public registry. This will build the custom container locally:
Once the build has finished, push the image to your private or public registry with:
The custom container will be available in your registry and can be used immediately for workloads.
Step 2: Logging In
Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.
To use the API, you will need to obtain a token as shown in Creating a user application.
Run the below --help command to obtain the login options and log in according to your setup:
Step 3: Creating a User Application
Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token and use it within subsequent API calls.
In the NVIDIA Run:ai user interface:
Click the user avatar at the top right corner, then select Settings
Click +APPLICATION
Enter the application’s name and click CREATE
Copy the Client ID and Client secret and store securely
Click DONE
To request an API access token, use the client credentials to get a token to access NVIDIA Run:ai using the Tokens API. For example:
Step 4: Creating a User Credential
User credentials allow users to securely store private authentication secrets, which are accessible only to the user who created them. See User credentials for more details.
In the NVIDIA Run:ai user interface:
Click the user avatar at the top right corner, then select Settings
Click +CREDENTIAL and select Docker registry from the dropdown
Enter a name for the credential. The name must be unique.
Optional: Provide a description of the credential
Enter the username, password, and Docker registry URL
Click CREATE CREDENTIAL
NVIDIA Run:ai automatically creates a Kubernetes secret prefixed with dockerreg-.
Step 5: Creating a PVC Data Source
To make it easier to reuse code and checkpoints in future workloads, create a data source in the form of a Persistent Volume Claim (PVC). The PVC can be mounted to workloads and will persist after the workload completes, allowing any data it contains to be reused.
To create a PVC, go to Workload manager → Data sources.
Click +NEW DATA SOURCE and select PVC from the dropdown menu.
Within the new form, set the scope.
Enter a name for the data source. The name must be unique.
For the data options, select New PVC and the storage class that suits your needs:
To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode.
Enter
10 TBfor the claim size to ensure there is capacity for future workloadsSelect Filesystem (default) as the volume mode. The volume will be mounted as a filesystem, enabling the usage of directories and files.
Set the Container path to
/checkpointswhich is where the PVC will be mounted inside containers.
Click CREATE DATA SOURCE
Copy the following command to your terminal. Make sure to update the following parameters:
<COMPANY-URL>- The link to the NVIDIA Run:ai user interface.<TOKEN>- The API access token obtained in Step 3.
For all other parameters within the JSON body, refer to the PVC API.
After creating the data source, wait for the PVC to be provisioned. Use the List PVC assets API to retrieve the claim name. This claim name is the exact value that will be used for the <pvc-claim-name> when submitting the workload.
Step 6: Creating the Workload
How the Configuration Works
Distributed workload configuration: Workers & master - Selected to enable coordination between pods during multi-node distributed training. This configuration is required for workloads that perform collective communication operations, such as
all_reduce, to synchronize gradients across processes. In this example, the master pod participates in training rather than acting as a coordination-only process.Number of workers: 1 - Configured to create two pods (1 master and 1 worker) running across two nodes. Since the master pod participates in training, the number of workers is set to one fewer than the total number of nodes.
GPU devices per pod: 1 - Each pod requests a single GPU, matching the example cluster setup where each node provides one GPU.
Runtime command:
torchrun- Used to launch the distributed training workload.torchrunstarts one process per allocated GPU and assigns each process a unique rank. The PyTorch training operator, managed by NVIDIA Run:ai, automatically sets environment variables such asRANK,LOCAL_RANK, andWORLD_SIZEbased on the total number of GPUs allocated to the workload.Training parameters - The training script runs with a batch size of 32 for 100 total epochs and saves a checkpoint every 25 epochs. These values are chosen to demonstrate periodic checkpointing during a long-running training workload.
PVC mounted at
/checkpoints- A PVC is mounted at/checkpointsto persist training checkpoints beyond the lifetime of the workload. This enables reuse of checkpoints in future runs and continuation of training with modified hyperparameters.
Submitting the Workload
To create the training workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Training from the dropdown menu.
Within the new training form, select the cluster and project.
Set the training workload architecture to Distributed workload. This runs multiple processes that can span across different nodes. Distributed workloads are supported only in environments where distributed training is enabled.
Select Start from scratch to launch a new workload quickly.
Set the framework for the distributed workload to PyTorch. If PyTorch isn’t available, see Distributed training prerequisites for details on enabling.
Set the distributed workload configuration, which defines how distributed training workloads are divided across multiple machines or processes. For this example, select Workers & master, which enables coordination between pods during distributed training.
Choose Workers & master or Workers only based on training requirements and infrastructure.
A master pod is typically required for multi-node training workloads that must coordinate across nodes, such as workloads that perform
all_reduceoperations.The master pod can either participate in training like a worker or act as a lightweight coordination process.
If coordination is not required, select Workers only.
Enter a name for the distributed training workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
Under Environment, set the Image URL by entering the image tag specified during the container build in the Creating a Custom Docker Container section above (
<your-reg>/pytorch-ddp-example:24.07-py3).Set the image pull policy:
Set the condition for pulling the image. It is recommended to select Pull the image only if it's not already present on the host. If you are pushing new containers to your registry with the same tag, select Always pull the image from the registry to check if there are updates to the image.
Set the secret for pulling the image. Provide a Kubernetes secret that contains the required Docker registry authentication credentials created in Step 4.
The Runtime settings define the working directory and command executed when the container starts. This example launches the
multinode.pyscript usingtorchrun, which runs a multi-process application where each process has a unique rank. The PyTorch training operator coordinates withtorchrunto automatically set environment variables such asRANK,LOCAL_RANK, andWORLD_SIZEbased on the total number of GPUs allocated to the workload.Click +COMMAND & ARGUMENTS
In the Command field, enter
bash run.sh. This will run the script on all allocated GPUs with a batch size of 32 for 100 total epochs and save a checkpoint every 25 epochs. The final checkpoint will be saved to the/checkpointsdirectory.Set the container working directory to
/runai-distributed/examples/distributed/ddp-tutorial-series. This directory contains the training scripts copied into the container during the image build and is the path the pod opens to when it starts.
Set the number of workers and GPU devices per pod:
Set 1 worker, resulting in two pods (1 master + 1 worker) running across two nodes. Since the master participates in training, specify one fewer worker than the total number of nodes.
Set GPU devices per pod to 1. In this example, the workload runs on two nodes, each with one GPU.
In the Data sources form, click the load icon. A side pane appears, displaying a list of available data sources. Select the PVC that was created in the Step 5.
Click CONTINUE
Ensure the Allow different setup for the master toggle is disabled. Although the master pod can be configured differently from worker pods, this example uses the same configuration for both.
Click CREATE TRAINING
To create the training workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Training from the dropdown menu.
Within the new training form, select the cluster and project.
Set the training workload architecture to Distributed workload. This runs multiple processes that can span across different nodes. Distributed workloads are supported only in environments where distributed training is enabled.
Set the framework for the distributed workload to PyTorch. If PyTorch isn’t available, see Distributed training prerequisites for details on enabling.
Set the distributed workload configuration, which defines how distributed training workloads are divided across multiple machines or processes. For this example, select Workers & master, which enables coordination between pods during distributed training.
Choose Workers & master or Workers only based on training requirements and infrastructure.
A master pod is typically required for multi-node training workloads that must coordinate across nodes, such as workloads that perform
all_reduceoperations.The master pod can either participate in training like a worker or act as a lightweight coordination process.
If coordination is not required, select Workers only.
Select Start from scratch to launch a new workload quickly.
Enter a name for the distributed training workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
Click + NEW ENVIRONMENT
Enter a name
Under Image URL, enter the image tag specified during the container build in the Creating a Custom Docker Container section above (
<your-reg>/pytorch-ddp-example:24.07-py3).Set the image pull policy. It is recommended to select Pull the image only if it's not already present on the host. If you are pushing new containers to your registry with the same tag, select Always pull the image from the registry to check if there are updates to the image.
The Runtime settings define the working directory and command executed when the container starts. This example launches the
multinode.pyscript usingtorchrun, which runs a multi-process application where each process has a unique rank. The PyTorch training operator coordinates withtorchrunto automatically set environment variables such asRANK,LOCAL_RANK, andWORLD_SIZEbased on the total number of GPUs allocated:Click +COMMAND & ARGUMENTS
In the Command field, enter
bash run.sh. This will run the script on all allocated GPUs with a batch size of 32 for 100 total epochs and save a checkpoint every 25 epochs. The final checkpoint will be saved to the/checkpointsdirectory.Set the container working directory to
/runai-distributed/examples/distributed/ddp-tutorial-series. This directory contains the training scripts copied into the container during the image build and is the path the pod opens to when it starts.
Click CREATE ENVIRONMENT
The newly created environment will be selected automatically
Set the number of workers to 1, resulting in two pods (1 master + 1 worker) running across two nodes. Since the master participates in training, specify one fewer worker than the total number of nodes.
Select the ‘one-gpu’ compute resource for your workload:
If ‘one-gpu’ is not displayed in the gallery, follow the below steps:
Click +NEW COMPUTE RESOURCE
Enter one-gpu as the name for the compute resource. The name must be unique
Set GPU devices per pod - 1
Click CREATE COMPUTE RESOURCE
The newly created compute resource will be selected automatically
In the Data sources form, select the PVC that was created in the Step 5.
Click CONTINUE
Ensure the Allow different setup for the master toggle is disabled. Although the master pod can be configured differently from worker pods, this example uses the same configuration for both.
Click CREATE TRAINING
Copy the following example request and update the parameters as needed. For more details, see Distributed API:
<COMPANY-URL>- The link to the NVIDIA Run:ai user interface.<TOKEN>- The API access token obtained in Step 3.<PROJECT-ID>- The ID of the Project the workload is running on. You can get the Project ID via the Get Projects API.<CLUSTER-UUID>- The unique identifier of the Cluster. You can get the Cluster UUID via the Get Clusters API.<dockerreg-name>- The name of the user credential created in Step 4. Replace this with the name of the credential preceded by the system prefixdockerreg-.<pvc-claim-name>- The claim name associated with the PVC created in Step 5.
Copy the following command to your terminal. Make sure to update the below with the name of your project and workload. For more details, see CLI reference:
Step 7: Monitoring
After the training workload is created, it is added to the Workloads table, where it can be viewed, monitored, and managed throughout its lifecycle.
Step 8: Getting the Checkpoint
At the end of the run.sh script, the latest generated checkpoint is copied to the PVC attached to the workload. Any workload that mounts this same PVC can load the checkpoint from /checkpoints/snapshot.pt. The PVC can also be used to persist additional data written to the specified filesystem path. This enables checkpoint reuse across workloads, allowing long-running training workloads to resume from a saved state or continue training with different hyperparameters.
Step 9: Cleaning up the Environment
After the workload finishes, it can be deleted to free up resources for other workloads. To reclaim the disk space used by the PVC, delete the PVC once it is no longer needed.
Last updated