Distributed Training with PyTorch

This tutorial demonstrates how to run a distributed training workload with PyTorch on the NVIDIA Run:ai platform. Distributed training enables you to scale model training across multiple GPUs and nodes, improving performance and reducing training time. The example configurations provided here can be adapted to fit your own models, datasets, and training workflows.

Note

  • Before running the example, verify the supported CUDA version of PyTorch is compatible with your target GPU hardware.

  • While the walkthrough uses PyTorch as the framework, the same principles apply to other distributed training frameworks supported by NVIDIA Run:ai.

In this tutorial, you will learn how to:

  • Build a custom PyTorch container image

  • Create a user application for API integrations with NVIDIA Run:ai

  • Create a persistent data source for storing checkpoints

  • Submit and run a distributed training workload through the NVIDIA Run:ai user interface, API, or CLI

  • Monitor workload progress and retrieve generated checkpoints

Prerequisites

  • Before you start, make sure you have created a project or have one created for you.

  • To run a distributed PyTorch training workload, you must first build a custom Docker container. This allows you to package the required code into a container that can be run and shared for future workloads.

    • The Docker runtime must be installed on a local machine with the same CPU architecture as the target hosts. For example, if the hosts use AMD64 CPUs, build the container on an AMD64 machine. If the hosts use ARM CPUs, build it on an ARM-based machine. Follow the Docker Engine installation guide to install Docker locally.

    • To store and pull the image from your own private Docker registry, create your own Docker registry credential as detailed in Step 4.

Step 1: Creating a Custom Docker Container

  1. On your local machine where Docker is installed, create and navigate to a directory to save the Dockerfile, such as pytorch-distributed:

  2. In the new directory, open a new file named run.sh and copy the following contents to the file. This is a very simple script that uses torchrun to launch a distributed training workload and copies the generated checkpoint to the /checkpoints directory inside the container so it can be used again later:

  3. Save and close the file.

  4. Open another new file named Dockerfile and copy the following contents to the file. This Dockerfile uses the 24.07 PyTorch container hosted on NGC as a base, clones the official PyTorch examples repository inside the container, and copies the run.sh file created previously into the container.

  5. Save and close the file.

  6. Once both files have been saved locally, build a container with the following command, replacing <your-reg> with the name of your private or public registry. This will build the custom container locally:

Note

If you are building the container on a machine with a different CPU architecture than the target cluster (for example, AMD64), use the following command:

  1. Once the build has finished, push the image to your private or public registry with:

The custom container will be available in your registry and can be used immediately for workloads.

Step 2: Logging In

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 3: Creating a User Application

Note

This step is required only if you plan to submit or manage workloads via the API.

Applications are used for API integrations with NVIDIA Run:ai. An application contains a client ID and a client secret. With the client credentials, you can obtain a token and use it within subsequent API calls.

In the NVIDIA Run:ai user interface:

  1. Click the user avatar at the top right corner, then select Settings

  2. Click +APPLICATION

  3. Enter the application’s name and click CREATE

  4. Copy the Client ID and Client secret and store securely

  5. Click DONE

To request an API access token, use the client credentials to get a token to access NVIDIA Run:ai using the Tokens API. For example:

Step 4: Creating a User Credential

Note

  • Creating Docker registry user credentials is supported via the UI only.

  • Pulling images from a private Docker registry using a user credential is supported in the UI (Flexible submission) and API.

User credentials allow users to securely store private authentication secrets, which are accessible only to the user who created them. See User credentials for more details.

In the NVIDIA Run:ai user interface:

  1. Click the user avatar at the top right corner, then select Settings

  2. Click +CREDENTIAL and select Docker registry from the dropdown

  3. Enter a name for the credential. The name must be unique.

  4. Optional: Provide a description of the credential

  5. Enter the username, password, and Docker registry URL

  6. Click CREATE CREDENTIAL

NVIDIA Run:ai automatically creates a Kubernetes secret prefixed with dockerreg-.

Step 5: Creating a PVC Data Source

To make it easier to reuse code and checkpoints in future workloads, create a data source in the form of a Persistent Volume Claim (PVC). The PVC can be mounted to workloads and will persist after the workload completes, allowing any data it contains to be reused.

Note

The first time a workload is launched using a new PVC, it will take longer to start as the storage gets provisioned only once the first claim to the PVC is made.

  1. To create a PVC, go to Workload manager → Data sources.

  2. Click +NEW DATA SOURCE and select PVC from the dropdown menu.

  3. Within the new form, set the scope.

Important

PVC data sources created at the cluster or department level do not replicate data across projects or namespaces. Each project or namespace will be provisioned as a separate PVC replica with different underlying PVs; therefore, the data in each PVC is not replicated.

  1. Enter a name for the data source. The name must be unique.

  2. For the data options, select New PVC and the storage class that suits your needs:

    • To allow all nodes to read and write from/to the PVC, select Read-write by many nodes for the access mode.

    • Enter 10 TB for the claim size to ensure there is capacity for future workloads

    • Select Filesystem (default) as the volume mode. The volume will be mounted as a filesystem, enabling the usage of directories and files.

    • Set the Container path to /checkpoints which is where the PVC will be mounted inside containers.

  3. Click CREATE DATA SOURCE

Step 6: Creating the Workload

Note

Flexible workload submission is enabled by default. If unavailable, contact your administrator to enable it under General settings → Workloads → Flexible workload submission.

How the Configuration Works

  • Distributed workload configuration: Workers & master - Selected to enable coordination between pods during multi-node distributed training. This configuration is required for workloads that perform collective communication operations, such as all_reduce, to synchronize gradients across processes. In this example, the master pod participates in training rather than acting as a coordination-only process.

  • Number of workers: 1 - Configured to create two pods (1 master and 1 worker) running across two nodes. Since the master pod participates in training, the number of workers is set to one fewer than the total number of nodes.

  • GPU devices per pod: 1 - Each pod requests a single GPU, matching the example cluster setup where each node provides one GPU.

  • Runtime command: torchrun - Used to launch the distributed training workload. torchrun starts one process per allocated GPU and assigns each process a unique rank. The PyTorch training operator, managed by NVIDIA Run:ai, automatically sets environment variables such as RANK, LOCAL_RANK, and WORLD_SIZE based on the total number of GPUs allocated to the workload.

  • Training parameters - The training script runs with a batch size of 32 for 100 total epochs and saves a checkpoint every 25 epochs. These values are chosen to demonstrate periodic checkpointing during a long-running training workload.

  • PVC mounted at /checkpoints - A PVC is mounted at /checkpoints to persist training checkpoints beyond the lifetime of the workload. This enables reuse of checkpoints in future runs and continuation of training with modified hyperparameters.

Submitting the Workload

  1. To create the training workload, go to Workload manager → Workloads.

  2. Click +NEW WORKLOAD and select Training from the dropdown menu.

  3. Within the new training form, select the cluster and project.

  4. Set the training workload architecture to Distributed workload. This runs multiple processes that can span across different nodes. Distributed workloads are supported only in environments where distributed training is enabled.

  5. Select Start from scratch to launch a new workload quickly.

  6. Set the framework for the distributed workload to PyTorch. If PyTorch isn’t available, see Distributed training prerequisites for details on enabling.

  7. Set the distributed workload configuration, which defines how distributed training workloads are divided across multiple machines or processes. For this example, select Workers & master, which enables coordination between pods during distributed training.

    • Choose Workers & master or Workers only based on training requirements and infrastructure.

    • A master pod is typically required for multi-node training workloads that must coordinate across nodes, such as workloads that perform all_reduce operations.

    • The master pod can either participate in training like a worker or act as a lightweight coordination process.

    • If coordination is not required, select Workers only.

  8. Enter a name for the distributed training workload. If the name already exists in the project, you will be requested to submit a different name.

  9. Click CONTINUE

  10. Under Environment, set the Image URL by entering the image tag specified during the container build in the Creating a Custom Docker Container section above (<your-reg>/pytorch-ddp-example:24.07-py3).

  11. Set the image pull policy:

    • Set the condition for pulling the image. It is recommended to select Pull the image only if it's not already present on the host. If you are pushing new containers to your registry with the same tag, select Always pull the image from the registry to check if there are updates to the image.

    • Set the secret for pulling the image. Provide a Kubernetes secret that contains the required Docker registry authentication credentials created in Step 4.

  12. The Runtime settings define the working directory and command executed when the container starts. This example launches the multinode.py script using torchrun, which runs a multi-process application where each process has a unique rank. The PyTorch training operator coordinates with torchrun to automatically set environment variables such as RANK, LOCAL_RANK, and WORLD_SIZE based on the total number of GPUs allocated to the workload.

    • Click +COMMAND & ARGUMENTS

    • In the Command field, enter bash run.sh. This will run the script on all allocated GPUs with a batch size of 32 for 100 total epochs and save a checkpoint every 25 epochs. The final checkpoint will be saved to the /checkpoints directory.

    • Set the container working directory to /runai-distributed/examples/distributed/ddp-tutorial-series. This directory contains the training scripts copied into the container during the image build and is the path the pod opens to when it starts.

  13. Set the number of workers and GPU devices per pod:

    • Set 1 worker, resulting in two pods (1 master + 1 worker) running across two nodes. Since the master participates in training, specify one fewer worker than the total number of nodes.

    • Set GPU devices per pod to 1. In this example, the workload runs on two nodes, each with one GPU.

  14. In the Data sources form, click the load icon. A side pane appears, displaying a list of available data sources. Select the PVC that was created in the Step 5.

  15. Click CONTINUE

  16. Ensure the Allow different setup for the master toggle is disabled. Although the master pod can be configured differently from worker pods, this example uses the same configuration for both.

  17. Click CREATE TRAINING

Step 7: Monitoring

After the training workload is created, it is added to the Workloads table, where it can be viewed, monitored, and managed throughout its lifecycle.

Step 8: Getting the Checkpoint

At the end of the run.sh script, the latest generated checkpoint is copied to the PVC attached to the workload. Any workload that mounts this same PVC can load the checkpoint from /checkpoints/snapshot.pt. The PVC can also be used to persist additional data written to the specified filesystem path. This enables checkpoint reuse across workloads, allowing long-running training workloads to resume from a saved state or continue training with different hyperparameters.

Step 9: Cleaning up the Environment

After the workload finishes, it can be deleted to free up resources for other workloads. To reclaim the disk space used by the PVC, delete the PVC once it is no longer needed.

Last updated