Run your first distributed training

This article provides a step-by-step walkthrough for running a PyTorch distributed training workload.

Distributed training is the ability to split the training of a model among multiple processors. Each processor is called a worker. Worker nodes work in parallel to speed up model training. There is also a master which coordinates the workers.

Prerequisites

Before you start, make sure:

  • You have created a project or have one created for you.

  • The project has an assigned quota of at least 1 GPU.

Step 1: Logging in

Browse to the provided NVIDIA Run:ai user interface and log in with your credentials.

Step 2: Submitting a standard training workload

  1. Go to the Workload manager → Workloads

  2. Click +NEW WORKLOAD and select Training

  3. Select under which cluster to create the workload

  4. Select the project in which your workload will run

  5. Under Workload architecture, select Distributed and choose PyTorch. Set the distributed training configuration to Worker & master

  6. Select a preconfigured template or select the Start from scratch to launch a new workload quickly

  7. Enter a name for the standard training workload (if the name already exists in the project, you will be requested to submit a different name)

  8. Click CONTINUE

    In the next step:

  9. Create an environment for your workload

    • Click +NEW ENVIRONMENT

    • Enter pytorch-dt as the name

    • Enter kubeflow/pytorch-dist-mnist:latest as the Image URL

    • Click CREATE ENVIRONMENT

    The newly created environment will be selected automatically

  10. Select the ‘small-fraction’ compute resource for your workload (GPU devices: 1)

    • If ‘small-fraction’ is not displayed in the gallery, follow the below steps:

      • Click +NEW COMPUTE RESOURCE

      • Enter a name for the compute resource. The name must be unique.

      • Set GPU devices per pod - 1

      • Set GPU memory per device

        • Select % (of device) - Fraction of a GPU device’s memory

        • Set the memory Request - 10 (the workload will allocate 10% of the GPU memory)

      • Optional: set the CPU compute per pod - 0.1 cores (default)

      • Optional: set the CPU memory per pod - 100 MB (default)

      • Click CREATE COMPUTE RESOURCE

    The newly created compute resource will be selected automatically

  11. Click CONTINUE

  12. Click CREATE TRAINING

Next steps

  • Manage and monitor your newly created workload using the Workloads table.

  • After validating your training performance and results, deploy your model using inference.

Last updated