Distributed Training Templates
This guide explains how to create distributed training templates for reuse during workload submission. To manage templates, see Workload Templates.
Note
Flexible workload templates is enabled by default and applies only to flexible workload submission (enabled by default). If unavailable, contact your administrator to enable it under General settings → Workloads → Flexible workload templates.
Before You Start
To access NGC, make sure you have an NGC account with an active NGC API key. To obtain a key, go to NGC → Setup → API Keys, then generate or copy an existing key. In NVIDIA Run:ai, store the key either as a shared secret (created by an administrator) or a user credential (created under User settings).
Workload Priority and Preemption
By default, training workloads are assigned a Low priority and are preemptible. These defaults allow training workloads to use opportunistic compute resources beyond the project’s deserved quota while being scheduled after higher-priority workloads. Preemptible training workloads may be interrupted if those resources are required by higher-priority, non-preemptible workloads.
You can override the defaults by configuring priority and preemptibility. For more details, see Workload priority and preemption.
Linking Assets
When loading an existing asset, environment or compute resource, into a template, you can choose whether to link the asset or use it without linking. Linked assets remain connected to the template. Any updates made to the original environment or compute resource are automatically reflected in the template. While linked, the asset fields in the template cannot be modified.
Note
Linking data source assets is currently not supported.
Adding a New Template
To add a new template, go to Workload manager → Templates.
Click +NEW TEMPLATE and select Training from the dropdown menu.
Within the new training template form, select the scope.
Set the training workload architecture as distributed workload, which consists of multiple processes working together. These processes can run on different nodes. This workload uses environments that support distributed training workloads only.
Set the framework for the distributed workload. In case one of the frameworks is not enabled, see Distributed training prerequisites for details on enabling.
Set the distributed workload configuration that defines how distributed training workloads are divided across multiple machines or processes. Choose Workers & master or Workers only based on your training requirements and infrastructure.
Enter a unique name for the training template. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
Setting Up an Environment
Note
NGC catalog is disabled by default. If unavailable, your administrator must enable it under General settings → Workloads → NGC catalog.
To select an image from the NGC private registry, your administrator must configure it under General settings → Workloads → NGC private registry.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available environments. Select an environment from the list.
Alternatively, click the ➕ icon in the side pane to create a new environment. For step-by-step instructions, see Environments.
Choose whether to link the environment when applying it to the template. See Linking assets for more details.
Provide your own settings
Manually configure the settings below as needed.
Configure environment
Set the environment image:
Select Custom image and add the Image URL or update the URL of the existing setup.
Select from the NGC catalog and then set how to access the NGC catalog:
As a guest - Choose the image name and tag from the dropdown.
Authenticated - Configure access to NGC and then choose the image name and tag from the dropdown:
Under Source, select Shared secret or My credentials.
The Type field is fixed to NGC API key.
Under Credential name, select an existing credential from the dropdown. If you select My credentials, you can also create a new credential directly from the dropdown. The credential is also saved under User settings → Credentials.
Select from the NGC private registry and then set how to access the registry:
Select a registry from the dropdown.
Under Source, select Shared secret or My credentials.
The Type field is fixed to NGC API key.
Under Credential name, select an existing credential from the dropdown. If you select My credentials, you can also create a new credential directly from the dropdown. The credential is also saved under User settings → Credentials.
Set the condition for pulling the image by selecting the image pull policy. It is recommended to pull the image only if it's not already present on the host.
Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.
Select the connection type:
Auto generate - A unique URL / port is automatically created for each workload using the environment.
Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between
30000and32767.If the node port is already in use, the workload will fail and display an error message.Load Balancer - Set the container port. Connection handling is managed by the load balancer. For more information, see External access to containers.
Modify who can access the tool:
By default, All authenticated users and service accounts is selected giving access to everyone within the organization’s account.
For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
For Specific user(s) and service account(s), enter a valid user email or name. If you remove yourself, you will lose access to the tool.
Set the command and arguments for the container running the workspace. If no command is added, the container will use the image’s default command (entry-point):
Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
Set multiple arguments separated by spaces, using the following format (e.g.:
--arg1=val1).
Set the environment variable(s):
Modify the existing environment variable(s) or click +ENVIRONMENT VARIABLE. The existing environment variables may include instructions to guide you with entering the correct values.
You can either select Custom to define a value manually, or choose an existing value from Shared secret or ConfigMap.
Both NVIDIA Run:ai and the training framework (for example, PyTorch or TensorFlow) automatically inject environment variables into distributed workloads. See Built-in workload environment variables for more details.
Enter a path pointing to the container's working directory.
Set where the UID, GID, and supplementary groups for the container should be taken from. If you select Custom, you’ll need to manually enter the UID, GID and Supplementary groups values.
Select additional Linux capabilities for the container from the dropdown menu. This grants certain privileges to a container without granting all the root user's privileges.
Setting Up Compute Resources
Note
GPU memory limit is disabled by default. If unavailable, your administrator must enable it under General settings → Resources → GPU resource optimization.
Toleration(s) and topology can still be modified even when the compute resource asset is linked.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.
Alternatively, click the ➕ icon in the side pane to create a new compute resource. For step-by-step instructions, see Compute resources.
Choose whether to link the compute resource when applying it to the template. See Linking assets for more details.
Provide your own settings
Manually configure the settings below as needed.
Configure compute resources
Set the number of workers for your workload.
Set the number of GPU devices per pod (physical GPUs).
Enable GPU fractioning to set the GPU memory per device using either a fraction of a GPU device’s memory (% of device) or a GPU memory unit (MB/GB):
Request - The minimum GPU memory allocated per device. Each pod in the workspace receives at least this amount per device it uses.
Limit - The maximum GPU memory allocated per device. Each pod in the workspace receives at most this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see the above note.
Set the CPU resources
Set CPU compute resources per pod by choosing the unit (cores or millicores):
Request - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.
Limit - The maximum amount of CPU compute a pod can use. Each pod receives at most this amount of CPU compute. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU compute resources.
Set the CPU memory per pod by selecting the unit (MB or GB):
Request - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.
Limit - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU memory resources.
Set extended resource(s)
Enable Increase shared memory size to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.
Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the Extended resources and Quantity guides.
Click +TOLERATION to allow the workspace to be scheduled on a node with a matching taint. Select the operator and the effect:
If you select Exists, the effect will be applied if the key exists on the node.
If you select Equals, the effect will be applied if the key and the value set match the value on the node.
Click +TOPOLOGY to let the workload be scheduled on nodes with a matching topology - same region, zone, placement group or any other topology you define.
Setting Up Data & Storage
Note
Data volumes is enabled by default. If unavailable, contact your administrator to enable it under General settings → Workloads → Data volumes.
If Data volumes is not enabled, Data & storage appears as Data sources only, and no data volumes will be available.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available data sources/volumes. Select a data source/volume from the list.
Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this template only and will not affect the selected data source:
Container path - Enter the container path to set the data target location.
ConfigMap sub-path - Specify a sub-path (file/key) inside the ConfigMap to mount (for example,
app.properties). This lets you mount a single file from an existing ConfigMap.
Alternatively, click the ➕ icon in the side pane to create a new data source/volume. For step-by-step instructions, see Data sources or Data volumes.
Configure data sources for a one-time configuration
Note
PVCs, Secrets, ConfigMaps and Data volumes cannot be added as a one-time configuration.
Click the ➕ icon and choose the data source from the dropdown menu. You can add multiple data sources.
Once selected, set the data origin according to the required fields and enter the container path to set the data target location.
The required fields vary by data source. For detailed configuration options and usage guidelines for each data source type, see Data sources.
Configure EmptyDir for a one-time configuration
Use EmptyDir to allocate temporary storage that exists only for the lifetime of the workload. The EmptyDir volume is ephemeral, the volume and its data is deleted every time the workload’s status changes to “Stopped
Set the size and units to define the maximum storage capacity.
Enter the container path to set the data target location.
Select the storage medium:
Disk-backed storage - The data is stored on the node's filesystem, which persists across container restarts but not pod rescheduling, and is slower than memory
Memory-backed storage - The data is stored in RAM, which provides faster access but is not persistent, and will be lost if the pod restarts
Configure Volume for a one-time configuration
Select Volume to allocate a storage space to your workspace that is persistent across restarts:
Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see Kubernetes storage classes. If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
Select one or more access mode(s) and define the claim size and its units.
Select the volume mode:
Filesystem (default) - The volume will be mounted as a filesystem, enabling the usage of directories and files.
Block - The volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
Enter the container path to set the data target location.
Set the volume persistency to Persistent if the volume and its data should be deleted when the workspace is deleted or Ephemeral if the volume and its data should be deleted every time the workspace’s status changes to “Stopped”.
Setting Up General Settings
Note
The following general settings are optional.
Set whether the workload may be interrupted by selecting Preemptible or Non-preemptible:
Non-preemptible workloads use the project's available GPU quota and will not be interrupted once they start running.
Preemptible workloads may be interrupted if resources are needed for higher-priority workloads.
Set the workload priority. Choose the appropriate priority level for the workload. Higher-priority workloads are scheduled before lower-priority ones.
Set the grace period for workload preemption. This is a buffer that allows a preempted workload to reach a safe checkpoint before it is forcibly preempted. Enter a timeframe between 0 sec and 5 min. This will be applied to the workers and master.
Set the backoff limit before workload failure. The backoff limit is the maximum number of retry attempts for failed workloads. After reaching the limit, the workload status will change to "Failed." Enter a value between 0 and 100. This will be applied to the workers and master.
Set the timeframe for auto-deletion after workload completion or failure. The time after which a completed or failed workload is deleted; if this field is set to 0 seconds, the workload will be deleted automatically. This will be applied to the workers and master. This setting does not affect log retention. Log retention is managed separately.
Set the SSH authorization mount path. Specify the path to the SSH key directory to enable MPI communication as a non-root user for MPI distributed training workloads. This setting applies to both workers and the master.
Set which pods should be deleted after workload completion or failure. Use this setting to manage resource cleanup behavior based on your use case, whether you want to debug issues or immediately release resources. The default selection varies depending on the framework used. This will be applied to the workers and master.
Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation. This will be applied to the workers.
Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying. This will be applied to the workers.
Completing the Template
Decide if you wish to define a different setup between the Workers and the Master via the toggle. When disabled, the master’s setup will inherit the workers’ setup. In case a different setup is required, repeat the above setup steps with the necessary changes.
Before finalizing your template, review your configurations and make any necessary adjustments.
Click CREATE TEMPLATE
Using API
Go to the Workload templates API reference to view the available actions.
Last updated