Distributed

Distributed Training, is the ability to split the training of a model among multiple processors. It is often a necessity when multi-GPU training no longer applies; typically when you require more GPUs than exist on a single node. Each such split is a pod (see definition above). NVIDIA Run:ai spawns an additional launcher process that manages and coordinates the other worker pods. For more information, see Distributed trainingarrow-up-right.

Create a distributed training.

post

Use to create a distributed training.

Authorizations
AuthorizationstringRequired

Bearer authentication

Body
and
and
Responses
post
/api/v1/workloads/distributed

Get distributed training's data. [Experimental]

get

Retrieve the details of a distributed training by workload id.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Responses
chevron-right
200

Executed successfully.

application/json
get
/api/v1/workloads/distributed/{workloadId}

Delete a distributed training by id.

delete

Use to delete a distributed training by workload id.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Responses
delete
/api/v1/workloads/distributed/{workloadId}

Suspend a distributed training.

post

Suspend a distributed training from running using a workload id.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Responses
post
/api/v1/workloads/distributed/{workloadId}/suspend

Resume a distributed training.

post

Resume a distributed training that was suspended using a workload id.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Responses
post
/api/v1/workloads/distributed/{workloadId}/resume

Last updated