Distributed Training, is the ability to split the training of a model among multiple processors. It is often a necessity when multi-GPU training no longer applies; typically when you require more GPUs than exist on a single node. Each such split is a pod (see definition above). NVIDIA Run:ai spawns an additional launcher process that manages and coordinates the other worker pods. For more information, see Distributed training.
Create a distributed training.
post
Use to create a distributed training.
Authorizations
AuthorizationstringRequired
Bearer authentication
Body
and
Responses
202
Request completed successfully.
application/json
400
Bad submission request.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
503
unexpected error
application/json
post
/api/v1/workloads/distributed
Get distributed training's data. [Experimental]
get
Retrieve the details of a distributed training by workload id.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
Responses
200
Executed successfully.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
404
The specified resource was not found
application/json
500
unexpected error
application/json
503
unexpected error
application/json
get
/api/v1/workloads/distributed/{workloadId}
Delete a distributed training by id.
delete
Use to delete a distributed training by workload id.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
Responses
202
Accepted.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
404
The specified resource was not found
application/json
500
unexpected error
application/json
503
unexpected error
application/json
delete
/api/v1/workloads/distributed/{workloadId}
Suspend a distributed training.
post
Suspend a distributed training from running using a workload id.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.