Distributed Inferences

Distributed inference enables running inference workloads across multiple pods, typically to scale model serving beyond a single container or node. This approach is useful when a single instance cannot meet resource requirements.NVIDIA Run:ai supports this model using Leader Worker Set (LWS). Each pod plays a specific role, either as a leader or worker, and together they form a coordinated service. NVIDIA Run:ai manages the orchestration and configuration of these pods to ensure efficient and scalable inference execution

Create a distributed inference.

post

Create a distributed inference using container related fields.

Authorizations
AuthorizationstringRequired

Bearer authentication

Body
and
and
Responses
post
/api/v1/workloads/distributed-inferences

Get a distributed inference data.

get

Retrieve a distributed inference details using a workload id.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Responses
chevron-right
200

Executed successfully.

application/json
get
/api/v1/workloads/distributed-inferences/{workloadId}

Delete a distributed inference.

delete

Delete a distributed inference using a workload id.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Responses
delete
/api/v1/workloads/distributed-inferences/{workloadId}

Update distributed inference spec.

patch

Update the specification of an existing distributed inference workload.

Authorizations
AuthorizationstringRequired

Bearer authentication

Path parameters
workloadIdstring · uuidRequired

The Universally Unique Identifier (UUID) of the workload.

Body
Responses
patch
/api/v1/workloads/distributed-inferences/{workloadId}

Last updated