Inference workloads deploy trained models into a production environment to generate predictions from live data. These workloads are prioritized over Trainings and Workspaces during scheduling. NVIDIA Run:ai Inference workloads support auto-scaling to maintain service-level agreements (SLAs) by dynamically adjusting resources as demand changes.
Create an inference.
post
Create an inference using container related fields.
Authorizations
AuthorizationstringRequired
Bearer authentication
Body
and
and
Responses
202
Request completed successfully.
application/json
400
Bad request.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
503
unexpected error
application/json
post
/api/v1/workloads/inferences
Get inference data.
get
Retrieve inference details using a workload id.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
Responses
200
Executed successfully.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
404
The specified resource was not found
application/json
500
unexpected error
application/json
503
unexpected error
application/json
get
/api/v1/workloads/inferences/{workloadId}
Delete an inference.
delete
Delete an inference using a workload id.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
Responses
202
Accepted.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
404
The specified resource was not found
application/json
500
unexpected error
application/json
503
unexpected error
application/json
delete
/api/v1/workloads/inferences/{workloadId}
Update inference spec.
patch
Update the specification of an existing inference workload.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
Body
Responses
202
Executed successfully.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
404
The specified resource was not found
application/json
500
unexpected error
application/json
503
unexpected error
application/json
patch
/api/v1/workloads/inferences/{workloadId}
Get inference metrics data.
get
Retrieve inference metrics data by id. Supported from control-plane version 2.18 or later.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
Query parameters
startstring · date-timeRequired
Start date of time range to fetch data in ISO 8601 timestamp format.
Example: 2023-06-06T12:09:18.211Z
endstring · date-timeRequired
End date of time range to fetch data in ISO 8601 timestamp format.
Example: 2023-06-07T12:09:18.211Z
numberOfSamplesinteger · max: 1000Optional
The number of samples to take in the specified time range.
Default: 20Example: 20
Responses
200
Executed successfully.
207
Partial success.
application/json
400
Bad request.
application/json
401
Unauthorized
application/json
403
Forbidden
application/json
404
The specified resource was not found
application/json
500
unexpected error
application/json
503
unexpected error
application/json
get
/api/v1/workloads/inferences/{workloadId}/metrics
Get inference pod's metrics data.
get
Retrieve inference metrics pod's data by workload and pod id. Supported from control-plane version 2.18 or later.
Authorizations
AuthorizationstringRequired
Bearer authentication
Path parameters
workloadIdstring · uuidRequired
The Universally Unique Identifier (UUID) of the workload.
podIdstring · uuidRequired
The requested pod id.
Query parameters
startstring · date-timeRequired
Start date of time range to fetch data in ISO 8601 timestamp format.
Example: 2023-06-06T12:09:18.211Z
endstring · date-timeRequired
End date of time range to fetch data in ISO 8601 timestamp format.
Example: 2023-06-07T12:09:18.211Z
numberOfSamplesinteger · max: 1000Optional
The number of samples to take in the specified time range.