Interacting with Workloads Using the Cluster API
Use the Cluster API for live, cluster-scoped operations on running workloads—checking cluster/workload status, streaming and downloading logs, executing commands, attaching to containers, and forwarding ports. For creating or managing workloads across a tenant, use the Control plane API.
The API is exposed per cluster at https://<cluster-fqdn>/cluster-api/. The interactive endpoints (exec, attach, port-forward, and both logs modes) use WebSocket; status, logs/download, and the cluster status endpoint use standard HTTPS. All endpoints require a bearer token (JWT)—refer to How to authenticate to the API.
Before You Start
The Cluster API runs on each NVIDIA Run:ai cluster, not on the control plane. Before making a call, you need:
The cluster FQDN - Each cluster exposes the API at
https://<cluster-fqdn>/cluster-api/. Find the FQDN in the NVIDIA Run:ai UI under Clusters → your cluster → Connection details, or from your platform admin.Network reachability to the cluster - The Cluster API endpoint must be reachable from the machine making the call. If your cluster is behind a VPN or private network, you must be on that network.
A bearer token (JWT) with the right permissions - All endpoints require a JWT. The token must be associated with a user or application that has permission to access the project and workload in question. Refer to How to authenticate to the API for how to obtain a token.
Valid project and workload identifiers - Most endpoints require the project name, workload type, workload framework, and workload name in the path. The identifiers must match an existing, running workload in that project. Refer to Common path parameters below.
Common Path Parameters
Most endpoints take the same four path parameters. They are documented once here and referenced by each endpoint below.
projectName
Required
The NVIDIA Run:ai project name that owns the workload.
project1
workloadType
Required
The workload type. One of: workspace, training, distributed, inference, external.
training
workloadFramework
Required
The workload framework. One of: runai, mpi, pytorch, tf, xgboost, any, external. Use external together with workloadType=external for workloads created from third-party CRDs (refer to the CRD-mapping table below). The any value does not route to anyworkload-managed third-party CRDs—use external for those.
pytorch
workloadName
Required
The name of the workload, as it appears in the NVIDIA Run:ai UI or CLI. For workloads created from third-party CRDs, refer to the CRD-mapping table below.
my-jp
CRD to Path Parameter Mapping
The cluster-api resolves the path parameters to an NVIDIA Run:ai V2 CRD inside the project namespace (runai-{project}). Use the table below to pick the right workloadType, workloadFramework, and workloadName for your workload.
InteractiveWorkload (NVIDIA Run:ai workspace)
workspace
runai
The workload's name as shown in UI/CLI.
TrainingWorkload (NVIDIA Run:ai training, single)
training
runai
The workload's name as shown in UI/CLI.
DistributedWorkload (PyTorch)
distributed
pytorch
The workload's name as shown in UI/CLI.
DistributedWorkload (TensorFlow)
distributed
tf
The workload's name as shown in UI/CLI.
DistributedWorkload (MPI)
distributed
mpi
The workload's name as shown in UI/CLI.
DistributedWorkload (XGBoost)
distributed
xgboost
The workload's name as shown in UI/CLI.
InferenceWorkload
inference
runai
The workload's name as shown in UI/CLI.
ExternalWorkload (third-party PyTorchJob)
external
external
ew-pytorchjob-{originalName}
ExternalWorkload (third-party TFJob)
external
external
ew-tfjob-{originalName}
ExternalWorkload (third-party MPIJob)
external
external
ew-mpijob-{originalName}
ExternalWorkload (third-party LeaderWorkerSet)
external
external
ew-leaderworkerset-{originalName}
ExternalWorkload (any other third-party CRD)
external
external
ew-{kind-lowercase}-{originalName}
Tip
For finding the right name:
For NVIDIA Run:ai-native workloads (workspace / training / distributed / inference): the name in the UI matches the
workloadNamedirectly.For third-party CRDs: run
kubectl get externalworkloads -n runai-{project}to list the auto-generatedew-…names, or look at the workload entry in the NVIDIA Run:ai UI.
Note
GET /cluster-api/status is cluster-scoped and does not take these parameters. All other endpoints are project- and workload-scoped.
Request and Response Conventions
Base URL
Replace <cluster-fqdn> with your cluster's FQDN in every example:
Throughout this guide, examples use $CLUSTER for the FQDN and $TOKEN for the bearer token:
Authentication
Every request must include the bearer token in the Authorization header:
For WebSocket endpoints (exec, attach, port-forward, logs), the same header is sent during the WebSocket upgrade handshake.
Content Types
JSON endpoints return
application/json.The
logs/downloadendpoint returnsapplication/octet-stream.WebSocket log frames carry log-line payloads as binary messages.
Connection Behavior
The following endpoints use WebSocket connections and have different semantics from standard REST endpoints:
logs(both one-shot andfollow=truemodes)execattachport-forward
WebSocket examples in this guide use websocat. Install with cargo install websocat or brew install websocat. Equivalent calls work from any WebSocket-capable client (Python websockets, Node ws, Go gorilla/websocket).
Upgrade Flow
Clients issue a GET request with the standard WebSocket upgrade headers (Connection: Upgrade, Upgrade: websocket, Sec-WebSocket-Key, Sec-WebSocket-Version: 13). The bearer token is passed in Authorization on the handshake request.
On success, the server responds with 101 Switching Protocols. On authentication failure, the server responds with 401 before upgrading.
Keep-Alive and Timeouts
The server sends periodic ping frames every 30 seconds. Clients must respond with pong frames to keep the connection alive.
Idle connections (no traffic for 5 minutes) are closed by the server.
Long-running sessions (
exec,attach) have no maximum duration as long as pings are acknowledged.
Termination
The connection closes when:
The client closes the WebSocket.
The target pod or container terminates.
The bearer token expires (server closes with code
1008).Idle timeout is reached.
Clients should handle abrupt disconnection and reconnect if needed. No replay is provided—on reconnect, only new log lines or output are delivered.
Errors
All endpoints return standard HTTP status codes. Error responses have the following JSON shape:
Common Error Codes
400
Bad Request
Invalid query parameter (for example, sinceSeconds=-1), malformed path parameter, or missing required query parameter.
401
Unauthorized
Bearer token missing, malformed, or expired.
403
Forbidden
Token is valid but the associated user lacks permission to access the project or workload.
404
Not Found
Project, pod, or container does not exist. Also returned when a workload exists but is not yet scheduled (no pods).
409
Conflict
Operation cannot be performed in the workload's current state (for example, exec on a pod that is terminating).
500
Internal Server Error
Unexpected cluster-side failure. Also currently returned (with a plain-text body) when the requested workload cannot be located in the cluster—verify that projectName, workloadType, workloadFramework, and workloadName match an existing workload (refer to CRD to Path Parameter Mapping). Retry with backoff; contact support if persistent.
503
Service Unavailable
Cluster is temporarily unreachable, the cluster-api service is restarting, or the requested operation cannot complete (for example, attach to a container whose primary process already holds stdio).
For WebSocket endpoints, authentication and permission errors are returned on the HTTP handshake (before upgrade). Errors during an active stream are reported through WebSocket close codes:
1000
Normal closure (target terminated cleanly)
1001
Going away (server shutdown or pod terminated)
1002
Protocol error (malformed frames)
1006
Abnormal closure (network drop, no close frame received)
1008
Policy violation (token expired mid-stream)
1011
Server error (unexpected cluster-side failure)
Endpoints
Cluster Status
GET
HTTPS
Bearer (JWT)
Yes
None
application/json
Returns the cluster status as a JSON object, including the version and the state of features within the cluster.
Example
Response:
Workload Status
GET
HTTPS
Bearer (JWT)
Yes
None
application/json
Returns the workload status as a JSON object with a status field carrying the workload phase.
Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters for values and the CRD-mapping table.
Example
Response:
Logs
GET
WebSocket
Bearer (JWT)
Yes
None
WebSocket binary frames
Read the logs of a container in a pod. If the pod has multiple containers, the container is auto-selected if the container name is not provided.
The logs endpoint is WebSocket-only, in both one-shot and follow modes. Plain HTTPS GET returns HTTP 426 Upgrade Required. Each log line is delivered as a binary WebSocket message. In one-shot mode (follow=false or omitted), the server closes the connection with code 1000 after delivering the requested range; in follow mode, the connection stays open and new lines are pushed as they arrive.
Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.
Query Parameters
pod
Optional
The name of the pod
my-pod
container
Optional
The container to use. Defaults to the only container if there is one.
my-c
follow
Optional
Keep the connection open and stream new log lines. Defaults to false.
false
limitBytes
Optional
Number of bytes to read from the server before terminating log output.
200
previous
Optional
Return previous terminated container logs. Defaults to false.
false
sinceSeconds
Optional
Return logs newer than a relative duration (for example, 5s, 2m, 3h). Defaults to 0s (all logs).
5s
timestamps
Optional
Include timestamps on each line in the log output. Defaults to false.
false
sinceTime
Optional
Return logs after a specific date (RFC3339).
2024-01-01T00:00:00Z
tailLines
Optional
Number of recent log lines to display. Defaults to -1 (all lines).
2
Example (one-shot)
Example (follow / stream)
Logs Download
GET
HTTPS
Bearer (JWT)
Yes
None
application/octet-stream
Download the logs for a container in a pod as a file. If the pod has multiple containers, the container is auto-selected if the container name is not provided.
Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.
Query Parameters
pod
Optional
The name of the pod
my-pod
container
Optional
The container to use. Defaults to the only container if there is one.
my-c
limitBytes
Optional
Number of bytes to read from the server before terminating log output.
200
previous
Optional
Return previous terminated container logs. Defaults to false.
false
sinceSeconds
Optional
Return logs newer than a relative duration (for example, 5s, 2m, 3h). Defaults to 0s (all logs).
5s
timestamps
Optional
Include timestamps on each line in the log output. Defaults to false.
false
sinceTime
Optional
Return logs after a specific date (RFC3339).
2024-01-01T00:00:00Z
tailLines
Optional
Number of recent log lines to display. Defaults to -1 (all lines).
2
Example
Attach
GET
WebSocket
Bearer (JWT)
No
Opens a stdio session against the running container
WebSocket binary frames
Attach to a container in a pod.
Tip
Attaching to a container whose primary process is already consuming stdio (for example, an actively-running training job) can fail with 503. Use exec to start a new shell in the same container instead.
Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.
Query Parameters
pod
Optional
The name of the pod
my-pod
container
Optional
The container to use. Defaults to the only container if there is one.
my-c
stderr
Optional
Redirect stderr for the attach call. Defaults to true.
true
stdin
Optional
Redirect standard input stream of the pod. Defaults to false.
true
stdout
Optional
Redirect stdout for the attach call. Defaults to true.
true
tty
Optional
Allocate a TTY for the attach call. Defaults to false.
false
Example
Port Forward
GET
WebSocket
Bearer (JWT)
No
Opens a TCP forwarding session to the pod
WebSocket binary frames
Port forward to the specified pod in the given workload.
To forward more than one port, repeat the port query key (for example ?port=8090&port=8080). Comma-separated lists are not accepted.
Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.
Query Parameters
pod
Optional
The name of the pod
my-pod
container
Optional
The container to use. Defaults to the only container if there is one.
my-c
port
Required
Port to forward to the pod. Repeat the parameter to forward multiple ports.
8090
Example (single port)
Example (multiple ports)
Exec
GET
WebSocket
Bearer (JWT)
No
Spawns a new process inside the container
WebSocket binary frames
Execute a command in a container.
The command can be passed in two equivalent forms:
as the
commandquery parameter (URL-encoded base64 of a JSON argv array), oras the
CommandHTTP header on the WebSocket upgrade (raw base64 of a JSON argv array, no URL-encoding needed).
For example, ["/bin/bash"] base64-encodes to WyIvYmluL2Jhc2giXQ==.
Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.
Query Parameters
pod
Optional
The name of the pod
my-pod
container
Optional
The container to use. Defaults to the only container if there is one.
my-c
stderr
Optional
Redirect stderr for the exec call. Defaults to true.
true
stdin
Optional
Redirect standard input stream of the pod. Defaults to false.
true
stdout
Optional
Redirect stdout for the exec call. Defaults to true.
true
tty
Optional
Allocate a TTY for the exec call. Defaults to false.
false
command
Required
Command to execute, either as a query string or base64-encoded in the Command header.
WyIvYmluL2Jhc2giXQ%3D%3D
Example (Command header)
Example (query parameter)
Last updated