Interacting with Workloads Using the Cluster API

Use the Cluster API for live, cluster-scoped operations on running workloads—checking cluster/workload status, streaming and downloading logs, executing commands, attaching to containers, and forwarding ports. For creating or managing workloads across a tenant, use the Control plane API.

The API is exposed per cluster at https://<cluster-fqdn>/cluster-api/. The interactive endpoints (exec, attach, port-forward, and both logs modes) use WebSocket; status, logs/download, and the cluster status endpoint use standard HTTPS. All endpoints require a bearer token (JWT)—refer to How to authenticate to the API.

Before You Start

The Cluster API runs on each NVIDIA Run:ai cluster, not on the control plane. Before making a call, you need:

  • The cluster FQDN - Each cluster exposes the API at https://<cluster-fqdn>/cluster-api/. Find the FQDN in the NVIDIA Run:ai UI under Clusters → your cluster → Connection details, or from your platform admin.

  • Network reachability to the cluster - The Cluster API endpoint must be reachable from the machine making the call. If your cluster is behind a VPN or private network, you must be on that network.

  • A bearer token (JWT) with the right permissions - All endpoints require a JWT. The token must be associated with a user or application that has permission to access the project and workload in question. Refer to How to authenticate to the API for how to obtain a token.

  • Valid project and workload identifiers - Most endpoints require the project name, workload type, workload framework, and workload name in the path. The identifiers must match an existing, running workload in that project. Refer to Common path parameters below.

Common Path Parameters

Most endpoints take the same four path parameters. They are documented once here and referenced by each endpoint below.

Parameter
Required
Description
Example

projectName

Required

The NVIDIA Run:ai project name that owns the workload.

project1

workloadType

Required

The workload type. One of: workspace, training, distributed, inference, external.

training

workloadFramework

Required

The workload framework. One of: runai, mpi, pytorch, tf, xgboost, any, external. Use external together with workloadType=external for workloads created from third-party CRDs (refer to the CRD-mapping table below). The any value does not route to anyworkload-managed third-party CRDs—use external for those.

pytorch

workloadName

Required

The name of the workload, as it appears in the NVIDIA Run:ai UI or CLI. For workloads created from third-party CRDs, refer to the CRD-mapping table below.

my-jp

CRD to Path Parameter Mapping

The cluster-api resolves the path parameters to an NVIDIA Run:ai V2 CRD inside the project namespace (runai-{project}). Use the table below to pick the right workloadType, workloadFramework, and workloadName for your workload.

CRD
Workload type
Workload framework
Workload name

InteractiveWorkload (NVIDIA Run:ai workspace)

workspace

runai

The workload's name as shown in UI/CLI.

TrainingWorkload (NVIDIA Run:ai training, single)

training

runai

The workload's name as shown in UI/CLI.

DistributedWorkload (PyTorch)

distributed

pytorch

The workload's name as shown in UI/CLI.

DistributedWorkload (TensorFlow)

distributed

tf

The workload's name as shown in UI/CLI.

DistributedWorkload (MPI)

distributed

mpi

The workload's name as shown in UI/CLI.

DistributedWorkload (XGBoost)

distributed

xgboost

The workload's name as shown in UI/CLI.

InferenceWorkload

inference

runai

The workload's name as shown in UI/CLI.

ExternalWorkload (third-party PyTorchJob)

external

external

ew-pytorchjob-{originalName}

ExternalWorkload (third-party TFJob)

external

external

ew-tfjob-{originalName}

ExternalWorkload (third-party MPIJob)

external

external

ew-mpijob-{originalName}

ExternalWorkload (third-party LeaderWorkerSet)

external

external

ew-leaderworkerset-{originalName}

ExternalWorkload (any other third-party CRD)

external

external

ew-{kind-lowercase}-{originalName}

circle-info

Tip

For finding the right name:

  • For NVIDIA Run:ai-native workloads (workspace / training / distributed / inference): the name in the UI matches the workloadName directly.

  • For third-party CRDs: run kubectl get externalworkloads -n runai-{project} to list the auto-generated ew-… names, or look at the workload entry in the NVIDIA Run:ai UI.

circle-info

Note

GET /cluster-api/status is cluster-scoped and does not take these parameters. All other endpoints are project- and workload-scoped.

Request and Response Conventions

Base URL

Replace <cluster-fqdn> with your cluster's FQDN in every example:

Throughout this guide, examples use $CLUSTER for the FQDN and $TOKEN for the bearer token:

Authentication

Every request must include the bearer token in the Authorization header:

For WebSocket endpoints (exec, attach, port-forward, logs), the same header is sent during the WebSocket upgrade handshake.

Content Types

  • JSON endpoints return application/json.

  • The logs/download endpoint returns application/octet-stream.

  • WebSocket log frames carry log-line payloads as binary messages.

Connection Behavior

The following endpoints use WebSocket connections and have different semantics from standard REST endpoints:

  • logs (both one-shot and follow=true modes)

  • exec

  • attach

  • port-forward

WebSocket examples in this guide use websocatarrow-up-right. Install with cargo install websocat or brew install websocat. Equivalent calls work from any WebSocket-capable client (Python websockets, Node ws, Go gorilla/websocket).

Upgrade Flow

Clients issue a GET request with the standard WebSocket upgrade headers (Connection: Upgrade, Upgrade: websocket, Sec-WebSocket-Key, Sec-WebSocket-Version: 13). The bearer token is passed in Authorization on the handshake request.

On success, the server responds with 101 Switching Protocols. On authentication failure, the server responds with 401 before upgrading.

Keep-Alive and Timeouts

  • The server sends periodic ping frames every 30 seconds. Clients must respond with pong frames to keep the connection alive.

  • Idle connections (no traffic for 5 minutes) are closed by the server.

  • Long-running sessions (exec, attach) have no maximum duration as long as pings are acknowledged.

Termination

The connection closes when:

  • The client closes the WebSocket.

  • The target pod or container terminates.

  • The bearer token expires (server closes with code 1008).

  • Idle timeout is reached.

Clients should handle abrupt disconnection and reconnect if needed. No replay is provided—on reconnect, only new log lines or output are delivered.

Errors

All endpoints return standard HTTP status codes. Error responses have the following JSON shape:

Common Error Codes

Code
Meaning
Typical cause

400

Bad Request

Invalid query parameter (for example, sinceSeconds=-1), malformed path parameter, or missing required query parameter.

401

Unauthorized

Bearer token missing, malformed, or expired.

403

Forbidden

Token is valid but the associated user lacks permission to access the project or workload.

404

Not Found

Project, pod, or container does not exist. Also returned when a workload exists but is not yet scheduled (no pods).

409

Conflict

Operation cannot be performed in the workload's current state (for example, exec on a pod that is terminating).

500

Internal Server Error

Unexpected cluster-side failure. Also currently returned (with a plain-text body) when the requested workload cannot be located in the cluster—verify that projectName, workloadType, workloadFramework, and workloadName match an existing workload (refer to CRD to Path Parameter Mapping). Retry with backoff; contact support if persistent.

503

Service Unavailable

Cluster is temporarily unreachable, the cluster-api service is restarting, or the requested operation cannot complete (for example, attach to a container whose primary process already holds stdio).

For WebSocket endpoints, authentication and permission errors are returned on the HTTP handshake (before upgrade). Errors during an active stream are reported through WebSocket close codes:

Close code
Meaning

1000

Normal closure (target terminated cleanly)

1001

Going away (server shutdown or pod terminated)

1002

Protocol error (malformed frames)

1006

Abnormal closure (network drop, no close frame received)

1008

Policy violation (token expired mid-stream)

1011

Server error (unexpected cluster-side failure)

Endpoints

Cluster Status

Method
Connection
Auth
Idempotent
Side effects
Response

GET

HTTPS

Bearer (JWT)

Yes

None

application/json

Returns the cluster status as a JSON object, including the version and the state of features within the cluster.

Example

Response:

Workload Status

Method
Connection
Auth
Idempotent
Side effects
Response

GET

HTTPS

Bearer (JWT)

Yes

None

application/json

Returns the workload status as a JSON object with a status field carrying the workload phase.

Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters for values and the CRD-mapping table.

Example

Response:

Logs

Method
Connection
Auth
Idempotent
Side effects
Response

GET

WebSocket

Bearer (JWT)

Yes

None

WebSocket binary frames

Read the logs of a container in a pod. If the pod has multiple containers, the container is auto-selected if the container name is not provided.

The logs endpoint is WebSocket-only, in both one-shot and follow modes. Plain HTTPS GET returns HTTP 426 Upgrade Required. Each log line is delivered as a binary WebSocket message. In one-shot mode (follow=false or omitted), the server closes the connection with code 1000 after delivering the requested range; in follow mode, the connection stays open and new lines are pushed as they arrive.

Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.

Query Parameters

Field
Required
Description
Example

pod

Optional

The name of the pod

my-pod

container

Optional

The container to use. Defaults to the only container if there is one.

my-c

follow

Optional

Keep the connection open and stream new log lines. Defaults to false.

false

limitBytes

Optional

Number of bytes to read from the server before terminating log output.

200

previous

Optional

Return previous terminated container logs. Defaults to false.

false

sinceSeconds

Optional

Return logs newer than a relative duration (for example, 5s, 2m, 3h). Defaults to 0s (all logs).

5s

timestamps

Optional

Include timestamps on each line in the log output. Defaults to false.

false

sinceTime

Optional

Return logs after a specific date (RFC3339).

2024-01-01T00:00:00Z

tailLines

Optional

Number of recent log lines to display. Defaults to -1 (all lines).

2

Example (one-shot)

Example (follow / stream)

Logs Download

Method
Connection
Auth
Idempotent
Side effects
Response

GET

HTTPS

Bearer (JWT)

Yes

None

application/octet-stream

Download the logs for a container in a pod as a file. If the pod has multiple containers, the container is auto-selected if the container name is not provided.

Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.

Query Parameters

Field
Required
Description
Example

pod

Optional

The name of the pod

my-pod

container

Optional

The container to use. Defaults to the only container if there is one.

my-c

limitBytes

Optional

Number of bytes to read from the server before terminating log output.

200

previous

Optional

Return previous terminated container logs. Defaults to false.

false

sinceSeconds

Optional

Return logs newer than a relative duration (for example, 5s, 2m, 3h). Defaults to 0s (all logs).

5s

timestamps

Optional

Include timestamps on each line in the log output. Defaults to false.

false

sinceTime

Optional

Return logs after a specific date (RFC3339).

2024-01-01T00:00:00Z

tailLines

Optional

Number of recent log lines to display. Defaults to -1 (all lines).

2

Example

Attach

Method
Connection
Auth
Idempotent
Side effects
Response

GET

WebSocket

Bearer (JWT)

No

Opens a stdio session against the running container

WebSocket binary frames

Attach to a container in a pod.

circle-info

Tip

Attaching to a container whose primary process is already consuming stdio (for example, an actively-running training job) can fail with 503. Use exec to start a new shell in the same container instead.

Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.

Query Parameters

Field
Required
Description
Example

pod

Optional

The name of the pod

my-pod

container

Optional

The container to use. Defaults to the only container if there is one.

my-c

stderr

Optional

Redirect stderr for the attach call. Defaults to true.

true

stdin

Optional

Redirect standard input stream of the pod. Defaults to false.

true

stdout

Optional

Redirect stdout for the attach call. Defaults to true.

true

tty

Optional

Allocate a TTY for the attach call. Defaults to false.

false

Example

Port Forward

Method
Connection
Auth
Idempotent
Side effects
Response

GET

WebSocket

Bearer (JWT)

No

Opens a TCP forwarding session to the pod

WebSocket binary frames

Port forward to the specified pod in the given workload.

To forward more than one port, repeat the port query key (for example ?port=8090&port=8080). Comma-separated lists are not accepted.

Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.

Query Parameters

Field
Required
Description
Example

pod

Optional

The name of the pod

my-pod

container

Optional

The container to use. Defaults to the only container if there is one.

my-c

port

Required

Port to forward to the pod. Repeat the parameter to forward multiple ports.

8090

Example (single port)

Example (multiple ports)

Exec

Method
Connection
Auth
Idempotent
Side effects
Response

GET

WebSocket

Bearer (JWT)

No

Spawns a new process inside the container

WebSocket binary frames

Execute a command in a container.

The command can be passed in two equivalent forms:

  • as the command query parameter (URL-encoded base64 of a JSON argv array), or

  • as the Command HTTP header on the WebSocket upgrade (raw base64 of a JSON argv array, no URL-encoding needed).

For example, ["/bin/bash"] base64-encodes to WyIvYmluL2Jhc2giXQ==.

Path Parameters: projectName, workloadType, workloadFramework, workloadName—refer to Common path parameters.

Query Parameters

Field
Required
Description
Example

pod

Optional

The name of the pod

my-pod

container

Optional

The container to use. Defaults to the only container if there is one.

my-c

stderr

Optional

Redirect stderr for the exec call. Defaults to true.

true

stdin

Optional

Redirect standard input stream of the pod. Defaults to false.

true

stdout

Optional

Redirect stdout for the exec call. Defaults to true.

true

tty

Optional

Allocate a TTY for the exec call. Defaults to false.

false

command

Required

Command to execute, either as a query string or base64-encoded in the Command header.

WyIvYmluL2Jhc2giXQ%3D%3D

Example (Command header)

Example (query parameter)

Last updated