Defining a Resource Interface

This section describes how to extend NVIDIA Run:ai to support ML frameworks, tools, or Kubernetes resources by defining and registering them through the Workload Types API and a corresponding Resource Interface (RI) definition. See the Quick start templates for ready-to-use examples of Resource Interface configurations.

An RI is a declarative workload contract that describes how NVIDIA Run:ai should interpret a custom resource. It defines how the platform identifies the workload’s structure, locates its pods, tracks its state, and applies scheduling, monitoring, and optimization logic.

Without an RI, NVIDIA Run:ai cannot reliably schedule, monitor, or optimize a custom resource, because it lacks the information required to traverse the resource’s schema and interpret its semantics.

Workload Types API

Use this endpoint POST /api/v1/workload-types to register a new workload type. Each request must include the workload’s metadata and one or more Resource Interface (RI) definitions, one per supported CRD version (for example, v1, v1beta1). See Workload Types API for more details.

Field
Description
Example

categoryId

The unique identifier of the workload category. See List workload categories API.

046b6c7f-...

priorityId

The unique identifier of the workload priority. See Get workload priorities API.

046b6c7f-...

name

The unique name of the workload type. This value must match the Kubernetes Kind that represents the workload type.

Deployment

group

The Kubernetes API group associated with the workload resource.

apps

resourceInterfaces

One or more RI definitions (one per CRD version, e.g. v1, v1beta1). Lists the versions of the custom resource definition (CRD) supported for this workload type, such as v1, v1beta1, or v1alpha1. This enables the platform to correctly parse, interpret, and manage manifests for this workload type according to the specific structure and schema associated with each listed version.

What is a Resource Interface?

In Kubernetes, a workload is not a standalone execution unit (pod). It comprises various components, including an ingress entry point, a collection of pods, and storage.

Defining a Resource Interface (RI) enables NVIDIA Run:ai to interpret, schedule, monitor, and optimize custom Kubernetes workloads by describing how the CRD’s structure maps to workload behavior. An RI tells the platform:

  • What resources represent the workload (CRD group, version, kind)

  • How to find pods and their spec fields

  • How to map status conditions to unified states

  • How to interpret component hierarchy and scheduling rules

circle-info

Note

The RI defines the contract between a CRD and the NVIDIA Run:ai platform. It does not create the CRD; it explains how NVIDIA Run: should treat it.

Resource Interface Structure

A Resource Interface is a structured YAML or JSON object that describes how a workload is represented in Kubernetes. The following sections describe each required component of the RI definition.

Minimum Requirements

Contract Summary: Required vs Optional

Section
Required
Purpose

rootComponent

Yes

Defines the primary resource (CRD) that represents the workload

statusDefinition

Yes (under rootComponent)

Maps the CRD’s status fields to NVIDIA Run:ai canonical workload states (for example, running, failed)

specDefinition

Yes (where pods exist)

Specifies where NVIDIA Run:ai can locate the pod specifications associated with the workload

childComponents

Optional (only if workload has owned resources)

Describes subordinate Kubernetes resources that are created and managed by the root CRD

additionalChildKinds

Optional

Declares additional resource kinds (GVKs) that the workload creates or controls.

optimizationInstructions

Optional

Instructs scheduler behaviors (e.g., gang scheduling).

scaleDefinition

Optional

Defines how the workload or its components can be scaled

Each required field has explicit minimal semantics and a clear reason for being required. Optional fields unlock additional platform capabilities.

A minimal RI must define a rootComponent with a name, full GVK (group, version, kind) and a statusDefinition section that describes its runtime state.

Root Component

Every Resource Interface begins with a rootComponent definition:

  • Must specify the full Kubernetes GVK (group, version, kind)

  • Must include a statusDefinition to describe the workload’s runtime state

Child Components

childComponents represent resources owned by the root component.

  • Must include ownerRef (points to parent component)

  • Usually includes a specDefinition

  • All paths must be absolute, starting from the CRD root

Paths

All paths within an RI are written in jqarrow-up-right syntax. The jq query language provides both path navigation and various query capabilities, and is widely used in the Kubernetes ecosystem. When defining paths:

  • Use the correct jq type for each property (path/query)

  • Provide default values where necessary

Spec Definitions

specDefinition determines how NVIDIA Run:ai locates pod specifications within the CRD. Several mutually exclusive definition types are supported, such as podTemplateSpecPath, podSpecPath, metaDataPath and fragmentedPodDefinition.

Pattern
Use Case
Example

podTemplateSpecPath

CRD embeds a full pod template

.spec.pytorchReplicaSpecs.Master.template

podSpecPath and metadataPath

CRD directly embeds a podSpec and/or an objectMeta

.spec.jobTemplate.spec .spec.jobTemplate.metadata

fragmentedPodSpecDefinition

Pod fields scattered across CRD

Component Instances

A component’s spec definition might point to multiple specs (in map/array format). In those cases, it’s crucial to be able to distinguish between each instance of that component.

If a component produces multiple instances (arrays or maps), define instanceIdPath to identify each instance uniquely.

Pod Selectors

Use podSelector to associate pods with components or instances. This is required when multiple components define pods or when a component manages multiple instances. All selectors of each kind (component, instance) must be mutually exclusive within their scope. Paths in pod selectors refer to paths in the pod JSON/YAML.

  • componentTypeSelector - A key and value selector that associates pod to the current component. If the value is not provided, only key existence is checked:

  • componentInstanceSelector - A path on the pod that holds its matching instance id:

Status Definitions

The statusDefinition maps CRD conditions or phases to the generic Resource Interface (RI) statuses. A status definition is required for the rootComponent. For each generic status, you can define how it is evaluated using one or more of the following mechanisms:

Conditions and Phases

  • Statuses can be derived from CRD conditions, phases, or both.

  • If both conditions and phases are defined, both are evaluated when determining the status.

  • When using conditions or phases, you must first define a conditionsDefinition or phaseDefinition.

  • Multiple, independent definitions can be provided for the same generic status.

  • When defining a status using byConditions, all specified conditions must be met (AND logic).

Matched Expressions

Matched expressions consist of:

  • An expression

  • An expected result

If the evaluated expression output matches the expected result, the condition is considered met and the corresponding status is applied.

Examples:

Additional Child Kinds

List any additional GVKs created or managed by the CRD but not defined explicitly under childComponents. This is essential for permission management so that your CRD can be managed correctly.

Optimization Instructions

Optimization instructions define how NVIDIA Run:ai schedules or groups pods. All paths in the configuration refer to pod YAML or JSON fields.

Supported Instruction Types

The supported instruction type is gangScheduling, which instructs the Scheduler on how to group related pods:

  • Each pod-group definition contains a list of the included members

  • Each defined member can provide a list of distinct keys to group pods by and a list of filters to determine which pods should be included

Grouping examples

The following definitions are equivalent when the master and worker share the same job-name label:

Using default values

When multiple groups are possible, use a default value to cover cases where a single group is used (the used pattern: -{name}-{index}):

Filters

When different components are not named components (array/map of specs in the same component) you can use filters to form different pod groups. In this example, for a CRD that defines multiple jobs but all under the job component, we use jq query as filter to identify pods that use NVIDIA GPUs, and queries with hard-coded values as grouping keys.

Best Practices

Before submitting a new Resource Interface, confirm:

  • All kinds use full GVK (group, version, kind)

  • The rootComponent includes a valid statusDefinition

  • Absolute paths only in specDefinitions

  • No duplicate child kinds; don’t list explicitly defined components in additionalChildKinds

  • Mutually exclusive podSelectors in multi-component workloads

  • Null-safe jq expressions (e.g. // 0 defaults)

  • Target actual components in optimizationInstructions, not just the root CRD

  • Status conditions match real framework APIs

  • All childComponents has ownerRef directed to existing components and there are no ownership cycles

Example: KServe Inference Service

Last updated