Defining a Resource Interface
This section describes how to extend NVIDIA Run:ai to support ML frameworks, tools, or Kubernetes resources by defining and registering them through the Workload Types API and a corresponding Resource Interface (RI) definition. See the Quick start templates for ready-to-use examples of Resource Interface configurations.
An RI is a declarative workload contract that describes how NVIDIA Run:ai should interpret a custom resource. It defines how the platform identifies the workload’s structure, locates its pods, tracks its state, and applies scheduling, monitoring, and optimization logic.
Without an RI, NVIDIA Run:ai cannot reliably schedule, monitor, or optimize a custom resource, because it lacks the information required to traverse the resource’s schema and interpret its semantics.
Workload Types API
Use this endpoint POST /api/v1/workload-types to register a new workload type. Each request must include the workload’s metadata and one or more Resource Interface (RI) definitions, one per supported CRD version (for example, v1, v1beta1). See Workload Types API for more details.
categoryId
The unique identifier of the workload category. See List workload categories API.
046b6c7f-...
priorityId
The unique identifier of the workload priority. See Get workload priorities API.
046b6c7f-...
name
The unique name of the workload type. This value must match the Kubernetes Kind that represents the workload type.
Deployment
group
The Kubernetes API group associated with the workload resource.
apps
resourceInterfaces
One or more RI definitions (one per CRD version, e.g. v1, v1beta1). Lists the versions of the custom resource definition (CRD) supported for this workload type, such as v1, v1beta1, or v1alpha1. This enables the platform to correctly parse, interpret, and manage manifests for this workload type according to the specific structure and schema associated with each listed version.
What is a Resource Interface?
In Kubernetes, a workload is not a standalone execution unit (pod). It comprises various components, including an ingress entry point, a collection of pods, and storage.
Defining a Resource Interface (RI) enables NVIDIA Run:ai to interpret, schedule, monitor, and optimize custom Kubernetes workloads by describing how the CRD’s structure maps to workload behavior. An RI tells the platform:
What resources represent the workload (CRD group, version, kind)
How to find pods and their spec fields
How to map status conditions to unified states
How to interpret component hierarchy and scheduling rules
Note
The RI defines the contract between a CRD and the NVIDIA Run:ai platform. It does not create the CRD; it explains how NVIDIA Run: should treat it.
Resource Interface Structure
A Resource Interface is a structured YAML or JSON object that describes how a workload is represented in Kubernetes. The following sections describe each required component of the RI definition.
Minimum Requirements
Contract Summary: Required vs Optional
rootComponent
Yes
Defines the primary resource (CRD) that represents the workload
statusDefinition
Yes (under rootComponent)
Maps the CRD’s status fields to NVIDIA Run:ai canonical workload states (for example, running, failed)
specDefinition
Yes (where pods exist)
Specifies where NVIDIA Run:ai can locate the pod specifications associated with the workload
childComponents
Optional (only if workload has owned resources)
Describes subordinate Kubernetes resources that are created and managed by the root CRD
additionalChildKinds
Optional
Declares additional resource kinds (GVKs) that the workload creates or controls.
optimizationInstructions
Optional
Instructs scheduler behaviors (e.g., gang scheduling).
scaleDefinition
Optional
Defines how the workload or its components can be scaled
Each required field has explicit minimal semantics and a clear reason for being required. Optional fields unlock additional platform capabilities.
A minimal RI must define a rootComponent with a name, full GVK (group, version, kind) and a statusDefinition section that describes its runtime state.
Root Component
Every Resource Interface begins with a rootComponent definition:
Must specify the full Kubernetes GVK (
group,version,kind)Must include a
statusDefinitionto describe the workload’s runtime state
Child Components
childComponents represent resources owned by the root component.
Must include
ownerRef(points to parent component)Usually includes a
specDefinitionAll paths must be absolute, starting from the CRD root
Paths
All paths within an RI are written in jq syntax. The jq query language provides both path navigation and various query capabilities, and is widely used in the Kubernetes ecosystem. When defining paths:
Use the correct jq type for each property (path/query)
Provide default values where necessary
Spec Definitions
specDefinition determines how NVIDIA Run:ai locates pod specifications within the CRD. Several mutually exclusive definition types are supported, such as podTemplateSpecPath, podSpecPath, metaDataPath and fragmentedPodDefinition.
podTemplateSpecPath
CRD embeds a full pod template
.spec.pytorchReplicaSpecs.Master.template
podSpecPath and metadataPath
CRD directly embeds a podSpec and/or an objectMeta
.spec.jobTemplate.spec
.spec.jobTemplate.metadata
fragmentedPodSpecDefinition
Pod fields scattered across CRD
Component Instances
A component’s spec definition might point to multiple specs (in map/array format). In those cases, it’s crucial to be able to distinguish between each instance of that component.
If a component produces multiple instances (arrays or maps), define instanceIdPath to identify each instance uniquely.
Pod Selectors
Use podSelector to associate pods with components or instances. This is required when multiple components define pods or when a component manages multiple instances. All selectors of each kind (component, instance) must be mutually exclusive within their scope. Paths in pod selectors refer to paths in the pod JSON/YAML.
componentTypeSelector- A key and value selector that associates pod to the current component. If the value is not provided, only key existence is checked:componentInstanceSelector- A path on the pod that holds its matching instance id:
Status Definitions
The statusDefinition maps CRD conditions or phases to the generic Resource Interface (RI) statuses. A status definition is required for the rootComponent. For each generic status, you can define how it is evaluated using one or more of the following mechanisms:
Conditions and Phases
Statuses can be derived from CRD conditions, phases, or both.
If both conditions and phases are defined, both are evaluated when determining the status.
When using conditions or phases, you must first define a
conditionsDefinitionorphaseDefinition.Multiple, independent definitions can be provided for the same generic status.
When defining a status using
byConditions, all specified conditions must be met (AND logic).
Matched Expressions
Matched expressions consist of:
An expression
An expected result
If the evaluated expression output matches the expected result, the condition is considered met and the corresponding status is applied.
Examples:
Additional Child Kinds
List any additional GVKs created or managed by the CRD but not defined explicitly under childComponents. This is essential for permission management so that your CRD can be managed correctly.
Optimization Instructions
Optimization instructions define how NVIDIA Run:ai schedules or groups pods. All paths in the configuration refer to pod YAML or JSON fields.
Supported Instruction Types
The supported instruction type is gangScheduling, which instructs the Scheduler on how to group related pods:
Each pod-group definition contains a list of the included members
Each defined member can provide a list of distinct keys to group pods by and a list of filters to determine which pods should be included
Grouping examples
The following definitions are equivalent when the master and worker share the same job-name label:
Using default values
When multiple groups are possible, use a default value to cover cases where a single group is used (the used pattern: -{name}-{index}):
Filters
When different components are not named components (array/map of specs in the same component) you can use filters to form different pod groups. In this example, for a CRD that defines multiple jobs but all under the job component, we use jq query as filter to identify pods that use NVIDIA GPUs, and queries with hard-coded values as grouping keys.
Best Practices
Before submitting a new Resource Interface, confirm:
All kinds use full GVK (
group,version,kind)The
rootComponentincludes a validstatusDefinitionAbsolute paths only in
specDefinitionsNo duplicate child kinds; don’t list explicitly defined components in
additionalChildKindsMutually exclusive
podSelectorsin multi-component workloadsNull-safe jq expressions (e.g. // 0 defaults)
Target actual components in
optimizationInstructions, not just the root CRDStatus conditions match real framework APIs
All
childComponentshasownerRefdirected to existing components and there are no ownership cycles
Example: KServe Inference Service
Last updated