Defining a Resource Interface
This section describes how to extend NVIDIA Run:ai to support ML frameworks, tools, or Kubernetes resources by defining and registering them through the Workload Types API and a corresponding Resource Interface (RI) definition. See the Quick start templates for ready-to-use examples of Resource Interface configurations.
Workload Types API
Use this endpoint POST /api/v1/workload-types to register a new workload type. Each request must include the workload’s metadata and one or more Resource Interface (RI) definitions, one per supported CRD version (for example, v1, v1beta1). See Workload Types API for more details.
categoryId
The unique identifier of the workload category. See List workload categories API.
046b6c7f-...
priorityId
The unique identifier of the workload priority. See Get workload priorities API.
046b6c7f-...
name
The unique name of the workload type. This value must match the Kubernetes Kind that represents the workload type.
Deployment
group
The Kubernetes API group associated with the workload resource.
apps
resourceInterfaces
One or more RI definitions (one per CRD version, e.g. v1, v1beta1). Lists the versions of the custom resource definition (CRD) supported for this workload type, such as v1, v1beta1, or v1alpha1. This enables the platform to correctly parse, interpret, and manage manifests for this workload type according to the specific structure and schema associated with each listed version.
What is a Resource Interface?
In Kubernetes, a workload is not a standalone execution unit (pod). It comprises various components, including an ingress entry point, a collection of pods, and storage.
The purpose of the Resource Interface is to allow NVIDIA Run:ai to perform actions and extract information from a new workload type. By registering a workload type via an RI, NVIDIA Run:ai can perform resource allocation, scheduling, monitoring, and data extraction, ensuring efficient operation and seamless integration.
A Resource Interface is a structured description of a Kubernetes workload type. It enables NVIDIA Run:ai to perform actions such as:
Identifying the root component (the CRD itself)
Modeling child components (for example, replicas, workers, statefulsets)
Locating pod specifications inside the resource definition
Interpreting status (e.g., running, completed, failed)
Applying optimization instructions (e.g., gang scheduling)
Resource Interface Structure
A Resource Interface is a structured YAML or JSON object that describes how a workload is represented in Kubernetes. The following sections describe each required component of the RI definition.
Minimum Requirements
A minimal RI must define a rootComponent with a name, full GVK (group, version, kind) and a statusDefinition section that describes its runtime state.
rootComponent:
name: "minimal"
kind:
group: "minimal.org"
version: "v1"
kind: "minimalRI"
statusDefinition:
statusMappings:
running:
byConditions:
- type: "Running"
status: "True"Root Component
Every Resource Interface begins with a rootComponent definition:
Must specify the full Kubernetes GVK (
group,version,kind)Must include a
statusDefinitionto describe the workload’s runtime state
rootComponent:
name: "pytorchjob"
kind:
group: "kubeflow.org"
version: "v1"
kind: "PyTorchJob"
statusDefinition:
statusMappings:
running:
byConditions:
- type: "Running"
status: "True"Child Components
childComponents represent resources owned by the root component.
Must include
ownerRef(points to parent component)Usually includes a
specDefinitionAll paths must be absolute, starting from the CRD root
childComponents:
- name: "worker"
ownerName: "pytorchjob"
kind:
group: "apps"
version: "v1"
kind: "StatefulSet"
specDefinition:
podTemplateSpecPath: ".spec.pytorchReplicaSpecs.Worker.template"Paths
All paths within an RI are written in jq syntax. The jq query language provides both path navigation and various query capabilities, and is widely used in the Kubernetes ecosystem. When defining paths:
Use the correct jq type for each property (path/query)
Provide default values where necessary
Spec Definitions
specDefinition determines how NVIDIA Run:ai locates pod specifications within the CRD. Several mutually exclusive definition types are supported, such as podTemplateSpecPath, podSpecPath, metaDataPath and fragmentedPodDefinition.
podTemplateSpecPath
CRD embeds a full pod template
.spec.pytorchReplicaSpecs.Master.template
podSpecPath and metadataPath
CRD directly embeds a podSpec and/or an objectMeta
.spec.jobTemplate.spec
.spec.jobTemplate.metadata
fragmentedPodSpecDefinition
Pod fields scattered across CRD
specDefinition:
fragmentedPodDefinition:
labelsPath: ".spec.labels"
annotationsPath: ".spec.annotations"
resourcesPath: ".spec.resources"
schedulerNamePath: ".spec.schedulerName"Component Instances
A component’s spec definition might point to multiple specs (in map/array format). In those cases, it’s crucial to be able to distinguish between each instance of that component.
If a component produces multiple instances (arrays or maps), define instanceIdPath to identify each instance uniquely.
# Array of specs
instanceIdPath: ".spec.jobs[].name"
# Map of specs (use map keys as instance IDs)
instanceIdPath: ".spec.jobs | to_entries[] | .key"Pod Selectors
Use podSelector to associate pods with components or instances. This is required when multiple components define pods or when a component manages multiple instances. All selectors of each kind (component, instance) must be mutually exclusive within their scope. Paths in pod selectors refer to paths in the pod JSON/YAML.
componentTypeSelector- A key and value selector that associates pod to the current component. If the value is not provided, only key existence is checked:podSelector: componentTypeSelector: keyPath: '.metadata.labels["training.kubeflow.org/replica-type"]' value: "master"componentInstanceSelector- A path on the pod that holds its matching instance id:podSelector: componentInstanceSelector: idPath: '.metadata.labels["jobset.sigs.k8s.io/replicatedjob-name"]'
Status Definitions
The statusDefinition maps the described CRD conditions or phases to the RI generic statuses. For each generic status, the user can provide a definition based on conditions or phases. If both are provided, both are validated when evaluating the status.
If using definition by conditions/phase, you first must include
conditionsDefinition/phaseDefinitionMultiple, separate definitions can be provided for each generic status
When providing a definition
byConditions, all must exist (AND logic)Required for the
rootComponent
statusDefinition:
conditionsDefinition:
path: ".status.conditions"
typeFieldName: "type"
statusFieldName: "status"
statusMappings:
initializing:
- byConditions:
- type: "Created"
status: "True"
- type: "Running"
status: "False"
running:
- byConditions:
- type: "Running"
status: "True"
- type: "Succeeded"
status: "False"
- type: "Failed"
status: "False"
completed:
- byConditions:
- type: "Succeeded"
status: "True"
failed:
- byConditions:
- type: "Failed"
status: "True"Additional Child Kinds
List any additional GVKs created or managed by the CRD but not defined explicitly under childComponents. This is essential for permission management so that your CRD can be managed correctly.
additionalChildKinds:
- group: apps
version: v1
kind: Deployment
- group: leaderworkerset.x-k8s.io
version: v1
kind: LeaderWorkerSetOptimization Instructions
Optimization instructions define how NVIDIA Run:ai schedules or groups pods. All paths in the configuration refer to pod YAML or JSON fields.
Supported Instruction Types
The supported instruction type is gangScheduling, which instructs the Scheduler on how to group related pods:
Each pod-group definition contains a list of the included members
Each defined member can provide a list of distinct keys to group pods by and a list of filters to determine which pods should be included
optimizationInstructions:
gangScheduling:
podGroups:
- name: "job"
members:
- componentName: "master"
groupByKeyPaths:
- '.metadata.labels["training.kubeflow.org/job-name"]'
- componentName: "worker"
groupByKeyPaths:
- '.metadata.labels["training.kubeflow.org/job-name"]'Grouping examples
The following definitions are equivalent when the master and worker share the same job-name label:
optimizationInstructions:
gangScheduling:
podGroups:
- name: "job"
members:
- componentName: "master"
groupByKeyPaths:
- '.metadata.labels["training.kubeflow.org/job-name"]'
- componentName: "worker"
groupByKeyPaths:
- '.metadata.labels["training.kubeflow.org/job-name"]'optimizationInstructions:
gangScheduling:
podGroups:
- name: "job"
members:
- componentName: "job"
groupByKeyPaths:
- '.metadata.labels["training.kubeflow.org/job-name"]'Using default values
When multiple groups are possible, use a default value to cover cases where a single group is used (the used pattern: -{name}-{index}):
optimizationInstructions:
gangScheduling:
podGroups:
- name: "group"
members:
- componentName: "group"
groupByKeyPaths:
- '.metadata.labels["leaderworkerset.sigs.k8s.io/name"]'
- '.metadata.labels["leaderworkerset.sigs.k8s.io/group-index"] // "0"'Filters
When different components are not named components (array/map of specs in the same component) you can use filters to form different pod groups. In this example, for a CRD that defines multiple jobs but all under the job component, we use jq query as filter to identify pods that use NVIDIA GPUs, and queries with hard-coded values as grouping keys.
optimizationInstructions:
gangScheduling:
podGroups:
- name: "gpu-jobs"
members:
- componentName: "job"
filters:
- 'any(.spec.jobs[].spec.containers[]; (.resources.limits["nvidia.com/gpu"] // 0) > 0)'
groupByKeyPaths:
- 'gpu'
- name: "no-gpu-jobs"
members:
- componentName: "job"
filters:
- 'any(.spec.jobs[].spec.containers[]; (.resources.limits["nvidia.com/gpu"] // 0) == 0)'
groupByKeyPaths:
- 'no-gpu'Best Practices
Before submitting a new Resource Interface, confirm:
All kinds use full GVK (
group,version,kind)The
rootComponentincludes a validstatusDefinitionAbsolute paths only in
specDefinitionsNo duplicate child kinds; don’t list explicitly defined components in
additionalChildKindsMutually exclusive
podSelectorsin multi-component workloadsNull-safe jq expressions (e.g. // 0 defaults)
Target actual components in
optimizationInstructions, not just the root CRDStatus conditions match real framework APIs
All
childComponentshasownerRefdirected to existing components and there are no ownership cycles
Example: KServe Inference Service
{
"spec": {
"structureDefinition": {
"rootComponent": {
"name": "inferenceservice",
"kind": {
"group": "serving.kserve.io",
"version": "v1beta1",
"kind": "InferenceService"
},
"specDefinition": {
"fragmentedPodSpecDefinition": {
"resourcesPath": ".spec.domain.resources",
"priorityClassNamePath": ".spec.priorityClassName",
"nodeAffinityPath": ".spec.affinity.nodeAffinity"
}
},
"statusDefinition": {
"conditionsDefinition": {
"path": ".status.conditions",
"typeFieldName": "type",
"statusFieldName": "status"
},
"statusMappings": {
"running": [
{
"byConditions": [
{
"type": "PredictorReady",
"status": "True"
},
{
"type": "RoutesReady",
"status": "True"
},
{
"type": "LatestDeploymentReady",
"status": "True"
}
]
}
],
"failed": [
{
"byConditions": [
{
"type": "PredictorReady",
"status": "False"
},
{
"type": "PredictorConfigurationReady",
"status": "False"
},
{
"type": "RoutesReady",
"status": "False"
}
]
}
]
}
}
},
"childComponents": [
{
"name": "predictor",
"kind": {
"group": "apps",
"version": "v1",
"kind": "Deployment"
},
"ownerRef": "inferenceservice",
"specDefinition": {
"podSpecPath": ".spec.predictor",
"metadataPath": ".spec.predictor",
"fragmentedPodSpecDefinition": {
"containerPath": ".spec.predictor.model"
}
},
"scaleDefinition": {
"minReplicasPath": ".spec.predictor.minReplicas",
"maxReplicasPath": ".spec.predictor.maxReplicas"
},
"podSelector": {
"componentTypeSelector": {
"keyPath": ".metadata.labels[\"component\"]",
"value": "predictor"
}
}
},
{
"name": "transformer",
"kind": {
"group": "apps",
"version": "v1",
"kind": "Deployment"
},
"ownerRef": "inferenceservice",
"specDefinition": {
"podSpecPath": ".spec.transformer",
"metadataPath": ".spec.transformer"
},
"scaleDefinition": {
"minReplicasPath": ".spec.transformer.minReplicas",
"maxReplicasPath": ".spec.transformer.maxReplicas"
},
"podSelector": {
"componentTypeSelector": {
"keyPath": ".metadata.labels[\"component\"]",
"value": "transformer"
}
}
}
]
},
"optimizationInstructions": {
"gangScheduling": {
"podGroups": [
{
"name": "service",
"members": [
{
"componentName": "predictor",
"groupByKeyPaths": [
".metadata.labels[\"serving.kserve.io/inferenceservice\"]"
]
},
{
"componentName": "transformer",
"groupByKeyPaths": [
".metadata.labels[\"serving.kserve.io/inferenceservice\"]"
]
}
]
}
]
}
}
}
}Last updated