Extending Workload Support with Resource Interface
The Resource Interface (RI) in NVIDIA Run:ai provides a declarative way to extend platform support for new workload types. It enables organizations to quickly incorporate emerging ML frameworks, tools, or Kubernetes resources without requiring platform updates or code changes. Administrators can introduce new workload types at any time through the Workload Types API.
Once registered, these workloads become available across the organization, enabling teams to innovate and collaborate. Practitioners can then submit them using the Workloads v2 API and manage them with the same orchestration, monitoring, and scheduling capabilities as native workloads. For details on feature support, see Supported features.
Resource Interface
The Resource Interface provides a YAML-based interface for defining how NVIDIA Run:ai should interpret, optimize, and monitor new workload types, without requiring platform updates or code changes.
Core Functions
Structure Awareness - Allow NVIDIA Run:ai to interpret the full composition of a workload, including component hierarchy, dependencies, and scaling logic. This ensures the platform can organize and optimize any supported workload, independent of framework or origin.
Monitoring and Status Mapping - Defines how to track workload health by mapping specific status conditions from the framework to standard, abstract states (such as running, succeeded, failed). This mapping supports robust monitoring and lifecycle automation across diverse workload types.
Optimization Directives - Encodes optimization strategies such as gang scheduling. These directives help ensure that workloads are utilized efficiently and reliably in any infrastructure environment.
Structure Overview
A typical Resource Interface manifest includes the following primary sections:
structureDefinition - Specifies all components (root, children), their types, hierarchical relationships, and how to interpret their specs within the overall workload.
optimizationInstructions - Encapsulates how workloads should be scheduled and optimized (e.g., using gang scheduling).
scaleDefinition - Explains how workload components can be scaled (manual, auto-scaling boundaries).
statusDefinition - Maps resource-specific conditions to standard running/completed/failed states for unified monitoring.
Registering New Workload Types
The Workload Types API allows administrators to register and manage new workload types by providing the required details, such as workload name, supported CRD versions, category, priority, and Kubernetes group.
POST /api/v1/workload-types
with the following:
name
- The unique name of the workload type. This value must exactly match the Kubernetes Kind that represents the workload type.resourceInterfaces
- Lists the versions of the custom resource definition (CRD) supported for this workload type, such asv1
,v1beta1
, orv1alpha1
. This enables the platform to correctly parse, interpret, and manage manifests for this workload type according to the specific structure and schema associated with each listed version. On update, you may only add or remove supported versions, modifying existing version entries is not allowed.categoryId
- The identifier of the workload category.priorityId
- The identifier of the workload priority.group
- The Kubernetes group associated with the workload resource.
Example POST request body:
{
"name": "Deployment",
"resourceInterfaces": [
{
"spec": {
"structureDefinition": {
"rootComponent": {
"kind": {
"group": "apps",
"version": "v1",
"kind": "Deployment"
}
}
}
}
},
{
"spec": {
"structureDefinition": {
"rootComponent": {
"kind": {
"group": "apps",
"version": "v1beta1",
"kind": "Deployment"
}
}
}
}
}
],
"categoryId": "046b6c7f-0b8a-43b9-b35d-6489e6daee91",
"priorityId": "046b6c7f-0b8a-43b9-b35d-6489e6daee91",
"group": "apps"
}
Supported Workload Types
NVIDIA Run:ai supports a broad range of workloads from the ML and Kubernetes ecosystem that are already registered as workload types in the platform and ready to use, including:
NVIDIA - NIM Services
Kubernetes - Deployment, StatefulSet, ReplicaSet, Pod, CronJob, Job, JobSet (kubernetes.io)
Kubeflow - TFJob, PyTorchJob, MPIJob, XGBoostJob, Notebook, ScheduledWorkflow (kubeflow.org)
Ray - RayService, RayCluster, RayJob (ray.io)
Tekton - PipelineRun, TaskRun (tekton.dev)
Additional frameworks - SeldonDeployment, AMLJob, Workflow, DevWorkspace, Service, VirtualMachineInstance, KServe, Milvus
Each workload type comes with a default priority and category. These defaults determine how workloads are scheduled and prioritized within a project and how they are grouped for monitoring and reporting.
Administrators can change the default priority and category assigned to a workload type by updating the mapping using the NVIDIA Run:ai API:
To update the priority mapping, see Workload priority control
To update the category mapping, see Monitor workloads by category
Last updated