NVIDIA Run:ai Inference Overview
NVIDIA Run:ai provides flexible and robust deployment options for AI inference workloads, offering high performance, strong security, and seamless scalability tailored to organizational needs. The platform supports both single-node and multi-node architectures and is compatible with NVIDIA Inference Microservices (NIM), Hugging Face models, and custom inference containers. This enables integration with a wide range of frameworks and model formats.
Use Cases
Deploy any inference workload type, including LLMs, vision models, speech models, and others for diverse AI applications.
Run both batch and real-time inference, adapting to varying latency and throughput needs.
Benefit from dynamic autoscaling - clusters scale down during idle periods and rapidly scale up as new inference jobs are submitted, improving efficiency for batch workflows.
Support distributed inference for very large LLMs that cannot fit on a single node, such as DeepSeek R1, enabling multi-node deployment for high-capacity models.
Monitor LLM and model performance using real-time and historical utilization metrics, request analytics, and token-level statistics, allowing MLOps engineers to track performance, usage, and availability for operational insights
Key Features and Capabilities
Performance and Optimizations
Topology-aware scheduling - Optimizes placement of distributed inference workloads, reducing communication overhead and improving workload efficiency.
Gang scheduling - Schedules related pods together (for example, multi-pod inference workloads such as Dynamo deployments).
Automatic MNNVL support - Optimizes placement for Multi-Node NVLink systems when available.
Priority and scheduling - Uses workload priority and advanced scheduling strategies to enhance performance and reduce latency
Dynamic autoscaling - Defines minimum and maximum replicas and set metric-based thresholds to handle fluctuating demand, with scaling conditions triggered by latency, throughput, or concurrency, or other custom metrics.
Lifecycle Management
Flexible deployment architecture - Choose between single-node (serverless Knative) and distributed multi-node (Leader–Worker) deployments, supporting large language models (LLMs). LLMs can be distributed across multiple nodes depending on available hardware when the model does not fit within a single node.
Model compatibility and serving - Native support for deploying NVIDIA Inference Microservices (NIM), Hugging Face models, and custom inference containers.
Access management - Secure, customizable endpoint access for public, authenticated users, groups, or service accounts, with user-specific restrictions enforced through access token–based authentication.
Rolling updates - Supports real-time, disruption-free updates to inference workloads, including container image, configuration, compute resources, and scaling policy. Revision management capabilities allow tracking and managing changes across inference workload versions.
Dynamic NVIDIA NIM model list from NGC catalog - Automatically retrieve the available NVIDIA NIM models from the NGC catalog, ensuring the list remains current and reflects the latest model offerings.
Hugging Face model catalog browsing - Browse and search the Hugging Face model catalog directly when creating inference workloads. The live catalog view displays model details such as download count and gated status. For gated models, the platform prompts you to provide a Hugging Face token for access, while open models can be selected without authentication.
Visualization and Monitoring
End-to-end observability - Provides unified access to resource utilization (GPU, CPU, network), inference metrics (throughput, latency, replica counts), and NIM-specific workload metrics (request concurrency, request counts, TTFT, latency percentiles, GPU KV-cache utilization) for comprehensive monitoring and analysis.
Workload Support and Framework Compatibility
NVIDIA Run:ai supports a wide range of inference workload types and frameworks, each integrated with the platform’s scheduling, orchestration, and observability capabilities. Feature availability can vary depending on the workload type and submission method. To understand which capabilities are supported for each workload type, refer to Supported features.
Native Inference Workloads
NVIDIA Run:ai supports multiple native inference workload types to accommodate different serving frameworks and deployment preferences:
Custom - Deploy user-defined inference images built with any compatible runtime. This workload type also supports distributed inference, enabling multi-node serving for large models. Distributed inference is available via API only. See Distributed inference API for more details.
Hugging Face - Run transformer-based models directly from Hugging Face repositories.
NVIDIA NIM - Deploy optimized NVIDIA Inference Microservices that include built-in observability, tracing, and GPU performance metrics.
Additional Supported Inference Workloads
NVIDIA Run:ai supports a broad range of workloads from the ML and Kubernetes ecosystem that are already registered as workload types in the platform and ready to use. These workloads can then be submitted using via YAML and managed with the same orchestration, monitoring, and scheduling capabilities as native workloads. See Supported workload types for more details.
NVIDIA NIM Services (using the dedicated NVIDIA NIM API)
LeaderWorkerSet (LWS)
InferenceService (KServe)
Extending Inference Workload Support
For emerging ML frameworks, tools, or Kubernetes resources, the Resource Interface (RI) provides a declarative way to extend platform support. Administrators can register new workload types via the Workload Types API, making them available across the organization without requiring platform updates or code changes. Once registered, these workloads can then be submitted using via YAML and managed with the same orchestration, monitoring, and scheduling capabilities as native workloads. See Extending workload support with Resource Interface for more details.
Observability and Metrics
Inference workloads deployed through NVIDIA Run:ai provide comprehensive observability and performance monitoring. See Workloads for more details.
Inference metrics - Available for all inference workloads, tracking performance indicators such as GPU utilization, request throughput, and latency.
NVIDIA NIM metrics - Specific to NVIDIA NIM workloads, offering additional visibility into model-level statistics, runtime performance, and token-level metrics for LLMs.
Workload Architecture
NVIDIA Run:ai inference workloads can be deployed as:
Single-node - Deployed as a single pod, typically for lightweight or latency-sensitive serving.
Multi-node - Deployed as a coordinated set of pods (Leader–Worker Set) to support large language models (LLMs) that cannot fit on a single GPU node (for example, DeepSeek R1).
Deployment Architecture Overview
Single-Node Deployment
Single-node inference workloads are deployed as a single pod and use Knative Serving, which provides serverless capabilities such as request queuing, autoscaling, and granular rollout.
Request flow:
The client authenticates with the NVIDIA Run:ai control plane to obtain a token.
The client sends a request to the inference serving endpoint hosted in the NVIDIA Run:ai cluster.
The request passes through the organization’s load balancer.
The request reaches the NGINX Ingress, where TLS termination occurs. NGINX proxies the request to the Kourier ingress.
Kourier forwards the request to the Knative Queue Proxy. The queue proxy manages traffic to maintain service level objectives (SLOs), metrics, and concurrency.
Authorization is validated before the LLM/model container processes the request.
Results are returned to the client.

Multi-Node Deployment
Multi-node inference workloads use multiple pods, one leader and several workers pods, forming a Leader–Worker Set (LWS). Knative is not used in this configuration.
Request flow
The client authenticates with the NVIDIA Run:ai control plane to obtain a token.
The client sends a request to the inference serving endpoint hosted in the NVIDIA Run:ai cluster
Requests flow through the load balancer and Ingress directly to the inference leader pod.
Authorization is validated by the leader pod before any computation is offloaded to workers.
The leader pod handles authorization and delegates computation across the worker pods.
Results are aggregated by the leader and returned to the client.

Last updated