NVIDIA Run:ai Inference Overview

NVIDIA Run:ai provides flexible and robust deployment options for AI inference workloads, offering high performance, strong security, and seamless scalability tailored to organizational needs. The platform supports both single-node and multi-node architectures and is compatible with NVIDIA Inference Microservices (NIM), Hugging Face models, and custom inference containers. This enables integration with a wide range of frameworks and model formats.

Use Cases

  • Deploy any inference workload type, including LLMs, vision models, speech models, and others for diverse AI applications.

  • Run both batch and real-time inference, adapting to varying latency and throughput needs.​

  • Benefit from dynamic autoscaling - clusters scale down during idle periods and rapidly scale up as new inference jobs are submitted, improving efficiency for batch workflows.​

  • Support distributed inference for very large LLMs that cannot fit on a single node, such as DeepSeek R1, enabling multi-node deployment for high-capacity models.​

  • Monitor LLM and model performance using real-time and historical utilization metrics, request analytics, and token-level statistics, allowing MLOps engineers to track performance, usage, and availability for operational insights

Key Features and Capabilities

  • Flexible deployment architecture - Choose between single-node (serverless Knative) and distributed multi-node (Leader–Worker) deployments, supporting large language models (LLMs). LLMs can be distributed across multiple nodes depending on available hardware when the model does not fit within a single node.

  • Model compatibility and serving - Native support for deploying NVIDIA Inference Microservices (NIM), Hugging Face models, and custom inference containers.

  • Access management - Secure, customizable endpoint access for public, authenticated users, groups, or service accounts, with user-specific restrictions enforced through access token–based authentication.

  • Dynamic autoscaling - Define minimum and maximum replicas and set metric-based thresholds to handle fluctuating demand, with scaling conditions triggered by latency, throughput, or concurrency, or other custom metrics.

  • Priority and scheduling - Uses workload priority and advanced scheduling strategies, such as gang and topology-aware scheduling, to enhance performance and reduce latency.

  • End-to-end observability - Provides unified access to resource utilization (GPU, CPU, network), inference metrics (throughput, latency, replica counts), and NIM-specific workload metrics (request concurrency, request counts, TTFT, latency percentiles, GPU KV-cache utilization) for comprehensive monitoring and analysis.

  • Rolling updates - Supports real-time, disruption-free updates to inference workloads, including container image, configuration, compute resources, and scaling policy. Revision management capabilities allow tracking and managing changes across inference workload versions.

  • Dynamic NVIDIA NIM model list from NGC catalog - Automatically retrieve the available NVIDIA NIM models from the NGC catalog, ensuring the list remains current and reflects the latest model offerings.

  • Comprehensive scheduling and workload support - Additional scheduling, resource management, and operational features are supported for inference workloads. For the most current and detailed feature list, refer to Supported features.

Workload Support and Framework Compatibility

Native Inference Workloads

NVIDIA Run:ai supports multiple native inference workload types to accommodate different serving frameworks and deployment preferences:

  • Custom - Deploy user-defined inference images built with any compatible runtime. This workload type also supports distributed inference, enabling multi-node serving for large models. Distributed inference is available via API only. See Distributed inference API for more details.

  • Hugging Face - Run transformer-based models directly from Hugging Face repositories.

  • NVIDIA NIM - Deploy optimized NVIDIA Inference Microservices that include built-in observability, tracing, and GPU performance metrics.

Additional Supported Inference Workloads

NVIDIA Run:ai also supports a broad range of workloads from the ML and Kubernetes ecosystem that are already registered as workload types in the platform and ready to use. These workloads can then be submitted using the Workloads v2 API and managed with the same orchestration, monitoring, and scheduling capabilities as native workloads. For details on feature support, see Supported features.

Extending Inference Workload Support

For emerging ML frameworks, tools, or Kubernetes resources, the Resource Interface (RI) provides a declarative way to extend platform support. Administrators can register new workload types via the Workload Types API, making them available across the organization without requiring platform updates or code changes. Once registered, these workloads can then be submitted using the Workloads v2 API and managed with the same orchestration, monitoring, and scheduling capabilities as native workloads. See Extending workload support with Resource Interface for more details.

Observability and Metrics

Inference workloads deployed through NVIDIA Run:ai provide comprehensive observability and performance monitoring. See Workloads for more details.

  • Inference metrics - Available for all inference workloads, tracking performance indicators such as GPU utilization, request throughput, and latency.

  • NVIDIA NIM metrics - Specific to NVIDIA NIM workloads, offering additional visibility into model-level statistics, runtime performance, and token-level metrics for LLMs.

Workload Architecture

NVIDIA Run:ai inference workloads can be deployed as:

  • Single-node - Deployed as a single pod, typically for lightweight or latency-sensitive serving.

  • Multi-node - Deployed as a coordinated set of pods (Leader–Worker Set) to support large language models (LLMs) that cannot fit on a single GPU node (for example, DeepSeek R1).

Deployment Architecture Overview

Note

The deployment architecture applies to native inference workloads only. The architecture for other workload types and frameworks (e.g., KServe or SeldonDeployment) may differ based on the framework.

Single-Node Deployment

Single-node inference workloads are deployed as a single pod and use Knative Serving, which provides serverless capabilities such as request queuing, autoscaling, and granular rollout.

Request flow:

  1. The client authenticates with the NVIDIA Run:ai control plane to obtain a token.

  2. The client sends a request to the inference serving endpoint hosted in the NVIDIA Run:ai cluster.

  3. The request passes through the organization’s load balancer.

  4. The request reaches the NGINX Ingress, where TLS termination occurs. NGINX proxies the request to the Kourier ingress.

  5. Kourier forwards the request to the Knative Queue Proxy. The queue proxy manages traffic to maintain service level objectives (SLOs), metrics, and concurrency.

  6. Authorization is validated before the LLM/model container processes the request.

  7. Results are returned to the client.

Multi-Node Deployment

Multi-node inference workloads use multiple pods, one leader and several workers pods, forming a Leader–Worker Set (LWS). Knative is not used in this configuration.

Request flow

  1. The client authenticates with the NVIDIA Run:ai control plane to obtain a token.

  2. The client sends a request to the inference serving endpoint hosted in the NVIDIA Run:ai cluster

  3. Requests flow through the load balancer and Ingress directly to the inference leader pod.

  4. Authorization is validated by the leader pod before any computation is offloaded to workers.

  5. The leader pod handles authorization and delegates computation across the worker pods.

  6. Results are aggregated by the leader and returned to the client.

Last updated