Introduction to workloads

NVIDIA Run:ai enhances visibility and simplifies management, by monitoring, presenting and orchestrating all AI workloads in the clusters it is installed. Workloads are the fundamental building blocks for consuming resources, enabling AI practitioners such as researchers, data scientists and engineers to efficiently support the entire life cycle of an AI initiative.

Workloads across the AI lifecycle

A typical AI initiative progresses through several key stages, each with distinct workloads and objectives. With NVIDIA Run:ai, research and engineering teams can host and manage all these workloads to achieve the following:

  • Data preparation: Aggregating, cleaning, normalizing, and labeling data to prepare for training.

  • Training: Conducting resource-intensive model development and iterative performance optimization.

  • Fine-tuning: Adapting pre-trained models to domain-specific datasets while balancing efficiency and performance.

  • Inference: Deploying models for real-time or batch predictions with a focus on low latency and high throughput.

  • Monitoring and optimization: Ensuring ongoing performance by addressing data drift, usage patterns, and retraining as needed.

What is a workload?

A workload runs in the cluster, is associated with a namespace, and operates to fulfill its targets, whether that is running to completion for a batch job, allocating resources for experimentation in an integrated development environment (IDE)/notebook, or serving inference requests in production.

The workload, defined by the AI practitioner, consists of:

  • Container images: This includes the application, its dependencies, and the runtime environment.

  • Compute resources: CPU, GPU, and RAM to execute efficiently and address the workload’s needs.

  • Data & storage configuration: The data needed for processing such as training and testing datasets or input from external databases, and the storage configuration which refers to the way this data is managed, stored and accessed.

  • Credentials: The access to certain data sources or external services, ensuring proper authentication and authorization.

Workload scheduling and orchestration

NVIDIA Run:ai’s core mission is to optimize AI resource usage at scale. This is achieved through efficient scheduling and orchestrating of all cluster workloads using the NVIDIA Run:ai Scheduler. The Scheduler allows the prioritization of workloads across different departments and projects within the organization at large scales, based on the resource distribution set by the system administrator.

NVIDIA Run:ai and third-party workloads

  • NVIDIA Run:ai workloads: These workloads are submitted via the NVIDIA Run:ai platform. They are represented by Kubernetes Custom Resource Definitions (CRDs) and APIs. When using NVIDIA Run:ai workloads, a complete Workload and Scheduling Policy solution is offered for administrators to ensure optimizations, governance and security standards are applied.

  • Third-party workloads: These workloads are submitted via third-party applications that use the NVIDIA Run:ai Scheduler. The NVIDIA Run:ai platform manages and monitors these workloads. They enable seamless integrations with external tools, allowing teams and individuals flexibility.

Levels of support

Different types of workloads have different levels of support. Understanding what capabilities are needed before selecting the workload type to work with is important. The table below details the level of support for each workload type in NVIDIA Run:ai. NVIDIA Run:ai workloads are fully supported with all of NVIDIA Run:ai advanced features and capabilities. While third-party workloads are partially supported. The list of capabilities can change between different NVIDIA Run:ai versions.

Functionality
NVIDIA Run:ai Workspace
NVIDIA Run:ai Training - Standard
NVIDIA Run:ai Training - distributed
NVIDIA Run:ai Inference
Third-party workloads

Workload awareness

Specific workload-aware visibility, so that different pods are identified and treated as a single workload (for example GPU utilization, workload view, dashboards).

Last updated