LogoLogo
Contact support
v2.21
  • Home
  • SaaS
  • Self-hosted
v2.21
  • Getting Started
    • Overview
    • What's New
      • What’s New in Version 2.21
      • Hotfixes for Version 2.21
    • Installation
      • Control Plane System Requirements
      • Preparations
      • Network Requirements
      • Install the Control Plane
      • Cluster System Requirements
      • Install Using Helm
      • Customized Installation
      • Upgrade
      • Uninstall
  • Infrastructure setup
    • Authentication and Authorization
      • Authentication and Authorization
      • Users
      • SSO
        • Set Up SSO with SAML
        • Set Up SSO with OpenID Connect
        • Set Up SSO with OpenShift
      • Roles
      • Applications
      • User Applications
      • Access Rules
    • Advanced Setup
      • Node Roles
      • Advanced Control Plane Configurations
      • Advanced Cluster Configurations
      • Service Mesh
      • Integrations
        • Interworking with Karpenter
    • Infrastructure Procedures
      • NVIDIA Run:ai at Scale
      • Monitoring and Maintenance
      • NVIDIA Run:ai System Monitoring
      • Clusters
      • Shared Storage
      • Nodes Maintenance
      • Cluster Restore
      • Secure Your Cluster
      • Logs Collection
      • Event History
  • Platform management
    • Manage AI Initiatives
      • Adapting AI Initiatives to Your Organization
      • Managing Your Organization
        • Projects
        • Departments
      • Managing Your Resources
        • Nodes
        • Configuring NVIDIA MIG Profiles
        • Using GB200 NVL72 and Multi-Node NVLink Domains
        • Node Pools
    • Scheduling and Resource Optimization
      • Scheduling
        • The NVIDIA Run:ai Scheduler: Concepts and Principles
        • How the Scheduler Works
        • Workload Priority Control
        • Quick Starts
          • Over Quota, Fairness and Preemption
      • Resource Optimization
        • GPU Fractions
        • Dynamic GPU Fractions
        • Optimize Performance with Node Level Scheduler
        • GPU Time-Slicing
        • GPU Memory Swap
        • Quick Starts
          • Launching Workloads with GPU Fractions
          • Launching Workloads with Dynamic GPU Fractions
          • Launching Workloads with GPU Memory Swap
    • Policies
      • Policies and Rules
      • Workload Policies
      • Policy YAML Examples
      • Policy YAML Reference
      • Scheduling Rules
    • Monitor Performance and Health
      • Before You Start
      • Metrics and Telemetry
      • Reports
  • Workloads in NVIDIA Run:ai
    • Introduction to Workloads
    • NVIDIA Run:ai Workload Types
    • Workloads
    • Workload Assets
      • Workload Assets
      • Environments
      • Data Sources
      • Data Volumes
      • Compute Resources
      • Credentials
    • Workload Templates
      • Workspace Templates
    • Experiment Using Workspaces
      • Running Workspaces
      • Quick Starts
        • Running Jupyter Notebooks Using Workspaces
    • Train Models Using Training
      • Standard Training
        • Train Models Using a Standard Training Workload
        • Quick Starts
          • Run Your First Standard Training
      • Distributed Training
        • Train Models Using a Distributed Training Workload
        • Quick Starts
          • Run Your First Distributed Training
    • Deploy Models Using Inference
      • Deploy a Custom Inference Workload
      • Deploy Inference Workloads from Hugging Face
      • Deploy Inference Workloads with NVIDIA NIM
      • Deploy NVIDIA Cloud Functions (NVCF) in NVIDIA Run:ai
  • Reference
    • CLI Reference
      • Install and Configure CLI
      • Administrator CLI
      • Add NVIDIA Run:ai Authorization to Kubeconfig
      • CLI Commands Reference
        • runai cluster
        • runai config
        • runai inference
        • runai jax
        • runai kubeconfig
        • runai login
        • runai logout
        • runai mpi
        • runai node
        • runai nodepool
        • runai project
        • runai pvc
        • runai pytorch
        • runai report
        • runai tensorflow
        • runai training
        • runai upgrade
        • runai version
        • runai whoami
        • runai workload
        • runai workspace
        • runai xgboost
      • CLI Commands Examples
    • API Reference
      • How to Authenticate to the API
      • NVIDIA Run:ai REST API
        • Configuring Slack Notifications
  • Support Policy
    • Product Support Policy
    • Product Version Life Cycle

NVIDIA Run:ai Self-Hosted Product Documentation

Cover

Install, set up and monitor

Install self-hosted

Set authenticated access

Set node roles and advanced cluster configurations

Monitor, manage and restore clusters

Monitor your platform

Cover

Manage organizations and resources

Map and set up your organizations

Set up and assign your resources

Manage permissions

Create and manage policies

Monitor performance and health

Cover

Build, train and deploy models

Learn more about workloads and workload types

Prepare workload assets

Build your model using workspaces

Train your model using standard or distributed training workloads

Deploy your model with inference workloads

Cover

Scheduling and resource optimization

Learn the NVIDIA Run:ai Scheduler concepts and principles

Understand more about how the Scheduler works

Explore different resource optimizations

Cover

Develop with APIs

Set API access

Use REST APIs

Consume metrics and telemetry

Cover

Use the CLI

Install and configure the CLI

See the full list of commands and examples

Cover

Quick starts

Run Jupyter Notebook using workspaces

Run your first distributed training workload

Launch workloads with dynamic GPU fractions

Last updated 3 hours ago

LogoLogo

Corporate Info

  • NVIDIA.com Home
  • About NVIDIA
  • Privacy Policy
  • Manage My Privacy
  • Terms of Service

NVIDIA Developer

  • Developer Home
  • Blog

Resources

  • Contact Us
  • Developer Program

Copyright © 2025, NVIDIA Corporation.