runai inference nim submit

submit a nim inference workload

Synopsis

Before using the flags, keep in mind:

Model Store Configuration:

  • You must select exactly one model store option: --model-existing-pvc, --model-new-pvc, or --model-nim-cache.

Autoscaling vs Fixed Replicas:

  • Use --replicas for a fixed number of instances.

  • Use --min-replicas, --max-replicas, --metric for autoscaling.

Multi-Node Inference:

  • Use --workers to specify the number of worker nodes per leader.

  • Setting workers > 0 automatically enables multi-node mode (1 Leader + N Workers).

  • Note: Multi-node (--workers) cannot be used with autoscaling (--min-replicas/--max-replicas).

runai inference nim submit [flags]

Examples


# Submit a Llama3 model using an existing PVC
runai inference nim submit <workload-name> -p <project-name> -i nvcr.io/nim/meta/llama3-8b-instruct:latest --ngc-auth-secret my-ngc-secret --model-existing-pvc claimname=my-model-pvc --serving-port 8000 -g 1

# Submit with autoscaling
runai inference nim submit <workload-name> -p <project-name> -i nvcr.io/nim/meta/llama3-8b-instruct:latest --ngc-auth-secret my-ngc-secret --model-nim-cache "name=my-cache,profile=10gb" --min-replicas 1 --max-replicas 5 --metric concurrency --metric-threshold 10 --serving-port 8000

# Submit multi-node model (1 Leader + 2 Workers)
runai inference nim submit <workload-name> -p <project-name> -i nvcr.io/nim/meta/llama3-8b-instruct:latest --ngc-auth-secret my-ngc-secret --model-new-pvc "claimname=my-model-pvc,size=100Gi,accessmode-rwo" --workers 2 --replicas 1 --serving-port 8000

Options

Options inherited from parent commands

SEE ALSO

  • runai inference nim - [Experimental] Runs NVIDIA NIM (NVIDIA Inference Microservices) workloads. Optimized for deploying foundation models.

Last updated