Deploy Inference Workloads with NVIDIA NIM
This section explains how to deploy a GenAI model from Nvidia NIM as an inference workload via the NVIDIA Run:ai UI.
An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.
The inference workload is assigned to a project and is affected by the project’s quota.
To learn more about the inference workload type in NVIDIA Run:ai and determine that it is the most suitable workload type for your goals, see Workload types and features.

Before You Start
Make sure you have created a project or have one created for you.
Make sure Knative is properly installed by your administrator.
NVIDIA NIM requires an image that is pulled from the NGC catalog. Make sure a Docker registry credential is configured in your project (where the workload will run) with the following values:
Username -
$oauthtokenPassword -
<NGC API key>Docker registry URL -
nvcr.io
For a step-by-step guide on adding credentials to the gallery, see Credentials.
Workload Priority
By default, inference workloads in NVIDIA Run:ai are assigned a priority of very-high, which is non-preemptible. This behavior ensures that inference workloads, which often serve real-time or latency-sensitive traffic, are guaranteed the resources they need and will not be disrupted by other workloads. You can select a different priority when submitting a workload. For more details on the available options, Workload priority control.
Submission Form Options
You can create a new workload using either the Flexible or Original submission form. The Flexible submission form offers greater customization and is the recommended method. Within the Flexible form, you have two options:
Load from an existing setup - You can select an existing setup to populate the workload form with predefined values. While the Original submission form also allows you to select an existing setup, with the Flexible submission you can customize any of the populated fields for a one-time configuration. These changes will apply only to this workload and will not modify the original setup. If needed, you can reset the configuration to the original setup at any time.
Provide your own settings - Manually fill in the workload configuration fields. This is a one-time setup that applies only to the current workload and will not be saved for future use.
Advanced Setup Form
The Advanced setup form allows you to fine-tune your workload configuration with additional preferences, including environment details, image settings, workload priority, and data sources. This gives you greater flexibility, helping you adapt the workload to your specific requirements. After completing the initial setup you can either create the workload as is or use the dropdown next to CREATE INFERENCE and select Advanced setup.
Creating a NIM Inference Workload
To create an inference workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Inference from the dropdown menu.
Within the new form, select the cluster and project. To create a new project, click +NEW PROJECT and refer to Projects for a step-by-step guide.
Select a template or Start from scratch to launch a new workload quickly. You can use a workload template to populate the workload advanced setup form with predefined configuration values. You can still modify the populated fields before submitting the workload. Any changes you make will apply only to current workload and will not be saved back to the original template.
Select NVIDIA NIM from Inference type (if applicable)
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
Setting Up an Environment
Set the model name by selecting a model from the dropdown list or entering the model name
Set how the model profile should be selected. A NIM model profile sets compatible model engines and criteria for engine selection, such as precision, latency, throughput optimization, and GPU requirements. Profiles are optimized to balance either latency or throughput, with quantized profiles (e.g., fp8) preferred to reduce memory usage and enhance performance.
Automatically (recommended) - NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters that influence the selection process.
Manually - Enter profile name or unique hash identifier
Set how to access NGC by choosing one of the following:
If you choose to not allow access to NGC, load the model from a local model-store. Go to Setting up data & storage.
Provide a token by entering your NVIDIA NGC API key. To obtain a key, go to NGC → Setup → API Key, then generate or copy an existing key.
Select credential from a predefined list of Shared secrets or My Credentials. My credentials are Generic secret credentials created via User settings. This credential must contain an
NGC_API_KEYand must already exist in the same namespace as the workload. To obtain a key, go to NGC → Setup → API Key, then generate or copy an existing key.To add a new Generic secret credential:
Click +NEW CREDENTIALS
Enter a name
The key is set to
NGC_API_KEYand cannot be changedEnter a value - your
<NGC API key>Click CREATE CREDENTIAL
When created, the credential is also saved under User settings → Credentials, where it can be reused in future workloads.
Set an inference serving endpoint
Select HTTP or gRPC and enter the corresponding container port
Modify who can access the endpoint. See Accessing the inference workload for more details:
By default, Public is selected giving everyone within the network access to the endpoint with no authentication
If you select All authenticated users and applications, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access.
For Specific user(s) and application(s), enter a valid user email or name. If you remove yourself, you will lose access.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available environments. Select an environment from the list.
Optionally, customize any of the environment’s predefined fields as shown below. The changes will apply to this workload only and will not affect the selected environment.
Alternatively, click the ➕ icon in the side pane to create a new environment. For step-by-step instructions, see Environments.
Provide your own settings
Manually configure the settings below as needed. The changes will apply to this workload only.
Configure environment
Set the environment image:
Select Custom image and add the Image URL or update the URL of the existing setup
Select from the NGC public registry and choose the image name and tag from the dropdown.
Set the image pull policy. Set the condition for pulling the image. It is recommended to pull the image only if it's not already present on the host.
Set an inference serving endpoint. The connection protocol and the container port are defined within the environment:
Select HTTP or gRPC and enter a corresponding container port
Modify who can access the endpoint. See Accessing the inference workload for more details:
By default, Public is selected giving everyone within the network access to the endpoint with no authentication
If you select All authenticated users and applications, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access.
For Specific user(s) and application(s), enter a valid user email or name. If you remove yourself, you will lose access.
Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.
Select the connection type - External URL or NodePort:
Auto generate - A unique URL / port is automatically created for each workload using the environment.
Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between
30000and32767.If the node port is already in use, the workload will fail and display an error message.
Modify who can access the tool:
By default, All authenticated users and applications is selected giving access to everyone within the organization’s account.
For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
For Specific user(s) and application(s), enter a valid user email or name. If you remove yourself, you will lose access to the tool.
Set the command and arguments for the container running the workload. If no command is added, the container will use the image’s default command (entry-point).
Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
Set multiple arguments separated by spaces, using the following format (e.g.:
--arg1=val1).
Set the environment variable(s):
Modify the existing environment variable(s) or click +ENVIRONMENT VARIABLE. The existing environment variables may include instructions to guide you with entering the correct values.
You can select Custom to define your own variable or choose from a predefined list of Secrets, ConfigMaps and My credentials. My credentials are Docker registry or Generic secret credentials created via User settings. The credential must already exist in the same namespace as the workload.
Some environment variables are injected by NVIDIA Run:ai. See Built-in workload environment variables for more details.
Enter a path pointing to the container's working directory
Set where the UID, GID, and supplementary groups for the container should be taken from. If you select Custom, you’ll need to manually enter the UID, GID and Supplementary groups values.
Select additional Linux capabilities for the container from the drop-down menu. This grants certain privileges to a container without granting all the root user's privileges.
Select the NIM model. Set the model name by selecting a model or entering the model name as displayed in NIM.
Set how the model profile should be selected. A NIM model profile sets compatible model engines and criteria for engine selection, such as precision, latency, throughput optimization, and GPU requirements. Profiles are optimized to balance either latency or throughput, with quantized profiles (e.g., fp8) preferred to reduce memory usage and enhance performance.
Automatically (recommended) - NIM is designed to automatically select the most suitable profile from the list of compatible profiles based on the detected hardware. Each profile consists of different parameters that influence the selection process.
Manually - Enter profile name or hash
Set how to access NGC by choosing one of the following:
If you choose to not allow access to NGC, load the model from a local model-store. Go to Setting up data & storage.
Provide a token by entering your NVIDIA NGC API key. To obtain the key, go to NGC → Setup → API Key, then generate or copy an existing key.
Select credential. If you are selecting an existing Generic secret credential, make sure it contains an
NGC_API_KEY. To add a new Generic secret credential:Click +NEW CREDENTIAL
Enter a name
Select New secret:
Enter a key -
NGC_API_KEYEnter a value - your
<NGC API key>
Click CREATE CREDENTIAL
Once created, the new credential is automatically selected. For step-by-step instructions, see Credentials.
Set an inference serving endpoint
Select HTTP or gRPC and enter the corresponding container port
(Optional) Modify who can access the endpoint. See Accessing the inference workload for more details:
By default, Public is selected giving everyone within the network access to the endpoint with no authentication
If you select All authenticated users and applications, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access.
For Specific user(s) and applications, enter a valid user email or name. If you remove yourself, you will lose access.
Setting Up Compute Resources
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.
Optionally, customize any of the compute resource's predefined fields as shown below. The changes will apply to this workload only and will not affect the selected compute resource.
Alternatively, click the ➕ icon in the side pane to create a new compute resource. For step-by-step instructions, see Compute resources.
Provide your own settings
Manually configure the settings below as needed. The changes will apply to this workload only.
Configure compute resources
Set the number of GPU devices per pod (physical GPUs).
Enable GPU fractioning to set the GPU memory per device using either a fraction of a GPU device’s memory (% of device) or a GPU memory unit (MB/GB):
Request - The minimum GPU memory allocated per device. Each pod in the workload receives at least this amount per device it uses.
Limit - The maximum GPU memory allocated per device. Each pod in the workload receives at most this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see the above note.
Set the CPU resources
Set CPU compute resources per pod by choosing the unit (cores or millicores):
Request - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.
Limit - The maximum amount of CPU compute a pod can use. Each pod receives at most this amount of CPU compute. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU compute resources.
Set the CPU memory per pod by selecting the unit (MB or GB):
Request - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.
Limit - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU memory resources.
Set extended resource(s)
Enable Increase shared memory size to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.
Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the Extended resources and Quantity guides.
Set the minimum and maximum number of replicas to be scaled up and down to meet the changing demands of inference services:
If the number of minimum and maximum replicas are different, autoscaling will be triggered and you'll need to set conditions for creating a new replica. A replica will be created every time a condition is met. When a condition is no longer met after a replica was created, the replica will be automatically deleted to save resources.
Select one of the variables to set the conditions for creating a new replica. The variable's values will be monitored via the container's port. When you set a value, this value is the threshold at which autoscaling is triggered.
Set when the replicas should be automatically scaled down to zero. This allows compute resources to be freed up when the model is inactive (i.e., there are no requests being sent). Automatic scaling to zero is enabled only when the minimum number of replicas in the previous step is set to 0.
Set the order of priority for the node pools on which the Scheduler tries to run the workload. When a workload is created, the Scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available:
Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see Node pools.
Select a node affinity to schedule the workload on a specific node type. If the administrator added a ‘node type (affinity)’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. Nodes must be tagged with a label that matches the node type key and value.
Click +TOLERATION to allow the workload to be scheduled on a node with a matching taint. Select the operator and the effect:
If you select Exists, the effect will be applied if the key exists on the node.
If you select Equals, the effect will be applied if the key and the value set match the value on the node.
Select a compute resource or click +NEW COMPUTE RESOURCE to add a new compute resource to the gallery. For a step-by-step guide on adding compute resources to the gallery, see Compute resources. Once created, the new compute resource will be automatically selected.
Set the minimum and maximum number of replicas to be scaled up and down to meet the changing demands of inference services:
If the number of minimum and maximum replicas are different, autoscaling will be triggered and you'll need to set conditions for creating a new replica. A replica will be created every time a condition is met. When a condition is no longer met after a replica was created, the replica will be automatically deleted to save resources.
Select one of the variables to set the conditions for creating a new replica. The variable's values will be monitored via the container's port. When you set a value, this value is the threshold at which autoscaling is triggered.
Set when the replicas should be automatically scaled down to zero. This allows compute resources to be freed up when the model is inactive (i.e., there are no requests being sent). When automatic scaling to zero is enabled, the minimum number of replicas set in the previous step, automatically changes to 0.
Optional: Set the order of priority for the node pools on which the Scheduler tries to run the workload. When a workload is created, the scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available:
Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see Node pools.
Select a node affinity to schedule the workload on a specific node type. If the administrator added a ‘node type (affinity)’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. Nodes must be tagged with a label that matches the node type key and value.
Optional: Click +TOLERATION to allow the workload to be scheduled on a node with a matching taint. Select the operator and the effect:
If you select Exists, the effect will be applied if the key exists on the node.
If you select Equals, the effect will be applied if the key and the value set match the value on the node.
Setting Up Data & Storage
Select the data source that will serve as the model store. If the model is already stored on the selected data source it will be loaded from there automatically. Otherwise, it will be stored on the selected data source during the first workload deployment:
Click the load icon. A side pane appears, displaying a list of available data sources. Select a data source from the list.
Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this workload only and will not affect the selected data source:
Container path - Enter the container path to set the data target location.
Alternatively, click the ➕ icon in the side pane to create a new data source. For step-by-step instructions, see Data sources.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available data sources/volumes. Select a data source/volume from the list.
Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this workload only and will not affect the selected data source.
Container path - Enter the container path to set the data target location.
ConfigMaps sub-path - Specify a sub-path (file/key) inside the ConfigMap to mount (for example,
app.properties). This lets you mount a single file from an existing ConfigMap.
Alternatively, click the ➕ icon in the side pane to create a new data source/volume. For step-by-step instructions, see Data sources or Data volumes.
Configure data sources for a one-time configuration
Click the ➕ icon and choose the data source from the dropdown menu. You can add multiple data sources.
Once selected, set the data origin according to the required fields and enter the container path to set the data target location.
Select Volume to allocate a storage space to your workload that is persistent across restarts:
Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see Kubernetes storage classes. If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
Select one or more access mode(s) and define the claim size and its units.
Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
Set the Container path with the volume target location.
Select the data source that will serve as the model store. If the model is already stored on the selected data source it will be loaded from there automatically. Otherwise, it will be stored on the selected data source during the first workload deployment.
Select an existing data source where the model is already cached to reduce loading time.
To add a new data source, click + NEW DATA SOURCE. For a step-by-step guide, see Data sources. Once created, it will be automatically selected. This will cache the model and reduce loading time for future use.
Setting Up General Settings
Set the workload priority. Choose the appropriate priority level for the workload. Higher-priority workloads are scheduled before lower-priority ones.
Set the workload initialization timeout. This is the maximum amount of time the system will wait for the workload to start and become ready. If the workload does not start within this time, it will automatically fail. Enter a value between 5 seconds and 60 minutes. If you do not set a value, the default is taken from Knative’s max-revision-timeout-seconds.
Set the request timeout. This defines the maximum time allowed to process an end-user request. If the system does not receive a response within this time, the request will be ignored. Enter a value between 5 seconds and 10 minutes. If you do not set a value, the default is taken from Knative’s revision-timeout-seconds.
Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.
Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.
Completing the Workload
Decide if you wish to fine-tune your setup with additional preferences including environment details and data sources. If yes, click on the dropdown next to CREATE INFERENCE and select Advanced setup. Follow the advanced setup steps above.
Before finalizing your workload, review your configurations and make any necessary adjustments.
Click CREATE INFERENCE
Managing and Monitoring
After the inference workload is created, it is added to the Workloads table, where it can be managed and monitored.
NIM observability metrics are available for monitoring performance, including request throughput, latency, and token usage (for LLMs). These metrics can be accessed through the Workloads and Pods APIs. For the full list of metrics, response formats, and examples, see NIM observability metrics via API.
Accessing the Inference Workload
You can programmatically consume an inference workload via API by making direct calls to the serving endpoint, typically from other workloads or external integrations.
Once an inference workload is deployed, the serving endpoint URL appears in the Connections column of the inference workloads grid.
Access to the inference serving API depends on how the serving endpoint access was configured when submitting the inference workload:
If Public access is enabled, no authentication is required.
If restricted access is configured:
Authentication is performed either by a user (with a username/password), user application or applications (with client credentials).
Authorization to access the endpoint is enforced based on user, application or group membership.
Users relying on SSO should authenticate using their user applications.
Follow the below steps to obtain a token:
Use the Tokens API with:
grantType: passwordfor specific users or groupsgrantType: client_credentialsfor applications and user applications
Use the obtained token to make API calls to the inference serving endpoint. For example:
#replace <serving-endpoint-url> and <model-name> (e.g. "meta-llama/Llama-3.1-8B-Instruct") curl <serving-endpoint-url>/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer <your-access-token>" \ -d '{ "model": "<model-name>", "messages": [{ "role": "user", "content": "Write a short poem on AI" }] }'
Using CLI
To view the available actions, see the inference workload CLI v2 reference.
Using API
To view the available actions for creating an inference workload, see the Inferences API reference.
Last updated