Deploy Distributed Inference Workloads
This guide explains how to create a native distributed inference workload via the NVIDIA Run:ai UI.
An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.
The inference workload is assigned to a project and is affected by the project’s quota.
To learn more about the inference workload type in NVIDIA Run:ai and determine that it is the most suitable workload type for your goals, see Workload types and features.

Before You Start
Make sure you have created a project or have one created for you.
Make sure LWS is properly installed by your administrator.
If the image you are planning to use is hosted in NGC, make sure you have an NGC account with an active NGC API key. To obtain a key, go to NGC → Setup → API Keys, then generate or copy an existing key. In NVIDIA Run:ai, store the key either as a shared secret (created by an administrator) or a user credential (created under User settings).
Workload Priority and Preemption
By default, inference workloads in NVIDIA Run:ai are assigned a Very high priority and are non-preemptible. These defaults ensures that inference workloads, which often serve real-time or latency-sensitive traffic, are guaranteed the resources they need. The default very-high priority ensures the fastest possible scheduling, and the default non-preemptible value guarantees that the workload will not be disrupted by other workloads once it starts running.
You can override the defaults by configuring priority and preemptibility. For more details, see Workload priority and preemption.
Workload Policies
When creating a new workload, fields and assets may have limitations or defaults. These rules and defaults are derived from a policy your administrator set.
Policies allow you to control, standardize, and simplify the workload submission process. For additional information, see Policies and rules.
The effects of the policy are reflected in the workspace creation form:
Defaults derived from the policy will be displayed automatically for specific fields.
Disabled actions and permitted value ranges for values will be visibly explained per field.
Rules and defaults for entire sections (such as environments, compute resources, or data sources) may prevent selection and will appear on the entire library card with an option for additional information via an external modal.
Leader Worker Set Architecture
Distributed inference workloads are based on a Leader Worker Set (LWS) architecture, where each replica consists of a leader pod and a set of worker pods. The workload is configured across three steps:
Leader - Defines the specification for the leader pod, which is responsible for coordinating the distributed inference workload and exposing the serving endpoint. The container image, compute resources, environment variables, and storage are configured independently for the leader.
Workers - Defines the specification for the worker pods, which are responsible for model inference processing. Worker pods can be configured with a different container image and compute resources than the leader. The number of worker pods is also defined in this step.
Workload Setup - Defines workload-level settings, including the number of replicas (each replica consists of one leader and its associated worker pods), scheduling priority, preemptibility, and the startup policy that determines when worker pods are created relative to the leader.

Creating a Custom Distributed Inference Workload
To create an inference workload, go to Workload manager → Workloads.
Click +NEW WORKLOAD and select Inference from the dropdown menu.
Within the new form, select the cluster and project. To create a new project, click +NEW PROJECT and refer to Projects for a step-by-step guide.
Set the inference workload architecture as Distributed. This workload will only be able to use environments that support distributed workloads. A distributed workload consists of multiple processes working together. These processes can run on different nodes.
Enter a unique name for the workload. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE
Setting Up an Environment
Note
NGC catalog is disabled by default. If unavailable, your administrator must enable it under General settings → Workloads → NGC catalog.
To select an image from the NGC private registry, your administrator must configure it under General settings → Workloads → NGC private registry.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available environments. Select an environment from the list.
Optionally, customize any of the environment’s predefined fields as shown below. The changes will apply to this workload only and will not affect the selected environment.
Alternatively, click the ➕ icon in the side pane to create a new environment. For step-by-step instructions, see Environments.
Provide your own settings
Manually configure the settings below as needed. The changes will apply to this workload only.
Configure environment
Set the environment image:
Select Custom image and add the Image URL or update the URL of the existing setup.
Select from the NGC catalog and then set how to access the NGC catalog:
As a guest - Choose the image name and tag from the dropdown.
Authenticated - Configure access to NGC and then choose the image name and tag from the dropdown:
Under Source, select Shared secret or My credentials.
The Type field is fixed to NGC API key.
Under Credential name, select an existing credential from the dropdown. If you select My credentials, you can also create a new credential directly from the dropdown. The credential is also saved under User settings → Credentials.
Select from the NGC private registry and then set how to access the registry:
Select a registry from the dropdown.
Under Source, select Shared secret or My credentials.
The Type field is fixed to NGC API key.
Under Credential name, select an existing credential from the dropdown. If you select My credentials, you can also create a new credential directly from the dropdown. The credential is also saved under User settings → Credentials.
Set the image pull policy:
Set the condition for pulling the image. It is recommended to pull the image only if it's not already present on the host.
Set your Docker registry credential for pulling the image:
Under Source, select Shared secret or My credentials.
Under Type, select Docker registry or NGC API key.
Under Credential name, select an existing credential from the dropdown. If you select My credentials, you can also create a new credential directly from the dropdown. The credential is also saved under User settings → Credentials.
Set an inference serving endpoint. The connection protocol and the container port are defined within the environment:
Select HTTP and enter a corresponding container port
Modify who can access the endpoint. See Accessing the inference workload for more details:
By default, Public is selected giving everyone within the network access to the endpoint with no authentication
If you select All authenticated users and service accounts, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access.
For Specific user(s) and service account(s), enter a valid user email or name. If you remove yourself, you will lose access.
Set whether inference serving endpoints can be accessed externally or only by internal cluster traffic:
External (Public access) - Available only if your administrator has configured Knative to support external access. See the Inference requirements section for more details.
Internal only - Endpoint is accessible only within cluster traffic.
Set the command and arguments for the container running the workload. If no command is added, the container will use the image’s default command (entry-point).
Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
Set multiple arguments separated by spaces, using the following format (e.g.:
--arg1=val1).
Set the environment variable(s):
Modify the existing environment variable(s) or click +ENVIRONMENT VARIABLE. The existing environment variables may include instructions to guide you with entering the correct values.
You can either select Custom to define a value manually, or choose an existing value from Shared secret, My Credentials, or ConfigMap. If you select My credentials, you can also create a new credential directly from the dropdown. The credential is also saved under User settings → Credentials.
Some environment variables are automatically injected by NVIDIA Run:ai. See Built-in workload environment variables for more details.
Enter a path pointing to the container's working directory.
Set where the UID, GID, and supplementary groups for the container should be taken from. If you select Custom, you’ll need to manually enter the UID, GID and Supplementary groups values.
Select additional Linux capabilities for the container from the drop-down menu. This grants certain privileges to a container without granting all the root user's privileges.
Setting Up Compute Resources
Note
GPU memory limit is disabled by default. If unavailable, your administrator must enable it under General Settings → Resources → GPU resource optimization.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.
Optionally, customize any of the compute resource's predefined fields as shown below. The changes will apply to this workload only and will not affect the selected compute resource.
Alternatively, click the ➕ icon in the side pane to create a new compute resource. For step-by-step instructions, see Compute resources.
Provide your own settings
Manually configure the settings below as needed. The changes will apply to this workload only.
Configure compute resources
Set the number of workers for each replica. This setting is only available in the Workers step.
Set the number of GPU devices per pod (physical GPUs).
Enable GPU fractioning to set the GPU memory per device using either a fraction of a GPU device’s memory (% of device) or a GPU memory unit (MB/GB):
Request - The minimum GPU memory allocated per device. Each pod in the workload receives at least this amount per device it uses.
Limit - The maximum GPU memory allocated per device. Each pod in the workload receives at most this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see the above note.
Set the CPU resources
Set CPU compute resources per pod by choosing the unit (cores or millicores):
Request - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.
Limit - The maximum amount of CPU compute a pod can use. Each pod receives at most this amount of CPU compute. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU compute resources.
Set the CPU memory per pod by selecting the unit (MB or GB):
Request - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.
Limit - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU memory resources.
Set extended resource(s)
Enable Increase shared memory size to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.
Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the Extended resources and Quantity guides.
Set the order of priority for the node pools on which the Scheduler tries to run the workload. When a workload is created, the Scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the Scheduler will move on to the next one until it finds one that is available:
Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
Click +NODE POOL to add a new node pool from the list of node pools that were defined on the cluster. To configure a new node pool and for additional information, see Node pools.
Node pool priority is configured in the Leader step and inherited by the workers. It cannot be modified in the Workers step.
Select a node affinity to schedule the workload on a specific node type. If the administrator added a ‘node type (affinity)’ scheduling rule to the project/department, then this field is mandatory. Otherwise, entering a node type (affinity) is optional. Nodes must be tagged with a label that matches the node type key and value.
Click +TOLERATION to allow the workload to be scheduled on a node with a matching taint. Select the operator and the effect:
If you select Exists, the effect will be applied if the key exists on the node.
If you select Equals, the effect will be applied if the key and the value set match the value on the node.
Click +TOPOLOGY to let the workload be scheduled on nodes with a matching topology - same region, zone, placement group or any other topology you define.
Setting Up Data & Storage
Note
S3 data sources are not supported for inference workloads.
Load from existing setup
Click the load icon. A side pane appears, displaying a list of available data sources. Select a data source from the list.
Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this workload only and will not affect the selected data source:
Container path - Enter the container path to set the data target location.
ConfigMap sub-path - Specify a sub-path (file/key) inside the ConfigMap to mount (for example,
app.properties). This lets you mount a single file from an existing ConfigMap.
Alternatively, click the ➕ icon in the side pane to create a new data source/volume. For step-by-step instructions, see Data sources.
Configure data sources for a one-time configuration
Note
PVCs, Secrets, and ConfigMaps cannot be added as a one-time configuration.
Click the ➕ icon and choose the data source from the dropdown menu. You can add multiple data sources.
Once selected, set the data origin according to the required fields and enter the container path to set the data target location.
The required fields vary by data source. For detailed configuration options and usage guidelines for each data source type, see Data sources.
Configure EmptyDir for a one-time configuration
Use EmptyDir to allocate temporary storage that exists only for the lifetime of the workload. The EmptyDir volume is ephemeral, the volume and its data is deleted every time the workload’s status changes to “Stopped
Set the size and units to define the maximum storage capacity.
Enter the container path to set the data target location.
Select the storage medium:
Disk-backed storage - The data is stored on the node's filesystem, which persists across container restarts but not pod rescheduling, and is slower than memory
Memory-backed storage - The data is stored in RAM, which provides faster access but is not persistent, and will be lost if the pod restarts
Configure Volume for a one-time configuration
Select Volume to allocate a storage space to your workspace that is persistent across restarts:
Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see Kubernetes storage classes. If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
Select one or more access mode(s) and define the claim size and its units.
Select the volume mode:
Filesystem (default) - The volume will be mounted as a filesystem, enabling the usage of directories and files.
Block - The volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
Enter the container path to set the data target location.
Optional: Click +VOLUME to set the volume needed for your workload. A volume allocates storage space to your workload that is persistent across restarts:
Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see Kubernetes storage classes. If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
Select one or more access mode(s) and define the claim size and its units.
Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
Set the Container path with the volume target location.
Setting Up General Settings
Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.
Configuring Workload Setup
Set the number of replicas for your workload. Each replica consists of one leader pod and its associated worker pods.
Set whether the workload may be interrupted by selecting Preemptible or Non-preemptible:
Non-preemptible workloads use the project's available GPU quota and will not be interrupted once they start running.
Preemptible workloads may be interrupted if resources are needed for higher-priority workloads.
Set the workload priority. Choose the appropriate priority level for the workload. Higher-priority workloads are scheduled before lower-priority ones.
Set when the workers' pod should be created:
When the leader pod is created - Create the workers’ pods alongside the leader pod to reduce startup time. This may cause the deployment to fail if the leader isn't ready.
Once the leader pod is ready - Wait for the leader’s pod to be ready before creating the workers’ pods.
Completing the Workload
Before finalizing your workload, review your configurations and make any necessary adjustments.
Click CREATE INFERENCE
Managing and Monitoring
After the workload is created, it is added to the Workloads table, where it can be managed and monitored.
Accessing the Inference Workload
You can programmatically consume an inference workload via API by making direct calls to the serving endpoint, typically from other workloads or external integrations. Once an inference workload is deployed, the serving endpoint URL appears in the Connections column of the Workloads table.
Access to the inference serving API depends on how the serving endpoint access was configured when submitting the inference workload:
If Public access is enabled, no authentication is required.
If restricted access is configured:
Authentication is performed either by a user (with a username/password), user access keys or service accounts (with client credentials).
Authorization to access the endpoint is enforced based on user, service account or group membership.
Users relying on SSO should authenticate using their user access keys.
Follow the below steps to obtain a token:
Use the Tokens API with:
grantType: passwordfor specific users or groupsgrantType: client_credentialsfor service accounts and user access keys
Use the obtained token to make API calls to the inference serving endpoint. For example:
Using CLI
To view the available actions, see the inference workload CLI v2 reference.
Using API
To view the available actions for creating an inference workload, see the Distributed Inferences API reference.
Last updated