Custom Inference Templates

This section explains how to create custom inference templates for reuse during workload submission. To manage templates, see Workload Templates.

Note

Flexible workload templates is enabled by default and applies only to flexible workload submission (enabled by default). If unavailable, contact your administrator to enable it under General settings → Workloads → Flexible workload templates.

Linking Assets

When loading an existing asset, environment or compute resource, into a template, you can choose whether to link the asset or use it without linking. Linked assets remain connected to the template. Any updates made to the original environment or compute resource are automatically reflected in the template. While linked, the asset fields in the template cannot be modified.

Note

Linking data source assets is currently not supported.

Adding a New Template

To add a new template, go to Workload manager → Templates.
Click +NEW TEMPLATE and select Inference from the dropdown menu.
Within the new inference template form, select the scope.
Select custom inference from Inference type (if applicable)
Enter a unique name for the inference template. If the name already exists in the project, you will be requested to submit a different name.
Click CONTINUE

Setting Up an Environment

Note

NGC public registry is disabled by default. If unavailable, your administrator must enable it under General settings → Workloads → NGC public registry. When the NGC public registry is disabled, only the Image URL field is available.

Load from existing setup

Click the load icon. A side pane appears, displaying a list of available environments. Select an environment from the list.
Alternatively, click the ➕ icon in the side pane to create a new environment. For step-by-step instructions, see Environments.
Choose whether to link the environment when applying it to the template. See Linking assets for more details.

Provide your own settings

Manually configure the settings below as needed.

Configure environment

Set the environment image:
- Select Custom image and add the Image URL or update the URL of the existing setup
- Select from the NGC public registry and choose the image name and tag from the dropdown.
Set the condition for pulling the image by selecting the image pull policy. It is recommended to pull the image only if it's not already present on the host.
Set an inference serving endpoint. The connection protocol and the container port are defined within the environment:
- Select HTTP or gRPC and enter a corresponding container port
- Modify who can access the endpoint. See Accessing the inference workload for more details:
  - By default, Public is selected giving everyone within the network access to the endpoint with no authentication
  - If you select All authenticated users and applications, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
  - For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access.
  - For Specific user(s) and application(s), enter a valid user email or name. If you remove yourself, you will lose access.
Set the connection for your tool(s). If you are loading from existing setup, the tools are configured as part of the environment.
- Select the connection type - External URL or NodePort:
  - Auto generate - A unique URL / port is automatically created for each workload using the environment.
  - Custom URL / Custom port - Manually define the URL or port. For custom port, make sure to enter a port between 30000 and 32767. If the node port is already in use, the workload will fail and display an error message.
- Modify who can access the tool:
  - By default, All authenticated users and applications is selected giving access to everyone within the organization’s account.
  - For Specific group(s), enter group names as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
  - For Specific user(s) and application(s), enter a valid user email or name. If you remove yourself, you will lose access to the tool.
Set the command and arguments for the container running the workload. If no command is added, the container will use the image’s default command (entry-point).
- Modify the existing command or click +COMMAND & ARGUMENTS to add a new command.
- Set multiple arguments separated by spaces, using the following format (e.g.: --arg1=val1).
Set the environment variable(s):
- Modify the existing environment variable(s) or click +ENVIRONMENT VARIABLE. The existing environment variables may include instructions to guide you with entering the correct values.
- You can either select Custom to define your own variable, or choose from a predefined list of Secrets or ConfigMaps.
Some environment variables are automatically injected by NVIDIA Run:ai. See Built-in workload environment variables for more details.
Enter a path pointing to the container's working directory
Set where the UID, GID, and supplementary groups for the container should be taken from. If you select Custom, you’ll need to manually enter the UID, GID and Supplementary groups values.
Select additional Linux capabilities for the container from the drop-down menu. This grants certain privileges to a container without granting all the root user's privileges.

Setting Up Compute Resources

Note

GPU memory limit is disabled by default. If unavailable, your administrator must enable it under General settings → Resources → GPU resource optimization.
Replica autoscaling and toleration(s) can still be modified even when the compute resource asset is linked.

Load from existing setup

Click the load icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.
Alternatively, click the ➕ icon in the side pane to create a new compute resource. For step-by-step instructions, see Compute resources.
Choose whether to link the compute resource when applying it to the template. See Linking assets for more details.

Provide your own settings

Manually configure the settings below as needed.

Configure compute resources

Set the number of GPU devices per pod (physical GPUs).
Enable GPU fractioning to set the GPU memory per device using either a fraction of a GPU device’s memory (% of device) or a GPU memory unit (MB/GB):
- Request - The minimum GPU memory allocated per device. Each pod in the workload receives at least this amount per device it uses.
- Limit - The maximum GPU memory allocated per device. Each pod in the workload receives at most this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see the above note.
Set the CPU resources
- Set CPU compute resources per pod by choosing the unit (cores or millicores):
  - Request - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.
  - Limit - The maximum amount of CPU compute a pod can use. Each pod receives at most this amount of CPU compute. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU compute resources.
- Set the CPU memory per pod by selecting the unit (MB or GB):
  - Request - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.
  - Limit - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to Auto which means that the pod may consume up to the node's maximum available CPU memory resources.
Set extended resource(s)
- Enable Increase shared memory size to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.
- Click +EXTENDED RESOURCES to add resource/quantity pairs. For more information on how to set extended resources, see the Extended resources and Quantity guides.
Set the minimum and maximum number of replicas to be scaled up and down to meet the changing demands of inference services:
- If the number of minimum and maximum replicas are different, autoscaling will be triggered and you'll need to set conditions for creating a new replica. A replica will be created every time a condition is met. When a condition is no longer met after a replica was created, the replica will be automatically deleted to save resources.
- Select one of the variables to set the conditions for creating a new replica. The variable's values will be monitored via the container's port. When you set a value, this value is the threshold at which autoscaling is triggered.
Set when the replicas should be automatically scaled down to zero. This allows compute resources to be freed up when the model is inactive (i.e., there are no requests being sent). Automatic scaling to zero is enabled only when the minimum number of replicas in the previous step is set to 0.
Click +TOLERATION to allow the workload to be scheduled on a node with a matching taint. Select the operator and the effect:
- If you select Exists, the effect will be applied if the key exists on the node.
- If you select Equals, the effect will be applied if the key and the value set match the value on the node.

Setting Up Data & Storage

Note

Data volumes is enabled by default. If unavailable, contact your administrator to enable it under General settings → Workloads → Data volumes.
If Data volumes is not enabled, Data & storage appears as Data sources only, and no data volumes will be available.
S3 data sources are not supported for inference workloads.

Load from existing setup

Click the load icon. A side pane appears, displaying a list of available data sources/volumes. Select a data source/volume from the list.
Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this template only and will not affect the selected data source:
- Container path - Enter the container path to set the data target location.
- ConfigMap sub-path - Specify a sub-path (file/key) inside the ConfigMap to mount (for example, app.properties). This lets you mount a single file from an existing ConfigMap.
Alternatively, click the ➕ icon in the side pane to create a new data source/volume. For step-by-step instructions, see Data sources or Data volumes.

Configure data sources for a one-time configuration

Note

PVCs, Secrets, ConfigMaps and Data volumes cannot be added as a one-time configuration.

Click the ➕ icon and choose the data source from the dropdown menu. You can add multiple data sources.
Once selected, set the data origin according to the required fields and enter the container path to set the data target location.
Select Volume to allocate a storage space to your workload that is persistent across restarts:
- Set the Storage class to None or select an existing storage class from the list. To add new storage classes, and for additional information, see Kubernetes storage classes. If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
- Select one or more access mode(s) and define the claim size and its units.
- Select the volume mode. If you select Filesystem (default), the volume will be mounted as a filesystem, enabling the usage of directories and files. If you select Block, the volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
- Set the Container path with the volume target location.

Setting Up General Settings

Note

The following general settings are optional.

Set the workload priority. Choose the appropriate priority level for the workload. Higher-priority workloads are scheduled before lower-priority ones. See Workload priority control for more details.
Set the workload initialization timeout. This is the maximum amount of time the system will wait for the workload to start and become ready. If the workload does not start within this time, it will automatically fail. Enter a value between 5 seconds and 60 minutes. If you do not set a value, the default is taken from Knative’s max-revision-timeout-seconds.
Set the request timeout. This defines the maximum time allowed to process an end-user request. If the system does not receive a response within this time, the request will be ignored. Enter a value between 5 seconds and 10 minutes. If you do not set a value, the default is taken from Knative’s revision-timeout-seconds.
Set annotations(s). Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
Set labels(s). Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.

Completing the Template

Before finalizing your template, review your configurations and make any necessary adjustments.
Click CREATE TEMPLATE

Using API

Go to the Workload templates API reference to view the available actions.

PreviousInference Templates NextNVIDIA NIM Inference Templates

Last updated 20 days ago

Good evening

Linking Assets

Adding a New Template

Setting Up an Environment

Setting Up Compute Resources

Setting Up Data & Storage

Setting Up General Settings

Completing the Template

Using API