# Hugging Face Inference Templates

This guide explains how to create hugging face inference templates for reuse during workload submission. To manage templates, see [Workload Templates](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/workload-templates).

{% hint style="info" %}
**Note**

* Flexible workload templates is enabled by default and applies only to flexible workload submission (enabled by default). If unavailable, contact your administrator to enable it under **General settings** → Workloads → Flexible workload templates.
* Selecting the **Inference type** is enabled by default. If you cannot see it in the menu, contact your administrator to enable it under **General settings** → Workloads → Hugging face models.
* **Docker registry URL for inference workloads** - For Knative-based inference workloads, Docker Hub credentials must be configured using `https://index.docker.io/v1/` as the registry URL. Credentials configured with `docker.io` result in `401 Unauthorized` errors for Knative-based inference workloads due to differences in how image digests are resolved during image pull. See [Credentials](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/workload-templates/inference-templates/broken-reference) for more details.
  {% endhint %}

## Before You Start <a href="#workload-priority-class" id="workload-priority-class"></a>

* To access gated Hugging Face models, make sure to obtain an access token from [Hugging Face Account Settings](https://huggingface.co/) → Access Token. In NVIDIA Run:ai, store the token as a [shared secret](https://run-ai-docs.nvidia.com/saas/assets/credentials#generic-secret) (created by an administrator). The credential must be a Generic secret that contains an `HF_TOKEN`.
* If the image you are planning to use is hosted in NGC, make sure you have an NGC account with an active NGC API key. To obtain a key, go to [NGC](https://catalog.ngc.nvidia.com/) → Setup → API Keys, then generate or copy an existing key. In NVIDIA Run:ai, store the key either as a [shared secret](https://run-ai-docs.nvidia.com/saas/assets/credentials#ngc-api-key) (created by an administrator) or a [user credential](https://run-ai-docs.nvidia.com/saas/settings/user-settings/user-credentials) (created under User settings).&#x20;

## Workload Priority and Preemption <a href="#workload-priority-class" id="workload-priority-class"></a>

By default, inference workloads in NVIDIA Run:ai are assigned a **Very high** priority and are **non-preemptible**. These defaults ensures that inference workloads, which often serve real-time or latency-sensitive traffic, are guaranteed the resources they need. The default very-high priority ensures the fastest possible scheduling, and the default non-preemptible value guarantees that the workload will not be disrupted by other workloads once it starts running.

You can override the defaults by configuring priority and preemptibility. For more details, see [Workload priority and preemption](https://run-ai-docs.nvidia.com/saas/platform-management/runai-scheduler/scheduling/workload-priority-control).

## Linking Assets

When loading an existing asset, environment or compute resource, into a template, you can choose whether to link the asset or use it without linking. Linked assets remain connected to the template. Any updates made to the original environment or compute resource are automatically reflected in the template. While linked, the asset fields in the template cannot be modified.

{% hint style="info" %}
**Note**

Linking data source assets is currently not supported.
{% endhint %}

## Adding a New Template

1. To add a new template, go to Workload manager → Templates.
2. Click **+NEW TEMPLATE** and select **Inference** from the dropdown men&#x75;**.**
3. Within the new inference template form, select the **scope**.
4. Select **Hugging Face** from the **Inference type**
5. Enter a unique **name** for the inference template. If the name already exists in the project, you will be requested to submit a different name.
6. Click **CONTINUE**

### Setting Up an Environment

{% hint style="info" %}
**Note**

* NGC catalog is disabled by default. If unavailable, your administrator must enable it under **General settings** → Workloads → NGC catalog.
* To select an image from the NGC private registr&#x79;**,** your administrator must configure it under **General settings** → Workloads → NGC private registry.
  {% endhint %}

{% tabs %}
{% tab title="Initial setup" %}

1. Select or type a **language model** from the dropdown list, as displayed in Hugging Face.
2. If you select a gated model, set how to **access** Hugging Face by choosing one of the following. Make sure your token has the required permissions and that you have been granted access to the selected model:
   * **Provide a token** - Enter the **access token**.
   * **Select credential** - Select an existing Generic secret credential that contains an `HF_TOKEN` from the dropdown.
3. Set the set the **server type** that runs the model - [**vLLM** ](https://docs.vllm.ai/en/latest/models/supported_models.html)or [**TGI**](https://huggingface.co/docs/text-generation-inference/en/supported_models)**.**
4. Set whether inference serving endpoints can be accessed externally or only by internal cluster traffic:
   * **External (Public access)** - Available only if your administrator has configured **Knative** to support external access. See the [Inference requirements](https://run-ai-docs.nvidia.com/saas/getting-started/installation/install-using-helm/system-requirements#inference) section for more details.
   * **Internal only** - Endpoint is accessible only within cluster traffic.
5. Set who can access the inference **serving endpoint.** See [Accessing the inference workload](https://run-ai-docs.nvidia.com/saas/using-inference/hugging-face-inference#accessing-the-inference-workload) for more details:
   * By default, **Public** is selected giving everyone within the network access to the endpoint with no authentication
   * If you select **All authenticated users and service accounts**, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
   * For **Specific group(s)**, enter **group names** as they appear in your identity provider. You must be a member of one of the groups listed to have access.
   * For **Specific user(s) and service account(s)**, enter a valid user email or name. If you remove yourself, you will lose access.
     {% endtab %}

{% tab title="Advanced setup" %}
**Load from existing setup**

1. Click the **load** icon. A side pane appears, displaying a list of available environments. Select an environment from the list.
2. Alternatively, click the **➕** icon in the side pane to create a new environment. For step-by-step instructions, see [Environments](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/environments).
3. Choose whether to **link** the environment when applying it to the template. See [Linking assets](#linking-assets) for more details.

**Provide your own settings**

Manually configure the settings below as needed.

**Configure environment**

1. Set the **environment image**:
   * Select **Custom image** and add the **Image URL** or update the URL of the existing setup.
   * **Select from the NGC catalog** and then set how to **access the NGC catalog**:
     * **As a guest** - Choose the **image name** and **tag** from the dropdown.
     * **Authenticated** - Configure **access** to NGC and then choose the **image name** and **tag** from the dropdown:
       * Under **Source**, select [**Shared secret**](https://run-ai-docs.nvidia.com/saas/assets/credentials#ngc-api-key) or [**My credentials**](https://run-ai-docs.nvidia.com/saas/settings/user-settings/user-credentials).
       * The **Type** field is fixed to **NGC API key**.
       * Under **Credential name**, select an existing credential from the dropdown. If you select **My credentials**, you can also create a new credential directly from the dropdown. The credential is also saved under **User settings** → Credentials.
   * **Select from the NGC private registry** and then set how to access the registry:
     * Select a **registry** from the dropdown.
     * Under **Source**, select [**Shared secret**](https://run-ai-docs.nvidia.com/saas/assets/credentials#ngc-api-key) or [**My credentials**](https://run-ai-docs.nvidia.com/saas/settings/user-settings/user-credentials).
     * The **Type** field is fixed to **NGC API key**.
     * Under **Credential name**, select an existing credential from the dropdown. If you select **My credentials**, you can also create a new credential directly from the dropdown. The credential is also saved under **User settings** → Credentials.
2. Set the **image pull policy.** Set the **condition for pulling the image**. It is recommended to pull the image only if it's not already present on the host.
3. Set whether inference serving endpoints can be accessed externally or only by internal cluster traffic:
   * **External (Public access)** - Available only if your administrator has configured **Knative** to support external access. See the [Inference requirements](https://run-ai-docs.nvidia.com/saas/getting-started/installation/install-using-helm/system-requirements#inference) section for more details.
   * **Internal only** - Endpoint is accessible only within cluster traffic.
4. Set an inference **serving endpoint**. The connection protocol and the container port are defined within the environment:
   * Select **HTTP** or **gRPC** and enter a corresponding **container port**
   * Modify who can access the endpoint. See [Accessing the inference workload](https://run-ai-docs.nvidia.com/saas/using-inference/hugging-face-inference#accessing-the-inference-workload) for more details:
     * By default, **Public** is selected giving everyone within the network access to the endpoint with no authentication
     * If you select **All authenticated users and service accounts**, access is given to everyone within the organization’s account that can log in (to NVIDIA Run:ai or SSO).
     * For **Specific group(s)**, enter **group names** as they appear in your identity provider. You must be a member of one of the groups listed to have access.
     * For **Specific user(s) and service account(s)**, enter a valid user email or name. If you remove yourself, you will lose access.
5. Set the connection for your **tool(s)**. If you are loading from existing setup, the tools are configured as part of the environment.
   * Select the connection type:
     * **Auto generate** - A unique URL / port is automatically created for each workload using the environment.
     * **Custom URL** / **Custom port** - Manually define the URL or port. For custom port, make sure to enter a port between `30000` and `32767.` If the node port is already in use, the workload will fail and display an error message.
     * **Load Balancer** - Set the container port. Connection handling is managed by the load balancer. For more information, see [External access to containers](https://run-ai-docs.nvidia.com/saas/infrastructure-setup/advanced-setup/container-access/external-access-to-containers).
   * Modify who can **access** the tool:
     * By default, **All authenticated users** **and service accounts** is selected giving access to everyone within the organization’s account.
     * For **Specific group(s)**, enter **group names** as they appear in your identity provider. You must be a member of one of the groups listed to have access to the tool.
     * For **Specific user(s) and service account(s)**, enter a valid user email or name. If you remove yourself, you will lose access to the tool.
6. Set the **command and arguments** for the container running the workload. If no command is added, the container will use the image’s default command (entry-point).
   * Modify the existing command or click **+COMMAND & ARGUMENTS** to add a new command.
   * Set multiple arguments separated by spaces, using the following format (e.g.: `--arg1=val1`).
7. Set the **environment variable(s)**:

   * Modify the existing environment variable(s) or click **+ENVIRONMENT VARIABLE**. The existing environment variables may include instructions to guide you with entering the correct values.
   * You can either select **Custom** to define a value manually, or choose an existing value from [**Shared secret**](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/credentials) or [**ConfigMap**](https://run-ai-docs.nvidia.com/saas/assets/datasources#creating-configmaps-in-advance).

   Some environment variables are automatically injected by NVIDIA Run:ai. See [Built-in workload environment variables](https://run-ai-docs.nvidia.com/saas/workloads#built-in-workload-environment-variables) for more details.
8. Enter a path pointing to the **container's working directory**.
9. Set where the UID, GID, and supplementary groups for the container should be taken from. If you select **Custom**, you’ll need to manually enter the **UID,** **GID and** **Supplementary groups values**.
10. Select additional Linux capabilities for the container from the drop-down menu. This grants certain privileges to a container without granting all the root user's privileges.
    {% endtab %}
    {% endtabs %}

### Setting Up Compute Resources

{% hint style="info" %}
**Note**

* Setting up compute resources is available in both the initial and advanced setup forms. If you configure it during the initial setup, the values will be populated into the advanced setup. Any changes made during the advanced setup will override the initial configuration and be applied to the template.
* GPU memory limit is disabled by default. If unavailable, your administrator must enable it under **General settings** → Resources → GPU resource optimization.
* Replica autoscaling and toleration(s) can still be modified even when the compute resource asset is linked.
  {% endhint %}

**Load from existing setup**

1. Click the **load** icon. A side pane appears, displaying a list of available compute resources. Select a compute resource from the list.
2. Alternatively, click the **➕** icon in the side pane to create a new compute resource. For step-by-step instructions, see [Compute resources](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/compute-resources).
3. Choose whether to **link** the compute resource when applying it to the template. See [Linking assets](#linking-assets) for more details.

**Provide your own settings**

Manually configure the settings below as needed.

**Configure compute resources**

1. Set the number of **GPU devices** per pod (physical GPUs).
2. Enable **GPU fractioning** to set the GPU memory per device using either a fraction of a GPU device’s memory **(% of device)** or a GPU memory unit **(MB/GB)**:
   * **Request** - The minimum GPU memory allocated per device. Each pod in the workload receives at least this amount per device it uses.
   * **Limit** - The maximum GPU memory allocated per device. Each pod in the workload receives **at most** this amount of GPU memory for each device(s) the pod utilizes. This is disabled by default, to enable see the above note.
3. Set the **CPU resources**
   * Set **CPU compute resources** per pod by choosing the unit (**cores** or **millicores**):
     * **Request** - The minimum amount of CPU compute provisioned per pod. Each running pod receives this amount of CPU compute.
     * **Limit** - The maximum amount of CPU compute a pod can use. Each pod receives **at most** this amount of CPU compute. By default, the limit is set to **Auto** which means that the pod may consume up to the node's maximum available CPU compute resources.
   * Set the **CPU memory per pod** by selecting the unit (**MB** or **GB**):
     * **Request** - The minimum amount of CPU memory provisioned per pod. Each running pod receives this amount of CPU memory.
     * **Limit** - The maximum amount of CPU memory a pod can use. Each pod receives at most this amount of CPU memory. By default, the limit is set to **Auto** which means that the pod may consume up to the node's maximum available CPU memory resources.
4. Set **extended resource(s)**
   * Enable **Increase shared memory size** to allow the shared memory size available to the pod to increase from the default 64MB to the node's total available memory or the CPU memory limit, if set above.
   * Click **+EXTENDED RESOURCES** to add resource/quantity pairs. For more information on how to set extended resources, see the [Extended resources](https://kubernetes.io/docs/tasks/configure-pod-container/extended-resource/) and [Quantity](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/quantity/) guides.
5. Set the **minimum and maximum** number of replicas to be scaled up and down to meet the changing demands of inference services:
   * If the number of minimum and maximum replicas are different, autoscaling will be triggered and you'll need to set **conditions for creating a new replica**. A replica will be created every time a condition is met. When a condition is no longer met after a replica was created, the replica will be automatically deleted to save resources.
   * Select one of the **variables to set** the conditions for creating a new replica. The variable's values will be monitored via the container's port. When you set a **value**, this value is the threshold at which autoscaling is triggered.
6. Set when the replicas should be automatically **scaled down to zero**. This allows compute resources to be freed up when the model is inactive (i.e., there are no requests being sent). Automatic scaling to zero is enabled only when the minimum number of replicas in the previous step is set to 0.
7. Click **+TOLERATION** to allow the workload to be scheduled on a node with a matching taint. Select the **operator** and the **effect**:
   * If you select **Exists**, the effect will be applied if the key exists on the node.
   * If you select **Equals,** the effect will be applied if the key and the value set match the value on the node.

### Setting Up Data & Storage

{% hint style="info" %}
**Note**

* **Advanced setup** - Data volumes is enabled by default. If unavailable, contact your administrator to enable it under **General settings** → Workloads → Data volumes.
* **Advanced setup** - If Data volumes is not enabled, **Data & storage** appears as **Data sources** only, and no data volumes will be available.
* S3 data sources are not supported for inference workloads.
  {% endhint %}

{% tabs %}
{% tab title="Initial setup" %}
Select the data source that will serve as the **model store**. If the model is already stored on the selected data source it will be loaded from there automatically. Otherwise, it will be stored on the selected data source during the first workload deployment.

1. Click the **load** icon. A side pane appears, displaying a list of available data sources. Select a data source from the list.
2. Optionally, customize any of the data source's predefined fields. The changes will apply to this template only and will not affect the selected data source:
   * **Container path** - Enter the **container path** to set the **data target location**.
3. Alternatively, click the **➕** icon in the side pane to create a new data source. For step-by-step instructions, see [Data sources](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/datasources).
   {% endtab %}

{% tab title="Advanced setup" %}
**Load from existing setup**

1. Click the **load** icon. A side pane appears, displaying a list of available data sources/volumes. Select a data source/volume from the list.
2. Optionally, customize any of the data source's predefined fields as shown below. The changes will apply to this template only and will not affect the selected data source:
   * **Container path** - Enter the **container path** to set the **data target location**.
   * **ConfigMap** **sub-path** - Specify a **sub-path** (file/key) inside the ConfigMap to mount (for example, `app.properties`). This lets you mount a single file from an existing ConfigMap.
3. Alternatively, click the **➕** icon in the side pane to create a new data source/volume. For step-by-step instructions, see [Data sources](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/datasources) or [Data volumes](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/data-volumes).

**Configure data sources for a one-time configuration**

{% hint style="info" %}
**Note**

[PVCs](https://run-ai-docs.nvidia.com/saas/assets/datasources#pvc), [Secrets](https://run-ai-docs.nvidia.com/saas/assets/datasources#secret), [ConfigMaps](https://run-ai-docs.nvidia.com/saas/assets/datasources#configmap) and [Data volumes](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/data-volumes) cannot be added as a one-time configuration.
{% endhint %}

1. Click the **➕** icon and choose the data source from the dropdown menu. You can add multiple data sources.
2. Once selected, set the **data origin** according to the required fields and enter the **container path** to set the **data target location**.

The required fields vary by data source. For detailed configuration options and usage guidelines for each data source type, see [Data sources](https://run-ai-docs.nvidia.com/saas/workloads-in-nvidia-run-ai/assets/datasources).

**Configure EmptyDir for a one-time configuration**

Use **EmptyDir** to allocate temporary storage that exists only for the lifetime of the workload. The **EmptyDir** volume is ephemeral, the volume and its data is deleted every time the workload’s status changes to “Stopped

1. Set the **size** and **units** to define the maximum storage capacity.
2. Enter the **container path** to set the **data target location**.
3. Select the **storage medium**:
   * **Disk-backed storage** - The data is stored on the node's filesystem, which persists across container restarts but not pod rescheduling, and is slower than memory
   * **Memory-backed storage** - The data is stored in RAM, which provides faster access but is not persistent, and will be lost if the pod restarts

**Configure Volume for a one-time configuration**

Select **Volume** to allocate a storage space to your workspace that is persistent across restarts:

1. Set the **Storage class** to **None** or select an existing storage class from the list. To add new storage classes, and for additional information, see [Kubernetes storage classes](https://run-ai-docs.nvidia.com/saas/infrastructure-setup/procedures/shared-storage). If the administrator defined the storage class configuration, the rest of the fields will appear accordingly.
2. Select one or more **access mode(s)** and define the **claim size** and its **units**.
3. Select the **volume mode**:
   1. **Filesystem** (default) - The volume will be mounted as a filesystem, enabling the usage of directories and files.
   2. **Block** - The volume is exposed as a block storage, which can be formatted or used directly by applications without a filesystem.
4. Enter the **container path** to set the **data target location**.
   {% endtab %}
   {% endtabs %}

### Setting Up General Settings

{% hint style="info" %}
**Note**

* The following general settings are optional.
* General settings are available via the Advanced setup form.
  {% endhint %}

1. Set whether the workload may be interrupted by selecting **Preemptible** or **Non-preemptible**:
   * Non-preemptible workloads use the project's available GPU quota and will not be interrupted once they start running.
   * Preemptible workloads may be interrupted if resources are needed for higher-priority workloads.
2. Set the **workload priority**. Choose the appropriate priority level for the workload. Higher-priority workloads are scheduled before lower-priority ones.
3. Set the **workload initialization timeout.** This is the maximum amount of time the system will wait for the workload to start and become ready. If the workload does not start within this time, it will automatically fail. Enter a value between 5 seconds and 720 minutes. If you do not set a value, the default is taken from [Knative’s max-revision-timeout-seconds](https://knative.dev/docs/serving/configuration/config-defaults/#max-revision-timeout-seconds).
4. Set the **request timeout.** This defines the maximum time allowed to process an end-user request. If the system does not receive a response within this time, the request will be ignored. Enter a value between 5 seconds and 10 minutes. If you do not set a value, the default is taken from [Knative’s revision-timeout-seconds](https://knative.dev/docs/serving/configuration/config-defaults/#revision-timeout-seconds).
5. Set **annotations(s).** Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
6. Set **labels(s).** Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying.

### Completing the Template

1. Decide if you wish to fine-tune your setup with additional preferences including environment details and data sources. If yes, click on the dropdown next to **CREATE TEMPLATE** and select **Advanced setup**. Follow the advanced setup steps above.
2. Before finalizing your template, review your configurations and make any necessary adjustments.
3. Click **CREATE TEMPLATE**

## Using API

Go to the [Workload templates](https://app.gitbook.com/s/b5QLzc5pV7wpXz3CDYyp/workloads/workload-templates) API reference to view the available actions.
