# Deploy Inference Workloads from Hugging Face

This section explains how to create an inference workload via the NVIDIA Run:ai UI using Hugging Face inference models.

An inference workload provides the setup and configuration needed to deploy your trained model for real-time or batch predictions. It includes specifications for the container image, data sets, network settings, and resource requests required to serve your models.

The inference workload is assigned to a project and is affected by the project's quota.

To learn more about the inference workload type in NVIDIA Run:ai and determine that it is the most suitable workload type for your goals, see [Workload types](https://run-ai-docs.nvidia.com/self-hosted/2.21/workloads-in-nvidia-run-ai/workload-types).

![](https://1836807109-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FUc7kDeOTlZaDiMM2pR07%2Fuploads%2Fgit-blob-0c29150c36bc6a84e593346bafd9a4c465d8db60%2Finference-workload.png?alt=media)

## Before You Start

* Make sure you have created a [project](https://run-ai-docs.nvidia.com/self-hosted/2.21/platform-management/aiinitiatives/organization/projects) or have one created for you.
* Make sure [Knative](https://run-ai-docs.nvidia.com/self-hosted/2.21/getting-started/installation/install-using-helm/system-requirements#inference) is properly installed by your administrator.

{% hint style="info" %}
**Note**

* Selecting the Inference type is disabled by default. If you cannot see it in the menu, then it must be enabled by your Administrator, under **General settings** → Workloads → Models.
* Tolerations are disabled, by default. If you cannot see Tolerations in the menu, then it must be enabled by your Administrator, under **General settings** → Workloads → Tolerations
* **Docker registry URL for inference workloads** - For Knative-based inference workloads, Docker Hub credentials must be configured using `https://index.docker.io/v1/` as the registry URL. Credentials configured with `docker.io` result in `401 Unauthorized` errors for Knative-based inference workloads due to differences in how image digests are resolved during image pull. See [Credentials](https://run-ai-docs.nvidia.com/self-hosted/2.21/assets/credentials#docker-registry) for more details.
  {% endhint %}

## Workload Priority

By default, inference workloads in NVIDIA Run:ai are assigned the `inference` priority which is non-preemptible. This behavior ensures that inference workloads, which often serve real-time or latency-sensitive traffic, are guaranteed the resources they need and will not be disrupted by other workloads. For more details, see [Workload priority control](https://run-ai-docs.nvidia.com/self-hosted/2.21/platform-management/runai-scheduler/scheduling/workload-priority-control).

## Creating a Hugging Face Inference Workload

To add a new inference workload:

1. Go to the Workload manager → Workloads
2. Click **+NEW WORKLOAD** and select **Inference**\
   Within the new inference form:
3. Select under which **cluster** to create the inference workload
4. Select the **project** in which your inference will run
5. Select **Hugging Face** from **Inference type**
6. Enter a unique **name** for the inference workload (if the name already exists in the project, you will be requested to submit a different name)
7. Click **CONTINUE**\
   In the next step:
8. Set the **model** and how to access
   * Set the **model name** by selecting a model or entering the model name as displayed in Hugging Face.
   * Set how to **access** Hugging Face
     * **Provide a token** by entering the access token
     * **Select credential**
       * If you are selecting an existing credential, make sure the existing credential contains an **HF\_TOKEN**
       * To add a new credential, click **+NEW CREDENTIAL** and make sure to create one with an **HF\_TOKEN**. For a step-by-step guide on adding credentials to the gallery, see [Credentials](https://run-ai-docs.nvidia.com/self-hosted/2.21/workloads-in-nvidia-run-ai/assets/credentials). Once created, the new credential will be automatically selected.
   * Optional: Modify who can access the inference **serving endpoint.** See [Accessing the inference workload](#accessing-the-inference-workload) for more details:
     * **Public (default)**

       Everyone within the network can access the endpoint with no authentication
     * **All authenticated users**

       Everyone within the organization's account that can log in (to NVIDIA Run:ai or SSO)
     * **Specific group(s)**
       * Click **+GROUP**
       * Enter **group names** as they appear in your identity provider. You must be a member of one of the groups listed to have access to the endpoint.
     * **Specific user(s)**
       * Click **+USER**
       * Enter a valid email address or username. If you remove yourself, you will lose access to the endpoint.
9. Select the **compute resource** for your inference workload
   * Select a compute resource or click **+NEW COMPUTE RESOURCE** to add a new compute resource to the gallery.\
     For a step-by-step guide on adding compute resources to the gallery, see [compute resources](https://run-ai-docs.nvidia.com/self-hosted/2.21/workloads-in-nvidia-run-ai/assets/compute-resources). Once created, the new compute resource will be automatically selected.
   * Optional: Set the **minimum and maximum** number of replicas to be scaled up and down to meet the changing demands of inference services.
   * If the number of minimum and maximum replicas are different, autoscaling will be triggered and you'll need to set **conditions for creating a new replica**. A replica will be created every time a condition is met. When a condition is no longer met after a replica was created, the replica will be automatically deleted to save resources.
     * Select a **variable** - The variable's values will be monitored via the container's port.
       * **Latency (milliseconds)**
       * **Throughput (Requests/sec)**
       * **Concurrency (Requests)**
     * Set a **value** - This value is the threshold at which autoscaling is triggered.
   * Optional: Set when the replicas should be automatically **scaled down to zero**. This allows the compute resources to be freed up when the model is inactive (i.e., there are no requests being sent) When automatic scaling to zero is enabled, the minimum number of replicas set in the previous step, automatically changes to 0
   * Optional: Set the **order of priority** for the **node pools** on which the scheduler tries to run the workload.\
     When a workload is created, the scheduler will try to run it on the first node pool on the list. If the node pool doesn't have free resources, the scheduler will move on to the next one until it finds one that is available.
     * Drag and drop them to change the order, remove unwanted ones, or reset to the default order defined in the project.
     * Click **+NODE POOL** to add a new node pool from the list of node pools that were defined on the cluster.\
       To configure a new node pool and for additional information, see [node pools](https://run-ai-docs.nvidia.com/self-hosted/2.21/platform-management/aiinitiatives/resources/node-pools).
   * Select a **node affinity** to schedule the workload on a specific node type.\
     If the administrator added a '[node type (affinity)](https://run-ai-docs.nvidia.com/self-hosted/2.21/platform-management/policies/scheduling-rules#node-type-affinity)' scheduling rule to the project/department, then this field is mandatory.\
     Otherwise, entering a node type (affinity) is optional. [Nodes must be tagged](https://run-ai-docs.nvidia.com/self-hosted/2.21/platform-management/policies/scheduling-rules#labelling-nodes-for-node-types-grouping) with a label that matches the node type key and value.
   * Optional: Set toleration(s) to allow the workload to be scheduled on a node with a matching taint.
     * Click **+TOLERATION**
     * Enter a **key**
     * Select the **operator**
       * **Exists** - If the key exists on the node, the effect will be applied
       * **Equals** - If the key and the value set below matches to the value on the node, the effect will be applied
         * Enter a **value** matching the value on the node
     * Select the effect for the toleration
       * **NoExecute** - Pods that do not tolerate this taint are evicted immediately.
       * **NoSchedule** - No new pods will be scheduled on the tainted node unless they have a matching toleration. Pods currently running on the node will not be evicted.
       * **PreferNoSchedule** - The control plane will try to avoid placing a pod that does not tolerate the taint on the node, but it is not guaranteed.
       * **Any** - All effects above match.
10. Under Environment, set the **server type** to [**vLLM** ](https://docs.vllm.ai/en/latest/models/supported_models.html)or [**TGI**:](https://huggingface.co/docs/text-generation-inference/en/supported_models)
    * Enter the **Image URL.** Make sure to use the correct version in the image URL path.
    * Set multiple arguments separated by spaces, using the following format (e.g.: `--arg1=val1`).
11. Select the **data source** that will serve as the model store\
    Select a data source where the model is already cached to reduce loading time or click **+NEW DATA SOURCE** to add a new data source to the gallery. This will cache the model and reduce loading time for future use. If there are issues with the connectivity to the cluster, or issues while creating the data source, the data source won't be available for selection. For a step-by-step guide on adding data sources to the gallery, see [data sources](https://run-ai-docs.nvidia.com/self-hosted/2.21/workloads-in-nvidia-run-ai/assets/datasources). Once created, the new data source will be automatically selected.
12. **Optional - General settings**:
    * Set **annotations(s)**\
      Kubernetes annotations are key-value pairs attached to the workload. They are used for storing additional descriptive metadata to enable documentation, monitoring and automation.
      * Click **+ANNOTATION**
      * Enter a **name**
      * Enter a **value**
    * Set **labels(s)**\
      Kubernetes labels are key-value pairs attached to the workload. They are used for categorizing to enable querying. To add labels:
      * Click **+LABEL**
      * Enter a **name**
      * Enter a **value**
13. Click **CREATE INFERENCE**

## Managing and Monitoring

After the inference workload is created, it is added to the [Workloads](https://run-ai-docs.nvidia.com/self-hosted/2.21/workloads-in-nvidia-run-ai/workloads) table, where it can be managed and monitored.

## Rolling Inference Updates <a href="#rolling-inference-updates" id="rolling-inference-updates"></a>

When deploying models and running inference workloads, you may need to update the workload configuration in real-time without disrupting critical services. Rolling inference updates allow you to submit changes to an existing inference workload, regardless of its current status (running, pending, etc.).

To update an inference workload, select the workload and click **UPDATE**. Only the settings listed below can be modified.

### Supported Updates

You can update various aspects of an inference workload, for example:

* **Container image** – Deploy a new model version.
* **Configuration parameters** – Modify command arguments and/or environment variables.
* **Compute resources** – Adjust resources to optimize performance.
* **Replica count and scaling policy** – Adapt to changing workload demands.

Throughout the update process, the workload remains operational, ensuring uninterrupted access for consumers (e.g., interacting with an LLM).

### Update Process

When an inference workload is updated, a **new revision** of the pod(s) is created based on the updated specification.

* Multiple updates can be submitted in succession, but only the latest update takes effect—previous updates are ignored.
* Once the new revision is fully deployed and running, traffic is redirected to it.
* The original revision is then terminated, and its resources are released back to the shared pool.

### GPU Quota Considerations

To successfully complete an inference workload update, the project must have sufficient free GPU quota. For example:

* **Existing workload** - The current inference workload is running with 3 replicas. Assuming each replica uses 1 GPU, the project is currently consuming 3 GPUs from its quota. For clarity, we'll refer to this as **Revision 1**.
* **Updated workload** - The workload is updated to use 8 replicas, which requires 8 additional GPUs during the update process. These GPUs must be available in the project's quota before the update can begin. Once the update is complete and the new revision is running, the 3 GPUs used by **Revision 1** are released.

### Monitoring Updates in the UI

In the UI, the **Workloads table** displays the configuration of the latest submitted update. For example, if you change the container image, the **image** column will display the name of updated image.

The **status** of the workload continues to reflect the operational state of the service the workload exposes. For instance, during an update, the workload status remains "Running" if the service is still being delivered to consumers. Hovering over the workload's **status** in the grid will display the phase message for the update, offering additional insights into its update state.

### Timeout and Resource Allocation

* As long as the update process is not completed, GPUs are not allocated to the replicas of the new revision. This prevents the allocation of idle GPUs so others will not be deprived using them. This behavior is supported using the Knative behavior described below.
* If the update process is not completed within the default time limit of 10 minutes, it will automatically stop. At that point, all replicas of the new revision will be removed, and the original revision will continue to run normally.
* The above default time limit for updates is configurable. Consider setting a longer duration if your workload requires extended time to pull the image due to its size, if the workload takes additional time to reach a 'READY' state due to a long initialization process, or if your cluster depends on autoscaling to allocate resources for new replicas. For example, to set the time limit to 30 minutes, you can run the following command:

  ```bash
  kubectl patch ConfigMap config-deployment -n knative-serving --type='merge' -p '{"data": {"progress-deadline": "1800s"}}'
  ```

#### Inference Workloads with Knative <a href="#inference-workloads-with-knative-new-behavior-in-v219" id="inference-workloads-with-knative-new-behavior-in-v219"></a>

Starting in version 2.19, all pods of a single Knative revision are grouped under a single pod-group. This means that when a new Knative revision is created:

* It either succeeds in allocating the minimum number of pods; or
* It fails and moves into a pending state, to retry again later to allocate all pods with their resources.

The resources (GPUs, CPUs) are not occupied by a new Knative revision until it succeeds in allocating all pods. The older revision pods are then terminated and release their resources (GPUs, CPUs) back to the cluster to be used by other workloads.

## Accessing the Inference Workload

You can programmatically consume an inference workload via API by making direct calls to the serving endpoint, typically from other workloads or external integrations.

Once an inference workload is deployed, the serving endpoint URL appears in the **Connections** column of the inference workloads grid.

{% hint style="info" %}
**Note**

If the serving endpoint URL ends with `.svc.cluster.local`, it is accessible only within the cluster. To enable external access, your administrator must configure the cluster as described in the [inference requirements](https://run-ai-docs.nvidia.com/self-hosted/2.21/getting-started/installation/install-using-helm/system-requirements#inference) section.
{% endhint %}

Access to the inference serving API depends on how the serving endpoint access was configured when submitting the inference workload:

* If **Public** access is enabled, no authentication is required.
* If restricted access is configured, a valid access token is required. Authentication is limited to **Specific user(s)** and **Specific group(s)**. SSO users and applications are currently not supported.

Follow the below steps to obtain a token:

1. Use the [Tokens](https://run-ai-docs.nvidia.com/api/2.21/authentication-and-authorization/tokens) API with `grantType: password`.
2. Use the obtained token to make API calls to the inference serving endpoint. For example:

   ```bash
   #replace <serving-endpoint-url> and <model-name> (e.g. "meta-llama/Llama-3.1-8B-Instruct")
   curl <serving-endpoint-url>/v1/chat/completions \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer <your-access-token>" \
     -d '{
       "model": "<model-name>",
       "messages": [{
         "role": "user",
         "content": "Write a short poem on AI"
       }]
     }'
   ```

## Using CLI

To view the available actions, see the inference workload [CLI v2 reference](https://run-ai-docs.nvidia.com/self-hosted/2.21/reference/cli/runai).

## Using API

* To view the available actions for creating an inference workload, see the [Inferences](https://run-ai-docs.nvidia.com/api/2.21/workloads/inferences) API reference.
* To view the available actions for rolling an inference update, see the [Update inference spec](https://run-ai-docs.nvidia.com/api/2.21/workloads/inferences#patch-api-v1-workloads-inferences-workloadid) API reference.
