Data Sources

This section explains what data sources are and how to create and use them.

Data sources are a type of workload assets and represent a location where data is actually stored. They may represent a remote data location, such as NFS, Git, or S3, or a Kubernetes local resource, such as PVC, ConfigMap, HostPath, or Secret.

This configuration simplifies the mapping of the data into the workload’s file system and handles the mounting process during workload creation for reading and writing. These data sources are reusable and can be easily integrated and used by AI practitioners while submitting workloads across various scopes.

Data Sources Table

The data sources table can be found under Workload manager in the NVIDIA Run:ai platform.

The data sources table provides a list of all the data sources defined in the platform and allows you to manage them.

Note

Data & storage - with Data sources and Data volumes - is visible only if your Administrator has enabled Data volumes.

The data sources table comprises the following columns:

Column

Description

Data source

The name of the data source

Description

A description of the data source

Type

The type of data source connected – e.g., S3 bucket, PVC, or others

Status

The different lifecycle phases and representation of the data source condition

Scope

The scope of the data source within the organizational tree. Click the scope name to view the organizational tree diagram

Kubernetes name

The unique name of the data sources Kubernetes name as it appears in the cluster

Workload(s)

The list of existing workloads that use the data source

Template(s)

The list of workload templates that use the data source

Created by

The user who created the data source

Creation time

The timestamp for when the data source was created

Cluster

The cluster that the data source is associated with

Data Sources Status

The following table describes the data sources' condition and whether they were created successfully for the selected scope.

Status

Description

No issues found

No issues were found while creating the data source

Issues found

Issues were found while propagating the data source credentials

Issues found

The data source couldn’t be created at the cluster

Creating…

The data source is being created

No status / “-”

When the data source’s scope is an account, the current version of the cluster is not up to date, or the asset is not a cluster-syncing entity, the status can’t be displayed

Customizing the Table View

Filter - Click ADD FILTER, select the column to filter by, and enter the filter values
Search - Click SEARCH and type the value to search by
Sort - Click each column header to sort by
Column selection - Click COLUMNS and select the columns to display in the table
Download table - Click MORE and then click ‘Download as CSV’. Export to CSV is limited to 20,000 rows.
Refresh - Click REFRESH to update the table with the latest data

Adding a New Data Source

To create a new data source:

Click +NEW DATA SOURCE
Select the data source type from the list. Follow the step-by-step guide for each data source type:

NFS

A Network File System (NFS) is a Kubernetes concept used for sharing storage in the cluster among different pods. Like a PVC, the NFS volume’s content remains preserved, even outside the lifecycle of a single pod. However, unlike PVCs, which abstract storage management, NFS provides a method for network-based file sharing. The NFS volume can be pre-populated with data and can be mounted by multiple pod writers simultaneously. At NVIDIA Run:ai, an NFS-type data source is an abstraction that is mapped directly to a Kubernetes NFS volume. This integration allows multiple workloads under various scopes to mount and present the NFS data source.

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Set the data origin
- Enter the NFS server (host name or host IP)
- Enter the NFS path
Set the data target location
- Container path
Optional: Restrictions
- Prevent data modification - When enabled, the data will be mounted with read-only permissions
Click CREATE DATA SOURCE

PVC

A Persistent Volume Claim (PVC) is a Kubernetes concept used for managing storage in the cluster, which can be provisioned by an administrator or dynamically by Kubernetes using a StorageClass. PVCs allow users to request specific sizes and access modes (read/write once, read-only many). NVIDIA Run:ai ensures that data remains consistent and accessible across various scopes and workloads, beyond the lifecycle of individual pods, which is efficient while working with large datasets typically associated with AI projects.

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Select PVC:
- Existing PVC
  This option is relevant when the purpose is to create a PVC-type data source based on an existing PVC in the cluster
  - Select a PVC from the list - (The list is empty if no existing PVCs were created in advance)
- New PVC - creates a new PVC in the cluster. New PVCs are not added to the Existing PVCs list.
  When creating a PVC-type data source and selecting the ‘New PVC’ option, the PVC is immediately created in the cluster (even if no workload has requested this PVC).
Select the storage class
- None - Proceed without defining a storage class
- Custom storage class - This option applies when selecting a storage class based on existing storage classes.
  To add new storage classes to the storage class list, and for additional information, check Kubernetes storage classes
Select the access mode(s) (multiple modes can be selected)
- Read-write by one node - The volume can be mounted as read-write by a single node.
- Read-only by many nodes - The volume can be mounted as read-only by many nodes.
- Read-write by many nodes - The volume can be mounted as read-write by many nodes.
Set the claim size and its units
Select the volume mode
1. File system (default) - allows the volume to be mounted as a filesystem, enabling the usage of directories and files.
2. Block - exposes the volume as a block storage, which can be formatted or used by applications directly without a filesystem.
Set the data target location
- container path
Optional: Prevent data modification - When enabled, the data will be mounted with read-only permission.
Click CREATE DATA SOURCE

After the data source is created, check its status to monitor its proper creation across the selected scope.

S3 Bucket

The S3 bucket data source enables the mapping of a remote S3 bucket into the workload’s file system. Similar to a PVC, this mapping remains accessible across different workload executions, extending beyond the lifecycle of individual pods. However, unlike PVCs, data stored in an S3 bucket resides remotely, which may lead to decreased performance during the execution of heavy machine learning workloads. As part of the NVIDIA Run:ai connection to the S3 bucket, you can create credentials in order to access and map private buckets.

Note

S3 data sources are not supported for custom inference workloads.

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Set the data origin
- Set the S3 service URL
- Select the credential
  - None - for public buckets
  - Credential names - This option is relevant for private buckets based on existing credentials that were created for the scope.
    To add new credentials to the credentials list, and for additional information, check the Credentials article.
- Enter the bucket name
Set the data target location
- container path
Click CREATE DATA SOURCE

After a private data source is created, check its status to monitor its proper creation across the selected scope.

Git

A Git-type data source is a NVIDIA Run:ai integration, that enables code to be copied from a Git branch into a dedicated folder in the container. It is mainly used to provide the workload with the latest code repository. As part of the integration with Git, in order to access private repositories, you can add predefined credentials to the data source mapping.

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Set the data origin
- Set the Repository URL
- Set the Revision (branch, tag, or hash)- If left empty, it will use the 'HEAD' (latest)
- Select the credential
  - None - for public repositories
  - Credential names - This option applies to private repositories based on existing credentials that were created for the scope.
    To add new credentials to the credentials list, and for additional information, check the Credentials article.
Set the data target location
- container path
Click CREATE DATA SOURCE

After a private data source is created, check its status to monitor its proper creation across the selected scope.

Host path

A Host path volume is a Kubernetes concept that enables mounting a host path file or a directory on the workload’s file system. Like a PVC, the host path volume’s data persists across workloads under various scopes. It also enables data serving from the hosting node.

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Set the data origin
- host path
Set the data target location
- container path
Optional: Prevent data modification - When enabled, the data will be mounted with read-only permissions.
Click CREATE DATA SOURCE

ConfigMap

A ConfigMap data source is a NVIDIA Run:ai abstraction for the Kubernetes ConfigMap concept. The ConfigMap is used mainly for storage that can be mounted on the workload container for non-confidential data. It is usually represented in key-value pairs (e.g., environment variables, command-line arguments etc.). It allows you to decouple environment-specific system configurations from your container images, so that your applications are easily portable. ConfigMaps must be created on the cluster prior to being used within the NVIDIA Run:ai system.\

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Set the data origin
- Select the ConfigMap name (The list is empty if no existing ConfigMaps were created in advance).
Set the data target location
- container path
Click CREATE DATA SOURCE

Secret

A secret-type data source enables the mapping of a credential into the workload’s file system. Credentials are a workload asset that simplify the complexities of Kubernetes Secrets. The credentials mask sensitive access information, such as passwords, tokens, and access keys, which are necessary for gaining access to various resources.

Select the cluster under which to create this data source
Select a scope
Enter a name for the data source. The name must be unique.
Optional: Provide a description of the data source
Set the data origin
- Select the credential
  To add new credentials, and for additional information, check the Credentials article.
Set the data target location
- container path
Click CREATE DATA SOURCE

After the data source is created, check its status to monitor its proper creation across the selected scope.

Note

It is also possible to add data sources directly when creating a specific workspace, training or inference workload.

Copying a Data Source

To copy an existing data source:

Select the data source you want to copy
Click MAKE A COPY
Enter a name for the data source. The name must be unique.
Update the data source and click CREATE DATA SOURCE

Renaming a Data Source

To rename an existing data source:

Select the data source you want to rename
Click Rename and edit the name/description

Deleting a Data Source

To delete a data source:

Select the data source you want to delete
Click DELETE
Confirm you want to delete the data source

Note

It is not possible to delete a data source being used by an existing workload or template.

Creating PVCs in Advance

Add PVCs in advance to be used when creating a PVC-type data source via the NVIDIA Run:ai UI.

The actions taken by the admin are based on the scope (cluster, department or project) that the admin wants for data source of type PVC. Follow the steps below for each required scope:

Cluster Scope

Locate the PVC in the NVIDIA Run:ai namespace (runai)
Provide NVIDIA Run:ai with visibility and authorization to share the PVC to your selected scope by implementing the following label: run.ai/cluster-wide: "true”

The PVC is now displayed for that scope in the list of existing PVCs.

Note

This step is also relevant for creating the data source of type PVC via API

Department Scope

Locate the PVC in the NVIDIA Run:ai namespace (runai)
To authorize NVIDIA Run:ai to use the PVC, label it: run.ai/department: "id"

The PVC is now displayed for that scope in the list of existing PVCs.

Project Scope

Locate the PVC in the project’s namespace.

The PVC is now displayed for that scope in the list of existing PVCs.

Creating ConfigMaps in Advance

Add ConfigMaps in advance to be used when creating a ConfigMap-type data source via the NVIDIA Run:ai UI.

Cluster Scope

Locate the ConfigMap in the NVIDIA Run:ai namespace (runai)
To authorize NVIDIA Run:ai to use the ConfigMap, label it: run.ai/cluster-wide: "true”
The ConfigMap must have a label of run.ai/resource: <resource-name>

The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

Department Scope

Locate the ConfigMap in the NVIDIA Run:ai namespace (runai)
To authorize NVIDIA Run:ai to use the ConfigMap, label it: run.ai/department: "<department-id>"
The ConfigMap must have a label of run.ai/resource: <resource-name>

The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

Project Scope

Locate the ConfigMap in the project’s namespace
The ConfigMap must have a label of run.ai/resource: <resource-name>

The ConfigMap is now displayed for that scope in the list of existing ConfigMaps.

Using API

To view the available actions, go to the Data sources API reference.

PreviousEnvironments NextData Volumes

Last updated 2 months ago