Policy YAML Reference

A workload policy is an end-to-end solution for AI managers and administrators to control and simplify how workloads are submitted, setting best practices, enforcing limitations, and standardizing processes for AI projects within their organization.

This article explains the policy YAML fields and the possible rules and defaults that can be set for each field.

Policy YAML Fields - Reference Table

The policy fields are structured in a similar format to the workload API fields. The following tables represent a structured guide designed to help you understand and configure policies in a YAML format. It provides the fields, descriptions, defaults and rules for each workload type.

Click the link to view the value type of each field.

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

args

When set, contains the arguments sent along with the command. These override the entry point of the image in the created workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

command

A command to serve as the entry point of the container running the workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

createHomeDir

Instructs the system to create a temporary home directory for the user within the container. Data stored in this directory is not saved when the container exists. When the runAsUser flag is set to true, this flag defaults to true as well

  • Workspace

  • Standard training

  • Distributed training

  • Inference

environmentVariables

Set of environmentVariables to populate the container running the workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

image

Specifies the image to use when creating the container running the workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

imagePullPolicy

Specifies the pull policy of the image when starting t a container running the created workload. Options are: always, ifNotPresent, or never

  • Workspace

  • Standard training

  • Distributed training

  • Inference

imagePullSecrets

Specifies a list of references to Kubernetes secrets in the same namespace used for pulling container images.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

workingDir

Container’s working directory. If not specified, the container runtime default is used, which might be configured in the container image

  • Workspace

  • Standard training

  • Distributed training

  • Inference

nodeType

Nodes (machines) or a group of nodes on which the workload runs

  • Workspace

  • Standard training

  • Distributed training

  • Inference

nodePools

A prioritized list of node pools for the scheduler to run the workload on. The scheduler always tries to use the first node pool before moving to the next one when the first is not available.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

annotations

Set of annotations to populate into the container running the workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

labels

Set of labels to populate into the container running the workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

terminateAfterPreemtpion

Indicates whether the job should be terminated, by the system, after it has been preempted

  • Workspace

  • Standard training

  • Distributed training

autoDeletionTimeAfterCompletionSeconds

Specifies the duration after which a finished workload (Completed or Failed) is automatically deleted. If this field is set to zero, the workload becomes eligible to be deleted immediately after it finishes.

  • Workspace

  • Standard training

  • Distributed training

terminationGracePeriodSeconds

Duration in seconds the pod needs to terminate gracefully upon probe failure. The grace period is the duration in seconds after the processes running in the pod are sent a termination signal and the time when the processes are forcibly halted with a kill signal. Set this value longer than the expected cleanup time for your process. Value must be non-negative integer. The value zero indicates stop immediately via the kill signal (no opportunity to shut down).

  • Workspace

  • Standard training

  • Distributed training

backoffLimit

Specifies the number of retries before marking a workload as failed

  • Workspace

  • Standard training

  • Distributed training

restartPolicy

Specify the restart policy of the workload pods. Default is empty, which is determine by the framework default

Enum: "Always" "Never" "OnFailure"

  • Workspace

  • Standard training

  • Distributed training

cleanPodPolicy

Specifies which pods will be deleted when the workload reaches a terminal state (completed/failed). The policy can be one of the following values:

  • Running - Only pods still running when a job completes (for example, parameter servers) will be deleted immediately. Completed pods will not be deleted so that the logs will be preserved. (Default).

  • All - All (including completed) pods will be deleted immediately when the job finishes.

  • None - No pods will be deleted when the job completes. It will keep running pods that consume GPU, CPU and memory over time. It is recommended to set to None only for debugging and obtaining logs from running pods.

Distributed training

completions

Used with Hyperparameter Optimization. Specifies the number of successful pods the job should reach to be completed. The Job is marked as successful once the specified amount of pods has succeeded.

Standard training

parallelism

Used with Hyperparameters Optimization. Specifies the maximum desired number of pods the workload should run at any given time.

Standard training

exposedUrls

Specifies a set of exported URLs (e.g. ingress) from the container running the created workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

relatedUrls

Specifies a set of URLs related to the workload. For example, a URL to an external server providing statistics or logging about the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

PodAffinitySchedulingRule

Indicates if we want to use the Pod affinity rule as: the “hard” (required) or the “soft” (preferred) option.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

podAffinityTopology

Specifies the Pod Affinity Topology to be used for scheduling the job.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

category

Specifies the workload category assigned to the workload. Categories are used to classify and monitor different types of workloads within the NVIDIA Run:ai platform.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

sshAuthMountPath

Specifies the directory where SSH keys are mounted

Distributed training (MPI only)

mpiLauncherCreationPolicy

Define s whether the MPI Launcher is created in parallel with the workers, or if its creation is postponed until all workers are in Ready state. This prevents failures when the launcher attempts to connect to workers that are not yet ready.

Enum: AtStartup, WaitForWorkersReady

Distributed training (MPI only)

ports

Specifies a set of ports exposed from the container running the created workload. More information in Ports fields below.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

probes

Specifies the ReadinessProbe to use to determine if the container is ready to accept traffic. More information in Probes fields below

-

  • Workspace

  • Standard training

  • Distributed training

  • Inference

tolerations

Toleration rules which apply to the pods running the workload. Toleration rules guide (but do not require) the system to which node each pod can be scheduled to or evicted from, based on matching between those rules and the set of taints defined for each Kubernetes node.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

priorityClass

Specifies the priority class of the workload. The default values are:

  • Workspace - High

  • Training / distributed training - Low

  • Inference - Very high

You can change it to any of the following valid values to adjust the workload's scheduling behavior: very-low, low, medium- low, medium, medium-high, high, very-high.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

nodeAffinityRequired

If the affinity requirements specified by this field are not met at scheduling time, the pod will not be scheduled onto the node. If the affinity requirements specified by this field cease to be met at some point during pod execution (e.g. due to an update), the system may or may not try to eventually evict the pod from its node.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

storage

Contains all the fields related to storage configurations. More information in Storage fields below.

-

  • Workspace

  • Standard training

  • Distributed training

  • Inference

security

Contains all the fields related to security configurations. More information in Security fields below.

-

  • Workspace

  • Standard training

  • Distributed training

  • Inference

compute

Contains all the fields related to compute configurations. More information in Compute fields below.

-

  • Workspace

  • Standard training

  • Distributed training

  • Inference

tty

Whether this container should allocate a TTY for itself, also requires 'stdin' to be true

  • Workspace

  • Standard training

  • Distributed training

stdin

Whether this container should allocate a buffer for stdin in the container runtime. If this is not set, reads from stdin in the container will always result in EOF

  • Workspace

  • Standard training

  • Distributed training

numWorkers

the number of workers that will be allocated for running the workload.

Distributed training

distributedFramework

The distributed training framework used in the workload.

Enum: "MPI" "PyTorch" "TF" "XGBoost" "JAX"

Distributed training

slotsPerWorker

Specifies the number of slots per worker used in hostfile. Defaults to 1. (applicable only for MPI)

Distributed training (MPI only)

minReplicas

The lower limit for the number of worker pods to which the training job can scale down. (applicable only for PyTorch)

Distributed training (PyTorch only)

maxReplicas

The upper limit for the number of worker pods that can be set by the autoscaler. Cannot be smaller than MinReplicas. (applicable only for PyTorch)

Distributed training (PyTorch only)

servingPort

Specifies the port for accessing the inference service. See Serving Port Fields.

-

Inference

autoscaling

Specifies the minimum and maximum number of replicas to be scaled up and down to meet the changing demands of inference services. See Autoscaling Fields.

-

Inference

servingConfiguration

Specifies the inference workload serving configuration. See Serving Configuration Fields.

-

Inference

Ports Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

container

The port that the container running the workload exposes.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

serviceType

Specifies the default service exposure method for ports. the default shall be used for ports which do not specify service type. Options are: LoadBalancer, NodePort or ClusterIP. For more information see the External Access to Containers guide.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

external

The external port which allows a connection to the container port. If not specified, the port is auto-generated by the system.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

toolType

The tool type that runs on this port.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

toolName

A name describing the tool that runs on this port.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Probes Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

readiness

Specifies the Readiness Probe to use to determine if the container is ready to accept traffic.

-

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Readiness Field Details
  • Description: Specifies the Readiness Probe to use to determine if the container is ready to accept traffic

  • Value type: itemized

  • Example policy snippet:

defaults:
   probes:
     readiness:
         initialDelaySeconds: 2
Spec readiness fields
Description
Value type

initialDelaySeconds

Number of seconds after the container has started before liveness or readiness probes are initiated.

periodSeconds

How often (in seconds) to perform the probe

timeoutSeconds

Number of seconds after which the probe times out

successThreshold

Minimum consecutive successes for the probe to be considered successful after having failed

failureThreshold

When a probe fails, the number of times to try before giving up

Security Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

uidGidSource

Indicates the way to determine the user and group ids of the container. The options are:

  • fromTheImage - user and group IDs are determined by the docker image that the container runs. This is the default option.

  • custom - user and group IDs can be specified in the environment asset and/or the workspace creation request.

  • idpToken - user and group IDs are determined according to the identity provider (idp) access token. This option is intended for internal use of the environment UI form. For more information, see User identity in containers.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

capabilities

The capabilities field allows adding a set of unix capabilities to the container running the workload. Capabilities are Linux distinct privileges traditionally associated with superuser which can be independently enabled and disabled

  • Workspace

  • Standard training

  • Distributed training

  • Inference

seccompProfileType

Indicates which kind of seccomp profile is applied to the container. The options are:

  • RuntimeDefault - the container runtime default profile should be used

  • Unconfined - no profile should be applied

  • Workspace

  • Standard training

  • Distributed training

  • Inference

runAsNonRoot

Indicates that the container must run as a non-root user.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

readOnlyRootFilesystem

If true, mounts the container's root filesystem as read-only.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

runAsUid

Specifies the Unix user id with which the container running the created workload should run.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

runasGid

Specifies the Unix Group ID with which the container should run.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

supplementalGroups

Comma separated list of groups that the user running the container belongs to, in addition to the group indicated by runAsGid.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

allowPrivilegeEscalation

Allows the container running the workload and all launched processes to gain additional privileges after the workload starts

  • Workspace

  • Standard training

  • Distributed training

hostIpc

Whether to enable hostIpc. Defaults to false.

  • Workspace

  • Standard training

  • Distributed training

hostNetwork

Whether to enable host network.

  • Workspace

  • Standard training

  • Distributed training

Compute Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

cpuCoreRequest

CPU units to allocate for the created workload (0.5, 1, .etc). The workload receives at least this amount of CPU. Note that the workload is not scheduled unless the system can guarantee this amount of CPUs to the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

cpuCoreLimit

Limitations on the number of CPUs consumed by the workload (0.5, 1, .etc). The system guarantees that this workload is not able to consume more than this amount of CPUs.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

cpuMemoryRequest

The amount of CPU memory to allocate for this workload (1G, 20M, .etc). The workload receives at least this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of memory to the workload

  • Workspace

  • Standard training

  • Distributed training

  • Inference

cpuMemoryLimit

Limitations on the CPU memory to allocate for this workload (1G, 20M, .etc). The system guarantees that this workload is not be able to consume more than this amount of memory. The workload receives an error when trying to allocate more memory than this limit.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

largeShmRequest

A large /dev/shm device to mount into a container running the created workload (shm is a shared file system mounted on RAM).

  • Workspace

  • Standard training

  • Distributed training

  • Inference

gpuRequestType

Sets the unit type for GPU resources requests to either portion or memory. Only if gpuDeviceRequest = 1, the request type can be stated as portion or memory.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

gpuPortionRequest

Specifies the fraction of GPU to be allocated to the workload, between 0 and 1. For backward compatibility, it also supports the number of gpuDevices larger than 1, currently provided using the gpuDevices field.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

gpuDeviceRequest

Specifies the number of GPUs to allocate for the created workload. Only if gpuDeviceRequest = 1, the gpuRequestType can be defined.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

gpuPortionLimit

When a fraction of a GPU is requested, the GPU limit specifies the portion limit to allocate to the workload. The range of the value is from 0 to 1.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

gpuMemoryRequest

Specifies GPU memory to allocate for the created workload. The workload receives this amount of memory. Note that the workload is not scheduled unless the system can guarantee this amount of GPU memory to the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

gpuMemoryLimit

Specifies a limit on the GPU memory to allocate for this workload. Should be no less than the gpuMemory.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

extendedResources

Specifies values for extended resources. Extended resources are third-party devices (such as high-performance NICs, FPGAs, or InfiniBand adapters) that you want to allocate to your Job.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Storage Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

dataVolume

Set of data volumes to use in the workload. Each data volume is mapped to a file-system mount point within the container running the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Maps a folder to a file-system mount point within the container running the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Details of the git repository and items mapped to it.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Specifies persistent volume claims to mount into a container running the created workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Specifies NFS volume to mount into the container running the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Specifies S3 buckets to mount into the container running the workload.

  • Workspace

  • Standard training

  • Distributed training

configMapVolumes

Specifies ConfigMaps to mount as volumes into a container running the created workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

secretVolume

Set of secret volumes to use in the workload. A secret volume maps a secret resource in the cluster to a file-system mount point within the container running the workload.

  • Workspace

  • Standard training

  • Distributed training

  • Inference

Storage Field Examples

hostPath Field Details
  • Description: Maps a folder to a file system mount oint within the container running the workload

  • Value type: itemized

  • Example policy snippet:

defaults:
  storage:
    hostPath:
      instances:
        - path: h3-path-1
          mountPath: h3-mount-1
        - path: h3-path-2
          mountPath: h3-mount-2
      attributes:
        - readOnly: true
hostPath fields
Description
Value type

name

Unique name to identify the instance. Primarily used for policy locked rules.

path

Local path within the controller to which the host volume is mapped.

readOnly

Force the volume to be mounted with read-only permissions. Defaults to false

mountPath

The path that the host volume is mounted to when in use. Enum:

  • "None"

  • "HostToContainer"

mountPropagation

Share this volume mount with other containers. If set to HostToContainer, this volume mount receives all subsequent mounts that are mounted to this volume or any of its subdirectories. In case of multiple hostPath entries, this field should have the same value for all of them.

Git Field Details
  • Description: Details of the git repository and items mapped to it

  • Value type: itemized

  • Example policy snippet:

defaults:
  storage:
    git:
      attributes:
        Repository: https://runai.public.github.com
      instances
        - branch: "master"
          path: /container/my-repository
          passwordSecret: my-password-secret
Git fields
Description
Value type

repository

URL to a remote git repository. The content of this repository is mapped to the container running the workload

revision

Specific revision to synchronize the repository from

path

Local path within the workspace to which the S3 bucket is mapped

secretName

Optional name of Kubernetes secret that holds your git username and password

username

If secretName is provided, this field should contain the key, within the provided Kubernetes secret, which holds the value of your git username. Otherwise, this field should specify your git username in plain text (example: myuser).

PVC Field Details
  • Description: Specifies persistent volume claims to mount into a container running the created workload

  • Value type: itemized

  • Example policy snippet:

defaults:
  storage:
    pvc:
      instances:
        - claimName: pvc-staging-researcher1-home
          existingPvc: true
          path: /myhome
          readOnly: false
          claimInfo:
            accessModes:
              readWriteMany: true
Spec PVC fields
Description
Value type

claimName (mandatory)

A given name for the PVC. Allowed referencing it across workspaces

ephemeral

Use true to set PVC to ephemeral. If set to true, the PVC is deleted when the workspace is stopped.

path

Local path within the workspace to which the PVC bucket is mapped

readonly

Permits read only from the PVC, prevents additions or modifications to its content

ReadwriteOnce

Requesting claim that can be mounted in read/write mode to exactly 1 host. If none of the modes are specified, the default is readWriteOnce.

size

Requested size for the PVC. Mandatory when existing PVC is false

storageClass

Storage class name to associate with the PVC. This parameter may be omitted if there is a single storage class in the system, or you are using the default storage class. Further details at Kubernetes storage classes.

readOnlyMany

Requesting claim that can be mounted in read-only mode to many hosts

readWriteMany

Requesting claim that can be mounted in read/write mode to many hosts

NFS Field Details
  • Description: Specifies NFS volume to mount into the container running the workload

  • Value type: itemized

  • Example policy snippet:

defaults:
 storage:
   nfs:
     instances:
       - path: nfs-path
         readOnly: true
         server: nfs-server
         mountPath: nfs-mount
rules:
  storage:
    nfs:
      instances:
        canAdd: false
nfs fields
Description
Value type

mountPath

The path that the NFS volume is mounted to when in use

path

Path that is exported by the NFS server

readOnly

Whether to force the NFS export to be mounted with read-only permissions

nfsServer

The hostname or IP address of the NFS server

S3 Field Details
  • Description: Specifies S3 buckets to mount into the container running the workload

  • Value type: itemized

  • Example policy snippet:

defaults:
  storage:
    s3:
      instances:
        - bucket: bucket-opt-1
          path: /s3/path
          accessKeySecret: s3-access-key
          secretKeyOfAccessKeyId: s3-secret-id
          secretKeyOfSecretKey: s3-secret-key
      attributes:
        url: https://amazonaws.s3.com
s3 fields
Description
Value type

Bucket

The name of the bucket

path

Local path within the workspace to which the S3 bucket is mapped

url

The URL of the S3 service provider. The default is the URL of the Amazon AWS S3 service

Serving Port Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

container

Specifies the port that the container running the inference service exposes

Inference

protocol

Specifies the protocol used by the port. Defaults to http.

Enum: "http", "grpc"

Inference

authorizationType

Specifies the authorization type for serving port URL access. Defaults to public, which means no authorization is required. If set to authenticatedUsers, only authenticated NVIDIA Run:ai users are allowed to access the URL. If set to authorizedUsersOrGroups, only users or groups specified in authorizedUsers or authorizedGroups are allowed to access the URL. Supported from cluster version 2.19.

Enum: "public", "authenticatedUsers", "authorizedUsersOrGroups"

Inference

authorizedUsers

Specifies the list of users that are allowed to access the URL. Note that authorizedUsers and authorizedGroups are mutually exclusive.

Inference

authorizedGroups

Specifies the list of groups that are allowed to access the URL. Note that authorizedUsers and authorizedGroups are mutually exclusive.

Inference

clusterLocalAccessOnly

Configures the serving port URL to be available only on the cluster-local network, and not externally. Defaults to false.

Inference

Autoscaling Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

metricThresholdPercentage

Specifies the percentage of metric threshold value to use for autoscaling. Defaults to 70. Applicable only with the 'throughput' and 'concurrency' metrics.

Inference

minReplicas

Specifies the minimum number of replicas for autoscaling. Defaults to 1. Use 0 to allow scale-to-zero.

Inference

maxReplicas

Specifies the maximum number of replicas for autoscaling. Defaults to minReplicas, or to 1 if minReplicas is set to 0.

Inference

initialReplicas

Specifies the number of replicas to run when initializing the workload for the first time. Defaults to minReplicas, or to 1 if minReplicas is set to 0.

Inference

activationReplicas

Specifies the number of replicas to run when scaling-up from zero. Defaults to minReplicas, or to 1 if minReplicas is set to 0.

Inference

concurrencyHardLimit

Specifies the maximum number of requests allowed to flow to a single replica at any time. 0 means no limit.

Inference

scaleToZeroRetentionSeconds

Specifies the minimum amount of time (in seconds) that the last replica will remain active after a scale-to-zero decision. Defaults to 0. Available only if minReplicas is set to 0.

Inference

scaleDownDelaySeconds

Specifies the minimum amount of time (in seconds) that a replica will remain active after a scale-down decision

Inference

metric

Specifies the metric to use for autoscaling. Mandatory if minReplicas < maxReplicas, except for the special case where minReplicas is set to 0 and maxReplicas is set to 1, as in this case autoscaling decisions are made according to network activity rather than metrics. Use one of the built-in metrics of 'throughput', 'concurrency' or 'latency', or any other available custom metric. Only the 'throughput' and 'concurrency' metrics support scale-to-zero.

Inference

metricThreshold

Specifies the threshold to use with the specified metric for autoscaling. Mandatory if metric is specified.

Inference

Serving Configuration Fields

Fields
Description
Value type
Supported NVIDIA Run:ai workload type

initializationTimeoutSeconds

Specifies the maximum time (in seconds) allowed for a workload to initialize and become ready. If the workload does not start within this time, it will be moved to failed state.

Inference

requestTimeoutSeconds

Specifies the maximum time (in seconds) allowed to process an end-user request. If no response is returned within this time, the request will be ignored.

Inference

Value Types

Each field has a specific value type. The following value types are supported.

Value type
Description
Supported rule type
Defaults

Boolean

A binary value that can be either True or False

true/false

String

A sequence of characters used to represent text. It can include letters, numbers, symbols, and spaces

abc

Itemized

An ordered collection of items (objects), which can be of different types (all items in the list are of the same type). For further information see the chapter below the table.

See below

Integer

An Integer is a whole number without a fractional component.

100

Number

Capable of having non-integer values

10.3

Quantity

Holds a string composed of a number and a unit representing a quantity

5M

Array

Set of values that are treated as one, as opposed to Itemized in which each item can be referenced separately.

  • node-a

  • node-b

  • node-c

Itemized

Workload fields of type itemized have multiple instances, however in comparison to objects, each can be referenced by a key field. The key field is defined for each field.

Consider the following workload spec:

spec:
  image: ubuntu
  compute:
    extendedResources:
      - resource: added/cpu
        quantity: 10
      - resource: added/memory
        quantity: 20M

In this example, extendedResources have two instances, each has two attributes: resource (the key attribute) and quantity.

In policy, the defaults and rules for itemized fields have two sub sections:

  • Instances: default items to be added to the policy or rules which apply to an instance as a whole.

  • Attributes: defaults for attributes within an item or rules which apply to attributes within each item.

Consider the following example:

defaults:
  compute:
    extendedResources:
      instances: 
        - resource: default/cpu
          quantity: 5
        - resource: default/memory
          quantity: 4M
      attributes:
        quantity: 3
rules:
  compute:
    extendedResources:
      instances:
        locked: 
          - default/cpu
      attributes:
        quantity: 
          required: true

Assume the following workload submission is requested:

spec:
  image: ubuntu
  compute:
    extendedResources:
      - resource: default/memory
        exclude: true
      - resource: added/cpu
      - resource: added/memory
        quantity: 5M

The effective policy for the above mentioned workload has the following extendedResources instances:

Resource
Source of the instance
Quantity
Source of the attribute quantity

default/cpu

Policy defaults

5

The default of this instance in the policy defaults section

added/cpu

Submission request

3

The default of the quantity attribute from the attributes section

added/memory

Submission request

5M

Submission request

Note

The default/memory is not populated to the workload, this is because it has been excluded from the workload using “exclude: true”.

A workload submission request cannot exclude the default/cpu resource, as this key is included in the locked rules under the instances section. {#a-workload-submission-request-cannot-exclude-the-default/cpu-resource,-as-this-key-is-included-in-the-locked-rules-under-the-instances-section.}

Rule Types

Rule types
Description
Supported value types

canAdd

Whether the submission request can add items to an itemized field other than those listed in the policy defaults for this field.

locked

Set of items that the workload is unable to modify or exclude. In this example, a workload policy default is given to HOME and USER, that the submission request cannot modify or exclude from the workload.

canEdit

Whether the submission request can modify the policy default for this field. In this example, it is assumed that the policy has default for imagePullPolicy. As canEdit is set to false, submission requests are not able to alter this default.

required

When set to true, the workload must have a value for this field. The value can be obtained from policy defaults. If no value specified in the policy defaults, a value must be specified for this field in the submission request.

min

The minimal value for the field

max

The maximal value for the field

step

The allowed gap between values for this field. In this example the allowed values are: 1, 3, 5, 7

options

Set of allowed values for this field

defaultFrom

Set a default value for a field that will be calculated based on the value of another field

Rule Type Examples

canAdd
storage:
  hostPath:
     instances:
       canAdd: false
locked
storage:
  hostPath:
    Instances:
      locked:
        - HOME
        - USER
canEdit
imagePullPolicy:
    canEdit: false
required
image:
    required: true
min
compute:
  gpuDevicesRequest:
    min: 3
max
compute:
  gpuMemoryRequest:
     max: 2G
step
compute:
  cpuCoreRequest:
    min: 1
    max: 7
    Step: 2
options
image:
  options:
    - value: image-1
    - value: image-2
defaultFrom
cpuCoreRequest:
  defaultFrom:
    field: compute.cpuCoreLimit
    factor: 0.5

Policy Spec Sections

For each field of a specific policy, you can specify both rules and defaults. A policy spec consists of the following sections:

  • Rules

  • Defaults

  • Imposed Assets

Rules

Rules set up constraints on workload policy fields. For example, consider the following policy:

rules:
  compute:
    gpuDevicesRequest: 
      max: 8
  security:
    runAsUid: 
      min: 500

Such a policy restricts the maximum value for gpuDeviceRequests to 8, and the minimal value for runAsUid, provided in the security section to 500.

Defaults

The defaults section is used for providing defaults for various workload fields. For example, consider the following policy:

defaults:
  imagePullPolicy: Always
  security:
    runAsNonRoot: true
    runAsUid: 500

Assume a submission request with the following values:

  • Image: ubuntu

  • runAsUid: 501

The effective workload that runs has the following set of values:

Field
Value
Source

Image

Ubuntu

Submission request

ImagePullPolicy

Always

Policy defaults

security.runAsNonRoot

true

Policy defaults

security.runAsUid

501

Submission request

Note

It is possible to specify a rule for each field, which states if a submission request is allowed to change the policy default for that given field, for example:

defaults:
imagePullPolicy: Always
security:
    runAsNonRoot: true
    runAsUid: 500
 rules:
 security:
    runAsUid:
    canEdit: false

If this policy is applied, the submission request above fails, as it attempts to change the value of secuirty.runAsUid from 500 (the policy default) to 501 (the value provided in the submission request), which is forbidden due to canEdit rule set to false for this field.

Imposed Assets

Default instances of a storage field can be provided using a datasource containing the details of this storage instance. To add such instances in the policy, specify those asset IDs in the imposedAssets section of the policy.

defaults: null
rules: null
imposedAssets:
  - f12c965b-44e9-4ff6-8b43-01d8f9e630cc

Assets with references to credential assets (for example: private S3, containing reference to an AccessKey asset) cannot be used as imposedAssets.

Last updated