diff --git a/.gitignore b/.gitignore index 4940046..8a5c8f5 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,4 @@ venv +site + diff --git a/docs/guides/compute-daemons/readme.md b/docs/guides/compute-daemons/readme.md index 66dfdb4..cb79854 100644 --- a/docs/guides/compute-daemons/readme.md +++ b/docs/guides/compute-daemons/readme.md @@ -40,7 +40,7 @@ SERVICE_ACCOUNT=$1 NAMESPACE=$2 kubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data.token' | base64 --decode > ./service.token -kubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data["ca.crt"]' | base64 -decode > ./service.cert +kubectl get secret ${SERVICE_ACCOUNT} -n ${NAMESPACE} -o json | jq -Mr '.data["ca.crt"]' | base64 --decode > ./service.cert ``` The `service.token` and `service.cert` files must be copied to each compute node, typically in the `/etc/[BINARY-NAME]/` directory diff --git a/docs/guides/ha-cluster/readme.md b/docs/guides/ha-cluster/readme.md index 25dba31..1f88a99 100644 --- a/docs/guides/ha-cluster/readme.md +++ b/docs/guides/ha-cluster/readme.md @@ -46,7 +46,7 @@ Configure the NNF agent with the following parameters: | `nnf-node-name=[NNF-NODE-NAME]` | Name of the NNF node as it is appears in the System Configuration | | `api-version=[VERSION]` | The API Version of the NNF Node resource. Defaults to "v1alpha1" | -The token and certificate can be found in the Kubernetes Secrets resource for the nnf-system/nnf-fence-agent ServiceAccount. This provides RBAC rules to limit the fencing agent to only the Kubernetes resources it needs access to. +The token and certificate can be found in the Kubernetes Secrets resource for the nnf-system/nnf-fencing-agent ServiceAccount. This provides RBAC rules to limit the fencing agent to only the Kubernetes resources it needs access to. For example, setting up the NNF fencing agent on `rabbit-node-1` with a kubernetes service API running at `192.168.0.1:6443` and the service token and certificate copied to `/etc/nnf/fence/`. This needs to be run on one node in the cluster. diff --git a/docs/guides/index.md b/docs/guides/index.md index ba933df..9f85555 100644 --- a/docs/guides/index.md +++ b/docs/guides/index.md @@ -12,3 +12,9 @@ * [Storage Profiles](storage-profiles/readme.md) * [Data Movement Configuration](data-movement/readme.md) + +## NNF User Containers + +* [User Containers](user-containers/readme.md) + + diff --git a/docs/guides/rbac-for-users/readme.md b/docs/guides/rbac-for-users/readme.md index 85a7545..ab1f563 100644 --- a/docs/guides/rbac-for-users/readme.md +++ b/docs/guides/rbac-for-users/readme.md @@ -3,13 +3,15 @@ authors: Matt Richerson categories: setup --- -# RBAC for Users +# RBAC: Role-Based Access Control -This document shows how to create a kubeconfig file with RBAC set up to restrict access to view only for resources. +RBAC (Role Based Access Control) determines the operations a user or service can perform on a list of Kubernetes resources. RBAC affects everything that interacts with the kube-apiserver (both users and services internal or external to the cluster). More information about RBAC can be found in the Kubernetes [***documentation***](https://kubernetes.io/docs/reference/access-authn-authz/rbac/). -## Overview +## RBAC for Users -RBAC (Role Based Access Control) determines the operations a user or service can perform on a list of Kubernetes resources. RBAC affects everything that interacts with the kube-apiserver (both users and services internal or external to the cluster). More information about RBAC can be found in the Kubernetes [***documentation***](https://kubernetes.io/docs/reference/access-authn-authz/rbac/). +This section shows how to create a kubeconfig file with RBAC set up to restrict access to view only for resources. + +### Overview User access to a Kubernetes cluster is defined through a kubeconfig file. This file contains the address of the kube-apiserver as well as the key and certificate for the user. Typically this file is located in `~/.kube/config`. When a kubernetes cluster is created, a config file is generated for the admin that allows unrestricted access to all resources in the cluster. This is the equivalent of `root` on a Linux system. @@ -19,46 +21,49 @@ The goal of this document is to create a new kubeconfig file that allows view on - Creating a new kubeconfig file - Adding RBAC rules for the "hpe" user to allow read access -## Generate a Key and Certificate +### Generate a Key and Certificate The first step is to create a new key and certificate so that HPE employees can authenticate as the "hpe" user. This will likely be done on one of the master nodes. The `openssl` command needs access to the certificate authority file. This is typically located in `/etc/kubernetes/pki`. ```bash # make a temporary work space -mkdir /tmp/hpe -cd /tmp/hpe +mkdir /tmp/rabbit +cd /tmp/rabbit + +# Create this user +export USERNAME=hpe # generate a new key -openssl genrsa -out hpe.key 2048 +openssl genrsa -out rabbit.key 2048 -# create a certificate signing request for the "hpe" user -openssl req -new -key hpe.key -out hpe.csr -subj "/CN=hpe" +# create a certificate signing request for this user +openssl req -new -key rabbit.key -out rabbit.csr -subj "/CN=$USERNAME" # generate a certificate using the certificate authority on the k8s cluster. This certificate lasts 500 days -openssl x509 -req -in hpe.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out hpe.crt -days 500 +openssl x509 -req -in rabbit.csr -CA /etc/kubernetes/pki/ca.crt -CAkey /etc/kubernetes/pki/ca.key -CAcreateserial -out rabbit.crt -days 500 ``` -## Create a kubeconfig +### Create a kubeconfig -After the keys have been generated, a new kubeconfig file can be created for the "hpe" user. The admin kubeconfig `/etc/kubernetes/admin.conf` can be used to determine the cluster name kube-apiserver address. +After the keys have been generated, a new kubeconfig file can be created for this user. The admin kubeconfig `/etc/kubernetes/admin.conf` can be used to determine the cluster name kube-apiserver address. ```bash # create a new kubeconfig with the server information -kubectl config set-cluster {CLUSTER_NAME} --kubeconfig=/tmp/hpe/hpe.conf --server={SERVER_ADDRESS} --certificate-authority=/etc/kubernetes/pki/ca.crt --embed-certs=true +kubectl config set-cluster $CLUSTER_NAME --kubeconfig=/tmp/rabbit/rabbit.conf --server=$SERVER_ADDRESS --certificate-authority=/etc/kubernetes/pki/ca.crt --embed-certs=true -# add the key and cert for the "hpe" user to the config -kubectl config set-credentials hpe --kubeconfig=/tmp/hpe/hpe.conf --client-certificate=/tmp/hpe/hpe.crt --client-key=/tmp/hpe/hpe.key --embed-certs=true +# add the key and cert for this user to the config +kubectl config set-credentials $USERNAME --kubeconfig=/tmp/rabbit/rabbit.conf --client-certificate=/tmp/rabbit/rabbit.crt --client-key=/tmp/rabbit/rabbit.key --embed-certs=true # add a context -kubectl config set-context hpe-context --kubeconfig=/tmp/hpe/hpe.conf --cluster={CLUSTER_NAME} --user=hpe +kubectl config set-context $USERNAME --kubeconfig=/tmp/rabbit/rabbit.conf --cluster=$CLUSTER_NAME --user=$USERNAME ``` The kubeconfig file should be placed in a location where HPE employees have read access to it. -## Create ClusterRole and ClusterRoleBinding +### Create ClusterRole and ClusterRoleBinding The next step is to create ClusterRole and ClusterRoleBinding resources. The ClusterRole provided allows viewing all cluster and namespace scoped resources, but disallows creating, deleting, or modifying any resources. @@ -92,10 +97,58 @@ roleRef: Both of these resources can be created using the `kubectl apply` command. -## Testing +### Testing Get, List, Create, Delete, and Modify operations can be tested as the "hpe" user by setting the KUBECONFIG environment variable to use the new kubeconfig file. Get and List should be the only allowed operations. Other operations should fail with a "forbidden" error. ```bash export KUBECONFIG=/tmp/hpe/hpe.conf ``` + +## RBAC for Workload Manager (WLM) + +**Note** This section assumes the reader has read and understood the steps described above for setting up `RBAC for Users`. + +A workload manager (WLM) such as [Flux](https://github.com/flux-framework) or [Slurm](https://slurm.schedmd.com) will interact with [DataWorkflowServices](https://dataworkflowservices.github.io) as a privileged user. RBAC is used to limit the operations that a WLM can perform on a Rabbit system. + +The following steps are required to create a user and a role for the WLM. In this case, we're creating a user to be used with the Flux WLM: + +- Generate a new key/cert pair for a "flux" user +- Creating a new kubeconfig file +- Adding RBAC rules for the "flux" user to allow appropriate access to the DataWorkflowServices API. + +### Generate a Key and Certificate + +Generate a key and certificate for our "flux" user, similar to the way we created one for the "hpe" user above. Substitute "flux" in place of "hpe". + +### Create a kubeconfig + +After the keys have been generated, a new kubeconfig file can be created for the "flux" user, similar to the one for the "hpe" user above. Again, substitute "flux" in place of "hpe". + +### Apply the provided ClusterRole and create a ClusterRoleBinding + +DataWorkflowServices has already defined the role to be used with WLMs. Simply apply the `workload-manager` ClusterRole from DataWorkflowServices to the system: + +```console +kubectl apply -f https://github.com/HewlettPackard/dws/raw/master/config/rbac/workload_manager_role.yaml +``` + +Create and apply a ClusterRoleBinding to associate the "flux" user with the `workload-manager` ClusterRole: + +ClusterRoleBinding +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: flux +subjects: +- kind: User + name: flux + apiGroup: rbac.authorization.k8s.io +roleRef: + kind: ClusterRole + name: workload-manager + apiGroup: rbac.authorization.k8s.io +``` + +The WLM should then use the kubeconfig file associated with this "flux" user to access the DataWorkflowServices API and the Rabbit system. diff --git a/docs/guides/user-containers/readme.md b/docs/guides/user-containers/readme.md new file mode 100644 index 0000000..e0e7a4c --- /dev/null +++ b/docs/guides/user-containers/readme.md @@ -0,0 +1,200 @@ +# NNF User Containers + +NNF User Containers are a mechanism to allow user-defined containerized +applications to be run on Rabbit nodes with access to NNF ephemeral and persistent storage. + +!!! note + + The following is a limited look at User Containers. More content will be + provided after the RFC has been finalized. + +## Custom NnfContainerProfile + +The author of a containerized application will work with the administrator to +define a pod specification template for the container and to create an +appropriate NnfContainerProfile resource for the container. The image and tag +for the user's container will be specified in the profile. + +New NnfContainerProfile resources may be created by copying one of the provided +example profiles from the `nnf-system` namespace. The examples may be found by listing them with `kubectl`: + +```console +kubectl get nnfcontainerprofiles -n nnf-system +``` + +### Workflow Job Specification + +The user's workflow will specify the name of the NnfContainerProfile in a DW +directive. If the custom profile is named `red-rock-slushy` then it will be +specified in the "#DW container" directive with the "profile" parameter. + +```bash +#DW container profile=red-rock-slushy [...] +``` + +## Using a Private Container Repository + +The user's containerized application may be placed in a private repository. In +this case, the user must define an access token to be used with that repository, +and that token must be made available to the Rabbit's Kubernetes environment +so that it can pull that container from the private repository. + +See [Pull an Image from a Private Registry](https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/) in the Kubernetes documentation +for more information. + +### About the Example + +Each container registry will have its own way of letting its users create tokens to +be used with their repositories. Docker Hub will be used for the private repository in this example, and the user's account on Docker Hub will be "dean". + +### Preparing the Private Repository + +The user's application container is named "red-rock-slushy". To store this container +on Docker Hub the user must log into docker.com with their browser and click the "Create repository" button to create a repository named "red-rock-slushy", and the user must check the box that marks the repository as private. The repository's name will be displayed as "dean/red-rock-slushy" with a lock icon to show that it is private. + +### Create and Push a Container + +The user will create their container image in the usual ways, naming it for their private repository and tagging it according to its release. + +Prior to pushing images to the repository, the user must complete a one-time login to the Docker registry using the docker command-line tool. + +```console +docker login -u dean +``` + +After completing the login, the user may then push their images to the repository. + +```console +docker push dean/red-rock-slushy:v1.0 +``` + +### Generate a Read-Only Token + +A read-only token must be generated to allow Kubernetes to pull that container +image from the private repository, because Kubernetes will not be running as +that user. **This token must be given to the administrator, who will use it to create a Kubernetes secret.** + +To log in and generate a read-only token to share with the administrator, the user must follow these steps: + +- Visit docker.com and log in using their browser. +- Click on the username in the upper right corner. +- Select "Account Settings" and navigate to "Security". +- Click the "New Access Token" button to create a read-only token. +- Keep a copy of the generated token to share with the administrator. + +### Store the Read-Only Token as a Kubernetes Secret + +The adminstrator must store the user's read-only token as a kubernetes secret. The +secret must be placed in the `default` namespace, which is the same namespace +where the user containers will be run. The secret must include the user's Docker +Hub username and the email address they have associated with that username. In +this case, the secret will be named `readonly-red-rock-slushy`. + +```console +$ USER_TOKEN=users-token-text +$ USER_NAME=dean +$ USER_EMAIL=dean@myco.com +$ SECRET_NAME=readonly-red-rock-slushy +$ kubectl create secret docker-registry $SECRET_NAME -n default --docker-server="https://index.docker.io/v1/" --docker-username=$USER_NAME --docker-password=$USER_TOKEN --docker-email=$USER_EMAIL +``` + +### Add the Secret to the NnfContainerProfile + +The administrator must add an `imagePullSecrets` list to the NnfContainerProfile +resource that was created for this user's containerized application. + +The following profile shows the placement of the `readonly-red-rock-slushy` secret +which was created in the previous step, and points to the user's +`dean/red-rock-slushy:v1.0` container. + +```yaml +apiVersion: nnf.cray.hpe.com/v1alpha1 +kind: NnfContainerProfile +metadata: + name: red-rock-slushy + namespace: nnf-system +data: + pinned: false + retryLimit: 6 + spec: + imagePullSecrets: + - name: readonly-red-rock-slushy + containers: + - command: + - /users-application + image: dean/red-rock-slushy:v1.0 + name: red-rock-app + storages: + - name: DW_JOB_foo_local_storage + optional: false + - name: DW_PERSISTENT_foo_persistent_storage + optional: true +``` + +Now any user can select this profile in their Workflow by specifying it in a +`#DW container` directive. + +```bash +#DW container profile=red-rock-slushy [...] +``` + +### Using a Private Container Repository for MPI Application Containers + +If our user's containerized application instead contains an MPI application, +because perhaps it's a private copy of [nnf-mfu](https://github.com/NearNodeFlash/nnf-mfu), +then the administrator would insert two `imagePullSecrets` lists into the +`mpiSpec` of the NnfContainerProfile for the MPI launcher and the MPI worker. + +```yaml +apiVersion: nnf.cray.hpe.com/v1alpha1 +kind: NnfContainerProfile +metadata: + name: mpi-red-rock-slushy + namespace: nnf-system +data: + mpiSpec: + mpiImplementation: OpenMPI + mpiReplicaSpecs: + Launcher: + template: + spec: + imagePullSecrets: + - name: readonly-red-rock-slushy + containers: + - command: + - mpirun + - dcmp + - $(DW_JOB_foo_local_storage)/0 + - $(DW_JOB_foo_local_storage)/1 + image: dean/red-rock-slushy:v2.0 + name: red-rock-launcher + Worker: + template: + spec: + imagePullSecrets: + - name: readonly-red-rock-slushy + containers: + - image: dean/red-rock-slushy:v2.0 + name: red-rock-worker + runPolicy: + cleanPodPolicy: Running + suspend: false + slotsPerWorker: 1 + sshAuthMountPath: /root/.ssh + pinned: false + retryLimit: 6 + storages: + - name: DW_JOB_foo_local_storage + optional: false + - name: DW_PERSISTENT_foo_persistent_storage + optional: true +``` + +Now any user can select this profile in their Workflow by specifying it in a +`#DW container` directive. + +```bash +#DW container profile=mpi-red-rock-slushy [...] +``` + + diff --git a/docs/rfcs/0002/readme.md b/docs/rfcs/0002/readme.md index d12e7b9..2e28679 100644 --- a/docs/rfcs/0002/readme.md +++ b/docs/rfcs/0002/readme.md @@ -1,57 +1,189 @@ --- -authors: Nate Thornton +authors: Blake Devcich state: discussion --- -Rabbit storage for containerized applications -============================================= +# Rabbit storage for containerized applications For Rabbit to provide storage to a containerized application there needs to be _some_ mechanism. The remainder of this RFC proposes that mechanism. -Actors ------- +## Actors -There are several different actors involved +There are several actors involved: - The AUTHOR of the containerized application - The ADMINISTRATOR who works with the author to determine the application requirements for execution -- The USER who intends to to use the application using the 'container' directive in their job specification +- The USER who intends to use the application using the 'container' directive in their job specification - The RABBIT software that interprets the #DWs and starts the container during execution of the job -There are multiple relationships between the actors +There are multiple relationships between the actors: - AUTHOR to ADMINISTRATOR: The author tells the administrator how their application is executed and the NNF storage requirements. - Between the AUTHOR and USER: The application expects certain storage, and the #DW must meet those expectations. - ADMINISTRATOR to RABBIT: Admin tells Rabbit how to run the containerized application with the required storage. - Between USER and RABBIT: User provides the #DW container directive in the job specification. Rabbit validates and interprets the directive. -Proposal --------- +## Proposal -The proposal below might take a couple of read-throughs; I've also added a concrete example afterward that might help. +The proposal below outlines the high level behavior of running containers in a workflow: 1. The AUTHOR writes their application expecting NNF Storage at specific locations. For each storage requirement, they define: 1. a unique name for the storage which can be referenced in the 'container' directive - 2. the expected storage types; if necessary - 3. the required mount path or mount path prefix - 4. other constraints or storage requirements (e.g. minimum capacity) + 2. the required mount path or mount path prefix + 3. other constraints or storage requirements (e.g. minimum capacity) 2. The AUTHOR works with the ADMINISTRATOR to define: 1. a unique name for the program to be referred by USER - 2. the pod template specification for executing their program - 3. the NNF storage requirements described above. -3. The ADMINISTRATOR creates a corresponding _NNF Container Profile_ custom kubernetes resource with the necessary NNF storage requirements and pod specification as described by the AUTHOR -4. The USER who desires to use the application works with the AUTHOR and the related NNF Container Profile to understand the storage requirements. -5. The USER submits a WLM job with the #DW container fields populated -6. WLM runs the job and drives the job through the following stages... - 1. Proposal: RABBIT validates the #DW container directive by comparing the supplied values to what is listed in the NNF Container Profile. If the USER fails to meet the requirements, the job fails. - 2. Pre-run: RABBIT software will: - 1. create a config map reflecting the storage requirements and any runtime parameters; this is provided to the container at the volume mount named "nnf-config", if specified. - 2. duplicate the pod template specification from the Container Profile and patches the necessary Volumes and the config map. The spec is used as the basis for starting the necessary pods and containers. - 3. The containerized application executes. The expected mounts are available per the requirements and celebration occurs. - -Example -------- - -Say I authored a simple application, `foo`, that requires Rabbit local GFS2 storage and a persistent Lustre storage volume. As the author, my program is coded to expect the GFS2 volume is mounted at `/foo/local` and the Lustre volume is mounted at `/foo/persistent`. In this case, the storages are not optional, so they are defined as such in the NNF Container Profile. + 2. the pod template or MPI Job specification for executing their program + 3. the NNF storage requirements described above. +3. The ADMINISTRATOR creates a corresponding _NNF Container Profile_ Kubernetes custom resource with the necessary NNF storage requirements and pod specification as described by the AUTHOR +4. The USER who desires to use the application works with the AUTHOR and the related NNF Container Profile to understand the storage requirements +5. The USER submits a WLM job with the #DW container directive variables populated +6. WLM runs the workflow and drives it through the following stages... + 1. `Proposal`: RABBIT validates the #DW container directive by comparing the supplied values to those listed in the NNF Container Profile. If the workflow fails to meet the requirements, the job fails + 2. `PreRun`: RABBIT software: + 1. duplicates the pod template specification from the Container Profile and patches the necessary Volumes and the config map. The spec is used as the basis for starting the necessary pods and containers + 2. creates a config map reflecting the storage requirements and any runtime parameters; this is provided to the container at the volume mount named `nnf-config`, if specified + 3. The containerized application(s) executes. The expected mounts are available per the requirements and celebration occurs. The pods continue to run until: + 1. a pod completes successfully (any failed pods will be retried) + 2. the max number of pod retries is hit (indicating failure on all retry attempts) + 1. Note: retry limit is non-optional per Kubernetes configuration + 2. If retries are not desired, this number could be set to 0 to disable any retry attempts + 4. `PostRun`: RABBIT software: + 1. marks the stage as `Ready` if the pods have all completed successfully. This includes a successful retry after preceding failures + 2. starts a timer for any running pods. Once the timeout is hit, the pods will be killed and the workflow will indicate failure + 3. leaves all pods around for log inspection + +### Container Assignment to Rabbit Nodes + +During `Proposal`, the USER must assign compute nodes for the container workflow. The assigned compute nodes determine which Rabbit nodes run the containers. + +### Container Definition + +Containers can be launched in two ways: + +1. MPI Jobs +2. Non-MPI Jobs + +MPI Jobs are launched using [`mpi-operator`](https://github.com/kubeflow/mpi-operator). This uses a launcher/worker model. The launcher pod is responsible for running the `mpirun` command that will target the worker pods to run the MPI application. The launcher will run on the first targeted NNF node and the workers will run on each of the targeted NNF nodes. + +For Non-MPI jobs, `mpi-operator` is **not** used. This model runs the same application on each of the targeted NNF nodes. + +The NNF Container Profile allows a user to pick one of these methods. Each method is defined in similar, but different fashions. Since MPI Jobs use `mpi-operator`, the [`MPIJobSpec`](https://pkg.go.dev/github.com/kubeflow/mpi-operator@v0.4.0/pkg/apis/kubeflow/v2beta1#MPIJobSpec) is used to define the container(s). For Non-MPI Jobs a [`PodSpec`](https://pkg.go.dev/k8s.io/api/core/v1#PodSpec) is used to define the container(s). + +An example of an MPI Job is below. The `data.mpiSpec` field is defined: + +```yaml +kind: NnfContainerProfile +apiVersion: nnf.cray.hpe.com/v1alpha1 +data: + mpiSpec: + mpiReplicaSpecs: + Launcher: + template: + spec: + containers: + - command: + - mpirun + - dcmp + - $(DW_JOB_foo_local_storage)/0 + - $(DW_JOB_foo_local_storage)/1 + image: ghcr.io/nearnodeflash/nnf-mfu:latest + name: example-mpi + Worker: + template: + spec: + containers: + - image: ghcr.io/nearnodeflash/nnf-mfu:latest + name: example-mpi + slotsPerWorker: 1 +... +``` + +An example of a Non-MPI Job is below. The `data.spec` field is defined: + +```yaml +kind: NnfContainerProfile +apiVersion: nnf.cray.hpe.com/v1alpha1 +data: + spec: + containers: + - command: + - /bin/sh + - -c + - while true; do date && sleep 5; done + image: alpine:latest + name: example-forever +... +``` + +In both cases, the `spec` is used as a starting point to define the containers. NNF software supplements the specification to add functionality (e.g. mounting #DW storages). In other words, what you see here will not be the final spec for the container that ends up running as part of the container workflow. + +### Security + +The workflow's UID and GID are used to run the container application and for mounting the specified fileystems in the container. Kubernetes allows for a way to define permissions for a container using a [Security Context](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/). + +`mpirun` uses `ssh` to communicate with the worker nodes. `ssh` requires that UID is assigned to a username. Since the UID/GID are dynamic values from the workflow, work must be done to the container's `/etc/passwd` to map the UID/GID to a username. An `InitContainer` is used to modify `/etc/passwd` and mount it into the container. + +### Communication Details + +The following subsections outline the proposed communication between the Rabbit nodes themselves and the Compute nodes. + +#### Rabbit-to-Rabbit Communication + +##### Non-MPI Jobs + +Each rabbit node can be reached via `.` using DNS. The hostname is the Rabbit node name and the workflow name is used for the subdomain. + +For example, a workflow name of `foo` that targets `rabbit-node2` would be `rabbit-node2.foo`. + +Environment variables are provided to the container and ConfigMap for each rabbit that is targeted by the container workflow: + +```shell +NNF_CONTAINER_NODES=rabbit-node2 rabbit-node3 +NNF_CONTAINER_SUBDOMAIN=foo +NNF_CONTAINER_DOMAIN=default.svc.cluster.local +``` + +```yaml +kind: ConfigMap +apiVersion: v1 +data: + nnfContainerNodes: + - rabbit-node2 + - rabbit-node3 + nnfContainerSubdomain: foo + nnfContainerDomain: default.svc.cluster.local +``` + +DNS can then be used to communicate with other Rabbit containers. The FQDN for the container running on rabbit-node2 is `rabbit-node2.foo.default.svc.cluster.local`. + +##### MPI Jobs + +For MPI Jobs, these hostnames and subdomains will be slightly different due to the implementation of `mpi-operator`. However, the variables will remain the same and provide a consistent way to retrieve the values. + +#### Compute-to-Rabbit Communication + +For Compute to Rabbit communication, the proposal is to use an open port between the nodes, so the applications could communicate using IP protocol. The port number would be assigned by the Rabbit software and included in the workflow resource's environmental variables after the Setup state (similar to workflow name & namespace). Flux should provide the port number to the compute application via an environmental variable or command line argument. The containerized application would always see the same port number using the `hostPort`/`containerPort` mapping functionality included in Kubernetes. To clarify, the Rabbit software is picking and managing the ports picked for `hostPort`. + +This requires a range of ports to be open in the firewall configuration and specified in the rabbit system configuration. The fewer the number of ports available increases the chances of a port reservation conflict that would fail a workflow. + +Example port range definition in the SystemConfiguration: + +```yaml +apiVersion: v1 +items: + - apiVersion: dws.cray.hpe.com/v1alpha1 + kind: SystemConfiguration + name: default + namespace: default + spec: + containerHostPortRangeMin: 30000 + containerHostPortRangeMax: 40000 + ... +``` + +## Example + +For this example, let's assume I've authored an application called `foo`. This application requires Rabbit local GFS2 storage and a persistent Lustre storage volume. Working with an administrator, my application's storage requirements and pod specification are placed in an NNF Container Profile `foo`: @@ -62,40 +194,34 @@ metadata: name: foo namespace: default spec: + postRunTimeout: 300 + maxRetries: 6 storages: - - name: JOB_DW_foo-local-storage + - name: DW_JOB_foo-local-storage optional: false - - name: PERSISTENT_DW_foo-persistent-storage + - name: DW_PERSISTENT_foo-persistent-storage optional: false - template: - metadata: - name: foo - namespace: default - spec: - containers: - - name: foo - image: foo:latest - command: - - /foo - volumeMounts: - - name: foo-local-storage - mountPath: /foo/local - - name: foo-persistent-storage - mountPath: /foo/persistent - - name: nnf-config - mountPath: /nnf/config + spec: + containers: + - name: foo + image: foo:latest + command: + - /foo + ports: + - name: compute + containerPort: 80 ``` Say Peter wants to use `foo` as part of his job specification. Peter would submit the job with the directives below: -``` +```text #DW jobdw name=my-gfs2 type=gfs2 capacity=1TB #DW persistentdw name=some-lustre #DW container name=my-foo profile=foo \ - JOB_DW_foo-local-storage=my-gfs2 \ - PERSISTENT_DW_foo-persistent-storage=some-lustre + DW_JOB_foo-local-storage=my-gfs2 \ + DW_PERSISTENT_foo-persistent-storage=some-lustre ``` Since the NNF Container Profile has specified that both storages are not optional (i.e. `optional: false`), they must both be present in the #DW directives along with the `container` directive. Alternatively, if either was marked as optional (i.e. `optional: true`), it would not be required to be present in the #DW directives and therefore would not be mounted into the container. @@ -106,76 +232,84 @@ Peter submits the job to the WLM. WLM guides the job through the workflow states 2. Setup: Since there is a jobdw, `my-gfs2`, Rabbit software provisions this storage. 3. Pre-Run: 1. Rabbit software generates a config map that corresponds to the storage requirements and runtime parameters. -```yaml - kind: ConfigMap - apiVersion: v1 - metadata: - name: my-job-container-my-foo - data: - JOB_DW_foo-local-storage: type=gfs2 mount-type=indexed-mount - PERSISTENT_DW_foo-persistent-storage: type=lustre mount-type=mount-point -``` - 2. Rabbit software duplicates the `foo` pod template spec in the NNF Container Profile and fills in the necessary volumes and config map. -```yaml - kind: Pod - apiVersion: v1 - metadata: - name: my-job-container-my-foo - template: - metadata: - name: foo - namespace: default - spec: - containers: - # This section unchanged from Container Profile - - name: foo - image: foo:latest - command: - - /foo - volumeMounts: - - name: foo-local-storage - mountPath: /foo/local - - name: foo-persistent-storage - mountPath: /foo/persistent - - name: nnf-config - mountPath: /nnf/config - - # volumes added by Rabbit software - volumes: - - name: foo-local-storage - hostPath: - path: /nnf/job/my-job/my-gfs2 - - name: foo-persistent-storage - hostPath: - path: /nnf/persistent/some-lustre - - name: nnf-config - configMap: + + ```yaml + kind: ConfigMap + apiVersion: v1 + metadata: name: my-job-container-my-foo + data: + DW_JOB_foo_local_storage: mount-type=indexed-mount + DW_PERSISTENT_foo_persistent_storage: mount-type=mount-point + ... + ``` + + 2. Rabbit software creates a pod and duplicates the `foo` pod spec in the NNF Container Profile and fills in the necessary volumes and config map. + + ```yaml + kind: Pod + apiVersion: v1 + metadata: + name: my-job-container-my-foo + template: + metadata: + name: foo + namespace: default + spec: + containers: + # This section unchanged from Container Profile + - name: foo + image: foo:latest + command: + - /foo + volumeMounts: + - name: foo-local-storage + mountPath: + - name: foo-persistent-storage + mountPath: + - name: nnf-config + mountPath: /nnf/config + ports: + - name: compute + hostPort: 9376 # hostport selected by Rabbit software + containerPort: 80 + + # volumes added by Rabbit software + volumes: + - name: foo-local-storage + hostPath: + path: /nnf/job/my-job/my-gfs2 + - name: foo-persistent-storage + hostPath: + path: /nnf/persistent/some-lustre + - name: nnf-config + configMap: + name: my-job-container-my-foo + + # securityContext added by Rabbit software - values will be inherited from the workflow + securityContext: + runAsUser: 1000 + runAsGroup: 2000 + fsGroup: 2000 + ``` - # securityContext added by Rabbit software - values will be inherited from the workflow - securityContext: - runAsUser: 1000 - runAsGroup: 2000 - fsGroup: 2000 -``` 3. Rabbit software starts the pods on Rabbit nodes +4. Post-Run + 1. Rabbit waits for all pods to finish (or until timeout is hit) + 2. If all pods are successful, Post-Run is marked as `Ready` + 3. If any pod is not successful, Post-Run is not marked as `Ready` -Security --------- - -Kubernetes allows for a way to define permissions for a container using a [Security Context](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/). This can be seen in the pod template spec above. The user and group IDs will be inherited from the Workflow's spec. -Special Note: Indexed-Mount Type --------------------------------- +## Special Note: Indexed-Mount Type for GFS2 File Systems -When using a file system like XFS or GFS2, each compute is allocated its own Rabbit volume. The Rabbit software mounts a collection of mount paths with a common prefix and an ending indexed value. +When using a GFS2 file system, each compute is allocated its own Rabbit volume. The Rabbit software mounts a collection of mount paths with a common prefix and an ending indexed value. Application AUTHORS must be aware that their desired mount-point really contains a collection of directories, one for each compute node. The mount point type can be known by consulting the config map values. -If we continue the example from above, the `foo` application would expect the foo-local-storage path of `/foo/local` to contain several directories +If we continue the example from above, the `foo` application expects the foo-local-storage path of `/foo/local` to contain several directories ```shell -# ls /foo/local/* +$ ls /foo/local/* node-0 node-1 @@ -184,7 +318,7 @@ node-2 node-N ``` -Node positions are ***not*** absolute locations. WLM could, in theory, select 6 physical compute nodes at physical location 1, 2, 3, 5, 8, 13, which would appear as directories `/node-0` through `/node-5` in the container path. +Node positions are _not_ absolute locations. WLM could, in theory, select 6 physical compute nodes at physical location 1, 2, 3, 5, 8, 13, which would appear as directories `/node-0` through `/node-5` in the container path. Symlinks will be added to support the physical compute node names. Assuming a compute node hostname of `compute-node-1` from the example above, it would link to `node-0`, `compute-node-2` would link to `node-1`, etc. diff --git a/mkdocs.yml b/mkdocs.yml index 9934d00..a8bad15 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -15,6 +15,7 @@ nav: - 'High Availability Cluster': 'guides/ha-cluster/readme.md' - 'RBAC for Users': 'guides/rbac-for-users/readme.md' - 'Storage Profiles': 'guides/storage-profiles/readme.md' + - 'User Containers': 'guides/user-containers/readme.md' - 'RFCs': - rfcs/index.md - 'Rabbit Request For Comment Process': 'rfcs/0001/readme.md' @@ -45,6 +46,7 @@ extra: provider: mike default: latest markdown_extensions: + - admonition - pymdownx.highlight: anchor_linenums: true - pymdownx.details