Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial SDG REST API definitions #80

Draft
wants to merge 12 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .spellcheck-en-custom.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,12 @@
Abhishek
Akash
AMDGPU
API
API's
api
arge
arXiv
ascii
backend
backends
benchmarking
Expand Down Expand Up @@ -37,6 +41,7 @@ Eval
Excalidraw
exfiltrate
exfiltrating
extensibility
Finetuning
formedness
GFX
Expand Down Expand Up @@ -109,7 +114,9 @@ Shivchander
Signoff
Srivastava
subdirectory
submodule
Sudalairaj
sync'ed
Taj
tatsu
TBD
Expand All @@ -123,6 +130,7 @@ triager's
triagers
unquantized
USM
utf
UX
venv
watsonx
Expand All @@ -135,3 +143,4 @@ XT
XTX
Xu
YAML
yaml
3 changes: 3 additions & 0 deletions api-definitions/common/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Common

This section of the API definitions holds common structures that are shared across multiple service definitions.
35 changes: 35 additions & 0 deletions api-definitions/common/file-path.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
################################################################################
# This schema defines the common ways to reference files and directories held in
# the various supported storage media.
################################################################################

schemas:
LocalPath:
required: ['path']
properties:
path:
type: string
description: The name of the file on disk

ObjectStoragePath:
allOf:
- $ref: './object-storage-connection.yaml#/schemas/Bucket'
- type: object
description: 'Path within an object storage bucket'
required: ['path']
properties:
path:
type: string
description: The path within the bucket

FilePath:
description: Path to an individual file
oneOf:
- $ref: '#/schemas/LocalPath'
- $ref: '#/schemas/ObjectStoragePath'

DirectoryPath:
description: Path to a directory
oneOf:
- $ref: '#/schemas/LocalPath'
- $ref: '#/schemas/ObjectStoragePath'
29 changes: 29 additions & 0 deletions api-definitions/common/job-status.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
################################################################################
# This schema defines the common elements of a Job Status response. Individual
# job types may extend this model with task-specific properties, but these
# common properties must be present in the response to all job status queries.
################################################################################

schemas:
JobStatus:
type: object
description: The status of a job in the system
properties:
job_id:
type: string
description: Unique identifier for a single job
status:
type: string
enum:
- QUEUED
- RUNNING
- COMPLETED
- CANCELED
- ERRORED
Comment on lines +18 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we map with whatever Kubernetes says on Job'status?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question. From my read of the Job status API, there's no equivalent enum in k8s since some of these states are represented by the absence of certain fields (e.g. QUEUED == missing status.startTime). I think for a REST API, an enum is a more logical way to represent this, but I think we could tweak the words to be a bit more in line with k8s terminology:

QUEUED -> PENDING
RUNNING -> STARTED
COMPLETED -> SUCCEEDED
CANCELED -> DELETED (I don't like this one because in k8s deletion is an actual -X DELETE)
ERRORED -> FAILED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we feel we need to model anything for when a job goes through a "temporary failure" and let's say goes through a retry? Or would we jusft go from FAILED to QUEUED again? Or would we just consider that "process" as another job entirely.

Just thinking through how we would like to look at modeling what could happen let's say when a job hits a transient failure (let's say due to part of it running on bad infrastructure that is then replaced) and then a retry of that is scheduled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think there are probably a lot of detailed error semantics that could shake out of the different usage patterns, but the would probably loosely fall into the 4XX (user error) vs 5XX (system error) camps. I don't think we want to be too prescriptive with the job framework's error handling in the API (some implementations may retry whereas others may not), but I think it might be reasonable to consider having two errored states for user vs system. The challenge will then be figuring out how to encode those different error types in the backend library implementing the job body.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the enum too as well as the remap from the Kube terminology. Thanks!

description: >
Status of the job in the system:
* QUEUED: The job has not started and is waiting to be scheduled
* RUNNING: The job is actively running as expected
* COMPLETED: The job has completed successfully
* CANCELED: The job was canceled by user action
* ERRORED: The job terminated in an error state and is not running
51 changes: 51 additions & 0 deletions api-definitions/common/object-storage-connection.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
################################################################################
# This schema defines the common components used to reference content held in a
# cloud object store using an S3 interface.
################################################################################

schemas:
HMACCredentials:
type: object
properties:
access_key_id:
type: string
description: The public Access Key ID
secret_key:
type: string
description: The private Secret Key

IAMCredentials:
type: object
properties:
# TODO: What else goes here?
apikey:
type: string
description: The IAM apikey


Service:
type: object
description: Pointer to an object storage service
required: ['endpoint', 'credentials']
properties:
endpoint:
type: string
description: The qualified endpoint of the object storage service (http://, https://)
credentials:
oneOf:
- $ref: '#/schemas/HMACCredentials'
- $ref: '#/schemas/IAMCredentials'
region:
type: string
description: The region qualifier for this service

Bucket:
allOf:
- $ref: '#/schemas/Service'
- type: object
description: Pointer to an object storage bucket
required: ['bucket']
properties:
bucket:
type: string
description: The name of the bucket
21 changes: 21 additions & 0 deletions api-definitions/platform.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
openapi: '3.0.2'
info:
title: InstructLab Backend Platform
version: '0.1.0'

paths:
## Inference #################################################################

## Customization #############################################################

## Data Jobs #################################################################

#########
## SDG ##
#########
/synthetic-data-generations:
$ref: './platform/synthetic-data-generations.yaml#/paths/~1synthetic-data-generations'
/synthetic-data-generations/{job_id}:
$ref: './platform/synthetic-data-generations.yaml#/paths/~1synthetic-data-generations~1{job_id}'
/synthetic-data-generations/tasks:
$ref: './platform/synthetic-data-generations.yaml#/paths/~1synthetic-data-generations~1tasks'
104 changes: 104 additions & 0 deletions api-definitions/platform/synthetic-data-generations.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
openapi: '3.0.2'
info:
title: Synthetic Data Generation
version: '0.1.0'

paths:
/synthetic-data-generations:
post:
summary: Initialize a Synthetic Data Generation job
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/SDGJobBody'
responses:
"201":
description: A successfully submitted job
content:
application/json:
schema:
$ref: '../common/job-status.yaml#/schemas/JobStatus'
/synthetic-data-generations/{job_id}:
get:
summary: Retrieve the status of a Synthetic Data Generation job
responses:
"200":
description: The status for the job
content:
application/json:
schema:
$ref: '../common/job-status.yaml#/schemas/JobStatus'

delete:
summary: Cancel a running Synthetic Data Generation job
responses:
"200":
description: The status for the job after cancellation
content:
application/json:
schema:
$ref: '../common/job-status.yaml#/schemas/JobStatus'

/synthetic-data-generations/tasks:
get:
summary: List the currently supported SDG tasks
responses:
"200":
description: The set of currently supported SDG tasks
content:
application/json:
schema:
type: array
items:
$ref: '#/components/schemas/SDGTaskDefinition'

components:
schemas:
SDGJobBody:
type: object
required: ['output_directory', 'tasks', 'seed_data']
properties:
output_directory:
description: Location to place the output in
$ref: '../common/file-path.yaml#/schemas/DirectoryPath'

tasks:
description: Mapping from task name to task config. The config will be validated against the task's config schema.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there's the potential for multiple "tasks" within one job: what do you visualize that being long term? Would it be something along the lines potentially of being able to have a "generate" task, a "mix" tasks to do random mixing of that data, and/or potentially a "filter task" to then filter some of the data and that should still all live within the context of a SDGJob?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I took this from the CLI args to the library we're thinking of for the generic platform SDG implementation. I think for InstructLab, we'll likely only ever run a single Task (dataset + config + generation algorithm) per job.

type: object
additionalProperties:
type: object
description: Config for the given task. This will be deep-merged over the default values.

seed_data:
description: The file or directory containing the seed data
oneOf:
- $ref: '../common/file-path.yaml#/schemas/FilePath'
- $ref: '../common/file-path.yaml#/schemas/DirectoryPath'

SDGTaskDefinition:
type: object
required: ['name', 'data_json_schema', 'config_json_schema']
properties:
name:
type: string
description: The name of the task
data_json_schema:
type: object
description: The json schema for input data files for this task
# TODO: This doesn't render cleanly for some reason, but the body here
# must be a valid JSON Schema
# $ref: 'https://json-schema.org/draft-04/schema#'
data_example:
type: string
description: Example of an input data file for this task
config_json_schema:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "config" is this where you visualize a flexible "json blob" to where users could request "advanced parameters" when necessary to feed into SDG that maybe aren't default (lower level things like num samples, algorithm utilized in SDG, etc).

I like the idea of it then starting out really flexible and I guess different implementations at potentially different moments would potentially only allow a given subset of what can be sent in the config level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's exactly the idea here. We imagine the set of tasks to be extensible. Initially, this would be "build time" where the owner of the docker image would rebuild with new task implementations, but eventually we'd probably imagine users creating their own tasks by binding proprietary datasets/prompts/etc to existing generation algorithms. Each generation algorithm has a set of lower-level configs that can theoretically be overridden for each job, so the idea with this API is that when creating the job, the config overrides are an opaque blob, but you can query the system to understand the right schema for that blob beforehand. This avoids the need for us to keep a giant oneOf in the API definitions while still giving the user the ability to know the acceptable schemas that will be used for validation.

type: object
description: The json schema for the config of this task
# TODO: This doesn't render cleanly for some reason, but the body here
# must be a valid JSON Schema
# $ref: 'https://json-schema.org/draft-04/schema#'
config_defaults:
type: object
description: Default values for all config values
58 changes: 58 additions & 0 deletions docs/backend/api-definitions-guidelines.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# API Definitions Guidelines

This document describes how service APIs will be managed for `InstructLab` and the sub-components of the `InstructLab` backend.

## What parts of InstructLab need service APIs?

There are two primary classes of service APIs needed to support InstructLab:

* `Platform APIs`: These are APIs that are ignorant of `InstructLab` and provide generic AI platform capabilities (e.g., Fine Tuning, SDG, Eval)
* `InstructLab APIs`: These are the APIs that reflect the user-facing functionality of `InstructLab` itself. They are aware of the end-to-end `InstructLab` workflow.

The `InstructLab APIs` are essential for hosting `InstructLab` as a service in a repeatable way. The `Platform APIs` are critical for component reuse and extensibility (e.g., new SDG algorithms for new taxonomy data types), but they are not strictly required for hosting `InstructLab` as a service.

## How will service APIs be defined?

Service APIs will be defined using [OpenAPI](https://www.openapis.org/) format in [YAML](https://yaml.org/). For structural and style guidelines, see [api-definitions](../../api-definitions/README.md).

## Where will service API definitions live?

Service API definitions will live a new repository github.com/instructlab/service-api-definitions. This repo will have two primary responsibilities:

1. House the static service API definitions
2. Build and publish any language-specific generated packages for consumption by service implementation projects (see below)

## How will service implementations reference shared APIs?

When a project chooses to implement one or more service APIs, there are three acceptable methods for doing so, listed in order of preference:

1. Consume a supported language-specific package. The `service-api-definitions` repo will build consumable packages with generated code for supported languages. This is the preferred method of consumption as it avoids repository references and code duplication.
2. For languages without a supported package, the `service-api-definitions` repo may be held as a [git submodule](https://www.git-scm.com/book/en/v2/Git-Tools-Submodules).
3. It is also acceptable for an implementation to copy the relevant API definitions to the local project repository. Any changes made in the central repository will need to be sync'ed by the project owners, and any new APIs added in the project will not be considered usable until they have been integrated into the central API definitions.

## Style Guidelines

* Use `kebab-case` for path elements
* All characters must be in the [ascii](https://www.ascii-code.com/) character set to avoid percent encoding in URIs
* All letters must be lowercase
* Words are separated by the `-` (dash) character
* Use `snake_case` for properties
* All characters must be in the [utf-8](https://www.w3schools.com/charsets/ref_html_utf8.asp) character set for simple `json` encoding
* Words are separated by the `_` (underscore) character
* Use `UpperCamelCase` for internal reusable schema names
* These are internal names, so the character set is not limited
* Words are capitalized and concatenated with no separator

## API Layout

* There will be two main portions of the APIs:
* `instructlab.yaml`: This defines the user-facing `InstructLab` REST API
* `platform.yaml`: This defines the platform-level APIs used by the `InstructLab` workflow.
* Each platform `Capability` should own its own fully-functional sub-API file that can be used by individual capability service implementations
* Any schema object that is reused between endpoints should be housed in a schema file under the central `common` directory.

## Versioning and Stability

**WARNING** At this stage in development, we make no guarantees about stability and support for APIs!

**FUTURE**: Once stabilized, the APIs will follow an agreed-upon form of [semantic versioning](https://semver.org/) so that users can rely on the API's stability. The decision of how to version the API and at what granularity to do so is still under discussion.