-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial SDG REST API definitions #80
base: main
Are you sure you want to change the base?
Changes from all commits
d1d0426
224e8c9
70a1546
0156135
83b627c
e1d394d
8a1f9b9
7109cb7
1a34e8c
0507301
e9bd599
3747c8f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Common | ||
|
||
This section of the API definitions holds common structures that are shared across multiple service definitions. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
################################################################################ | ||
# This schema defines the common ways to reference files and directories held in | ||
# the various supported storage media. | ||
################################################################################ | ||
|
||
schemas: | ||
LocalPath: | ||
required: ['path'] | ||
properties: | ||
path: | ||
type: string | ||
description: The name of the file on disk | ||
|
||
ObjectStoragePath: | ||
allOf: | ||
- $ref: './object-storage-connection.yaml#/schemas/Bucket' | ||
- type: object | ||
description: 'Path within an object storage bucket' | ||
required: ['path'] | ||
properties: | ||
path: | ||
type: string | ||
description: The path within the bucket | ||
|
||
FilePath: | ||
description: Path to an individual file | ||
oneOf: | ||
- $ref: '#/schemas/LocalPath' | ||
- $ref: '#/schemas/ObjectStoragePath' | ||
|
||
DirectoryPath: | ||
description: Path to a directory | ||
oneOf: | ||
- $ref: '#/schemas/LocalPath' | ||
- $ref: '#/schemas/ObjectStoragePath' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
################################################################################ | ||
# This schema defines the common elements of a Job Status response. Individual | ||
# job types may extend this model with task-specific properties, but these | ||
# common properties must be present in the response to all job status queries. | ||
################################################################################ | ||
|
||
schemas: | ||
JobStatus: | ||
type: object | ||
description: The status of a job in the system | ||
properties: | ||
job_id: | ||
type: string | ||
description: Unique identifier for a single job | ||
status: | ||
type: string | ||
enum: | ||
- QUEUED | ||
- RUNNING | ||
- COMPLETED | ||
- CANCELED | ||
- ERRORED | ||
description: > | ||
Status of the job in the system: | ||
* QUEUED: The job has not started and is waiting to be scheduled | ||
* RUNNING: The job is actively running as expected | ||
* COMPLETED: The job has completed successfully | ||
* CANCELED: The job was canceled by user action | ||
* ERRORED: The job terminated in an error state and is not running |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
################################################################################ | ||
# This schema defines the common components used to reference content held in a | ||
# cloud object store using an S3 interface. | ||
################################################################################ | ||
|
||
schemas: | ||
HMACCredentials: | ||
type: object | ||
properties: | ||
access_key_id: | ||
type: string | ||
description: The public Access Key ID | ||
secret_key: | ||
type: string | ||
description: The private Secret Key | ||
|
||
IAMCredentials: | ||
type: object | ||
properties: | ||
# TODO: What else goes here? | ||
apikey: | ||
type: string | ||
description: The IAM apikey | ||
|
||
|
||
Service: | ||
type: object | ||
description: Pointer to an object storage service | ||
required: ['endpoint', 'credentials'] | ||
properties: | ||
endpoint: | ||
type: string | ||
description: The qualified endpoint of the object storage service (http://, https://) | ||
credentials: | ||
oneOf: | ||
- $ref: '#/schemas/HMACCredentials' | ||
- $ref: '#/schemas/IAMCredentials' | ||
region: | ||
type: string | ||
description: The region qualifier for this service | ||
|
||
Bucket: | ||
allOf: | ||
- $ref: '#/schemas/Service' | ||
- type: object | ||
description: Pointer to an object storage bucket | ||
required: ['bucket'] | ||
properties: | ||
bucket: | ||
type: string | ||
description: The name of the bucket |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
openapi: '3.0.2' | ||
info: | ||
title: InstructLab Backend Platform | ||
version: '0.1.0' | ||
|
||
paths: | ||
## Inference ################################################################# | ||
|
||
## Customization ############################################################# | ||
|
||
## Data Jobs ################################################################# | ||
|
||
######### | ||
## SDG ## | ||
######### | ||
/synthetic-data-generations: | ||
$ref: './platform/synthetic-data-generations.yaml#/paths/~1synthetic-data-generations' | ||
/synthetic-data-generations/{job_id}: | ||
$ref: './platform/synthetic-data-generations.yaml#/paths/~1synthetic-data-generations~1{job_id}' | ||
/synthetic-data-generations/tasks: | ||
$ref: './platform/synthetic-data-generations.yaml#/paths/~1synthetic-data-generations~1tasks' |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
openapi: '3.0.2' | ||
info: | ||
title: Synthetic Data Generation | ||
version: '0.1.0' | ||
|
||
paths: | ||
/synthetic-data-generations: | ||
post: | ||
summary: Initialize a Synthetic Data Generation job | ||
requestBody: | ||
required: true | ||
content: | ||
application/json: | ||
schema: | ||
$ref: '#/components/schemas/SDGJobBody' | ||
responses: | ||
"201": | ||
description: A successfully submitted job | ||
content: | ||
application/json: | ||
schema: | ||
$ref: '../common/job-status.yaml#/schemas/JobStatus' | ||
/synthetic-data-generations/{job_id}: | ||
get: | ||
summary: Retrieve the status of a Synthetic Data Generation job | ||
responses: | ||
"200": | ||
description: The status for the job | ||
content: | ||
application/json: | ||
schema: | ||
$ref: '../common/job-status.yaml#/schemas/JobStatus' | ||
|
||
delete: | ||
summary: Cancel a running Synthetic Data Generation job | ||
responses: | ||
"200": | ||
description: The status for the job after cancellation | ||
content: | ||
application/json: | ||
schema: | ||
$ref: '../common/job-status.yaml#/schemas/JobStatus' | ||
|
||
/synthetic-data-generations/tasks: | ||
get: | ||
summary: List the currently supported SDG tasks | ||
responses: | ||
"200": | ||
description: The set of currently supported SDG tasks | ||
content: | ||
application/json: | ||
schema: | ||
type: array | ||
items: | ||
$ref: '#/components/schemas/SDGTaskDefinition' | ||
|
||
components: | ||
schemas: | ||
SDGJobBody: | ||
type: object | ||
required: ['output_directory', 'tasks', 'seed_data'] | ||
properties: | ||
output_directory: | ||
description: Location to place the output in | ||
$ref: '../common/file-path.yaml#/schemas/DirectoryPath' | ||
|
||
tasks: | ||
description: Mapping from task name to task config. The config will be validated against the task's config schema. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see there's the potential for multiple "tasks" within one job: what do you visualize that being long term? Would it be something along the lines potentially of being able to have a "generate" task, a "mix" tasks to do random mixing of that data, and/or potentially a "filter task" to then filter some of the data and that should still all live within the context of a SDGJob? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. I took this from the CLI args to the library we're thinking of for the generic platform SDG implementation. I think for |
||
type: object | ||
additionalProperties: | ||
type: object | ||
description: Config for the given task. This will be deep-merged over the default values. | ||
|
||
seed_data: | ||
description: The file or directory containing the seed data | ||
oneOf: | ||
- $ref: '../common/file-path.yaml#/schemas/FilePath' | ||
- $ref: '../common/file-path.yaml#/schemas/DirectoryPath' | ||
|
||
SDGTaskDefinition: | ||
type: object | ||
required: ['name', 'data_json_schema', 'config_json_schema'] | ||
properties: | ||
name: | ||
type: string | ||
description: The name of the task | ||
data_json_schema: | ||
type: object | ||
description: The json schema for input data files for this task | ||
# TODO: This doesn't render cleanly for some reason, but the body here | ||
# must be a valid JSON Schema | ||
# $ref: 'https://json-schema.org/draft-04/schema#' | ||
data_example: | ||
type: string | ||
description: Example of an input data file for this task | ||
config_json_schema: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By "config" is this where you visualize a flexible "json blob" to where users could request "advanced parameters" when necessary to feed into SDG that maybe aren't default (lower level things like num samples, algorithm utilized in SDG, etc). I like the idea of it then starting out really flexible and I guess different implementations at potentially different moments would potentially only allow a given subset of what can be sent in the config level. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, that's exactly the idea here. We imagine the set of |
||
type: object | ||
description: The json schema for the config of this task | ||
# TODO: This doesn't render cleanly for some reason, but the body here | ||
# must be a valid JSON Schema | ||
# $ref: 'https://json-schema.org/draft-04/schema#' | ||
config_defaults: | ||
type: object | ||
description: Default values for all config values |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
# API Definitions Guidelines | ||
|
||
This document describes how service APIs will be managed for `InstructLab` and the sub-components of the `InstructLab` backend. | ||
|
||
## What parts of InstructLab need service APIs? | ||
|
||
There are two primary classes of service APIs needed to support InstructLab: | ||
|
||
* `Platform APIs`: These are APIs that are ignorant of `InstructLab` and provide generic AI platform capabilities (e.g., Fine Tuning, SDG, Eval) | ||
* `InstructLab APIs`: These are the APIs that reflect the user-facing functionality of `InstructLab` itself. They are aware of the end-to-end `InstructLab` workflow. | ||
|
||
The `InstructLab APIs` are essential for hosting `InstructLab` as a service in a repeatable way. The `Platform APIs` are critical for component reuse and extensibility (e.g., new SDG algorithms for new taxonomy data types), but they are not strictly required for hosting `InstructLab` as a service. | ||
|
||
## How will service APIs be defined? | ||
|
||
Service APIs will be defined using [OpenAPI](https://www.openapis.org/) format in [YAML](https://yaml.org/). For structural and style guidelines, see [api-definitions](../../api-definitions/README.md). | ||
|
||
## Where will service API definitions live? | ||
|
||
Service API definitions will live a new repository github.com/instructlab/service-api-definitions. This repo will have two primary responsibilities: | ||
|
||
1. House the static service API definitions | ||
2. Build and publish any language-specific generated packages for consumption by service implementation projects (see below) | ||
|
||
## How will service implementations reference shared APIs? | ||
|
||
When a project chooses to implement one or more service APIs, there are three acceptable methods for doing so, listed in order of preference: | ||
|
||
1. Consume a supported language-specific package. The `service-api-definitions` repo will build consumable packages with generated code for supported languages. This is the preferred method of consumption as it avoids repository references and code duplication. | ||
2. For languages without a supported package, the `service-api-definitions` repo may be held as a [git submodule](https://www.git-scm.com/book/en/v2/Git-Tools-Submodules). | ||
3. It is also acceptable for an implementation to copy the relevant API definitions to the local project repository. Any changes made in the central repository will need to be sync'ed by the project owners, and any new APIs added in the project will not be considered usable until they have been integrated into the central API definitions. | ||
|
||
## Style Guidelines | ||
|
||
* Use `kebab-case` for path elements | ||
* All characters must be in the [ascii](https://www.ascii-code.com/) character set to avoid percent encoding in URIs | ||
* All letters must be lowercase | ||
* Words are separated by the `-` (dash) character | ||
* Use `snake_case` for properties | ||
* All characters must be in the [utf-8](https://www.w3schools.com/charsets/ref_html_utf8.asp) character set for simple `json` encoding | ||
* Words are separated by the `_` (underscore) character | ||
* Use `UpperCamelCase` for internal reusable schema names | ||
* These are internal names, so the character set is not limited | ||
* Words are capitalized and concatenated with no separator | ||
|
||
## API Layout | ||
|
||
* There will be two main portions of the APIs: | ||
* `instructlab.yaml`: This defines the user-facing `InstructLab` REST API | ||
* `platform.yaml`: This defines the platform-level APIs used by the `InstructLab` workflow. | ||
* Each platform `Capability` should own its own fully-functional sub-API file that can be used by individual capability service implementations | ||
* Any schema object that is reused between endpoints should be housed in a schema file under the central `common` directory. | ||
|
||
## Versioning and Stability | ||
|
||
**WARNING** At this stage in development, we make no guarantees about stability and support for APIs! | ||
|
||
**FUTURE**: Once stabilized, the APIs will follow an agreed-upon form of [semantic versioning](https://semver.org/) so that users can rely on the API's stability. The decision of how to version the API and at what granularity to do so is still under discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we map with whatever Kubernetes says on Job'status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good question. From my read of the Job status API, there's no equivalent enum in k8s since some of these states are represented by the absence of certain fields (e.g.
QUEUED == missing status.startTime
). I think for a REST API, an enum is a more logical way to represent this, but I think we could tweak the words to be a bit more in line with k8s terminology:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we feel we need to model anything for when a job goes through a "temporary failure" and let's say goes through a retry? Or would we jusft go from FAILED to QUEUED again? Or would we just consider that "process" as another job entirely.
Just thinking through how we would like to look at modeling what could happen let's say when a job hits a transient failure (let's say due to part of it running on bad infrastructure that is then replaced) and then a retry of that is scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think there are probably a lot of detailed error semantics that could shake out of the different usage patterns, but the would probably loosely fall into the
4XX
(user error) vs5XX
(system error) camps. I don't think we want to be too prescriptive with the job framework's error handling in the API (some implementations may retry whereas others may not), but I think it might be reasonable to consider having two errored states for user vs system. The challenge will then be figuring out how to encode those different error types in the backend library implementing the job body.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the enum too as well as the remap from the Kube terminology. Thanks!