Skip to content

Commit

Permalink
Merge master (#131)
Browse files Browse the repository at this point in the history
* [perf] use uv for venv creation and pip install (#4414)

* Revert "remove `uv` from runtime setup due to azure installation issue (#4401)"

This reverts commit 0b20d56.

* on azure, use --prerelease=allow to install azure-cli

* use uv venv --seed

* fix backwards compatibility

* really fix backwards compatibility

* use uv to set up controller dependencies

* fix python 3.8

* lint

* add missing file

* update comment

* split out azure-cli dep

* fix lint for dependencies

* use runpy.run_path rather than modifying sys.path

* fix cloud dependency installation commands

* lint

* Update sky/utils/controller_utils.py

Co-authored-by: Zhanghao Wu <[email protected]>

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* [Minor] README updates. (#4436)

* [Minor] README touches.

* update

* update

* make --fast robust against credential or wheel updates (#4289)

* add config_dict['config_hash'] output to write_cluster_config

* fix docstring for write_cluster_config

This used to be true, but since #2943, 'ray' is the only provisioner.
Add other keys that are now present instead.

* when using --fast, check if config_hash matches, and if not, provision

* mock hashing method in unit test

This is needed since some files in the fake file mounts don't actually exist,
like the wheel path.

* check config hash within provision with lock held

* address other PR review comments

* rename to skip_if_no_cluster_updates

Co-authored-by: Zhanghao Wu <[email protected]>

* add assert details

Co-authored-by: Zhanghao Wu <[email protected]>

* address PR comments and update docstrings

* fix test

* update docstrings

Co-authored-by: Zhanghao Wu <[email protected]>

* address PR comments

* fix lint and tests

* Update sky/backends/cloud_vm_ray_backend.py

Co-authored-by: Zhanghao Wu <[email protected]>

* refactor skip_if_no_cluster_update var

* clarify comment

* format exception

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* [k8s] Add resource limits only if they exist (#4440)

Add limits only if they exist

* [robustness] cover some potential resource leakage cases (#4443)

* if a newly-created cluster is missing from the cloud, wait before deleting

Addresses #4431.

* confirm cluster actually terminates before deleting from the db

* avoid deleting cluster data outside the primary provision loop

* tweaks

* Apply suggestions from code review

Co-authored-by: Zhanghao Wu <[email protected]>

* use usage_intervals for new cluster detection

get_cluster_duration will include the total duration of the cluster since its
initial launch, while launched_at may be reset by sky launch on an existing
cluster. So this is a more accurate method to check.

* fix terminating/stopping state for Lambda and Paperspace

* Revert "use usage_intervals for new cluster detection"

This reverts commit aa6d2e9.

* check cloud.STATUS_VERSION before calling query_instances

* avoid try/catch when querying instances

* update comments

---------

Co-authored-by: Zhanghao Wu <[email protected]>

* smoke tests support storage mount only (#4446)

* smoke tests support storage mount only

* fix verify command

* rename to only_mount

* [Feature] support spot pod on RunPod (#4447)

* wip

* wip

* wip

* wip

* wip

* wip

* resolve comments

* wip

* wip

* wip

* wip

* wip

* wip

---------

Co-authored-by: hwei <[email protected]>

* use lazy import for runpod (#4451)

Fixes runpod import issues introduced in #4447.

* [k8s] Fix show-gpus when running with incluster auth (#4452)

* Add limits only if they exist

* Fix incluster auth handling

* Not mutate azure dep list at runtime (#4457)

* add 1, 2, 4 size H100's to GCP (#4456)

* add 1, 2, 4 size H100's to GCP

* update

* Support buildkite CICD and restructure smoke tests (#4396)

* event based smoke test

* more event based smoke test

* more test cases

* more test cases with managed jobs

* bug fix

* bump up seconds

* merge master and resolve conflict

* more test case

* support test_managed_jobs_pipeline_failed_setup

* support test_managed_jobs_recovery_aws

* manged job status

* bug fix

* test managed job cancel

* test_managed_jobs_storage

* more test cases

* resolve pr comment

* private member function

* bug fix

* restructure

* fix import

* buildkite config

* fix stdout problem

* update pipeline test

* test again

* smoke test for buildkite

* remove unsupport cloud for now

* merge branch 'reliable_smoke_test_more'

* bug fix

* bug fix

* bug fix

* test pipeline pre merge

* build test

* test again

* trigger test

* bug fix

* generate pipeline

* robust generate pipeline

* refactor pipeline

* remove runpod

* hot fix to pass smoke test

* random order

* allow parameter

* bug fix

* bug fix

* exclude lambda cloud

* dynamic generate pipeline

* fix pre-commit

* format

* support SUPPRESS_SENSITIVE_LOG

* support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log

* support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log

* add backward_compatibility_tests to pipeline

* pip install uv for backward compatibility test

* import style

* generate all cloud

* resolve PR comment

* update comment

* naming fix

* grammar correction

* resolve PR comment

* fix import

* fix import

* support gcp on pre merge test

* no gcp test case for pre merge

* [k8s] Make node termination robust (#4469)

* Add limits only if they exist

* retry deletion

* lint

* lint

* comments

* lint

* [Catalog] Bump catalog schema version (#4470)

* Bump catalog schema version

* trigger CI

* [core] skip provider.availability_zone in the cluster config hash (#4463)

skip provider.availability_zone in the cluster config hash

* remove sky jobs launch --fast (#4467)

* remove sky jobs launch --fast

The --fast behavior is now always enabled. This was unsafe before but since
\#4289 it should be safe.

We will remove the flag before 0.8.0 so that it never touches a stable version.

sky launch still has the --fast flag. This flag is unsafe because it could cause
setup to be skipped even though it should be re-run. In the managed jobs case,
this is not an issue because we fully control the setup and know it will not
change.

* fix lint

* [docs] Change urls to docs.skypilot.co, add 404 page (#4413)

* Add 404 page, change to docs.skypilot.co

* lint

* [UX] Fix unnecessary OCI logging (#4476)

Sync PR: fix-oci-logging-master

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* [Example] PyTorch distributed training with minGPT (#4464)

* Add example for distributed pytorch

* update

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update examples/distributed-pytorch/README.md

Co-authored-by: Romil Bhardwaj <[email protected]>

* Fix

---------

Co-authored-by: Romil Bhardwaj <[email protected]>

* Add tests for Azure spot instance (#4475)

* verify azure spot instance

* string style

* echo

* echo vm detail

* bug fix

* remove comment

* rename pre-merge test to quicktest-core (#4486)

* rename to test core

* rename file

* [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337)

* [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu

Signed-off-by: nkwangleiGIT <[email protected]>

* fix format issue

Signed-off-by: nkwangleiGIT <[email protected]>

---------

Signed-off-by: nkwangleiGIT <[email protected]>

* [k8s] Fix IPv6 ssh support (#4497)

* Add limits only if they exist

* Fix ipv6 support

* Fix ipv6 support

* [Serve] Add and adopt least load policy as default poicy. (#4439)

* [Serve] Add and adopt least load policy as default poicy.

* Docs & smoke tests

* error message for different lb policy

* add minimal example

* fix

* [Docs] Update logo in docs (#4500)

* WIP updating Elisa logo; issues with light/dark modes

* Fix SVG in navbar rendering by hardcoding SVG + defining text color in css

* Update readme images

* newline

---------

Co-authored-by: Zongheng Yang <[email protected]>

* Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298)

* style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks

* chore: more typings

* use `df.empty` for dataframe

* fix: more `df.empty`

* format

* revert partially

* style: add back comments

* style: format

* refactor: `dict[str, str]`

Co-authored-by: Tian Xia <[email protected]>

---------

Co-authored-by: Tian Xia <[email protected]>

* [Docs] Fix logo file path (#4504)

* Add limits only if they exist

* rename

* [Storage] Show logs for storage mount (#4387)

* commit for logging change

* logger for storage

* grammar

* fix format

* better comment

* resolve copilot review

* resolve PR comment

* remove unuse var

* Update sky/data/data_utils.py

Co-authored-by: Romil Bhardwaj <[email protected]>

* resolve PR comment

* update comment for get_run_timestamp

* rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp

---------

Co-authored-by: Romil Bhardwaj <[email protected]>

* [Examples] Update Ollama setup commands (#4510)

wip

* [OCI] Support OCI Object Storage  (#4501)

* OCI Object Storage Support

* example yaml update

* example update

* add more example yaml

* Support RClone-RPM pkg

* Add smoke test

* ver

* smoke test

* Resolve dependancy conflict between oci-cli and runpod

* Use latest RClone version (v1.68.2)

* minor optimize

* Address review comments

* typo

* test

* sync code with repo

* Address review comments & more testing.

* address one more comment

* [Jobs] Allowing to specify intermediate bucket for file upload (#4257)

* debug

* support workdir_bucket_name config on yaml file

* change the match statement to if else due to mypy limit

* pass mypy

* yapf format fix

* reformat

* remove debug line

* all dir to same bucket

* private member function

* fix mypy

* support sub dir config to separate to different directory

* rename and add smoke test

* bucketname

* support sub dir mount

* private member for _bucket_sub_path and smoke test fix

* support copy mount for sub dir

* support gcs, s3 delete folder

* doc

* r2 remove_objects_from_sub_path

* support azure remove directory and cos remove

* doc string for remove_objects_from_sub_path

* fix sky jobs subdir issue

* test case update

* rename to _bucket_sub_path

* change the config schema

* setter

* bug fix and test update

* delete bucket depends on user config or sky generated

* add test case

* smoke test bug fix

* robust smoke test

* fix comment

* bug fix

* set the storage manually

* better structure

* fix mypy

* Update docs/source/reference/config.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/reference/config.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* limit creation for bucket and delete sub dir only

* resolve comment

* Update docs/source/reference/config.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update sky/utils/controller_utils.py

Co-authored-by: Romil Bhardwaj <[email protected]>

* resolve PR comment

* bug fix

* bug fix

* fix test case

* bug fix

* fix

* fix test case

* bug fix

* support is_sky_managed param in config

* pass param intermediate_bucket_is_sky_managed

* resolve PR comment

* Update sky/utils/controller_utils.py

Co-authored-by: Romil Bhardwaj <[email protected]>

* hide bucket creation log

* reset green color

* rename is_sky_managed to _is_sky_managed

* bug fix

* retrieve _is_sky_managed from stores

* propogate the log

---------

Co-authored-by: Romil Bhardwaj <[email protected]>

* [Core] Deprecate LocalDockerBackend (#4516)

Deprecate local docker backend

* [docs] Add newer examples for AI tutorial and distributed training (#4509)

* Update tutorial and distributed training examples.

* Add examples link

* add rdvz

* [k8s] Fix L40 detection for nvidia GFD labels (#4511)

Fix L40 detection

* [docs] Support OCI Object Storage (#4513)

* Support OCI Object Storage

* Add oci bucket for file_mount

* [Docs] Disable Kapa AI (#4518)

Disable kapa

* [DigitalOcean] droplet integration (#3832)

* init digital ocean droplet integration

* abbreviate cloud name

* switch to pydo

* adjust polling logic and mount block storage to instance

* filter by paginated

* lint

* sky launch, start, stop functional

* fix credential file mounts, autodown works now

* set gpu droplet image

* cleanup

* remove more tests

* atomically destroy instance and block storage simulatenously

* install docker

* disable spot test

* fix ip address bug for multinode

* lint

* patch ssh from job/serve controller

* switch to EA slugs

* do adaptor

* lint

* Update sky/clouds/do.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/clouds/do.py

Co-authored-by: Tian Xia <[email protected]>

* comment template

* comment patch

* add h100 test case

* comment on instance name length

* Update sky/clouds/do.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/clouds/service_catalog/do_catalog.py

Co-authored-by: Tian Xia <[email protected]>

* comment on max node char len

* comment on weird azure import

* comment acc price is included in instance price

* fix return type

* switch with do_utils

* remove broad except

* Update sky/provision/do/instance.py

Co-authored-by: Tian Xia <[email protected]>

* Update sky/provision/do/instance.py

Co-authored-by: Tian Xia <[email protected]>

* remove azure

* comment on non_terminated_only

* add open port debug message

* wrap start instance api

* use f-string

* wrap stop

* wrap instance down

* assert credentials and check against all contexts

* assert client is None

* remove pending instances during instance restart

* wrap rename

* rename ssh key var

* fix tags

* add tags for block device

* f strings for errors

* support image ids

* update do tests

* only store head instance id

* rename image slugs

* add digital ocean alias

* wait for docker to be available

* update requirements and tests

* increase docker timeout

* lint

* move tests

* lint

* patch test

* lint

* typo fix

* fix typo

* patch tests

* fix tests

* no_mark spot test

* handle 2cpu serve tests

* lint

* lint

* use logger.debug

* fix none cred path

* lint

* handle get_cred path

* pylint

* patch for DO test_optimizer_dryruns.py

* revert optimizer dryrun

---------

Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Ubuntu <[email protected]>

* [Docs] Refactor pod_config docs (#4427)

* refactor pod_config docs

* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst

Co-authored-by: Zongheng Yang <[email protected]>

* Update docs/source/reference/kubernetes/kubernetes-getting-started.rst

Co-authored-by: Zongheng Yang <[email protected]>

---------

Co-authored-by: Zongheng Yang <[email protected]>

* [OCI] Set default image to ubuntu LTS 22.04 (#4517)

* set default gpu image to skypilot:gpu-ubuntu-2204

* add example

* remove comment line

* set cpu default image to 2204

* update change history

* [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524)

* Support specify os type with custom image id.

* trim space

* nit

* comment

* Update intermediate bucket related doc (#4521)

* doc

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* Update docs/source/examples/managed-jobs.rst

Co-authored-by: Romil Bhardwaj <[email protected]>

* add tip

* minor changes

---------

Co-authored-by: Romil Bhardwaj <[email protected]>

* [aws] cache user identity by 'aws configure list' (#4507)

* [aws] cache user identity by 'aws configure list'

Signed-off-by: Aylei <[email protected]>

* refine get_user_identities docstring

Signed-off-by: Aylei <[email protected]>

* address review comments

Signed-off-by: Aylei <[email protected]>

---------

Signed-off-by: Aylei <[email protected]>

* [k8s] Add validation for pod_config #4206 (#4466)

* [k8s] Add validation for pod_config #4206

Check pod_config when run 'sky check k8s' by using k8s api

* update: check pod_config when launch

check merged pod_config during launch using k8s api

* fix test

* ignore check failed when test with dryrun

if there is no kube config in env, ignore ValueError when launch
with dryrun. For now, we don't support check schema offline.

* use deserialize api to check pod_config schema

* test

* create another api_client with no kubeconfig

* test

* update error message

* update test

* test

* test

* Update sky/backends/backend_utils.py

---------

Co-authored-by: Romil Bhardwaj <[email protected]>

* [core] fix wheel timestamp check (#4488)

Previously, we were only taking the max timestamp of all the subdirectories of
the given directory. So the timestamp could be incorrect if only a file changed,
and no directory changed. This fixes the issue by looking at all directories and
files given by os.walk().

* [docs] Add image_id doc in task YAML for OCI (#4526)

* Add image_id doc for OCI

* nit

* Update docs/source/reference/yaml-spec.rst

Co-authored-by: Tian Xia <[email protected]>

---------

Co-authored-by: Tian Xia <[email protected]>

* [UX] warning before launching jobs/serve when using a reauth required credentials (#4479)

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* wip

* Update sky/backends/cloud_vm_ray_backend.py

Minor fix

* Update sky/clouds/aws.py

Co-authored-by: Romil Bhardwaj <[email protected]>

* wip

* minor changes

* wip

---------

Co-authored-by: hong <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>

* [GCP] Activate service account for storage and controller (#4529)

* Activate service account for storage

* disable logging if not using service account

* Activate for controller as well.

* revert controller activate

* Add comments

* format

* fix smoke

* [OCI] Support reuse existing VCN for SkyServe (#4530)

* Support reuse existing VCN for SkyServe

* fix

* remove unused import

* format

* [docs] OCI: advanced configuration & add vcn_ocid (#4531)

* Add vcn_ocid configuration

* Update config.rst

* fix merge issues WIP

* fix merging issues

* fix imports

* fix stores

---------

Signed-off-by: nkwangleiGIT <[email protected]>
Signed-off-by: Aylei <[email protected]>
Co-authored-by: Christopher Cooper <[email protected]>
Co-authored-by: Zongheng Yang <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: zpoint <[email protected]>
Co-authored-by: Hong <[email protected]>
Co-authored-by: hwei <[email protected]>
Co-authored-by: Yika <[email protected]>
Co-authored-by: Seth Kimmel <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Lei <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Andy Lee <[email protected]>
Co-authored-by: Romil Bhardwaj <[email protected]>
Co-authored-by: Hysun He <[email protected]>
Co-authored-by: Andrew Aikawa <[email protected]>
Co-authored-by: Ubuntu <[email protected]>
Co-authored-by: Aylei <[email protected]>
Co-authored-by: Chester Li <[email protected]>
Co-authored-by: hong <[email protected]>
  • Loading branch information
20 people authored Jan 7, 2025
1 parent 4d9c1d5 commit 3441512
Show file tree
Hide file tree
Showing 179 changed files with 11,472 additions and 7,330 deletions.
252 changes: 252 additions & 0 deletions .buildkite/generate_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
"""
This script generates a Buildkite pipeline from test files.
The script will generate two pipelines:
tests/smoke_tests
├── test_*.py -> release pipeline
├── test_quick_tests_core.py -> run quick tests on PR before merging
run `PYTHONPATH=$(pwd)/tests:$PYTHONPATH python .buildkite/generate_pipeline.py`
to generate the pipeline for testing. The CI will run this script as a pre-step,
and use the generated pipeline to run the tests.
1. release pipeline, which runs all smoke tests by default, generates all
smoke tests for all clouds.
2. pre-merge pipeline, which generates all smoke tests for all clouds,
author should specify which clouds to run by setting env in the step.
We only have credentials for aws/azure/gcp/kubernetes(CLOUD_QUEUE_MAP and
SERVE_CLOUD_QUEUE_MAP) now, smoke tests for those clouds are generated, other
clouds are not supported yet, smoke tests for those clouds are not generated.
"""

import ast
import os
import random
from typing import Any, Dict, List, Optional

from conftest import cloud_to_pytest_keyword
from conftest import default_clouds_to_run
import yaml

DEFAULT_CLOUDS_TO_RUN = default_clouds_to_run
PYTEST_TO_CLOUD_KEYWORD = {v: k for k, v in cloud_to_pytest_keyword.items()}

QUEUE_GENERIC_CLOUD = 'generic_cloud'
QUEUE_GENERIC_CLOUD_SERVE = 'generic_cloud_serve'
QUEUE_KUBERNETES = 'kubernetes'
QUEUE_KUBERNETES_SERVE = 'kubernetes_serve'
# Only aws, gcp, azure, and kubernetes are supported for now.
# Other clouds do not have credentials.
CLOUD_QUEUE_MAP = {
'aws': QUEUE_GENERIC_CLOUD,
'gcp': QUEUE_GENERIC_CLOUD,
'azure': QUEUE_GENERIC_CLOUD,
'kubernetes': QUEUE_KUBERNETES
}
# Serve tests runs long, and different test steps usually requires locks.
# Its highly likely to fail if multiple serve tests are running concurrently.
# So we use a different queue that runs only one concurrent test at a time.
SERVE_CLOUD_QUEUE_MAP = {
'aws': QUEUE_GENERIC_CLOUD_SERVE,
'gcp': QUEUE_GENERIC_CLOUD_SERVE,
'azure': QUEUE_GENERIC_CLOUD_SERVE,
'kubernetes': QUEUE_KUBERNETES_SERVE
}

GENERATED_FILE_HEAD = ('# This is an auto-generated Buildkite pipeline by '
'.buildkite/generate_pipeline.py, Please do not '
'edit directly.\n')


def _get_full_decorator_path(decorator: ast.AST) -> str:
"""Recursively get the full path of a decorator."""
if isinstance(decorator, ast.Attribute):
return f'{_get_full_decorator_path(decorator.value)}.{decorator.attr}'
elif isinstance(decorator, ast.Name):
return decorator.id
elif isinstance(decorator, ast.Call):
return _get_full_decorator_path(decorator.func)
raise ValueError(f'Unknown decorator type: {type(decorator)}')


def _extract_marked_tests(file_path: str) -> Dict[str, List[str]]:
"""Extract test functions and filter clouds using pytest.mark
from a Python test file.
We separate each test_function_{cloud} into different pipeline steps
to maximize the parallelism of the tests via the buildkite CI job queue.
This allows us to visualize the test results and rerun failures at the
granularity of each test_function_{cloud}.
If we make pytest --serve a job, it could contain dozens of test_functions
and run for hours. This makes it hard to visualize the test results and
rerun failures. Additionally, the parallelism would be controlled by pytest
instead of the buildkite job queue.
"""
with open(file_path, 'r', encoding='utf-8') as file:
tree = ast.parse(file.read(), filename=file_path)

for node in ast.walk(tree):
for child in ast.iter_child_nodes(node):
setattr(child, 'parent', node)

function_cloud_map = {}
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) and node.name.startswith('test_'):
class_name = None
if hasattr(node, 'parent') and isinstance(node.parent,
ast.ClassDef):
class_name = node.parent.name

clouds_to_include = []
clouds_to_exclude = []
is_serve_test = False
for decorator in node.decorator_list:
if isinstance(decorator, ast.Call):
# We only need to consider the decorator with no arguments
# to extract clouds.
continue
full_path = _get_full_decorator_path(decorator)
if full_path.startswith('pytest.mark.'):
assert isinstance(decorator, ast.Attribute)
suffix = decorator.attr
if suffix.startswith('no_'):
clouds_to_exclude.append(suffix[3:])
else:
if suffix == 'serve':
is_serve_test = True
continue
if suffix not in PYTEST_TO_CLOUD_KEYWORD:
# This mark does not specify a cloud, so we skip it.
continue
clouds_to_include.append(
PYTEST_TO_CLOUD_KEYWORD[suffix])
clouds_to_include = (clouds_to_include if clouds_to_include else
DEFAULT_CLOUDS_TO_RUN)
clouds_to_include = [
cloud for cloud in clouds_to_include
if cloud not in clouds_to_exclude
]
cloud_queue_map = SERVE_CLOUD_QUEUE_MAP if is_serve_test else CLOUD_QUEUE_MAP
final_clouds_to_include = [
cloud for cloud in clouds_to_include if cloud in cloud_queue_map
]
if clouds_to_include and not final_clouds_to_include:
print(f'Warning: {file_path}:{node.name} '
f'is marked to run on {clouds_to_include}, '
f'but we do not have credentials for those clouds. '
f'Skipped.')
continue
if clouds_to_include != final_clouds_to_include:
excluded_clouds = set(clouds_to_include) - set(
final_clouds_to_include)
print(
f'Warning: {file_path}:{node.name} '
f'is marked to run on {clouds_to_include}, '
f'but we only have credentials for {final_clouds_to_include}. '
f'clouds {excluded_clouds} are skipped.')
function_name = (f'{class_name}::{node.name}'
if class_name else node.name)
function_cloud_map[function_name] = (final_clouds_to_include, [
cloud_queue_map[cloud] for cloud in final_clouds_to_include
])
return function_cloud_map


def _generate_pipeline(test_file: str) -> Dict[str, Any]:
"""Generate a Buildkite pipeline from test files."""
steps = []
function_cloud_map = _extract_marked_tests(test_file)
for test_function, clouds_and_queues in function_cloud_map.items():
for cloud, queue in zip(*clouds_and_queues):
step = {
'label': f'{test_function} on {cloud}',
'command': f'pytest {test_file}::{test_function} --{cloud}',
'agents': {
# Separate agent pool for each cloud.
# Since they require different amount of resources and
# concurrency control.
'queue': queue
},
'if': f'build.env("{cloud}") == "1"'
}
steps.append(step)
return {'steps': steps}


def _dump_pipeline_to_file(yaml_file_path: str,
pipelines: List[Dict[str, Any]],
extra_env: Optional[Dict[str, str]] = None):
default_env = {'LOG_TO_STDOUT': '1', 'PYTHONPATH': '${PYTHONPATH}:$(pwd)'}
if extra_env:
default_env.update(extra_env)
with open(yaml_file_path, 'w', encoding='utf-8') as file:
file.write(GENERATED_FILE_HEAD)
all_steps = []
for pipeline in pipelines:
all_steps.extend(pipeline['steps'])
# Shuffle the steps to avoid flakyness, consecutive runs of the same
# kind of test may fail for requiring locks on the same resources.
random.shuffle(all_steps)
final_pipeline = {'steps': all_steps, 'env': default_env}
yaml.dump(final_pipeline, file, default_flow_style=False)


def _convert_release(test_files: List[str]):
yaml_file_path = '.buildkite/pipeline_smoke_tests_release.yaml'
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
pipeline = _generate_pipeline(test_file)
output_file_pipelines.append(pipeline)
print(f'Converted {test_file} to {yaml_file_path}\n\n')
# Enable all clouds by default for release pipeline.
_dump_pipeline_to_file(yaml_file_path,
output_file_pipelines,
extra_env={cloud: '1' for cloud in CLOUD_QUEUE_MAP})


def _convert_quick_tests_core(test_files: List[str]):
yaml_file_path = '.buildkite/pipeline_smoke_tests_quick_tests_core.yaml'
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
# We want enable all clouds by default for each test function
# for pre-merge. And let the author controls which clouds
# to run by parameter.
pipeline = _generate_pipeline(test_file)
pipeline['steps'].append({
'label': 'Backward compatibility test',
'command': 'bash tests/backward_compatibility_tests.sh',
'agents': {
'queue': 'back_compat'
}
})
output_file_pipelines.append(pipeline)
print(f'Converted {test_file} to {yaml_file_path}\n\n')
_dump_pipeline_to_file(yaml_file_path,
output_file_pipelines,
extra_env={'SKYPILOT_SUPPRESS_SENSITIVE_LOG': '1'})


def main():
test_files = os.listdir('tests/smoke_tests')
release_files = []
quick_tests_core_files = []
for test_file in test_files:
if not test_file.startswith('test_'):
continue
test_file_path = os.path.join('tests/smoke_tests', test_file)
if "test_quick_tests_core" in test_file:
quick_tests_core_files.append(test_file_path)
else:
release_files.append(test_file_path)

_convert_release(release_files)
_convert_quick_tests_core(quick_tests_core_files)


if __name__ == '__main__':
main()
5 changes: 2 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ repos:
args:
- "--sg=build/**" # Matches "${ISORT_YAPF_EXCLUDES[@]}"
- "--sg=sky/skylet/providers/ibm/**"
files: "^(sky|tests|examples|llm|docs)/.*" # Only match these directories
# Second isort command
- id: isort
name: isort (IBM specific)
Expand Down Expand Up @@ -56,8 +55,8 @@ repos:
hooks:
- id: yapf
name: yapf
exclude: (build/.*|sky/skylet/providers/ibm/.*) # Matches exclusions from the script
args: ['--recursive', '--parallel'] # Only necessary flags
exclude: (sky/skylet/providers/ibm/.*) # Matches exclusions from the script
args: ['--recursive', '--parallel', '--in-place'] # Only necessary flags
additional_dependencies: [toml==0.10.2]

- repo: https://github.com/pylint-dev/pylint
Expand Down
34 changes: 17 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
</p>

<p align="center">
<a href="https://skypilot.readthedocs.io/en/latest/">
<a href="https://docs.skypilot.co/">
<img alt="Documentation" src="https://readthedocs.org/projects/skypilot/badge/?version=latest">
</a>

Expand Down Expand Up @@ -43,7 +43,7 @@
<summary>Archived</summary>

- [Jul 2024] [**Finetune**](./llm/llama-3_1-finetuning/) and [**serve**](./llm/llama-3_1/) **Llama 3.1** on your infra
- [Apr 2024] Serve and finetune [**Llama 3**](https://skypilot.readthedocs.io/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Apr 2024] Serve and finetune [**Llama 3**](https://docs.skypilot.co/en/latest/gallery/llms/llama-3.html) on any cloud or Kubernetes: [**example**](./llm/llama-3/)
- [Mar 2024] Serve and deploy [**Databricks DBRX**](https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm) on your infra: [**example**](./llm/dbrx/)
- [Feb 2024] Speed up your LLM deployments with [**SGLang**](https://github.com/sgl-project/sglang) for 5x throughput on SkyServe: [**example**](./llm/sglang/)
- [Dec 2023] Using [**LoRAX**](https://github.com/predibase/lorax) to serve 1000s of finetuned LLMs on a single instance in the cloud: [**example**](./llm/lorax/)
Expand All @@ -60,17 +60,17 @@
SkyPilot is a framework for running AI and batch workloads on any infra, offering unified execution, high cost savings, and high GPU availability.

SkyPilot **abstracts away infra burdens**:
- Launch [dev clusters](https://skypilot.readthedocs.io/en/latest/examples/interactive-development.html), [jobs](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html), and [serving](https://skypilot.readthedocs.io/en/latest/serving/sky-serve.html) on any infra
- Launch [dev clusters](https://docs.skypilot.co/en/latest/examples/interactive-development.html), [jobs](https://docs.skypilot.co/en/latest/examples/managed-jobs.html), and [serving](https://docs.skypilot.co/en/latest/serving/sky-serve.html) on any infra
- Easy job management: queue, run, and auto-recover many jobs

SkyPilot **supports multiple clusters, clouds, and hardware** ([the Sky](https://arxiv.org/abs/2205.07147)):
- Bring your reserved GPUs, Kubernetes clusters, or 12+ clouds
- [Flexible provisioning](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry
- [Flexible provisioning](https://docs.skypilot.co/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry

SkyPilot **cuts your cloud costs & maximizes GPU availability**:
* [Autostop](https://skypilot.readthedocs.io/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
* [Managed Spot](https://skypilot.readthedocs.io/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
* [Optimizer](https://skypilot.readthedocs.io/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra
* [Autostop](https://docs.skypilot.co/en/latest/reference/auto-stop.html): automatic cleanup of idle resources
* [Managed Spot](https://docs.skypilot.co/en/latest/examples/managed-jobs.html): 3-6x cost savings using spot instances, with preemption auto-recovery
* [Optimizer](https://docs.skypilot.co/en/latest/examples/auto-failover.html): 2x cost savings by auto-picking the cheapest & most available infra

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Expand All @@ -79,13 +79,13 @@ Install with pip:
# Choose your clouds:
pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```
To get the latest features and fixes, use the nightly build or [install from source](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html):
To get the latest features and fixes, use the nightly build or [install from source](https://docs.skypilot.co/en/latest/getting-started/installation.html):
```bash
# Choose your clouds:
pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp]"
```

[Current supported infra](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
[Current supported infra](https://docs.skypilot.co/en/latest/getting-started/installation.html) (Kubernetes; AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack, RunPod, Cudo, Paperspace, Cloudflare, Samsung, IBM, VMware vSphere):
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/skypilot-org/skypilot/master/docs/source/images/cloud-logos-dark.png">
Expand All @@ -95,16 +95,16 @@ pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidst


## Getting Started
You can find our documentation [here](https://skypilot.readthedocs.io/en/latest/).
- [Installation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)
- [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html)
- [CLI reference](https://skypilot.readthedocs.io/en/latest/reference/cli.html)
You can find our documentation [here](https://docs.skypilot.co/).
- [Installation](https://docs.skypilot.co/en/latest/getting-started/installation.html)
- [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html)
- [CLI reference](https://docs.skypilot.co/en/latest/reference/cli.html)

## SkyPilot in 1 Minute

A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.

Once written in this [**unified interface**](https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.
Once written in this [**unified interface**](https://docs.skypilot.co/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud. This avoids vendor lock-in, and allows easily moving jobs to a different provider.

Paste the following into a file `my_task.yaml`:

Expand Down Expand Up @@ -135,7 +135,7 @@ Prepare the workdir by cloning:
git clone https://github.com/pytorch/examples.git ~/torch_examples
```

Launch with `sky launch` (note: [access to GPU instances](https://skypilot.readthedocs.io/en/latest/cloud-setup/quota.html) is needed for this example):
Launch with `sky launch` (note: [access to GPU instances](https://docs.skypilot.co/en/latest/cloud-setup/quota.html) is needed for this example):
```bash
sky launch my_task.yaml
```
Expand All @@ -152,10 +152,10 @@ SkyPilot then performs the heavy-lifting for you, including:
</p>


Refer to [Quickstart](https://skypilot.readthedocs.io/en/latest/getting-started/quickstart.html) to get started with SkyPilot.
Refer to [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html) to get started with SkyPilot.

## More Information
To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).
To learn more, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html), [SkyPilot docs](https://docs.skypilot.co/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).

<!-- Keep this section in sync with index.rst in SkyPilot Docs -->
Runnable examples:
Expand Down
1 change: 1 addition & 0 deletions docs/requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ sphinx-autobuild==2021.3.14
sphinx-autodoc-typehints==1.25.2
sphinx-book-theme==1.1.0
sphinx-togglebutton==0.3.2
sphinx-notfound-page==1.0.4
sphinxcontrib-applehelp==1.0.7
sphinxcontrib-devhelp==1.0.5
sphinxcontrib-googleanalytics==0.4
Expand Down
Loading

0 comments on commit 3441512

Please sign in to comment.