Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* [perf] use uv for venv creation and pip install (#4414) * Revert "remove `uv` from runtime setup due to azure installation issue (#4401)" This reverts commit 0b20d56. * on azure, use --prerelease=allow to install azure-cli * use uv venv --seed * fix backwards compatibility * really fix backwards compatibility * use uv to set up controller dependencies * fix python 3.8 * lint * add missing file * update comment * split out azure-cli dep * fix lint for dependencies * use runpy.run_path rather than modifying sys.path * fix cloud dependency installation commands * lint * Update sky/utils/controller_utils.py Co-authored-by: Zhanghao Wu <[email protected]> --------- Co-authored-by: Zhanghao Wu <[email protected]> * [Minor] README updates. (#4436) * [Minor] README touches. * update * update * make --fast robust against credential or wheel updates (#4289) * add config_dict['config_hash'] output to write_cluster_config * fix docstring for write_cluster_config This used to be true, but since #2943, 'ray' is the only provisioner. Add other keys that are now present instead. * when using --fast, check if config_hash matches, and if not, provision * mock hashing method in unit test This is needed since some files in the fake file mounts don't actually exist, like the wheel path. * check config hash within provision with lock held * address other PR review comments * rename to skip_if_no_cluster_updates Co-authored-by: Zhanghao Wu <[email protected]> * add assert details Co-authored-by: Zhanghao Wu <[email protected]> * address PR comments and update docstrings * fix test * update docstrings Co-authored-by: Zhanghao Wu <[email protected]> * address PR comments * fix lint and tests * Update sky/backends/cloud_vm_ray_backend.py Co-authored-by: Zhanghao Wu <[email protected]> * refactor skip_if_no_cluster_update var * clarify comment * format exception --------- Co-authored-by: Zhanghao Wu <[email protected]> * [k8s] Add resource limits only if they exist (#4440) Add limits only if they exist * [robustness] cover some potential resource leakage cases (#4443) * if a newly-created cluster is missing from the cloud, wait before deleting Addresses #4431. * confirm cluster actually terminates before deleting from the db * avoid deleting cluster data outside the primary provision loop * tweaks * Apply suggestions from code review Co-authored-by: Zhanghao Wu <[email protected]> * use usage_intervals for new cluster detection get_cluster_duration will include the total duration of the cluster since its initial launch, while launched_at may be reset by sky launch on an existing cluster. So this is a more accurate method to check. * fix terminating/stopping state for Lambda and Paperspace * Revert "use usage_intervals for new cluster detection" This reverts commit aa6d2e9. * check cloud.STATUS_VERSION before calling query_instances * avoid try/catch when querying instances * update comments --------- Co-authored-by: Zhanghao Wu <[email protected]> * smoke tests support storage mount only (#4446) * smoke tests support storage mount only * fix verify command * rename to only_mount * [Feature] support spot pod on RunPod (#4447) * wip * wip * wip * wip * wip * wip * resolve comments * wip * wip * wip * wip * wip * wip --------- Co-authored-by: hwei <[email protected]> * use lazy import for runpod (#4451) Fixes runpod import issues introduced in #4447. * [k8s] Fix show-gpus when running with incluster auth (#4452) * Add limits only if they exist * Fix incluster auth handling * Not mutate azure dep list at runtime (#4457) * add 1, 2, 4 size H100's to GCP (#4456) * add 1, 2, 4 size H100's to GCP * update * Support buildkite CICD and restructure smoke tests (#4396) * event based smoke test * more event based smoke test * more test cases * more test cases with managed jobs * bug fix * bump up seconds * merge master and resolve conflict * more test case * support test_managed_jobs_pipeline_failed_setup * support test_managed_jobs_recovery_aws * manged job status * bug fix * test managed job cancel * test_managed_jobs_storage * more test cases * resolve pr comment * private member function * bug fix * restructure * fix import * buildkite config * fix stdout problem * update pipeline test * test again * smoke test for buildkite * remove unsupport cloud for now * merge branch 'reliable_smoke_test_more' * bug fix * bug fix * bug fix * test pipeline pre merge * build test * test again * trigger test * bug fix * generate pipeline * robust generate pipeline * refactor pipeline * remove runpod * hot fix to pass smoke test * random order * allow parameter * bug fix * bug fix * exclude lambda cloud * dynamic generate pipeline * fix pre-commit * format * support SUPPRESS_SENSITIVE_LOG * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log * support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log * add backward_compatibility_tests to pipeline * pip install uv for backward compatibility test * import style * generate all cloud * resolve PR comment * update comment * naming fix * grammar correction * resolve PR comment * fix import * fix import * support gcp on pre merge test * no gcp test case for pre merge * [k8s] Make node termination robust (#4469) * Add limits only if they exist * retry deletion * lint * lint * comments * lint * [Catalog] Bump catalog schema version (#4470) * Bump catalog schema version * trigger CI * [core] skip provider.availability_zone in the cluster config hash (#4463) skip provider.availability_zone in the cluster config hash * remove sky jobs launch --fast (#4467) * remove sky jobs launch --fast The --fast behavior is now always enabled. This was unsafe before but since \#4289 it should be safe. We will remove the flag before 0.8.0 so that it never touches a stable version. sky launch still has the --fast flag. This flag is unsafe because it could cause setup to be skipped even though it should be re-run. In the managed jobs case, this is not an issue because we fully control the setup and know it will not change. * fix lint * [docs] Change urls to docs.skypilot.co, add 404 page (#4413) * Add 404 page, change to docs.skypilot.co * lint * [UX] Fix unnecessary OCI logging (#4476) Sync PR: fix-oci-logging-master Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> * [Example] PyTorch distributed training with minGPT (#4464) * Add example for distributed pytorch * update * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Update examples/distributed-pytorch/README.md Co-authored-by: Romil Bhardwaj <[email protected]> * Fix --------- Co-authored-by: Romil Bhardwaj <[email protected]> * Add tests for Azure spot instance (#4475) * verify azure spot instance * string style * echo * echo vm detail * bug fix * remove comment * rename pre-merge test to quicktest-core (#4486) * rename to test core * rename file * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu (#4337) * [k8s] support to use custom gpu resource name if it's not nvidia.com/gpu Signed-off-by: nkwangleiGIT <[email protected]> * fix format issue Signed-off-by: nkwangleiGIT <[email protected]> --------- Signed-off-by: nkwangleiGIT <[email protected]> * [k8s] Fix IPv6 ssh support (#4497) * Add limits only if they exist * Fix ipv6 support * Fix ipv6 support * [Serve] Add and adopt least load policy as default poicy. (#4439) * [Serve] Add and adopt least load policy as default poicy. * Docs & smoke tests * error message for different lb policy * add minimal example * fix * [Docs] Update logo in docs (#4500) * WIP updating Elisa logo; issues with light/dark modes * Fix SVG in navbar rendering by hardcoding SVG + defining text color in css * Update readme images * newline --------- Co-authored-by: Zongheng Yang <[email protected]> * Replace `len()` Zero Checks with Pythonic Empty Sequence Checks (#4298) * style: mainly replace len() comparisons with 0/1 with pythonic empty sequence checks * chore: more typings * use `df.empty` for dataframe * fix: more `df.empty` * format * revert partially * style: add back comments * style: format * refactor: `dict[str, str]` Co-authored-by: Tian Xia <[email protected]> --------- Co-authored-by: Tian Xia <[email protected]> * [Docs] Fix logo file path (#4504) * Add limits only if they exist * rename * [Storage] Show logs for storage mount (#4387) * commit for logging change * logger for storage * grammar * fix format * better comment * resolve copilot review * resolve PR comment * remove unuse var * Update sky/data/data_utils.py Co-authored-by: Romil Bhardwaj <[email protected]> * resolve PR comment * update comment for get_run_timestamp * rename backend_util.get_run_timestamp to sky_logging.get_run_timestamp --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [Examples] Update Ollama setup commands (#4510) wip * [OCI] Support OCI Object Storage (#4501) * OCI Object Storage Support * example yaml update * example update * add more example yaml * Support RClone-RPM pkg * Add smoke test * ver * smoke test * Resolve dependancy conflict between oci-cli and runpod * Use latest RClone version (v1.68.2) * minor optimize * Address review comments * typo * test * sync code with repo * Address review comments & more testing. * address one more comment * [Jobs] Allowing to specify intermediate bucket for file upload (#4257) * debug * support workdir_bucket_name config on yaml file * change the match statement to if else due to mypy limit * pass mypy * yapf format fix * reformat * remove debug line * all dir to same bucket * private member function * fix mypy * support sub dir config to separate to different directory * rename and add smoke test * bucketname * support sub dir mount * private member for _bucket_sub_path and smoke test fix * support copy mount for sub dir * support gcs, s3 delete folder * doc * r2 remove_objects_from_sub_path * support azure remove directory and cos remove * doc string for remove_objects_from_sub_path * fix sky jobs subdir issue * test case update * rename to _bucket_sub_path * change the config schema * setter * bug fix and test update * delete bucket depends on user config or sky generated * add test case * smoke test bug fix * robust smoke test * fix comment * bug fix * set the storage manually * better structure * fix mypy * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <[email protected]> * limit creation for bucket and delete sub dir only * resolve comment * Update docs/source/reference/config.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update sky/utils/controller_utils.py Co-authored-by: Romil Bhardwaj <[email protected]> * resolve PR comment * bug fix * bug fix * fix test case * bug fix * fix * fix test case * bug fix * support is_sky_managed param in config * pass param intermediate_bucket_is_sky_managed * resolve PR comment * Update sky/utils/controller_utils.py Co-authored-by: Romil Bhardwaj <[email protected]> * hide bucket creation log * reset green color * rename is_sky_managed to _is_sky_managed * bug fix * retrieve _is_sky_managed from stores * propogate the log --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [Core] Deprecate LocalDockerBackend (#4516) Deprecate local docker backend * [docs] Add newer examples for AI tutorial and distributed training (#4509) * Update tutorial and distributed training examples. * Add examples link * add rdvz * [k8s] Fix L40 detection for nvidia GFD labels (#4511) Fix L40 detection * [docs] Support OCI Object Storage (#4513) * Support OCI Object Storage * Add oci bucket for file_mount * [Docs] Disable Kapa AI (#4518) Disable kapa * [DigitalOcean] droplet integration (#3832) * init digital ocean droplet integration * abbreviate cloud name * switch to pydo * adjust polling logic and mount block storage to instance * filter by paginated * lint * sky launch, start, stop functional * fix credential file mounts, autodown works now * set gpu droplet image * cleanup * remove more tests * atomically destroy instance and block storage simulatenously * install docker * disable spot test * fix ip address bug for multinode * lint * patch ssh from job/serve controller * switch to EA slugs * do adaptor * lint * Update sky/clouds/do.py Co-authored-by: Tian Xia <[email protected]> * Update sky/clouds/do.py Co-authored-by: Tian Xia <[email protected]> * comment template * comment patch * add h100 test case * comment on instance name length * Update sky/clouds/do.py Co-authored-by: Tian Xia <[email protected]> * Update sky/clouds/service_catalog/do_catalog.py Co-authored-by: Tian Xia <[email protected]> * comment on max node char len * comment on weird azure import * comment acc price is included in instance price * fix return type * switch with do_utils * remove broad except * Update sky/provision/do/instance.py Co-authored-by: Tian Xia <[email protected]> * Update sky/provision/do/instance.py Co-authored-by: Tian Xia <[email protected]> * remove azure * comment on non_terminated_only * add open port debug message * wrap start instance api * use f-string * wrap stop * wrap instance down * assert credentials and check against all contexts * assert client is None * remove pending instances during instance restart * wrap rename * rename ssh key var * fix tags * add tags for block device * f strings for errors * support image ids * update do tests * only store head instance id * rename image slugs * add digital ocean alias * wait for docker to be available * update requirements and tests * increase docker timeout * lint * move tests * lint * patch test * lint * typo fix * fix typo * patch tests * fix tests * no_mark spot test * handle 2cpu serve tests * lint * lint * use logger.debug * fix none cred path * lint * handle get_cred path * pylint * patch for DO test_optimizer_dryruns.py * revert optimizer dryrun --------- Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Ubuntu <[email protected]> * [Docs] Refactor pod_config docs (#4427) * refactor pod_config docs * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Zongheng Yang <[email protected]> * Update docs/source/reference/kubernetes/kubernetes-getting-started.rst Co-authored-by: Zongheng Yang <[email protected]> --------- Co-authored-by: Zongheng Yang <[email protected]> * [OCI] Set default image to ubuntu LTS 22.04 (#4517) * set default gpu image to skypilot:gpu-ubuntu-2204 * add example * remove comment line * set cpu default image to 2204 * update change history * [OCI] 1. Support specify OS with custom image id. 2. Corner case fix (#4524) * Support specify os type with custom image id. * trim space * nit * comment * Update intermediate bucket related doc (#4521) * doc * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj <[email protected]> * add tip * minor changes --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [aws] cache user identity by 'aws configure list' (#4507) * [aws] cache user identity by 'aws configure list' Signed-off-by: Aylei <[email protected]> * refine get_user_identities docstring Signed-off-by: Aylei <[email protected]> * address review comments Signed-off-by: Aylei <[email protected]> --------- Signed-off-by: Aylei <[email protected]> * [k8s] Add validation for pod_config #4206 (#4466) * [k8s] Add validation for pod_config #4206 Check pod_config when run 'sky check k8s' by using k8s api * update: check pod_config when launch check merged pod_config during launch using k8s api * fix test * ignore check failed when test with dryrun if there is no kube config in env, ignore ValueError when launch with dryrun. For now, we don't support check schema offline. * use deserialize api to check pod_config schema * test * create another api_client with no kubeconfig * test * update error message * update test * test * test * Update sky/backends/backend_utils.py --------- Co-authored-by: Romil Bhardwaj <[email protected]> * [core] fix wheel timestamp check (#4488) Previously, we were only taking the max timestamp of all the subdirectories of the given directory. So the timestamp could be incorrect if only a file changed, and no directory changed. This fixes the issue by looking at all directories and files given by os.walk(). * [docs] Add image_id doc in task YAML for OCI (#4526) * Add image_id doc for OCI * nit * Update docs/source/reference/yaml-spec.rst Co-authored-by: Tian Xia <[email protected]> --------- Co-authored-by: Tian Xia <[email protected]> * [UX] warning before launching jobs/serve when using a reauth required credentials (#4479) * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * wip * Update sky/backends/cloud_vm_ray_backend.py Minor fix * Update sky/clouds/aws.py Co-authored-by: Romil Bhardwaj <[email protected]> * wip * minor changes * wip --------- Co-authored-by: hong <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> * [GCP] Activate service account for storage and controller (#4529) * Activate service account for storage * disable logging if not using service account * Activate for controller as well. * revert controller activate * Add comments * format * fix smoke * [OCI] Support reuse existing VCN for SkyServe (#4530) * Support reuse existing VCN for SkyServe * fix * remove unused import * format * [docs] OCI: advanced configuration & add vcn_ocid (#4531) * Add vcn_ocid configuration * Update config.rst * fix merge issues WIP * fix merging issues * fix imports * fix stores --------- Signed-off-by: nkwangleiGIT <[email protected]> Signed-off-by: Aylei <[email protected]> Co-authored-by: Christopher Cooper <[email protected]> Co-authored-by: Zongheng Yang <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: zpoint <[email protected]> Co-authored-by: Hong <[email protected]> Co-authored-by: hwei <[email protected]> Co-authored-by: Yika <[email protected]> Co-authored-by: Seth Kimmel <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Lei <[email protected]> Co-authored-by: Tian Xia <[email protected]> Co-authored-by: Andy Lee <[email protected]> Co-authored-by: Romil Bhardwaj <[email protected]> Co-authored-by: Hysun He <[email protected]> Co-authored-by: Andrew Aikawa <[email protected]> Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Aylei <[email protected]> Co-authored-by: Chester Li <[email protected]> Co-authored-by: hong <[email protected]>
- Loading branch information