Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support buildkite CICD and restructure smoke tests #4396

Merged
merged 73 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from 68 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
468409c
event based smoke test
zpoint Nov 7, 2024
7191844
more event based smoke test
zpoint Nov 11, 2024
5cbebeb
more test cases
zpoint Nov 11, 2024
6f68409
more test cases with managed jobs
zpoint Nov 11, 2024
1f67691
bug fix
zpoint Nov 11, 2024
eacc23f
merge master to resolve conflict
zpoint Nov 12, 2024
be7964e
bump up seconds
zpoint Nov 13, 2024
9712505
merge master
zpoint Nov 15, 2024
c464005
merge master and resolve conflict
zpoint Nov 15, 2024
03485cb
Merge branch 'master' into dev/zeping/reliable_smoke_test
zpoint Nov 18, 2024
c054edf
more test case
zpoint Nov 18, 2024
8675df3
support test_managed_jobs_pipeline_failed_setup
zpoint Nov 18, 2024
7e7c055
support test_managed_jobs_recovery_aws
zpoint Nov 18, 2024
f631cd3
manged job status
zpoint Nov 18, 2024
d822c4b
bug fix
zpoint Nov 18, 2024
9d8194e
test managed job cancel
zpoint Nov 18, 2024
41dfbee
test_managed_jobs_storage
zpoint Nov 18, 2024
6a13540
more test cases
zpoint Nov 19, 2024
d83647f
resolve pr comment
zpoint Nov 19, 2024
573e83e
private member function
zpoint Nov 19, 2024
e32f1df
resolve conflict
zpoint Nov 20, 2024
1202d1a
bug fix
zpoint Nov 21, 2024
87d7f12
restructure
zpoint Nov 21, 2024
9abd4d4
fix import
zpoint Nov 21, 2024
e0a4c9f
buildkite config
zpoint Nov 21, 2024
58090a3
fix stdout problem
zpoint Nov 21, 2024
88b396f
update pipeline test
zpoint Nov 22, 2024
9405b44
test again
zpoint Nov 22, 2024
5a2409f
smoke test for buildkite
zpoint Nov 22, 2024
e11a7d1
remove unsupport cloud for now
zpoint Nov 22, 2024
ec13656
Merge branch 'master' into dev/zeping/restructure_smoke_tests
zpoint Nov 22, 2024
8a65150
merge branch 'reliable_smoke_test_more'
zpoint Nov 25, 2024
41bac9b
bug fix
zpoint Nov 25, 2024
fd46f09
bug fix
zpoint Nov 25, 2024
dc71b72
bug fix
zpoint Nov 25, 2024
e68430b
test pipeline pre merge
zpoint Nov 25, 2024
d2ab7ba
build test
zpoint Nov 25, 2024
d2a065e
test again
zpoint Nov 25, 2024
ab6a311
trigger test
zpoint Nov 25, 2024
2ada082
bug fix
zpoint Nov 26, 2024
a39c055
merge master
zpoint Nov 27, 2024
9e14168
generate pipeline
zpoint Nov 27, 2024
a2b0415
robust generate pipeline
zpoint Nov 27, 2024
1d577cf
Merge branch 'master' into dev/zeping/restructure_smoke_tests
zpoint Nov 27, 2024
e764192
refactor pipeline
zpoint Nov 28, 2024
a63a8c9
remove runpod
zpoint Nov 28, 2024
7f75f9f
hot fix to pass smoke test
zpoint Nov 29, 2024
64f9282
random order
zpoint Nov 29, 2024
543ced4
allow parameter
zpoint Nov 29, 2024
2cff4bd
bug fix
zpoint Nov 29, 2024
19fc691
bug fix
zpoint Nov 29, 2024
115af30
exclude lambda cloud
zpoint Nov 29, 2024
0c7bfd5
dynamic generate pipeline
zpoint Nov 30, 2024
b14a655
fix pre-commit
zpoint Nov 30, 2024
f4a1b36
format
zpoint Nov 30, 2024
60c9290
support SUPPRESS_SENSITIVE_LOG
zpoint Dec 2, 2024
b22afff
support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log
zpoint Dec 2, 2024
def4eb7
support env SKYPILOT_SUPPRESS_SENSITIVE_LOG to suppress debug log
zpoint Dec 2, 2024
bef1cf1
add backward_compatibility_tests to pipeline
zpoint Dec 4, 2024
60c5c97
merge master
zpoint Dec 4, 2024
cd64c4c
pip install uv for backward compatibility test
zpoint Dec 4, 2024
cd4d6e1
import style
zpoint Dec 5, 2024
7db0579
generate all cloud
zpoint Dec 5, 2024
4428c90
resolve PR comment
zpoint Dec 10, 2024
e6c7577
merge master
zpoint Dec 10, 2024
ce550e7
update comment
zpoint Dec 10, 2024
e389780
naming fix
zpoint Dec 10, 2024
74b2d6e
grammar correction
zpoint Dec 10, 2024
595c043
resolve PR comment
zpoint Dec 11, 2024
f29637f
fix import
zpoint Dec 11, 2024
010f4af
fix import
zpoint Dec 11, 2024
8d5d023
support gcp on pre merge test
zpoint Dec 12, 2024
0bd7d04
no gcp test case for pre merge
zpoint Dec 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions .buildkite/generate_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
"""
This script generates a Buildkite pipeline from test files.

The script will generate two pipelines:

tests/smoke_tests
├── test_*.py -> release pipeline
├── test_pre_merge.py -> pre-merge pipeline

run `python .buildkite/generate_pipeline.py` to generate the pipeline for
testing. The CI will run this script as a pre-step, and use the generated
pipeline to run the tests.

1. release pipeline, which runs all smoke tests by default, generates all
smoke tests for all clouds.
2. pre-merge pipeline, which generates all smoke tests for all clouds,
author should specify which clouds to run by setting env in the step.

We only have credentials for aws/azure/gcp/kubernetes(CLOUD_QUEUE_MAP and
SERVE_CLOUD_QUEUE_MAP) now, smoke tests for those clouds are generated, other
clouds are not supported yet, smoke tests for those clouds are not generated.
"""

import ast
import os
import random
from typing import Any, Dict, List, Optional

import yaml

DEFAULT_CLOUDS_TO_RUN = ['aws', 'azure']

ALL_CLOUDS_IN_SMOKE_TESTS = [
zpoint marked this conversation as resolved.
Show resolved Hide resolved
'aws', 'gcp', 'azure', 'lambda', 'cloudflare', 'ibm', 'scp', 'oci',
'kubernetes', 'vsphere', 'cudo', 'fluidstack', 'paperspace', 'runpod',
'lambda_cloud'
]
QUEUE_GENERIC_CLOUD = 'generic_cloud'
QUEUE_GENERIC_CLOUD_SERVE = 'generic_cloud_serve'
QUEUE_KUBERNETES = 'kubernetes'
QUEUE_KUBERNETES_SERVE = 'kubernetes_serve'
# Only aws, gcp, azure, and kubernetes are supported for now.
# Other clouds do not have credentials.
CLOUD_QUEUE_MAP = {
'aws': QUEUE_GENERIC_CLOUD,
'gcp': QUEUE_GENERIC_CLOUD,
'azure': QUEUE_GENERIC_CLOUD,
'kubernetes': QUEUE_KUBERNETES
}
# Serve tests runs long, and different test steps usually requires locks.
# Its highly likely to fail if multiple serve tests are running concurrently.
# So we use a different queue that runs only one concurrent test at a time.
SERVE_CLOUD_QUEUE_MAP = {
'aws': QUEUE_GENERIC_CLOUD_SERVE,
'gcp': QUEUE_GENERIC_CLOUD_SERVE,
'azure': QUEUE_GENERIC_CLOUD_SERVE,
'kubernetes': QUEUE_KUBERNETES_SERVE
}

GENERATED_FILE_HEAD = ('# This is an auto-generated Buildkite pipeline by '
'.buildkite/generate_pipeline.py, Please do not '
'edit directly.\n')


def _get_full_decorator_path(decorator: ast.AST) -> str:
"""Recursively get the full path of a decorator."""
if isinstance(decorator, ast.Attribute):
return f'{_get_full_decorator_path(decorator.value)}.{decorator.attr}'
elif isinstance(decorator, ast.Name):
return decorator.id
elif isinstance(decorator, ast.Call):
return _get_full_decorator_path(decorator.func)
raise ValueError(f'Unknown decorator type: {type(decorator)}')


def _extract_marked_tests(file_path: str) -> Dict[str, List[str]]:
"""Extract test functions and filter clouds using pytest.mark
from a Python test file.

We separate each test_function_{cloud} into different pipeline steps
to maximize the parallelism of the tests via the buildkite CI job queue.
This allows us to visualize the test results and rerun failures at the
granularity of each test_function_{cloud}.

If we make pytest --serve a job, it could contain dozens of test_functions
and run for hours. This makes it hard to visualize the test results and
rerun failures. Additionally, the parallelism would be controlled by pytest
instead of the buildkite job queue.
"""
with open(file_path, 'r', encoding='utf-8') as file:
tree = ast.parse(file.read(), filename=file_path)

for node in ast.walk(tree):
for child in ast.iter_child_nodes(node):
setattr(child, 'parent', node)

function_cloud_map = {}
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) and node.name.startswith('test_'):
class_name = None
if hasattr(node, 'parent') and isinstance(node.parent,
ast.ClassDef):
class_name = node.parent.name

clouds_to_include = []
clouds_to_exclude = []
is_serve_test = False
for decorator in node.decorator_list:
if isinstance(decorator, ast.Call):
# We only need to consider the decorator with no arguments
# to extract clouds.
continue
full_path = _get_full_decorator_path(decorator)
if full_path.startswith('pytest.mark.'):
assert isinstance(decorator, ast.Attribute)
suffix = decorator.attr
if suffix.startswith('no_'):
clouds_to_exclude.append(suffix[3:])
else:
if suffix == 'serve':
is_serve_test = True
continue
if suffix not in ALL_CLOUDS_IN_SMOKE_TESTS:
# This mark does not specify a cloud, so we skip it.
continue
clouds_to_include.append(suffix)
clouds_to_include = (clouds_to_include if clouds_to_include else
DEFAULT_CLOUDS_TO_RUN)
clouds_to_include = [
cloud for cloud in clouds_to_include
if cloud not in clouds_to_exclude
]
cloud_queue_map = SERVE_CLOUD_QUEUE_MAP if is_serve_test else CLOUD_QUEUE_MAP
final_clouds_to_include = [
cloud for cloud in clouds_to_include if cloud in cloud_queue_map
]
if clouds_to_include and not final_clouds_to_include:
cg505 marked this conversation as resolved.
Show resolved Hide resolved
print(f'Warning: {file_path}:{node.name} '
f'is marked to run on {clouds_to_include}, '
f'but we do not have credentials for those clouds. '
f'Skipped.')
cg505 marked this conversation as resolved.
Show resolved Hide resolved
cg505 marked this conversation as resolved.
Show resolved Hide resolved
continue
if clouds_to_include != final_clouds_to_include:
excluded_clouds = set(clouds_to_include) - set(
final_clouds_to_include)
print(
f'Warning: {file_path}:{node.name} '
f'is marked to run on {clouds_to_include}, '
f'but we only have credentials for {final_clouds_to_include}. '
f'clouds {excluded_clouds} are skipped.')
function_name = (f'{class_name}::{node.name}'
if class_name else node.name)
function_cloud_map[function_name] = (final_clouds_to_include, [
cloud_queue_map[cloud] for cloud in final_clouds_to_include
])
return function_cloud_map


def _generate_pipeline(test_file: str) -> Dict[str, Any]:
"""Generate a Buildkite pipeline from test files."""
steps = []
function_cloud_map = _extract_marked_tests(test_file)
for test_function, clouds_and_queues in function_cloud_map.items():
for cloud, queue in zip(*clouds_and_queues):
step = {
'label': f'{test_function} on {cloud}',
'command': f'pytest {test_file}::{test_function} --{cloud}',
'agents': {
# Separate agent pool for each cloud.
# Since they require different amount of resources and
# concurrency control.
'queue': queue
},
'if': f'build.env("{cloud}") == "1"'
}
steps.append(step)
return {'steps': steps}


def _dump_pipeline_to_file(yaml_file_path: str,
pipelines: List[Dict[str, Any]],
extra_env: Optional[Dict[str, str]] = None):
default_env = {'LOG_TO_STDOUT': '1', 'PYTHONPATH': '${PYTHONPATH}:$(pwd)'}
if extra_env:
default_env.update(extra_env)
with open(yaml_file_path, 'w', encoding='utf-8') as file:
file.write(GENERATED_FILE_HEAD)
all_steps = []
for pipeline in pipelines:
all_steps.extend(pipeline['steps'])
# Shuffle the steps to avoid flakyness, consecutive runs of the same
# kind of test may fail for requiring locks on the same resources.
random.shuffle(all_steps)
final_pipeline = {'steps': all_steps, 'env': default_env}
yaml.dump(final_pipeline, file, default_flow_style=False)


def _convert_release(test_files: List[str]):
yaml_file_path = '.buildkite/pipeline_smoke_tests_release.yaml'
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
pipeline = _generate_pipeline(test_file)
output_file_pipelines.append(pipeline)
print(f'Converted {test_file} to {yaml_file_path}\n\n')
# Enable all clouds by default for release pipeline.
_dump_pipeline_to_file(yaml_file_path,
output_file_pipelines,
extra_env={cloud: '1' for cloud in CLOUD_QUEUE_MAP})


def _convert_pre_merge(test_files: List[str]):
yaml_file_path = '.buildkite/pipeline_smoke_tests_pre_merge.yaml'
output_file_pipelines = []
for test_file in test_files:
print(f'Converting {test_file} to {yaml_file_path}')
# We want enable all clouds by default for each test function
# for pre-merge. And let the author controls which clouds
# to run by parameter.
pipeline = _generate_pipeline(test_file)
pipeline['steps'].append({
'label': 'Backward compatibility test',
'command': 'bash tests/backward_compatibility_tests.sh',
'agents': {
'queue': 'back_compat'
},
'if': 'build.env("aws") == "1"'
})
output_file_pipelines.append(pipeline)
print(f'Converted {test_file} to {yaml_file_path}\n\n')
_dump_pipeline_to_file(yaml_file_path,
output_file_pipelines,
extra_env={'SKYPILOT_SUPPRESS_SENSITIVE_LOG': '1'})


def main():
test_files = os.listdir('tests/smoke_tests')
release_files = []
pre_merge_files = []
for test_file in test_files:
if not test_file.startswith('test_'):
continue
test_file_path = os.path.join('tests/smoke_tests', test_file)
if "test_pre_merge" in test_file:
pre_merge_files.append(test_file_path)
else:
release_files.append(test_file_path)

_convert_release(release_files)
_convert_pre_merge(pre_merge_files)


if __name__ == '__main__':
main()
5 changes: 2 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ repos:
args:
- "--sg=build/**" # Matches "${ISORT_YAPF_EXCLUDES[@]}"
- "--sg=sky/skylet/providers/ibm/**"
Copy link
Collaborator Author

@zpoint zpoint Nov 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format.sh script limits files to this range, but .github/workflows/format.yml does not. Remove the limitation in .pre-commit-config.yaml to ensure consistency with the final result (.github/workflows/format.yml).

files: "^(sky|tests|examples|llm|docs)/.*" # Only match these directories
# Second isort command
- id: isort
name: isort (IBM specific)
Expand Down Expand Up @@ -56,8 +55,8 @@ repos:
hooks:
- id: yapf
name: yapf
exclude: (build/.*|sky/skylet/providers/ibm/.*) # Matches exclusions from the script
args: ['--recursive', '--parallel'] # Only necessary flags
exclude: (sky/skylet/providers/ibm/.*) # Matches exclusions from the script
args: ['--recursive', '--parallel', '--in-place'] # Only necessary flags
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't add this flag, pre-commit does nothing for yapf. Now it should be fine.

(sky) ➜  skypilot git:(dev/zeping/restructure_smoke_tests) ✗ git commit -m "fix pre-commit"
trim trailing whitespace.................................................Passed
fix end of files.........................................................Passed
check yaml...............................................................Passed
check for added large files..............................................Passed
black (IBM specific).................................(no files to check)Skipped
isort (general)......................................(no files to check)Skipped
isort (IBM specific).................................(no files to check)Skipped
mypy.....................................................................Passed
yapf.....................................................................Failed
- hook id: yapf
- files were modified by this hook
pylint...............................................(no files to check)Skipped

additional_dependencies: [toml==0.10.2]

- repo: https://github.com/pylint-dev/pylint
Expand Down
18 changes: 18 additions & 0 deletions sky/sky_logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
not env_options.Options.MINIMIZE_LOGGING.get())
_FORMAT = '%(levelname).1s %(asctime)s %(filename)s:%(lineno)d] %(message)s'
_DATE_FORMAT = '%m-%d %H:%M:%S'
_SENSITIVE_LOGGER = ['sky.provisioner', 'sky.optimizer']


class NewLineFormatter(logging.Formatter):
Expand Down Expand Up @@ -75,6 +76,23 @@ def _setup_logger():
# Setting this will avoid the message
# being propagated to the parent logger.
_root_logger.propagate = False
if env_options.Options.SUPPRESS_SENSITIVE_LOG.get():
# If the sensitive log is enabled, we reinitialize a new handler
# and force set the level to INFO to suppress the debug logs
# for certain loggers.
for logger_name in _SENSITIVE_LOGGER:
logger = logging.getLogger(logger_name)
handler_to_logger = RichSafeStreamHandler(sys.stdout)
handler_to_logger.flush = sys.stdout.flush # type: ignore
logger.addHandler(handler_to_logger)
logger.setLevel(logging.INFO)
if _show_logging_prefix:
handler_to_logger.setFormatter(FORMATTER)
else:
handler_to_logger.setFormatter(NO_PREFIX_FORMATTER)
# Do not propagate to the parent logger to avoid parent
# logger printing the logs.
logger.propagate = False


def reload_logger():
Expand Down
1 change: 1 addition & 0 deletions sky/utils/env_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ class Options(enum.Enum):
SHOW_DEBUG_INFO = ('SKYPILOT_DEBUG', False)
DISABLE_LOGGING = ('SKYPILOT_DISABLE_USAGE_COLLECTION', False)
MINIMIZE_LOGGING = ('SKYPILOT_MINIMIZE_LOGGING', True)
SUPPRESS_SENSITIVE_LOG = ('SKYPILOT_SUPPRESS_SENSITIVE_LOG', False)
# Internal: this is used to skip the cloud user identity check, which is
# used to protect cluster operations in a multi-identity scenario.
# Currently, this is only used in the job and serve controller, as there
Expand Down
6 changes: 4 additions & 2 deletions tests/backward_compatibility_tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,17 @@ rm -r ~/.sky/wheels || true
cd ../sky-master
git pull origin master
pip uninstall -y skypilot
pip install -e ".[all]"
pip install uv
uv pip install -e ".[all]"
cg505 marked this conversation as resolved.
Show resolved Hide resolved
cd -

conda env list | grep sky-back-compat-current || conda create -n sky-back-compat-current -y python=3.9
conda activate sky-back-compat-current
conda install -c conda-forge google-cloud-sdk -y
rm -r ~/.sky/wheels || true
pip uninstall -y skypilot
pip install -e ".[all]"
pip install uv
uv pip install -e ".[all]"


# exec + launch
Expand Down
2 changes: 2 additions & 0 deletions tests/smoke_tests/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
"""For smoke tests import."""
__all__ = ['smoke_tests_utils']
Loading
Loading