Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] revamp scheduling for managed jobs #4485

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Dec 19, 2024

Detaches the job controller from ray worker and the ray driver program, and uses our own scheduling and parallelism control mechanism, derived from the state tracked in the managed jobs sqlite database on the controller.

See the commands in sky/jobs/scheduler.py for more info.

Previously, the number of simultaneous jobs is limited to 4x CPU count, by our per-job ray placement group request

CONTROLLER_PROCESS_CPU_DEMAND = 0.25

After this PR, there are two paralellism limits:
4 * cpu_count jobs can be launching at the same time.
memory / 350M jobs can be running at the same time.

Common and max instance sizes and their parallelism limits

instance type vCPUs memory (GB) old job parallelism (new) launch parallelism (new) run parallelism
m6i.large / Standard_D2s_v5 / n2-standard-2 2 8 8 launching/running at once 8 launches at once 22 running at once
r6i.large / Standard_E2s_v5 / n2-highmem-2 2 16 8 launching/running at once 8 launches at once 44 running at once
m6i.2xlarge / Standard_D8s_v2 / n2-standard-8 8 32 32 launching/running at once 32 launches at once 90 running at once
Standard_E96s_v5 96 672 384 launching/running at once 384 launches at once ~1930 running at once
n2-highmem-128 128 864 512 launching/running at once 512 launches at once ~2480 running at once
r6i.32xlarge 128 1024 512 launching/running at once 512 launches at once ~2950 running at once

run parallelism varies slightly between clouds as instances listed with the same amount of memory do not actually have exactly the same number of bytes.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: `conda deactivate; bash -i tests/backward_compatibility_tests.

sky/jobs/scheduler.py Outdated Show resolved Hide resolved
sky/jobs/state.py Outdated Show resolved Hide resolved
os.makedirs(logs_dir, exist_ok=True)
log_path = os.path.join(logs_dir, f'{managed_job_id}.log')

pid = subprocess_utils.launch_new_process_tree(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if scheduler is killed before this line (e.g. when running as part of a controller job), we will get stuck since the job will be submitted but the controller will never start. Todo figure out how to recover from this case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already using the exiting managed job skylet event for that, but the problem is that if it dies right here, there's no way to know if the scheduler is just about to start the process or if it already died. We need a way to check if the scheduler died or maybe a timestamp for the WAITING -> LAUNCHING transition.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505 for making this significant change! This is awesome! I glanced the code, and it mostly looks good. The main concern is the complexity and granularity we have for limiting the number of launches. Please see the comments below.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/jobs/constants.py Outdated Show resolved Hide resolved
sky/jobs/scheduler.py Outdated Show resolved Hide resolved
sky/jobs/scheduler.py Outdated Show resolved Hide resolved
sky/jobs/scheduler.py Outdated Show resolved Hide resolved
os.makedirs(logs_dir, exist_ok=True)
log_path = os.path.join(logs_dir, f'{managed_job_id}.log')

pid = subprocess_utils.launch_new_process_tree(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have a skylet event to monitor managed job table, like we do for normal unmanaged jobs.

sky/jobs/scheduler.py Outdated Show resolved Hide resolved
sky/jobs/scheduler.py Outdated Show resolved Hide resolved
sky/jobs/scheduler.py Outdated Show resolved Hide resolved
sky/jobs/scheduler.py Outdated Show resolved Hide resolved
@cg505 cg505 marked this pull request as ready for review December 20, 2024 05:34
@cg505 cg505 requested a review from Michaelvll December 20, 2024 05:34
@cg505 cg505 changed the title revamp scheduling for managed jobs [jobs/ revamp scheduling for managed jobs Dec 20, 2024
@cg505 cg505 changed the title [jobs/ revamp scheduling for managed jobs [jobs] revamp scheduling for managed jobs Dec 20, 2024
@cg505
Copy link
Collaborator Author

cg505 commented Dec 20, 2024

/quicktest-core

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505! This PR looks pretty good to me! We should do some thorough test with managed jobs, especially testing for:

  1. scheduling speed for jobs
  2. special cases that may get the scheduling stuck
  3. many jobs
  4. cancellation of jobs
  5. in parallel jobs scheduling

@@ -191,6 +190,8 @@ def _run_one_task(self, task_id: int, task: 'sky.Task') -> bool:
f'Submitted managed job {self._job_id} (task: {task_id}, name: '
f'{task.name!r}); {constants.TASK_ID_ENV_VAR}: {task_id_env_var}')

scheduler.wait_until_launch_okay(self._job_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new API looks much better than before. Maybe we can turn this into a context so as to combine the wait and finish

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._strategy_executor.launch() may call scheduler.launch_finished and scheduler.wait_until_launch_okay in the recovery case, so I feel like the context wouldn't really be accurate.

sky/jobs/state.py Outdated Show resolved Hide resolved
sky/jobs/state.py Show resolved Hide resolved
sky/jobs/scheduler.py Show resolved Hide resolved
]
if show_all:
columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE']
columns += ['STARTED', 'CLUSTER', 'REGION', 'FAILURE', 'SCHED. STATE']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would prefer to not have the sched. state column, instead, we may want to do something similar as kubectl describe pod where it shows detailed description of what the pod is working on in the same state. For example, we can maybe rename the FAILURE column to be DESCRIPTION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to spend too much time on this but I'll take a look.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, don't need to be a large change. Just adding the state as a description in the FAILURE column (now should rename to DESCRIPTION

sky/jobs/utils.py Outdated Show resolved Hide resolved
sky/jobs/utils.py Outdated Show resolved Hide resolved
@cg505 cg505 requested a review from Michaelvll January 7, 2025 21:32
@Michaelvll
Copy link
Collaborator

/smoke-test managed_jobs

@zpoint
Copy link
Collaborator

zpoint commented Jan 9, 2025

Need to merge this PR to get smoke-test comment work
I have resolved the comment, could u help take a look again? @Michaelvll

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants