Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Simplifying Batch Cancellation #7

Closed
wants to merge 1 commit into from

Conversation

jigold
Copy link

@jigold jigold commented Aug 1, 2023

This RFC is in response to some of the feedback given from #5 where we need to explore cancellation in more detail in a separate RFC. Here is my proposed set of changes although as described in the document, we have a couple of different options and I think what is ideal may change depending on how we want to think about job groups. I also put in a brief description of my longer term vision for the Batch Driver. If this vision has interest, then I can write a separate RFC for that.

@jigold jigold force-pushed the batch-cancellation branch from 52c25dd to b0bc171 Compare August 1, 2023 20:32
Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up, I understand our cancellation system better than ever before! Before I get too in the weeds about the proposed implementation and alternatives, I feel I need to fix a fundamental flaw in my understanding, which is why fair share and cancellation are so entangled. See my bigger comment in "How the Current System Works"

----------

Cancellation is a key operation in the Batch system. It must be as
fast as possible to avoid provisioning resources for jobs that users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as fast as possible" is an impossible requirement. We should use language that allows us to discern whether or not we are meeting requirements, like: a cancellation event should be an O(1) database operation after which jobs in a cancelled batch will no longer be scheduled, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do this later on in the constraints section, maybe good to reflect those up here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely certain what the expectations should be, but I think instead of talking about time complexity or latency, we should be talking about wasted resources and/or USD.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I think starting with talking about time-complexity is good. It's a lot easier to think about than the implications on resources and cost.

no longer want to run as well as not impeding other users (or batches)
from being able to have their jobs run. The current implementation of
cancellation does not allow for any flexibility in which jobs are
cancelled. This lack of fine-grained control over which jobs are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this RFC adds additional flexibility to cancellation, so let's leave discussion of flexibility out of this proposal.

costs. There are no current work arounds for this lack of
functionality. Therefore, we need to add more fine-grained
cancellation abilities to the Batch interface. In addition, we
independently need a simpler, easier to maintain implementation of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplification seems like the actual focus of this RFC. Fine to make a note of future additions to cancellation coming down the pipe, but can we make this the main point?

and slower feature development. This proposal reviews how cancellation
works in the current system and compares options for other
cancellation implementations before recommending a new simpler
implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of overlap here with the Motivation section. I think the main value of this paragraph is in the sentences describing what kinds of cancellation events exist, not in the implementation or problems with it as that's covered in the Motivation.

share algorithm is fast because we aggregate across a global table
``user_inst_coll_resources`` that has a limited number of rows
maintaining counts of the number of ready cores per instance
collection and user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of content in these paragraphs. It's all good stuff, but I think it'd be easier to digest if it were broken down into subsections for the scheduler, autoscaler, and canceler.

for that batch are calculated by aggregating over the
``batch_inst_coll_cancellable_resources`` table and then
subtracting those counts from the corresponding values in the
``user_inst_coll_resources`` table.
Copy link
Contributor

@daniel-goldstein daniel-goldstein Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been a very enlightening read, thank you for putting the time and effort into this. I have been confused until now why fair share had anything to do with cancellation and this bullet clarifies that point entirely. Nevertheless, I am unconvinced that these two topics need to be related and do not yet understand the true motivation for performing this subtraction.

In my mind, a fair share algorithm need only concern itself with the resources currently in use, i.e. currently Running jobs. As the determiner of fair share, why would I care if a job that's running belongs to a cancelled batch? Is that user not still hogging those resources until the jobs themselves are cancelled?

Is the following scenario a correct understanding of the current system?

Two users, A and B, are perfectly sharing the cluster 50/50, each running one batch. User A submits a second, identical, batch and cancels their first batch. Immediately following cancellation, individual jobs have not yet been cancelled (and will be cancelled slowly) but the scheduler sees A as consuming 0 resources compared to B's resources that have not changed, so the scheduler will favor user A in scheduling jobs. This may result in user A using >50% of the cluster depending on the speed of the scheduler vs the canceler. As user B, I would feel pretty cheated in this scenario.

Copy link
Contributor

@daniel-goldstein daniel-goldstein Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood earlier, this does not wipe out Running resources, but wipes out some Ready resources? If that is the case, why does the scheduler care about how many Ready jobs a user has? I would assume that a fair share should be determined based on the currently Running jobs, and the scheduler just tries to select up to that number of ready jobs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the scheduler does care. This is about the autoscaler. The ready jobs are a measure of the demand on the system. When a Ready job becomes Cancelled, it should be removed from our measure of demand.

Maybe the confusion is that both scheduling and autoscaling is done "fairly"? In particular, we limit the number VM creations per unit time. We need to ensure that scarce resource (VM creation operations) is shared fairly across all the users on our cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three scarce resources:

  1. VM creation operations.
  2. Active VM cores.
  3. Unschedule-job operations.

The autoscaler uses scarce resource (1) to create more of (2). The scheduler users scarce resource (2) to process jobs. Jobs create demand for (2) and, transitively demand for (1). In both cases, we want to service the demand fairly.

The third resource is a bit independent of this system though it does create more of (2).

3. The data the fair share calculation uses must be updated
instantaneously after a cancellation event or be updated within
some time window that balances cluster efficiency with code
complexity or the fair share algorithm must be redesigned entirely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constraint 1 is concise and precise. For 2 and 3, I think they should specify which queries exactly and what "fast" and "instantaneously" and "some time window" mean in both contexts.

for "n_ready_cancellable_jobs" for batches that have been cancelled
when computing the total aggregated values while still accounting
for "n_ready_always_run_jobs" for cancelled batches. The fair share
proportions per user are cached for 5 seconds at a time in order to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fair share proportion changes every scheduling loop, so this feels wrong to cache.

be able to have a longer running fair share computation using the
larger ``user_inst_coll_resources_by_batch`` table with more
records.
3. Change the fair share algorithm for the canceller to query the new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the canceller does a fair share calculation? What is it sharing?


-----------------------
Extension to Job Groups
-----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think discussion of job groups is necessary for this RFC. If this is about a simplification of cancellation, such a change can stand on its own. I am personally more concerned about introducing complexity due to conflating separate ideas than I am about closing the door to future improvements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to include this since its job groups that inspired this change. AFAIU, the motivation to change cancellation is additionally motivated by the complexity of cancelling a job group.

@danking
Copy link
Contributor

danking commented Aug 3, 2023

This RFC has a lot of threads of discussion. I think we should try to respond to all your concerns @daniel-goldstein , but I'd like to focus the discussion a bit at first to what seems to be the high-level argument. @jigold please chime in if I'm misrepresenting your argument in this summary.

Batch has at least three scarce resources:

  1. VM creation operations.
  2. VM resources (e.g. cores).
  3. Job cancellation operations.

The fundamental job of Batch is to execute jobs. Each job has different resource requirements that must be met.

Batch has three control systems that operate on these scarce resources:

  1. Autoscaler. This creates new VMs so that there is sufficient resources for the jobs. There's a practical limit on how many VMs we can create (due to GCE rate limits, driver CPU, etc.) and also we want to limit the amount of unused cores at any time.
  2. Scheduler. This executes jobs on VMs.
  3. Canceller. This stops the execution of jobs on VMs.

All three of the systems need to fairly apportion their scarce resources to the users.

In order to fairly apportion these scarce resources, we need to know the demand for them disaggregated by user:

  1. user_inst_coll_resources.ready_cores_mcpu summarizes, by user, the cores demanded from each instance collection. The autoscaler can use this as a measure of demand for cores by user and by instance collection. It, of course, also needs to considers the currently active VMs and how soon it expects their cores to become free again.
  2. user_inst_coll_resource.running_cores_mcpu is similar to (1). This is important because if someone has lots of running cores, we want to deprioritize them until other folks get a chance to run.
  3. user_inst_coll_resources.n_cancelled_running_jobs summarizes, by user, the jobs which are waiting to be stopped/evicted. The canceller can use this as a measure of demand for cancellation by user and by instance collection.
  4. user_inst_coll_resource.n_cancelled_ready_jobs is similar to (3) but for jobs that never even made it to a VM. These are substantially less work to cancel, but we still treat SQL operations as a scarce resource, so we want to fairly share them. [1]

Great. When new jobs arrive, we account for them in user_inst_coll_resouces so that the autoscaler, scheduler, and canceller can fairly process them. When a batch is newly cancelled, we want to rapidly stop spending money, but so far we have no O(1) way to calculate how many jobs were cancelled by cancelling that batch. The batch_inst_coll_cancellable_resources is the solution to that. It tracks the information necessary to remove consideration of cancelled resources, in O(1) time, from the user_inst_coll_resources.


OK, so, @jigold & @daniel-goldstein, do I understand correctly that the core frustration is how long and complex jobs_after_update is? If jobs_after_update were, like 15 simple lines, would we still feel that cancellation is unnecessarily complex?


[1] Aside: We seem to assume 300 is the maximum operations we can do per loop for both cancel-ready as well as cancel-running operations. That doesn't seem right. I'd expect the former to be faster since we should already have an open connection to the DB and we're just sending it some CALL mark_job_complete commands. I'm pretty sure "300" came from an assumption that GCE limited us to 20 queries per second or 1200 queries per minute. This isn't true anymore. The limit is now 1500 queries per minute per region, so we have substantially more operations available than we're using. We should probably only use ~half of our limit so that we have room for error and also room for human-being-operator intervention. This also raises concerns for me about "Autoscaler loop period in seconds" being configurable in the UI. We can't change that number without also changing the assumed GCE operation rate limit. We also ought to manage this scarce resource (GCE operations) fairly across all instance collections, which we don't currently do.

@jigold
Copy link
Author

jigold commented Aug 3, 2023

@danking Your understanding of how cancellation works is correct.

I think "simplifying" is a misnomer. I don't think what I have proposed is actually simpler than what we currently have -- in fact in some ways it's more complicated. The issue is that if we don't want to treat job groups as a nested hierarchy of jobs and capitalize on the fact that you know how to update the "cancellable_resources" table that stores the O(1) resource update information based on parent child relationships with the invariants that come with that relationship, then you have to do something like I proposed here to have a "fast enough" cancellation of an arbitrary set of jobs because we can't store the counts explicitly per job group (how do you subtract resources from the other job groups in the table when one job group is cancelled?). Therefore, I think the change in this proposal is only necessary if we decide we want job groups to be an arbitrary set of jobs compared to a nested job hierarchy. Otherwise, there is no net benefit in my opinion except making the trigger simpler to work with (but how often do we actually touch that trigger?).

If we decide we want a nested job hierarchy and no arbitrary job groups, then I think what I have already implemented in hail-is/hail#13086 is ideal as it's the smallest conceptual lift from what we currently have in terms of how the system works. It does box us into job groups being a nested hierarchy though that will be difficult to remove once it's there. Along those lines of boxing us into a more narrow database representation, there was discussion on whether it was a good idea to treat a job group as a child of the root job group (which is a batch) and how to store the parent-child relationships. If we start separating out job group from batch, then we need to decide how much table duplication we want for aggregated_batch_resources: aggregated_job_group_resources, batch_attributes: job_group_attributes, batch_cancellable_resources: job_group_cancellable_resources etc.

Lastly, I will point out that neither of these cancellation strategies is what I want in the long run. The current design of the driver is not flexible enough to do the kind of scheduling optimizations I want to do (burn rate limits, more sophisticated autoscaler [tetris idea]), has well-defined performance bottlenecks, and is not easy to work with. Every time we have to modify or migrate incremental data structures in the database, it slows development down substantially. In an ideal world, we wouldn't need to keep track of the cancellable resources at all and we have a different way altogether of enforcing fair resource allocation in the autoscaler, scheduler, and canceller.

@daniel-goldstein
Copy link
Contributor

In order to fairly apportion these scarce resources, we need to know the demand for them disaggregated by user:

  1. user_inst_coll_resources.ready_cores_mcpu summarizes, by user, the cores demanded from each instance collection. The autoscaler can use this as a measure of demand for cores by user and by instance collection. It, of course, also needs to considers the currently active VMs and how soon it expects their cores to become free again.
  2. user_inst_coll_resource.running_cores_mcpu is similar to (1). This is important because if someone has lots of running cores, we want to deprioritize them until other folks get a chance to run.
  3. user_inst_coll_resources.n_cancelled_running_jobs summarizes, by user, the jobs which are waiting to be stopped/evicted. The canceller can use this as a measure of demand for cancellation by user and by instance collection.
  4. user_inst_coll_resource.n_cancelled_ready_jobs is similar to (3) but for jobs that never even made it to a VM. These are substantially less work to cancel, but we still treat SQL operations as a scarce resource, so we want to fairly share them. [1]

Great. When new jobs arrive, we account for them in user_inst_coll_resouces so that the autoscaler, scheduler, and canceller can fairly process them. When a batch is newly cancelled, we want to rapidly stop spending money, but so far we have no O(1) way to calculate how many jobs were cancelled by cancelling that batch. The batch_inst_coll_cancellable_resources is the solution to that. It tracks the information necessary to remove consideration of cancelled resources, in O(1) time, from the user_inst_coll_resources.

Wow, ok! I gotta say this is the first time I have felt clarity that I knew what these tables were doing in 2.5 years. Can we please put this into documentation?

I'll have to give this a re-read with a fresh perspective, but I am still not entirely convinced that we need to be as perfectly fair as this suggests. E.g. If we make 1 HTTP request per worker to cancel running jobs, that just kinda feels fine? I don't really get why cancelling Ready jobs needs to be fair because they won't incur cost for the user. And I don't get why the autoscaler cares about per-user demand as opposed to overall demand (per inst coll per region).

I recognize that these questions stray us further and further from the job groups goal. Maybe a fresh look with my newly-acquired understanding will ease some of the pain here.

Regarding Jackie's point:

The issue is that if we don't want to treat job groups as a nested hierarchy of jobs and capitalize on the fact that you know how to update the "cancellable_resources" table that stores the O(1) resource update information based on parent child relationships with the invariants that come with that relationship, then you have to do something like I proposed here to have a "fast enough" cancellation of an arbitrary set of jobs because we can't store the counts explicitly per job group (how do you subtract resources from the other job groups in the table when one job group is cancelled?).

Not sure I understand this point. Assuming job groups are disjoint, why would you need to subtract resources from other job groups?

@jigold
Copy link
Author

jigold commented Aug 3, 2023

Assuming job groups are disjoint, why would you need to subtract resources from other job groups?

This is the question. Are we going to enforce they are disjoint either as a tree or one job group per job or are we going to allow a job to be in multiple groups? I don't think we can move forward on deciding what to do about cancellation until this decision is made.

@jigold
Copy link
Author

jigold commented Aug 3, 2023

E.g. If we make 1 HTTP request per worker to cancel running jobs, that just kinda feels fine? I don't really get why cancelling Ready jobs needs to be fair because they won't incur cost for the user.

There's a wall time cost to the user. The batch isn't marked as "complete" until all jobs are in a complete state (Ready is not one of these states). The QoB wait operation will suffer from this if one user starves another from starving cancellation operations.

@daniel-goldstein
Copy link
Contributor

Can't wait check if the batch has been cancelled?

@danking
Copy link
Contributor

danking commented Aug 3, 2023

This is really good y'all, I feel like we're making conceptual progress. Let's grab some time after stand up to chat further. I think I can answer this specific question though:

Not sure I understand this point. Assuming job groups are disjoint, why would you need to subtract resources from other job groups?

Even in the simplest possible situation of Batches contain zero or more disjoint job groups with no nesting, you need to adjust the amount of "cancellable" resources for a batch when one of its component job groups is cancelled.

@jigold
Copy link
Author

jigold commented Aug 3, 2023

Can't wait check if the batch has been cancelled?

No. There could be always_run jobs left.

@danking
Copy link
Contributor

danking commented Aug 21, 2023

Can we close this until we decide to revisit cancellatioN?

@jigold jigold closed this Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants