-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Simplifying Batch Cancellation #7
Conversation
52c25dd
to
b0bc171
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing this up, I understand our cancellation system better than ever before! Before I get too in the weeds about the proposed implementation and alternatives, I feel I need to fix a fundamental flaw in my understanding, which is why fair share and cancellation are so entangled. See my bigger comment in "How the Current System Works"
---------- | ||
|
||
Cancellation is a key operation in the Batch system. It must be as | ||
fast as possible to avoid provisioning resources for jobs that users |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as fast as possible" is an impossible requirement. We should use language that allows us to discern whether or not we are meeting requirements, like: a cancellation event should be an O(1) database operation after which jobs in a cancelled batch will no longer be scheduled, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do this later on in the constraints section, maybe good to reflect those up here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely certain what the expectations should be, but I think instead of talking about time complexity or latency, we should be talking about wasted resources and/or USD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I think starting with talking about time-complexity is good. It's a lot easier to think about than the implications on resources and cost.
no longer want to run as well as not impeding other users (or batches) | ||
from being able to have their jobs run. The current implementation of | ||
cancellation does not allow for any flexibility in which jobs are | ||
cancelled. This lack of fine-grained control over which jobs are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this RFC adds additional flexibility to cancellation, so let's leave discussion of flexibility out of this proposal.
costs. There are no current work arounds for this lack of | ||
functionality. Therefore, we need to add more fine-grained | ||
cancellation abilities to the Batch interface. In addition, we | ||
independently need a simpler, easier to maintain implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplification seems like the actual focus of this RFC. Fine to make a note of future additions to cancellation coming down the pipe, but can we make this the main point?
and slower feature development. This proposal reviews how cancellation | ||
works in the current system and compares options for other | ||
cancellation implementations before recommending a new simpler | ||
implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of overlap here with the Motivation
section. I think the main value of this paragraph is in the sentences describing what kinds of cancellation events exist, not in the implementation or problems with it as that's covered in the Motivation.
share algorithm is fast because we aggregate across a global table | ||
``user_inst_coll_resources`` that has a limited number of rows | ||
maintaining counts of the number of ready cores per instance | ||
collection and user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of content in these paragraphs. It's all good stuff, but I think it'd be easier to digest if it were broken down into subsections for the scheduler, autoscaler, and canceler.
for that batch are calculated by aggregating over the | ||
``batch_inst_coll_cancellable_resources`` table and then | ||
subtracting those counts from the corresponding values in the | ||
``user_inst_coll_resources`` table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been a very enlightening read, thank you for putting the time and effort into this. I have been confused until now why fair share had anything to do with cancellation and this bullet clarifies that point entirely. Nevertheless, I am unconvinced that these two topics need to be related and do not yet understand the true motivation for performing this subtraction.
In my mind, a fair share algorithm need only concern itself with the resources currently in use, i.e. currently Running jobs. As the determiner of fair share, why would I care if a job that's running belongs to a cancelled batch? Is that user not still hogging those resources until the jobs themselves are cancelled?
Is the following scenario a correct understanding of the current system?
Two users, A and B, are perfectly sharing the cluster 50/50, each running one batch. User A submits a second, identical, batch and cancels their first batch. Immediately following cancellation, individual jobs have not yet been cancelled (and will be cancelled slowly) but the scheduler sees A as consuming 0 resources compared to B's resources that have not changed, so the scheduler will favor user A in scheduling jobs. This may result in user A using >50% of the cluster depending on the speed of the scheduler vs the canceler. As user B, I would feel pretty cheated in this scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misunderstood earlier, this does not wipe out Running
resources, but wipes out some Ready
resources? If that is the case, why does the scheduler care about how many Ready jobs a user has? I would assume that a fair share should be determined based on the currently Running
jobs, and the scheduler just tries to select up to that number of ready jobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the scheduler does care. This is about the autoscaler. The ready jobs are a measure of the demand on the system. When a Ready job becomes Cancelled, it should be removed from our measure of demand.
Maybe the confusion is that both scheduling and autoscaling is done "fairly"? In particular, we limit the number VM creations per unit time. We need to ensure that scarce resource (VM creation operations) is shared fairly across all the users on our cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three scarce resources:
- VM creation operations.
- Active VM cores.
- Unschedule-job operations.
The autoscaler uses scarce resource (1) to create more of (2). The scheduler users scarce resource (2) to process jobs. Jobs create demand for (2) and, transitively demand for (1). In both cases, we want to service the demand fairly.
The third resource is a bit independent of this system though it does create more of (2).
3. The data the fair share calculation uses must be updated | ||
instantaneously after a cancellation event or be updated within | ||
some time window that balances cluster efficiency with code | ||
complexity or the fair share algorithm must be redesigned entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constraint 1 is concise and precise. For 2 and 3, I think they should specify which queries exactly and what "fast" and "instantaneously" and "some time window" mean in both contexts.
for "n_ready_cancellable_jobs" for batches that have been cancelled | ||
when computing the total aggregated values while still accounting | ||
for "n_ready_always_run_jobs" for cancelled batches. The fair share | ||
proportions per user are cached for 5 seconds at a time in order to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fair share proportion changes every scheduling loop, so this feels wrong to cache.
be able to have a longer running fair share computation using the | ||
larger ``user_inst_coll_resources_by_batch`` table with more | ||
records. | ||
3. Change the fair share algorithm for the canceller to query the new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come the canceller does a fair share calculation? What is it sharing?
|
||
----------------------- | ||
Extension to Job Groups | ||
----------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think discussion of job groups is necessary for this RFC. If this is about a simplification of cancellation, such a change can stand on its own. I am personally more concerned about introducing complexity due to conflating separate ideas than I am about closing the door to future improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's important to include this since its job groups that inspired this change. AFAIU, the motivation to change cancellation is additionally motivated by the complexity of cancelling a job group.
This RFC has a lot of threads of discussion. I think we should try to respond to all your concerns @daniel-goldstein , but I'd like to focus the discussion a bit at first to what seems to be the high-level argument. @jigold please chime in if I'm misrepresenting your argument in this summary. Batch has at least three scarce resources:
The fundamental job of Batch is to execute jobs. Each job has different resource requirements that must be met. Batch has three control systems that operate on these scarce resources:
All three of the systems need to fairly apportion their scarce resources to the users. In order to fairly apportion these scarce resources, we need to know the demand for them disaggregated by user:
Great. When new jobs arrive, we account for them in OK, so, @jigold & @daniel-goldstein, do I understand correctly that the core frustration is how long and complex [1] Aside: We seem to assume 300 is the maximum operations we can do per loop for both cancel-ready as well as cancel-running operations. That doesn't seem right. I'd expect the former to be faster since we should already have an open connection to the DB and we're just sending it some |
@danking Your understanding of how cancellation works is correct. I think "simplifying" is a misnomer. I don't think what I have proposed is actually simpler than what we currently have -- in fact in some ways it's more complicated. The issue is that if we don't want to treat job groups as a nested hierarchy of jobs and capitalize on the fact that you know how to update the "cancellable_resources" table that stores the O(1) resource update information based on parent child relationships with the invariants that come with that relationship, then you have to do something like I proposed here to have a "fast enough" cancellation of an arbitrary set of jobs because we can't store the counts explicitly per job group (how do you subtract resources from the other job groups in the table when one job group is cancelled?). Therefore, I think the change in this proposal is only necessary if we decide we want job groups to be an arbitrary set of jobs compared to a nested job hierarchy. Otherwise, there is no net benefit in my opinion except making the trigger simpler to work with (but how often do we actually touch that trigger?). If we decide we want a nested job hierarchy and no arbitrary job groups, then I think what I have already implemented in hail-is/hail#13086 is ideal as it's the smallest conceptual lift from what we currently have in terms of how the system works. It does box us into job groups being a nested hierarchy though that will be difficult to remove once it's there. Along those lines of boxing us into a more narrow database representation, there was discussion on whether it was a good idea to treat a job group as a child of the root job group (which is a batch) and how to store the parent-child relationships. If we start separating out job group from batch, then we need to decide how much table duplication we want for Lastly, I will point out that neither of these cancellation strategies is what I want in the long run. The current design of the driver is not flexible enough to do the kind of scheduling optimizations I want to do (burn rate limits, more sophisticated autoscaler [tetris idea]), has well-defined performance bottlenecks, and is not easy to work with. Every time we have to modify or migrate incremental data structures in the database, it slows development down substantially. In an ideal world, we wouldn't need to keep track of the cancellable resources at all and we have a different way altogether of enforcing fair resource allocation in the autoscaler, scheduler, and canceller. |
Wow, ok! I gotta say this is the first time I have felt clarity that I knew what these tables were doing in 2.5 years. Can we please put this into documentation? I'll have to give this a re-read with a fresh perspective, but I am still not entirely convinced that we need to be as perfectly fair as this suggests. E.g. If we make 1 HTTP request per worker to cancel running jobs, that just kinda feels fine? I don't really get why cancelling I recognize that these questions stray us further and further from the job groups goal. Maybe a fresh look with my newly-acquired understanding will ease some of the pain here. Regarding Jackie's point:
Not sure I understand this point. Assuming job groups are disjoint, why would you need to subtract resources from other job groups? |
This is the question. Are we going to enforce they are disjoint either as a tree or one job group per job or are we going to allow a job to be in multiple groups? I don't think we can move forward on deciding what to do about cancellation until this decision is made. |
There's a wall time cost to the user. The batch isn't marked as "complete" until all jobs are in a complete state (Ready is not one of these states). The QoB wait operation will suffer from this if one user starves another from starving cancellation operations. |
Can't |
This is really good y'all, I feel like we're making conceptual progress. Let's grab some time after stand up to chat further. I think I can answer this specific question though:
Even in the simplest possible situation of Batches contain zero or more disjoint job groups with no nesting, you need to adjust the amount of "cancellable" resources for a batch when one of its component job groups is cancelled. |
No. There could be always_run jobs left. |
Can we close this until we decide to revisit cancellatioN? |
This RFC is in response to some of the feedback given from #5 where we need to explore cancellation in more detail in a separate RFC. Here is my proposed set of changes although as described in the document, we have a couple of different options and I think what is ideal may change depending on how we want to think about job groups. I also put in a brief description of my longer term vision for the Batch Driver. If this vision has interest, then I can write a separate RFC for that.