-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[batch] Job Groups #5
base: main
Are you sure you want to change the base?
Conversation
rfc/0000-batch-job-groups.rst
Outdated
|
||
|
||
Flow for Creating and Updating Batches | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think ~
is valid RST which prevents GitHub from rendering this for easy reading. Can you switch to -
?
@danking @daniel-goldstein I'd appreciate your thoughts on this RFC. Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really a tour de force of Batch description! A lot of my comments concern the motivations, which seems good! We should make sure we are completely agreed on what the motivation is and what the MVP is. That'll help ground the discussion about tradeoffs because we can ask "is this effort worth a user being able to do specific thing X which we decided was a requirement".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for writing this up, Jackie. Echoing Dan, this is a goldmine of valuable information about the batch system and the directions that it could go in.
Regarding how this should proceed, I largely echo Dan's sentiment. I think a lot of the background section can be transplanted into documentation as it is background that will be useful not only to this RFC but many RFCs to follow. Regarding proposed implementation, I think we need to tease these many features apart into a sequence of changes we can consider individually. I think it would be nice to narrow down this particular RFC into one that introduces the notion of a job group as a way to cancel a subset of jobs in a Batch. To me, that was the most interesting and possibly hairy part of this proposal and it would be nice to flesh out those sections more on their own. I feel confident that we can make such a change without prohibiting the additional features you've outlined, which will themselves be easier because we'll have the groundwork laid for job groups.
Ok. I think the action items for the next week are to:
Does this sound good? |
AFAICT, the
I don't think this is necessary to resolve this RFC since Daniel and I are deep enough in the system and the others aren't reading the RFC. Splitting out into a dev-docs can happen later too. Let's focus all our time and energy on the discussion of cancellation. That appears to be the blocker for returning to this RFC. |
I re-read this RFC and had a discussion with Dan about how we got here and where we're heading. We agreed that the discussion RFC and discussion was incredibly valuable in cementing our understanding of the system as it stands today and what changes to cancellation could look like the in the models you have posed above. To summarize some of the points from our discussion and the discussion from Thursday meeting:
About going forward, while #7 has been incredibly informative, it seems that under the disjoint sets/tree models the added complexity to the canceller would be fairly small. So it would be good to cement the distilled information from #7 into dev documentation and then close it for the time-being. Another very useful takeaway from #7 was that the ultimate level of abstraction in that discussion was just right to express the complexities in the system. The truly salient points were not at the code or SQL level, and not at the user level, but at the level of the various kinds of resource tracking (status, cancellable jobs, billing) and the dependencies that different components in the system have on that state. To me, that kind of precise discussion at that level is new and not something we have had in dev docs or RFCs so far. It would be great if, for this RFC, we can zero in on those changes in the data model that demand changes such as hail-is/hail#12697. I would hope that such a description would be short and can allow us to finalize a written agreement on the scope of changes that have been swimming around in our heads for months now. |
bb3d54d
to
10b2605
Compare
This PR adds the job groups functionality as described in this [RFC](hail-is/hail-rfcs#5) to the Batch backend and `hailtop.batch_client`. This includes supporting nested job groups up to a maximum depth of 5. Note, that none of these changes are user-facing yet (hence no change log here). The PRs that came before this one: - #13475 - #13487 - #13810 (note that this database migration required a shutdown) Subsequent PRs will need to implement the following: - Querying job groups with the flexible query language (v2) - Implementing job groups in the Scala Client for QoB - Using job groups in QoB with `cancel_after_n_failures=1` for all new stages of worker jobs - UI functionality to page and sort through job groups - A new `hailtop.batch` interface for users to define and work with Job Groups A couple of nuances in the implementation came up that I also tried to articulate in the RFC: 1. A root job group with ID = 0 does not belong to an update ("update_id" IS NULL). This means that any checks that look for "committed" job groups need to do `(batch_updates.committed OR job_groups.job_group_id = %s)` where "%s" is the ROOT_JOB_GROUP_ID. 2. When job groups are cancelled, only the specific job group that was cancelled is inserted into `job_groups_cancelled`. This table does **NOT** contain all transitive job groups that were also cancelled indirectly. The reason for this is we cannot guarantee that a user wouldn't have millions of job groups and we can't insert millions of records inside a single SQL stored procedure. Now, any query on the driver / front_end must look up the tree and see if any parent has been cancelled. This code looks similar to the code below [1]. 3. There used to be `DELETE FROM` statements in `commit_batch_update` and `commit_batch` that cleaned up old records that were no longer used in `job_group_inst_coll_cancellable_resources` and `job_groups_inst_coll_staging`. This cleanup now occurs in a periodic loop on the driver. 4. The `job_group_inst_coll_cancellable_resources` and `job_groups_inst_coll_staging` tables have values which represent the sum of all child job groups. For example, if a job group has 1 job and it's child job group has 2 jobs, then the staging table would have n_jobs = 3 for the parent job group and n_jobs = 2 for the child job group. Likewise, all of the billing triggers and MJC have to use the `job_group_self_and_ancestors` table to modify the job group the job belongs to as well its parent job groups. [1] Code to check whether a job group has been cancelled. ```mysql SELECT job_groups.*, cancelled_t.cancelled IS NOT NULL AS cancelled FROM job_groups LEFT JOIN LATERAL ( SELECT 1 AS cancelled FROM job_group_self_and_ancestors INNER JOIN job_groups_cancelled ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND job_groups.job_group_id = job_group_self_and_ancestors.job_group_id ) AS cancelled_t ON TRUE WHERE ... ```
This PR includes the proposal for Job Groups in Batch.