[batch] Job Groups #5

jigold · 2023-07-20T18:25:32Z

This PR includes the proposal for Job Groups in Batch.

danking · 2023-07-20T18:34:58Z

rfc/0000-batch-job-groups.rst

+
+
+Flow for Creating and Updating Batches
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


I don't think ~ is valid RST which prevents GitHub from rendering this for easy reading. Can you switch to -?

jigold · 2023-07-26T15:46:46Z

@danking @daniel-goldstein I'd appreciate your thoughts on this RFC. Thanks in advance!

danking

This is really a tour de force of Batch description! A lot of my comments concern the motivations, which seems good! We should make sure we are completely agreed on what the motivation is and what the MVP is. That'll help ground the discussion about tradeoffs because we can ask "is this effort worth a user being able to do specific thing X which we decided was a requirement".

daniel-goldstein

Thank you for writing this up, Jackie. Echoing Dan, this is a goldmine of valuable information about the batch system and the directions that it could go in.

Regarding how this should proceed, I largely echo Dan's sentiment. I think a lot of the background section can be transplanted into documentation as it is background that will be useful not only to this RFC but many RFCs to follow. Regarding proposed implementation, I think we need to tease these many features apart into a sequence of changes we can consider individually. I think it would be nice to narrow down this particular RFC into one that introduces the notion of a job group as a way to cancel a subset of jobs in a Batch. To me, that was the most interesting and possibly hairy part of this proposal and it would be nice to flesh out those sections more on their own. I feel confident that we can make such a change without prohibiting the additional features you've outlined, which will themselves be easier because we'll have the groundwork laid for job groups.

jigold · 2023-07-28T20:49:11Z

Ok. I think the action items for the next week are to:

I split off most of the background into a separate developer's document and possibly flesh out the rest of the details of the system.
I will write a separate RFC that specifically addresses simplifying cancellation as currently implemented in the system so we can scrutinize that independently as that's the trickiest part of the proposal.
I will write a separate RFC that addresses the need for vectorizing the batches_n_completed_states table as well as adding the n_running state so I can eventually add my awesome new progress bar that tracks job states that I worked on earlier in the summer.
We as a team need to decide what the longer term vision for job groups is and agree upon the scope of work for right now before I rewrite the proposal. I think we are all in agreement we just want to be able to cancel a subset of jobs without major performance regressions as the initial MVP. I'm going to add that wait is also a requirement because we'll need that for writing the tests. If we don't have a fast, optimized wait at first, then we will need to add an endpoint that gets the current status of a job group by running a full query over the jobs table rather than storing the counts in aggregation tables.

Does this sound good?

danking · 2023-07-29T13:30:36Z

AFAICT, the batches_n_completed_states is not necessary for QoB or cancellation, so my inclination is to delay that until we've fully resolved the other threads herein.

possibly flesh out the rest of the details of the system.

I don't think this is necessary to resolve this RFC since Daniel and I are deep enough in the system and the others aren't reading the RFC. Splitting out into a dev-docs can happen later too.

Let's focus all our time and energy on the discussion of cancellation. That appears to be the blocker for returning to this RFC.

daniel-goldstein · 2023-08-07T22:23:11Z

I re-read this RFC and had a discussion with Dan about how we got here and where we're heading. We agreed that the discussion RFC and discussion was incredibly valuable in cementing our understanding of the system as it stands today and what changes to cancellation could look like the in the models you have posed above. To summarize some of the points from our discussion and the discussion from Thursday meeting:

We believe that partially-overlapping sets of jobs (arbitrary labelling) entails much more complexity and risk than reward.
Job groups as disjoint sets of jobs in a batch and job groups as a tree of those same disjoint sets are non-conflicting designs and are both substantially simpler than overlapping label sets. [batch, WIP] Job Groups hail#12697 appears to be the closest public implementation to date matching this simpler model.

About going forward, while #7 has been incredibly informative, it seems that under the disjoint sets/tree models the added complexity to the canceller would be fairly small. So it would be good to cement the distilled information from #7 into dev documentation and then close it for the time-being.

Another very useful takeaway from #7 was that the ultimate level of abstraction in that discussion was just right to express the complexities in the system. The truly salient points were not at the code or SQL level, and not at the user level, but at the level of the various kinds of resource tracking (status, cancellable jobs, billing) and the dependencies that different components in the system have on that state. To me, that kind of precise discussion at that level is new and not something we have had in dev docs or RFCs so far. It would be great if, for this RFC, we can zero in on those changes in the data model that demand changes such as hail-is/hail#12697. I would hope that such a description would be short and can allow us to finalize a written agreement on the scope of changes that have been swimming around in our heads for months now.

I don't see any remaining reviews.

This PR adds the job groups functionality as described in this [RFC](hail-is/hail-rfcs#5) to the Batch backend and `hailtop.batch_client`. This includes supporting nested job groups up to a maximum depth of 5. Note, that none of these changes are user-facing yet (hence no change log here). The PRs that came before this one: - #13475 - #13487 - #13810 (note that this database migration required a shutdown) Subsequent PRs will need to implement the following: - Querying job groups with the flexible query language (v2) - Implementing job groups in the Scala Client for QoB - Using job groups in QoB with `cancel_after_n_failures=1` for all new stages of worker jobs - UI functionality to page and sort through job groups - A new `hailtop.batch` interface for users to define and work with Job Groups A couple of nuances in the implementation came up that I also tried to articulate in the RFC: 1. A root job group with ID = 0 does not belong to an update ("update_id" IS NULL). This means that any checks that look for "committed" job groups need to do `(batch_updates.committed OR job_groups.job_group_id = %s)` where "%s" is the ROOT_JOB_GROUP_ID. 2. When job groups are cancelled, only the specific job group that was cancelled is inserted into `job_groups_cancelled`. This table does **NOT** contain all transitive job groups that were also cancelled indirectly. The reason for this is we cannot guarantee that a user wouldn't have millions of job groups and we can't insert millions of records inside a single SQL stored procedure. Now, any query on the driver / front_end must look up the tree and see if any parent has been cancelled. This code looks similar to the code below [1]. 3. There used to be `DELETE FROM` statements in `commit_batch_update` and `commit_batch` that cleaned up old records that were no longer used in `job_group_inst_coll_cancellable_resources` and `job_groups_inst_coll_staging`. This cleanup now occurs in a periodic loop on the driver. 4. The `job_group_inst_coll_cancellable_resources` and `job_groups_inst_coll_staging` tables have values which represent the sum of all child job groups. For example, if a job group has 1 job and it's child job group has 2 jobs, then the staging table would have n_jobs = 3 for the parent job group and n_jobs = 2 for the child job group. Likewise, all of the billing triggers and MJC have to use the `job_group_self_and_ancestors` table to modify the job group the job belongs to as well its parent job groups. [1] Code to check whether a job group has been cancelled. ```mysql SELECT job_groups.*, cancelled_t.cancelled IS NOT NULL AS cancelled FROM job_groups LEFT JOIN LATERAL ( SELECT 1 AS cancelled FROM job_group_self_and_ancestors INNER JOIN job_groups_cancelled ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND job_groups.job_group_id = job_group_self_and_ancestors.job_group_id ) AS cancelled_t ON TRUE WHERE ... ```

[batch] Job Groups

893ffbe

danking reviewed Jul 20, 2023

View reviewed changes

jigold added 6 commits July 20, 2023 14:42

better rst?

1942c2d

change heading types

25a04ee

fix numbered lists

5f3a5ac

fixes that prob dont work

a6cb014

strip all extra roles

2f7afa5

try adding back the header info

c22229a

danking previously requested changes Jul 27, 2023

View reviewed changes

daniel-goldstein reviewed Jul 28, 2023

View reviewed changes

jigold mentioned this pull request Aug 1, 2023

[batch] Simplifying Batch Cancellation #7

Closed

hail-is deleted a comment from danking Aug 7, 2023

version 2

10b2605

jigold force-pushed the batch-job-groups branch from bb3d54d to 10b2605 Compare August 9, 2023 19:51

jigold added 4 commits September 25, 2023 14:37

delete original proposal

0ae1af9

clarify from questions on pr

e50930e

clarify job group creation and submission

79890a2

updates after actually implementing the changes

9d399d1

jigold mentioned this pull request Feb 14, 2024

[batch] Add Job Groups to Batch hail-is/hail#14282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batch] Job Groups #5

[batch] Job Groups #5

jigold commented Jul 20, 2023

danking Jul 20, 2023

jigold commented Jul 26, 2023

danking left a comment

daniel-goldstein left a comment

jigold commented Jul 28, 2023

danking commented Jul 29, 2023 •

edited

Loading

daniel-goldstein commented Aug 7, 2023



		Flow for Creating and Updating Batches
		~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[batch] Job Groups #5

Are you sure you want to change the base?

[batch] Job Groups #5

Conversation

jigold commented Jul 20, 2023

danking Jul 20, 2023

Choose a reason for hiding this comment

jigold commented Jul 26, 2023

danking left a comment

Choose a reason for hiding this comment

daniel-goldstein left a comment

Choose a reason for hiding this comment

jigold commented Jul 28, 2023

danking commented Jul 29, 2023 • edited Loading

daniel-goldstein commented Aug 7, 2023

danking commented Jul 29, 2023 •

edited

Loading