Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Job Groups #5

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

[batch] Job Groups #5

wants to merge 12 commits into from

Conversation

jigold
Copy link

@jigold jigold commented Jul 20, 2023

This PR includes the proposal for Job Groups in Batch.



Flow for Creating and Updating Batches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think ~ is valid RST which prevents GitHub from rendering this for easy reading. Can you switch to -?

@jigold
Copy link
Author

jigold commented Jul 26, 2023

@danking @daniel-goldstein I'd appreciate your thoughts on this RFC. Thanks in advance!

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a tour de force of Batch description! A lot of my comments concern the motivations, which seems good! We should make sure we are completely agreed on what the motivation is and what the MVP is. That'll help ground the discussion about tradeoffs because we can ask "is this effort worth a user being able to do specific thing X which we decided was a requirement".

Copy link
Contributor

@daniel-goldstein daniel-goldstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up, Jackie. Echoing Dan, this is a goldmine of valuable information about the batch system and the directions that it could go in.

Regarding how this should proceed, I largely echo Dan's sentiment. I think a lot of the background section can be transplanted into documentation as it is background that will be useful not only to this RFC but many RFCs to follow. Regarding proposed implementation, I think we need to tease these many features apart into a sequence of changes we can consider individually. I think it would be nice to narrow down this particular RFC into one that introduces the notion of a job group as a way to cancel a subset of jobs in a Batch. To me, that was the most interesting and possibly hairy part of this proposal and it would be nice to flesh out those sections more on their own. I feel confident that we can make such a change without prohibiting the additional features you've outlined, which will themselves be easier because we'll have the groundwork laid for job groups.

@jigold
Copy link
Author

jigold commented Jul 28, 2023

Ok. I think the action items for the next week are to:

  1. I split off most of the background into a separate developer's document and possibly flesh out the rest of the details of the system.
  2. I will write a separate RFC that specifically addresses simplifying cancellation as currently implemented in the system so we can scrutinize that independently as that's the trickiest part of the proposal.
  3. I will write a separate RFC that addresses the need for vectorizing the batches_n_completed_states table as well as adding the n_running state so I can eventually add my awesome new progress bar that tracks job states that I worked on earlier in the summer.
  4. We as a team need to decide what the longer term vision for job groups is and agree upon the scope of work for right now before I rewrite the proposal. I think we are all in agreement we just want to be able to cancel a subset of jobs without major performance regressions as the initial MVP. I'm going to add that wait is also a requirement because we'll need that for writing the tests. If we don't have a fast, optimized wait at first, then we will need to add an endpoint that gets the current status of a job group by running a full query over the jobs table rather than storing the counts in aggregation tables.

Does this sound good?

@danking
Copy link
Contributor

danking commented Jul 29, 2023

AFAICT, the batches_n_completed_states is not necessary for QoB or cancellation, so my inclination is to delay that until we've fully resolved the other threads herein.

possibly flesh out the rest of the details of the system.

I don't think this is necessary to resolve this RFC since Daniel and I are deep enough in the system and the others aren't reading the RFC. Splitting out into a dev-docs can happen later too.

Let's focus all our time and energy on the discussion of cancellation. That appears to be the blocker for returning to this RFC.

@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@hail-is hail-is deleted a comment from danking Aug 7, 2023
@daniel-goldstein
Copy link
Contributor

I re-read this RFC and had a discussion with Dan about how we got here and where we're heading. We agreed that the discussion RFC and discussion was incredibly valuable in cementing our understanding of the system as it stands today and what changes to cancellation could look like the in the models you have posed above. To summarize some of the points from our discussion and the discussion from Thursday meeting:

  • We believe that partially-overlapping sets of jobs (arbitrary labelling) entails much more complexity and risk than reward.
  • Job groups as disjoint sets of jobs in a batch and job groups as a tree of those same disjoint sets are non-conflicting designs and are both substantially simpler than overlapping label sets. [batch, WIP] Job Groups hail#12697 appears to be the closest public implementation to date matching this simpler model.

About going forward, while #7 has been incredibly informative, it seems that under the disjoint sets/tree models the added complexity to the canceller would be fairly small. So it would be good to cement the distilled information from #7 into dev documentation and then close it for the time-being.

Another very useful takeaway from #7 was that the ultimate level of abstraction in that discussion was just right to express the complexities in the system. The truly salient points were not at the code or SQL level, and not at the user level, but at the level of the various kinds of resource tracking (status, cancellable jobs, billing) and the dependencies that different components in the system have on that state. To me, that kind of precise discussion at that level is new and not something we have had in dev docs or RFCs so far. It would be great if, for this RFC, we can zero in on those changes in the data model that demand changes such as hail-is/hail#12697. I would hope that such a description would be short and can allow us to finalize a written agreement on the scope of changes that have been swimming around in our heads for months now.

@jigold jigold force-pushed the batch-job-groups branch from bb3d54d to 10b2605 Compare August 9, 2023 19:51
@jigold jigold dismissed danking’s stale review September 25, 2023 15:19

I don't see any remaining reviews.

danking pushed a commit to hail-is/hail that referenced this pull request Feb 26, 2024
This PR adds the job groups functionality as described in this
[RFC](hail-is/hail-rfcs#5) to the Batch backend
and `hailtop.batch_client`. This includes supporting nested job groups
up to a maximum depth of 5. Note, that none of these changes are
user-facing yet (hence no change log here).

The PRs that came before this one:
- #13475 
- #13487 
- #13810 (note that this database migration required a shutdown)

Subsequent PRs will need to implement the following:
- Querying job groups with the flexible query language (v2)
- Implementing job groups in the Scala Client for QoB
- Using job groups in QoB with `cancel_after_n_failures=1` for all new
stages of worker jobs
- UI functionality to page and sort through job groups
- A new `hailtop.batch` interface for users to define and work with Job
Groups

A couple of nuances in the implementation came up that I also tried to
articulate in the RFC:
1. A root job group with ID = 0 does not belong to an update
("update_id" IS NULL). This means that any checks that look for
"committed" job groups need to do `(batch_updates.committed OR
job_groups.job_group_id = %s)` where "%s" is the ROOT_JOB_GROUP_ID.
2. When job groups are cancelled, only the specific job group that was
cancelled is inserted into `job_groups_cancelled`. This table does
**NOT** contain all transitive job groups that were also cancelled
indirectly. The reason for this is we cannot guarantee that a user
wouldn't have millions of job groups and we can't insert millions of
records inside a single SQL stored procedure. Now, any query on the
driver / front_end must look up the tree and see if any parent has been
cancelled. This code looks similar to the code below [1].
3. There used to be `DELETE FROM` statements in `commit_batch_update`
and `commit_batch` that cleaned up old records that were no longer used
in `job_group_inst_coll_cancellable_resources` and
`job_groups_inst_coll_staging`. This cleanup now occurs in a periodic
loop on the driver.
4. The `job_group_inst_coll_cancellable_resources` and
`job_groups_inst_coll_staging` tables have values which represent the
sum of all child job groups. For example, if a job group has 1 job and
it's child job group has 2 jobs, then the staging table would have
n_jobs = 3 for the parent job group and n_jobs = 2 for the child job
group. Likewise, all of the billing triggers and MJC have to use the
`job_group_self_and_ancestors` table to modify the job group the job
belongs to as well its parent job groups.

[1] Code to check whether a job group has been cancelled.

```mysql
SELECT job_groups.*,
  cancelled_t.cancelled IS NOT NULL AS cancelled
FROM job_groups
LEFT JOIN LATERAL (
  SELECT 1 AS cancelled
  FROM job_group_self_and_ancestors
  INNER JOIN job_groups_cancelled
    ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND
      job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id
  WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND
    job_groups.job_group_id = job_group_self_and_ancestors.job_group_id
) AS cancelled_t ON TRUE
WHERE ...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants