Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Compact And Drop Records from job_group_inst_coll_cancellable_resources #14645

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

ehigham
Copy link
Member

@ehigham ehigham commented Jul 30, 2024

Records in the job_group_inst_coll_cancellable_resources table are dead once a
job group completes. We already compact records when a job group is cancelled.
We are yet to do this for finished job groups. See the linked issue for a more
detailed motivation.

This change adds two background tasks:

  1. finds uncompacted groups of records for finished job groups and
    compacts them by summing across the token field.
  2. finds compacted records for finished job groups and deletes them if
    all associated resources are 0.

The results of both tasks converge to a fixed point where the only remaining
records are for those jobs groups that are unfinished, cancelled or have
resources outstanding.

I've taken care to optimise the underlying SQL queries as best as I can. Both
make heavy use of lateral joins to avoid explodes - the natural implementation
of both are prohibitively expensive.

I've tested these tasks in a dev deploy where I created a number of batches and
observed that records from this table have indeed been compacted and destroyed
on completion. It's not immediately obvious to me how to automate testing for
these. AFAICT, we lack any automated integration testing for these background
tasks.

Resolves: #14623

@ehigham ehigham marked this pull request as ready for review July 31, 2024 18:07
) AS R ON TRUE
WHERE G.time_completed IS NOT NULL
AND C.id IS NULL
LIMIT 1000;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling out the limit on both queries. Seems other queries also limit to 1000 but not sure where this comes from. Without compacting, the query to find compacted rows takes for ever as it scans through a large chunk of the db. On the other hand, there are millions of rows so reducing this number would make the background task take longer to churn through records. Suggestions?

Copy link
Member Author

@ehigham ehigham Aug 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, my 2-week old prod snapshot has 173561655 rows in job_group_inst_coll_cancellable_resources and 8567769 job_groups. Assuming (incorrectly) instant execution, It'll will take 100 days to churn through the db.

@ehigham ehigham force-pushed the ehigham/14623-compact-job_group_inst_coll_cancellable_resources branch from 339c9fe to af92cd6 Compare July 31, 2024 19:09
Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'm convinced that the SQL works as advertised. I'd love to see an easy query to just delete all the unnecessary records from job_group_inst_coll_cancellable_resources, but I certainly haven't spent enough time understanding the stored procedures and triggers that cause this table to be updated and what invariants hold for it.

Guess that's a future project that will only be relevant if this table grows faster than we can delete records from it.

@ehigham ehigham added the WIP label Aug 5, 2024
@ehigham ehigham removed the WIP label Aug 5, 2024
@ehigham ehigham requested a review from chrisvittal August 5, 2024 19:07
@ehigham
Copy link
Member Author

ehigham commented Aug 5, 2024

Updated queries to return job groups that do not have an ancestor or self job group that has been cancelled. This logic now mirrors that of delete_prev_cancelled_job_group_cancellable_resources_records, only in anti-join form.
Previous query returned those job groups that do not have a cancellation record for itself or a descendent job group.
SQL is hard.

Copy link
Collaborator

@chrisvittal chrisvittal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newest SQL update LGTM

@ehigham
Copy link
Member Author

ehigham commented Aug 9, 2024

@daniel-goldstein @jigold I believe I've implemented this faithfully to the issue but I'm not confident about any fallout if I've got something wrong. Would you mind taking a look (sorry to drag you into this)?

I've grepped through the codebase as @daniel-goldstein suggested and AFAICT, these records are unused after a job group finish.
Why do we need them after they've finished? When would a job group terminate with non-zero resources?
Thanks so much for your help!

@daniel-goldstein
Copy link
Contributor

Why do we need them after they've finished? When would a job group terminate with non-zero resources?

I can take a look at this on Monday probably but AFAIK we don't and it wouldn't, hence the copious amounts of garbage.

hail-ci-robot pushed a commit that referenced this pull request Aug 25, 2024
Fixes #14660 by using the graphQL API to query github directly. Replaces
our current parallel interpretation of reviews into a review decision,
which is brittle if we ever change review requirements in github again.

Tested by manually updating the live CI to use the test batch generated
image. Results:
- Review decisions correctly fetched from github, not based on CI's
parallel interpretation of individual reviews:

![image](https://github.com/user-attachments/assets/67c03aa9-000a-44e7-91aa-3a42d04238dc)
- No merge candidate was being incorrectly nominated (in particular,
#14645 is now considered pending, rather than approved, which is what we
are currently, incorrectly, calculating)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[batch] Drop dead rows from job_group_inst_coll_cancellable_resources
4 participants