Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plugin: add enforcement of max running jobs limit for a queue per-association #491

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cmoussa1
Copy link
Member

Problem

The priority plugin does not support enforcement of a max running jobs limit in a queue for an association, i.e the only max running jobs limit currently enforced for an association is one across all of their running jobs, regardless of queue.


This PR looks to add enforcement of a new limit to the priority plugin: max running jobs in a queue per-association. To achieve this, new members are added to the Queue class and the Association class:

Queue
  • max_running_jobs: the max number of running jobs an association can have in a queue
Association
  • queue_usage: a hash-map storing the name of the queue and the number of running jobs the association has in that queue
  • queue_held_jobs: a hash-map storing the name of the queue and a list of any held jobs the association has in that queue

If a queue is specified on submission, the priority plugin will keep track of these running jobs per-association by incrementing running jobs counter when a job enters job.state.run and decrementing when it enters job.state.inactive.

If an association has the max number of running jobs in a queue and submits another job to that queue, a dependency is added in job.state.depend. The job ID will be stored in the Association object and held until a currently running job in that queue transitions to job.state.inactive.

In job.state.inactive, if a queue specified for a job, a check is performed to see if the association has any other jobs waiting to be run in that queue. If one is specified (and the association is now under the max running jobs limit), the dependency is removed from the first held job in that queue and it can proceed to run.

I've added some basic tests that simulate an association submitting the max number of jobs to a queue and having a dependency added to another submitted job. But, while at this limit, the association can successfully submit jobs to other queues and have them run. Once a currently running job in the queue where a job is held completes, the held job can transition to run.

@cmoussa1 cmoussa1 added new feature new feature plugin related to the multi-factor priority plugin labels Sep 27, 2024
@cmoussa1 cmoussa1 changed the title [WIP] priority plugin: add enforcement of max running jobs limit for a queue per-association [WIP] plugin: add enforcement of max running jobs limit for a queue per-association Sep 27, 2024
@cmoussa1
Copy link
Member Author

OK, I've added a couple more tests here and I think this might be ready for some initial review. I don't expect this to make this month's release, so no rush here. Let me know if the second and third commit could perhaps be squashed? I tried to keep them separate so they'd be easier to review, but can squash if we feel like it'd make more sense.

@cmoussa1 cmoussa1 changed the title [WIP] plugin: add enforcement of max running jobs limit for a queue per-association plugin: add enforcement of max running jobs limit for a queue per-association Sep 30, 2024
@cmoussa1 cmoussa1 marked this pull request as ready for review September 30, 2024 17:04
Problem: There is no definition for a max running jobs limit for a queue
in the flux-accounting database, but there exists a need to limit the
number of jobs an association can run under a certain queue.

Add a new column to the queue_table in the flux-accounting DB:
max_running_jobs, which limits the number of running jobs an association
can have under a particular queue.

Add max_running_jobs to the set of information sent by the flux
account-priority-update command, unpack it in the priority plugin, and
store it in an attribute of the Queue class.
Problem: The priority plugin has no way to keep track of the number of
jobs an association is running under each queue.

Add a new member to the Association class: queue_usage, a map whose
key-value pairs consist of a string representing the queue name and an
integer representing the number of running jobs the association has
under that queue.

In the callback for job.state.run, increment the number of running jobs
the association has for a given queue if one is specified.

In the callback for job.state.inactive, decrement the number of running
jobs the association has for a given queue if one is specified.

Adjust the unit tests for the Association class to account for the
addition of the "queue_usage" member.
Problem: The priority plugin has no way of enforcing a max running jobs
limit for an association on a per-queue basis.

Add a new member to the Association class: queue_held_jobs, a hash map
of key-value pairs where the key is the name of the queue the held job
is supposed to run under and the value is a vector of job IDs.

Add a helper function, max_run_jobs_per_queue (), to fetch the max
number of running jobs for an association in a queue.

In the callback for job.state.depend, add a check for the number of
currently running jobs an association has in a queue compared to the
limit of running jobs a queue can have. If they are equal, add a
dependency to the currently submitted job to hold it until a currently
running job finishes. Push back the held job ID onto the vector of held
jobs and store it in the Association object.

In the callback for job.state.inactive, add a check for the release of
any held jobs in a queue for an association *after* a currently running
job in that queue completes. If the association is now under that limit,
grab the first job ID in the held jobs queue and remove the dependency
from it. Remove that job ID from the vector that holds IDs of held jobs
for that queue.
Problem: There are no tests that check the enforcement of max running
jobs limit in a queue for an association.

Add some tests.
@cmoussa1
Copy link
Member Author

cmoussa1 commented Oct 3, 2024

@ryanday36 would you like to give this PR just a high-level look and confirm that this solves the case you brought up in #402? Let me know if I understood the requirements correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature new feature plugin related to the multi-factor priority plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant