-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
plugin: add enforcement of max running jobs limit for a queue per-association #491
base: master
Are you sure you want to change the base?
Conversation
166fc72
to
a5f220f
Compare
OK, I've added a couple more tests here and I think this might be ready for some initial review. I don't expect this to make this month's release, so no rush here. Let me know if the second and third commit could perhaps be squashed? I tried to keep them separate so they'd be easier to review, but can squash if we feel like it'd make more sense. |
Problem: There is no definition for a max running jobs limit for a queue in the flux-accounting database, but there exists a need to limit the number of jobs an association can run under a certain queue. Add a new column to the queue_table in the flux-accounting DB: max_running_jobs, which limits the number of running jobs an association can have under a particular queue. Add max_running_jobs to the set of information sent by the flux account-priority-update command, unpack it in the priority plugin, and store it in an attribute of the Queue class.
Problem: The priority plugin has no way to keep track of the number of jobs an association is running under each queue. Add a new member to the Association class: queue_usage, a map whose key-value pairs consist of a string representing the queue name and an integer representing the number of running jobs the association has under that queue. In the callback for job.state.run, increment the number of running jobs the association has for a given queue if one is specified. In the callback for job.state.inactive, decrement the number of running jobs the association has for a given queue if one is specified. Adjust the unit tests for the Association class to account for the addition of the "queue_usage" member.
Problem: The priority plugin has no way of enforcing a max running jobs limit for an association on a per-queue basis. Add a new member to the Association class: queue_held_jobs, a hash map of key-value pairs where the key is the name of the queue the held job is supposed to run under and the value is a vector of job IDs. Add a helper function, max_run_jobs_per_queue (), to fetch the max number of running jobs for an association in a queue. In the callback for job.state.depend, add a check for the number of currently running jobs an association has in a queue compared to the limit of running jobs a queue can have. If they are equal, add a dependency to the currently submitted job to hold it until a currently running job finishes. Push back the held job ID onto the vector of held jobs and store it in the Association object. In the callback for job.state.inactive, add a check for the release of any held jobs in a queue for an association *after* a currently running job in that queue completes. If the association is now under that limit, grab the first job ID in the held jobs queue and remove the dependency from it. Remove that job ID from the vector that holds IDs of held jobs for that queue.
Problem: There are no tests that check the enforcement of max running jobs limit in a queue for an association. Add some tests.
@ryanday36 would you like to give this PR just a high-level look and confirm that this solves the case you brought up in #402? Let me know if I understood the requirements correctly. |
Problem
The priority plugin does not support enforcement of a max running jobs limit in a queue for an association, i.e the only max running jobs limit currently enforced for an association is one across all of their running jobs, regardless of queue.
This PR looks to add enforcement of a new limit to the priority plugin: max running jobs in a queue per-association. To achieve this, new members are added to the
Queue
class and theAssociation
class:Queue
max_running_jobs
: the max number of running jobs an association can have in a queueAssociation
queue_usage
: a hash-map storing the name of the queue and the number of running jobs the association has in that queuequeue_held_jobs
: a hash-map storing the name of the queue and a list of any held jobs the association has in that queueIf a queue is specified on submission, the priority plugin will keep track of these running jobs per-association by incrementing running jobs counter when a job enters
job.state.run
and decrementing when it entersjob.state.inactive
.If an association has the max number of running jobs in a queue and submits another job to that queue, a dependency is added in
job.state.depend
. The job ID will be stored in theAssociation
object and held until a currently running job in that queue transitions tojob.state.inactive
.In
job.state.inactive
, if a queue specified for a job, a check is performed to see if the association has any other jobs waiting to be run in that queue. If one is specified (and the association is now under the max running jobs limit), the dependency is removed from the first held job in that queue and it can proceed to run.I've added some basic tests that simulate an association submitting the max number of jobs to a queue and having a dependency added to another submitted job. But, while at this limit, the association can successfully submit jobs to other queues and have them run. Once a currently running job in the queue where a job is held completes, the held job can transition to run.