Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting an history should cancel running workflow invocations #10442

Closed
nsoranzo opened this issue Oct 16, 2020 · 7 comments
Closed

Deleting an history should cancel running workflow invocations #10442

nsoranzo opened this issue Oct 16, 2020 · 7 comments

Comments

@nsoranzo
Copy link
Member

This is especially an issue if the inputs for the workflow are from another non-deleted history, meaning that scheduled jobs won't even be paused.

@jmchilton
Copy link
Member

We do cancel new scheduling iterations because of deleted histories (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/workflow/run.py#L170). So this must be within a scheduling iteration I assume. I'm nervous about simply rechecking the history between each step, between each job, etc... but clearly scheduling iterations are too long right now if this is a problem. I assume the jobs don't run at least?

@nsoranzo
Copy link
Member Author

We do cancel new scheduling iterations because of deleted histories (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/workflow/run.py#L170). So this must be within a scheduling iteration I assume.

Yes, the user deleted the history while a very large mapping step was being scheduled.

I'm nervous about simply rechecking the history between each step, between each job, etc... but clearly scheduling iterations are too long right now if this is a problem. I assume the jobs don't run at least?

I didn't check at the time if the jobs were run (I have separate workflow and job handlers), and now this is too buried in the logs for me to be confident in the answer, unfortunately. But, as you mention, the unnecessary workflow scheduling of such large steps is any way a problem for us.

A possible set of solutions for this could be:

  • from galaxy.managers.histories.HistoryManager.delete() call galaxy.managers.workflows.WorkflowsManager.cancel_invocation() on each invocation running on the history to delete
  • change the default of maximum_workflow_jobs_per_scheduling_iteration from -1 to a large but positive number.

The use of maximum_workflow_jobs_per_scheduling_iteration was suggested by @mvdbeek , but changing its default is my idea (so don't blame him if the idea is stupid :D ).
As mentioned on Gitter, I am not sure the current implementation of maximum_workflow_jobs_per_scheduling_iteration is working as expected though. By looking on how it's passed as max_num_jobs to https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/execute.py#L29 , jobs_executed seems to be set to 0 and never updated (could be partially due to #7449 ).

@jmchilton
Copy link
Member

from galaxy.managers.histories.HistoryManager.delete() call galaxy.managers.workflows.WorkflowsManager.cancel_invocation() on each invocation running on the history to delete

This wouldn't change anything - since the invocation will cancel itself if it is over a deleted history. The problem is this happens in the middle of an invocation scheduling step. I guess we could also check the invocation is cancelled after each step - that might be a slight improvement over repeatedly checking the history.

I thought maximum_workflow_jobs_per_scheduling_iteration was working when I implemented it, but it is hard to test and may have regressed. It is worth fixing.

Hopefully @mvdbeek's recent job scheduling enhancements will reduce this scope of this problem. The fast we schedule jobs the more we can free resources to do more checking and the less likely conflicts like this will be to occur.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 21, 2020

I'm working on it now, and the test was not executed under pytest because the test method didn't start with test. I did find a bunch of things to fix, will have a PR later today.

@jmchilton
Copy link
Member

How do you do so much - you're amazing. Good luck let me know if I can help.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 21, 2020

#10490 restores maximum_workflow_jobs_per_scheduling_iteration. We could also check at each scheduling step that the history is not deleted. That should at most result in one extra job per deleted history.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 30, 2023

That should be fixed in #16252, which will also cancel the invocation's already scheduled jobs.

@mvdbeek mvdbeek closed this as completed Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants