Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running workflows may appear stopped on network filesystems #6506

Open
oliver-sanders opened this issue Nov 29, 2024 · 3 comments
Open

running workflows may appear stopped on network filesystems #6506

oliver-sanders opened this issue Nov 29, 2024 · 3 comments
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

Regression of: #2943
Bug report: https://cylc.discourse.group/t/cylc-set-slow-to-be-able-to-run-after-workflow-started/1073/5

On network filesystems, there is a lag between a file being created on one host and it appearing on another.

The Cylc contact file contains the details of the running scheduler and is how we detect if a workflow is running or not.

Through a combination of running fsync on the contact file, and listing the directory it is contained in, we can force the network filesystem to synchronize the file reducing this lag.

By default, Cylc installs each run of a workflow into a numbered directory (run1, run2, ...) with a runN` symlink pointing at the most recent run number. It would appear that we need to perform a filesystem listing on this symlink before the contact file is synchronized to other nodes on the network.

There is another symlink that might trip things up which is configured by global.cylc[install][symlink dirs]run. We will likely need to perform an additional filesystem listing if this symlink is created.

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Nov 29, 2024
@oliver-sanders oliver-sanders added this to the 8.3.x milestone Nov 29, 2024
@hjoliver
Copy link
Member

hjoliver commented Dec 2, 2024

(Calling this a bug in Cylc seems a little harsh! but fair enough)

@ColemanTom
Copy link
Contributor

We're seeing another potential spot. I could be wrong about it, but

  1. Cylc submits a job (so ssh + mkdir + qsub)
  2. QSUB submits the job and it starts trying to run on a different node
  3. PBS option has -k option, so job.out and job.err should go straight into the log/job directory
  4. PBS is failing to start running the job because it is failing to open the job.out file
  5. The log folder tree is on NFS on the HPC

That to me means a potential that the log folder has no synced across NFS. Does that sound reasonable/possible?

@dpmatthews
Copy link
Contributor

Sounds possible but, if so, that would have to be addressed by PBS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

4 participants