Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combination of 'storeDir' with optional output results in process always being skipped #4123

Open
mcallaway opened this issue Jul 22, 2023 · 5 comments · May be fixed by #5651
Open

Combination of 'storeDir' with optional output results in process always being skipped #4123

mcallaway opened this issue Jul 22, 2023 · 5 comments · May be fixed by #5651

Comments

@mcallaway
Copy link

mcallaway commented Jul 22, 2023

Bug report

Expected behavior and actual behavior

Given a process with an optional output and the storeDir directive, the process should run if the output file is not present in the storeDir. If the script produces no output file, it should not be an error.

Actual behavior is that the process is skipped if the output file is not present.

Steps to reproduce the problem

Here is a process definition:

nextflow.enable.dsl=2

process one {
    debug params.debug
    storeDir params.thedir

    output:
    path "outputfile.txt", optional: true

    script:
    """
    date > outputfile.txt
    """
}

workflow {
    one().view()
}

Program output

❯ nextflow run ./nextflow/test.nf --thedir $HOME/tmp/ --debug
N E X T F L O W  ~  version 23.04.2
Launching `./nextflow/test.nf` [voluminous_leibniz] DSL2 - revision: 90de946046
[skipped  ] process > one [100%] 1 of 1, stored: 1 ✔
[skipping] Stored process > one

.nextflow.log shows:

Jul-22 13:28:18.813 [main] DEBUG nextflow.Session - Session await
Jul-22 13:28:18.842 [Actor Thread 4] DEBUG nextflow.processor.TaskProcessor - Process `one` is unable to find [UnixPath]: `/Users/gmmqs/tmp/ou
tputfile.txt` (pattern: `outputfile.txt`)
Jul-22 13:28:18.843 [Actor Thread 4] INFO  nextflow.processor.TaskProcessor - [skipping] Stored process > one
Jul-22 13:28:18.890 [Actor Thread 4] DEBUG nextflow.processor.TaskProcessor - Process one > Skipping output binding because one or more optional files are missing: fileoutparam<0>
Jul-22 13:28:18.892 [main] DEBUG nextflow.Session - Session await > all processes finished

Environment

  • Nextflow version: version 23.04.2 build 5870
  • Java version: openjdk version "17.0.7" 2023-04-18
  • Operating system: macOS
  • Bash version: zsh 5.9 (x86_64-apple-darwin22.0)

Additional context

None

@mcallaway
Copy link
Author

Note that I'm aware of publishDir, but that creates a symbolic link in the publishDir to the real file within workDir. I don't want a symlink to workDir because that link will be broken if workDir gets cleaned.

@mribeirodantas
Copy link
Member

Note that I'm aware of publishDir, but that creates a symbolic link in the publishDir to the real file within workDir. I don't want a symlink to workDir because that link will be broken if workDir gets cleaned.

Hello, @mcallaway. You can set publishDir to move or copy files, instead of symlinking them. You can read more about it here: https://www.nextflow.io/docs/latest/process.html#publishdir

Snippet below:

process foo {
    publishDir '/data/chunks', mode: 'copy', overwrite: false

    output:
    path 'chunk_*'

    '''
    printf 'Hola' | split -b 1 - chunk_
    '''
}

@schorlton-bugseq
Copy link

@bentsherman and @pditommaso - storeDir is a great feature yet it took me a long time to figure out why it was skipping a process and eventually figure out it was the optional output. If this cannot be fixed (which I hope it can!), is it possible to add a warning that storeDir does not work for optional outputs? Thanks for your consideration!

@bentsherman
Copy link
Member

I think it is a fundamental limitation of storeDir, because there is no cache metadata to verify whether the optional output should be there from a previous run. The same is true for an output with a variable number of files, as there is no way to verify the number of files produced by a previous run.

For now I think it's worth documenting these limitations. We are investigating some ideas that I think could replace storeDir in the long-term.

@bentsherman bentsherman linked a pull request Jan 7, 2025 that will close this issue
@bentsherman
Copy link
Member

The best workaround that I can think of is that a process that uses storeDir should always have at least one non-optional file output. This will guarantee that the process is executed at least once. You might have to store a dummy output file to make it work.

I believe there are also cases where a process has multiple optional outputs, but in practice at least one of those outputs is expected to be present. That is more of a modeling problem that we need to solve in the language. For example, instead of having two optional outputs for BAM and CRAM, there should be one required output that is somehow modeled as "either BAM or CRAM". Invalid states should be unrepresentable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants