Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve Fusion symlinks when publishing files #4348

Merged
merged 20 commits into from
Nov 13, 2023

Conversation

bentsherman
Copy link
Member

Close #4309

@netlify
Copy link

netlify bot commented Sep 26, 2023

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit eb24f72
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/6552821b7268cb000860cbf8

@bentsherman
Copy link
Member Author

I updated Jordi's example script as follows:

nextflow.enable.dsl=2

process CREATE {

    output:
    path "data.txt"

    script:
    """
    echo HELLO > data.txt
    """
}

process FORWARD {

    input:
    path "data.txt"

    output:
    path "data.txt"

    script:
    """
    echo AND >> data.txt
    """
}

process PUBLISH {
    publishDir "${params.outdir}"

    input:
    path "data.txt"

    output:
    path "data.txt"

    script:
    '''
    echo BYE >> data.txt
    '''
}

workflow {
    CREATE | FORWARD | PUBLISH
}

And I found that the output file in PUBLISH links to the output from FORWARD rather than CREATE. I will try to fix that in Nextflow.

@bentsherman
Copy link
Member Author

Actually, it might not be so simple. When an input is forwarded as an output multiple times, there are conflicting needs:

  • linking to the original file regardless of the degrees of separation is nice to avoid multiple levels of links
  • however, these links are needed to determine task provenance (and the automatic work directory cleanup that is WIP)

If we can find a way to resolve the actual symlinks but preserve Nextflow's metadata, then maybe we can have it both ways. For example, we could add a field to FileHolder e.g. originalStorePath that points to the original file regardless of intermediate links.

@bentsherman
Copy link
Member Author

I worked around the previous problem by using the .fusion.symlinks files. However there is one more issue:

def target = sourceDir ? sourceDir.relativize(source) : source.getFileName()

This line does not work with the resolved Fusion symlink, because now the resolved file is not in the current task directory. So instead of publishing to results/data.txt it goes to something like results/dc/14a14e8ef9bc91cbf92e14e33715da/data.txt.

I think this line is meant to deal with files that are in a subdirectory of the task directory, I should be able to fix it by resolving the Fusion symlink after this line.

@bentsherman bentsherman marked this pull request as ready for review September 29, 2023 20:03
@bentsherman
Copy link
Member Author

PublishOpTest is failing, but since PublishOp is not used and we decided not to make a publish operator, maybe it's better to delete

bentsherman and others added 2 commits October 12, 2023 12:55
Signed-off-by: Ben Sherman <[email protected]>
@jordeu
Copy link
Collaborator

jordeu commented Oct 28, 2023

We have some open tickets related to this problem. I'd be good to make it progress to master.

Signed-off-by: Ben Sherman <[email protected]>
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests should be added for key logic

@pditommaso pditommaso force-pushed the 4309-fix-publish-fusion-symlink branch from c367fd8 to dd154a0 Compare October 30, 2023 21:15
@bentsherman
Copy link
Member Author

Is there a simple way to mock the S3 interactions here? Or do you want to use real S3 objects?

@marcodelapierre
Copy link
Member

note: also backporting to latest stable

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member

Regex "compilation" can be slow. Moved into a separate const, so it's parsed only the very first time 👉 0d853ea

@bentsherman bentsherman added this to the 23.11.0-edge milestone Nov 7, 2023
Signed-off-by: Ben Sherman <[email protected]>
@pditommaso
Copy link
Member

The test is failing

Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso
Copy link
Member

Still failing

@bentsherman
Copy link
Member Author

I think the integration tests are not set up to use Fusion, seems like AWS credentials are not provided

@pditommaso
Copy link
Member

Umm, this it should be added exportStorageCredentials (good opportunity to update here as well)

https://github.com/nextflow-io/nextflow/blob/816216614dd90b83a0261192729fccf85fc55e76/validation/wave-tests/example6/nextflow.config#L1-L21

bentsherman and others added 3 commits November 13, 2023 10:48
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso merged commit 89f09fe into master Nov 13, 2023
20 checks passed
@pditommaso pditommaso deleted the 4309-fix-publish-fusion-symlink branch November 13, 2023 21:01
pditommaso added a commit that referenced this pull request Dec 17, 2023
This commit is a supplement to PR #4348
to fix the issue #4309

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
pditommaso added a commit that referenced this pull request Dec 17, 2023
This commit is a supplement to PR #4348
to fix the issue #4309

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
pditommaso added a commit that referenced this pull request Jan 12, 2024
This commit fix the invalid resolution for Fusion symlink in publishDir directive 
when the output file is the same as an input file. 

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Robert Syme <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
pditommaso added a commit that referenced this pull request Jan 12, 2024
This commit is a supplement to PR #4348
to fix the issue #4309

Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect content when a process's published file is also an input and Fusion is enabled
5 participants