Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR ~ Error executing process > 'pipeline:reference_assembly:map_reads (1)' #121

Closed
physnano opened this issue Sep 30, 2024 · 9 comments
Closed
Labels
question Further information is requested

Comments

@physnano
Copy link

physnano commented Sep 30, 2024

My workflow keeps failing at the reference_assembly:map_reads step:

ERROR ~ Error executing process > 'pipeline:reference_assembly:map_reads (1)'

Caused by:
  Process `pipeline:reference_assembly:map_reads (1)` terminated with an error exit status (140)

Command executed:

  minimap2 -t 1 -ax splice -uf genome_index.mmi seqs.fastq.gz        | samtools view -q 40 -F 2304 -Sb -        | seqkit bam -j 1 -x -T 'AlnContext: { Ref: "GRCh38.primary_assembly.genome.fa", LeftShift: -24,
      RightShift: 24, RegexEnd: "[Aa]{8,}",
      Stranded: True,Invert: True, Tsv: "internal_priming_fail.tsv"} ' -        | samtools sort --write-index -@ 1 -o "E3_rep2_reads_aln_sorted.bam##idx##E3_rep2_reads_aln_sorted.bam.bai" - ;
  ((cat "E3_rep2_reads_aln_sorted.bam" | seqkit bam -s -j 1 - 2>&1)  | tee E3_rep2_read_aln_stats.tsv ) || true
  
  # Add sample id header and column
  sed "s/$/E3_rep2/" "E3_rep2_read_aln_stats.tsv"         | sed "1 s/E3_rep2/sample_id/" > tmp
  mv tmp "E3_rep2_read_aln_stats.tsv"
  
  if [[ -s "internal_priming_fail.tsv" ]];
      then
          tail -n +2 "internal_priming_fail.tsv" | awk '{print ">" $1 "\n" $4 }' - > "context_internal_priming_fail_start.fasta"
          tail -n +2 "internal_priming_fail.tsv" | awk '{print ">" $1 "\n" $6 }' - > "context_internal_priming_fail_end.fasta"
  fi

Command exit status:
  140

Command output:
  (empty)

Error code 140 suggests Memory/CPU constraint, however adding the following to the config file has not resolved the issue:

process {
    withName: 'makeReport' {
    queue = 'himem'
    memory = '512.GB'
    }

    withName: 'reference_assembly:map_reads' {
    memory = '32.GB'
    } 
}

--->

WARN: There's no process matching config selector: reference_assembly:map_reads
@physnano physnano added the question Further information is requested label Sep 30, 2024
@nrhorner
Copy link
Contributor

Hi @physnano

Just the process name should be included in the process selector like so:

    withName: 'map_reads' {
    memory = '32.GB'
    } 

@physnano
Copy link
Author

physnano commented Oct 3, 2024

Thanks @nrhorner, that along with clusterOptions = '--qos=long' seemed to help. Although now I am seeing the following:

ERROR ~ Error executing process > 'pipeline:split_bam (2)'

Caused by:
  Process `pipeline:split_bam (2)` terminated with an error exit status (137)

Command executed:

  n=`samtools view -c isob11_rep2_reads_aln_sorted.bam`
  if [[ n -lt 1 ]]
  then
      echo 'There are no reads mapping for isob11_rep2. Exiting!'
      exit 1
  fi
  
  re='^[0-9]+$'
  
  if [[ 50000 =~ $re ]]
  then
      echo "Bundling up the bams"
      seqkit bam -j 4 -N 50000 isob11_rep2_reads_aln_sorted.bam -o  bam_bundles/
      let i=1
      for b in bam_bundles/*.bam; do
          echo $b
          newname="isob11_rep2_batch_${i}.bam"
          mv $b $newname
         ((i++))
      done
  else
      echo 'no bundling'
      ln -s isob11_rep2_reads_aln_sorted.bam isob11_rep2_batch_1.bam
  fi

Command exit status:
  137

It seems that many of the steps of this workflow do not have sufficient default memory allocated to the (sub)processes...

@nrhorner
Copy link
Contributor

Hi @physnano

Ok, thanks for the update. We will review memory allocations for this workflow. Would you be able to share a bit of information about your data? How many samples and total number of reads are you using? ALso which version of the workflow and the command you used?

Thanks,

Neil

@physnano
Copy link
Author

Hi @nrhorner , In my case 3 replicates for 2 samples (6 total) were split across two PromethION flow cells, so ~40-50M raw reads per individual barcode. The makeReport step spikes to ~200GB according to my monitoring. I am using the latest version v1.4.0 --> Command used:

nextflow run ${wfPath}wf-transcriptomes \
    --fastq ${fqPath} \
    --de_analysis \
    --ref_genome ${refPath}GRCh38.primary_assembly.genome.fa \
    --ref_annotation ${refPath}gencode.v46.primary_assembly.annotation.gtf \
    --ref_transcriptome ${refPath}gencode.v46.transcripts.fa \
    --sample_sheet ${wfPath}sample_sheet.csv \
    --cdna_kit SQK-PCB114 \
    --out_dir ${wfPath}outdir-de \
    -profile singularity \
    -c ${wfPath}wf-transcriptomes/nextflow.config \
    --threads 4 \
    -resume

@nrhorner
Copy link
Contributor

nrhorner commented Nov 6, 2024

Hi @physnano

It's not good that the report generation step is using so much memory. I will investigate this.

@nrhorner
Copy link
Contributor

@physnano

Would you be able to try out version 1.6.0 and see if memory consumption has reduced please?

@physnano
Copy link
Author

Hi @nrhorner, I am rerunning on v1.6.0 today and will let you know how it goes when it completes!

@physnano
Copy link
Author

physnano commented Dec 29, 2024

Hi @nrhorner , I have run v1.6.0 and it completes, however since I am running the workflow via singularity profile on a cluster I needed to specify job runtimes via the config profile (clusterOptions = '--qos=long') mainly for the map reads step.

Also I am noticing that my "results_dge.tsv" file has raw read counts (I reran the script and same result) instead of the "gene" "logFC" "logCPM" "F" "PValue" "FDR" columns expected of the DGE analysis. The weird thing is this doesn't happen when I processed a different dataset (PacBio reads) with a nearly identical script, so I am confused as to why this might occur... Any ideas why this would be the case? (I can share the log file if needed)

@physnano
Copy link
Author

closing as this final issue identified in #139 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants