Skip to content

Commit

Permalink
Merge pull request galaxyproject#580 from pavanvidem/rnaseq-sr-0.10
Browse files Browse the repository at this point in the history
Update RNA-seq single-end workflow
  • Loading branch information
lldelisle authored Nov 19, 2024
2 parents 45843b7 + 4921cc1 commit eb46ef9
Show file tree
Hide file tree
Showing 5 changed files with 2,076 additions and 592 deletions.
2 changes: 2 additions & 0 deletions workflows/transcriptomics/rnaseq-sr/.dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,5 @@ workflows:
authors:
- name: Lucille Delisle
orcid: 0000-0002-1964-4960
- name: Pavankumar Videm
orcid: 0000-0002-5192-126X
13 changes: 13 additions & 0 deletions workflows/transcriptomics/rnaseq-sr/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Changelog

## [1.0] 2024-10-22

### Changes in workflows
- Add an optional subworkflow with more QC: FastQC, Picard, Read distribution on genomic features, gene body coverage, reads per chromosomes.
- Add featureCounts as an alternative way to generate count files
- Use fastp instead of cutadapt which uses pair overlap and allows to have optional adapter sequences

### Tool update
- `toolshed.g2.bx.psu.edu/repos/devteam/cufflinks/cufflinks/2.2.1.3` was updated to `toolshed.g2.bx.psu.edu/repos/devteam/cufflinks/cufflinks/2.2.1.4`

### Test dataset
- Using a new subsampled Yeast test data from Zenodo record https://zenodo.org/records/13987631

## [0.9] 2024-09-23

### Automatic update
Expand Down
24 changes: 16 additions & 8 deletions workflows/transcriptomics/rnaseq-sr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## Inputs dataset

- The workflow needs a list of datasets of fastqsanger.
- As well as a gtf file with genes
- Optional, but recommended: a gtf file with regions to exclude from normalization in Cufflinks.
- Collection of FASTQ files: The workflow needs a list of datasets of fastqsanger.
- GTF file of annotation: A gtf file with genes annotation.
- GTF with regions to exclude from FPKM normalization with Cufflinks: Optional, but recommended. A gtf file with regions to exclude from normalization in Cufflinks.

- For instance a gtf that masks chrM for the mm10 genome:

Expand All @@ -15,11 +15,13 @@ chrM chrM_gene exon 0 16299 . - . gene_id "chrM_gene_minus"; transcript_id "chrM

## Inputs values

- forward adapter sequence: this depends on the library preparation. Usually classical Illumina RNA libraries are Truseq and ISML (relatively new Illumina library) is Nextera. If you don't know, use FastQC to determine if it is Truseq or Nextera. If the read length is relatively short (50bp), there is probably no adapter so it will not impact your results.
- reference_genome: this field will be adapted to the genomes available for STAR
- strandedness: For stranded RNA, reverse means that the read is complementary to the coding sequence, forward means that the read is in the same orientation as the coding sequence. This will only count alignments that are compatible with your library preparation strategy. This is also used for the stranded coverage and for FPKM computation with cufflinks/StringTie.
- cufflinks_FPKM: Whether you want to get FPKM with Cufflinks (pretty long)
- stringtie_FPKM: Whether you want to get FPKM/TPM etc... with Stringtie.
- Forward adapter (optional): If not provided, fastp will try to guess the adapter sequence from the data. Its sequences depends on the library preparation. Usually classical Illumina RNA libraries are Truseq and ISML (relatively new Illumina library) is Nextera. If you don't know, use FastQC to determine if it is Truseq or Nextera. If the read length is relatively short (50bp), there is probably no adapter so it will not impact your results.
- Generate additional QC reports: whether to compute additional QC: FastQC, Picard, Read distribution on genomic features, gene body coverage, reads per chromosomes.
- Reference genome: this field will be adapted to the genomes available for STAR.
- Strandedness: For stranded RNA, reverse means that the read is complementary to the coding sequence, forward means that the read is in the same orientation as the coding sequence. This will only count alignments that are compatible with your library preparation strategy. This is also used for the stranded coverage and for FPKM computation with cufflinks/StringTie.
- Use featureCounts for generating count tables: Whether to use count tables from featureCounts instead of from STAR.
- Compute Cufflinks FPKM: Whether you want to get FPKM with Cufflinks (pretty long).
- Compute StringTie FPKM: Whether you want to get FPKM/TPM etc... with StringTie.

## Processing

Expand All @@ -41,6 +43,12 @@ chrM chrM_gene exon 0 16299 . - . gene_id "chrM_gene_minus"; transcript_id "chrM

## Contribution

### Version 0.1

@lldelisle wrote the workflow and the tests.

@nagoue updated the tools, made it work in usegalaxy.org, fixed some best practices.

### Version 1.0

@pavanvidem added the new features (featurecount + additional QC) and found a smaller test dataset.
123 changes: 49 additions & 74 deletions workflows/transcriptomics/rnaseq-sr/rnaseq-sr-tests.yml
Original file line number Diff line number Diff line change
@@ -1,101 +1,76 @@
- doc: Test outline for RNAseq_SR
job:
gtf:
GTF file of annotation:
class: File
location: https://zenodo.org/record/4541751/files/Drosophila_melanogaster.BDGP6.87.gtf
location: https://zenodo.org/records/13987631/files/Saccharomyces_cerevisiae.R64-1-1.113.gtf
filetype: gtf
SR fastq input:
Collection of FASTQ files:
class: Collection
collection_type: list
elements:
- class: File
identifier: GSM461177
location: https://zenodo.org/record/4541751/files/GSM461177_1_subsampled.fastqsanger
forward_adapter: GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
reference_genome: dm6
strandedness: unstranded
cufflinks_FPKM: true
stringtie_FPKM: true
identifier: SRR5085167
location: https://zenodo.org/records/13987631/files/SRR5085167_forward.fastqsanger.gz
Forward adapter: AGATCGGAAGAG
Generate additional QC reports: true
Reference genome: sacCer3
Strandedness: stranded - forward
Use featureCounts for generating count tables: true
Compute Cufflinks FPKM: true
GTF with regions to exclude from FPKM normalization with Cufflinks: null
Compute StringTie FPKM: true
outputs:
output_log:
element_tests:
GSM461177:
asserts:
- that: "has_text"
text: "Number of input reads |\t1051466"
- that: "has_text"
text: "Uniquely mapped reads number |\t871202"
- that: "has_text"
text: "Number of reads mapped to multiple loci |\t91808"
mapped-reads:
element_tests:
GSM461177:
asserts:
has_size:
value: 43037033
delta: 4000000
'MultiQC on input dataset(s): Stats':
asserts:
has_line:
line: "Sample STAR_mqc_generalstats_star_total_reads_1 STAR_mqc_generalstats_star_mapped_1 STAR_mqc_generalstats_star_mapped_percent_1 STAR_mqc_generalstats_star_uniquely_mapped_1 STAR_mqc_generalstats_star_uniquely_mapped_percent_1 STAR_mqc_generalstats_star_multimapped_1 Cutadapt_mqc_generalstats_cutadapt_percent_trimmed"
has_text_matching:
expression: "GSM461177\t1.0[0-9]*\t0.96[0-9]*\t91.[0-9]*\t0.8[0-9]*\t82.8[0-9]*\t0.091[0-9]*\t4.0[0-9]*"
MultiQC webpage:
MultiQC stats:
asserts:
- that: "has_text"
text: "GSM461177"
- that: "has_text"
text: "<a href=\"#cutadapt_filtered_reads\" class=\"nav-l2\">Filtered Reads</a>"
- that: "has_text"
text: "<a href=\"#star\" class=\"nav-l1\">STAR</a>"
reads_per_gene from STAR:
has_text_matching:
expression: "SRR5085167\t0.11[0-9]*\t18.3[0-9]*\t69.6[0-9]*\t0.3[0-9]*\t0.3[0-9]*\t94.62\t0.12[0-9]*\t34.43\t0.2[0-9]*\t28.[0-9]*\t90.[0-9]*\t16.[0-9]*\t0.36[0-9]*\t43.[0-9]*\t91.[0-9]*\t70.[0-9]*\t36.[0-9]*\t46.0\t75.0\t75\t27.27[0-9]*\t0.39[0-9]*"
Counts Table:
element_tests:
GSM461177:
SRR5085167:
asserts:
- that: "has_text"
text: "N_ambiguous\t20961\t5272\t4705"
- that: "has_text"
text: "FBgn0010247\t14\t6\t8"
HTS count like output:
has_line:
line: "YAL038W 1775"
Mapped Reads:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_text:
text: "FBgn0010247\t14"
transcripts_expression_cufflinks:
has_size:
value: 31570787
delta: 3000000
Gene Abundance Estimates from StringTie:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_text:
text: "FBtr0078104\t-\t-\tFBgn0031217\tCG11377\t-\tchr2L:102379-104142\t1583\t0.626702\t18.291\t9.78605\t26.796\tOK"
genes_expression_cufflinks:
has_text_matching:
expression: "YAL038W\tCDC19\tchrI\t\\+\t71786\t73288\t57.[0-9]*\t3575.[0-9]*\t3084.[0-9]*"
Genes Expression from Cufflinks:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_text_matching:
expression: "FBgn0031217\t-\t-\tFBgn0031217\tCG11377\t-\tchr2L:102379-104142\t-\t-\t32.1016\t22.1771\t42.02[0-9]*\tOK"
genes_expression_stringtie:
has_line:
line: "YAL038W - - YAL038W CDC19 - chrI:71785-73288 - - 3375.85 3161.36 3590.33 OK"
Transcripts Expression from Cufflinks:
element_tests:
GSM461177:
SRR5085167:
asserts:
has_text:
text: "FBgn0031220\tCG4822\tchr2L\t-\t116970\t121754\t1.067492\t36.649773\t74.784904"
both strands coverage:
has_line:
line: "YAL038W_mRNA - - YAL038W CDC19 - chrI:71785-73288 1503 57.5601 3375.85 3161.36 3590.33 OK"
Stranded Coverage:
element_tests:
GSM461177:
SRR5085167_forward:
asserts:
has_size:
value: 6075761
delta: 600000
stranded coverage:
element_tests:
GSM461177_reverse:
value: 555489
delta: 50000
SRR5085167_reverse:
asserts:
has_size:
value: 3103918
delta: 300000
GSM461177_forward:
value: 526952
delta: 50000
Unstranded Coverage:
element_tests:
SRR5085167:
asserts:
has_size:
value: 3103918
delta: 300000
value: 978542
delta: 90000
Loading

0 comments on commit eb46ef9

Please sign in to comment.