documentation update to v2.1.2

dieterich-lab · Jan 11, 2022 · 454058c · 454058c
1 parent 36a3279
commit 454058c
Show file tree

Hide file tree

Showing 5 changed files with 87 additions and 25 deletions.
diff --git a/docs/index.md b/docs/index.md
@@ -6,6 +6,7 @@
 Baltica is a framework that facilitates the execution and enables the integration of results from multiple differential junction usage (DJU) methods. The core of the framework is Snakemake workflows [@M_lder_2021], a python command-line interface, and R/Bioconductor scripts for analysis [@r_core][@Lawrence_2013][@Lawrence_2009][@Wickham_2019]. The workflows are include methods for RNA-Seq quality control [@wang2012][@andrews2012][@ewels_2016], four DJU methods: RMATs [@Shen_2014] JunctionSeq [@Hartley2016], Majiq [@VaqueroGarcia2016] and Leafcutter [@Li2017]. We use Stringtie2 [@Kovaka_2019] _de novo_ transcriptome assembly to re-annotate the results. Baltica's main goal is to provide an integrative view of the results of these methods. To do so,  Baltica produces an RMarkdown report with the integrated results and links to UCSC GenomeBrowser for further exploration.
 
 ## Features
+
     - Snakemake workflows for DJU: junctionseq, majiq, rmats, and leafcutter
     - Snakemake workflow for de novo transcriptome annotation with stringtie
     - Process, integrate and annotate the results from the methods
@@ -17,16 +18,18 @@ Baltica is a framework that facilitates the execution and enables the integratio
 
 ## Citation
 
-Thiago Britto-Borges, Volker Boehm, Niels H. Gehring and Christoph Dieterich (2020) __Baltica: integrated splice junction usage analysis__. 
-Manuscript in preparation.
+Britto-Borges T, Boehm V, Gehring NH, Dieterich C. Baltica: integrated splice junction usage analysis. bioRxiv. 2021 Jan 1. doi: https://doi.org/10.1101/2021.12.23.473966 
 
 Baltica is based on the work of many scientists and developers. Thus, if you use the results of their tools in your analysis, consider citing their work.
 
 ## License
+
 Baltica is free, open-source software released under an [MIT License](https://github.com/dieterich-lab/Baltica/blob/master/LICENSE).
 
 ## Contact
+
 Please get in touch with us [the GitHub issue tracker](https://github.com/dieterich-lab/Baltica/issues).
 
 ## References
-\bibliography
+
+\bibliography
diff --git a/docs/integration.md b/docs/integration.md
@@ -21,22 +21,21 @@ The first step in the analysis workflow is parsing and processing the DJU method
 
 ## Result integration
 
-One challenge for the integration of DJU results is that the methods use different genomic coordinate systems. 
-The coordinates system's differences are due to the method implementation: methods can be 0-indexed (BED format) versus 1-indexed (GTF format) or use the exonic versus intronic coordinates to represent the SJ genomic position.   
+One challenge for the integration of DJU results is that the methods use different genomic coordinate systems.
+The coordinates system's differences are due to the method implementation: methods can be 0-indexed (BED format) versus 1-indexed (GTF format) or use the exonic versus intronic coordinates to represent the SJ genomic position.
 We propose a `filter_hits_by_diff` function to find overlapping features and then discard any overlaps with more than two bp differences to account for the multiple genomic coordinates system.
-The multiple hits form a graph, which is then partitioned into the clusters, and each cluster represents an intron. 
+The multiple hits form a graph, which is then partitioned into the clusters, and each cluster represents an intron.
 This feature enables the reconciliation of the multiple DJU results.
 
-
 ## Annotating the results
 
-We annotate the results with information from genes and transcripts hosting the SJ. 
-For this, we use the _de novo_ transcript annotation at `stringtie/merged/merged.combined.gtf`. 
+We annotate the results with information from genes and transcripts hosting the SJ.
+For this, we use the _de novo_ transcript annotation at `stringtie/merged/merged.combined.gtf`.
 Commonly, multiple transcripts share an intron so that a single intron may be annotated with multiple transcripts.
 
 These are the columns assigned after the annotation:
 
-__Table 1: Annotation description__  
+### Table 1: Annotation description
 
 Column name | Description |
 ------------|-------------|
@@ -57,35 +56,40 @@ class_code | association between reference transcript and novel transcript ([seq
     The section below was obtained in a previous Baltica release, using stringtie v1.2.X, but we don't expect major changes in current version.
 
 We found that the parameters used to obtain the _de novo_ transcriptome are critical for maximum integration between the GTF and the SJ from DJU methods.
-__Fig 1__ shows a parameter scan where we vary the group, `-j` (minimum junction coverage), `-c` (minimum coverage), and `-f` (minimum isoform proportion) and compute the number of transcripts that match with SJ called significantly. 
-As expected, the merged annotation and not the group-specific annotation have the highest rate of annotated introns. 
-The crucial result here is the dependency of the `-f` parameter, which is also associated with an increased number of annotated introns. 
-As we confirmed this behavior in other datasets, we decided to use `-c 3 -j 3 -f 0.01` as default values in Baltica. 
+__Fig 1__ shows a parameter scan where we vary the group, `-j` (minimum junction coverage), `-c` (minimum coverage), and `-f` (minimum isoform proportion) and compute the number of transcripts that match with SJ called significantly.
+As expected, the merged annotation and not the group-specific annotation have the highest rate of annotated introns.
+The crucial result here is the dependency of the `-f` parameter, which is also associated with an increased number of annotated introns.
+As we confirmed this behavior in other datasets, we decided to use `-c 3 -j 3 -f 0.01` as default values in Baltica.
 The higher coverage (`-c` and `-j`) values counter the potential noise of transcripts with low abundance.
 
 ![](img/stringtie_parameter_scan_heatmap.png)  
 __Fig 1__:Parameter scan to maximize the number of introns annotated.
-We have run Stringtie with multipleparameters of merged annotation or group annotation; junction coverage of 1, 2, or 3; coverage of 1, 2, 3, and minimum isoform fraction of 0.1, 0.01, or 0.001. 
+We have run Stringtie with multipleparameters of merged annotation or group annotation; junction coverage of 1, 2, or 3; coverage of 1, 2, 3, and minimum isoform fraction of 0.1, 0.01, or 0.001.
 The result shows a dependency of the minimum isoform fraction parameter, which needs to be minimized to increase the proportion of annotated SJ, as expected.
 
 ## Assigning AS type
 
 ### Biological motivation
+
 Identifying the type of AS is critical to understand a potential molecular mechanism for AS events. [SRSF2](https://www.uniprot.org/uniprot/Q01130) is a relevant example in this context. SRSF2 is splicing factors from the SR family that are known for auto-regulation. In certain conditions, the SRSF2 transcript can activate the nonsense-mediated decay by either including a new exon containing a premature stop codon or an intron in  3' UTR. These changes lead to transcript degradation and overall reduction of gene expression. Thus, the reduction of SRSF2 protein level leads to widespread exon skipping. Identifying such patterns is critical to understanding which splicing regulators are driving the observed splicing changes, and it enables further analysis of AS events of a specific type.
 
 ### Implementation
+
 In Baltica, we use a geometric approach to define AS in three classes:
+
 - ES, for exon skipping
 - A3SS, for alternative 3' splice-site
 - A5SS, for alternative 5' splice-site
 
 Figure 2 details how we use the distance between features start and end to determine the AS type.
 
 ![](img/Baltica_as_type.png){ : .center width=70% }  
-__Fig 2__: AS type assignment in Baltica. Baltica uses the genomic coordinates from the SJ and its overlapping exons to assigning AS type to SJ and its overlapping exons. Because many exons may be affected, multiple assignments are output. For example, donor and acceptor exons are assigned as JS and JE, respectively. 
+__Fig 2__: AS type assignment in Baltica. Baltica uses the genomic coordinates from the SJ and its overlapping exons to assigning AS type to SJ and its overlapping exons. Because many exons may be affected, multiple assignments are output. For example, donor and acceptor exons are assigned as JS and JE, respectively.
 
 ## Simplify the AS event
-Because most of the final users are only interested in the list of genomic ranges, gene names, or event types, we offer a simplified output that removes redundant information. This step helps generate a final report. 
+
+Because most of the final users are only interested in the list of genomic ranges, gene names, or event types, we offer a simplified output that removes redundant information. This step helps generate a final report.
 
 ## References
+
 \bibliography
diff --git a/docs/release-notes.md b/docs/release-notes.md
@@ -1,24 +1,33 @@
 ## Change log
 
-### v1.1 <small> July 23, 2021 (released in September 7 2021) </small>
+### v1.1.2 <small> Unreleased </small>
+
+* Add support to unstranded RNA-seq data
+* Add scripts for benchmark
+* Add a new configuration `bind_paths` that allow integrating bam files from different projects
+
+### v1.1.1 <small> July 23, 2021 (released in September 7 2021) </small>
+
 * Add rmats workflow
 * Add scrips for parsing for rmats and updated analysis to support the method
 * Create the benchmark with the ONT Nanopore-seq
 * Update benchmaks, included difference comparison for SIRV benchmark
-* Splite annotation and AS type assigment functions
-* Update baltica table algorithm 
+* Split annotation and AS type assigment functions
+* Update baltica table algorithm
 * Add support for singularity container via snakemake, with container recipes `baltica qc config.yaml --use-singularity`
 * Add parsing method for gffcompare tracking output
 * Update configuration file to expose important parameters from the DJU methods
 * Add end-to-end analysis with `baltica all config`  
-* Experiment with meta-score (gradient boosted trees) 
-* Add baltica report and improved on report summaries 
+* Experiment with meta-score (gradient boosted trees)
+* Add baltica report and improved on report summaries
 * Add orthogonal dataset use-case, to integrate third generation sequencing to the baltica table
-* Change strand parameter to "fr-firststrand": "reverse", "fr-secondstrand": "forward" or unstranded, fix error in rmats strand 
+* Change strand parameter to "fr-firststrand": "reverse", "fr-secondstrand": "forward" or unstranded, fix error in rmats strand
 
 ### v1.0 <small> September 17, 2020</small>
+
 * Add `is_novel` column, indication introns not into the reference annotation
-* Remove unitended columns (X1, ...) from merge
+* Remove unitended columns (X1, ...) from the report
 
 ### v1.0 <small>- July 23, 2020</small>
+
 * First public release comprises of DJU methods Leafcutter, Junctionseq and Majiq. Stringtie for *de novo* transcriptomics assembly. FastQC and MultiQC (#1).
diff --git a/docs/report.md b/docs/report.md
@@ -1,2 +1,47 @@
-# Report 
+# Baltica output
 
+Baltica framework produces two files as output:  
+    - an R markdown report
+    - an excel spreadsheet  
+
+!!! note
+    If available, the orthogonal dataset is treated as a new method named `orthogonal.`
+
+## Baltica table spreadsheet
+
+- `results/baltica_table_{proj_name}.xlsx`
+
+The spreadsheet contains the complete set of coordinate output by methods and comparisons. In addition, there are a column for the combination of methods and comparisons plus the columns for the annotation:
+
+- coordinates: junction genomic coordinate in the format: `{chr}:{start}-{end}:{strand}` (strand omitted if none)
+- score columns: in the format: `{method}_{comparisons}`
+- is_novel: whether the splicing junction is or not annotated
+- gene_name: the gene name obtained from the de novo annotation workflow
+- transcript_name: transcript name from the de novo annotation
+- class_code: transcript class association to the reference annotation transcript, please see [Fig 1 in the GFF Utilities paper](https://f1000research.com/articles/9-304/v2) for details
+- exon_number: pairs of exon numbers from the de novo annotation. First of the pair is the donor exon if the feature is the positive strand; otherwise acceptor
+- as_type: type of AS for each junction exon skipping (ES), alternative 3' splice site (A3SS), alternative 5' splice site (A5SS)
+
+Currently, the HTML report comprises two sections:
+
+## Common splice junctions
+
+The [upset plot](https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html#upset-plot) shows the combination of distinct sets of calls (score > 0.95) from each method and contrast. The plot helps to compare the common calls among sets. The complement sets are ignored, as these sets usually have a high number.
+
+## Baltica table
+
+This interactive HTML table provides the top 1,000 (or `baltica_max_table` in the configuration file) sorted by the sum of the scores. Extra annotation is available upon clicking on ▶. In addition, the coordinates columns link to the UCSC genome browser. Regional URL for UCSC GB can be selected with `ucsc_url`, and assembly should be selected with the `assembly` configuration
+
+## Baltica report configuration
+
+Change the following options on your project configuration to change the report:  
+
+- project_authors: name of the persons running the project
+- project_title: name of the files and report title
+- baltica_max_table: maximum number of rows on the HTML table
+- assembly: assembly used for linking with the genome browser
+- ucsc_url: URL for the genome browser, like `http://genome-euro.ucsc.edu` for the European mirror
+
+## Reproducibility
+
+This section provides the information necessary to reproduce the report, including project configuration and R package version.  
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -34,6 +34,7 @@ nav:
   - Workflow implementation: workflows.md 
   - DJU methods result integration: integration.md 
   - Development guidelines: dev_guide.md
+  - Baltica output: report.md
   - Frequently asked questions: faq.md
   - Tutorial: tutorial.md
   - References: bibliography.md