From c2fa5dfc031f33a74dc1a5c897b01e6d492ed433 Mon Sep 17 00:00:00 2001 From: Bo Li Date: Mon, 19 Oct 2015 16:41:46 -0700 Subject: [PATCH] Removed all *.html --- README.html | 491 ------------------------ convert-sam-for-rsem.html | 88 ----- rsem-calculate-expression.html | 619 ------------------------------ rsem-control-fdr.html | 94 ----- rsem-generate-ngvector.html | 111 ------ rsem-plot-transcript-wiggles.html | 131 ------- rsem-prepare-reference.html | 253 ------------ rsem-run-ebseq.html | 125 ------ updates.html | 62 --- 9 files changed, 1974 deletions(-) delete mode 100644 README.html delete mode 100644 convert-sam-for-rsem.html delete mode 100644 rsem-calculate-expression.html delete mode 100644 rsem-control-fdr.html delete mode 100644 rsem-generate-ngvector.html delete mode 100644 rsem-plot-transcript-wiggles.html delete mode 100644 rsem-prepare-reference.html delete mode 100644 rsem-run-ebseq.html delete mode 100644 updates.html diff --git a/README.html b/README.html deleted file mode 100644 index 4a53c60..0000000 --- a/README.html +++ /dev/null @@ -1,491 +0,0 @@ -

README for RSEM

- -

Bo Li (bli at cs dot wisc dot edu)

- -
- -

Table of Contents

- - - -
- -

Introduction

- -

RSEM is a software package for estimating gene and isoform expression -levels from RNA-Seq data. The RSEM package provides an user-friendly -interface, supports threads for parallel computation of the EM -algorithm, single-end and paired-end read data, quality scores, -variable-length reads and RSPD estimation. In addition, it provides -posterior mean and 95% credibility interval estimates for expression -levels. For visualization, It can generate BAM and Wiggle files in -both transcript-coordinate and genomic-coordinate. Genomic-coordinate -files can be visualized by both UCSC Genome browser and Broad -Institute’s Integrative Genomics Viewer (IGV). Transcript-coordinate -files can be visualized by IGV. RSEM also has its own scripts to -generate transcript read depth plots in pdf format. The unique feature -of RSEM is, the read depth plots can be stacked, with read depth -contributed to unique reads shown in black and contributed to -multi-reads shown in red. In addition, models learned from data can -also be visualized. Last but not least, RSEM contains a simulator.

- -

Compilation & Installation

- -

To compile RSEM, simply run

- -
make
-
- -

For cygwin users, please uncomment the 3rd and 7th line in -‘sam/Makefile’ before you run ‘make’.

- -

To compile EBSeq, which is included in the RSEM package, run

- -
make ebseq
-
- -

To install, simply put the rsem directory in your environment’s PATH -variable.

- -

If you prefer to put all RSEM executables to a bin directory, please -also remember to put ‘rsem_perl_utils.pm’ and ‘WHAT_IS_NEW’ to the -same bin directory. ‘rsem_perl_utils.pm’ is required for most RSEM’s -perl scripts and ‘WHAT_IS_NEW’ contains the RSEM version information.

- -

Prerequisites

- -

C++, Perl and R are required to be installed.

- -

To take advantage of RSEM’s built-in support for the Bowtie/Bowtie 2 -alignment program, you must have -Bowtie and/or Bowtie -2 installed.

- -

Usage

- -

I. Preparing Reference Sequences

- -

RSEM can extract reference transcripts from a genome if you provide it -with gene annotations in a GTF file. Alternatively, you can provide -RSEM with transcript sequences directly.

- -

Please note that GTF files generated from the UCSC Table Browser do not -contain isoform-gene relationship information. However, if you use the -UCSC Genes annotation track, this information can be recovered by -downloading the knownIsoforms.txt file for the appropriate genome.

- -

To prepare the reference sequences, you should run the -‘rsem-prepare-reference’ program. Run

- -
rsem-prepare-reference --help
-
- -

to get usage information or visit the rsem-prepare-reference -documentation page.

- -

II. Calculating Expression Values

- -

To calculate expression values, you should run the -‘rsem-calculate-expression’ program. Run

- -
rsem-calculate-expression --help
-
- -

to get usage information or visit the rsem-calculate-expression -documentation page.

- -

Calculating expression values from single-end data

- -

For single-end models, users have the option of providing a fragment -length distribution via the ‘–fragment-length-mean’ and -‘–fragment-length-sd’ options. The specification of an accurate fragment -length distribution is important for the accuracy of expression level -estimates from single-end data. If the fragment length mean and sd are -not provided, RSEM will not take a fragment length distribution into -consideration.

- -

Using an alternative aligner

- -

By default, RSEM automates the alignment of reads to reference -transcripts using the Bowtie aligner. Turn on ‘–bowtie2’ for -‘rsem-prepare-reference’ and ‘rsem-calculate-expression’ will allow -RSEM to use the Bowtie 2 alignment program instead. Please note that -indel alignments, local alignments and discordant alignments are -disallowed when RSEM uses Bowtie 2 since RSEM currently cannot handle -them. See the description of ‘–bowtie2’ option in -‘rsem-calculate-expression’ for more details. Similarly, turn on -‘–star’ will allow RSEM to use the STAR aligner. To use an -alternative alignment program, align the input reads against the file -‘reference_name.idx.fa’ generated by ‘rsem-prepare-reference’, and -format the alignment output in SAM or BAM format. Then, instead of -providing reads to ‘rsem-calculate-expression’, specify the ‘–sam’ or -‘–bam’ option and provide the SAM or BAM file as an argument.

- -

RSEM requires the alignments of a read to be adjacent. For -paired-end reads, RSEM also requires the two mates of any alignment be -adjacent. To check if your SAM/BAM file satisfy the requirements, -please run

- -
rsem-sam-validator <input.sam/input.bam>
-
- -

If your file does not satisfy the requirements, you can use -‘convert-sam-for-rsem’ to convert it into a BAM file which RSEM can -process. Please run

- -
convert-sam-for-rsem --help
-
- -

to get usage information or visit the convert-sam-for-rsem -documentation -page.

- -

However, please note that RSEM does ** not ** support gapped -alignments. So make sure that your aligner does not produce alignments -with intersions/deletions. Also, please make sure that you use -‘reference_name.idx.fa’ , which is generated by RSEM, to build your -aligner’s indices.

- -

III. Visualization

- -

RSEM contains a version of samtools in the ‘sam’ subdirectory. RSEM -will always produce three files:’sample_name.transcript.bam’, the -unsorted BAM file, ‘sample_name.transcript.sorted.bam’ and -‘sample_name.transcript.sorted.bam.bai’ the sorted BAM file and -indices generated by the samtools included. All three files are in -transcript coordinates. When users specify the –output-genome-bam -option RSEM will produce three files: ‘sample_name.genome.bam’, the -unsorted BAM file, ‘sample_name.genome.sorted.bam’ and -‘sample_name.genome.sorted.bam.bai’ the sorted BAM file and indices -generated by the samtools included. All these files are in genomic -coordinates.

- -

a) Converting transcript BAM file into genome BAM file

- -

Normally, RSEM will do this for you via ‘–output-genome-bam’ option -of ‘rsem-calculate-expression’. However, if you have run -‘rsem-prepare-reference’ and use ‘reference_name.idx.fa’ to build -indices for your aligner, you can use ‘rsem-tbam2gbam’ to convert your -transcript coordinate BAM alignments file into a genomic coordinate -BAM alignments file without the need to run the whole RSEM -pipeline.

- -

Usage:

- -
rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output
-
- -

reference_name : The name of reference built by ‘rsem-prepare-reference’
-unsorted_transcript_bam_input : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools -genome_bam_output : The output genomic coordinate BAM file’s name

- -

b) Generating a Wiggle file

- -

A wiggle plot representing the expected number of reads overlapping -each position in the genome/transcript set can be generated from the -sorted genome/transcript BAM file output. To generate the wiggle -plot, run the ‘rsem-bam2wig’ program on the -‘sample_name.genome.sorted.bam’/‘sample_name.transcript.sorted.bam’ file.

- -

Usage:

- -
rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
-
- -

sorted_bam_input : Input BAM format file, must be sorted
-wig_output : Output wiggle file’s name, e.g. output.wig
-wiggle_name : The name of this wiggle plot
-–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line

- -

c) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)

- -

For UCSC genome browser, please refer to the UCSC custom track help page.

- -

For integrative genomics viewer, please refer to the IGV home page. Note: Although IGV can generate read depth plot from the BAM file given, it cannot recognize “ZW” tag RSEM puts. Therefore IGV counts each alignment as weight 1 instead of the expected weight for the plot it generates. So we recommend to use the wiggle file generated by RSEM for read depth visualization.

- -

Here are some guidance for visualizing transcript coordinate files using IGV:

- -

1) Import the transcript sequences as a genome

- -

Select File -> Import Genome, then fill in ID, Name and Fasta file. Fasta file should be ‘reference_name.idx.fa’. After that, click Save button. Suppose ID is filled as ‘reference_name’, a file called ‘reference_name.genome’ will be generated. Next time, we can use: File -> Load Genome, then select ‘reference_name.genome’.

- -

2) Load visualization files

- -

Select File -> Load from File, then choose one transcript coordinate visualization file generated by RSEM. IGV might require you to convert wiggle file to tdf file. You should use igvtools to perform this task. One way to perform the conversion is to use the following command:

- -
igvtools tile reference_name.transcript.wig reference_name.transcript.tdf reference_name.genome   
-
- -

d) Generating Transcript Wiggle Plots

- -

To generate transcript wiggle plots, you should run the -‘rsem-plot-transcript-wiggles’ program. Run

- -
rsem-plot-transcript-wiggles --help
-
- -

to get usage information or visit the rsem-plot-transcript-wiggles -documentation page.

- -

e) Visualize the model learned by RSEM

- -

RSEM provides an R script, ‘rsem-plot-model’, for visulazing the model learned.

- -

Usage:

- -
rsem-plot-model sample_name output_plot_file
-
- -

sample_name: the name of the sample analyzed
-output_plot_file: the file name for plots generated from the model. It is a pdf file

- -

The plots generated depends on read type and user configuration. It -may include fragment length distribution, mate length distribution, -read start position distribution (RSPD), quality score vs observed -quality given a reference base, position vs percentage of sequencing -error given a reference base and histogram of reads with different -number of alignments.

- -

fragment length distribution and mate length distribution: x-axis is fragment/mate length, y axis is the probability of generating a fragment/mate with the associated length

- -

RSPD: Read Start Position Distribution. x-axis is bin number, y-axis is the probability of each bin. RSPD can be used as an indicator of 3’ bias

- -

Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the “observed quality”, Phred quality scores learned by RSEM from the data. Q = –10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base

- -

Position vs. percentage sequencing error given a reference base: x-axis is position and y-axis is percentage sequencing error

- -

Histogram of reads with different number of alignments: x-axis is the number of alignments a read has and y-axis is the number of such reads. The inf in x-axis means number of reads filtered due to too many alignments

- -

Example

- -

Suppose we download the mouse genome from UCSC Genome Browser. We do -not add poly(A) tails and use ‘/ref/mouse_0’ as the reference name. -We have a FASTQ-formatted file, ‘mmliver.fq’, containing single-end -reads from one sample, which we call ‘mmliver_single_quals’. We want -to estimate expression values by using the single-end model with a -fragment length distribution. We know that the fragment length -distribution is approximated by a normal distribution with a mean of -150 and a standard deviation of 35. We wish to generate 95% -credibility intervals in addition to maximum likelihood estimates. -RSEM will be allowed 1G of memory for the credibility interval -calculation. We will visualize the probabilistic read mappings -generated by RSEM on UCSC genome browser. We will generate a list of -genes’ transcript wiggle plots in ‘output.pdf’. The list is -‘gene_ids.txt’. We will visualize the models learned in -‘mmliver_single_quals.models.pdf’

- -

The commands for this scenario are as follows:

- -
rsem-prepare-reference --gtf mm9.gtf --mapping knownIsoforms.txt --bowtie --bowtie-path /sw/bowtie /data/mm9 /ref/mouse_0
-rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --output-genome-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mouse_0 mmliver_single_quals
-rsem-bam2wig mmliver_single_quals.sorted.bam mmliver_single_quals.sorted.wig mmliver_single_quals
-rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf 
-rsem-plot-model mmliver_single_quals mmliver_single_quals.models.pdf
-
- -

Simulation

- -

RSEM provides users the ‘rsem-simulate-reads’ program to simulate RNA-Seq data based on parameters learned from real data sets. Run

- -
rsem-simulate-reads
-
- -

to get usage information or read the following subsections.

- -

Usage:

- -
rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]
-
- -

reference_name: The name of RSEM references, which should be already generated by ‘rsem-prepare-reference’

- -

estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using ‘rsem-calculate-expression’. The file can be found under the ‘sample_name.stat’ folder with the name of ‘sample_name.model’. ‘model_file_description.txt’ provides the format and meanings of this file.

- -

estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using ‘rsem-calculate-expression’ from real data. The corresponding file users want to use is ‘sample_name.isoforms.results’. If simulating from user-designed expression profile is desired, start from a learned ‘sample_name.isoforms.results’ file and only modify the ‘TPM’ column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, ‘sample_name.alleles.results’ should be used instead.

- -

theta0: This parameter determines the fraction of reads that are coming from background “noise” (instead of from a transcript). It can also be estimated using ‘rsem-calculate-expression’ from real data. Users can find it as the first value of the third line of the file ‘sample_name.stat/sample_name.theta’.

- -

N: The total number of reads to be simulated. If ‘rsem-calculate-expression’ is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file ‘sample_name.stat/sample_name.cnt’.

- -

output_name: Prefix for all output files.

- -

–seed seed: Set seed for the random number generator used in simulation. The seed should be a 32-bit unsigned integer.

- -

-q: Set it will stop outputting intermediate information.

- -

Outputs:

- -

output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from. -output_name.sim.alleles.results: Allele-specific expression levels estimated by counting where each simulated read comes from.

- -

output_name.fa if single-end without quality score;
-output_name.fq if single-end with quality score;
-output_name_1.fa & output_name_2.fa if paired-end without quality -score;
-output_name_1.fq & output_name_2.fq if paired-end with quality score.

- -

Format of the header line: Each simulated read’s header line encodes where it comes from. The header line has the format:

- -
{>/@}_rid_dir_sid_pos[_insertL]
-
- -

{>/@}: Either ‘>’ or ‘@’ must appear. ‘>’ appears if FASTA files are generated and ‘@’ appears if FASTQ files are generated

- -

rid: Simulated read’s index, numbered from 0

- -

dir: The direction of the simulated read. 0 refers to forward strand (‘+’) and 1 refers to reverse strand (‘-’)

- -

sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid’s transcript name can be found in the ‘transcript_id’ column of the ‘sample_name.isoforms.results’ file (at line sid + 1, line 1 is for column names)

- -

pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0

- -

insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.

- -

Example:

- -

Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned from Example. In addition, we set theta0 as 0.2 and output_name as ‘simulated_reads’. The command is:

- -
rsem-simulate-reads /ref/mouse_0 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads
-
- -

Generate Transcript-to-Gene-Map from Trinity Output

- -

For Trinity users, RSEM provides a perl script to generate transcript-to-gene-map file from the fasta file produced by Trinity.

- -

Usage:

- -
extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file
-
- -

trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
-map_file: transcript-to-gene-map file’s name.

- -

Differential Expression Analysis

- -

Popular differential expression (DE) analysis tools such as edgeR and -DESeq do not take variance due to read mapping uncertainty into -consideration. Because read mapping ambiguity is prevalent among -isoforms and de novo assembled transcripts, these tools are not ideal -for DE detection in such conditions.

- -

EBSeq, an empirical Bayesian DE analysis tool developed in UW-Madison, -can take variance due to read mapping ambiguity into consideration by -grouping isoforms with parent gene’s number of isoforms. In addition, -it is more robust to outliers. For more information about EBSeq -(including the paper describing their method), please visit EBSeq’s -website.

- -

RSEM includes EBSeq in its folder named ‘EBSeq’. To use it, first type

- -
make ebseq
-
- -

to compile the EBSeq related codes.

- -

EBSeq requires gene-isoform relationship for its isoform DE -detection. However, for de novo assembled transcriptome, it is hard to -obtain an accurate gene-isoform relationship. Instead, RSEM provides a -script ‘rsem-generate-ngvector’, which clusters transcripts based on -measures directly relating to read mappaing ambiguity. First, it -calcualtes the ‘unmappability’ of each transcript. The ‘unmappability’ -of a transcript is the ratio between the number of k mers with at -least one perfect match to other transcripts and the total number of k -mers of this transcript, where k is a parameter. Then, Ng vector is -generated by applying Kmeans algorithm to the ‘unmappability’ values -with number of clusters set as 3. This program will make sure the mean -‘unmappability’ scores for clusters are in ascending order. All -transcripts whose lengths are less than k are assigned to cluster -3. Run

- -
rsem-generate-ngvector --help
-
- -

to get usage information or visit the rsem-generate-ngvector -documentation -page.

- -

If your reference is a de novo assembled transcript set, you should -run ‘rsem-generate-ngvector’ first. Then load the resulting -‘output_name.ngvec’ into R. For example, you can use

- -
NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
-
- -

. After that, set “NgVector = NgVec” for your differential expression -test (either ‘EBTest’ or ‘EBMultiTest’).

- -

For users’ convenience, RSEM also provides a script -‘rsem-generate-data-matrix’ to extract input matrix from expression -results:

- -
rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
-
- -

The results files are required to be either all gene level results or -all isoform level results. You can load the matrix into R by

- -
IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
-
- -

before running either ‘EBTest’ or ‘EBMultiTest’.

- -

Lastly, RSEM provides two scripts, ‘rsem-run-ebseq’ and -‘rsem-control-fdr’, to help users find differential expressed -genes/transcripts. First, ‘rsem-run-ebseq’ calls EBSeq to calculate related statistics -for all genes/transcripts. Run

- -
rsem-run-ebseq --help
-
- -

to get usage information or visit the rsem-run-ebseq documentation -page. Second, -‘rsem-control-fdr’ takes ‘rsem-run-ebseq’ ’s result and reports called -differentially expressed genes/transcripts by controlling the false -discovery rate. Run

- -
rsem-control-fdr --help
-
- -

to get usage information or visit the rsem-control-fdr documentation -page. These -two scripts can perform DE analysis on either 2 conditions or multiple -conditions.

- -

Please note that ‘rsem-run-ebseq’ and ‘rsem-control-fdr’ use EBSeq’s -default parameters. For advanced use of EBSeq or information about how -EBSeq works, please refer to EBSeq’s -manual.

- -

Questions related to EBSeq should -be sent to Ning Leng.

- -

Authors

- -

Bo Li and Colin Dewey designed the RSEM algorithm. Bo Li implemented the RSEM software. Peng Liu contributed the STAR aligner options.

- -

Acknowledgements

- -

RSEM uses the Boost C++ and -samtools libraries. RSEM includes -EBSeq for -differential expression analysis.

- -

We thank earonesty and Dr. Samuel Arvidsson for contributing patches.

- -

We thank Han Lin, j.miller, Joël Fillon, Dr. Samuel G. Younkin and Malcolm Cook for suggesting possible fixes.

- -

License

- -

RSEM is licensed under the GNU General Public License -v3.

diff --git a/convert-sam-for-rsem.html b/convert-sam-for-rsem.html deleted file mode 100644 index d69263f..0000000 --- a/convert-sam-for-rsem.html +++ /dev/null @@ -1,88 +0,0 @@ - - - - -convert-sam-for-rsem - - - - - - - - - - -

NAME

- -

convert-sam-for-rsem

- -

SYNOPSIS

- -

convert-sam-for-rsem [options] <input.sam/input.bam> output_file_name

- -

ARGUMENTS

- -
- -
input.sam/input.bam
-
- -

The SAM or BAM file generated by user's aligner. We require this file contains the header section. If input is a SAM file, it must end with suffix 'sam' (case insensitive). If input is a BAM file, it must end with suffix 'bam' (case insensitive).

- -
-
output_file_name
-
- -

The output name for the converted file. 'convert-sam-for-rsem' will output a BAM with the name 'output_file_name.bam'.

- -
-
- -

OPTIONS

- -
- -
-T/--temporary-directory <directory>
-
- -

'convert-sam-for-rsem' will call 'sort' command and this is the '-T/--temporary-directory' option of 'sort' command. The following is the description from 'sort' : "use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories".

- -
-
-h/--help
-
- -

Show help information.

- -
-
- -

DESCRIPTION

- -

This program converts the SAM/BAM file generated by user's aligner into a BAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references and output header sections. This program will create a temporary directory called 'output_file_name.bam.temp' to store the intermediate files. The directory will be deleted automatically after the conversion. After the conversion, this program will call 'rsem-sam-validator' to validate the resulting BAM file.

- -

Note: You do not need to run this script if `rsem-sam-validator' reports that your SAM/BAM file is valid.

- -

Note: This program does not check the correctness of input file. You should make sure the input is a valid SAM/BAM format file.

- -

EXAMPLES

- -

Suppose input is set to 'input.sam' and output file name is "output"

- -
 convert-sam-for-rsem input.sam output
- -

We will get a file called 'output.bam' as output.

- - - - - - - diff --git a/rsem-calculate-expression.html b/rsem-calculate-expression.html deleted file mode 100644 index b0033a0..0000000 --- a/rsem-calculate-expression.html +++ /dev/null @@ -1,619 +0,0 @@ - - - - -rsem-calculate-expression - - - - - - - - - - -

NAME

- -

rsem-calculate-expression

- -

SYNOPSIS

- -
 rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name 
- rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name 
- rsem-calculate-expression [options] --sam/--bam [--paired-end] input reference_name sample_name
- -

ARGUMENTS

- -
- -
upstream_read_files(s)
-
- -

Comma-separated list of files containing single-end reads or upstream reads for paired-end data. By default, these files are assumed to be in FASTQ format. If the --no-qualities option is specified, then FASTA format is expected.

- -
-
downstream_read_file(s)
-
- -

Comma-separated list of files containing downstream reads which are paired with the upstream reads. By default, these files are assumed to be in FASTQ format. If the --no-qualities option is specified, then FASTA format is expected.

- -
-
input
-
- -

SAM/BAM formatted input file. If "-" is specified for the filename, SAM/BAM input is instead assumed to come from standard input. RSEM requires all alignments of the same read group together. For paired-end reads, RSEM also requires the two mates of any alignment be adjacent. See Description section for how to make input file obey RSEM's requirements.

- -
-
reference_name
-
- -

The name of the reference used. The user must have run 'rsem-prepare-reference' with this reference_name before running this program.

- -
-
sample_name
-
- -

The name of the sample analyzed. All output files are prefixed by this name (e.g., sample_name.genes.results)

- -
-
- -

BASIC OPTIONS

- -
- -
--paired-end
-
- -

Input reads are paired-end reads. (Default: off)

- -
-
--no-qualities
-
- -

Input reads do not contain quality scores. (Default: off)

- -
-
--strand-specific
-
- -

The RNA-Seq protocol used to generate the reads is strand specific, i.e., all (upstream) reads are derived from the forward strand. This option is equivalent to --forward-prob=1.0. With this option set, if RSEM runs the Bowtie/Bowtie 2 aligner, the '--norc' Bowtie/Bowtie 2 option will be used, which disables alignment to the reverse strand of transcripts. (Default: off)

- -
-
--bowtie2
-
- -

Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additionally use options '--no-mixed' and '--no-discordant'. (Default: off)

- -
-
--star
-
- -

Use STAR to align reads. Alignment parameters are from ENCODE3's STAR-RSEM pipeline. To save computational time and memory resources, STAR's Output BAM file is unsorted. It is stored in RSEM's temporary directory with name as 'sample_name.bam'. Each STAR job will have its own private copy of the genome in memory. (Default: off)

- -
-
--star-path <path>
-
- -

The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment variable)

- -
-
--sam
-
- -

Input file is in SAM format. (Default: off)

- -
-
--bam
-
- -

Input file is in BAM format. (Default: off)

- -
-
-p/--num-threads <int>
-
- -

Number of threads to use. Both Bowtie/Bowtie2, expression estimation and 'samtools sort' will use this many threads. (Default: 1)

- -
-
--no-bam-output
-
- -

Do not output any BAM file. (Default: off)

- -
-
--output-genome-bam
-
- -

Generate a BAM file, 'sample_name.genome.bam', with alignments mapped to genomic coordinates and annotated with their posterior probabilities. In addition, RSEM will call samtools (included in RSEM package) to sort and index the bam file. 'sample_name.genome.sorted.bam' and 'sample_name.genome.sorted.bam.bai' will be generated. (Default: off)

- -
-
--sampling-for-bam
-
- -

When RSEM generates a BAM file, instead of outputing all alignments a read has with their posterior probabilities, one alignment is sampled according to the posterior probabilities. The sampling procedure includes the alignment to the "noise" transcript, which does not appear in the BAM file. Only the sampled alignment has a weight of 1. All other alignments have weight 0. If the "noise" transcript is sampled, all alignments appeared in the BAM file should have weight 0. (Default: off)

- -
-
--seed <uint32>
-
- -

Set the seed for the random number generators used in calculating posterior mean estimates and credibility intervals. The seed must be a non-negative 32 bit interger. (Default: off)

- -
-
--calc-pme
-
- -

Run RSEM's collapsed Gibbs sampler to calculate posterior mean estimates. (Default: off)

- -
-
--calc-ci
-
- -

Calculate 95% credibility intervals and posterior mean estimates. The credibility level can be changed by setting '--ci-credibility-level'. (Default: off)

- -
-
-q/--quiet
-
- -

Suppress the output of logging information. (Default: off)

- -
-
-h/--help
-
- -

Show help information.

- -
-
--version
-
- -

Show version information.

- -
-
- -

ADVANCED OPTIONS

- -
- -
--sam-header-info <file>
-
- -

RSEM reads header information from input by default. If this option is on, header information is read from the specified file. For the format of the file, please see SAM official website. (Default: "")

- -
-
--seed-length <int>
-
- -

Seed length used by the read aligner. Providing the correct value is important for RSEM. If RSEM runs Bowtie, it uses this value for Bowtie's seed length parameter. Any read with its or at least one of its mates' (for paired-end reads) length less than this value will be ignored. If the references are not added poly(A) tails, the minimum allowed value is 5, otherwise, the minimum allowed value is 25. Note that this script will only check if the value >= 5 and give a warning message if the value < 25 but >= 5. (Default: 25)

- -
-
--tag <string>
-
- -

The name of the optional field used in the SAM input for identifying a read with too many valid alignments. The field should have the format <tagName>:i:<value>, where a <value> bigger than 0 indicates a read with too many alignments. (Default: "")

- -
-
--bowtie-path <path>
-
- -

The path to the Bowtie executables. (Default: the path to the Bowtie executables is assumed to be in the user's PATH environment variable)

- -
-
--bowtie-n <int>
-
- -

(Bowtie parameter) max # of mismatches in the seed. (Range: 0-3, Default: 2)

- -
-
--bowtie-e <int>
-
- -

(Bowtie parameter) max sum of mismatch quality scores across the alignment. (Default: 99999999)

- -
-
--bowtie-m <int>
-
- -

(Bowtie parameter) suppress all alignments for a read if > <int> valid alignments exist. (Default: 200)

- -
-
--bowtie-chunkmbs <int>
-
- -

(Bowtie parameter) memory allocated for best first alignment calculation (Default: 0 - use Bowtie's default)

- -
-
--phred33-quals
-
- -

Input quality scores are encoded as Phred+33. (Default: on)

- -
-
--phred64-quals
-
- -

Input quality scores are encoded as Phred+64 (default for GA Pipeline ver. >= 1.3). (Default: off)

- -
-
--solexa-quals
-
- -

Input quality scores are solexa encoded (from GA Pipeline ver. < 1.3). (Default: off)

- -
-
--bowtie2-path <path>
-
- -

(Bowtie 2 parameter) The path to the Bowtie 2 executables. (Default: the path to the Bowtie 2 executables is assumed to be in the user's PATH environment variable)

- -
-
--bowtie2-mismatch-rate <double>
-
- -

(Bowtie 2 parameter) The maximum mismatch rate allowed. (Default: 0.1)

- -
-
--bowtie2-k <int>
-
- -

(Bowtie 2 parameter) Find up to <int> alignments per read. (Default: 200)

- -
-
--bowtie2-sensitivity-level <string>
-
- -

(Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end mode. This option controls how hard Bowtie 2 tries to find alignments. <string> must be one of "very_fast", "fast", "sensitive" and "very_sensitive". The four candidates correspond to Bowtie 2's "--very-fast", "--fast", "--sensitive" and "--very-sensitive" options. (Default: "sensitive" - use Bowtie 2's default)

- -
-
--gzipped-read-file
-
- -

Input read file(s) is compressed by gzip. This option can be only used when aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)

- -
-
--bzipped-read-file
-
- -

Input read file(s) is compressed by bzip2. This option can be only used when aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)

- -
-
--output-star-genome-bam
-
- -

Save the BAM file from STAR alignment under genomic coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted by genomic coordinate. In this file, according to STAR's manual, 'paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well'. (Default: off)

- -
-
--sort-bam-by-read-name
-
- -

Sort BAM file aligned under transcript coordidate by read name. Setting this option on will produce determinstic maximum likelihood estimations from independet runs. Note that sorting will take long time and lots of memory. (Default: off)

- -
-
--sort-bam-buffer-size <string>
-
- -

Size for main memeory buffer when sorting BAM file. It can be any string acceptable to GNU sort's '-S' option. See "sort --help" for details. (Default: '60G')

- -
-
--forward-prob <double>
-
- -

Probability of generating a read from the forward strand of a transcript. Set to 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand, or 0.5 for a non-strand-specific protocol. (Default: 0.5)

- -
-
--fragment-length-min <int>
-
- -

Minimum read/insert length allowed. This is also the value for the Bowtie/Bowtie2 -I option. (Default: 1)

- -
-
--fragment-length-max <int>
-
- -

Maximum read/insert length allowed. This is also the value for the Bowtie/Bowtie 2 -X option. (Default: 1000)

- -
-
--fragment-length-mean <double>
-
- -

(single-end data only) The mean of the fragment length distribution, which is assumed to be a Gaussian. (Default: -1, which disables use of the fragment length distribution)

- -
-
--fragment-length-sd <double>
-
- -

(single-end data only) The standard deviation of the fragment length distribution, which is assumed to be a Gaussian. (Default: 0, which assumes that all fragments are of the same length, given by the rounded value of --fragment-length-mean)

- -
-
--estimate-rspd
-
- -

Set this option if you want to estimate the read start position distribution (RSPD) from data. Otherwise, RSEM will use a uniform RSPD. (Default: off)

- -
-
--num-rspd-bins <int>
-
- -

Number of bins in the RSPD. Only relevant when '--estimate-rspd' is specified. Use of the default setting is recommended. (Default: 20)

- -
-
--gibbs-burnin <int>
-
- -

The number of burn-in rounds for RSEM's Gibbs sampler. Each round passes over the entire data set once. If RSEM can use multiple threads, multiple Gibbs samplers will start at the same time and all samplers share the same burn-in number. (Default: 200)

- -
-
--gibbs-number-of-samples <int>
-
- -

The total number of count vectors RSEM will collect from its Gibbs samplers. (Default: 1000)

- -
-
--gibbs-sampling-gap <int>
-
- -

The number of rounds between two succinct count vectors RSEM collects. If the count vector after round N is collected, the count vector after round N + <int> will also be collected. (Default: 1)

- -
-
--ci-credibility-level <double>
-
- -

The credibility level for credibility intervals. (Default: 0.95)

- -
-
--ci-memory <int>
-
- -

Maximum size (in memory, MB) of the auxiliary buffer used for computing credibility intervals (CI). Set it larger for a faster CI calculation. However, leaving 2 GB memory free for other usage is recommended. (Default: 1024)

- -
-
--ci-number-of-samples-per-count-vector <int>
-
- -

The number of read generating probability vectors sampled per sampled count vector. The crebility intervals are calculated by first sampling P(C | D) and then sampling P(Theta | C) for each sampled count vector. This option controls how many Theta vectors are sampled per sampled count vector. (Default: 50)

- -
-
--samtools-sort-mem <string>
-
- -

Set the maximum memory per thread that can be used by 'samtools sort'. <string> represents the memory and accepts suffices 'K/M/G'. RSEM will pass <string> to the '-m' option of 'samtools sort'. Please note that the default used here is different from the default used by samtools. (Default: 1G)

- -
-
--keep-intermediate-files
-
- -

Keep temporary files generated by RSEM. RSEM creates a temporary directory, 'sample_name.temp', into which it puts all intermediate output files. If this directory already exists, RSEM overwrites all files generated by previous RSEM runs inside of it. By default, after RSEM finishes, the temporary directory is deleted. Set this option to prevent the deletion of this directory and the intermediate files inside of it. (Default: off)

- -
-
--temporary-folder <string>
-
- -

Set where to put the temporary files generated by RSEM. If the folder specified does not exist, RSEM will try to create it. (Default: sample_name.temp)

- -
-
--time
-
- -

Output time consumed by each step of RSEM to 'sample_name.time'. (Default: off)

- -
-
- -

DESCRIPTION

- -

In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments. RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Alternatively, users can use STAR to align reads using the '--star' option. RSEM has provided options in 'rsem-prepare-reference' to prepare STAR's genome indices. Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the indices generated by 'rsem-prepare-reference' and the alignment file satisfies the requirements mentioned in ARGUMENTS section.

- -

One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command:

- -
  convert-sam-for-rsem /ref/mouse_125 input.sam -o input_for_rsem.sam  
- -

For details, please refer to 'convert-sam-for-rsem's documentation page.

- -

The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL are not '*'.

- -

The user must run 'rsem-prepare-reference' with the appropriate reference before using this program.

- -

For single-end data, it is strongly recommended that the user provide the fragment length distribution parameters (--fragment-length-mean and --fragment-length-sd). For paired-end data, RSEM will automatically learn a fragment length distribution from the data.

- -

Please note that some of the default values for the Bowtie parameters are not the same as those defined for Bowtie itself.

- -

The temporary directory and all intermediate files will be removed when RSEM finishes unless '--keep-intermediate-files' is specified.

- -

With the '--calc-pme' option, posterior mean estimates will be calculated in addition to maximum likelihood estimates.

- -

With the '--calc-ci' option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates.

- -

OUTPUT

- -
- -
sample_name.isoforms.results
-
- -

File containing isoform level expression estimates. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

- -

transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound]

- -

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

- -

'transcript_id' is the transcript name of this transcript. 'gene_id' is the gene name of the gene which this transcript belongs to (denote this gene as its parent gene). If no gene information is provided, 'gene_id' and 'transcript_id' are the same.

- -

'length' is this transcript's sequence length (poly(A) tail is not counted). 'effective_length' counts only the positions that can generate a valid fragment. If no poly(A) tail is added, 'effective_length' is equal to transcript length - mean fragment length + 1. If one transcript's effective length is less than 1, this transcript's both effective length and abundance estimates are set to 0.

- -

'expected_count' is the sum of the posterior probability of each read comes from this transcript over all reads. Because 1) each read aligning to this transcript has a probability of being generated from background noise; 2) RSEM may filter some alignable low quality reads, the sum of expected counts for all transcript are generally less than the total number of reads aligned.

- -

'TPM' stands for Transcripts Per Million. It is a relative measure of transcript abundance. The sum of all transcripts' TPM is 1 million. 'FPKM' stands for Fragments Per Kilobase of transcript per Million mapped reads. It is another relative measure of transcript abundance. If we define l_bar be the mean transcript length in a sample, which can be calculated as

- -

l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every transcript),

- -

the following equation is hold:

- -

FPKM_i = 10^3 / l_bar * TPM_i.

- -

We can see that the sum of FPKM is not a constant across samples.

- -

'IsoPct' stands for isoform percentage. It is the percentage of this transcript's abandunce over its parent gene's abandunce. If its parent gene has only one isoform or the gene information is not provided, this field will be set to 100.

- -

'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean estimates calculated by RSEM's Gibbs sampler. 'posterior_standard_deviation_of_count' is the posterior standard deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage calculated from 'pme_TPM' values.

- -

'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and 'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95% credibility intervals for TPM and FPKM values. The bounds are inclusive (i.e. [l, u]).

- -
-
sample_name.genes.results
-
- -

File containing gene level expression estimates. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

- -

gene_id transcript_id(s) length effective_length expected_count TPM FPKM [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound]

- -

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

- -

'transcript_id(s)' is a comma-separated list of transcript_ids belonging to this gene. If no gene information is provided, 'gene_id' and 'transcript_id(s)' are identical (the 'transcript_id').

- -

A gene's 'length' and 'effective_length' are defined as the weighted average of its transcripts' lengths and effective lengths (weighted by 'IsoPct'). A gene's abundance estimates are just the sum of its transcripts' abundance estimates.

- -
-
sample_name.alleles.results
-
- -

Only generated when the RSEM references are built with allele-specific transcripts.

- -

This file contains allele level expression estimates for allele-specific expression calculation. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

- -

allele_id transcript_id gene_id length effective_length expected_count TPM FPKM AlleleIsoPct AlleleGenePct [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM AlleleIsoPct_from_pme_TPM AlleleGenePct_from_pme_TPM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound]

- -

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

- -

'allele_id' is the allele-specific name of this allele-specific transcript.

- -

'AlleleIsoPct' stands for allele-specific percentage on isoform level. It is the percentage of this allele-specific transcript's abundance over its parent transcript's abundance. If its parent transcript has only one allele variant form, this field will be set to 100.

- -

'AlleleGenePct' stands for allele-specific percentage on gene level. It is the percentage of this allele-specific transcript's abundance over its parent gene's abundance.

- -

'AlleleIsoPct_from_pme_TPM' and 'AlleleGenePct_from_pme_TPM' have similar meanings. They are calculated based on posterior mean estimates.

- -

Please note that if this file is present, the fields 'length' and 'effective_length' in 'sample_name.isoforms.results' should be interpreted similarly as the corresponding definitions in 'sample_name.genes.results'.

- -
-
sample_name.transcript.bam, sample_name.transcript.sorted.bam and sample_name.transcript.sorted.bam.bai
-
- -

Only generated when --no-bam-output is not specified.

- -

'sample_name.transcript.bam' is a BAM-formatted file of read alignments in transcript coordinates. The MAPQ field of each alignment is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the posterior probability of that alignment being the true mapping of a read. In addition, RSEM pads a new tag ZW:f:value, where value is a single precision floating number representing the posterior probability. Because this file contains all alignment lines produced by bowtie or user-specified aligners, it can also be used as a replacement of the aligner generated BAM/SAM file. For paired-end reads, if one mate has alignments but the other does not, this file marks the alignable mate as "unmappable" (flag bit 0x4) and appends an optional field "Z0:A:!".

- -

'sample_name.transcript.sorted.bam' and 'sample_name.transcript.sorted.bam.bai' are the sorted BAM file and indices generated by samtools (included in RSEM package).

- -
-
sample_name.genome.bam, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai
-
- -

Only generated when --no-bam-output is not specified and --output-genome-bam is specified.

- -

'sample_name.genome.bam' is a BAM-formatted file of read alignments in genomic coordinates. Alignments of reads that have identical genomic coordinates (i.e., alignments to different isoforms that share the same genomic region) are collapsed into one alignment. The MAPQ field of each alignment is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the posterior probability of that alignment being the true mapping of a read. In addition, RSEM pads a new tag ZW:f:value, where value is a single precision floating number representing the posterior probability. If an alignment is spliced, a XS:A:value tag is also added, where value is either '+' or '-' indicating the strand of the transcript it aligns to.

- -

'sample_name.genome.sorted.bam' and 'sample_name.genome.sorted.bam.bai' are the sorted BAM file and indices generated by samtools (included in RSEM package).

- -
-
sample_name.time
-
- -

Only generated when --time is specified.

- -

It contains time (in seconds) consumed by aligning reads, estimating expression levels and calculating credibility intervals.

- -
-
sample_name.stat
-
- -

This is a folder instead of a file. All model related statistics are stored in this folder. Use 'rsem-plot-model' can generate plots using this folder.

- -

'sample_name.stat/sample_name.cnt' contains alignment statistics. The format and meanings of each field are described in 'cnt_file_description.txt' under RSEM directory.

- -

'sample_name.stat/sample_name.model' stores RNA-Seq model parameters learned from the data. The format and meanings of each filed of this file are described in 'model_file_description.txt' under RSEM directory.

- -
-
- -

EXAMPLES

- -

Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mouse_125'.

- -

1) '/data/mmliver.fq', single-end reads with quality scores. Quality scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8 threads and generate a genome BAM file:

- -
 rsem-calculate-expression --phred64-quals \
-                           -p 8 \
-                           --output-genome-bam \
-                           /data/mmliver.fq \
-                           /ref/mouse_125 \
-                           mmliver_single_quals
- -

2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', paired-end reads with quality scores. Quality scores are in SANGER format. We want to use 8 threads and do not generate a genome BAM file:

- -
 rsem-calculate-expression -p 8 \
-                           --paired-end \
-                           /data/mmliver_1.fq \
-                           /data/mmliver_2.fq \
-                           /ref/mouse_125 \
-                           mmliver_paired_end_quals
- -

3) '/data/mmliver.fa', single-end reads without quality scores. We want to use 8 threads:

- -
 rsem-calculate-expression -p 8 \
-                           --no-qualities \
-                           /data/mmliver.fa \
-                           /ref/mouse_125 \
-                           mmliver_single_without_quals
- -

4) Data are the same as 1). This time we assume the bowtie executables are under '/sw/bowtie'. We want to take a fragment length distribution into consideration. We set the fragment length mean to 150 and the standard deviation to 35. In addition to a BAM file, we also want to generate credibility intervals. We allow RSEM to use 1GB of memory for CI calculation:

- -
 rsem-calculate-expression --bowtie-path /sw/bowtie \
-                           --phred64-quals \
-                           --fragment-length-mean 150.0 \
-                           --fragment-length-sd 35.0 \
-                           -p 8 \
-                           --output-genome-bam \
-                           --calc-ci \
-                           --ci-memory 1024 \
-                           /data/mmliver.fq \
-                           /ref/mouse_125 \
-                           mmliver_single_quals
- -

5) '/data/mmliver_paired_end_quals.bam', paired-end reads with quality scores. We want to use 8 threads:

- -
 rsem-calculate-expression --paired-end \
-                           --bam \
-                           -p 8 \
-                           /data/mmliver_paired_end_quals.bam \
-                           /ref/mouse_125 \
-                           mmliver_paired_end_quals
- -

6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads with quality scores and read files are compressed by gzip. We want to use STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we want to use 8 threads and do not generate a genome BAM file:

- -
 rsem-calculate-expression --paired-end \
-                           --star \
-                           --star-path /sw/STAR \
-                           --gzipped-read-file \
-                           -p 8 \
-                           /data/mmliver_1.fq.gz \
-                           /data/mmliver_2.fq.gz \
-                           /ref/mouse_125 \
-                           mmliver_paired_end_quals
- - - - - - - diff --git a/rsem-control-fdr.html b/rsem-control-fdr.html deleted file mode 100644 index f4078a2..0000000 --- a/rsem-control-fdr.html +++ /dev/null @@ -1,94 +0,0 @@ - - - - -rsem-control-fdr - - - - - - - - - - -

NAME

- -

rsem-control-fdr

- -

SYNOPSIS

- -

rsem-control-fdr [options] input_file fdr_rate output_file

- -

ARGUMENTS

- -
- -
input_file
-
- -

This should be the main result file generated by 'rsem-run-ebseq', which contains all genes/transcripts and their associated statistics.

- -
-
fdr_rate
-
- -

The desire false discovery rate (FDR).

- -
-
output_file
-
- -

This file is a subset of the 'input_file'. It only contains the genes/transcripts called as differentially expressed (DE). When more than 2 conditions exist, DE is defined as not all conditions are equally expressed. Because statistical significance does not necessarily mean biological significance, users should also refer to the fold changes to decide which genes/transcripts are biologically significant. When more than two conditions exist, this file will not contain fold change information and users need to calculate it from 'input_file.condmeans' by themselves.

- -
-
- -

OPTIONS

- -
- -
--hard-threshold
-
- -

Use hard threshold method to control FDR. If this option is set, only those genes/transcripts with their PPDE >= 1 - fdr_rate are called as DE. (Default: on)

- -
-
--soft-threshold
-
- -

Use soft threshold method to control FDR. If this option is set, this program will try to report as many genes/transcripts as possible, as long as their average PPDE >= 1 - fdr_rate. This option is equivalent to use EBSeq's 'crit_fun' for FDR control. (Default: off)

- -
-
-h/--help
-
- -

Show help information.

- -
-
- -

DESCRIPTION

- -

This program controls the false discovery rate and reports differentially expressed genes/transcripts.

- -

EXAMPLES

- -

We assume that we have 'GeneMat.results' as input. We want to control FDR at 0.05 using hard threshold method and name the output file as 'GeneMat.de.txt':

- -
 rsem-control-fdr GeneMat.results 0.05 GeneMat.de.txt
- - - - - - - diff --git a/rsem-generate-ngvector.html b/rsem-generate-ngvector.html deleted file mode 100644 index 759428f..0000000 --- a/rsem-generate-ngvector.html +++ /dev/null @@ -1,111 +0,0 @@ - - - - -rsem-generate-ngvector - - - - - - - - - - -

NAME

- -

rsem-generate-ngvector

- -

SYNOPSIS

- -

rsem-generate-ngvector [options] input_fasta_file output_name

- -

ARGUMENTS

- -
- -
input_fasta_file
-
- -

The fasta file containing all reference transcripts. The transcripts must be in the same order as those in expression value files. Thus, 'reference_name.transcripts.fa' generated by 'rsem-prepare-reference' should be used.

- -
-
output_name
-
- -

The name of all output files. The Ng vector will be stored as 'output_name.ngvec'.

- -
-
- -

OPTIONS

- -
- -
-k <int>
-
- -

k mer length. See description section. (Default: 25)

- -
-
-h/--help
-
- -

Show help information.

- -
-
- -

DESCRIPTION

- -

This program generates the Ng vector required by EBSeq for isoform level differential expression analysis based on reference sequences only. EBSeq can take variance due to read mapping ambiguity into consideration by grouping isoforms with parent gene's number of isoforms. However, for de novo assembled transcriptome, it is hard to obtain an accurate gene-isoform relationship. Instead, this program groups isoforms by using measures on read mappaing ambiguity directly. First, it calcualtes the 'unmappability' of each transcript. The 'unmappability' of a transcript is the ratio between the number of k mers with at least one perfect match to other transcripts and the total number of k mers of this transcript, where k is a parameter. Then, Ng vector is generated by applying Kmeans algorithm to the 'unmappability' values with number of clusters set as 3. 'rsem-generate-ngvector' will make sure the mean 'unmappability' scores for clusters are in ascending order. All transcripts whose lengths are less than k are assigned to cluster 3.

- -

If your reference is a de novo assembled transcript set, you should run 'rsem-generate-ngvector' first. Then load the resulting 'output_name.ngvec' into R. For example, you can use

- -
 NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
- -

. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of section 3.2.5 (Page 10) of EBSeq's vignette:

- -
 IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
- -

This program only needs to run once per RSEM reference.

- -

OUTPUT

- -
- -
output_name.ump
-
- -

'unmappability' scores for each transcript. This file contains two columns. The first column is transcript name and the second column is 'unmappability' score.

- -
-
output_name.ngvec
-
- -

Ng vector generated by this program.

- -
-
- -

EXAMPLES

- -

Suppose the reference sequences file is '/ref/mouse_125/mouse_125.transcripts.fa' and we set the output_name as 'mouse_125':

- -
 rsem-generate-ngvector /ref/mouse_125/mouse_125.transcripts.fa mouse_125
- - - - - - - diff --git a/rsem-plot-transcript-wiggles.html b/rsem-plot-transcript-wiggles.html deleted file mode 100644 index f5fb9e9..0000000 --- a/rsem-plot-transcript-wiggles.html +++ /dev/null @@ -1,131 +0,0 @@ - - - - -rsem-plot-transcript-wiggles - - - - - - - - - - -

NAME

- -

rsem-plot-transcript-wiggles

- -

SYNOPSIS

- -

rsem-plot-transcript-wiggles [options] sample_name input_list output_plot_file

- -

ARGUMENTS

- -
- -
sample_name
-
- -

The name of the sample analyzed.

- -
-
input_list
-
- -

A list of transcript ids or gene ids. But it cannot be a mixture of transcript & gene ids. Each id occupies one line without extra spaces.

- -
-
output_plot_file
-
- -

The file name of the pdf file which contains all plots.

- -
-
- -

OPTIONS

- -
- -
--gene-list
-
- -

The input-list is a list of gene ids. (Default: off)

- -
-
--transcript-list
-
- -

The input-list is a list of transcript ids. This option can only be turned on if allele-specific expression is calculated. (Default: off)

- -
-
--show-unique
-
- -

Show the wiggle plots as stacked bar plots. See description section for details. (Default: off)

- -
-
-h/--help
-
- -

Show help information.

- -
-
- -

DESCRIPTION

- -

This program generates transcript wiggle plots and outputs them in a pdf file. This program can accept either a list of transcript ids or gene ids (if transcript to gene mapping information is provided) and has two modes of showing wiggle plots. If '--show-unique' is not specified, the wiggle plot for each transcript is a histogram where each position has the expected read depth at this position as its height. If '--show-unique' is specified, for each transcript a stacked bar plot is generated. For each position, the read depth of unique reads, which have only one alignment, is showed in black. The read depth of multi-reads, which align to more than one places, is showed in red on top of the read depth of unique reads.This program will use some files RSEM generated previouslly. So please do not delete/move any file 'rsem-calculate-expression' generated. If allele-specific expression is calculated, the basic unit for plotting is an allele-specific transcript and plots can be grouped by either transcript ids (--transcript-list) or gene ids (--gene-list).

- -

OUTPUT

- -
- -
output_plot_file
-
- -

This is a pdf file containing all plots generated. If a list of transcript ids is provided, each page display at most 6 plots in 3 rows and 2 columns. If gene ids are provided, each page display a gene. The gene's id is showed at the top and all its transcripts' wiggle plots are showed in this page. The arrangment of plots is determined automatically. For each transcript wiggle plot, the transcript id is displayed as title. x-axis is position in the transcript and y-axis is read depth. If allele-specific expression is calculated, the basin unit becomes an allele-specific transcript and transcript ids and gene ids can be used to group allele-specific transcripts.

- -
-
sample_name.transcript.sorted.bam and sample_name.transcript.readdepth
-
- -

If these files do not exist, 'rsem-plot-transcript-wiggles' will automatically generate them.

- -
-
sample_name.uniq.transcript.bam, sample_name.uniq.transcript.sorted.bam and sample_name.uniq.transcript.readdepth
-
- -

If '--show-unique' option is specified and these files do not exist, 'rsem-plot-transcript-wiggles' will automatically generate them.

- -
-
- -

EXAMPLES

- -

Suppose sample_name and output_plot_file are set to 'mmliver_single_quals' and 'output.pdf' respectively. input_list is set to 'transcript_ids.txt' if transcript ids are provided, and is set to 'gene_ids.txt' if gene ids are provided.

- -

1) Transcript ids are provided and we just want normal wiggle plots:

- -
 rsem-plot-transcript-wiggles mmliver_single_quals transcript_ids.txt output.pdf
- -

2) Gene ids are provided and we want to show stacked bar plots:

- -
 rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf 
- - - - - - - diff --git a/rsem-prepare-reference.html b/rsem-prepare-reference.html deleted file mode 100644 index 4dcc392..0000000 --- a/rsem-prepare-reference.html +++ /dev/null @@ -1,253 +0,0 @@ - - - - -rsem-prepare-reference - - - - - - - - - - -

NAME

- -

rsem-prepare-reference

- -

SYNOPSIS

- -

rsem-prepare-reference [options] reference_fasta_file(s) reference_name

- -

ARGUMENTS

- -
- -
reference_fasta_file(s)
-
- -

Either a comma-separated list of Multi-FASTA formatted files OR a directory name. If a directory name is specified, RSEM will read all files with suffix ".fa" or ".fasta" in this directory. The files should contain either the sequences of transcripts or an entire genome, depending on whether the --gtf option is used.

- -
-
reference name
-
- -

The name of the reference used. RSEM will generate several reference-related files that are prefixed by this name. This name can contain path information (e.g. /ref/mm9).

- -
-
- -

OPTIONS

- -
- -
--gtf <file>
-
- -

If this option is on, RSEM assumes that 'reference_fasta_file(s)' contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in <file>, which should be in GTF format.

- -

If this option is off, RSEM will assume 'reference_fasta_file(s)' contains the reference transcripts. In this case, RSEM assumes that name of each sequence in the Multi-FASTA files is its transcript_id.

- -

(Default: off)

- -
-
--transcript-to-gene-map <file>
-
- -

Use information from <file> to map from transcript (isoform) ids to gene ids. Each line of <file> should be of the form:

- -

gene_id transcript_id

- -

with the two fields separated by a tab character.

- -

If you are using a GTF file for the "UCSC Genes" gene set from the UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from the "Downloads" section of the UCSC Genome Browser site) is of this format.

- -

If this option is off, then the mapping of isoforms to genes depends on whether the --gtf option is specified. If --gtf is specified, then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF file. Otherwise, RSEM assumes that each sequence in the reference sequence files is a separate gene.

- -

(Default: off)

- -
-
--allele-to-gene-map <file>
-
- -

Use information from <file> to provide gene_id and transcript_id information for each allele-specific transcript. Each line of <file> should be of the form:

- -

gene_id transcript_id allele_id

- -

with the fields separated by a tab character.

- -

This option is designed for quantifying allele-specific expression. It is only valid if '--gtf' option is not specified. allele_id should be the sequence names presented in the Multi-FASTA-formatted files.

- -

(Default: off)

- -
-
--polyA
-
- -

Add poly(A) tails to the end of all reference isoforms. The length of poly(A) tail added is specified by '--polyA-length' option. STAR aligner users may not want to use this option. (Default: do not add poly(A) tail to any of the isoforms)

- -
-
--polyA-length <int>
-
- -

The length of the poly(A) tails to be added. (Default: 125)

- -
-
--no-polyA-subset <file>
-
- -

Only meaningful if '--polyA' is specified. Do not add poly(A) tails to those transcripts listed in <file>. <file> is a file containing a list of transcript_ids. (Default: off)

- -
-
--bowtie
-
- -

Build Bowtie indices. (Default: off)

- -
-
--bowtie-path <path>
-
- -

The path to the Bowtie executables. (Default: the path to Bowtie executables is assumed to be in the user's PATH environment variable)

- -
-
--bowtie2
-
- -

Build Bowtie 2 indices. (Default: off)

- -
-
--bowtie2-path
-
- -

The path to the Bowtie 2 executables. (Default: the path to Bowtie 2 executables is assumed to be in the user's PATH environment variable)

- -
-
--star
-
- -

Build STAR indices. (Default: off)

- -
-
--star-path <path>
-
- -

The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment varaible)

- -
-
--star-sjdboverhang <int>
-
- -

Length of the genomic sequence around annotated junction. It is only used for STAT to build splice junctions database and not needed for Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to STAR. According to STAR's manual, its ideal value is max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is 101-1=100. In most cases, the default value of 100 will work as well as the ideal value. (Default: 100)

- -
-
-p/--num-threads <int>
-
- -

Number of threads to use for building STAR's genome indices. (Default: 1)

- -
-
-q/--quiet
-
- -

Suppress the output of logging information. (Default: off)

- -
-
-h/--help
-
- -

Show help information.

- -
-
- -

DESCRIPTION

- -

This program extracts/preprocesses the reference sequences for RSEM. It can optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using their default parameters. It can also optionally build STAR indices (with '--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. If an alternative aligner is to be used, indices for that particular aligner can be built from either 'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for details). This program is used in conjunction with the 'rsem-calculate-expression' program.

- -

OUTPUT

- -

This program will generate 'reference_name.grp', 'reference_name.ti', 'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and optional STAR index files.

- -

'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and 'reference_name.chrlist' are used by RSEM internally.

- -

'reference_name.transcripts.fa' contains the extracted reference transcripts in Multi-FASTA format. Poly(A) tails are not added and it may contain lower case bases in its sequences if the corresponding genomic regions are soft-masked.

- -

'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' are used by aligners to build their own indices. In these two files, all sequence bases are converted into upper case. In addition, poly(A) tails are added if '--polyA' option is set. The only difference between 'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G' characters. This conversion is in particular desired for aligners (e.g. Bowtie) that do not allow reads to overlap with 'N' characters in the reference sequences. Otherwise, 'reference_name.idx.fa' should be used to build the aligner's index files. RSEM uses 'reference_name.idx.fa' to build Bowtie 2 indices and 'reference_name.n2g.idx.fa' to build Bowtie indices. For visualizing the transcript-coordinate-based BAM files generated by RSEM in IGV, 'reference_name.idx.fa' should be imported as a "genome" (see Visualization section in README.md for details).

- -

EXAMPLES

- -

1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mouse_0'. We do not add any poly(A) tails. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'.

- -

There are two ways to write the command:

- -
 rsem-prepare-reference --gtf mm9.gtf \
-                        --transcript-to-gene-map knownIsoforms.txt \
-                        --bowtie \
-                        --bowtie-path /sw/bowtie \                  
-                        /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
-                        /ref/mouse_0
- -

OR

- -
 rsem-prepare-reference --gtf mm9.gtf \
-                        --transcript-to-gene-map knownIsoforms.txt \
-                        --bowtie \
-                        --bowtie-path /sw/bowtie \
-                        /data/mm9 \
-                        /ref/mouse_0
- -

2) Suppose we also want to build Bowtie 2 indices in the above example and Bowtie 2 executables are found in '/sw/bowtie2', the command will be:

- -
 rsem-prepare-reference --gtf mm9.gtf \
-                        --transcript-to-gene-map knownIsoforms.txt \
-                        --bowtie \
-                        --bowtie-path /sw/bowtie \
-                        --bowtie2 \
-                        --bowtie2-path /sw/bowtie2 \
-                        /data/mm9 \
-                        /ref/mouse_0
- -

3) Suppose we want to build STAR indices in the above example and save index files under '/ref' with name 'mouse_0'. Assuming STAR executable is '/sw/STAR', the command will be:

- -
 rsem-prepare-reference --gtf mm9.gtf \
-                        --star \
-                        --star-path /sw/STAR \
-                        -p 8 \
-                        /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
-                        /ref/mouse_0
- -

OR

- -
 rsem-prepare-reference --gtf mm9.gtf \
-                        --star \
-                        --star-path /sw/STAR \
-                        -p 8 \
-                        /data/mm9
-                        /ref/mouse_0
- -

STAR genome index files will be saved under '/ref/'.

- -

4) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and isoform-gene information stored in 'mapping.txt'. We want to add 125bp long poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an alternative aligner to align reads against either 'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':

- -
 rsem-prepare-reference --transcript-to-gene-map mapping.txt \
-                        --polyA
-                        mm9.fasta \
-                        mouse_125
- - - - - - - diff --git a/rsem-run-ebseq.html b/rsem-run-ebseq.html deleted file mode 100644 index 0f233ec..0000000 --- a/rsem-run-ebseq.html +++ /dev/null @@ -1,125 +0,0 @@ - - - - -rsem-run-ebseq - - - - - - - - - - -

NAME

- -

rsem-run-ebseq

- -

SYNOPSIS

- -

rsem-run-ebseq [options] data_matrix_file conditions output_file

- -

ARGUMENTS

- -
- -
data_matrix_file
-
- -

This file is a m by n matrix. m is the number of genes/transcripts and n is the number of total samples. Each element in the matrix represents the expected count for a particular gene/transcript in a particular sample. Users can use 'rsem-generate-data-matrix' to generate this file from expression result files.

- -
-
conditions
-
- -

Comma-separated list of values representing the number of replicates for each condition. For example, "3,3" means the data set contains 2 conditions and each condition has 3 replicates. "2,3,3" means the data set contains 3 conditions, with 2, 3, and 3 replicates for each condition respectively.

- -
-
output_file
-
- -

Output file name.

- -
-
- -

OPTIONS

- -
- -
--ngvector <file>
-
- -

This option provides the grouping information required by EBSeq for isoform-level differential expression analysis. The file can be generated by 'rsem-generate-ngvector'. Turning this option on is highly recommended for isoform-level differential expression analysis. (Default: off)

- -
-
-h/--help
-
- -

Show help information.

- -
-
- -

DESCRIPTION

- -

This program is a wrapper over EBSeq. It performs differential expression analysis and can work on two or more conditions. All genes/transcripts and their associated statistcs are reported in one output file. This program does not control false discovery rate and call differential expressed genes/transcripts. Please use 'rsem-control-fdr' to control false discovery rate after this program is finished.

- -

OUTPUT

- -
- -
output_file
-
- -

This file reports the calculated statistics for all genes/transcripts. It is written as a matrix with row and column names. The row names are the genes'/transcripts' names. The column names are for the reported statistics.

- -

If there are only 2 different conditions among the samples, four statistics (columns) will be reported for each gene/transcript. They are "PPEE", "PPDE", "PostFC" and "RealFC". "PPEE" is the posterior probability (estimated by EBSeq) that a gene/transcript is equally expressed. "PPDE" is the posterior probability that a gene/transcript is differentially expressed. "PostFC" is the posterior fold change (condition 1 over condition2) for a gene/transcript. It is defined as the ratio between posterior mean expression estimates of the gene/transcript for each condition. "RealFC" is the real fold change (condition 1 over condition2) for a gene/transcript. It is the ratio of the normalized within condition 1 mean count over normalized within condition 2 mean count for the gene/transcript. Fold changes are calculated using EBSeq's 'PostFC' function. The genes/transcripts are reported in descending order of their "PPDE" values.

- -

If there are more than 2 different conditions among the samples, the output format is different. For differential expression analysis with more than 2 conditions, EBSeq will enumerate all possible expression patterns (on which conditions are equally expressed and which conditions are not). Suppose there are k different patterns, the first k columns of the output file give the posterior probability of each expression pattern is true. Patterns are defined in a separate file, 'output_file.pattern'. The k+1 column gives the maximum a posteriori (MAP) expression pattern for each gene/transcript. The k+2 column gives the posterior probability that not all conditions are equally expressed (column name "PPDE"). The genes/transcripts are reported in descending order of their "PPDE" column values. For details on how EBSeq works for more than 2 conditions, please refer to EBSeq's manual.

- -
-
output_file.pattern
-
- -

This file is only generated when there are more than 2 conditions. It defines all possible expression patterns over the conditions using a matrix with names. Each row of the matrix refers to a different expression pattern and each column gives the expression status of a different condition. Two conditions are equally expressed if and only if their statuses are the same.

- -
-
output_file.condmeans
-
- -

This file is only generated when there are more than 2 conditions. It gives the normalized mean count value for each gene/transcript at each condition. It is formatted as a matrix with names. Each row represents a gene/transcript and each column represent a condition. The order of genes/transcripts is the same as 'output_file'. This file can be used to calculate fold changes between conditions which users are interested in.

- -
-
- -

EXAMPLES

- -

1) We're interested in isoform-level differential expression analysis and there are two conditions. Each condition has 5 replicates. We have already collected the data matrix as 'IsoMat' and generated ngvector as 'ngvector.ngvec':

- -
 rsem-run-ebseq --ngvector ngvector.ngvec IsoMat 5,5 IsoMat.results
- -

The results will be in 'IsoMat.results'.

- -

2) We're interested in gene-level analysis and there are 3 conditions. The first condition has 3 replicates and the other two has 4 replicates each. The data matrix is named as 'GeneMat':

- -
 rsem-run-ebseq GeneMat 3,4,4 GeneMat.results
- -

Three files, 'GeneMat.results', 'GeneMat.results.pattern', and 'GeneMat.results.condmeans', will be generated.

- - - - - - - diff --git a/updates.html b/updates.html deleted file mode 100644 index d20a5b9..0000000 --- a/updates.html +++ /dev/null @@ -1,62 +0,0 @@ - - -Update information for RSEM - - -

Updates

-
-

Jul 27, 2015   RSEM v1.2.22 is online now. Added options to run the STAR aligner.

-

May 6, 2015   RSEM v1.2.21 is online now. Strip read names of extra words to avoid mismatches of paired-end read names.

-

Mar 23, 2015   RSEM v1.2.20 is online now. Fixed a problem that can lead to assertion error if any paired-end read's insert size > 32767 (by changing the type of insertL in PairedEndHit.h from short to int).

-

Nov 5, 2014   RSEM v1.2.19 is online now. Modified 'rsem-prepare-reference' such that by default it does not add any poly(A) tails. To add poly(A) tails, use '--polyA' option. Added an annotation of the 'sample_name.stat/sample_name.cnt' file, see 'cnt_file_description.txt'.

-

Sept 29, 2014   RSEM v1.2.18 is online now. Only generate warning message if two mates of a read pair have different names. Only parse attributes of a GTF record if its feature is "exon" to avoid unnecessary warning messages.

-

Sept 4, 2014   RSEM v1.2.17 is online now. Added error detection for cases such as a read's two mates having different names or a read is both alignable and unalignable.

-

Aug 18, 2014   RSEM v1.2.16 is online now. Corrected a typo in 'rsem-generate-data-matrix', this script extracts 'expected_count' column instead of 'TPM' column.

-

Jun 16, 2014   RSEM v1.2.15 is online now. Allowed for a subset of reference sequences to be declared in an input SAM/BAM file. For any transcript not declared in the SAM/BAM file, its PME estimates and credibility intervals are set to zero. Added advanced options for customizing Gibbs sampler and credibility interval calculation behaviors. Splitted options in 'rsem-calculate-expression' into basic and advanced options.

-

Jun 8, 2014   RSEM v1.2.14 is online now. Changed RSEM's behaviors for building Bowtie/Bowtie 2 indices. In 'rsem-prepare-reference', '--no-bowtie' and '--no-ntog' options are removed. By default, RSEM does not build either Bowtie or Bowtie 2 indices. Instead, it generates two index Multi-FASTA files, 'reference_name.idx.fa' and 'reference_name.n2g.idx.fa'. Compared to the former file, the latter one in addition converts all 'N's into 'G's. These two files can be used to build aligner indices for customized aligners. In addition, 'reference_name.transcripts.fa' does not have poly(A) tails added. To enable RSEM build Bowtie/Bowtie 2 indices, '--bowtie' or '--bowtie2' must be set explicitly. The most significant benefit of this change is that now we can build Bowtie and Bowtie 2 indices simultaneously by turning both '--bowtie' and '--bowtie2' on. Type 'rsem-prepare-reference --help' for more information. If transcript coordinate files are visualized using IGV, 'reference_name.idx.fa' should be imported as a genome (instead of 'reference_name.transcripts.fa'). For more information, see the third subsection of Visualization in 'README.md'. Modified RSEM perl scripts so that RSEM directory will be added in the beginning of the PATH variable. This also means RSEM will try to use its own samtools first. Added --seed option to set random number generator seeds in 'rsem-calculate-expression'. Added posterior standard deviation of counts as output if either '--calc-pme' or '--calc-ci' is set. Updated boost to v1.55.0. Renamed makefile as Makefile. If '--output-genome-bam' is set, in the genome BAM file, each alignment's 'MD' field will be adjusted to match the CIGAR string. 'XS:A:value' field is required by Cufflinks for spliced alignments. If '--output-genome-bam' is set, in the genome BAM file, first each alignment's 'XS' filed will be deleted. Then if the alignment is an spliced alignment, a 'XS:A:value' field will be added accordingly. Added instructions for users who want to put all RSEM executables into a bin directory (see Compilation & Installation section of 'README.md').

-

May 26, 2014   RSEM v1.2.13 is online now. Allowed usersto use the SAMtools in the PATHfirst and enabled RSEM to find its executables via a symbolic link. Changed the behavior of parsing GTF file. Now if a GTF line's feature is not "exon" and it does not contain a "gene_id" or "transcript_id" attribute, only a warning message will be produced (instead of failing the RSEM).

-

Mar 27, 2014   RSEM v1.2.12 is online now. Enabled allele-specific expression estimation. Added '--calc-pme' option for 'rsem-calculate-expression' to calculate posterior mean estimates only (no credibility intervals). Modified the shebang line of RSEM perl scripts to make them more portable. Added '--seed' option for 'rsem-simulate-reads' to enable users set the seed of random number generator used by the simulation. Modified the transcript extraction behavior of 'rsem-prepare-reference'. For transcripts that cannot be extracted, instead of failing the whole script, warning information is produced. Those transcripts are ignored.

-

Feb 14, 2014   RSEM v1.2.11 is online now. Enabled RSEM to use Bowtie 2 aligner (indel, local and discordant alignments are not supported yet). Changed option names '--bowtie-phred33-quals', '--bowtie-phred64-quals' and '--bowtie-solexa-quals' back to '--phred33-quals', '--phred64-quals' and '--solexa-quals'.

-

Jan 31, 2014   RSEM v1.2.10 is online now. Fixed a bug which will lead to out-of-memory error when RSEM computes ngvector for EBSeq.

-

Jan 8, 2014   RSEM v1.2.9 is online now. Fixed a compilation error problem in Mac OS. Fixed a problem in makefile that affects 'make ebseq'. Added 'model_file_description.txt', which describes the format and meanings of file 'sample_name.stat/sample_name.model'. Updated samtools to version 0.1.19.

-

Nov 22, 2013   RSEM v1.2.8 is online now. Provided a more detailed description for how to simulate RNA-Seq data using 'rsem-simulate-reads'. Provided more user-friendly error message if RSEM fails to extract transcript sequences due to the failure of reading certain chromosome sequences.

-

Sept 25, 2013   RSEM v1.2.7 has a minor update. One line is added to the 'WHAT_IS_NEW' file to reflect a change made but forgotten to put in to 'WHAT_IS_NEW'. The line added is "Renamed '--phred33-quals', '--phred64-quals', and '--solexa-quals' in 'rsem-calculate-expression' to '--bowtie-phred33-quals', '--bowtie-phred64-quals', and '--bowtie-solex-quals' to avoid confusion".

-

Sept 7, 2013   RSEM v1.2.7 is online now. 'rsem-find-DE' is replaced by 'rsem-run-ebseq' and 'rsem-control-fdr' for a more friendly user experience. Added support for differential expression testing on more than 2 conditions in RSEM's EBSeq wrappers 'rsem-run-ebseq' and 'rsem-control-fdr'.

-

Jul 31, 2013   RSEM v1.2.6 is online now. Install the latest version of EBSeq from Bioconductor and if fails, try to install EBSeq v1.1.5 locally. Fixed a bug in 'rsem-gen-transcript-plots', which makes 'rsem-plot-transcript-wiggles' fail.

-

Jun 26, 2013   RSEM v1.2.5 is online now. Updated EBSeq from v1.1.5 to v1.1.6 . Fixed a bug in 'rsem-generate-data-matrix', which can cause 'rsem-find-DE' to crash.

-

Apr 15, 2013   RSEM v1.2.4 is online now. Fixed a bug that leads to poor parallelization performance in Mac OS systems. Fixed a problem that may halt the 'rsem-gen-transcript-plots', thanks Han Lin for pointing out the problem and suggesting possible fixes. Added some user-friendly error messages for converting transcript BAM files into genomic BAM files. Modified rsem-tbam2gbam so that the original alignment quality MAPQ will be preserved if the input bam is not from RSEM. Added user-friendly error messages if users forget to compile the source codes.

-

Jan 8, 2013   RSEM v1.2.3 is online now. Fixed a bug in 'EBSeq/rsem-for-ebseq-generate-ngvector-from-clustering-info' which may crash the script.

-

Jan 8, 2013   RSEM v1.2.2 is online now. Updated EBSeq to v1.1.5 . Modified 'rsem-find-DE' to generate extra output files (type 'rsem-find-DE' to see more information).

-

Nov 29, 2012   RSEM v1.2.1 is online now. Added poly(A) tails to 'reference_name.transcripts.fa' so that the RSEM generated transcript unsorted BAM file can be fed into RSEM as an input file. However, users need to rebuild their references if they want to visualize the transcript level wiggle files and BAM files using IGV. Modified 'rsem-tbam2gbam' to convert users' alignments from transcript BAM files into genome BAM files, provided users use 'reference_name.idx.fa' to build indices for their aligners. Updated EBSeq from v1.1.3 to v1.1.4. Corrected several typos in warning messages.

-

Sept 11, 2012   RSEM v1.2.0 is online now. Changed output formats, added FPKM field etc. . Fixed a bug related to paired-end reads data. Added a script to run EBSeq automatically and updated EBSeq to v1.1.3 .

-

Jul 2, 2012   RSEM v1.1.21 is online now. Removed optional field "Z0:A:!" in the BAM outputs. Added --no-fractional-weight option to rsem-bam2wig, if the BAM file is not generated by RSEM, this option is recommended to be set. Fixed a bug for generating transcript level wiggle files using 'rsem-plot-transcript-wiggles'.

-

Jun 5, 2012   RSEM v1.1.20 is online now. Added an option to set the temporary folder name. Removed sample_name.sam.gz. Instead, RSEM uses samtools to convert bowtie outputted SAM file into a BAM file under the temporary folder. RSEM generated BAM files now contains all alignment lines produced by bowtie or user-specified aligners, including unalignable reads. Please note that for paired-end reads, if one mate has alignments but the other does not, RSEM will mark the alignable mate as "unmappable" (flag bit 0x4) and append an optional field "Z0:A:!".

-

Apr 26, 2012   RSEM v1.1.19 is online now. Allowed > 2^31 hits. Added some instructions on how to visualize transcript coordinate BAM/WIG files using IGV. Included EBSeq for downstream differential expression analysis.

-

Mar 11, 2012   RSEM v1.1.18-modified is online now. This modified version solved a compilation problem for GCC version 4.5.3 or higher.

-

Mar 11, 2012   RSEM v1.1.18 is online now. Added some user-friendly error messages. Added program 'rsem-sam-validator', users can use this program to check if RSEM can process their SAM/BAM files. Modified 'convert-sam-for-rsem' so that this program will convert users' SAM/BAM files into acceptable BAM files for RSEM.

-

Feb 8, 2012   RSEM v1.1.17 is online now. Fixed a bug related to parallezation of credibility intervals calculation. Added --no-bam-output option to rsem-calculate-expression. The order of @SQ tags in SAM/BAM files can be arbitrary now.

-

Jan 31, 2012   RSEM v1.1.16 is online now. Added --time option to show time consumed by each phase. Moved the alignment file out of the temporary folder. Enabled pthreads for calculating credibility intervals.

-

Jan 26, 2012   RSEM v1.1.15 is online now. Fixed several bugs causing compilation error. Modified samtools' Makefile for cygwin. For cygwin users, please uncomment the 4th and 8th lines in sam/Makefile before compiling RSEM.

-

Jan 17, 2012   RSEM v1.1.14 is online now. Several things are updated, including transcript coordinate wiggle plot generation. For details, please see what_is_new included in the package.

-

Nov 5, 2011   RSEM v1.1.13 is online now. Speed up EM algorithm by only updating model parameters for first 10 iterations. Skip reads with its (or at least one of its mates', for paired-end reads) length < 25bp.

-

Oct 18, 2011   RSEM v1.1.12 is online now. Add a script for generating transcript-to-gene-map file from Trinity output. Claim the minimum seed length explicitly. Allow empty sequences in the reference fasta file.

-

Aug 25, 2011   RSEM v1.1.11 is online now. A bug related to expected counts calculation is fixed. Allow spaces for field seqname and attributes gene_id and transcript_id in GTF files. This version changed reference indices format used by RSEM, so please rebuild the references by using 'rsem-prepare-reference'.

-

Aug 10, 2011   RSEM v1.1.10 is online now. Add some user-friendly error messages.

-

May 27, 2011   RSEM v1.1.9 is online now. "rsem-plot-model" is modified. A bug related to simulation is fixed.

-

Apr 18, 2011   RSEM v1.1.8 is online now. Added sanity checks for the user choosed aligner, including if the reference used and alignments produced follow RSEM's requirements. Added a new plot for "rsem-plot-model", a histogram of reads with different number of alignments. A bug in "rsem-plot-model" is fixed.

-

Apr 12, 2011   RSEM v1.1.7 is online now. A bug under transcript set model (i.e. use transcript set as reference) is fixed. This bug will lead crash under transcript set model if user tries to generate BAM format output.

-

Apr 7, 2011   RSEM v1.1.6 is online now. One bug in 'rsem-plot-model' is fixed. A bug related to assigning task to multiple cpus is fixed.

-

Mar 21, 2011   RSEM v1.1.5 is online now. Several bugs are fixed. One critical bug relates to set the correct probability of generating a read from -forward strand. If users let RSEM run bowtie and do not set the "--strand-specific" option, the bug would make RSEM throw all alignments aligned to forward strand away (by setting the "--forward-prob" as 0). The other two are related to confidence interval calculation. These bugs exist since RSEM v1.1.0. It is important for users to update to this version.

-

Mar 10, 2011   RSEM v1.1.4 is online now. Provide script for visulazing the model learned from data.

-

Mar 9, 2011   RSEM v1.1.3 is online now. Fixed a bug related to output format. Grand read/executation permessions to files.

-

Feb 17, 2011   RSEM v1.1.2 is online now. There are two important changes. First, Normalized Read Fraction(nrf) are eliminated from outputs and posterior mean counts are added. Second, a bug leading RSEM fail if estimating RSPD is fixed. In addition, now RSEM supports SAM Spec v1.3 as well as the old specs.

-

Feb 10, 2011   RSEM v1.1.1 is online now. Bugs for single end read model are fixed.

-

Feb 8, 2011   RSEM v1.1.0 is online now. The interfaces are redesigned, correction scheme added and bugs for generating wiggle plots are fixed. Please re-download if you have downloaded this version before Feb 9, 2011. Help information is changed a bit.

-

Nov 18, 2010   RSEM v1.0.8 is online now. RSEM can calculate credibility intervals now.

-

Oct 31, 2010   RSEM v1.0.7 is online now. Fix one bug for 32bit machines and one underflow problem.

-

Sept 27, 2010   RSEM v1.0.6 is online now. Fix a bug in rsem-simulate-reads.

-

Sept 24, 2010   RSEM v1.0.5 is online now. This is a brand new version of RSEM! Previous version of rsem is obsoleted. Old page is here.

-
- -