diff --git a/README.html b/README.html index efbd8f2..4a53c60 100644 --- a/README.html +++ b/README.html @@ -1,14 +1,3 @@ - - - - - - - - - -

README for RSEM

Bo Li (bli at cs dot wisc dot edu)

@@ -55,14 +44,16 @@

Com

To compile RSEM, simply run

-
make
+
make
+

For cygwin users, please uncomment the 3rd and 7th line in ‘sam/Makefile’ before you run ‘make’.

To compile EBSeq, which is included in the RSEM package, run

-
make ebseq
+
make ebseq
+

To install, simply put the rsem directory in your environment’s PATH variable.

@@ -86,18 +77,19 @@

Usage

I. Preparing Reference Sequences

RSEM can extract reference transcripts from a genome if you provide it -with gene annotations in a GTF file. Alternatively, you can provide +with gene annotations in a GTF file. Alternatively, you can provide RSEM with transcript sequences directly.

Please note that GTF files generated from the UCSC Table Browser do not -contain isoform-gene relationship information. However, if you use the +contain isoform-gene relationship information. However, if you use the UCSC Genes annotation track, this information can be recovered by downloading the knownIsoforms.txt file for the appropriate genome.

To prepare the reference sequences, you should run the -‘rsem-prepare-reference’ program. Run

+‘rsem-prepare-reference’ program. Run

-
rsem-prepare-reference --help
+
rsem-prepare-reference --help
+

to get usage information or visit the rsem-prepare-reference documentation page.

@@ -105,9 +97,10 @@

I. Preparing Reference Sequences

II. Calculating Expression Values

To calculate expression values, you should run the -‘rsem-calculate-expression’ program. Run

+‘rsem-calculate-expression’ program. Run

-
rsem-calculate-expression --help
+
rsem-calculate-expression --help
+

to get usage information or visit the rsem-calculate-expression documentation page.

@@ -116,49 +109,49 @@

Calculating expression va

For single-end models, users have the option of providing a fragment length distribution via the ‘–fragment-length-mean’ and -‘–fragment-length-sd’ options. The specification of an accurate fragment +‘–fragment-length-sd’ options. The specification of an accurate fragment length distribution is important for the accuracy of expression level -estimates from single-end data. If the fragment length mean and sd are +estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into consideration.

Using an alternative aligner

By default, RSEM automates the alignment of reads to reference -transcripts using the Bowtie alignment program. Turn on ‘–bowtie2’ -for ‘rsem-prepare-reference’ and ‘rsem-calculate-expression’ will -allow RSEM to use the Bowtie 2 alignment program instead. Please note -that indel alignments, local alignments and discordant alignments are +transcripts using the Bowtie aligner. Turn on ‘–bowtie2’ for +‘rsem-prepare-reference’ and ‘rsem-calculate-expression’ will allow +RSEM to use the Bowtie 2 alignment program instead. Please note that +indel alignments, local alignments and discordant alignments are disallowed when RSEM uses Bowtie 2 since RSEM currently cannot handle them. See the description of ‘–bowtie2’ option in -‘rsem-calculate-expression’ for more details. To use an alternative -alignment program, align the input reads against the file +‘rsem-calculate-expression’ for more details. Similarly, turn on +‘–star’ will allow RSEM to use the STAR aligner. To use an +alternative alignment program, align the input reads against the file ‘reference_name.idx.fa’ generated by ‘rsem-prepare-reference’, and -format the alignment output in SAM or BAM format. Then, instead of +format the alignment output in SAM or BAM format. Then, instead of providing reads to ‘rsem-calculate-expression’, specify the ‘–sam’ or -‘–bam’ option and provide the SAM or BAM file as an argument. When -using an alternative aligner, you may also want to provide the -‘–no-bowtie’ option to ‘rsem-prepare-reference’ so that the Bowtie -indices are not built.

+‘–bam’ option and provide the SAM or BAM file as an argument.

RSEM requires the alignments of a read to be adjacent. For paired-end reads, RSEM also requires the two mates of any alignment be adjacent. To check if your SAM/BAM file satisfy the requirements, please run

-
rsem-sam-validator <input.sam/input.bam>
+
rsem-sam-validator <input.sam/input.bam>
+

If your file does not satisfy the requirements, you can use ‘convert-sam-for-rsem’ to convert it into a BAM file which RSEM can process. Please run

-
convert-sam-for-rsem --help
+
convert-sam-for-rsem --help
+

to get usage information or visit the convert-sam-for-rsem documentation page.

-

However, please note that RSEM does * not * support gapped +

However, please note that RSEM does ** not ** support gapped alignments. So make sure that your aligner does not produce alignments with intersions/deletions. Also, please make sure that you use ‘reference_name.idx.fa’ , which is generated by RSEM, to build your @@ -190,28 +183,30 @@

a) Converting transcript

Usage:

-
rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output
+
rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output
+
-

reference_name : The name of reference built by ‘rsem-prepare-reference’
-unsorted_transcript_bam_input : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools -genome_bam_output : The output genomic coordinate BAM file’s name

+

reference_name : The name of reference built by ‘rsem-prepare-reference’
+unsorted_transcript_bam_input : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools +genome_bam_output : The output genomic coordinate BAM file’s name

b) Generating a Wiggle file

A wiggle plot representing the expected number of reads overlapping each position in the genome/transcript set can be generated from the -sorted genome/transcript BAM file output. To generate the wiggle +sorted genome/transcript BAM file output. To generate the wiggle plot, run the ‘rsem-bam2wig’ program on the -‘sample_name.genome.sorted.bam’/’sample_name.transcript.sorted.bam’ file.

+‘sample_name.genome.sorted.bam’/‘sample_name.transcript.sorted.bam’ file.

-

Usage:

+

Usage:

-
rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
+
rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
+
-

sorted_bam_input : Input BAM format file, must be sorted
-wig_output : Output wiggle file’s name, e.g. output.wig
-wiggle_name : The name of this wiggle plot
-–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line

+

sorted_bam_input : Input BAM format file, must be sorted
+wig_output : Output wiggle file’s name, e.g. output.wig
+wiggle_name : The name of this wiggle plot
+–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line

c) Loading a BAM and/or Wiggle file into the UCSC Genome Browser or Integrative Genomics Viewer(IGV)

@@ -223,20 +218,22 @@

d) Generating Transcript Wiggle Plots

To generate transcript wiggle plots, you should run the -‘rsem-plot-transcript-wiggles’ program. Run

+‘rsem-plot-transcript-wiggles’ program. Run

-
rsem-plot-transcript-wiggles --help
+
rsem-plot-transcript-wiggles --help
+

to get usage information or visit the rsem-plot-transcript-wiggles documentation page.

@@ -247,10 +244,11 @@

e) Visualize the model learned by RSEM<

Usage:

-
rsem-plot-model sample_name output_plot_file
+
rsem-plot-model sample_name output_plot_file
+
-

sample_name: the name of the sample analyzed
-output_plot_file: the file name for plots generated from the model. It is a pdf file

+

sample_name: the name of the sample analyzed
+output_plot_file: the file name for plots generated from the model. It is a pdf file

The plots generated depends on read type and user configuration. It may include fragment length distribution, mate length distribution, @@ -263,7 +261,7 @@

e) Visualize the model learned by RSEM<

RSPD: Read Start Position Distribution. x-axis is bin number, y-axis is the probability of each bin. RSPD can be used as an indicator of 3’ bias

-

Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the “observed quality”, Phred quality scores learned by RSEM from the data. Q = -10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base

+

Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the “observed quality”, Phred quality scores learned by RSEM from the data. Q = –10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base

Position vs. percentage sequencing error given a reference base: x-axis is position and y-axis is percentage sequencing error

@@ -271,17 +269,17 @@

e) Visualize the model learned by RSEM<

Example

-

Suppose we download the mouse genome from UCSC Genome Browser. We do +

Suppose we download the mouse genome from UCSC Genome Browser. We do not add poly(A) tails and use ‘/ref/mouse_0’ as the reference name. We have a FASTQ-formatted file, ‘mmliver.fq’, containing single-end -reads from one sample, which we call ‘mmliver_single_quals’. We want +reads from one sample, which we call ‘mmliver_single_quals’. We want to estimate expression values by using the single-end model with a fragment length distribution. We know that the fragment length distribution is approximated by a normal distribution with a mean of 150 and a standard deviation of 35. We wish to generate 95% credibility intervals in addition to maximum likelihood estimates. RSEM will be allowed 1G of memory for the credibility interval -calculation. We will visualize the probabilistic read mappings +calculation. We will visualize the probabilistic read mappings generated by RSEM on UCSC genome browser. We will generate a list of genes’ transcript wiggle plots in ‘output.pdf’. The list is ‘gene_ids.txt’. We will visualize the models learned in @@ -293,68 +291,73 @@

Example

rsem-calculate-expression --bowtie-path /sw/bowtie --phred64-quals --fragment-length-mean 150.0 --fragment-length-sd 35.0 -p 8 --output-genome-bam --calc-ci --memory-allocate 1024 /data/mmliver.fq /ref/mouse_0 mmliver_single_quals rsem-bam2wig mmliver_single_quals.sorted.bam mmliver_single_quals.sorted.wig mmliver_single_quals rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf -rsem-plot-model mmliver_single_quals mmliver_single_quals.models.pdf +rsem-plot-model mmliver_single_quals mmliver_single_quals.models.pdf +

Simulation

RSEM provides users the ‘rsem-simulate-reads’ program to simulate RNA-Seq data based on parameters learned from real data sets. Run

-
rsem-simulate-reads
+
rsem-simulate-reads
+

to get usage information or read the following subsections.

Usage:

-
rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]
+
rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]
+
-

reference_name: The name of RSEM references, which should be already generated by ‘rsem-prepare-reference’

+

reference_name: The name of RSEM references, which should be already generated by ‘rsem-prepare-reference’

-

estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using ‘rsem-calculate-expression’. The file can be found under the ‘sample_name.stat’ folder with the name of ‘sample_name.model’. ‘model_file_description.txt’ provides the format and meanings of this file.

+

estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using ‘rsem-calculate-expression’. The file can be found under the ‘sample_name.stat’ folder with the name of ‘sample_name.model’. ‘model_file_description.txt’ provides the format and meanings of this file.

-

estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using ‘rsem-calculate-expression’ from real data. The corresponding file users want to use is ‘sample_name.isoforms.results’. If simulating from user-designed expression profile is desired, start from a learned ‘sample_name.isoforms.results’ file and only modify the ‘TPM’ column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, ‘sample_name.alleles.results’ should be used instead.

+

estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using ‘rsem-calculate-expression’ from real data. The corresponding file users want to use is ‘sample_name.isoforms.results’. If simulating from user-designed expression profile is desired, start from a learned ‘sample_name.isoforms.results’ file and only modify the ‘TPM’ column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, ‘sample_name.alleles.results’ should be used instead.

-

theta0: This parameter determines the fraction of reads that are coming from background “noise” (instead of from a transcript). It can also be estimated using ‘rsem-calculate-expression’ from real data. Users can find it as the first value of the third line of the file ‘sample_name.stat/sample_name.theta’.

+

theta0: This parameter determines the fraction of reads that are coming from background “noise” (instead of from a transcript). It can also be estimated using ‘rsem-calculate-expression’ from real data. Users can find it as the first value of the third line of the file ‘sample_name.stat/sample_name.theta’.

-

N: The total number of reads to be simulated. If ‘rsem-calculate-expression’ is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file ‘sample_name.stat/sample_name.cnt’.

+

N: The total number of reads to be simulated. If ‘rsem-calculate-expression’ is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file ‘sample_name.stat/sample_name.cnt’.

-

output_name: Prefix for all output files.

+

output_name: Prefix for all output files.

–seed seed: Set seed for the random number generator used in simulation. The seed should be a 32-bit unsigned integer.

-

-q: Set it will stop outputting intermediate information.

+

-q: Set it will stop outputting intermediate information.

Outputs:

output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from. output_name.sim.alleles.results: Allele-specific expression levels estimated by counting where each simulated read comes from.

-

output_name.fa if single-end without quality score;
-output_name.fq if single-end with quality score;
+

output_name.fa if single-end without quality score;
+output_name.fq if single-end with quality score;
output_name_1.fa & output_name_2.fa if paired-end without quality -score;
-output_name_1.fq & output_name_2.fq if paired-end with quality score.

+score;
+output_name_1.fq & output_name_2.fq if paired-end with quality score.

Format of the header line: Each simulated read’s header line encodes where it comes from. The header line has the format:

-
{>/@}_rid_dir_sid_pos[_insertL]
+
{>/@}_rid_dir_sid_pos[_insertL]
+
-

{>/@}: Either ‘>’ or ‘@’ must appear. ‘>’ appears if FASTA files are generated and ‘@’ appears if FASTQ files are generated

+

{>/@}: Either ‘>’ or ‘@’ must appear. ‘>’ appears if FASTA files are generated and ‘@’ appears if FASTQ files are generated

-

rid: Simulated read’s index, numbered from 0

+

rid: Simulated read’s index, numbered from 0

-

dir: The direction of the simulated read. 0 refers to forward strand (‘+’) and 1 refers to reverse strand (‘-‘)

+

dir: The direction of the simulated read. 0 refers to forward strand (‘+’) and 1 refers to reverse strand (‘-’)

-

sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid’s transcript name can be found in the ‘transcript_id’ column of the ‘sample_name.isoforms.results’ file (at line sid + 1, line 1 is for column names)

+

sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid’s transcript name can be found in the ‘transcript_id’ column of the ‘sample_name.isoforms.results’ file (at line sid + 1, line 1 is for column names)

-

pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0

+

pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0

-

insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.

+

insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.

Example:

Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned from Example. In addition, we set theta0 as 0.2 and output_name as ‘simulated_reads’. The command is:

-
rsem-simulate-reads /ref/mouse_0 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads
+
rsem-simulate-reads /ref/mouse_0 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads
+

Generate Transcript-to-Gene-Map from Trinity Output

@@ -362,10 +365,11 @@

Usage:

-
extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file
+
extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file
+
-

trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
-map_file: transcript-to-gene-map file’s name.

+

trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
+map_file: transcript-to-gene-map file’s name.

Differential Expression Analysis

@@ -384,7 +388,8 @@

Differential E

RSEM includes EBSeq in its folder named ‘EBSeq’. To use it, first type

-
make ebseq
+
make ebseq
+

to compile the EBSeq related codes.

@@ -403,7 +408,8 @@

Differential E transcripts whose lengths are less than k are assigned to cluster 3. Run

-
rsem-generate-ngvector --help
+
rsem-generate-ngvector --help
+

to get usage information or visit the rsem-generate-ngvector documentation @@ -413,7 +419,8 @@

Differential E run ‘rsem-generate-ngvector’ first. Then load the resulting ‘output_name.ngvec’ into R. For example, you can use

-
NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+
NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+

. After that, set “NgVector = NgVec” for your differential expression test (either ‘EBTest’ or ‘EBMultiTest’).

@@ -422,12 +429,14 @@

Differential E ‘rsem-generate-data-matrix’ to extract input matrix from expression results:

-
rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+
rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+

The results files are required to be either all gene level results or all isoform level results. You can load the matrix into R by

-
IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
+
IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
+

before running either ‘EBTest’ or ‘EBMultiTest’.

@@ -436,15 +445,17 @@

Differential E genes/transcripts. First, ‘rsem-run-ebseq’ calls EBSeq to calculate related statistics for all genes/transcripts. Run

-
rsem-run-ebseq --help
+
rsem-run-ebseq --help
+

to get usage information or visit the rsem-run-ebseq documentation page. Second, -‘rsem-control-fdr’ takes ‘rsem-run-ebseq’ ‘s result and reports called +‘rsem-control-fdr’ takes ‘rsem-run-ebseq’ ’s result and reports called differentially expressed genes/transcripts by controlling the false discovery rate. Run

-
rsem-control-fdr --help
+
rsem-control-fdr --help
+

to get usage information or visit the rsem-control-fdr documentation page. These @@ -478,5 +489,3 @@

License

RSEM is licensed under the GNU General Public License v3.

- - \ No newline at end of file diff --git a/convert-sam-for-rsem.html b/convert-sam-for-rsem.html index bea955d..d69263f 100644 --- a/convert-sam-for-rsem.html +++ b/convert-sam-for-rsem.html @@ -4,87 +4,85 @@ convert-sam-for-rsem - + - -
-

-
- +

NAME

-

-

-

NAME

convert-sam-for-rsem

-

-

-
-

SYNOPSIS

+ +

SYNOPSIS

+

convert-sam-for-rsem [options] <input.sam/input.bam> output_file_name

-

-

-
-

ARGUMENTS

+ +

ARGUMENTS

+
-
input.sam/input.bam
+
input.sam/input.bam
-

The SAM or BAM file generated by user's aligner. We require this file contains the header section. If input is a SAM file, it must end with suffix 'sam' (case insensitive). If input is a BAM file, it must end with suffix 'bam' (case insensitive).

-
-
output_file_name
+

The SAM or BAM file generated by user's aligner. We require this file contains the header section. If input is a SAM file, it must end with suffix 'sam' (case insensitive). If input is a BAM file, it must end with suffix 'bam' (case insensitive).

+ + +
output_file_name
-

The output name for the converted file. 'convert-sam-for-rsem' will output a BAM with the name 'output_file_name.bam'.

+ +

The output name for the converted file. 'convert-sam-for-rsem' will output a BAM with the name 'output_file_name.bam'.

+
-

-

-
-

OPTIONS

+ +

OPTIONS

+
-
-T/--temporary-directory <directory>
+
-T/--temporary-directory <directory>
-

'convert-sam-for-rsem' will call 'sort' command and this is the '-T/--temporary-directory' option of 'sort' command. The following is the description from 'sort' : "use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories".

-
-
-h/--help
+

'convert-sam-for-rsem' will call 'sort' command and this is the '-T/--temporary-directory' option of 'sort' command. The following is the description from 'sort' : "use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories".

+ + +
-h/--help
+

Show help information.

+
-

-

-
-

DESCRIPTION

-

This program converts the SAM/BAM file generated by user's aligner into a BAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references and output header sections. This program will create a temporary directory called 'output_file_name.bam.temp' to store the intermediate files. The directory will be deleted automatically after the conversion. After the conversion, this program will call 'rsem-sam-validator' to validate the resulting BAM file.

-

Note: You do not need to run this script if `rsem-sam-validator' reports that your SAM/BAM file is valid.

+ +

DESCRIPTION

+ +

This program converts the SAM/BAM file generated by user's aligner into a BAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references and output header sections. This program will create a temporary directory called 'output_file_name.bam.temp' to store the intermediate files. The directory will be deleted automatically after the conversion. After the conversion, this program will call 'rsem-sam-validator' to validate the resulting BAM file.

+ +

Note: You do not need to run this script if `rsem-sam-validator' reports that your SAM/BAM file is valid.

+

Note: This program does not check the correctness of input file. You should make sure the input is a valid SAM/BAM format file.

-

-

-
-

EXAMPLES

-

Suppose input is set to 'input.sam' and output file name is "output"

-
- convert-sam-for-rsem input.sam output
-

We will get a file called 'output.bam' as output.

+ +

EXAMPLES

+ +

Suppose input is set to 'input.sam' and output file name is "output"

+ +
 convert-sam-for-rsem input.sam output
+ +

We will get a file called 'output.bam' as output.

+ + + diff --git a/index.html b/index.html deleted file mode 100644 index 8dcde23..0000000 --- a/index.html +++ /dev/null @@ -1,78 +0,0 @@ - - -RNA-Seq gene expression estimation with read -mapping uncertainty - - -

RSEM (RNA-Seq by Expectation-Maximization)

-
-

Updates

-

Jul 27, 2015   RSEM v1.2.22 is online now. Added options to run the STAR aligner.

-

May 6, 2015   RSEM v1.2.21 is online now. Strip read names of extra words to avoid mismatches of paired-end read names.

-

Mar 23, 2015   RSEM v1.2.20 is online now. Fixed a problem that can lead to assertion error if any paired-end read's insert size > 32767 (by changing the type of insertL in PairedEndHit.h from short to int).

-

Click here for full update information.

-

Authors

-

Bo Li and Colin Dewey designed the RSEM algorithm. Bo Li implemented the RSEM software. Peng Liu contributed the STAR aligner options.

-

License

-

RSEM is under the GNU General Public License

-

Source Code

- -

Documentation

-

README

-

Prebuilt RSEM Indices (RSEM v1.1.17) for Galaxy Wrapper

-

These indices are based on RefSeq containing NM accession numbers only. -That means only curated genes (no experimental, no miRNA, no noncoding). -Only mature RNAs. In addition, 125bp Poly(A) tails are added at the end of each transcript.

-

Mouse Indices, extracted from mouse genome mm9

-

Human Indices, extracted from human genome hg18

-

Reference annotations and Simulation Data used in the paper

-

RefSeq and Ensembl annotation GTF files used in the paper

-

Simulation Data using Refseq set as reference

-

Simulation Data using Ensembl set as reference

-

Google Users and Announce Groups

- - - - - - - -
- Google Groups -
- Subscribe to RSEM Announce -
- Email: - -
- Visit this group -
- - - - - - - -
- Google Groups -
- Subscribe to RSEM Users -
- Email: - -
- Visit this group -
-
-

(last modified on Jul 27, 2015)

- - diff --git a/rsem-calculate-expression.html b/rsem-calculate-expression.html index bd6e24d..b0033a0 100644 --- a/rsem-calculate-expression.html +++ b/rsem-calculate-expression.html @@ -4,571 +4,581 @@ rsem-calculate-expression - + - -
-

-
- +

NAME

-

-

-

NAME

rsem-calculate-expression

-

-

-
-

SYNOPSIS

-
- rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name 
+
+

SYNOPSIS

+ +
 rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name 
  rsem-calculate-expression [options] --paired-end upstream_read_file(s) downstream_read_file(s) reference_name sample_name 
- rsem-calculate-expression [options] --sam/--bam [--paired-end] input reference_name sample_name
-

-

-
-

ARGUMENTS

+ rsem-calculate-expression [options] --sam/--bam [--paired-end] input reference_name sample_name
+ +

ARGUMENTS

+
-
upstream_read_files(s)
+
upstream_read_files(s)
-

Comma-separated list of files containing single-end reads or upstream reads for paired-end data. By default, these files are assumed to be in FASTQ format. If the --no-qualities option is specified, then FASTA format is expected.

-
-
downstream_read_file(s)
-
-

Comma-separated list of files containing downstream reads which are paired with the upstream reads. By default, these files are assumed to be in FASTQ format. If the --no-qualities option is specified, then FASTA format is expected.

-
-
input
+

Comma-separated list of files containing single-end reads or upstream reads for paired-end data. By default, these files are assumed to be in FASTQ format. If the --no-qualities option is specified, then FASTA format is expected.

-
-

SAM/BAM formatted input file. If "-" is specified for the filename, SAM/BAM input is instead assumed to come from standard input. RSEM requires all alignments of the same read group together. For paired-end reads, RSEM also requires the two mates of any alignment be adjacent. See Description section for how to make input file obey RSEM's requirements.

-
reference_name
+
downstream_read_file(s)
+
+ +

Comma-separated list of files containing downstream reads which are paired with the upstream reads. By default, these files are assumed to be in FASTQ format. If the --no-qualities option is specified, then FASTA format is expected.

+
+
input
-

The name of the reference used. The user must have run 'rsem-prepare-reference' with this reference_name before running this program.

+ +

SAM/BAM formatted input file. If "-" is specified for the filename, SAM/BAM input is instead assumed to come from standard input. RSEM requires all alignments of the same read group together. For paired-end reads, RSEM also requires the two mates of any alignment be adjacent. See Description section for how to make input file obey RSEM's requirements.

+
-
sample_name
+
reference_name
+
+

The name of the reference used. The user must have run 'rsem-prepare-reference' with this reference_name before running this program.

+ +
+
sample_name
+

The name of the sample analyzed. All output files are prefixed by this name (e.g., sample_name.genes.results)

+
-

-

-
-

BASIC OPTIONS

+ +

BASIC OPTIONS

+
-
--paired-end
+
--paired-end
+

Input reads are paired-end reads. (Default: off)

-
-
--no-qualities
+ +
--no-qualities
+

Input reads do not contain quality scores. (Default: off)

-
-
--strand-specific
-
-

The RNA-Seq protocol used to generate the reads is strand specific, i.e., all (upstream) reads are derived from the forward strand. This option is equivalent to --forward-prob=1.0. With this option set, if RSEM runs the Bowtie/Bowtie 2 aligner, the '--norc' Bowtie/Bowtie 2 option will be used, which disables alignment to the reverse strand of transcripts. (Default: off)

-
--bowtie2
- +
--strand-specific
-

Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additionally use options '--no-mixed' and '--no-discordant'. (Default: off)

-
-
--star
-
-

Use STAR to align reads. Alignment parameters are from ENCODE3's STAR-RSEM pipeline. To save computational time and memory resources, STAR's Output BAM file is unsorted. It is stored in RSEM's temporary directory with name as 'sample_name.bam'. Each STAR job will have its own private copy of the genome in memory. (Default: off)

+

The RNA-Seq protocol used to generate the reads is strand specific, i.e., all (upstream) reads are derived from the forward strand. This option is equivalent to --forward-prob=1.0. With this option set, if RSEM runs the Bowtie/Bowtie 2 aligner, the '--norc' Bowtie/Bowtie 2 option will be used, which disables alignment to the reverse strand of transcripts. (Default: off)

+
-
--star-path <path>
+
--bowtie2
+
+ +

Use Bowtie 2 instead of Bowtie to align reads. Since currently RSEM does not handle indel, local and discordant alignments, the Bowtie2 parameters are set in a way to avoid those alignments. In particular, we use options '--sensitive --dpad 0 --gbar 99999999 --mp 1,1 --np 1 --score-min L,0,-0.1' by default. The last parameter of '--score-min', '-0.1', is the negative of maximum mismatch rate. This rate can be set by option '--bowtie2-mismatch-rate'. If reads are paired-end, we additionally use options '--no-mixed' and '--no-discordant'. (Default: off)

+
+
--star
-

The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment variable)

+ +

Use STAR to align reads. Alignment parameters are from ENCODE3's STAR-RSEM pipeline. To save computational time and memory resources, STAR's Output BAM file is unsorted. It is stored in RSEM's temporary directory with name as 'sample_name.bam'. Each STAR job will have its own private copy of the genome in memory. (Default: off)

+
-
--sam
+
--star-path <path>
+
+ +

The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment variable)

+
+
--sam
+

Input file is in SAM format. (Default: off)

-
-
--bam
+ +
--bam
+

Input file is in BAM format. (Default: off)

-
-
-p/--num-threads <int>
-
-

Number of threads to use. Both Bowtie/Bowtie2, expression estimation and 'samtools sort' will use this many threads. (Default: 1)

-
--no-bam-output
+
-p/--num-threads <int>
+
+

Number of threads to use. Both Bowtie/Bowtie2, expression estimation and 'samtools sort' will use this many threads. (Default: 1)

+ +
+
--no-bam-output
+

Do not output any BAM file. (Default: off)

-
-
--output-genome-bam
-
-

Generate a BAM file, 'sample_name.genome.bam', with alignments mapped to genomic coordinates and annotated with their posterior probabilities. In addition, RSEM will call samtools (included in RSEM package) to sort and index the bam file. 'sample_name.genome.sorted.bam' and 'sample_name.genome.sorted.bam.bai' will be generated. (Default: off)

-
--sampling-for-bam
+
--output-genome-bam
+
+

Generate a BAM file, 'sample_name.genome.bam', with alignments mapped to genomic coordinates and annotated with their posterior probabilities. In addition, RSEM will call samtools (included in RSEM package) to sort and index the bam file. 'sample_name.genome.sorted.bam' and 'sample_name.genome.sorted.bam.bai' will be generated. (Default: off)

+ +
+
--sampling-for-bam
+

When RSEM generates a BAM file, instead of outputing all alignments a read has with their posterior probabilities, one alignment is sampled according to the posterior probabilities. The sampling procedure includes the alignment to the "noise" transcript, which does not appear in the BAM file. Only the sampled alignment has a weight of 1. All other alignments have weight 0. If the "noise" transcript is sampled, all alignments appeared in the BAM file should have weight 0. (Default: off)

-
-
--seed <uint32>
+ +
--seed <uint32>
+

Set the seed for the random number generators used in calculating posterior mean estimates and credibility intervals. The seed must be a non-negative 32 bit interger. (Default: off)

-
-
--calc-pme
-
-

Run RSEM's collapsed Gibbs sampler to calculate posterior mean estimates. (Default: off)

-
--calc-ci
- +
--calc-pme
-

Calculate 95% credibility intervals and posterior mean estimates. The credibility level can be changed by setting '--ci-credibility-level'. (Default: off)

+ +

Run RSEM's collapsed Gibbs sampler to calculate posterior mean estimates. (Default: off)

+
-
-q/--quiet
+
--calc-ci
+
+

Calculate 95% credibility intervals and posterior mean estimates. The credibility level can be changed by setting '--ci-credibility-level'. (Default: off)

+ +
+
-q/--quiet
+

Suppress the output of logging information. (Default: off)

-
-
-h/--help
+ +
-h/--help
+

Show help information.

-
-
--version
+ +
--version
+

Show version information.

+
-

-

-
-

ADVANCED OPTIONS

+ +

ADVANCED OPTIONS

+
-
--sam-header-info <file>
+
--sam-header-info <file>
+

RSEM reads header information from input by default. If this option is on, header information is read from the specified file. For the format of the file, please see SAM official website. (Default: "")

-
-
--seed-length <int>
-
-

Seed length used by the read aligner. Providing the correct value is important for RSEM. If RSEM runs Bowtie, it uses this value for Bowtie's seed length parameter. Any read with its or at least one of its mates' (for paired-end reads) length less than this value will be ignored. If the references are not added poly(A) tails, the minimum allowed value is 5, otherwise, the minimum allowed value is 25. Note that this script will only check if the value >= 5 and give a warning message if the value < 25 but >= 5. (Default: 25)

-
--tag <string>
+
--seed-length <int>
+
+ +

Seed length used by the read aligner. Providing the correct value is important for RSEM. If RSEM runs Bowtie, it uses this value for Bowtie's seed length parameter. Any read with its or at least one of its mates' (for paired-end reads) length less than this value will be ignored. If the references are not added poly(A) tails, the minimum allowed value is 5, otherwise, the minimum allowed value is 25. Note that this script will only check if the value >= 5 and give a warning message if the value < 25 but >= 5. (Default: 25)

+
+
--tag <string>
+

The name of the optional field used in the SAM input for identifying a read with too many valid alignments. The field should have the format <tagName>:i:<value>, where a <value> bigger than 0 indicates a read with too many alignments. (Default: "")

-
-
--bowtie-path <path>
-
-

The path to the Bowtie executables. (Default: the path to the Bowtie executables is assumed to be in the user's PATH environment variable)

-
--bowtie-n <int>
+
--bowtie-path <path>
+
+ +

The path to the Bowtie executables. (Default: the path to the Bowtie executables is assumed to be in the user's PATH environment variable)

+
+
--bowtie-n <int>
+

(Bowtie parameter) max # of mismatches in the seed. (Range: 0-3, Default: 2)

-
-
--bowtie-e <int>
+ +
--bowtie-e <int>
+

(Bowtie parameter) max sum of mismatch quality scores across the alignment. (Default: 99999999)

-
-
--bowtie-m <int>
+ +
--bowtie-m <int>
+

(Bowtie parameter) suppress all alignments for a read if > <int> valid alignments exist. (Default: 200)

-
-
--bowtie-chunkmbs <int>
-
-

(Bowtie parameter) memory allocated for best first alignment calculation (Default: 0 - use Bowtie's default)

-
--phred33-quals
+
--bowtie-chunkmbs <int>
+
+ +

(Bowtie parameter) memory allocated for best first alignment calculation (Default: 0 - use Bowtie's default)

+
+
--phred33-quals
+

Input quality scores are encoded as Phred+33. (Default: on)

-
-
--phred64-quals
+ +
--phred64-quals
+

Input quality scores are encoded as Phred+64 (default for GA Pipeline ver. >= 1.3). (Default: off)

-
-
--solexa-quals
+ +
--solexa-quals
+

Input quality scores are solexa encoded (from GA Pipeline ver. < 1.3). (Default: off)

-
-
--bowtie2-path <path>
-
-

(Bowtie 2 parameter) The path to the Bowtie 2 executables. (Default: the path to the Bowtie 2 executables is assumed to be in the user's PATH environment variable)

-
--bowtie2-mismatch-rate <double>
+
--bowtie2-path <path>
+
+ +

(Bowtie 2 parameter) The path to the Bowtie 2 executables. (Default: the path to the Bowtie 2 executables is assumed to be in the user's PATH environment variable)

+
+
--bowtie2-mismatch-rate <double>
+

(Bowtie 2 parameter) The maximum mismatch rate allowed. (Default: 0.1)

-
-
--bowtie2-k <int>
+ +
--bowtie2-k <int>
+

(Bowtie 2 parameter) Find up to <int> alignments per read. (Default: 200)

-
-
--bowtie2-sensitivity-level <string>
-
-

(Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end mode. This option controls how hard Bowtie 2 tries to find alignments. <string> must be one of "very_fast", "fast", "sensitive" and "very_sensitive". The four candidates correspond to Bowtie 2's "--very-fast", "--fast", "--sensitive" and "--very-sensitive" options. (Default: "sensitive" - use Bowtie 2's default)

-
--gzipped-read-file
+
--bowtie2-sensitivity-level <string>
+
+

(Bowtie 2 parameter) Set Bowtie 2's preset options in --end-to-end mode. This option controls how hard Bowtie 2 tries to find alignments. <string> must be one of "very_fast", "fast", "sensitive" and "very_sensitive". The four candidates correspond to Bowtie 2's "--very-fast", "--fast", "--sensitive" and "--very-sensitive" options. (Default: "sensitive" - use Bowtie 2's default)

+ +
+
--gzipped-read-file
+

Input read file(s) is compressed by gzip. This option can be only used when aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)

-
-
--bzipped-read-file
+ +
--bzipped-read-file
+

Input read file(s) is compressed by bzip2. This option can be only used when aligning reads by STAR, i.e. --star-genome-path <path> is defined (Default: off)

-
-
--output-star-genome-bam
-
-

Save the BAM file from STAR alignment under genomic coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted by genomic coordinate. In this file, according to STAR's manual, 'paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well'. (Default: off)

-
--sort-bam-by-read-name
+
--output-star-genome-bam
+
+ +

Save the BAM file from STAR alignment under genomic coordinate to 'sample_name.STAR.genome.bam'. This file is NOT sorted by genomic coordinate. In this file, according to STAR's manual, 'paired ends of an alignment are always adjacent, and multiple alignments of a read are adjacent as well'. (Default: off)

+
+
--sort-bam-by-read-name
+

Sort BAM file aligned under transcript coordidate by read name. Setting this option on will produce determinstic maximum likelihood estimations from independet runs. Note that sorting will take long time and lots of memory. (Default: off)

-
-
--sort-bam-buffer-size <string>
-
-

Size for main memeory buffer when sorting BAM file. It can be any string acceptable to GNU sort's '-S' option. See "sort --help" for details. (Default: '60G')

-
--forward-prob <double>
+
--sort-bam-buffer-size <string>
+
+

Size for main memeory buffer when sorting BAM file. It can be any string acceptable to GNU sort's '-S' option. See "sort --help" for details. (Default: '60G')

+ +
+
--forward-prob <double>
+

Probability of generating a read from the forward strand of a transcript. Set to 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand, or 0.5 for a non-strand-specific protocol. (Default: 0.5)

-
-
--fragment-length-min <int>
+ +
--fragment-length-min <int>
+

Minimum read/insert length allowed. This is also the value for the Bowtie/Bowtie2 -I option. (Default: 1)

-
-
--fragment-length-max <int>
+ +
--fragment-length-max <int>
+

Maximum read/insert length allowed. This is also the value for the Bowtie/Bowtie 2 -X option. (Default: 1000)

-
-
--fragment-length-mean <double>
+ +
--fragment-length-mean <double>
+

(single-end data only) The mean of the fragment length distribution, which is assumed to be a Gaussian. (Default: -1, which disables use of the fragment length distribution)

-
-
--fragment-length-sd <double>
-
-

(single-end data only) The standard deviation of the fragment length distribution, which is assumed to be a Gaussian. (Default: 0, which assumes that all fragments are of the same length, given by the rounded value of --fragment-length-mean)

-
--estimate-rspd
+
--fragment-length-sd <double>
+
+ +

(single-end data only) The standard deviation of the fragment length distribution, which is assumed to be a Gaussian. (Default: 0, which assumes that all fragments are of the same length, given by the rounded value of --fragment-length-mean)

+
+
--estimate-rspd
+

Set this option if you want to estimate the read start position distribution (RSPD) from data. Otherwise, RSEM will use a uniform RSPD. (Default: off)

-
-
--num-rspd-bins <int>
-
-

Number of bins in the RSPD. Only relevant when '--estimate-rspd' is specified. Use of the default setting is recommended. (Default: 20)

-
--gibbs-burnin <int>
- +
--num-rspd-bins <int>
-

The number of burn-in rounds for RSEM's Gibbs sampler. Each round passes over the entire data set once. If RSEM can use multiple threads, multiple Gibbs samplers will start at the same time and all samplers share the same burn-in number. (Default: 200)

+ +

Number of bins in the RSPD. Only relevant when '--estimate-rspd' is specified. Use of the default setting is recommended. (Default: 20)

+
-
--gibbs-number-of-samples <int>
+
--gibbs-burnin <int>
+
+

The number of burn-in rounds for RSEM's Gibbs sampler. Each round passes over the entire data set once. If RSEM can use multiple threads, multiple Gibbs samplers will start at the same time and all samplers share the same burn-in number. (Default: 200)

+ +
+
--gibbs-number-of-samples <int>
+

The total number of count vectors RSEM will collect from its Gibbs samplers. (Default: 1000)

-
-
--gibbs-sampling-gap <int>
+ +
--gibbs-sampling-gap <int>
+

The number of rounds between two succinct count vectors RSEM collects. If the count vector after round N is collected, the count vector after round N + <int> will also be collected. (Default: 1)

-
-
--ci-credibility-level <double>
+ +
--ci-credibility-level <double>
+

The credibility level for credibility intervals. (Default: 0.95)

-
-
--ci-memory <int>
+ +
--ci-memory <int>
+

Maximum size (in memory, MB) of the auxiliary buffer used for computing credibility intervals (CI). Set it larger for a faster CI calculation. However, leaving 2 GB memory free for other usage is recommended. (Default: 1024)

-
-
--ci-number-of-samples-per-count-vector <int>
+ +
--ci-number-of-samples-per-count-vector <int>
+

The number of read generating probability vectors sampled per sampled count vector. The crebility intervals are calculated by first sampling P(C | D) and then sampling P(Theta | C) for each sampled count vector. This option controls how many Theta vectors are sampled per sampled count vector. (Default: 50)

-
-
--samtools-sort-mem <string>
-
-

Set the maximum memory per thread that can be used by 'samtools sort'. <string> represents the memory and accepts suffices 'K/M/G'. RSEM will pass <string> to the '-m' option of 'samtools sort'. Please note that the default used here is different from the default used by samtools. (Default: 1G)

-
--keep-intermediate-files
- +
--samtools-sort-mem <string>
-

Keep temporary files generated by RSEM. RSEM creates a temporary directory, 'sample_name.temp', into which it puts all intermediate output files. If this directory already exists, RSEM overwrites all files generated by previous RSEM runs inside of it. By default, after RSEM finishes, the temporary directory is deleted. Set this option to prevent the deletion of this directory and the intermediate files inside of it. (Default: off)

+ +

Set the maximum memory per thread that can be used by 'samtools sort'. <string> represents the memory and accepts suffices 'K/M/G'. RSEM will pass <string> to the '-m' option of 'samtools sort'. Please note that the default used here is different from the default used by samtools. (Default: 1G)

+
-
--temporary-folder <string>
+
--keep-intermediate-files
+
+

Keep temporary files generated by RSEM. RSEM creates a temporary directory, 'sample_name.temp', into which it puts all intermediate output files. If this directory already exists, RSEM overwrites all files generated by previous RSEM runs inside of it. By default, after RSEM finishes, the temporary directory is deleted. Set this option to prevent the deletion of this directory and the intermediate files inside of it. (Default: off)

+ +
+
--temporary-folder <string>
+

Set where to put the temporary files generated by RSEM. If the folder specified does not exist, RSEM will try to create it. (Default: sample_name.temp)

-
-
--time
+ +
--time
-

Output time consumed by each step of RSEM to 'sample_name.time'. (Default: off)

+ +

Output time consumed by each step of RSEM to 'sample_name.time'. (Default: off)

+
-

-

-
-

DESCRIPTION

-

In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments. RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Alternatively, users can use STAR to align reads using the '--star' option. RSEM has provided options in 'rsem-prepare-reference' to prepare STAR's genome indices. Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the indices generated by 'rsem-prepare-reference' and the alignment file satisfies the requirements mentioned in ARGUMENTS section.

-

One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command:

-
-  convert-sam-for-rsem /ref/mouse_125 input.sam -o input_for_rsem.sam
-

For details, please refer to 'convert-sam-for-rsem's documentation page.

-

The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL are not '*'.

-

The user must run 'rsem-prepare-reference' with the appropriate reference before using this program.

-

For single-end data, it is strongly recommended that the user provide the fragment length distribution parameters (--fragment-length-mean and --fragment-length-sd). For paired-end data, RSEM will automatically learn a fragment length distribution from the data.

+ +

DESCRIPTION

+ +

In its default mode, this program aligns input reads against a reference transcriptome with Bowtie and calculates expression values using the alignments. RSEM assumes the data are single-end reads with quality scores, unless the '--paired-end' or '--no-qualities' options are specified. Alternatively, users can use STAR to align reads using the '--star' option. RSEM has provided options in 'rsem-prepare-reference' to prepare STAR's genome indices. Users may use an alternative aligner by specifying one of the --sam and --bam options, and providing an alignment file in the specified format. However, users should make sure that they align against the indices generated by 'rsem-prepare-reference' and the alignment file satisfies the requirements mentioned in ARGUMENTS section.

+ +

One simple way to make the alignment file satisfying RSEM's requirements (assuming the aligner used put mates in a paired-end read adjacent) is to use 'convert-sam-for-rsem' script. This script only accept SAM format files as input. If a BAM format file is obtained, please use samtools to convert it to a SAM file first. For example, if '/ref/mouse_125' is the 'reference_name' and the SAM file is named 'input.sam', you can run the following command:

+ +
  convert-sam-for-rsem /ref/mouse_125 input.sam -o input_for_rsem.sam  
+ +

For details, please refer to 'convert-sam-for-rsem's documentation page.

+ +

The SAM/BAM format RSEM uses is v1.4. However, it is compatible with old SAM/BAM format. However, RSEM cannot recognize 0x100 in the FLAG field. In addition, RSEM requires SEQ and QUAL are not '*'.

+ +

The user must run 'rsem-prepare-reference' with the appropriate reference before using this program.

+ +

For single-end data, it is strongly recommended that the user provide the fragment length distribution parameters (--fragment-length-mean and --fragment-length-sd). For paired-end data, RSEM will automatically learn a fragment length distribution from the data.

+

Please note that some of the default values for the Bowtie parameters are not the same as those defined for Bowtie itself.

-

The temporary directory and all intermediate files will be removed when RSEM finishes unless '--keep-intermediate-files' is specified.

-

With the '--calc-pme' option, posterior mean estimates will be calculated in addition to maximum likelihood estimates.

-

With the '--calc-ci' option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates.

-

-

-
-

OUTPUT

+ +

The temporary directory and all intermediate files will be removed when RSEM finishes unless '--keep-intermediate-files' is specified.

+ +

With the '--calc-pme' option, posterior mean estimates will be calculated in addition to maximum likelihood estimates.

+ +

With the '--calc-ci' option, 95% credibility intervals and posterior mean estimates will be calculated in addition to maximum likelihood estimates.

+ +

OUTPUT

+
-
sample_name.isoforms.results
+
sample_name.isoforms.results
-

File containing isoform level expression estimates. The first line -contains column names separated by the tab character. The format of -each line in the rest of this file is:

+ +

File containing isoform level expression estimates. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

+

transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM IsoPct_from_pme_TPM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound]

-

Fields are separated by the tab character. Fields within "[]" are -optional. They will not be presented if neither '--calc-pme' nor -'--calc-ci' is set.

-

'transcript_id' is the transcript name of this transcript. 'gene_id' -is the gene name of the gene which this transcript belongs to (denote -this gene as its parent gene). If no gene information is provided, -'gene_id' and 'transcript_id' are the same.

-

'length' is this transcript's sequence length (poly(A) tail is not -counted). 'effective_length' counts only the positions that can -generate a valid fragment. If no poly(A) tail is added, -'effective_length' is equal to transcript length - mean fragment -length + 1. If one transcript's effective length is less than 1, this -transcript's both effective length and abundance estimates are set to -0.

-

'expected_count' is the sum of the posterior probability of each read -comes from this transcript over all reads. Because 1) each read -aligning to this transcript has a probability of being generated from -background noise; 2) RSEM may filter some alignable low quality reads, -the sum of expected counts for all transcript are generally less than -the total number of reads aligned.

-

'TPM' stands for Transcripts Per Million. It is a relative measure of -transcript abundance. The sum of all transcripts' TPM is 1 -million. 'FPKM' stands for Fragments Per Kilobase of transcript per -Million mapped reads. It is another relative measure of transcript -abundance. If we define l_bar be the mean transcript length in a -sample, which can be calculated as

+ +

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

+ +

'transcript_id' is the transcript name of this transcript. 'gene_id' is the gene name of the gene which this transcript belongs to (denote this gene as its parent gene). If no gene information is provided, 'gene_id' and 'transcript_id' are the same.

+ +

'length' is this transcript's sequence length (poly(A) tail is not counted). 'effective_length' counts only the positions that can generate a valid fragment. If no poly(A) tail is added, 'effective_length' is equal to transcript length - mean fragment length + 1. If one transcript's effective length is less than 1, this transcript's both effective length and abundance estimates are set to 0.

+ +

'expected_count' is the sum of the posterior probability of each read comes from this transcript over all reads. Because 1) each read aligning to this transcript has a probability of being generated from background noise; 2) RSEM may filter some alignable low quality reads, the sum of expected counts for all transcript are generally less than the total number of reads aligned.

+ +

'TPM' stands for Transcripts Per Million. It is a relative measure of transcript abundance. The sum of all transcripts' TPM is 1 million. 'FPKM' stands for Fragments Per Kilobase of transcript per Million mapped reads. It is another relative measure of transcript abundance. If we define l_bar be the mean transcript length in a sample, which can be calculated as

+

l_bar = \sum_i TPM_i / 10^6 * effective_length_i (i goes through every transcript),

+

the following equation is hold:

+

FPKM_i = 10^3 / l_bar * TPM_i.

+

We can see that the sum of FPKM is not a constant across samples.

-

'IsoPct' stands for isoform percentage. It is the percentage of this -transcript's abandunce over its parent gene's abandunce. If its parent -gene has only one isoform or the gene information is not provided, -this field will be set to 100.

-

'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean -estimates calculated by RSEM's Gibbs -sampler. 'posterior_standard_deviation_of_count' is the posterior -standard deviation of counts. 'IsoPct_from_pme_TPM' is the isoform -percentage calculated from 'pme_TPM' values.

-

'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and -'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95% -credibility intervals for TPM and FPKM values. The bounds are -inclusive (i.e. [l, u]).

-
-
sample_name.genes.results
- -
-

File containing gene level expression estimates. The first line -contains column names separated by the tab character. The format of -each line in the rest of this file is:

+ +

'IsoPct' stands for isoform percentage. It is the percentage of this transcript's abandunce over its parent gene's abandunce. If its parent gene has only one isoform or the gene information is not provided, this field will be set to 100.

+ +

'posterior_mean_count', 'pme_TPM', 'pme_FPKM' are posterior mean estimates calculated by RSEM's Gibbs sampler. 'posterior_standard_deviation_of_count' is the posterior standard deviation of counts. 'IsoPct_from_pme_TPM' is the isoform percentage calculated from 'pme_TPM' values.

+ +

'TPM_ci_lower_bound', 'TPM_ci_upper_bound', 'FPKM_ci_lower_bound' and 'FPKM_ci_upper_bound' are lower(l) and upper(u) bounds of 95% credibility intervals for TPM and FPKM values. The bounds are inclusive (i.e. [l, u]).

+ +
+
sample_name.genes.results
+
+ +

File containing gene level expression estimates. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

+

gene_id transcript_id(s) length effective_length expected_count TPM FPKM [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound]

-

Fields are separated by the tab character. Fields within "[]" are -optional. They will not be presented if neither '--calc-pme' nor -'--calc-ci' is set.

-

'transcript_id(s)' is a comma-separated list of transcript_ids -belonging to this gene. If no gene information is provided, 'gene_id' -and 'transcript_id(s)' are identical (the 'transcript_id').

-

A gene's 'length' and 'effective_length' are -defined as the weighted average of its transcripts' lengths and -effective lengths (weighted by 'IsoPct'). A gene's abundance estimates -are just the sum of its transcripts' abundance estimates.

-
-
sample_name.alleles.results
- -
-

Only generated when the RSEM references are built with allele-specific -transcripts.

-

This file contains allele level expression estimates for -allele-specific expression calculation. The first line -contains column names separated by the tab character. The format of -each line in the rest of this file is:

+ +

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

+ +

'transcript_id(s)' is a comma-separated list of transcript_ids belonging to this gene. If no gene information is provided, 'gene_id' and 'transcript_id(s)' are identical (the 'transcript_id').

+ +

A gene's 'length' and 'effective_length' are defined as the weighted average of its transcripts' lengths and effective lengths (weighted by 'IsoPct'). A gene's abundance estimates are just the sum of its transcripts' abundance estimates.

+ +
+
sample_name.alleles.results
+
+ +

Only generated when the RSEM references are built with allele-specific transcripts.

+ +

This file contains allele level expression estimates for allele-specific expression calculation. The first line contains column names separated by the tab character. The format of each line in the rest of this file is:

+

allele_id transcript_id gene_id length effective_length expected_count TPM FPKM AlleleIsoPct AlleleGenePct [posterior_mean_count posterior_standard_deviation_of_count pme_TPM pme_FPKM AlleleIsoPct_from_pme_TPM AlleleGenePct_from_pme_TPM TPM_ci_lower_bound TPM_ci_upper_bound FPKM_ci_lower_bound FPKM_ci_upper_bound]

-

Fields are separated by the tab character. Fields within "[]" are -optional. They will not be presented if neither '--calc-pme' nor -'--calc-ci' is set.

-

'allele_id' is the allele-specific name of this allele-specific transcript.

-

'AlleleIsoPct' stands for allele-specific percentage on isoform -level. It is the percentage of this allele-specific transcript's -abundance over its parent transcript's abundance. If its parent -transcript has only one allele variant form, this field will be set to -100.

-

'AlleleGenePct' stands for allele-specific percentage on gene -level. It is the percentage of this allele-specific transcript's -abundance over its parent gene's abundance.

-

'AlleleIsoPct_from_pme_TPM' and 'AlleleGenePct_from_pme_TPM' have -similar meanings. They are calculated based on posterior mean -estimates.

-

Please note that if this file is present, the fields 'length' and -'effective_length' in 'sample_name.isoforms.results' should be -interpreted similarly as the corresponding definitions in -'sample_name.genes.results'.

-
-
sample_name.transcript.bam, sample_name.transcript.sorted.bam and sample_name.transcript.sorted.bam.bai
+

Fields are separated by the tab character. Fields within "[]" are optional. They will not be presented if neither '--calc-pme' nor '--calc-ci' is set.

+ +

'allele_id' is the allele-specific name of this allele-specific transcript.

+ +

'AlleleIsoPct' stands for allele-specific percentage on isoform level. It is the percentage of this allele-specific transcript's abundance over its parent transcript's abundance. If its parent transcript has only one allele variant form, this field will be set to 100.

+ +

'AlleleGenePct' stands for allele-specific percentage on gene level. It is the percentage of this allele-specific transcript's abundance over its parent gene's abundance.

+ +

'AlleleIsoPct_from_pme_TPM' and 'AlleleGenePct_from_pme_TPM' have similar meanings. They are calculated based on posterior mean estimates.

+ +

Please note that if this file is present, the fields 'length' and 'effective_length' in 'sample_name.isoforms.results' should be interpreted similarly as the corresponding definitions in 'sample_name.genes.results'.

+ + +
sample_name.transcript.bam, sample_name.transcript.sorted.bam and sample_name.transcript.sorted.bam.bai
+

Only generated when --no-bam-output is not specified.

-

'sample_name.transcript.bam' is a BAM-formatted file of read -alignments in transcript coordinates. The MAPQ field of each alignment -is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the -posterior probability of that alignment being the true mapping of a -read. In addition, RSEM pads a new tag ZW:f:value, where value is a -single precision floating number representing the posterior -probability. Because this file contains all alignment lines produced -by bowtie or user-specified aligners, it can also be used as a -replacement of the aligner generated BAM/SAM file. For paired-end -reads, if one mate has alignments but the other does not, this file -marks the alignable mate as "unmappable" (flag bit 0x4) and appends an -optional field "Z0:A:!".

-

'sample_name.transcript.sorted.bam' and -'sample_name.transcript.sorted.bam.bai' are the sorted BAM file and -indices generated by samtools (included in RSEM package).

-
-
sample_name.genome.bam, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai
+

'sample_name.transcript.bam' is a BAM-formatted file of read alignments in transcript coordinates. The MAPQ field of each alignment is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the posterior probability of that alignment being the true mapping of a read. In addition, RSEM pads a new tag ZW:f:value, where value is a single precision floating number representing the posterior probability. Because this file contains all alignment lines produced by bowtie or user-specified aligners, it can also be used as a replacement of the aligner generated BAM/SAM file. For paired-end reads, if one mate has alignments but the other does not, this file marks the alignable mate as "unmappable" (flag bit 0x4) and appends an optional field "Z0:A:!".

+ +

'sample_name.transcript.sorted.bam' and 'sample_name.transcript.sorted.bam.bai' are the sorted BAM file and indices generated by samtools (included in RSEM package).

+ + +
sample_name.genome.bam, sample_name.genome.sorted.bam and sample_name.genome.sorted.bam.bai
+

Only generated when --no-bam-output is not specified and --output-genome-bam is specified.

-

'sample_name.genome.bam' is a BAM-formatted file of read alignments in -genomic coordinates. Alignments of reads that have identical genomic -coordinates (i.e., alignments to different isoforms that share the -same genomic region) are collapsed into one alignment. The MAPQ field -of each alignment is set to min(100, floor(-10 * log10(1.0 - w) + -0.5)), where w is the posterior probability of that alignment being -the true mapping of a read. In addition, RSEM pads a new tag -ZW:f:value, where value is a single precision floating number -representing the posterior probability. If an alignment is spliced, a -XS:A:value tag is also added, where value is either '+' or '-' -indicating the strand of the transcript it aligns to.

-

'sample_name.genome.sorted.bam' and 'sample_name.genome.sorted.bam.bai' are the -sorted BAM file and indices generated by samtools (included in RSEM package).

-
-
sample_name.time
+

'sample_name.genome.bam' is a BAM-formatted file of read alignments in genomic coordinates. Alignments of reads that have identical genomic coordinates (i.e., alignments to different isoforms that share the same genomic region) are collapsed into one alignment. The MAPQ field of each alignment is set to min(100, floor(-10 * log10(1.0 - w) + 0.5)), where w is the posterior probability of that alignment being the true mapping of a read. In addition, RSEM pads a new tag ZW:f:value, where value is a single precision floating number representing the posterior probability. If an alignment is spliced, a XS:A:value tag is also added, where value is either '+' or '-' indicating the strand of the transcript it aligns to.

+ +

'sample_name.genome.sorted.bam' and 'sample_name.genome.sorted.bam.bai' are the sorted BAM file and indices generated by samtools (included in RSEM package).

+ + +
sample_name.time
+

Only generated when --time is specified.

+

It contains time (in seconds) consumed by aligning reads, estimating expression levels and calculating credibility intervals.

-
-
sample_name.stat
+ +
sample_name.stat
-

This is a folder instead of a file. All model related statistics are stored in this folder. Use 'rsem-plot-model' can generate plots using this folder.

-

'sample_name.stat/sample_name.cnt' contains alignment statistics. The format and meanings of each field are described in 'cnt_file_description.txt' under RSEM directory.

-

'sample_name.stat/sample_name.model' stores RNA-Seq model parameters learned from the data. The format and meanings of each filed of this file are described in 'model_file_description.txt' under RSEM directory.

+ +

This is a folder instead of a file. All model related statistics are stored in this folder. Use 'rsem-plot-model' can generate plots using this folder.

+ +

'sample_name.stat/sample_name.cnt' contains alignment statistics. The format and meanings of each field are described in 'cnt_file_description.txt' under RSEM directory.

+ +

'sample_name.stat/sample_name.model' stores RNA-Seq model parameters learned from the data. The format and meanings of each filed of this file are described in 'model_file_description.txt' under RSEM directory.

+
-

-

-
-

EXAMPLES

-

Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mouse_125'.

-

1) '/data/mmliver.fq', single-end reads with quality scores. Quality scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8 threads and generate a genome BAM file:

-
- rsem-calculate-expression --phred64-quals \
+
+

EXAMPLES

+ +

Assume the path to the bowtie executables is in the user's PATH environment variable. Reference files are under '/ref' with name 'mouse_125'.

+ +

1) '/data/mmliver.fq', single-end reads with quality scores. Quality scores are encoded as for 'GA pipeline version >= 1.3'. We want to use 8 threads and generate a genome BAM file:

+ +
 rsem-calculate-expression --phred64-quals \
                            -p 8 \
                            --output-genome-bam \
                            /data/mmliver.fq \
                            /ref/mouse_125 \
-                           mmliver_single_quals
+                           mmliver_single_quals
+ +

2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', paired-end reads with quality scores. Quality scores are in SANGER format. We want to use 8 threads and do not generate a genome BAM file:

-
-

2) '/data/mmliver_1.fq' and '/data/mmliver_2.fq', paired-end reads with quality scores. Quality scores are in SANGER format. We want to use 8 threads and do not generate a genome BAM file:

-
- rsem-calculate-expression -p 8 \
+
 rsem-calculate-expression -p 8 \
                            --paired-end \
                            /data/mmliver_1.fq \
                            /data/mmliver_2.fq \
                            /ref/mouse_125 \
-                           mmliver_paired_end_quals
+                           mmliver_paired_end_quals
-
-

3) '/data/mmliver.fa', single-end reads without quality scores. We want to use 8 threads:

-
- rsem-calculate-expression -p 8 \
+

3) '/data/mmliver.fa', single-end reads without quality scores. We want to use 8 threads:

+ +
 rsem-calculate-expression -p 8 \
                            --no-qualities \
                            /data/mmliver.fa \
                            /ref/mouse_125 \
-                           mmliver_single_without_quals
+                           mmliver_single_without_quals
+ +

4) Data are the same as 1). This time we assume the bowtie executables are under '/sw/bowtie'. We want to take a fragment length distribution into consideration. We set the fragment length mean to 150 and the standard deviation to 35. In addition to a BAM file, we also want to generate credibility intervals. We allow RSEM to use 1GB of memory for CI calculation:

-
-

4) Data are the same as 1). This time we assume the bowtie executables are under '/sw/bowtie'. We want to take a fragment length distribution into consideration. We set the fragment length mean to 150 and the standard deviation to 35. In addition to a BAM file, we also want to generate credibility intervals. We allow RSEM to use 1GB of memory for CI calculation:

-
- rsem-calculate-expression --bowtie-path /sw/bowtie \
+
 rsem-calculate-expression --bowtie-path /sw/bowtie \
                            --phred64-quals \
                            --fragment-length-mean 150.0 \
                            --fragment-length-sd 35.0 \
@@ -578,32 +588,32 @@ 

EXAMPLES

--ci-memory 1024 \ /data/mmliver.fq \ /ref/mouse_125 \ - mmliver_single_quals + mmliver_single_quals
-
-

5) '/data/mmliver_paired_end_quals.bam', paired-end reads with quality scores. We want to use 8 threads:

-
- rsem-calculate-expression --paired-end \
+

5) '/data/mmliver_paired_end_quals.bam', paired-end reads with quality scores. We want to use 8 threads:

+ +
 rsem-calculate-expression --paired-end \
                            --bam \
                            -p 8 \
                            /data/mmliver_paired_end_quals.bam \
                            /ref/mouse_125 \
-                           mmliver_paired_end_quals
+                           mmliver_paired_end_quals
+ +

6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads with quality scores and read files are compressed by gzip. We want to use STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we want to use 8 threads and do not generate a genome BAM file:

-
-

6) '/data/mmliver_1.fq.gz' and '/data/mmliver_2.fq.gz', paired-end reads with quality scores and read files are compressed by gzip. We want to use STAR to aligned reads and assume STAR executable is '/sw/STAR'. Suppose we want to use 8 threads and do not generate a genome BAM file:

-
- rsem-calculate-expression --star \
+
 rsem-calculate-expression --paired-end \
+                           --star \
                            --star-path /sw/STAR \
                            --gzipped-read-file \
                            -p 8 \
                            /data/mmliver_1.fq.gz \
                            /data/mmliver_2.fq.gz \
                            /ref/mouse_125 \
-                           mmliver_paired_end_quals
+                           mmliver_paired_end_quals
-
+ + diff --git a/rsem-control-fdr.html b/rsem-control-fdr.html index a1ccfee..f4078a2 100644 --- a/rsem-control-fdr.html +++ b/rsem-control-fdr.html @@ -4,94 +4,91 @@ rsem-control-fdr - + - -
-

-
- +

NAME

-

-

-

NAME

rsem-control-fdr

-

-

-
-

SYNOPSIS

+ +

SYNOPSIS

+

rsem-control-fdr [options] input_file fdr_rate output_file

-

-

-
-

ARGUMENTS

+ +

ARGUMENTS

+
-
input_file
+
input_file
-

This should be the main result file generated by 'rsem-run-ebseq', which contains all genes/transcripts and their associated statistics.

-
-
fdr_rate
+

This should be the main result file generated by 'rsem-run-ebseq', which contains all genes/transcripts and their associated statistics.

+ + +
fdr_rate
+

The desire false discovery rate (FDR).

-
-
output_file
+ +
output_file
-

This file is a subset of the 'input_file'. It only contains the genes/transcripts called as differentially expressed (DE). When more than 2 conditions exist, DE is defined as not all conditions are equally expressed. Because statistical significance does not necessarily mean biological significance, users should also refer to the fold changes to decide which genes/transcripts are biologically significant. When more than two conditions exist, this file will not contain fold change information and users need to calculate it from 'input_file.condmeans' by themselves.

+ +

This file is a subset of the 'input_file'. It only contains the genes/transcripts called as differentially expressed (DE). When more than 2 conditions exist, DE is defined as not all conditions are equally expressed. Because statistical significance does not necessarily mean biological significance, users should also refer to the fold changes to decide which genes/transcripts are biologically significant. When more than two conditions exist, this file will not contain fold change information and users need to calculate it from 'input_file.condmeans' by themselves.

+
-

-

-
-

OPTIONS

+ +

OPTIONS

+
-
--hard-threshold
+
--hard-threshold
+

Use hard threshold method to control FDR. If this option is set, only those genes/transcripts with their PPDE >= 1 - fdr_rate are called as DE. (Default: on)

-
-
--soft-threshold
-
-

Use soft threshold method to control FDR. If this option is set, this program will try to report as many genes/transcripts as possible, as long as their average PPDE >= 1 - fdr_rate. This option is equivalent to use EBSeq's 'crit_fun' for FDR control. (Default: off)

-
-h/--help
+
--soft-threshold
+
+ +

Use soft threshold method to control FDR. If this option is set, this program will try to report as many genes/transcripts as possible, as long as their average PPDE >= 1 - fdr_rate. This option is equivalent to use EBSeq's 'crit_fun' for FDR control. (Default: off)

+
+
-h/--help
+

Show help information.

+
-

-

-
-

DESCRIPTION

+ +

DESCRIPTION

+

This program controls the false discovery rate and reports differentially expressed genes/transcripts.

-

-

-
-

EXAMPLES

-

We assume that we have 'GeneMat.results' as input. We want to control FDR at 0.05 using hard threshold method and name the output file as 'GeneMat.de.txt':

-
- rsem-control-fdr GeneMat.results 0.05 GeneMat.de.txt
+ +

EXAMPLES

+ +

We assume that we have 'GeneMat.results' as input. We want to control FDR at 0.05 using hard threshold method and name the output file as 'GeneMat.de.txt':

+ +
 rsem-control-fdr GeneMat.results 0.05 GeneMat.de.txt
+ + + diff --git a/rsem-generate-ngvector.html b/rsem-generate-ngvector.html index 7450ee3..759428f 100644 --- a/rsem-generate-ngvector.html +++ b/rsem-generate-ngvector.html @@ -4,108 +4,108 @@ rsem-generate-ngvector - + - -
-

-
- +

NAME

-

-

-

NAME

rsem-generate-ngvector

-

-

-
-

SYNOPSIS

+ +

SYNOPSIS

+

rsem-generate-ngvector [options] input_fasta_file output_name

-

-

-
-

ARGUMENTS

+ +

ARGUMENTS

+
-
input_fasta_file
+
input_fasta_file
-

The fasta file containing all reference transcripts. The transcripts must be in the same order as those in expression value files. Thus, 'reference_name.transcripts.fa' generated by 'rsem-prepare-reference' should be used.

-
-
output_name
+

The fasta file containing all reference transcripts. The transcripts must be in the same order as those in expression value files. Thus, 'reference_name.transcripts.fa' generated by 'rsem-prepare-reference' should be used.

+ + +
output_name
-

The name of all output files. The Ng vector will be stored as 'output_name.ngvec'.

+ +

The name of all output files. The Ng vector will be stored as 'output_name.ngvec'.

+
-

-

-
-

OPTIONS

+ +

OPTIONS

+
-
-k <int>
+
-k <int>
+

k mer length. See description section. (Default: 25)

-
-
-h/--help
+ +
-h/--help
+

Show help information.

+
-

-

-
-

DESCRIPTION

-

This program generates the Ng vector required by EBSeq for isoform level differential expression analysis based on reference sequences only. EBSeq can take variance due to read mapping ambiguity into consideration by grouping isoforms with parent gene's number of isoforms. However, for de novo assembled transcriptome, it is hard to obtain an accurate gene-isoform relationship. Instead, this program groups isoforms by using measures on read mappaing ambiguity directly. First, it calcualtes the 'unmappability' of each transcript. The 'unmappability' of a transcript is the ratio between the number of k mers with at least one perfect match to other transcripts and the total number of k mers of this transcript, where k is a parameter. Then, Ng vector is generated by applying Kmeans algorithm to the 'unmappability' values with number of clusters set as 3. 'rsem-generate-ngvector' will make sure the mean 'unmappability' scores for clusters are in ascending order. All transcripts whose lengths are less than k are assigned to cluster 3.

-

If your reference is a de novo assembled transcript set, you should run 'rsem-generate-ngvector' first. Then load the resulting 'output_name.ngvec' into R. For example, you can use

-
- NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
-

. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of section 3.2.5 (Page 10) of EBSeq's vignette:

-
- IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
+ +

DESCRIPTION

+ +

This program generates the Ng vector required by EBSeq for isoform level differential expression analysis based on reference sequences only. EBSeq can take variance due to read mapping ambiguity into consideration by grouping isoforms with parent gene's number of isoforms. However, for de novo assembled transcriptome, it is hard to obtain an accurate gene-isoform relationship. Instead, this program groups isoforms by using measures on read mappaing ambiguity directly. First, it calcualtes the 'unmappability' of each transcript. The 'unmappability' of a transcript is the ratio between the number of k mers with at least one perfect match to other transcripts and the total number of k mers of this transcript, where k is a parameter. Then, Ng vector is generated by applying Kmeans algorithm to the 'unmappability' values with number of clusters set as 3. 'rsem-generate-ngvector' will make sure the mean 'unmappability' scores for clusters are in ascending order. All transcripts whose lengths are less than k are assigned to cluster 3.

+ +

If your reference is a de novo assembled transcript set, you should run 'rsem-generate-ngvector' first. Then load the resulting 'output_name.ngvec' into R. For example, you can use

+ +
 NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+ +

. After that, replace 'IsoNgTrun' with 'NgVec' in the second line of section 3.2.5 (Page 10) of EBSeq's vignette:

+ +
 IsoEBres=EBTest(Data=IsoMat, NgVector=NgVec, ...)
+

This program only needs to run once per RSEM reference.

-

-

-
-

OUTPUT

+ +

OUTPUT

+
-
output_name.ump
+
output_name.ump
-

'unmappability' scores for each transcript. This file contains two columns. The first column is transcript name and the second column is 'unmappability' score.

-
-
output_name.ngvec
+

'unmappability' scores for each transcript. This file contains two columns. The first column is transcript name and the second column is 'unmappability' score.

+ + +
output_name.ngvec
+

Ng vector generated by this program.

+
-

-

-
-

EXAMPLES

-

Suppose the reference sequences file is '/ref/mouse_125/mouse_125.transcripts.fa' and we set the output_name as 'mouse_125':

-
- rsem-generate-ngvector /ref/mouse_125/mouse_125.transcripts.fa mouse_125
+ +

EXAMPLES

+ +

Suppose the reference sequences file is '/ref/mouse_125/mouse_125.transcripts.fa' and we set the output_name as 'mouse_125':

+ +
 rsem-generate-ngvector /ref/mouse_125/mouse_125.transcripts.fa mouse_125
+ + + diff --git a/rsem-plot-transcript-wiggles.html b/rsem-plot-transcript-wiggles.html index 45824c7..f5fb9e9 100644 --- a/rsem-plot-transcript-wiggles.html +++ b/rsem-plot-transcript-wiggles.html @@ -4,125 +4,128 @@ rsem-plot-transcript-wiggles - + - -
-

-
- +

NAME

-

-

-

NAME

rsem-plot-transcript-wiggles

-

-

-
-

SYNOPSIS

+ +

SYNOPSIS

+

rsem-plot-transcript-wiggles [options] sample_name input_list output_plot_file

-

-

-
-

ARGUMENTS

+ +

ARGUMENTS

+
-
sample_name
+
sample_name
+

The name of the sample analyzed.

-
-
input_list
+ +
input_list
+

A list of transcript ids or gene ids. But it cannot be a mixture of transcript & gene ids. Each id occupies one line without extra spaces.

-
-
output_plot_file
+ +
output_plot_file
+

The file name of the pdf file which contains all plots.

+
-

-

-
-

OPTIONS

+ +

OPTIONS

+
-
--gene-list
+
--gene-list
+

The input-list is a list of gene ids. (Default: off)

-
-
--transcript-list
+ +
--transcript-list
+

The input-list is a list of transcript ids. This option can only be turned on if allele-specific expression is calculated. (Default: off)

-
-
--show-unique
+ +
--show-unique
+

Show the wiggle plots as stacked bar plots. See description section for details. (Default: off)

-
-
-h/--help
+ +
-h/--help
+

Show help information.

+
-

-

-
-

DESCRIPTION

-

This program generates transcript wiggle plots and outputs them in a pdf file. This program can accept either a list of transcript ids or gene ids (if transcript to gene mapping information is provided) and has two modes of showing wiggle plots. If '--show-unique' is not specified, the wiggle plot for each transcript is a histogram where each position has the expected read depth at this position as its height. If '--show-unique' is specified, for each transcript a stacked bar plot is generated. For each position, the read depth of unique reads, which have only one alignment, is showed in black. The read depth of multi-reads, which align to more than one places, is showed in red on top of the read depth of unique reads.This program will use some files RSEM generated previouslly. So please do not delete/move any file 'rsem-calculate-expression' generated. If allele-specific expression is calculated, the basic unit for plotting is an allele-specific transcript and plots can be grouped by either transcript ids (--transcript-list) or gene ids (--gene-list).

-

-

-
-

OUTPUT

+ +

DESCRIPTION

+ +

This program generates transcript wiggle plots and outputs them in a pdf file. This program can accept either a list of transcript ids or gene ids (if transcript to gene mapping information is provided) and has two modes of showing wiggle plots. If '--show-unique' is not specified, the wiggle plot for each transcript is a histogram where each position has the expected read depth at this position as its height. If '--show-unique' is specified, for each transcript a stacked bar plot is generated. For each position, the read depth of unique reads, which have only one alignment, is showed in black. The read depth of multi-reads, which align to more than one places, is showed in red on top of the read depth of unique reads.This program will use some files RSEM generated previouslly. So please do not delete/move any file 'rsem-calculate-expression' generated. If allele-specific expression is calculated, the basic unit for plotting is an allele-specific transcript and plots can be grouped by either transcript ids (--transcript-list) or gene ids (--gene-list).

+ +

OUTPUT

+
-
output_plot_file
+
output_plot_file
-

This is a pdf file containing all plots generated. If a list of transcript ids is provided, each page display at most 6 plots in 3 rows and 2 columns. If gene ids are provided, each page display a gene. The gene's id is showed at the top and all its transcripts' wiggle plots are showed in this page. The arrangment of plots is determined automatically. For each transcript wiggle plot, the transcript id is displayed as title. x-axis is position in the transcript and y-axis is read depth. If allele-specific expression is calculated, the basin unit becomes an allele-specific transcript and transcript ids and gene ids can be used to group allele-specific transcripts.

-
-
sample_name.transcript.sorted.bam and sample_name.transcript.readdepth
-
-

If these files do not exist, 'rsem-plot-transcript-wiggles' will automatically generate them.

+

This is a pdf file containing all plots generated. If a list of transcript ids is provided, each page display at most 6 plots in 3 rows and 2 columns. If gene ids are provided, each page display a gene. The gene's id is showed at the top and all its transcripts' wiggle plots are showed in this page. The arrangment of plots is determined automatically. For each transcript wiggle plot, the transcript id is displayed as title. x-axis is position in the transcript and y-axis is read depth. If allele-specific expression is calculated, the basin unit becomes an allele-specific transcript and transcript ids and gene ids can be used to group allele-specific transcripts.

+
-
sample_name.uniq.transcript.bam, sample_name.uniq.transcript.sorted.bam and sample_name.uniq.transcript.readdepth
+
sample_name.transcript.sorted.bam and sample_name.transcript.readdepth
+
+ +

If these files do not exist, 'rsem-plot-transcript-wiggles' will automatically generate them.

+
+
sample_name.uniq.transcript.bam, sample_name.uniq.transcript.sorted.bam and sample_name.uniq.transcript.readdepth
-

If '--show-unique' option is specified and these files do not exist, 'rsem-plot-transcript-wiggles' will automatically generate them.

+ +

If '--show-unique' option is specified and these files do not exist, 'rsem-plot-transcript-wiggles' will automatically generate them.

+
-

-

-
-

EXAMPLES

-

Suppose sample_name and output_plot_file are set to 'mmliver_single_quals' and 'output.pdf' respectively. input_list is set to 'transcript_ids.txt' if transcript ids are provided, and is set to 'gene_ids.txt' if gene ids are provided.

+ +

EXAMPLES

+ +

Suppose sample_name and output_plot_file are set to 'mmliver_single_quals' and 'output.pdf' respectively. input_list is set to 'transcript_ids.txt' if transcript ids are provided, and is set to 'gene_ids.txt' if gene ids are provided.

+

1) Transcript ids are provided and we just want normal wiggle plots:

-
- rsem-plot-transcript-wiggles mmliver_single_quals transcript_ids.txt output.pdf
+ +
 rsem-plot-transcript-wiggles mmliver_single_quals transcript_ids.txt output.pdf
+

2) Gene ids are provided and we want to show stacked bar plots:

-
- rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf
+ +
 rsem-plot-transcript-wiggles --gene-list --show-unique mmliver_single_quals gene_ids.txt output.pdf 
+ + + diff --git a/rsem-prepare-reference.html b/rsem-prepare-reference.html index cfb9403..4dcc392 100644 --- a/rsem-prepare-reference.html +++ b/rsem-prepare-reference.html @@ -4,225 +4,250 @@ rsem-prepare-reference - + - -
-

-
- +

NAME

-

-

-

NAME

rsem-prepare-reference

-

-

-
-

SYNOPSIS

+ +

SYNOPSIS

+

rsem-prepare-reference [options] reference_fasta_file(s) reference_name

-

-

-
-

ARGUMENTS

+ +

ARGUMENTS

+
-
reference_fasta_file(s)
+
reference_fasta_file(s)
-

Either a comma-separated list of Multi-FASTA formatted files OR a directory name. If a directory name is specified, RSEM will read all files with suffix ".fa" or ".fasta" in this directory. The files should contain either the sequences of transcripts or an entire genome, depending on whether the --gtf option is used.

-
-
reference name
+

Either a comma-separated list of Multi-FASTA formatted files OR a directory name. If a directory name is specified, RSEM will read all files with suffix ".fa" or ".fasta" in this directory. The files should contain either the sequences of transcripts or an entire genome, depending on whether the --gtf option is used.

+ + +
reference name
+

The name of the reference used. RSEM will generate several reference-related files that are prefixed by this name. This name can contain path information (e.g. /ref/mm9).

+
-

-

-
-

OPTIONS

+ +

OPTIONS

+
-
--gtf <file>
+
--gtf <file>
-

If this option is on, RSEM assumes that 'reference_fasta_file(s)' contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in <file>, which should be in GTF format.

-

If this option is off, RSEM will assume 'reference_fasta_file(s)' contains the reference transcripts. In this case, RSEM assumes that name of each sequence in the Multi-FASTA files is its transcript_id.

+ +

If this option is on, RSEM assumes that 'reference_fasta_file(s)' contains the sequence of a genome, and will extract transcript reference sequences using the gene annotations specified in <file>, which should be in GTF format.

+ +

If this option is off, RSEM will assume 'reference_fasta_file(s)' contains the reference transcripts. In this case, RSEM assumes that name of each sequence in the Multi-FASTA files is its transcript_id.

+

(Default: off)

-
-
--transcript-to-gene-map <file>
+ +
--transcript-to-gene-map <file>
-

Use information from <file> to map from transcript (isoform) ids to gene ids. -Each line of <file> should be of the form:

+ +

Use information from <file> to map from transcript (isoform) ids to gene ids. Each line of <file> should be of the form:

+

gene_id transcript_id

+

with the two fields separated by a tab character.

+

If you are using a GTF file for the "UCSC Genes" gene set from the UCSC Genome Browser, then the "knownIsoforms.txt" file (obtained from the "Downloads" section of the UCSC Genome Browser site) is of this format.

-

If this option is off, then the mapping of isoforms to genes depends on whether the --gtf option is specified. If --gtf is specified, then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF file. Otherwise, RSEM assumes that each sequence in the reference sequence files is a separate gene.

+ +

If this option is off, then the mapping of isoforms to genes depends on whether the --gtf option is specified. If --gtf is specified, then RSEM uses the "gene_id" and "transcript_id" attributes in the GTF file. Otherwise, RSEM assumes that each sequence in the reference sequence files is a separate gene.

+

(Default: off)

-
-
--allele-to-gene-map <file>
+ +
--allele-to-gene-map <file>
-

Use information from <file> to provide gene_id and transcript_id information for each allele-specific transcript. -Each line of <file> should be of the form:

+ +

Use information from <file> to provide gene_id and transcript_id information for each allele-specific transcript. Each line of <file> should be of the form:

+

gene_id transcript_id allele_id

+

with the fields separated by a tab character.

-

This option is designed for quantifying allele-specific expression. It is only valid if '--gtf' option is not specified. allele_id should be the sequence names presented in the Multi-FASTA-formatted files.

+ +

This option is designed for quantifying allele-specific expression. It is only valid if '--gtf' option is not specified. allele_id should be the sequence names presented in the Multi-FASTA-formatted files.

+

(Default: off)

-
-
--polyA
-
-

Add poly(A) tails to the end of all reference isoforms. The length of poly(A) tail added is specified by '--polyA-length' option. STAR aligner users may not want to use this option. (Default: do not add poly(A) tail to any of the isoforms)

-
--polyA-length <int>
+
--polyA
+
+

Add poly(A) tails to the end of all reference isoforms. The length of poly(A) tail added is specified by '--polyA-length' option. STAR aligner users may not want to use this option. (Default: do not add poly(A) tail to any of the isoforms)

+ +
+
--polyA-length <int>
+

The length of the poly(A) tails to be added. (Default: 125)

-
-
--no-polyA-subset <file>
-
-

Only meaningful if '--polyA' is specified. Do not add poly(A) tails to those transcripts listed in <file>. <file> is a file containing a list of transcript_ids. (Default: off)

-
--bowtie
+
--no-polyA-subset <file>
+
+

Only meaningful if '--polyA' is specified. Do not add poly(A) tails to those transcripts listed in <file>. <file> is a file containing a list of transcript_ids. (Default: off)

+ +
+
--bowtie
+

Build Bowtie indices. (Default: off)

-
-
--bowtie-path <path>
-
-

The path to the Bowtie executables. (Default: the path to Bowtie executables is assumed to be in the user's PATH environment variable)

-
--bowtie2
+
--bowtie-path <path>
+
+

The path to the Bowtie executables. (Default: the path to Bowtie executables is assumed to be in the user's PATH environment variable)

+ +
+
--bowtie2
+

Build Bowtie 2 indices. (Default: off)

-
-
--bowtie2-path
-
-

The path to the Bowtie 2 executables. (Default: the path to Bowtie 2 executables is assumed to be in the user's PATH environment variable)

-
--star
+
--bowtie2-path
+
+ +

The path to the Bowtie 2 executables. (Default: the path to Bowtie 2 executables is assumed to be in the user's PATH environment variable)

+
+
--star
+

Build STAR indices. (Default: off)

-
-
--star-path <path>
-
-

The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment varaible)

-
--star-sjdboverhang <int>
- +
--star-path <path>
-

Length of the genomic sequence around annotated junction. It is only used for STAT to build splice junctions database and not needed for Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to STAR. According to STAR's manual, its ideal value is max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is 101-1=100. In most cases, the default value of 100 will work as well as the ideal value. (Default: 100)

-
-
-p/--num-threads <int>
+

The path to STAR's executable. (Default: the path to STAR executable is assumed to be in user's PATH environment varaible)

+ + +
--star-sjdboverhang <int>
-

Number of threads to use for building STAR's genome indices. (Default: 1)

+ +

Length of the genomic sequence around annotated junction. It is only used for STAT to build splice junctions database and not needed for Bowtie or Bowtie2. It will be passed as the --sjdbOverhang option to STAR. According to STAR's manual, its ideal value is max(ReadLength)-1, e.g. for 2x101 paired-end reads, the ideal value is 101-1=100. In most cases, the default value of 100 will work as well as the ideal value. (Default: 100)

+
-
-q/--quiet
+
-p/--num-threads <int>
+
+ +

Number of threads to use for building STAR's genome indices. (Default: 1)

+
+
-q/--quiet
+

Suppress the output of logging information. (Default: off)

-
-
-h/--help
+ +
-h/--help
+

Show help information.

+
-

-

-
-

DESCRIPTION

-

This program extracts/preprocesses the reference sequences for RSEM. It can optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using their default parameters. It can also optionally build STAR indices (with '--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. If an alternative aligner is to be used, indices for that particular aligner can be built from either 'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for details). This program is used in conjunction with the 'rsem-calculate-expression' program.

-

-

-
-

OUTPUT

-

This program will generate 'reference_name.grp', 'reference_name.ti', 'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and optional STAR index files.

-

'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and 'reference_name.chrlist' are used by RSEM internally.

-

'reference_name.transcripts.fa' contains the extracted reference transcripts in Multi-FASTA format. Poly(A) tails are not added and it may contain lower case bases in its sequences if the corresponding genomic regions are soft-masked.

-

'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' are used by aligners to build their own indices. In these two files, all sequence bases are converted into upper case. In addition, poly(A) tails are added if '--polyA' option is set. The only difference between 'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G' characters. This conversion is in particular desired for aligners (e.g. Bowtie) that do not allow reads to overlap with 'N' characters in the reference sequences. Otherwise, 'reference_name.idx.fa' should be used to build the aligner's index files. RSEM uses 'reference_name.idx.fa' to build Bowtie 2 indices and 'reference_name.n2g.idx.fa' to build Bowtie indices. For visualizing the transcript-coordinate-based BAM files generated by RSEM in IGV, 'reference_name.idx.fa' should be imported as a "genome" (see Visualization section in README.md for details).

-

-

-
-

EXAMPLES

-

1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mouse_0'. We do not add any poly(A) tails. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'.

+ +

DESCRIPTION

+ +

This program extracts/preprocesses the reference sequences for RSEM. It can optionally build Bowtie indices (with '--bowtie' option) and/or Bowtie 2 indices (with '--bowtie2' option) using their default parameters. It can also optionally build STAR indices (with '--star' option) using parameters from ENCODE3's STAR-RSEM pipeline. If an alternative aligner is to be used, indices for that particular aligner can be built from either 'reference_name.idx.fa' or 'reference_name.n2g.idx.fa' (see OUTPUT for details). This program is used in conjunction with the 'rsem-calculate-expression' program.

+ +

OUTPUT

+ +

This program will generate 'reference_name.grp', 'reference_name.ti', 'reference_name.transcripts.fa', 'reference_name.seq', 'reference_name.chrlist' (if '--gtf' is on), 'reference_name.idx.fa', 'reference_name.n2g.idx.fa', optional Bowtie/Bowtie 2 index files, and optional STAR index files.

+ +

'reference_name.grp', 'reference_name.ti', 'reference_name.seq', and 'reference_name.chrlist' are used by RSEM internally.

+ +

'reference_name.transcripts.fa' contains the extracted reference transcripts in Multi-FASTA format. Poly(A) tails are not added and it may contain lower case bases in its sequences if the corresponding genomic regions are soft-masked.

+ +

'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' are used by aligners to build their own indices. In these two files, all sequence bases are converted into upper case. In addition, poly(A) tails are added if '--polyA' option is set. The only difference between 'reference_name.idx.fa' and 'reference_name.n2g.idx.fa' is that 'reference_name.n2g.idx.fa' in addition converts all 'N' characters to 'G' characters. This conversion is in particular desired for aligners (e.g. Bowtie) that do not allow reads to overlap with 'N' characters in the reference sequences. Otherwise, 'reference_name.idx.fa' should be used to build the aligner's index files. RSEM uses 'reference_name.idx.fa' to build Bowtie 2 indices and 'reference_name.n2g.idx.fa' to build Bowtie indices. For visualizing the transcript-coordinate-based BAM files generated by RSEM in IGV, 'reference_name.idx.fa' should be imported as a "genome" (see Visualization section in README.md for details).

+ +

EXAMPLES

+ +

1) Suppose we have mouse RNA-Seq data and want to use the UCSC mm9 version of the mouse genome. We have downloaded the UCSC Genes transcript annotations in GTF format (as mm9.gtf) using the Table Browser and the knownIsoforms.txt file for mm9 from the UCSC Downloads. We also have all chromosome files for mm9 in the directory '/data/mm9'. We want to put the generated reference files under '/ref' with name 'mouse_0'. We do not add any poly(A) tails. Please note that GTF files generated from UCSC's Table Browser do not contain isoform-gene relationship information. For the UCSC Genes annotation, this information can be obtained from the knownIsoforms.txt file. Suppose we want to build Bowtie indices and Bowtie executables are found in '/sw/bowtie'.

+

There are two ways to write the command:

-
- rsem-prepare-reference --gtf mm9.gtf \
+
+
 rsem-prepare-reference --gtf mm9.gtf \
                         --transcript-to-gene-map knownIsoforms.txt \
                         --bowtie \
                         --bowtie-path /sw/bowtie \                  
                         /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
-                        /ref/mouse_0
+ /ref/mouse_0
+

OR

-
- rsem-prepare-reference --gtf mm9.gtf \
+
+
 rsem-prepare-reference --gtf mm9.gtf \
                         --transcript-to-gene-map knownIsoforms.txt \
                         --bowtie \
                         --bowtie-path /sw/bowtie \
                         /data/mm9 \
-                        /ref/mouse_0
-

2) Suppose we also want to build Bowtie 2 indices in the above example and Bowtie 2 executables are found in '/sw/bowtie2', the command will be:

-
- rsem-prepare-reference --gtf mm9.gtf \
+                        /ref/mouse_0
+ +

2) Suppose we also want to build Bowtie 2 indices in the above example and Bowtie 2 executables are found in '/sw/bowtie2', the command will be:

+ +
 rsem-prepare-reference --gtf mm9.gtf \
                         --transcript-to-gene-map knownIsoforms.txt \
                         --bowtie \
                         --bowtie-path /sw/bowtie \
                         --bowtie2 \
                         --bowtie2-path /sw/bowtie2 \
                         /data/mm9 \
-                        /ref/mouse_0
-

3) Suppose we want to build STAR indices in the above example and save index files under '/ref' with name 'mouse_0'. Assuming STAR executable is '/sw/STAR', the command will be:

-
- rsem-prepare-reference --gtf mm9.gtf \
+                        /ref/mouse_0
+ +

3) Suppose we want to build STAR indices in the above example and save index files under '/ref' with name 'mouse_0'. Assuming STAR executable is '/sw/STAR', the command will be:

+ +
 rsem-prepare-reference --gtf mm9.gtf \
                         --star \
                         --star-path /sw/STAR \
                         -p 8 \
                         /data/mm9/chr1.fa,/data/mm9/chr2.fa,...,/data/mm9/chrM.fa \
-                        /ref/mouse_0
+ /ref/mouse_0
+

OR

-
- rsem-prepare-reference --gtf mm9.gtf \
+
+
 rsem-prepare-reference --gtf mm9.gtf \
                         --star \
                         --star-path /sw/STAR \
                         -p 8 \
                         /data/mm9
-                        /ref/mouse_0
-

STAR genome index files will be saved under '/ref/'.

-

4) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and isoform-gene information stored in 'mapping.txt'. We want to add 125bp long poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an alternative aligner to align reads against either 'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':

-
- rsem-prepare-reference --transcript-to-gene-map mapping.txt \
+                        /ref/mouse_0
+ +

STAR genome index files will be saved under '/ref/'.

+ +

4) Suppose we only have transcripts from EST tags stored in 'mm9.fasta' and isoform-gene information stored in 'mapping.txt'. We want to add 125bp long poly(A) tails to all transcripts. The reference_name is set as 'mouse_125'. In addition, we do not want to build Bowtie/Bowtie 2 indices, and will use an alternative aligner to align reads against either 'mouse_125.idx.fa' or 'mouse_125.idx.n2g.fa':

+ +
 rsem-prepare-reference --transcript-to-gene-map mapping.txt \
                         --polyA
                         mm9.fasta \
-                        mouse_125
+                        mouse_125
-
+ + diff --git a/rsem-run-ebseq.html b/rsem-run-ebseq.html index 89e9750..0f233ec 100644 --- a/rsem-run-ebseq.html +++ b/rsem-run-ebseq.html @@ -4,118 +4,122 @@ rsem-run-ebseq - + - -
-

-
- +

NAME

-

-

-

NAME

rsem-run-ebseq

-

-

-
-

SYNOPSIS

+ +

SYNOPSIS

+

rsem-run-ebseq [options] data_matrix_file conditions output_file

-

-

-
-

ARGUMENTS

+ +

ARGUMENTS

+
-
data_matrix_file
+
data_matrix_file
-

This file is a m by n matrix. m is the number of genes/transcripts and n is the number of total samples. Each element in the matrix represents the expected count for a particular gene/transcript in a particular sample. Users can use 'rsem-generate-data-matrix' to generate this file from expression result files.

-
-
conditions
+

This file is a m by n matrix. m is the number of genes/transcripts and n is the number of total samples. Each element in the matrix represents the expected count for a particular gene/transcript in a particular sample. Users can use 'rsem-generate-data-matrix' to generate this file from expression result files.

+ + +
conditions
+

Comma-separated list of values representing the number of replicates for each condition. For example, "3,3" means the data set contains 2 conditions and each condition has 3 replicates. "2,3,3" means the data set contains 3 conditions, with 2, 3, and 3 replicates for each condition respectively.

-
-
output_file
+ +
output_file
+

Output file name.

+
-

-

-
-

OPTIONS

+ +

OPTIONS

+
-
--ngvector <file>
+
--ngvector <file>
-

This option provides the grouping information required by EBSeq for isoform-level differential expression analysis. The file can be generated by 'rsem-generate-ngvector'. Turning this option on is highly recommended for isoform-level differential expression analysis. (Default: off)

-
-
-h/--help
+

This option provides the grouping information required by EBSeq for isoform-level differential expression analysis. The file can be generated by 'rsem-generate-ngvector'. Turning this option on is highly recommended for isoform-level differential expression analysis. (Default: off)

+ + +
-h/--help
+

Show help information.

+
-

-

-
-

DESCRIPTION

-

This program is a wrapper over EBSeq. It performs differential expression analysis and can work on two or more conditions. All genes/transcripts and their associated statistcs are reported in one output file. This program does not control false discovery rate and call differential expressed genes/transcripts. Please use 'rsem-control-fdr' to control false discovery rate after this program is finished.

-

-

-
-

OUTPUT

+ +

DESCRIPTION

+ +

This program is a wrapper over EBSeq. It performs differential expression analysis and can work on two or more conditions. All genes/transcripts and their associated statistcs are reported in one output file. This program does not control false discovery rate and call differential expressed genes/transcripts. Please use 'rsem-control-fdr' to control false discovery rate after this program is finished.

+ +

OUTPUT

+
-
output_file
+
output_file
-

This file reports the calculated statistics for all genes/transcripts. It is written as a matrix with row and column names. The row names are the genes'/transcripts' names. The column names are for the reported statistics.

-

If there are only 2 different conditions among the samples, four statistics (columns) will be reported for each gene/transcript. They are "PPEE", "PPDE", "PostFC" and "RealFC". "PPEE" is the posterior probability (estimated by EBSeq) that a gene/transcript is equally expressed. "PPDE" is the posterior probability that a gene/transcript is differentially expressed. "PostFC" is the posterior fold change (condition 1 over condition2) for a gene/transcript. It is defined as the ratio between posterior mean expression estimates of the gene/transcript for each condition. "RealFC" is the real fold change (condition 1 over condition2) for a gene/transcript. It is the ratio of the normalized within condition 1 mean count over normalized within condition 2 mean count for the gene/transcript. Fold changes are calculated using EBSeq's 'PostFC' function. The genes/transcripts are reported in descending order of their "PPDE" values.

-

If there are more than 2 different conditions among the samples, the output format is different. For differential expression analysis with more than 2 conditions, EBSeq will enumerate all possible expression patterns (on which conditions are equally expressed and which conditions are not). Suppose there are k different patterns, the first k columns of the output file give the posterior probability of each expression pattern is true. Patterns are defined in a separate file, 'output_file.pattern'. The k+1 column gives the maximum a posteriori (MAP) expression pattern for each gene/transcript. The k+2 column gives the posterior probability that not all conditions are equally expressed (column name "PPDE"). The genes/transcripts are reported in descending order of their "PPDE" column values. For details on how EBSeq works for more than 2 conditions, please refer to EBSeq's manual.

-
-
output_file.pattern
+

This file reports the calculated statistics for all genes/transcripts. It is written as a matrix with row and column names. The row names are the genes'/transcripts' names. The column names are for the reported statistics.

+ +

If there are only 2 different conditions among the samples, four statistics (columns) will be reported for each gene/transcript. They are "PPEE", "PPDE", "PostFC" and "RealFC". "PPEE" is the posterior probability (estimated by EBSeq) that a gene/transcript is equally expressed. "PPDE" is the posterior probability that a gene/transcript is differentially expressed. "PostFC" is the posterior fold change (condition 1 over condition2) for a gene/transcript. It is defined as the ratio between posterior mean expression estimates of the gene/transcript for each condition. "RealFC" is the real fold change (condition 1 over condition2) for a gene/transcript. It is the ratio of the normalized within condition 1 mean count over normalized within condition 2 mean count for the gene/transcript. Fold changes are calculated using EBSeq's 'PostFC' function. The genes/transcripts are reported in descending order of their "PPDE" values.

+ +

If there are more than 2 different conditions among the samples, the output format is different. For differential expression analysis with more than 2 conditions, EBSeq will enumerate all possible expression patterns (on which conditions are equally expressed and which conditions are not). Suppose there are k different patterns, the first k columns of the output file give the posterior probability of each expression pattern is true. Patterns are defined in a separate file, 'output_file.pattern'. The k+1 column gives the maximum a posteriori (MAP) expression pattern for each gene/transcript. The k+2 column gives the posterior probability that not all conditions are equally expressed (column name "PPDE"). The genes/transcripts are reported in descending order of their "PPDE" column values. For details on how EBSeq works for more than 2 conditions, please refer to EBSeq's manual.

+ + +
output_file.pattern
+

This file is only generated when there are more than 2 conditions. It defines all possible expression patterns over the conditions using a matrix with names. Each row of the matrix refers to a different expression pattern and each column gives the expression status of a different condition. Two conditions are equally expressed if and only if their statuses are the same.

-
-
output_file.condmeans
+ +
output_file.condmeans
-

This file is only generated when there are more than 2 conditions. It gives the normalized mean count value for each gene/transcript at each condition. It is formatted as a matrix with names. Each row represents a gene/transcript and each column represent a condition. The order of genes/transcripts is the same as 'output_file'. This file can be used to calculate fold changes between conditions which users are interested in.

+ +

This file is only generated when there are more than 2 conditions. It gives the normalized mean count value for each gene/transcript at each condition. It is formatted as a matrix with names. Each row represents a gene/transcript and each column represent a condition. The order of genes/transcripts is the same as 'output_file'. This file can be used to calculate fold changes between conditions which users are interested in.

+
-

-

-
-

EXAMPLES

-

1) We're interested in isoform-level differential expression analysis and there are two conditions. Each condition has 5 replicates. We have already collected the data matrix as 'IsoMat' and generated ngvector as 'ngvector.ngvec':

-
- rsem-run-ebseq --ngvector ngvector.ngvec IsoMat 5,5 IsoMat.results
-

The results will be in 'IsoMat.results'.

-

2) We're interested in gene-level analysis and there are 3 conditions. The first condition has 3 replicates and the other two has 4 replicates each. The data matrix is named as 'GeneMat':

-
- rsem-run-ebseq GeneMat 3,4,4 GeneMat.results
-

Three files, 'GeneMat.results', 'GeneMat.results.pattern', and 'GeneMat.results.condmeans', will be generated.

+ +

EXAMPLES

+ +

1) We're interested in isoform-level differential expression analysis and there are two conditions. Each condition has 5 replicates. We have already collected the data matrix as 'IsoMat' and generated ngvector as 'ngvector.ngvec':

+ +
 rsem-run-ebseq --ngvector ngvector.ngvec IsoMat 5,5 IsoMat.results
+ +

The results will be in 'IsoMat.results'.

+ +

2) We're interested in gene-level analysis and there are 3 conditions. The first condition has 3 replicates and the other two has 4 replicates each. The data matrix is named as 'GeneMat':

+ +
 rsem-run-ebseq GeneMat 3,4,4 GeneMat.results
+ +

Three files, 'GeneMat.results', 'GeneMat.results.pattern', and 'GeneMat.results.condmeans', will be generated.

+ + +