diff --git a/README.html b/README.html index efbd8f2..4a53c60 100644 --- a/README.html +++ b/README.html @@ -1,14 +1,3 @@ - - - - -
- - - - -Bo Li (bli at cs dot wisc dot edu)
@@ -55,14 +44,16 @@To compile RSEM, simply run
-make
+make
+
For cygwin users, please uncomment the 3rd and 7th line in ‘sam/Makefile’ before you run ‘make’.
To compile EBSeq, which is included in the RSEM package, run
-make ebseq
+make ebseq
+
To install, simply put the rsem directory in your environment’s PATH variable.
@@ -86,18 +77,19 @@RSEM can extract reference transcripts from a genome if you provide it -with gene annotations in a GTF file. Alternatively, you can provide +with gene annotations in a GTF file. Alternatively, you can provide RSEM with transcript sequences directly.
Please note that GTF files generated from the UCSC Table Browser do not -contain isoform-gene relationship information. However, if you use the +contain isoform-gene relationship information. However, if you use the UCSC Genes annotation track, this information can be recovered by downloading the knownIsoforms.txt file for the appropriate genome.
To prepare the reference sequences, you should run the -‘rsem-prepare-reference’ program. Run
+‘rsem-prepare-reference’ program. Run -rsem-prepare-reference --help
+rsem-prepare-reference --help
+
to get usage information or visit the rsem-prepare-reference documentation page.
@@ -105,9 +97,10 @@To calculate expression values, you should run the -‘rsem-calculate-expression’ program. Run
+‘rsem-calculate-expression’ program. Run -rsem-calculate-expression --help
+rsem-calculate-expression --help
+
to get usage information or visit the rsem-calculate-expression documentation page.
@@ -116,49 +109,49 @@For single-end models, users have the option of providing a fragment length distribution via the ‘–fragment-length-mean’ and -‘–fragment-length-sd’ options. The specification of an accurate fragment +‘–fragment-length-sd’ options. The specification of an accurate fragment length distribution is important for the accuracy of expression level -estimates from single-end data. If the fragment length mean and sd are +estimates from single-end data. If the fragment length mean and sd are not provided, RSEM will not take a fragment length distribution into consideration.
By default, RSEM automates the alignment of reads to reference -transcripts using the Bowtie alignment program. Turn on ‘–bowtie2’ -for ‘rsem-prepare-reference’ and ‘rsem-calculate-expression’ will -allow RSEM to use the Bowtie 2 alignment program instead. Please note -that indel alignments, local alignments and discordant alignments are +transcripts using the Bowtie aligner. Turn on ‘–bowtie2’ for +‘rsem-prepare-reference’ and ‘rsem-calculate-expression’ will allow +RSEM to use the Bowtie 2 alignment program instead. Please note that +indel alignments, local alignments and discordant alignments are disallowed when RSEM uses Bowtie 2 since RSEM currently cannot handle them. See the description of ‘–bowtie2’ option in -‘rsem-calculate-expression’ for more details. To use an alternative -alignment program, align the input reads against the file +‘rsem-calculate-expression’ for more details. Similarly, turn on +‘–star’ will allow RSEM to use the STAR aligner. To use an +alternative alignment program, align the input reads against the file ‘reference_name.idx.fa’ generated by ‘rsem-prepare-reference’, and -format the alignment output in SAM or BAM format. Then, instead of +format the alignment output in SAM or BAM format. Then, instead of providing reads to ‘rsem-calculate-expression’, specify the ‘–sam’ or -‘–bam’ option and provide the SAM or BAM file as an argument. When -using an alternative aligner, you may also want to provide the -‘–no-bowtie’ option to ‘rsem-prepare-reference’ so that the Bowtie -indices are not built.
+‘–bam’ option and provide the SAM or BAM file as an argument.RSEM requires the alignments of a read to be adjacent. For paired-end reads, RSEM also requires the two mates of any alignment be adjacent. To check if your SAM/BAM file satisfy the requirements, please run
-rsem-sam-validator <input.sam/input.bam>
+rsem-sam-validator <input.sam/input.bam>
+
If your file does not satisfy the requirements, you can use ‘convert-sam-for-rsem’ to convert it into a BAM file which RSEM can process. Please run
-convert-sam-for-rsem --help
+convert-sam-for-rsem --help
+
to get usage information or visit the convert-sam-for-rsem documentation page.
-However, please note that RSEM does * not * support gapped +
However, please note that RSEM does ** not ** support gapped alignments. So make sure that your aligner does not produce alignments with intersions/deletions. Also, please make sure that you use ‘reference_name.idx.fa’ , which is generated by RSEM, to build your @@ -190,28 +183,30 @@
Usage:
-rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output
+rsem-tbam2gbam reference_name unsorted_transcript_bam_input genome_bam_output
+
-reference_name : The name of reference built by ‘rsem-prepare-reference’
-unsorted_transcript_bam_input : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools
-genome_bam_output : The output genomic coordinate BAM file’s name
reference_name : The name of reference built by ‘rsem-prepare-reference’
+unsorted_transcript_bam_input : This file should satisfy: 1) the alignments of a same read are grouped together, 2) for any paired-end alignment, the two mates should be adjacent to each other, 3) this file should not be sorted by samtools
+genome_bam_output : The output genomic coordinate BAM file’s name
A wiggle plot representing the expected number of reads overlapping each position in the genome/transcript set can be generated from the -sorted genome/transcript BAM file output. To generate the wiggle +sorted genome/transcript BAM file output. To generate the wiggle plot, run the ‘rsem-bam2wig’ program on the -‘sample_name.genome.sorted.bam’/’sample_name.transcript.sorted.bam’ file.
+‘sample_name.genome.sorted.bam’/‘sample_name.transcript.sorted.bam’ file. -Usage:
+Usage:
-rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
+rsem-bam2wig sorted_bam_input wig_output wiggle_name [--no-fractional-weight]
+
-sorted_bam_input : Input BAM format file, must be sorted
-wig_output : Output wiggle file’s name, e.g. output.wig
-wiggle_name : The name of this wiggle plot
-–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line
sorted_bam_input : Input BAM format file, must be sorted
+wig_output : Output wiggle file’s name, e.g. output.wig
+wiggle_name : The name of this wiggle plot
+–no-fractional-weight : If this is set, RSEM will not look for “ZW” tag and each alignment appeared in the BAM file has weight 1. Set this if your BAM file is not generated by RSEM. Please note that this option must be at the end of the command line
To generate transcript wiggle plots, you should run the -‘rsem-plot-transcript-wiggles’ program. Run
+‘rsem-plot-transcript-wiggles’ program. Run -rsem-plot-transcript-wiggles --help
+rsem-plot-transcript-wiggles --help
+
to get usage information or visit the rsem-plot-transcript-wiggles documentation page.
@@ -247,10 +244,11 @@Usage:
-rsem-plot-model sample_name output_plot_file
+rsem-plot-model sample_name output_plot_file
+
-sample_name: the name of the sample analyzed
-output_plot_file: the file name for plots generated from the model. It is a pdf file
sample_name: the name of the sample analyzed
+output_plot_file: the file name for plots generated from the model. It is a pdf file
The plots generated depends on read type and user configuration. It may include fragment length distribution, mate length distribution, @@ -263,7 +261,7 @@
RSPD: Read Start Position Distribution. x-axis is bin number, y-axis is the probability of each bin. RSPD can be used as an indicator of 3’ bias
-Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the “observed quality”, Phred quality scores learned by RSEM from the data. Q = -10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base
+Quality score vs. observed quality given a reference base: x-axis is Phred quality scores associated with data, y-axis is the “observed quality”, Phred quality scores learned by RSEM from the data. Q = –10log_10(P), where Q is Phred quality score and P is the probability of sequencing error for a particular base
Position vs. percentage sequencing error given a reference base: x-axis is position and y-axis is percentage sequencing error
@@ -271,17 +269,17 @@Suppose we download the mouse genome from UCSC Genome Browser. We do +
Suppose we download the mouse genome from UCSC Genome Browser. We do not add poly(A) tails and use ‘/ref/mouse_0’ as the reference name. We have a FASTQ-formatted file, ‘mmliver.fq’, containing single-end -reads from one sample, which we call ‘mmliver_single_quals’. We want +reads from one sample, which we call ‘mmliver_single_quals’. We want to estimate expression values by using the single-end model with a fragment length distribution. We know that the fragment length distribution is approximated by a normal distribution with a mean of 150 and a standard deviation of 35. We wish to generate 95% credibility intervals in addition to maximum likelihood estimates. RSEM will be allowed 1G of memory for the credibility interval -calculation. We will visualize the probabilistic read mappings +calculation. We will visualize the probabilistic read mappings generated by RSEM on UCSC genome browser. We will generate a list of genes’ transcript wiggle plots in ‘output.pdf’. The list is ‘gene_ids.txt’. We will visualize the models learned in @@ -293,68 +291,73 @@
RSEM provides users the ‘rsem-simulate-reads’ program to simulate RNA-Seq data based on parameters learned from real data sets. Run
-rsem-simulate-reads
+rsem-simulate-reads
+
to get usage information or read the following subsections.
rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]
+rsem-simulate-reads reference_name estimated_model_file estimated_isoform_results theta0 N output_name [-q]
+
-reference_name: The name of RSEM references, which should be already generated by ‘rsem-prepare-reference’
+reference_name: The name of RSEM references, which should be already generated by ‘rsem-prepare-reference’
-estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using ‘rsem-calculate-expression’. The file can be found under the ‘sample_name.stat’ folder with the name of ‘sample_name.model’. ‘model_file_description.txt’ provides the format and meanings of this file.
+estimated_model_file: This file describes how the RNA-Seq reads will be sequenced given the expression levels. It determines what kind of reads will be simulated (single-end/paired-end, w/o quality score) and includes parameters for fragment length distribution, read start position distribution, sequencing error models, etc. Normally, this file should be learned from real data using ‘rsem-calculate-expression’. The file can be found under the ‘sample_name.stat’ folder with the name of ‘sample_name.model’. ‘model_file_description.txt’ provides the format and meanings of this file.
-estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using ‘rsem-calculate-expression’ from real data. The corresponding file users want to use is ‘sample_name.isoforms.results’. If simulating from user-designed expression profile is desired, start from a learned ‘sample_name.isoforms.results’ file and only modify the ‘TPM’ column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, ‘sample_name.alleles.results’ should be used instead.
+estimated_isoform_results: This file contains expression levels for all isoforms recorded in the reference. It can be learned using ‘rsem-calculate-expression’ from real data. The corresponding file users want to use is ‘sample_name.isoforms.results’. If simulating from user-designed expression profile is desired, start from a learned ‘sample_name.isoforms.results’ file and only modify the ‘TPM’ column. The simulator only reads the TPM column. But keeping the file format the same is required. If the RSEM references built are aware of allele-specific transcripts, ‘sample_name.alleles.results’ should be used instead.
-theta0: This parameter determines the fraction of reads that are coming from background “noise” (instead of from a transcript). It can also be estimated using ‘rsem-calculate-expression’ from real data. Users can find it as the first value of the third line of the file ‘sample_name.stat/sample_name.theta’.
+theta0: This parameter determines the fraction of reads that are coming from background “noise” (instead of from a transcript). It can also be estimated using ‘rsem-calculate-expression’ from real data. Users can find it as the first value of the third line of the file ‘sample_name.stat/sample_name.theta’.
-N: The total number of reads to be simulated. If ‘rsem-calculate-expression’ is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file ‘sample_name.stat/sample_name.cnt’.
+N: The total number of reads to be simulated. If ‘rsem-calculate-expression’ is executed on a real data set, the total number of reads can be found as the 4th number of the first line of the file ‘sample_name.stat/sample_name.cnt’.
-output_name: Prefix for all output files.
+output_name: Prefix for all output files.
–seed seed: Set seed for the random number generator used in simulation. The seed should be a 32-bit unsigned integer.
--q: Set it will stop outputting intermediate information.
+-q: Set it will stop outputting intermediate information.
output_name.sim.isoforms.results, output_name.sim.genes.results: Expression levels estimated by counting where each simulated read comes from. output_name.sim.alleles.results: Allele-specific expression levels estimated by counting where each simulated read comes from.
-output_name.fa if single-end without quality score;
-output_name.fq if single-end with quality score;
+
output_name.fa if single-end without quality score;
+output_name.fq if single-end with quality score;
output_name_1.fa & output_name_2.fa if paired-end without quality
-score;
-output_name_1.fq & output_name_2.fq if paired-end with quality score.
Format of the header line: Each simulated read’s header line encodes where it comes from. The header line has the format:
-{>/@}_rid_dir_sid_pos[_insertL]
+{>/@}_rid_dir_sid_pos[_insertL]
+
-{>/@}: Either ‘>’ or ‘@’ must appear. ‘>’ appears if FASTA files are generated and ‘@’ appears if FASTQ files are generated
+{>/@}: Either ‘>’ or ‘@’ must appear. ‘>’ appears if FASTA files are generated and ‘@’ appears if FASTQ files are generated
-rid: Simulated read’s index, numbered from 0
+rid: Simulated read’s index, numbered from 0
-dir: The direction of the simulated read. 0 refers to forward strand (‘+’) and 1 refers to reverse strand (‘-‘)
+dir: The direction of the simulated read. 0 refers to forward strand (‘+’) and 1 refers to reverse strand (‘-’)
-sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid’s transcript name can be found in the ‘transcript_id’ column of the ‘sample_name.isoforms.results’ file (at line sid + 1, line 1 is for column names)
+sid: Represent which transcript this read is simulated from. It ranges between 0 and M, where M is the total number of transcripts. If sid=0, the read is simulated from the background noise. Otherwise, the read is simulated from a transcript with index sid. Transcript sid’s transcript name can be found in the ‘transcript_id’ column of the ‘sample_name.isoforms.results’ file (at line sid + 1, line 1 is for column names)
-pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0
+pos: The start position of the simulated read in strand dir of transcript sid. It is numbered from 0
-insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.
+insertL: Only appear for paired-end reads. It gives the insert length of the simulated read.
Suppose we want to simulate 50 millon single-end reads with quality scores and use the parameters learned from Example. In addition, we set theta0 as 0.2 and output_name as ‘simulated_reads’. The command is:
-rsem-simulate-reads /ref/mouse_0 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads
+rsem-simulate-reads /ref/mouse_0 mmliver_single_quals.stat/mmliver_single_quals.model mmliver_single_quals.isoforms.results 0.2 50000000 simulated_reads
+
extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file
+extract-transcript-to-gene-map-from-trinity trinity_fasta_file map_file
+
-trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
-map_file: transcript-to-gene-map file’s name.
trinity_fasta_file: the fasta file produced by trinity, which contains all transcripts assembled.
+map_file: transcript-to-gene-map file’s name.
RSEM includes EBSeq in its folder named ‘EBSeq’. To use it, first type
-make ebseq
+make ebseq
+
to compile the EBSeq related codes.
@@ -403,7 +408,8 @@rsem-generate-ngvector --help
+rsem-generate-ngvector --help
+
to get usage information or visit the rsem-generate-ngvector
documentation
@@ -413,7 +419,8 @@ Differential E
run ‘rsem-generate-ngvector’ first. Then load the resulting
‘output_name.ngvec’ into R. For example, you can use
NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+NgVec <- scan(file="output_name.ngvec", what=0, sep="\n")
+
. After that, set “NgVector = NgVec” for your differential expression test (either ‘EBTest’ or ‘EBMultiTest’).
@@ -422,12 +429,14 @@rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.counts.matrix
+
The results files are required to be either all gene level results or all isoform level results. You can load the matrix into R by
-IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
+IsoMat <- data.matrix(read.table(file="output_name.counts.matrix"))
+
before running either ‘EBTest’ or ‘EBMultiTest’.
@@ -436,15 +445,17 @@rsem-run-ebseq --help
+rsem-run-ebseq --help
+
to get usage information or visit the rsem-run-ebseq documentation page. Second, -‘rsem-control-fdr’ takes ‘rsem-run-ebseq’ ‘s result and reports called +‘rsem-control-fdr’ takes ‘rsem-run-ebseq’ ’s result and reports called differentially expressed genes/transcripts by controlling the false discovery rate. Run
-rsem-control-fdr --help
+rsem-control-fdr --help
+
to get usage information or visit the rsem-control-fdr documentation page. These @@ -478,5 +489,3 @@
RSEM is licensed under the GNU General Public License v3.
- - \ No newline at end of file diff --git a/convert-sam-for-rsem.html b/convert-sam-for-rsem.html index bea955d..d69263f 100644 --- a/convert-sam-for-rsem.html +++ b/convert-sam-for-rsem.html @@ -4,87 +4,85 @@-
-convert-sam-for-rsem
--
-convert-sam-for-rsem [options] <input.sam/input.bam> output_file_name
--
-The SAM or BAM file generated by user's aligner. We require this file contains the header section. If input is a SAM file, it must end with suffix 'sam' (case insensitive). If input is a BAM file, it must end with suffix 'bam' (case insensitive).
-The SAM or BAM file generated by user's aligner. We require this file contains the header section. If input is a SAM file, it must end with suffix 'sam' (case insensitive). If input is a BAM file, it must end with suffix 'bam' (case insensitive).
+ + +The output name for the converted file. 'convert-sam-for-rsem' will output a BAM with the name 'output_file_name.bam'.
+ +The output name for the converted file. 'convert-sam-for-rsem' will output a BAM with the name 'output_file_name.bam'.
+-
-'convert-sam-for-rsem' will call 'sort' command and this is the '-T/--temporary-directory' option of 'sort' command. The following is the description from 'sort' : "use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories".
-'convert-sam-for-rsem' will call 'sort' command and this is the '-T/--temporary-directory' option of 'sort' command. The following is the description from 'sort' : "use DIR for temporaries, not $TMPDIR or /tmp; multiple options specify multiple directories".
+ + +Show help information.
+-
-This program converts the SAM/BAM file generated by user's aligner into a BAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references and output header sections. This program will create a temporary directory called 'output_file_name.bam.temp' to store the intermediate files. The directory will be deleted automatically after the conversion. After the conversion, this program will call 'rsem-sam-validator' to validate the resulting BAM file.
-Note: You do not need to run this script if `rsem-sam-validator' reports that your SAM/BAM file is valid.
+ +This program converts the SAM/BAM file generated by user's aligner into a BAM file which RSEM can process. However, users should make sure their aligners use 'reference_name.idx.fa' generated by 'rsem-prepare-reference' as their references and output header sections. This program will create a temporary directory called 'output_file_name.bam.temp' to store the intermediate files. The directory will be deleted automatically after the conversion. After the conversion, this program will call 'rsem-sam-validator' to validate the resulting BAM file.
+ +Note: You do not need to run this script if `rsem-sam-validator' reports that your SAM/BAM file is valid.
+Note: This program does not check the correctness of input file. You should make sure the input is a valid SAM/BAM format file.
--
-Suppose input is set to 'input.sam' and output file name is "output"
-- convert-sam-for-rsem input.sam output-
We will get a file called 'output.bam' as output.
+ +Suppose input is set to 'input.sam' and output file name is "output"
+ + convert-sam-for-rsem input.sam output
+
+We will get a file called 'output.bam' as output.
+