RNAFeatures

Module Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf

The RNA features pipeline aligns RNA covariance models against a genome. By default it runs cmscan with the latest Rfam covariance models, tRNAscan for tRNA, and loads miRNA genes from miRBase. If you have other sets of covariance models, they can also be passed to cmscan. The results are stored as alignments

Prerequisites

A registry file with the locations of the core database server(s) and the production database (or -production_db $PROD_DB_URL specified).

Prepared Rfam data (see below).

How to run

init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf \
    $($CMD details script) \
    -hive_force_init 1\
    -registry $REG_FILE \
    -production_db "$($PROD_SERVER details url)""$PROD_DBNAME" \
    -pipeline_dir $OUT_DIR/rna_features \
    -species $SPECIES \
    -eg_pipelines_dir $ENS_DIR/ensembl-production-imported \
    ${OTHER_OPTIONS} \
    2> $OUT_DIR/init.stderr \
    1> $OUT_DIR/init.stdout

SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -sync

LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -reg_file $REG_FILE -loop

$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout

Parameters / Options

option	default value	meaning
`-species`		species to process, several `-species` options are possible
`-pipeline_dir`		directory to store results to
`-run_trnascan`	1	run `tRNAscan`; 0 -- to disable
`-run_cmscan`	1	run `cmscan`
`-rfam_rrna`	1	include rRNA set of Rfam covariance models; 0 -- to exclude
`-rfam_trna`	0	include tRNA set of Rfam covariance models; 1 -- to include
`-cmscan_cm_file ${SPECIES}\|_all_=/path/to/lib/file`		Custom set of models to use instead of default Rfam covariance models (`-rfam_cm_file`); can be used several time; all to use the same for all species
`-cmscan_logic_name ${SPECIES}\|_all_=`	all=cmscan_custom	Set custom logic name; default can be overidden on a per-species basis (several options), or by using `all=` -- for all species
`-cmscan_db_name ${SPECIES}\|_all_=${ALT_DB_NAME}`	`all=Rfam`	use `external_db` (pre-existinf in prod db) to associate alignments with
`-cmscan_cpu`	3	Number of cores to be use by `cmscan` (sets LSF's `-n` option)
`-cmscan_heuristics`	default	sets trade-off between sensitivity and run-time; can be one of `slowest`, `slower`, `slow`, `default`, `faster`, `fastest`, `rfam_official` (recommended for bacteria runs) (see Bio::EnsEMBL::Analysis::Runnable::CMScan and cmscan man for mappings
`-cmscan_threshold`	0.001	E-value threshold for reporting alignments
`-taxonomic_filtering`	1	Enable taxonomic filtering of Rfam models (see below for details); 0 -- to disable
`-taxonomic_threshold`	0.02	(fraction 0 to 1; effective when `-taxonomic_filtering 1`) Only allow Rfam model if ratio of number of species within the devision taxonomic level (or `-taxonomic_levels` parameter) to the total number of species associated with this model is higher than threshold
`-rfam_taxonomy_file`	`${rfam_dir}/taxonomic_levels.txt`	Path to the file with counts of Rfam sequences at division-specific taxonomic levels
`-taxonomic_levels`		List of taxomomic levels to be used for taxonomic filtering
`-taxonomic_lca`	0	Enable LCA in addition to the division level filtering; 1 -- to enable
`-load_mirbase`	1	look for miRBase files, if there are none then nothing will happen; 0 -- to disable
`-mirbase_file ${SPECIES}=/path/to/file.gff3`		specify which miRBase file to use for which species; should be loaded in advance (see mirBase species range); can be outdated, check assembly versions
`-delete_existing`	1	Delete pre-existing analysis data; 0 to disable
`-max_hive_capacity`	50	The default hive capacity (i.e. the maximum number of concurrent workers) (no enforced upper limit, but above 150 you might run into problems with locking or database connections).
`-cmscan_exe`	`cmscan`	Path to the cmscan executable file.
`-cmscan_parameters`		Parameters that are passed directly to the cmscan executable
`-rfam_version`	RFAM_VERSION	set `rfam_version` to be included into default `-rfam_dir` value (see RNAFeatures_conf.pm)
`-rfam_dir`	RFAM_VERSIONS_DIR/`${rfam_version}`	path to directory containing Rfam files
`-rfam_cm_file`	`${rfam_dir}/Rfam.cm`	Path to the Rfam CM file
`-rfam_logic_name`	`cmscan_rfam_${rfam_version}`	Logic name for the Rfam analysis
`-rfam_db_name`	`RFAM`	external_db name of the Rfam record
`-trnascan_exe`	`tRNAscan-SE`	Path to the trnascan executable file
`-trnascan_logic_name`	`trnascan_align`	logic name for the the tRNAscan analysis
`-trnascan_db_name`	`TRNASCAN_SE`	external_db name of the tRNAscan
`-trnascan_pseudo`	0	include prediction of tRNA pseudogenes; 1 -- to include
`-trnascan_parameters`		parameters passed to the trnascan executable
`-mirbase_logic_name`	`mirbase`	logic name for the the miRBase analysis
`-mirbase_db_name`	`miRBase`	external_db name of the miRBase
`-mirbase_version`	see RNAFeatures_conf.pm	version of miRBase files
`-pipeline_name`	`rna_features_${ENS_VERSION}`	The hive database name will be `${USER}_${pipeline_name}`
`-production_lookup`	1	Fetch analysis display name, description and web data from the production database; 0 -- to disable

Notes

Taxonomic filtering

Most of the covariance models (CMs) in Rfam have been constructed with reference to a particular set of species. Thus allowing primates only CMs not to be annotated in protists. But in practice some RNA genes (being derived from transposons) get many hundreds of hits in inappropriate species.

Rfam doesn't provide information on the taxonomic restrictedness of CMs, but the taxonomic spread of the sequences associated with each model can be calculated. This data has been precalculated at the division-level (the default) level of filtering.

To decide if an Rfam model can be applied to a particular species, the pipeline checks whether >2% (see -taxonomic_threshold parameter) of the species associated with the model are within the taxonomic level equivalent to the species' division.

Taxonomic filtering is switched on by default, but can be switched off with the -taxonomic_filtering parameter. The 2% threshold can be changed via the -taxonomic_threshold parameter, which takes values from 0 to 1.

Strict taxonomic filtering

The pipeline allows you to filter on LCA in addition to the division level filtering, using the -taxonomic_lca parameter (default 0). This level of filtering is in addition to the threshold mentioned above, and the blacklists and whitelists mentioned below.

Using the LCA option appends _lca to the analysis logic_name. Thus the pipeline can be run twice, with and without LCA filtering, to get two separate tracks for display.

Customised taxonomic filtering

More control over the taxonomic filtering can be achived using the scripts/rna_features/taxonomic_levels.pl script (see below).

Blacklisting and whitelisting Rfam models

Some inappropriate models may slip past the taxonomic filtering. For the current blacklist see rfam_blacklist definition in RNAFeatures_conf.pm.

Sometimes we might want to always use a model that is excluded by the default taxonomic filtering. For the current whitelist see rfam_whitelist definition in RNAFeatures_conf.pm.

Running the pipeline on collections (Bacteria,Fungi, Protists)

When running the pipeline on collections (ie.- bacteria, fungi, protists...) using -cmscan_heuristics rfam_official reduces by a factor of 6x/7x the run-time of the CMScan analysis. The cost in sensitivity is totally acceptable ( RFAM themselves use this parameter to run cmScan on their DB ).

The following parameters were recommended by Rfam, and they are passed automatically when choosing the heuristic 'rfam_official':

-rfam -- significantly increase the speed of the run without affecting much of the results quality
-cut_ga option -- automatically adapt the threshold according to the values used by the Rfam curators for each specific family, which will result in better fit cutoffs, and lead to better runtime in average.
-fit 2 -- specify the kind of output file, adding two extra columns with additional flags that help filtering the results, reducing the number of false positives.
-clanin Rfam.clanin -- file with groups of homologous models and some of the hits to those models will overlap (reduces false positives). Should be be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin

For more details and full explanation of the Infernal software options please see: https://docs.rfam.org/en/latest/genome-annotation.html

Example init command for bacterial collection:

CLANIN_FILE=/path/to/clanin/file # or whatever

init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf \
  $($CMD details script) \
  -hive_force_init 1
  -registry $REG_FILE \
  -pipeline_dir $PIPELINE_DIR \
  -eg_pipelines_dir $EG_PIPELINES_DIR  \
  -division bacteria \
  -cmscan_heuristics rfam_official \
  -production_db ensembl_production \
  -clanin_file $CLANIN_FILE \

Prepare Rfam data for release

We need to preprocess Rfam data when a new version is released, in order to format the data suitably for the pipeline. Additionaly a file for use with customised taxonomic filters can be generated.

N.B. This should be done only once for the release.

scripts/rna_features/add_rfam_desc.pl is used to add model descritions (reported by cmscan) into the file with the Rfam covariance models (Rfam.cm). Usage instructions are given at the top of the script. Sample usage is shown below.
scripts/rna_features/taxonomic_levels.pl is used to extract data about the species that have sequences aligned for each Rfam model, cross-references with the ncbi_taxonomy database to get counts of each species (or subspecies) at the level of division and store this information into a single file. Usage instructions are given at the top of the script. Sample use cases areshown below.

Here's an example of a typical setup process: (don't forget to get/edit/add RFAM_VERSIONS_DIR )

# initialize with proper values
RFAM_VERSION=14.5
RFAM_VERSION_DIR=... # dir to store Rfam data in

ENS_ROOT_DIR=... # dir with Ensembl repos

PROD_SERVER=... # server alias (RO) to get 'ncbi_taxonomy' DB from
CMD_W=... # server alias (RW) to store Rfam DB clone

# get data
RFAM_DIR=${RFAM_VERSION_DIR}/${RFAM_VERSION}

mkdir -p ${RFAM_DIR}
cd ${RFAM_DIR}

lftp -e 'mirror database_files; mget README Rfam.full_region.gz Rfam.clanin Rfam.cm.gz; bye' \
  ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT

gunzip Rfam.cm.gz Rfam.full_region.gz database_files/family.txt.gz database_files/rfamseq.txt.gz

# source production environment settings if you have any
source $ENS_ROOT_DIR/setup.sh

# for the sake of it
md5sum Rfam.cm

# patch Rfam with family info
perl ${ENS_ROOT_DIR}/ensembl-production-imported/scripts/rna_features/add_rfam_desc.pl \
  -cm_file Rfam.cm \
  -family_file database_files/family.txt

cmpress Rfam.cm

# generate taxonomic levels
perl ${ENS_ROOT_DIR}/ensembl-production-imported/scripts/rna_features/taxonomic_levels.pl \
  $(${PROD_SERVER} details script) \
  -dbname 'ncbi_taxonomy' \
  -full_region_file Rfam.full_region \
  -rfamseq_file database_files/rfamseq.txt \
  -rfam2taxonomy_file rfam2taxonomy.txt \
  > taxonomic_levels.txt

# generate database, if you need one
pushd database_files
  RFAM_VERSION_U=$(echo ${RFAM_VERSION} | perl -pe 's/\./_/') # replace '.' with '_'
  RFAM_DB="rfam_${RFAM_VERSION_U}"

  ${CMD_W} -e "drop database if exists ${RFAM_DB}; create database ${RFAM_DB};"

  ls *.sql | cut -f 1 -d . | sort | uniq |
    xargs -n 1 -I XXX \
      ${CMD_W} -D ${RFAM_DB} -e 'set foreign_key_checks = 0; source XXX.sql; set foreign_key_checks = 1'

  gunzip *txt.gz
  ls *.sql | cut -f 1 -d . | sort | uniq |
    xargs -n 1 -I XXX \
      ${CMD_W} -D ${RFAM_DB} -e 'set foreign_key_checks = 0; load data local infile "XXX.txt" into table XXX; set foreign_key_checks = 1;'

  gzip *txt # if you need to preserve raw data
popd

gzip "${RFAM_DIR}/Rfam.full_region"

# drop raw (useless) data
# rm "${RFAM_DIR}/Rfam.full_region.gz"
# rm -rf "${RFAM_DIR}/database_files"  # if you don't need raw data

Now you can update RFAM_VERSION in PrivateConfDetails::Impl

Custom taxonomic filters preparation

Once you have initial ${RFAM_DIR}/taxonomic_levels.txt and ${RFAM_DIR}/rfam2taxonomy.txt files generated, you can create a custom taxonomy filter:

perl ${ENS_ROOT_DIR}/ensembl-production-imported/scripts/rna_features/taxonomic_levels.pl \
  -rfam2taxonomy_file ${RFAM_DIR}/rfam2taxonomy.txt \
  -levels Diptera \
  -levels Drosophila \
  -root Eukaryota \
  > /path/to/custom/taxonomic_levels/diptera.txt

This custom filter can be passed to init command using -rfam_taxonomy_file (and corresponding -taxonomic_levels) option, i.e.:

   -rfam_taxonomy_file /path/to/custom/taxonomic_levels/diptera.txt \
   -taxonomic_levels Diptera

Parts

A few generic from Common::RunnableDB.

A few from RNAFeatures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RNAFeatures.md

RNAFeatures.md

RNAFeatures

Module Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf

Prerequisites

How to run

Parameters / Options

Notes

Taxonomic filtering

Strict taxonomic filtering

Customised taxonomic filtering

Blacklisting and whitelisting Rfam models

Running the pipeline on collections (Bacteria,Fungi, Protists)

Prepare Rfam data for release

Custom taxonomic filters preparation

Parts

Files

RNAFeatures.md

Latest commit

History

RNAFeatures.md

File metadata and controls

RNAFeatures

Module Bio::EnsEMBL::EGPipeline::PipeConfig::RNAFeatures_conf

Prerequisites

How to run

Parameters / Options

Notes

Taxonomic filtering

Strict taxonomic filtering

Customised taxonomic filtering

Blacklisting and whitelisting Rfam models

Running the pipeline on collections (Bacteria,Fungi, Protists)

Prepare Rfam data for release

Custom taxonomic filters preparation

Parts