Skip to content
Stephen Leong Koan edited this page Apr 24, 2014 · 1 revision

Table of Contents

Description

This project aim to build a 2D and 3D structures database of pre-miRNA using MC-Fold/MC-FlashFold and MC-Sym.

The code is kept on github at https://github.com/major-lab/decoydb.

General Protocol

The protocol has 5 major steps:

  1. Link each precursor to its mature(s)
  2. "Run" MC-Fold/MC-FlashFold for those precursors using masks to make sure the matures are on the right stem and make a "perfect" hairpin
  3. "Run" MC-Sym on those 2D structures until we find 3D structures that fit with precomputed 3D templates
  4. Retrieve those 3D structures and make Base Pairings statistics with them to get information about "dynamic" parts of the molecule
  5. Filter 2D structures to keep only those who have all of the "static" base pairs computed previously
    • Generate color-coded 2D representations to identify the statistics and where the mature microRNAs are
    • "Re-run" MC-Sym on those 2Ds
      • For each microRNA, align the 3Ds and merge them in the same pdb

Link each precursor to its mature(s)

This part is achieved using the 1st script 1_organize_precursor_mature that takes as arguments:

  • hairpin_fa: the hairpin.fa file downloaded from miRBase
  • mature_fa: the mature.fa file downloaded from miRBase
  • mirna_dat: the miRNA.dat file downloaded from miRBase
  • out_dir: a path where the script will:
    • create if it doesn't exist yet, the directories called 'hairpin' and 'mature' in which the sequences of the precursor miRNA are formatted in FASTA. The files are named with the precursor's accession ID
    • write a pickle file containing the same data. The pickle object structure is:
[dict(header=fasta_header_of_the_precursor,
      accession=accession_of_the_precursor,
      name=nom_of_the_precursor,
      sequence=sequence_of_the_precursor,
      matures=[dict(header=fasta_header_of_the_mature,
                    name=name_of_the_mature,
                    accession=accession_of_the_mature,
                    sequence=sequence_of_the_mature)])]

When the mature name doesn't contain 5p or 3p as suffix, we compare the length of the precursor before and after. if len(before) < len(after), we assume the mature is on the 5p side, 3p otherwise.

MC-Fold/MC-FlashFold part. 1

This part contains 2 steps:

  1. Generate 2D structures using mask
  2. filter those 2D structures to make sure the matures are in hairpins and there are no stem in between them

Generate mask

This part is achieved by 2_1_mcflashfold_mask_generator.py, which takes as arguments:

  • digested_data: The pickle file generated by 1_organize_precursor_mature
  • mcfold_cmd: a template string of the MC-Fold/MC-FlashFold command with the following formatting:
'/u/admc/MC-Flashfold/mcff --tables /u/admc/MC-Flashfold/tables -s "{seq}" -um "{mask}" -t 5 -ns > ./2d/{accession}'

Basically, we put "p" in the 5" mature area, "q" in the 3" mature area, "x" otherwise. p, q and x respectively means not reverse paired, not forward paired and don't care.

2D structures filtering

This part is achieved by 2_2_mcflashfold_structure_filter.py, which takes as arguments:

  • hairpin_fasta: the precursor FASTA file
  • mature_fasta: the mature FASTA file
  • mcfold_output: the output of MC-Fold/MC-FlashFold
The above script allows us to get rid of duplicate structures, the structures in which the mature(s) are not on the same stem.

MC-Sym part.1

The filtered structures are then "fed" to MC-Sym until we find one that fits the template, with various degrees of tolerance depending on the minimization stage:

  • 5.0 for a structure without any minimization
  • 4.0 for a refined structure
  • 3.56 for a relieved structure
  • 2.0 for a brushed-up structure
The structure is only kept if it's within all these thresholds

The verification is achieved using 3_structure_verificator.py, which takes as arguments:

  • hairpin_seq: the precursor's sequence
  • mature5p_seq: the 5' mature sequence if there is
  • mature5p_seq: the 3' mature sequence if there is
  • structure: the structure of the molecule
  • decoy_dir: where the decoys are located
  • refine_script: the location of the bash script called for a refine operation
  • relieve_script: the location of the bash script called for a relieve operation
  • brushup_script: the location of the bash script called for a brushup operation
  • out_dir: where the valid pdbs are gonna get dumped

2D representation of the valid structures

  1. For each precursor, we take the first valid sequence as the "good" structure
  2. We then chose 5 structures from the MC-Fold/MC-FlashFold suboptimals that contains the exact same pairings for the mature(s) as the "good" structure chosen above
  3. Pairings are broken if the nucleotide is unpaired in more than 50% of the above set and if the nucleotide is not implied in a pairing in which the matures are
  4. The resulting structure is drawn using pseudoviewer with the following parameters:
    • Save as: SVG
    • Draw options: Standard view, Additional line, Box layout, Scale: 1.5
    • Numbering: Interval: 5
  5. The returned SVG are then converted to highlight the matures, and add a color gradient about the percentage of non-pairing
[1-4] are achieved by 4_find_best_struct.py

5 is achieved by 5_reformat_svg.py

Building the decoys

This section is divided in 2 main steps:

  1. Extract the mature part of the previously found PDB
  2. Use the extracted part as a library for subsequent MC-Sym of the 5 "good" structures previously isolated

Extract mature part

This is achieved using 6_extract_mature_library which takes as parameters:

  • best_struct_dir: that was populated in the previous step
  • out_dir: the directory in which the isolated library will be dumped
N.B: a positions.txt file will be created alongside the pdb.gz, it is to be used when we generated new MC-Sym scripts as the end positions are needed to use the pdb.gz file as a library