Home

Table of Contents Description General Protocol Link each precursor to its mature(s) MC-Fold/MC-FlashFold part. 1 Generate mask 2D structures filtering MC-Sym part.1 2D representation of the valid structures Building the decoys Extract mature part

Description

This project aim to build a 2D and 3D structures database of pre-miRNA using MC-Fold/MC-FlashFold and MC-Sym.

The code is kept on github at https://github.com/major-lab/decoydb.

General Protocol

The protocol has 5 major steps:

Link each precursor to its mature(s)
"Run" MC-Fold/MC-FlashFold for those precursors using masks to make sure the matures are on the right stem and make a "perfect" hairpin
"Run" MC-Sym on those 2D structures until we find 3D structures that fit with precomputed 3D templates
Retrieve those 3D structures and make Base Pairings statistics with them to get information about "dynamic" parts of the molecule
Filter 2D structures to keep only those who have all of the "static" base pairs computed previously
- Generate color-coded 2D representations to identify the statistics and where the mature microRNAs are
- "Re-run" MC-Sym on those 2Ds
  - For each microRNA, align the 3Ds and merge them in the same pdb

Link each precursor to its mature(s)

This part is achieved using the 1st script 1_organize_precursor_mature that takes as arguments:

hairpin_fa: the hairpin.fa file downloaded from miRBase
mature_fa: the mature.fa file downloaded from miRBase
mirna_dat: the miRNA.dat file downloaded from miRBase
out_dir: a path where the script will:
- create if it doesn't exist yet, the directories called 'hairpin' and 'mature' in which the sequences of the precursor miRNA are formatted in FASTA. The files are named with the precursor's accession ID
- write a pickle file containing the same data. The pickle object structure is:

[dict(header=fasta_header_of_the_precursor,
      accession=accession_of_the_precursor,
      name=nom_of_the_precursor,
      sequence=sequence_of_the_precursor,
      matures=[dict(header=fasta_header_of_the_mature,
                    name=name_of_the_mature,
                    accession=accession_of_the_mature,
                    sequence=sequence_of_the_mature)])]

When the mature name doesn't contain 5p or 3p as suffix, we compare the length of the precursor before and after. if len(before) < len(after), we assume the mature is on the 5p side, 3p otherwise.

MC-Fold/MC-FlashFold part. 1

This part contains 2 steps:

Generate 2D structures using mask
filter those 2D structures to make sure the matures are in hairpins and there are no stem in between them

Generate mask

This part is achieved by 2_1_mcflashfold_mask_generator.py, which takes as arguments:

digested_data: The pickle file generated by 1_organize_precursor_mature
mcfold_cmd: a template string of the MC-Fold/MC-FlashFold command with the following formatting:

'/u/admc/MC-Flashfold/mcff --tables /u/admc/MC-Flashfold/tables -s "{seq}" -um "{mask}" -t 5 -ns > ./2d/{accession}'

Basically, we put "p" in the 5" mature area, "q" in the 3" mature area, "x" otherwise. p, q and x respectively means not reverse paired, not forward paired and don't care.

2D structures filtering

This part is achieved by 2_2_mcflashfold_structure_filter.py, which takes as arguments:

hairpin_fasta: the precursor FASTA file
mature_fasta: the mature FASTA file
mcfold_output: the output of MC-Fold/MC-FlashFold

The above script allows us to get rid of duplicate structures, the structures in which the mature(s) are not on the same stem.

MC-Sym part.1

The filtered structures are then "fed" to MC-Sym until we find one that fits the template, with various degrees of tolerance depending on the minimization stage:

5.0 for a structure without any minimization
4.0 for a refined structure
3.56 for a relieved structure
2.0 for a brushed-up structure

The structure is only kept if it's within all these thresholds

The verification is achieved using 3_structure_verificator.py, which takes as arguments:

hairpin_seq: the precursor's sequence
mature5p_seq: the 5' mature sequence if there is
mature5p_seq: the 3' mature sequence if there is
structure: the structure of the molecule
decoy_dir: where the decoys are located
refine_script: the location of the bash script called for a refine operation
relieve_script: the location of the bash script called for a relieve operation
brushup_script: the location of the bash script called for a brushup operation
out_dir: where the valid pdbs are gonna get dumped

2D representation of the valid structures

For each precursor, we take the first valid sequence as the "good" structure
We then chose 5 structures from the MC-Fold/MC-FlashFold suboptimals that contains the exact same pairings for the mature(s) as the "good" structure chosen above
Pairings are broken if the nucleotide is unpaired in more than 50% of the above set and if the nucleotide is not implied in a pairing in which the matures are
The resulting structure is drawn using pseudoviewer with the following parameters:
- Save as: SVG
- Draw options: Standard view, Additional line, Box layout, Scale: 1.5
- Numbering: Interval: 5
The returned SVG are then converted to highlight the matures, and add a color gradient about the percentage of non-pairing

[1-4] are achieved by 4_find_best_struct.py

5 is achieved by 5_reformat_svg.py

Building the decoys

This section is divided in 2 main steps:

Extract the mature part of the previously found PDB
Use the extracted part as a library for subsequent MC-Sym of the 5 "good" structures previously isolated

Extract mature part

This is achieved using 6_extract_mature_library which takes as parameters:

best_struct_dir: that was populated in the previous step
out_dir: the directory in which the isolated library will be dumped

N.B: a positions.txt file will be created alongside the pdb.gz, it is to be used when we generated new MC-Sym scripts as the end positions are needed to use the pdb.gz file as a library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly