-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This project aim to build a 2D and 3D structures database of pre-miRNA using MC-Fold/MC-FlashFold and MC-Sym.
The code is kept on github at https://github.com/major-lab/decoydb.
The protocol has 5 major steps:
- Link each precursor to its mature(s)
- "Run" MC-Fold/MC-FlashFold for those precursors using masks to make sure the matures are on the right stem and make a "perfect" hairpin
- "Run" MC-Sym on those 2D structures until we find 3D structures that fit with precomputed 3D templates
- Retrieve those 3D structures and make Base Pairings statistics with them to get information about "dynamic" parts of the molecule
- Filter 2D structures to keep only those who have all of the "static" base pairs computed previously
- Generate color-coded 2D representations to identify the statistics and where the mature microRNAs are
- "Re-run" MC-Sym on those 2Ds
- For each microRNA, align the 3Ds and merge them in the same pdb
This part is achieved using the 1st script 1_organize_precursor_mature that takes as arguments:
- hairpin_fa: the hairpin.fa file downloaded from miRBase
- mature_fa: the mature.fa file downloaded from miRBase
- mirna_dat: the miRNA.dat file downloaded from miRBase
-
out_dir: a path where the script will:
- create if it doesn't exist yet, the directories called 'hairpin' and 'mature' in which the sequences of the precursor miRNA are formatted in FASTA. The files are named with the precursor's accession ID
- write a pickle file containing the same data. The pickle object structure is:
[dict(header=fasta_header_of_the_precursor,
accession=accession_of_the_precursor,
name=nom_of_the_precursor,
sequence=sequence_of_the_precursor,
matures=[dict(header=fasta_header_of_the_mature,
name=name_of_the_mature,
accession=accession_of_the_mature,
sequence=sequence_of_the_mature)])]
When the mature name doesn't contain 5p or 3p as suffix, we compare the length of the precursor before and after. if len(before) < len(after)
, we assume the mature is on the 5p side, 3p otherwise.
This part contains 2 steps:
- Generate 2D structures using mask
- filter those 2D structures to make sure the matures are in hairpins and there are no stem in between them
This part is achieved by 2_1_mcflashfold_mask_generator.py, which takes as arguments:
- digested_data: The pickle file generated by 1_organize_precursor_mature
- mcfold_cmd: a template string of the MC-Fold/MC-FlashFold command with the following formatting:
'/u/admc/MC-Flashfold/mcff --tables /u/admc/MC-Flashfold/tables -s "{seq}" -um "{mask}" -t 5 -ns > ./2d/{accession}'
Basically, we put "p"
in the 5" mature area, "q"
in the 3" mature area, "x" otherwise. p, q and x respectively means not reverse paired, not forward paired and don't care.
This part is achieved by 2_2_mcflashfold_structure_filter.py, which takes as arguments:
- hairpin_fasta: the precursor FASTA file
- mature_fasta: the mature FASTA file
- mcfold_output: the output of MC-Fold/MC-FlashFold
The filtered structures are then "fed" to MC-Sym until we find one that fits the template, with various degrees of tolerance depending on the minimization stage:
- 5.0 for a structure without any minimization
- 4.0 for a refined structure
- 3.56 for a relieved structure
- 2.0 for a brushed-up structure
The verification is achieved using 3_structure_verificator.py, which takes as arguments:
- hairpin_seq: the precursor's sequence
- mature5p_seq: the 5' mature sequence if there is
- mature5p_seq: the 3' mature sequence if there is
- structure: the structure of the molecule
- decoy_dir: where the decoys are located
- refine_script: the location of the bash script called for a refine operation
- relieve_script: the location of the bash script called for a relieve operation
- brushup_script: the location of the bash script called for a brushup operation
- out_dir: where the valid pdbs are gonna get dumped
- For each precursor, we take the first valid sequence as the "good" structure
- We then chose 5 structures from the MC-Fold/MC-FlashFold suboptimals that contains the exact same pairings for the mature(s) as the "good" structure chosen above
- Pairings are broken if the nucleotide is unpaired in more than 50% of the above set and if the nucleotide is not implied in a pairing in which the matures are
- The resulting structure is drawn using pseudoviewer with the following parameters:
- Save as: SVG
- Draw options: Standard view, Additional line, Box layout, Scale: 1.5
- Numbering: Interval: 5
- The returned SVG are then converted to highlight the matures, and add a color gradient about the percentage of non-pairing
5 is achieved by 5_reformat_svg.py
This section is divided in 2 main steps:
- Extract the mature part of the previously found PDB
- Use the extracted part as a library for subsequent MC-Sym of the 5 "good" structures previously isolated
This is achieved using 6_extract_mature_library which takes as parameters:
- best_struct_dir: that was populated in the previous step
- out_dir: the directory in which the isolated library will be dumped