Skip to content

Commit

Permalink
Check Newick tree structural integrity (#35)
Browse files Browse the repository at this point in the history
* Checkpoint for Newick tree structural integrity

- All internal nodes in the input (Newick) tree
  must have exactly two children nodes
- If they have one child node or if they have
  three (or more), then the checkpoint exists
  and prompts the user to adjust the tree
  in the configuration file
- To help out finding where the problem is, the
  structural error in the input tree is
  shown to the user both as a string and as an
  ASCII drawing of the subtree that contained
  the incorrect internal node
  • Loading branch information
Cecilia-Sensalari authored Mar 23, 2022
1 parent b1b1d9c commit bc9ea47
Show file tree
Hide file tree
Showing 3 changed files with 68 additions and 1 deletion.
1 change: 1 addition & 0 deletions ksrates/fc_configfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ def get_species(self):
def get_newick_tree(self):
"""
Gets the config file field of the Newick tree.
Checks and exits if the species' names in the Newick tree contain illegal characters (underscore or spaces).
:return tree_string: the tree object by ete3
"""
Expand Down
67 changes: 66 additions & 1 deletion ksrates/fc_manipulate_trees.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,72 @@ def find_missing_pairs_for_tree_rates(tree, species, species_history, latin_name
missing_pairs_with_latin_names.append([sorted_latin_tag, sorted([leaf1.name, leaf2.name], key=str.casefold)])
missing_pairs.append(sorted([leaf1.name, leaf2.name], key=str.casefold))
return missing_pairs_with_latin_names, missing_pairs



def check_integrity_newick_tree(tree):
"""
:param tree: the original tree object
Checks if there are syntax errors in the newick_tree input. Exists if there are errors.
- Case 1: The presence of extra unnecessary pairs of parenthesis generates internal nodes with only one child node,
instead of two children nodes; this will rise problems during the parsing of the tree to obtain the species trios.
Therefore, the code exists and prompts the user to remove such unnecessary parentheses.
Example: the input Newick tree contains a subtree whose outermost pair of parenthesis has to be removed.
String visualization of the subtree: (((elaeis,oryza),asparagus))
ASCII visualization of the subtree - note the extra node at the base of this subtree:
/-elaeis
/-|
-- /-| \-oryza
|
\-asparagus
- Case 2: In presence of unresolved phylogeny (i.e. three or more children nodes branching off from an internal node)
there will be problems in downstream analysis due to ambiguous outgroup relationships.
Therefore, the code exists and prompts the user to rearrange the node(s).
Example: the input Newick tree contains a subtree where the basal node has three children nodes.
String visualization of the subtree: (elaeis,oryza,maize)
ASCII visualization of the subtree:
/-elaeis
|
--|--oryza
|
\-maize
"""
# For each internal node, check integrity (must have exactly two children)
logging.info("Checking structural integrity of input Newick tree...")
trigger_exit = False
internal_nodes_with_one_child, internal_nodes_with_three_children = [], []
for node in tree.traverse():
if not node.is_leaf():
number_of_children_nodes = len(node.get_children())
if number_of_children_nodes == 1:
internal_nodes_with_one_child.append(node)
elif number_of_children_nodes > 2:
internal_nodes_with_three_children.append(node)

if len(internal_nodes_with_one_child) != 0:
logging.error(f'The tree structure provided in "newick_tree" configuration file field has one ore more incomplete internal nodes:')
logging.error(f"likely there are unnecessary pairs of parentheses that generate internal nodes with only one child node instead of two children nodes")
logging.error(f"Please adjust the input tree in the configuration file as suggested below and rerun the analysis")
logging.error(f"Such syntax error can be solved by removing the unnecessary outermost pair of parentheses in the following subtree(s):\n")
for node in internal_nodes_with_one_child:
logging.error(f'Subtree {internal_nodes_with_one_child.index(node)+1}: {node.write(format=9).rstrip(";")}{node}\n')
trigger_exit = True

if len(internal_nodes_with_three_children) != 0:
logging.error(f'The tree structure provided in "newick_tree" configuration file field contains unresolved phylogenetic relationships')
logging.error(f"Please adjust the tree so that each internal node has exactly two children nodes")
logging.error(f"Such structural issue has been encountered at the base of the following subtree(s):\n")
for node in internal_nodes_with_three_children:
logging.error(f'Subtree {internal_nodes_with_three_children.index(node)+1}: {node.write(format=9).rstrip(";")}{node}\n')
trigger_exit = True

if trigger_exit:
sys.exit(1)


def reorder_tree_leaves(tree, species):
"""
Expand Down
1 change: 1 addition & 0 deletions ksrates/setup_correction.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def setup_correction(config_file, nextflow_flag):
# Check configfile
species_of_interest = config.get_species()
original_tree = config.get_newick_tree()
fcTree.check_integrity_newick_tree(original_tree)
tree = fcTree.reorder_tree_leaves(original_tree, species_of_interest) # focal species is the top leaf
latin_names = config.get_latin_names()
paranome = config.get_paranome()
Expand Down

0 comments on commit bc9ea47

Please sign in to comment.