Skip to content

AlanKerstjens/MoleculeAutoCorrect

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

Spell checker for your molecular graphs. A virtual library of reference correct molecules is used to build a dictionary of allowed chemical features. The chemical features of input molecules are compared against this dictionary. If any invalid features are present the molecule is modified in a controlled way to find a closely related valid molecule.

For more information on the algorithm please read the accompanying paper: Kerstjens, A., De Winter, H. Molecule auto-correction to facilitate molecular design. J Comput Aided Mol Des 38, 10 (2024).

Installation

Installation from source

Prerequisites

Ensure the following dependencies are installed:

  • RDKit
  • Molpert
  • Boost. You already have this if you installed the RDKit. If you'd like to build the Python bindings make sure Boost.Python is installed.
  • CMake

Instructions

The following instructions are for GNU+Linux. For alternative operating systems you'll have to adapt these commands slightly.

git clone https://github.com/AlanKerstjens/MoleculeAutoCorrect.git
export MOLECULE_AUTO_CORRECT="$(pwd)/MoleculeAutoCorrect"
mkdir ${MOLECULE_AUTO_CORRECT}/build && cd ${MOLECULE_AUTO_CORRECT}/build

You need to point CMake to your Molpert installation. Assuming it's installed at ${MOLPERT}:

cmake -DMolpert_INCLUDE_DIRS=${MOLPERT}/source ..
make install

To be able to import the library from Python add ${MOLECULE_AUTO_CORRECT}/lib to your ${PYTHONPATH}. Consider doing so in your bash_profile file. Otherwise you'll have to manually extend ${PYTHONPATH} everytime you open a new shell.

export PYTHONPATH="${PYTHONPATH}:${MOLECULE_AUTO_CORRECT}/lib"

Troubleshooting

CMake will try to find the rest of the dependencies for you. To avoid problems ensure you build the software with the same Boost and Python versions that you used to build Molpert and the RDKit. If CMake finds a different Boost or Python installation you'll need to point it to the correct one, as described here and here.

CMake will search for the RDKit in the active Anaconda environment (if you have one) and at ${RDBASE} if set. If neither of these are the case you need to specify the path to the RDKit yourself. Replace the above CMake command with the one below, substituting the <placeholder/path> with your paths.

cmake -DRDKit_ROOT=<path/to/rdkit> -DMolpert_INCLUDE_DIRS=<path/to/molpert> ..

Quick start

Get your hands on a virtual library of molecules you would like to use as reference of correct chemistry (here chembl.smi). Then use this library to create a dictionary of chemical features (here chembl.dict). You can specify the radius of circular atomic environments as the last argument (here 1).

${MOLECULE_AUTO_CORRECT}/bin/MakeChemicalDictionary chembl.smi chembl.dict 1

Given the SMILES string of a molecule (here [OH2+]C1C([O-])[C]11=C(C#P)C2=S(=C1)C=NC=N2) you can inspect if it has any issues:

${MOLECULE_AUTO_CORRECT}/bin/HighlightMoleculeErrors chembl.dict "[OH2+]C1C([O-])[C]11=C(C#P)C2=S(=C1)C=NC=N2" molecule_errors.svg

If it has issues you can proceed to try correcting them:

python ${MOLECULE_AUTO_CORRECT}/AutoCorrectMolecule.py chembl.dict "[OH2+]C1C([O-])[C]11=C(C#P)C2=S(=C1)C=NC=N2"

You can experiment with different settings, including tree policies. Access the --help for more information.

python ${MOLECULE_AUTO_CORRECT}/AutoCorrectMolecule.py --help