-
Notifications
You must be signed in to change notification settings - Fork 3
Architecture
The central focus tmol
is the PoseStack - a batch of structures. At its heart, tmol
is a library for creating, scoring, manipulating, and exporting PoseStacks.
Under the hood, all PoseStack creation is done through a common function: tmol.pose_stack_from_canonical_form
. Other PoseStack creation functions such as loading from a pdb, or importing from RosettaFold2 or OpenFold, work by first converting the source data into a common representation - the CanonicalForm.
Because tmol
represents structures with a higher chemical granularity than most other ML packages, it has to resolve the chemical structure of the molecules coming into it from other sources. Even the process of reading in a PDB file requires this chemical type resolution.
The CanonicalForm is a structure batch format that lets us represent data in tmol
while deferring the chemical resolution. This makes loading the source data into tmol
easier, and lets us make the chemical type resolution step use the same machinery, regardless of data source.
Once a batch of structures has been converted into a CanonicalForm, it can be combined with several other data-source specific objects to resolve the chemical structure and create a PoseStack:
- The PackedBlockTypes object, which contains chemical information for the set of chemical types used by the data source.
- The CanonicalOrdering object, which describes the mapping of chemical types to integers and also a mapping for each type of the atom names to unique integers.
Variants of these two objects for each data source are stored in the database, and more can be added in order to support PoseStack creation from new data sources.
tmol
can evaluate the energy of a PoseStack using a ScoreFunction that is composed of one or more EnergyTerms. EnergyTerms pull any necessary parameters from the database, populate the PackedBlockTypes object with preprocessed data, and then render a torch Module that can be used to do the actual scoring.
Warning
Not really sure what we should say about the database here, but it seems like it should be here
Before scoring a PoseStack, there is some precompuation that must happen that ensures EnergyTerms have the data they need for every block type in the PoseStack.
The first step is to have every EnergyTerm precompute data that it needs for each RefinedResidueType, which it will then store in the RefinedResidueType objects.
The second step is for every EnergyTerm to take that precomputed data and serialize it into compact tensors that are then stored in the PackedblockTypes object.
In order for torch
to actually use our EnergyTerms, we have to create a torch
Module. The EnergyTerms use the function render_whole_pose_scoring_module
to instantiate a module that is configured for running with the preprocessed data.
The actual ScoringModule itself defines a forward function that does the actual computation on the atom coordinates. This computation can either be pure torch
Python code, or can be written in C++.
Warning
Some sort of description of how the ScoringModules set up parameters (_p())
tmol
is primarily written in Python, with C++/CUDA being used to write optimized low level code for specific operations (most EnergyTerms, for example). C++ functions are exported to Python by pybind
.
When C++/CUDA is used, both a CPU and a CUDA version are compiled. This compilation is done Just-In-Time (JIT) by Ninja
when used. tmol
makes use of a 'diamond' structure to share the implementation code between C++/CUDA. Note that this means implementation code may only use functions that are available both in C++17 and CUDA (critically, things like std::cout
are missing).
Warning
There is currently a bug in the CUDA compilation where the JIT compiling may fail to recognize updates to the code. If you notice a difference between the behavior of your C++ and CUDA implementations, you may need to delete the local cached object files to force a recompile.
- The ScoreFunction - A function that will evaluate a PoseStack with one or more ScoreTerms.
- The Minimizer - A gradient-descent algorithm that modifies degrees-of-freedom of a PoseStack to minimize the value of a ScoreFunction.