Skip to content

Architecture

Jeff Flatten edited this page Jun 20, 2024 · 40 revisions

Overview

The central focus tmol is the PoseStack - a batch of structures. At its heart, tmol is a library for creating, scoring, manipulating, and exporting PoseStacks.

Creating PoseStacks

Under the hood, all PoseStack creation is done through a common function: tmol.pose_stack_from_canonical_form. Other PoseStack creation functions such as loading from a pdb, or importing from RosettaFold2 or OpenFold, work by first converting the source data into a common representation - the CanonicalForm.

CanonicalForm

Because tmol represents structures with a higher chemical granularity than most other ML packages, it has to resolve the chemical structure of the molecules coming into it from other sources. Even the process of reading in a PDB file requires this chemical type resolution.

The CanonicalForm is a structure batch format that lets us represent data in tmol while deferring the chemical resolution. This makes loading the source data into tmol easier, and lets us make the chemical type resolution step use the same machinery, regardless of data source.

Chemical Type Resolution

Once a batch of structures has been converted into a CanonicalForm, it can be combined with several other data-source specific objects to resolve the chemical structure and create a PoseStack:

  • The PackedBlockTypes object, which contains chemical information for the set of chemical types used by the data source.
  • The CanonicalOrdering object, which describes the mapping of chemical types to integers and also a mapping for each type of the atom names to unique integers.

Variants of these two objects for each data source are stored in the database, and more can be added in order to support PoseStack creation from new data sources.

Scoring PoseStacks

tmol can evaluate the energy of a PoseStack using a ScoreFunction that is composed of one or more EnergyTerms. EnergyTerms pull any necessary parameters from the database, populate the PackedBlockTypes object with preprocessed data, and then render a torch Module that can be used to do the actual scoring.

Database

Warning

Not really sure what we should say about the database here, but it seems like it should be here

Precomputation

Before scoring a PoseStack, there is some precompuation that must happen that ensures EnergyTerms have the data they need for every block type in the PoseStack.

The first step is to have every EnergyTerm precompute data that it needs for each RefinedResidueType, which it will then store in the RefinedResidueType objects.

The second step is for every EnergyTerm to take that precomputed data and serialize it into compact tensors that are then stored in the PackedblockTypes object.

Rendering a ScoringModule

In order for torch to actually use our EnergyTerms, we have to create a torch Module. The EnergyTerms use the function render_whole_pose_scoring_module to instantiate a module that is configured for running with the preprocessed data.

The actual ScoringModule itself defines a forward function that does the actual computation on the atom coordinates. This computation can either be pure torch Python code, or can be written in C++.

Warning

Some sort of description of how the ScoringModules set up parameters (_p())

Extending the Database

Python, C++, and CUDA

tmol is primarily written in Python, with C++/CUDA being used to write optimized low level code for specific operations (most EnergyTerms, for example). C++ functions are exported to Python by pybind.

When C++/CUDA is used, both a CPU and a CUDA version are compiled. This compilation is done Just-In-Time (JIT) by Ninja when used. tmol makes use of a 'diamond' structure to share the implementation code between C++/CUDA. Note that this means implementation code may only use functions that are available both in C++17 and CUDA (critically, things like std::cout are missing).

Warning

There is currently a bug in the CUDA compilation where the JIT compiling may fail to recognize updates to the code. If you notice a difference between the behavior of your C++ and CUDA implementations, you may need to delete the local cached object files to force a recompile.

TODO

  • The ScoreFunction - A function that will evaluate a PoseStack with one or more ScoreTerms.
  • The Minimizer - A gradient-descent algorithm that modifies degrees-of-freedom of a PoseStack to minimize the value of a ScoreFunction.
Clone this wiki locally