OpenFold 3 Technical Deep Dive
- 1 A Brief Overview of OpenFold 3
- 2 OpenFold 3 Local Setup
- 3 OpenFold 3 Technical Deep Dive you are here
- 4 Contributing to OpenFold 3: A Primer
A Technical Deep Dive
Before being able to contribute, itās necessary to get a strong understanding of the overall codebase. Here I will detail the shape of the codebase, the tools/libraries/utilities used by the library, and some other details that I think will be helpful to understand. In particular Iāll cover the approach to testing and CI throughout the codebase. Iāll wrap up with a basic overview of inference vs training and detail where/when each takes place. The inference section will be more detailed, since most of us (myself included) will never really spend much time (read: no time) dealing with the training runs or infra.
The next post will be more of a tactical primer on contributing. Iāll cover codebase/commit/PR etiquette, and the high level ālanesā or areas available to budding contributors to focus on. Stay tuned!
Language and libraries
Python and PyTorch
The main language, unsurprisingly, is Python. A realistic range right now seems to be 3.10 - 3.13. There are a few reasons why 3.14 is not your best bet right now. The project uses snakemake for data analysis workflows, and apparently the pinned version does not support 3.14 yet (it may in a newer version, Iām unfamiliar with snakemake). Primarily the project depends (at least historically) on various features of TorchScript, which is an older pytorch utility that has been essentially replaced with torch.compile. You can read more on the [pytorch torch.compiler docs].
The quick note is that pytorch does some fancy stuff to compile python code in a few different ways. TorchScript was an older way to do this, but pytorch now offers a full JIT compiler. TorchScript is a frozen feature that is essentially no longer supported, with the pytorch team actively steering users away from it and towards the newer torch.compile. For this reason youāll see warnings during inference if you use 3.14, since anything using TorchScript seems to not actually compile and instead run as normal python. This means youāre also likely leaving some performance on the table.
As you explore the codebase, youāll come across uses of jit/torchscript via examples such as:
# openfold3/core/model/primitives/attention.py:104
@torch.jit.ignore
def _deepspeed_evo_attn(...):
...
# attention.py:123 (the scripting itself is currently disabled)
# @torch.jit.script
def _attention(query, key, value, biases, ...):
...
PyTorch Lightning
PyTorch Lightning from lightning.ai is used as the training and inference harness (trainer, prediction loop, callbacks, etc.). Iāll cover this in more detail later, but itās a good idea to browse the docs briefly to get a general idea for whatās going on.
Biology Ecosystem
Biotite is OpenFold-3ās core structural-biology toolkit. Biotiteās AtomArray is the in-memory representation of molecular structures throughout the data and output pipeline. Main uses:
-
The structure data model (biotite.structure, 47 uses). AtomArray / Atom / BondList are the canonical objects passed around in core/data/primitives/structure/ (query.py, tokenization.py, metadata.py, unresolved.py, template.py) for tokenization, bond handling, and building model inputs. Also used in metrics (rasa.py for solvent accessibility) and tensor conversion (
tensor_utils.py). -
File I/O (
biotite.structure.io.pdbx,.io). Reading/writing CIF / BinaryCIF / PDB: CIFFile, CIFBlock, CIFCategory, BinaryCIFFile. Used to parse input structures and templates, and the output writer (core/runners/writer.py) serializes predictions to CIF/PDB. -
The CCD - Chemical Component Dictionary (biotite.setup_ccd).
setup_openfold.pyand dev scripts callsetup_ccdto install/build the local components.bcif, the reference ligand/residue chemistry the pipeline looks up. -
Chemistry helpers.
biotite.structure.infofor bond/link types (patches.py,metadata.py), biotite.interface.rdkit (from_mol/to_mol) to bridge to RDKit for ligand handling, and biotite.database.rcsb to fetch structures from the PDB. -
Tests. Heavily used to construct synthetic AtomArrays and assert on structures across the test suite.
In short: biotite is the structural representation + CIF/PDB I/O + CCD chemistry layer; RDKit and the model tensors sit on either side of it.
The docs stack
Generator: Sphinx. The standard Python documentation tool. It reads source files under docs/source/, is configured by docs/source/conf.py, and is built via docs/Makefile (e.g. make html). Output is HTML.
Source format: Markdown via myst-parser. Rather than Sphinxās default reStructuredText, this project writes docs in Markdown using the MyST parser. Three MyST extensions are enabled in conf.py:
- colon_fence - :::-style fenced blocks for admonitions/directives.
- dollarmath - inline/display math with
$...$and$$...$$. - amsmath - LaTeX environments for multi-line equations.
Theme: furo. A clean, modern, responsive HTML theme (html_theme = āfuroā), also used by many well-known Python projects.
Diagrams: sphinxcontrib-mermaid. Lets you embed Mermaid diagrams (flowcharts, etc.) directly in the docs, rendered at build time.
Hosting: Read the Docs. Published at openfold-3.readthedocs.io; RTD rebuilds the Sphinx site on changes.
Dependencies. Declared in two places: the optional docs extra in pyproject.toml (sphinx, myst-parser, furo) and the pixi env (pixi.toml), which also pins sphinx, myst-parser, furo (plus the mermaid extension that conf.py imports).docs/environment.yml exists for a conda-based build too.
Notably absent: no autodoc/napoleon API-from-docstrings extensions are enabled, so the docs are hand-written prose/Markdown, not auto-generated from the code.
Various
- ml_collections is used for config
- click is used for the CLI
- setuptools_scm used or versioning
- cibuildwheel for shipping wheels with compiled kernels
The shape of the codebase
At the top level there are three directories worth knowing, and they map cleanly onto three different jobs.
openfold3/core is the actual library: the reusable, model-agnostic machinery. Inside it youāll find the pieces youād expect from a model of this size:
config- configuration plumbing and the linear-init defaultsdata- the entire data pipeline (this turns out to be a much bigger world than the rest, more on that later)kernels- the GPU kernel wrappers (cuEquivariance and Triton)loss- the training losses (diffusion, distogram, confidence)metrics- quality and confidence scoring, and sample rankingmodel- the network itself, split into embedders, trunk, layers, primitives, structure, and headsrunners- the Lightning glue (theModelRunnerbase class and the output writer)utils- the grab bag (chunking, checkpointing, atomization, EMA, schedulers, and so on)
openfold3/entry_points is the command-and-control layer: the ExperimentRunner classes that stand up a PyTorch Lightning trainer for either training or inference, plus input validation and parameter download.
openfold3/projects is where an abstract pile of core components becomes a specific, runnable model. Right now there is one project, of3_all_atom, and it bundles a concrete model.py, a runner.py (OpenFold3AllAtom, which subclasses the core ModelRunner), and a config/ directory holding the real model config and the preset YAML. The entry point, project_entry.py, is the thing that hands you a fully-composed config via get_model_config_with_presets.
Why the indirection? core does not know anything about āOpenFold 3ā specifically. It knows about embedders and trunks and diffusion modules in the abstract. A project is what pins down which of those you use, at what sizes, and with which config, so that the same training and inference machinery can in principle host more than one model. If you are coming to contribute, the practical takeaway is this: read projects/of3_all_atom first to see how the model is actually wired together, then drop into core to read the one piece you care about.
The model, end to end
If you trace a single prediction through the network, it moves through five stages, and the directories under core/model follow them almost one-to-one.
-
Embedding the inputs (
model/feature_embedders). The raw features (sequence, MSA, templates) get turned into the modelās working representations.input_embedders.pyholds the all-atom input embedder and the MSA module embedder;template_embedders.pyhandles structural templates. -
The trunk (
model/latent). This is the heart of the AlphaFold-style architecture, where the model iterates on two representations at once: a per-token āsingleā representation and a pairwise āpairā representation.msa_module.pymixes information out of the MSA, andpairformer.pyis the Pairformer stack (the successor to AlphaFold 2ās Evoformer, which also still lives here asevoformer.py). This is where the expensive triangle operations run. -
The primitives and layers (
model/primitives,model/layers) are what the trunk is built from. Primitives are the small reusable pieces: attention, LayerNorm A normalization layer that rescales activations to keep training stable; OpenFold 3 leans on it heavily. (normalization.py), linear layers, and activations. Layers are the bigger named blocks straight out of the paper:triangular_attention.py,triangular_multiplicative_update.py,outer_product_mean.py,attention_pair_bias.py, and the transitions. -
Structure generation (
model/structure).diffusion_module.pyis the diffusion model that actually produces 3D coordinates, denoising from noise into a structure conditioned on the trunkās representations. This is the big architectural shift from AlphaFold 2, and it is why ādiffusion samplesā showed up as a flag back in the setup post. -
The heads (
model/heads).prediction_heads.pyandhead_modules.pyproduce the confidence outputs: the pLDDT, pTM, and PAE scores, plus the distogram.
So the whole flow is: features go in, the embedders lift them into single and pair representations, the trunk refines those, the diffusion module turns them into coordinates, and the heads score the result. Every structure I rendered in this series came out the far end of exactly this pipeline.
Kernels and performance
A model this size lives or dies on a handful of hand-optimized GPU kernels, and OpenFold 3 leans on two families. core/kernels/cueq_utils.py wraps NVIDIAās cuEquivariance kernels for the triangle operations, and core/kernels/triton/ holds a set of Triton kernels (a fused softmax, a SwiGLU, and an Evoformer kernel). On top of that, attention can route through DeepSpeedās EvoformerAttention, which is the @torch.jit.ignore-wrapped path you saw in primitives/attention.py earlier.
The reason all of this exists is the same reason the setup post hit a memory wall: the pair and triangle tensors scale with the square of the sequence length, so both compute and memory blow up fast. The model fights back with a couple of levers you can see directly in model_setting_presets.yml: a chunk_size that splits the big operations into smaller pieces, and an offload_inference flag that pushes activations off to the CPU instead of holding them on the GPU. The predict preset turns these on modestly, and the low_mem preset turns them up.
One more thing worth knowing if you ever go chasing speed: inference defaults to full fp32 (precision: "32-true" in the trainer args). On a memory-bound card that is a real lever left untouched, and it is one of the threads I want to pull on in a future, perf-focused post. For now the mental model is simple. It is correct first, and not yet squeezed.
Configuration system
Configuration is ml_collections-based, and the thing that makes it tractable is presets. projects/of3_all_atom/config/model_setting_presets.yml defines a few named settings blocks, train, predict, and low_mem, each one toggling things like chunk sizes and offloading. You compose them: inference stacks predict, and on a tight GPU you stack low_mem on top, which is exactly the model_update.presets: [predict, low_mem] line from the setup post. project_entry.get_model_config_with_presets is what resolves all of that into one concrete config.
Separately, the PyTorch Lightning trainer has its own small typed config in entry_points/validator.py (PlTrainerArgs), which is where things like precision (defaulting to 32-true) and the profiler live. If you want to flip the model into bf16 or attach a profiler, that is the surface to do it from, via a runner YAML and no code change.
How inference runs
Here is the path a prediction actually takes, top to bottom:
run_openfold predict(the click CLI inrun_openfold.py) parses your flags and your query JSON.- That hands off to an
InferenceExperimentRunnerinentry_points/experiment_runner.py. Itssetupbuilds the model and the data, and itsrundoes the one thing that matters:self.trainer.predict(...). - The
traineris a plain PyTorch Lightningpl.Trainer, constructed from thePlTrainerArgsmentioned above. - The LightningModule it drives is
core/runners/model_runner.py(ModelRunner), subclassed byprojects/of3_all_atom/runner.py(OpenFold3AllAtom). The interesting hook ispredict_step, where a batch becomes a structure. - Output is handled by a writer callback in
core/runners/writer.py, which serializes each prediction to CIF (and PDB) using biotite.
The one branch worth calling out is the MSA step, because it is the single biggest lever on quality (as the setup post showed). The tooling lives in core/data/tools: colabfold_msa_server.py is the hosted path you get with --use_msa_server, while jackhmmer.py, hhblits.py, and hhsearch.py are there for the precomputed, run-it-yourself route. Turn the server off and provide nothing, and you fall back to a single-sequence dummy MSA, which is fast and offline and, as we saw, a lot less accurate.
How training runs
I will keep this short, both because the docs cover it well in training.md and because, like most people running inference, it is not where I spend my time.
The training side mirrors the inference side: a TrainingExperimentRunner stands up the same kind of Lightning trainer, just pointed at the losses in core/loss (diffusion, distogram, and confidence) and the metrics in core/metrics (quality, confidence, sample ranking). The heavy lifting that makes training even possible is in core/data, which is a much bigger world than the inference path lets on: a full preprocessing and featurization pipeline plus a dataset cache system, because you cannot re-derive features for millions of structures on every step.
The part most worth knowing, and the part the overview post already gestured at, is that OpenFold reproduced the AlphaFold 3 training recipe, including the large MGnify-based distillation dataset. Training runs distributed through Lightning, and the conda-and-pixi parity that CI enforces (more on that next) matters a lot more here than it does for a one-off local inference.
Testing and CI
Testing is pytest, run in parallel with pytest-xdist, and the plugins it leans on tell you what the project cares about. pytest-regressions drives the numerical snapshot tests (the ones that compare arrays against committed baselines, and the ones that bit me on a non-reference GPU in the setup post). There is also pytest-benchmark for performance regressions and pytest-recording for replaying network interactions. A shared seeded_rng fixture keeps the randomness deterministic, which, as my flaky-test PR showed, is not optional for a model full of random initialization.
Linting and formatting are ruff, configured with an 88-character line length and the E/F/UP/B/SIM/I/TID rule sets. Two choices stand out: relative imports are banned outright (ban-relative-imports = "all"), so everything is imported by full path, and tests are exempt from the line-length rule.
CI is where you really see that this is a systems project and not just a model. There are dual test pipelines, one conda and one pixi, that build a Docker image and run the suite on cloud GPU runners, pushing images to GHCR. There is a heavier integration-test path for the slow, full-fat tests, a workflow that caches the model parameters from S3 so every run is not re-downloading gigabytes, a ruff gate, and a PyPI publish pipeline that uses cibuildwheel to ship the compiled kernels as wheels. If you contribute, the conda-and-pixi duplication is the thing to keep in mind: a change that works in one environment has to work in the other, because CI checks both.
Wrap
This ended up being a lot longer than I expected. Make no mistake that I myself understand only the very surface level of many of these details, as Iām still ramping up on the project. Donāt be intimidated by the amount of information. Just use this an overview, and focus on whatās relevant to you as you work on contributions. Next up Iāll dive into everything relevant to contributing and how to pick a lane to swim in.
Folding is fun
The structure for this post is the obvious one for a deep dive: ATP synthase subunit β, the catalytic heart of the machine that converts the mitochondrial proton gradient into ATP. The whole respiratory chain, the electron relay that cytochrome c and cytochrome c1 are part of in the other posts, exists to pump protons across the inner membrane. ATP synthase is what cashes that gradient back in. Subunit β is where the chemistry happens.
The large blue body is the nucleotide-binding fold, predicted with high confidence. The long orange strand peeling off it is the N-terminal mitochondrial targeting presequence, the tag that gets the protein imported and is then cleaved; OpenFold 3 correctly has no confidence in its position, because in the mature protein it does not exist. Superposed against AlphaFoldās model of the same sequence (UniProt P06576), the confident core matches to 0.47 Ć backbone RMSD over all 472 residues, which is about as close as two independent predictions of the same fold get. Same caveat as the rest of the series: the public AlphaFold database is AlphaFold 2, since there is no public bulk download of AlphaFold 3 structures, but for a conserved monomer like this it is a fair reference.