Back to blog listing

Can You Write Down a Protein Model's Mind in English?


In the deep dive I ended on the sparse autoencoders, because they let you ask what ESMC has actually learned and get back something a biologist could read. Around the same time, Anthropic published a piece on natural language autoencoders that comes at the same question from a different angle. I wanted to see whether the idea carried over to a protein model. This post is that experiment. It is the one part of this series that is mine rather than a walkthrough of someone else’s release.

The idea I was borrowing

A sparse autoencoder An unsupervised network trained to re-express a model's dense internal activations as a sparse combination of many interpretable "features." Only a handful of the ~16,000 features fire for any given residue, and each one tends to correspond to a recognizable piece of biology. tells you which interpretable features fire for a given input. Anthropic’s natural language autoencoder asks a harder question. Instead of a list of features, can plain English be the bottleneck?

The setup has two halves. A verbalizer turns a model’s internal activation into a text description. A reconstructor turns that description back into the activation. You train them so the round trip is faithful. The clever part is the constraint: if you can rebuild the activation from the text alone, the text must have actually captured what the activation encodes. The English is not a loose annotation. It is a lossy compression you can measure.

Anthropic ran this on Claude to surface reasoning and hidden motivations. My question was narrower and, I think, unexplored. Protein language models already have SAEs An unsupervised network trained to re-express a model's dense internal activations as a sparse combination of many interpretable "features." Only a handful of the ~16,000 features fire for any given residue, and each one tends to correspond to a recognizable piece of biology. and even auto-generated English descriptions of every feature. Nobody, as far as I can find, has treated that English as a reconstructable bottleneck and measured how much of the protein model’s representation survives the trip through a sentence.

What I could actually build

I want to be honest about scope up front. Anthropic trains a verbalizer and a reconstructor jointly with reinforcement learning, generating hundreds of tokens per activation. That is not a weekend project on one GPU. What I built is a lightweight instantiation of the same idea, using parts I already had running from earlier in this series.

  • Verbalizer. For a protein, I take ESMC-6B’s activation, read off its top interpretable SAE features at layer 60, and hand their descriptions to claude -p with instructions to write one plain paragraph about the protein. The text is derived only from the activation. The verbalizer never sees the sequence or the protein’s name.
  • Reconstructor. I freeze a small sentence encoder, embed the paragraph, and train a little MLP to map that text embedding back to ESMC’s own embedding A protein language model's internal vector representation of each residue: a list of numbers per amino acid that encodes what the model has learned about that position in context. These vectors are what downstream models (like a folding trunk) actually consume. for the protein. The training signal is cosine similarity to the real embedding.
  • Corpus. 1,000 Swiss-Prot proteins between 40 and 200 residues, split 800 for training and 200 for test.
Exact pipeline and hyperparameters
  • Backbone: ESMC-6B, mean-pooled final-layer hidden state (2560-d) as the reconstruction target.
  • SAE: ESMC-6B-sae-k64-codebook16384, layer 60. Per protein I aggregate feature activations over residues and keep the top 12 that have a non-generic description.
  • Verbalizer: claude -p, prompted with the feature categories and summaries only, asked for a 3 to 5 sentence paragraph, told not to mention features or invent specifics.
  • Text encoder: all-mpnet-base-v2 (768-d), mean-pooled, frozen.
  • Reconstructor: MLP 768 to 1024 to 1024 to 2560, GELU, trained 600 steps with a cosine loss against the L2-normalized ESMC embedding.
  • Metrics on the held-out 200: reconstruction cosine, and nearest-neighbor retrieval (does the reconstructed vector’s closest match in the test pool turn out to be the right protein). Chance retrieval is 1 in 200.

The English is real

Before any numbers, the thing that made me want to keep going. The verbalizations are good, and they come entirely from the activation. Two examples, picked because the protein is recognizable from its identifier.

A bacterial 50S ribosomal protein L30:

This is a core component of the cell’s machinery for reading and expressing genetic information, most likely a ribosomal protein, predominantly of the large-subunit (L-family) type, that is integral to the structured body of the ribosome rather than a cleavable targeting tag. It folds into a compact, basic alpha/beta domain whose Lysine- and Arginine-rich surfaces, flexible tails, and low-complexity loops form cationic patches that grip the negatively charged ribosomal RNA.

A RING-finger E3 ubiquitin ligase:

This protein is most readily recognized as a RING/U-box type E3 ubiquitin (or SUMO) ligase: a small Cys/His-rich, zinc-coordinating catalytic core flanked by a short pre-RING helix and an adjacent helical scaffold that positions the domain and engages a ubiquitin-conjugating (E2) enzyme. It is almost certainly membrane or organelle associated, carrying an N-terminal targeting region.

Both are correct, and neither model was told what the protein was. That is ESMC’s own learned features, read out in English.

What survives the trip through English

Now the measurement. I compared the English bottleneck against the thing it was built from, the raw sparse feature code, both trained to reconstruct the same ESMC embedding. I also checked a shuffled control and, as a reference for how compressible these embeddings are at all, plain PCA.

BottleneckReconstruction cosineRetrieval top-1Retrieval top-5
English paragraph (NLA)0.796.5%18%
Raw SAE code (same 12 features)0.8129%54%
Shuffled controln/a~0%n/a
Chancen/a0.5%2.5%

For reference, PCA of the embeddings themselves hits 0.90 cosine at 8 dimensions and 0.96 at 64, so ESMC’s pooled representation is very low rank to begin with.

The telling result is the gap between the two real rows. On reconstruction cosine they are almost tied, 0.79 against 0.81, but I would not lean on that. Cosine is a forgiving bar here: 8 dimensions of plain PCA already clear both at 0.90, so a near-tie mostly confirms how low-rank these embeddings are. Retrieval is the real test, and there the English collapses: 6.5% top-1 against the code’s 29%, 18% top-5 against 54%.

So the sentence keeps the gist and loses the fingerprint. English is enough to place a protein in roughly the right region of ESMC’s representation space, which is why the cosine holds up and why retrieval still beats chance by more than tenfold. It is not enough to tell two ribosomal proteins apart, which is why retrieval falls off a cliff relative to the raw code. The act of writing the features down as fluent prose smooths away exactly the fine distinctions that make one protein a specific point rather than a neighborhood.

That lines up with Anthropic’s own caveat that these descriptions capture high-level patterns better than precise detail. It is satisfying to see the same shape fall out of a protein model, where “high-level pattern” means a fold family and “precise detail” means which member of it.

Limitations, plainly

This is a lightweight version of the idea, and the numbers should be read that way. The verbalizer reuses the shipped feature descriptions instead of being trained for reconstruction, the reconstructor is a small supervised MLP rather than a jointly trained decoder, and the sentence encoder is a generic one that was never meant for dense biological prose. A stronger encoder or an actually-trained verbalizer would move the retrieval number, probably a lot. The verbalizations can also assert things the features do not strictly support, which is the protein version of the hallucination problem Anthropic flags, and reconstruction fidelity is the only guard I have against it.

What I am fairly confident in is the qualitative finding, because it shows up cleanly against the right control: turning a protein model’s internal state into English preserves what kind of protein it is and blurs which protein it is. That is a real and, to me, slightly poetic property of natural language as an interpretability tool, and it is the kind of thing I wanted to be able to take from one field and test in another.

The structure behind the words

The first verbalization above, and the hero for this post, is that bacterial 50S ribosomal protein L30. Since I had ESMC loaded anyway, I folded it.

Loading 3D structure…
50S ribosomal protein L30 from Streptococcus agalactiae (Q3K3V1), ESMFold2 single-sequence · avg pLDDT ≈ 78, pTM ≈ 0.67 Drag to rotate · pinch or scroll to zoom · colored by OpenFold 3 pLDDT on the AlphaFold confidence scale (blue high, orange low)

It comes back as the compact alpha/beta fold the verbalization described, at moderate confidence. Small single-domain proteins like this land in the middle: better than the beta-barrel that defeated it in the deep dive, short of the near-crystal accuracy it reached on myoglobin. The same lesson at medium volume, how well the model folds a protein alone tracks how much it could lean on local structure instead of long-range pairing it never got an MSA for.

Wrap

This started as a question about whether a method built for a chatbot would say anything about a protein model. It does. A sentence holds the shape of ESMC’s representation but not its fingerprint. I would like to push it further some time, with a verbalizer actually trained for reconstruction and a target taken at the SAE’s own layer rather than the final one. For now I am happy to have a small, honest result that is mine.