Trading accuracy for speed using language based models:
> Meta’s network, called ESMFold, isn’t quite as accurate as AlphaFold, Rives’ team reported earlier this summer2, but it is about 60 times faster at predicting structures, he says. “What this means is that we can scale structure prediction to much larger databases.”
> Burkhard Rost, a computational biologist at the Technical University of Munich in Germany, is impressed with the combination of speed and accuracy of Meta’s model. But he questions whether it really offers an advantage over AlphaFold’s precision, when it comes to predicting proteins from metagenomic databases. Language model-based prediction methods — including one developed by his team3 — are better suited to quickly determine how mutations alter protein structure, which is not possible with AlphaFold. “We will see structure prediction become leaner, simpler cheaper and that will open the door for new things,” he says.
Structure prediction is embarassingly parallel, and rarely requires any specific protein to be predicted- which can be done at any time, including precomputed- in an extremely short period of time.
DeepMind has no trouble getting the inference time at Google to compute predictions using their model for any protein of interest or the largest database.
Thus, a faster but more inaccurate system is not really desirable unless it really does provide another feature, such as (as Rost says) predictions on mutations alter structure. But if alphafold works, then it would indeed be capable of predicting how mutations alter structure.
As an outsider to both biology and AI, I feel like there's a more key problem than structure estimation from pattern-matching (over-fitting?) known data: Learning how the molecular dynamics work. I'm fascinated with how the folding might happen. The intermediate steps; how the backbone and sidechains react to change in temperatures; how much motion is going on in the sidechains in response to constant interaction with water molecules etc. How well does a classical rigidity/elasticity etc model apply to covalently-bonded molecules. How much flexing do various parts of the protein to real-time. How to make a model that ab-initio folds (or unfolds) without being tuned to experimental data. How much of the quantum interactions can we dodge due to working at the atomic scale vice electron-scale. How accurately can we model using dipoles, hydrophobic interactions, simplified hydrogen bond models etc. How much of the picture are we missing by only looking at the static aspect of the folded protein - ie experimentally or AI-determined static atom coordinates as the model.
A number of people, myself included, have railed against the terminology that frequently surrounds these otherwise great achievements, because it does a disservice, precisely on this topic.
AlphaFold (I haven't looked at Meta AI yet, beyond a cursory glance), is solving the state of the folded protein, based primarily off the state of other similar folded proteins. There's some, but extremely limited, modeling of unknown states. The major breakthrough for AlphaFold appears to be that it is substantially better at detecting meaningful signal in homology-based sequence alignments. That means, it can figure out the parts of a protein that are similar to other known folded proteins, with much greater success than previous models (2x-3x better iIrc). As far as modeling proteins or portions of proteins for which there are no known homologs, the model becomes significantly weaker (although not nonexistant).
In my opinion, and also the opinion of a number of my colleagues, advertising AlphaFold as having solved the "protein folding problem", and other similar language (used by both the media and the AlphaFold press releases) is completely disingenuous. We are effectively no closer to understanding how proteins fold, nor are we closer to being able to predict how a protein folds without any a priori information. Furthermore, homology modeling, while quite successful so far for a lot of proteins, breaks down on edge cases, higher-order structures, and unknown folds. It seems unlikely that it will ever completely solve the problem, and therefore, an entirely different approach will be needed to modeling protein folding that encompasses the entire field, edge cases and all (a quite lofty goal that may also never be reached).
All this is to say: AlphaFold is a great tool that I use frequently and am grateful it exists, but it hasn't solved protein folding, and the thing is has solved is intrinsically limited, and we should probably use different language to describe what it's done. Either way, I'm glad to see progress being made here, and eagerly await finding out how proteins actually fold.
> We are effectively no closer to understanding how proteins fold, nor are we closer to being able to predict how a protein folds without any a priori information.
Isn't that a bit like saying about image classifiers, that we still don't _understand_ how to recognize cat pictures, and that we couldn't do it without a large set of training data?
(To an extent this is true, that's also why a lot of work goes into interpretability and attribution in models.)
AlphaFold knows very little about the physicochemical mechanisms governing protein folding and doesn’t give any new insights either. The way it samples conformational space is very different from Folding@home’s molecular dynamics based approach, for instance.
To describe the folding of a protein chain, a “picture” of its lowest energy conformation (which is the goal of AlphaFold—and any structural modelling tool) is not enough. We rather need a “movie” showing the steps said chain undergoes, from an unfolded polypeptide to its final, folded, native state (the “picture”).
Yes, it is exactly like that. The objection is a number of people and organisations are talking as if they had solved the recognition problem. The parent commemt was remarking they want to see how a protein folds - it's unlikely that AlphaFold will ever reveal that.
How do we know that you aren't missing a piece of the puzzle that others are/may see?
It's a genuine question. A lot of experts have a lot of blind spots especially around prediction, rate of growth, potential, vision. History is littered with people who are really good at a field but fail to see 2/3/5/10 years ahead.
>It seems unlikely that it will ever completely solve the problem, and therefore, an entirely different approach will be needed to modeling protein folding that encompasses the entire field, edge cases and all (a quite lofty goal that may also never be reached).
Of course I'm fallible, but it doesn't change the fact that right now, this is the case - AlphaFold teaches us essentially nothing about how proteins are actually folding, and instead is solving for the folded state. If someone can backtrack from there, I'll be nothing but ecstatic.
is it also accurate to say that proteins may fold differently based on temperature, ph, and other factors?
meaning a protein could fold in multiple ways (similar to a swiss army knife "folding" into different shapes), and alphafold only predicts a subset of these for now (which is still amazing).
Yes, absolutely. Proteins can fold to different target structures (that are very distinct from each other) and get stuck in those states for long periods of time, even if the state isn't the global energy minimum.
The list of reasons for this- both functionally and inadvertently- is extremely long. But the long and short of it is that the ability of proteins to reproducibly fold to a single structure was figured out using this technique- putting an already folded protein into a very strong solution (of Urea or guanidinum chloride), detecting that it "unfolded", then putting it back into salty water and watching it reform to the same original structure.
Even more detail here: https://en.wikipedia.org/wiki/Hofmeister_series which is basically a series of progressively stronger solutions that interfere with water/protein interactions and disrupt or enhance folding. This data was key to establishing that hydrophobic collapse (one of the dominant models for how proteins spontaneously form structure) is a significant force in driving folding free energy.
We can use molecular dynamics to model how proteins fold and interact with different molecules and solvents. It’s very computationally expensive so it doesn’t scale to these huge datasets, but it’s not like we have no tools at all for understanding these processes
As an insider to both fields (and specifically their juncture to molecular dynamics).
What you are discussing is of course being studied but the problem is that it is a lot more computationally expensive. We do have simulations of simple (read very small) proteins folding and unfolding but for larger ones the computational time to watch them fold can be gigantic, if not impossible due to the fact that proteins often fold as they are being made. Which means including a much larger process into a folding one which just further stresses computational resources.
This computational problem is so enormous that a company at the cutting edge of research D.E. Shaw built a specialized computer solely for simulating proteins. Also most of the software used for this until recently had abandoned multi-GPU paralellism because it didn't scale well. The pandemic caused the need to simulate the virus on the entirety of Summit and introduced some work back into that route but it is still specialized (and wouldn't help for systems below a certain size anyways).
Also my previous points have been for atomic models (e.g. we treat everything more like newtonian particles and ignore quantum effects) some things definitely need more resolution and at that level you are lucky to see protein fluctuations let alone folding.
> This computational problem is so enormous that a company at the cutting edge of research D.E. Shaw built a specialized computer solely for simulating proteins.
My memory is a little rusty, but I believe yes. If I remember correctly simulation of the virus helped medically in a few ways. Specifically I think it gave insight into the mRNA vaccines and what sequences to use to make them effective (by basically making a slightly worse spike protein). I am sure it helped in drug discovery as I know our lab used simulations to suggest some potential drug pockets. There were some really good talks about it at NVIDIA's GTC last year or so (maybe more at this recent GTC but I had too much going on personally to watch the VODs).
These questions are better addressed by folding@home and we're still very much computationally limited in our ability to answer these questions.
Do you want to answer these questions for the satisifcation of understanding the underlying physical rules that drive folding? Why? It's unclear that knowing those things would actually make a large impact in any industrially/medically useful contexts. It uses huge amounts of CPU to sample these functions accurately enough to replace actual physical experiments on protein motion. Just doesn't seem like an effective investment of brain or computer time.
(I say this as somebody whose entire career was predicated on using MD to answer these questions; see https://www.nature.com/articles/nchem.1821 for our attempt in that space)
The folding@home approach is very limited to short individual simulation times (a millisecond total, maybe, but 100 disconnected nanoseconds at a time), so it then relies on various 'enhanced sampling' techniques to try to put your thumb on the scale to bias things into exploring interesting dynamics. It seems like it is probably more effective the more you already know about a given protein target. Meta's approach (which seems like AF2, but faster/worse?) seems to have a similar problem, in that it's even less trustworthy when you apply it to a new target you have relatively little concrete information about.
That's what the Alphafold team is working on for some time now. The only difference is that it will be relevant for drug research, so I don't expect Alphabet to give it away for free as well.
This is incredible work, honestly. Is the theory here that humans seem to enjoy an intuitive understanding of protein folding? I seem to remember an online game (maybe https://en.wikipedia.org/wiki/Foldit) that exploited this intuition by crowdsourcing human suggestions for brute-force style protein folding work.
And am I right that the goal is to create a best-effort short-cut to the final stage of brute-force final folding to get a real result?
It's just about making simplifications where possible. You don't need a molecular dynamics simulation to predict where a pool ball will go. That simplification is easy to see and understand. Scientists and engineers have been making useful simplifications for a long time.
There are other simplifications where a bunch of forces and masses can be cancelled out, or aggregated - but we can't see them. Can a computer see them with enough processing? It seems like it.
Can anyone describe how/where this actually fits into Meta's businesses?
"As a test case, they decided to wield their model on a database of bulk-sequenced ‘metagenomic’ DNA from environmental sources including soil, seawater, the human gut, skin and other microbial habitats."
Meta has a lot of data, but I'm unaware of them having a presence in the environmental/medical diagnostics industry which is where I assume this would be applied
Perhaps they are going for a Bell Labs kind of structure?
Improving the newsfeed algorithm. Automating content moderation. Creating content for the upcoming Metaverse. Realistic virtual characters for the Metaverse.
Whether or not we like, or agree with, or believe in their goals (I don't), I think it's hard to argue that competence in AI is not useful for them.
Yes, however all of that research was directly applicable to the Meta family of companies products - most specifically around NLP, image processing and computer vision with some work in RL. So it was applicable to the company, in addition to being good for recruiting.
I think they were asking with regards to why does Meta cares about Protein structure. It doesn't care but it does help them keep up with the state of the art in ML I suppose.
Cool stuff, but more likely useful as a primary filter or search tool rather than for detailed understanding of protein structure and function. See this quote:
> "Sergey Ovchinnikov, an evolutionary biologist at Harvard University in Cambridge, Massachusetts, wonders about the hundreds of millions of predictions that ESMFold made with low-confidence. Some might lack a defined structure, at least in isolation, whereas others might be non-coding DNA mistaken as a protein-coding material."
Understanding of how these proteins function requires high-resolution information about bond angles and atom-atom distances, particularly for non-structural proteins (i.e. interesting catalytic enzymes). Hence, wet-lab work and protein structure characterization via X-ray and NMR methods aren't going anywhere.
Medium confidence is often close enough that you can formulate a hypothesis, but high confidence still isn't close enough that you'd stake millions of dollars of experiments on it anyways. Most of the people in structural biology in pharma that I have chatted with said they're still solving the structures of their targets even with high-confidence AlphaFold models.
For me, as someone solving structures on a monthly basis, AlphaFold is great because after I get back my electron density map, I dump my sequence into AlphaFold, get a model (of any confidence) and most of it fits well enough into my density map that I don't have to start trying to model from nothing, and it saves me honestly days of work.
What's really funny about using AlphaFold predictions to bootstrap a model into a density map is eventually, your structure, based on an AF prediction, will eventually be folded into the dataset used to train the next version of AlphaFold. Talk about test set/training set leakage!
That's a cool way I had never thought about it being used. Just for my own curiosity do you know if that is a common use case in the protein solving field? (I'd imagine it would also be useful in NMR experiments for getting initial point labels).
It is quite common to use predicted models as an aid for phasing and molecular replacement; even Foldit models have been adopted for that purpose: https://www.nature.com/articles/nsmb.2119
I'm fairly certain that the model was not at all trained on language data. It's just that the "neural architecture" (specifically, the Transformer architecture) was first discovered in the NLP realm.
If your question is why does an architecture designed for NLP work well for aminoacid sequences, it turns out that the Transformer architecture is surprisingly versatile. It works amazingly well for other sequence-like data (like audio), and even for images (which, to me, is surprising, since images are not a one dimensional sequence like text or audio).
The Transformer architecture (Particularly, the Multihead self-attention[0] mechanism within the Transformer) is, in my estimation, once of the key innovations in deep learning in the past 5 years. It's used in pretty much everything in deep learning (GPT-3, AlphaFold, DALLE-2/Stable Diffusion, OpenAI's Whisper, come to mind).
IIRC they don't really "understands" physics. They just recall known folding structures really, really well. They can't discover new protein structures yet to be verified by Cryo-EM.
Meta has become a creepy metaphor for facebook: briefly the new cool kid on the block, but now awkwardly shows up at high school parties with their letter jacket on, talking about the time they almost made it to State.
That's a harsh take. FAIR (the Facebook AI research group) is a respected group in the field and they put out high quality research on a variety of topics.
This would be like condemning research for AT&T/Bell Labs because the company was (even at the time) a terrible monopolistic corporation.
> Meta’s network, called ESMFold, isn’t quite as accurate as AlphaFold, Rives’ team reported earlier this summer2, but it is about 60 times faster at predicting structures, he says. “What this means is that we can scale structure prediction to much larger databases.”
> Burkhard Rost, a computational biologist at the Technical University of Munich in Germany, is impressed with the combination of speed and accuracy of Meta’s model. But he questions whether it really offers an advantage over AlphaFold’s precision, when it comes to predicting proteins from metagenomic databases. Language model-based prediction methods — including one developed by his team3 — are better suited to quickly determine how mutations alter protein structure, which is not possible with AlphaFold. “We will see structure prediction become leaner, simpler cheaper and that will open the door for new things,” he says.