Reading this it sounds like 'AI' is when you build a heuristic model (which we've had for a while now) but pass some threshold of cost in terms of input data, GPUs, energy, and training.
The classical approach was to understand how genes transcribe to mRNA, and how mRNA translates to polypeptides; how those are cleaved by the cell, and fold in 3D space; and how those 3D shapes results in actual biological function. It required real-world measurement, experiment, and modeling in silico using biophysical models. Those are all hard research efforts. And it seems like the mindset now is: we've done enough hard research, let's feed what we know into a model, hope we've chosen the right hyperparameters, and see what we get. Hidden in the weights and biases of the model will be that deeper map of the real world that we have not yet fully grasped through research.
But the AI cannot provide a 'why'. Its network of weights and biases are as unintelligible to us as the underlying scientific principles of the real world we gave up trying to understand along the way. When AI produces a result that is surprising, we still have to validate it in the real world, and work backwards through the hard research to understand why we are surprised.
If AI is just a tool for a shotgun approach to discovery, that may be fine. However, I fear it is sucking a lot of air out of the room from the classical approaches. When 'AI' produces incorrect, misleading, or underwhelming results? Well, throw more GPUs at it; more tokens; more joules; more parameters. We have blind faith it'll work itself out.
But because the AI can never provide a guarantee of correctness, it is only useful to those with the infrastructure to carry out those real-world validations on its output, so it's not really going to create a paradigm shift. It can provide only a marginal improvement at the top of the funnel for existing discovery pipelines. And because AI is very expensive and getting more so, there's a pretty hard cap on how valuable it would be to a drugmaker.
I know I'm not the only one worried about a bubble here.
You're using "AI" quite broadly here. Here's a perspective from computer vision (my field).
For decades, CV was focused on trying to 'understand' how to do the task. This meant a lot of hand crafting of low level features that are common in images, finding clever ways to make them invariant to typical 3D transformations. This works well for some tasks, and is still used today in things like robotics, SLAM etc. However - when we then want to add an extra level of complexity - e.g. to try and model an abstract concept like "cat", we hit a bit of a brick wall. This happens to be a task where feeding a large dataset into an (mostly) unconstrained machine learning model does very well.
> The classical approach was to understand how genes transcribe to mRNA, and how mRNA translates to polypeptides; how those are cleaved by the cell, and fold in 3D space; and how those 3D shapes results in actual biological function.
I don't have the expertise to critique this, but it does sound like we're in the extreme 'high complexity' zone to me. Some questions for you:
- how accurate does each stage of this need to get to useful performance? Are you sure there are no brick walls here? How long do you think this approach will take to deliver results?
- do you not have to validate a surprising classical finding in the same way that you would an AI model - i.e. how much does the "why" matter? "the AI can never provide a guarantee of correctness" - is true, but what it was merely extremely accurate, in the same way that many computer vision models are?
> do you not have to validate a surprising classical finding in the same way that you would an AI model - i.e. how much does the "why" matter? "the AI can never provide a guarantee of correctness" - is true, but what it was merely extremely accurate, in the same way that many computer vision models are?
The lack of asking "why" is one of my biggest frustrations in much of the research I have seen in biology and genetics today. The why is hugely important, without knowing why something happens or how it works we're left only with knowing what happened. When we go to use that as knowledge we have no idea what unintended side effects may occur and no real information telling us where to look or how to identify side effects should they occur.
Researching what happens when we throw crap at the wall can occasionally lead to a sellable product but is a far cry from the scientific method.
I mean - it's more than a sellable product, the reason we're doing this is to be able to advance medicine. A good understanding of the "why" - would be great, but if we can advance medicine quicker in the here and now without it, I think that's worth doing?
> When we go to use that as knowledge we have no idea what unintended side effects may occur and no real information telling us where to look or how to identify side effects should they occur.
Alright and what if this is also a lot quicker to solve with AI?
> I mean - it's more than a sellable product, the reason we're doing this is to be able to advance medicine
I get this approach for trauma care, but that's not really what we're talking about here. With medicine, how do we know we aren't making things worse without knowing how and why it works? We can focus on immediate symptom relief, but that's a very narrow window with regards to unintended harm.
> Alright and what if this is also a lot quicker to solve with AI?
Can we really call it solved if we don't know how or why it works, or what the limitations are?
Its extremely important to remember that we don't have Artificial Intelligence today, we have LLMs and similar tools designed to mimic human behaviors. An LLM will never invent a medical treatment or medication, or more precisely it may invent one by complete accident and it will look exactly like all the wrong answers it gave along the way. LLMs are tasked with answering questions in a way that statistically matches what humans might say, with variance based on randomness factors and a few other control knobs.
If we do get to actual AI that's a different story. It takes intelligence to invent these new miracle cures we hope they will invent. The AI has to reason about how the human body works, complex interactions between the body, environment, and any interventions, and it had to reason through the necessary mechanisms for a novel treatment. It would also need to understand how to model these complex systems in ways that humans have yet to figure out, if we already could model the human body in a computer algorithm we wouldn't need AI to do it for us.
Even at that point, let's say an AI invents a cure for cancer. Is that really worth all the potential downsides of all the dangerous things such a powerful AI could do? Is a cure for cancer worth knowing that the same AI could also be used to create bioweapons on a level that no human would be able to create? And that doesn't even get into the unknown risks of what an AI would want to do for itself, what its motivations would be, or what emotions and consciousness would look like when they emerge in am entirely new evolutionary system separate from biological life.
> how much does the "why" matter? [...] merely extremely accurate, in the same way that many computer vision models are?
Because without a "why" (causal reasoning) they cannot generalize, and their accuracy is always liable to tank when they encounter out-of-(training)-distribution samples. And when an ML system is deployed among other live actors, they are highly incentivized to figure out how to perturb inputs to exploit the system. Adversarial examples in computer vision, adversarial prompts / jailbreaks for large language models, etc.
"AI" has always been a marketing term first, there's a great clip of John McCarthy on twitter/X basically pointing out that he invented the term "Artificial Intelligence" for marketing purposes [0].
Don't read too deeply into what exactly is AI. Likewise I recommend not being too cynical about it either. McCarthy and those around him absolutely did pioneer some incredible, world changing work under that moniker.
Regarding your particular critiques, natural intelligence very often also cannot provide a "why". If you follow any particular technical field deep enough it's not uncommon to come across ideas that currently have deep rigorous proofs behind them, that basically started as hunch. Consider the very idea of "correlation" which seems rooted in mathematical truths, was basically invented by Galton because he couldn't find causal methods to prove his theories of eugenics (it was his student Pearson, who later took this idea and refined it further).
Are we in an AI bubble? Very likely, but that doesn't mean there's not incredible finds to be had with all this cash flowing around. AI winters can be just as irrational (remember that Perceptron basically caused the first AI winter by exposing the XOR problem, despite the fact that it was well known this could be solved with trivial modifications).
I feel like the invariant with "AI" is the software engineer saying "with enough data and statistics I can not understand the problem domain." It's fundamentally a rejection of expertise.
Take weather prediction for instance. This is something that the AI companies are pushing hard on. There are very good physics-based weather prediction models. They have been improved incrementally over many years and are probably pretty close to the theoretical peak accuracy given the available initial state data. They are run by governments and their output is often freely accessible by the public.
So firstly, where on earth is the business model when your competition is free?
Secondly, how do you think you will do better than the current state of the art? Oh yeah, because AI is magic. All those people studying fluid dynamics were just wasting their time when they could have just cut a check to nvidia.
> I feel like the invariant with "AI" is the software engineer saying "with enough data and statistics I can not understand the problem domain." It's fundamentally a rejection of expertise.
Nature doesn't understand the problem domain, and yet it produced us. Capable of extraordinary achievments.
> The classical approach was to understand how genes transcribe to mRNA, and how mRNA translates to polypeptides; how those are cleaved by the cell, and fold in 3D space; and how those 3D shapes results in actual biological function.
Do you have references for this approach? It’s my understanding that structure solutions mostly lag drug development quite significantly and that underlying biological understanding is typically either pre-existing or will end up not existing for the drug until later. Case in point look at recent Alzheimer’s drugs where the biological hypothesis is even straight up disproven.
Hopefully the bubble will pop when the GenAI bubble does. (Not that it probably should, since this both predates and is unrelated to it… but hype isn't rational to begin with.)
> matching less than 60% of the sequence of the most closely related fluorescent protein
> When the researchers made around 100 of the resulting designs, several were as bright as natural GFPs, which are still vastly dimmer than lab-engineered variants.
So they didn't come up with better functionality, unlike what some commentators imply. They basically introduced a bunch of mutations while preserving the overall function.
60% is freaking nothing in protein space. Like, that's not something to brag about. I could go through and hand edit GFP to 60% identity and get basically GFP still functional at the end.
In grad school, I worked on proteins that were 12%-15% sequence identical but had sub 4 Å RMSDs once the structures were solved.
EDIT: Actually, bragging about it could be a nigerian prince thing, where they do it to scare away investors that might actually hold them to some standard.
If the mutations were non-synonymous, resulting in different amino acids, the fact that they keep the natural function is still kinda cool. Very much a pure research result AFAICT, but worth a little something.
This is already a well known fact: protein structure (and consequently function) is much more conserved than sequence, mostly due to biophysical constraints.
If non-synonymous mutations do not change the biophysical features of the amino acid residues, then the structure is usually kept. Alternatively, it can be the case that a disruptive mutation is compensated by another one that keeps the structure/function/phenotype. This is the basis for evolutionary coupling based structure prediction methods, such as Alphafold.
It's fairly common for two proteins to have almost identical structures but different (down to 30% or lower sequence identify) and it's also possible to mess up a nice protein that folds easily with a single amino acid change.
That's still a pretty significant result. Imagine how much more effective directed evolution could be if it weren't driven by random mutation, or if the random mutations were applied on top of already working variants.
That feels like deliberately cute wording. What was the similarity vs the starting molecule? I could probably hand pick a few point mutations substituting one small hydrophobic AA for another without impacting function.
Very nice work. We need brighter fluorescent protein tags that are more compact, in particular in the far red spectrum. The size of current fluorescent protein coding DNA sequences is out of reach of prime editing and still relies on less efficient gene editing technology.
There is only so much that you can get out of tinkering with naturally evolved proteins. I suspect that these kind of tools can be used to generate smaller & brighter FP's in the near future.
Are there any good resources for understanding models like this? Specifically a "protein language model". I have a basic grasp on how LLMs tokenize and encode natural language, but what does a protein language actually look like? An LLM can produce results that look correct but are actually incorrect, how are proteins produced by this model validated? Are the outputs run through some other software to determine whether the proteins are valid?
Proteins are linear molecules consisting of sequences of (mostly) 20 amino acids. You can see the list of amino acids here: https://en.wikipedia.org/wiki/Amino_acid#Table_of_standard_a.... There is a standard encoding of amino acids using single letters, A for alanine, etc. Earlier versions of ESM (I haven't read the ESM3 paper yet) uses one token per amino acid, plus a few control tokens (beginning of sequence, end of sequence, class token, mask, etc.) Earlier versions of ESM were BERT-style models focused on understanding, not GPT-style generative models.
Agreed, would be interested if someone with more knowledge could comment.
My layman's understanding of LLMs is that they are essentially "fancy autocomplete". That is, you take a whole corpus of text, then train the model to determine the statistical relationships between those words (more accurately, tokens), so that given a list of tokens of length N, the LLM will find the next most likely token for N + 1, and then to generate whole sentences/paragraphs, you just recursively repeat this process.
I certainly understand encoding proteins as just a linear sequence of tokens representing their amino acids, but how does that then map to a human-language description of the function of those proteins?
Most protein language models are not able to understand human-language descriptions of proteins. Mostly they just predict the next amino acid in a sequence and sometimes they can understand certain structured metadata tags.
Can they understand the functional impact of different protein chains, or are they just predicting what amino acid would come next based on the training set with no concern for how the protein would function?
The way you would use a protein language model is different from how you would use a regular LLM like chatgpt. Normally, you aren't looking for one correct answer to your query but rather you would like thousands of ideas to try out in the lab. Biologists have techniques for trying out thousands or tens of thousands of proteins in a lab and filtering it down to a single candidate thats the best solution to whatever they are trying to achieve.
> For a smaller open-source version, certain sequences, such as those from viruses and a US government list of worrying pathogens and toxins, were excluded from training. Neither can ESM3-open — which scientists anywhere can download and run independently — be prompted to generate such proteins.
> However, its amino-acid sequence is vastly different, matching less than 60% of the sequence of the most closely related fluorescent protein in its training data set.
Not to downplay this achievement, but 60% sequence identity is nowhere near “vastly different”.
The twilight zone is 20-35%: https://pubmed.ncbi.nlm.nih.gov/10195279/ (Incidentally, the author was on my thesis committee, but this isn't precisely my field of expertise.)
Proteins from the same family (and thus the same fold) can share less than 10% identity and still keep the same functionality as shown by e.g. profile Hidden Markov Model comparisons.
Tangential - the laws of nature discovered by our brain usually involve just few quantities, like f=ma. On one side it is a great ability of our brain for analytical reduction, on the other side it is just an inability to deal with complex multiparameter phenomena without such a reduction. I wonder if pumping more and more data into the NNs we'd be able to distill emerging multiparameter correlations which happen to be new laws of nature irreducible to more simpler ones.
I disagree, our brains almost always deal with complex multiparameter phenomena.
The part of us that can't cope with that is our System 2 reasoning: our minds' rules based thinking is great at F=ma but cannot enumerate the rules necessary at the level of retina cell activation patterns to recognise a tiger; our System 1 can, and always does, we just call it "intuition" or "common sense" or "a gut feeling".
Eh, not in this context at least. Nature is a big term. Physics just "is". This makes it a good place for simple mechanics to appear and be derived. You can argue around how universal that is all you want but proteins themselves are the result of an enormously large search of a hugely high dimensional space. It doesn't seem possible for them to ever be reduced.
I wonder if this will ever come full circle (or.. spiral) and the AI tools we've created will in turn lead the way to discovering / inventing new proteins / cells / life forms that eventually outsmart and outcompete us.
"Rives sees ESM3’s generation of new proteins by iterating through various sequences as analogous to evolution."
Except for the part where a sequence is actually deemed more fit, ie natural selection? And the part where mutations are random, instead of sampled from the training data manifold, so much more constrained?
...so really it's a worse version of random search?
The classical approach was to understand how genes transcribe to mRNA, and how mRNA translates to polypeptides; how those are cleaved by the cell, and fold in 3D space; and how those 3D shapes results in actual biological function. It required real-world measurement, experiment, and modeling in silico using biophysical models. Those are all hard research efforts. And it seems like the mindset now is: we've done enough hard research, let's feed what we know into a model, hope we've chosen the right hyperparameters, and see what we get. Hidden in the weights and biases of the model will be that deeper map of the real world that we have not yet fully grasped through research.
But the AI cannot provide a 'why'. Its network of weights and biases are as unintelligible to us as the underlying scientific principles of the real world we gave up trying to understand along the way. When AI produces a result that is surprising, we still have to validate it in the real world, and work backwards through the hard research to understand why we are surprised.
If AI is just a tool for a shotgun approach to discovery, that may be fine. However, I fear it is sucking a lot of air out of the room from the classical approaches. When 'AI' produces incorrect, misleading, or underwhelming results? Well, throw more GPUs at it; more tokens; more joules; more parameters. We have blind faith it'll work itself out.
But because the AI can never provide a guarantee of correctness, it is only useful to those with the infrastructure to carry out those real-world validations on its output, so it's not really going to create a paradigm shift. It can provide only a marginal improvement at the top of the funnel for existing discovery pipelines. And because AI is very expensive and getting more so, there's a pretty hard cap on how valuable it would be to a drugmaker.
I know I'm not the only one worried about a bubble here.