Yep, the intention is to change the design by getting rid of it. HN used to just render entire threads in one go, and once we release some performance improvements we hope to do that again.
Radically overgeneralizing from that, it seems likely that the pinned comment at the top helps a bit in terms of directing people to later pages. How that compares to the mammoth-single-page scenario is hard to say because we don't know how many readers would be scrolling down that far to see those comments. There's likely a power-law dropoff no matter what we do.
If you are fine serving more data, you could trigger a "More" automatically when the user minimizes a comment thread, and add the "More" comments to the bottom of the page?
Also, to encourage people exploring "More" comments, maybe some comments inside the 2-3 megathreads showing in the first page can be minimized at first?
I wonder what are the thoughts on these design. I really like how Hacker News is designed, thank you!
My thoughts: the first suggestion is too complex for HN and would amount to a sort of infinite scroll, which users here would probably hate (many have said so pre-emptively!). The second suggestion is probably a good idea. I worked on it at one point but it turned out to be a little harder to get right than I expected. Will probably return to it.
unrelated but the reason i opened the link was because the .cgi file. It's been a very long time since i visited one. The page uses a 2006 YUI library and a 2010 jquery version. The amazing part is it still works in Firefox.
Two years ago, after DeepMind submitted its first set of predictions to CASP (Critical Assessment of protein Structure Prediction), Mohammed AlQuraishi, an expert in the field, asked, "What just happened?"
Now that the problem of static protein structure prediction has been solved (prediction errors are below the threshold that is considered acceptable in experimental measurements), we can confidently answer AlQuraishi's question:
Protein Folding just had its "ImageNet moment."
In hindsight, AlphaFold v1 represented for protein structure prediction in 2018 what AlexNet represented for visual recognition in 2012.
> CASP14 #s just came out and they’re astounding—DeepMind looks to have solved protein structure prediction. Median GDT_TS went from 68.5 (CASP13) to 92.4!!!! Cf. their 2nd best CASP13 struct scored 92.8 (out of 100). Median RMSD is 2.1Å. I think it's over https://predictioncenter.org/casp14/zscores_final.cgi
Standard distance measure in most atomic-scale condensed-matter fields. Certainly inorganic crystallography/materials science/condensed matter physics.
> I don’t think we would do ourselves a service by not recognizing that what just happened presents a serious indictment of academic science.
Much like other fields, I do begin to question the academic structure to making advances. It appears something is rotten in the state of academia. Oddly it's academia doing incremental improvements to existing methods but industry making novel leaps and bounds... The other major case in point being NLP
Academia is for generating problem solvers. Teams are small and made of people who will be there for around 5 years.
A better comparison would be to national labs, but they are tasked with projects that make no sense for industry to tackle.
The system is working as intended, all players are needed. The team at Alphafold busted their chops in academia and went on to working on problems they could spend decades on.
People seem to forget that you need a system like academia that's allowed to fail. Most companies aren't allowed to fail when they need to have quarterly returns. Of course academia has become more and more competitive. But tbh I think the answer is that the funding hasn't increased equally with the number of quality people who could stay in academia. But who knows.
Stability "like academia" is rich, given all we've heard about "publish or perish". Modern academia is a poor fit for increasingly any case you can think of besides maintaining the status of academia. But sure, there needs to be some stability and ability to "fail"/i.e. produce something worthless. Corporate research departments provide this -- if they didn't, they wouldn't have a research department and indeed many don't, nor do they need to, but this has little to do with quarterly returns.
We've also seen a rise of VC-backed research startups (like DeepMind but many others) whose value proposition (to the VC) only makes sense if the goal is to demonstrate a research capacity to get them bought out by a big company, or as a moonshot to out-compete them on an actual product made possible by the research. Investing in these little research startups themselves is also giving companies a way to push research without having to deal with having the researchers as direct employees, and I'm sure makes some of the startup employees feel a bit safer since there's a separation of money and operation influence. One similarity with modern academia is it selects for those who can do good work but who are also good bureaucrats (write grant proposals well, advising politicians, etc), startups have a selection for good work + good at courting VCs. But the startup just needs a few of them, then they can hire people who just want to do good work.
Another thing that makes corporate even better is they can occasionally spin off research developments into products, they can have some nice advances that only come when you try to productize, and among other reasons by not having to bother with external publishing (which takes time + fights with lawyers and business people) they can routinely be 10+ years ahead of whatever the state of the art in academia is.
That's all nice in theory, but what's the compelling empirical evidence for corporate science vs academic research? A professor might conversely argue the open nature of scholarship and freedom of inquiry as being essential to basic science, and capitalist businesses fundamentally cannot provide that. So it goes back to empirical support. And last I checked, companies still need a pool of trained PhDs to choose from, and those come from academia, for good reason.
Elsewhere in the thread makes plenty of cases for corporate advances, even (or perhaps especially?) in the 20th century with e.g. Bell Labs et al. I think the empirical results are pretty good for corporate science.
As for 'needing' PhDs, I'm not sure. Having some can be convenient, yes, but in many cases not necessary. In some fields the only way to get caught up (i.e. no corporation will train you directly) may be with academic foundations but is a PhD necessary or just some relevant graduate work?
As an example N=1 to show a PhD is not needed always, Jeff Jonas formerly held the title of IBM Chief Scientist where he did some state of the art work in entity resolution. He didn't even finish high school.
That's an interesting way to make selection bias sound like "pretty good"; I disagree on the face of it, even if I am open to considering the idea of completely doing away with academia.
I don't know and am skeptical, but intuitively, the day that would happen is the day that Nobel prizes are routinely awarded to FAANG companies and not to academics.
Further, since your original position was strongly that academia is useless, it is your onus to back up the implication that PhDs are not (generally) needed/useful, and using "well hmm, not sure if absolutely necessary" logic is fallacious and clouds the issue.
We are in a period of historically low interest rates. If and when interest rates rise, these moonshot research startups will get dropped like a hot potato.
Because in addition to funding moonshot project plays, low interest rates also fund a lot of really stupid investments that should never have been funded, and that go belly-up at the first sign of a tightening fiscal market.
When too many of those really stupid investments go belly up at once, it's called a bubble popping, and it is catastrophic to the economy.
The article then goes on to describe a not very general set of circumstances
> But in part due to the canonicalization of CASP, protein structure prediction effectively has a two-year clock cycle, where separate research groups guard their discoveries until after CASP results are announced.
and further noting
> As I discussed earlier, it is clear that between the Xu and Zhang groups enough was known to develop a system that would have perhaps rivaled AlphaFold.
Finally, and rather crushingly for your thesis, is the points made about the real industrial groups:
> What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet. It is an indictment of the laughable “basic research” groups of these companies, which pay lip service to fundamental science but focus myopically on target-driven research that they managed to so badly embarrass themselves in this episode.
I used to work in one of the top labs doing protein folding, in fact I recognize some lab mates who survived to have their own labs in this year's CASP top ten ranking. Something is a bit rotten though because I'd guess about half of us burned out of academia entirely, and the state of the art at the time was regrettable. I remember one model doing rather well except the weights of the various physical forces were wrong, so much that some signs were negative and what should have been electrostatic repulsion was actually electrostatic attraction between similar charges. This was 10 years ago of course, but all of the field improvements were relatively incremental and small until AlphaFold came along.
Modern academia exists to further modern academia, no more and no less. I became disillusioned of any search for truth and progress during my time there, it was really just about showing up the Baker lab at the next CASP and getting the next round of grants secured.
This is exactly my point. There is something rotten in the research labs where society is not getting return on investment for basic research.
If the goal is to just produce researchers that can work at corporate research labs, then I feel we could get more bang for the buck.
If the goal is to do move research in the public good, something needs to change. Maybe it’s the fact there’s too little money out there, and it causes everyone to chase meager grant money. Or that there’s too much competition between groups. Or a million other reasons.
But I’d love to see it fixed and have more faith in public investment in basic science.
The problem is that people in academia and outside of it were saying the same thing in the 70s, 80s, 90s, 00's ... so if you really want to make the claim that "society is not getting return on investment for basic research", you need to claim with a straight face that this also applies to the last 50 years of academic basic research (at least). Alternatively, explain what has changed and when.
Citing increases in fraud and retractions, this article makes the case that as funding has increased, scientific quality has worsened. (Which is counter to what I guessed above!) Pursuit of scientific knowledge for its own sake replaced by obsession with grant cycles.
But lots of other things have also changed that may or may not be causes:
- many fewer tenure track positions
- many more jobs in industry that require or value a 4 year or advanced degree
- obtaining a college degree as a right of passage seen as increasingly essential to you g adulthood
- the rise of data science: more jobs in industry that have access to lots of data and demand scientific rigor
- the rise of private cloud supercomputing (is deep mind) vs, say, a public university's cluster
- the obsession in some foreign countries of getting advanced degrees from American universities, creating essentially a guest visa workforce that is easily abused
- rise of Big Tech which has money to throw at things like protein folding
I think while funding has been increased overall due to more people in academia, the processes to acquire said funding got more complex so that a significant part of work is actually making sure the next grant can be secured. Similar situation in publishing. The research might only get a backseat.
That said, this is primarily a computational problem, so the advances here might not be applicable to basic research.
From my experience working as an engineer in Academia, there is a big problem in anything AI related, in that anyone decent can instantly increase their pay by 200 -1000% just by leaving for one of the big tech companies. Half the phD students don't even bother finishing, before being given an offer they can't refuse. How can academia compete with that, given sky high student debt, an extremely uncertain path to tenure etc etc
Speaking of which, Google Translate was published in 2006, but when did the "learning from data" approach became an accepted idea in machine translation? I think the earlier attempts at machine translation were more about trying to codify grammar rules in software, than doing statistical learning from large text corpuses? I remember in 2002, the approach of leaning protein substructures from data was already the best performing approach in the protein folding problem.
Not really. Using a statistical approach to text modelling, specifically using Markov Chains, was proposed by Shannon in 1948. But yeah, there's a point in the 2000s where generative grammar/ symbolic approaches were pretty much left behind by NN methods.
When we discuss Google's input in NLP, the most important contribution is certainly the "Attention is all you Need" paper, which paved the way for BERT and GTP (Alphafold also uses Attention networks, btw)
> generative grammar/ symbolic approaches were pretty much left behind by NN methods
Which is the same thing as hand-engineered feature stacks being left behind in vision problems, really. The story in every field is more or less "you're not clever enough to engineer good features"; "you might be clever enough to define good symmetries for the feature space in which the features live... maybe" (convolutional neural networks in image problems); "... but maybe not even that" (attention mechanisms).
The hand generated features are still superior for SfM style problems, where the geometry is well defined but would need to be learned by the NN from scratch.
I think Netflix model of simplifying then translating should work better with internet forums and blog posts. Looking at how google works I was hoping someone in google wold adopt it and release a competing product against google translate
https://arxiv.org/abs/2005.11197
Specifically the linked article wonders about the research environment of academia compared to industry. Why teams of hundreds in academia with their own super computing resources couldn’t make the same advances. He posits there’s something not great going on about how academic research environments make advances, the poor incentive structures, the abuse and burnout of PhDs, the lack of open sharing of findings, the obsession with publication quantity over quality...
There’s a reason these PhDs at Google mind aren’t in academia after all doing the same work
Academia is doing a lot of advances every year. The fact it didn't make _this one_ is not really relevant to postulate that academia is inefficient.
It happens that the team at deepmind is apparently pretty damn good at deep learning problems, so they're going faster than matching academia labs.
It's not to say that academia has none of the problems you mentioned, but it's imo unreasonable to expect that, in a world where both public and private labs exist, only public ones would make advances.
I would say that is most likely due to a massively higher salary and no teaching responsibilities. Academia can’t compete on salary with industry in AI / data science.
Academia keeps employing people who have done well in classes and within fine bounds. Its a careerist track. Industry cares about results, its more meritocratic
> Industry cares about results, its more meritocratic
Industry cares about positive results. If you're not allowed to fail, you will be afraid to explore. That's what Academia is. Then, the industry reaps the fruit of that exploration, which is as it should be.
If academia actually allowed failure we wouldn't be getting so many tiny incremental growth papers just for the sake of it as in deep learning and machine learning.
This is actually something I feel strongly about. The absent minded creative professor is the one who traditionally has made the breakthroughs. Recent years has instead seen the straight A student with no curiosity making it into programs, when they really have no business doing novel research and are better suited as orderly wage slaves
There's a lot of fanciful stuff about "creative" and "absent-minded" people doing the best work, but what actually makes a good researcher is the same as in any other field: (a) curiosity (b) determination, and (c) hard work.
PhD programs don't take people who just have a good GPA; you have to have a research record before you're even in the consideration. I've been on an admissions committee, so this is not conjecture.
Right, but a modern research record is about incremental improvement. The argument being that low hanging fruit is often picked, and so the incremental is natural. My argument is that far too many people are gaming the academic system, using it as a form of status credentialing, which is hurting true academic research.
Is this a scientific advance or a technological one though? Academia doesn't have the capital like industry or government to implement the latter. In America it's small small groups of young students led by a professor. Not full grown PhDs with Google levels of staff and money.
I would claim that this is a technological advance that is likely to lead to many scientific advances.
Perhaps a good analogy are the inventions of the microscope and telescope. They were advances in technology, which then led to advances in science. I don't know if this will have the same effect as the microscope and telescope, but it would be great if it did. It certainly seems extremely promising.
> something is rotten in the state of academia. Oddly it's academia doing incremental improvements to existing methods but industry making novel leaps and bounds... The other major case in point being NLP
You have to realize that corporate research labs had a high level of recognition back in the 20th century. Labs like the Bell Labs, the RCA Laboratories, or the IBM Research, privately-funded, had a reputation that met or exceeded the standard of not-for-profit or public-funded academic research institutions. They made some of the most important discoveries in the electronics industry of 20th century, like the point-contact transistor, the MOSFET, VLSI, or the UNIX operating system. They were considered a part of the academia, many scientists were their employees. It's only the 1980s after their death that people had the impression that "important research must come from academia, industry is for incremental changes." So, I'd argue that the division between industry and academia is large, but actually smaller than people's perception. If you consider privately-funded researches by the industry as a part of the academia, the current situation is totally normal, nothing unusual.
Interestingly, for those labs to exist, being a monopolistic megacorp is a requirement. It appears to me that today's FAANG monopoly allowed the creation of Google Deepmind and OpenAI, perhaps it's simply a beginning of the repetition of history.
The article The death of corporate research labs had an interesting review. I highly recommend to read the article:
To summarize, those great labs existed and made great contributions because of (1) corporate monopoly on the industry, and (2) the pressure from anti-trust laws. First, due to monopoly, the gigantic size allowed the labs to be the center of gravity and to concentrate all talents and projects into a single place, with a huge research budget for basic research. Second, the pressure from anti-trust laws also forced corporations to invent more into basic research to grow the business, because mergers and acquisitions were restricted. In some cases, the pressure from anti-trust laws also made the corporate labs to share their discoveries in a more open manner, examples included advances in semiconductor [1], or the Unix source code.
Note: but as HN comments pointed out, somewhat ironically, the success of corporate labs relies on anti-trust pressures, but not the actual monopoly-busting enforcement. The breakup of Bell caused the death of the Bell Labs.
Finally their decline,
> The more relaxed antitrust environment in the 1980s, however, changed this status quo. Growth through acquisitions became a more viable alternative to internal research, and hence the need to invest in internal research was reduced.
And it turns out that managing a corporate research labs without losing money is a tricky problem to solve. If the researches are too goal-oriented, short-termism will dominate, basic research in the labs will be ignored. Thus, basic research in the lab must be independent. However, a lab too isolated from the business can also cause great loss.
> Research in corporations is difficult to manage profitably. Research projects have long horizons and few intermediate milestones that are meaningful to non-experts. As a result, research inside companies can only survive if insulated from the short-term performance requirements of business divisions. However, insulating research from business also has perils. [...] Walking this tightrope has been extremely difficult. Greater product market competition, shorter technology life cycles, and more demanding investors have added to this challenge. Companies have increasingly concluded that they can do better by sourcing knowledge from outside, rather than betting on making game-changing discoveries in-house.
And the author argued the death of corporate labs decreased productivity.
>> An unintended consequence of abandoning anti-trust enforcement was thus a slowing of productivity growth, because the this new division of labor wasn't as effective as the labs:
> a new division of innovative labor, with universities focusing on research, large firms focusing on development and commercialization, and spinoffs, startups, and university technology licensing offices responsible for connecting the two.
> The translation of scientific knowledge generated in universities to productivity enhancing technical progress has proved to be more difficult to accomplish in practice than expected. Spinoffs, startups, and university licensing offices have not fully filled the gap left by the decline of the corporate lab. Corporate research has a number of characteristics that make it very valuable for science-based innovation and growth. Large corporations have access to significant resources, can more easily integrate multiple knowledge streams, and direct their research toward solving specific practical problems, which makes it more likely for them to produce commercial applications. University research has tended to be curiosity-driven rather than mission-focused. It has favored insight rather than solutions to specific problems, and partly as a consequence, university research has required additional integration and transformation to become economically useful.
> Honeywell brought a lawsuit against us and said you can’t selectively choose people to divulge your technology to. It’s too important. And if you divulge it to anyone, you’ve got to divulge it to everybody. They filed a lawsuit, and the government came down on their side. And RCA basically had to open up all of its patents to everybody if they opened them up to anybody.
Thank you for writing this. Some key ingredients for organizational driven scientific advancement include: long term funding, risk tolerance, enough reputation to get people involved and aware, team work / collegiality, some degree of openness (depending on the problem), and of course being in the right place at the right time with the right skills, management, and people.
> Interestingly, for those labs to exist, being a monopolistic megacorp is a requirement. It appears to me that today's FAANG monopoly allowed the creation of Google Deepmind and OpenAI,
AFAIK OpenAI is still independent, despite its recent closeness with Microsoft. Deepmind existed and was active well before being acquired by Google. All these to examples prove is that today's big, monopolistic corporations tend to acquire research labs, not that they are a requirement for their existence or successful activity.
Yeah, but both labs burn hundreds of millions of dollars. Without Google/Microsoft (not to mention Tesla/YC money), they would have died before bringing these kind of results to market.
Industry is not going to fund the overwhelming majority of research areas in biology, physics, chemistry, mathematics, etc. Data science and AI are an exception, where people in industry are much better paid, and can get access to much better resources that would be hard to afford in academia... It’s not surprising this type of advance came from an industry funded group. On the other hand, it is academia and its structure that has enabled so many other discoveries, for example, Crispr DNA tech, our understanding of gravitational waves, or the proof of the Poincaré conjecture.
Deepmind does not foster its future PhD, but, yes, they offer a better rewarding environment for those PhDs to flourish after they get the basic training.
I think so, too. Linear algebra, control theory and quantum mechanics haven't gotten us anywhere and ivory towers prevail as this machine learning solution to a problem in biological chemistry clearly demonstrates.
/s
Almost every single of the tens of thousand papers by the hundreds of tenured academics in the field of protein folding are made obsolete by 10 google engineers.
This is what it's like when someone really moves the needle. And academic science cannot get it's head around it.
And yet, none of these scientists will suffer any career consequences. Their irrelevant work will be healthily cited by all the other scientists who are doing and have done irrelevant work. They'll retcon a story in their lit reviews about how their irrelevant work led to this.
The career consequences are saved for those who had their eye on the real ball for the last five years, but didn't get there first. For them, the comfortably irrelevant will have the gall to ask in accusatory tones: "What have you been doing these last 5 years?".
I particularly like the rant on pharmaceuticals companies lack of basic research. My impression has been that medical progression have been slow for quite some time, nice to see that there are some truth to that.
In the end software and tech companies might just eat up the pharmaceutical industry as well. - It's all just code at some level.
The Deepmind team did this with ;
"We trained this system on publicly available data consisting of ~170,000 protein structures from the protein data bank together with large databases containing protein sequences of unknown structure. It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks, which is a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today."
So it wasn't out of reach for academia, pharmaceuticals, or others with a bit of resources.
This is the cost of training the final architecture with all the refinements enabled by years of research.
These years of research involved trying many different architectures, many of which received as much or more compute time than the final system.
The price of training the final architecture is meaningless. Researching and training AlphaGo was expensive but it enabled the ideas and development of AlphaZero which is more computationally tractable.
To have any chance, an academic team would need the same compute resources as what the DeepMind protein folding team used during the whole development of the architecture during the last few years, not only the resources used to train the final system. And I bet this funding is not available to most if not all academic teams.
Even if you try to account for the overall R&D cost, DeepMind isn't that large an organization by the standards of biomedical research. It's very big and well funded for a computer science research organization, yes, and most CS departments can't match its resources. But the NIH budget is $40 billion, and private pharmaceutical companies do another $80 billion in annual R&D. It's interesting that this kind of breakthrough didn't come from those sectors.
DeepMind is taking advantage of NIH's funding.
For example, Anfinsen who demonstrated that proteins fold spontaneously and reproducibly (https://en.wikipedia.org/wiki/Anfinsen%27s_dogma) ran a lab at NIH. Levinthal (who postulated an early and easily refutable model of protein folding) was funded by NIH for decades. Most of the competitors at CASP are supported by NIH and its investments have contributed to the modern results significantly.
That said I think the academic and pharma communities had engineered themselves into a corner and weren't going to see huge gains (even thogh they are exploring similar ideas) for a number of banal reasons.
That's a good point; this system certainly didn't come from nowhere! The protein datasets they used also mostly came out of various NIH-funded projects.
What I meant to focus on was that I think DeepMind has less of a pure money/scale advantage in this area than in some others. In something like Go or Atari game-playing, there are many academic groups researching similar things, but their resources are laughably small compared to what DeepMind threw at it. So you might argue that they got good results there in part because they directed 1000x the personnel and compute at the problem compared to what any academic group could afford. In biomed though, their peers in academia and industry are also pretty well-funded.
Personally I think a major part of the secret sauce is Google's internal compute infrastructure. When I was an academic, 50% of my time went to building infra to do my science. At Google, petabytes of storage, millions of cores, algorithms, and brains were all easily tappable within a common software repo and cluster infrastructure. That immediately translates to higher scientific productivity.
Mostly? I left google to work at a biotech startup working in a related area and found that the big three cloud providers have built systems that greatly improve computational science. That said, it's still a lot of work to get productive, many in the field are really resistant to changes like version control, continuous integration, testing, and architecting distributed systems for handling complex lab production environments.
It seems like spending these government funds on creating new challenges like CASP and ImageNet could have an enormous ROI. Don’t let them try to choose the winner, just let them define the game
> The price of training the final architecture is meaningless.
The research is the giant shoulders you stand on, the compute cost is the price of the tool you need to do the present-day work.
Both are relevant but the shoulder’s of giants are generally more accessible, particularly if we’re talking about published research and not proprietary tech.
A competing team is not starting from the same place the DeepMind team started at 5 or 10 years ago.
To expand on this, after fully reading AlQuraishi's "What Just Happened" post from a couple years ago, was this point that he made;
> I don’t think we would do ourselves a service by not recognizing that what just happened presents a serious indictment of academic science. There are dozens of academic groups, with researchers likely numbering in the (low) hundreds, working on protein structure prediction. We have been working on this problem for decades, with vast expertise built up on both sides of the Atlantic and Pacific, and not insignificant computational resources when measured collectively. For DeepMind’s group of ~10 researchers, with primarily (but certainly not exclusively) ML expertise, to so thoroughly route everyone surely demonstrates the structural inefficiency of academic science. This is not Go, which had a handful of researchers working on the problem, and which had no direct applications beyond the core problem itself. Protein folding is a central problem of biochemistry, with profound implications for the biological and chemical sciences. How can a problem of such vital importance be so badly neglected?
In short, academia got utterly schooled by a small group at Google spending a relatively small dollar amount on compute, using techniques that in hindsight are fairly described as "simplistic". There's no way around it.
I don't think AlQuraishi really hits the mark in his critique. The mere fact that hundreds or thousands of people working on a problem for decades doesn't account for the fact that the field of machine learning has been growing extremely rapidly over the last decade, the compute power available has grown exponentially, and the people working on the problem simply weren't looking at the problem in the way that the deepmind people were looking at it.
If you were trying to get across the Atlantic, this would be like getting upset at a group of bridgebuilders for trying to solve the problem by building a bridge across instead of by inventing the airplane. The approaches are that different.
> and the people working on the problem simply weren't looking at the problem in the way that the deepmind people were looking at it.
>The approaches are that different.
I'm not sure if that analogy applies here. DeepMind wasn't the first group tackling structure prediction with machine learning. Their success lies in the innovations that they implemented (predicting interresidue distances as opposed to contacts, for example).
To be fair, I'm not sure that they are "simplistic" in the sense that, e.g., writing a neural network to recognise cat pictures is now simplistic. I don't know how many people have Deepmind levels of expertise in ML, or could implement what they have done, but I doubt it is many, and they are thinly spread amongst many interesting problems.
> The price of training the final architecture is meaningless.
Meaningless in historical terms, but meaningful in future terms. It's meaningless how long the training took because there were countless resources spent to get to that point. It's meaningful in the future, because we know that training times are fairly short, and iteration can be done fairly quickly.
I mean, credit where credit is due. Google employs some of the greatest names in artificial intelligence and the DeepMind team had a huge chunk of them working on this problem. While the resources may have been available, I don’t think any other single institution had the level of brain power.
It also makes one reconsider the notion that monopolies are entirely bad. This essentially appears to be a vanity project for Google. Though of course they'll benefit from it in many ways, but it's not like they're doing this as the core product of their service. It's a pretty awesome achievement.
Look at all of the incredible things that came out of Bell labs during their monopolistic reign. I think a better way to put it is not all monopolies are bad for research and progress but many are bad for other social and economic reasons. Like any position of power, it depends on how it is used snd who is using it.
> It also makes one reconsider the notion that monopolies are entirely bad.
Much like political dictators, they can be exceedingly efficient and have resources (and authority) to do things in spite of opposing interests.
People who faced with the narrative that countries have a monopoly on a number of aspects of life find monopolies are not a BAD THING(tm), but that they are bad for a consumer market - as a monopoly eventually blockades aspects of the market.
I think there's some merit to the idea that huge corporate monopolies have the resources to accomplish undertakings that smaller companies cannot. But it's often a what-if, because we don't know what the alternative might have been.
Big companies can suck up all the air in the room by monopolizing talent and making it harder for startups to pay the kinds of salaries needed for top tier AI research. Xerox PARC came up with all kinds of groundbreaking inventions that were never commercialized (by them). For every invention that comes out of a big company, it's worth thinking about whether it might have actually come out faster if it was borne of competition instead of a side project. Or in the grand scheme of things, if corporate taxes were higher and the money was given to a university research lab.
I think the best results may come from the middle ground. Smaller/medium companies are so worried about staying afloat or hitting their quarterly earnings that they have trouble making long term investments. Large companies are diverse and profitable enough that they can afford to blow money on things that might not pan out, but they don't have the same drive -- and in fact have some pressure to avoid being "too" innovative because it could cannibalize their existing products.
It's kind of like a modern day Bell Labs where they have so much excess profit from adtech that they can fund lots of "basic research" or the computer science equivalent of that.
You've just describe why many Socialists 100 years were very skeptical of anti-trust as trying to sacrifice modernity to proper up a romanticized notion of the past as disaggregated pure-petit-bourgeois capitalism. Really not that different than the critism of the Luddites 100 years before that.
This line of argument reminds me of Haldane's point that economic planning can often work for the same reasons why large corporations and monopolies often work well too.
Imagine we lived in a culture that did not believe "government is always bad at everything". Government could then pay Google-level salaries and provide Google-level resources to the top minds in the world and give them free rein to tackle problems like this. It's worked in the past, such as Manhattan project or moon landing. But I don't think it's doable nowadays because of the anti-government political culture. Even when government is fully funding things these days the work has to be farmed out to private interests.
It'll take more than just belief in the government. We'd need people to actually care about making government better.
Most people just show up to vote once every 4 years (or less) and make their decision based on the party affiliation or the wedge issue du jour, and the rest of the time pretty much ignore what's going on or don't have the power to do anything about it, which gives a lot of leeway for special interests to slide things in under the radar.
Not even a little bit. There is nothing here that would require Google to be a monopoly to accomplish. If anything companies become lazy without competition.
I feel like that is not too far from saying it makes one reconsider communism because good things can happen with authoritarian control.
Absolutely. The capability to "create" the breakthrough is extremely rare. Perhaps only DeepMind, OpenAI, and GoogleBrain can assemble these types of teams. Luckily, the capability to replicate and exploit the breakthrough is far more 'common'; though still very rare.
Excited to see how follow on use of these models, by many more teams, researchers, and companies plays out over the next two decades.
Yeah, it was a big slap in the face. But, to be fair, most of the scientific and technological advances (sequencing efforts, structural genomics projects, etc.) that generated the data used by DeepMind came from academia and, to a lesser extent, the pharma industry.
I think the lesson here is that most of the big data genomic, metabolic, pharmacologic and other research will all be driven by deep learning. The models themselves however require 100+ gpus so we are sort of back in that phase where you need large compute systems to even compete. A single lab will have issues unless they can leverage a cloud and then also get grant funding to spend that money on the cloud compute... which may be difficult b/c its basically a consumable now and you don't have any hardware leftover.
In a prior(/n) life I worked on Protein folding, and participated in CASP.
This was a/the "holy grail" problem of molecular biology, long thought to be an automatic Nobel. It's somewhat unfair to characterise developments prior to this as insignificant. In fact by the time I was working on it, that "automatic Nobel" was no longer assumed, because the field had made quite a bit of progress, in many tiny steps by many different groups, and the assumption was it would continue in this slog until reaching some state of sufficiency for practical applications without ever seeing the sort of singular achievement that would be worthy of praise and prize.
Far more went into this breakthrough, obviously, than those TPU-hours: the development of those TPUs, for example, and assembling a team that can make use of them. The protein folding problem requires very little knowledge of biology or physics to understand and was always pre-destined for some outsider to sweep. Indeed, there was game that allowed people to solve structures by intuition alone, and, IIRC, some 13-year old Mexican kid cleaned everyone's clock some years back.
Why didn't some research group do this first? Most of them just don't have the budget. We were five people, total, IIRC, and felt pretty rich because we were computer-people getting the same budget for materials as everyone at our institution, which was all wetlab, otherwise. So I was a student being paid $20/h but with a $50,000/p.a. hardware budget. How many false start does it take before you do that run with 128TPUs "for a few weeks" that works? If you blow your budget on one gigantic Google invoice, what's going to happen to you when it doesn't pan out, and the whole institute laughs at you? Etc...
There are quite a few rather good things this problem has inspired over the years, though. Among them is CASP itself: the idea of instituting a yearly competition that gives unequivocal feedback on the state of the field and every group working on it is rather rare, I believe, and it's been successful. Indeed, it would seem that CASP was necessary to attract outside groups like Deepmind, i. e. deep-pocketed industry groups striving to prove themselves on a clearly defined problem. Chess, Jeopardy, CASP: maybe it would be worthwhile to explore not <solving x>, but <stating X as a problem that attracts Google/IBM/etc.-scale money> as a superior strategy in some cases.
There was also folding@home, pioneering the distributed-donated-computing model, and the aforementioned gamification of the problem, and hundreds of the most intricate, custom-tailed, more-or-less insane ideas people devoted months and/or careers and/or careers of their most promising post-docs to that didn't pan out.
Like cellular automata. They don't work for this, trust me. (Great hit for interactive poster sessions, though)
From what I can gather, Google bought Deepmind for 500 million USD in 2014, they have outstanding debt to its parent company as of 2019 of 1.3 billion USD.
And they had income around 100 million in 2019 but it's all against Google, so looks like a 2 billion +/- 0.5 operation so far, and who knows if they pay for compute.
Other articles place the runrate at 500 million per year in 2019.
Which means 500 million * 6 years = 3 bn + 0.5 purchase price. = 3.5 bn.
So somewhere in the 2.5 - 3.5 billion range its seems likely as total cost so far.
Nevertheless doesn't seem out of reach for a multinational.
It would still be a significant amount of money for a lot of companies.
Remember, we are looking in hindsight that it seemingly paid off. A few years ago, this was just an educated bet; only the richest companies with money to burn (from selling ads) would be willing to take on that kind of a risk.
I appreciate this tremendous 3.5B subsidy that Google brought to basic ML research and R&D.
There is barely any multinational that has the freedom Google had of planning to spend 3.5B with no ROI. Their shareholders would sue and vote the managers out.
Also, pharma does not really have a huge incentive to work on this problem. Solving the protein folding problem does not automatically translate to new drugs just in the same way CRISPR or DNA sequencing did not. It's another tool in the toolbox (which to be clear is a big deal).
The competition requires enough revealing about the methodology for other teams to replicate it so open implementations are going to be available for sure.
It also looks like they came up with a brand new jiggling algorithm which is probably just V1 now, this really changes things in a significant way!
I expect this to be quickly replicated once published. Training data is public and training compute is not enormous and AlphaFold of 2018 did get replicated.
CASP typically works this way: one person "wins" by getting a slightly higher score than everybody else. Two years later, the top teams have all duplicated the previous winner's tech, and two years after that, there's a github you can download and run on your GPU to reproduce everything.
How do you define enormous? "It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks". Also last time it took about a year for good replications to pop up.
A year is a fast time to replication in many scientific fields.
While substantial, the resources here are well within reach of many labs, research institutes, and organizations. For this result this big, I'd guess we'll have 2-6 additional implementations in the next 18 months. The problem has been 'open' for 40+ years, so that's lightening fast!
A couple of hundred GPU's is well within the reach of many even moderately well heeled research institutes. It'd seem that about 3 weeks of compute time with 128 TPU v3's would be about $170,311.68.
But of course that cost would only be for the final model. Anyway, I think I am just living in a different world... :-) We could never compete with that
Yah, big grant money. Now the grad students programming the open source clones will only make approximately $0.56, or 4.2 Ramen packs, for their effort. ;)
Also with keeping in mind that once a good open source model is available, researchers with less resources can still use it to fine tune and get new results for far cheaper than training a new model from scratch.
A lot of labs have access to the various strategic supercomputers of the USA.
Ex: Summit has 27,648 V100 GPUs (and those V100s have Tensor units). If you're saying that only 200 GPUs are needed to replicate the experiment, that doesn't even use up 1% of Summit's available utilization.
> However, if the (AlphaFold-adjusted) trend in the above figure were to continue, then perhaps in two CASPs, i.e. four years, we’ll actually get to a point where the problem can be called solved, in terms of gross topology (mean GDT_TS ~ 85% or so).
Interesting prediction within.
It turned out only to be one more year instead of four (depending on whether getting to the 90~ range is "solved".
I'm curious to see if AlphaFold can do even better the next two years.
Those last mile percentages always tend to be small anyway.
> Now that the problem of static protein structure prediction has been solved (prediction errors are below the threshold that is considered acceptable in experimental measurements)
This seems premature. Even though it does very well on average, there may be some areas where it struggles, and those areas may turn out to be important.
Sometimes announcements like this are a bit over-the-top. But what really, to me, cements the 'big-deal' of this is the "Median Free-Modelling Accuracy" graph half way down the page.
Scores of 30-45 for 15 years. Now scores of 87-92.
This isn't a minor improvement, it's a leap forward.
That is an impressive improvement, but I think you've missed the most important point:
>a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods
So DeepMind is to the point where it's a question of whether their generated model or the experimentally determined structure is closest to the actual physical structure.
Then we get the really fun question: if the experimentally determined structure is only 90% accurate, can machine learning actually reach 100%? Can you learn exact truth from inexact examples?
Which gets into the concept of whether the ML model has actually learned some deeper conceptual ideas than we have, some deeper truth about how this works. If so, can we somehow extract that truth, or is it truly a black box that does the thing we want?
I'm reminded of a sci-fi book I read long ago in which humans are discussing the fact that the science they are utilizing is beyond the scope of a human mind to comprehend- only the AIs can intuitively deal with 12-dimensional manifolds (or something to that extent). Maybe we've reached the doorstep of that future.
If you have an experimental error that is somewhat normally distributed around the mean, the the AI should, with enough examples, learn what the rules are that are closest to the mean. Because it will minimize the sum of errors.
So i do think the results could be more accurate than measurement.
I don’t think we can assume the errors are normally distributed. It’s possible researchers are biased in a particular “direction”, away from 0 on all dimensions of this problem.
That's fine. It's still a normal distribution with a different mean. The Gaussian is characterized by having only the first two moments: mean and variance.
> Which gets into the concept of whether the ML model has actually learned some deeper conceptual ideas than we have, some deeper truth about how this works.
Well I think that the results speak for themselves; ultimately the question you raise is one of semantics. ML models don't think in terms of "conceptual ideas" like humans do, these models simply perform at such a massive statistical scale that they can identify patterns far beyond any human conception. Clearly, the model embodies some verifiably reliable information about the way the world works, but this is "just" a trick of statistics not anything resembling actual "understanding" in the way the word is typically used when referring to human understanding.
I have a related question about this. If experimental methods produce results around a score of 90, what is the baseline we are comparing the DeepMind results against? If the experimental error is equal to the observed DeepMind error, how can we say which one is actually more erroneous?
you really can't compare stats like that. Those are independent, uncorrelated measurements. When you take RMSD measurements on a molecule they are not independent (for example, atoms near the core are less likely to be "inaccurate").
The "experiments" here use X-Ray Crystallography. Like most methods of measuring anything, we have a pretty good idea of its accuracy under various conditions.
Think of it like satellite imagery of a tree: A score of zero would be a single green-ish pixel, while a score of 100 would show each leaf within the range it naturally moves in due to wind etc. (proteins tend to wiggle quite a bit under natural conditions, as well)
Finding the energy of each configuration should be much easier than finding the lowest-energy configuration. Can that be calculated ab-initio or it is still too expensive?
The problem with ab-initio methods in this context is the sheer number of non-covalent interactions present in these large proteins. A simple protein would require a hybrid quantum mechanic/molecular mechanics simulation to even approximate the vibrational energy required to validate equilibrium.
These proteins are so massive that we often use Daltons [1] as an averaged measure of molecular weight.
Conceptually one of the most promising applications of quantum computing is theoretical chemistry, and we are only now starting to make progress in this avenue [2]. I anticipate it would require quantum computing to explicitly optimise large folded proteins.
"So DeepMind is to the point where it's a question of whether their generated model or the experimentally determined structure is closest to the actual physical structure."
While this is an accomplishment, nobody is going to be confusing these models for structures produced experimentally. The CASP metric is for backbone atoms. To have a useful model of protein structure, you really need to have the positions of the protein side-chain atoms modeled correctly. Experimental methods will do that, but this method, as I understand it, does not.
So it's a really good start, but nobody is going to be throwing these structures into molecular docking simulations for drug discovery or etc just yet. But hopefully those details can be worked out soon enough.
Yeah, there's a huge difference between a 1Å all-atom RMSD structure, and a 1Å backbone RMSD structure. The non-backbone atoms in a protein make up most of the mass and volume. When structural biologists talk about RMSD, this is what they mean.
I don't have a background in biology, and that quote confused me.
What's an experimental method for protein folding and why is it so good? Are they talking about creating an actual, physical protein in a lab and observing how it folds?
> Are they talking about creating an actual, physical protein in a lab and observing how it folds?
Exactly. Researches purify the folded protein and then use methods such as X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy to determine its three-dimensional atomic structure.
I’m not a protein crystallographer, but here’s my generalist take.
We understand the physics of e.g. X-ray diffraction pretty well, so we can fit pretty decent forward models for the x-ray data given a proposed structure. The hardest task here is getting a good enough guess at the structure to optimize the physical model, and it’s my impression that people use an iterative model refinement workflow. At least that’s how it’s done in condensed matter materials.
There are many sources of experimental uncertainty, like the non-ideal nature of the x-ray source and optics, and the fact that the atoms in the protein are not static but have some thermal fluctuations. so at the end of the refinement you still have some uncertainty on your model parameters (the interatomic distances for proteins I guess), but if you are careful you can calibrate these uncertainties pretty well.
X-ray diffraction is pretty nutty, too. You're taking the diffraction pattern, which is the fourier transform of the electron density. Fourier transform results are complex-valued data. Unfortunately, we don't really have X-Ray lasers, so you can only get the intensities and not the phases of those diffraction spots. Since mother science hates us, it of course the case that, in a fourier transform "more information" is contained in the phases than in the intensities.
So you "make guesses at what the phases are", the best choice is to bootstrapping these phases measured with another technique (you can introduce crystal defects that do allow you to guess at what the phases are).
Less scrupulous is to use a computer generated model, like fitting another protein "that you guess is related", then you model the electron density, take the phases of that.
In any case you take these "phase" guesses, and then apply it to your intensities, re-run the fourier transform, refine your electron densities, twiddle the location where you think the atoms, are, then repeat with your new model. This process repeats until you converge on a structure that you're happy with.
Now alarm bells should be screaming in your head right now: Yes, it's entirely possible to converge on a wrong structure, especially if you're a young up-and-comer professor seeking tenure that has no ethical problems with "suggesting" their grad students to sleep in the lab and work 100 hour weeks and willing to do slipshod work to get you tenure: https://www.sciencedirect.com/science/article/pii/S002228360...
Best is an orthogonal process (like NMR). Cryo-EM is getting better too so maybe that will start to be viable. Sometimes that's not possible, but you can use secondary evidence: "we know these three amino acids are important band hey look they touch in our model".
I'm not a biologist but I'm not sure that follows. It could be that the experimentally-derived structure is 100% accurate to the actual physical structure but getting 90% of your predicted residues to match that is enough to get an accurate prediction of protein behavior and hence "competitive."
Something like this comes up in assessing the accuracy of automated segmentation results of brain regions e.g. the hippocampus. Human-machine reliability is approaching the human to human reliability, so it becomes harder to improve the automated methods.
I don’t think you can say DeepMind could ever be more accurate to the true physical structure since it was built on the same experimental structures that it is being compared to. The limit of accuracy is the experimental data. However, I think we can say that a DeepMind prediction could at least be as good as a new experimental structure.
This seems like an obvious assumption to make, but it isnt always true. It is easier to see why if you are measuring a single value multiple times in order to get a more accurate estimate of the true value. In that case your "model" is simply the mean of all measurements made and can exceed the accuracy of a single measurement.
In this case, the model is predicting values of multiple structures, but patterns could still theoretically be found which allow for predictions beyond the accuracy of a single measurement.
DM is merging several experimental data: known x-ray structures, and evolutionary data. The experimental method (xray) doesn't take advantage of the evolutionary data. And it also doesn't model the underlying protein behavior accurately (xray basically assumes a single static model with atoms fluctuating in little gaussian "puffs" around the atomic centers, but that's not how most proteins behave).
But DeepMind could be used to find errors in the training set.
Let’s say you have 100000 proteins in the training set. Now remove #1 and train on 99999, and then check that it still predicts the same protein result for #1 as the experimental result.
Or remove from training whole sets of proteins by particular teams to find systematic errors made by teams?
Is that true? I thought fundamentally, the simulation tries to find the state of lowest energy, which is defined by physics. So, your result can be better than the data set used for training.
This reminds me of AlphaGo and AlphaZero. DeepMind was able to produce a very solid model on their first attempt, at both protein folding and at Go (and Starcraft2 as well). Their second models, however, seemed to blow their first out of the water.
This bodes extremely well for the future of computational biology, I'm very excited thinking about the prospects. If we know how a protein folds, we know its shape, meaning we know which shaped/charged molecules are needed to act as suppressors/enhancers of those proteins.
One difference to AlphaZero though, if my understanding is correct, is that AlphaFold is trained on a predetermined data set and hence didn’t learn how “arbitrary” proteins fold in general, but just how the kinds of proteins fold for which we already know how they fold. To work more like AlphaZero, AlphaFold would have to be able to synthesize arbitrary proteins and run the experiments on them to verify and correct its predictions. Therefore it’s conceivable that AlphaFold is biased by the existing training data and doesn’t fully generalize to all proteins we would want to apply it to. Maybe that won’t be a problem in practice, but nevertheless it makes for a significant difference from what AlphaZero was about, being solely self-trained.
> AlphaFold would have to be able to synthesize arbitrary proteins and run the experiments on them to verify and correct its predictions.
Could this lead to a virtuous cycle where AlphaFold is used generate a ton of random sequences where it has low confidence, those are then screened for ease of synthesis, measured and the results used to improve the model?
Edit: nevermind, according to another comment[0] there are still plenty of real proteins without experimental data left to explore.
> AlphaFold would have to be able to synthesize arbitrary proteins and run the experiments on them to verify and correct its predictions.
It can verify how much it minimizes the potential energy, which may not always line up with how it would fold in the real world but is a strong indicator.
Not to mention the fact that two years ago they took it from 45% to >60%. If they can continue improving, even with an exponential decay in rate of improvement, this is certainly a stunning example of technological disruption.
Even without any improvement, the amount of grunt-work the AI can pre-do and get down to a short-list - that in itself will see changes in progress speeding research up.
Why is the graph not monotonically increasing? Does the complexity of the problem to be solved increase each time? If so, does that make the relative improvement from the previous result even more impressive?
That's quite interesting ... I believe the test set size is not constant year to year but rather a function of how many new structures have been experimentally discovered since the last contest?
Does seem like the contest structure could include quite a bit of risk for hiding the effect of overfitting ... I wonder if there is anything inherent about the problem that reduces that risk ...?
My understanding is, that it's always 100 new structures, which is a small fraction of the total structures identified in that year.
The reason why the top score in one year, can be lower than in the previous year, is that the test (the 100 structures to guess) is always new and different, so it can end up being 'harder' than the year before. Luck will also play a small role.
Another explanation for a reduction in the top score would be, that previous winners are not re-submitted unchanged. For instance AlphaFold v1 seems to not have been submitted to the latest competition.
Only 100 new structures each test cycle? That seems a very small test set size ...
Is it really possible to select 100 new structures which together are likely to represent a meaningful increase in the sample generalization versus the prior years test set ...?
Given that we only know the structure of on the order of 100k proteins, we might only get another 10k new ones per year. I guess.
Using 1% of those (presumably from the more-often-reproduced subset) for this challenge seems reasonable? Note that the structures have to remain secret up until the challenge, and presumably all those teams uncovering the structures don't want to have to wait up to 2 years every time to actually make their results public.
Interesting ... plenty of opportunity then potentially for the 100 samples to have prediction similarity to the set of published discoveries (for expected or unknown reasons)?
I suppose it will take a few more years of repetition for the challenge to confirm that the problem has been been solved -- but I wonder if a new version of the contest is going to be needed as well? Maybe the model accuracy is now high enough to invert the contest to a form where models generate predictions for randomly selected unknown samples -- and experimental teams are then expected to make observations for those particular sequences over the next two years as part of their otherwise research agenda selected experimental workload?
There are different categories of samples, namely FM and TBM targets. FM targets don't have any similarity to known structures. Roughly a quarter were FM targets. I think the more interesting thing to look at is the size of the multiple sequence alignments (MSAs) which is the basis of this and essentially all methods. They seem to do very well with few MSAs, which bodes well for other targets, although there are families of proteins with few MSAs.
100 structure with 100+ amino acids each, so it's not quite as bad. Part of the folding information is contained within a distance of a few amino acids, while some (the harder part and crux of the problem) is farther away.
But yeah, compared to other fields, the size of training/test sets is sometimes pretty small in ML for life sciences.
Not knowing a lot about biotechnology, I read the article and it sounds great, but how big is this as a gamechanger? Can someone comment on how big are the implications of this in, let’s say, 5 years from now, on day to day life? Does this mean that biotech is going to explode? Or just that drugs will come to market faster, perhaps cheaper for rare diseases, but from the same industry structure as always?
Protein folding is a big and important problem, so this is certainly big news if it works as well as it seems. But I wouldn't assume that this changes everything, we can already determine how proteins fold by experimental work. The disadvantage is that this is a lot of work, though the methods there also improved a lot.
One question is how robust the predictions are that DeepMind produces. I would also assume that right now it can't e.g. determine protein structures in the present of other small molecules, or protein complexes. A lot of the interesting stuff lies in the interactions between molecules.
And in general in life sciences any new development will take at least a decade until it hits day to day life, likely even more. We're living with a exception to this rule right now due to the pandemic, but in general things take quite a bit of time in that space.
We can already determine how a few proteins (170k — which sounds like a lot, but which is only 0.09% of all currently-catalogued protein sequences) fold by experimental work.
What an accurate model of protein folding allows us to do, is to take our big database of DNA, predict protein foldings for all of it, and then stand up a search index for this database, keying each amino-acid "row" by the "words" of its predicted protein's structural features.
We could then, with a simple search query that executes in O(log n) time, find DNA targets that produce molecules with interesting structures that might be worthy of study.
This would, for example, be a game-changer in how biopharmaceutical macromolecule-therapy R&D is conducted. Right now we have to notice that some bacterium or another produces some interesting protein, and then engineer a bioreactor to get more of that protein. With this tech, we can work backward from an entirely hyothetical, under-specified "interesting protein", to figure out what catalogued-but-unstudied DNA sequences produce never-before-catalogued proteins that fit that particular functional "shape", and therefore might do the interesting thing. Then we can either directly synthesize that same DNA, or find the organism we originally sampled it from and study it more.
"A few" does appear quite dismissive of the enormous amounts of effort in structural biology so far. There are more than 170,000 structures in the PDB right now.
To determine potential targets for drugs we have to understand what the proteins do. Having the structure is not really enough for that, it doesn't tell you the purpose of the protein (though it certainly can give you some hints).
In most cases the proteins were determined to be interesting by other experiments, and then people decided to try and solve their structure. So the structures we already solved are also biased towards the more biologically relevant proteins.
170k is "a few" compared to 180 million (i.e. the size of the PDB as soon as someone runs AlphaFold over everything in the UniProt.)
> In most cases the proteins were determined to be interesting by other experiments, and then people decided to try and solve their structure.
Yes, that's what we're doing right now, because structure is not a useful predictor, because we don't have structure available in advance of studies on the protein itself. There was no point to a "functional taxonomy" of proteins, because we were never trying to predict with protein-structure as the only data available.
In a world where protein structure is "on tap" in a data warehouse, part of the game of bioinformatics will become "structural analysis" of classes of known-function proteins, to find functional sub-units that do similar things among all studied proteins, allowing searches to be conducted for other proteins that express similar functional sub-units.
Determining what a protein structure does might be even harder than folding. Right now we can't really do that ab initio, you have determine the activity in the lab and then look at the structure. And that allows you to potentially identify this motif in other proteins.
If someone produces an AI that you give a sequence and it tells you what the protein does exactly, I'd be extremely impressed. I don't see that happening soon.
The specifics matter a lot here. We can often determine rough functions for subdomains by homology alone. But that really doesn't tell you the full story, it only gives you some hints on what that protein actually does.
"If someone produces an AI that you give a sequence and it tells you the protein conformation, I'd be extremely impressed".
Sure there are many more things to solve in this space; but that doesn't take away that this is an impressive achievement and does unlock quite a few things (including making more tractable the problem you just brought up). I'm excited to see what DeepMind works on now and what the new state of the world will be just five years from now.
I think I have to clarify that my response was to a large part to the "this will change all our lives" part, and might look too negative on its own. I'm very, very impressed by these results, but that still doesn't mean that we just solved biology. If this works that well on folding, this could mean that a lot of other stuff that simply didn't work well in silico might come into reach.
I'm maybe overcompensating for the tech-centric population here, with some comments speculating for very near and drastic impacts from discoveries like this. Biology and life sciences are much slower, and there's always more complexity below every breakthrough. That does tend to push me towards commenting with the more skeptical and sober view here.
My understanding of this is not perfect, but wouldn't answering the "actually does" question require a full biomolecular model of the cell, or even the whole organism? If so I see what you mean. I suppose that it might be possible to get around this by improving the theory of catalysts so that you could look at a site and say, "oh, this will act in such a way..." Dynamic quantum simulation of a few atoms at the active site is hardly easy but a far sight easier than the other.
It's a step forward for sure, but structures change over time to perform their function. The method described here only returns a static structure. Much more research and development is needed to be able to predict the dynamic behavior and interplay with other proteins or RNA.
> as soon as someone runs AlphaFold over everything in the UniProt
It'll take a while before those results can be trusted, though, right? There's probably a selection bias in the training data for proteins which are easy to crystallize, so many proteins probably aren't well represented by the training examples.
170,000 is three orders of magnitude less than the number of recorded protein sequences. I don't think it's dismissive to describe that as comparatively few.
Structure is much, much more conserved than sequence. In other words, protein sequences with low sequence identity can fold similarly due to the physical constraints that guide protein folding.
I also don't know the field and the opposite concern is that 170,000 sounds like a lot, but, apparently, it's a relatively small amount compared to the number of proteins there are. It makes sense to me to refer to it as a small number - e.g. "That hard drive is tiny." "No, it stores several million bytes..."
We can already determine how a few proteins fold by experimental work.
Where "a few" is around 0.1% of the known 180 million proteins. So a relative few and a whole lot.
But the catch is which proteins could we figure out by experiment, and which not. In particular membrane proteins are hard to experimentally determine. But knowing how they fold is very important for figuring out how to get things to react with or get through membranes such as cell walls. Which is an important problem for everything from understanding how viruses work to targeted delivery of drugs. We now have a way to find those structures.
There are post-translational modifications to proteins. This means that for many (most?) proteins, the amino acid chain sequence is different from what you would predict from the DNA. These modifications are dependent on the state of the cell at the time of translation, and so cannot be predicted from the DNA alone. Even with a 100% accurate folding model, we cannot simply know the shapes of all the proteins inside the human body based on the genome.
Considering that this system "uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks" to determine a single protein structure, making predictions for all proteins encoded in a human genome seems impractical at this stage. With luck, this advance will help lead to discovery and definition of new folding rules and optimizations that will make protein folding predictions for the whole human genome more tractable.
I think it is possible to make predictions for all proteins encoded in the human genome. Perhaps you misread a very long and confusing sentence?
Background, Neural networks have two modes 1) training - where you learn all the model weights and 2) inference - where you run the model once on new data. Training takes takes a long time, because you're computing derivatives to implement updates rules on millions or billions of parameters based on iteratively examining massive datasets. Inference is extremely fast because you're just running matrix multiplies of those parameters on new data. And TPUs/GPUs are specially designed to compute matrix multiplies.
The article said: "We trained this system [...] over a few weeks." I searched for, but did not see them identify the inference time. I do expect inference time to be well under one second, though I'm not personally experienced with running inference on this type of network architecture.
For comparison, GPT-3 and AlphaStar have month long training times and real-time (sub-second) inference times.
Still much faster than synthesizing the protein and then doing NMR or cristallography to solve the structure puzzle what easily takes half a year or more (and very expensive equipment).
Entire classes of diseases may become history. Creutzfeldt-Jakob and other prion diseases can now be completely understood. Precision targeting of cancerous cells will become trivial (in theory). Minimal life projects (simplest cell possible) will require less trial and error. In general, it will provide a magnitude level improvement to biotechnology, akin to moving from Aristotelian physics to Newtonian.
Some possibilities: artificial muscles for robots, man-made blood substitutes, designer enzymes to break down plastic and other compounds. Software defined biology, where the pipeline from DNA code to actual protein can now be modeled in silico ahead of time. The biology classes of the future may be less observation of animals and more training in usage of whatever the equivalent of Autodesk for biology will be. Healthcare economics in developing nations will be changed as biochemistry itself may finally become deterministic (to some extent). Orphan drug development price would drop (and if you take into account right to try laws and ignore ethics in favor of progress, then people with rare disease may be cured en masse without bankrupting the health insurance company).
The most accurate technique in computational drug discovery is protein-ligand binding prediction (https://blogs.sciencemag.org/pipeline/archives/2015/02/23/is...). Given the protein structure, you can predict which molecules will bind with it, even for molecules which have never been sythesized. Many protein targets have not been amenable to this because we don't know what the potential binding pockets look like. That set of proteins will now drastically shrink. We're going to have a lot of new drug candidates, and with any luck new drugs, come out of this.
I never worked directly with protein folding or structure, but worked a bit in proteomics on teams measuring gene expression (which you could roughly think of as how much of each protein is found in this cell). IIRC there are 50,000 - potentially millions of "kinds" of proteins found in a human, and the "shape" of most of them is unknown, and that determines a lot about how they work.
So imagine you gave an iPhone to someone in the 1800's, they wouldn't understand how most of it works, but this may be analogous to them finally figuring out some key aspects of the transistor. So it's another tool in the toolbelt and like all good tools will be used in all sorts of unpredictable ways.
Someone else I'm sure could do a lot better at explaining how important shape is to understanding the function and behavior of proteins.
IMO, this is huge. One of the biggest applications of ML to science that I know of for sure. People used to manually crystallize proteins at great effort to solve for structures.
Of course, there is a caveat. The static, crystallized structure is only one aspect of a protein. The dynamic behavior dissolved in H2O, at different pH, different ionic strength, with different ligands/cofactors are all also important, and not (afaik) directly addressed by this research.
The industry process will not change. You still need industrial biologists to generate and validate AphaFold structures, interpret the results as part of the bigger picture, and to finally design the drugs. And, then, of course you still need to validate the drugs in experimental systems (first the test tube, then mice, then humans).
So your second guess is correct - one of the steps is much cheaper now, which marginally improves the entire pipeline. As a result, drugs should now arrive to the market faster.
As a side note, I am curious what happens to the field of structural biology in 10 to 15 years from now. Every research university has a large structural biology department with super expensive Xray/NRM/Cryo-EM machines, and armies of students who routinely spend 4-6 years of their PhD trying to solve a structure of a single protein. If AlphaFold works as advertised, NIH will gradually shift funding to other problems.
(It was predicted that it'd be taxi drivers, not professors, that AI got first. Ironic.)
> "armies of students who routinely spend 4-6 years of their PhD trying to solve a structure of a single protein"
Back in the 1990s, when I worked on structure data, I remember that at least some crystallizations were easy enough they could be done as a rotation project.
> Macromolecular crystallography evolved enormously from the pioneering days, when structures were solved by “wizards” performing all complicated procedures almost by hand. In the current situation crystal structures of large systems can be often solved very effectively by various powerful automatic programs in days or hours, or even minutes. Such progress is to a large extent coupled to the advances in many other fields, such as genetic engineering, computer technology, availability of synchrotron beam lines and many other techniques, creating the highly interdisciplinary science of macromolecular crystallography. Due to this unprecedented success crystallography is often treated as one of the analytical methods and practiced by researchers interested in structures of macromolecules, but not highly competent in the procedures involved in the process of structure determination.
Certainly some proteins are extremely hard to crystallize, and the new single-atom EM work will help a lot. But are there really "armies of students who routinely spend 4-6 years of their PhD trying to solve a structure of a single protein" these days?
I honestly don't know. I'm sure some do. But if so, that army is pretty small compared to the vast numbers who more routinely use crystallography.
I had a friend who solved the structure of 2 or 3 new proteins pretty much by himself his senior year of college. I also had an acquaintance who was a PhD student in the same lab, who said (jokingly) that she hated him because she had spent 5 years on a single protein and got way worse results than he did. I got the sense from talking to them that the process of figuring out how to get a protein to crystallize is basically just trial and error over and over—my friend himself said he basically got very lucky several times in a row (though he is also a brilliant biochemist).
Anyway that anecdote is pretty much the entire sum of my protein crystallography knowledge, but perhaps it explains how your experience and GP's statement can both be true?
Also, one important thing to realize is that AlphaFold was trained largely on proteins that we were able to crystallize. I'd be very curious to see how its performance fares as a function of 'ease of crystallization'.
You aren't wrong. I got caught up making the comparison between structural biologists and taxi drivers being ran out of business by AI, so I ended up exaggerating the work load that's addressed by AlphaFold. I should been more precise.
Getting from DNA structure from tissue samples is relatively straight forward.
DNA -> RNA -> unfolded protein is basically one-to-one mapping in most cases.
How protein functions depends on how it folds into itself. Once you solve protein folding, you can take DNA sample and see the structure of the molecule without working in lab using crystallography techniques.
Solving protein folding is huge, Nobel in chemistry scale achievement. It would be massive leap for biochemistry.
It seems that Deep Mind solved competition benchmark and made huge leap, but it's just partial solution that works on limited set.
After you have solved protein folding, there is still problem of solving chemical interactions between molecules accurately. Quantum chemistry is extremely compute intensive.
It seems unlikely there will be any large changes in life from solving protein folding. Knowing the structure of a protein (or really, its dynamics) is useful for identifying drugs that bind, but the real bottlenecks n drug discovery and biotech are elsewhere.
If folding and docking, alongwith dynamics simulations, start getting commodified, that might change things significantly though. I can already start imagining project workflows that are significantly streamlined without much thought, god knows what other scientists would dream up when we reach those steps
One young lady I knew worked on neural algos recognition of X-ray images.
They always had single digit, bizarre artifacts, where the program can't sometimes recognise the very data it was trained on with most minute differences.
Other artifact was that the most "stereotypical cases" were least reliably recognised, and they hot a lot of flak for screwed up live demos, where a radiologist put a very, very obvious tumor shot onto the scanner, and it didn't work without a half an hour of wiggling the film, and a camera.
The "bruteforce" solution may well be always, 80-85% off, but off consistently, and always. NN algo so far beat them, but fail with double digit frequencies on "artifacts" which they themselves can't do anything about.
How well it deals with the later, is what I believe will measure its real world usefullness.
I find this disingenuous. Yes, its important that the algos can perform well on real world data, but the framing of this post begins with an anecodote about one person who had a bad model, and implicitly extrapolates that these problems are generalized throughout all neural nets.
One could say the same thing about programmers automating a task, or a number of other trivial examples. I would lean towards assuming deep mind has competent model validation teams vs. not, even if data science is hard.
May well be, but if you spend more compute, and human time checking for those corner cases than if you went with another, more consistent exhaustive search algorith, then the method looses to it economically.
This is more the case the more close to bruteforce you come, like encryption cracking. Imagine, spending years of HPC cluster time, trying to break a password, while knowing you have a single digit chance to miss the right key, in a way which would be completely impossible with with a conventional solution.
This will allow us to discover much more about the structure of the cell (of "life") at a before this unprecedented speed. We should find many, many more mechanisms and targets for medicine, but it takes 10-20 years to bring a new medicine to market.
So in 5 years you'll see exactly zero new medicines pop up.
No new medicines, but way more biotech tools. Higher yield GMO plants, foundational research into disease, science backed recommendations for lifestyle changes to avoid disease that previously eluded us, some crazy stuff happening in animal models. The progress in biotech the past 20 years makes moore's law look slow.
I agree. The main inhibitor of speed that products of this advancement will be deployed at will likely be determined by local policies. Though, given just how profound some of the impacts on medicine might be, the speed at which they can be deployed might become a matter of national security (a healthier population bodes well for a healthier economy which in turn strengthens national security). Hopefully this competition shortens the time-to-market for all these new medicines.
In short, a core problem of biochem (the wagon) was just hitched to Moore's law (the horse). Our understanding of proteins will now grow exponentially not linearly, helping us to move up a level of abstraction to higher level biochemistry and biology problems.
She can still work on complexes, binding modes, and engineered biomolecules (eg, protein–drug conjugates and antisense oligonucleotide dimers) where the training data isn't really there.
> The organizers even worried DeepMind may have been cheating somehow. So Lupas set a special challenge: a membrane protein from a species of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an x-ray crystal structure of the protein. “We couldn’t solve it.”
> But AlphaFold had no trouble. It returned a detailed image of a three-part protein with two long helical arms in the middle. The model enabled Lupas and his colleagues to make sense of their x-ray data; within half an hour, they had fit their experimental results to AlphaFold’s predicted structure. “It’s almost perfect,” Lupas says. “They could not possibly have cheated on this. I don’t know how they do it.”
Like the old Arthur C. Clark quote goes: “Any sufficiently advanced technology is indistinguishable from magic” -- unless it might be cheating in which case throw them a curve ball.
Kudos to the DeepMind team for making magic happen.
I am happy you mention this. I was reading the article and thinking “wow the amount of scientific knowledge these guys need to know to understand what they are doing is way beyond me”. I work in health care and I always talk to clients about all the cool things they witnessed in their life. Cell phones, TVs, microwaves are some obvious ones I like to talk about. I sit and wonder what are the things my generation will get to look back on and say “I was alive when that happened”. I guess for many of us we will talk about how the internet was vs what it surely will be in the future, a shell of its initial glory.
From the perspective of a human observer, an AI more or less does exist outside of space and time. It can travel at the speed of light through radio broadcast (with some caveats). It can spend the equivalent of a lifetime learning a topic in just a few days.
That is a plot of a novel. Humanity has to restart the entire global electrical grid to deal with an AI worm that accidentally causes epoch ending havoc :)
I just had a discussion with a friend about this! It's indeed a very difficult question. We ended on the conclusion that God can't possibly exist outside of time and space in the Abrahamic tradition because he precedes the creation of the Universe, but I'm sure there's a twist we missed somewhere.
The conclusion we came to is that such a being would have to have it's own, metaphysically superior, time and space, and our time would be a subtime of it as well as our space would be a subspace of it.
The concept of a being subject to causality presupposes something akin to time.
No. The concept is that there is an infinite, indescribable void from which emerges consciousness manifested in matter.
And conscious matter creates universes constantly, that appear out of the void, hence creating time and space. This would also imply that its possible not only to change the future, but also the past.
The masters advise, however, that these are distractions and that the journey inevitably leads to experiencing the void itself, hence stepping out of time and space, and consciously experiencing any aspect of time and space and consciousness one wishes to, while always remembering that one is beyond it.
Ah, ok. My thought is to distinguish between “logically prior” or “causally prior” or something like that, and “prior in time”,
But, I suppose one might consider those things to be a kind of “time” in some sense.
One cannot define the notion of “exist” without explicitly or implicitly referring to time. So the question of existence without time reduces to absurdity like can one exist without existing?
Does the solution to an equation exist before one stated the equation? Does a particular phrase exist before one has written it? Does it continue to exist after all its copies were erased? And even if one answers positively to such questions based on a belief (it is not provable), like "yes, they always exist", one has to define what "always" means.
Their existence is not at all dependent on such things. (Note that I didn’t use the word “always”.)
If the answer to a question of “Why [...]_1?” would be “Because there is a [...]_2 such that [...]_3 .”, then there is such a [...]_2 such that [...]_3 . (Possible exception : if “because you asked that question” would be part of the answer.)
Consider if this universe exists only as a simulation (I’ve met some theists making this category of comparison but with different language). The laws of physics are identical in both directions for the arrow of time, so the starting point of the simulation from the point one view of the outside universe can be any point in time from the perspective of the inhabitants.
In this case, the “before” in the outer universe is a logical rather than temporal one from the point of view of us inhabitants.
(Disclaimer: my philosophy qualification is really bad)
Well yes, from our point of view. But I'd understand that there still must be something like time from the point of view of those that wrote the simulation.
There is an in depth explanation in Zen, Buddhism and Hinduism of this phenomenon.
I would suggest reading anything by Nisargadatta Maharaj, who expanded on this in detail to questioners from all over the world who came to his humble dwellings in a Mumbai tenement to observe him. I’d suggest starting with ‘I am That’, available on Amazon, iBooks etc.
He claimed to be outside of time and space himself.
I think you could actually argue that it does; it just solved a problem in a relatively short amount of time (iirc the folding@home project has been crunching numbers for over a decade and barely got close), and it doesn't occupy 'real' space since it lives on various computers - it could occupy a whole datacenter, or be contained to a single chip, either way it's in a scale that humans themselves can never exist at.
I used to work in scientific HPC and seeing the amount what researchers used computing resources for folding was staggering. Just thinking how much this will speedup the research in coming years is to be seen. I am really hope full.
I continue to be impressed by how quickly DeepMind has managed to progress in such a short time.
CASP13 was a shocker to all of us I think, but many were skeptical as to the longevity of the performance DeepMind was able to achieve. I believe with CASP14 rankings now released, it's safe to say that they've proven themselves.
Congratulations to the team! This work will have far reaching impacts, and I hope that you continue to invest heavily in this area of research.
> but many were skeptical as to the longevity of the performance DeepMind was able to achieve
For a non-biologist, on what is this skepticism based?
Just purely based on following ML news it looks like the trend for ML solutions has been that they've overtaken expert-systems once they've gained a solid foodhold in a field.
Maybe this is some perception bias. Are there any cases where ML performed decently but then hit a ceiling while expert systems kept improving?
It's because for many researchers ML is just to take a standard keras or scikitlearn model shove their data in and get some table or number out, and see if that solves their problem. If that's your only ML experience then I suppose this is how sceptical you'd be of ML in general.
It looks like DeepMind invented a completely new method for this round that's not just an extension of their previous work, showing how much you can gain if you don't shoebox yourself into just trying to improve existing methods.
That all the scientists were highly skeptical about the scope of ML (and these are computer scientists to begin with mind you) just shows how little they knew of what they did know of what a computer or a program can possibly do, which is a bit appalling to be honest.
"It looks like DeepMind invented a completely new method for this round that's not just an extension of their previous work, showing how much you can gain if you don't shoebox yourself into just trying to improve existing methods. That all the scientists were highly skeptical about the scope of ML (and these are computer scientists to begin with mind you) just shows how little they knew of what they did know of what a computer or a program can possibly do, which is a bit appalling to be honest."
My PhD (now over a decade ago...yikes) was in applying much simpler ML methods to these kinds of problems (I started in protein folding, finished in protein / nucleic acid recognition, but my real interest was always protein design). Even back then, it was clear that ML methods had a lot more potential for structural biology (pun unintended) than for which they were being given credit. But it was hard to get interest from a research community that cared little about non-physical solutions. No matter how well you did, people would dismiss it as a "black box solution", and that pretty much limited your impact.
Some of this is understandable: even today, it's not at all clear that a custom-built ML model for protein folding is of much use to anyone -- particularly a model that doesn't consider all of the atoms in the protein. The traditional justification for research in this area is that if you could develop a sufficiently general model of protein physics, it would also allow you to do all sorts of other stuff that is much more interesting: rational protein design, drug binding, etc.
The alphafold model is not really useful for any of this, so in a way, it's kind of like the weinermobile of science: cool and impressive when done well ("hey! a giant hot dog on wheels!"), but not really useful outside of the niche for which it was designed. So it's hard to blame researchers in this field -- who generally have to chase funding and justify their existence -- from pursuing the application of deep learning to this one, narrow problem domain.
Obviously there will now be a wave of follow-on research, and it's impossible to know what methods this will spawn. Maybe this will revolutionize computational structural biology, maybe not. But I think it's a little unfair to demonize the entire field. Protein folding just traditionally hasn't been a very useful or interesting area, and like all "pure science", it leads to a lot of small-stakes, tribal thinking amongst the few players who can afford to compete. This is right out of Thomas Kuhn: a newcomer sweeps into a field, glances at the work of the past, then bashes it over the head, dismissively.
We don't know too much about the exact model they made but it looks sufficiently generalizable to be able to give a candidate protein structure for any given sequence. It doesn't automatically cure cancer and inject the drug but that by itself is an amazing tool that if available to everyone will revolutionize biology experimentation.
I will definitely blame the protein structure field in multiple levels though. It was always frustrating to me to open up Nature or Science and see it filled with papers about structure - like they are innovating so much that half of the top science magazines every week have papers in that field, yet it's not going anywhere? Or is it simply just a bunch of professors tooting their own horns about ostensible progress in a field that's archaic by decades if not years? The overall protein structure field internalised some dogmas in self defeating ways to everyone's detriment and finally events like this (and Cryo em, maybe) will jolt them out or make them fully irrelevant so we can move on. it's only doubly ironic that this came from a team in a company with minimal academic ties showing how toxic that entire system is. I only feel pity for the graduate students still trying to crystallize proteins in this day and age.
The reason for your second paragraph is pretty straightforward. There has been an immense amount of support for proteins as "the workhorses of the cell" for hundred+ years. I call it the "protein bias". We've seen in many times- for example when it was first hypothesized and then proved that DNA, rather than protein, is the heredity-encoding material, and seen many times, for example in the denial that RNA could act as an enzyme or the functional core of the ribosome could be a ribozyme.
I think what basically happened is a very influential group of scientists mainly in Cambridge around the 50s and 60s convinced everytbody that reductionist molecular biology would be able to crystallize proteins and "understand precisely how they function" by inspecting the structures carefully enough.
I learned, after reading all those breathless papers about individual structures and how they explain the function of protein is that in the vast majority of cases, they don't have enough data to speculate responsibility about the behavior of proteins and how they implement their functions.
There are definiteyl cases of where an elucidated structure immediately led to an improved understanding of function:
"It has not escaped our notice (12) that the specific pairing we have postulated immediately suggests a possible copying mechanism for the genetic material."
but most papers about how cytochrome "works" aren't really illuminating at all.
"We don't know too much about the exact model they made but it looks sufficiently generalizable to be able to give a candidate protein structure for any given sequence. It doesn't automatically cure cancer and inject the drug but that by itself is an amazing tool that if available to everyone will revolutionize biology experimentation."
They say on their own press-release page that side-chains are a future research problem, and nothing about their method description makes me believe they've innovated on all-atom modeling. This software seems able to generate good models of protein backbones; these kinds of models certainly have uses, but a backbone model is not enough for drug design.
This is certainly an advancement, but you're exaggerating the scope of the accomplishment.
" I only feel pity for the graduate students still trying to crystallize proteins in this day and age."
Nothing about this changes the fact that protein crystallography is a gold-standard method for determining a protein structure. CryoEM has made it possible to obtain good structures for classes of proteins we could never achieve before, and it's certainly interesting if we can run a computer for a few days to get a 1Å ab initio model for a protein sequence, but we could already do that for a large class of proteins with homology modeling. These predicted structures still aren't generally that useful for drug design, where tiny details of molecular interactions matter.
To put it in perspective: protein energetics are measured on the scale of tens of kcal / mol. Protein-drug interactions are measured in fractions of a kcal. A single hydrogen bond or cation-pi interaction or displaced water molecule can make the difference between a drug candidate and an abandoned lead. Tiny changes in backbone position make the difference between a good structure and a bad one. Alphafold isn't doing that kind of modeling.
Of course, they havent solved everything, but you seem to be doing exactly what I accuse that entire field (and academia in general) of doing - which is to insist a problem is intractable or hard and undermine someone potentially challenging that. When they released the 2018 results tbey field did embrace it (for sure I'd consider the groups organizing CASP as at least forward thinking) but was still skeptical on how much more progress it can make; now they blow everyone's minds again by a monumental leap and again people want to come say of course this is the last big jump!
I understand the self preservation instincts that kick in when there's a suggestion that the entire field has been in a dark age for a while, but I hope you can see that there might be something fundamentally wrong with how research is done in academia and that is to blame for why this didn't happen sooner, and why it's so hard for many to embrace it.
Regarding your comments on the inapplicability of this current solution for docking, I'm sure that's the next project they're taking up, and let's see where that goes.
This is exactly the same type of progression that happened with Go, where when their software bet a professional player everyone's like "yeah but I bet he wasn't that good". A few years later and Lee Sedol just decided to retire. I am interested to see what happens to that entire academic field in a similar vein, though my interests are more in knowing how science can advance from more people thinking this way.
> Nothing about this changes the fact that protein crystallography is a gold-standard method for determining a protein structure.
Yes it does. Protein crystallography is/was the gold-standard. Once this result is verified and accepted by the scientific community as a whole, that changes.
There are definitely cases where machine learned statistical solutions do not perform as well as the systems tuned by the experts, but if you can define the task well and get the data for a deep solution, usually those will overtake.
This is likely because linear regression meets most widely accepted definitions of machine learning. [0][1] It is simple and very effective when learning in linear space.
> Are there any cases where ML performed decently but then hit a ceiling while expert systems kept improving?
Yes, this describes entire history of AI including several boom-bust cycles. In particular the 80's come to mind. Yes the practitioners think that there's no technical barriers stopping them from eating the world, but that's exactly what people thought about other so-called revolutionary advances.
Although to be pedantic, "expert systems" is the technology behind AI boom of the 80's. At the time people were saying expert systems can't be as good as existing algorithms (including what we would now call "machine learning" techniques), then suddenly the expert systems were better and there was rampant speculation real AI was around the corner. Then they plateaued.
We appear to be at the tail end of the maximum hype part of the boom-bust cycle. Thinking that the rapid gains being made by the current deep learning approaches will soon hit a wall is a reasonable outside-view prediction to make: nearly every time we've had a similarly transformative technology in the AI space and elsewhere, hitting the wall is exactly what happened. The onus would be on practitioners to show that this time really is different.
I think the disconnect this time around is in productionization. We're getting breakthroughs in a wide range of problems, and translating those gains in the problem space into 'real' stable, practical solutions we can use in the world is the remaining gap, and often takes years of additional effort. It's still really expensive to launch this stuff, and often requires domain expertise that the ML research team doesn't have.
We're seeing a lot of this pattern: ML Researcher shows up, says 'hey gimme your hardest problem in a nice parseable format' and then knocks a solution out of the park. The ML researcher then goes to the next field of study, leaving (say) the doctors or whatever to try to bridge the gap between the nice competition data and actual medical records. It also turns out that there's a host of closely related but different problems that ALSO need to be solved for the competition problem to really be useful.
I don't think this means that the ML has failed, though; it's probably similar to the situation for accounting software circa 1980: everything was on paper, so using a computerized system was more trouble than it was worth. But today the situation in accounting has completely flipped. Apply N+1 years of consistent effort improving data ecosystems, and the ML might be a lot easier to use on generic real world problems.
Next time you fly through a busy airport, think about the system which assigns planes to gates in realtime based on a large number of variable factors in order to maximize utilization and minimize waits. This is an expert system design in the 80's and which allowed a huge increase in the number of planes handled per day at the busiest airports.
Or when you drive your car, think about the lights-out factory that built-it, using robotics technologies developed in the 80's and 90's, and the freeways which largely operate without choke points again due to expert system models used by city planners.
These advances were just as revolutionary before, and people were just as excited about AI technologies eating the world. Still, it largely didn't happen. To continue the example of robotics, we don't have an equivalent of the Jetson's home robot Rosey. We can make a robot assemble a $50,000 car, but we can't get it to fold the laundry.
These rapid successes you see aren't literally "any problem from any field" -- it's specific problems chosen specifically for their likely ease in solving using current methods. DeepMind didn't decide to take on protein folding at random; they looked around and picked a problem that they thought they could solve. Don't expect them to have as much success on every problem they put their minds to.
No, machine learning is not trivially solving the hardest problems in every field. Not even close. In biomedicine, for example, protein folding is probably one of the easiest challenges. It's a hard problem, yes, but it's self-contained: given an amino acid sequence, predict the structure. Unlike, say, predicting the metabolism of a drug applied to a living system, which requires understanding an extremely dense network of existing metabolic pathways and their interdependencies on local cell function. There's no magic ML pixie dust that can make that hard problem go away.
Well, we can agree that world peace is off the table!
Beyond that, let's notice that expert systems did indeed change how airports and freeways work: They improved the areas where they solved problems. Deployment happened.
What we're seeing now is new classes of previously unsolvable problems falling. Deployment in medicine is known to be particularly hard, but not impossible. My read on the situation is that there have been a number of ML applications in the current round that have been kinda-successful 'in vitro' and failed in deployment. That doesn't mean that all deployments will fail.
Furthermore... Neil Lawrence points out that in most cases we change the world to fit new technologies. For example, mechanized tomato pickers suck, so we develop a more machine-resistant tomato. Cars break easily on dirt roads, so we pave half the planet. ML/AI somehow flips people's expectations of how technology works, and expect the algorithms to adapt to the world. This is almost certainly wrong.
"it's specific problems chosen specifically for their likely ease in solving using current methods. DeepMind didn't decide to take on protein folding at random; they looked around and picked a problem that they thought they could solve."
I'm actually not sure this is at all true. Protein folding is a long-standing grand challenge on which no current methods were working. My guess is that it was initially chosen for potential impact, and chased with more resources after some initial success.
> We appear to be at the tail end of the maximum hype part of the boom-bust cycle. Thinking that the rapid gains being made by the current deep learning approaches will soon hit a wall is a reasonable outside-view prediction to make: nearly every time we've had a similarly transformative technology in the AI space and elsewhere, hitting the wall is exactly what happened. The onus would be on practitioners to show that this time really is different.
What a take. Neural networks just took a huge bite out of protein folding and your take is: This just in, the Deep Learning boom is about to go bust! Asinine.
Which part of genetics are you thinking of? Much of genetics isn’t amenable to this kind of ML, because it isn’t some kind of optimisation problem. And many other parts don’t require ML because they can be modelled very closely using exact methods. ML does get used here, and sometimes to great effect (e.g. DeepVariant, which often outperforms other methods, but not by much — not because DeepVariant isn’t good, but rather because we have very efficient approximations to the exact solution).
Genetics is amenable because the genome is a sequence that can be language modeled/auto-regressed for depth of understanding by the network.
There are plenty of inferences that you would want to do on genetic sequences that we can't model exactly and there is some past work on doing stuff like this, although biology is usually a few years behind.
Not sure what you mean by that. Genetics is a field of research. The genome is a sequence. And yes, that sequence can be modelled for various purposes but without a specific purpose there’s no point in doing so (and furthermore doing so without specific purpose is trivial — e.g. via markov chains or even simpler stochastic processes — but not informative).
> There are plenty of inferences that you would want to do on genetic sequences
I’m aware (I’m in the field). But, again, I was looking for specific examples where you’d expect ML to provide breakthroughs. Because so far, the reason why ML hasn’t provided many breakthroughs in less about the lack of research and more because it’s not as suitable here as for other hard questions. For instance, polygenic risk scores (arguably the current “hotness” in the general field of genetics) can already be calculated fairly precisely using GWAS, it just requires a ton of clinical data. GWAS arguably already uses ML but, more to the point, throwing more ML at the problem won’t lead to breakthroughs because the problem isn’t compute bound or vague, it’s purely limited by data availability.
I could imagine that ML can help improve spatial resolution of single-cell expression data (once again ML is already used here) but, again, I don’t think we’ll see improvements worthy of called breakthroughs, since we’re already fairly good.
I spoke loosely, my mind skipped ahead of my writing, and I didn't realize that we were parsing so closely. "Genetics (the field) is amenable because the object of its study (the genome) is a sequence" would have been more correct but I thought it was implied.
> without a specific purpose there’s no point in doing so
Well yes, prior to the success of transfer learning I could see why you would think that is the case, but if you've been following deep sequence research recently then you would know there are actually immense benefits to doing so because the embeddings learned can then be portably used on downstream tasks.
> it’s purely limited by data availability.
Yes, and transfer learning on models pre-trained on unsupervised sequence tasks provides a (so-far under-explored) path around labeled data availability problems.
I already linked to a paper showing a task that these sorts of approaches outperform, and that is without using the most recent techniques in sequence modeling.
Maybe read the paper in Nature that uses this exact LM technique to predict the effect of mutations before assuming that it doesn't work: https://sci-hub.do/10.1038/s41592-018-0138-4
I am not directly in the field, you are right - but I think you are also being overconfident if you think that these approaches are exactly the same as the HMM/markov chain approaches that came before.
Thanks for the paper, I’ll check it out; this isn’t my speciality so I’m definitely learning something. Just one minor clarification:
> Maybe read the paper … before assuming that it doesn't work
I don’t assume that. In fact, I know that using ML works on many problems in genetics. What I’m less convinced by is that we can expect a breakthrough due to ML any time soon, partly because conventional techniques (including ML) already have a handle on some current problems in genetics, and because there isn’t really a specific (or flashy) hard, algorithmic problem like there is in structural biology. Rather, there’s lots of stuff where I expect to see steady incremental improvement. In fact, in Wikipedia’s list of unsolved biological problems [1] there isn’t a single one that I’d characterise specifically as a question from the field of genetics (as a geneticist, that’s slightly depressing).
But my question was even more innocent than that: I’m not even that sceptical, I’m just not aware of anything and genuinely wanted an answer. And the paper you’ve posted might provide just that, so go and do my research now.
Not being in the field, I would term what I see in this story as a ‘bottom up’ approach to understanding genetics/molecular biology. More akin to applied sciences than medicine or health. This, for example, seems to be very important but it still leaves us with a jello jigsaw puzzle with 200 million pieces and probably far removed from immediate utility in health outcomes.
Then there’s the more clinically oriented approaches of looking at effects, trying to find associated genes/mutations whatever mechanisms exist in between to cause a desirable or undesirable outcome. I’d call that ‘top down’.
I’m sure the lines get blurred more every day, but is there a meaningful distinction into these and/or more categories that are working the problem from both ends? If so, are there associated terms of art for them?
I cannot give constructive feedback to something which is incomprehensible.
"the genome is a sequence that can be language modeled/auto-regressed for depth of understanding by the network"
The genome is not a sequence so much as a discrete set of genes which are themselves sequences which specify construction plans for proteins. That distinction is important.
Language modeling in the context of machine learning typically means NLP methods. Genetics is nothing like natural language.
Auto-regression is using (typically time series) information to predict the next codon. This makes very little sense in the context of genetics since, again, the genetic code is not an information carrying medium in the same sense as human language. Being able to predict the next codon tells you zilch in terms of useable information.
"Depth of understanding by the network" ... what does that even mean???
The above sentence is a bunch of popular technical jargon from an unrelated field thrown together in a nonsensical way. AKA word salad.
> The genome is not a sequence so much as a discrete set of genes which are themselves sequences which specify construction plans for proteins. That distinction is important.
aka a sequence. "a book is not a sequence so much as a discrete set of chapters which are themselves sequences of paragraphs which are themselves sequences of sentences" -> still a sequence
these techniques are already being used, such as in the paper I just linked.
> Being able to predict the next codon tells you zilch in terms of useable information.
You have absolutely no way of knowing that apriori. And autogressive tasks can be more sophisticated than just next codon.
> bunch of popular technical jargon from an unrelated field thrown together in a nonsensical way
Okay, feel free to think that.
There's always this assumption of it "will never work on my field." I've done work on NLP and on proteins and read others' work on genetics. I think you will end up being surprised, although it might take a few years.
It is incomprehensible to you, because you just simply do not understand what your parent is talking about. You are the ignorant one here and indeed quite rude. Doesn't matter that genetics is not natural language. The point is we can train large transformers auto regressively and the representation it learns turns out to be useful for a) all kinds of supervised downstream tasks with minimal fine-tuning and b) interpreting the data by analysing the attention weights. There is a huge amount of literature on this topic and what your parent says is quite sensible.
That statement you quote is completely understandable.
Let's say you have discrete sequences that are a product of a particular distribution.
Unsupervised methods are able, by just reading these sequences, to construct a compact representation of that distribution.
The model has managed to untangle the sequences into a compact representation (weights in a neural network) that allows you to use it for other, higher level supervised tasks.
For example, the transformer model in NLP allowed us to not have to do part-of-speech tagging, dependency parsing, named entity recognition or entity relationship extraction for a successful language-pair translation system. The compact transformer model managed to remap the sequences into a representation that allows direct translation (people have inspected these models and figured out the internal workings of it and realized it does have latent information about a parse tree of a sentence or part-of-speech of a word).
Another interesting note is that designers of the transformer architecture did not incorporate any prior linguistic knowledge when they were designing it (meaning that the model is not designed to model language but just a discrete sequence).
FWIW, transformers is to sequences what convnets is to grids, modulo important considerations like kernel size and normalization. Think of transformers as really wide (N) and really short (1) convolutions. Both are instances of graphnets with a suitable neighbor function. Once normalization was cracked by transformers, all sort of interesting graphnets became possible, though it's possible that stacked k-dimensional convolutions are sufficient in practice.
I work in the field, I don't need the difference explained to me.
> Think of transformers as really wide (N) and really short (1) convolutions
Modern transformer networks are not "really short" and you're also conflating the difference between intra- and inter- attention.
There is still a pitched battle being waged between convnets and transformers for sequences, although it looks like transformers have the upper hand accuracy wise right now, convnets are competitive speed-wise.
Just to add to this whole "It's not solved! Yes it is!" discussion. Note that
>According to Professor Moult, a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods.
So if we go by >= 90 as solved:
>In the results from the 14th CASP assessment, released today, our latest AlphaFold system achieves a median score of 92.4 GDT overall across all targets.
they solved for their targets, but
>Even for the very hardest protein targets, those in the most challenging free-modelling category, AlphaFold achieves a median score of 87.0 GDT (data available here).
They basically admit they still haven't "solved" it for "most challenging free-modelling category"
Take that as you will, not sure how useful the ">= 90 is solved" criteria is since they call it "informal" themselves.
What do you mean you're not sure how useful ">= 90" is as a criteria?
You literally said why it is useful in your comment:
> 90 GDT is informally considered to be competitive with results obtained from experimental methods.
It's informal because we don't have a true "gold-standard" for determining a protein's folded structure – the best we have is experimental methods of trying to determine the structure which still have a great deal of error (compared to other things we can measure).
So all we can do is say "the GDT between two experimental measurements (of the same protein) is often around 90, so if we get there with predictive models that's pretty much just as good".
As soon as we have better experimental methods for determining protein tertiary structure, you can be sure we will require predictive models to deliver better results too. Until then, the point is that the delta between any two experimental determinations of folded structure is approximately the same as the delta between an experimental determination and an AlphaFold guess. So the AlphaFold guess may as well be an experimental measurement. Except the AlphaFold guess happens fairly trivially (once you give it the DNA sequence[1]), where as the experimental method is involved and expensive.
[1] Or the primary structure, I'm unsure what inputs are given to AlphaFold.
Just to add to my own comment. Why does HN like being so pedantic about the definitions of words? This is an interesting post regarding AI and cellular biochemistry.
Do we really need to add a philosophical debate about the meaning of "solution"? Personally I think anyone who can't add to the discussion about AI and protein folding should just not comment, instead of settled on adding to the what does solution mean "debate". I'd love to see a blanket rule flagging pedantic posts.
That’s shifting goal posts. The hardest structures are also going to be harder experimentally.
What makes them hard to predict is the very close energies involved in different folding pathways. Those close energies mean there will be more variant structures which change by use the experimental approach too.
CASP (Critical Assessment of protein Structure Prediction) is calling it a solution. To quote from the article:
"We have been stuck on this one problem – how do proteins fold up – for nearly 50 years. To see DeepMind produce a solution for this, having worked personally on this problem for so long and after so many stops and starts, wondering if we’d ever get there, is a very special moment."
--Professor John Moult
Co-founder and chair of CASP
It's an improvement- and a big one- but not a solution to the problem. It mainly shows just how stuck the community had gotten with their techniques and how recently improvements in DNNs and information theory methods can be exploited if you have lots of TPU time.
Well, it's not. Nature does not have a committee sorry. Proteins are delicate "machines" where even a a small change in the sequence (and thus the 3D structure) as small as a few amino-acids would change effectively the structure and the function of it. On top of that, proteins are dynamic beasts. In any case, it's a great advance, but DM, as many companies likes a little bit too much to tout its own horn.
I think that missed the mark, regardless of the rest of the discussion. It's like saying that the winner of the DARPA Grand Challenge for self-driving cars "solved" autonomous driving back in 2010.
This benchmark maybe solved, but simultaneously, there remain other open problems relating to protein folding which are unsolved and which may not even have benchmarks yet :)
Said differently, there's vast space between having a great result on a specific benchmark (this) and solving all interesting problems in a scientific field.
This is an issue of the more subtle aspects of English.
"To see DeepMind produce a solution for this" does not imply something is solved. I can produce a bad solution. I can produce a really good solution. All without solving a problem.
This is a really good solution. Of course, there's still room for more research and better methods in the future, but now computational protein structure prediction can compete with experiments actually measuring the structure.
Laypersons often use the word "solution" in situations where an academic would say "method" or "approach": we did something useful, but it may not be the best possible way.
In pure math, "solution" means determining whether a logical statement is true or false. For example, in (asymptotic, worst-case) analysis of algorithms, the logical statements take the form "there exists an algorithm to compute X with asymptotic complexity O(f(n)), and no algorithm with lower complexity exists." These are crisp notions with no room for debate.
In this competition, they defined "solved" as achieving 90% accuracy. This is somewhere in between the two definitions. It's technically a valid problem statement, but it can become obsolete in a weird way. If someone else solves the problem of achieving 95% accuracy, then suddenly the 90% solution doesn't look so good. Compare to e.g. sorting. If we add the requirement of a stable sort, it becomes a new problem. Stable sorting algorithms are not automatically "better" than unstable ones.
"AlphaFold achieves a median score of 87.0 GDT". Game changing, and a huge improvement, but not 100% solved. Also this is for static folding. Dynamic folding and interaction is a much harder problem. Those need to be tackled too before I would consider protein folding 'solved'.
They solved the latest folding competition benchmark set.
Shorter problems are easy to solve. Median score is mix of easier hand harder problems. Next year competition will have new set of much bigger and harder problems to solve.
This seems like a leap, not solved as in having solution that just works and scales.
It's probably never going to be solved though right. To truly solve protein folding we'd have to have a program that can stimulate a small but still significant system at the QM level; looks like deep learning can get us 60% (conservatively estimating the whole problem domain ) but not all the edge cases, just like it did in other problem domains as well.
Despite this breakthrough by DeepMind, at this point we still do not understand protein folding. That makes it very hard to say precisely which features would be required to do the simulation correctly.
DeepMind/AlphaFold might have something to contribute there too, depending on how interpretable their network model(s?) are.
They seem to have a completely new tension algorithm that's doing the heavy lifting now, so it's likely we will learn much about how folding practically works from these results as well.
It remains unclear whether QM is required to fold proteins accurately. So far classical methods have shown they require far less computer power to get far closer to the right structure.
'Never' is a long timespan :) It will be solved, sooner or later. The universe will be fully understood and manipulated. By us, a modified version of us, or some other entity, perhaps even one we created. 300 years ago 'electricity' wasn't even a word. We can imagine what 500 years into the future will be, with an exponentially more advanced tech, worse than a caveman could imagine the concept of 'machine learning'.
>Those need to be tackled too before I would consider protein folding 'solved'
Semantics. From a systemtheoretical point of view, dynamic folding is an abstraction of static folding; solve (i.e. understand the underlying mechanisms) static folding and you can start progressing on dynamic folding, building up on your previously achieved solution.
Wether it's solved or not depends on wether you mean `general folding` or the `entire spectrum of folding` when considering the problem.
My intuition for deeplearning was exactly that, statistical inference of underlying mechanisms. But I haven't read the paper yet, so you might be right
12-13 years ago in a classroom the professor for my intro to bioinformatics class said if you were to solve this problem, you would win a Nobel prize. Congrats to the team! What an achievement.
Man, I remember running folding@home years ago on my terrible laptop. Now this was done with what they say is equivalent to only 100-200 GPUs. Crazy to see how far we've come in just a short amount of time.
Pretty interesting that they only used about $15k worth of resources (retail price) to achieve this. It's not a technique that would have been out of reach for other organizations based only on not being able to afford the compute.
That’s only for the final model. To find it, they’d need to run 1,000 experiments, trying many high-level approaches, many architectures for each component, hyperparameter search, and multiple seeds. Large machine learning projects need $10M in capital.
The vast majority of structures in the protein data bank are determined by crystallography, which involves putting the protein in a chemical cocktail that causes it to crystallize. The cocktail is very different to the chemical environment in which the protein functions, so an open question is whether the protein structure determined by crystallography (and hence learned by AlphaFold) is representative of the structure in it's natural environment.
It would be very interesting if there was a way to use computational techniques to go beyond what crystallography and other experimental techniques (Cryo) can accomplish and determine the protein structure in it's true biological setting. Some research into experimental methods for this include high power X-ray pulses.
This is a lot bigger than people are assuming if protein folding can be done quickly and cheaply it will trickle down to a lot more than medicine. It is going to advance bio fuels, food production and a lot more.
My conclusion reading this is that a gradient is a gradient is a gradient. If you can minimize one, you can minimize them all. The hard work would seem to be figuring out how to transform into a gradient that your hardware can solve. It will also be interesting to see the kinds of systematic errors that will come as a result of the biases in the training set, and whether it can be used to predict what the structures would look like under slightly different conditions (e.g. pH).
I worked in the lab that helped develop folding@home, as well as the game where the crowd was the chaotically trained machine that folded and unfolded one amino acid at a time. This feels like a pretty significant new chapter in the humanity movie.
A few times, I get immense pangs of jealousy for younger people a generation or a half before me. And I'm only 30! This is one of those times.
This is amazing, if we can simulate multi-protein interactions, you could imagine in our lifetimes being able to see a fully computation driven simulation of a human blood cell. That would be a huge breakthrough.
What amazed me most was that they used hundreds of millions of unlabelled protein scans. This means we can collect massive data in a new modality, besides the usual suspects: images, video, audio, text, lidar and sensors. Soon I expect neural implant data to be massive as well.
They surely did unsupervised training on raw data and then fine-tuning on the 170K labelled sequences. I expect the data volume could be increased by orders of magnitude in the next couple of years and we'll see a GPT-3 like jump.
This is a big step forward, but the outstanding question as far as to whether or not this is useful for evaluating novel proteins, is going to be how good is the confidence metric at telling the user to trust or not trust the results. You can see from their examples, that AlphaFold is very good but not perfect. I imagine for some proteins it will still give misleading or erroneous results and if you can’t tell when that happens without verifying the structure experimentally then this will likely not be that useful for new science.
> the outstanding question as far as to whether or not this is useful for evaluating novel proteins
That is not an outstanding question. The test on which DeepMind scored high marks is a test of how well the algorithm folds novel proteins -- proteins whose ground-truth structure has not yet been published.
We’d have to see the distribution of GDT scores evaluated on unknown proteins to say anything about how confident we can be. If the distribution is tightly distributed around the median then great, this works really well. If the variance is large though then you’re going to have a hard time using this for meaningful predictions.
According to the article there's a confidence score as well. As long as this is sufficiently predictive of errors either a tight or wide distribution is likely acceptable.
We need to see the relationship between confidence and GDT score. If you have a nice relationship then again everything is great. But... most confidence metrics from neural networks do not have a nice relationship to the primary metric.
You don't generally look at neural network output like that.
There is generally a threshold, less than X, not the class, equal or more, is the class. Then you run the network with the same threshold on a known data set and compute a confusion matrix, which tells you about the error, I don't even want to know what a confusion matrix analogue for 3D geometry would look like but I'm sure they have something.
This is literally the process that one does in taking part of the this. And the error rate (specifically the lack of errors) is what is everybody is talking about. 90 is just as accurate as we can get with experimental measurement. It's likely at this point the source of error is in the data set (we can only train on data we experimentally measure and these are not perfect measurements). It's also possible, at this point, the model generalized so well that when it deviates from experimental measurements it's actually correct and the experimental value was the one that was wrong.
So no, the outstanding question is not "is going to be how good is the confidence metric at telling the user to trust or not trust the results.". Nobody is going to be looking confidence values when it model is giving an output, they are going to be looking at the overall error rate across a broad spectrum of proteins to get a sense of it's accuracy.
Scientists can verify that an AlphaFold-predicted structure is correct, or at least useful, without being able to get the structure experimentally. For instance, we could use the AlphaFold-predicted structure to do protein-ligand binding calculations for a bunch of known molecules. If these calculations agree with experimental protein-ligand binding (which they generally do for proteins with known structures), then we can say with high confidence that we've got a good structure.
The way computer scientists do it, yes, it is. In the CS situation you define an energy function (in this case representing the physical behavior of the protein in water) and find a heuristic to approximate the coordinates of the lowest energy configuration; done, problem solved.
in reality, that's not how it works at all. The energy functions we have are crappy and require too much sampling before we can find the lowest energy configuration. And more importantly, it doesn't look like proteins typically fold to their lowest energy configuration (with the exception of some small fast two state folders), but rather explore a kinetically accessible region around there (or even somewhere else entirely, if the energy cost to transition is too high).
Methods like AF depend heavily on large amount of information correlation from evolutionary data, which has historically been of the highest value for making decisions about protein structure.
I was wondering the same thing. But I also wonder if having good guesses makes the x-ray crystallography and other experiments to verify a given protein easier/cheaper/quicker? I don't know enough about the actual techniques to have an informed opinion but I would think it would be helpful.
Every simulator is going to have error. In this case this biennial challenge represents the computational state of the art with scores of 30-40 over the last decade. The AlphaFold2 model sends that score up to 87 with errors about than the width of the atom. You can actually see the difference between their prediction and the actual result and it’s stunning. This is all on the blog site so I recommend reading before throwing shade.
There's a difference between being a random commentator on HN and being one of the several experts in the field quoted in the article, among other things, predicting a mass exodus from the computational biology field as the major problem of that field is now solved.
It’s a good question, and I’m not a domain expert here.
The article did claim:
> According to Professor Moult, a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods.
So perhaps their score of 87 GDT is pretty significant. But “competitive with” is not the same as “always in agreement with”, as you point out. Could be the failure modes are problematic.
There are other experimental methods that are much cheaper that can be used to assist validation. Also the models look damn impressive, even down to the sidechain packing.
I think you have this backwards in practice. It was in the 80s that I first read a paper about a de-novo protein design engineered for a specific stable conformation. Natural proteins have no reason to be particularly predictable, just as genetic programming produces hard-to-understand programs relative to human-written ones. In fact making the structure especially stable against perturbations seems like it'd make it less responsive to changing evolutionary pressures.
The forward folding problem lets you determine structures from a known genetic sequence. So for example you could very quickly sequence the genome of a virus and figure out how it worked much faster than current methods allow.
The reverse folding problem lets you specify a structure and then make a genetic sequence to produce it. For example you could look at this virus to see how it infects its host, then design a custom protein to act as an anti-body stopping it, which is a capability we don't currently have.
Forward folding is certainly useful, but reverse folding would be revolutionary.
The set of all proteins which can potentially be expressed in an organism is known. Now maybe we also get decent (static) structure information for these. But the interaction of a virus with the host cell is much more complex. There is much more than just an amino acid sequence involved. And these parts are all moving, so a static picture as we now can create faster than before does not contain all the information necessary to fully understand the functions.
No, the genome of the host is much smaller than the theoretical number of combinations. There are about 20 to 30k different proteins in a human cell (about 20k directly encoded on the DNA).
Right, but you made the example with the virus docking at a known organism. If you do synthetic biology and modify bacteria to produce any proteins then the situation is different of course.
The other comment mentioned the example of making proteins that bind a structure. Heres an extension - a general understanding of how an enzyme works to catalyze chemical reactions, is that it binds the reaction intermediate with higher affinity than the two substrates; thus if we have this reverse ability, we can start inventing enzymes that can catalyze any arbitrary chemical reaction, even ones that need energy input, so you could imagine for example enzyme systems that can convert plastic to fuel!
Ok, then this is about enzymes which do not yet exist in the organism. You could then modify bacteria so they produce this enzyme and feed on plastic, I see.
But producing fuel as the fellow suggested would then be another function to be added to the bacterium; and maybe it should work on different kinds of plastic.
Of course, that's why I focused on degradation. There's plenty of room for improvement. For instance, PETase is not very efficient actually, and many research groups are working on its engineering.
Considering the resource requirements for this AI approach mentioned in the article, its unlikely that its been tested on more than a few tens to hundreds of proteins. This may only work on a subset of the proteome so I would think it worth it to continue playing if you find it to be a fun past-time.
fold.it was always more geared towards being edutainment than actually contributing solutions. Of the ~20 publications made related to fold.it over a decade, ~5 of them seem to have contributed to solving structures, while the rest of them are about the game itself.
For-profit corporations that value protein engineering will beat a path to DeepMind's door ASAP, like pharmas.
Protein conformation prediction is essential when engineering new small-molecule drug compounds that must 'dock' with the specific proteins that regulate disease. Knowing how to create a protein with the precise shape to become biologically active has soaked up a lot of R&D funding toward pie-in-the-sky techniques that promise to advance that agenda (like quantum or DNA computing).
If this method works as DeepMind says, it will immediately be adopted by every pharma to assess and tweak the shape of candidate proteins.
you give pharma too much credit. I had built a previous system to do something similar to this that produced excellent results and tried to give it away for free to Genentech, which ignored me. They said it didn't work for their purchasing department.
I feel that the "produced excellent results" has a lot of unpack there.
It obviously wasn't scoring 90+ in CASP.
Actually, after reading your linked blog post, it's pretty obvious why they weren't exactly chomping on the bit:
"To gain insights into the receptor’s dynamics, Kai performed detailed molecular simulations using hundreds of millions of core hours on Google’s infrastructure, generating hundreds of terabytes of valuable molecular dynamics data."
Hmm, yes, hundreds of millions of hours of cpu time, hundreds of terabytes of data, who says no to that? It doesn't even seem seem to attack generalized protein folding in general. It really seems like the plan was, "let's attack this problem with a Google-sized firehose" rather than created a fundamentally different algorithm that had game-changing results.
Comparing your system to AlphaFold seems like your really bending the truth here.
If you note, the paper has an enormous number of citations from pharma, since modelling protein dynamics, rather than static structure, is key to understsanding ligand binding firehose.
You can see another paper we published where attacking the problem with a firehose helped unlock a long-standing problem: https://pubmed.ncbi.nlm.nih.gov/24265211/ in this case, showed that bond angles need to be 'free' to move rather than fully constrained,to build the most accurate models. This paper is also heavily cited amongst protein modellers.
It is correct that the MD simulations don't directly work for CASP- in a sense, the results they produce directly disagree with CASP's mental model of protein structure and function.
I don't believe you, but I look forward to you showing proof of this with some links (and if you tried giving it for free, I assume you just open sourced the whole deal, so I look forward to a repo link or the like).
we offered the service for free to Genentech since I used to work there and knew they could probably use it to get some good publications.
We didn't open source the distributed computing framework, but the underlying technology (Folding@Home) is based on gromacs, which is open source. It's the scale at which it ran, and the processing pipeline for filtering the results that had the real value.
> What are the immediate real-world applications of this?
A protein is actually a linear sequence of amino acids, but in a cell this sequence has a three-dimensional arrangement like a clew of thread. The arrangement is not random, but dependent on the specific composition of the sequence (i.e. selection and order of amino acids) and some other factors. To understand the function of a protein, we need to know this three-dimensional arrangement (i.e. structure). Up to now the structure determination process was mostly manual, complex, time-consuming (several months up to more than a year) and error prone. If structure determination by DNN is reliable, this is a big win for life science. There are still a lot of problems open: e.g. the structure is not constant over time but there are "moving parts" in the structure which are important for its function.
Given the DNA code for one of the "machines" that run cells, we can generate an atomic model of that machine. This means we can "compile" (one part of) the DNA code. It was already possible, but so slow that entire datacenters would spend months calculating this for a single protein and even then we can't use them on the really complex ones at all, necessitating things like neutron spectroscopy which are totally insane, and only work on like 1% of proteins.
This is useful because for example chemical simulation tools don't run on DNA code, but on atomic models. And also to produce "images" of the molecules (images between quotes because most proteins are too small to interact with reasonable photons, and no interaction with photons means you can't see them in any way)
DNA has other parts that are really important but we don't understand at all yet, where this doesn't help at all. This applies to sections of DNA sent to ribosomes, to produce actual molecules. Besides that, there are pieces of DNA that "index" the DNA, pointers (from one gene to another), triggers (that for instance start production of an enzyme based on some external influence, like detection of a marker molecule) and export markers (that tell you what to do once the protein is produced, for example, mark a protein to be removed from the cell, incorporated into the cell membrane, or for instance used inside the cell nucleus, and there's also one that essentially says "at this point stop producing a protein and instead couple the rest of the DNA code to the end of the protein you just made").
The full chain is DNA -> mRNA -> Ribosome -> tRNA combinations -> amino acid chain -> protein.
It's true that in nature there are many steps between DNA and proteins (this list doesn't even include the steps that mediate the translation, ie. start it, stop it, slow it down, ...), but the structure of a protein is fully determined by the DNA code.
Protein folding is about you start from the DNA code that is fed into the ribosome ignoring all the meta information, and come up with an atomic model (VERY long list like "H atom at 3.27,2.17,12.18, C atom at 2.87, 2.19, 12.33, ..."). Now there's a million niceties we've discovered to make this problem simpler and nicer looking, but that's what it boils down to.
Thank you very much; almost forgot I did a Phd on the subject ;-)
But anyway your answer does not contradict my statement. What you say belongs to the basics of molecular biology, but does not justify that DNA should be considered when determining the structure of proteins. In practice, the amino acid sequence is always already present.
For the sceptics: if you read the referenced article, you will see that it is about protein structure determination by means of deep neural networks. It's not about gene expression, which is a different topic. What benefit does it have to respond to the question "What are the immediate real-world applications of this" (see above) by reciting some molecular biology dogmas from text books mixed with misconceptions, instead of responding to the real question?
Nobody is suggesting that this research has anything to do with gene expression or anything like that. Their point was simply that we now have better tools to actually see the meaning/effect of a given DNA sequence.
Also, there is no need to passive-agressively highlight your credentials. I already researched them before replying.
I rather think most people comment without even having a look at the referenced article. And since when is the reference to a qualification considered aggressive? If your doctor hangs his doctor's certificate on the wall, is he "passive-aggressive"? Pretty weird.
> that we now have better tools to actually see the meaning/effect of a given DNA sequence
Note that the "meaning/effect" of a DNA segment encoding a protein is known and unrelated to the protein folding process. The protein gets its conformation after the translation process.
> Note that the "meaning/effect" of a DNA segment encoding a protein [...]
The "meaning" of a DNA segment is not to encode a protein. The "meaning" is to describe a mechanism in the host organism (by way of encoding a protein). That is a complex process which involves gene expression AND protein folding.
For example would you say that the "meaning" of some Java code is to generate bytecode? Of course not, the "meaning" is to run some algorithm on the computer that executes it
So what? The DNA only codes for the RNA and amino acid sequence. Structure determination is yet another topic. When we determine the protein structure we already know the sequence. Neither DeepMind has to look at the DNA to train their DNN.
Have you read the article? It's about protein structure determination. The DNA only determines the RNA and amino acid sequence. But who cares. I will get a bit less work and citations because http://cara.nmr.ch/doku.php will be less used in future.
We indeed stand on the shoulders of a small number of giants! I'm infinitely thankful for the work DeepMind is doing. Lets maybe celebrate this accomplishment for one day and start being worried about big tech again tomorrow. Many of the comments here usually suggest that we should live in worries and fear but to my knowledge there is not too much historical evidence for these kind of companies turning evil.
This sounds big, like really really big. At least from my old times providing my idle computing resources to Folding@Home and following that project, this seems like the major golden milestone for protein folding.
So, sorry to be a philistine but what specific discoveries will this lead to... will it make it easier to produce antivirals or even molecular machines?
This is a huge jump forward. Last year's performance already was a big step up over the previous, and this seems to go much further. So big kudos to the research team.
Nonetheless, I'd like to hear more from specialists outside the context of a marketing blog post before I fully buy into a claim of a solution.
There's also a rabbit hole about what 'solution' actually means. Is the performance sufficient for any protein folding prediction application that might arise in the future?
Anyone care to muse about appropriate investment strategies based on the not previously feasible research approaches that might now be possible?
Should we expect to see faster progress in large well capitalized bioscience companies -- or a sudden increase in the viability of smaller biotech and/or biotech startups ...? Are we gonna see top talent fleeing the old biotech companies to start their own ventures with a new belief that the potential for huge reward might suddenly seem achievable?
What kind of companies do we think will be the first that are able to translate this new knowledge into profits?
GDT_TS for AlphaFold is now comparable is at experimental levels; but that's based on the class of proteins for which we've been able to determine the 3D structure of the protein, for which there might be selection bias.
I wonder if we can determine if this extends to proteins that aren't as keen to determining their 3D structure?
For example, certain proteins are more crystallizable than others.. For these non-crystallizable proteins, I wonder if we can say that AlphaFold would generate accurate 3D models? And if possible, might there be a way to map out this uncertainty?
> I wonder if we can determine if this extends to proteins that aren't as keen to determining their 3D structure?
This is already happened.
"An AlphaFold prediction helped to determine the structure of a bacterial protein that Lupas’s lab has been trying to crack for years. Lupas’s team had previously collected raw X-ray diffraction data, but transforming these Rorschach-like patterns into a structure requires some information about the shape of the protein. Tricks for getting this information, as well as other prediction tools, had failed. “The model from group 427 gave us our structure in half an hour, after we had spent a decade trying everything,” Lupas says."
Agree this is great to hear, but the fact that they had X-ray diffraction data indicates this protein was indeed crystallizable no?
Though the next paragraph in the article shows that DeepMind is indeed working on mapping out reliability:
"Demis Hassabis, DeepMind’s co-founder and chief executive, says that the company plans to make AlphaFold useful so other scientists can employ it. (It previously published enough details about the first version of AlphaFold for other scientists to replicate the approach.) It can take AlphaFold days to come up with a predicted structure, which includes estimates on the reliability of different regions of the protein. “We’re just starting to understand what biologists would want,” adds Hassabis, who sees drug discovery and protein design as potential applications."
> Agree this is great to hear, but the fact that they had X-ray diffraction data indicates this protein was indeed crystallizable no?
Yes. CASP uses as targets proteins with no known published structure but a solved or soon-to-be-solved one. They are then kept on hold until the end of the competition.
From what I gather, training was done on 170,000 Amino Acids (features) and the resultant protein structure (labels). This is out of 200 million possible proteins.
As usual with ML, I now wonder how “similar” the test set is to the training set, compared to the examples that are neither in the training set, nor the test set:
TODO: 200 million - 170,000 training - 100 test ~= 199.8 million proteins
They trained on 170k sequences/ structures/ proteins, each sequence has 10s to 100s or even 1000s amino acids. Structure is much more conserved than sequence. Out of the 100 targets, roughly 1/4th have no similarity to known structures, so there shouldn't be an overlap for those with the training set. They did very well on those targets.
What happens when AI is better at everything measurable than humans?
Better at conversation. Better at making people laugh, and generate attraction or other emotions, better at motivating them, and organizing movements, etc.
Clearly we are not ready for such an efficient system... it would be a big disruption to all human organizations and relations. It would start with Twitter botnets and directing sentiment.
They still suck pretty bad at many physical things. Bipedal robots are a joke. They also dent and rust. It'll be that way for a while. They don't reproduce.
But virtual world, say they're better at math. Say they prove all the Clay Millennium Problems. Say they go way beyond those problems and produce some math far beyond human's ability to understand it.
I've been thinking about that for a while and have decided it's fine. Math as a profession will still exist. Fact is, there's already a proof for everything mathematicians are investigating (or a proof that there's no proof, recursively), out there somewhere. Mathematicians are just searching for it, so that it can be understood and translated to human language. The fact that AI already knows the answers doesn't mean that human mathematicians are useless: they are still required to uncover the meaning of these results and translate them into human language. AI then is still just a tool that mathematicians use to help them in their search. Similar to how biologists will use AlphaFold. I guess.
Like this is awesome and a huge advancement but one thing that worries me with an AI solution is that it doesn't really draw us any closer to the why. Why do proteins fold the way they do? We can predict the resulting structure which is extremely significant, we have no clue why. While we get the insight of being able to predict some structures we don't get the insight of why things are happening the way they are. In some cases like this it might not matter but in other cases that insight might actually be way more significant than answer the problem to begin with. Of course we can review over the problem with the additional predictions that AI gives us but this can be haphazardous because what if there is specific sequence spins in some certain way that we and thus the AI has never seen and it goes missed. I'm not a biologist to say this is possible but I known this kind of edge case can come up and what rabbit holes will we go down because we only have the AI implied insight.
disclaimer I think the contributions are super useful for science but they do come with worries as does every path of discovery
I think the why is pretty clearly understood (https://en.wikipedia.org/wiki/Protein_folding), in the same way that we understand the mechanism behind the three body problem in physics or quantum computing. But that does necessarily imply that there is an efficient way for us to simulate/predict the results of having nature play out those mechanisms.
There are two threads here. The first is that it would not be surprising to learn that describing the way that proteins fold is a very hard thing for humans to understand. See i.e. 4CT [1] and its computational proofs.
The second is that explainability in ML is much more tractable than it was 10 years ago. This is not to say that it's solved, but having solved the predictive problem -- I would expect model simplifications and SME research to proceed more quickly towards understanding the how now. I did some work w/ an Astrophysics postdoc using beta-VAEs [2] to classify astronomical observations, and simplifying models in order to achieve human-explainability proved to not cost as much predictive power as you might expect. It might be that the same holds true here.
> While we get the insight of being able to predict some structures we don't get the insight of why things are happening the way they are.
This isn't something specific to AI, but science itself. We know the value of C, but now why the value is C, sure we can point to something like the Lorentz transformation, but we can't and probably won't even be able to explain why it has these particular constants, we just know that we can measure them and they are this.
Science isn't in the business of answering why. A successful scientific theory does two things, A) Makes useful predictions, B) Is correct in its predictions. It'd be wrong to call a NN a scientific theory, but it certainly does make predictions and as these results show, it is
correct in its predictions.
Sometime soon, humanity is going to have to come to terms that we will soon (or perhaps already have) enter an age where mankind is not the only source of new knowledge. AI-derived knowledge will only increase as the future unfolds and the analysis of such knowledge will likely become it's own branch of study itself.
I agree as long as science is a business. But why is science a business?
If science is not meant to answer why, does this mean we cannot know why?
should we just give up on having story-like (narrative) explanations for why and how things work? it seems like we are headed to a world where the computer just tells us what to do and where to go. a world in which we are free from having to think about why we are being told to do whatever it is we're doing. click (or tap) buttons, get tokens to buy food and pay rent.
These are predictions. Presumably the proteins will be inspected and the model refined and updated before we start using DNA without first checking the output.
It could be that more complex phenomenon don't have a simple explanation. It could be that they do. But, just because I would like a why, doesn't mean that there is one. (Personally I think there is a why.)
AI solve the process but doesn't give a whole lot of insight into the formulas and the description what's going on. Where we as humans have reasonably found that e = mc^2. However AI would gives us e or m but backboxes us away from seeing that c aka the speed of light was involved(unless we implied that before). There might be interesting relationships that are useful that AI unintentionally masks that could be ground breaking if we could only understand process more holistically. I think a different commenter eluded in this case we think we understand protein folding well we just struggle to synthesis it in a compact mathematical way even though with AI we can simulate the process well for known examples.
The issue with AI is we don't know if our current example set includes every case what if there is a strange sequence of amino acid that causes something "weird" to happen that we have haven't seen. AI cannot predict something novel it or us haven't seen which is the issue. The process(if it exists) of how one could solve this problem might also be exportable to other fields if it was formulized with math rather than estimated with AI.
I've long been an AI/ML positivist in the field of protein structure prediction (but not in drug discovery in general), admittedly a bit surprised it was now and not 3-4 years from now... And for a long time I have been saying that a "heuristic" model for folding is going to win (and it looks like it has). However, I would also caution that, there are going to be protein structures that are not in the opus of known structures (being able to solve the structure at all is itself a biasing factor) and AlphaFold's capability to figure those out will be interesting. I would not necessarily be confident it could. (think of issues, like face detection algorithms not being able to correctly identify minorities, e.g.)
IMO, AlphaFold 2 is a great example of industrial research labs making huge breakthroughs. I'm not sure if AlphaFold 2 is over hyped or not (because I don't know anything about protein folding), but given how a lot of computational biologists reacted to the results (the co-founder of CASP seems very impressed :')), I suppose this a big deal. I hope DeepMind becomes the Bell Labs for AI. Bell Labs is the best example of industrial research labs making huge strides. Of course, AI doesn't exist yet, and deep "learning" is nothing but curve-fitting done in fancy ways, but I would not be surprised if DeepMind results in a few Turing and Nobel laureates.
Been out of the field for a while, could someone currently in it qualify these results? Hyperbolic title notwithstanding, they approach 90% median free modeling accuracy. The "other 90%" still remains to be solved...
I don't think anyone on HN is going to have more authority to qualify the results than the independent experts quoted in the linked article. Among whom are numbered a Nobel laureate, the president of the group that designs the tests of protein folding systems, and the former CEO of Genentech+current CEO of Calico.
I would imagine that he is not assessing this advancement merely using his own personal expertise, but rather the combined expertise of the resources he represents. CEOs don't just look at problems and potential solutions. They have people who look at those things, and then tell them their opinion. In any case, you've picked a nit with one of the three people quoted. Any objections to the other two?
My main objection to Vivek (the Nobel Prize winner) is the prize in that case should have gone to my advisor, Harry Noller. John Moult... he's a nice guy but I think he's being a bit breathless here.
CASP is not "the organization that tests protein folding". It's an organization that every two years does a blind prediction and publishes the results (I've competed, some 20 years ago). John's a protein expert, no question about it.
I knew him moderately well back in the day because our advisors moved in similar circles.
The method relies on multiple-sequence-alignment (MSA) of homologous proteins. This cannot fold arbitrary proteins, only biologically relevant ones that have high quality MSAs available. It's also worth pointing out that the gold-standard for validating MSAs relies on PDBs of folded proteins. This is exciting work that will assist NMR and XRay crystallographers, but it's not a panacea of protein folding.
It doesn't matter so much how they perform the feature extraction, so much as what their inputs to the feature extraction are.
This model requires a collection of wild-type proteins in an accurate MSA. Producing an accurate MSA is hard even if you have many homologs.
They require protein homologs which means they can "only" do this for wild-type proteins. This work is useless with mutant and synthetic proteins. This is a big advancement that will assist crystallographers and NMR structural biologists with difficult wild-type proteins, but it doesn't "solve protein folding" by any stretch of the imagination.
> Producing an accurate MSA is hard even if you have many homologs.
To assess co-evolutionary couplings the amount of homologs in the MSA is not as important as the number of effective sequences (i.e. sequence depth and diversity) in it.
> They require protein homologs which means they can "only" do this for wild-type proteins.
Even remote homologs work, as shown by the widespread use of HHM-based methods in the prediction pipelines.
> This work is useless with mutant and synthetic proteins.
Unless you generate a flurry of data with them using deep mutational scanning for example. As long as correlated mutations are present in the MSA the technique should work as expected no matter where the protein sequences originated.
I'm honestly not familiar with "deep mutational scanning." Can you share a link? I'm first author on papers related to the structural biology of coevolution and I competed in CASP about a decade ago, but I haven't kept up much since then.
Can anyone (yet) provide a sketch of how this works? I saw a mention of "attention", which I vaguely take to be a surrogate for some form of structural information. It's an astonishing result. How does it work?
This will undoubtably change our understanding of human health and biology in many impactful ways in the years to come!
The same information we get through x-ray diffraction will now be available 100x or even 1000x cheaper, and using this model can even aid the interpretation of xray diffraction data!
What excites me most isn't doing what we can do now, for cheaper (which will surely lead to more effective research methods), but the potential to gain a systematic view of protein structures, either across the genome, species, or through time which will give us a deeper and more fundamental understanding of biology.
> We trained this system on publicly available data consisting of ~170,000 protein structures from the protein data bank together with large databases containing protein sequences of unknown structure. It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks
Actually, no! Or at least, the budget they used (<$100k at retail prices to train the model) is well within the feasible range for other research institutions.
In other words, it's less like GPT3 and more like ImageNet.
Is the cost to train really the relevant metric for developing this? It seems like the salary's involved are probably at least 10x whatever they spent on hardware.
I was replying in the context of the grand parent:
> Did AlphaFold2 also have the biggest budget? :)
And then the parent
> Actually, no! Or at least, the budget they used (<$100k at retail prices to train the model) is well within the feasible range for other research institutions.
I'm not sure how the cost of replicating the model in the future is relevant in this context. We appear to be discussing the cost of developing this model from scratch, such as what it would have taken an alternate team to create and submit this if DeepMind never got involved.
I don't know - tens of thousands per train is not accessible for most academic institutions when you consider the necessity of ablation studies, experimentation, etc.
Has anyone got any good other references for this? After some of the dodgy experiments related to alpha zero (comparing to purposefully degraded chess systems), I'd love to see some independent analysis.
The article in Science implies that we have independent confirmation of predictions yielding useful results, beyond the challenge itself:
> The organizers even worried DeepMind may have been cheating somehow. So Lupas set a special challenge: a membrane protein from a species of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an x-ray crystal structure of the protein. “We couldn’t solve it.”
> But AlphaFold had no trouble. It returned a detailed image of a three-part protein with two long helical arms in the middle. The model enabled Lupas and his colleagues to make sense of their x-ray data; within half an hour, they had fit their experimental results to AlphaFold’s predicted structure. “It’s almost perfect,” Lupas says. “They could not possibly have cheated on this. I don’t know how they do it.”
True, but I haven't seen an independent discussion of the CASP results. There is a good chance this is great, but I don't trust deepmind press releases.
The "dodgy experiments" (setting the per-turn computation time to a fixed value) in the chess system were only in the pre-print. In the actual publication, they allowed for full time control of the most up-to-date version of stockfish.
Yes, and then Leela Chess Zero, an open source implementation of AlphaZero, beat the latest Stockfish in the de facto engine championship (TCEC). Since AlphaZero, the TCEC finals have been traded back and forth between Stockfish and LCZero.
This last season, Stockfish won by using NNUE, a neural network based evaluation function.
At Sun back in the day our workstations tended to have fairly promiscuous login settings, so one of my coworkers took the liberty to launch folding@home on every machine in the org. Listing running processes one day, I saw this thing pegging my CPU; asked around and others had it too. A virus!?! Then he fessed up. Kinda miffed at first but ultimately really cool, so we let the thing keep running. That was my introduction to the whole protein folding problem, and it's really great to see this milestone!
I ran Folding@Home at Google on hundreds of thousands of fast Xeon cores for over a year. I concluded at the end that unbiased MD simulations are not an effective use of computer time.
If this is the only thing that comes from AI, or if this is the only lasting application the technology, then all of the research and time and code and frustration will have been worth it.
After I had some time to think about it, I come to a different conclusion. Contrary to my first assumption, Bio NMR (in contrast to crystallography) will become more and more important, since the method allows to study the dynamic properties of proteins. With the structure predicted by DNNs, the chemical shifts to be expected in the NMR spectra can be calculated; the assignment problem is thus largely eliminated. Bio NMR can then be used specifically to study the "parts that move".
Where will the development go on from now? We have been working on a geometrical approach that avoids the curse of dimensionality to solving the same problem for the last few months. Now, I wonder whether it makes sense to continue at all (we were and are clearly not ready to participate in the challenge yet). So What remains unresolved? Exact position of side chains? Can their approach be used for protein-protein interaction too?
This is great and I feel weirdly relieved (considering I don't actually really gain anything from that).
That, on the other hand, makes me feel sad and almost depressed every time:
> It uses approximately 16 TPUv3s (which is 128 TPUv3 cores or roughly equivalent to ~100-200 GPUs) run over a few weeks, a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today
Does this produce the various different foldings that each protein can often "sit" in?
Can it take temperature and other environmental conditions into account?
Can you specify that a particular ligand or electrical current is present so that you can see the resultant shape change?
Is all the source code for this available so that other scientists can build on top of this, or will we have to go through a paid or SaaS google API to use it?
Fascinating work. I wonder if this approach works to model interactions (no reason it shouldn’t).
The interactions of proteins with other proteins and well as as molecules like lipids, water and electrolytes form the basis for cellular processes. If that can be inferred correctly, you are looking at the building blocks of a “human simulator”.
Very interesting, however now the problem becomes to characterize such machine learning approaches. With traditional simulation methods the authors can usually explain easily in which situation a specific approach is good or bad, with neural networks we don't really have a good approach how to analyze the quality of the prediction.
There's something I don't understand about protein shapes. There are tons of software solutions – on the web and offline – to visualize the shape of proteins from their sequence of amino acids. How do these work then, if we don't know how the atoms might be arranged in space?
For example, this[1] is the code for SARS-Cov-2's Spike(S) protein. From what I understand of this page it's pretty short, only ~1,757 proteins (corresponding to ~3,821 bases in RNA).
And here[2] is a visualization of it in 3D. You'll likely recognize the characteristic mushroom shape that's been portrayed in 3D models of SARS-CoV-2 in the media. How does this software work if there's no real way to tell how the protein is arranged?
Thanks for the answer! That explains it. I was looking just at the amino acid sequence and missing a whole lot.
I read the protein folding and X-ray crystallography articles on Wikipedia and they had most of the answers I was looking for. I also saw a request being made by this JavaScript 3D viewer to fetch the PDB (Protein Data Bank) file for the model, which is a text file with tens of thousands of lines describing the coordinates of atoms in space as well as their bonds and other structures. It even has some metadata about the way the data was collected.
The article implies that the "ground-truth" (experimental determined) structure has accuracy interval as well. Above 90% is the same accuracy as what you get from experimental determined results, hence the "solved" claim.
Solving the inverse problem would be even more valuable -- given a specific shape (and other biochemical desiderata), what sequence of amino acids would create that protein?
As hard as the protein folding problem is, the inverse problem is harder still. THAT is the one true grail.
We "solved" this at Google years ago using Exacycle. We ran Rosetta (the premier protein design tool) at scale. The visiting scientist (who later joined GOogle and created DeepDream) said it worked really well "I could just watch a folder and good designs would show up as PDB files in a directory".
The protein folding problem is predicated on the idea that there is a ground truth (a single static set of atomic coordinates with positional variances). If your point is that even experimental methods can't truly reach 100% (due either to underlying motion in the protein, or can't determine the structure), that's more or less what Moult is saying (they more or less arbitrarily define ~1A resoution and GDT of 90 as the "threshold at which the problem is solved").
The title here is not merely breathless clickbait, it also has very little to do with the headline of the actual article, which is "AlphaFold: a solution to a 50-year-old grand challenge in biology".
I thought the #1 criterion for titles was that they should match the original if at all reasonable...?
Can it simulate two proteins interacting? IE search for the simplest protein sequence that a.) doesn't affect folding geometry when paired with every known human protein and b) causes the greatest deviation when paired with covid-19 proteins?
My question exactly; or Rosetta @ home, or any of the other protein folding "@home"s. I participate in a few, but would gladly donate my compute resources elsewhere if this is no longer necessary.
I think this is the interesting part because there aren't going to be the same regulatory hurdles for using ribosomes to manufacture technology as there are for medicines. Synthetic organelles that weave fibers, build metamaterials, etc could lead to pretty magical advances in our capability.
Far from an expert here, but your comment makes me think of Michael Crichton's 'Prey', if you've not already read it.
Not that I wish to add to your apprehension.
Fascinating! AlphaFold (and other competitors) seem to use MSA (Multiple Sequence Aligment) and this (brilliant) idea of co-evolving residues to build an initial graph of sections of protein chain that are likely proximal. This seems like a useful trick for predicting existing biological structures (i.e. ones that evolved) from genomic data. I wonder (as very much a non-biologist), do MSA-based approaches also help understand "first-principles" folding physics any better? and to what degree? If I write a random genetic sequence (think drug discovery) that has many aligned sequences, without the strong assumption of co-evolution at my disposal, there does not seem any good reason for the aligned sequences to also be proximal. Please pardon my admittedly deep knowledge gaps.
> do MSA-based approaches also help understand "first-principles" folding physics any better?
Not really. MSA-based approaches, as most structure prediction methods, have as a goal to find the lowest energy conformation of the protein chain, disregarding folding kinetics and basically all dynamic aspects of protein structure.
> If I write a random genetic sequence (think drug discovery) that has many aligned sequences, without the strong assumption of co-evolution at my disposal, there does not seem any good reason for the aligned sequences to also be proximal.
I don't think I fully understood this, but I'll give it a shot anyway. If your artificial sequence aligns with others, there's a chance that it will fold like them, depending on the quality and accuracy of the multiple sequence alignment. Since multiple sequence alignments are built under the assumption of homology (all sequences have a common ancestor), it's a matter of how far from the "sequence sampling space" your sequence is located compared to the others.
> I don't think I fully understood this, but I'll give it a shot anyway. If your artificial sequence aligns with others, there's a chance that it will fold like them, depending on the quality and accuracy of the multiple sequence alignment. Since multiple sequence alignments are built under the assumption of homology (all sequences have a common ancestor), it's a matter of how far from the "sequence sampling space" your sequence is located compared to the others.
I understand that similar sequences may fold similarly (although as length increases, I highly doubt it, but IDK). I'm talking about aligned sub-sequences within one chain and their ultimate distance from each other in the final structure. Co-evolution suggests that aligned sub-sequences are also proximal. But manufactured chains did not evolve, therefore the assumption is no longer useful.
Oh, I see! Yes, an intrachain alignment of an artificial sequence does not by itself give any information about co-evolution, especially since you don't know whether your protein is actually folding. To assess co-evolution you need a multiple sequence alignment between protein homologs containing correlated mutations.
> I understand that similar sequences may fold similarly (although as length increases, I highly doubt it, but IDK).
As long as the sequence similarity is kept between those sequences, length is not an issue.
> Co-evolution suggests that aligned sub-sequences are also proximal
What do you mean by "proximal"? Close in space, or similar in structure?
> To assess co-evolution you need a multiple sequence alignment between protein homologs containing correlated mutations.
That makes sense. So in the CASP competition, when teams are given a sequence, do their algorithms do something like the following?
1. Search database for homologs of given sequence
2. Look at MSA and correlated mutations of homologs
3. Look for similar correlated mutations in given sequence
I imagine 1-3 could somehow be embedded in a NN after training on a protein database.
> What do you mean by "proximal"? Close in space, or similar in structure?
This is a really insightful question and I need to take some time to fully understand the ensuing discussion.
If my speculation is correct, then drug discovery should use a process of genetic programming, using something like this to score the resulting amino acid sequences. I'm wondering if an artificial process of evolution would be sufficient to satisfy the co-evolution assumption here.
"It has occurred decades before many people in the field would have predicted. It will be exciting to see the many ways in which it will fundamentally change biological research."
Assuming optimistic further progress, what are the implications of accurately predicting protein folding? What are we hoping to discover, or succeed in doing?
I am puzzled me about “AI-knowledge”. Have we really learnt anything? Is distilling the knowledge from AlphaFold just as a hard problem as solving protein folding?
I feel like DeepMind has a disproportionately large scientific impact relative to its resource pool. How would one (or a group) go about replicating its success?
I think the key here to replicating the success is the deployment of deep learning effectively. But I would argue that deepmind's resource pool is immense, it's backed by Google. The resources of GPU's (and more advanced TPU's) are in abundance... not to mention the many brilliant PhD scientists who work there.
Accurate modelling on such a detailed level becomes intractable due to long time scales needed for folding, and the presence of forces that are not adequately described at the "ball and spring" level of abstraction that molecular mechanics simulation usually employs.
Its better to abstract everything away by a neural net, apparently...
Is this going to put biologists that study this problem (and there are a lot of them, right?) out of business?
This is the tipping point where I think the AI singularity may have teeth. I could see math proofs being the next thing to fall. If AI solves the remaining six millennium problems in the next few years, what does that mean for math researchers?
In short - certain ones, yes. This should be one step (that was a bottleneck) in helping a company with a fixed budget do an order or magnitude more 'experiments' with the same amount of resources. Lab resources are expensive and fixed, so if you can pre-compute what you need, you can get right to the more powerful results.
We design proteins for immunotherapies - this kind of thing would help us more rapidly design our proteins (and more efficiently use our wet-lab resources to speed existing projects). For others, some drugs are hard build without knowing how they will interact - this could both provide new 'targets' to go after, but also might help prevent projects that would otherwise accidentally target an important protein.
Realistically speaking, if you are a scientist who could use this and you mailed DeepMind, they will probably run it for free and send you the result. It would be a good PR.
After reading this I can't stop thinking about a possible future where we have predicted all possible permutations of different diseases and created vaccines for them. Maybe our kids one day will get an all in one vaccine preventing all viral and bacterial disease.
as so many time recently the hn crowd proves to be completely clueless and uneducated when it comes to ai.. this is a miracle.. it is THE achievement we'll remember from the past decade when it comes to ai.. if you don't understand why I recommend learning and reading.the level of ignorance and often proud ignorance here is frightening to me.. ppl who downplay this are either stupid in biochemistry or ai or both .. please don't listen to them. this right here is the single biggest news of 2020..
If there was a headline, "Company X with Product Y cured cancer" and it turns out that product Y actually only cured 90% of cancer, I'm pretty sure most people would be happy the headline.
Oh, and to be a true parallel example, in this case the remaining 10% of cancers might not even be cancers, as experimental accuracy of protein structures is only ~90% accurate, the model could very well be more accurate than our current ability to experimentally detect protein structure.
I really interpreted that headline as "found a general solution to the protein-folding question" not as the also interesting but not that much "can be used to solve protein-folding problems".
No, they are very crude (but useful!) models of reality. General relativity and quantum electrodynamics are much better corresponding models, respectively, and even those are just approximations.
If you can explain how gravity works in a quantum level you'd deserve a Nobel. It's not 100% solved, Newton's Laws of Motion are a model, not a solution. Just like the vast majority of science.
but experimental methods have not solved protein folding either. AlphaFold has'nt solved protein folding but I can't wait to see their progress for ALphaFold 3.
What would be informatively useful would be to know how much accuracy is needed on average for drug engineers, I'd say that 99% is more likely to be the minimum to make solid inferences
> but experimental methods have not solved protein folding either.
I might be missing something here, but isn't "experimental methods" just shorthand for "our best knowledge of a protein's structure, obtained via NMR or X-ray crystallography"? In that case, I'm not sure what "solving" protein folding even means - literally zero mean error? We can't know/solve anything beyond our best knowledge, that's tautological.
> What would be informatively useful would be to know how much accuracy is needed on average for drug engineers.
Yeah that would be interesting, but:
> I'd say that 99% is more likely to be the minimum to make solid inferences
It's pretty clear what solving means, it means to have an exact representation of the 3D structure. Our partial knowledge obtained from such techniques is what it is, partial. We need new metrology that increase the observability accuracy and completeness OR better deterministic models from sequences.
"We can't know/solve anything beyond our best knowledge, that's tautological." yes it is indeed tautological if you assume that experimental methods can't get better then guess what? It follows that they can't get better!
"what are you basing this on?" on nothing solid, that's why I say it would be interesting.
99% is a non negligible error rate given that proteins have generally a not very high atom count and they the protein will be produced an enormous amount of time, then the 1% error progagate and can a priori easily break the system.
But this guess is not solid as I'm not an expert.
99% accuracy for simple (low atom count) proteins is a sensitive error and could be negligible for very high atom count proteins.
> It's pretty clear what solving means, it means to have an exact representation of the 3D structure.
That's not clear at all, because perfect measurement doesn't exist. I agree that improving is always a worthy goal, but clearly we don't need 100% accuracy to consider something "solved" for the purposes of science. Also, "3D structure" of a protein is not a fixed truth, the parts are in motion all the time and may even have multiple semi-stable conformations. Rather than focusing on X,Y,Z perfection, I would imagine getting the angles between bonds, or the general topological conformation right would be more valuable.
> if you assume that experimental methods can't get better ...
I'm saying that if your definition for "solved" is "perfect knowledge", then we might as well not discuss whether method X or Y solves the problem, because they obviously do not.
The more I think about it, the more I think we should just drop the whole debate over the word "solved". Clearly different experiments and different proteins will have different requirements which may or may not be met by this or by other techniques - I agree that I would be interested to hear an expert weigh in on those requirements.
Is this immune to things like adversarial examples? E.g. will we get a situation where we flip one nucleotide or amino acid, and suddenly AlphaFold is making completely incorrect predictions?
Glad to see AI is progressing beyond annoying customer support chatbots and marketing tools. At this rate it will predict the covid pandemic anytime soon now.
Let's imagine that as a researcher I make a breaktrhough NN model, but that I need a lot of TPUs/GPUs in order to test it, is there a service for temporarily lending such hardware to me for free/not much ? (e.g google colab ?)
Otherwise researchers will plateau with their hardware budget.
According to [1], they must release enough information for others to replicate the AI model: "As a condition of entering CASP, DeepMind—like all groups—agreed to reveal sufficient details about its method for other groups to re-create it. That will be a boon for experimentalists, who will be able to use accurate structure predictions to make sense of opaque x-ray and cryo-EM data."
I suspect DM will sell this as a service, especially to corporations like pharmas who create small molecule drugs. If their method works as advertised, it may rejuvenate the flagging prospects of Rational Drug Design, the guiding R&D drug development methodology behind most new molecular entities (drugs) for the past ~25 years, which has not proven to be the clear economic win that had been hoped.
Whenever deepmind comes up with something like this, my first instinct is to say "yay for humanity" ... then I remember who they work for, and the second instinct is to say "Ah. Crap."
Far from an expert on complexity theory, but NP-hard problems can be approximated in polynomial time. With Deep Learning you are doing approximation. So this is nothing ground breaking in that respect.
That actually isn't totally true. Approximate methods, in the formal sense, require a guarantee that they perform within X of the optimal solution. Not all NP-hard problems have polynomial approximations and the methods shown here are likely not approximations because they very likely provide no guarantees on performance. They provide zero guarantees.
I think I'm almost as uninformed as you, but I believe it comes down to the difference between perfect solutions and close enough solutions. Consider the classic NP problem of the traveling salesman problem.
"[Modern heuristic and approximation algorithms] can find solutions for extremely large problems (millions of cities) within a reasonable time which are with a high probability just 2–3% away from the optimal solution." [0]
When close enough is enough, NP problems can often be solved in P time, and I suspect this is one of those cases. For crypto however, close enough is not enough.
No. It really is just heuristic building. A core problem with using ML in this sort of use case is that it is often brittle. Once it gets outside of the context it was trained in it may or may not be able to generalize it's training to new contexts. We may have difficulty knowing when it is very wrong.
I think ML in research science could be viewed as a very good intuitive oracle. Even if they are right 95% of the time, you have to do this work prove the long way every time because that 5% matters. The real utility is in "scanning the field" to better focus research on things likely to bear fruit.
I think that this is a heuristic "near optimal" method rather than an exact analytic method (I have little to no idea of what that would be in protein folding). A domain I do understand a bit which is np-hard is the travelling sales man. Computing an exact solution is unrealistic, but doing heuristic searches that get you to 99% of the optimal 99% of the time is relatively doable.
But - you don't know that you are 1% from the solution... even if you are pretty confident that you are. It's quite possible (unlikely) that you are way off the optimal, but if you have a decent solution that's ok.
NP-hard doesn’t say how hard it is to solve finite problems. Even for n = 1,000,000, O(e^n) isn’t necessarily problematic, if the constant is small enough, or if you throw enough hardware at it.
This “uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks”. That is a moderate amount of hardware for this kind of work, so it seems they have a more efficient algorithm.
Also, this algorithm doesn’t solve protein folding in the mathematical sense; it ‘just’ produces good approximations.
> But wasn't protein folding supposed to be NP-hard?
Yeah, at least some variations of it are NP-hard. SAT is THE NP-complete problem, but there are some really good SAT solvers around. This basically means: They have a solution that mostly does very well on most instances. But because (probably) P != NP, you will never have a polynomial time algorithm for this.
There is probably a team at DeepMind working on cracking simple crypto.
Problem is, it can be difficult to cast the problem properly/“correcty”. How does a one way function get represented?
I hate headlines like “X has solved Y.” How often have we see computer vision and natural language solved at this point, whenever a model does well enough in a benchmark? Their own article doesn’t even have that headline. This is a massively cool thing that’s happened. Why ruin it with a massively hyperbolic headline?
Because only the experts in this field get to tell us, the laymen, what "solving the protein folding problem means", and they defined it not as "perfect" but as "more than good enough to be acceptable as correct result". Which this did.
X has actually solved Y. That's not so much "massively cool", that's historical.
I think the “they” you’re referring to is only whatever PR person wrote the headline. Nowhere in the substance of this (PR!) post does it refer to it as anything but a great leap. When an expert in the field outside of deepmind says protein folding has been solved, I’ll believe it.
No, they didn't. They approximated a solution to protein folding.
The two are different concepts -- this isn't the typical HN pedantry.
"Solving" the problem would entail developing an interpretable algorithm for taking a string of amino acids and determining the 3D structure once folded.
Approximating a solution would entail simulating that algorithm, which is what their neural network is doing. It is of course usually accurate, but you would expect this with any suitable universal function approximator.
Props to DeepMind and congrats to CASP but is it not obvious that this is more hype-rhetoric for public consumption?
The distinction you're making between "solved" and "closely approximated" makes logical sense to me. However, if I'm interpreting the AlphaFold results correctly, this distinction isn't practically significant, right?
If you can approximate an algorithm with error that is "below the threshold that is considered acceptable in experimental measurements" (to quote another HN comment), then you have something as good as the algorithm itself for all intents and purposes.
Therefore the use of the word "solve" doesn't qualify as hype-rhetoric, and the distinction you're making does seem somewhat pedantic (even if technically true).
(I'm speaking as someone with only the tiniest amount of stats/ML experience, so I could be totally wrong!)
It might be the case that the relevant, practical threshold now tightens. For example, perhaps it is easier to experimentally verify a protein shape predicted by an algorithm than it is to experimentally determine the protein shape?
“The organizers even worried DeepMind may have been cheating somehow. So Lupas set a special challenge: a membrane protein from a species of archaea, an ancient group of microbes. For 10 years, his research team tried every trick in the book to get an x-ray crystal structure of the protein. ‘We couldn’t solve it.’”
“But AlphaFold had no trouble. It returned a detailed image of a three-part protein with two long helical arms in the middle. The model enabled Lupas and his colleagues to make sense of their x-ray data; within half an hour, they had fit their experimental results to AlphaFold’s predicted structure. ‘It is almost perfect,’ Lupas says.”
exactly. Even an incomplete map with somewhat limited resolution makes navigation a hell of a lot easier than flying blind. This effectively is a data reduction solution-- if you have a fuzzy shape of the thing you are trying to model, and you learn the mechanics better with each thing you model, your ability to quickly and accurately reach a goal improves
> "Solving" the problem would entail developing an interpretable algorithm
It looks like you'd like a grokable solution, but the problem might be just too complex to grasp for the human brain. "Solved" means they solved the protein puzzles on the official benchmark.
> but you would expect this with any suitable universal function approximator
Yeah, it's just that easy. Function approximator, engage! It took a team of Deep Mind researchers, two years and God knows how much compute. The universal function approximation theorem doesn't also say how to find that network.
"Pedantry" implies that the distinction is not meaningful.
This is true if you're only paying attention to how this system can be utilized to answer questions posed to it.
This achievement by itself, however, does not do much to push the science of protein folding much further. Those advances will come when people poke, prod, and break the model to develop a unified theory for protein folding.
The "science" of protein folding has a primary goal: to predict the structure of a protein given it's constituent parts.
This is what alphaFold does, and it's been verified to produce results at an apparent accuracy at or above something like X-ray protein crystallography. The advances will come, after these results are validated and accepted by the scientific community as whole, simply when groups start using this technique to immediately access the structure of proteins that in the past would be prohibitively expensive and time consuming or down right impossible to access before, and then use that knowledge to do their work.
You seem to think the first thought a researcher will have after this becomes widely available is, "Oh hey, I can now accurately predict the shape of an arbitrary protein which unlocks untold potential scientific progress on numerous scientific fronts, but the thing I want to spend my time on is trying to replicate the results of the network myself, so I can do it manually thousands of times slower...", which is patently inane.
This is exactly right. It's like saying you solved chess because for each configuration of pieces on the board you can use machine learning to predict whether that position can be achieved with valid chess moves. With 90% accuracy.
https://news.ycombinator.com/item?id=25253488&p=2
We changed the URL from https://predictioncenter.org/casp14/zscores_final.cgi to the blog post, which has more background info.