AlphaFold: a solution to a 50-year-old grand challenge in biology

dang · on Nov 30, 2020

All: there are multiple pages of comments; if you're curious to read them, click More at the bottom of the page, or like this:

https://news.ycombinator.com/item?id=25253488&p=2

We changed the URL from https://predictioncenter.org/casp14/zscores_final.cgi to the blog post, which has more background info.

kovek · on Nov 30, 2020

I've seen you mention this [More] comment a few times now. I like it, though what if you change the design of the More functionality?

dang · on Nov 30, 2020

Yep, the intention is to change the design by getting rid of it. HN used to just render entire threads in one go, and once we release some performance improvements we hope to do that again.

jvolkman · on Dec 2, 2020

In the meantime, how about just placing the More link at the top of the comments section in addition to the bottom so it stands out better?

nathancahill · on Nov 30, 2020

Also, what do the traffic stats look like for the second/third pages of big threads like this one? Pretty steep falloff?

dang · on Nov 30, 2020

I assume so, but haven't looked recently. I'll try to do that and report back here later. Feel free to ping me at hn@ycombinator.com if I forget.

Edit: ok, for this thread so far, 95% of views are page 1, 4% are page 2, 1% are page 3.

For https://news.ycombinator.com/item?id=25065026, which had a "more pages" comment at the top: 93% of views were page 1, 5% page 2, 2% viewed page 3.

For https://news.ycombinator.com/item?id=23155647, which did not have a "more pages" comment at the top: 96% of views were page 1, 3% page 2, 0.5% page 3.

Radically overgeneralizing from that, it seems likely that the pinned comment at the top helps a bit in terms of directing people to later pages. How that compares to the mammoth-single-page scenario is hard to say because we don't know how many readers would be scrolling down that far to see those comments. There's likely a power-law dropoff no matter what we do.

kovek · on Dec 1, 2020

If you are fine serving more data, you could trigger a "More" automatically when the user minimizes a comment thread, and add the "More" comments to the bottom of the page?

Also, to encourage people exploring "More" comments, maybe some comments inside the 2-3 megathreads showing in the first page can be minimized at first?

I wonder what are the thoughts on these design. I really like how Hacker News is designed, thank you!

dang · on Dec 2, 2020

My thoughts: the first suggestion is too complex for HN and would amount to a sort of infinite scroll, which users here would probably hate (many have said so pre-emptively!). The second suggestion is probably a good idea. I worked on it at one point but it turned out to be a little harder to get right than I expected. Will probably return to it.

thawab · on Nov 30, 2020

unrelated but the reason i opened the link was because the .cgi file. It's been a very long time since i visited one. The page uses a 2006 YUI library and a 2010 jquery version. The amazing part is it still works in Firefox.

cs702 · on Nov 30, 2020

Two years ago, after DeepMind submitted its first set of predictions to CASP (Critical Assessment of protein Structure Prediction), Mohammed AlQuraishi, an expert in the field, asked, "What just happened?"

https://moalquraishi.wordpress.com/2018/12/09/alphafold-casp...

Now that the problem of static protein structure prediction has been solved (prediction errors are below the threshold that is considered acceptable in experimental measurements), we can confidently answer AlQuraishi's question:

Protein Folding just had its "ImageNet moment."

In hindsight, AlphaFold v1 represented for protein structure prediction in 2018 what AlexNet represented for visual recognition in 2012.

xral · on Nov 30, 2020

AlQuraishi's tweet [0] about this:

> CASP14 #s just came out and they’re astounding—DeepMind looks to have solved protein structure prediction. Median GDT_TS went from 68.5 (CASP13) to 92.4!!!! Cf. their 2nd best CASP13 struct scored 92.8 (out of 100). Median RMSD is 2.1Å. I think it's over https://predictioncenter.org/casp14/zscores_final.cgi

[0]: https://twitter.com/MoAlQuraishi/status/1333383634649313280

elwell · on Nov 30, 2020

> https://predictioncenter.org/casp14/zscores_final.cgi

`.cgi`... we've come full circle

matsemann · on Nov 30, 2020

What does that Å mean? Never seen our letter been used in a scientific context.

flobosg · on Nov 30, 2020

Ångström, a length unit. 1 Å = 0.1 nm.

seslattery · on Nov 30, 2020

It's the symbol for Angstrom, a unit of length 10^-10m https://en.wikipedia.org/wiki/Angstrom

kolinko · on Nov 30, 2020

0.1nm - approximately a size of an atom - used in organic chemistry often.

adw · on Dec 1, 2020

Standard distance measure in most atomic-scale condensed-matter fields. Certainly inorganic crystallography/materials science/condensed matter physics.

softwaredoug · on Nov 30, 2020

> I don’t think we would do ourselves a service by not recognizing that what just happened presents a serious indictment of academic science.

Much like other fields, I do begin to question the academic structure to making advances. It appears something is rotten in the state of academia. Oddly it's academia doing incremental improvements to existing methods but industry making novel leaps and bounds... The other major case in point being NLP

mensetmanusman · on Nov 30, 2020

Academia is for generating problem solvers. Teams are small and made of people who will be there for around 5 years.

A better comparison would be to national labs, but they are tasked with projects that make no sense for industry to tackle.

The system is working as intended, all players are needed. The team at Alphafold busted their chops in academia and went on to working on problems they could spend decades on.

mnky9800n · on Nov 30, 2020

People seem to forget that you need a system like academia that's allowed to fail. Most companies aren't allowed to fail when they need to have quarterly returns. Of course academia has become more and more competitive. But tbh I think the answer is that the funding hasn't increased equally with the number of quality people who could stay in academia. But who knows.

Jach · on Dec 1, 2020

Stability "like academia" is rich, given all we've heard about "publish or perish". Modern academia is a poor fit for increasingly any case you can think of besides maintaining the status of academia. But sure, there needs to be some stability and ability to "fail"/i.e. produce something worthless. Corporate research departments provide this -- if they didn't, they wouldn't have a research department and indeed many don't, nor do they need to, but this has little to do with quarterly returns.

We've also seen a rise of VC-backed research startups (like DeepMind but many others) whose value proposition (to the VC) only makes sense if the goal is to demonstrate a research capacity to get them bought out by a big company, or as a moonshot to out-compete them on an actual product made possible by the research. Investing in these little research startups themselves is also giving companies a way to push research without having to deal with having the researchers as direct employees, and I'm sure makes some of the startup employees feel a bit safer since there's a separation of money and operation influence. One similarity with modern academia is it selects for those who can do good work but who are also good bureaucrats (write grant proposals well, advising politicians, etc), startups have a selection for good work + good at courting VCs. But the startup just needs a few of them, then they can hire people who just want to do good work.

Another thing that makes corporate even better is they can occasionally spin off research developments into products, they can have some nice advances that only come when you try to productize, and among other reasons by not having to bother with external publishing (which takes time + fights with lawyers and business people) they can routinely be 10+ years ahead of whatever the state of the art in academia is.

calf · on Dec 1, 2020

That's all nice in theory, but what's the compelling empirical evidence for corporate science vs academic research? A professor might conversely argue the open nature of scholarship and freedom of inquiry as being essential to basic science, and capitalist businesses fundamentally cannot provide that. So it goes back to empirical support. And last I checked, companies still need a pool of trained PhDs to choose from, and those come from academia, for good reason.

Jach · on Dec 2, 2020

Elsewhere in the thread makes plenty of cases for corporate advances, even (or perhaps especially?) in the 20th century with e.g. Bell Labs et al. I think the empirical results are pretty good for corporate science.

As for 'needing' PhDs, I'm not sure. Having some can be convenient, yes, but in many cases not necessary. In some fields the only way to get caught up (i.e. no corporation will train you directly) may be with academic foundations but is a PhD necessary or just some relevant graduate work?

As an example N=1 to show a PhD is not needed always, Jeff Jonas formerly held the title of IBM Chief Scientist where he did some state of the art work in entity resolution. He didn't even finish high school.

calf · on Dec 3, 2020

That's an interesting way to make selection bias sound like "pretty good"; I disagree on the face of it, even if I am open to considering the idea of completely doing away with academia. I don't know and am skeptical, but intuitively, the day that would happen is the day that Nobel prizes are routinely awarded to FAANG companies and not to academics.

Further, since your original position was strongly that academia is useless, it is your onus to back up the implication that PhDs are not (generally) needed/useful, and using "well hmm, not sure if absolutely necessary" logic is fallacious and clouds the issue.

qqqwerty · on Dec 2, 2020

We are in a period of historically low interest rates. If and when interest rates rise, these moonshot research startups will get dropped like a hot potato.

peacefulhat · on Dec 2, 2020

Why should they rise? Maybe they should stay low if it is so productive.

vkou · on Dec 2, 2020

Because in addition to funding moonshot project plays, low interest rates also fund a lot of really stupid investments that should never have been funded, and that go belly-up at the first sign of a tightening fiscal market.

When too many of those really stupid investments go belly up at once, it's called a bubble popping, and it is catastrophic to the economy.

SiempreViernes · on Dec 2, 2020

The article then goes on to describe a not very general set of circumstances

> But in part due to the canonicalization of CASP, protein structure prediction effectively has a two-year clock cycle, where separate research groups guard their discoveries until after CASP results are announced.

and further noting

> As I discussed earlier, it is clear that between the Xu and Zhang groups enough was known to develop a system that would have perhaps rivaled AlphaFold.

Finally, and rather crushingly for your thesis, is the points made about the real industrial groups:

> What is worse than academic groups getting scooped by DeepMind? The fact that the collective powers of Novartis, Pfizer, etc, with their hundreds of thousands (~million?) of employees, let an industrial lab that is a complete outsider to the field, with virtually no prior molecular sciences experience, come in and thoroughly beat them on a problem that is, quite frankly, of far greater importance to pharmaceuticals than it is to Alphabet. It is an indictment of the laughable “basic research” groups of these companies, which pay lip service to fundamental science but focus myopically on target-driven research that they managed to so badly embarrass themselves in this episode.

jboggan · on Dec 1, 2020

I used to work in one of the top labs doing protein folding, in fact I recognize some lab mates who survived to have their own labs in this year's CASP top ten ranking. Something is a bit rotten though because I'd guess about half of us burned out of academia entirely, and the state of the art at the time was regrettable. I remember one model doing rather well except the weights of the various physical forces were wrong, so much that some signs were negative and what should have been electrostatic repulsion was actually electrostatic attraction between similar charges. This was 10 years ago of course, but all of the field improvements were relatively incremental and small until AlphaFold came along.

Modern academia exists to further modern academia, no more and no less. I became disillusioned of any search for truth and progress during my time there, it was really just about showing up the Baker lab at the next CASP and getting the next round of grants secured.

softwaredoug · on Dec 1, 2020

This is exactly my point. There is something rotten in the research labs where society is not getting return on investment for basic research.

If the goal is to just produce researchers that can work at corporate research labs, then I feel we could get more bang for the buck.

If the goal is to do move research in the public good, something needs to change. Maybe it’s the fact there’s too little money out there, and it causes everyone to chase meager grant money. Or that there’s too much competition between groups. Or a million other reasons.

But I’d love to see it fixed and have more faith in public investment in basic science.

PaulDavisThe1st · on Dec 2, 2020

The problem is that people in academia and outside of it were saying the same thing in the 70s, 80s, 90s, 00's ... so if you really want to make the claim that "society is not getting return on investment for basic research", you need to claim with a straight face that this also applies to the last 50 years of academic basic research (at least). Alternatively, explain what has changed and when.

softwaredoug · on Dec 2, 2020

Citing increases in fraud and retractions, this article makes the case that as funding has increased, scientific quality has worsened. (Which is counter to what I guessed above!) Pursuit of scientific knowledge for its own sake replaced by obsession with grant cycles.

https://www.jamesgmartin.center/2020/01/the-intellectual-and...

But lots of other things have also changed that may or may not be causes:

- many fewer tenure track positions

- many more jobs in industry that require or value a 4 year or advanced degree

- obtaining a college degree as a right of passage seen as increasingly essential to you g adulthood

- the rise of data science: more jobs in industry that have access to lots of data and demand scientific rigor

- the rise of private cloud supercomputing (is deep mind) vs, say, a public university's cluster

- the obsession in some foreign countries of getting advanced degrees from American universities, creating essentially a guest visa workforce that is easily abused

- rise of Big Tech which has money to throw at things like protein folding

raxxorrax · on Dec 2, 2020

I think while funding has been increased overall due to more people in academia, the processes to acquire said funding got more complex so that a significant part of work is actually making sure the next grant can be secured. Similar situation in publishing. The research might only get a backseat.

That said, this is primarily a computational problem, so the advances here might not be applicable to basic research.

gwicks56 · on Dec 2, 2020

From my experience working as an engineer in Academia, there is a big problem in anything AI related, in that anyone decent can instantly increase their pay by 200 -1000% just by leaving for one of the big tech companies. Half the phD students don't even bother finishing, before being given an offer they can't refuse. How can academia compete with that, given sky high student debt, an extremely uncertain path to tenure etc etc

sampo · on Nov 30, 2020

> The other major case in point being NLP

Speaking of which, Google Translate was published in 2006, but when did the "learning from data" approach became an accepted idea in machine translation? I think the earlier attempts at machine translation were more about trying to codify grammar rules in software, than doing statistical learning from large text corpuses? I remember in 2002, the approach of leaning protein substructures from data was already the best performing approach in the protein folding problem.

Polygator · on Dec 1, 2020

Not really. Using a statistical approach to text modelling, specifically using Markov Chains, was proposed by Shannon in 1948. But yeah, there's a point in the 2000s where generative grammar/ symbolic approaches were pretty much left behind by NN methods.

When we discuss Google's input in NLP, the most important contribution is certainly the "Attention is all you Need" paper, which paved the way for BERT and GTP (Alphafold also uses Attention networks, btw)

adw · on Dec 1, 2020

> generative grammar/ symbolic approaches were pretty much left behind by NN methods

Which is the same thing as hand-engineered feature stacks being left behind in vision problems, really. The story in every field is more or less "you're not clever enough to engineer good features"; "you might be clever enough to define good symmetries for the feature space in which the features live... maybe" (convolutional neural networks in image problems); "... but maybe not even that" (attention mechanisms).

snovv_crash · on Dec 1, 2020

The hand generated features are still superior for SfM style problems, where the geometry is well defined but would need to be learned by the NN from scratch.

std_badalloc · on Dec 1, 2020

I think it was with the seq2seq paper of Sutskever, Vinyals and Le in 2014: https://arxiv.org/abs/1409.3215

People were doing a mix of learning from data and hand engineered solutions before this, but this was the first system learned end-to-end, afaik.

xbmcuser · on Dec 1, 2020

I think Netflix model of simplifying then translating should work better with internet forums and blog posts. Looking at how google works I was hoping someone in google wold adopt it and release a competing product against google translate https://arxiv.org/abs/2005.11197

C4stor · on Dec 1, 2020

The deepmind research team is essentially all PhD though, so it seems academia isn't doing such a bad job.

softwaredoug · on Dec 1, 2020

Specifically the linked article wonders about the research environment of academia compared to industry. Why teams of hundreds in academia with their own super computing resources couldn’t make the same advances. He posits there’s something not great going on about how academic research environments make advances, the poor incentive structures, the abuse and burnout of PhDs, the lack of open sharing of findings, the obsession with publication quantity over quality...

There’s a reason these PhDs at Google mind aren’t in academia after all doing the same work

C4stor · on Dec 2, 2020

Academia is doing a lot of advances every year. The fact it didn't make _this one_ is not really relevant to postulate that academia is inefficient.

It happens that the team at deepmind is apparently pretty damn good at deep learning problems, so they're going faster than matching academia labs.

It's not to say that academia has none of the problems you mentioned, but it's imo unreasonable to expect that, in a world where both public and private labs exist, only public ones would make advances.

krull10 · on Dec 2, 2020

I would say that is most likely due to a massively higher salary and no teaching responsibilities. Academia can’t compete on salary with industry in AI / data science.

codingslave · on Nov 30, 2020

Academia keeps employing people who have done well in classes and within fine bounds. Its a careerist track. Industry cares about results, its more meritocratic

rtsil · on Dec 1, 2020

> Industry cares about results, its more meritocratic

Industry cares about positive results. If you're not allowed to fail, you will be afraid to explore. That's what Academia is. Then, the industry reaps the fruit of that exploration, which is as it should be.

haihaibye · on Dec 1, 2020

High impact journals care about positive results and academia very much cares about journals.

You need to win grants to survive.

JPLeRouzic · on Dec 1, 2020

In my (one point) experience of industry R&D more projects failed than succeeded, but project leaders were not blamed.

How could one predict success or failure in risky project proposals? In both cases you learn something.

_8gfe · on Dec 1, 2020

If academia actually allowed failure we wouldn't be getting so many tiny incremental growth papers just for the sake of it as in deep learning and machine learning.

Ar-Curunir · on Dec 1, 2020

I wish people would stop making reductive and false statements. Life isn't this simple.

codingslave · on Dec 1, 2020

This is actually something I feel strongly about. The absent minded creative professor is the one who traditionally has made the breakthroughs. Recent years has instead seen the straight A student with no curiosity making it into programs, when they really have no business doing novel research and are better suited as orderly wage slaves

Ar-Curunir · on Dec 2, 2020

There's a lot of fanciful stuff about "creative" and "absent-minded" people doing the best work, but what actually makes a good researcher is the same as in any other field: (a) curiosity (b) determination, and (c) hard work.

PhD programs don't take people who just have a good GPA; you have to have a research record before you're even in the consideration. I've been on an admissions committee, so this is not conjecture.

codingslave · on Dec 3, 2020

Right, but a modern research record is about incremental improvement. The argument being that low hanging fruit is often picked, and so the incremental is natural. My argument is that far too many people are gaming the academic system, using it as a form of status credentialing, which is hurting true academic research.

calf · on Nov 30, 2020

Is this a scientific advance or a technological one though? Academia doesn't have the capital like industry or government to implement the latter. In America it's small small groups of young students led by a professor. Not full grown PhDs with Google levels of staff and money.

dwheeler · on Dec 1, 2020

I would claim that this is a technological advance that is likely to lead to many scientific advances.

Perhaps a good analogy are the inventions of the microscope and telescope. They were advances in technology, which then led to advances in science. I don't know if this will have the same effect as the microscope and telescope, but it would be great if it did. It certainly seems extremely promising.

adwn · on Dec 1, 2020

> Is this a scientific advance or a technological one though?

Where's the difference?

(this is a genuine question, I'm not trying to trick you)

segfaultbuserr · on Nov 30, 2020

> something is rotten in the state of academia. Oddly it's academia doing incremental improvements to existing methods but industry making novel leaps and bounds... The other major case in point being NLP

You have to realize that corporate research labs had a high level of recognition back in the 20th century. Labs like the Bell Labs, the RCA Laboratories, or the IBM Research, privately-funded, had a reputation that met or exceeded the standard of not-for-profit or public-funded academic research institutions. They made some of the most important discoveries in the electronics industry of 20th century, like the point-contact transistor, the MOSFET, VLSI, or the UNIX operating system. They were considered a part of the academia, many scientists were their employees. It's only the 1980s after their death that people had the impression that "important research must come from academia, industry is for incremental changes." So, I'd argue that the division between industry and academia is large, but actually smaller than people's perception. If you consider privately-funded researches by the industry as a part of the academia, the current situation is totally normal, nothing unusual.

Interestingly, for those labs to exist, being a monopolistic megacorp is a requirement. It appears to me that today's FAANG monopoly allowed the creation of Google Deepmind and OpenAI, perhaps it's simply a beginning of the repetition of history.

The article The death of corporate research labs had an interesting review. I highly recommend to read the article:

* The death of corporate research labs

> https://blog.dshr.org/2020/05/the-death-of-corporate-researc...

(HN comment: https://news.ycombinator.com/item?id=232466722)

To summarize, those great labs existed and made great contributions because of (1) corporate monopoly on the industry, and (2) the pressure from anti-trust laws. First, due to monopoly, the gigantic size allowed the labs to be the center of gravity and to concentrate all talents and projects into a single place, with a huge research budget for basic research. Second, the pressure from anti-trust laws also forced corporations to invent more into basic research to grow the business, because mergers and acquisitions were restricted. In some cases, the pressure from anti-trust laws also made the corporate labs to share their discoveries in a more open manner, examples included advances in semiconductor [1], or the Unix source code.

Note: but as HN comments pointed out, somewhat ironically, the success of corporate labs relies on anti-trust pressures, but not the actual monopoly-busting enforcement. The breakup of Bell caused the death of the Bell Labs.

Finally their decline,

> The more relaxed antitrust environment in the 1980s, however, changed this status quo. Growth through acquisitions became a more viable alternative to internal research, and hence the need to invest in internal research was reduced.

And it turns out that managing a corporate research labs without losing money is a tricky problem to solve. If the researches are too goal-oriented, short-termism will dominate, basic research in the labs will be ignored. Thus, basic research in the lab must be independent. However, a lab too isolated from the business can also cause great loss.

> Research in corporations is difficult to manage profitably. Research projects have long horizons and few intermediate milestones that are meaningful to non-experts. As a result, research inside companies can only survive if insulated from the short-term performance requirements of business divisions. However, insulating research from business also has perils. [...] Walking this tightrope has been extremely difficult. Greater product market competition, shorter technology life cycles, and more demanding investors have added to this challenge. Companies have increasingly concluded that they can do better by sourcing knowledge from outside, rather than betting on making game-changing discoveries in-house.

And the author argued the death of corporate labs decreased productivity.

>> An unintended consequence of abandoning anti-trust enforcement was thus a slowing of productivity growth, because the this new division of labor wasn't as effective as the labs:

> a new division of innovative labor, with universities focusing on research, large firms focusing on development and commercialization, and spinoffs, startups, and university technology licensing offices responsible for connecting the two.

> The translation of scientific knowledge generated in universities to productivity enhancing technical progress has proved to be more difficult to accomplish in practice than expected. Spinoffs, startups, and university licensing offices have not fully filled the gap left by the decline of the corporate lab. Corporate research has a number of characteristics that make it very valuable for science-based innovation and growth. Large corporations have access to significant resources, can more easily integrate multiple knowledge streams, and direct their research toward solving specific practical problems, which makes it more likely for them to produce commercial applications. University research has tended to be curiosity-driven rather than mission-focused. It has favored insight rather than solutions to specific problems, and partly as a consequence, university research has required additional integration and transformation to become economically useful.

---

[0] https://www.eetimes.com/podcasts/six-words-that-built-the-ic...

> Honeywell brought a lawsuit against us and said you can’t selectively choose people to divulge your technology to. It’s too important. And if you divulge it to anyone, you’ve got to divulge it to everybody. They filed a lawsuit, and the government came down on their side. And RCA basically had to open up all of its patents to everybody if they opened them up to anybody.

xpe · on Nov 30, 2020

Thank you for writing this. Some key ingredients for organizational driven scientific advancement include: long term funding, risk tolerance, enough reputation to get people involved and aware, team work / collegiality, some degree of openness (depending on the problem), and of course being in the right place at the right time with the right skills, management, and people.

rtsil · on Dec 1, 2020

> Interestingly, for those labs to exist, being a monopolistic megacorp is a requirement. It appears to me that today's FAANG monopoly allowed the creation of Google Deepmind and OpenAI,

AFAIK OpenAI is still independent, despite its recent closeness with Microsoft. Deepmind existed and was active well before being acquired by Google. All these to examples prove is that today's big, monopolistic corporations tend to acquire research labs, not that they are a requirement for their existence or successful activity.

nmfisher · on Dec 1, 2020

Yeah, but both labs burn hundreds of millions of dollars. Without Google/Microsoft (not to mention Tesla/YC money), they would have died before bringing these kind of results to market.

krull10 · on Dec 2, 2020

Industry is not going to fund the overwhelming majority of research areas in biology, physics, chemistry, mathematics, etc. Data science and AI are an exception, where people in industry are much better paid, and can get access to much better resources that would be hard to afford in academia... It’s not surprising this type of advance came from an industry funded group. On the other hand, it is academia and its structure that has enabled so many other discoveries, for example, Crispr DNA tech, our understanding of gravitational waves, or the proof of the Poincaré conjecture.

xpe · on Nov 30, 2020

It is not always either/or.

It is not always A versus B.

These fallacies are too common.

lambdamore · on Dec 1, 2020

Deepmind does not foster its future PhD, but, yes, they offer a better rewarding environment for those PhDs to flourish after they get the basic training.

twic · on Dec 2, 2020

The academic groups working on this have a tiny fraction of the resources that Google do.

ac42 · on Nov 30, 2020

I think so, too. Linear algebra, control theory and quantum mechanics haven't gotten us anywhere and ivory towers prevail as this machine learning solution to a problem in biological chemistry clearly demonstrates. /s

Tomminn · on Nov 30, 2020

Almost every single of the tens of thousand papers by the hundreds of tenured academics in the field of protein folding are made obsolete by 10 google engineers.

This is what it's like when someone really moves the needle. And academic science cannot get it's head around it.

And yet, none of these scientists will suffer any career consequences. Their irrelevant work will be healthily cited by all the other scientists who are doing and have done irrelevant work. They'll retcon a story in their lit reviews about how their irrelevant work led to this.

The career consequences are saved for those who had their eye on the real ball for the last five years, but didn't get there first. For them, the comfortably irrelevant will have the gall to ask in accusatory tones: "What have you been doing these last 5 years?".

mnky9800n · on Dec 1, 2020

Did you miss the part where a bunch of the deepmind authors did phds and postdocs in protein folding related fields?

They pushed the needle because they understood the field and the new tools. This is how pushing the needle works.

flobosg · on Nov 30, 2020

AlQuraishi described the progress made in CASP13 (2018) as “two CASPs in one”. This one is an even bigger breakthrough.

Seanambers · on Nov 30, 2020

I particularly like the rant on pharmaceuticals companies lack of basic research. My impression has been that medical progression have been slow for quite some time, nice to see that there are some truth to that.

In the end software and tech companies might just eat up the pharmaceutical industry as well. - It's all just code at some level.

The Deepmind team did this with ;

"We trained this system on publicly available data consisting of ~170,000 protein structures from the protein data bank together with large databases containing protein sequences of unknown structure. It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks, which is a relatively modest amount of compute in the context of most large state-of-the-art models used in machine learning today."

So it wasn't out of reach for academia, pharmaceuticals, or others with a bit of resources.

throwawayiionqz · on Nov 30, 2020

This is the cost of training the final architecture with all the refinements enabled by years of research.

These years of research involved trying many different architectures, many of which received as much or more compute time than the final system.

The price of training the final architecture is meaningless. Researching and training AlphaGo was expensive but it enabled the ideas and development of AlphaZero which is more computationally tractable.

To have any chance, an academic team would need the same compute resources as what the DeepMind protein folding team used during the whole development of the architecture during the last few years, not only the resources used to train the final system. And I bet this funding is not available to most if not all academic teams.

mjn · on Nov 30, 2020

Even if you try to account for the overall R&D cost, DeepMind isn't that large an organization by the standards of biomedical research. It's very big and well funded for a computer science research organization, yes, and most CS departments can't match its resources. But the NIH budget is $40 billion, and private pharmaceutical companies do another $80 billion in annual R&D. It's interesting that this kind of breakthrough didn't come from those sectors.

dekhn · on Nov 30, 2020

DeepMind is taking advantage of NIH's funding. For example, Anfinsen who demonstrated that proteins fold spontaneously and reproducibly (https://en.wikipedia.org/wiki/Anfinsen%27s_dogma) ran a lab at NIH. Levinthal (who postulated an early and easily refutable model of protein folding) was funded by NIH for decades. Most of the competitors at CASP are supported by NIH and its investments have contributed to the modern results significantly.

That said I think the academic and pharma communities had engineered themselves into a corner and weren't going to see huge gains (even thogh they are exploring similar ideas) for a number of banal reasons.

mjn · on Nov 30, 2020

That's a good point; this system certainly didn't come from nowhere! The protein datasets they used also mostly came out of various NIH-funded projects.

What I meant to focus on was that I think DeepMind has less of a pure money/scale advantage in this area than in some others. In something like Go or Atari game-playing, there are many academic groups researching similar things, but their resources are laughably small compared to what DeepMind threw at it. So you might argue that they got good results there in part because they directed 1000x the personnel and compute at the problem compared to what any academic group could afford. In biomed though, their peers in academia and industry are also pretty well-funded.

dekhn · on Nov 30, 2020

Personally I think a major part of the secret sauce is Google's internal compute infrastructure. When I was an academic, 50% of my time went to building infra to do my science. At Google, petabytes of storage, millions of cores, algorithms, and brains were all easily tappable within a common software repo and cluster infrastructure. That immediately translates to higher scientific productivity.

MaxBarraclough · on Nov 30, 2020

Has cloud computing changed this?

dekhn · on Nov 30, 2020

Mostly? I left google to work at a biotech startup working in a related area and found that the big three cloud providers have built systems that greatly improve computational science. That said, it's still a lot of work to get productive, many in the field are really resistant to changes like version control, continuous integration, testing, and architecting distributed systems for handling complex lab production environments.

Here's an exemplar of how I think it evolved well in a cloud world: https://gnomad.broadinstitute.org/

that project adopts many concepts from google and others and greatly improved our analytic capabilities for large-scale genomics.

asah · on Nov 30, 2020

Having recently experienced both, 1000x this.

t_serpico · on Nov 30, 2020

You hit the nail on the head here.

WanderPanda · on Nov 30, 2020

It seems like spending these government funds on creating new challenges like CASP and ImageNet could have an enormous ROI. Don’t let them try to choose the winner, just let them define the game

zaroth · on Nov 30, 2020

> The price of training the final architecture is meaningless.

The research is the giant shoulders you stand on, the compute cost is the price of the tool you need to do the present-day work.

Both are relevant but the shoulder’s of giants are generally more accessible, particularly if we’re talking about published research and not proprietary tech.

A competing team is not starting from the same place the DeepMind team started at 5 or 10 years ago.

zaroth · on Nov 30, 2020

To expand on this, after fully reading AlQuraishi's "What Just Happened" post from a couple years ago, was this point that he made;

> I don’t think we would do ourselves a service by not recognizing that what just happened presents a serious indictment of academic science. There are dozens of academic groups, with researchers likely numbering in the (low) hundreds, working on protein structure prediction. We have been working on this problem for decades, with vast expertise built up on both sides of the Atlantic and Pacific, and not insignificant computational resources when measured collectively. For DeepMind’s group of ~10 researchers, with primarily (but certainly not exclusively) ML expertise, to so thoroughly route everyone surely demonstrates the structural inefficiency of academic science. This is not Go, which had a handful of researchers working on the problem, and which had no direct applications beyond the core problem itself. Protein folding is a central problem of biochemistry, with profound implications for the biological and chemical sciences. How can a problem of such vital importance be so badly neglected?

In short, academia got utterly schooled by a small group at Google spending a relatively small dollar amount on compute, using techniques that in hindsight are fairly described as "simplistic". There's no way around it.

Invictus0 · on Nov 30, 2020

I don't think AlQuraishi really hits the mark in his critique. The mere fact that hundreds or thousands of people working on a problem for decades doesn't account for the fact that the field of machine learning has been growing extremely rapidly over the last decade, the compute power available has grown exponentially, and the people working on the problem simply weren't looking at the problem in the way that the deepmind people were looking at it.

If you were trying to get across the Atlantic, this would be like getting upset at a group of bridgebuilders for trying to solve the problem by building a bridge across instead of by inventing the airplane. The approaches are that different.

flobosg · on Nov 30, 2020

> and the people working on the problem simply weren't looking at the problem in the way that the deepmind people were looking at it.

>The approaches are that different.

I'm not sure if that analogy applies here. DeepMind wasn't the first group tackling structure prediction with machine learning. Their success lies in the innovations that they implemented (predicting interresidue distances as opposed to contacts, for example).

dash2 · on Nov 30, 2020

To be fair, I'm not sure that they are "simplistic" in the sense that, e.g., writing a neural network to recognise cat pictures is now simplistic. I don't know how many people have Deepmind levels of expertise in ML, or could implement what they have done, but I doubt it is many, and they are thinly spread amongst many interesting problems.

craftinator · on Nov 30, 2020

> The price of training the final architecture is meaningless.

Meaningless in historical terms, but meaningful in future terms. It's meaningless how long the training took because there were countless resources spent to get to that point. It's meaningful in the future, because we know that training times are fairly short, and iteration can be done fairly quickly.

beowulfey · on Nov 30, 2020

I mean, credit where credit is due. Google employs some of the greatest names in artificial intelligence and the DeepMind team had a huge chunk of them working on this problem. While the resources may have been available, I don’t think any other single institution had the level of brain power.

elcritch · on Nov 30, 2020

It also makes one reconsider the notion that monopolies are entirely bad. This essentially appears to be a vanity project for Google. Though of course they'll benefit from it in many ways, but it's not like they're doing this as the core product of their service. It's a pretty awesome achievement.

xzel · on Nov 30, 2020

Look at all of the incredible things that came out of Bell labs during their monopolistic reign. I think a better way to put it is not all monopolies are bad for research and progress but many are bad for other social and economic reasons. Like any position of power, it depends on how it is used snd who is using it.

Supermancho · on Nov 30, 2020

> It also makes one reconsider the notion that monopolies are entirely bad.

Much like political dictators, they can be exceedingly efficient and have resources (and authority) to do things in spite of opposing interests.

People who faced with the narrative that countries have a monopoly on a number of aspects of life find monopolies are not a BAD THING(tm), but that they are bad for a consumer market - as a monopoly eventually blockades aspects of the market.

bawolff · on Nov 30, 2020

Or to put another way, the kings and queens of yesteryear funded a staggering amount of beautiful art, etc.

e_y_ · on Nov 30, 2020

I think there's some merit to the idea that huge corporate monopolies have the resources to accomplish undertakings that smaller companies cannot. But it's often a what-if, because we don't know what the alternative might have been.

Big companies can suck up all the air in the room by monopolizing talent and making it harder for startups to pay the kinds of salaries needed for top tier AI research. Xerox PARC came up with all kinds of groundbreaking inventions that were never commercialized (by them). For every invention that comes out of a big company, it's worth thinking about whether it might have actually come out faster if it was borne of competition instead of a side project. Or in the grand scheme of things, if corporate taxes were higher and the money was given to a university research lab.

I think the best results may come from the middle ground. Smaller/medium companies are so worried about staying afloat or hitting their quarterly earnings that they have trouble making long term investments. Large companies are diverse and profitable enough that they can afford to blow money on things that might not pan out, but they don't have the same drive -- and in fact have some pressure to avoid being "too" innovative because it could cannibalize their existing products.

generalizations · on Nov 30, 2020

Note that Bell Labs is another example of the corporate monopoly research lab producing things that others couldn't / didn't.

soup10 · on Nov 30, 2020

It's kind of like a modern day Bell Labs where they have so much excess profit from adtech that they can fund lots of "basic research" or the computer science equivalent of that.

Ericson2314 · on Nov 30, 2020

You've just describe why many Socialists 100 years were very skeptical of anti-trust as trying to sacrifice modernity to proper up a romanticized notion of the past as disaggregated pure-petit-bourgeois capitalism. Really not that different than the critism of the Luddites 100 years before that.

See https://ilr.law.uiowa.edu/print/volume-100-issue-5/all-i-rea...

sudosysgen · on Dec 1, 2020

This line of argument reminds me of Haldane's point that economic planning can often work for the same reasons why large corporations and monopolies often work well too.

Ericson2314 · on Dec 1, 2020

"The People’s Republic of Walmart"

bosswipe · on Nov 30, 2020

Imagine we lived in a culture that did not believe "government is always bad at everything". Government could then pay Google-level salaries and provide Google-level resources to the top minds in the world and give them free rein to tackle problems like this. It's worked in the past, such as Manhattan project or moon landing. But I don't think it's doable nowadays because of the anti-government political culture. Even when government is fully funding things these days the work has to be farmed out to private interests.

e_y_ · on Dec 1, 2020

It'll take more than just belief in the government. We'd need people to actually care about making government better.

Most people just show up to vote once every 4 years (or less) and make their decision based on the party affiliation or the wedge issue du jour, and the rest of the time pretty much ignore what's going on or don't have the power to do anything about it, which gives a lot of leeway for special interests to slide things in under the radar.

nightski · on Nov 30, 2020

Not even a little bit. There is nothing here that would require Google to be a monopoly to accomplish. If anything companies become lazy without competition.

I feel like that is not too far from saying it makes one reconsider communism because good things can happen with authoritarian control.

mrDmrTmrJ · on Nov 30, 2020

Absolutely. The capability to "create" the breakthrough is extremely rare. Perhaps only DeepMind, OpenAI, and GoogleBrain can assemble these types of teams. Luckily, the capability to replicate and exploit the breakthrough is far more 'common'; though still very rare.

Excited to see how follow on use of these models, by many more teams, researchers, and companies plays out over the next two decades.

This is a foundational advance!

flobosg · on Nov 30, 2020

Yeah, it was a big slap in the face. But, to be fair, most of the scientific and technological advances (sequencing efforts, structural genomics projects, etc.) that generated the data used by DeepMind came from academia and, to a lesser extent, the pharma industry.

sjg007 · on Nov 30, 2020

I think the lesson here is that most of the big data genomic, metabolic, pharmacologic and other research will all be driven by deep learning. The models themselves however require 100+ gpus so we are sort of back in that phase where you need large compute systems to even compete. A single lab will have issues unless they can leverage a cloud and then also get grant funding to spend that money on the cloud compute... which may be difficult b/c its basically a consumable now and you don't have any hardware leftover.

IfOnlyYouKnew · on Nov 30, 2020

In a prior(/n) life I worked on Protein folding, and participated in CASP.

This was a/the "holy grail" problem of molecular biology, long thought to be an automatic Nobel. It's somewhat unfair to characterise developments prior to this as insignificant. In fact by the time I was working on it, that "automatic Nobel" was no longer assumed, because the field had made quite a bit of progress, in many tiny steps by many different groups, and the assumption was it would continue in this slog until reaching some state of sufficiency for practical applications without ever seeing the sort of singular achievement that would be worthy of praise and prize.

Far more went into this breakthrough, obviously, than those TPU-hours: the development of those TPUs, for example, and assembling a team that can make use of them. The protein folding problem requires very little knowledge of biology or physics to understand and was always pre-destined for some outsider to sweep. Indeed, there was game that allowed people to solve structures by intuition alone, and, IIRC, some 13-year old Mexican kid cleaned everyone's clock some years back.

Why didn't some research group do this first? Most of them just don't have the budget. We were five people, total, IIRC, and felt pretty rich because we were computer-people getting the same budget for materials as everyone at our institution, which was all wetlab, otherwise. So I was a student being paid $20/h but with a $50,000/p.a. hardware budget. How many false start does it take before you do that run with 128TPUs "for a few weeks" that works? If you blow your budget on one gigantic Google invoice, what's going to happen to you when it doesn't pan out, and the whole institute laughs at you? Etc...

There are quite a few rather good things this problem has inspired over the years, though. Among them is CASP itself: the idea of instituting a yearly competition that gives unequivocal feedback on the state of the field and every group working on it is rather rare, I believe, and it's been successful. Indeed, it would seem that CASP was necessary to attract outside groups like Deepmind, i. e. deep-pocketed industry groups striving to prove themselves on a clearly defined problem. Chess, Jeopardy, CASP: maybe it would be worthwhile to explore not <solving x>, but <stating X as a problem that attracts Google/IBM/etc.-scale money> as a superior strategy in some cases.

There was also folding@home, pioneering the distributed-donated-computing model, and the aforementioned gamification of the problem, and hundreds of the most intricate, custom-tailed, more-or-less insane ideas people devoted months and/or careers and/or careers of their most promising post-docs to that didn't pan out.

Like cellular automata. They don't work for this, trust me. (Great hit for interactive poster sessions, though)

emmelaich · on Dec 1, 2020

The game was https://fold.it/ presumably.

nl · on Dec 1, 2020

> How many false start does it take before you do that run with 128TPUs "for a few weeks" that works?

This is a big issue that most people miss. Having easy access to vast computational power makes such a difference for experimentation.

tonfa · on Nov 30, 2020

> So it wasn't out of reach for academia, pharmaceuticals, or others with a bit of resources.

How much does hiring a deepmind-like team cost though? (massively more than the TPU resources?)

Still within reach of pharmaceutical industry I guess, but maybe not so easy for academia.

Seanambers · on Nov 30, 2020

From what I can gather, Google bought Deepmind for 500 million USD in 2014, they have outstanding debt to its parent company as of 2019 of 1.3 billion USD.

And they had income around 100 million in 2019 but it's all against Google, so looks like a 2 billion +/- 0.5 operation so far, and who knows if they pay for compute.

Other articles place the runrate at 500 million per year in 2019.

Which means 500 million * 6 years = 3 bn + 0.5 purchase price. = 3.5 bn. So somewhere in the 2.5 - 3.5 billion range its seems likely as total cost so far.

Nevertheless doesn't seem out of reach for a multinational.

sseagull · on Nov 30, 2020

It would still be a significant amount of money for a lot of companies.

Remember, we are looking in hindsight that it seemingly paid off. A few years ago, this was just an educated bet; only the richest companies with money to burn (from selling ads) would be willing to take on that kind of a risk.

xbmcuser · on Dec 1, 2020

Only the energy cost savings google got from Deepmind probably already makes it a very profitable acquisition https://deepmind.com/blog/article/deepmind-ai-reduces-google...

ovi256 · on Dec 1, 2020

I appreciate this tremendous 3.5B subsidy that Google brought to basic ML research and R&D.

There is barely any multinational that has the freedom Google had of planning to spend 3.5B with no ROI. Their shareholders would sue and vote the managers out.

TulliusCicero · on Nov 30, 2020

That's the cost of running DeepMind as a whole, right? Which includes all the other stuff they've worked on, like games.

Seanambers · on Nov 30, 2020

Yeah, as far as I can tell, that's the whole lot of it.

t_serpico · on Nov 30, 2020

Also, pharma does not really have a huge incentive to work on this problem. Solving the protein folding problem does not automatically translate to new drugs just in the same way CRISPR or DNA sequencing did not. It's another tool in the toolbox (which to be clear is a big deal).

mwcampbell · on Nov 30, 2020

How far does the similarity extend? Specifically, the big question for me is whether AlphaFold will be freely available like ImageNet, or proprietary.

ramraj07 · on Nov 30, 2020

The competition requires enough revealing about the methodology for other teams to replicate it so open implementations are going to be available for sure.

It also looks like they came up with a brand new jiggling algorithm which is probably just V1 now, this really changes things in a significant way!

sanxiyn · on Nov 30, 2020

I expect this to be quickly replicated once published. Training data is public and training compute is not enormous and AlphaFold of 2018 did get replicated.

dekhn · on Nov 30, 2020

CASP typically works this way: one person "wins" by getting a slightly higher score than everybody else. Two years later, the top teams have all duplicated the previous winner's tech, and two years after that, there's a github you can download and run on your GPU to reproduce everything.

kxs · on Nov 30, 2020

How do you define enormous? "It uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks". Also last time it took about a year for good replications to pop up.

mrDmrTmrJ · on Nov 30, 2020

A year is a fast time to replication in many scientific fields.

While substantial, the resources here are well within reach of many labs, research institutes, and organizations. For this result this big, I'd guess we'll have 2-6 additional implementations in the next 18 months. The problem has been 'open' for 40+ years, so that's lightening fast!

elcritch · on Nov 30, 2020

A couple of hundred GPU's is well within the reach of many even moderately well heeled research institutes. It'd seem that about 3 weeks of compute time with 128 TPU v3's would be about $170,311.68.

kxs · on Nov 30, 2020

But of course that cost would only be for the final model. Anyway, I think I am just living in a different world... :-) We could never compete with that

elcritch · on Nov 30, 2020

Yah, big grant money. Now the grad students programming the open source clones will only make approximately $0.56, or 4.2 Ramen packs, for their effort. ;)

sdenton4 · on Nov 30, 2020

Also with keeping in mind that once a good open source model is available, researchers with less resources can still use it to fine tune and get new results for far cheaper than training a new model from scratch.

intpx · on Nov 30, 2020

or cryptominers

dragontamer · on Nov 30, 2020

A lot of labs have access to the various strategic supercomputers of the USA.

Ex: Summit has 27,648 V100 GPUs (and those V100s have Tensor units). If you're saying that only 200 GPUs are needed to replicate the experiment, that doesn't even use up 1% of Summit's available utilization.

0-_-0 · on Nov 30, 2020

ImageNet is a competition and a dataset, AlphaFold is a neural network.

dmix · on Nov 30, 2020

> However, if the (AlphaFold-adjusted) trend in the above figure were to continue, then perhaps in two CASPs, i.e. four years, we’ll actually get to a point where the problem can be called solved, in terms of gross topology (mean GDT_TS ~ 85% or so). Interesting prediction within.

It turned out only to be one more year instead of four (depending on whether getting to the 90~ range is "solved".

I'm curious to see if AlphaFold can do even better the next two years.

Those last mile percentages always tend to be small anyway.

im3w1l · on Dec 2, 2020

> Now that the problem of static protein structure prediction has been solved (prediction errors are below the threshold that is considered acceptable in experimental measurements)

This seems premature. Even though it does very well on average, there may be some areas where it struggles, and those areas may turn out to be important.

mabbo · on Nov 30, 2020

Sometimes announcements like this are a bit over-the-top. But what really, to me, cements the 'big-deal' of this is the "Median Free-Modelling Accuracy" graph half way down the page.

Scores of 30-45 for 15 years. Now scores of 87-92.

This isn't a minor improvement, it's a leap forward.

treis · on Nov 30, 2020

That is an impressive improvement, but I think you've missed the most important point:

>a score of around 90 GDT is informally considered to be competitive with results obtained from experimental methods

So DeepMind is to the point where it's a question of whether their generated model or the experimentally determined structure is closest to the actual physical structure.

mabbo · on Nov 30, 2020

Then we get the really fun question: if the experimentally determined structure is only 90% accurate, can machine learning actually reach 100%? Can you learn exact truth from inexact examples?

Which gets into the concept of whether the ML model has actually learned some deeper conceptual ideas than we have, some deeper truth about how this works. If so, can we somehow extract that truth, or is it truly a black box that does the thing we want?

I'm reminded of a sci-fi book I read long ago in which humans are discussing the fact that the science they are utilizing is beyond the scope of a human mind to comprehend- only the AIs can intuitively deal with 12-dimensional manifolds (or something to that extent). Maybe we've reached the doorstep of that future.

carlmr · on Nov 30, 2020

If you have an experimental error that is somewhat normally distributed around the mean, the the AI should, with enough examples, learn what the rules are that are closest to the mean. Because it will minimize the sum of errors.

So i do think the results could be more accurate than measurement.

willj · on Dec 1, 2020

I don’t think we can assume the errors are normally distributed. It’s possible researchers are biased in a particular “direction”, away from 0 on all dimensions of this problem.

bollu · on Dec 1, 2020

That's fine. It's still a normal distribution with a different mean. The Gaussian is characterized by having only the first two moments: mean and variance.

hoseja · on Dec 1, 2020

There is no 100%. Proteins are flexible. Curious how this deals with that even more fiendishly difficult fact.

root_axis · on Dec 1, 2020

> Which gets into the concept of whether the ML model has actually learned some deeper conceptual ideas than we have, some deeper truth about how this works.

Well I think that the results speak for themselves; ultimately the question you raise is one of semantics. ML models don't think in terms of "conceptual ideas" like humans do, these models simply perform at such a massive statistical scale that they can identify patterns far beyond any human conception. Clearly, the model embodies some verifiably reliable information about the way the world works, but this is "just" a trick of statistics not anything resembling actual "understanding" in the way the word is typically used when referring to human understanding.

asdfasgasdgasdg · on Nov 30, 2020

I have a related question about this. If experimental methods produce results around a score of 90, what is the baseline we are comparing the DeepMind results against? If the experimental error is equal to the observed DeepMind error, how can we say which one is actually more erroneous?

mrDmrTmrJ · on Nov 30, 2020

Excellent question. At somepoint, I think the only answer is, "have a bunch of different people run a bunch of experiments on the same protein."

The threshold for "real" in particle physics is +5 sigma. Which takes a lot of data.

dnautics · on Dec 1, 2020

you really can't compare stats like that. Those are independent, uncorrelated measurements. When you take RMSD measurements on a molecule they are not independent (for example, atoms near the core are less likely to be "inaccurate").

radioactivist · on Nov 30, 2020

I think it's that a score of >90 means the result is within the error bars of whatever particular experiment was chosen to be the "reference".

IfOnlyYouKnew · on Nov 30, 2020

The "experiments" here use X-Ray Crystallography. Like most methods of measuring anything, we have a pretty good idea of its accuracy under various conditions.

Think of it like satellite imagery of a tree: A score of zero would be a single green-ish pixel, while a score of 100 would show each leaf within the range it naturally moves in due to wind etc. (proteins tend to wiggle quite a bit under natural conditions, as well)

0-_-0 · on Nov 30, 2020

That's a damn good question, it looks like we don't know how much above 90 AlphaFold is.

cpeterso · on Nov 30, 2020

And is it even meaningful for DeepMind to score better than experimental results? How are DeepMind’s results scored then?

marcosdumay · on Nov 30, 2020

Finding the energy of each configuration should be much easier than finding the lowest-energy configuration. Can that be calculated ab-initio or it is still too expensive?

crispycrafter2 · on Nov 30, 2020

The problem with ab-initio methods in this context is the sheer number of non-covalent interactions present in these large proteins. A simple protein would require a hybrid quantum mechanic/molecular mechanics simulation to even approximate the vibrational energy required to validate equilibrium.

These proteins are so massive that we often use Daltons [1] as an averaged measure of molecular weight.

Conceptually one of the most promising applications of quantum computing is theoretical chemistry, and we are only now starting to make progress in this avenue [2]. I anticipate it would require quantum computing to explicitly optimise large folded proteins.

1. https://en.m.wikipedia.org/wiki/Dalton_(unit) 2. https://arxiv.org/abs/2004.04174

timr · on Nov 30, 2020

"So DeepMind is to the point where it's a question of whether their generated model or the experimentally determined structure is closest to the actual physical structure."

While this is an accomplishment, nobody is going to be confusing these models for structures produced experimentally. The CASP metric is for backbone atoms. To have a useful model of protein structure, you really need to have the positions of the protein side-chain atoms modeled correctly. Experimental methods will do that, but this method, as I understand it, does not.

jey · on Nov 30, 2020

So it's a really good start, but nobody is going to be throwing these structures into molecular docking simulations for drug discovery or etc just yet. But hopefully those details can be worked out soon enough.

timr · on Nov 30, 2020

Yeah, there's a huge difference between a 1Å all-atom RMSD structure, and a 1Å backbone RMSD structure. The non-backbone atoms in a protein make up most of the mass and volume. When structural biologists talk about RMSD, this is what they mean.

mFixman · on Nov 30, 2020

I don't have a background in biology, and that quote confused me.

What's an experimental method for protein folding and why is it so good? Are they talking about creating an actual, physical protein in a lab and observing how it folds?

flobosg · on Nov 30, 2020

> Are they talking about creating an actual, physical protein in a lab and observing how it folds?

Exactly. Researches purify the folded protein and then use methods such as X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy to determine its three-dimensional atomic structure.

selestify · on Nov 30, 2020

If even the experimental approach is only 90% accurate, how do they know which 90% is accurate?

rsfern · on Dec 1, 2020

I’m not a protein crystallographer, but here’s my generalist take.

We understand the physics of e.g. X-ray diffraction pretty well, so we can fit pretty decent forward models for the x-ray data given a proposed structure. The hardest task here is getting a good enough guess at the structure to optimize the physical model, and it’s my impression that people use an iterative model refinement workflow. At least that’s how it’s done in condensed matter materials.

There are many sources of experimental uncertainty, like the non-ideal nature of the x-ray source and optics, and the fact that the atoms in the protein are not static but have some thermal fluctuations. so at the end of the refinement you still have some uncertainty on your model parameters (the interatomic distances for proteins I guess), but if you are careful you can calibrate these uncertainties pretty well.

This paper looks like a really good detailed discussion: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080831/

dnautics · on Dec 1, 2020

X-ray diffraction is pretty nutty, too. You're taking the diffraction pattern, which is the fourier transform of the electron density. Fourier transform results are complex-valued data. Unfortunately, we don't really have X-Ray lasers, so you can only get the intensities and not the phases of those diffraction spots. Since mother science hates us, it of course the case that, in a fourier transform "more information" is contained in the phases than in the intensities.

So you "make guesses at what the phases are", the best choice is to bootstrapping these phases measured with another technique (you can introduce crystal defects that do allow you to guess at what the phases are).

Less scrupulous is to use a computer generated model, like fitting another protein "that you guess is related", then you model the electron density, take the phases of that.

In any case you take these "phase" guesses, and then apply it to your intensities, re-run the fourier transform, refine your electron densities, twiddle the location where you think the atoms, are, then repeat with your new model. This process repeats until you converge on a structure that you're happy with.

Now alarm bells should be screaming in your head right now: Yes, it's entirely possible to converge on a wrong structure, especially if you're a young up-and-comer professor seeking tenure that has no ethical problems with "suggesting" their grad students to sleep in the lab and work 100 hour weeks and willing to do slipshod work to get you tenure: https://www.sciencedirect.com/science/article/pii/S002228360...

selestify · on Dec 1, 2020

> Yes, it's entirely possible to converge on a wrong structure

I guess my question is, how do you know if you’ve converged on the right structure or not? Is there a different experiment you could do?

dnautics · on Dec 2, 2020

Best is an orthogonal process (like NMR). Cryo-EM is getting better too so maybe that will start to be viable. Sometimes that's not possible, but you can use secondary evidence: "we know these three amino acids are important band hey look they touch in our model".

mactrey · on Nov 30, 2020

I'm not a biologist but I'm not sure that follows. It could be that the experimentally-derived structure is 100% accurate to the actual physical structure but getting 90% of your predicted residues to match that is enough to get an accurate prediction of protein behavior and hence "competitive."

SubiculumCode · on Nov 30, 2020

Something like this comes up in assessing the accuracy of automated segmentation results of brain regions e.g. the hippocampus. Human-machine reliability is approaching the human to human reliability, so it becomes harder to improve the automated methods.

contravariant · on Nov 30, 2020

Of course this may no longer be the case for methods solely trained to optimize that particular metric.

beowulfey · on Nov 30, 2020

I don’t think you can say DeepMind could ever be more accurate to the true physical structure since it was built on the same experimental structures that it is being compared to. The limit of accuracy is the experimental data. However, I think we can say that a DeepMind prediction could at least be as good as a new experimental structure.

dwiel · on Nov 30, 2020

This seems like an obvious assumption to make, but it isnt always true. It is easier to see why if you are measuring a single value multiple times in order to get a more accurate estimate of the true value. In that case your "model" is simply the mean of all measurements made and can exceed the accuracy of a single measurement.

In this case, the model is predicting values of multiple structures, but patterns could still theoretically be found which allow for predictions beyond the accuracy of a single measurement.

dekhn · on Nov 30, 2020

DM is merging several experimental data: known x-ray structures, and evolutionary data. The experimental method (xray) doesn't take advantage of the evolutionary data. And it also doesn't model the underlying protein behavior accurately (xray basically assumes a single static model with atoms fluctuating in little gaussian "puffs" around the atomic centers, but that's not how most proteins behave).

robocat · on Nov 30, 2020

But DeepMind could be used to find errors in the training set.

Let’s say you have 100000 proteins in the training set. Now remove #1 and train on 99999, and then check that it still predicts the same protein result for #1 as the experimental result.

Or remove from training whole sets of proteins by particular teams to find systematic errors made by teams?

FrojoS · on Nov 30, 2020

Is that true? I thought fundamentally, the simulation tries to find the state of lowest energy, which is defined by physics. So, your result can be better than the data set used for training.

WhompingWindows · on Nov 30, 2020

This reminds me of AlphaGo and AlphaZero. DeepMind was able to produce a very solid model on their first attempt, at both protein folding and at Go (and Starcraft2 as well). Their second models, however, seemed to blow their first out of the water.

This bodes extremely well for the future of computational biology, I'm very excited thinking about the prospects. If we know how a protein folds, we know its shape, meaning we know which shaped/charged molecules are needed to act as suppressors/enhancers of those proteins.

layer8 · on Nov 30, 2020

One difference to AlphaZero though, if my understanding is correct, is that AlphaFold is trained on a predetermined data set and hence didn’t learn how “arbitrary” proteins fold in general, but just how the kinds of proteins fold for which we already know how they fold. To work more like AlphaZero, AlphaFold would have to be able to synthesize arbitrary proteins and run the experiments on them to verify and correct its predictions. Therefore it’s conceivable that AlphaFold is biased by the existing training data and doesn’t fully generalize to all proteins we would want to apply it to. Maybe that won’t be a problem in practice, but nevertheless it makes for a significant difference from what AlphaZero was about, being solely self-trained.

the8472 · on Nov 30, 2020

> AlphaFold would have to be able to synthesize arbitrary proteins and run the experiments on them to verify and correct its predictions.

Could this lead to a virtuous cycle where AlphaFold is used generate a ton of random sequences where it has low confidence, those are then screened for ease of synthesis, measured and the results used to improve the model?

Edit: nevermind, according to another comment[0] there are still plenty of real proteins without experimental data left to explore.

[0] https://news.ycombinator.com/item?id=25255601

cma · on Dec 1, 2020

> AlphaFold would have to be able to synthesize arbitrary proteins and run the experiments on them to verify and correct its predictions.

It can verify how much it minimizes the potential energy, which may not always line up with how it would fold in the real world but is a strong indicator.

entropicdrifter · on Nov 30, 2020

Not to mention the fact that two years ago they took it from 45% to >60%. If they can continue improving, even with an exponential decay in rate of improvement, this is certainly a stunning example of technological disruption.

Zenst · on Nov 30, 2020

Even without any improvement, the amount of grunt-work the AI can pre-do and get down to a short-list - that in itself will see changes in progress speeding research up.

kordlessagain · on Nov 30, 2020

> and get down to a short-list

There's no reason to believe the list will contain all solutions, however.

patagurbon · on Nov 30, 2020

No but it will hopefully contain some. Which for many if not most problems is all that matters

martinpw · on Nov 30, 2020

Why is the graph not monotonically increasing? Does the complexity of the problem to be solved increase each time? If so, does that make the relative improvement from the previous result even more impressive?

breatheoften · on Nov 30, 2020

That's quite interesting ... I believe the test set size is not constant year to year but rather a function of how many new structures have been experimentally discovered since the last contest?

Does seem like the contest structure could include quite a bit of risk for hiding the effect of overfitting ... I wonder if there is anything inherent about the problem that reduces that risk ...?

FrojoS · on Nov 30, 2020

My understanding is, that it's always 100 new structures, which is a small fraction of the total structures identified in that year.

The reason why the top score in one year, can be lower than in the previous year, is that the test (the 100 structures to guess) is always new and different, so it can end up being 'harder' than the year before. Luck will also play a small role.

Another explanation for a reduction in the top score would be, that previous winners are not re-submitted unchanged. For instance AlphaFold v1 seems to not have been submitted to the latest competition.

breatheoften · on Nov 30, 2020

Only 100 new structures each test cycle? That seems a very small test set size ...

Is it really possible to select 100 new structures which together are likely to represent a meaningful increase in the sample generalization versus the prior years test set ...?

MauranKilom · on Nov 30, 2020

Given that we only know the structure of on the order of 100k proteins, we might only get another 10k new ones per year. I guess.

Using 1% of those (presumably from the more-often-reproduced subset) for this challenge seems reasonable? Note that the structures have to remain secret up until the challenge, and presumably all those teams uncovering the structures don't want to have to wait up to 2 years every time to actually make their results public.

breatheoften · on Nov 30, 2020

Interesting ... plenty of opportunity then potentially for the 100 samples to have prediction similarity to the set of published discoveries (for expected or unknown reasons)?

I suppose it will take a few more years of repetition for the challenge to confirm that the problem has been been solved -- but I wonder if a new version of the contest is going to be needed as well? Maybe the model accuracy is now high enough to invert the contest to a form where models generate predictions for randomly selected unknown samples -- and experimental teams are then expected to make observations for those particular sequences over the next two years as part of their otherwise research agenda selected experimental workload?

kxs · on Dec 1, 2020

There are different categories of samples, namely FM and TBM targets. FM targets don't have any similarity to known structures. Roughly a quarter were FM targets. I think the more interesting thing to look at is the size of the multiple sequence alignments (MSAs) which is the basis of this and essentially all methods. They seem to do very well with few MSAs, which bodes well for other targets, although there are families of proteins with few MSAs.

hobofan · on Dec 1, 2020

100 structure with 100+ amino acids each, so it's not quite as bad. Part of the folding information is contained within a distance of a few amino acids, while some (the harder part and crux of the problem) is farther away.

But yeah, compared to other fields, the size of training/test sets is sometimes pretty small in ML for life sciences.

harperlee · on Nov 30, 2020

Not knowing a lot about biotechnology, I read the article and it sounds great, but how big is this as a gamechanger? Can someone comment on how big are the implications of this in, let’s say, 5 years from now, on day to day life? Does this mean that biotech is going to explode? Or just that drugs will come to market faster, perhaps cheaper for rare diseases, but from the same industry structure as always?

fabian2k · on Nov 30, 2020

Protein folding is a big and important problem, so this is certainly big news if it works as well as it seems. But I wouldn't assume that this changes everything, we can already determine how proteins fold by experimental work. The disadvantage is that this is a lot of work, though the methods there also improved a lot.

One question is how robust the predictions are that DeepMind produces. I would also assume that right now it can't e.g. determine protein structures in the present of other small molecules, or protein complexes. A lot of the interesting stuff lies in the interactions between molecules.

And in general in life sciences any new development will take at least a decade until it hits day to day life, likely even more. We're living with a exception to this rule right now due to the pandemic, but in general things take quite a bit of time in that space.

derefr · on Nov 30, 2020

We can already determine how a few proteins (170k — which sounds like a lot, but which is only 0.09% of all currently-catalogued protein sequences) fold by experimental work.

What an accurate model of protein folding allows us to do, is to take our big database of DNA, predict protein foldings for all of it, and then stand up a search index for this database, keying each amino-acid "row" by the "words" of its predicted protein's structural features.

We could then, with a simple search query that executes in O(log n) time, find DNA targets that produce molecules with interesting structures that might be worthy of study.

This would, for example, be a game-changer in how biopharmaceutical macromolecule-therapy R&D is conducted. Right now we have to notice that some bacterium or another produces some interesting protein, and then engineer a bioreactor to get more of that protein. With this tech, we can work backward from an entirely hyothetical, under-specified "interesting protein", to figure out what catalogued-but-unstudied DNA sequences produce never-before-catalogued proteins that fit that particular functional "shape", and therefore might do the interesting thing. Then we can either directly synthesize that same DNA, or find the organism we originally sampled it from and study it more.

fabian2k · on Nov 30, 2020

"A few" does appear quite dismissive of the enormous amounts of effort in structural biology so far. There are more than 170,000 structures in the PDB right now.

To determine potential targets for drugs we have to understand what the proteins do. Having the structure is not really enough for that, it doesn't tell you the purpose of the protein (though it certainly can give you some hints).

In most cases the proteins were determined to be interesting by other experiments, and then people decided to try and solve their structure. So the structures we already solved are also biased towards the more biologically relevant proteins.

derefr · on Nov 30, 2020

170k is "a few" compared to 180 million (i.e. the size of the PDB as soon as someone runs AlphaFold over everything in the UniProt.)

> In most cases the proteins were determined to be interesting by other experiments, and then people decided to try and solve their structure.

Yes, that's what we're doing right now, because structure is not a useful predictor, because we don't have structure available in advance of studies on the protein itself. There was no point to a "functional taxonomy" of proteins, because we were never trying to predict with protein-structure as the only data available.

In a world where protein structure is "on tap" in a data warehouse, part of the game of bioinformatics will become "structural analysis" of classes of known-function proteins, to find functional sub-units that do similar things among all studied proteins, allowing searches to be conducted for other proteins that express similar functional sub-units.

fabian2k · on Nov 30, 2020

Determining what a protein structure does might be even harder than folding. Right now we can't really do that ab initio, you have determine the activity in the lab and then look at the structure. And that allows you to potentially identify this motif in other proteins.

If someone produces an AI that you give a sequence and it tells you what the protein does exactly, I'd be extremely impressed. I don't see that happening soon.

The specifics matter a lot here. We can often determine rough functions for subdomains by homology alone. But that really doesn't tell you the full story, it only gives you some hints on what that protein actually does.

jeffxtreme · on Nov 30, 2020

Five years ago, I would have said the following:

"If someone produces an AI that you give a sequence and it tells you the protein conformation, I'd be extremely impressed".

Sure there are many more things to solve in this space; but that doesn't take away that this is an impressive achievement and does unlock quite a few things (including making more tractable the problem you just brought up). I'm excited to see what DeepMind works on now and what the new state of the world will be just five years from now.

fabian2k · on Nov 30, 2020

I think I have to clarify that my response was to a large part to the "this will change all our lives" part, and might look too negative on its own. I'm very, very impressed by these results, but that still doesn't mean that we just solved biology. If this works that well on folding, this could mean that a lot of other stuff that simply didn't work well in silico might come into reach.

I'm maybe overcompensating for the tech-centric population here, with some comments speculating for very near and drastic impacts from discoveries like this. Biology and life sciences are much slower, and there's always more complexity below every breakthrough. That does tend to push me towards commenting with the more skeptical and sober view here.

whatshisface · on Nov 30, 2020

My understanding of this is not perfect, but wouldn't answering the "actually does" question require a full biomolecular model of the cell, or even the whole organism? If so I see what you mean. I suppose that it might be possible to get around this by improving the theory of catalysts so that you could look at a site and say, "oh, this will act in such a way..." Dynamic quantum simulation of a few atoms at the active site is hardly easy but a far sight easier than the other.

Rochus · on Nov 30, 2020

It's a step forward for sure, but structures change over time to perform their function. The method described here only returns a static structure. Much more research and development is needed to be able to predict the dynamic behavior and interplay with other proteins or RNA.

AlexCoventry · on Nov 30, 2020

> as soon as someone runs AlphaFold over everything in the UniProt

It'll take a while before those results can be trusted, though, right? There's probably a selection bias in the training data for proteins which are easy to crystallize, so many proteins probably aren't well represented by the training examples.

entropicdrifter · on Nov 30, 2020

170,000 is three orders of magnitude less than the number of recorded protein sequences. I don't think it's dismissive to describe that as comparatively few.

flobosg · on Nov 30, 2020

Structure is much, much more conserved than sequence. In other words, protein sequences with low sequence identity can fold similarly due to the physical constraints that guide protein folding.

ClumsyPilot · on Nov 30, 2020

I don't know the field, and I understood 'a few' as like a dozen, certainly not in the thousands.

Anyone uninitiated with think the same, and thise already informed. Well, they are already informed.

ALittleLight · on Nov 30, 2020

I also don't know the field and the opposite concern is that 170,000 sounds like a lot, but, apparently, it's a relatively small amount compared to the number of proteins there are. It makes sense to me to refer to it as a small number - e.g. "That hard drive is tiny." "No, it stores several million bytes..."

btilly · on Nov 30, 2020

We can already determine how a few proteins fold by experimental work.

Where "a few" is around 0.1% of the known 180 million proteins. So a relative few and a whole lot.

But the catch is which proteins could we figure out by experiment, and which not. In particular membrane proteins are hard to experimentally determine. But knowing how they fold is very important for figuring out how to get things to react with or get through membranes such as cell walls. Which is an important problem for everything from understanding how viruses work to targeted delivery of drugs. We now have a way to find those structures.

FredFS456 · on Nov 30, 2020

There are post-translational modifications to proteins. This means that for many (most?) proteins, the amino acid chain sequence is different from what you would predict from the DNA. These modifications are dependent on the state of the cell at the time of translation, and so cannot be predicted from the DNA alone. Even with a 100% accurate folding model, we cannot simply know the shapes of all the proteins inside the human body based on the genome.

carlob · on Nov 30, 2020

Here is another interesting approach in synthetic protein building:

https://science.sciencemag.org/content/369/6502/440.abstract

ghostpepper · on Nov 30, 2020

This does indeed sound like a game changer then, if true

IgniteTheSun · on Nov 30, 2020

Considering that this system "uses approximately 128 TPUv3 cores (roughly equivalent to ~100-200 GPUs) run over a few weeks" to determine a single protein structure, making predictions for all proteins encoded in a human genome seems impractical at this stage. With luck, this advance will help lead to discovery and definition of new folding rules and optimizations that will make protein folding predictions for the whole human genome more tractable.

mrDmrTmrJ · on Nov 30, 2020

I think it is possible to make predictions for all proteins encoded in the human genome. Perhaps you misread a very long and confusing sentence?

Background, Neural networks have two modes 1) training - where you learn all the model weights and 2) inference - where you run the model once on new data. Training takes takes a long time, because you're computing derivatives to implement updates rules on millions or billions of parameters based on iteratively examining massive datasets. Inference is extremely fast because you're just running matrix multiplies of those parameters on new data. And TPUs/GPUs are specially designed to compute matrix multiplies.

The article said: "We trained this system [...] over a few weeks." I searched for, but did not see them identify the inference time. I do expect inference time to be well under one second, though I'm not personally experienced with running inference on this type of network architecture.

For comparison, GPT-3 and AlphaStar have month long training times and real-time (sub-second) inference times.