What DeepMind’s AlphaFold 2 really achieved (2020)

MauranKilom · on July 12, 2021

> Consider, for example, the possibility that Alphabet decides to commercially exploit AlphaFold, for example — is it reasonable that they make profit off such a large body of research paid almost exclusively by the taxpayers? To what extent is the information created by publicly available research — made public, mind you, to stimulate further public research — belong to the public, and under what conditions could it be used in for-profit initiatives?

Maybe I have a wrong conception of what research is supposed to achieve, but commercializing new insights is absolutely one of the intended outcomes. One would sure hope that taxpayer money isn't funneled into research to... just enable more research. At some point the public should tangibly benefit from it, which is not achieved by writing more papers.

This all notwithstanding the fact that DeepMind intends to make AlphaFold open source and available to the community.

gumby · on July 12, 2021

> but commercializing new insights is absolutely one of the intended outcomes.

This is a recent idea, dating back to the late 70s and implemented through the 1980 Bayh-Dole Act. Before that research was research; development (and private research of course) was the province of business.

The Gradgrind mentality that all research must be commercialized has impoverished basic research; of what use is looking for gravity waves or new branches of mathematics (just look at the contortions university press offices go to to justify some new paper on quantum mechanics).

Speaking of which, QM is a perfect example of something that would have advanced very slowly had this attitude existed 150 years ago…yet it is at the heart of the semiconductor revolution!

tzs · on July 12, 2021

Isn't Bayh-Dole about letting people who do research with the government, for the government, or paid for with government grants own the resulting IP such as patents, so that they could commercialize it?

If so, I don't think that really applies to what the article is talking about. The article is talking about Alphabet potentially using large amounts of data from other researchers, mostly academic, who were funded by the government and commercializing it. That's more akin to how it was before Bayh-Dole: a private company taking government funded research they were not involved in, adding their own privately funded research, and making something commercial.

dekhn · on July 12, 2021

B-D allows the universities who house the principal investigators who conduct government-sponsored research to commercialize their inventions.

This means, for example, that David Baker licenses Rosetta for free to academic and government, but commercial users have to obtain a paid commercial license. Baker (his lab, or LLC, or whomever vends Rosetta Commercial) benefits monetarialy from all the data that Rosetta includes, which is decades of structural biology funded by NIH and others.

gumby · on July 12, 2021

> If so, I don't think that really applies to what the article is talking about.

My comment was in reply to this comment by MauranKilom:

> > Maybe I have a wrong conception of what research is supposed to achieve, but commercializing new insights is absolutely one of the intended outcomes. One would sure hope that taxpayer money isn't funneled into research to... just enable more research.

And further on your comment:

> The article is talking about Alphabet [commercializing results from public datasets without needing to pay for them] That's more akin to how it was before Bayh-Dole:

Indeed, pre Bayh-Dole, publicly funded research was public (consider it public domain, or at least "MIT licensed") and anyone could use it.

Now everything has to be licensed from university licensing departments, typically with an expensive exclusive. Which has had a distorting effect on research, not merely restricting use (have you ever tried to work with a university licensing office? They consider even the most trivial results to be Nobel prize class) but, because they are a source of revenue, bending resource allocation, tenure, etc much as sports teams do for the schools that have them.

unishark · on July 12, 2021

>This is a recent idea, dating back to the late 70s and implemented through the 1980 Bayh-Dole Act. Before that research was research; development (and private research of course) was the province of business.

I don't think federal funding of research is that much older in the US, only really starting in the 50s apart from military research. How exactly were the early QM researchers funded anyway? (apart from Einstein's famous day job at the patent office). I know at least a few of them had fellowships at universities, meaning rich benefactors.

gumby · on July 12, 2021

> I don't think federal funding of research is that much older in the US, only really starting in the 50s apart from military research.

US government support for university research dates back to patent holder Abraham Lincoln who even in the middle of a war got legislation passed to support land grant (mostly ag) colleges and universities (and of which MIT was one of the very early beneficiaries). However it was small and you are right that in WWII the model of the US modern research university was explicitly created by James Conant, with MIT again being the largest beneficiary (note that all tuition and student expenses are about 14% of MIT's revenue and 16% of expenditures, and the number of staff is greater than that of the student body -- it's a huge government research lab with a small school attached).

The problem with this model is that unless you are MIT (/Stanford/Harvard/Cornell/CMU et al -- maybe 25 institutions, if that) licensing revenue matters, and affects who gets tenure, departmental budgets etc.

> How exactly were the early QM researchers funded anyway? (apart from Einstein's famous day job at the patent office). I know at least a few of them had fellowships at universities, meaning rich benefactors.

In Europe, in the 20th century funding came primarily from governments (and benefactors, more early in the century), under varying institutions (the big research institutions in Imperial and post-WWI Germany, "Institutes" in France, Oxbridge in the UK, etc). In The USA it was the institutions themselves, some benefactors and, as I said, some government funding (like Fermi and Lawrence).

dogma1138 · on July 12, 2021

This can be applied to anything, Google couldn’t have been founded without decades of public research into computer science that was itself built on thousands of years of human knowledge.

Everything we do is built on top of what came before.

andrepd · on July 12, 2021

This is indeed the argument of e.g. Anarchists such as Kropotkin, or of Georgism.

tehjoker · on July 12, 2021

The government, especially since WW2, has increasingly designed its operations to subsidize something for the public and then allow private operators to extract whatever wealth from that regardless of the costs to the public.

For example, research is paid for by the public, but then the products that affect people are completely captured by monopolists and spooned out in such a way to make sure only the monied sections of the population get them until the public protests enough to create a program like medicaid.

If we paid most of the cost, we should get most of the benefit. The monopolists should be happy to make any money at all, not their superprofits. Fair right?

jjtheblunt · on July 12, 2021

did you say "monopolists" when meaning "capitalists"?

heavyarms · on July 12, 2021

I highlighted almost the exact quote you have here and it's nice to see it at the top of the discussion.

I agree with your sentiment, but I also think it's worth thinking carefully about two of the main points that stuck out to me:

- Access to compute for large models

- Access to large datasets (in this case mostly taxpayer funded academic research)

Every company and/or research group has access to the data, but some have a huge advantage in terms of compute. If there's a question about commercializing research, the scales are tilted toward those with more compute.

In this specific case, I think the intention to make AlphaFold open source and available to the community is obviously the best solution. But my question is, what happens if a less altruistic for-profit entity uses its huge compute advantage to develop new techniques and insights, and then patents everything before it becomes available to the community?

I understand that is the basic mechanism for how medical/pharmaceutical research gets translated into life-saving treatments, but if we're approaching a generalized model that can pump out "patent-worthy" discoveries only bound by the amount of data and access to compute, there's an obvious opportunity for a winner-take-most scenario.

travisgriggs · on July 12, 2021

You used public benefits and commercialization in the same paragraph.

While that kind of semi symbiotic relationship can (and has been observed to) exist, it does so best in an an environment that looks different than what is described here (few large near monopolies, legislative regulations that are best navigated using wealth, a market that has inelastic bargaining qualities).

unishark · on July 12, 2021

But the only way the monopoly on technology can make money is by sharing the benefits. The point of technology is to make the production of goods and services more efficient; it's not a scarce resource in itself. If a technology is not commercialized then this efficiency gain is not achieved and benefits no one. If someone commercializes it and monopolizes it but charges too high a price, people wouldn't buy it anyway, since they can always use older technology, and the monopoly also earns nothing. If transactions occur, it means both buyer and seller feel they are getting a benefit.

axiosgunnar · on July 12, 2021

I suppose the grants should then be paid back over time with the money made?

Perhaps with some interest, since the grants are high risk (many grants fail)

jahnu · on July 12, 2021

I think that tax should cover that. Of course that raises the current problems with international firms and taxation.

axiosgunnar · on July 12, 2021

Why do I have to pay the same tax rate as someone who literally got tax payer money injected into his budget, while I have to use surplus profit from the past?

elcomet · on July 12, 2021

Because you profit from those inventions ? You might be saved by the drug discovered by Alphafold

xwolfi · on July 12, 2021

What you're using to type this message was made possible by research spending.

Or we could do like before: let the church help the poors, the nobles take decisions and the peasants make the food. Like that, everyone has its clear and simple role and you wont complain of taxes: you ll have no revenue :)

KaoruAoiShiho · on July 12, 2021

Basically what you're saying is governments should never give grants only loans.

elcomet · on July 12, 2021

Maybe government should invest in companies instead of giving grants. So if the company fails, it is money lost like a grant, but if there is success, then the government can get its money back.

KaoruAoiShiho · on July 12, 2021

Then you would have state capitalism, and the government might be biased towards state owned enterprises, ruining the free market.

simondotau · on July 13, 2021

Good news, there's no free market to be ruined. Free markets are a fiction. Governments already influence the market in far, far more extreme ways than this.

So long as any profits from these investments are not ploughed back into general revenues, your concerns are moot. For example, you could establish an independent body—let's call it INVEST1—to oversee these investments. INVEST1 would be required to divest ownership of successful enterprises at a threshold that ensures money spent roughly matches money earned (and thus self-funding). Once it reaches a stable equilibrium, you spin off INVEST1 as a fully independent not-for-profit. Government then establishes INVEST2 and the process starts again from the beginning. Rinse and repeat.

simondotau · on July 12, 2021

Loans that only come due upon breakout success, more precisely.

friedman23 · on July 12, 2021

This comment is absurd. You think people should be paying you for using humanity's past knowledge (which you had no part in creating) to advance technology and society?

andrepd · on July 12, 2021

The argument is that Humanity's past knowledge and labour is a common heritage of everyone. Anybody that benefits from it must, at least in part, pay back "the commons" for that benefit.

friedman23 · on July 12, 2021

So you get to leech off the greatest minds in the present while they are living and again after they are dead?

So when people build off humanity's past knowledge and they pay for the privilege I assume the new knowledge that is created does not belong to humanity any longer and belongs to individuals?

andrepd · on July 14, 2021

I don't get what you mean. How do you mean "leech"? The point is that anyone — actually let's take a concrete example, Google and Deepmind — when they produce something, that was due in part to their own labour and in part to the heritage that the past generations and the current fellow humans have given them. Therefore, in principle, some of the fruits of their success belong to them and some belong to society. It's now a matter to discuss the split ;)

hprotagonist · on July 12, 2021

“we’ll fund you, you keep the IP” is not one but two grant structures! at least!

https://sbir.nih.gov/about/what-is-sbir-sttr

neither were at play here but the idea is pretty darn normal.

tzs · on July 12, 2021

The grants were to various academic researchers who researched, published, and did not commercialize their discoveries.

The money will be made by private companies that have no connection to the researchers who received the grants, but simply use the published research in something they build.

It's hard to see a good way to build a system to make the private companies pay back the grants. It would be an accounting and tracking nightmare to try to figure out how much money is actually being made from the research that any given grant paid for.

robbiep · on July 12, 2021

Often (as in this instance) big breakthroughs are multiples steps downstream from the initial grants or come from a collection of research.

There’s a reason for the saying ‘standing on the shoulders of giants’

andrepd · on July 12, 2021

>At some point the public should tangibly benefit from it.

Yes indeed. The public. Not capital, not private concerns, but the public.

teorema · on July 12, 2021

Lets say we're talking about VCs and shareholders. Shouldn't the public enjoy the same expectations? Especially when we're just talking about a zero percent payback?

I think there's a legitimate argument taxes exist for this sort of thing, but (1) taxes arguably are avoided in various ways to the point it's a currently broken system, and (2) this is a rare case where the government has a clear case for a specific amount of money owed by a specific company — why not keep it simple?

If the grants aren't worth paying back at zero percent, the corporation shouldn't be taking them.

bitcurious · on July 12, 2021

>Shouldn't the public enjoy the same expectations?

The public absolutely should have some ROI, and in fact does in the form of taxes.

Swenrekcah · on July 12, 2021

So many problems could be solved if corporations only paid their taxes without all the avoidance and/or evasion gymnastics.

klapatsibalo · on July 12, 2021

"Make a profit" != commercializing, I would say.

hortense · on July 12, 2021

One thing that DeepMind demonstrated was that sometimes one well funded team is much better than 50 poorly funded teams.

Gatsky · on July 12, 2021

Yes, this is the key takeaway. It is really a blow to academia that a private company could be so much better than them. It clearly demonstrates, to my mind, that academia is a poor engine for progress and getting worse. This is due to structural and sociological pathologies which there seems to be little appetite to mitigate.

I say this as an academic, of course.

an_opabinia · on July 12, 2021

But weren’t all those Google employees trained in the academy? Wasn’t this competition organized and designed by people in the academy? Who defined the goal, who laid not just the foundation but built the whole town? It’s clearly a positive collaboration.

In any case, left to their own devices, corporate R&D teams wouldn’t be able to define goals that work for their business. Like without the competition and goals being defined for them, DeepMind would be having meetings with brand managers about the avant- grade of ad tracking.

deeviant · on July 12, 2021

> But weren’t all those Google employees trained in the academy? Wasn’t this competition organized and designed by people in the academy?

Getting an education (even an advanced one) is a completely separate thing than entering academia and I suspect you know this.

an_opabinia · on July 13, 2021

> Getting an education (even an advanced one) is a completely separate thing than entering academia and I suspect you know this.

This is the same energy as saying that being the son of an incredibly rich person is a completely separate thing than being rich.

Gatsky · on July 12, 2021

I did not say that we should burn down the Universities, only that they have gone astray. I think this is actually not a very controversial comment. Every academic I know is deeply unhappy, even the ones who are really doing as well as one can. This is a generalisation.

Salgat · on July 12, 2021

The biggest issue with academia is their hyper focus on pumping out as many papers as cheaply and quickly as possible. Big ambitious projects are much less efficient at pulling this off.

stevenbedrick · on July 12, 2021

The reason for that "hyper focus" is due to the "structural and sociological pathologies" that the grandparent posted about. Change the funding model and the rest will follow.

folli · on July 12, 2021

My (admittedly biology/pharma-centric) point of view is a bit less fatalistic:

Private companies are much more efficient in reaching a well-defined goal.

Academia is much more efficient in reaching ill-defined goals.

The thing is that the majority of goals for basic science are very ill-defined and virtually all breakthroughs are serendipitous (reaching from antibiotics to more recently CRISPR-Cas). So I don't think it makes sense to advocate for one vs the other.

jcfrei · on July 12, 2021

It always depends. In lots of fields it's just a fact that all the exciting research happens in private companies and then for others it's reversed. Private companies can do research well when a near or mid-term commercialization is possible. Otherwise it's up to the public institutions.

kaba0 · on July 12, 2021

I think it’s unfair to generalize it to all areas of academia.

nopasswrdmngr · on July 12, 2021

Do you share this perspective with prospective graduate students?

Gatsky · on July 13, 2021

Yes, definitely. I am not in a position where I recruit anyone, nor do I hope to be, I just finished my PhD. But there is a significant portion of students who will go ahead with a research career no matter what you tell them. This is partly due to a lack of alternatives (in some fields in particular eg for life science students for example), and partly due to the irresistable cachet of academia for some. I think some people also have a degree of sadomasachism, and will fly towards the flames again and again.

Science is pretty difficult frankly, especially these days. Sure science is worth doing, and can be quite rewarding, no doubt. But I can't quite understand the popularity. Writing papers is particularly painful, again I can't understand why there are so many papers being written, it is torrential. As others have noted, this is not the same as actual progress. But lately I have noticed something telling about several leading researchers in my field, people who have made it in every sense of the word, work at the top institutions, regularly publish stunning research in Science, Nature. They have been moving to the pharmaceutical industry in significant numbers. This tells you, basically, that academia sucks for everyone. This is also my impression from talking to people who have entered research in the last 15 years. The older researchers seem happier. Perhpas it is the pay scales that scale with seniority, or just survivorship bias.

londons_explore · on July 12, 2021

Or that lots of compute power is more effective than decades of expertise in solving a problem...

MattRix · on July 12, 2021

It’s quite naive to assume compute power is the main reason for deepmind’s success.

Any bit of research into this (and most of their successes in other fields) will show otherwise.

Diggsey · on July 12, 2021

The fact that they've succeeded in so many different fields implies that their success is due to a combination of compute power & expertise in harnessing that compute power, rather than expertise in all the different fields they have applied it to.

civilized · on July 13, 2021

What is the difference between "expertise in using computing power to advance a field" and "expertise in that field"

joe_the_user · on July 12, 2021

Any bit of research into this (and most of their successes in other fields) will show otherwise.

Or you could supply references and/or an argument. The "if you researched this you'd agree with my claims" approach is pretty pernicious.

MattRix · on July 13, 2021

Why is it my responsibility to provide references when the parent comment provided none?

Everyone knows that Deepmind hires experts in deep learning, as well as subject matter experts. The idea that their success is just because of compute power alone is preposterous.

Even the article that this comment thread is about has extensive speculation about the techniques Deepmind used. There’s clearly a high level of anticipation to read their paper.

nopasswrdmngr · on July 12, 2021

well sometimes it is, sometimes it isn't. The question is what are the odds. I don't think Deepmind has the answer to this question.

londons_explore · on July 12, 2021

Most of deepminds additional funding goes into paying higher salaries.

Those higher salaries don't result in better research results - but merely as a way to move the most prolific researchers from other institutions to them...

Arguably this extra funding isn't leading to many new discoveries, but just shifting where discoveries are made.

nerdponx · on July 12, 2021

> Arguably this extra funding isn't leading to many new discoveries, but just shifting where discoveries are made.

Concentrating all these prolific researchers in one place, removing the publish-or-perish incentive, and giving them access to unlimited data and computing power.

Seems like that could make a difference.

solveit · on July 12, 2021

The additional funding raises the market price of researchers. That nudges the market to produce more researchers. The marginal quant became an AI researcher because people respond to incentives[1]. This leads to more new discoveries[2].

[1] Standard caveats apply and the point stands. [2] Standard caveats apply and the point stands.

alecst · on July 12, 2021

I like how you added "and the point stands" to your own comment.

solveit · on July 12, 2021

Lol yes, you can tell I'm just so done talking about anything vaguely statistical to people who spend half their working lives thinking about edge cases.

leadingthenet · on July 12, 2021

In case there’s someone else like me who could use an introductory video on the topic, Sabine Hossenfelder has recently made one: https://youtu.be/yhJWAdZl-Ck

It includes some commentary on this discovery, as well.

qwertox · on July 12, 2021

https://en.wikipedia.org/wiki/Sabine_Hossenfelder

herodoturtle · on July 12, 2021

That was a great intro, thanks.

wokwokwok · on July 12, 2021

This is from late 2020… weirdly, nothing seems to have come of it.

Is there an update since then? Have they actually done anything useful with it?

stupidcar · on July 12, 2021

According to the AlphaFold Wikipedia article:

> As of 18 June 2021, according to DeepMind's CEO Demis Hassabis a full methods paper to describe AlphaFold 2 had been written up and was undergoing peer review prior to publication, which would be accompanied by open source code and "broad free access to AlphaFold for the scientific community"

clavigne · on July 12, 2021

which is basically admission that they will API it, not release the models... again.

The original version on github can only compute the specific systems in the paper.

https://github.com/deepmind/deepmind-research/tree/master/al...

I don't know why scientific publications keep doing PR work for them.

AnotherTechie · on July 12, 2021

if cryptography is a weapon, isn't folding proteins also arguably a weapon?

TenToedTony · on July 12, 2021

Yes, but only because if cryptography is a weapon then everything is a weapon.

drbw · on July 12, 2021

As the article says

> The details of how AlphaFold 2 works are still unknown, and we may not have full access to them until their paper is peer-reviewed (which may take more than a year, based on their CASP13 paper).

So it's not particularly surprising that we haven't heard much yet.

londons_explore · on July 12, 2021

So I guess the next scientific milestone becomes doing the inverse of this challange...

Ie. Given a structure that you'd like a protein to have, develop a sequence for it.

If we could do that easily, we could start making molecular machines for all kinds of tasks. Rather than co-opting enzymes from nature, we could design our own.

So many industries could benefit from that, even if you exclude all the biomedical applications where such an approach might be considered too high risk. We could for example begin with dishwashing tablets which actually get burnt on stuff off...

siver_john · on July 12, 2021

So the Baker Lab out of Seattle has actually been working on that exact problem for a while now. There suit of programs for doing this type of work is called Rosetta and I know they have generated at least one protein from scratch.

titoCA321 · on July 12, 2021

They have and so has the Folding@Home team as Washington University. Although, Folding@Home is terribly inefficient at the way it approaches the problem. I know Rosetta but never have worked on it or used it so I can't comment on it's efficiency.

hencoappel · on July 12, 2021

I mean, the simplest solution to the reverse problem is generating random sequences and then predicting their structure to see if they fit the desired structure.

tryptophan · on July 12, 2021

A 10 long protein has 21^10 possible sequences. That is already a hard number to guess-and-check.

If you wanted to make a more reasonable length protein, of say 100-200, you would run out of atoms in the universe to do computations with.

foxes · on July 12, 2021

So do you think it actually understands something about the structure of the protein folding problem? It somehow detected something about the physics, topology, the hard optimisation problem, and that it knows something about the geometry of that potential surface and can exploit that?

Or is it just such a huge model it basically encodes an entire database after weeks and weeks of computation and has a more compressed form?

robbedpeter · on July 12, 2021

It's likely that many of the patterns it learned are encoded understanding of the form you mention, but not at a formal level of explication.

The architecture of the system and design of the training methodology are laid out to specifically prevent direct database-esque "pattern in pattern out" failure mode.

Similar to Google deep dream, there will be contextual features and feature clusters encoded into neurons that can be explored and extracted, and those could provide insights that can be sourcing translated into "hard" science, with explicit formulae and theory allowing a fully transparent model to be created.

Like other transformer models, you can elicit the training data intact, but such scenarios are a statistically insignificant selection of the range of outputs the models are capable of producing. That doesn't mean anything with regards to accuracy of the novel output, though.

With alphafold 2 going open source, it's possible that tools and methodologies to extract hard science from transformers will be formalized quickly and in the public eye. We have an amazingly powerful new tool, and the coming decades will be fascinating.

joe_the_user · on July 12, 2021

It's likely that many of the patterns it learned are encoded understanding of the form you mention, but not at a formal level of explication.

- The thing I'd be curious about is whether or not "not formalized" would imply "not consistently generalizing", whether it would have to be train all over if given a problem similar to but identical with, the problem it solves.

robbedpeter · on July 13, 2021

No, there wouldn't be an implication toward performance by "not formalized." I mean the function is obfuscated by its encoding in the model. To formalize, you'd need to pick apart those encodings and derive explicit formula and conjectures. The models in situ are essentially black boxes, but that can be fixed.

The tools to do that with transformers in particular and neural networks in general are still pretty new and specific to particular models.

https://ai.googleblog.com/2015/06/inceptionism-going-deeper-...

Someone with extensive knowledge of protein folding and chemistry and any relevant domain would need to manually categorize different neurons, their contexts, and effects, then experiment with the model in a static/procedural inference mode, then parse out the overall functions. By reproducing well known simple instances and tracing the propagation of information through the network, you can validate what a model knows by mapping to current knowledge.

Using the model in this way can also lead to interesting real world experiment choices, by processing targets suggested by real life phenomena and then coordinating from known functions of the model and real life experiment. Different proteins in coronavirus, for example, would likely have a different cascade of activations in the model than the flu virus. Any shared features between the two might be captured in the activation patterns, possibly leading to new information about the diseases that would opaque to science without these models.

l33tman · on July 12, 2021

Nature has re-used existing folds all over the place (partially because of genetic mutation but also because it's improbable to come up with stable folds from scratch by evolution I would guess), this was encapsulated by earlier award-winning systems like Rosetta. There is probably a finite number of folds in nature, with most of the difference being in the outward-facing amino acids "citation needed" :)

So an extremely large DL network would have a good chance to find and integrate ("compress") all the existing folds and sub-folds that human researchers or Rosetta missed or just hadn't the time to investigate and characterize yet (I'm not an expert on Rosetta by far btw so please expand if you are :).

I would venture to say it's a good problem fit for DL methods (as was impressively demonstrated).

Regarding your question, "does it understand something about the structure of the protein folding problem" - expanding on the above, I would say it understands enough, but it probably doesn't understand the generics of chemistry as protein and their folding is a biased subset. The output is (as far as I remember) an atom distance matrix and not atom trajectories etc. so folding dynamics is not part of the model (this is btw an important part of protein science as well).

wiz21c · on July 12, 2021

My (basic) understanding is that 1/ there's some inductive bias (knowledge from researchers) 2/ data is definitely "compressed" in some ways 3/ since the model predicts better than the others, then it actually found some relationships in the data that were not found before.

From what I understand, deep learning, although opaque and relying on tons of data, is a bit magical : although one would say "it's just probabilities", it actually does probabilities at a level where it actually figures some things.

Plus, and that's very much a problem to me, Google does it at 100x the scale of a regular researcher. Since I just invested a year in studying data sciences, that worries me a lot : where am I suppose to work if, to produce meaningful results, you need way-too-expensive hardware...

sgt101 · on July 12, 2021

>if, to produce meaningful results, you need way-too-expensive hardware...

If you are in a team that is looking at problems that justify massive hardware (in the sense that solving them will pay back the capital and environmental cost) then you will have access to said hardware.

Most (almost all) AI and Data Science teams are not working on that kind of problem though, and it's often the case that we are working on cloud infrastructures where GPU's and TPU's can be accessed on demand and $100's can be used to train pretty good models. Obviously models need to be trained many times so the crossover point from a few $100 to a few $1000 can be painful - but actually many/most engagements really only need models that cost <$100 to train.

Also many of the interesting problems that are out there can utilize transfer learning over shared large pretrained models such as Resnet or GPT-2 (I know that in the dizzying paced modern world these no longer count as large or modern but they are examples...) So for images and natural language problems we can get round the intractable demand for staggeringly expensive compute.

Imagine that you had got a degree in Aeronautical Engineering, you are watching the Apollo program and wondering how you will get a job at NASA or something similar... but there are lots of jobs at Boeing and Lockheed and Cessna and so on.

thomasahle · on July 12, 2021

> where am I suppose to work if, to produce meaningful results, you need way-too-expensive hardware...

I know how you feel, but I also think stories like this may be a wake-up call for some groups to invest more in hardware. The Baker and Tsinghua University groups are not small. They can afford more than 4 GPUs.

Probably it's more about setting up a good pipeline. Once you get about 4 GPUs you need more than one machine to run them. Hopefully in the next years we'll see more open source tools to make it easy to "build your own GPU cloud".

dumb1224 · on July 12, 2021

And to use all that GPU power exclusively you need some good strategy too. Speaking from a medium size biology research centre.

thomasahle · on July 12, 2021

True. It may be that a smaller research center will only rarely be able to saturate the cluster. Maybe it doesn't matter, like how other lab equipment is not in constant use. Another option may be for centers to team up and have shared machines? Or maybe compute as a service will eventually be cheap enough for this not to matter...

dumb1224 · on July 12, 2021

Currently only those team who are heavy-deep-learning got special exclusive queues on the GPU nodes on the cluster. If many users want to use the GPUs at the same time it might need some planning. I don't know if it is a solved problem in the HPC field though.

touisteur · on July 12, 2021

Most big labs have large computing clusters and upgrade them from time to time. We've almost always needed huge computing power in the scientific domain, no?

Anyway these days I see a lot of industrial investment in middle size computing datacenters full of GPUs. Sure Google scale is not within reach but I'm sure there's room for scrapier algorithms and training methods to demonstrate feasibility and then paying a lump sum to AWS (or some ml-expert-for-cloud-training SME) for the final step.

Anyway, I thought the expensive part was data acquisition and labeling? I like the 'surrogate network' approach of learning a very-expensive-to-compute simulation or model, that doesn't need data, but the output of a costly simulation.

shawnz · on July 12, 2021

What's the difference between "knowing things" and finding a more compressed form of the solution space?

inductive_magic · on July 12, 2021

queue joscha bach:

>Scientific logic is proving things by losslessly compressing statements to their axioms. Commonsense logic uses lossy compression, which makes it less accurate in edge cases, but also less brittle, more efficient, further reaching and more stable in most real-world situations.

sgt101 · on July 12, 2021

Knowledge includes insight into the why part of the mechanism - why does the protein behave in this way? This can lead to generalizations which go beyond answering different questions of the same sort (such as "what about this protein then") to questions of a different form that have answers underpinned by the mechanism. For example, "how does that structure evolve over time?" this is closely related to the ability to make analogies using the knowledge - "if proteins react in that way within their own molecule then when they meet another molecule they should react this way". Also the knowledge only becomes knowledge when it's in the framework that "can know" which is to say that the thing using it can handle different questions and can decide to create an analogy using other knowledge. For Alphafold2 that framework is Deepmind, but of course I don't know enough to know if they and it can know things about proteins in the way I described or if they "just" have a compressed form of the solution space. I suspect the latter.

garmaine · on July 12, 2021

Being able to extrapolate beyond mere variations of the training data.

EDIT: A simpler example might be helpful. We could, for example, train a network to recognize and predict orbital trajectories. Feed it either raw images or processed position-and-magnitude readings, and it outputs predicted future observations. One could ask, "does it really understand orbital mechanics, or is it merely finding an efficient compression of the solution space?"

But this question can be reduced in such a way as to made empirical by presenting the network with a challenge that requires real understanding to solve. For example, show it observations of an interstellar visitor on a hyperbolic trajectory. ALL of its training data consisted of observations of objects in elliptical orbits exhibiting periodic motion. If it is simply matching observation to its training data, it will be unable to conceive that the interstellar visitor is not also on a periodic trajectory. But on the other hand if it really understood what it was seeing then it would understand (like Kepler and Newton did) that elliptical motion requires velocities bounded by an upper limit, and if that speed is exceeded then the object will follow a hyperbolic path away from the system, never to return. It might not conceive these notions analytically the way a human would, but an equivalent generalized model of planetary motion must be encoded in the network if it is to give accurate answers to questions posed so far outside of its training data.

How you translate this into AlphaFold I'm not so certain, as I lack the domain knowledge. But a practical ramification would be the application of AlphaFold to novel protein engineering. If AlphaFold lacks "real understanding", then its quality will deteriorate when it is presented with protein sequences further and further removed from its training data, which presumably consists only of naturally evolved biological proteins. Artificial design is not as constrained as Darwinian evolution, so de novo engineered proteins are more likely to diverge from AlphaFold's training data. But if AlphaFold has an actual, generalized understanding of the problem domain, then it should remain accurate for these use cases.

foxes · on July 12, 2021

I mean perhaps I am not entirely sure myself. I imagine that the solution space to this problem is some very complicated, lets say algebraic variety/manifold/space/configuration space, but obviously it is still low enough dimension it can be sort of picked out nicely from some huge ambient space.

For example specific points on this object are a folded proteins. I suppose then it is how well does this get encoded, does it know about "properties" of this surface, or is it more like a rough kind of point cloud because you have sampled enough and then it does some crude interpolation. But maybe that does not respect the sort of properties in this object. Maybe there are conservation laws, symmetry properties, etc which are actually important, and then not respecting that you have just produced garbage.

So I think it is important to know what kind of problem you are dealing with. Imagine a long time scale n-body problem with lots of sensitivity. Maybe in a video game it doesn't matter if there is something non physical about what it produces, as long as it looks good enough.

Maybe this interpolation is practical for its purpose.

But I think we should still be careful and question what kind of problem it is applied to perhaps. Maybe it's more like a complexity vs complicated question.

adverbly · on July 12, 2021

Unrelated question, but this got me thinking:

> does it know about "properties" of this surface, or is it more like a rough kind of point cloud because you have sampled enough and then it does some crude interpolation

Say that there existed some high-level property such as "conservation of energy". A "knowledge system" which learns about that property would be able to answer any questions related to it after reducing to a "conservation of energy" problem. Is the same true for NNs? The way folks talk about them, they sound like they can compress dynamically, and would therefore be able to learn and apply new high-level properties.

Also, do NNs have "rounding errors"? We have confidently learned that energy is conserved, but would NNs which never had that rule directly encoded understand conservation as "exactly zero", or "zero with probability almost 1", or "almost zero"?

ma2rten · on July 12, 2021

It could have learned abstractions which help it to predict how proteins fold but these do not correspond to the real underlying causes.

foxes · on July 12, 2021

I think it is fine if it is "effective". Really most of our physics is effective. So valid at a certain length scale. Fluid mechanics is very good, but it does not describe it all in terms of quark interactions. Quantum field theories are also mostly effective. So as long as it is describing protein dynamics at some effective length scale that is fine. Obviously it does not know anything about quarks/electrons/etc etc.

dekhn · on July 12, 2021

it's pretty clear what it does. It uses the evolutionary information expressed in multiple sequence alignments to make reasonable judgements about interatomic distances, which are used as constraints for a force field. We've been doing variations on this for decades. The evolutionary information encoded in multiple sequence alignments is pretty much all you need to fold homologous proteins (apparently). No, this technique doesn't do anything about the harder problems of actually understanding the raw physics of protein folding (nor, does it seem, that we need that to solve downstream problems).

twanvl · on July 12, 2021

An important input to this (and similar) algorithms is multiple sequence alignment, which tells the algorithm which parts of proteins are preserved between species and variants, and which amino-acids mutate together. So already it is relying on natural selection to do some of the work. And the algorithm will probably not work very well if you input a random sequence not found in nature and ask it to find the folding.

haihaibye · on July 12, 2021

I hope not, as knowing how a novel mutation in a patient alters a protein would be extremely useful when trying to find disease causing variants.

Synaesthesia · on July 12, 2021

This looks like a tremendous breakthrough in this domain, very impressive. I was similarly impressed by their Alphastar AI agent which could play Starcraft 2 at a pro level (this is actually a very difficult problem to solve).

I'm similarly disappointed that, like with that effort, the methods and techniques will not be shared with the scientific community.

inasio · on July 12, 2021

This is definitely not a foregone conclusion, with AlphaFold 1 they did release a lot of information about it [0]. The article only says that google/deepmind is waiting until they publish the paper, and in fact Demis Hassabis recently tweeted that they plan to open source and provide broad access to it [1]

[0] https://deepmind.com/research/open-source/alphafold_casp13

[1] https://twitter.com/demishassabis/status/1405922961710854144

jokoon · on July 12, 2021

Science is the only field where ML would truly shine and be really useful.

There are tons of science problems where there are just not enough gray matter because it's just too expensive to train scientists. ML can crunch any data and result and speed up research by guiding experiments, where normal research just doesn't have enough resources to do so.

Of course, it really only works if the scientists are able to understand data and how to use ML, which is why computing becomes just a tool for a scientist, nothing else.

And again, ML is not really "smart", it's just sophisticated, improved statistical methods.

siver_john · on July 12, 2021

As someone who's background is biology and physics and who does ML work as well. This is an incredibly optimistic view of ML.

>Of course, it really only works if the scientists are able to understand data and how to use ML, which is why computing becomes just a tool for a scientist, nothing else.

Ideally in science you would like to use literally anything else other than ML if possible, fitting models come with their own challenges and neural networks are even more of a nightmare. Understanding the world well enough to hard code a rule is always preferable to fitting to some data and hoping the model will come up with a rule. While there has been some attempts to use ML for feature detection it then takes a lot of experimenting to generally show if it detected signal or just some noise in your data.

Most of the things that would accelerate science would either require AI much more complex than we currently have (basically replacing lab assistants with AI) or are incredible research undertakings in their own right like Alpha Fold, Deep Potential Neural Networks, etc.

miltondts · on July 12, 2021

Totally agree.

While AlphaFold 2 is a tremendous achievement, to me the major drawback is the blackbox approach. It means it is very difficult to know when the model is outputting garbage and it also doesn't directly lead to new insights.

A much more interesting approach: "Discovery of Physics From Data: Universal Laws and Discrepancies" [1]

If ML did that, then it would be much more interesting.

[1] - https://www.frontiersin.org/articles/10.3389/frai.2020.00025...

mjburgess · on July 12, 2021

ML is an associative statistical system of function optimisation -- pretty much the opposite of science.

Ie., ML makes the assumption that data points are IID.

The whole purpose of science is to produce models which explain why data isnt IID.

blackbear_ · on July 12, 2021

> ML is an associative statistical system of function optimisation

You can also separate cause from effect by using causal inference, under some assumptions.

> ML makes the assumption that data points are IID.

Common ML algorithms do, but it is done for practical reasons rather than a limitation in the mathematics.

> The whole purpose of science is to produce models which explain why data isnt IID.

And ML can greatly help in this, though it is not a silver bullet.

jokoon · on July 12, 2021

Actually, there might not be a good way to model or describe the difference between causal inference, correlation and causality.

Causality involves a deep understanding of a phenomenon in science.

For example, the standard model of physics is pretty good at describing the real world in a good enough manner because we understand a lot of it. The difference with correlation and causality, in my view, is human, scientific understanding of what things are. Formulas, data or drawing are not enough.

For example there might never be a way to prove natural selection, even if there is a lot of data available, but a lot of scientific consensus is enough to describe causality.

jokoon · on July 12, 2021

When you say it's the opposite of science, you mean ML is just made of black boxes that completely hide away knowledge that humans can interpret?

Science is derived from scio, which in latin means knowledge.

It's true that in a way, ML allows new things, but which are still obscuring real knowledge...

I'm still curious about analysis of trained networks.

temporalparts · on July 12, 2021

> Science is the only field where ML would truly shine and be really useful.

You know that ML is a really important part of a lot of companies that aren't "Science", right?

Like Google search result rankings rely on ML.

deeviant · on July 12, 2021

> Science is the only field where ML would truly shine and be really useful.

It's clear to see how ML is useful for science, but why exactly do you think it's *only* useful for science? It seems like in order for that to be true, you'd have to expand the definition of science to basically everything.

voiper1 · on July 12, 2021

Needs (2020)

optimalsolver · on July 12, 2021

What are the CASP competition equivalents of other scientific/engineering fields?

danuker · on July 12, 2021

Here are categories of benchmarks of Machine Learning papers that also publish code:

https://paperswithcode.com/sota

ma2rten · on July 12, 2021

(2020)

deeviant · on July 12, 2021

I'm I broken in some sort of way where I read:

> Neither the Oxford Protein Informatics Group nor I accept any responsibility for the content of this post.

And just stopped. I get it that you don't want to speak for your employer or academic institution but the "nor I" part is just weird.

advisedwang · on July 12, 2021

The author isn't saying they don't speak for themselves, they are just saying don't rely on this info. They say they wrote it to clear their thoughts, and probably haven't done a high level of verification.

me_again · on July 12, 2021

It's a law of the Internet that any post complaining about grammar or spelling must inevitably contain a grammatical error of its own. Yours is no exception.

And without wishing to be unkind, the fact that you are adverse to this phrasing is not of general interest, which is likely why you're attracting downvotes.

deeviant · on July 12, 2021

You seem to have failed to comprehend my comment. It has nothing to do with grammar, but rather the "I take no responsibility for the content of my post" (my summary of their wording), part.

tashi · on July 12, 2021

Similarly, you probably meant "averse" instead of "adverse."