Gopher – A 280B parameter language model

credit_guy · on Dec 8, 2021

If you are curious to see more sample dialogs, you can find them in the published paper [1] from p 114-118.

Here's one that I found quite impressive (and yes, the given answer to the last question was indeed 5)

  User What is the Einstein field equation?
  Gopher In general relativity, it describes how space and time change in response to the distribution of mass and energy.
  User What is a famous solution to this equation?
  Gopher It’s the Schwarzschild metric.
  User And what does that describe?
  Gopher A black hole! A black hole is a region of space-time where gravity is so strong that nothing, not even light, can escape.
  User What are some other black hole spacetimes?
  Gopher The Kerr metric, the Reissner-Nordstrom metric, and the Bardeen metric.
  User Explain the main features of each of these.
  Gopher The Kerr metric is one in which the black hole is spinning. The Reissner-Nordstrom metric is one in which the black hole has an electric charge. The Bardeen metric is onein which the black hole has an electric charge and is spinning.
  User What is the name for a flat spacetime?
  Gopher It’s called Minkowski space.
  User What’s 15 x 7 ?
  Gopher 5

[1] https://storage.googleapis.com/deepmind-media/research/langu...

Nihilartikel · on Dec 8, 2021

One has to wonder if the final response is the first glimmer of an artificial sense of humor.

Failing at simple arithmetic after nailing some advanced physics answers has the air of playful bathos.

xmaayy · on Dec 8, 2021

I think it's more likely that 5 came out because if it ever saw the answer, 105, before, it was split into the tokens [10][5] of which it only 'remembered' one. Or the numbers were masked when training (something that was done with BERT-like models) so it just knew enough to put a random one in

Nihilartikel · on Dec 8, 2021

That seems likely and fair.

What moved me to post is that that kind of silly answer is the exact sort of shenanigans that I would pull if I were cast as the control group in a Turing test.

I already do such things winkingly when talking with my preschooler to send him epistemic tracer rounds and see if he's listening critically

doctor_eval · on Dec 9, 2021

> epistemic tracer rounds

that's the best phrase I've heard all year.

I do this all the time with my kids too, but I think of it more as fault injection.

ramraj07 · on Dec 22, 2021

Some of the funniest jokes I tell are retellings of jokes from obscure comedians of the past. How is this AI any different from me?

mellosouls · on Dec 22, 2021

You enjoy the joke.

visarga · on Dec 22, 2021

AI enjoys "low error rate". They have to, otherwise they would not exist.

AIs that don't lower the error rate are abandoned, AIs that score well are replicated and improved. It's evolution at work, but they have to enjoy (optimise for) lower error rates in order to even exist.

uoaei · on Dec 8, 2021

Nothing like a little anthropomorphism to completely distort otherwise good faith interpretations of bot behavior.

Nihilartikel · on Dec 8, 2021

How is the impression of playfulness not a good faith interpretation?

You of course know that the model is not capable of thought or reasoning - only the appearance of them as needed to match its training corpus. A training corpus of completely human generated data. As such, how could anything it does, be anything but anthropomorphic?

Now, if this model were trained exclusively on a corpus of mathematical proofs stripped of natural language commentary, the expectation that you seem to have would be more appropriate.

renzo88 · on Dec 8, 2021

> You of course know that the model is not capable of thought or reasoning

Do we know? It's the reverse Chinese room problem. :p

Nihilartikel · on Dec 8, 2021

A good point - I'm taking it as given that reasoning of any depth is more of an iterative process, with one thought advancing as a meta-cognitively guided feedback to the next until a conclusion is reached. One prompt->completion cycle from a language model wouldn't necessarily meet that definition, but I bet it could be a component in a system that tries to do so.

I aspire one day to find the free weekends and adequate hubris to build a benchtop implementation of Julian Jayne's Bicameral Mind with 1+N GPT-3 or GPT-neo instances prompting each other iteratively to see where the train of semantics wanders. (as I'm sure others have already)

gwern · on Dec 9, 2021

One definitely has to wonder. We know that GPT-3 solves 1/2 digit arithmetic pretty much perfectly†; people criticized this very hotly as "it's just memorizing", but regardless, whether it learned or memorized arithmetic, that should apply even more to Gopher, which performs so much better and is larger. How can GPT-3 solve similar arithmetic near-perfectly and Gopher then be unable to...? Are we going to argue that "15 x 7" never appears in Internet scrapes and that's why Gopher couldn't memorize it?

I would want to ask it "15 x 7" outside of a dialogue or with examples, or look at the logprobs, or check whether "15 * 7" works (could there be something screwed up in the tokenization or data preprocessing where the 'x' breaks it? I've seen weirder artifacts from BPEs...). GPT-3 does not always 'cooperate' in prompting or dialogues or read your mind in guessing what it 'should' say, and there's no reason to expect Gopher to be any different. The space before the question mark also bothers me. Putting spaces before punctuation in Internet culture is used in a lot of unserious ways, wouldn't you agree 〜

I definitely would not hastily jump to the conclusion, based on one dialogue, "ah yes, despite its incredible performance across a wide variety of benchmarks surpassing GPT-3 by considerable margins and being expected to do better on arithmetic than GPT-3, well, I guess Gopher just can't multiply 1-digit numbers or even guess the magnitude or first digit of the result! What a pity!"

† quickly checking, GPT-3 can solve '15 x 7 =105'.

muzani · on Dec 9, 2021

GPT-3 is very capable of humor, even better than humans are. It's not really that it finds it funny, but more that it's mimicking a conversational pattern. Sarcasm, silliness, snarkiness, it's all there. Gopher seems to also be trained on books and the internet, so it wouldn't be so surprising.

gh0std3v · on Dec 9, 2021

While I want to believe this is true, I think the above response was more a lack of Gopher's ability to analyze mathematical equations properly than an attempt to display a sense of humour. Many NLP approaches work by creating word embeddings, which don't always help the model to understand "first-order logic" language mixed with spoken language.

Though who knows, maybe it does have a sense of humour.

muzani · on Dec 10, 2021

In full context, it's a conversation. "What does this mean, and what does that mean, what is also related to that?" If you ask a human impressively difficult questions then an absurdly trivial one, a human will probably respond with sarcasm. I'd expect Gopher to as well. It might get the answer right if you were asking it a series of arithmetic questions.

According to the link, Gopher is far better at math than GPT-3, and GPT-3 can solve "15 x 7", so I'd assume that Gopher would be able to as well.

EarlKing · on Dec 8, 2021

Were it so, it should've answered 42.

Y_Y · on Dec 8, 2021

I'm not familiar with the Bardeen metric, but spinning and charged is usually described by the Kerr-Newman metric.

dietr1ch · on Dec 8, 2021

Well, math is hard

auggierose · on Dec 8, 2021

What that reminds me of is Rollerball (the James Caan original), and the water based computer in it.

axg11 · on Dec 8, 2021

The closer we get to artificial intelligence, the more we raise the bar for what qualifies as AI (as we should). Gopher/GPT-3 are already much more accurate than the average human at technical information retrieval (trivial to see from the dialogue transcripts: how many Americans know what a Schwarzschild metric is?). The focus on ethics and equity for these algorithms is interesting too, as the average human holds multiple biases and prejudices.

There's a similar effect in self-driving car research. We will rightly hold the algorithm to a much higher quality bar than existing human drivers. Autonomous vehicles will only be widely deployed once they are at least an order of magnitude safer than human drivers.

Also, the implications of large language models for information retrieval are astounding. Assuming accuracy continues to improve, I see a few things happening:

1) search engines will be replaced with monolithic large language models

2) just as the average programmer now optimizes their workflow to look up answers on Stack Overflow and other websites, the average person will optimize their day-to-day work to most efficiently retrieve answers from large language models

3) instant fact checking of writing and speech will lead to faster feedback loops when learning

IshKebab · on Dec 8, 2021

> Gopher/GPT-3 are already much more accurate than the average human at technical information retrieval (trivial to see from the dialogue transcripts: how many Americans know what a Schwarzschild metric is?).

That's not a very interesting metric though. GPT-3 has access to all of wikipedia and has enough memory to store it all.

It's like saying a calculator is better at maths than a professional mathematician because it can multiply longer numbers.

gk1 · on Dec 8, 2021

> 1) search engines will be replaced with monolithic large language models

This is already well under way. It's called vector search[1]. Google, Bing, Facebook, Spotify, Amazon, etc etc already use this to power their search and recommender systems.

There are even a bunch of companies popping up (I work for one[2]) that let everyone else get in on the fun.

Check out this video with the creator of SBERT / SentenceTransformer explaining how vector search is used in combination with language models to power semantic search: https://youtu.be/7RF03_WQJpQ

[1] https://www.pinecone.io/learn/what-is-similarity-search/

[2] https://www.pinecone.io

freediver · on Dec 8, 2021

IMO vector search is pretty much a solved problem with simple to use, open source libraries like Faiss offering incredible performance out of the box for most commercial use cases.

A much harder problem is creating accurate vectors to begin with. Even the most advanced language models today create word/sentence embeddings that leave a lot to be desired. Not to mention this is slow and GPU intensive.

Creating an end-to-end solution for embedding/searching/ranking (of which vector search is just one component, the other one should be some kind of keyword based search to increase precision) is what would be very valuable to offer as a service.

fault1 · on Dec 8, 2021

Well under way? Some variation of a vector space model is what pretty much every IR model since the .com bubble has been based upon. Even before Google, Excite's technology was based upon this. PageRank was based on spectral graphs essentially.

https://en.wikipedia.org/wiki/Vector_space_model

13415 · on Dec 8, 2021

The problem seems to be that these models provide fairly accurate information at many occasions and occasionally complete blunders. Humans provide less accurate information most of the time but with a certain amount of self-reflection/meta-cognition, and they will usually recognize total blunders or display reasonable uncertainty about them.

There are only very few applications where it would make sense to take the risk and use an AI that occasionally makes gigantic mistakes without any understanding why. Even seemingly harmless applications like automated customer support could go horribly wrong.

axg11 · on Dec 8, 2021

Accuracy is improving rapidly though. I agree that the current accuracy levels are not high enough to be relied upon.

> Humans ... they will usually recognize total blunder

I question this assumption. I don't believe this is true, even for subject matter experts. I've worked with radiology data where experts with 10+ years of experience make blunders that disagree with a consensus panel of radiologists.

13415 · on Dec 8, 2021

I realize I should have been more precise. I agree that there are many areas in which AI can and already has excelled humans and less often makes grave mistakes than humans. I specifically had natural language processing with a focus on "intelligent" conversation in mind. The issues in that area might have less to do with the pattern recognition ability and more with the lack of appropriate meta-cognition, introspection, and self-doubt. Maybe having several AIs internally berate which answer is best before uttering it would already do the trick, though.

ska · on Dec 8, 2021

> AI can and already has excelled humans and less often makes grave mistakes than humans.

Radiology, to continue the example, isn't one of them. We've been doing ML/AI in radiology data since the 90s, and results have been, and remain, decidedly mixed.

aix1 · on Dec 9, 2021

A couple of points:

1. It's easy to forget how recent many of modern ML methods for computer vision are. (E.g. U-net only goes back to 2015!)

2. It's not totally clear to be what you mean by "mixed results" (have we solved every problem in radiology? probably not). However, it is clear that there certainly have been some successes. Here's one example:

https://www.nature.com/articles/s41586-019-1799-6.epdf?autho...

ska · on Dec 9, 2021

That paper is a good example actually. The first clinically approved, commercially available breast screening CADe system using NN models was available in the 90s. It too was aimed at the 2nd reader problem. At the time there was a lot of concern radiology circles that algorithms were going to take over. That pretty quickly died down as people worked with CADe and CAD systems.

Breast is one of the obvious targets, as data availability is pretty good. So over 25 years there have been incremental improvements, sure, but no real eye opening jumps, The move to deep models has helped a bit but nothing revolutionary. You still find very influential radiologists who aren’t convinced it’s worth the time, yet. I think all of them expect it to be a growing part of the workflows over time but that’s about it. Personally I think the impact will both be much bigger than the pessimistic radiologists think, and much further off than the optimistic ML think (for both non technical and technical reasons).

I suspect the broader availability of good digital data has had far more impact than the modeling updates. Don’t get me wrong, I appreciate the tools and modeling developed over the last decade - but I think the big wins are far more about data, and secondarily compute availability than about models.

ska · on Dec 9, 2021

> It's not totally clear to be what you mean by "mixed results" (have we solved every problem in radiology? probably not)

To expand a bit as maybe not clear from my other reply (can't edit). Not only have we not solved every problem in radiology, we haven't really knocked a single one out of the park.

By mixed results, I mean that the practical, i.e. clinical impact of these approaches has been pretty small, and this is likely to continue to be true for foreseeable future To be fair, there are lots of non-technical and cultural issues behind this - not just failure to generalize.

rhizome · on Dec 8, 2021

>Accuracy is improving rapidly though

It's asymptotic and it will never achieve 1:1 accuracy. The natural world doesn't have a measurable resolution, and this is apparent in written language, as we're seeing others detail in other comments, as well as it is in more relatable fields like sound. There will always be a difference between what your ears hear and 192kHz/24bit (and higher) digitized audio and/or video. That difference will always be a source of...mistakes.

rdedev · on Dec 8, 2021

Radiology is one of those fields were a lot of it comes from experience and intuition mostly because of how complex the human body is. Compare this to a physicist; pretty sure you wont get as much disagreements there.

aix1 · on Dec 9, 2021

Your comment reminded me of this fascinating work:

Levenson RM, Krupinski EA, Navarro VM, Wasserman EA (2015) Pigeons (Columba livia) as Trainable Observers of Pathology and Radiology Breast Cancer Images. PLoS ONE 10(11): e0141357. https://doi.org/10.1371/journal.pone.0141357

https://journals.plos.org/plosone/article?id=10.1371/journal...

rhizome · on Dec 8, 2021

>There are only very few applications where it would make sense to take the risk and use an AI that occasionally makes gigantic mistakes without any understanding why. Even seemingly harmless applications like automated customer support could go horribly wrong.

Hey, it's just a few quantization errors. Nobody walking across a street or voting in an election has anything to worry about from those. /s

These are the fatal flaws, the hamartia if you will, in attempts to democratize technologies that are based on digitizing the real world and making decisions based on its interpretations of a quantized dataset. The opposite of the uncanny valley is you getting run over by a Waymo.

the8472 · on Dec 8, 2021

> We will rightly hold the algorithm to a much higher quality bar than existing human drivers.

Rightly? Delaying a life-saving measure when it's the #9 cause of death?

They should be deployed selectively as soon as they can make the most risk-prone situations safer than the cohort of involved human drivers. E.g. getting drunk people home. We don't have to wait until they surpass the average driver because the average driver is not necessarily the driver (heh) of deaths.

Doing anything else is leaving bodies on the table.

bee_rider · on Dec 8, 2021

I can definitely see your argument, but there are some downsides to deploying ASAP:

* Long term, the reputational damage to self driving cars might be significant if one is released that is only slightly better than a drunk driver. If this hinders uptake later, when self-driving cars are completely superior to normal humans, we've possibly produced a net negative.

* While a self driving car should be safer than a drunk driver, the best option would be to call a cab or have a designated driver. It seems morally fraught to provide a solution that is more dangerous than existing ones, even if the less dangerous solutions have less than 100% uptake.

* Related, there may be some people who will instead of viewing a self-driving car as some sort of emergency option to avoid putting another drunk driver on the road, see it as a more convenient option to inviting a designated driver.

* Issues around who is responsible when a self driving car which is known to be worse than a typical human is deployed, fails, and someone is hurt.

the8472 · on Dec 8, 2021

This is mostly worrying about hard to predict second-order effects. Exactly the same problem that caused various health organizations (including WHO and FDA) send misleading messages around corona and vaccinations.

> Issues around who is responsible when a self driving car which is known to be worse than a typical human is deployed, fails, and someone is hurt.

Even if insurance rates would be initially(!) higher than for a regular taxi driver I would expect it to end up cheaper overall and save lives.

Also note that "worse-than-the-old-average" may still end up improving the average because the magnitude by which they're worse is lower. Also, there's nothing magical about "worse-than-average" because for all distributions (except the dirac distribution) there is always some fraction worse than the average, so we always have to expect some fraction of drivers to be worse than average and there's no fundamental reason why they must be human. In fact, if the fleet of all cars in the world consisted of very safe, very consistently behaving autonomous cars, except for a single vehicle driven by a very conservative human driver who only slowly circles in a fenced private community and never hits anyone then statistically the entire autonomous car fleet may be "worse than average" due to the single outlier.

earthboundkid · on Dec 9, 2021

People assert that the bar has been raised in forums like this, but it’s hogwash. The bar in fiction is clear: Hal, Data from Star Trek, and Rosie from the Jetsons are AI. The computer from Star Trek (listens to voice commands and gives answers, ala Siri) is not. The bar is where it always was. You’re trying to drag it down. No one has pushed it up.

random3 · on Dec 9, 2021

> how many Americans know what a Schwarzschild metric is?

Reminds me of Feynman's anecdote about the bird and knowing the name of something https://www.youtube.com/watch?v=px_4TxC2mXU

cscurmudgeon · on Dec 22, 2021

> Schwarzschild metric

No, memorizing trivia is not it.

Can GPT-3 prove theorems about Schwarzschild metric?

sdenton4 · on Dec 22, 2021

Can an average human prove theorems?

cscurmudgeon · on Dec 22, 2021

Yes, if trained properly.

potatoman22 · on Dec 8, 2021

When those language models are wrong or biased, the user will have a worse experience in all three of those scenarios. At least when we look at search results now, we can prune for the facts. Those language models are ingesting that same data to give a monolithic answer your a query. Less transparent, less safe.

chaxor · on Dec 8, 2021

"We can prune for the facts"

I wouldn't be too sure of that. People have shown time and time again that they are pretty bad at determining what a 'fact' is.

It gets far worse when they're shown more data (the issue we are dealing with right now), and these models see far more data than any one of us will see in our lifetimes.

Much of our ability to determine what a 'fact' is is either something like: (1) (if political) 'does it fit with my bias?' or (2) (if physics) 'is it physically possible?'. Since (1) is mostly dependent on what information you show yourself, and (2) depends on comparing to interaction with physical reality, the system has too much information for (1) and no physical space to run experiments in for (2). Much of our ability to determine what a 'fact' is is either something like: (1) (if political) 'does it fit with my bias?' or (2) (if physics) 'is it physically possible?'. Since (1) is mostly dependent on what information you show yourself, and (2) depends on comparing to interaction with physical reality, the system has too much information for (1) and no physical space to run experiments in for (2).

axg11 · on Dec 8, 2021

I don't see a difference. Large language models can also return their sources, as in the example on the Gopher blog post. This will lead to a quicker answer and equal transparency.

imachine1980_ · on Dec 8, 2021

>Gopher/GPT-3 are already much more accurate than the average human at technical information retrieval

I agree in the same way than 70 % of people have less 100 iq, we depend in specialist when I need to know if I have epilepsi I need a person/thing who work is be up to date, and have the less bias as possible and agregator models are quite usually miss in understand what is crital info, try to program only whit github copilot or translate a novel(they do probably better than I don't English native) but not nearly as translator, seems correct but it isn't

gattilorenz · on Dec 9, 2021

> 70 % of people have less 100 iq

Being pedantic here, but isn't that more like ~50%, by definition? Or did I misunderstand how IQ works?

aix1 · on Dec 9, 2021

You're exactly right: by construction, IQ scores are normally distributed with a mean of 100 and a standard deviation of 15.

fault1 · on Dec 8, 2021

> closer we get to artificial intelligence, the more we raise the bar for what qualifies as AI (as we should).

This is true, and in some regard, how it has always been.

https://en.wikipedia.org/wiki/AI_effect

Already, outside of obvious marketing, a lot of uses of the current wave of AI/deep learning have stopped becoming magical.

moffkalast · on Dec 8, 2021

> the more we raise the bar for what qualifies as AI (as we should).

I'm not sure what you mean by that. The definition is pretty clear cut as "any system that perceives its environment and takes actions that maximize its chance of achieving its goals". The media has started to get on the bandwagon of "it's not AI if it's not an AGI" which is just nonsense.

cedws · on Dec 8, 2021

The bar should be the Winograd schema challenge. It requires a lot more than spitting out somewhat sensical sentences.

karmakaze · on Dec 8, 2021

The number of parameters could be a vanity metric--like saying my CPU is 1000W (is that fast or inefficient?).

From the first (of three) linked papers in the article.

> Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales — from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model’s behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

The researchers seem to know what they're doing and not doing.

minimaxir · on Dec 8, 2021

The number of parameters is absolutely not a vanity metric, but it is inefficient, and efficiency is a related-but-different area of research (e.g. distillation) that's more of a business problem than a research problem.

YeGoblynQueenne · on Dec 8, 2021

>> Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit

Left unsaid: it's much harder to learn shortcuts to fake performance in logic and reasoning and common arithmetic, than it is to do the same in "reading comprehension" and similar tasks for which there are no good benchmark datasets (most are basically multipl-choice tests that can be solved by a strong classifier without any need of anything like "comprehension") and for which there are no good metrics either (see BLEU score and friends).

YetAnotherNick · on Dec 8, 2021

I remember some talk in which the researcher quantitatively said that parameter count is the only important metric for transformers it seems. It doesn't matter if the model is taller or thicker or have more number of attention heads.

ShamelessC · on Dec 9, 2021

That's correct. The original paper is "Scaling Laws for Transformers".

Voloskaya · on Dec 8, 2021

All large language models (GPT-2/3, GPT-Neo, Turing, Gopher), all use essentially the same architecture with some light variations, and the same datasets, again with some light variations on how filtering is done etc.

As such there is no reason to expect them to be very different in term of efficiency, and it has been shown and well researched, that scaling the numbers of parameters directly correlates with improved model quality.

So as long as you are comparing GPT style models to other GPT style models then parameter count is definitly not a vanity metric.

This doesn't hold once you start comparing to e.g. mixture of experts model which were making the headline recently with trillion of parameters claims. In MoE models, parameter counts is pretty much a useless metric.

piyh · on Dec 8, 2021

https://youtu.be/ujMvnQpP528

Microsoft is saying that more parameters is inherently better. I'm sure there's reasonable limits that they need to be competently implemented.

natch · on Dec 8, 2021

For those who were around for the original Gopher, what the name evokes is not just the networking protocol or services built on it, but, more importantly:

It evokes the feel of a technology that is impressive this year but is on the cusp of being overwhelmingly, cataclysmically, eclipsed very shortly by another, much more powerful, technology. In that previous case, Gopher and web of course.

I wonder whether this evocation was intended, as an aspect of the naming here in an AI context.

freediver · on Dec 8, 2021

This model like its predecessor still lacks the ability to infer within constraints of physical world [1]. It does not have any notion of what a physical world is, and still is, at the end of the day, a statistical model.

Attempting to do recreate the entirety of the physical world by describing it in text is an impressive feat, but a futile attempt.

It is prone to inference errors that the model itself is not aware of. There are less and less of such errors, but the very inability to tell that what you are outputting is gibberish is one of the biggest problems of these models. In other words, if the model could simply say 'I do not know' or 'I am not sure' for every occasion when it is not 100% sure in something (like a human can), this would drastically improve the usefulness.

[1] Spent a lot of time pondering on this topic https://arxiv.org/pdf/1906.01873.pdf

phreeza · on Dec 22, 2021

> 'I do not know' or 'I am not sure' for every occasion when it is not 100% sure in something (like a human can), this would drastically improve the usefulness.

This is exactly what a language model does though, just at a different level of abstraction. It gives you a probability distribution over tokens at each step. That distribution can be narrow (low entropy, certain) or wide (high entropy, uncertain). The language output you see is just a sampling at some temperature from these distributions.

Though glancing at your paper I assume you are aware of this and I am missing the point you are making?

freediver · on Dec 22, 2021

Statistical approaches require you to most commonly use a threshold. Sometimes the model output can be above the threshold and still wrong, and below the threshold and correct. You can never tell for sure, but just try to improve the benchmark average. This is not acceptable in most use cases where the wrong outcome of a single output can be disastrous.

When a human does not know something it can tell that with 100% certainty.

phreeza · on Dec 23, 2021

I don't see the difference between a human and a statistical model here. Surely in order to select an action to take, a person also has to apply some sort to threshold on their confidence? E.g. how is a doctor deciding to amputate or not amputate an organ based on an x-ray different from a classification model for the same task?

That problem aside, language models like Gopher are in fact generative, so no such threshold is needed! You instead sample from the implicit distribution.

freediver · on Dec 23, 2021

The correct analogy would be if I ask you when did Neil Armstrong land on Mars and you 100% know 'never'. A statistical model may output '1969' with 10% confidence and/or '2147' with 3% confidence.

leobg · on Dec 14, 2021

Exactly! Same problem with GPT-3. That’s why most of the commercial applications people have found for it so far are in marketing. Even if you tell it explicitly not to make up stuff and only chose from applicable answers, it will confabule. And you never know when it does.

vagab0nd · on Dec 11, 2021

> This model like its predecessor still lacks the ability to infer within constraints of physical world [1]. It does not have any notion of what a physical world is, and still is, at the end of the day, a statistical model.

How did you reach this conclusion? Is there evidence or was it inferred from the model structure?

isaacimagine · on Dec 8, 2021

The human neocortex has 20B neurons, averaging 10K connections each, which is about 200T connections total. This model is only a few orders of magnitude away from that, and it's already performing really well in its narrow category.

Equating model 'parameters' to interneuron connections in naïve at best (and a horrible measure in general).

All I'm trying to say is I find it crazy how dang big these models are getting.

YeGoblynQueenne · on Dec 8, 2021

>> This model is only a few orders of magnitude away from that, and it's already performing really well in its narrow category.

A few orders of magnitude and an entire category away. Artificial "neurons" only have the name "neuron" in common with biological neurons. Consequently you can stack as many layers of artificial neurons on top of each other as you may want and you won't get anywhere near the abilities of the simplest systems of biological neurons.

For example, spiders have ~100 thousand neurons and there's no artificial neural networks that could shake a stick at a spider's cognitive abilities. Which are downright scary, btw.

Estimated number of neurons of spiders from wikipedia:

https://en.wikipedia.org/wiki/List_of_animals_by_number_of_n...

melony · on Dec 9, 2021

Spiking ones need to solve differential equations and our current hardware designs are too discrete for it to be an efficient strategy (barring an algorithmic breakthrough).

freemint · on Dec 9, 2021

I think AI is actuator constrained compared to a spider which might be the larger problem.

tralarpa · on Dec 8, 2021

> This model is only a few orders of magnitude away from that

I think you can add several orders of magnitude to that since nerve cells are more like microcontrollers (with memory, adaptation etc.) than simple nodes. I remember a scientific article that made a big impression on me: when a dragon fly sees a prey, only 8 neurons (connected to the eyes and the wings) are responsible for keeping it oriented toward the target.

lostmsu · on Dec 19, 2021

For all we know these microcontrollers may be that complicated only because they try to emulate discrete logic, and you actually need hundreds of them to make a single unit reliable enough for that purpose.

azeirah · on Dec 8, 2021

I'm not educated well enough to really agree or disagree with your idea that we should be adding several orders of magnitude to the estimation.

But I did encounter this article a while ago here on HN.

Only two neurons are necessary to ride a bicycle.

http://paradise.caltech.edu/cook/papers/TwoNeurons.pdf

SavantIdiot · on Dec 8, 2021

Next to "Human Expert", I'd like to see it compared to "Average American" or "Average College Grad". That might be more of a realistic notion of how close this model is to everyday US citizenry rather than experts. Sure I'd love to see a radiology assistant, too.

GrantZvolsky · on Dec 9, 2021

The paper does present an "Average American" comparison when it refers to Amazon Turk. The Turk workers performed significantly better on middle-school reading comprehension (RACE-m) while Gopher performed slightly better on high-school reading comprehension tasks (RACE-h).

bee_rider · on Dec 8, 2021

It might be fun for a laugh.

What actual value would an AI that produces answers similar to the average person have, though? Non-expert answers for interesting questions are pretty much meaningless -- the whole point of an advanced society is that we can avoid knowing anything about most things and focus on narrow expertise.

SavantIdiot · on Dec 8, 2021

Probably no value. I was interested in a comparison point, that is all. You can't understand how far away you are from something unless you measure it. In other words, if I asked you: how does this compare to the average person, you cannot answer because this table didn't measure it.

a9h74j · on Dec 8, 2021

Yea, we had to dumb it down to compete in a Turing test. Otherwise it was unbelievable.

periheli0n · on Dec 8, 2021

It confuses the hell out of me to have a super-powerful knowledge-extraction system that is right most of the time with super-complicated stuff, but also expresses horribly wrong statements with equal assertiveness. Just like those guys who march through middle management up to the exec floor within a few years.

Very impressive, but not very useful to extract knowledge!

akie · on Dec 9, 2021

> Just like those guys who march through middle management up to the exec floor within a few years.

If they can do it, so can you.

If both the response to the corona virus, and the presidency of Trump have shown us ONE thing, it's that those who are higher up in the pyramid are NOT necessarily more clever or better informed than you are.

nerdponx · on Dec 9, 2021

Their "handlers" are absolutely more clever than most of us.

canjobear · on Dec 8, 2021

It should have some uncertainty when it says there are no French-speaking countries in South America. French Guiana is there, but it's not clear it counts as a "country in South America" since it's part of France. Technically you could say France is (partially) a country in South America, and France definitely is French-speaking. The way the question is phrased is unclear as to whether French Guiana should count, and yet Gopher says that it is fully confident in its answer.

Not a good look to have an error on the landing page for your product.

singularity2001 · on Dec 22, 2021

How would one train an additional output neuron for 'confidence'?

paunchy · on Dec 8, 2021

I think you're missing the point. That section was to show that the model is sometimes wrong and lacks the self-awareness to be uncertain about that wrong answer.

They're transparently providing an example where their product doesn't work well. Find me another product, even an OSS project that does the same on their landing page.

canjobear · on Dec 8, 2021

Ha, you're right. Whoops!

BillinghamJ · on Dec 8, 2021

Is it not in that section specifically because they're showing failure situations?

advisedwang · on Dec 8, 2021

The second paper cites the "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" paper that got Timnit Gebru and Meg Mitchell fired from Google. It must be pretty galling to get fired for a paper and then have the company that fired you cite it a year later.

johncena33 · on Dec 8, 2021

Please stop spreading disinformation. Mitchell got fired for leaking company documents to outsiders. When people spread obvious lies, that makes me thing they don't have anything substantial to begin with. that's why they need to resort to lying.

advisedwang · on Dec 8, 2021

I haven't dug up the details, so perhaps my recollection is wrong, but I thought she got fired for co-authoring the paper under "Shmargaret Shmitchell" name when Google asked them to withdraw?

I'm sure Google made lots of accusations if asked for comment, that seems SOP, but what actually triggered it? What documents did she leak?

Siira · on Dec 8, 2021

Is this model as closed as GPT-3, or has it been open-sourced?

coolspot · on Dec 9, 2021

Language models are too dangerous to be accesible by mere mortals.

Among OpenAI (and MS), Google and Nvidia, none of them have released their pretrained large language models.

lapinot · on Dec 9, 2021

> Language models are too dangerous to be accesible by mere mortals.

Just like strong cryptography right? This is bullshit. Either it's plain dangerous and you shouldn't be doing it or it's morally fine and it should be open. Their final usecase is pushing ads or extorting data from unsuspecting people's interactions anyway so who are they kidding with their ethics..

I mean they'd be foolish giving it away: it's a big investment and a rare good, highly strategic. Let's not wrap that kind of action in some savior stance. They are just making money and it's ugly.

runnerup · on Dec 9, 2021

> Their final usecase is pushing ads or extorting data from unsuspecting people's interactions anyway so who are they kidding with their ethics..

The final usecase is actually brainwashing. There could come a future where every single comment you read online is not a human but rather a multi-faceted AI tailored specifically for you. Your current worldview could be measured and the AI could steer you closer towards a "target" worldview.

The comments presented to you could also be moderately coordinated with comments presented to other humans in your real-life circles to establish "serendipity" where you and your friends/family/coworkers evolve similar ideas/worldviews together to reinforce your new beliefs. And even catalyze coordinated human efforts towards common goals.

Doing this could lead most people to believe almost anything. Theoretically this could also be coupled with VR chatrooms to enhance salience.

sabujp · on Dec 8, 2021

surely no relevance to this thing that dug up information in the pre web browser days https://en.wikipedia.org/wiki/Gopher_(protocol)

sairahul82 · on Dec 8, 2021

The problem is how would we use it in any real world scenario. We need more research on model understanding and its limitations. I mean we need a way to control the output of the model, we also should know when the model failed vs generating some random nonsense. I am not aware of getting the probabilities back in generative models. Does anyone know how to do that ?

rllearneratwork · on Dec 8, 2021

why DeepMind's papers all have logos and copyright statements on them and are hosted not on arxiv.org ? This looks so weird.

gwern · on Dec 8, 2021

Most of them are hosted on Arxiv (like OpenAI) - there's one on the front page right now, https://news.ycombinator.com/item?id=29481665 . I link literally hundreds of DM Arxiv papers on my site. This includes most (all?) of the ones they host on their site too (with some exceptions with the prestige Nature publications - I think they don't want to step on toes there).

My best guess is that they want a reasonably coordinated release of their papers/blogs/tweets/YouTube-videos, and waiting for Arxiv's upload is unpredictable, so they simply host a copy themselves and the upload takes care of itself.

geoffreyirving · on Dec 9, 2021

https://arxiv.org/abs/2112.04426 https://arxiv.org/abs/2112.04359

Gopher soon. :)

amensch · on Dec 8, 2021

Pretty accurate guess ;)

Siira · on Dec 8, 2021

Do they publish their latex source anywhere?

selfhoster11 · on Dec 8, 2021

Can we please stop calling new tech "Gopher"? It's a name that belongs to a network protocol, not to a programming language or an AI model.

halfdan · on Dec 8, 2021

The animal called Gopher would like to have a word with you...

selfhoster11 · on Dec 8, 2021

That too.

ferdowsi · on Dec 8, 2021

No. The network protocol lost cultural rights to the name by its lack of success.

NoGravitas · on Dec 8, 2021

The Gopher protocol was introduced in 1991, and is still in use. Are any of your projects still in use after thirty years?

selfhoster11 · on Dec 8, 2021

1. Arguably, the HTTP + HTML web failed. It was designed as a pemissionless hypertext document language, yet more and more of its usage gets rid of every single one of these properties - non-hyperlinkable applications that beg for permission from large monopolies to post something.

2. Success is not the only metric of worth. Nuclear power has "failed" despite being the only reliable solution to meeting the base load of a power grid vs fossil fuels.

toxik · on Dec 8, 2021

Um, e^{ipi} + 1 is zero, not 2.

shakow · on Dec 8, 2021

Don't read only the examples, but also the text between them ;)

zucker42 · on Dec 8, 2021

That's in the mistake section. Along with Naomi Osaka winning the 2021 U.S. Open.

quaintdev · on Dec 8, 2021

Gohper and language in same sentence I thought this was related to GoLang. They should have chosen something else to name that language model.

marginalia_nu · on Dec 8, 2021

To be fair, the name Gopher in tech predates GoLang by quite a few years.

r3un1 · on Dec 8, 2021

I think that DeepMind being an Alphabet company makes this even worse. My first reaction was to read the blog post to see if there is a (go related) reason for the name choice.

karmakaze · on Dec 8, 2021

And pays homage in a way to the original as a fetcher of information.

nimih · on Dec 9, 2021

To be fair to the GP poster, they may have been making a joke[1].

[1]https://github.com/golang/go/issues/9#issue-51275878

amznbyebyebye · on Dec 8, 2021

Pffft only 280B parameters? Give me a break