I enjoy a good visualization, but at best they're high level graphical powerpoints, and in this case I found the animations more distracting than useful.
Also, if you're going to do a 30k foot view of a technical topic, you might want to tell people what GPT3 is somewhere in there.
I agree that in this case the animated parts of the graphics were not needed, it's an easy pitfall to be distracted by the beautiful aspects of visualisations when crafting them.
I feel the need to defend the author though, it's hard to make research accessible while still distilling valuable insight. I think his post on transformer networks [1] did a good job for example, and you'll appreciate the lack of animations.
Yes this seems like an early work in progress, compared to Jay's previous Transformer articles.
In addition to your link, I've found a really good Transformer explanation here (backed by a Github repo w/ lively Issues talk): http://www.peterbloem.nl/blog/transformers
I few this comment is overly negative. Just to provide a counter-datapoint, I have seen quite a bit of GPT3 on HN lately but could not understand the research papers at all. It’s too abstract, and I often fail to see what they really mean.
This article and the animations definitely helped me a lot in understanding this. I learned quite a few things, so thanks a lot to the author!
Openning OP's page on a slow 4G connection via hotspotting from my smartphone, the whole page makes no sense because I can't know if I should wait for something to move or carry on.
My head was getting dizzy and had to stop mid way. People were smart enough to create animations but not sensitive enough to know whether it is too much.
Please don't use terms like "magic" when trying to explain things to people. They never point out where the "magic" part lines up to any of their other explanation.
Author here. Thank you. I feel an important element of this type of writing is what complexity to show and what to hide at different points. "Magic" is just to say "don't worry about the contents of this box yet, we'll get to it". It's what we discuss right after the visual. Sorry that came out as confusing. I'll add a note to the following figure saying that's the magic.
I get the sense that you're trying to mask the simplicity of predicting the next-most-likely-word after training your app, ala markov chains, under the guise of "magical AI." Providing an error threshold when it spits out the wrong response to a phrase seems to be worsening its natural ability as well.
That's how it's often taught, which is a real shame. Paul Lockhart wrote an elegant piece about this, titled A Mathematician's Lament [0]:
> Nevertheless, the fact is that there is nothing as dreamy and poetic, nothing as radical, subversive, and psychedelic, as mathematics. It is every bit as mind blowing as cosmology or physics (mathematicians conceived of black holes long before astronomers actually found any), and allows more freedom of expression than poetry, art, or music (which depend heavily on properties of the physical universe). Mathematics is the purest of the arts, as well as the most misunderstood.
> ...
> This is why it is so heartbreaking to see what is being done to mathematics in school. This rich and fascinating adventure of the imagination has been reduced to a sterile set of “facts” to be memorized and procedures to be followed. In place of a simple and natural question about shapes, and a creative and rewarding process of invention and discovery, students are treated to this: Triangle Area Formula - A = 1/2 b h
> “The area of a triangle is equal to one-half its base times its height.” Students are asked to memorize this formula and then “apply” it over and over in the “exercises.” Gone is the thrill, the joy, even the pain and frustration of the creative act. There is not even a problem anymore. The question has been asked and answered at the same time — there is nothing left for the student to do.
I love the amount of efforts Jay puts in his posts to develop intuitions. And, I wonder if there are some open source projects out there to help make simple animations for researchers who like to blog.
I'm curious, what can I, as a full-stack developer, do to prepare for things like GPT-X eventually making a lot of the work I do obsolete in the next 10-20 years?
Seeing all these demonstrations is starting to make me a little bit nervous and I feel it is time for a long term plan.
The parts of programming that are going to get automated are going to be the parts that require little skill, take a long time, and are boring as hell: writing boilerplate CRUD code, wiring up buttons to actions, etc.
Automating the harder and more interesting parts of programming is many orders of magnitude more difficult. This requires a true understanding of the problem domain and the ability to "think." GPT-3 and similar are just really good prediction engines that can extrapolate based on training data of what's already been done.
The answer therefore is the same as "how do I stay competitive vs. lower skill offshore labor?" You need to level up and become skilled in higher-order thinking and problem solving, not just grinding out glue code and grunt work.
Ruby on Rails scaffolding didn't make backend developers obsolete. I know you said GPT-X, but GPT-3 is at the boundaries of technology. The jump to GPT-4 will either take much longer or be much less impressive than the jump from GPT-2 to GPT-3. I would say that your job is safe from automation from GPT. But the technology that might put you out of job, which I personally think will not be something like a neural network, might be spontaneously discovered in the next 10-20 years just like the spontaneity of smart phones. To answer your question, be a human; be adaptable, be useful.
On the other hand, when it’s good enough to replace us, it’s also good enough to replace basically any job where you transform a written request into some written output, e.g. law, politics, pharmacology, hedge fund management, and writing books.
I have no idea how to prepare, only that I should.
(Edit: what makes us redundant may well not be in the GPT family, but I do expect some form of AGI to be good-enough in 20 years).
There’s a good book called “Rebooting AI” that does some fundamental analysis about current state of deep learning and its applications.
The biggest problem with GPT or any massive neural net is explainability. When it doesn’t do the correct thing, no one quite knows why. GPT makes all sorts of silly mistakes.
The human brain, albeit being a form of a neural net, can do some very deep symbolic reasoning about things. Artificial Neural nets just don’t to that (Yet). We haven’t figured out that not have I seen a system that is close. We haven’t got generic neural nets that can perform arithmetic operations to arbitrary precision. For computers to learn proper language, they have to embed themselves into the world for years like children do and learn the relationship of objects in the world.
So if I were a fake comment house, I’d worry about GPT. Not so much if I were a programmer or a lawyer. We do some very deep symbolic thinking to produce our work. If computers are able to replace us, they can probably replace a large part of humanity. At which point we have way bigger problems to worry about.
Symbolic reasoning is a very hard problem to crack. Something like “how old was Obama’s 2nd child when the US hit 4 digit deaths due to covid-19?”. Answering that question not only requires context like “4 digit” means 1000, it requires a bunch of lookups and ability to break a big problem into smaller problems.
Siri/Ok Google/Alexa/Cortana/GPT3 - all of them fail.
They can’t even answer “Find fast food restaurants around me that aren’t mcdonalds”.
I'm actually looking forward to more code generation tools. Things like wiring up a button aren't stimulating and I wouldn't mind that level of programming becoming automated.
That’s what I loved about Visual Basic. You could just draw your user interface and specify actions and then just fill in the one or two lines of code that need to run when that button is pressed.
I’m surprised React doesn’t have something like that. At least not that I’m aware of. Is there a GUI interface builder for React?
I am as well, especially ever since i saw [1]. It's a small test that someone tried with GPT-3 that translates natural language descriptions and phrases into shell commands.
Some of the examples from the tweets:
> Q: find occurrences of the string "pepsi" in every file in the current directory recursively
> A: grep -r "pepsi"*
> Q: run prettier against every file in this directory recursively, rewriting the files
> A: prettier --write "*.js"
It also seems to work the other way as well. You can set it up to give it a shell command and have it write a plain english description of it.
Granted sometimes the results are wrong, and in a video I saw someone playing with it like 1/10 of the commands were subtly wrong or didn't have enough context to generate what you really meant, but as a starting point it seems like such a powerful tool!
I personally spend a lot of time looking up shell command flags, thinking of ways to combine tools to get the data I want out of a log or something, or running help commands to figure out the kubectl incantation that will just let me force a deployment to redeploy with the latest image.
Imagine having a VS-Code style command palette where I can just type a "plain english" description of what i'm trying to do, and have it generate a command that I can tweak or just run. Turning a 10 minute process of recalling esoteric flags or finding documentation into 10 seconds of typing.
If it's really as good as it seems, imagine being able to type stuff like "setup test scaffolding for the LoginPage component" and having it just generate a "close enough" starting point!
Get good at specifying and documenting product requirements apparently.
Also remember that ultimately even if GPT-X is successful at transforming text into working code, all that's done is essentially define a new programming language. Instead of writing Python, you'll write GPT-X-code at a higher level.
GPT-x will be able to perform most copy-and-paste operations soon enough so that's the kind of jobs that would be made obsolete by it. Low code and point and click jobs are the ones that will follow. At first it will be "aiding" developers by suggesting code, and then GPT's successors will finally deliver the "no code, only a business description promise" that has been hanging on the industry for decades.
Of course GPT-3 is not there but it's only a matter of time: the capabilities are there. You are already thinking in decades which is the right mindset. Fortunately, tech is not something that will be done ever so there will always be opportunities just not in the fields we are looking at this time --digital products like web or mobile apps will be as exciting as a custom invoicing Windows app in a matter of years, but then you have IoT, autonomous vehicles, blockchain, and whatnot. Stay ahead of the ball as an engineer.
Of course you can also move up the food chain and become a manager or technical architect or lead.
Being worried about new potentially disruptive tech is legitimate, it's hard to see our place in an environment we can't predict.
However, particularly as a full stack dev, I think that it will create more opportunities for jobs than concurrence. You mention 10-20 years ahead, if you look that same horizon back in the past it seems (I wasn't working then) that the job also changed significantly, without making devs obsolete.
AGI might happen in our lifetime (I hope so), but I'm dubious that it will happen through a singularity [1]. Therefore, I'm not worried that as tech experts we won't have time to adapt.
It's not clear how much the demos have been gamed for presentation, and it seems more of an opportunity than a threat - it will still need devs to put stuff together, and (assuming it is as impressive as demoed) will take a lot of the donkey work away.
A significant chunk of what devs are paid for is the donkey work. Making every dev significantly more productive increases the supply of dev power relative to the demand, dropping the price.
It may be that deep learning as we know it (TF, PyTorch) is going to be replaced by prompting large models, thus making most applications straightforward for anyone to use.
Your main value add as a developer is understanding the problem domain. Machines won't be able to do this in your, or your children's lifetime, outside some important, but very constrained niches.
AFAIK, the only thing new about GPT-3 is its massive size, the architecture is completely conventional, so the same as those you've seen from a few years ago.
The visualizations seem to show non-recurrent networks whereas my understanding is that one of the important differences between GPT1 and GPT2 & 3 is the use of recurrent networks.
This allows the output to loop backwards, providing a rudimentary form of memory / context beyond just the input vector.
Just curious. What languages (human languages) were used in the training data set of GPT3? Is it trained only on English texts and grammar, or is it transcending language barriers?
My (very limited) understanding of AI models is the input "shape" has to be well defined.
I.e. a vision network expects 1 input per pixel (or more for encoding color) and so it's up to you to "format" your given image into what the model expects.
But what about GPT-3, which takes in "free text?" The animations in the post show 2048 input nodes, does this mean it can only take in a maximum of 2048 tokens? Or will it somehow scale beyond that?
Correct, you can only input up to 2048 tokens total (this is a big improvement over GPT-2's 1024 input size). You can use sliding windows to continue generating beyond that.
However, model training scales quadratically as input size increases which makes building larger models more difficult (which is why Reformer is trying workarounds to increase the input size).
Yes, there is a limited amount of input. In addition, each token may be a word or only part of a word, depending on how common it is. Common words get one token and uncommon words are divided into pieces, each of which gets a token.
The answer is yes and no. First, it will produce an output for any input. What you really mean is answer a query correctly.
It goes about doing that the same way it works in general which is memorizing sequences that are similar and outputting the corresponding sequence that follows. For example, if the training data has something like "These are the countries that have over a million people: <countries>" I would not be surprised if it returned <countries> for your query. However, if your query was "less than a million" I would be very surprised if it would return the other countries.
If you only give it "Here are the countries whose population exceeds 1 million:" the model has a chance to go on a tangent / inconsistently structured output / inconsistent values in output (examples when generating at temp=0.7: https://gist.github.com/minimaxir/86e09253f9e05058eb1e96de2b... )
If you give it the same prompt with a doubleline break and a "1.", it behaves much better.
This fun but IMO too simplified. For example it's really important to know that GPT-3 does not see "words" it sees byte pair encodings. Which are for the most part smaller than words but larger than individual characters. This has immediate implications for what GPT-3 can and cannot do. It can reverse a sentence (This cat is cute -> cute is cat this) but it cannot reliably reverse a word (allegorical -> lacirogella).
Interesting to consider whether this limitation of BPE points to a more fundamental issue with the model. Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?
If so, wouldn't this be evidence that the model is using its mind-blowingly large latent space to memorize surface patterns that bear no real relationship to the underlying language (as most people suspect)?
I suppose this comes back to my question about Transformer models in general - the use of a very large attention window of BPE tokens.
When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read. So I doubt our brains are keeping some running stack of the last XXXX words, or even some smaller distributed representation thereof.
It's more plausible that we're using some kind of natural hierarchical compression/comprehension mechanism that operating on the character/word/sentence/paragraph level.
It certainly feels like GPT-3 is using a huge parameter space to bypass this mechanism and simply learn a "reconstitutable" representation.
Either way, I'd be really interested to see how it handles character-level input symbols.
> Does GPT-3 "fail" when BPE is replaced with the conventional English alphabet as input symbols (for various definitions of "fail")?
The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
> When I finish reading a paragraph, I can probably use my own words to explain it. But there's no chance I could even try to recreate the sentences using the exact words I just read.
Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)
> the use of a very large attention window of BPE
You're also able to remember chunks from not long before. You just don't remember all of them. I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions. (E.g. you can't just bolt a nearest-key->value database on the side and simply expect it to learn to use it).
> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?
> Sure you could, you could look up and copy it which is an ability GPT-3 also needs to model if its to successfully learn from the internet where people do that all the time. :)
That's right - but then we're just talking about memorization and regurgitation. Sure, it's impressive when done on a large scale, but is it really a research direction worth throwing millions of dollars at?
> I'm sure people working on transformers would _prefer_ to not have it remember everything for a window (and instead spend those resource costs elsewhere), but it's necessary that the attention mechanism be differentiable for training, and that excludes obvious constructions.
Of course, but all of my whinging about Transformers is a roundabout way of saying "I'm not convinced that the One True AI will unquestionably use some variant of differentiation/backpropagation".
> The attention mechanism is quadratic cost in the number of input symbols. Restricting it to a tiny alphabet would radically blow up the model cost, so it's difficult to make an apples to apples comparison.
That's ultimately my point. If an alphabet-based model can't achieve nearly the same results as a BPE-based model (even if appropriately scaled up to accommodate the expanded cost), doesn't that suggest that Transformers really are just a neat memorization hack?
BPE's aren't even words for the most part. Are all native Chinese authors non-conscious memorization hacks? :)
Yes, GPT-2 could also do that. It generally works best if you give it some examples to start off. For example, I actually ran the following prompt through the full GPT-2. Everything after "How long ago did Elasmosaurus live?" is GPT-2 talking.
Elasmosaurus is a genus of plesiosaur that lived in North America during the Campanian stage of the Late Cretaceous period, about 80.5 million years ago. The first specimen was discovered in 1867 near Fort Wallace, Kansas, US, and was sent to the American paleontologist Edward Drinker Cope, who named it E. platyurus in 1868. The generic name means "thin-plate reptile", and the specific name means "flat-tailed". Cope originally reconstructed the skeleton of Elasmosaurus with the skull at the end of the tail, an error which was made light of by the paleontologist Othniel Charles Marsh, and became part of their "Bone Wars" rivalry. Only one incomplete Elasmosaurus skeleton is definitely known, consisting of a fragmentary skull, the spine, and the pectoral and pelvic girdles, and a single species is recognized today; other species are now considered invalid or have been moved to other genera.
Where did the Elasomosaurus live?
North America
Where was the first Elasomosaurus discovered?
Fort Wallace, Kansas
How long ago did Elasmosaurus live?
80.5 million years ago
When was Elasomosaurus discovered?
1867
Was Elasmosaurus capable of leaping?
Yes, the two small, sharp teeth on either side of the lower jaw contain the necessary enzymes to propel the animal upwards by using muscles. However, in the developing skeleton, the upper and lower jaws had a tendency to grip the body,
It will give an answer, whatever is formed by extending the input you give it. The answer will be based on the text you provide, so in that sense it "cares" about it. Whether the answer is any good is another matter. But maybe it will find something based on its training data that relates.
yeah, mainly because its generally getting more difficult to distill ideas and keep people interested, highly applicable to communications in engineering
Also, if you're going to do a 30k foot view of a technical topic, you might want to tell people what GPT3 is somewhere in there.