There's a deeper more troubling problem being exposed here - deep learning systems are at least an order of magnitude less data efficient than the systems they hope to replicate.
GPT-3 175B is trained with 499 Billion tokens[1]. Let's assume token = word for the sake of this argument[2]. The average adult person reads at a rate of 238wmp[3]. Then a human who reads 24hrs/day from birth until their 18th birthday would read a total of 2.2B billion words[4], or 0.45% of the words GPT-3 was trained on.
Human's simply do much more with much less. So what gives? I don't disagree that we still haven't reached the end of what scaling can do, but there is a creeping suspicion that we've gotten something fundamentally wrong on the way there.
2. GPT-based models use BPE and while we would dive into the actual dictionary of tokens and make a word-token relationship, we both agree that although this isn't a 1-to-1 relationship it won't change the conclusion https://huggingface.co/docs/transformers/tokenizer_summary
Humans take in a tremendously high bitrate of data via other senses and are able to connect those to the much lower amount of language input such that the language can go much further.
GPT-3 is learning everything it knows about the entire universe just from text.
Imagine we received a 1TB information dump from a civilization that lives in an alternate universe with entirely different physics. How much could we learn just from this information dump?
And from our point of view, it could be absurdly exotic. Maybe their universe doesn't have gravity or electromagnetic radiation. Maybe the life forms in that universe spontaneously merge consciousnesses with other life forms and separate randomly, so whatever writing we have received is in a style that assumes the reader can effortlessly deduce that the author is actually a froth of many consciousnesses. And in the grand spectrum of how weird things could get, this "exotic" universe I have described is really basically identical to our own, because my imagination is limited.
Learning about a whole exotic universe from just an info dump is the task of GPT-3. For instance, tons of our writing takes for granted that solid objects don't pass through each other. I dropped the book. Where is the book? On the floor. Very few bits of GPT-3's training set includes the statements "a book is a solid object", "the floor is a solid object", "solid objects don't pass through each other", but it can infer this principle and others like it.
From this point of view, its shortcomings make a lot of sense. Some things GPT fails at are obvious to us having grown up in this universe. I imagine we're going to see an explosion of intelligence once researchers figure out how to feed AI systems large swaths of YouTube and such, because then they will have a much higher bandwidth way to learn about the universe and how things interact, connecting language to physical reality.
This is a fantastically good point. I think things will get even more interesting once the ML tools have access to more than just text, audio and image/video information. They will be able to draw inferences that humans will generally be unaware of. For example, maybe something happens in the infrared range that humans are generally oblivious to, or maybe inferences can be drawn based on how radio waves bounce around an object.
"The universe" according to most human experience misses SO much information and it will be interesting to see what happens once we have agents that can use all this extra stuff in realtime and "see" things we cannot.
As far as I know, all sensory evolution prior to this point has been motivated based on incremental gains in fitting a changing environment.
True vision requires motive and embodied self. I’m ignorant about the state of the art here, but I’m way more terrified of what these things don’t see than interested in what they could show us. It seems to me that the only human motives accessible to machines are extremely superficial and behavioral based.
Knowledge is not some disconnected map of symbols that results in easily measurable behavior, it has a deep and fundamental relation to conscious and unconscious human motivation.
I don’t see any possible way to give a machine that same set of motives without having it go through our same evolutionary and cultural history, and strongly believe most of our true motives are under many protective layers of behavioral feints and tests and require voluntary connection and deep introspection to fractionally expose to our conscious selves, let alone others, let alone a computer.
These models seem to be amazingly good at combining maps of already travelled territory. Trying to use them to create maps for territory that is new to us seems incredibly dangerous.
Am I missing something here, or is it not true that AI models operate purely on bias? What we chose to measure and train the model on seems to predetermine the outcome, it’s not actually empirical because it can’t evaluate whether it’s predictions make sense outside of that model. At some point it’s always dependent on a human saying “success/fail”, and seems more like an incredibly complicated kaleidoscope. Maybe they can cause humans to see patterns we didn’t see before, but I don’t think it’s something that could actually make new discoveries on its own.
I think your point is more interesting but the problem is tabula-rasa knowledge starts. A human isn't born knowing about quantum mechanics, christoffel symbols or what pushforward measures are. If there was just a method to learn facts from scratch as cheaply as brilliant humans do, it would be so amazing. Even if you count from elementary school years, humans still end up with less energy spend by several orders of magnitude.
Transformers themselves are a lot more effective compared to n-gram models or non-contextual word vectors. I imagine there is something to Transformers as Transformers are to word2vec.
Google's Imagen was trained on about as many images as a 6 year old would have seen over their lifetime at 24fps and a whole lot more text. It can draw a lot better and probably has a better visual vocabulary but is also way outclassed in many ways.
Paucity of the stimulus is a real problem and may mean our starting point architecture from genetics has a lot of learning built in than just a bunch of uninitialized weights randomly connected. A newborn animal can often get up and walk right away in many species.
Definitely. I do think video is much more important than images, because video implicitly encodes physics, which is a huge deal.
And, as you say, there are probably some structural/architectural improvements to be made in the neural network as well. The mammalian brain has had a few hundred million years to evolve such a structure.
It also remains unclear how important learning causal influence is. These networks are essentially "locked in" from inception. They can only take the world in. Whereas animals actively probe and influence their world to learn causality.
The mammalian brain have had a few hundred million years to evolve neural plasticity [1] which is the key function missing in AI. The brain’s structure isn’t set in stone but develops over one’s lifetime and can even carry out major restructuring on a short time scale in some cases of massive brain damage.
Neural plasticity is the algorithm running on top of our neural networks that optimizes their structure as we learn so not only do we get more data, but our brains get better tailored to handle that kind of data. This process continues from birth to death and physical experimentation in youth is a key part of that development, as is social experimentation in social animals.
I think “it remains unclear” only to the ML field, from the perspective of neuroscientists, current neural networks aren’t even superficially at the complexity of axon-dendrite connections with ion channels and threshold potentials, let alone the whole system.
A family member’s doctoral thesis was on the potentiation of signals and based on my understanding if it, every neuron takes part in the process with its own “memory” of sorts and the potentiation she studied was just one tiny piece of the neural plasticity story. We’d need to turn every component in the hidden layers of a neural network into it’s own massive NN with its own memory to even begin to approach that kind of complexity.
> our starting point architecture from genetics has a lot of learning built in
I don't doubt that evolution provided us with great priors to help us be fast learners, but there are two more things to consider.
One is scale - the brain is still 10,000x more complex than large language models. We know that smaller models need more training data, thus our brain being many orders of magnitude larger than GPT-3 naturally learns faster.
The second is social embedding - we are not isolated, our environment is made of human beings, similarly an AI would need to be trained as part of human society, or even as part of an AI society, but not alone.
> Google's Imagen was trained on about as many images as a 6 year old would have seen over their lifetime at 24fps
The six year old has the advantage of being immersed in a persistent world where images have continuity and don’t jump around randomly. For example infants learn very quickly that most objects stay put even when they aren’t being observed. In contrast a dataset of images on the internet doesn’t really demonstrate how the world works.
Drawing involves taking a mental image and converting it into a sequence of actions that replicate the image on a physical surface. Imagen does not do that. I think the images it generates are more analogous to the image a person creates in their mind before drawing something.
I was too loose with that. There is CLIPDraw and others that operate at the stroke/action level but haven't been trained on as much data. Still impressive at the time:
One of the more interesting things I have seen recently is the combination of different domains in models / datasets. The top network of Stable Diffusion combines text-based descriptions with image-based descriptions, where the model learns to represent either text or images in the same embedding; a picture, or a caption for that picture, lead to similar embeddings.
Effectively, this can broaden the context the network can learn. There are relationships that are readily apparent to something that learned images that might not be apparent to something trained only on text, or vis-versa.
It will be interesting to see where that goes. Will it be possible to make a singular multi-domain encoder, that can take a wide range of inputs and create an embedding (an "mental model" of the input), and have this one model be usable as the input for a wide variety of tasks? Can something trained on multi-domains learn new concepts faster than a network that is single-domain?
They haven't even figured out basic math, so not sure what you would expect to find there. They aren't smart enough to generate structure that doesn't already exist.
Depends on the method. Evolutionary methods can absolutely find structure that we missed, and they often go hand in hand with learning. Like AlphaGo move 37.
AlphaGo had a lot of driver code involved to make it tick, it wasn't just a big network deciding what to do. You would need something similar here, without someone figuring out that driver code you aren't revolutionizing anything with todays neural networks.
Yes, since Go is a very simple game. Making a proper driver for much more complex domains like engineering blueprints is not something we know how to do today.
Edit: Also you are missing the Go engine in that comment, it can't train without a Go engine to train against that evaluates the results of each move. That Go engine is a part of the training algorithm and thus is also a part of the driver code, you would need to produce something similar to train a similar AI for other domains. We don't know how to write similar blueprint engines or text evaluation engines, so we can't expect such AI models to produce similar results.
The hypothesis that you can't learn some things from text - you need real life experience, is intuitive and I used to think it's true. But there are interesting results from just a few days ago saying that text by itself is also enough:
> We test a stronger hypothesis: that the conceptual representations learned by text only models are functionally equivalent (up to a linear transformation) to those learned by models trained on vision tasks. Specifically, we show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection.
The claim isn’t that you can’t learn it from text, but rather that this is why models require so much text to train on - because they’re learning the stuff that humans learn from video.
The key issue is learning effort (such as energy vs time). Congenitally deaf-blind humans with no accompanying mental disabilities as a shared cause can learn as children just fine without any video or sound from comparatively low bandwidth channels like proprioception and touch.
Another issue is what we really care about is scientific reasoning and there, if anything, nature has given an anti-bias, at least at the level of interfacing with facts. People aren't born biased towards learning Metric Tensors and Christoffell Symbols but it takes only a few years at a handful of hours a day using a small number of joules for many humans to get it (I'm counting from all grade school prerequisites vs GPUs watts x time). Much fewer for genius children.
Im testing this argument out, but doesnt this apply to all tasks, not just language? I can learn to paint from scratch in what like 300 attempts? 1000 attempts? It takes far more examples to train a guided diffusion model. I'd struggle to believe that our brains are hardwired for painting
> Humans take in a tremendously high bitrate of data via other senses and are able to connect those to the much lower amount of language input such that the language can go much further.
They don't. Human bitrates are quite low all things considered. The eyes which by far produce them most information only have a bitrate equivalent to ~2kbps:
The rest of the input nerves don't bring us over 20kpbs.
The average image recognition system has access to more data and can tell the difference between a cat and a banana. A human has somewhat more capability than that.
I think the link says a single synapse does 2kbps, not the whole visual cortex. There are 6 quadrillion (10^12) synapses (3q per hemisphere) in visual cortex according to https://pubmed.ncbi.nlm.nih.gov/7244322/
If we play "the bus filled with ping-pongs" with that information: it is a 3D structure so if you assume cortex is a perfect cube that feeds to something right behind it, you will get (10^12)^(2 (dimensions)/3 (of 3 dimensions)) channels, e.g. 10^8 channels 2kbps each. E.g. about 25GB/s. Which is less than an order of magnitude off from an estimate you would get from 8000x8000 resolution per eye True Color at 24fps - 9GB/s.
Humans also have millions of years of evolution that have effectively pre-trained the structure and learning ability of the brain. A baby isn't born knowing a language but is born with the ability to efficiently learn them.
Indeed, there is a certain hardcoding that can efficiently synthesize language. Doesn't that beg the question... what is the missing hardcoding for AI that could enable it to synthesize via much smaller samples?
There is a great paper, Weight Agnostic Neural Networks [0], that explores this topic. They experiment with using a single shared weight for a network while using an evolutionary algorithm to find architectures that are themselves biased towards being effective on specific problems.
The upshot is that once you've found an architecture that is already biased towards solving a specific problem, then the training of the weights is faster and results in better performance.
From the abstract, "...In this work, we question to what extent neural network architectures alone, without learning any weight parameters, can encode solutions for a given task.... We demonstrate that our method can find minimal neural network architectures that can perform several reinforcement learning tasks without weight training. On a supervised learning domain, we find network architectures that achieve much higher than chance accuracy on MNIST using random weights."
I agree. If you look at animals it's also clear that scaling hypothesis breaks somewhen, as all measures of brain size (brain mass ratio, etc.) fail to capture intelligence. And animals have natural neutral networks.
If you think about it, neutral networks have roamed the earth for millions of years - including generic algorithm for optimizing the hardware. And yet only extremely recently something like humans happened. Why?
The amount of training and processing power which happened naturally through evolution beats current AI research by several orders of magnitude. Yes, evolution isn't intelligent design. But the current approach to AI isn't intelligent design either.
The brain has about 1T synapses and GPT-3 has 175B parameters, even though a parameter is much simpler than a synapse. So the scale of the brain is at least 5700x that of GPT-3. It seems normal to have to compensate by using 200x more training data.
You're right about reference [2], which can alter things by ~1 order of magnitude (words are usually ~3-10 tokens).
Additionally as others have pointed out, we don't live entirely in the text world. So, we have the nice benefit of understanding objects from visual and proprioceptive inputs, which is huge.
The paucity of data argument made well-known by Noam Chomsky et al is certainly worth discussing in academia; however, I am not as moved by these arguments of the stark differences in input required between humans and ML as I once was.
In image processing for example, sending 10k images in rapid succession with no other proprioceptive inputs, time dependencies, or agent-driven exploration of spaces puts these systems at an enormous disadvantage to learn certain phenomenon (classes of objects or otherwise).
Of course there are differences between the systems, but I'm beginning to be more skeptical that saying that the newer ML system can't learn as much as biological systems given the same input (obviously this is where a lot is hidden).
Thank you for the tokens-to-words factor! Much appreciated.
I'm definitely in agreement that multi-task models represent an ability to learn more than any one specialized model, but I think it's a bit of an open question whether multi-task learning alone can fully close the digital-biological gap. Of course I'd be very happy to be proven wrong on this though by empirical evidence in my lifetime :)
What’s missing is interaction/causation, and the reason is that we can scale things more easily without interaction in the data gathering loop. Training a model with data gathering in the loop requires gathering more data every time the model takes a learning step. It’s slow and expensive. Training a model on pre-existing data is much simpler, and it’s unclear whether we’ve reached the limits of that yet.
My prediction is we’ll get ‘good enough for prod’ without interactive data, which will let us put interactive systems in the real world at scale, at which point the field’s focus will be able to shift.
One way to look at it is active learning. We all know the game where I think of a number between 0 and 100 and you have to guess it, and I’ll tell you if it’s higher or lower. You’ll start by guessing 50, then maybe 25, and so on, bisecting the intervals. If you want to get within +/-1 of the number I’m thinking of, you need about six data points. On the other hand, if you don’t do this interactively, and just gather a bunch of data before seeing any answers, to get within +/-1, and need 50 data points. The interactivity means you can refine your questions in response to whatever you’ve learned, saving huge amounts of time.
Another way to look at it is like randomized controlled trials. To learn a compact idea (more X means more Y), you can randomize X and gather just enough data on Y to be confident that the relationship isn’t a coincidence. The alternative (observational causal inference) is famously harder. You have to look at a bunch of X’s and Y’s, and also all the Z’s that might affect them, and then get enough data to be confident in this entire structure you’ve put together involving lots of variables.
The way ML has progressed is really a function of what’s easy. If you want a model to learn to speak english, do you want it to be embodied in the real world for two years with humans teaching it full time how the world and language relate? Or is it faster to just show it a terabyte of english?
tl;dr observational learning is much much harder than interactive learning, but we can scale observational learning in ways we can’t scale interactive learning.
>deep learning systems are at least an order of magnitude less data efficient than the systems they hope to replicate.
While true on the surface, you have to also consider that there is a vast quantity of training data expressed in our DNA. Our 'self' is a conscious thought, sure, but it's also unconscious action and instinct, all of which is indirect lived experience of our forebear organisms. The ones that had a slightly better twitch response to the feel of an insect crawling on their arm were able to survive the incident, etc. Our 'lizard brains' are the result of the largest set of training data we could possibly imagine - the evolutionary history of life on earth.
I think comparing to humans is a bit of a distraction, unless what you care about is replicating the way human intelligence works in AI. The mechanisms by which learning is done (in these cases self-supervised and supervised learning) are not at all the same as humans have, so it's unsurprising the qualitative aspects are different.
It may be argued we need more human-like learning mechanisms. Then again, if we need internet-scale data to achieve human-level general intelligence, so what? If it works it works. Of course, the comparison has some value in terms of knowing what can be improved and so on, especially for RL. But I wouldn't call this a 'troubling problem'.
Brains do not actually work very similarly to artificial neural networks. The connectionist approach is no longer favored, and human brains are not arranged in regular grids of fully interconnected layers. ANNs were inspired by how people thought the brain worked more than 50 years ago. Of course, ANNs are meant to work and solve practical problems with the technology we have. They're not simulations.
Because the whole industry is wrong. ML is incapable of general intelligence, because that's not what intelligence is. ML is the essential component with which one interfaces with the universe, but it's not intelligence, and never will be.
As a ML Vision researcher, I find these scaling hypothesis claims quite ridiculous. I understand that the NLP world has made large strides by adding more attention layers, but I'm not an NLP person and I suspect there's more than just more layers. We won't even talk about the human brain and just address that "scaling is sufficient" hypothesis.
With vision, pointing to Parti and DALLE as scaling is quite dumb. They perform similarly but are DRASTICALLY different in size. Parti has configurations with 350M, 750M, 3B, and 20B parameters. DALLE has 3.5. Imagen uses T5-XXL which alone has 11B parameters, just in the text part.
Not only this, there are major architecture changes. If scaling was all you needed then all these networks would still be using CNNs. But we shifted to transformers. THEN we have shifted to diffusion based models. Not to mention that Parti, DALLE, and Parti have different architectures. It isn't just about scale. Architecture matters here.
And to address concerns, diffusion (invented decades ago) didn't work because we just scaled it up. It worked because of engineering. It was largely ignored previously because no one got it to work better than GANs. I think this lesson should really stand out. That we need to consider the advantages and disadvantages of different architectures and learn how to make ALL of them work effectively. In that manner we can combine them to work in ideal ways. Even Le Cun is coming to this point of view despite previously being on the scaling side.
But maybe you NLP folks disagree. But the experience in vision is far more rich than just scaling.
I agree - I think scaling laws and scaling hypothesis are quite distinct personally. Scaling hypothesis is 'just go bigger with what we have and we'll get AGI', vs scaling laws are 'for these tasks and these models types, these are the empirical trends in performance we see'. I think scaling laws are still really valuable for vision research, but as you say we should not just abandon thinking about things beyond scaling even if we observe good scaling trends.
Yeah I agree with this position. It is also what I see within my own research. But also in my own research I see the vast importance of architecture search. This may not be what the public sees, but I think it is well known to the research community or anyone with hands on experience with these types of models.
DALL-E's model is a multimodal implementation of GPT-3 with 12 billion parameters which "swaps text for pixels", trained on text-image pairs from the Internet. DALL-E 2 uses 3.5 billion parameters, a smaller number than its predecessor.
At the other extreme, some recent works [1,2] show why it’s sometime better to scale down instead of up, especially for some humanlike capabilities like generalization:
If greater parameterization leads to memorization rather than generalisation it's likely a failure in our current architectures and loss formulations rather than an inherent benefit of "fewer parameters" improving generalizaiton. Other animals do not generalize better than humans despite having fewer neurons (or their generalizaitons betray a misunderstanding of the number and depth of subcategories there are for things, like when a dog barks at everything that passes by the window).
I think something that has concerned me with the concept of scaling to AGI is the concept of "adversarial examples". Small tweaks that can be made to cause unpredictable behavior in the system. At a high level these are caused by unexpected paths in high dimensional model weight space that don't align with our intuition. This problem in general seems to get worse as the size of the weights grow.
From a value perspective a very high fidelity model with extremely unexpected behavior seems really low value since you need a human there full time to make sure that the model doesn't go haywire that 1-5% of the time
Suppose an adversarial example is recognized correctly by a human, yet a model recognizes it wrongly.
Therefore, the model uses other information than humans in order to classify, and the information it uses is wrong.
Therefore, the model needs to be trained ON THE ADVERSARIAL EXAMPLES in order to gain robustness.
Similarly to GANs using a classifier adversary for augmenting a generative network, one could use a generative adversary for augmenting a classifier network.
You can repeat until the adversarial examples look ambiguous even to a human.
It would be great to see more focus on Chinchilla's result that most large models were quite undertrained with respect to optimal reduction in test loss.
The largest models have 10 ^ 9 parameters? There is still a lot of "scaling" to be done if we want to reach 10 ^ 18 parameters, the number of synapses in the brain.
if one FLOP is one parameter we do have the capacity with modern day supercomputers? bottleneck then probably becomes data unless overparameterization holds?
I think this is true for operations where inference time is not very critical. Think content generation.
For example, my work in grad school is deploying AI-based control on drones such that they are fault-tolerant. We can have a stronger model at the ground station communicating with the vehicle, which introduces lag. Or the ground station model is more supervisory, and bare metal control is deferred to your classical algorithms. Or we run a smaller, faster network(s) on board which can act and learn fast.
The other thing is model interpretability. For some of my work with drones or smart buildings, the stakeholders wouldn't let a neural network run on a real vehicle or building until it was thoroughly certified through whatever criteria they had. Easier with a smaller network.
I’m grateful for recent open release of models like whisper and stable diffusion which you can run on your own hardware.
However the core also seems to be training data. SEER from meta/Facebook is a billion images, far bigger than imagenet. OpenAI doesn’t release its training dataset for whisper.
It seems we are in a place where AI will most likely increase the digital divide and wealth gap because the big datasets + large model training ability will only be accessible to mega corps.
"Scaling" means increasing the number of parameters. Parameters are just the database of the system. At 300GB of parameters, we're talking models which remember compressed versions of all books ever written.
This is not a path to "AGI", this is just building a search engine with a little better querying power.
"AI" systems today are little more than superpositions of google search results, with their parameters being a compression of billions of images/documents.
This isnt even on the road to intelligence, let alone an instance of it. "General intelligence" does not solve problems by induction over billions of examples of their prior solutions.
And exponential scaling in the amount of such remembering required is a fatal trajectory for AI, and likewise and indication that it doesnt deserve the term.
No intelligence is exponential in an answer-space, indeed, i'd say that's *the whole point* of intelligence!
We already know that if you compress all possible {(Question, Answer)} pairs, you can "solve" any problem trivially.
That's why the Chinchilla paper (given a single paragraph in the article) is so important; it gives a scaling equation that puts a limit on the effect of increasing parameters. Generally for the known transformer models the reduction in loss from having infinite parameters is significantly less that the reduction in loss from training on infinite data. Most large models are very undertrained.
My issue is less with the strategy as NN optimisaiton, and more as the OP says, a road to general intelligence.
There's an a priori "limit" on the form of any time/space complexity "intelligence" can have: it cannot be exponential.
We know that if you exhaust all infinite number of ways of describing a system by facts (/data), then you don't need intelligence. Whatever intelligence is, it's a strategy animals use for competent action in the face of their bodies not being able to exponentially expend energy to solve every trivial problem.
Exponential complexity is a killer to the project of building intelligence in every respect. The universe, a priori, cannot require an exponential space complexity description of it to be parsable: since that would require exponentially many universes.
Exponential scaling, for me (via ordinary conceptual analysis), seems a hallmark of any system not using intelligence. Exp. scaling is "the shortcut" to solving the problem without intelligence. It's the hallmark of the system not building O(log_{large number}) rich representations of the world.
Everything of interest in ML networks is occurring in the abstractions that emerge in training in deep multi-layer networks.
At the crudest level this immediately provides for more than canned lookup as asserted; analogical reasoning is a much-documented emergent property.
But analogies are merely the simplest, first-order abstraction, which are easy for humans to point at.
Inference and abstraction across multiple levels, means the behaviors of these systems is utterly unlike simple stores. One clear demonstration of this is the effective "compression" of image gen networks. They don't compress images. For lack of any better vocabulary, they understand them well enough produce them.
The hot topic is precisely whether there are boundaries to what sorts of implicit reasoning can occur through scale, and, what other architectures need to be present to effect agency and planning of the kind hacked at in traditional symbolic systems AI.
It might be worthwhile to read contemporary work to get up to speed. Things are already a lot weirder than we have had time to internalize.
Can they be said to understand the images if a style transfer model they produce is image dependent with an unstable threshold boundary?
Or when they make an error similar to pareidolia all the time, seeing faces where there are none?
When they do not understand how to paint even roughly fake text?
Serious answer: they can, just not as as well as we do.
The problem is not that really what they are doing, which is more akin to what e.g our own visual processing stack does...
...it's that we don't have language for a lot of the territory these systems operate in.
We are used to a world defined by very clear boundaries between agents (which e.g. have state, and agency, and carry world representations which we can infer and reason about and behave against) and tools (potentially complex but inert, at best narrowly-constrained),
these exist in the in-between where until now we only really had "animals" or maybe "institutions" but that is more of an abstract analogy.
Like say in insect, they understand but poorly—compared to us as the gold standard. And we have silos of deep specialization (language) and world knowledge beyond the purely visual. Eventually it's clear there will be a knitting together of systems which provide those, as we cobble together something that "understands" more than just the visual.
But already when you look closer it's clear they already understand some things as well as we do, or better.
Pareidolia is a fabulous word and frame, I've used the behavior of these models also to explain what hallucinations are like (and are) when on psychedelics to people who haven't experienced them.
Anyway. What does a toddler understand? Or a dog? Or...
On the bright side I am seeing language come into use before our eyes. The most established and interesting one to emerge in common use this year IMO is the use of "latent" to describe that which these things understand... they render that which is latent in their complex encodings. A great adaptation of a domain specific term of art into lay use.
The tone of this betrays a possibly more argumentative than collaborative conversation style than that which I may want to engage with further (as seems common I've noticed amongst anti-connectionists), but I did find one point intersting for discussion.
> Parameters are just the database of the system.
Would any equations parameters be considered just the database then? C in E=MC^2, 2 in a^2+b^2=c^2?
I suppose those numbers are basically a database, but the relationships (connections) they have to the other variables (inputs) represent a demonstrable truth about the universe.
To some degree every parameter in a nn is also representing some truth about the universe. How general and compact that representation is currently is not known (likely less than we'd like of both traits).
There's a very literal sense in which NN parameters are just a db. As in, it's fairly trivial to get copyrighted verbatim output from a trained NN (eg., quake source code from git co-pilot, etc.).
"Connectionists" always want to reduce everything to formulae with no natural semantics and then equivocate this with science. Science isnt mathematics. Mathematics is just a short hand for a description of the world made true by the semantics of that description.
E=mc^2 isnt true because it's a polynomial, and it doesnt mean a polynomial, and it doesnt have "polynomial properties" because it isnt about mathematics. It's about the world.
E stands for energy, m for mass, and c for a geometric constant of spacetime. If they were to stand for other properties of the world, in general, the formulae would be false.
I find this "connectionist supernaturalism" about mathematics deeply irritating, it has all the hubris and numerology of religions but wandering around in a stolen lab coat. Hence the tone.
What can one say or feel in the face of the overtaking of science by pseudoscience? It seems plausible to say now, today, more pseudoscientific papers are written than scientific ones. A generation of researchers are doing little more than analysing ink-blot patterns and calling them "models".
The insistence, without explanation, that this is a reasonable activity pushes one past tolerance on these matters. It's exasperating... from psychometrics to AI, the whole world of intellectual life has been taken over by a pseudoscientific analysis of non-experimental post-hoc datasets.
This discussion (the GP and your response) perhaps suggests that a way to evaluate the intelligence of an AI may need to be more than the generation of some content, but also citations and supporting work for that content. I guess I'm suggesting that the field could benefit from a shift towards explainability-first models.
I'm not anti-connectionist, but if I were to put myself in their shoes, I'd respond by pointing out that in E=MC^2, C is a value which directly correlates with empirical results. If all of humanity were to suddenly disappear, a future advanced civilization would re-discover the same constant, though maybe with different units. Their neural networks, on the other hand, probably would be meaningfully different.
Also, the C in E=MC^2 has units which define what it means in physical terms. How can you define a "unit" for a neural network's output?
Now, my thoughts on this are contrary to what I've said so far. Even though neural network outputs aren't easily defined currently, there's some experimental results showing neurons in neural networks demonstrating symbolic-like higher-level behavior:
Part of the confusion likely comes from how neural networks represent information -- often by superimposing multiple different representations. A very nice paper from Anthropic and Harvard delved into this recently:
300 GB is nothing compared to the vastness of information in the universe (hence it fitting on a disk). AI is approximating a function, and the function they are now learning to approximate is us.
From [1], with my own editing...
When comparing the difference between now and human performance
> ...[huamns] can achieve closer to 0.7 bits per character . What is in that missing >0.4?
> Well—everything! Everything that the model misses. While just babbling random words was good enough at the beginning, at the end, it needs to be able to reason our way through the most difficult textual scenarios requiring causality or commonsense reasoning... every time that it lacks the theory of mind to compress novel scenes describing the Machiavellian scheming of a dozen individuals at dinner jockeying for power as they talk...
> If we trained a model which reached that loss of <0.7, which could predict text indistinguishable from a human, whether in a dialogue ...how could we say that it doesn’t truly understand everything?
We could make it scale in a very human way. I could forsee a near future where a person is not sentenced in court, but rather branded an outlaw, and to an AI online, which becomes deployed on their lives. It is learning as it goes to find and inflict psychological and social manipulation tactics as criminal penalties, up to and including attempting to cause death. It is a way for judges and society to both not actively murder for ethical reasons, but also to not make their own populations individually more violent by having their citizens hunt the convicted for bounties like they may have in traditional and ancient outlaw exile. It spreads the source of punishment evenly among participants, like parading someone in stocks through the streets in front of the mob, but on the micro, psychological scale.
The AI stalks the convicted and finds ways to cause suffering, but learning from performing it against thousands of people, many simultaneously and even adversarially, optimizing for the most reflected cruelty it interprets from posts it gauges sentiment of in its environment. It's a plausible reason for why Roko's Basilisk would be made to exist, as we would have to think it was something we were inflicting on the other or another. Criminal punishment via AI as a means to wash our hands of our judgments seems like just the sort of thing we might sadly do. Surely it will only be reserved for the most serious of criminals.
You want to make sure offender doesn't cause any more suffering of others and you want to help him become somebody better. I know US prison system exists, but we officially dropped torture as a punishment some time ago.
GPT-3 175B is trained with 499 Billion tokens[1]. Let's assume token = word for the sake of this argument[2]. The average adult person reads at a rate of 238wmp[3]. Then a human who reads 24hrs/day from birth until their 18th birthday would read a total of 2.2B billion words[4], or 0.45% of the words GPT-3 was trained on.
Human's simply do much more with much less. So what gives? I don't disagree that we still haven't reached the end of what scaling can do, but there is a creeping suspicion that we've gotten something fundamentally wrong on the way there.
1. https://lambdalabs.com/blog/demystifying-gpt-3/
2. GPT-based models use BPE and while we would dive into the actual dictionary of tokens and make a word-token relationship, we both agree that although this isn't a 1-to-1 relationship it won't change the conclusion https://huggingface.co/docs/transformers/tokenizer_summary
3. https://psyarxiv.com/xynwg/
4. 238*60*24*365*18=2,251,670,400