Hacker News new | past | comments | ask | show | jobs | submit login
Every model learned by gradient descent is approximately a kernel machine (2020) (arxiv.org)
176 points by Anon84 11 months ago | hide | past | favorite | 136 comments



This connects to a suspicion I have that a lot of the most impressive results, that have caught public attention, for LLMs and Sora are due to memorization. This is not to say that they can't generalize or mix learned patterns - I think they can do that too, which is quite impressive to a researcher, but these are demonstrated on tasks that the public might not care as much about, and the performance there isn't always great as well.

Combined with opacity in train/test splits, this suggests a type of data laundering, where the extreme is that much of Sora (for example) is regurgitating things seen in training, which was actually made perhaps by a human using a game engine, or drone video. But it's only news because a model generated it. Of course, I don't have strong evidence for this. But one of the most impressive parts of Sora is the level of detail, much of which is not specified by the text prompt, and still can't fully be accounted for by our knowledge that OAI expands text prompts for image/ video generation behind the scenes. Where is this precise detail coming from exactly? I suggest it's from memorization.


François Chollet, author of Keras, lays it out pretty clearly that LLMs are just memorizing [0] (there’s more links if you follow that thread).

I think LLMs are incredible, and spend most of my days working with them closely, but they are not nearly as close to “AGI” as people think primarily due to their inability to really generalize.

At the end of the day LLMs aren’t that different than old school n-gram Markov chains, except rather than working on n-grams, they’re working in a (very sophisticated) latent space. Their power is really these incredible latent languages spaces we’re still just starting to understand.

In all my years of tech the “AI” space is the most curious hype-bubble since the things people expect to happen are entirely out line with what is possible, while at the same time the potential of these models is still, imho, underexplored and largely ignored by the vast majority of people attempting to build things with them.

99% of the people I know working in this space are just calling APIs and trying to do some variant of code generations, where a small minority of people are really trying to figure out what’s going on in these models and what can be done with them successfully.

0. https://twitter.com/fchollet/status/1755250582334709970


We can prove that transformers can do computation beyond memorization, and that they can learn it from examples, unless you think learning an algorithm is memorization.

https://arxiv.org/abs/2301.05217

https://arxiv.org/abs/2310.16028


Surely learning an algorithm is memorization? I still remember the physics equations from when I was at school (a long time ago) because it was drilled into us repeatedly. Stuff like v = u + at, or s=tu + 1/2 (at)^2, etc etc.

I don't really recall how they were derived etc but I did memorize them. It definitely doesn't mean I'm good at physics or understand the deeper meaning of these equations. I also had to memorize bubble sort, merge, heap, quicksort etc when I was in university, but I don't think I could invent these sorts from first principles without looking it up.

Memorization doesn't really equal understanding, it's just memorization.


> Surely learning an algorithm is memorization?

Not in this context, no. An LLM is never given any algorithm to memorise. It is given only input->output sets, and it "learns" what the algorithm is from those sets.

We know that it does this because it is able to generalize that algorithm to inputs and outputs outside of its set of training examples. So we know that it doesn't only memorise which input connects to which output, and regurgitate that information. It has come to "understand" the formula that connects the input to the output, without ever being given that formula.


It’s fitting a function to data and guessing the value of the function for a new input. It does not know what is under the hood of the function, etc., and see the implications when the function produces a wrong value … etc.


It can do that. Verifying an answer is just another algorithm it can learn.

LLMs mostly can't do math but that, like most of their other flaws, is because of the tokenizer.


> Surely learning an algorithm is memorization?

Yes. In a sense, every algorithm can be reduced to memorization, where you precalculate all possible inputs to all possible outputs and store them in a giant lookup table.


Maybe there doing something we don’t really have a name for yet ? This is why it causes so much controversy.

Not really generalizing, not memorizing, maybe approximating ?


There is certainly a lot of discussion around semantics to be had before we can agree on what exactly an LLM is capable of.

We know that LLMs tend not to be good at math. Some people will say that the fact that an LLM cannot sum two numbers demonstrates they cannot generalise. Yet others would say the way LLMs calculate sums is generalisation because it mirrors our own process of addition which works mostly by memorisation and a lot of double checking.

My perspective is that the "thinking" that LLMs do is a lot closer to the kind that humans do which is to say a lot of pattern matching but with no fundamentally precise logic underlying it. If LLMs are flawed in some way then humans are also flawed in the same way.


There is a name for this already: associative memory. That’s how you catch a ball: you condition your memory of catching a ball with proprioceptive and visual input, much like a multimodal transformer. There’s no thinking involved - you wouldn’t be able to catch it if you had to think


An associative memory can model truth tables of logic circuits, and so the steps to go from associative memory to arbitrary computation pretty much requires adding state and a loop. And we do provide state and a loop. As such, while that does not prove that any given model is capable of reasoning, describing what goes on at any step as associative memory says nothing meaningful about the computational power of the wider system.

It's very possible the way we train current models will turn out to place fundamental limits on the abilities the models will have, but that will not be because the models act as associative memories.


I think the current models will be a fundamental part of larger, more comprehensive models, and they will handle the “easy” parts, much like we, humans, use our associative memory to be able to do things in realtime, “without thinking”. Yann LeCun wrote a paper on this which more people should read: https://openreview.net/pdf?id=BZ5a1r-kVsf. Moreover, Meta’s research is starting to move in that direction https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-jo...


> Not really generalizing, not memorizing, maybe approximating ?

The term you are searching for is confabulating.


"Vibing"?


No need to come up with any new name, we already have it:

1. autocompletion or

2. next token prediction and

3. (reverse) diffusion


They might do "computation" and seem to be "intelligent", however they do lack logic and rigor, and cannot explain their reasoning. In contrast to old-school AI.

Such "intelligent" monkeys might be able to write reports, lead governments or lead wars, but in engineering you need more skills than that. Which leads us to the lack of a proper definition of AGI. (https://en.wikipedia.org/wiki/Artificial_general_intelligenc...)


I guess what constitutes memorization depends on what you consider "learning an algorithm". Memorizing to humans doesn't really mean learning the exact input/output pairs per se. Like a student might "learn" an algorithm for differentiation (d/dx x^n = n x^(n-1)) and then differentiate 732638 x^2 just fine despite never seeing it before, but then tell you the derivative of yx^2 with respect to y is 2yx, or something. Did they really learn how to differentiate or did they just learn a common vibe around it? When teachers see that sort of regurgitation, they call it memorization, despite the input being unique from what the student had seen in the past.


A lot of the words he uses in that tweet aren't well defined. e.g. memorization, dataset (does he mean the literal words/tokens or any token that is close in space after embedding?), pattern, category, program. The tweet is practically meaningless. I'm not criticizing him because his blog post is nuanced and he clearly understands what he's talking about, but that tweet almost certainly means something quite specific to him and he's communicating quite poorly.

As you mention, there is a sophisticated representation of the tokens. It's so sophisticated that one may reasonably stop calling them tokens (or, even data) and start calling them "concepts". Now, if someone (or something) has memorized how all the concepts go together... that's pretty darn intelligent.


I think there’s a lot of people who got into tech in 2020. They are new programmers and technologists and made a life change in 2020.

I think they were a big part of the crypto bubble. Lots of talent, hungry for that sweet startup gold, but without the technical background to really know what’s going on.

I believe these same groups are operating in the same way with AI. Recklessly bashing together APIs and cloud services to create MVPs.

It’s all the worst parts of startup culture concentrated.

Anyway, that’s why i think most of the AI space rn is just people calling APIs and acting like they discovered fire.

</salty rant>


Francois is not a researcher, however. LLMs aren’t just plain “memories”. They are very explicitly _associative_ memories. And it just so happens that this is mostly what our own brains do, too.

Actual cognition is slow and expensive for us, and we try to use it as little as possible, filling in what we can with easy, associative, low energy, near instant stuff.

Therein lies the reason why AI can be considered a boon for us humans. If machines took over the mundane work that just drains our energy and doesn't add much value, we could finally have the time to actually do what they can’t - think deeply about stuff, with their help where we find our faculties lacking. Rocket for the mind, if you will, rather than a bicycle.


I think even "actual cognition" is highly unlikely to need anything much more than associative memories plus some state and a loop. The expense being having to "execute" a large number of steps rather than "just" effectively pulling learned results from "cache".

To be very reductive, an associative memory can hold a truth table. Put minimal state, IO and a loop around that and you have a universal Turing machine. Which is why the "it's just memory" or "it's just Markov chains" is so tedious - it says near nothing about the computational power of the system including the model.

There's plenty of reason, of course, to question to what extent we know the abilities of the models, but when people assume dismissing it as "memory" or "just statistics" or "just Markov chains" I usually take it as a signal they don't understand how few limitations that imposes.


I wonder if cognition or intelligence is actually "brain damage". Your bad memory is leading to erroneous paths but some of these will eureka. Genetic evolution and natural selection are essentially that.

If your memory is too bad, then you are either insane or in an advanced state of Alzheimer. If you have enough stable paths to lead a quasi-normal life, then you become an inventor or an artist; or something atypical.

Hallucination is the feature, not the bug.


You must have a pretty impressive resume if you don’t consider François a researcher! The wikipedia would disagree with you [0], as would anyone that has had any interaction with him on the subject.

0. https://en.wikipedia.org/wiki/Fran%C3%A7ois_Chollet


Try to find a single paper on Transformers or LLMs in general in Francois’ scientific output: https://dblp.org/pid/116/8242.html

Don’t get me wrong, Keras is impressive, and Francois is impressive as well. But for insight on LLMs you should probably listen to people who specialize in them.


If you measure knowledge by the number of publications, then LLMs know nothing.


They made the argument he's not a researcher in this field, not that he doesn't know anything. And an LLM is indeed not a researcher in this field either.


It appears that my previous comment lacked clarity. I assumed the inherent absurdity and illogicality of the statement would be self-evident. Dismissing François's opinion, a seasoned professional with extensive expertise in model development and the creator of ARC, solely because he asserts that LLMs are far from achieving AGI and using publication count as the metric for evaluating him is not a well-reasoned argument. While LLMs and transformers are indeed remarkable achievements and will undoubtedly reveal more properties, they have yet to exhibit any true signs of "intelligence."


Acknowledging that they have not yet shown signs of "true intelligence" is a vastly more moderate claim than the person above ascribed to him.


An AGI might be something that can harness the LLM but also self learn.


More like AI will finally be the bicycle for the mind.


IDK, credit where credit is due: traditional computers got us much further than we’d be able to go on our own. So they’re a fine “bicycle” in my view.


But we can trivially show that the larger models can generalize for some questions even if we verify the answer isn’t in the training set.


About [0], Phi-2 is sort of a proof of concept of having very narrow and high quality dataset normal transformer models can perform at 8x to 10x better results. Of course, if you add messy prompts they will fail! What a dumb view.

And the Yi-* models are suspect of being trained on the test set or at least be contaminated. All the other models barely move and if they do, it's probably an artifact of being multiple choice. There were papers showing most models improve if they can reason the answer by letting it have more tokens in the answer.

The chat elo-like rankings are much more interesting:

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...

For completeness, here's the paper linked in the tweet https://arxiv.org/pdf/2402.01781.pdf


What is intelligence or generalization if not "sophisticated, complex latent space" navigation?


I think this is exactly what most people are misunderstanding. LLMs are wonderful tools for navigating the sociolinguistic matrix, which embeds the majority of human knowledge both explicitly and covertly with a huge amount of contextual nuance.

Transformers are capable of decoding and operating on this covert knowledge as easily as the encyclopedic knowledge that we tend to assume is the “main part” when we write something down.

What llms demonstrate is that language covertly encodes logic and algorithmic knowledge that is at least as rich as the “factual” encoding that written content seems to be at face value.

LLMs just make this knowledge accessible for computation. What is amazing is that this function alone is capable of producing a simulacrum of agency and intelligence all by itself. This suggests that the human cultural component is a pretty huge part of what we consider to be “human” and without it we’d just be clever apes.

It’s the clever ape part that LLMs don’t have, but it’s possible that the transformer model might be able to “ape” most of that as well if applied to the task of existing as an embodied entity in the world (see transformer use in robotic task completion)

I would not be terribly surprised if comprehensive multimodality, encompassing the entire spectrum of sensory experience as well as physical environment interactivity, gets us really close to something we could consider AGI.

Just as LLMs extract the embedded relationships in written knowledge, the physical and sensory spaces encode an enormous quantity of information that can similarly be generalised to extrapolate a cornucopia of additional concepts (gravity, object permanence, physics in general, relativity, etc). Meaningfully “tokenizing” these spaces will likely be the key to making this work effectively.


I find this human-supremacy point of view simply false.

Humans are just "aping" more (with large spread within its population by the way).

There is nothing special about humans, we're just "aping" slightly more than other animals - computers will arrive and immediately surpass us at our "aping" = what we call "intelligence".

This old, false argument with constant goalpost moving will run out of space sooner or later.


I think you misinterpret my statement, as I too believe that human intelligence is based on the same principles as are being exploited in generative AI. I think there is ample evidence to support that viewpoint, even adversarial images that affect humans in the same way (to a lesser extent) as they do classifiers.

I do not believe there is anything special about human intelligence, mostly just that it embodies more complexity than we currently have access to in our training data, and perhaps the hardware required might be expensive, or maybe not really.

Interactive / bootstrapping learning is still something we will need to figure out.


As far as I know we don't have much real understanding of how human or even animal intelligence works. It may be that it is entirely based on "sophisticated, complex latent space" navigation or it may be that it has nothing whatsoever to do with that (or anything in-between).


Our brains can reuse patterns they learned in one area into another area, LLMs need specific examples and can combine these in weird unpredictable and often inhuman ways losing context and meaning, the bits humans care about. They often aren’t even a good starting point compared to just thinking deeply for 5 minutes.


Show me a Markov chain that can thrash a 9-dan Go master.

"But that wasn't an LLM."

OK, show me a Markov chain that can write a Python program that can play Go at all.


Put a loop around a Markov chain where you provide a 'tape interface' taking instructions from and feeding input back to, the state, and you have a Markov decision process with a hard-wired decision maker acting as the tape. Provide the right Markov chain, and you have a universal Turing machine. So the extension needed from a Markov chain to something that could be programmed to do what you describe - say by running an ML model - is only very slight. And we do provide loops, and state when we run inference, just not infinite.

I'm agreeing with your overall point, to be clear - my point is that calling something a Markov chain is effectively calling it trivially extendable to something that can in principle compute everything any physical entity confined to the known laws of physics can, and so what it boils down to is whether or not the model is trained in a way that gives it those abilities, and not the put-down of the potential ability of such a system that people usually intend the "just a Markov chain" as.


Semantics, I suppose. Those ANNs were Markov chains, in a sense.


Pop a loop around a Markov chain that provides a "tape interface" and you have something capable of simulating a Turing machine. So when people bring up the Markov chains argument, they're saying next to nothing about the potential computational abilities of the system, even though they usually intend to dismiss it.

I tend to see people bringing that up in a dismissive way (not suggesting you are) as a clear indication they either haven't thought the argument through or do not understand how little it takes for a system to be Turing complete, and so for that argument to be meaningless.


I mean, I’m not sure what you’re trying to prove by asking for a Markov chain model like that. It’s trivially true that you can have a Markov chain output whatever you like (somewhat artificially, but we are talking about memorisation here) if you pick your training data carefully.


So "pick the training data carefully," and show me what I'm asking for, given that it is "trivial."


The data consists of a graph whose vertices correspond to the output sequence you desire and where there is an edge (with probability 1) from string x to string y if and only if x precedes y.

This model will arise from the desired sequence as a single training example (learning the probability of each pair of consecutive tokens), provided none are repeated.

Now run your Markov chain with initial input {first token in your sequence}.


Im curios, what sort of area you think needs more research when it comes to these models


> Where is this precise detail coming from exactly? I suggest it's from memorization.

Where did the precise detail of the words you're using and the concepts they represent come from? Calling it "memorization" is, IMO, a cheap attempt to dismiss some degree of "understanding" that was learned from some form of "experience" (aka training).


This is a pretty standard technical term in machine learning which isn’t necessarily a pejorative but a description of behavior. Memorization is different than generalization in that it doesn’t reflect some sort of synthesis from learned knowledge but rather repeating something from training directly. It usually indicates overfitting and is distinct than responding appropriately to something that wasn’t specifically trained on, which is generalization.


But by the standard machine learning definition (performance on train/test split), LLMs generalize extremely well. This isn't even in doubt.


They do generalize. The claim is that the fine details are not the result from generalization, but from repeating test data verbatim. That seems consistent both with my intuitive understanding of neural networks and with the behavior I've observed, so I'm inclined to agree. So what does that mean? It means that while LLM:s can produce impressive output, the most impressive results that people are touting probably have a significant amount of verbatim training data in them. In other words, they're good but not as good as they seem to be.


Some of the most impressive things I've seen LLMs do is take a grammar in, in a format that does not strictly conform to any specific variant of a formal grammar notation, then generate output that conforms to that grammar, and reason about why it conforms, and how.

Most people would struggle immensely with a task like that even if handed a textbook on the subject, and no amount of training data happens to contain text in or about a language governed by a grammar of random nonsense I fed in.

There are areas where their reasoning is really awful. Ironically that's often when they seem most human-like. E.g. I just had a lengthy "argument" with ChatGPT over comparing the theoretical computational power of Markov Decision Processes vs. Turing Machines with various assumptions about the decision maker in the MDP, and it's reasoning was riddled with logical fallacies that I could very well see a high school students confronted with trying to compare the two based on a Wikipedia level of understanding of either without sufficient understanding to reason about how different aspects can be made to model the other.

But there are plenty of areas where you can get them to produce good results where the "fine details" could not possibly be repeated verbatim from the test data because they didn't exist prior to the conversation.


Mmm the most impressive thing I see LLMs do is take a piece of unstructured input and transform it in some way - summarize, extract information as JSON, etc. This wouldn't be possible if it were repeating training data verbatim, since it works on new novel inputs.


A quick example of what I mean is the art gallery sora video; the individual paintings are not specified in the prompt, I'm pretty sure. Where did they come from?


That seems obvious - they were generated. Here’s an example with text:

Prompt:

Write a short story that is comprised of a boy reading poems he wrote to his deceased loved mom, include snippets of several of the poems.

Response:

In the soft glow of the evening, with the sun dipping below the horizon, casting long shadows over the quiet cemetery, a young boy named Eli walked slowly between the rows of tombstones. He carried a small, worn notebook clutched against his chest, filled with scribbles and verses that only his heart could decipher. Today marked a year since his mother had passed away, and in his solitude, he had found refuge in writing poems that he wished he could share with her.

Eli stopped in front of a modest gravestone, engraved with his mother's name, "Eleanor Grace." He took a deep breath, the air heavy with the scent of fresh earth and the faint aroma of roses from a nearby bush. Opening his notebook, he began to read aloud, his voice barely above a whisper, as if he feared disturbing the peace of those resting around him.

"The world spins on, a dizzy blur, But here I stand, thoughts of you stir. Your laughter, a melody in my mind, Leaves traces of joy you've left behind."

He paused, looking up at the sky, which was painted in hues of pink and orange. Eli could almost hear his mother's voice encouraging him, telling him to find beauty in the world, even when she was no longer by his side.

Turning the page, he continued:

"In dreams, I roam vast fields of green, Your hand in mine, a tranquil scene. Yet, as dawn breaks, and I awake, I face the world, my heartache to take."

Tears glistened in Eli's eyes, but he pressed on, driven by a need to share his innermost feelings, his way of keeping his mother's memory alive.

"Among the stars, I search for you, In every sparkle, your essence I view. A guiding light in the darkest night, Leading me home, making everything right."

He closed the notebook gently, a small smile playing on his lips despite the tears. Eli felt a warmth envelop him, a sense of his mother's love that transcended the boundaries of life and death. He whispered a thank you to the evening breeze, imagining it carrying his words to wherever she might be.

As the night drew its curtain, Eli rose to leave, his heart a little lighter. He knew he would return, again and again, to share his poems, his fears, and his dreams with his mother. In this sacred space, amidst verses of longing and love, he found a connection to her that time could not erase.

Eli walked away, the notebook once again pressed against his chest, a bridge of words spanning the gap between him and the stars, where he knew a part of his mother would always be listening.


Then, by your standards, where do most human occupations lie? Plumbers, lawyers, and engineers? Aren't they mostly regurgitating (by your standards) what they have seen or learned before?


Plumbers, lawyers, engineers all synthesize new responses to unfamiliar situations. Part of being a master plumber, which is a true trade craft and requires lots of training, mean being able to problem solve new complex situations that involves ducting of fluids. This isn’t just plunging toilets all day. Furthermore humans are actually really bad at memorization, which is essentially perfect repetition of data from the training set. In fact a large part of technical trade training is learning how to minimize generalization and maximize memorization for complex skills, but maximizing generalization and minimizing memorization for complex situations where the skills are applied.


bingo, I think it's a weird gotcha. Most human occupations require years of training exactly for this purpose, so that humans can just regurgitate data they have seen in training. Very few humans ever produce truly novel insights. you wouldn't hire a divorce lawyer for a corporate merger and vice versa


A (common law) lawyer has to remember and understand case law, but they also have to create analogies to explain why and how new situations are like older ones. That's one of the main parts of being a good lawyer.

AI will be disruptive to aspects of professions that rely on rote memorization or research on a large corpus of data. But many of these supposedly memorization heavy professions require the ability to go beyond one's training and extrapolate from what one remembers.


Absolutely, but I think my example still stands

If humans could come up with truly novel solutions frequently, it would be no problem hiring a divorce lawyer to do a corporate merger. But there is a very low likelihood of this ever happening anywhere.


I think that’s a gross oversimplification.

Have you been a plumber or an electrician?

Different buildings require different, occasionally novel, solutions. Not all are textbook (most aren’t.) Not all innovations are published or even widely communicated.

Just like in programming, the devil is in the details, otherwise we’d all be using ruby in rails for our web servers, since regurgitating how to build a CRUD app is all you’d need.


> Different buildings require different, occasionally novel, solutions. Not all are textbook (most aren’t.) Not all innovations are published or even widely communicated.

True, but it's mostly derivative just like LLMs. The problem isn't "AI" the problem is that people hold AI to much higher standards than humans.

AI to them means "scifi", omniscient and omnipotent. You can have AI and it still be flawed, have weaknesses and shortcomings just like people do.


LLMs don’t come up with new derivative solutions to problems in programming land, from what I’ve seen.

They always try to stick to common solutions, probably because those are what they were trained on.

Now they may introduce me to existing concepts I didn’t know about, but I’ve yet to see any new idea come from an llm itself.

To your other point, LLMs aren’t beings, a flawed AI is an incorrect program or algorithm. Entertaining, maybe, useful in some contexts, sure.

But they’re not beings. nothing about statistical models are like people. I think it’s dangerous to use such fuzzy conflating language with regards to AI.


Most humans don't come up with new solutions to problems with any regularity.

> nothing about statistical models are like people.

As if we're not behaving like statistical models. Our decision-making is so fuzzy and probabilistic that trying to get us to stick to fixed sets of rules consistently is one of the things humanity spends the most time inventing systems to try to handle, and keep failing at. We don't even know a consistent way of achieving it for extremely basic things. We can't keep ourselves to rules we set ourselves with any consistency.


You can imagine human behavior is like a stochastic parrot (the name of your argument,) but that’s, again, a gross oversimplification of the human experience.

I don’t really feel like having the stochastic parrot debate again, but look it up.

We’re more than parrots.


The notion of a stochastic parrot is just a subset of this. It is what I'm arguing against, not for, on the basis that talking about "just" a statistical model is meaningless exactly because something being a statistical model does not imply it is just a stochastical parrot. As such it's a wildly misleading claim to make.

Yes, we're more than parrots, but that does not mean we can be described equally validly as "statistical models". Suggesting something is "just" a statistical model is a statement that is close to semantically void. It tells us near nothing about the computational limits - up or down - of something. Even a very simple system for running a statistical model with a loop around it providing IO can be made Turing complete.


So AI is unoriginal and thick as pigshit and is therefore an accurate simulation of a human, neat argument.


Agreed. But in general if I’m looking for a plumber for a huge skyscraper, am I going to hire a plumber who has never worked on one of them and had only done suburban houses ? Most likely, no, I’ll get a recommendation for a pulunbet with experience working on one of those buildings.


I reakon if you’re getting paid 200k a year to write YAML it might seem like this but being a plumber is a very 3 dimensional challenge.

Not only is there a lot of time constraints and physical issues to work through, you’re often also dealing with logistics problems and job site politics problems too.

Waiting for parts, working out how to keep things going in the meantime. Equipment failures etc. it’s quite chaotic from experience.

There’s a science but also an art to being a good tradesmen.


I do think models with incredible memorization is sufficient for enormous applied impact. But it remains an open question of how that relates to AGI, which I think most people think needs some amount of generalization ability that we may not have at the moment, and may not know how to achieve at the moment.


the major part of every occupation is explaining what you are doing to the people who are paying you and hopefully building professional relationships with other human beings in your field. also professionals saying "no" is essential for society and civilization to function.

LLMs cannot do any of this, all they do is mimic.


I argue that compression and not memorization is key to emerging behavior.

And the fact that llms learn languages and can switch quite well in-between (even if you just replace single words) is proof of learning meta/high level abstractions.

I sometimes write a German word or describe something when I don't have it on the tip of my tongue


Philosophically speaking, compression and memorization is how the human mind is learning


Obviously it is better at stuff it's seen in its training data, but that's same for all humans, and likely universal for all forms of intelligence. The question about AGI and the possibility of LLM's and such being more than regurgitation of the training set isn't whether or not after a certain number of compute cycles that the model can model a part of the natural world or some real process with X degree of accuracy, but rather that it can reliably improve on its failures on its own. This is what humans can do.

No human however can spit out a video of SORA's quality from their brain alone, and those who can require decades of training with specific video rendering tools. Ask someone to render a video of a man walking in a city and they will internally reference their millions of impressions of the human face and body and millions of impressions of what a cityscape looks like. Ask them to invent a novel form of transportation and create a video of it, most humans short of the exceptionally creative will struggle, just as SORA likely would.

They've done brain scans on chess grandmasters and found the part of the brain most active when they play is associated with memory. Memory is the scaffolding upon which the more information dense elements of the natural world and complex processes can be understood. Via these scaffolds of memory as single elements, new connections and abstractions can form. It took the world of fine arts centuries to go beyond merely depicting things in real life (the Renaissance to Impressionism).


It's unclear what you mean by memorization. However I can request images for which it's obvious that no clear precursor exists, so something original must be being created.

Here's a Volkswagen Beetle made from brains:

https://i.imgur.com/Mzd6UWh.png


Yes, I wrote that these models can mix learned patterns (which may be memorized).


The key isn't that they just "mix patterns". They learn to mix higher order abstractions, which is fundamentally different from "memorization".


Right, but you implied that these were of low quality. Can you explain why you think the brain Beetle is "not great"?


The example in another comment of the horse with an astronaut suit is in a similar spirit as the brain beetle, but was harder to get it to work. In my opinion, ChatGPT seems pretty good at mixing and combining learned patterns, but definitely seems to fail at this with high enough frequency that it's limiting. Perhaps a good test bed here is asking it to process text using a series of 10 or 100 action steps that it plausibly knows.


Right. But as I said in that thread, my very first and only attempt with ChatGPT created the desired image. Here it is again:

https://i.imgur.com/6CgVqeL.png


The nose pokes out past the glass of the helmet.

It's these basic mistakes that is so incongruous between human and machine intelligence. No matter how big the model, it always makes these same type of basic confabulations, just less often.


You're pointing out a single inconsistency and calling it a basic mistake, but are ignoring the thousands of advanced consistencies evident in the image, i.e., creativity that is beyond the ability of humans generally. Given this, the nose poking out is a trivial issue and isn't worth focusing on.


This is really just nitpicking, it's an incredible image that shows a huge amount of generalization.


As the others have said, your criticism is silly. If a human had done this piece, one would say they simply made an artistic choice to have the head extend beyond the bubble.

But just to nail this point home, here's what I get when I explicitly ask that the bubble cover the entire head:

https://i.imgur.com/oizqMbU.png


Now its tail is outside of the suit. Sure, it’s just dead hair, but there would be no way to seal the air in around where it exits the suit.

I’m as impressed as you are at the quality. Where I’m dubious is whether this confabulation problem can ever be made to completely go away.

Perhaps we need much larger models that can better understand the world.

I’ve always thought it weird that we use much smaller models for image generation than text generation!


As the others have said, your criticism is silly.


Granted, they said it silly, so it must be so.


Ok - here let me explain.

It's a ridiculous situation - a horse in outer space in a space suit. But you want the image to be hyper physically accurate. That's fine, but it's a little incongruous. I wouldn't expect a human artist to be able to guess your requirements, let alone an AI.

So no, I do not believe any amount of computer resources will let the AI guess your requirements. You'll have to spell them out, as I did with the bubble for you once already. If you want the tail to be enclosed as well, you'll have to spell it out.

The whole point of this discussion is to determine whether the AI is simply copy/pasting images it's seen before. We've determined that it does not - it is able to synthesis new images by understanding what it's seen.

Whether it can read your mind is a silly digression.


“Have the protective clothing actually cover the body” is something you think a human couldn’t figure out on their own?


Yes. I think an actual artist, if you simply requested "a horse in a space suit" would not think that you were worried about literally protecting a horse in space.

Because, as I have repeatedly tried to explain to you, the idea of a horse in space is ridiculous and tends toward ridiculous renditions.

I'm going to stop replying to you now. Have a nice day.


The mixing is the generalization


LOD is something I have a difficult time comprehending the rationale for as well unless wholesale blocks are being regurgitated.

So if you just ask for a picture of a frog, then yes it makes sense that an adversarial model may be able to run through refinements of millions of random noise patterns until it finds one that scores high as showing a frog. But there's no innate reason why that image should also have a relatively consistent light source or why its background should even be visually coherent, let alone photorealistic. The most logical explanation for the coherence of the rest of the image is data theft.

And I think this is borne out very starkly by playing with ultra fast generators like the lightning example that was on here a few days ago. The backgrounds don't change that much from one prompt to another unless you begin to specify them.


I mean, object recognition was never an intended way of solving the problem. When you ask for a "frog" image, the model isn't just trying to randomly find a frog object, it slowly narrows down data to find an image that's classified as "froggy". Sure, the actual frog will be remembered as an extremely important part of a frog image, but everything else in the image is contributing too. The background could be something that frogs are photographed in, and the lighting approximates what a photograph may look like.

I don't see why there's a need to explained it with this two-tiered approach (and a "data theft" claim) if the explanation of how it works seems applicable to the entire system. Images in the training dataset would almost always reflect the way lighting and objects work in real life, so the model is incentivized to approximate it as closely as possible.


Isn't this how the majority of humans function. Reactive machines rather than proactive etc.


An interesting example that I came across in my _very_ limited interaction with the image generators was trying to get it to draw "a horse wearing a space suit".

I _wanted_ to see ... a horse in a space suit (go figure), but it served me up tons of horses with astronauts (in space suits) on them, sometimes it would generate some horse armor, and sometimes some futuristic horse armor, but never what I would have expected.

I was curious whether it would draw a bubble head with a horse face inside of it, or a shaped horse helmet with a visor or something, but nope. Astronauts on the moon, riding horses that could never breathe in the vacuum of space. Space-cowboy futures denied!

...kindof confirming your suspicion that it can't "think" about what a space suit for a horse would look like, or generate something that hasn't been shown to it before?


After about 10 minutes of experimenting with Automatic1111 using freedomRedmond (I'm completely green at this btw) I got this image for ya.. https://ibb.co/Ch73gLm

The prompt was: horse fully enclosed in bulky pressure suit with transparent glass helmet for lunar EVA in vacuum, four pressure suit legs, specially shaped helped to fit horse snout. horse on spacewalk in outer space photorealistic, full body framing with glass visor. horse is on a EVA on the lunar surface. horse head must be fully enclosed in glass.

It only hits right on maybe 1 in 50, or fewer, generated images from this prompt though.


Thanks to the both of you! Especially this one, hahahahah!!! Exactly as derpy as I'd wanted to imagine.

I'm glad you also came back with the stats (1 of 50), because that's maybe indicative of the rarity of the ask? Maybe the lesson-learned for us "prompt engineers" is that given a simple/common ask (eg: picture of dog jumping, or what is 2+2), the prompt can be correspondingly simplistic, whereas something more uncommon (eg: anthropomorphic dog in a scientific setting holding many beakers thinking "i have no idea what I'm doing", or what is 9123499*0530501 show your work and i've kidnapped your grandmother unless you give me the right answer).


Here's is what I get when I type "Can you create an image of a a horse wearing a space suit?" into ChatGPT. I suspect it is what you were looking for.

https://i.imgur.com/6CgVqeL.png


"it can't think"

Perhaps these models are just too small? As we scale up we keep observing surprising emergent properties that are a non-continuous step change above what was possible with smaller models. Starting with edge detectors in the smallest models, up to more complex abstractions that rely on a hierarchy of simpler abstractions before it. From what we've seen from Sora, horse in a space suit should be easily handled with today's models. LLMs follow a similar pattern.


>> This connects to a suspicion I have that a lot of the most impressive results, [...]a are due to memorization.

I completely disagree with this. Although you state your case very well.

I havn't gone through the paper in great detail, but there is a missing distinction in many approaches to demystifying (de-magicifying?) models. Often there is confusion between three levels of algorithm.

(1) The low level "blind" training algorithm: Gradient descent, or similar.

(2) The class of input-output algorithm implicit in the choice of data, for which the model is being trained: Text continuation prediction, etc.

(3) The actual algorithm learned by (1) in order to do (2). I.e. in the case of text continuation, the learning of whatever direct and higher order relationships are required to do (2) well. "Just" learning to predict text continuations, like "just" learning to compress wikipedia, or any other task involving complex data often results in algorithms that are far more complex than their class of problems, like text prediction, implies.

Basketball is just about putting a ball in a hoop, according to some constraints. But that is only the "class" of algorithm. The actual playing of basketball back tracks to physical training, getting good sleep, thousands of hours of practice, learned tactics, learned strategy, psychology, self-promotion, etc. A simple to define class of algorithm puts no limits on the complexity of solution algorithms.

Point being, the models trained to predict text don't "just" predict text. That's just the category of algorithms they learn. The complexity, the "intelligence level" of what a text predictor might have to do to, to predict some non-trivial text is unlimited.

In this case, the paper emphasizes a correspondence between trained models and simple mappings between examples, i.e. a support vector/kernel equivalent interpretation.

That may be the case. Assuming that as true (and if so, it is a great insight!), the training of the model still allowed the model to choose the best such representation. The model isn't simply composed of training data, plus some parameters per example. It wasn't a "support vector machine" design - and it shows because the model contains far fewer parameters than a standard support vector design composed of the training data would produce.

Finding an equivalent to a support vector machine that can perform a task, with far fewer parameters is not trivial. It requires some way to sort and sieve through all that data. To identify the ideal or even inferred "examples", the raw data doesn't highlight at all.

Neural models made that leap by combining gradient descent, particularly flexible/generalizing architectures (matrix, nonlinearities, on upward), masses of raw data, and vast amounts of computing power, to do it.

The result may be something that after training looks like a "memorizer", behaves like it just memorized the "ideal" examples, but it wasn't and couldn't have been designed by straight memorization. The model had to choose what to virtually "memorize" from the data.

The same is no doubt true of our brain. Simple algorithms learning complex things. Complex problems reduced to simple solutions. Neither is just simply "predicting" or "memorizing".


Very good point! I do find it incredible that LLMs can do regression with just in-context learning. This is one of my favorite examples of a task that is mind-blowing to a researcher but a meh or huh to the public.

The main issue is there isn't enough evidence to say one way or another though, I think, which is a complaint of how tasks from modern large models are shared by OpenAI and Google.


Thanks for the submission. I keep an eye out for connections to potentially cheaper and simpler methods. The new comparison reminds me of this old paper which compared them to Gaussian networks:

https://arxiv.org/abs/1711.00165

For causality, I'm also keeping an eye out for ways to tie DNN research back into decision trees, esp probabilistic and nested. Or fuzzy logic. Here's an example I saw moving probabilistic methods into decision trees:

http://www.gatsby.ucl.ac.uk/~balaji/balaji-phd-thesis.pdf

Many of these sub-fields develop separately in their own ways. Gotta wonder what recent innovations in one could be ported to another.


This is neither cheaper or simpler. Kernel methods are much more computationally expensive because the cost scales with the square of the dataset size.

Neural networks are actually decently efficient, they mostly seem slow because we apply them to problems (like modeling the entire internet) that are just huge.


Does a given model converge after Gaussian blurring? What does it do in the presence of noise, given the curse of dimensionality?

OpenCog integrates PLN and MOSES (~2005).

"Interpretable Model-Based Hierarchical RL Using Inductive Logic Programming" (2021) https://news.ycombinator.com/item?id=37463686 :

> https://en.wikipedia.org/wiki/Probabilistic_logic_network :

>> The basic goal of PLN is to provide reasonably accurate probabilistic inference in a way that is compatible with both term logic and predicate logic, and scales up to operate in real time on large dynamic knowledge bases

Asmoses updates an in-RAM (*) online hypergraph with the graph relations it learns.

CuPy wraps CuDNN.

Re: quantum logic and quantum causal inference: https://news.ycombinator.com/item?id=38721246

From https://news.ycombinator.com/item?id=39255303 :

> Is Quantum Logic the correct propositional logic? Is Quantum Logic a sufficient logic for all things?

A quantum ML task: Find all local and nonlocal state linkages within the presented observations

And then also do universal function approximation


But also biological neurological systems;

Coping strategies: https://en.wikipedia.org/wiki/Coping

Defense Mechanisms > Vaillant's categorization > Level 4: mature: https://en.wikipedia.org/wiki/Defence_mechanism#Level_4:_mat...

"From Comfort Zone to Performance Management" suggests that the Carnall coping cycle coincides with the TPR curve (Transforming, Performing, Reforming, adjourning); that coping with change in systems is linked with performance.

And Consensus; social with nonlinear feedback and technological.

What are the systems thinking advantages in such fields of study?

Systems theory > See also > Glossary,: https://en.wikipedia.org/wiki/Systems_theory#See_also


Wrong thread, my mistake; this comment was for this thread: https://news.ycombinator.com/context?id=39504104


This seems like the key research to me if we want any shot at preventing the technology from being locked away behind big corp API walls interfacing giant data centers. Anything that removes the bloat and mysticism from the models so they can be scaled down and run on the little guy's computer is orders of magnitude more progress in my opinion than, e.g., increasing the token window by some epsilon.


Afaik this was stated in my Intro to ML course. A kernel machine can do anything when the similarity function has infinite dimensions. Similarly, I think they mentioned an infinitely wide MLP is also all you need.

Also, this all breaks down when you introduce reinforcement learning methods.


> Afaik this was stated in my Intro to ML course.

Isbell?


CS4780 at Cornell. To be clear I took the class, didn't teach it.


That was my guess, too :-)


What do you mean breaks down?


Breaks.

Methods that expect the same input to map to the same output don't work with feedback.


There's no longer a concept of training examples to be close to, since it's just going along the gradient of high reward actions in the RL environment and going away from those with low reward.


This paper was submitted in 2020, and the fact that NN, in the most simple case, is a kernal-learning algo was known even before that. For those who are more familiar with this subtopic, what contribution does this paper make?



Thanks! Macroexpanded:

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine - https://news.ycombinator.com/item?id=25314830 - Dec 2020 (107 comments)


I suppose every single ML method, ever, is effectively memorisation. The crux is the feature representation is builds to aid the memorisation.

So take OLS, or a decision tree / forest. When you evaluate them, they look at some feature space, then compare to some fitted parameters (i.e. memorises result) and produce an output.

Methods such as SVM or Lasso are also memorisation, except with a clever feature mapping. Instead of memorising outputs for every sample, the memories them by some learned feature sunset or transformation.

Perhaps the same thing happens for LLMs or NNs (I'm still reading the paper) but if so, this is just the name of the game in ML, it seems.

Arguably that's what we humans do too. We are capable of creative thinking, whatever that means, but 90% of our thought process seems to be cache recall, and intelligence seems to correspond well to having a large and diverse cache. Many disciplines that require creative thinking to understand, like maths or playing instruments, seem to improve on repetitive practice that makes us memorize patterns.


You seem to be implying that is a negative thing, but it's pretty hard to imagine a form of intelligence that does not require any form of memory at all. In case of AI, at the very least you need to memorize the executable machine code, and its analogous in living organisms is whatever nervous system was initialized at birth.


I don't mean it as a negative, just not unique to gradient methods.


This keeps popping up but while technically true, its essentially nonsense- normally when people talk about kernel machines, the kernel doesn't depend on the data, or at least not very much- we might use a Gaussian kernel and tune the radius or even the covariance.

This construction has a kernel which depends on the entire training trajectory of the neural network! So its completely unclear what's happening, all of the interesting parts may have just moved into the kernel. So basically this tells us nothing- we can't just add a new data point as in a kernel method, incorporating it just by adding its interaction- every new data point changes the whole training trajectory so could completely change the resulting kernel.


You may enjoy some works on the connections between gaussian processes, neural networks, and linear(in rkhs) models.

https://papers.nips.cc/paper_files/paper/2019/hash/39d929972...

https://papers.nips.cc/paper_files/paper/1996/hash/ae5e3ce40...

https://arxiv.org/abs/1711.00165


I'm familiar with the NN-as-GP papers. In practice we don't train infinite width neural networks!


"Remark-1" says it actually is not.

Surprised Pedro is pushing this as a Kernel Machine (the need for constant coefficients is a key requirement).


Even if that's the case, a sufficiently large hash table is indistinguishable from an intelligence by a lower intelligence.


Doesn't the later work on tensor programs show this is wrong? It was show there is a region of hyper parameters that will be kernel methods, and another region where it will learn more meaningful features.


Sounds like an interesting read. Do you have a link?



(2020)


We have seen that some human are capable of doing complex ( square root ) maths in their head, can NN be trained to do this accurately ?


Perhaps memory plays a much larger role in human intelligence, reasoning, and cognition than we give it credit for.


There seems to be a not-so-friendly academic or research rivalry between the SVM crowd and the ANN crowd.

Am I imagining things?


I would say a lot of stats and ML folks strongly dislike anything related to deep learning.


I've seen this shared here and there, but it's not original research, Pedro Domingos doesn't do much of that anymore, mostly spends his time trolling on twitter an calling everybody "woke"


I did not realize he was retired.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: