My theory is that architecture doesn't matter - convolutional, transformer or recurrent, as long as you can efficiently train models of the same size, what counts is the dataset.
Similarly, humans achieve about the same results when they have the same training. Small variations. What matters is not the brain but the education they get.
Of course I am exaggerating a bit, just saying there are a multitude of architectures of brain and neural nets with similar abilities, and the differentiating factor is the data not the model.
For years we have seen hundreds of papers trying to propose sub-quadratic attention. They all failed to get traction, big labs still use almost vanilla transformer. At some point a paper declared "mixing is all you need" (MLP-Mixers) to replace "attention is all you need". Just mixing, the optimiser adapts to what it gets.
If you think about it, maybe language creates a virtual layer where language operations are performed. And this works similarly in humans and AIs. That's why the architecture doesn't matter, because it is running the language-OS on top. Similarly for vision.
I place 90% the merits of AI on language and 10% on the model architecture. Finding intelligence was inevitable, it was hiding in language, that's how we get to be intelligent as well. A human raised without language is even worse than a primitive. Intelligence is encoded in software, not hardware. Our language software has more breadth and depth than any one of us can create or contain.
> Similarly, humans achieve about the same results when they have the same training. Small variations. What matters is not the brain but the education they get.
That's misleading. Small average variation overall, but some outliers are dramatically better and dramatically worse. Some brains produce fields medals, some can barely count change. Unfortunate but true.
Over the years, I've found that the "nurture is the only thing that matters" camp is typically intensely self-delusional.
Despite hard evidence to the contrary, some people really really want to pretend that all human brains are equally capable.
For example, instances of cognitive disability due to macroscopically visible brain structure deviations or brain damage are usually brushed away as "Well that's just a disability and therefore it doesn't count! All the other brains are the same!"
An eye-opening version of this is a friend of mine has twin girls. Both have the same parents, took all the same classes with the same teachers, neither got more access to something that the other couldn't but one dramatically better at school, has a higher measured iq (not just a couple of points) and just seems "sharper" when you deal with her than the other. She just somehow seems to pick up a lot of things quicker than the other. Otherwise they are both basically identical when it comes to sports and the sorts of things they are in to. They both got the same "nurture" but "nature" had something to do with it somewhere in the mix that made them pretty different.
I have a sibling very close in age and in my experience just coming from the same home and parents doesn't mean a whole lot, siblings can have wildly different experiences and be treated by the adults around them quite different. Not to mention there are forces which make you want to differentiate from your sibling and intentionally be different. I don't think that anecdote is nearly informative enough to make any conclusions.
>siblings can have wildly different experiences and be treated by the adults around them quite different. Not to mention there are forces which make you want to differentiate from your sibling and intentionally be different.
That's true, but is that due to all the adults and "forces" around them conspiring to treat them different, or is it due to them behaving in a way that elicits certain responses? Surely some of each, no? So then, nature may have a way of manipulating nurture in a sense.
Twins can have dramatically different experiences in life, i.e. "nurture" and the experiences you are exposed to you end up meaning that while your genetics are the same, the phenotype which is expressed is ultimately quite different. For example, one twin could be being badly bullied and hiding it, hurting their confidence and also damaging brain development due to high levels of stress hormones.
Also, there are cases where one identical twin had low birth weight due to a placenta problem, and grew up with developmental delays. Some differences really are congenital.
Yes but they all stood on the shoulders of giants. Alone they couldn't have done it. Intelligence is a collective affair.
Ideas are self replicators - usually by being useful to people, but their lifecycle is different from that of people, they evolve much faster. It's an evolutionary process. Modern man and that of 10,000 years ago are biologically not much different, the difference comes from cultural evolution.
> Alone they couldn't have done it. Intelligence is a collective affair.
Yes, but the Gausses and Ramanujans of the world disprove the claim that architecture doesn't matter, and that only training data matters. The long tails that describe human variance I think show that architecture can matter a lot.
I think you might be right that architecture for some machine learning algorithms doesn't matter so much given enough data, but the fact that they need so much data to perform well I think implies that there's a better architecture awaiting discovery.
If by "basic architecture", you mean auditory, visual, and motor centres, sure, but we're clearly talking about higher cognitive abilities here. I'm not at all sure that they have the same architecture given Ramanujan had even less high school math than most American students, yet clearly took the same knowledge much further.
If you're saying Ramanujan had a drastically different brain architecture than other humans, it's an extraordinary claim which will require extraordinary evidence. Einstein's brain is well studied, and fundamentally human...
Not to defend the original argument (because comparing the difference between people, belonging to the exact same animal specie, and difference between radically different model architecture make zero sense).
> Small average variation overall, but some outliers are dramatically better and dramatically worse. Some brains produce fields medals, some can barely count change. Unfortunate but true.
We don't actually know which part of the “can barely count change” situation comes from the “brain” compared to the situation people were raises in. Sure you'd always have outliers (good or bad) for purely genetic reasons, but if you looked at sport performance for instance you'd find that the theoretical difference between people is much lower than the practical one for many environmental reasons: in theory, once you factor in rare medical conditions, almost every human between 18 and 40 should be able to run a marathon. The reason why most people cannot is entirely environmental (not enough physical activity, overweight, work related health issues, etc.).
And it's also true in terms of performance: almost any human ought to be able to run a 100m in less than 14 seconds, which is only 30% slower than the world record. Yet, for purely environmental reasons, only a small fraction of the population is actually able to do it.
I suspect that in terms of intelligence the results would be roughly in the same ballpark: almost everyone, except rare mental development disorders, ought to be 70% as intelligent as the most intelligent person in history (and the enormous majority fitting in a much narrower band), but most end up far worse for environmental reasons alone.
For some reason (mostly politico-philosophical), most people in the “nature” camp of the “nature vs nurture” debate are completely oblivious to the evidence for this.
> We don't actually know which part of the “can barely count change” situation comes from the “brain” compared to the situation people were raises i
I think this too is misleading. We've quantified the heritability of intelligence to quite a significant degree. Twin studies of twins who have been raised apart even shows that outcomes are largely the same despite different environments. Nurture is very much oversold. Basically, as long as you don't traumatize one of two twins they will develop very similarly.
> And it's also true in terms of performance: almost any human ought to be able to run a 100m in less than 14 seconds
But it's literally not true of all humans, and it won't be true if you drop that below 13 or 12 seconds. Which shows that the specific biomechanics you're born with will influence your natural physical abilities. By extension, we also have natural cognitive abilities with similar consequences.
> most people in the “nature” camp of the “nature vs nurture” debate are completely oblivious to the evidence for this
I think you're way overselling this claim. "It's all nurture" is a far more common opinion, and far less evidence supports it.
> We've quantified the heritability of intelligence to quite a significant degree
Not really. I can't find the link again but there was a survey of cognitive science experts, and the question of inheritability of intelligence is still an open question with lots of conflicting opinions in the academic community. Also, “heritability” doesn't mean what you think it means.
> Twin studies of twins who have been raised apart even shows that outcomes are largely the same despite different environments.
I'd love to know your source, because I doubt the number of “twins that have been raised apart ” is large enough to be a statically significant cohort even if you could have them all in your study.
> But it's literally not true of all humans
Yes, some people are born without legs. And some are born with Down syndrome, nobody disputes that.
> and it won't be true if you drop that below 13 or 12 seconds. Which shows that the specific biomechanics you're born with will influence your natural physical abilities. By extension, we also have natural cognitive abilities with similar consequences.
Nobody questions that either (although 13 secs is probably close to what's reachable by anyone, and 12 by any male), the question is about scale: if you take a random sample of 20 Americans of the right age, you'd have insane disparities, with a significant number not being able to run the distance at all and struggling to finish in less than a minute, despite their completely functional genome.
> I think you're way overselling this claim. "It's all nurture" is a far more common opinion, and far less evidence supports it.
That's a strawman, a very common one in the mouth of the folks of the ”nature” camp. Nobody argues that it's literally “all nurture”, everyone know that Down syndrome exist.
The actual argument is that when in my previous 100m example you have genetics variation with an impact of at best 30%, and a observed variation of more than 500%, it's fair to say nurture is the dominant factor.
Doesn't mean nature has no role at all, but it doesn't have an explanatory role: the reason for observed variation is essentially nurture.
Stop taking The Bell Curve seriously, this book has been written in bad faith by people with a political agenda (which even stated with a straight face in the book). The reason why these people argue for the genetic impact trumping everything is because they don't like the idea of state intervention to reduce poverty. When you say “these people are poor because they are genetically dumb” then you can claim there's no point in helping them, and without surprise that's exactly what Herrnstein and Murray are advocating in their book…
> I'd love to know your source, because I doubt the number of “twins that have been raised apart ” is large enough to be a statically significant cohort even if you could have them all in your study.
> Yes, some people are born without legs. And some are born with Down syndrome, nobody disputes that.
I see you haven't debated many blank slaters. The original post I responded to said, "What matters is not the brain but the education they get". That is literally false.
> if you take a random sample of 20 Americans of the right age, you'd have insane disparities, with a significant number not being able to run the distance at all and struggling to finish in less than a minute, despite their completely functional genome.
Yes, and you will find cases of people who ate basically the same diet and were just as active or sedentary who fall on both sides of that line of finishing/not finishing. Some people will just have a more efficient metabolism, or a more robust skeletal structure, or greater lung capacity, or more type II muscle fibers, and the positive confluence of all of these factors makes a great natural athlete, and a negative confluence on all of them makes for an unlucky individual who will likely never be athletic no matter how hard they try. Which doesn't reduce the importance of being fit, or reduce their human worth, but there's no point denying the obvious.
All of these factors fall within natural human variability, and all of these factors are influenced by environment of course, because genetics and environment can't be trivially separated, but the point is that it seems like these factors have multiplicative effects. Such factors exist for cognition as well, and the same conclusions follow. This counts as architecture.
> The actual argument is that when in my previous 100m example you have genetics variation with an impact of at best 30%, and a observed variation of more than 500%, it's fair to say nurture is the dominant factor.
Nurture is probably the dominant factor in almost all cases throughout the world because of malnutrition. It remains the case that even if you eliminate all environmental disparities, humans will still exhibit considerable variability on all traits. Some will still be natural gifted athletes and naturally gifted mathematicians, and some significant fraction will simply be below average on many categories, and some unlucky few will be below average on all categories, and not due to any obvious deficiency like Down's.
I'm not even going to bother addressing the irrelevant comments on Murray and his book, mostly because I haven't even read it. This is all standard statistics and genetics.
The original matter I disputed was literally false, as I said above, but the spirit of the comment was that data matters more than architecture. I think that's a very premature conclusion. It would be more correct to say that a) you can equate outcomes by varying data despite the architecture, but b) some architectures scale better (do more with less data), and I would add c) no known architecture scales as well as human cognition yet, which suggests better architectures remain to be found, and this suggests d) that architecture does matter.
> What matters is not the brain but the education they get". That is literally false.
No, it's oversimplified, but it's still less false than the opposite view you're trying to promote.
> and the positive confluence of all of these factors makes a great natural athlete, and a negative confluence on all of them makes for an unlucky individual who will likely never be athletic no matter how hard they try.
Only if you count as “athletic” the performances that are outliers, by a few 10%s. But not by reasonable standards for being athletic. Almost anybody that starts practicing endurance running before their 40 can be among the top 5% of the whole human population (and be “the athletic guy”), just because the enormous majority is desperately bad for environmental reasons (including: they have something else to in their damn life than running 2 hours a day). Likewise, everyone who can read can read a book a week and quickly become among the 5% most cultivated. Except the few disabled, there's no genetic barrier you cannot overwhelm with some work, and win the game that the majority isn't even playing.
> It remains the case that even if you eliminate all environmental disparities, humans will still exhibit considerable variability on all traits.
Yes, 30% is still considerable by most metrics (you're never ever gonna win a race against someone who's 30% faster than you) but it's also very small difference if you compared to what are the disparities in the real world. If everyone was in the 30% expected margin, then pretty much everyone would be over 150IQ (whatever this metrics means). Sure, there would still be a few people above 200IQ, but the world would still be unrecognizable.
> This counts as architecture.
No, this is barely akin to hyperparameters in the ML world, but any comparison between deep learning models and the brain is completely meaningless anyway. The original take was very bad, in the ML topic, but yours is equally bad on human cognition. Yet another example that one can refute a poor argument with an even worse one.
> I'm not even going to bother addressing the irrelevant comments on Murray and his book, mostly because I haven't even read it. This is all standard statistics and genetics.
It's not irrelevant, it's still the main source of inspiration for most people who share your arguments and opinion, and it's likely that you have been at least indirectly influenced by it through different means.
If you look at fields medalists I think most of them got interested in math super young and got exposed earlier to much more material than average mathematicians.
So they may have an IQ advantage, but they definitely also have a data/training advantage.
You need to be so smart that you jump the chasm between what school can teach you to what you can teach yourself from undergrad and beyond textbooks. Given the pressure to pass exams, in addition you need to have enough in the tank that learning more maths doesn’t risk your hoop jumping to get into university. Ironically you need to do well in many things that isn’t maths to be allowed to study maths at an elite university.
I am sure it helps alot if your parents are professors!
> If you look at fields medalists I think most of them got interested in math super young and got exposed earlier to much more material than average mathematicians
Even if that were true, you'd need only one Ramanujan to disprove the claim that architecture doesn't matter. We've seen many more than one.
Not quite a goatherd, but Ramanujan's father was a clerk in a sari shop. Ramanujan himself had little formal training in mathematics and was entirely self taught, and his research was in isolation until he sent an unsolicited letter to G.H. Hardy.
It's as if you trained an LLM on children's books and it converged to Terence Tao.
Describing Ramanujan as self-taught downplays that, within colonial India, he had access to some decent resources, and probably more than the median of the era: primary and secondary schooling, some interaction with older students, and a few of the books on higher mathematics. That was enough to open up all the possibilities he needed to "think mathematically". By his late teens it had become a total obsession, he ceased studying other subjects and failed out of academia. That is, the things he had already encountered were sufficient to get him started, and then he made things harder on himself because he couldn't play by the rules. But he replaced that disadvantage with persistence and trial and error, submitting what he had to whomever he could contact until, by the time it reached Hardy, it started to resemble formal language that other mathematicians understood.
I wish someone performed a large scale experiment to evaluate all these alternate architectures. I kind of feel that they get drowned out by new sota results from openai and others. What I wish is something that tries to see if emergent behaviors pop up with enough data and parameters.
Maybe vision is special enough that convnets and approache transformer level performance or it could be generalized to any modality. I haven't read enough papers to know if someone has already done something like this but everywhere I look on the application side of things, vanilla transformers seems to be dominating
> My theory is that architecture doesn't matter - convolutional, transformer or recurrent, as long as you can efficiently train models of the same size, what counts is the dataset
I assume you mean architecture doesn’t matter much, as long as it’s roughly sensibly designed.
The basics of architecture do matter though: a two-layer fully connected network won’t get very far no matter how many parameters it has. Architectural improvements like ResNets have definitely made a difference to the field.
Part of what made the transformer so powerful when compared to RNNs and CNNs in NLP was the fact that they are highly parallelizable. That means you can optimize more parameters in the same amount of training time, which means you can model a more complex function.
The slope of the power law is determined by the problem and dataset. Compute, parameter count, and data move you along the curve. Change in architecture/bias is a constant offset.
So architecture can give an advantage, but that advantage can be overcome by scale.
Agreed, it's in my first phrase "as long as you can efficiently train models of the same size, what counts is the dataset". But useful sizes are just a few. 7, 13, 35, 70, 120B - because they are targeted to various families of GPUs. A 2T model I can't run or too expensive to use on APIs is of no use. Not just dataset size, but data quality matters just as much, and diversity.
I believe LLMs will train mostly on synthetic data engineered to have extreme diversity and very high quality. This kind of data confers 5x gains in efficiency as demonstrated by Microsoft in the Phi-1.5 paper.
To add, if you have a model that's too large for the dataset, you end up in a situation where it just memorizes the training dataset and performs poorly on the test data. That's why GPT works so good, you only pass the training data through it once, and there's so much good training data that it works in spite of the model being absolutely massive.
I kind of agree, architecture does seem to matter in broad strokes but neuron-local learning mechanics will take it from there and the rest is up to the dataset.
This is great, but what is a possible use-case of these massive classifier models? I'm guessing they won't be running at the edge, which precludes them from real-time applications like self-driving cars, smartphones, or military. So then what? Facial recognition for police/governments or targeted advertisement based on your Instagram/Google photos? I'm genuinely curious.
This is nice because convolutional models seem better for some vision tasks like segmentation which are less obvious how to do with ViTs. Convolution seems like something you fundamentally want to do in order to model translation invariance in vision.
I haven't fully read the paper yet. Isn't the strength of Vision Transformers in unsupervised learning, meaning that the data doesn't need labels? And don't ResNets require labeled data?
It's possible to adapt ResNets for unsupervised learning. A popular approach is to use self-supervised learning techniques, where the model is initially trained to predict some aspect of the data (e.g., predicting a missing part of an image) without explicit labels. This is similar to masked language modeling.
That is one of the goals, yes. In addition, it seems like you get neural architecture search (architecture is optimized), faster training, inference and interpretability. I'm working it out as we speak.
Ironically, convolution provides some unification in physics too, e.g. renormalization is a convolution.
It kind of is. The commenter has been working on this formalism for a year or more. I'm sure he will come by with his link for the Discord channel where he discusses with and finds collaborators soon.
Similarly, humans achieve about the same results when they have the same training. Small variations. What matters is not the brain but the education they get.
Of course I am exaggerating a bit, just saying there are a multitude of architectures of brain and neural nets with similar abilities, and the differentiating factor is the data not the model.
For years we have seen hundreds of papers trying to propose sub-quadratic attention. They all failed to get traction, big labs still use almost vanilla transformer. At some point a paper declared "mixing is all you need" (MLP-Mixers) to replace "attention is all you need". Just mixing, the optimiser adapts to what it gets.
If you think about it, maybe language creates a virtual layer where language operations are performed. And this works similarly in humans and AIs. That's why the architecture doesn't matter, because it is running the language-OS on top. Similarly for vision.
I place 90% the merits of AI on language and 10% on the model architecture. Finding intelligence was inevitable, it was hiding in language, that's how we get to be intelligent as well. A human raised without language is even worse than a primitive. Intelligence is encoded in software, not hardware. Our language software has more breadth and depth than any one of us can create or contain.