Yikes, I’m old. There was a lot of NN work and a lot of books available on NN’s back in the mid and late 90’s. ‘Soft computing’ was the all-encompassing term for NN, genetic algorithms, AI, expert systems, fuzzy logic, ALife and all sorts of nascent computational areas back then. I still have a bunch of issues to the monthly AI Expert magazine one could buy at a decent magazine stand. Small data-sets were definitely a limiting factor as well as limited computer power. I remember certain applied fields did embrace NN’s early on, like some civil engineers and hydrologists, who were finding some use for them. At the U of Toronto, I considered doing a PhD with a biologist who was using them to investigate vision (and got help from Hinton). Physiology was one area where you could generate “long” time-series in a relatively short period of time. Those were still the days when Intel 286/386/486 and lowly Pentium machines were still common currency. Computer scientists at the time didn’t yet have clear break-through commercial applications which would have attracted crazy funding. A lot of theory, little real actions.
>"Small data-sets were definitely a limiting factor as well as limited computer power."
Not just small data-sets and limited computer power, but also very few libraries to help you out - although you could download something like xerion from ftp.cs.toronto.edu and join their email list, it was generally a case of retyping examples or implementing algorithms from printed textbooks. And it was all in C, presumably for performance reasons, while most of the symbolic AI folks came from Lisp or Prolog backgrounds.
While my experience is not from the 90s, I think I can speak to some of why this is. For some context, I first got into neural networks in the early 2000s during my undergrad research, and my first job (mid 2000s) was at an early pioneer that developed their V1 neural network models in the 90s (there is a good chance models I evolved from those V1 models influenced decisions that impacted you, however small).
* First off, there was no major issue with computation. Adding more units or more layers isn't that much more expensive. Vanishing gradients and poor regulation were a challenge and meant that increasing network size rarely improved performance empirically. This was a well known challenge up until the mid/later 2000s.
* There was a major 'AI winter' going on in the 90s after neural networks failed to live up to their hype in the 80s. Computer vision and NLP researchers - fields that have most famously recently been benefiting from huge neural networks - largely abandoned neural networks in the 90s. My undergrad PI at a computer vision lab told me in no uncertain terms he had no interest in neural networks, but was happy to support my interest in them. My grad school advisors had similar takes.
* A lot of the problems that did benefit from neural networks in the 90s/early 2000s just needed a non-linear model, but did not need huge neural networks to do well. You can very roughly consider the first layer of a 2-layer neural network to be a series of classifiers, each tackling a different aspect of the problem (e.g. the first neuron of a spam model may activate if you have never received an email from the sender, the second if the sender is tagged as spam a lot, etc). These kinds of problems didn't need deep, large networks, and 10-50 neuron 2-layer networks were often more than enough to fully capture the complexity of the problem. Nowadays many practitioners would throw a GBM at problems like that and can get away with O(100) shallow trees, which isn't very different from what the small neural networks were doing back then.
Combined, what this means from a rough perspective, is that the researchers who really could have used larger neural networks abandoned them, and almost everyone else was fine with the small networks that were readily available. The recent surge in AI is being fueled by smarter approaches and more computation, but arguably much more importantly from a ton more data that the internet made available. That last point is the real story IMO.
The funny thing is that the authors of the paper he linked actually answer his question in the first paragraph, when they say that the input dataset needs to be significantly larger than the number of weights to achieve good generalisation, but there is usually not enough data available.
Data, data, data, data. 1990s don't have wikipedia, Youtube, megapixel cameras every which where, every single adult human hooked up to a sensor package 24 hours a day, and who knows what else. I know as a 1990s guy I would never have imagined the amount of data we would eventually all throw up into the ether even ten years later, to say nothing of today. Without that corpus . .
And none of those examples except Wikipedia were used to train the various LLMs. I wonder how much better multi-modal models are going to get if they start incorporating the 24/7 sensor data from billions of people.
On a side note, long time ago I saw someone who make a bot trained on a selected sample of chats between people on the internet - and the tool swore a lot.
Nice link! I never saw that page before. This quote surprised me:
it should be noted that the amount of text added to Wikipedia articles every year has been constant since 2006, at roughly 1 gigabyte of (compressed) text added per year.
Yes, Wikipedia is surprisingly small. You can fit the whole thing on an iPad and access all of it without internet. Plenty of rabbit holes to fill even the longest airplane flight.
Highly recommend the exercises in Rumelhart and McClelland - Parallel Distributed Processing: Explorations in the Microstructure of Cognition from 1986-1987 (two volumes)
I was studying computer science and AI in 1987-1990; I didn't know it was the deepest, darkest pit of AI research despair.
I found the two Rumelhart & McClelland books, just a single copy on the shelf at Cody's Books, soon after publication. I worked through the examples, and was immediately convinced that this low-level approach was a way forward.
For some reason, none of the stressed out Comp Sci professors wanted to listen to a weirdo undergraduate, a lousy student.
I'm glad I was there at a reboot of AI, but my timing was lousy.
We were missing two architecture patterns that were needed to get deeper nets to converge: residual nets [1] which solved gradient propagation, and batch normalization [2] which solved initialization.
Also quasi-linear activation functions (prevent vanishing gradients), tons of regularisation (e.g convolutions) and more adaptive gradient descent (faster convergence). I've still met people in the early 2010s who tried to make neural networks work using only a few dozen units. Academia is pretty slow. What people also forget is that libraries like pytorch or tensorflow simply didn't exist. I wrote my own neural network stacks complete with backpropagation from scratch in c++ back then.
Rosenblatt had a working perceptron for classifying images in the 1950s (!). And yet it took 60 years before the theory and compute power had developed enough for all of this to be interesting outside of small, purely academic experiments.
And yet classical OCR techniques continued to dominate. Nothing happened in the industry on that front for over 20 years. That's as academic as it gets.
Do you think Carmack, deep down, wonders why he let himself miss the boat on the LLM revolution? He spent golden years toiling away in Facebook, only to finally announce he was quitting to focus on AGI... only for the world to be taken by storm by transformers, GPT, Midjourney, etc.
If anyone could have been at the forefront of this wave, it could've been him.
And now the landscape has utterly changed and no one is even convinced they need "AGI". Just a continually refined LLM hooked up to tools and other endpoints.
I sometimes wonder what could've happened if he stuck to the 3d graphics space. He once was a great innovator, wolfenstein, doom then quake, he did some innovation in Rage / id Tech 5 with infinite texture streaming but it was full of technical issues. Ultimately around doom 3 / rage, it felt like id software wasn't anything special anymore, they were brought out and then he left Id.
Now the last major innovation in the space came from epic games / unreal engine.
He did his best work when he wrote the entire engine alone. That's no longer possible. You can however plausibly invent AGI alone. He said that an AGI implementation is likely simple (meaning not complex), and I agree. The difficulty is in the method not lines of code, so it's work that fits him.
The biggest problem with AGI is definitional. How will we know when we see it?
Once that little detail gets solved, who’s to say that “refined LLM hooked up to tools and other specialized LLMs” won’t be it? Sure could be.
But it also could not be! AGI has been right around the corner my whole life and even longer. 50 years at least. Every new AI discovery is on the verge of AGI until a few years later it hits a wall. Research is hard like that.
With everything Carmack achieved two things dumbfounded me: his sycophantic relationship with Jobs (who apparently almost succeeded in getting him to postpone his wedding so that he could appear on some Apple event) and that he would go near Facebook at all.
Talk about having "fuck you" money but just not willing to say "fuck you".
I got exposed to programming neural networks in the early 90s. It solved certain problems incredibly fast like the traveling salesman problem. I was tinkering with 3D graphics and fractals and map pathfinding. Though it didn’t occur to me how much more power was there.
“Data” was so much smaller then. I had a minuscule hard drive if any, no internet, 8 bit graphics but nothing photo realistic, glimpses of windows and os2, and barely a mouse. In retrospect, it was like embedded programming.
I believe the issue was not a lack of computational power, but rather that people at the time didn't think large models with many parameters would effect meaningful change. This was even true three years ago, albeit on a different scale. As Ilya Sutskever expressed, people were not convinced there was still room to increase the scale. For the status quo to shift, two things could happen: a substantial reduction in computing costs, making large-scale experiments less a matter of conviction and more a matter of course; or the emergence of individuals with the resources and conviction to undertake larger experiments.
My favorite comparison for the accessibility of power is looking at a weird computer in the top 500 from a while back.
System X, in 2004 was the 7th most powerful computer in the world. It was 1100 PowerPC 970 Macs with 2200 cores and claimed an Rmax of 12k GFlops. https://www.top500.org/system/173736/
A M1 MacBook Air hits 900 Gflops ( https://news.ycombinator.com/item?id=26333369 ). A dozen MacBook Airs - about what you'd expect in a grade school computer lab - is more powerful than the 7th most powerful computer system in the world 2 decades ago.
The reason I like the comparison (and the "here's this giant computer and now it fits on a card that you can get at Micro Center" is another reasonable comparison) is that it deals with likeish to likeish.
It was a Mac back then - 1100 of them, but it was a Mac. You could walk into a store and buy one... or two. They might have some issue with buying a thousand of them, but they were consumer commodity equipment - it was the rack mounted version of the PowerMac G5 if I read things correctly. You might have one of them in the media lab for a high school.
And now, it's a dozen M1 MacBook Airs (or Mac minis). Still a Mac. Still something you could walk into the store and buy. But now instead of "maybe there are 1000 of them in all the grade and high schools in the state" (though that would be stretching it), its "now this is an acceptably outfitted grade school computer lab."
No regular person was ever going to get proper fraction of the nodes of BlueGene from DOE (though it was running a PowerPC 440 2C instead... but 32,768 of them) or do anything with it if they were. https://en.wikipedia.org/wiki/IBM_Blue_Gene
Comparing "that massive thing" to "this card" is impressive - but the "that massive thing" is inconceivable to the average person.
Thus the "you could have gotten a fraction of System X at a store and used it" comparison.
I see your logic, but the apple hardware is super-expensive, a single Macbook is the same cost as a single RTX 4090 (not the MBA maybe, but definitely the MBP). So it's not that wide a stretch to say that the 4090 in a normal PC is also a fair comparison as a "widely available" computer.
Computers are undoubtedly more powerful now than they were in the 90s. Although computing capabilities of the 90s seem weak compared to today's standards, they were not so inadequate that we couldn't train and run a network comprising thousands of parameters. I vividly recall the early 2000s when I was in college. Neural networks were seen as a sort of "fringe" technology in a series of statistics courses. We were mostly shown examples with 6 or 12 neurons, and nobody mentioned the possibility of scaling up to hundreds of neurons. Around that time, we already had sophisticated games like The Elder Scrolls III. We could have easily scaled up the network size by at least an order of magnitude at home, not to mention the capabilities that big companies possessed at that time.
> but rather that people at the time didn't think large models with many parameters would effect meaningful change. This was even true three years ago, albeit on a different scale.
I've also noticed this, and want to ask: who are these people? Do they not have (~80-billion-neuron) brains? (And that's neurons, with by most estimates thousands of synapses each; so you're actually talking on the order of tens to hundreds of trillions of neural network parameters before you reach parity with biological examples.)
In the early 2000's, it was believed that the topology of a neuron network was a major factor to get it to work well, and that throwing more neurons and computing power alone would not suffice. In a sense it was not wrong : convolutional nets were an early example of neuron network topology that enforced translation invariance while being parsimonious in tunable parameters.
An other factor was that SVM were all the rage back then, because they had nice math and fitted the computational resources of a contemporary workstation.
Are you referring to other threads? No. However, I wouldn't be surprised if other people developed similar beliefs following recent advances in large language models (LLMs). Of course, we wouldn't achieve GPT-4 level results using only technology available before 2020, but with sufficient data and computational power, we could have accomplished much more than what was generally believed to be possible in the machine learning field at the time.
I thought I had read almost those exact words before. I've been known to repeat myself on here before.
In fact I've been so nuanced that I've had people use something I've said to disagree with me and then I've had to point out that the original thing is also by me.
I've been unable to determine whether I'm actually influential or am just unknowingly expressing part of a generalized changing sentiment. Confidence is the first trapping of fools.
I think it's more that modern automatic differentiation abstractions weren't well known to researchers. From what I remember, even in the early 2000s when I went to school, backpropagation was basically hand coded.
The only ML that I ever did was a single undergrad NN class around ~2001. That was a long time ago, but I vaguely remember being taught at that time that adding more nodes rarely helped, that you were just going to overfit to your dataset and have worse results on items outside the dataset, or worse end up with a completely degenerate NN - eg that best practice was to use the minimum number of nodes that would do the job.
On the contrary, there was a mathematical proof that one-hidden-layer neural network with nonlinearity is enough to represent any function. Using more than 1 hidden seemed a waste.
To experiment with SGD and back-propagation with 4096x4096 32-bit matrices, you would need a machine with hundreds of megabytes of ram in the 90s. In terms of software, you would need to be comfortable with C/C++ or maybe Fortran to be able to experiment quickly enough to land on effective hyper parameters.
Probably too many low-probability events chained together.
But I think they discovered most of the interesting things that small networks can do? For example, TD-Gammon from 1992: https://en.wikipedia.org/wiki/TD-Gammon .
In 1999, our “computer vision” guy - a masters student - struggled mightily to recognize very simple things in a video stream from a UAV. Today, we would take this for granted. But back then, the computation was for all intents and purposes entirely non-existent. At best he was hoping to apply an edge detection kernel maybe once every two seconds and see if he could identify some lines and arcs and then hand code some logic to recognize things.
What? There were pentium 2 and 3 machines back then that could certainly do more than a edge detection kernel every 2 seconds.
Or do you mean on an embedded CPU?
Software is iterative. At least when I was studying in the mid 90s people had really only just gotten the idea to do a Fourier transform of an image and look for high frequencies to indicate borders. Add ~3 decades of each generation of grad student doing slightly better than the last one.
Yeah good times! The other day I was browsing for the 999th time Steve Smith's book "The Scientist and Engineer's Guide to Digital Signal Processing"[1] and stumbled upon the chapter on NN[2]: I remember ready this when I was a student I could make sense of it and why it worked, but reading it 15 years later I find it is explained so clearly compared to other resources! (maybe experience is playing in my favor too)
You got a BASIC code snippet for training and inference and mos of all, there is an explicit use-case for digital filter approximation! At the time NN were treated as a tool among other ones, not a "answer-to-everything" type of thing.
I know Deep Learning opened new possibilities but a lot of time CNN/RNN/Transformers are definitely not needed: working on the data instead and using "linear" models can go really far (my 2 cents)
In the early 90s, not only there was lower computing power but there was not that much internet connectivity, low bandwidth, no digital cameras so not that many images online, and the images the images you had were low res and low color depth. Internet giants didn't yet exist and didn't yet collect massive amounts of data.
I personally made a quake 2 bot using neural networks in 1999, I think it had several hundred neurons and several thousand 'synapses' (parameters).
At the time that felt like a lot of parameters. Computation wasn't much of a limit though, I could run several NNs faster than realtime.
I have one of the early PhDs in neural networks (graduated in 1992). However my work was analytical - I was able to prove a couple of theorems about the backpropagation. I just needed a simple implementation to prove that my ideas worked so I wrote my code from scratch in C.
I followed a scientific American article in 1992 as a high schooler and got digit recognition and basic arithmetic working on a 386. What the popsci press said at the time was that we were limited by memory bandwidth (cache size), training data, and to some extend pointer-chasing (and other inefficencies) in graph algos
On the topic of AI history, I would like to set up a demo of old AI and/or general CS research on late 90s/early 00s Sun Ultra machines.
Does anyone have suggestions (and links to code!) for what would be a cool demo? I’m thinking of a haar classifier to show some object recognition/face detection, but would appreciate more options!
definitely saw NN code in the 1990s ; I recall a hardback book with mostly red cover.. not sure of the title.. Prominent and rigorous code implementations were associated with MIT at that time (the Random Forest guy was at Berkeley in the stats department)
edit yes, almost certainly Neural Networks for Pattern Recognition (1995) thx!
In 2012 were published results of a vision processing in the brain research, that (among other things, like the retina compressing the input) figured out that visual cortex uses convolution. That got mimicked and was a breakthrough in image recognition NN, which sparked life into the whole field.
I knew someone in the early 90s who was making a neural network on a chip for his PhD. The chip fitted 1 neuron. Yes he might have used float16 to cram more in but those techniques were not known at the time.
There really wasn't the compute power around at the time, and as others have pointed out there wasn't the training data, or the cameras.
Reading through the twitter thread, and these comments. It reminds me of all of the back and forth when HN discusses Psychology.
One side, holding a pipe, 'well actually, back in 1954, I put together an analog variant of a neuron perceptron built out of old speaker cables and car parts, strung it across the living room and it could say 10 words and fetch my slippers'. 'Really', 'Yes, Indubitably'.
"Elmer and Elsie, or the "tortoises" as they were known, were constructed between 1948 and 1949 using war surplus materials and old alarm clocks."
"The robots were designed to show the interaction between both light-sensitive and touch-sensitive control mechanisms which were basically two nerve cells with visual and tactile inputs."
A very bad comment, that failed to make a point, and that wasn't very humorous.
I meant to make relationship between Psychology and Machine Learning.
Psychology, the study of the mind, with questionable scientific methods and a replication problem.
And
Machine Learning, (that is taking the mind as a model), with questionable scientific methods, and replication problem, and the addition of corporate hype machines.
Often in last few months we stand in awe of what AI achieves, but it produces questionable results, and has a lot of problems. Machine learning is worshiped.
And yet often in last few months, posts on Psychology is railed on and called a field full of con-men and BS-Artists.
Why the duality? Both are young fields and stretching. Rapidly making progress, hitting dead ends, and changing course. The scientific method isn't a strait path. But Psychology doesn't seem to be given much leeway to make errors and course correct.
I just find it hitting a peak right now, because the study of the Human Mind (wet net) and Machine Mind (electric net). Seem to be hitting a lot of the same issues. There are so many parallels in how they are spoken of, so many common problems and how they are framed within each field.
Wonder how long until we just openly talk about a field of Psychology of Machines, where we use the same tools to try and understand what the Neural Nets are thinking.