The trick of performing binary addition by using analog voltages was used in the IAS family of computers (1952), designed by John von Neumann. It implemented a full adder by converting two input bits and a carry in bit into voltages that were summed. Vacuum tubes converted the analog voltage back into bits by using a threshold to generate the carry-out and more complex thresholds to generate the sum-out bit.
The Man from the Future: The Visionary Life of John von Neumann by Ananyo Bhattacharya
I knew of von Neumann because his name shows up many many times when studying computers. But I had no idea he had several equally monumental bodies of work in such a wide range of subjects.
It’s that rare biography that helps understand the history of multiple different disciplines: quantum mechanics, game theory, computer science and more.
Nifty, makes one wonder if logarithmic or sigmoid functions for ML could be done using this method. Especially as we approach the node size limit, perhaps dealing with fuzzy analog will become more valuable.
Analog pid controllers: https://www.youtube.com/watch?v=Ps9iD738rUg
Soviet IR tracking missile heads too. Stumbled upon one on ebay once, was surprised the control system was analog. Not sure which one it was and I'm very surprised to find a Raytheon AIM 9 Sidewinder missile kit there: https://www.ebay.com/itm/273992271867
According to this Table on wikipedia, it's possible to use electron charge (instead of 'spin') to do Quantum Logic with Qubits.
How is that doing quantum logical computations with electron charge different from from what e.g. Cirq or Tequila do (optionally with simulated noise to simulate the Quantum Computer Engineering hardware)?
FWIU, analog and digital component qualities are not within sufficient tolerance to do precise analog computation? (Though that's probably debatable for certain applications at least, but not for general purpose computing architectures?) That is, while you can build adders out of voltage potentials quantified more specifically than 0 or 1, you might shouldn't without sufficient component spec tolerances because noise and thus error.
IMHO, Turing Tumble and Spintronics are neat analog computer games.
(Are Qubits, by Church-Turing-Deutsch, sufficient to; 1) simuluate arbitrary quantum physical systems; or 2) run quantum logical simulations as circuits with low error due to high coherence? https://en.wikipedia.org/wiki/Church%E2%80%93Turing%E2%80%93... )
There's a limit to this process. At a certain number of bits, the sum of voltages will be too high to represent safely. Yet adding more bits on a digital computer is trivial.
That's awesome - the network's solution is essentially to convert the input to analog, perform the actual addition in analog, and then convert that back to digital. And the first two parts of that all happened in the input weights, no less.
An 8 bit adder is too small and allows such 'hack'. He should try training a 32 or 64 bit adder, decrease the weights accuracy to bfloat16, introduce dropout regularization (or other kind of noise) to get more 'interesting' results.
Addendum: another interesting variation to try is a small transformer network, and feeding the bits sequentially as symbols. This kind of architecture could compute bignum-sized integers.
For a naive adder, sure. Actual adder circuits use carry lookahead. The basic idea is to “parallelize” carry computation as much as possible. As with anything else, there is a cost: area and power.
In my EE class that discussed adders, we learned this wasn't exactly true. A lot of adders compute the carry bit separately because it takes time to propagate if you actually capture it as output.
Yeah, the current solution is similar to overfitting, this wont generalize to harder math where the operation doesn't correspond to the activation function of the network.
Is this true? If it can add, it can probably subtract? If it can add and subtract it may be able to multiply (repeated addition) and divide? If it can multiply it can do exponents?
I don’t know, but cannot jump to your conclusion without much more domain knowledge.
An 8x8 bit multiplication only requires 7 additions, either in parallel or sequentially. Remember long-form multiplication? [1] It's the same principle. Of course, high-speed digital multiplication circuits use a much more optimized, much more complex implementation.
There is one savant that can do crazy math but is otherwise normal. Afgter an epileptic fit.
His system is base 10000, he can add any two numbers below 5000 and come up with a single digit response in one loop. Then covert to base10 for the rest of us.
Each digit in his base 10000 system has a different visual representation, like we have 0-9.
It's integer accurate so I don't think it uses sine wave approximations.
Guy can remember pi for 24 hours.
I don't think the boffins or he himself got close to understanding what is going on in his head.
I was interested in reading more about this, but the only google result is your comment. Could you remember a name or any other details that might help find the story?
Does this relate to the addition theorem of Fourier Transforms? The weights of his NN looked not only looked like sine waves but were crudely orthogonal.
Maybe in the future we will think of the word "digital" as archaic and "analog" will be the moniker for advanced tech.
Impressive. Am I right that what is really going on here is that the network is implementing the "+" operator by actually having the CPU of the host carrying out the "+" when executing the network?
I.e., the network converts the binary input and output to floating point, and then it is the CPU of the host the network is running on that really does the addition in floating point.
So usually one does a bunch of FLOPs to get "emerging" behaviour that isn't doing arithmetic. But in this case, instead the network does a bunch of transforms in and out so that in a critical point, the addition executed in the network runner is used for exactly its original purpose: Addition.
And I guess saying it is "analog" is a good analogy for this..
I'd say it's more that the network "discovered" that addition is something its own structure does "naturally", and reduced the problem into decoding binary input, and encoding binary output. In particular, I don't think it's specifically about the CPU having an addition operator.
My intuition is as follows: if I were to train this network with pencil, paper and a slide rule, I'd expect the same result. Addition (or maybe rather integration) is embedded in the abstract structure of a neural network as a computation artifact.
Sure, specifics of the substrate may "leak through" - e.g. in the pen-and-paper case, were I to round everything to first decimal space, or in the computer case, was the network implemented with 4-bit floats, I'd expect it not to converge because of loss of precision range (or maybe figure out the logic gate solution). But if the substrate can execute the mathematical model of a neural network to sufficient precision, I'd expect the same result to occur regardless of whether the network is run on paper, on a CPU, an fluid-based analog computer, or a beam of light and a clever arrangement of semi-transparent plastic plates.
You know how the basic node does a weighted sum of inputs, and feeds the result to a non linear function for output? This network exploited the addition operation implicit in that first part, the weighted sum.
Fascinating! Is it really exploitation? It seems to me that the node should optimize to the minimal inputs (weights) in the happy path. I.e. it's optimal for all non-operand inputs to be weighted to zero.
I'm curious how reproducible this is. I'm also curious how networks form abstractions and if we can verify those abstractions in isolation (or have other networks verify them, like this) then the building blocks become far less opaque.
That's an accurate enough description of what's going on here, yeah, but I'm very curious if this could be implemented in hardware. What we've got is an inexact not-quite-binary adder, but one that's potentially smaller and faster than the regular binary ones.
So in other words, this is analog computation on a digital hardware on analog substrate - the analog-to-digital step eliminates the noise of our physical reality, and the subsequent digital-to-analog reintroduces a certain flexibility of design thinking.
I wonder, is there ever a case analog-on-digital is better to work with as an abstraction layer, or is it always easier to work with digital signals directly?
Yes, at least I think so. That's why we so often try to approximate analog with psuedo-continuous data patterns like samples and floats. Even going beyond electronic computing, we invented calculus due to similar limitations with discrete data in mathematics.
Of course, these are all just approximations of analog. Much like emulation, there are limitations, penalties, and inaccuracies that will inevitably kneecap applications when compared to a native implementation running on bare metal (though, it seems that humans lack a proper math coprocessor, so calculus could be considered no more "native" than traditional algebra).
We do sometimes see specialized hardware emerge (DSPs in the case of audio signal processing, FPUs in the case of floats, NPUs/TPUs in the case of neural networks), but these are almost always highly optimized for operations specific to analog-like data patterns (e.g.: fourier transforms) rather than true analog operations. This is probably because scalable/reliable/fast analog memory remains an unsolved problem (semiconductors are simply too useful).
Right, but there exists software that would be robust to noise, e.g. ...neural networks. I'm curious if not-quiet-perfect adders might be beneficial in some cases.
Analog addition is actually really easy if you can tolerate noise and heat and can convert to/from representing the signal as an analog current. Using Kirchhoff's Current Law, addition of N currents from current sources is achieved by joining those N wires to a shared exit wire which contains the summed current.
There is valid use case for when designers simply need "fuzzy logic" or "fuzzy math" which doesn't need exact results, but which can tolerate inaccurate results. If using fuzzy hardware saves resources (time, energy, die space, etc.) then it might be a valid tradeoff to use inaccurate fuzzy hardware instead of exact hardware.
For instance instead of evaluating an exponential function in digital logic, it might be quicker and more energy-efficient to just evaluate it using a diode as an analog voltage, if the value is available as an analog voltage and if some noise is tolerable, as done with analog computers.
These stories remind me of a story from Discover Magazine https://www.discovermagazine.com/technology/evolving-a-consc...
A researcher was using a process to "evolve" a FPGA and the result was a circuit that was super efficient but worked in ways that were unexpected: part of the circuit seemed unconnected to the rest but if removed the whole thing stopped working and it would only work at a specific temperature.
> As with Thompson's FPGA exploiting some subtle physical properties of the device, Layzell found that evolved circuits could rely on external factors. For example, whilst trying to evolve an oscillator Bird and Layzell discovered that evolution was using part of the circuit for a radio antenna, and picking up emissions from the environment [18]. Layzell also found that evolved circuits were sensitive to whether or not a soldering iron was plugged in (not even switched on) in another part of the room[19]. ...
However, OP was actually referring to an experiment by Dr Adrian Thompson which was different but also sort of similar. The FPGA evolved by Thompson ended up depending on parts of the circuit that were disconnected from the main circuit but still affected its operation. It probably relied on electromagnetic properties, so was sort of a radio, but it did not rely on the clock of a nearby computer.
I mistakenly made a two AM radios in my electronics classes, from trying to make an amplifier and a PLL. The FM radio was mistakenly made while trying to make an AM radio. :-|
I can't believe someone dropped a link to this story. I remember reading this and feeling like it broadened my sense of what evolutionary processes can produce.
I believe I remember reading about that in the book "The Emperor's New Mind" by Roger Penrose back in the late 90s. It was something that has stuck with me ever since, even though I didn't follow up on the AI route.
It's interesting to see these unconventional solutions. Genetic algorithms evolving antenna design produce similar illogical but very efficient designs. Humans have a draw to aesthetic. Robots don't have such limitations.
One thing that the essay doesn't consider is the importance of efficiency in computation. Efficiency is important because in practice it is often the factor which most limits the scalability of a computational system.
The human brain only consumes around 20W [1], but for numerical calculations it is massively outclassed by an ARM chip consuming a tenth of that. Conversely, digital models of neural networks need a huge power budget to get anywhere close to a brain; this estimate [2] puts training GPT-3 at about a TWh, which is about six million years' of power for a single brain.
Six millions years of power for a single brain is one year of power for six million brains. The knowledge that ChatGPT contains is far higher than the knowledge a random sample of six million people have produced in a year.
With that said, I think we should apply ML and the results of the bitter lesson to things which are more intractable than searching the web or playing Go. Have you ever talked to a System's Biologist? Ask them about how anything works, and they'll start with 'oh, it's so complicated. You have X, and then Y, and then Z, and nobody knows about W. And then how they work together? Madness!"
> Six millions years of power for a single brain is one year of power for six million brains. The knowledge that ChatGPT contains is far higher than the knowledge a random sample of six million people have produced in a year.
On the other hand, I don't think if it would be larger than that of six million people selected more carefully (though I suppose you'd have to include the costs of selection process in the tally). I also imagine it wouldn't be larger than six thousand people specifically raised and taught in coordinated fashion to fulfill this role.
On the other other hand, a human brain does a lot more than learning language, ideas and conversion between one and the other. It also, simultaneously, learns a lot of video and audio processing, not to mention smell, proprioception, touch (including pain and temperature), and... a bunch of other stuff (I was actually surprised by the size of the "human" part of the Wikipedia infobox here: https://en.wikipedia.org/wiki/Template:Sensation_and_percept...). So it's probably hard to compare to specialized NN models until we learn to better classify and isolate how biological brains learn and encode all the various things they do.
> a TWh, which is about six million years' of power
i'm not a physics guy but wanted to check this - a Watt is a per second measure, and a Terawatt is 1 trillion watts, so 1 TWh is 50 billion seconds of 20 Watts, which is 1585 years of power for a single brain, not 6 million.
i'm sure i got this wrong as i'm not a physics guy but where did i go wrong here?
a more neutral article (that doesn't have a clear "AI is harming our planet" agenda) estimates closer to 1404MWh to train GPT3: https://blog.scaleway.com/doing-ai-without-breaking-the-bank... i dont know either way but i'd like as best an estimate as possible since this seems an important number.
Watts are actually a time independent measurement, note that the TWh has "hour" affixed to the end. This is 1 Tera Watt over the course of one hour, not one second. Your numbers are off by a factor of 3600.
Thanks for explaining the calculation! There is a huge error though, which is that I mis-read the units in the second link I posted. The actual power estimate for GPT-3 is more like 1 GWh (not 1TWh), so about 6000 years and not 6 million...!
Yeah, the other thing that I should have noticed was the completely unreasonable cost of 1 TWh of energy. If 1 kWh cost about $0.15 at the time they trained the model, 1TWh would have been like $150M. Probably twice that after accounting for cooling. Doh.
thank you both.. was googling this and found some stackexchange type answers and still got confused. apparently it is quite common to see the "per time" part of the definition of a Watt and get it mixed up with Watt-hours. i think i got it now thank yoyu
It is. The main culprit, arguably, is the ubiquitous use of kilowatt hours as unit of energy, particularly of electricity in context of the power company billing you for it - kilowatt hours are what you see on the power meter and on your power bill.
If you’re going to reinterpret watts on one side you have to do it on both sides. So, by your thinking, which is not wrong, 1h*1TJ/s vs 20J/s. As you can see, you can drop J/s from both sides and just get 1/20th of a trillion hours.
In the years since this came out I have shifted my opinion from 100% agreement to kind of the opposite in a lot of cases, a bitter lesson from using AI to solve complex end-to-end tasks. If you want to build a system that will get the best score on a test set - he's right, get all the data you possibly can, try to make your model as e2e as possible. This often has the advantage of being the easiest approach.
The problem is: your input dataset definitely has biases that you don't know about, and the model is going to learn spurious things that you don't want it to. It can make some indefensible decisions with high confidence and you may not know why. Some domains your model may be basically useless, and you won't know this until it happens. This often isn't good enough in industry.
To stop batshit things coming out of your system, you may have to do the opposite - use domain knowledge to break the problem the model is trying to solve down into steps you can reason about. Use this to improve your data. Use this to stop the system from doing anything you know makes no sense. This is really hard and time consuming, but IMO complex e2e models are something to be wary of.
Your understanding that problem solving has to do with "steps" is the correct one in my opinion.
Also, with data driven approaches, the model isn't necessarily learned in any meaningful way. If you train with certain inputs to get certain realistic looking outputs, you can build sophisticated parrots or chameleons. That's what GPT and stable diffusion are: compressed knowledge bases to get an output without a knowledge model. (No, language models are not knowledge models).
Thinking, rational or otherwise, requires causal steps. Since none of these data driven approaches have even fuzzy causal models, they require memorizing infinite universes to see if search can find a particular universe that's seen this before. That's why they're not intelligent and never will be. An intelligent animal only needs one universe and limited experiences to solve a problem because it knows the latent structure of the problem and can generate approaches. Intelligent entities, unlike these autistic Rain Man automatons, do not need to memorize every book in the library to multiply two large numbers together.
I'd say it has aged well and there are probably lots of new takes on it based on some of the achievements of the last 6 months.
At the same time, I don't agree with it. To simplify, the opposite of "don't over-optimize" isn't "it's too complicated to understand so never try". It is thought provoking though
I think there is no doubt that there must be more efficient model architectures out there, take for example the sample efficiency of GPT-3:
> If you think about what a human, a human probably in a human’s lifetime, 70 years, processes probably about a half a billion words, maybe a billion, let’s say a billion. So when you think about it, GPT-3 has been trained on 57 billion times the number of words that a human in his or her lifetime will ever perceive.[0]
I cannot wrap my head around that. I listened to the audio to check it wasn't a transcription error and don't think it is.
He is claiming GPT3 was trained on 57 billion billion words. The training dataset is something like 500B tokens and not all of that is used (common crawl is processed less than once), and I'm damn near certain that it wasn't trained for a hundred million epochs. Their original paper says the largest model was trained on 300B tokens [0]
Assuming a token is a word, as we're going for orders of magnitude, you're actually looking at about a few hundred times more text. The point kind of stands, it's more, but not billions of times.
I wouldn't be surprised if I'm wrong here because they seem to be an expert but this didn't pass the sniff test and looking into it doesn't support what they're saying to me.
It's at about 7 minutes in the video, they really do say it several times in a few ways. He starts by saying it's trained on 570 billion megabytes, which is probably where this confusion starts. Looking again at the paper, Common Crawl after filtering is 570GB or 570 billion bytes. So he makes two main mistakes - one is straight up multiplying by another million, then by assuming one byte is equivalent to one word. Then a bit more because less than half of it is used. That's probably taking it out by a factor of about ten million or more.
300B is then the "training budget" in a sense, not every dataset is used in its entirety, some are processed more than once, but each of the GPT3 sizes were trained on 300B tokens.
What do you mean? That we are not trained ONLY on words? Or that the training we receive on words is not the same as the training of a NN? Or something else?
Recent work on NNs where applying Kolmogorov complexity optimization leads to a network that solves binary addition with a carry-over counter and 100% provable accuracy: (see Results section)
Not too surprising. Each layer can multiply each input by a set of weights, and add them up. That's all you need, in order to convert from base 2 to base 10. In base 2, we calculate the value of a set of digits by multiplying each by 2^0, 2^1, ..., 2^n. Each weight is twice the previous. The activation function is just making it easier to center those factors on zero to reduce the magnitude of the regularization term. I am guessing you could probably get the job done with just two neurons in the first layer, which output the real value of each input, one neuron in the second layer to do the addition (weights 1, 1), and 8 neurons in the final layer to convert back to binary digits.
1. How much of this outcome is due to the unusual (pseudo) periodic activation function? Seems like a lot of the DAC-like behavior is coming from the periodicity of the first layer’s output, which seems to be due to the unique activation function.
2. Would the behavior of the network change if the binary strings were encoded differently? The author encodes them as 1D arrays with 1 corresponding to 1 and 0 corresponding to -1, which is an unusual way of doing things. What if the author encoded them as literal binary arrays (i.e. 1->1, 0->0)? What about one-hot arrays (i.e. 2D arrays with 1->[1, 0] and 0->[0, 1]), which is the most common way to encode categorical data?
For 1., the author used Ameo (the weird activation function) in the first layer and tanh for others, later on notes:
"While playing around with this setup, I tried re-training the network with the activation function for the first layer replaced with sin(x) and it ends up working pretty much the same way. Interestingly, the weights learned in that case are fractions of π rather than 1."
By the looks of it, any activation function that maps a positive and negative range should work. Haven't tested that myself. The 1 vs π is likely due to the peaks of the functions, Ameo at 1 and sine at π/2.
I think that the activation function definitely is important for this particular case, but it's not actually periodic; it saturates (with a configurable amount of leakyness) on both ends. The periodic behavior happens due to patterns in individual bits as you count up.
As for the encoding, I think it's a pretty normal way to encode binary inputs like this. Having the values be -1 and 1 is pretty common since it makes the data centered at 0 rather than 0.5 which can lead to better training results.
1. The author has prior blog posts talking about this activation function, and apparently it does help to learn binary logic tasks.
2. I doubt this matters at all here. For some architectures having inputs be 0 on average is useful, so the author probably just picked it as the default choice.
In an NN context, given that you already have “transform with a matrix” as a primitive, probably something very much like sticking a https://en.wikipedia.org/wiki/DFT_matrix somewhere. (You are already extremely familliar with the 2-input DFT, for example: it’s the (x, y) ↦ (x+y, x−y) map.)
If you want a physical implementation of a Fourier transform, it gets a little more fun. A sibling comment already mentioned one possibility. Another is that far-field (i.e. long-distance; “Fraunhofer”) diffraction of coherent light on a semi-transparent planar screen gives you the Fourier transform of the transmissivity (i.e. transparency) of that screen[1]. That’s extremely neat and covers all textbook examples of diffraction (e.g. a finite-width slit gives a sinc for the usual reasons), but probably impractical to mention in an introductory course because the derivation is to get a gnarly general formula then apply the far-field approximation to it.
A related application is any time the “reciprocal lattice” is mentioned in solid-state physics; e.g. in X-ray crystallography, what you see on the CRT screen in the simplest case once the X-rays have passed through the sample is the (continuous) Fourier transform of (a bunch of Dirac deltas stuck at each center of) its crystal lattice[2], and that’s because it’s basically the same thing as the Fraunhofer diffraction in the previous paragraph.
Of course, the mammalian inner ear is also a spectral analyzer[3].
> Another is that far-field (i.e. long-distance; “Fraunhofer”) diffraction of coherent light on a semi-transparent planar screen gives you the Fourier transform of the transmissivity (i.e. transparency) of that screen
Ooh, yes, I’d forgotten that! And I’ve actually done this experiment myself — it works impressively well when you set it up right. I even recall being able to create filters (low-pass, high-pass etc.) simply by blocking the appropriate part of the light beam and reconstituting the final image using another lens. Should have mentioned it in my comment…
It looks like Albert Michelson's Harmonic Analyzer: see https://engineerguy.com/fourier/ (and don’t miss the accompanying video series! It’s pretty cool.)
In a similar vein, in the 1980s a "superoptimizer" was built for optimizing snippets of assembler by doing an exhaustive search of all combinations of instructions. It found several delightful and unexpected optimizations, which quickly became incorporated into compiler code generators.
So important. Here’s a prediction. Work like this on model explainability will continue and eventually will reveal patterns in the input that reliably produce features in the resulting models. That knowledge will tell us about nature, in addition to ML.
Maybe not exactly what you had in mind, but is a lot of literature in general on trying to extract interpretations from neural network models, and to a lesser extent from other complicated nonlinear models like gradient boosted trees.
Somewhat famously, you can plot the weight activations on a heatmap from a CNN for image processing and obtain a visual representation of the "filter" that the model has learned, which the model (conceptually) slides across the image until it matches something. For example: https://towardsdatascience.com/convolutional-neural-network-...
Many techniques don't look directly at the numbers in the model. Instead, they construct inputs to the model that attempt to trace out its behavior under various constraints. Examples include Partial Dependence, LIME, and SHAP.
Also, those "deep dream" images that were popular a couple years ago are generated by running parts of a deep NN model without running the whole thing.
Yes, more generally this entails looking at the numbers learned by a model not as singular "weights" (which doesn't scale as the model gets larger) but more as a way of approximating a non-parametric representation (i.e. something function-like or perhaps even more general). There's a well-established field of variously "explainable" semi-parametric and non-parametric modeling and statistics, which aims to seamlessly scale to large volumes of data and modeling complexity much like NN's do.
The field of NN explainability tries, but usually there's only handwavy things to be done (because too many weights). This project involved intentionally building very very small networks that can be understood completely.
> One thought that occurred to me after this investigation was the premise that the immense bleeding-edge models of today with billions of parameters might be able to be built using orders of magnitude fewer network resources by using more efficient or custom-designed architectures.
Transformer units themselves are already specialized things. Wikipedia says that GPT-3 is a standard transformer network, so I'm sure there is additional room for specialization. But that's not a new idea either, and it's often the case that a after a model is released, smaller versions tend to follow.
Are transformers not already very specialized to the task of learning from sequences of word vectors? I'm sure there is more that can be done with them other than making the input sequences really long, but my point was that LLMs are hardly lacking in design specialized to their purpose.
> Are transformers not already very specialized to the task of learning from sequences of word vectors?
No, you can use transformers for vision, image generation, audio generation/recognition, etc.
They are 'specialized' in that they are for working with sequences of data, but almost everything can be nicely encoded as a sequence. In order to input images, for example, you typically split the image into blocks and then use a CNN to produce a token for each block. Then you concatenate the tokens and feed them into a transformer.
That's fair, and I didn't realize people were using them for images that way. I would still argue that they are at least somewhat more specialized than plain fully-connected layers, much like convolutional layers.
It is definitely interesting that we can do so much with a relatively small number of generic "primitive" components in these big models. but I suppose that's part of the point.
Small nit: the analog DAC circuit has MSB and LSB reversed. That is, the LSB goes into the top 128k resistor. If you think of it one bit at a time, the gain of the circuit = -Rf/R --- where R is one of the resistors in the ladder. (Rf is the feedback resistor, which is unlabeled and un-valued in the article.) Clearly, you get bigger output when R is smaller.
Putting aside the idea that "analog" doesn't have overflow. (Any physical implementation of that analog signal would have limits in the real world. If nothing else, then how much current or voltage your PSU can supply, or how much current can go through your wire.)
But that is not even the real issue. The real issue that it is not analog, it is just "analog" with quotes. The neural network is executed by a digital computer, on digital hardware. So that "analog" is just the underlying float type of the of the CPU or GPU executing the network. That is still very much digital. Just on a different abstraction level.
I don't see how this is surprising. It's cool but come on, you have a network consisting of addition and multiplication and you expect it to not use them? What else is it supposed to do? Memorize every input output pair like a look up table?
> One thought that occurred to me after this investigation was the premise that the immense bleeding-edge models of today with billions of parameters might be able to be built using orders of magnitude fewer network resources by using more efficient or custom-designed architectures.
I'm pretty convinced that something equivalent to GPT could run on consumer hardware today, and the only reason it doesn't is because OpenAI has a vested interest in selling it as a service.
It's the same as Dall-E and Stable Diffusion - Dall-E makes no attempt to run on consumer hardware because it benefits OpenAI to make it so large that you must rely on someone with huge resources (i.e. them) to use it. Then some new research shows that effectively the same thing can be done on a consumer GPU.
I'm aware that there's plenty of other GPT-like models available on Huggingface, but (to my knowledge) there is nothing that reaches the same quality that can run on consumer hardware - yet.