Hacker News new | past | comments | ask | show | jobs | submit login
Reverse engineering a neural network's clever solution to binary addition (cprimozic.net)
562 points by Ameo on Jan 16, 2023 | hide | past | favorite | 151 comments



The trick of performing binary addition by using analog voltages was used in the IAS family of computers (1952), designed by John von Neumann. It implemented a full adder by converting two input bits and a carry in bit into voltages that were summed. Vacuum tubes converted the analog voltage back into bits by using a threshold to generate the carry-out and more complex thresholds to generate the sum-out bit.


> designed by John von Neumann.

I can strongly recommend an excellent biography:

The Man from the Future: The Visionary Life of John von Neumann by Ananyo Bhattacharya

I knew of von Neumann because his name shows up many many times when studying computers. But I had no idea he had several equally monumental bodies of work in such a wide range of subjects.

It’s that rare biography that helps understand the history of multiple different disciplines: quantum mechanics, game theory, computer science and more.


Nifty, makes one wonder if logarithmic or sigmoid functions for ML could be done using this method. Especially as we approach the node size limit, perhaps dealing with fuzzy analog will become more valuable.


There are a few startups making analog ML compute. Mythic & Aspinity for example

https://www.eetimes.com/aspinity-puts-neural-networks-back-t...


Analog pid controllers: https://www.youtube.com/watch?v=Ps9iD738rUg Soviet IR tracking missile heads too. Stumbled upon one on ebay once, was surprised the control system was analog. Not sure which one it was and I'm very surprised to find a Raytheon AIM 9 Sidewinder missile kit there: https://www.ebay.com/itm/273992271867


Veritasium has a bit of an video on Mythic - Future Computers Will Be Radically Different (Analog Computing) - https://youtu.be/GVsUOuSjvcg


From "Faraday and Babbage: Semiconductors and Computing in 1833" https://news.ycombinator.com/item?id=32888210 and then "Qubit: Quantum register: Qudits and qutrits" https://news.ycombinator.com/item?id=31983110:

>>> The following is an incomplete list of physical implementations of qubits, and the choices of basis are by convention only: [...] Qubit#Physical_implementations: https://en.wikipedia.org/wiki/Qubit#Physical_implementations

> - note the "electrons" row of the table

According to this Table on wikipedia, it's possible to use electron charge (instead of 'spin') to do Quantum Logic with Qubits.

How is that doing quantum logical computations with electron charge different from from what e.g. Cirq or Tequila do (optionally with simulated noise to simulate the Quantum Computer Engineering hardware)?

FWIU, analog and digital component qualities are not within sufficient tolerance to do precise analog computation? (Though that's probably debatable for certain applications at least, but not for general purpose computing architectures?) That is, while you can build adders out of voltage potentials quantified more specifically than 0 or 1, you might shouldn't without sufficient component spec tolerances because noise and thus error.

IMHO, Turing Tumble and Spintronics are neat analog computer games.

(Are Qubits, by Church-Turing-Deutsch, sufficient to; 1) simuluate arbitrary quantum physical systems; or 2) run quantum logical simulations as circuits with low error due to high coherence? https://en.wikipedia.org/wiki/Church%E2%80%93Turing%E2%80%93... )

>> See also: "Quantum logic gate" https://en.wikipedia.org/wiki/Quantum_logic_gate

Analog computers > Electronic analog computers aren't Electronic digital computers: https://en.wikipedia.org/wiki/Analog_computer#Electronic_ana...


There's a limit to this process. At a certain number of bits, the sum of voltages will be too high to represent safely. Yet adding more bits on a digital computer is trivial.


That's awesome - the network's solution is essentially to convert the input to analog, perform the actual addition in analog, and then convert that back to digital. And the first two parts of that all happened in the input weights, no less.


An 8 bit adder is too small and allows such 'hack'. He should try training a 32 or 64 bit adder, decrease the weights accuracy to bfloat16, introduce dropout regularization (or other kind of noise) to get more 'interesting' results.

Addendum: another interesting variation to try is a small transformer network, and feeding the bits sequentially as symbols. This kind of architecture could compute bignum-sized integers.


Its converted the input into its native language, in a way


A 32-bit adder is just 4 8-bit adders with carry connected. I don't see why it'd be significantly more difficult.


For a naive adder, sure. Actual adder circuits use carry lookahead. The basic idea is to “parallelize” carry computation as much as possible. As with anything else, there is a cost: area and power.

Edit: This wiki page has a good explanation of the concept: https://en.wikipedia.org/wiki/Carry-lookahead_adder


It's not harder computationally but finding that solution with SGD might be hard. I'd be really interested in the results.


In my EE class that discussed adders, we learned this wasn't exactly true. A lot of adders compute the carry bit separately because it takes time to propagate if you actually capture it as output.


Yes, but the adder that the NN came up with has no carry.


It's better to say the NN "converged to" something than "came up with" something.


Why?


There is no intelligence or reasoning involved in stochastic gradient descent


Carry is just the 9th bit?


Not really, to compose adders you need specific lines for carry.


Yeah, the current solution is similar to overfitting, this wont generalize to harder math where the operation doesn't correspond to the activation function of the network.


Is this true? If it can add, it can probably subtract? If it can add and subtract it may be able to multiply (repeated addition) and divide? If it can multiply it can do exponents?

I don’t know, but cannot jump to your conclusion without much more domain knowledge.


> If it can add and subtract it may be able to multiply (repeated addition) and divide? If it can multiply it can do exponents?

It was just a simple feed-forward network. It can't do arbitrary amounts of repeated addition (nor repeat any other operation arbitrarily often).


None of these require arbitrary amounts of repeated addition though. E.g. multiplying two 8 bit numbers requires at most 7 additions.


Yeah I did not mean a loop, repeated in the network.


Are you doing the equivalent of repeated squaring here?

Otherwise, you'd need up to 255 (or so) additions to multiply two 8 bit numbers, I think?


An 8x8 bit multiplication only requires 7 additions, either in parallel or sequentially. Remember long-form multiplication? [1] It's the same principle. Of course, high-speed digital multiplication circuits use a much more optimized, much more complex implementation.

[1] https://en.wikipedia.org/wiki/Multiplication_algorithm#Examp...


OK. It sounded to me like the comment I originally replied to (https://news.ycombinator.com/item?id=34400296) suggested to implement multiplication as naive repeated addition.


Repeated additions may be hard, but computing a*b as exp(log(a)+log(b)) should be learnable.


A sequential one-bit adder, like in pocket calculators or earliest computers with delay-line memory, could likely be invented in the process.


Sometimes when doing arithmetic with sed I use the unary representation, but usually it's better to use lookup tables: https://en.wikipedia.org/wiki/IBM_1620#Transferred_to_San_Jo...


If you want to have some fun, implement code to do arbitrary precision arithmetic the way the CADET did it. It's remarkably simple and reliable.


So that's how savants do it ...


There is one savant that can do crazy math but is otherwise normal. Afgter an epileptic fit.

His system is base 10000, he can add any two numbers below 5000 and come up with a single digit response in one loop. Then covert to base10 for the rest of us. Each digit in his base 10000 system has a different visual representation, like we have 0-9.

It's integer accurate so I don't think it uses sine wave approximations. Guy can remember pi for 24 hours.

I don't think the boffins or he himself got close to understanding what is going on in his head.


I was interested in reading more about this, but the only google result is your comment. Could you remember a name or any other details that might help find the story?


Does this relate to the addition theorem of Fourier Transforms? The weights of his NN looked not only looked like sine waves but were crudely orthogonal.

Maybe in the future we will think of the word "digital" as archaic and "analog" will be the moniker for advanced tech.


Can someone please explain, I don't get here "converted the input to analog", converted how, where?


Impressive. Am I right that what is really going on here is that the network is implementing the "+" operator by actually having the CPU of the host carrying out the "+" when executing the network?

I.e., the network converts the binary input and output to floating point, and then it is the CPU of the host the network is running on that really does the addition in floating point.

So usually one does a bunch of FLOPs to get "emerging" behaviour that isn't doing arithmetic. But in this case, instead the network does a bunch of transforms in and out so that in a critical point, the addition executed in the network runner is used for exactly its original purpose: Addition.

And I guess saying it is "analog" is a good analogy for this..


I'd say it's more that the network "discovered" that addition is something its own structure does "naturally", and reduced the problem into decoding binary input, and encoding binary output. In particular, I don't think it's specifically about the CPU having an addition operator.

My intuition is as follows: if I were to train this network with pencil, paper and a slide rule, I'd expect the same result. Addition (or maybe rather integration) is embedded in the abstract structure of a neural network as a computation artifact.

Sure, specifics of the substrate may "leak through" - e.g. in the pen-and-paper case, were I to round everything to first decimal space, or in the computer case, was the network implemented with 4-bit floats, I'd expect it not to converge because of loss of precision range (or maybe figure out the logic gate solution). But if the substrate can execute the mathematical model of a neural network to sufficient precision, I'd expect the same result to occur regardless of whether the network is run on paper, on a CPU, an fluid-based analog computer, or a beam of light and a clever arrangement of semi-transparent plastic plates.


You know how the basic node does a weighted sum of inputs, and feeds the result to a non linear function for output? This network exploited the addition operation implicit in that first part, the weighted sum.


Fascinating! Is it really exploitation? It seems to me that the node should optimize to the minimal inputs (weights) in the happy path. I.e. it's optimal for all non-operand inputs to be weighted to zero.

I'm curious how reproducible this is. I'm also curious how networks form abstractions and if we can verify those abstractions in isolation (or have other networks verify them, like this) then the building blocks become far less opaque.


That's an accurate enough description of what's going on here, yeah, but I'm very curious if this could be implemented in hardware. What we've got is an inexact not-quite-binary adder, but one that's potentially smaller and faster than the regular binary ones.


Why it works here is that the "analog" representation is not influenced by noise, because it's simulated on digital hardware.

On the other hand, why we use digital hardware precisely because it's robust against the noise present in the hardware analog circuits.


So in other words, this is analog computation on a digital hardware on analog substrate - the analog-to-digital step eliminates the noise of our physical reality, and the subsequent digital-to-analog reintroduces a certain flexibility of design thinking.

I wonder, is there ever a case analog-on-digital is better to work with as an abstraction layer, or is it always easier to work with digital signals directly?


Yes, at least I think so. That's why we so often try to approximate analog with psuedo-continuous data patterns like samples and floats. Even going beyond electronic computing, we invented calculus due to similar limitations with discrete data in mathematics.

Of course, these are all just approximations of analog. Much like emulation, there are limitations, penalties, and inaccuracies that will inevitably kneecap applications when compared to a native implementation running on bare metal (though, it seems that humans lack a proper math coprocessor, so calculus could be considered no more "native" than traditional algebra).

We do sometimes see specialized hardware emerge (DSPs in the case of audio signal processing, FPUs in the case of floats, NPUs/TPUs in the case of neural networks), but these are almost always highly optimized for operations specific to analog-like data patterns (e.g.: fourier transforms) rather than true analog operations. This is probably because scalable/reliable/fast analog memory remains an unsolved problem (semiconductors are simply too useful).


Sorry but those signals aren't analog. Floating point numbers are still digital.


Right, but there exists software that would be robust to noise, e.g. ...neural networks. I'm curious if not-quiet-perfect adders might be beneficial in some cases.


It can be implemented in hardware but the implementation would be more complex than a digital adder based on logical gates.


Analog addition is actually really easy if you can tolerate noise and heat and can convert to/from representing the signal as an analog current. Using Kirchhoff's Current Law, addition of N currents from current sources is achieved by joining those N wires to a shared exit wire which contains the summed current.


Addition is easy, DAC is relatively easy, but ADC is not.


Quad-level flash memory cells (https://en.wikipedia.org/wiki/Multi-level_cell#Quad-level_ce...) can store 4 bits for months.

Because of that, I would (hand-wavingly) expect it can be made to work for a three-bit adder (with four bits of output)


There is valid use case for when designers simply need "fuzzy logic" or "fuzzy math" which doesn't need exact results, but which can tolerate inaccurate results. If using fuzzy hardware saves resources (time, energy, die space, etc.) then it might be a valid tradeoff to use inaccurate fuzzy hardware instead of exact hardware.

For instance instead of evaluating an exponential function in digital logic, it might be quicker and more energy-efficient to just evaluate it using a diode as an analog voltage, if the value is available as an analog voltage and if some noise is tolerable, as done with analog computers.


These stories remind me of a story from Discover Magazine https://www.discovermagazine.com/technology/evolving-a-consc... A researcher was using a process to "evolve" a FPGA and the result was a circuit that was super efficient but worked in ways that were unexpected: part of the circuit seemed unconnected to the rest but if removed the whole thing stopped working and it would only work at a specific temperature.


Yeah, IIRC it grew an antenna that was utilizing the clock in the computer it was sitting next to.


You're referring to [this](https://www.researchgate.net/publication/3949367_The_evolved...)

Which was also really cool!

> As with Thompson's FPGA exploiting some subtle physical properties of the device, Layzell found that evolved circuits could rely on external factors. For example, whilst trying to evolve an oscillator Bird and Layzell discovered that evolution was using part of the circuit for a radio antenna, and picking up emissions from the environment [18]. Layzell also found that evolved circuits were sensitive to whether or not a soldering iron was plugged in (not even switched on) in another part of the room[19]. ...

However, OP was actually referring to an experiment by Dr Adrian Thompson which was different but also sort of similar. The FPGA evolved by Thompson ended up depending on parts of the circuit that were disconnected from the main circuit but still affected its operation. It probably relied on electromagnetic properties, so was sort of a radio, but it did not rely on the clock of a nearby computer.

[damninteresting.com did a really interesting writeup about this](https://www.damninteresting.com/on-the-origin-of-circuits/)

Both were really cool, unexpected behaviors of evolved hardware systems.


That's right! Thanks for tracking it down...


Primary source (or at least one of several published versions): https://cse-robotics.engr.tamu.edu/dshell/cs625/Thompson96Ev...


Ah, the good old radio component you can program into fpgas.


Very easy to design a radio component into electronics, much harder to design it out.


I’ve had all kinds of cheap electronics that came with unadvertised radio features for free.


It's sad that those pesky regulators from the FCC make it so hard to get devices with unintended radio functionality these days!


I know, if people really cared about devices rejecting interference the free market would sort that out, amirite?


The free market doesn't allow you to pour sewage onto your neighbor's property, either.

Think "property rights" are required for a free market.


OMG you mean the libertarians are mistaken? Sacré bleu!


I don't know what you heard, but you're mistaken about what free markets are.


My fillings tune in Radio Moscow. Keeps me awake at night.


I mistakenly made a two AM radios in my electronics classes, from trying to make an amplifier and a PLL. The FM radio was mistakenly made while trying to make an AM radio. :-|


I can't believe someone dropped a link to this story. I remember reading this and feeling like it broadened my sense of what evolutionary processes can produce.


Reading that article made me start playing with evolutionary algorithms and start to think they were the future, as opposed to neural nets. Oops!


oh weird, I first heard about a story like this in https://www.damninteresting.com/on-the-origin-of-circuits/, I think perhaps they're both about the same lab.


Same guy: Adrian Thompson


I believe I remember reading about that in the book "The Emperor's New Mind" by Roger Penrose back in the late 90s. It was something that has stuck with me ever since, even though I didn't follow up on the AI route.


Overfitting IRL


It's interesting to see these unconventional solutions. Genetic algorithms evolving antenna design produce similar illogical but very efficient designs. Humans have a draw to aesthetic. Robots don't have such limitations.


The essay linked from the article is interesting: http://www.incompleteideas.net/IncIdeas/BitterLesson.html


One thing that the essay doesn't consider is the importance of efficiency in computation. Efficiency is important because in practice it is often the factor which most limits the scalability of a computational system.

The human brain only consumes around 20W [1], but for numerical calculations it is massively outclassed by an ARM chip consuming a tenth of that. Conversely, digital models of neural networks need a huge power budget to get anywhere close to a brain; this estimate [2] puts training GPT-3 at about a TWh, which is about six million years' of power for a single brain.

[1] https://www.pnas.org/doi/10.1073/pnas.2107022118

[2] https://www.numenta.com/blog/2022/05/24/ai-is-harming-our-pl....


Thanks for those numbers! They are frankly scary.

Six millions years of power for a single brain is one year of power for six million brains. The knowledge that ChatGPT contains is far higher than the knowledge a random sample of six million people have produced in a year.

With that said, I think we should apply ML and the results of the bitter lesson to things which are more intractable than searching the web or playing Go. Have you ever talked to a System's Biologist? Ask them about how anything works, and they'll start with 'oh, it's so complicated. You have X, and then Y, and then Z, and nobody knows about W. And then how they work together? Madness!"


> Six millions years of power for a single brain is one year of power for six million brains. The knowledge that ChatGPT contains is far higher than the knowledge a random sample of six million people have produced in a year.

On the other hand, I don't think if it would be larger than that of six million people selected more carefully (though I suppose you'd have to include the costs of selection process in the tally). I also imagine it wouldn't be larger than six thousand people specifically raised and taught in coordinated fashion to fulfill this role.

On the other other hand, a human brain does a lot more than learning language, ideas and conversion between one and the other. It also, simultaneously, learns a lot of video and audio processing, not to mention smell, proprioception, touch (including pain and temperature), and... a bunch of other stuff (I was actually surprised by the size of the "human" part of the Wikipedia infobox here: https://en.wikipedia.org/wiki/Template:Sensation_and_percept...). So it's probably hard to compare to specialized NN models until we learn to better classify and isolate how biological brains learn and encode all the various things they do.


> On the other hand, > … > On the other other hand,

See: "on the gripping hand" (-:


> a TWh, which is about six million years' of power

i'm not a physics guy but wanted to check this - a Watt is a per second measure, and a Terawatt is 1 trillion watts, so 1 TWh is 50 billion seconds of 20 Watts, which is 1585 years of power for a single brain, not 6 million.

i'm sure i got this wrong as i'm not a physics guy but where did i go wrong here?

a more neutral article (that doesn't have a clear "AI is harming our planet" agenda) estimates closer to 1404MWh to train GPT3: https://blog.scaleway.com/doing-ai-without-breaking-the-bank... i dont know either way but i'd like as best an estimate as possible since this seems an important number.


Watts are actually a time independent measurement, note that the TWh has "hour" affixed to the end. This is 1 Tera Watt over the course of one hour, not one second. Your numbers are off by a factor of 3600.

1TWh / 20 Watt brain = 50,000,000,000 (50 Billion) Hours.

50 Billion Hours / (24h * 365.25) = 5,703,855.8 Years


Thanks for explaining the calculation! There is a huge error though, which is that I mis-read the units in the second link I posted. The actual power estimate for GPT-3 is more like 1 GWh (not 1TWh), so about 6000 years and not 6 million...!


lol, this is more like the other estimate’s ballprk… 1 TWh sounded way off and i was like “wait i gotta check this”


Yeah, the other thing that I should have noticed was the completely unreasonable cost of 1 TWh of energy. If 1 kWh cost about $0.15 at the time they trained the model, 1TWh would have been like $150M. Probably twice that after accounting for cooling. Doh.


thank you both.. was googling this and found some stackexchange type answers and still got confused. apparently it is quite common to see the "per time" part of the definition of a Watt and get it mixed up with Watt-hours. i think i got it now thank yoyu


It is. The main culprit, arguably, is the ubiquitous use of kilowatt hours as unit of energy, particularly of electricity in context of the power company billing you for it - kilowatt hours are what you see on the power meter and on your power bill.


If you’re going to reinterpret watts on one side you have to do it on both sides. So, by your thinking, which is not wrong, 1h*1TJ/s vs 20J/s. As you can see, you can drop J/s from both sides and just get 1/20th of a trillion hours.


A watt is not a per-second measure, Wh are the energy measurement to W being the power. TWh = 10^12Wh which means a trillion-watt for 1 hour.

10^12 / 20 (power of brain) / 24 (hours in a day) / 365 (days in a year) = 5 707 762 years.


One watt is one joule per second. What exactly do you mean by "a per-second measure"?


The seconds are not a denominator, OP was doing TWh / W as if W = 1 / 3600 * Wh which is not the case.


I can't edit my post any more, but I'm a moron: the article says about a GWh, not a TWh. So the calculation is out by a factor of 1000.


In the years since this came out I have shifted my opinion from 100% agreement to kind of the opposite in a lot of cases, a bitter lesson from using AI to solve complex end-to-end tasks. If you want to build a system that will get the best score on a test set - he's right, get all the data you possibly can, try to make your model as e2e as possible. This often has the advantage of being the easiest approach.

The problem is: your input dataset definitely has biases that you don't know about, and the model is going to learn spurious things that you don't want it to. It can make some indefensible decisions with high confidence and you may not know why. Some domains your model may be basically useless, and you won't know this until it happens. This often isn't good enough in industry.

To stop batshit things coming out of your system, you may have to do the opposite - use domain knowledge to break the problem the model is trying to solve down into steps you can reason about. Use this to improve your data. Use this to stop the system from doing anything you know makes no sense. This is really hard and time consuming, but IMO complex e2e models are something to be wary of.


Your understanding that problem solving has to do with "steps" is the correct one in my opinion.

Also, with data driven approaches, the model isn't necessarily learned in any meaningful way. If you train with certain inputs to get certain realistic looking outputs, you can build sophisticated parrots or chameleons. That's what GPT and stable diffusion are: compressed knowledge bases to get an output without a knowledge model. (No, language models are not knowledge models).

Thinking, rational or otherwise, requires causal steps. Since none of these data driven approaches have even fuzzy causal models, they require memorizing infinite universes to see if search can find a particular universe that's seen this before. That's why they're not intelligent and never will be. An intelligent animal only needs one universe and limited experiences to solve a problem because it knows the latent structure of the problem and can generate approaches. Intelligent entities, unlike these autistic Rain Man automatons, do not need to memorize every book in the library to multiply two large numbers together.


Yes I just read it too. It appears to have been discussed here multiple times:

https://hn.algolia.com/?q=bitter+lesson

I'd say it has aged well and there are probably lots of new takes on it based on some of the achievements of the last 6 months.

At the same time, I don't agree with it. To simplify, the opposite of "don't over-optimize" isn't "it's too complicated to understand so never try". It is thought provoking though


I think there is no doubt that there must be more efficient model architectures out there, take for example the sample efficiency of GPT-3:

> If you think about what a human, a human probably in a human’s lifetime, 70 years, processes probably about a half a billion words, maybe a billion, let’s say a billion. So when you think about it, GPT-3 has been trained on 57 billion times the number of words that a human in his or her lifetime will ever perceive.[0]

0. https://hai.stanford.edu/news/gpt-3-intelligent-directors-co...


I cannot wrap my head around that. I listened to the audio to check it wasn't a transcription error and don't think it is.

He is claiming GPT3 was trained on 57 billion billion words. The training dataset is something like 500B tokens and not all of that is used (common crawl is processed less than once), and I'm damn near certain that it wasn't trained for a hundred million epochs. Their original paper says the largest model was trained on 300B tokens [0]

Assuming a token is a word, as we're going for orders of magnitude, you're actually looking at about a few hundred times more text. The point kind of stands, it's more, but not billions of times.

I wouldn't be surprised if I'm wrong here because they seem to be an expert but this didn't pass the sniff test and looking into it doesn't support what they're saying to me.

[0] https://arxiv.org/pdf/2005.14165.pdf appendix D


I didn't even catch that. Surely they meant 57 billion words.

Some words are broken up into several tokens, which might explain the 300B tokens.


It's at about 7 minutes in the video, they really do say it several times in a few ways. He starts by saying it's trained on 570 billion megabytes, which is probably where this confusion starts. Looking again at the paper, Common Crawl after filtering is 570GB or 570 billion bytes. So he makes two main mistakes - one is straight up multiplying by another million, then by assuming one byte is equivalent to one word. Then a bit more because less than half of it is used. That's probably taking it out by a factor of about ten million or more.

300B is then the "training budget" in a sense, not every dataset is used in its entirety, some are processed more than once, but each of the GPT3 sizes were trained on 300B tokens.


Humans are not trianed on words.


What do you mean? That we are not trained ONLY on words? Or that the training we receive on words is not the same as the training of a NN? Or something else?


We are not trained only on words.

In fact, we are not trained on words at all. We are trained on images, sounds, tactile, and a few more types of sources.


Seems odd to say we're trained on sounds but not words, especially when many of the sounds we hear while we are developing are words being spoken.


Recent work on NNs where applying Kolmogorov complexity optimization leads to a network that solves binary addition with a carry-over counter and 100% provable accuracy: (see Results section)

https://arxiv.org/abs/2111.00600


Related: an independent reverse engineering of a network that "grokked" modular addition:

https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

One interesting thing is that this network similarly does addition using a Fourier transform.


Not too surprising. Each layer can multiply each input by a set of weights, and add them up. That's all you need, in order to convert from base 2 to base 10. In base 2, we calculate the value of a set of digits by multiplying each by 2^0, 2^1, ..., 2^n. Each weight is twice the previous. The activation function is just making it easier to center those factors on zero to reduce the magnitude of the regularization term. I am guessing you could probably get the job done with just two neurons in the first layer, which output the real value of each input, one neuron in the second layer to do the addition (weights 1, 1), and 8 neurons in the final layer to convert back to binary digits.


While your analysis sounds good, I don’t think you mean base 10, you probably mean a “single float”?


A couple questions:

1. How much of this outcome is due to the unusual (pseudo) periodic activation function? Seems like a lot of the DAC-like behavior is coming from the periodicity of the first layer’s output, which seems to be due to the unique activation function.

2. Would the behavior of the network change if the binary strings were encoded differently? The author encodes them as 1D arrays with 1 corresponding to 1 and 0 corresponding to -1, which is an unusual way of doing things. What if the author encoded them as literal binary arrays (i.e. 1->1, 0->0)? What about one-hot arrays (i.e. 2D arrays with 1->[1, 0] and 0->[0, 1]), which is the most common way to encode categorical data?


For 1., the author used Ameo (the weird activation function) in the first layer and tanh for others, later on notes:

"While playing around with this setup, I tried re-training the network with the activation function for the first layer replaced with sin(x) and it ends up working pretty much the same way. Interestingly, the weights learned in that case are fractions of π rather than 1."

By the looks of it, any activation function that maps a positive and negative range should work. Haven't tested that myself. The 1 vs π is likely due to the peaks of the functions, Ameo at 1 and sine at π/2.

Regardless, it's not Ameo.


I think that the activation function definitely is important for this particular case, but it's not actually periodic; it saturates (with a configurable amount of leakyness) on both ends. The periodic behavior happens due to patterns in individual bits as you count up.

As for the encoding, I think it's a pretty normal way to encode binary inputs like this. Having the values be -1 and 1 is pretty common since it makes the data centered at 0 rather than 0.5 which can lead to better training results.


1. The author has prior blog posts talking about this activation function, and apparently it does help to learn binary logic tasks.

2. I doubt this matters at all here. For some architectures having inputs be 0 on average is useful, so the author probably just picked it as the default choice.


Reminded me of the 0xFFFF0000 0xFF00FF00 0xF0F0F0F0 0xCCCCCCCC 0XAAAAAAAA trick, the use of which I don't currently recall...

edit: several, but probably not very relevant https://graphics.stanford.edu/~seander/bithacks.html


When you expect your network to learn a binary adder and it learns FFT addition instead. :)

(For an ANN FFT is more natural as it's a projection algorithm.)


Awesome reverse engineering project. Next up: reverse engineer a NN's solution to Fourier Transform!


The Fourier transform is also linear, so the same solution should work. No clue if an NN would find it though.


What's an analog implementation of a Fourier transform look like? It sounds interesting!


Depends on what kind of “analog” you want.

In an NN context, given that you already have “transform with a matrix” as a primitive, probably something very much like sticking a https://en.wikipedia.org/wiki/DFT_matrix somewhere. (You are already extremely familliar with the 2-input DFT, for example: it’s the (x, y) ↦ (x+y, x−y) map.)

If you want a physical implementation of a Fourier transform, it gets a little more fun. A sibling comment already mentioned one possibility. Another is that far-field (i.e. long-distance; “Fraunhofer”) diffraction of coherent light on a semi-transparent planar screen gives you the Fourier transform of the transmissivity (i.e. transparency) of that screen[1]. That’s extremely neat and covers all textbook examples of diffraction (e.g. a finite-width slit gives a sinc for the usual reasons), but probably impractical to mention in an introductory course because the derivation is to get a gnarly general formula then apply the far-field approximation to it.

A related application is any time the “reciprocal lattice” is mentioned in solid-state physics; e.g. in X-ray crystallography, what you see on the CRT screen in the simplest case once the X-rays have passed through the sample is the (continuous) Fourier transform of (a bunch of Dirac deltas stuck at each center of) its crystal lattice[2], and that’s because it’s basically the same thing as the Fraunhofer diffraction in the previous paragraph.

Of course, the mammalian inner ear is also a spectral analyzer[3].

[1] https://en.wikipedia.org/wiki/Fourier_optics#The_far_field_a...

[2] https://en.wikipedia.org/wiki/Laue_equations

[3] https://en.wikipedia.org/wiki/Basilar_membrane#Frequency_dis...


> Another is that far-field (i.e. long-distance; “Fraunhofer”) diffraction of coherent light on a semi-transparent planar screen gives you the Fourier transform of the transmissivity (i.e. transparency) of that screen

Ooh, yes, I’d forgotten that! And I’ve actually done this experiment myself — it works impressively well when you set it up right. I even recall being able to create filters (low-pass, high-pass etc.) simply by blocking the appropriate part of the light beam and reconstituting the final image using another lens. Should have mentioned it in my comment…


It looks like Albert Michelson's Harmonic Analyzer: see https://engineerguy.com/fourier/ (and don’t miss the accompanying video series! It’s pretty cool.)


A prism and CCD sensor. Probably bit more fuckery to get phase info out of that tho.


In a similar vein, in the 1980s a "superoptimizer" was built for optimizing snippets of assembler by doing an exhaustive search of all combinations of instructions. It found several delightful and unexpected optimizations, which quickly became incorporated into compiler code generators.


you can mention Massalin by name. she's pretty brilliant, idk where she is now


I don't recall who wrote the paper, thanks for the tip, enabling me to find the paper:

H. Massalin, “Superoptimizer - A Look at the Smallest Program,” ACM SIGARCH Comput. Archit. News, pp. 122–126, 1987.

https://web.stanford.edu/class/cs343/resources/superoptimize...

The H is for "Henry".



So important. Here’s a prediction. Work like this on model explainability will continue and eventually will reveal patterns in the input that reliably produce features in the resulting models. That knowledge will tell us about nature, in addition to ML.


Interesting idea to reverse engineer the network. Are there other sources that have done this?


Maybe not exactly what you had in mind, but is a lot of literature in general on trying to extract interpretations from neural network models, and to a lesser extent from other complicated nonlinear models like gradient boosted trees.

Somewhat famously, you can plot the weight activations on a heatmap from a CNN for image processing and obtain a visual representation of the "filter" that the model has learned, which the model (conceptually) slides across the image until it matches something. For example: https://towardsdatascience.com/convolutional-neural-network-...

Many techniques don't look directly at the numbers in the model. Instead, they construct inputs to the model that attempt to trace out its behavior under various constraints. Examples include Partial Dependence, LIME, and SHAP.

Also, those "deep dream" images that were popular a couple years ago are generated by running parts of a deep NN model without running the whole thing.


Yes, more generally this entails looking at the numbers learned by a model not as singular "weights" (which doesn't scale as the model gets larger) but more as a way of approximating a non-parametric representation (i.e. something function-like or perhaps even more general). There's a well-established field of variously "explainable" semi-parametric and non-parametric modeling and statistics, which aims to seamlessly scale to large volumes of data and modeling complexity much like NN's do.


The field of NN explainability tries, but usually there's only handwavy things to be done (because too many weights). This project involved intentionally building very very small networks that can be understood completely.


I liked this article a lot, but

> One thought that occurred to me after this investigation was the premise that the immense bleeding-edge models of today with billions of parameters might be able to be built using orders of magnitude fewer network resources by using more efficient or custom-designed architectures.

Transformer units themselves are already specialized things. Wikipedia says that GPT-3 is a standard transformer network, so I'm sure there is additional room for specialization. But that's not a new idea either, and it's often the case that a after a model is released, smaller versions tend to follow.


Transformers are specialized to run on our hardware, whereas the article is suggesting architectures which are specialized for a specific task.


Are transformers not already very specialized to the task of learning from sequences of word vectors? I'm sure there is more that can be done with them other than making the input sequences really long, but my point was that LLMs are hardly lacking in design specialized to their purpose.


> Are transformers not already very specialized to the task of learning from sequences of word vectors?

No, you can use transformers for vision, image generation, audio generation/recognition, etc.

They are 'specialized' in that they are for working with sequences of data, but almost everything can be nicely encoded as a sequence. In order to input images, for example, you typically split the image into blocks and then use a CNN to produce a token for each block. Then you concatenate the tokens and feed them into a transformer.


That's fair, and I didn't realize people were using them for images that way. I would still argue that they are at least somewhat more specialized than plain fully-connected layers, much like convolutional layers.

It is definitely interesting that we can do so much with a relatively small number of generic "primitive" components in these big models. but I suppose that's part of the point.


Small nit: the analog DAC circuit has MSB and LSB reversed. That is, the LSB goes into the top 128k resistor. If you think of it one bit at a time, the gain of the circuit = -Rf/R --- where R is one of the resistors in the ladder. (Rf is the feedback resistor, which is unlabeled and un-valued in the article.) Clearly, you get bigger output when R is smaller.


Very interesting. But I missed how the network handled the overflow.

Spotted one small typo: digital to audio converter should be digital to analog.


It converted the inputs to analog so there isn't really a notion of 'overflow'.


Putting aside the idea that "analog" doesn't have overflow. (Any physical implementation of that analog signal would have limits in the real world. If nothing else, then how much current or voltage your PSU can supply, or how much current can go through your wire.)

But that is not even the real issue. The real issue that it is not analog, it is just "analog" with quotes. The neural network is executed by a digital computer, on digital hardware. So that "analog" is just the underlying float type of the of the CPU or GPU executing the network. That is still very much digital. Just on a different abstraction level.


Are there any true analog adder circuits? If so are they also less error prone like digital.


> Are there any true analog adder circuits?

Sure. Here is a semantics with an op-amp: https://www.tutorialspoint.com/linear_integrated_circuits_ap...


Of course it is, there is whole business around overflowing analog signals in right way in analog guitar effects pedal


Do you know how DAC/ADC circuits work? Build one that can handle one more bit of input than you have and you have overflow handling.


Oh nice catch on the typo - thanks!!


A more theoretical but increasingly popular approach to identifying behavior like this is interchange intervention: https://ai.stanford.edu/blog/causal-abstraction/


I don't see how this is surprising. It's cool but come on, you have a network consisting of addition and multiplication and you expect it to not use them? What else is it supposed to do? Memorize every input output pair like a look up table?


Relevant: reverse engineering modular addition https://twitter.com/NeelNanda5/status/1559060507524403200


Is this going to change how 8-but adders are implemented in chips?


No, there was a time where we had analog calculators but there are good reasons why digital electronics won in the computing space.


totally off topic: Any pointers to tutorials/articles that teach implementation of common neural network models in javascript/rust (from scratch) ?


I wonder if is possible to reverse engineer passwords. One can input passwords, train them on known values. What to expect?


> One thought that occurred to me after this investigation was the premise that the immense bleeding-edge models of today with billions of parameters might be able to be built using orders of magnitude fewer network resources by using more efficient or custom-designed architectures.

I'm pretty convinced that something equivalent to GPT could run on consumer hardware today, and the only reason it doesn't is because OpenAI has a vested interest in selling it as a service.

It's the same as Dall-E and Stable Diffusion - Dall-E makes no attempt to run on consumer hardware because it benefits OpenAI to make it so large that you must rely on someone with huge resources (i.e. them) to use it. Then some new research shows that effectively the same thing can be done on a consumer GPU.

I'm aware that there's plenty of other GPT-like models available on Huggingface, but (to my knowledge) there is nothing that reaches the same quality that can run on consumer hardware - yet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: