It's likely that computing with random pulses would increase power. That's because the vast majority of power in processors today is spent transferring information, not computing arithmetic operations. The actual arithmetic operations are very cheap energy-wise compared to the cost of reading and writing data from memory, as well as sending it around the chip.
The cost of moving data is proportional to the number of toggles you have to push through the wire. And with this representation, the number of toggles is immense. Bill Dally noted at SysML the other week that this kind of data format is almost the worst possible number format you could construct, if you care about power, because on average there must be lots of toggles.
It's too bad the article doesn't include any comparisons to see why the authors believe this is a good idea. The neuromorphic story is getting a little threadbare at this point, without data to back it up.
A few points: data movement is expensive, but the reasons are not so simple:
1) x68s RAM needs to move huge amounts of data cheaply and extremely reliability. This requires a lot of power to fight noise. Once you're allowed to have a few bit flips, the bus voltages can be made much power directly lowering losses.
2) A significant portion of the large energy costs is actually in computing associated with memory movement! (according to Sohmers from REX Computing, about 40%). Encoding, queuing on processor, Decoding, fetching, queuing on the memory controller, and caching again on the CPU takes a lot of logic.
I think it's quite possible this approach is superior for low precision floating point applications that currently use CPUs/GPUs. Note this allows lowering voltages for the whole system, including interfaces and memory.
Also, data movement doesn't dominate the energy cost of all applications; there are plenty of operations still compute-bound (e.g. some evolutionary methods), it's just that the applications more in vogue are like that. There are plenty of applications that are currently done in DSPs and ASICs because of heavy computational costs (error correction decoding as mentioned in the article) that might benefit from this.
---
As far as I know, it is still an open question whether it is possible to generalize this approach for data-efficiency. I have a few ideas I've been playing around with but none that seem to work yet.
I don't see where the authors are advocating that stochastic computers would wholesale replace general purpose digital computers. It looks more like they've found specific cases where stochastic computing could be a big benefit and are doing basic research on development techniques and other possible applications.
> That's because the vast majority of power in processors today is spent transferring information, not computing arithmetic operations.
As an aside: I found this aspect (and the amount of dark silicon in modern CPUs) very interesting when watching Simon Knowles talk on their graph core design. They try to reduce this cost, maximise compute and minimise dark silicon by using a far more localised memory architecture (granted what they are doing is naturally more parallel than a CPU though).
I think you are correct when thinking about dropping this into a normal CPU design as a bolt on discrete block of logic... but perhaps it's better used as a hardware building block for specialised processing. i.e rather than exposing it as an actual instruction on its own, use it as the building block of an algorithm implemented in hardware that requires really fast arithmetic but does not need to be perfectly deterministic (with no back and forth to registers until the end result) keep in mind it truly can be fast inside a single clock because fixed length streams (i.e words) can be processed in parallel.
I guess in short i'm saying: avoid the serialisation process until necessary, chain the probabilistic arithmetic until done _then_ do the expensive de-serialisation. My intuition tells me it would have value then, but I don't know.
“An AND gate is a digital circuit with two inputs and one output that gives a 1 only if both inputs are 1. It consists of just a few transistors and requires very little energy to operate. Being able to do multiplications with it—rather than, say, programming a microprocessor that contains thousands if not millions of transistors—results in enormous energy savings.”
To be fair, you don’t need thousands or millions of transistors for a functional ALU, such as the ones we built in college.
I’d imagine the reason it’s more practical to use something with 1M transistors rather than something with 10 or 100 is the ease of integration, (amazingly) simplified manufacturing, and supply chains already existing. I.e. why not throw a 1M transistors in when it’s actually cheaper?
So then, I’m sure there is value in what’s proposed, but for a layperson, what is the real, practical advantage of this?
To be fair, you don’t need thousands or millions of transistors for a functional ALU, such as the ones we built in college.
Indeed, and some of the variants of early mainframes were bit-serial in order to reduce cost, with a corresponding (significant) decrease in performance too. An ALU can be less than a dozen transistors or so if it's completely serial --- which seems to be what this article is going towards, but there's a reason bit-serial processors have basically become extinct: they're horribly slow. Any power savings is negated by the multiple-times-longer every operation takes, so the total energy does not decrease. In fact, since now frequency has to increase, and power consumption increases superlinearly with frequency, it's a net loss.
One memorable microcontroller with a bit-serial ALU is the CDP1802 --- and it takes 16 clock cycles for a single 8-bit add instruction.
> An ALU can be less than a dozen transistors or so if it's completely serial --- which seems to be what this article is going towards.
Each bit being completely independent seems to make it trivial to parallelise. I know they describe it as "serial" but I don't see anything inherently serial about it.
High order numbers being impractical anyway means in practical use people would probably settle on predefined word lengths rather than actual "bit streams". You can multiply your 256bit word (8bit probabilistic integer) in the time it takes to propagate through a single AND gate - the shortest possible, giving you the fastest clock - or if sharing the clock with other far more time consuming blocks in traditional CPUs it allows you to do a great many probabilistic multiplications in a single clock with the same silicon in the same block.
Admittedly what i'm suggesting uses 256 gates not a dozen, but the performance will be significantly better in terms of how much clock it takes up. I think the real problem with this method is the cost of serialising into the probabilistic bit stream. That needs to be avoided as much as possible to make this useful.
I wonder how the technique described in the article would compare to working with DSP chips? Say you have two continuous streams of numbers and want to add them. So, the processor is always on, it's constantly receiving numbers and adding them, and there is no "race to idle"; it's more a matter of power consumption and response time.
The practical advantage the article describes is in power consumption. Transistors are small and cheap, but in aggregate they're power hungry. This makes them work less well on the tiny voltages you can extract inside someone's eye, for example.
Mind you, modern low power chips use absolutely minuscule amounts of power. I suspect this technique will need a lot of refinement before its actually competitive.
I suspect this technique will need a lot of refinement before its actually competitive.
The main trouble with this technique, I suspect, is it doesn't get at the root of where power is really spent. Yes, anything costs power, but is that where all the juice is really going? A famous stat from the P4 days was, 40% of chip power was spent just driving the clocks. I'm sure it's better today, but how much? What about all the leakage in the SRAMs?
Basically, stochastic computing takes exponentially more time and energy to perform the same computation at the same precision, and has a higher error rate. If you want to save energy on precision, it would be better to use bit or digit serial operators.
Note of disclosure, this paper was written by my adviser and I am currently writing two papers on digit-serial arithmetic operators.
Fascinating! I don't think the article went into what physically encoded numbers into the probability streams to begin with, only mentioning that this is simple. Does anyone know about that? Is this happening on the analog to digital conversion?
A natural noise process (e.g. thermal) with a variable threshold could do it. PRNGs spend a lot of effort on statistical properties that may not matter for this use case. You absolutely must, however, guarantee the correct bias, so such a noise process would need to be tightly controlled, likely temperature controlled.
If you wanted to process two coherent signals, I would assume you’d need to randomize the distribution of 1’s and 0’s within each stream. Yeah, I wonder how that is done.
I imagine the internal registers simply clock out their stored data to the bitstream in sequence rather than carrying it in a parallel bus to adders and the like.
I suppose that wouldn't be a random pattern though, each iteration through it would be the same.
Could you have a true random number generator that with each clock cycle had a 50% chance of producing a 1 or 0, then XOR that stream with the data stream?
How expensive are RNGs to implement? I'm not sure how they work. I think your idea would work though.
You might make do with a very long prime value like 4073 that just clocks around endlessly, which will rarely line up meaningfully with anything else.
It does seem memory intensive though, maybe these minor issues aren't a big deal in simple enough systems that this could ever work anyway.
Maybe you just use 127 bits or something to randomize it, probably more than enough since the goal is likely mostly to avoid aliasing. Even with multiple incoming synchronized sources if you use different mixing patterns it might work out fine.
Nice! I see the noise spectra, I am guessing the long tail out at high frequencies means that the recovery time is short enough that subsequent bits aren't correlated?
i don't know how truly random a crc function is, but they're really cheap to implement in an fpga. especially if you're just pulling the bit off the end for a bitstream, and you can just feed the output or some polynomial into the input instead of an input stream. but that's where my thoughts end.
The representation they suggest sounds a lot like a dithered PDM[1] bitstream. There is a slight difference because PDM is usually representing a continuously varying value.
Sigma-delta ADC converters internally generate something like this from the analog signal (though they often use more than one bit) and then sum the values over a window to get the actual value that they output.
[edit]
This actually makes me wonder if you could get decent power savings with 2-bit add/multipliers and streams of 2-bit values; that would significantly decrease the stream-length required for more precise calculations.
I like the connection to biology. But it seems incredibly inefficient to store or transmit this representation. A normal 8 bit integer is blown up into 32 bytes. How would you implement anything that needs a cache?
Somebody correct me if I'm wrong, but the stream isn't an integer, it's a probability, or a number between 0 and 1. The point of it is that you shouldn't need to cache it, that would defeat the point of it being a stream.
Still, it seems like if you wanted reduced power, and could accept reduced precision, wouldn't you just do the same thing in binary by using only a few bits? Not even 8 bits, but like 3 or 4 bits?
We're not talking about replacing every aspect of a computer with streams. Just the ones that make sense. So, for example, once you're done processing the streams then you might store them as the calculated probabilities.
Let's say you have an image sensor where the 1s in each stream represent arrival of a photon at a particular location. Then do your image processing with the streams. Finally convert to a traditional image format for storage.
It would still be inefficient to transfer these streams from the sensor to the processor. You'd have to fab the processor on the same chip as the sensor, and even transferring the data on-chip from the pixels to the processing units might be inefficient.
They said the stochastic neural net had a factor of 10x reduction in power usage. What kinds of power reductions are theoretically possible with this tech?
How do I get this number? Bitcoin mining. Compare the hashes per watt of a CPU to the hashes per watt on an ASIC. The difference can be entirely attributed to the efficiency of the CPU circuit design. It is nearly impossible to design better ASIC's because the specification for SHA256 practically is the circuit design.
> Half the time, a bit sampled from the first input is transferred to the output; otherwise a bit from the second input is used, effectively averaging the two inputs. The circuit that accomplishes this sampling is again a very rudimentary one, called a multiplexer. With it, addition becomes very easy.
A multiplexer with a magic random input? If you merely alternate, then you have a non-i.i.d. sequence, and your next gate can mess up.
I'm more interested in how stochastic computing can do more interesting things such as "quantum computing lite", similar to how some applications of chaos analog computing can do rough approximations of what quantum annealers are capable of.
What's the difference between what the article described and PWM? I didn't see any specific justification for randomness. The math parts all work as described with deterministic and regular PWM signals...
I believe you need the randomness to avoid artifacts in the calculations.
But more importantly, the source of data in the applications are analog themselves and they are skipping the step of manually calculating an exact value before doing calculations.
Makes sense. Yes, aliasing and bias can happen with PWM math. But, you can directly encode analog signals in PWM, and doing so (using an analog source) mostly avoids the aliasing and bias problems.
That was my first thought too. Neural networks are stochastic by their very nature. I've been reading about binary neural networks for embedded devices and FPGAs recently because I'm convinced machine learning is about to get a whole lot smaller and more distributed, so this article really piqued my interest.
The cost of moving data is proportional to the number of toggles you have to push through the wire. And with this representation, the number of toggles is immense. Bill Dally noted at SysML the other week that this kind of data format is almost the worst possible number format you could construct, if you care about power, because on average there must be lots of toggles.
It's too bad the article doesn't include any comparisons to see why the authors believe this is a good idea. The neuromorphic story is getting a little threadbare at this point, without data to back it up.