Making floating point math highly efficient for AI hardware

grandmczeb · on Nov 10, 2018

Here's the bottom line for anyone who doesn't want to read the whole article.

> Using a commercially available 28-nanometer ASIC process technology, we have profiled (8, 1, 5, 5, 7) log ELMA as 0.96x the power of int8/32 multiply-add for a standalone processing element (PE).

> Extended to 16 bits this method uses 0.59x the power and 0.68x the area of IEEE 754 half-precision FMA

In other words, interesting but not earth shattering. Great to see people working in this area though!

jhj · on Nov 10, 2018

At least 69% more multiply-add flops at the same power iso-process is nothing to sneeze at (we're largely power/heat bound at this point), and unlike normal floating point (IEEE or posit or whatever), multiplication, division/inverse and square root are more or less free power, area and latency-wise. This is not a pure LNS or pure floating point because it is a hybrid of "linear" floating point (FP being itself hybrid log/linear, but the significand is linear) and LNS log representations for the summation.

Latency is also a lot less than IEEE or posit floating point FMA (not in the paper, but the results were only at 500 MHz because the float FMA couldn't meet timing closure at 750 MHz or higher in a single cycle, and the paper had to be pretty short with a deadline, so couldn't explore the whole frontier and show 1 cycle vs 2 cycle vs N cycle pipelined implementations).

The floating point tapering trick applied on top of this can help with the primary chip power problem, which is moving bits around, so you can solve more problems with a smaller word size because your encoding matches your data distribution better. Posits are a partial but not complete answer to this problem if you are willing to spend more area/energy on the encoding/decoding (I have a short mention about a learned encoding on this matter).

A floating point implementation that is more efficient than typical integer math but in which one can still do lots of interesting work is very useful too (providing an alternative for cases where you are tempted to use a wider bit width fixed point representation for dynamic range, or a 16+ bit floating point format).

grandmczeb · on Nov 10, 2018

The work is definitely great and I have no doubt we'll see new representations used in the future. But at least on the chip I work on, this would be a <5% power improvement in the very best case. For the risk/complexity involved, I would hope for a lot more.

TheRealPomax · on Nov 10, 2018

Wait, 0.59x isn't Earth shattering? That's almost half the power, and at only 2/3 the area. Those are _huge_ differences at data center scale!

grandmczeb · on Nov 10, 2018

Only a fraction of the total power goes the actual ALU's, even on an ML chip, so the actual top line impact is probably small. Not that it's bad, just that this is a fairly complex change for the amount of power saved. Plus this requires (unproven) changes on the modeling side which isn't desirable.

TheRealPomax · on Nov 10, 2018

Not sure I understand the "Plus [...]" part: this is new research, so obviously no one is going to implement this at scale until there's been some time for people to go over the approach and either confirm the maths is solid, or find problems with it. But that is universally true for any new low level design, I assume we all understand that "it's still in peer review" implies "so now it needs to be put to the test", not "and now we all use it without question" =)

grandmczeb · on Nov 10, 2018

Suppose you have three proposed optimizations, all of which save the same amount of power. Change A requires no modeling changes, change B requires modeling changes with well understood impacts, and change C requires modeling changes where the impact is unknown. If you could only implement one, your priority would probably be A > B > C.

Since it's not clear to me what the consequences are on the modeling here, I'd put this in category C. If lots of people start using it, it could move to B. The ideal would still be A though.

TheRealPomax · on Nov 11, 2018

False analogy: this is not in the set {A,B,C} yet. Instead, suppose we have one established way A of doing things, and two research papers B and C with alternatives that come with benefits but may require changing either software or hardware standards. You stick with A until B, C, or both have been proven to hold up, and someone has done the actual migration cost/benefit analysis before you even include them in any thoughts or opinions regarding optimization cycles, because until that work has been done by someone else they are not part of any real world solution.

jhj · on Nov 10, 2018

The Kulisch accumulator and entropy coding of the floating point words (tapering) address this particular issue.

They allow you to get away with much smaller word sizes while preserving dynamic range (and precision!) than would otherwise be the case. This is what the "word size"/tapering discussion in the blog. This is the thing that makes 8 bit floating point work in this case with just a drop in replacement via round to nearest even. You have to change significantly more to get 8 bit FP to work without either the Kulisch accumulator or entropy coding, as you have to make much different tradeoffs between precision and dynamic range.

"Users of floating point are seldom concerned simultaneously with with loss of accuracy and with overflow" (or underflow for that matter) [1]

The paper and blog post consider 4-5 different things/techniques, not all of which need be combined and some of which can be considered completely independently. The paper is a little bit gimmicky in that I combine all of them together, but that need not be the case.

(log significand fraction map (LNS), posit/Huffman/other entropy encoding, Kulisch accumulation, ELMA hybrid log/linear multiply-add as a replacement for pure log domain)

[1] Morris, Tapered floating point: a new floating point representation (1971) https://ieeexplore.ieee.org/abstract/document/1671767

jacquesm · on Nov 10, 2018

You are wrong that this is not 'earth shattering', 40% efficiency increases are roughly what you'd get from a process node step, given that those are rather few and far between now this is the equivalent of extending Moore's law by another 5 to 10 years.

dnautics · on Nov 10, 2018

That's for the actual number crunching but the real power cost is often in bandwidth (as discussed earlier in the op). If you can reliably use lower precision stuff for training, you get 4x the flops for a halving of the bandwidth costs due to matrix mult being O(n^2)

yosefk · on Nov 10, 2018

They don't show a comparison to bfloat16 PEs/FMA. IEEE half precision uses a larger mantissa than bfloat16, and the cost of multiplication is proportionate to the square of the mantissa size. I'd expect much lower gains relatively to bfloat16

moflome · on Nov 10, 2018

Not sure why this isn't getting more votes, but it's a good avenue of research and the authors should be commended. That said, this approach to optimizing floating point implementations has a lot of history at Imagination Technologies, ARM and similar low-power inferencing chipsets providers. I especially like the Synopsys ASIP Design [0] tool which leverages the open-source (although not yet IEEE ratified) LISA 2.0 Architecture Design Language [1] to iterate on these design issues.

Interesting times...

[0] https://www.synopsys.com/dw/ipdir.php?ds=asip-designer [1] https://en.wikipedia.org/wiki/LISA_(Language_for_Instruction...

Geee · on Nov 10, 2018

A bit off-topic, but I remember some studies about 'under-powered' ASICs, ie. running with 'lower-than-required' voltage and just letting the chip fail sometimes. I guess the outcome was that you can run with 0.1x power and get 0.9x of correctness. Usually chips are designed so that they never fail and that requires using substantially more energy than is needed in the average case. If the application is probabilistic or noisy in general, additional 'computation noise' could be allowed for better energy efficiency.

david-gpu · on Nov 10, 2018

That sounds awful for verification, debugging, reproducibility and safety-critical systems. Imagine this in a self-driving car. Scary.

emteycz · on Nov 12, 2018

Well, you could simply not use these in a self-driving vehicle.

dnautics · on Nov 10, 2018

Wow! It's kind of a wierd feeling to see some research I worked on get some traction in the real world!! The ELMA lookup problem for 32 bit could be fixed by using the posit standard, which just has "simple" adders for the section past the golomb encoded section, though you may have to worry about spending transistors on the barrel shifter.

jhj · on Nov 10, 2018

The ELMA LUT problem is in the log -> linear approximation to perform sums in the linear domain. This avoids the issue that LNS implementations have had in the past, which is in trying to keep the sum in the log domain, requiring an even bigger LUT or piecewise approximation of the sum and difference non-linear functions.

This is independent of any kind of posit or other encoding issue (i.e. it has nothing to do with posits).

(I'm the author)

dnautics · on Nov 10, 2018

Thanks for your work!! (And citing us ofc)

Do you think there might be an analytic trick that you could use for higher size ELMA numbers that yields semiaccurate results for machine learning purposes? Although to be honest I still think with a kuslich FMA and an extra operation for fused exponent add (softmax e.g.) you can cover most things you'll need 32 bits for with 8

jhj · on Nov 10, 2018

I've thought of that, but the problem is that it needs to linearly interpolate between the more accurate values, and depending upon how finely grained the linear interpolation is, you would need a pretty big fixed point multiplier to do that interpolation accurately.

If you didn't want to interpolate with an accurate slope, and just use a linear interpolation with a slope of 1 (using the approximations 2^x ~= 1+x and log_2(x+1) ~= x for x \in [0, 1)), then there's the issue that I discuss with the LUTs.

In the paper I mention that you need at least one more bit in the linear domain than the log domain (i.e., the `alpha` parameter in the paper is 1 + log significand fractional precision) for the values to be unique (such that log(linear(log_value)) == log_value) because the slope varies significantly from 1, but if you just took the remainder bits and used that as a linear extension with a slope of 1 (i.e., just paste the remainder bits on the end, and `alpha` == log significand fractional precision), then log(linear(log_value)) != log_value everywhere. Whether or not this is a real problem is debatable though, but probably has some effect on numerical stability if you don't preserve the identity.

Based on my tests I'm skeptical about training in 8 bits for general problems even with the exact linear addition; it doesn't work well. If you know what the behavior of the network should be, then you can tweak things enough to make it work (as people can do today with simulated quantization during training, or with int8 quantization for instance), but generally today when someone tries something new and it doesn't work, they tend to blame their architecture rather than the numerical behavior of IEEE 754 binary32 floating point. There are some things even today in ML (e.g., Poincaré embeddings) that can have issues even at 32 bits (in both dynamic range and precision). It would be a lot harder to know what the problem is in 8 bits when everything is under question if you don't know what the outcome should be.

This math type can and should also be used for many more things than neural network inference or training though.

nestorD · on Nov 10, 2018

> It would be a lot harder to know what the problem is in 8 bits when everything is under question if you don't know what the outcome should be.

I might have a solution for that : I work on methods to both quantify the impact of your precision on the result and locate the sections of your code that introduced the significant numerical errors (as long as your numeric representation respects the IEEE standard).

However, my method is designed to test or debug the numerical stability of a code and not be used in production (as it impacts performances).

jhj · on Nov 10, 2018

None of the representations considered in the paper (log or linear posit or log posit) respect the IEEE standard, deliberately so :)

nestorD · on Nov 10, 2018

You drop denormals and change the distribution but do you keep the 0,5 ULP (round to nearest) garantee from the IEEE standard ? And are your rounding errors exact numbers in your representation (can you build Error Free Transforms) ?

jhj · on Nov 10, 2018

For (linear) posit, what the "last place" is varies. Versus a fixed-size significand, there is no 0.5 ulp guarantee. If you are in the regime of full precision, then there is a 0.5 ulp guarantee. The rounding also becomes logarithmic rather than linear in some domains (towards 0 and +/- inf), in which case it is 0.5 ulp log scale rather than linear, when the exponent scale is not 0.

For my log numbers under ELMA (with or without posit-ing), the sum of 2 numbers alone cannot be analyzed in a simple ulp framework I think, given the hybrid log/linear nature. Two numbers summed are both approximated in the linear domain (to 0.5 ulp linear domain, assuming alpha >= frac + 1), then summed exactly, but conversion back to the log domain when done is approximate, to 0.5 ulp in the log domain. But the result is of course not necessarily 0.5 ulp in the log domain. Multiplication, division and square root are always the exact answer however (no rounding). The sum of two log numbers could of course also be done via traditional LNS summation, in which case there is <0.5 ulp log domain error.

Kulisch accumulation throws another wrench in the issue. Summation of many log domain numbers via ELMA will usually be way more accurate than 0.5 (log domain) ulp rounding via LNS traditional summation techniques, because the compounding of error is minimized, especially when you are summing numbers of different (or slightly different) magnitudes. Kulisch accumulation for linear numbers is of course exact, so the sum of any set of numbers rounded back to traditional floating point is accurate to 0.5 ulp.

jhj · on Nov 10, 2018

The rounding errors for linear posit are exact numbers (excepting division), assuming higher precision. The rounding errors for LNS add/sub are not exact numbers in the representation in the general case. 2 and sqrt(2) are represented exactly, but (2 + sqrt(2)) is not.

sgt101 · on Nov 10, 2018

For those interested the general area I saw a good talk about representing and manipulating floating point numbers in Julia at CSAIL last week by Jiahao Chen. The code with some good documentation is on his github.

https://github.com/jiahao/ArbRadixFloatingPoints.jl

davmar · on Nov 10, 2018

caveat: i haven't finished reading the entire FB announcement yet.

google announced something along these lines at their AI conference last september and released the video today on youtube. here's the link to the segment where their approach is discussed: https://www.youtube.com/watch?v=ot4RWfGTtOg&t=330s

moltensyntax · on Nov 10, 2018

> Significands are fixed point, and fixed point adders, multipliers, and dividers on these are needed for arithmetic operations... Hardware multipliers and dividers are usually much more resource-intensive

It's been a number of years since I've implemented low-level arithmetic, but when you use fixed point, don't you usually choose a power of 2? I don't see why you'd need multiplication/division instead of bit shifters.

jhj · on Nov 10, 2018

Multiplication or division by a power of 2 can be done by bit shift assuming binary numbers represent base-2 numbers; i.e. not a beta-expansion https://en.wikipedia.org/wiki/Non-integer_representation where binary numbers are base-1.5 or base-sqrt(2) or base-(pi-2) or whatever (in which case multiplication or division by powers of 1.5 or sqrt(2) or (pi-2) could be done via bit shift).

But when multiplying two arbitrary floating point numbers, your typical case is multiplying base-2 numbers not powers of 2, like 1.01110110 by 1.10010101, which requires a real multiplier.

General floating point addition, multiplication and division thus require fixed-point adders, multipliers and dividers on the significands.

repsilat · on Nov 10, 2018

Fixed point might usually put the "binary point" in between bits, but when doing a multiply between two of them you still have to do at least an integer multiply before the bit shift. Ditto division.

saagarjha · on Nov 10, 2018

I find it interesting that they were able to find improvements even on hardware that is presumably optimized for IEEE-754 floating point numbers.

nestorD · on Nov 10, 2018

It is a trade-of : they find improvements by losing precision where they believe it is not useful for their use case.