Hacker News new | past | comments | ask | show | jobs | submit login
Intel Prepares to Graft Google’s Bfloat16 onto Processors (nextplatform.com)
270 points by rbanffy on July 20, 2019 | hide | past | favorite | 131 comments



ISA: https://software.intel.com/sites/default/files/managed/c5/15... Look for anything marked with AVX512_BF16 CPUID feature flag.

Numerical details: https://software.intel.com/sites/default/files/managed/40/8b...

Support for bfloat16 is already present in MKL-DNN (https://github.com/intel/mkl-dnn)

Disclaimer: I work for Intel


Please take my bug report:

Dropping denormals is a huge mistake. This is easy to see if you draw out a number line for a very tiny floating-point format, for example with a 2-bit exponent and a 2-bit fraction. (do this on a sheet of graph paper) Without denormals, there is a huge gap surrounding zero.

Strangely, the infinities were kept. Treating these as NaN is far less harmful than dropping denormals. Treating -0.0 as 0.0 and never producing -0.0 would be less harmful. (the PDF didn't say what happens) Even treating NaN values as normal numbers is probably less harmful than screwing up the denormals.

IEEE floating point has lots of crazy stuff to annoy hardware vendors. Most of it isn't all that important, but denormals matter.


bfloat16 is only really intended for ML purposes, and denormals don't really matter there, especially given they accurately multiply-accumulate into 32 bit floats.


Would it hurt ML usage if they added them, would it make the implementation much harder? If not, could really increase use.


I'm not an expert in the nitty gritty, and I've heard conflicting information, but my rough understanding is that it would be affordable and not too difficult for Xeon processors, but relatively more expensive for the dedicated Nervana neural network processors.


Sure, I can see that being the case. My point is that the Xeon should support denormals.

The Bfloat16 format does allow for denormals. Intel's implementation mangles them, changing them to 0.0 on both input and output.


But then you need either new denormal-accepting instructions or, worse, a new global state bit enabling bfloat16 denormals, all to support use cases probably over two orders of magnitude less common than ML. What's the compelling reason to bother? Note that you need to support the denormal-disabled case because you'll want compatibility with Nervana.


Intel did add a global state bit. It just isn't useful because you can't modify it.

Nervana is discontinued, isn't it? Compatibility doesn't matter. It's pretty compatible anyway, as long as you aren't demanding bit-identical output.


> Intel did add a global state bit. It just isn't useful because you can't modify it.

You mean the CPUID bit? That's free. Toggling denormals isn't.

> Nervana is discontinued, isn't it? Compatibility doesn't matter. It's pretty compatible anyway, as long as you aren't demanding bit-identical output.

Nervana isn't discontinued according to their website[1], and bitwise compatibility does matter, certainly more than denormals do.

[1] https://www.intel.ai/ai-at-ces/


I mean the bit to toggle denormals, not the one to identify support for the opcodes.

Denormals are far more important than bitwise compatibility. To be clear, you would still be able to load a Nervana-produced number into a processor that supports denormals, and the other way would work too. You'd just avoid mangling numbers that are near zero.

If you still think denormals don't matter, seriously do what I suggested: draw it out on graph paper. They matter.


> I mean the bit to toggle denormals, not the one to identify support for the opcodes.

bfloat16 doesn't handle denormals, why is there a bit to toggle it? What's it called (so I can CTRL-f for it)?

> If you still think denormals don't matter, seriously do what I suggested: draw it out on graph paper. They matter.

No, I get how denormals work, I know what you're pointing at. But ML genuinely doesn't care, neural nets don't give a damn about mathematical purity[1]. In contrast compatibility matters because ML doesn't give you any guarantee that it's not depending on the behaviour at small values, and minor differences in rounding does cause issues. For example, Leela Chess Zero had difficulties with reproducibility because different GPUs round floats differently.

[1] Fun but relevant aside: https://openai.com/blog/nonlinear-computation-in-linear-netw...


It's two bits, DAZ and FTZ. (seems like "denormals are zero" and "flush to zero")

Bfloat16 obviously can handle denormals. The encoding is possible. There would be no need to handle the issue if the encoding did not exist.

As hex, these would be denormal: 0x0001 to 0x0080, and 0x8001 to 0x8080. It's the same as plain old 32-bit IEEE, with half the bits lopped off.

ML is all about difficulties with reproducibility. I don't see a reason to get upset about denormals when a 3D-printed turtle can be confused with a rifle.


> It's two bits, DAZ and FTZ. (seems like "denormals are zero" and "flush to zero")

You mean the standard ones for normal floats? You certainly wouldn't want to reuse that for bfloat16s.

> ML is all about difficulties with reproducibility. I don't see a reason to get upset about denormals when a 3D-printed turtle can be confused with a rifle.

These are different things, despite the similarity in terminology.


I met Naveen Rao after Intel bought Nervana. He seemed pretty adamant about getting stuff shipped fast. In contrast, the Xeon folks own all the politics and seem to want the transition to be very gradual. Plus the Phi folks get phased out. They had done a Nervana trial at Facebook but then flaked on other trials. Clearly Intel is trying to desperately manage their books.

Having Nervana and friends on a Xeon chip could be a huge positive change for software. Not only could we toss out the issue of GPU memory transfer, but Nvidia GPUs aren’t so great with concurrency, and here with the linux kernel we might have a chance to beat Nvidia. Naveen sure would like that... Nervana once had a Maxwell compiler that was better than Nvidia’s.


They had an assembler where one person wrote kernels that were faster than cublas in a lot of cases. Afaik, nobody ever released anything else with that assembler, and Nvidia caught up to that performance quickly. In talking with the cublas devs, it seemed more that maxas kernels were highly tuned for specific sizes, whereas cublas/cudnn had to be more general.

Nowadays it's really a moot point with Nvidia's Cutlass being open source.


True story-- the Nervana Maxwell stuff didn't go very far-- but it was noteworthy because they had both that small win as well as their own hardware platform.

One other thought about the Nervana-Xeon convergence is that the support for more memory (thru DDR, Optane, or even just mmap'ed NVME) will be a big win for modeling and large-minibatch SGD. For example, the minibatch fetching could be pushed to the hardware / OS instead of Tensorflow (or the crazy guy behind Tensorpack) using a threadpool. A lot of training is still I/O bound at some level, and processors only support so many PCI-e lanes...


A V100 GPU gets 900GB/s of memory bandwidth. I am less of a CPU expert but afaict you'll be lucky to get much more than 10% of that out of a CPU. This is going to make a huge difference that Intel can't make up with bigger execution units.

bfloat helps with this because the data is half as large. But of course if you're doing Nvidia you're probably already doing (IEEE) fp16.


Correct, GPUs can do SGD iterations much much faster once the data is in GPU RAM. But if you have a ton of data relative to compute, you might be better off with a shallower model and/or fewer iterations. Here’s one start-up serving these sorts of projects: https://www.memverge.com/

When GPUs kickstarted deep learning research in 2012, people had already studied shallow models on mapreduce for a decade or so. Once NVME / Optane and modern CPUs get 1-10TB of useful “memory” in the hands of grad students, there should be another wave of new research. To date, my experience has been that 1TB of “memory” is only commonly available in industry.


That sounded a bit low. This

https://en.wikichip.org/wiki/intel/microarchitectures/cooper...

says "Higher bandwidth (174.84 GiB/s, up from 119.209 GiB/s)"

I don't know if memory bandwidth matters for this type of job, though.


Is that intel ARK published spec sheet bandwidth, or actual usable bandwidth? There is a difference.

I've found I get about 75-80% of the advertised bandwidth both from my real app (TLS crypto) and a toy memory copy benchmark using AVX256 instructions. The toy memory copy benchmark is how I realized that my bottleneck was actually memory bandwidth and not CPU horsepower on Broadwell based servers.


I haven't a clue. I just got it off the link quoted. It's a good question and when there's a difference, you know what marketing will say.

To make a stab, I suppose it might depend on whether all requests are coming from a single memory bank or spread evenly across all memory banks, assuming fully populated (again from the link "Octa-channel (up from hexa-channel)")


Intel is dominated by how many cores/threads are accessing simultaneously. So with many scientific libraries you can get about 80% of the throughput. And the V100 will not give you 900GBps. That's the theoretical, but nominally it's about 750GBps.


My project, XLA, will get you quite close to the nominal 900GB/s. :)


Do you have more information?


Can't deep learning be done using 16 bit fixed point instead?


If you're using fixed point, wouldn't you just as easily substitute 16-bit integers instead?


Cutlass doesn't match the performance of cublass, even on nvidia's benchmarks. Hand assembly is still alive and well!


That's somewhat nitpicking. Cutlass is 90-95% of cublas in many cases, and there are things Cutlass can do that cublas can't. Cublas will always be marginally faster, but it won't matter in most cases, especially if that means saving a kernel call.


maxas still exists, and works for Pascal too. A Volta/Turing assembler like maxas would be nice, but there is no Nervana doing the microbenchmarking to identify the instruction encoding well enough.


> At this point, Intel doesn’t have bfloat16 implemented in any of its processors, so they used current AVX512 vector hardware present in its existing processor to emulate the format and the requisite operations. According to the researchers, this resulted in “only a very slight performance tax.”

Why implement bfloat if you get just slightly less performance emulating it with AVX512, which already exists? Maybe it’s an “us too” claim?


The reason to use float16 isn’t to make individual operations faster, its to fit more numbers in vector registers. For example, AVX512 vector registers are 512 bits. That’s 8 64-bit floats, 16 32-bit ones, or 32 16-bit ones.

=> A vector multiply using bfloat16 may not be much faster than one using float32, but it will do more multiplications.


> The reason to use float16 isn’t to make individual operations faster, its to fit more numbers vector registers. For example, AVX512 vector registers are 512 bits. That’s 8 64-bit floats, 16 32-bit ones, or 32 16-bit ones.

I'm unclear on the advantage you are trying to explain. If both AVX and bfloat are SIMD instructions that cannot be the reason implementing bfloat is better. I'm expecting something like "bfloat16 is more specialised so it can have larger registers" or something?

[Edit]

Sorry re-reading your comment, I think you are trying to say that. The key part being:

> fit more numbers vector registers

(vs AVX i assume), so adding bfloat16 would provide more registers vs AVX with similar gate usage due to greater specialization?


Not larger vector registers, smaller numbers, so that you can fit more numbers in a vector register without having to make the vector register larger.

For CPU-bound algorithms, one would expect that bfloat16 in 512 bit vector registers would be about equal in speed to float32 in (hypothetical) 1024 bit vector registers.

Also, for algorithms that are memory-bandwidth bound, halving the size of your numbers will (about) halve memory pressure.


> Not larger vector registers, smaller numbers, so that you can fit more numbers in a vector register without having to make the vector register larger.

Sorry i'm pretty ignorant of AVX so trying to understand... is this because the smallest word size in AVX is 32bit? compared to bfloat16 is using twice the register space? Or rather with the same register space in bfloat16 you can have twice the numbers? (with no negative effects on convergence in NN due to exponent size.)


AVX is SIMD, and each AVX register is a large number of bits (256 or 512) that is packed full of a given data type, which can all be operated on at the same time with the same instruction. So if you have a 512 bit register, that could hold either 16 32 bit floating point numbers, or 32 bfloat16s. In the latter case, this means every instruction you do is operating on 32 different values.


They did not specify what the tax was relative to. Maybe they meant relative to 32-bit float?

AVX512 is expensive. I believe if you have an AVX512-heavy workload it can cause the processor to throttle.


Yes, there’s a bios setting to control this. It basically under clocks the core while AVX units are under load.


Taking my 7980xe as an example: When it runs non-avx512 loads, I currently have it set to run at 4.1 GHz (all-core). When running avx-512 heavy loads, it instead runs at 3.6 GHz -- and tends to get much hotter (70-80C instead of 50-60C). 3.6 GHz is a mild overclock; Silicon Lottery reports 100% can achieve that speed for avx512 loads.[1]

Running programs doing the same thing (eg, Hamiltonian Monte Carlo where the likelihood function has or has not been vectorized), the avx512 version is far faster than scalar, and routinely 50%+ faster than avx2.

The avx512 instruction set itself also provides conveniences that make it easier to explicitly vectorize, even if most compilers don't take advantage of them on their own. Masking load and store operations in particular (they're better about masking to handle branches).

On why avx512 vs a graphics card: I need double precision, and my code routinely has maximum widths smaller than the 32 or 64 a graphics card would want to computer in parallel.

[1] https://siliconlottery.com/pages/statistics


Yeah, people tend to completely exaggerate the impact of throttling from AVX512. It's only an issue when you do short bursts of AVX512 and the rest is not AVX512. If you do math and your math can be done in AVX512, even with throttling it's going to be substantially faster. That it runs hotter doesn't concern me at all. Intel's claimed safe Tjunction is something like 105C. EEs tend to take the published component specifications seriously (e.g. your 1000v diode is guaranteed to withstand at least 1KV of reverse voltage), so I trust Intel when they say things are fine up to that temperature. Even beyond that it won't burn out, it'll just thermal throttle.


Maximum Tjunction for an STM32F303 (just happened to have datasheet open) is 150C, as is most other ICs I've seen.

So is 105C just a very conservative number, compensating for the probe location, or are there process specific things which brings it down to 105C?


From what I understand, the newer very-high-density procsses are far more sensitive to voltage and temperature than the older larger ones.


Makes sense. The STM32G series, which still has 150C Tjmax, is ST's first 90nm MCU[1] so yeah.

[1]: https://blog.st.com/stm32g0-mainstream-90-nm-mcu/


What are you using to vectorize avx512 for HMC? Do you have a lot of element wise ops on big arrays?

When running Stan (NUTS/HMC) on Xeon Phi, telling Eigen to use avx512 provided a noticeable speed up but I didn't look at the assembly to be sure.


I've been using Julia. I've been working on a front end meant to help specify vectorized models and their gradients. It is alpha-quality software (far from production ready), but here is the github: https://github.com/chriselrod/ProbabilityModels.jl

In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster. A simulation fitting many data sets for my dissertation took about 9 hours. 20x is the difference between running overnight, and taking a week.

If I understand correctly, one problem Stan has is that it uses a var datatype for its arrays, which interleaves the values (Scalar) with pointers (vi_). https://github.com/stan-dev/math/blob/master/stan/math/rev/c...

This interleaving is going to cause problems to an autovectorizer. To get a SIMD vector of the scalars, you'd probably have to load two vectors, and then blend them.

Even with arrays of doubles, I found Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library (3-8x worse than my Julia library for Mx32 * 32xN, for combinations of M and N = (3,...,32) ): https://bayeswatch.org/2019/06/06/small-matrix-multiplicatio...

I compiled the Eigen benchmarks with: g++ -O3 -fno-signed-zeros -fno-trapping-math -fassociative-math -march=native -mprefer-vector-width=512 -shared -fPIC -I/usr/include/eigen3 eigen_mul.cpp -o libeigenmul.so

How did you tell Eigen to use avx512? At the time, I was getting errors when specifying -DEIGEN_ENABLE_AVX512. http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1705


> here is the github: https://github.com/chriselrod/ProbabilityModels.jl. In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster.

that looks pretty cool, though I don't yet know enough Julia to understand all of it. The speedups make sense given that Stan's compiler/math lib doesn't do much in the way of smart data layout. I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.

> Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library

seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)

> I was getting errors when specifying -DEIGEN_ENABLE_AVX512

I used this flag with Eigen 3.3.1, I think, on GCC 6 or 7. This was for Xeon Phi, so I tried to use icc but despite supporting C++11 it doesn't handle Stan or Eigen's template metaprogramming.

This is all the more reason to use Julia, but my graduate student days are long past..


> I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.

I was getting similar effective sample sizes/sample size in both after switching to a diagonal mass matrix, like Stan uses, from the dense mass matrix DynamicHMC.jl uses by default (the HMC backed library I'm using).

Given how common it is for folks to run Stan over night or for a week to study prior sensitivity, internal coverage, type I and II errors, etc, via Monte Carlo, I think a focus on speed is worth while.

> seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)

The C++ library Blaze did a lot better than Eigen, but still not as well. But yes, Julia'a meta-programming is much easier to work with. Julia expressions are Julia objects that you can manipulate like anything else, so I can write all the functions I want describing how to generate matmul kernels as a function of matrix size and CPU Info, and how to loop over them.

That approach feels much more straightforward. I haven't looked at the code bases of Eigen or Blaze, nor am I that familiar with template meta-programming. But I'd guess they define matmul recursively for arbitrary fixed sizes, and then have some templates defined for specific sizes (the kernels) -- or ideally have some clever way of generating the kernels from there.

Regardless, I agree that this is much easier in Julia. Aggressive specialization is also better aligned with Julia's compilation model in general, because methods get compiled just before they're used. Defining a million possible specializations doesn't have the cost of compiling a million specializations.

> This is all the more reason to use Julia, but my graduate student days are long past..

I'm defending this week, and next Monday will be my first day in an industry job. They expressed openness to Julia, but my biggest fear is that they'll renege so that I'll only be able to work on or use Julia in my spare time at home.


> switching to a diagonal mass matrix

That's an interesting comment. We've always used Stan's default of a diagonal, but I think we'd benefit from mixed metrics, which doesn't seem possible in Stan, but looks somewhat doable in some of the HMC libs in Julia.

> Given how common it is for folks to run Stan over night or for a week to study prior sensitivity

Yes, we changed the walltime on our Slurm cluster to support Stan Jobs running up to a week long, have used multiple million core hours on this. Stan still isn't so shabby but it's a hard problem.

> I'm defending this week, and next Monday will be my first day in an industry job.

Good luck and congrats on the job. You'll probably have to bite your tongue and look for opportunities where Julia's advanced compilation model (as you described well above) is going to more than pay for the cost of deployment/extra language etc.


All the newly added instructions, (VCVTNE2PS2BF16, VCVTNEPS2BF16, VDPBF16PS) to support BF16 are AVX512 instructions.


I guess if it's easy enough to do with software, it won't be very difficult to implement with hardware either. So it doesn't really cost much additional die space.


As the exponent is the same size it is mostly a matter of truncating. However custom hardware could have ability to do more parallel 16 bit ops in avx mode. Presumably they are emulating by expanding to 32 bit in avx512 registers.


How would the researchers know how fast the native implementation would be? By making it architectural Intel can optimize it in future generations.


There’s a tradeoff between neural net performance and CPU performance.


does anyone know what is the story on the software side - CUDA is basically industry standard now.

Tensorflow OpenCL support bug [1] has been open for FOUR years now (with the discussion devolving into an Intel PlaidML flame war).

AMD OpenCL is now ROCm ?

At the end of the day, I cant run ANY accelerated workloads using Intel graphics or AMD ....because there's simply no software support anywhere.

OTOH, if you have a nVidia stack... boom. you get accelerated python https://developer.nvidia.com/how-to-cuda-python

Are you running containerized workloads on kubernetes ? it has BAKED-IN support for nvidia-docker (https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus...)

Is anything going to change ?

[1] https://github.com/tensorflow/tensorflow/issues/22


> At the end of the day, I cant run ANY accelerated workloads using Intel graphics or AMD ....because there's simply no software support anywhere.

You can, there’s an Intel framework: OpenVINO that is targeting Intel hardware, processors and HD video cards, you can convert TF graph to Vino and use their inference server, that mimics TF serving.

There’s TF ROCm port by AMD as well.


Huh, that's strange. I remember in the infancy of cryptocurrency mining (back before specialized ASIC hardware), OpenCL was far superior to CUDA, and AMD cards were doing 10X the hashrate for the price as OpenCL cards. What changed in the interim? Is SHA256^2 just a completely different workload than Tensorflow, or has Nvidia pulled ahead?


It's a different workload. Afaik, AMD cards are still better hash-per-dollar for the coins that are gpu-mineable.


I'm pessimistic. I doubt things will change unless all the major platform owners can come to an agreement on a common API for GPU compute. If NVIDIA were the only holdout, they might be forced to support it.

As it stands, CUDA is the only API with a decent implementation available on the three major desktop OSes. It's ridiculous.


Does AMD/Intel have any hardware worth switching for or is it just an ideological question?


Is Tensorflow+OpenCL a feature that a lot of folks are demanding? shrugs


Yes, in the sense that it creates actual competition, and will presumably mean that datacenter cards for ML will lose Nvidia's $5-10k markup.


There is zero demand since there is rocm support for tensorflow.


This is super exciting! Brings a bit of competition to NVIDIA for ML-related tasks, while being more "open" (to some extent) than the TPU ASICs (because you won't have single-cloud lock-in). In any case, good to see Intel finally waking up.


Is Intel planning TPU's or GPU's?

I don't see how GPU with AVX512 can compete with TPU's or GPU's, BF16 or not.


> (because you won't have single-cloud lock-in)

You don't really anyway. A GPU instance in Azure will beat whatever extension accel intel does on CPUs by a mile either way.

Where there is lock in is one level higher...for ML all the other players are going FPGA which I reckon is a bad move.


I wonder why they chose it over facebook’s 8 bit posit: https://code.fb.com/ai-research/floating-point-math/


‘(8, 1, alpha, beta, gamma) log’ isn't a posit; it just shares the encoding of the exponent.


bfloat16 is very similar to traditional IEEE 754 floating point, and Google+Intel already have history building hardware for it.


What does the "int8/32" mean in that paper?


4x8-bit values in a 32-bit value? SIMD does this to perform an operation on 4 8-bit values in a single 32-bit register. There are other configurations, IIRC.

https://en.wikipedia.org/wiki/SIMD#Software


In that context it's likely 8 bit operands and a 32 bit accumulator.


bfloat16 sounds like it could be supported with minimal changes to existing floating point units, maybe with just some improved microcode.

FB's approach on the other side requires entirely redesigned and separate execution units. That's harder to justify, that silicon will remain dark for non-DL usage.


It looks like a standard 32 bit float but with the mantissa truncated to fit in to 16 bits of storage.

I imagine it still keeps most of the performance benefits since it's eliminating around 2/3rds of the longest binary component.


They're not really "grafting" it, they're implementing it as a first-class data type.


Definitely feels click-bait-y.

Ditto for them calling it "Brain Floating Point" [1]. I mean it appears to be a typical floating-point numeric data type with some of the precision truncated to reduce its cost.

I guess they might be trying to make it sound super-fancy to preempt people seeing it as cheap or/and low-tech?

[1]: https://en.wikipedia.org/wiki/Bfloat16_floating-point_format


It's called brain floating point because it was developed by Google's "Google Brain" machine learning project.


Yeah, I get the marketing perspective; just, we typically describe primitive data types in terms of what they are from a technical perspective rather than a marketing perspective.


fp16 & fp8 refer to IEEE 754 floating point which has notable differences from bfloat16 & make them have worse performance for machine learning.


Yeah the point is 'reduced precision FP16' would make a lot more sense - since it actually describes what it does. 'Brain' floating point makes you wonder how Neuralink are planning to use this datatype.


It's not really first class if you have to CVT BF16 to FP32 if you want to use FMA.


FMA on BF16 uses FP32 for the accumulator for accuracy reasons, but you can have instructions for "Rm = Rm + low(Rn) * low(Rp)" and "Rm = Rm + high(Rn) * high(Rp)", and they can be faster than FP32 "Rm=Rm+Rn*Rp" because there are fewer bits in the mantissas.

Also, converting BF16 to FP32 and back is just a vector shuffle that sticks/drops 16 extra mantissa bits at the end of each FP16 value, so it's cheaper than other floating point conversions. This means that even if you occasionally have to escape to FP32, the overhead is low and you keep all of the memory bandwidth benefits.


That's a good analysis. I particularly like the idea that if the overhead is low it gets hidden in the memory BW savings.


Newbie question: what is the typical and extreme values (excluding +/-infinity, and are these used too?) that can occur in training/running of NNs? Also, what level of accuracy is needed?

It may well be a stupid q but I really don't know, and always assumed they would be [-1..+1] and that fixed point would suffice. Clearly not.


Nope, no standardization at that level.


7 bits of precision.... or 2.1 decimal digits.

So not quite as good as a 6-inch slide rule.

Edit: voters in this thread seem pretty uptight.


What would be the wattage of a typical six inch slide performing calculations as quickly as a top-of-the-line Intel microchip? Or if it would be physically impossible because of speed of light considerations, what would be the wattage of n slide rules in parallel performing such computations such that it adds up to the throughput of a microchip?

Speaking of boiling the ocean . . .


I am sorry for Intel. Perhaps John Gustafson’s 16 bit posits or unums would have made a better choice.


Why? Google has certainly researched their floats before committing an entire line of silicon chips. It's easy to just enumerate all possible float16 configurations in a simulator to see which one performs best on a wide range of neural network applications. Then pick the best one. Big data driven organizations do this all the time (brute force through an entire line of solutions, pick best results).


Also, a format that's just a slight tweak on IEEE-754 is going to be way easier to implement on existing hardware than a wholly new format.


bfloat16 is "best" primarily because it has the same exponent range as float32. That makes it easy to port models that have been developed using float32 to bfloat16. (As opposed to using int8 or float16, both of which have a smaller exponent range.)

It's possible that some other custom format is better in absolute terms for models trained specifically for the custom format. But for the current ecosystem, where models are trained primarily using float32, bfloat16 is a very good choice.


I know nothing about ASIC or CPU simulators but I suspect that it's not as easy as you make it sound: for machine-learning related tasks, performance doesn't only come from raw compute numbers: you'll also want to model the actual data movement costs across the caches hierarchy and registers. Because a lot of time training is not necessarily compute-bound: the relative cost of data transfer (VS compute) can be quite high, or even dominate.


fp16 and bfloat16 are the same size, so there's no difference in data transfer rates.


You can also implement them both on a FPGA and take performance numbers that way.


Why would you want to introduce a floating point type with completely different and incompatible behavior when you can just change the mantissa of an existing one and reuse everything already build around it?


For the same number of bits, posits are quite a bit more expensive to implement in terms of area than traditional floats.


That's not true, as I have implemented both, in FPGA. The INRA implementation missed key optimizations in the adder and multiplier.


The INRIA paper was indeed my reference.

It seems like a huge mistake if they missed key optimizations, but I'm happy to take your word for it.

Are there write-ups that go in detail about these mistakes? It's the kind of somebody-is-wrong-on-the-Internet topic that would result in flaming blog posts. :-)


Not really. If you look at the inria pseudocode they check if the posit is negative or positive before doing addition, and convert, in the style of 754 one's complement encoding, but you shouldn't need to do that with posits since the encoding is two's complement.

I mean, I helped design the posit spec and the twos complements treatment is something not even John Gustafson understands... The key insight is that the hidden bit is -2 for negative numbers (instead of 1 as it is for positive numbers). It's kind of nonobvious and I happened upon it by accident one night while fooling around with circuit diagrams. If people really get serious about it I'm sure though that it will get rediscovered by EDA folks smarter than I.


That paper used High Level Synthesis which would be the equivalent of coding something in ruby and comparing it with another algorithm written in optimized assembly.


That's surprising, I wouldn't guess it is something they would do so easily, especially considering Intel processors isn't something you generally use to train NNs.

But I'd be rather glad if they implemented unums already.


I absolutely train on Intel from time to time at work. If your model fits, cpu is a dream. The driver never breaks, they never crash and lock up your display...


It looks somewhat similar to the minimal size of compact float [1] (a format I developed for data communication), but with 2 more bits because I use them for sizing.

[1] https://github.com/kstenerud/compact-float/blob/master/compa...


Does bfloat16 have any other uses than deep learning?


It isn't really very good for machine learning - machine learning doesn't need 8 bits of exponent.

For weights during training, 7 bits of mantissa also seems a bit low - it's common for weights to adjust much less than 1% during a single batch of training, which this couldn't represent.

I think this is more a "we want something which is faster but is compatible with existing code written for fp32's".


Bfloat is specifically designed for ML. It is the native type in Google's TPUs. It is quite good at ML; most models that work with fp32 work with bfloat with no adjustments; that's in contrast to IEEE fp16.

You're right that the mantissa is small. The trick is that you always accumulate into fp32 and then truncate down to 16 bits at the end. You'd do this for any 16-bit floating type.

Source: I work on this at Google.

https://cloud.google.com/tpu/docs/bfloat16


Have you evaluated alternatives such as posits, Kulisch accumulation, and zfp? https://arxiv.org/pdf/1805.08624.pdf https://arxiv.org/pdf/1811.01721.pdf https://insidehpc.com/2018/05/universal-coding-reals-alterna...

In particular the latter describes a generic framework that can be used to generate a lot of different number systems. Could hardware implement this, allowing us to compose and choose the number system by just setting some simple flags?


Is there a principled way to not overfit? Or is the best we can do not overtraining or reducing the precision of the calculations?


regularization, dropout, early stopping etc.


Why "graft"? It's not something that's foreign to them. This promises to essentially double the performance of Intel chips on an increasingly important workload, and also simplify the modeling work, because the models don't experience any accuracy drop when simply converted to bfloat16, unlike with quantization, where it's model dependent and finicky AF. I'd much rather do fp16 or bfloat16 at inference time, without constant pain that is quantization. I hope ARM, AMD and RISCV pay attention and implement this in the exact same way, so that models could be portable.


I read this article as saying, "hey we can emulate bfloat16 pretty well in software on top of our existing hardware features". That's what "graft" and "minimal impact" mean to me.

Intel (for better or worse) takes a very experiment-results-driven approach to choosing which features to implement in hardware. So this result -- that software emulation of a feature works almost as well as a hardware implementation would -- probably makes Intel less likely to implement the feature in hardware.

ARM, AMD, RISCV etc, will probably come to similar conclusions.


The article links to another saying that bfloat16 is coming to Xeons. Also the title of the current says it too. There's nothing implying less likely if you read the articles.

https://www.nextplatform.com/2018/12/16/intel-unfolds-roadma...


Ouch, you're right! I misunderstood.


Could we get a Bfloat32?


Hard to infer humor here sometimes -- were you kidding?

Just in case, since I can see someone else being serious about this, I think the gist is that neural-networks tend to be fairly approximate things such that we're not particularly concerned with having a lot of precision in many cases. This use-case wouldn't seem to demand a 32-bit variant too often.

But.. if you want it anyway...

Higher-bit extensions would seem to be floating-point values that favor range-over-precision more than typical floating-point numerics with the same bit-count.

If we take that to an extreme, we can talk about ranges over infinities and infinitesimals -- this is, much like the hyperreal-number system [1].

And ya know what's funny?

Some guy's been pushing for such a primitive numeric data type [2] since the early-2000's [3]!

[1]: https://en.wikipedia.org/wiki/Hyperreal_number

[2]: http://wwwinfo.deis.unical.it/yaro/EMSS_Sergeyev.pdf

[3]: https://patents.google.com/patent/US7860914


You already have that; it's more formally called IEEE 754 single precision. Requires double the memory bandwidth (bad) for extra precision that back propagation would have corrected for anyways.


Seems like BFloat32 would have the same exponent range as Float64 (double precision), and a correspondingly shorter significand.


Ohh intel is awesome


I desperately want Chinese companies finally begin producing and designing general purpose CPUs, GPUS, and other types of accelerators. Current situation is terrible duo- and mono-polies, slow pace of innovations, and low reliability. We need more players.


Finally Bloomberg will be able to report on real hardware backdoors!


They already do.


Something about Google being able to influence features in consumer grade CPUs rubs me the wrong way.


I'd say that Google did the numerical analyses of the format, then proved the memory bandwidth improvement and behavior in regards to large production machine learning models with their TPUs. So they derisked the numerical format. That and how simple it is to implement given you already support IEEE 754 single precision (just fewer bits for significant), and lower overhead to convert to and from floats (relative to fp16) makes this format a no brainier for Intel.


Compared to enterprises requesting hardware backdoors... err "out of band management" (Intel AMT), adding an instruction or two is relatively tame.


Since Intel had major customers, they have been influencing the CPU roadmap in a major way (IE getting the features they want).


Not surprising though, new features shows up where there's money to make.


English is not my first language. I have never heard the term "Graft", even if I consider myself quite literate in English. So here you go, for everybody else in my situation:

Graft, as understood in American English, is a form of political corruption, being the unscrupulous use of a politician's authority for personal gain.

Edit: by the way, I really couldn't fit the term with the article. And realized I was probably looking at the wrong definition. This one might be much more apt:

a shoot or twig inserted into a slit on the trunk or stem of a living plant, from which it receives sap.


That definition definitely does not apply to this usage. You are looking for "to join (one thing) to another as if by grafting, so as to bring about a close union." (etymology 1, verb, definition 4 on Wiktionary [1]).

[1] https://en.m.wiktionary.org/wiki/graft


I believe this version is more relevant: "transplant (living tissue) as a graft."


I really don't get it: I was donwovoted (currently at -4) on the parent comment, because I was confused by a word and tried to clarify that it meant.

Does that really deserve downvoting? If so, explain it to me please, because I just can't see why.


No, you don't deserve it. In its non-horticultural sense, "graft" is a particularly tricky word, because it's familiar to most native English speakers, but can mean almost opposing things in England and America. So people often think they understand what is being said, while actually misunderstanding each other. Here's an article on the topic: https://separatedbyacommonlanguage.blogspot.com/2012/01/graf....

On the other hand, many of the readers of this blog are native English speakers, and to most of us the metaphorical meaning of the title was clear. A very useful function of voting is to re-order the comments on the page. This is a useful comment, but only for a small subset of readers. As such, it's perfectly reasonable that it would appear after the other more technical comments.

Which is to say that while you don't deserve to be downvoted, the comment arguably does. I personally upvoted it, because of my personal experience with intercontinental miscommunication involving the word "graft", but I can see why others would want to prioritize other comments. Not because it's a bad comment --- I'm sure a few people found it really helpful --- but because it's a meta-comment on a technical site that helps only a small portion of the audience.

So you should proudly keep making helpful comments like this, and not take it at all personally when they are downvoted to the bottom of the page. This is a case where you should feel confident that you did the right thing, despite the apparent feedback.


Thanks for sharing your point of view, I appreciate it.

In this situation I would personally not downvote, but rather upvote the other more technical comments, which provides a similar result, without penalizing the commenter.

Your interpretation might be right, but a downvote will often be interpreted as "I didn't contribute to the conversation". The fact that "graft" is confusing, and that I am trying to shed light on it, is a way for me to try contributing, and therefore, in my view, shouldn't be penalized.


insert or fix (something) permanently to something else, typically in a way considered inappropriate.

This is the definition they most likely meant.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: