Dropping denormals is a huge mistake. This is easy to see if you draw out a number line for a very tiny floating-point format, for example with a 2-bit exponent and a 2-bit fraction. (do this on a sheet of graph paper) Without denormals, there is a huge gap surrounding zero.
Strangely, the infinities were kept. Treating these as NaN is far less harmful than dropping denormals. Treating -0.0 as 0.0 and never producing -0.0 would be less harmful. (the PDF didn't say what happens) Even treating NaN values as normal numbers is probably less harmful than screwing up the denormals.
IEEE floating point has lots of crazy stuff to annoy hardware vendors. Most of it isn't all that important, but denormals matter.
bfloat16 is only really intended for ML purposes, and denormals don't really matter there, especially given they accurately multiply-accumulate into 32 bit floats.
I'm not an expert in the nitty gritty, and I've heard conflicting information, but my rough understanding is that it would be affordable and not too difficult for Xeon processors, but relatively more expensive for the dedicated Nervana neural network processors.
But then you need either new denormal-accepting instructions or, worse, a new global state bit enabling bfloat16 denormals, all to support use cases probably over two orders of magnitude less common than ML. What's the compelling reason to bother? Note that you need to support the denormal-disabled case because you'll want compatibility with Nervana.
> Intel did add a global state bit. It just isn't useful because you can't modify it.
You mean the CPUID bit? That's free. Toggling denormals isn't.
> Nervana is discontinued, isn't it? Compatibility doesn't matter. It's pretty compatible anyway, as long as you aren't demanding bit-identical output.
Nervana isn't discontinued according to their website[1], and bitwise compatibility does matter, certainly more than denormals do.
I mean the bit to toggle denormals, not the one to identify support for the opcodes.
Denormals are far more important than bitwise compatibility. To be clear, you would still be able to load a Nervana-produced number into a processor that supports denormals, and the other way would work too. You'd just avoid mangling numbers that are near zero.
If you still think denormals don't matter, seriously do what I suggested: draw it out on graph paper. They matter.
> I mean the bit to toggle denormals, not the one to identify support for the opcodes.
bfloat16 doesn't handle denormals, why is there a bit to toggle it? What's it called (so I can CTRL-f for it)?
> If you still think denormals don't matter, seriously do what I suggested: draw it out on graph paper. They matter.
No, I get how denormals work, I know what you're pointing at. But ML genuinely doesn't care, neural nets don't give a damn about mathematical purity[1]. In contrast compatibility matters because ML doesn't give you any guarantee that it's not depending on the behaviour at small values, and minor differences in rounding does cause issues. For example, Leela Chess Zero had difficulties with reproducibility because different GPUs round floats differently.
It's two bits, DAZ and FTZ. (seems like "denormals are zero" and "flush to zero")
Bfloat16 obviously can handle denormals. The encoding is possible. There would be no need to handle the issue if the encoding did not exist.
As hex, these would be denormal: 0x0001 to 0x0080, and 0x8001 to 0x8080. It's the same as plain old 32-bit IEEE, with half the bits lopped off.
ML is all about difficulties with reproducibility. I don't see a reason to get upset about denormals when a 3D-printed turtle can be confused with a rifle.
> It's two bits, DAZ and FTZ. (seems like "denormals are zero" and "flush to zero")
You mean the standard ones for normal floats? You certainly wouldn't want to reuse that for bfloat16s.
> ML is all about difficulties with reproducibility. I don't see a reason to get upset about denormals when a 3D-printed turtle can be confused with a rifle.
These are different things, despite the similarity in terminology.
I met Naveen Rao after Intel bought Nervana. He seemed pretty adamant about getting stuff shipped fast. In contrast, the Xeon folks own all the politics and seem to want the transition to be very gradual. Plus the Phi folks get phased out. They had done a Nervana trial at Facebook but then flaked on other trials. Clearly Intel is trying to desperately manage their books.
Having Nervana and friends on a Xeon chip could be a huge positive change for software. Not only could we toss out the issue of GPU memory transfer, but Nvidia GPUs aren’t so great with concurrency, and here with the linux kernel we might have a chance to beat Nvidia. Naveen sure would like that... Nervana once had a Maxwell compiler that was better than Nvidia’s.
They had an assembler where one person wrote kernels that were faster than cublas in a lot of cases. Afaik, nobody ever released anything else with that assembler, and Nvidia caught up to that performance quickly. In talking with the cublas devs, it seemed more that maxas kernels were highly tuned for specific sizes, whereas cublas/cudnn had to be more general.
Nowadays it's really a moot point with Nvidia's Cutlass being open source.
True story-- the Nervana Maxwell stuff didn't go very far-- but it was noteworthy because they had both that small win as well as their own hardware platform.
One other thought about the Nervana-Xeon convergence is that the support for more memory (thru DDR, Optane, or even just mmap'ed NVME) will be a big win for modeling and large-minibatch SGD. For example, the minibatch fetching could be pushed to the hardware / OS instead of Tensorflow (or the crazy guy behind Tensorpack) using a threadpool. A lot of training is still I/O bound at some level, and processors only support so many PCI-e lanes...
A V100 GPU gets 900GB/s of memory bandwidth. I am less of a CPU expert but afaict you'll be lucky to get much more than 10% of that out of a CPU. This is going to make a huge difference that Intel can't make up with bigger execution units.
bfloat helps with this because the data is half as large. But of course if you're doing Nvidia you're probably already doing (IEEE) fp16.
Correct, GPUs can do SGD iterations much much faster once the data is in GPU RAM. But if you have a ton of data relative to compute, you might be better off with a shallower model and/or fewer iterations. Here’s one start-up serving these sorts of projects: https://www.memverge.com/
When GPUs kickstarted deep learning research in 2012, people had already studied shallow models on mapreduce for a decade or so. Once NVME / Optane and modern CPUs get 1-10TB of useful “memory” in the hands of grad students, there should be another wave of new research. To date, my experience has been that 1TB of “memory” is only commonly available in industry.
Is that intel ARK published spec sheet bandwidth, or actual usable bandwidth? There is a difference.
I've found I get about 75-80% of the advertised bandwidth both from my real app (TLS crypto) and a toy memory copy benchmark using AVX256 instructions. The toy memory copy benchmark is how I realized that my bottleneck was actually memory bandwidth and not CPU horsepower on Broadwell based servers.
I haven't a clue. I just got it off the link quoted. It's a good question and when there's a difference, you know what marketing will say.
To make a stab, I suppose it might depend on whether all requests are coming from a single memory bank or spread evenly across all memory banks, assuming fully populated (again from the link "Octa-channel (up from hexa-channel)")
Intel is dominated by how many cores/threads are accessing simultaneously. So with many scientific libraries you can get about 80% of the throughput. And the V100 will not give you 900GBps. That's the theoretical, but nominally it's about 750GBps.
That's somewhat nitpicking. Cutlass is 90-95% of cublas in many cases, and there are things Cutlass can do that cublas can't. Cublas will always be marginally faster, but it won't matter in most cases, especially if that means saving a kernel call.
maxas still exists, and works for Pascal too. A Volta/Turing assembler like maxas would be nice, but there is no Nervana doing the microbenchmarking to identify the instruction encoding well enough.
> At this point, Intel doesn’t have bfloat16 implemented in any of its processors, so they used current AVX512 vector hardware present in its existing processor to emulate the format and the requisite operations. According to the researchers, this resulted in “only a very slight performance tax.”
Why implement bfloat if you get just slightly less performance emulating it with AVX512, which already exists? Maybe it’s an “us too” claim?
The reason to use float16 isn’t to make individual operations faster, its to fit more numbers in vector registers. For example, AVX512 vector registers are 512 bits. That’s 8 64-bit floats, 16 32-bit ones, or 32 16-bit ones.
=> A vector multiply using bfloat16 may not be much faster than one using float32, but it will do more multiplications.
> The reason to use float16 isn’t to make individual operations faster, its to fit more numbers vector registers. For example, AVX512 vector registers are 512 bits. That’s 8 64-bit floats, 16 32-bit ones, or 32 16-bit ones.
I'm unclear on the advantage you are trying to explain. If both AVX and bfloat are SIMD instructions that cannot be the reason implementing bfloat is better. I'm expecting something like "bfloat16 is more specialised so it can have larger registers" or something?
[Edit]
Sorry re-reading your comment, I think you are trying to say that. The key part being:
> fit more numbers vector registers
(vs AVX i assume), so adding bfloat16 would provide more registers vs AVX with similar gate usage due to greater specialization?
Not larger vector registers, smaller numbers, so that you can fit more numbers in a vector register without having to make the vector register larger.
For CPU-bound algorithms, one would expect that bfloat16 in 512 bit vector registers would be about equal in speed to float32 in (hypothetical) 1024 bit vector registers.
Also, for algorithms that are memory-bandwidth bound, halving the size of your numbers will (about) halve memory pressure.
> Not larger vector registers, smaller numbers, so that you can fit more numbers in a vector register without having to make the vector register larger.
Sorry i'm pretty ignorant of AVX so trying to understand... is this because the smallest word size in AVX is 32bit? compared to bfloat16 is using twice the register space? Or rather with the same register space in bfloat16 you can have twice the numbers? (with no negative effects on convergence in NN due to exponent size.)
AVX is SIMD, and each AVX register is a large number of bits (256 or 512) that is packed full of a given data type, which can all be operated on at the same time with the same instruction. So if you have a 512 bit register, that could hold either 16 32 bit floating point numbers, or 32 bfloat16s. In the latter case, this means every instruction you do is operating on 32 different values.
Taking my 7980xe as an example:
When it runs non-avx512 loads, I currently have it set to run at 4.1 GHz (all-core).
When running avx-512 heavy loads, it instead runs at 3.6 GHz -- and tends to get much hotter (70-80C instead of 50-60C). 3.6 GHz is a mild overclock; Silicon Lottery reports 100% can achieve that speed for avx512 loads.[1]
Running programs doing the same thing (eg, Hamiltonian Monte Carlo where the likelihood function has or has not been vectorized), the avx512 version is far faster than scalar, and routinely 50%+ faster than avx2.
The avx512 instruction set itself also provides conveniences that make it easier to explicitly vectorize, even if most compilers don't take advantage of them on their own. Masking load and store operations in particular (they're better about masking to handle branches).
On why avx512 vs a graphics card:
I need double precision, and my code routinely has maximum widths smaller than the 32 or 64 a graphics card would want to computer in parallel.
Yeah, people tend to completely exaggerate the impact of throttling from AVX512. It's only an issue when you do short bursts of AVX512 and the rest is not AVX512. If you do math and your math can be done in AVX512, even with throttling it's going to be substantially faster. That it runs hotter doesn't concern me at all. Intel's claimed safe Tjunction is something like 105C. EEs tend to take the published component specifications seriously (e.g. your 1000v diode is guaranteed to withstand at least 1KV of reverse voltage), so I trust Intel when they say things are fine up to that temperature. Even beyond that it won't burn out, it'll just thermal throttle.
I've been using Julia. I've been working on a front end meant to help specify vectorized models and their gradients. It is alpha-quality software (far from production ready), but here is the github:
https://github.com/chriselrod/ProbabilityModels.jl
In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster.
A simulation fitting many data sets for my dissertation took about 9 hours. 20x is the difference between running overnight, and taking a week.
This interleaving is going to cause problems to an autovectorizer. To get a SIMD vector of the scalars, you'd probably have to load two vectors, and then blend them.
Even with arrays of doubles, I found Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library (3-8x worse than my Julia library for Mx32 * 32xN, for combinations of M and N = (3,...,32) ):
https://bayeswatch.org/2019/06/06/small-matrix-multiplicatio...
I compiled the Eigen benchmarks with:
g++ -O3 -fno-signed-zeros -fno-trapping-math -fassociative-math -march=native -mprefer-vector-width=512 -shared -fPIC -I/usr/include/eigen3 eigen_mul.cpp -o libeigenmul.so
> here is the github: https://github.com/chriselrod/ProbabilityModels.jl. In the example I give there, the logdensity and gradient evaluation was about 25x faster than Stan, and sampling was about 20x faster.
that looks pretty cool, though I don't yet know enough Julia to understand all of it. The speedups make sense given that Stan's compiler/math lib doesn't do much in the way of smart data layout. I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.
> Eigen's fixed size arrays to get about 3-8x worse performance than my Julia library
seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)
> I was getting errors when specifying -DEIGEN_ENABLE_AVX512
I used this flag with Eigen 3.3.1, I think, on GCC 6 or 7. This was for Xeon Phi, so I tried to use icc but despite supporting C++11 it doesn't handle Stan or Eigen's template metaprogramming.
This is all the more reason to use Julia, but my graduate student days are long past..
> I would still keep in mind that the metric worth using for benchmarking is the number of effective samples per second, and this also depends on the HMC variant you use.
I was getting similar effective sample sizes/sample size in both after switching to a diagonal mass matrix, like Stan uses, from the dense mass matrix DynamicHMC.jl uses by default (the HMC backed library I'm using).
Given how common it is for folks to run Stan over night or for a week to study prior sensitivity, internal coverage, type I and II errors, etc, via Monte Carlo, I think a focus on speed is worth while.
> seems unsurprising that Julia can specialize a lot better than verbose C++ templating, no? (still, good job, very worth checking out)
The C++ library Blaze did a lot better than Eigen, but still not as well.
But yes, Julia'a meta-programming is much easier to work with. Julia expressions are Julia objects that you can manipulate like anything else, so I can write all the functions I want describing how to generate matmul kernels as a function of matrix size and CPU Info, and how to loop over them.
That approach feels much more straightforward. I haven't looked at the code bases of Eigen or Blaze, nor am I that familiar with template meta-programming. But I'd guess they define matmul recursively for arbitrary fixed sizes, and then have some templates defined for specific sizes (the kernels) -- or ideally have some clever way of generating the kernels from there.
Regardless, I agree that this is much easier in Julia. Aggressive specialization is also better aligned with Julia's compilation model in general, because methods get compiled just before they're used. Defining a million possible specializations doesn't have the cost of compiling a million specializations.
> This is all the more reason to use Julia, but my graduate student days are long past..
I'm defending this week, and next Monday will be my first day in an industry job.
They expressed openness to Julia, but my biggest fear is that they'll renege so that I'll only be able to work on or use Julia in my spare time at home.
That's an interesting comment. We've always used Stan's default of a diagonal, but I think we'd benefit from mixed metrics, which doesn't seem possible in Stan, but looks somewhat doable in some of the HMC libs in Julia.
> Given how common it is for folks to run Stan over night or for a week to study prior sensitivity
Yes, we changed the walltime on our Slurm cluster to support Stan Jobs running up to a week long, have used multiple million core hours on this. Stan still isn't so shabby but it's a hard problem.
> I'm defending this week, and next Monday will be my first day in an industry job.
Good luck and congrats on the job. You'll probably have to bite your tongue and look for opportunities where Julia's advanced compilation model (as you described well above) is going to more than pay for the cost of deployment/extra language etc.
I guess if it's easy enough to do with software, it won't be very difficult to implement with hardware either. So it doesn't really cost much additional die space.
As the exponent is the same size it is mostly a matter of truncating. However custom hardware could have ability to do more parallel 16 bit ops in avx mode. Presumably they are emulating by expanding to 32 bit in avx512 registers.
> At the end of the day, I cant run ANY accelerated workloads using Intel graphics or AMD ....because there's simply no software support anywhere.
You can, there’s an Intel framework: OpenVINO that is targeting Intel hardware, processors and HD video cards, you can convert TF graph to Vino and use their inference server, that mimics TF serving.
Huh, that's strange. I remember in the infancy of cryptocurrency mining (back before specialized ASIC hardware), OpenCL was far superior to CUDA, and AMD cards were doing 10X the hashrate for the price as OpenCL cards. What changed in the interim? Is SHA256^2 just a completely different workload than Tensorflow, or has Nvidia pulled ahead?
I'm pessimistic. I doubt things will change unless all the major platform owners can come to an agreement on a common API for GPU compute. If NVIDIA were the only holdout, they might be forced to support it.
As it stands, CUDA is the only API with a decent implementation available on the three major desktop OSes. It's ridiculous.
This is super exciting! Brings a bit of competition to NVIDIA for ML-related tasks, while being more "open" (to some extent) than the TPU ASICs (because you won't have single-cloud lock-in). In any case, good to see Intel finally waking up.
4x8-bit values in a 32-bit value? SIMD does this to perform an operation on 4 8-bit values in a single 32-bit register. There are other configurations, IIRC.
bfloat16 sounds like it could be supported with minimal changes to existing floating point units, maybe with just some improved microcode.
FB's approach on the other side requires entirely redesigned and separate execution units. That's harder to justify, that silicon will remain dark for non-DL usage.
Ditto for them calling it "Brain Floating Point" [1]. I mean it appears to be a typical floating-point numeric data type with some of the precision truncated to reduce its cost.
I guess they might be trying to make it sound super-fancy to preempt people seeing it as cheap or/and low-tech?
Yeah, I get the marketing perspective; just, we typically describe primitive data types in terms of what they are from a technical perspective rather than a marketing perspective.
Yeah the point is 'reduced precision FP16' would make a lot more sense - since it actually describes what it does. 'Brain' floating point makes you wonder how Neuralink are planning to use this datatype.
FMA on BF16 uses FP32 for the accumulator for accuracy reasons, but you can have instructions for "Rm = Rm + low(Rn) * low(Rp)" and "Rm = Rm + high(Rn) * high(Rp)", and they can be faster than FP32 "Rm=Rm+Rn*Rp" because there are fewer bits in the mantissas.
Also, converting BF16 to FP32 and back is just a vector shuffle that sticks/drops 16 extra mantissa bits at the end of each FP16 value, so it's cheaper than other floating point conversions. This means that even if you occasionally have to escape to FP32, the overhead is low and you keep all of the memory bandwidth benefits.
Newbie question: what is the typical and extreme values (excluding +/-infinity, and are these used too?) that can occur in training/running of NNs? Also, what level of accuracy is needed?
It may well be a stupid q but I really don't know, and always assumed they would be [-1..+1] and that fixed point would suffice. Clearly not.
What would be the wattage of a typical six inch slide performing calculations as quickly as a top-of-the-line Intel microchip? Or if it would be physically impossible because of speed of light considerations, what would be the wattage of n slide rules in parallel performing such computations such that it adds up to the throughput of a microchip?
Why? Google has certainly researched their floats before committing an entire line of silicon chips. It's easy to just enumerate all possible float16 configurations in a simulator to see which one performs best on a wide range of neural network applications. Then pick the best one. Big data driven organizations do this all the time (brute force through an entire line of solutions, pick best results).
bfloat16 is "best" primarily because it has the same exponent range as float32. That makes it easy to port models that have been developed using float32 to bfloat16. (As opposed to using int8 or float16, both of which have a smaller exponent range.)
It's possible that some other custom format is better in absolute terms for models trained specifically for the custom format. But for the current ecosystem, where models are trained primarily using float32, bfloat16 is a very good choice.
I know nothing about ASIC or CPU simulators but I suspect that it's not as easy as you make it sound: for machine-learning related tasks, performance doesn't only come from raw compute numbers: you'll also want to model the actual data movement costs across the caches hierarchy and registers. Because a lot of time training is not necessarily compute-bound: the relative cost of data transfer (VS compute) can be quite high, or even dominate.
Why would you want to introduce a floating point type with completely different and incompatible behavior when you can just change the mantissa of an existing one and reuse everything already build around it?
It seems like a huge mistake if they missed key optimizations, but I'm happy to take your word for it.
Are there write-ups that go in detail about these mistakes? It's the kind of somebody-is-wrong-on-the-Internet topic that would result in flaming blog posts. :-)
Not really. If you look at the inria pseudocode they check if the posit is negative or positive before doing addition, and convert, in the style of 754 one's complement encoding, but you shouldn't need to do that with posits since the encoding is two's complement.
I mean, I helped design the posit spec and the twos complements treatment is something not even John Gustafson understands... The key insight is that the hidden bit is -2 for negative numbers (instead of 1 as it is for positive numbers). It's kind of nonobvious and I happened upon it by accident one night while fooling around with circuit diagrams. If people really get serious about it I'm sure though that it will get rediscovered by EDA folks smarter than I.
That paper used High Level Synthesis which would be the equivalent of coding something in ruby and comparing it with another algorithm written in optimized assembly.
That's surprising, I wouldn't guess it is something they would do so easily, especially considering Intel processors isn't something you generally use to train NNs.
But I'd be rather glad if they implemented unums already.
I absolutely train on Intel from time to time at work. If your model fits, cpu is a dream. The driver never breaks, they never crash and lock up your display...
It looks somewhat similar to the minimal size of compact float [1] (a format I developed for data communication), but with 2 more bits because I use them for sizing.
It isn't really very good for machine learning - machine learning doesn't need 8 bits of exponent.
For weights during training, 7 bits of mantissa also seems a bit low - it's common for weights to adjust much less than 1% during a single batch of training, which this couldn't represent.
I think this is more a "we want something which is faster but is compatible with existing code written for fp32's".
Bfloat is specifically designed for ML. It is the native type in Google's TPUs. It is quite good at ML; most models that work with fp32 work with bfloat with no adjustments; that's in contrast to IEEE fp16.
You're right that the mantissa is small. The trick is that you always accumulate into fp32 and then truncate down to 16 bits at the end. You'd do this for any 16-bit floating type.
In particular the latter describes a generic framework that can be used to generate a lot of different number systems.
Could hardware implement this, allowing us to compose and choose the number system by just setting some simple flags?
Why "graft"? It's not something that's foreign to them. This promises to essentially double the performance of Intel chips on an increasingly important workload, and also simplify the modeling work, because the models don't experience any accuracy drop when simply converted to bfloat16, unlike with quantization, where it's model dependent and finicky AF. I'd much rather do fp16 or bfloat16 at inference time, without constant pain that is quantization. I hope ARM, AMD and RISCV pay attention and implement this in the exact same way, so that models could be portable.
I read this article as saying, "hey we can emulate bfloat16 pretty well in software on top of our existing hardware features". That's what "graft" and "minimal impact" mean to me.
Intel (for better or worse) takes a very experiment-results-driven approach to choosing which features to implement in hardware. So this result -- that software emulation of a feature works almost as well as a hardware implementation would -- probably makes Intel less likely to implement the feature in hardware.
ARM, AMD, RISCV etc, will probably come to similar conclusions.
The article links to another saying that bfloat16 is coming to Xeons. Also the title of the current says it too. There's nothing implying less likely if you read the articles.
Hard to infer humor here sometimes -- were you kidding?
Just in case, since I can see someone else being serious about this, I think the gist is that neural-networks tend to be fairly approximate things such that we're not particularly concerned with having a lot of precision in many cases. This use-case wouldn't seem to demand a 32-bit variant too often.
But.. if you want it anyway...
Higher-bit extensions would seem to be floating-point values that favor range-over-precision more than typical floating-point numerics with the same bit-count.
If we take that to an extreme, we can talk about ranges over infinities and infinitesimals -- this is, much like the hyperreal-number system [1].
And ya know what's funny?
Some guy's been pushing for such a primitive numeric data type [2] since the early-2000's [3]!
You already have that; it's more formally called IEEE 754 single precision. Requires double the memory bandwidth (bad) for extra precision that back propagation would have corrected for anyways.
I desperately want Chinese companies finally begin producing and designing general purpose CPUs, GPUS, and other types of accelerators. Current situation is terrible duo- and mono-polies, slow pace of innovations, and low reliability. We need more players.
I'd say that Google did the numerical analyses of the format, then proved the memory bandwidth improvement and behavior in regards to large production machine learning models with their TPUs. So they derisked the numerical format. That and how simple it is to implement given you already support IEEE 754 single precision (just fewer bits for significant), and lower overhead to convert to and from floats (relative to fp16) makes this format a no brainier for Intel.
English is not my first language. I have never heard the term "Graft", even if I consider myself quite literate in English. So here you go, for everybody else in my situation:
Graft, as understood in American English, is a form of political corruption, being the unscrupulous use of a politician's authority for personal gain.
Edit: by the way, I really couldn't fit the term with the article. And realized I was probably looking at the wrong definition. This one might be much more apt:
a shoot or twig inserted into a slit on the trunk or stem of a living plant, from which it receives sap.
That definition definitely does not apply to this usage. You are looking for "to join (one thing) to another as if by grafting, so as to bring about a close union." (etymology 1, verb, definition 4 on Wiktionary [1]).
No, you don't deserve it. In its non-horticultural sense, "graft" is a particularly tricky word, because it's familiar to most native English speakers, but can mean almost opposing things in England and America. So people often think they understand what is being said, while actually misunderstanding each other. Here's an article on the topic: https://separatedbyacommonlanguage.blogspot.com/2012/01/graf....
On the other hand, many of the readers of this blog are native English speakers, and to most of us the metaphorical meaning of the title was clear. A very useful function of voting is to re-order the comments on the page. This is a useful comment, but only for a small subset of readers. As such, it's perfectly reasonable that it would appear after the other more technical comments.
Which is to say that while you don't deserve to be downvoted, the comment arguably does. I personally upvoted it, because of my personal experience with intercontinental miscommunication involving the word "graft", but I can see why others would want to prioritize other comments. Not because it's a bad comment --- I'm sure a few people found it really helpful --- but because it's a meta-comment on a technical site that helps only a small portion of the audience.
So you should proudly keep making helpful comments like this, and not take it at all personally when they are downvoted to the bottom of the page. This is a case where you should feel confident that you did the right thing, despite the apparent feedback.
Thanks for sharing your point of view, I appreciate it.
In this situation I would personally not downvote, but rather upvote the other more technical comments, which provides a similar result, without penalizing the commenter.
Your interpretation might be right, but a downvote will often be interpreted as "I didn't contribute to the conversation". The fact that "graft" is confusing, and that I am trying to shed light on it, is a way for me to try contributing, and therefore, in my view, shouldn't be penalized.
Numerical details: https://software.intel.com/sites/default/files/managed/40/8b...
Support for bfloat16 is already present in MKL-DNN (https://github.com/intel/mkl-dnn)
Disclaimer: I work for Intel