Supporting half-precision floats is annoying

StefanKarpinski · on Aug 6, 2021

The big problem in this post seems to be compiling via C (and the awkward GPU extensions thereof), which has lousy Float16 support. LLVM has adequate support for at least representing Float16s. Because of this (and a lot of work by a lot of people), Julia has pretty good support for Float16. If you're running on hardware with native Float16 support like a GPU, it works and is fast; if you're running on hardware without native Float16 support, operations are implemented by converting to Float32 and back, which is slow but gives the same results. So you can run the program either way and get the same results and it's fast if your hardware has native support. Same deal with BFloat16 [1], which is the native 16-bit floating point type on Google's TPUs.

[1] https://github.com/JuliaMath/BFloat16s.jl

pm215 · on Aug 6, 2021

For what it's worth, the Arm "alternative half-precision" format is not very different from IEEE binary16. The difference is that instead of using the maximum exponent value (0x1f) to represent infinities and NaNs, the AHP format uses it the same way as any other non-zero exponent, to represent normalized fp values. The tradeoff is that you lose NaNs and infinities, but you double the range of numbers you can represent (the max value goes from 65504 to 131008).

You have to select AHP by setting an FP config register bit, so (unlike bfloat16 vs binary16) it's a "for this whole chunk of code I am going to use this format" choice.

Athas · on Aug 6, 2021

What happens on overflow or 0/0 with this format?

pm215 · on Aug 6, 2021

Alternative-halfprec format only applies for conversions (ie to/from single and double prec); data-processing operations on fp16 ignore the AHP bit and always assume IEEE format. If you convert a NaN from single/double into AHP you get a zero with the same sign as the input; if you convert an infinity you get the max/min representable number; in both cases the InvalidOp fp exception bit is set.

The upshot is that it's basically an in-memory storage format, and all the actual data-processing gets done at either single or double precision.

silverpath · on Aug 6, 2021

The article makes a lot of good points. But there are some cases where f16 is very useful. In the context of deep learning it's frequently useful to move from f32 -> f16. This can allow you to double the size of your models in memory (system or GPU/TPU). Since network size is often determinant of performance, doubling the number of parameters/activations in your model can make a big difference.

pjbk · on Aug 6, 2021

Some years ago I used them to transfer real-time waveforms from a medical device via USB and BLE links since they provided more than enough precision for the clinical application. The 2x bandwidth increase without resorting to compression (and its computational overhead) allowed us to meet the project specifications just by changing the type of the data array and recompiling.

bartwe · on Aug 6, 2021

Ranged float or would unorm/snorm work as well ?

pjbk · on Aug 6, 2021

Depends on what you want to do and how you interpret your data. IIRC we only needed to normalize one parameter in the host using percentages instead in the UI. For the rest everything translated without modification since the F16 covered the max/min parameter ranges.

Mind this was for a regulated medical device and therefore any change in the software was burdensome. The change didn't affect the internal calculations since it was only for the communication protocol and UI, and the compiler support allowed the minimal changes on both embedded and host source codes.

h2odragon · on Aug 6, 2021

Great rant.

"just don't use f16" seems like the course of wisdom here.

I had a 3x 5 bit value, packed structure in something where memory pressure was severe, and it was such a bitch on a SPARC to deal with 16 bit quantities that actually running the data as two passes using more memory wound up being an immensely better approach. The Alphas would diddle yer bits any way you liked at speed, but that was clearly an aberrant ability.

IfOnlyYouKnew · on Aug 6, 2021

"just don't use f16", when that is an option, tends to mean throwing away half of what your hardware is capable of. I'm not sure how wise that is.

p_l · on Aug 6, 2021

Only after BWX landed - before that, you had to ensure all accesses were 4 or 8 bytes, and aligned on 4 byte boundary, then do appropriate bitmasking and shifting.

h2odragon · on Aug 6, 2021

right. it was a noticeable pain point in early versions so they fixed good and hard for the last few.

37ef_ced3 · on Aug 6, 2021

16-bit floating point (as a weight storage format, not used for math) is essential for fast Winograd and Fourier convolution on AVX-512 CPUs. See https://NN-512.com

logimame · on Aug 6, 2021

How is a storage format that's not used for math important for speeding up actual computations? I'm curious.

Athas · on Aug 6, 2021

Many of these algorithms are bandwidth- or cache-limited on modern machines, so you can get significant speedup by storing your data in fewer bytes, even if you expand it in registers before actually doing computation on it.

retrac · on Aug 6, 2021

We're reaching a point where it's often faster to store pages in RAM compressed with a fast algo like LZ4 and to decompress them, than to simply copy from RAM uncompressed to L1 cache.

37ef_ced3 · on Aug 6, 2021

Exactly. In this case, the limit is memory bandwidth.

logimame · on Aug 6, 2021

Wow, that's illuminating. I naively thought the overhead of converting between datatypes would not make this worth that much (in favor of saving cache misses). Though does this also have anything to do with the AVX512 instructions?

37ef_ced3 · on Aug 6, 2021

Yes, this AVX-512F instruction makes fp16 to fp32 efficient:

https://software.intel.com/sites/landingpage/IntrinsicsGuide...

The result is that Winograd convolutions can achieve an effective FMA rate of twice the peak rate of the CPU.

The Winograd transform reduces the required number of FMAs by a factor of 5x, but you can only do FMAs at half peak rate (because you are bandwidth limited), so you come out ahead by a factor of 2.5x in theory (2x in practice).

Without fp16, that 2x advantage would be lost.

bullen · on Aug 6, 2021

If the CPU only has 32-bit ALUs (afaik all CPUs today have 32+ bit ALUs) there is no reason to "support" half-floats (other than converting floats to half-floats for the GPU which doesn't need to be fast or pretty since you do it beforehand and send it directly to the GPU from the model file format).

On the GPU on the other hand 16-bit floats are becoming the standard (the M1 GPU for instance has more 16-bit ALUs than 32-bit). With enough precision for possible resolutions/worlds and you save 2x the memory which makes it a no-brainer really.

pm215 · on Aug 6, 2021

If your CPU has vector/SIMD instructions then you might want fp16 format support for data processing so that you can operate on (say) a 128-bit vector of 8 fp16 values at once, rather than having to work with 4 fp32 values at a time. But I agree that there's a lot you can do with just load/store/conversion support.

owlbite · on Aug 6, 2021

The assertion that C doesn't support fp16 is just plain wrong. _Float16 is defined in standards committee work. There's also the __fp16 storage-only type widely supported by compilers.

The main issue is many compilers have issues on x86 platforms due to Intel's bizarre slowness in defining how to pass fp16 parameters in their official ABI.

Athas · on Aug 6, 2021

Do you have more details? I read https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html which made it look like a rather exotic feature that would not be available on all systems (and https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html#Float... also makes it sound like __fp16/_Float16 is only supported on AArch64).

owlbite · on Aug 6, 2021

My main experience is on clang primarily for arm64 systems, where it works just fine (give or take some terrible codegen at times if using llvm's extended vector support).

Intel support is a bit more dubious, especially if you're stuck on older LLVM versions, but __fp16 works for most purposes if all you want is storage. You just have to pass any fp16 function parameters as pointers/references to work around the aforementioned ABI issue. Architecturally intel has supported fp16 conversions necessary for __fp16 for quite a while now, so no reason for it not to work (whereas full fp16 arithmetic is restricted to some less common variants of avx512 iirc, so the distinction between __fp16 and _Float16 isn't much on most intel systems).

It should be fairly straightforward to write a quick test program and see what your local compilers can cope with.

dunham · on Aug 6, 2021

This reminds me that at one point Lucene was using an 8-bit floating point format for its normalization factors (I don't know if they still do):

https://lucene.apache.org/core/3_0_3/fileformats.html#N107EF

boulos · on Aug 6, 2021

Athas, now that you’ve added fp16, why not add bf16 as well? (A100s support bf16 natively, as do upcoming server CPUs).

Athas · on Aug 6, 2021

Sure, why not. I don't think it would be that difficult, and can be emulated with single-precision just as well as fp16 for the systems that don't support it in hardware.

ComputerGuru · on Aug 6, 2021

.NET recently landed (storage-only) support for f16 as well. Their article is nowhere near as interesting as TFA, but for reference: https://devblogs.microsoft.com/dotnet/introducing-the-half-t...

1wd · on Aug 6, 2021

What are the reasons for changing the allocation of bits in bf16 vs. f16? Why are there no (few?) similar alternative allocation schemes for f32 and f64? Was IEEE's choice perfect for f32 / f64? How did they know? Why not for f16?

Does any hardware offer "configurable" bit allocation like f16[e=4,m=11]?

pm215 · on Aug 6, 2021

I think it's not so much that the IEEE f32/f64 choice was necessarily perfect so much as that it was "good enough", and so it's not worth the hardware costs of handling multiple formats or the headaches of picking a single choice that's something else. With f16 because you only have 16 bits the tradeoffs are suddenly much more sharp, because you don't have enough to both have a reasonable representable range (large exponent field) and a reasonable precision (large mantissa field). So you must trade one against the other, and it can be worth the extra hardware to support two points in the tradeoff range.

tvirosi · on Aug 6, 2021

Huh? Half precision floats speeds up machine learning training and inference enormously. It's weird to me to argue that we should sacrifice hours of sliced off waiting time in favor of preserving the ability to write a new language over a weekend.

aseipp · on Aug 6, 2021

The entire blog post is literally an explanation of why implementing support for fp16 in Futhark was annoying. Let me repeat the key bit: support for fp16 in Futhark. As in, it exists, and you can use it. Right now! Nobody is stopping you from using it, or encouraging you not to use it. In fact the author literally gave you the ability to use it in his language, because clearly people want to use it. The literal second paragraph acknowledges not only that fp16 is popular in machine learning, but that there is actually two popular formats now, including bf16!

The post is actually quite relatable if you have experience in silly toolchain issues, i.e. worked on a compiler or something like that. It's all fairly straightforward. Please try to read the article next time, you might surprise yourself.

tvirosi · on Aug 6, 2021

No need for a tone like that

gpderetta · on Aug 6, 2021

computation will be slower but writing new languages will be faster. Think of how many new languages you can implement while you wait for your computation to finish!!

lmilcin · on Aug 6, 2021

Sometimes you need float packed in memory efficiently.

This might be useful in graphics, for example.

In general, I believe it is good to give people options. You never know what people will find useful.

But if you don't want support it and you feel it will make your language better, just don't support it.

But please, quit bitchin' about it.

Java does not support unsigned integer type and everybody is fine.

And Java devs don't write rants about how unsigned integers are useless, supporting them is annoying and everybody should just forget about them.

FreeFull · on Aug 6, 2021

The article didn't talk about f16 or anything like that. It just talks about some complications when implementing f16 support. I'm not sure what exactly you want the author not to bitch about.

rtkwe · on Aug 6, 2021

Java 8 you have a way to treat an int as unsigned: https://docs.oracle.com/javase/tutorial/java/nutsandbolts/da...

lmilcin · on Aug 6, 2021

Only technically.

But still, Java does not have unsigned integer types for 32 and 64 wide integers, just a way to treat them as such if you need it so much that you want to go out of your way. For example when you need to parse a binary with unsigned integers.

There is no way to declare unsigned 32 or 64 bit integer.

aseipp · on Aug 6, 2021

The author literally implemented fp16 support in his language. You clearly did not even read the post, which was just about the minor technical hurdles to doing so. Please learn to read next time instead of wasting everyone's time with your own useless bitching.