It's interesting today to see people act as though half (f16) is a completely normal obvious type, whereas when I was first writing code nobody had this type, it was unheard of, and it was this weird new special case when the 3DFX Voodoo used it (hardware accelerated 3D video for the PC, as a special case, the "pass through" for 2D video is a physical cable at first). The give away, if you're younger, is that it's sometimes called half precision. That's because single precision (32-bit floating point) was seen as the starting point years before.
I remember this when somebody says f128 will never be a thing, because if you'd asked me in 1990 whether f16 will be a thing I'd have laughed. What use is a 16-bit floating point number? It's not precise enough to be much use in the narrow range it can represent at all. In hindsight the application is obvious of course, but that's hindsight for you.
It's very popular in hardware-accelerated computer graphics. It has much more range, and a bit more precision, than the traditional integer 8-bits-per-channel representation of colour, so it is used for High Dynamic Range framebuffers and textures.
It's also ubiquitous as an arithmetic type in mobile GPU shaders, where it's used for things (like colours) that need to be floats but don't need full 32-bit precision. In many cases it doesn't just save memory bandwidth and register space, but also the shader core may have higher throughput for half precision.
Makes me wonder if there's a use case for "dynamic fixed point" numbers: say for a 16 bit value the upper 2 bits are one of four values that says where the decimal is in the remaining 14. Say 0 (essentially an int), two spots in the middle, and 14 (all decimal). The CPU arithmetic for any operation is (bitshift+math), which should be an order faster than any float operation. The range isn't nearly as dynamic, but would allow for fractional influence. Maybe such a system would lack enough precision for accuracy?
What you just described is exactly floating point numbers, you're just using a different split for the exponent and mantissa and not using the "integer part is zero" simplification.
Yeah, floating point is nothing more than your standard scientific notation of numbers, e.g.
digit.xyz... * 10 ^^ +/- some exponent
The exponent is simply shifting where the decimal point is. The only different for floating point is that everything is base 2 because computers :D
Interestingly you're right that a bunch of fp functions can be faster than integers equivalents (although I'm still not convinced that this isn't simply due to the reduced number of bits involved), and more fun the relative performance of operations can actually change vs what they would be in integers. Also this is in the context of doing it in software vs hardware, where again the perf costs of things change.
1. Since the normalized significand will always be 1.bbbb, the '1' bit is stripped from the significand representation, except:
2. To extend the range, the lowest 'zero' value of the exponent drops the leading '1'. This is referred to as the subnormal range
3. The highest exponent value, when the significand is zero, is used to represent positive and negative infinity
4. The highest exponent value with a non-zero significand is used to represent NaN
5. There are many different values usable for NaN by software, including a differentiation between 'quiet' NaNs and a (I believe implementation optional) 'signaling' variant, which will raise an interrupt when used. The idea is that these can be used to convey additional information, and that the signaling variant as well as the right interrupt handlers can be used to add additional functionality such as variable substitution.
Yes, I was giving a simplified description to convey how to translate what floating point is to something people are more familiar with.
The technical details of how it handles _every_ case weren’t particularly relevant.
However just to address 1,2 with some hilariousness (autocorrect wants this to be “hilarious mess” which may be more correct).
Ieee754’s 80 bit format was the first widely deployed format, and was largely used by intel to get the other manufacturers to stop trying to reduce the functionality of ieee floating point because “it couldn’t be implemented, implemented efficiently, etc”. However because of that it has a quirk that was fixed for fp32,64,etc.
FP80 uses an explicit bit for the leading 1. That means it can do 1.0 * 2 ^^ N, or 0.1 * 2 ^^ N it should hopefully be immediately obvious why this could be a problem :)
Not only do the multiple representations for a single value result in sadness, it also gives us a variety of concepts like pseudo-denormals, pseudo normals, pseudo infinities, pseudo nans, etc all of which cause their own problems.
Mercifully by default the only hardware fp80 implementation now (since 286 maybe?) defaults to just treating them as invalid and converts to Nan. But you can set a flag to make it treat them as it did originally.
ML applications apparently don't need the full precision, but they do need very large amounts of them, and process enough of them that the perf win from fewer bits is meaningful and the cost of f16->f32 is large enough to also be meaningful.
Presumably they do benefit from the dynamic range as otherwise you'd think int16 would be sufficient, and not suffer the conversion costs.
float16 and float128 (as well as longer formats) were standardized in IEEE754 in 2008. Half is now built into C#, many other languages, and gaining ground in hardware support.
I definitely don't think half precision is completely normal obvious type except maybe in certain circles. None of current mainstream languages support it as a primitive type, for one. No mainstream CPUs have hardware support for half precision.
C# supports it. CUDA supports it. ARM and related C/C++ compilers support it. Intel is adding hardware support in upcoming chips, so expect a lot more languages to add it.
Julia also supports it. it led to a really funny graph when doing performance tests on Fujitsu, because Julia is the only language that supports fp16 and is fast enough to write blas in, so there were some graphs where it was the only entry because c and fortran didn't bother showing up.
Most Intel and AMD CPUs produced in the last ~10 years have hardware support for converting 16-bit floats to and from 32-bit. That's not full hardware support of course, but you don't need it to be able to make good use of them. https://en.wikipedia.org/wiki/F16C
I remember this when somebody says f128 will never be a thing, because if you'd asked me in 1990 whether f16 will be a thing I'd have laughed. What use is a 16-bit floating point number? It's not precise enough to be much use in the narrow range it can represent at all. In hindsight the application is obvious of course, but that's hindsight for you.