I have used fp16 buffers frequently on NVIDIA GPUs with OpenGL in generations ra...

I have used fp16 buffers frequently on NVIDIA GPUs with OpenGL in generations ranging from GTX 760 (GK104) to Titan X (GM200 and GP102) as well as mobile GPUs like GT 730M (GK208M). I do this for things like ray-casting volume rendering, where the dynamic range, precision, and space tradeoffs are very important.

My custom shaders performing texture-mapping and blending are implicitly performing the same underlying half-load and half-store operations to work with these stored formats. The OpenGL shader model is that you are working in normalized fp math and the storage format is hidden, controlled independently with format flags during buffer allocation, so the format conversions are implicit during the loads and stores on the buffers. The shaders on fp16 data perform very well and this is non-sequential access patterns where individual 3 or 4-wide vectors are being loaded and stored for individual multi-channel voxels and pixels.

If I remember correctly, I only found one bad case where the OpenGL stack seemed to fail to do this well, and it was something like a 2-channel fp16 buffer where performance would fall off a cliff. Using 1, 3, or 4-channel buffers (even with padding) would perform pretty consistently with either uint8, uint16, fp16, or fp32 storage formats. It's possible they just don't have a properly tuned 2-channel texture sampling routine in their driver, and I've never had a need to explore 2-wide vector access in OpenCL.