If it's data size in GPU RAM you are concerned about, couldn't you store fp16 and cast into fp32 just in the kernel? In OpenCL, you would do this with vload_halfN and vstore_halfN to convert during load and store.
You won't get double throughput compared to fp32, but you shouldn't fall back to some terribly slow path either.
I haven't looked into it myself, but it could be that due to all this massaging you'd lose more on throughput than you gain on memory use. It's similar to doing 8 bit quantized stuff on general purpose CPUs. It's very hard to make it any faster than float32 due to all the futzing that has to be done before and after the (rather small) computation. In comparison, no futzing at all is needed for float32: load 4/8/16 things at a time (depending on the ISA), do your stuff, store 4/8/16 things at a time. Simple.
Right, you won't exceed the fp32 calculation rate, unless perhaps it was bandwidth-starved accessing the slowest memory tier. You are doing all fp32 operations after all. What you can do is fit twice the number of array elements into which ever GPU memory layer you decide to store as fp16, so potentially work on half as many blocks in a decomposed problem or twice the practical problem size on a non-decomposed problem.
You can separately decide which layer of cache to target for your loads and stores which convert, and then use fp32 with normalized 0-1 range math in that inner tier. You only have to introduce some saturation or rounding if your math heavily depends on the fp16 representation. The load and store routines are vectorized. You load one fp32 SIMD vector from one fp16 SIMD vector of the same size (edit: I mean same number of elements) and vice versa.
I have used fp16 buffers frequently on NVIDIA GPUs with OpenGL in generations ranging from GTX 760 (GK104) to Titan X (GM200 and GP102) as well as mobile GPUs like GT 730M (GK208M). I do this for things like ray-casting volume rendering, where the dynamic range, precision, and space tradeoffs are very important.
My custom shaders performing texture-mapping and blending are implicitly performing the same underlying half-load and half-store operations to work with these stored formats. The OpenGL shader model is that you are working in normalized fp math and the storage format is hidden, controlled independently with format flags during buffer allocation, so the format conversions are implicit during the loads and stores on the buffers. The shaders on fp16 data perform very well and this is non-sequential access patterns where individual 3 or 4-wide vectors are being loaded and stored for individual multi-channel voxels and pixels.
If I remember correctly, I only found one bad case where the OpenGL stack seemed to fail to do this well, and it was something like a 2-channel fp16 buffer where performance would fall off a cliff. Using 1, 3, or 4-channel buffers (even with padding) would perform pretty consistently with either uint8, uint16, fp16, or fp32 storage formats. It's possible they just don't have a properly tuned 2-channel texture sampling routine in their driver, and I've never had a need to explore 2-wide vector access in OpenCL.
You won't get double throughput compared to fp32, but you shouldn't fall back to some terribly slow path either.