I have a feeling this could combine really well with John Gustafson's posits. I partially bring this up because of the paper he wrote with Isaac Yonemoto, where the latter uses an 8-bit variant of the posit to create a cheap approximation of the sigmoid function for neural networks[0], but also because for posits one can tweak the number of fraction and exponent bits to suit the needs of the required dynamic range and accuracy.
The sigmoid approximation is nice, but not 100% necessary (I don't know the cost of doing sigmoid in GPU but I suspect the modern ones by nvidia probably have some way of doing it relatively fast) - note that sigmoid and tanh are still used in GRUs and LSTMs. And sigmoid, tanh, and their derivatives all have approximative shortcuts in posits.
There is a single critical step in training which I find gets stuck in early phases of learning due to accumulation issues (I solved it by expanding to 16 bits during this phase, which should be "on-chip"). It's nice to see another method to do the same! I suspect that batch normalization (which seems similar to what HALP is) after each step will also help.
Yes, while I haven't done anything with ML myself, I heard via the 3Blue1Brown video on Deep Learning[0] that the sigmoid function isn't really that used anymore. But I figured allowing that being able to tweak the dynamic range could make it a good fit for the rescaling and recentering approach.
You know what, I'll just go ahead and post a link to this article on the Unum google group, perhaps someone there can add some thoughts[1].
Ah, thank you for that correction. Guess the video was accidentally misleading (they technically didn't say anything about other ML approaches, but overgeneralising like I did isn't that big of a leap)
My realization from this article is a worrying trend of developing AI ASICS behind closed doors and only using them internally in your own datacenters or renting them through an abstraction, as a service.
Given how it's 2018 and we still don't have a single FOSS mobile phone in production, it's very possible that this technology will never leave the thin-client model, and we'll never see a retail sale of (open/programmable) neuromorphic hardware.
It's not just changing the exponent bias, but also adjusting the center of the representable range for gradients to be closer to the current value of the parameter (which is stored in full precision).
[0] http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf