HALP: High-Accuracy Low-Precision Training

vanderZwan · on March 10, 2018

I have a feeling this could combine really well with John Gustafson's posits. I partially bring this up because of the paper he wrote with Isaac Yonemoto, where the latter uses an 8-bit variant of the posit to create a cheap approximation of the sigmoid function for neural networks[0], but also because for posits one can tweak the number of fraction and exponent bits to suit the needs of the required dynamic range and accuracy.

[0] http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf

dnautics · on March 10, 2018

The sigmoid approximation is nice, but not 100% necessary (I don't know the cost of doing sigmoid in GPU but I suspect the modern ones by nvidia probably have some way of doing it relatively fast) - note that sigmoid and tanh are still used in GRUs and LSTMs. And sigmoid, tanh, and their derivatives all have approximative shortcuts in posits.

There is a single critical step in training which I find gets stuck in early phases of learning due to accumulation issues (I solved it by expanding to 16 bits during this phase, which should be "on-chip"). It's nice to see another method to do the same! I suspect that batch normalization (which seems similar to what HALP is) after each step will also help.

vanderZwan · on March 10, 2018

Yes, while I haven't done anything with ML myself, I heard via the 3Blue1Brown video on Deep Learning[0] that the sigmoid function isn't really that used anymore. But I figured allowing that being able to tweak the dynamic range could make it a good fit for the rescaling and recentering approach.

You know what, I'll just go ahead and post a link to this article on the Unum google group, perhaps someone there can add some thoughts[1].

[0] https://www.youtube.com/watch?v=aircAruvnKk&t=17m

[1]https://groups.google.com/forum/#!topic/unum-computing/RHrQU...

dnautics · on March 10, 2018

Sigmoid is indeed hardly used as layer-to-layer transfer functions.

But I gurantee you > 20% of the world is activating sigmoid functions in ML apps every day.

vanderZwan · on March 10, 2018

Ah, thank you for that correction. Guess the video was accidentally misleading (they technically didn't say anything about other ML approaches, but overgeneralising like I did isn't that big of a leap)

John_KZ · on March 11, 2018

My realization from this article is a worrying trend of developing AI ASICS behind closed doors and only using them internally in your own datacenters or renting them through an abstraction, as a service.

Given how it's 2018 and we still don't have a single FOSS mobile phone in production, it's very possible that this technology will never leave the thin-client model, and we'll never see a retail sale of (open/programmable) neuromorphic hardware.

deepnotderp · on March 9, 2018

Isn't this equivalent to changing the exponent bias? Like block floating point?

yorwba · on March 10, 2018

It's not just changing the exponent bias, but also adjusting the center of the representable range for gradients to be closer to the current value of the parameter (which is stored in full precision).

simonbyrne · on March 10, 2018

Looks similar, except they recenter as well.

p1esk · on March 11, 2018

Cifar10? Is this a joke? Where are results on ImageNet?

senatorobama · on March 9, 2018

So much great work being done in Stanford. Can't wait to see what the next generation of students are doing.