I've occasionally thought that there's a lot of time to be saved in machine learning systems by making the floating point operations a bit more noisy. Use maybe eight bit (or even four bit) numbers for the matrix entries, with the philosophy that the overall degree of freedom in the system is enough to make up for this local rigidity, while you gain a lot in terms of both storage space and evaluation time.
Along these lines, I saw a great demo of a robot a while ago which used 256 binary devices for movement. Each of these devices is the diagonal of a quadrilateral; when the device is in it's 'on' state, the diagonal is long, and when it's in the 'off' state the diagonal is short. Chain a lot of these together, and you can get pretty good accuracy aiming for specific points in a two-dimensional space. The demo I saw was a three-dimensional version of the same concept, of course.
So likewise, we chain together lots of linear operations to build (say) a deep learning system. The weight space is very high-dimensional, and the decision space is the much lower dimensional classification/regression problem: one probably needs a lot less than 64 bits of accuracy in the weight space to approximate a point in the decision space.
Indeed, 64 bits of accuracy is overkill for a lot of ML algorithms, where there is so much noise that the additional quantization noise due to low precision is negligible. Most deep neural nets are already being trained in single precision because that is what NVIDIA GPUs can do the fastest. There has been quite some research on reducing the bit depth further the last couple of years, see e.g. http://arxiv.org/abs/1502.02551 and http://arxiv.org/abs/1412.7024.
The current generation of GPUs already has limited support for half-precision operations, and the tools for using these operations in neural networks (dot product, convolution implementation) are starting to become available as well, which is awesome (see e.g. https://github.com/NervanaSystems/nervanagpu).
The number of bits does seem to affect neural network training and really kills it when you get really low (like 8 to 12 bits). The problem is that really small numbers are rounded down to zero, and then when summed together the result is really inaccurate.
Ideally you could do some kind of stochastic rounding, where the decimal value represents it's probability of being rounded up or down. On average the totals would still be the same. You could probably get the required number of bits really low then, perhaps even only one or two!
Something to look into when these algorithms are put into FPGAs and ASICs, which is starting to happen.
Using the log domain is cool but log summation still has the same problem. Logs just transform numbers into a different range, they don't magically add more bits of precision. That is adding a very small number in the log domain to another much larger number in the log domain, will just round it down.
However, if the division isn't too expensive, a (2,2) approximant also works for an even better precision:
(0.00383323 + 1.42412 x + 0.336161 x^2)/(1 + 0.704317 x +
0.0598015 x^2)
max error: -1.33902 10^-7
As for [0.75, 1.5) interval, I get
0.108715 + 1.05621 x - 0.165315 x^2
which gives a max error -0.000651425 and the max relative error is around 0.0006, which is again way better than what he gives in the write up, although the error is non zero at points 0.75 and 1.5 (which is not really a necessary constraint, but an artifact of the particular polynomial he chose). The overall plot for relative error in a wider range is smooth.
Bottom line:
1) Unless you are an expert in numerical methods, consult to a software/article/book by an expert (or to the expert him/herself if you can) in the field at some point.
2) While I appreciate his efforts and the write up, I wouldn't recommend using his polynomials in practice.
(BTW, one can use frexp instead of bit-twiddling to make the C code friendlier since not everyone is familiar with the bit layout of IEEE 754 numbers.)
I realized that I missed the part that he's talking about log2(x) rather than log(x+1), which changes the interval.
Since MiniMaxApproximation gets quirky when the function is 0 in the given interval (it minimizes the relative error, so a division by 0 occurs, unless you cancel it), I'm rather using the interval [0.25, 0.5] with t=1/4:
-3.64631 + 7.93796 x - 5.30406 x^2
This gives a maximum relative error of 0.00334047, which is still an order better than his.
As for the (2,2) approximant:
(-5.83261 - 16.7012 x + 22.394 x^2)/(1 + 11.2638 x + 7.81124 x^2)
max rel error: 1.69763 10^-6
I checked several other values of t, and the max rel error for (2,0) approximant is roughly the same. This leads me to believe that the reason he is getting 10 times larger error is because he is imposing that errors should vanish at certain points. Not only this increases the max rel error, but it also distorts the smoothness of the function in a bad way (the kinks in the plots are a bad sign in general).
Your polynomial, since it doesn't get it exactly right at 1/4 & 1/2, will have an enormous relative error when you use it for computing logs of numbers very close to 1.
Goldberg's initial function b ignores this issue, and a glance at the graph of its absolute errors will confirm that it's the minimax poly for absolute error or very close to it. His next candidate t half-fixes this, and again looking at the graph of the relative error makes it clear that it's right. (Note that avoiding intervals containing zero will indeed allow you to get better max relative errors -- keeping the relative error finite where the function is zero effectively reduces your polynomial degree by 1.) Then finally he fully fixes it with the polynomial g, solving the more interesting problem of minimizing the max relative error for the log function everywhere, and chooses a better functional form by shifting the interval he reduces to with the polynomial s.
(The fact that his earlier polynomials do the wrong thing is clearly not a mistake but a paedagogical strategy: start with something simple and incrementally improve it.)
My apologies for spelling all this out, but it really appears that you're accusing him of making elementary mistakes without actually understanding what he's doing.
"Imposing that errors should vanish at certain points" is not a mistake, it's a necessity if you're implementing log by argument reduction plus local approximation and you care about relative error in the resulting approx_log function, and you are getting better numbers than Goldberg by solving the wrong problem.
Your (2,2) rational approximant will be much more expensive to compute than the polynomial and won't yield the speed gains Goldberg is looking for. And, again, it will give you a log function with infinite relative error at 1.
And sorry for spelling this out for you, but maybe it is you who is "accusing of making elementary mistakes without actually understanding what he's doing."
Here is why:
1)
> Your polynomial, since it doesn't get it exactly right at 1/4 & 1/2, will have an enormous relative error when you use it for computing logs of numbers very close to 1.
In case you missed, Mathematica's (not mine) polynomial gives
a maximum 0.00334047 relative error. And the points with 0.00334047 error are actually edge points (which does not include 1).
The important point is, it is not enormous. In fact, his polynomial gives 10 times larger error at certain points, so it is he who has an enormous error. And Mathematica's polynomial is continuous, so even when you slightly get outside of the intended range, you're still getting an error of the order 10^-3.
2)
> Your (2,2) rational approximant will be much more expensive to compute than the polynomial and won't yield the speed gains Goldberg is looking for.
Again, since you apparently missed it, the (2,2) approximant I gave is for [0.25,0.5] interval with t=1/4, so yes, bad things can and will happen if you use that polynomial outside of that range (same things happens with his polynomials too, obviously). Instead, what one should do is to move that 1/4 part into the exponent and stick to [0.25,0.5].
(About the speed vs accuracy trade-off, I already pointed that out in my original post by saying "if the division isn't too expensive"; I'm not sure why you are emphasizing it again here.)
Going back to your statement, when compute logs very close to 1, you subtract 2 from the exponent and instead compute between [0.25,0.5] using that polynomial.
3.
> "Imposing that errors should vanish at certain points" is not a mistake, it's a necessity if you're implementing log by argument reduction plus local approximation and you care about relative error in the resulting approx_log function, and you are getting better numbers than Goldberg by solving the wrong problem.
"argument reduction plus local approximation" means range reduction to [t,2t) for log function and this has nothing to do with the behavior of approximation polynomial at the edges; you may very well have non-zero errors at t and 2t and there is nothing wrong or alarming about it (see Handbook of Floating Point Arithmetic, Chapter 11, for example).
The point is, the max rel error should be kept small, which Mathematica does, 10 times better.
Such an artificial condition is necessary only when the relative error function becomes discontinuous and starts making sudden jumps (see Figs 6-7 from the top in the write up.) He is trying to remedy a problem which appeared due to his choice of t, at the cost of losing degrees of freedom in the polynomial (see also the comments here http://reference.wolfram.com/language/FunctionApproximations... about when the function becomes zero in the interval, in which case you need to deal with these points by making them special --note from the example that this doesn't have to mean you lose 2 degrees of freedom to avoid a single point).
So, in short, I am solving the right problem.
4. Which brings me to the other unspelled problem in some of his polynomials; the derivative is very bumpy and discontinuous, which spells out disaster in general math applications.
(Clearly, the derivative of his function will have enormous minimax (L^\infty) and RMS (L^2) errors, unlike Mathematica's polynomial.)
Summary: 1) Mathematica's polynomial gives ~10 times smaller maximum error than the write up within the range 2) Mathematica's polynomial gives 10x smaller RMS error within the range 3) Mathematica's polynomial is smooth. (I hope by now, you are clear that Mathematica's polynomial nowhere gives any enormous error and quite the oppoisite, the error is much lower) 4) If you use approximant polynomials only within their intended range and move the remaining factors into the exponent, everything works out.
5) My pointing out a bad mathematical practice wasn't personal, so please don't steer the discussion into that direction. I understand that you respect him which may motivate this, my credentials in maths aren't shabby at all (I have more peer-reviewed research papers than him, and in higher impact journals etc etc) (and if Mathematica says one thing and a human says another, in most cases Mathematica is correct) but credentials don't matter in front of solid results. In fact, accepting a scientific result (partly-)based on credentials is considered to be toxic in science. Which brings me back to my original point. I stand by what I said earlier: I don't recommend using the polynomials in the write up.
On the interval from 1/4 to 1/2. But the whole point of this exercise is to make an approximation for the log function over its whole domain. If you take Mathematica's/your polynomial and do that with it, you get an infinitely large relative error at 1 because log(1)=0 but your approx_log(1) is not 0.
(This is the central point I was trying to make in my comment, and the fact that you missed it suggests I wasn't clear enough. I'm sorry about that; in particular, clearly I should have made it more explicit that I understand how argument reduction works -- it hadn't occurred to me that that was in question, since it's a standard technique, described perfectly adequately in the article we're discussing.)
> if you use the polynomial outside of that range
I never for an instant suggested using the polynomial outside that range. I am assuming the obvious argument-reduction scheme where you say log2(2^a.b) = a + log2(b) and arrange for b to lie within some interval [t,2t). In other words, I am assuming the same thing that Goldberg described and the same thing that you are assuming (but apparently missing a key consequence of).
So. Suppose you compute log2(1) using your polynomial. You argument-reduce: log2(1) = 2 + log2(0.25). Then you use your polynomial: log2(1) = 2 + (-3.64631 + 7.9376/4 - 5.30406/16) ~= 0.0066. The actual value you are looking for is log2(1) = 0. Your relative error is infinite.
(You can use the other end of the interval instead. Then you get -0.0035; again, the relative error is infinite.)
Goldberg avoids having this happen by forcing the absolute error to be zero at the endpoints (when he uses a polynomial on [1,2)) or at the relevant point in the middle (when he uses one on [0.75,1.5)). This extra constraint is what forces him to accept larger relative errors on his [t,2t) interval; that's the price he has to pay for finite relative error on the whole domain of the log2 function.
> and this has nothing to do with the behaviour of approximation polynomial at the edges
Indeed, the fact that the point where you run into trouble is at the end of the interval is irrelevant, and it would be an interior point that mattered if you did as Goldberg did and used an interval like [0.75,1.5).
> So, in short, I am solving the right problem.
No, really, you're not. Goldberg wants his relative error for the approx log2 function not to blow up when computing log2(1), which requires that it evaluate to exactly 0 when fed exactly 1. Your polynomial doesn't have that property.
(Of course it's questionable whether relative error is what one ought to care about for an approximate logarithm function. Maybe it isn't. I don't know what applications Goldberg actually has in mind, and quite possibly he isn't at liberty to tell us. But the problem he's stated is to compute an approximation to log2 with reasonable relative error, and it's simply incorrect to say he's done that badly because you can get reasonable relative error on a smaller interval, if your relative error goes to infinity elsewhere. Which it does.)
> the derivative is very bumpy and discontinuous
Yeah, I dislike that too. Again, how much this matters depends on the applications you have in mind.
> My pointing out a bad mathematical practice wasn't personal, so please don't steer the discussion into that direction. [...] accepting a scientific result (partly-)based on credentials is considered to be toxic in science.
I never said anything even slightly personal; I have made no comment on either Goldberg's credentials or yours (or for that matter mine); only one person in this discussion has appealed to credentials, and it's you. Twice. (Once earlier: "Unless you are an expert in numerical methods ...". Once just now, comparing your credentials with Goldberg's.)
Anyway: of course I strongly agree: credentials are less important than results. In this case, Goldberg made an approximation to log2 whose relative error is bounded on its entire domain of applicability, and you didn't. In particular, #4 from your summary is flatly incorrect.
> If you take Mathematica's/your polynomial and do that with it, you get an infinitely large relative error at 1 because log(1)=0 but your approx_log(1) is not 0.
So what?
Instead of accepting that his polynomial is more erroneous, you are steering this whole discussion in a totally synthetic direction.
Essentially your whole argument is global relative minimax error must be finite at every single point in the domain.
I don't know where you got this idea/mantra but the point here is to have a rough and fast sketch with small overall error in applications, not mathematical rigor, which you can't have with a 2nd degree polynomial anyway. His polynomials give inifintely huge L^1 and L^2 errors near origin; the minimax is finite (1) but it doesn't make an infinitely big error OK. Which is why such simple minded mantras are wrong and bound to fail.
While minimax is the only thing that gets mentioned there, the world is a bigger place. There are many error measures. The concept of minimax approximation breaks down at the points where the function is 0 (you can have any arbitrary small number and we're still doomed according to minimax) --this is not the end of the world, in many application, L^2 error (RMS) is more relevant, in which case Mathematica's approximant is again 10 times better.
Do I want to make my errors 10 times larger over the whole domain and do I want a bumpy approximant just to make minimax happy at one, single point in the whole floating point range? No, thanks.
If we're going to pick nits and be rigorous, his function fails big time at x=0 and x<0, and for x=\pm \infty as well as NaN. Now if this is a real concern, people just put a conditional there and live on happily. You can do that for x==1 case if this is a serious concern in your application ---even though it is very unlikely (this shouldn't bother you since you're OK with bumpy functions, and it's going to be a small bump in this case anyway and still beats his polynomial by an order globally).
The whole point here is to have some rough sketch of log that works fast and yields small errors overall in the actual calculation. If I wanted rigor, I'd use libm log2 or http://www.netlib.org/fdlibm/e_log.c which is also quite fast.
Understand this, relative flavor of minimax is some measure which is rarely directly relevant, and is hardly suited for log function. It's not God's absolute commandement and the requirements of domain specific rough approximants depend on the actual usage: whether the errors are going to end up being additive, multiplicative, etc.
Anyway, I'm not going to waste more time on this.
Here are the unwavering facts since the beginning:
1) His polynonimal has 10x worse relative error, and worse RMS error in the whole domain
2) Your whole argument "but what about x==1" is wrongly placed since the approximant you have been fiercely defending here itself suffers from infinite pathological errors at x <= 0; if it bothers you so much nevertheless, just put an if(x==1) return 0, and there you go, I still have a 10x better function.
So "produce a fast approx log function with small relative error everywhere" is the problem Goldberg was trying to solve from the outset. (He wants to be able to give guarantees like "accurate to 7 bits".) The fact that he gets larger errors than you is completely explained by this.
Of course no one else is obliged to be interested in solving that problem, but that's the problem he's trying to solve.
(If you'd said "I think Goldberg is solving the wrong problem; no one cares about relative error rather than absolute error in an approximate logarithm function" I'd have been inclined to agree. But no, it was "haha, he computed a crappy polynomial; that's what happens when you let amateurs do work that should be left to the professionals" with no sign of any understanding of what constraints he was working with that led him to compute what he did.)
> Instead of accepting that his polynomial is more erroneous, you are steering this whole discussion in a totally synthetic direction.
As well as agreeing that his polynomial gives larger errors on the range to which it's applied directly (which I have, explicitly, agreed; go back and check if you don't believe me), I am pointing out that it does (infinitely) better than yours according to the criterion Goldberg was actually using, namely relative error over the whole domain in the log function it produces.
> mantra
Wow.
(I got it from the start of Goldberg's article, where he explains what problem he's trying to solve.)
> infinitely huge L^1 and L^2 errors near origin
(I take it you mean in the derivative.) Yours does too; it is discontinuous at every integer power of 2. Obviously any piecewise-polynomial scheme like this one is going to have discontinuities in some derivative (and you can push up which derivative at the usual cost of either using higher-degree polynomials or getting worse errors).
> There are many error measures.
It appears that you think I am much more ignorant in this field than I actually am. (Or perhaps that you're pretending to for rhetorical effect, but I hope not.)
> just to make minimax happy at one, single point
If the relative error is infinite at x=1 then it is also very large when x is very close to 1.
> fails big time at x=0 and x<0
You're complaining that he isn't generating NaNs and infinities when passed inputs that don't have a finite log? Fair enough, I guess, but in a post subtitled "Part I: the basics" it seems enough to me that he has the following warning sentence: "The code does not check that x>0, much less check for infinities or NaNs. This may be appropriate for a fast version of log." (Which he does.)
> fiercely defending
I think you may be mistaken about where the fierceness is coming from in this discussion.
> just put an if (x==1) return 0 and there you go, I still have a 10x better function
Nope. You've got rid of the actually infinite relative errors but you still have extremely large relative errors for x very close to 1 but not equal.
Again: by all means argue that bounding the relative error in the return value of a log function is the Wrong Thing for almost all purposes. I will probably agree. But that isn't what you did.
I've learned to be extra careful with fast log approximation. I was the tools programmer at a game studio, and few years ago was given the task to speed things a bit. There was a process that was summing neighbouring pixels, doing some filtering/pre-processing then calculating a value through log, so I've profiled that this was the slowdown, and decided to provide a faster log version. After some googling, and testing it seems everything should've work.
Few days later, black mipmaps started to appear. Turns out, I was doing testing in a wrong way, where I was thinking I'm comparing new to old, where I was comparing old to old created by other machine (a bit different results due to be expected). All because I forgot to turn off caching.
And my approximation of log2 did not really worked for the most of the numbers involved in. It was using floats, not doubles to begin with, and producing a lot of either NaN's and infinities (hence the black colors).
Anyway it was reverted, and had to reconvert everything to get back to normal.
Why is Ebay using the log function and why is it used so heavily that it causes a performance bottleneck for them? Is this an inherited code base from their Hunch purchase?
Boost used to have a minimax approximation algorithm in its codebase. Unfortunately, nobody knows enough about how it works it to maintain it, and you have to look in much older releases of Boost to get it.
I used some bits from it to help me in building my own log approximation but I didn't have time to dig into it.
I ended up doing Chebyshev approximation over [1, 2) interval but I extended the interval slightly to make sure that an error term at 1.0 is minimal (you can also minimise an error term at 2.0 if you want some smoothness between [1, 2) and [2, 4) intervals).
It's easy to calculate new interval values for Chebyshev approximation and it shouldn't be a problem to modify Remez but I couldn't find any Remez implementation that would let me do what I needed.
Usually, interval ends in Remez algorithm are fixed, while internal points can be moved by the algorithm. I needed a way to to fix one or two internal points (1.0 and optionally 2.0) but let the algorithm move interval ends within certian limits.
If anyone knows of any tool that let me do it, I'd be happy to play with it and publish my code. My approach is a bit similar to [1] but I'd more likely use Intel assembly or DynASM.
PS The author of [1] uses remez from e Sollya tool but it doesn't let me fix internal points. I looked at the implementation but it's more obscure than boost/remez.
UPDATE: it's not clear from the above what my objectives were. I wanted to to produce exact values for integer powers of 2 including log2(1). This approach also gives a small relative error around x=1.
Yes, that one. It comes with a driver program using the Boost.Test framework that allows you to change the range and fix the values at the interval ends if that is an objective.
Their framework does not allow you to arbitrarily fix internal values, as per your spec however.
But why is this important? Can't computers already do this? Even my 1978 TI pocket calculator can do logs very fast. It's not like this is an important unsolved mathematics puzzle.
You could do better by splitting the function into different ranges and using a lookup table for a formula that fits each range the best. This is how many fast approximations are done. See how the residual error fits nicely into a few different curves? Eureqa doesn't support this though.
The original posting ended with a second-order polynomial, a few bitwise operations, and an if statement.
All of the ones you just posted are more computationally difficult. The non-trig ones, for example, include a division, and the best fit is a higher-order polynomial. There's no reason to believe it will be as fast to evaluate, so the question would be, is it more accurate and is the accuracy worth the additional time? It would be ill-advised to use one of those if it ended up being both less accurate than log() and slower.
Concerning lookup tables, as tgbrter answered in this thread, they are likely slower. It still needs the scaling, and also needs two value lookups followed by an interpolation. The cost of doing the interpolation will be about the same as evaluating the second order polynomial, but the pipeline will be waiting on the two memory lookups.
Assuming everything in is L2 cache, that might not be a problem. But depending on the problem, it may be in L3 or (shudder) main memory, which is eons away. So as tgbrter also pointed out, you may end up with variable calculation time depending on the cache.
Eureqa lets you choose what functions you use and what cost each has. I need to know what the expense is for each operation and I can just feed it in. I removed division and increased the cost for multiplication: https://i.imgur.com/8C9OBEY.png?1 (I'm using the author's error metric of relative error, which is what the 0 = abs(y-...)/y means.)
And of course the best fit has extra complexity, that's why it gives you many choices along the complexity tradeoff.
I wasn't suggesting a lookup table to approximate the function, but the choose the formula itself. E.g. you might have a function which fits the first half of the function very well, and another which fits the second half very well. I think that branching code is slower, so alternatively you can do a lookup for the constants of the formula. Some constants will be optimized for one part of the function, and another set of constants will be optimized for another part.
I don't think that kind of lookup would be too slow. You only need to fit a few elements in the cache, and interpolation is a simple operation. It's just a subtraction, a multiplication, and an addition.
EDIT: In the last paragraph I mean I don't think a complete lookup table and interpolation, like you mentioned, would be too slow. Not referring to the lookup table of formulas I suggested, but computed log values.
I am confused about why you need to think that certain operations are/are not expensive, or why you think a table is faster.
I think they are expensive, and it won't be faster. Hence we're at an impasse. That block was predictable, and easily solve by taking the equation you have and testing against the reference code.
If you test it out, you'll know the answer. There's no need to think you know the answer when you can just measure it, and overrule any incorrect beliefs I may have.
As I said, I don't have a table of how fast each operation is. I don't think any of my assumptions were unreasonable, but I did explicitly state they were assumptions. And I don't have any way of testing it.
I am interested in function approximation mainly for finding functions for when there is no closed form solution, not so much tiny optimization.
You need only one table look-up, a couple quick operations and no conditionals or multiplying.
The table size required to beat the accuracy of fastlog2 is 512 elements.
edit: The speed ratio is actually about 0.6 in favor of look-up versus fastlog2. Enabling -O3 changes the ratio to 0,9. This is of course not measured in a real-world program where the cache is shared by other stuff.
edit2: I have removed the if statement from fastlog2 and compiled with -O3. The ratio was 1,2.
I've occasionally thought that there's a lot of time to be saved in machine learning systems by making the floating point operations a bit more noisy. Use maybe eight bit (or even four bit) numbers for the matrix entries, with the philosophy that the overall degree of freedom in the system is enough to make up for this local rigidity, while you gain a lot in terms of both storage space and evaluation time.
Along these lines, I saw a great demo of a robot a while ago which used 256 binary devices for movement. Each of these devices is the diagonal of a quadrilateral; when the device is in it's 'on' state, the diagonal is long, and when it's in the 'off' state the diagonal is short. Chain a lot of these together, and you can get pretty good accuracy aiming for specific points in a two-dimensional space. The demo I saw was a three-dimensional version of the same concept, of course.
So likewise, we chain together lots of linear operations to build (say) a deep learning system. The weight space is very high-dimensional, and the decision space is the much lower dimensional classification/regression problem: one probably needs a lot less than 64 bits of accuracy in the weight space to approximate a point in the decision space.