Hacker News new | past | comments | ask | show | jobs | submit login
GCC optimized code gives strange floating point results (gcc.gnu.org)
145 points by tbrowbdidnso on Feb 26, 2017 | hide | past | favorite | 119 comments



I met William Kahan about 15 years ago and asked him about this. His response: "If your program says double then yes, of course it's a bug if the compiler does long-double precision arithmetic instead. Isn't that obvious? Why are you even asking this?"

If the author of the IEEE 754 standard says that the way you do floating-point arithmetic is obviously a bug, you're probably doing it wrong.


> If the author of the IEEE 754 standard says that the way you do floating-point arithmetic is obviously a bug, you're probably doing it wrong.

That's interesting, because it seems to contradict his attitude toward FLT_EVAL_METHOD.

Kahan is a strong supporter of FLT_EVAL_METHOD==2—in fact, he's quite possibly the single person who's most responsible for its existence—and FLT_EVAL_METHOD==2 has been resoundly rejected by compilers on newer architectures due to being totally unintuitive. (For those who don't know, FLT_EVAL_METHOD==2 means that "x∗y+z" can have a different result from "double tmp = x∗y; tmp + z", even if x, y, and z are doubles. It's a very similar situation to the bug, just with expression temporaries vs. explicit storage locations instead of register-memory spills.)

In particular, Kahan wrote a widely-shared, hyperbolic essay called "Why Java Floating Point Hurts Everyone Everywhere" (which is still sometimes shared around today as a way to attack Java, despite being obsolete) castigating Java for having FLT_EVAL_METHOD==0. Java added the little-used "strictfp" in response, but history proved Java's original behavior right—with the advent of SSE on x86-64, GCC switched to FLT_EVAL_METHOD==0 by default, matching Java.

Besides, Kahan's x87 FPU design is arguably responsible for this bug to begin with. The 80-bit FP register stack is designed makes sense when you consider ancient C compilers that didn't keep variables in registers unless explicitly told to do so. In that case, the register stack allows for a simple implementation of FLT_EVAL_METHOD==2. But if assignment to a variable doesn't result in a spill (as in any non-toy compiler since 1990?) then this becomes awkward, because register spills no longer coincide with those evaluation rules. This is just a design mistake in the x87 FPU, and it's a large part of the reason why we have scalar FP math in SSE.


I've did a fast "reread" of "Why Java Floating Point Hurts Everyone Everywhere" and I can't see FLT_EVAL_METHOD mentioned anywhere. What I see is that Kahan argues that Java should let him use 80 bits if he would want to use it, and that the float arithmetic without specific notation for otherwise should be done at least "as in K&R C" always as doubles. Basically his argument is "don't limit what I can program to only your strange subset of what's already in the standard and in the hardware, and also provide the defaults that when the programmer doesn't specify the explicit behavior the behavior is safer numerically." Which in the case of the sample program can't be "one sum is different than another sum" when both times it's the double that is the explicit result of the sum operations. It's not "should it spill or not" there the compiler has to respect the size which the programmer specified.

P.S. answer to pcwalton's post below: no it's certainly not "pretty much" "hey, you lose precision when you evaluate in fewer bits." It's: let me do the full IEEE standard in Java. It's: if the programmer doesn't specify what the sizes of the calculations in the expressions are, do these calculations with the biggest sizes (but predictably!), and if he specifies, respect what he specified(1). In the sample program we discuss, doubles as the results are explicit. These doubles can be then compared even in 80 bits but they will be still the same. It's not a "spill" it's an explicit cast that should happen.

1) "For example, tan(x) should delivers a double result no matter whether x is float or double, but tan((float)x) should deliver a float result provided a suitable tan method is available. Thus, a programmer who gives no thought to the question gets the safer default." This is obviously not "hey, you lose precision when you evaluate in fewer bits" he wants to be able to choose.


FLT_EVAL_METHOD is the modern term. What he calls the Java semantics we now call FLT_EVAL_METHOD==0, and what he calls K&R C we now call FLT_EVAL_METHOD==2. Contrary to that essay, for years now C on the most popular architectures by default follows the Java semantics.

His argument is pretty much "hey, you lose precision when you evaluate in fewer bits". Well, yeah, of course you do. The obvious counterargument is "if you need long double, write 'long double'; don't count on the compiler to automatically widen in expressions". The minor precision benefits in code that was fragile to begin with if written in C are by no means worth the giant refactoring hazard of "introducing temporaries can break your code".

> It's: if the programmer doesn't specify what the sizes of the calculations in the expressions are, do these calculations with the biggest sizes (but predictably!), and if he specifies, respect what he specified(1).

The programmer is specifying. When you say "tan(x)" with x as a float, then most programmers would expect to get a float out. In Haskell speak, people expect the type of "tan" to be "Floating a => a -> a" (which is, unsurprisingly, exactly what the type of "tan" is in Haskell [1]). Why? Because it's consistent with the rest of the language: everyone knows that x/y with x and y both ints yields an int, not a double. Likewise, x/y with x and y both floats intuitively yields a float, not a double (and with FLT_EVAL_METHOD==0, the norm in most cases nowadays, this is in fact how it works).

Saying that FLT_EVAL_METHOD==0 robs you of the ability to choose to evaluate in high precision makes no more sense than saying that the way that x/y truncates the remainder robs you of the ability to do floating point division. As all C programmers know, if you want a fractional result from integer division, you write (double)x/y. Likewise, if you want the double-precision tangent of the single-precision x, you would expect to write tan((double)x).

There is a slightly stronger argument that Java doesn't expose access to the 80-bit precision in the x86 FPU. That's true. But the right solution to that is to introduce "long double" in Java (which is what C did), not to complicate the semantics. An implementation of 80-bit floating point that doesn't allow you to store an 80-bit value in a variable is pretty much a toy implementation no matter what. You will need first-class "long double" in order to claim to have true 80-bit support, and once you have it, you have no need for FLT_EVAL_METHOD==2.

[1]: http://zvon.org/other/haskell/Outputprelude/tan_f.html


> Saying that FLT_EVAL_METHOD==0 robs you of the ability to choose

Kahan never defined something like FLT_EVAL_METHOD==0, he wrote the PDF you linked to, and the PDF doesn't contain claims that in the form you say it does, and I'm not making them either. He says in his suggestion to "fix" Java, among other things: calculate the partial results which aren't specified explicitly to higher precision, the target sizes should be respected as the programmers specified them. That is not what the example program produced with -O flag. And I don't see anywhere that what Kahan suggested has equivalence to

http://en.cppreference.com/w/cpp/types/climits/FLT_EVAL_METH...

"FLT_EVAL_METHOD==2 all operations and constants evaluate in the range and precision of long double. Additionally, both float_t and double_t are equivalent to long double"

He never mentions something like "float_t" and "double_t" there as far as I saw? He mentions K&R C which never had any of these. And where float x = something meant x is 32-bits. And tan( x ) meant pass 32-bit x to the double tan and return double, and float y = tan( x ) meant store the double returned from tan as 32-bits.

The program in the example that started the discussion is explicitly producing two double sums (that is, that's what the programmer wrote) so it should never compare one 80-bit and one 64-bit sum. That behavior is surely not something that you can find in Kahan's suggestions.

Please read once Kahan's document carefully to see if he writes what you believe he writes (because you keep referring to the definition he didn't make, and he explicitly knew that not "everything should be long double"). I claim he doesn't.

> An implementation of 80-bit floating point that doesn't allow you to store an 80-bit value in a variable is pretty much a toy implementation no matter what.

x87 instructions have it: FSTP m80fp and Kahan wanted that in the higher level languages too (pg. 77).

"Names for primitive floating-point types or for Borneo floating-point classes:"

"long double = 10+-byte IEEE 754 Double Extended with at least 64 sig. bits etc."

He also writes (pg. 43 of his PDF):

"Of course explicit casts and assignments to a narrower precision must round superfluous digits away as the programmer directs"

And on page 80, regarding false arguments:

"At first sight, a frequently occurring assignment X = Y¤Z involving floats X, Y, Z in just one algebraic operation ¤ appears to require that Y and Z be converted to double, and that Y¤Z be computed and rounded to double, and then rounded again to float to be stored in X . The same result X ( and the same exceptions if any ) are obtained sooner by rounding Y¤Z to float directly." That is, the standard and the x87 are already designed that it can be done directly.


> "calculate the partial results which aren't specified explicitly to higher precision"

I understand the paper. That is exactly what I'm arguing is the wrong behavior, and C compilers have stopped doing it on popular hardware too.

"Partial results which aren't specified explicitly" is really misleading terminology in a typed language. Given int x and int y, the type of "x/y" is specified explicitly given the typing rules of the language. And that's why programmers rightly expect that "int x = 5, y = 2; double z = x/y;" yields z=2.0, not z=2.5.

> "At first sight, a frequently occurring assignment X = Y¤Z involving floats X, Y, Z in just one algebraic operation ¤ appears to require that Y and Z be converted to double, and that Y¤Z be computed and rounded to double, and then rounded again to float to be stored in X . The same result X ( and the same exceptions if any ) are obtained sooner by rounding Y¤Z to float directly."

Yes, the unintuitive FLT_EVAL_METHOD==2 semantics can be optimized into the less confusing FLT_EVAL_METHOD==0 semantics in many cases. That doesn't change the fact that they're confusing semantics.


> Given int x and int y, the type of "x/y" is specified explicitly given the typing rules of the language. And that's why programmers rightly expect that "int x = 5, y = 2; double z = x/y;" yields z=2.0, not z=2.5

Red herring. Kahan advocated the approach of K&R C. K&R C didn't perform the division of two integers in floating point. Your example would be 2 in K&R C too.

> the unintuitive FLT_EVAL_METHOD==2 semantics can be optimized into the less confusing FLT_EVAL_METHOD==0 semantics in many cases.

The example given is not "optimization" at all, it's required by the definition of IEEE 754, mostly designed by Kahan.

> That doesn't change the fact that they're confusing semantics.

I see that as confusion from you repeatedly misdirecting the discussion to the "FLT_EVAL_METHOD" which is neither what Kahan wrote in his Java paper, nor what's in IEEE 754 and even less in K&R C.

Disclaimer: I actually studied IEEE 754 shortly after it was standardized, and followed Kahan's work afterwards. The first C I've learned was K&R C, directly from their first book. I've implemented some hardware doing some IEEE 754 operations completely to the letter of the standard (to the last bit, rounding direction and trap, the control of rounding directions and "meaningful" trap handling is also what Kahan complained that Java didn't do) and implemented some compilers that used x87.


> Red herring. Kahan advocated the approach of K&R C. K&R C didn't perform the division of two integers in floating point. Your example would be 2 in K&R C too.

You missed the point. Kahan's proposed semantics are inconsistent with what programmers expect. The 5/2 example is intended to show why programmers expect Java-like semantics.

> The example given is not "optimization" at all, it's required by the definition of IEEE 754, mostly designed by Kahan.

How does an optimization being guaranteed to be correct (which all optimizations sure ought to be!) make it not an optimization?

> I see that as confusion from you repeatedly misdirecting the discussion to the "FLT_EVAL_METHOD" which is neither what Kahan wrote in his Java paper, nor what's in IEEE 754 and even less in K&R C.

We're talking past each other at this point. FLT_EVAL_METHOD is precisely what Kahan is talking about in the Java paper: it is the term introduced in the C99 spec for the precision that intermediate floating-point computations are performed in. The difference between Java and K&R C is that Java uses FLT_EVAL_METHOD==0, while K&R C uses FLT_EVAL_METHOD==1. I don't know how to make that clearer.


I can imagine that is "confusing" to you but "what programmers expect" is not what the people who talk about "spills" (like you did) expect. I understand where you come from when you mention them. But spills etc are all irrelevant to how we started this discussion:

As cpreciva stated that Kahan said that the behavior of the given example (the code where two sums of two some numbers are summed to doubles and compared to equality) should never give "false" you wrote "it seems to contradict his attitude ..." and your "proof" was Kahan's Java paper.

I showed that there's nowhere anything in Kahan's Java paper or K&R that would support such behavior. I quoted explicit lines form Kahan's paper which explicitly claim that the explicit evaluation to doubles should properly round away the bits that don't fit. And you continued to claim "but but FLT_EVAL_METHOD" whatever.

> We're talking past each other at this point.

I agree about that. You also gave false example 5/2 and you now explain it was your "intention" even if it's neither K&R nor what Kahan argued. You also use the name of the "optimization" for the basic design property of IEEE 754 operations, probably the best feature of IEEE 754 compared to all "common practices" of the times it was designed. I'd be glad to explain you what it is, you'd actually have to make the effort to understand the design principles of IEEE 754. If you're really interested, please post some place where we could start a new thread for that topic, I won't post in this thread more. There we can also discuss, if you are really interested, what Kahan actually says in the paper you posted.


The integer example is not a red herring - it's an argument to consistency by comparison with an isomorphic evaluation semantics.

If you look at it closely though, you might find material for your position in integer promotion. But I think you're missing the point of what is being argued via a vis refactoring of expression trees changing calculation results.


> the point of what is being argued via a vis refactoring of expression trees changing calculation results

That was not a topic why the discussion started and I never discussed that here. Check for yourself. It started with pcwalton's (I simplify) "the bug is because of Kahan, see his Java paper" and I showed the opposite: whoever reads Kahan's paper can see that following his suggestions in the paper the bug can only be a bug. I also cited the exact pages.

I know there are a lot of people who would like to handle FP arithmetic identically to the integer one, but that can't work by the definition of FP. And please see my other post here -- it's surely time to continue the discussion somewhere else.


I have also met Kahan. He gives new meaning to the word "pedantic".

The reality is that numerical anslysts want the computer do do exactly what they tell them to, in precisely the order described. The rest of the human race wants approximately the right answer reasonably quickly. (Floating point is a bad choice if you want exactly the right answer). And the people who make purchase decisions read benchmark reports they don't understand.


> The rest of the human race wants approximately the right answer reasonably quickly

But numerical analysts really require being able to specify the semantics of calculations precisely in order to write libraries that mere mortals can use to get the approximate right answer.

A quick perusal of any numerical analysis textbook should convince you that order of evaluation, order of truncation, overflow modes, and the like are very relevant for being able to write a library that has the right order of error.

Something like Kahan summation will have vastly different results depend on whether intermediate calculations are done at a higher precision or whether the compiler is allowed to reorder expressions. In fact, an overly aggressive optimizing compiler will make Kahan summation simply fail [1].

If all you're doing is adding a few numbers together, by all means, you don't need or want to think about precise semantics. But if you've ever used BLAS or Numpy or any other math package, you very much depend on languages and compilers being pedantic as possible in spelling out how floating point gets evaluated.

[1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm


>But numerical analysts really require being able to specify the semantics of calculations precisely in order to write libraries that mere mortals can use to get the approximate right answer.

>A quick perusal of any numerical analysis textbook should convince you that order of evaluation, order of truncation, overflow modes, and the like are very relevant for being able to write a library that has the right order of error.

I see your point. I've done more than peruse numerical analysis textbooks, and as an old-school CPU logic designer, I've forgotten more about implementing floating point than most programmers know to start with.

But your point is not in conflict with mine. Numerical analysis requires precise operation ordering. Most people not only don't care, they don't have sufficient understanding of the issues that you could even convince them to care.


Actually, BLAS implementations usually do guarantee error bounds, but the implementation is almost always tailored for speed over precision. That said, you can always opt for extended precision at the cost of performance.


The reality is that numerical anslysts want the computer do do exactly what they tell them to, in precisely the order described.

Not at all. As far as we're concerned, it's just fine if the computer works by consulting an oracle.

But we want the answers to be exactly the same as if the computer did what we told it to do. :-)


Kahan has another great paper, https://people.eecs.berkeley.edu/~wkahan/Mindless.pdf , which discusses the futility of attempting to rely on floating-point numbers.


Interesting. This reminds me of a bug I had to track down when rewriting a bunch of Java image analysis code in C# (yeah I would have preferred neither language for this sort of thing, but that's a different story.)

Java has a flag which allows you to specify how 32-bit floating point arithmetic happens (STRICT_FP or something.) C# does not, and was doing extended precision math on floats (80-bit register) and then truncating the result. This led to small differences in feature extraction output.


And I believe he's right, if he was asked specifically what that program should do. If all the statements in the program are really just:

    #include <stdio.h>

    void test(double x, double y)
    {
      const double y2 = x + 1.0;
      if (y != y2) printf("error\n");
    }

    void main()
    {
      const double x = .012;
      const double y = x + 1.0;

      test(x, y);
    }
Then the comparison has to be done on doubles, and "but it's extended precision internally" mentioned in some answers is the wrong answer. All the variables in the program are stated to be doubles. If I'd write the x86 and x87 asm instructions I could easily write them so that they work as expected, even without SSE2.

However if you compile with some "special" optimizations that are documented to relax some specific code generation rules, then it can be OK that these specific optimizations behave differently, but then: you asked for it by choosing that documented optimization.

In the answers there, Richard Henderson wrote 2001-01-16 06:30:04 UTC:

"You are seeing the results of excess precision in the FPU. Either change the rounding precision in the FPCR, or work around the problem with -ffloat-store."

He's also right, if the -ffloat-store is documented to be turned off by -O and if it's documented that that flag also affects the precomputed calculations that are done in the compile time. If the second detail is not documented, the results could be unexpected, but again depends of what you expect from the optimizer. The properly written compiler and optimizer should perform the calculations during the code generation correct to the definition and not cut the corners. If the generated code was generated to actually perform the != during the run time, then the compile-time calculations are irrelevant, but the sane default should be something like -ffloat-store when not using SSE2 even with -O, if -O has the semantic "the optimizations that should not produce really unexpected effects." It sounds complicated, but it has its reasons.

P.S. answer to nsajko's post below: I mention -ffloat-store not ffast-math. Read the description of ffloat-store please:

https://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Optimize-Options....

and also "Note on x86 and m68080 floating-point math" here https://gcc.gnu.org/wiki/FloatingPointMath


With -O2 gcc actually compiles the program to

    xorl %eax, %eax
    ret
That is, it optimized away everything.


Yes, that's the correct optimization. That code obviously can't perform printf("error\n") to manifest the bug, and obviously both additions were made compile-time producing the results that are equal.


The relevant option is actually -fexcess-precision=standard, which is implied by -std=c99.


Nonstandard compiler options (ffast-math) are not relevant to this bug.


I remember that in university we had a test on a toy 16-bit floating point ALU. We had to walk through each of the steps in the ALU for the usual operations, showing the bit patterns at each step and keeping track of precision loss, etc. Because we had to track state and some aspects of the test were randomized, there were a few cases where identical operations on identical inputs resulted in different outputs due to the residual state.

During the next lecture, someone sticks their hand up and says, "this FPU thing is so nondeterministic, who in their right mind would use floating point for 99% of the kinds of work that we use computers for?"

The professor says, "you should think real hard about that the next time you decide to use a floating point type in your program."


> identical operations on identical inputs resulted > in different outputs due to the residual state.

What? What residual state? Floating-point math is tricky, but on a particular machine/compiler/compile-options set you should expect that identical operations on identical inputs should give identical results.

What is this "residual state" you mention?

If you have an FPU that lets you change precision or rounding modes then you can get different results from the same operations - is that what you're talking about?

I get frustrated when people think that floating-point just randomly changes bits for no reason. It doesn't. It randomly changes bits for specific reasons, which can be understood.


trelliscoded probably meant bits in scratchpad, which are not accessible to the programmer. I read somewhere that those hidden bits may not be saved and regenerated properly as part of cpu resuming execution of the program (task switching). I do not know if that can happen, but even if not, achieving consistency on subsequent runs is quite difficult - apparently one has to make sure the program and all libraries it uses were compiled with special compiler flags that prevent data alignment influencing results of FPU operations.

https://software.intel.com/articles/consistency-of-floating-...


Probably a bug report with the most duplicates?

> The x87 FPU was originally designed in (or before) 1980. I think that's why it is quite simple: it has only one unit for all FP data types. Of course, the precision must be of the widest type, which is the 80-bit long double.

> Consider you have a program, where all the FP variables are of the type double. You perform some FP operations and one of them is e.g. 1e-300/1e300, which results in 1e-600. Despite this value cannot be held by a "double", it is stored in an 80-bit FPU register as the result. Consider you use the variable "x" to hold that result. If the program has been compiled with optimization, the value need not be stored in RAM. So, say, it is still in the register.

> Consider you need x to be nonzero, so you perform the test x != 0. Since 1e-600 is not zero, the test yields true. While you perform some other computations, the value is moved to RAM and converted to 0 because x is of type "double". Now you want to use your certainly nonzero x... Hard luck :-(

> Note that if the result doesn't have its corresponding variable and you perform the test directly on an expression, the problem can come to light even without optimization.

> It could seem that performing all FP operations in extended precision can bring benefits only. But it introduces a serious pitfall: moving a value may change the value!!!

TIL long double is a thing.


So would turning off denormals fix the problem? That concept has given me 100 problems in the past and had helped me 0 times.


The solution is to compile with SSE2 on x86. (flags: -mfpmath=sse -msse -msse2)

On x86-64, the compiler should default to SSE2.

SSE2 is ~16 years old so compatibility shouldn't be an issue.


Technically, you only actually need the instructions from the original SSE set to do floating point operations. SSE2 adds a bunch of really useful integer floating point instructions.

But the only extra cpus that gets you is the Pentium III, AMD Athlon XP, and AMD Duron.

SSE2 is supported on every single x86 cpu released after those, such as the Pentium 4, Pentium M, and Athlon 64.

It's a real shame that people are still using CPUs that don't support SSE4, such as the AMD Phenom and Phenom II cpus, otherwise everyone would have moved to exclusive SSE4.


SSE1 is single-precision only. SSE2 added double precision.

So the bug will still appear for 'double' using just SSE1.


Some Atoms only support up to SSSE3 too.


Turning off denormals won't fix the issue that a long double has a wider exponent field than a double, and can represent smaller magnitudes without relying on denormals.


In case anyone is wondering why this showed up right now, it's because tbrowbdidnso was reading the 'examples of unexpected mathematical images' article [1] and came across the one about FPU-dependent aliasing artifacts [2].

[1] https://news.ycombinator.com/item?id=13736568

[2] http://www.cgl.uwaterloo.ca/csk/projects/alias/


Eh, was wondering how I wandered to the page myself.


    *DIAGNOSTIC* THE TEST FOR EQUALITY BETWEEN NON-INTEGERS MAY NOT BE MEANINGFUL
UNIVAC 1107 FORTRAN IV error message for this, 1967.


This is the bug that makes std::set< float > crash randomly under GCC with default compile options.

The fact that they consider this reasonable is insane.


Is having a set of floats reasonable?


That question should be immaterial from the perspective of the implementation. The question that they need to answer is, "does the standard for the language allow a set of floats, and require a certain behavior for it?". The answer is "yes" to both, and for the second answer, that behavior certainly doesn't include crashes.


Yes. There is nothing about floats that should make them unusable in a set.

In particular though, it's fairly common to use std::sets (or std::maps which use the same underlying data structure) to maintain something in a sorted order.

I ran into this myself when implementing A*. I was using an std::map with the float as key for the priority queue. Under GCC with default compile arguments it would randomly hang or crash due to GCC breaking the set invariant.


> Yes. There is nothing about floats that should make them unusable in a set.

Strictly speaking, there is. The < operator for floating-point types does not impose a strict weak order because NaNs are a thing.

In any case, comparing floating-point values for strict equality (instead of using an appropriate epsilon) is more often a bad idea than not.


But std::set doesn't use operator< for comparing values, but !(a<b) && !(b<a).

http://www.cplusplus.com/reference/set/set:

"Compare

A binary predicate that takes two arguments of the same type as the elements and returns a bool. The expression comp(a,b), where comp is an object of this type and a and b are key values, shall return true if a is considered to go before b in the strict weak ordering the function defines. The set object uses this expression to determine both the order the elements follow in the container and whether two element keys are equivalent (by comparing them reflexively: they are equivalent if !comp(a,b) && !comp(b,a)). No two elements in a set container can be equivalent. This can be a function pointer or a function object (see constructor for an example). This defaults to less<T>, which returns the same as applying the less-than operator (a<b)."

I think that guarantees that std::set works for IEEE floating point values, and that NaNs are at the end of the set.


"I think that guarantees that std::set works for IEEE floating point values, and that NaNs are at the end of the set. "

No, this is, by definition, not a strict weak order because it's not transitive. It also doesn't have transitivity of equivalence.

It's at best partial ordering. "strict weak" order is, by the way, such an incredibly stupid name it's hard to know where to begin (it's what they call it in math, but it doesn't make it any more sensible)

It's much easier to understand in terms of the invariants.

Partial orders: f(x, x) must be false.

f(x, y) implies !f(y, x)

f(x, y) and f(y, z) imply f(x, z).

Strict weak ordering: All of the above plus Define equivalence as f(x, y) and f(y, x) being both false. Then if x equivalent to y, and y equivalent to z, then x must be equivalent to z.

IE for all x, y, z: if f(x, y) and f(y, x) are both false, and f(y, z) and f(z,y) are both false, then f(x,z) and f(z, x) must both be false.

It's strict because it's not reflexive, and it's weak because b < a && a < b does not imply a == b. It just implies they are "equivalent", and as mentioned, you get a total ordering on the equivalence classes.


A strict weak order partitions a set of values into a collection of equivalence classes. Two values belong to different equivalence classes if and only if a<b or b<a. If either a or b is NaN, that condition is always false. Thus, every NaN is equivalent to every other value (regardless of the fact that no NaN is equal to any other value!)

However, any equivalence relation must be transitive, that is, if a≡b and b≡c then a≡c. Let a=1, c=2, and b=(any)NaN. We now have 1≡NaN and 2≡Nan which implies that 1≡2 which is clearly false. Thus, the < operator for floating-point types is not a strict weak order in the presence of NaNs.


Well the thing is, it should not fail because of obscure reasons completely unrelated to NaNs...


It's also a credulous bug that a standard ordering isn't defined for NaNs.


So, you want to redefine IEEE floats? 'cause it says that there is no ordering for NaNs.

And if you do redefine IEEE floats, where do NaNs go, and why is that better?


I'm saying IEEE shouldn't have defined them that way, yes. It was an arbitrary choice. As to where, before negative infinity or after positive infinity; I don't really care which. But the fact that we can't have good things (e.g. a standards compliant std::set<float>) because of a standards committee punted on making that choice makes me sad :(


I do wonder why they didn't special-case std::less for floats. They already did it for pointers - operator< for pointers isn't even defined, much less a strict weak order, if the operands do not point to the same object (region of storage). This is in order to facilitate non-flat addressing schemes. std::less, however, is explicitly specified to work with any pair of pointers.


Thanks for the example. But I'd be wary of anything that involves comparing floats.

I wonder if you could make a template specialization for set of floats that would fix the problem. I suspect not easily.

I'll wander off and look at the source now ...


I suspect you are only wary of comparing floats due to Bug #323 in the first place though! Programmers have built up a lot of superstition when it comes to floats which they don't really deserve due to it.

There is a bizarre notion of non-determinism that plagues their reputation. People think that adding the same two numbers on two different CPUs can produce different results, or even that adding the same two numbers on the same CPU can have different results.

No wonder when bug #323 causes floats to compare differently when you come back and read them later!

When you use SSE math exclusively the weirdness of doing calculations in a different precision then what you store goes away.


If I was going to use a std::set of float, I'd also implement my own comparison operator for the set in order to avoid these problems.


Hah! That is such a great read, all of the "this isn't a bug" and the tombstones of all the "duplicate" bug reports. Reminds me how challenging it is to do decimal floating point in binary. Mike Cowlishaw could go on for literally hours how just the concept is so broken and he wrote and shipped a decimal based floating point library which has none of these issues[1].

I still remember when I was told to never compare my floating point result to zero because even when it should be it probably won't be, always compare it to be less than a small sigma.

[1] http://speleotrove.com/decimal/


> always compare it to be less than a small sigma.

Unfortunately, this approach can result in problems with the usual expectation that equality is transitive. Consider 0 < x < y < sigma. 0 == x, 0 == y, but x != y. This problem applies generally any time you're doing equality comparisons with sigma.


You aren't following the instructions. You have to rewrite all comparisons to use a small sigma.

`0.f == x` becomes `SIGMA > x`

`0.f == y` becomes `SIGMA > y`

`x == y` becomes `SIGMA > abs(x - y)`

All the expression work if written correctly.


I probably should have given a better example instead of the hasty one. -sigma < x < 0 < y < sigma, abs(x - y) > sigma. I hope we can agree that this causes problems with transitivity :-)

There are also problems with granularity. Suppose x and y are both much, much larger than sigma, and the smallest representable difference between x and y is e.g. 1.0. Now there's no possible way that abs(x - y) < sigma.

I suspect that the resolution to this is to compare equality within some number of ULPs, but: 1) You still have the transitivity problem and 2) This is not my field of expertise, and there may be a more appropriate solution.


> -sigma < x < 0 < y < sigma, abs(x - y) > sigma

This statement works fine. If the numbers are farther than sigma apart then `abs(x - y) > sigma` is true as it should be. It's not distance to 0, it's distance from each other related to 0 that matters.

> There are also problems with granularity. Suppose x and y are both much, much larger than sigma

This is part of the same problem. You have to choose a maximum acceptable value. And ensure you don't overflow it, and then you can choose a sigma based on that maximum value such that the difference between any two numbers never exceeds sigma. In practice you already have a maximum value for your domain that the simulation should never exceed or it means it is incorrect anyway.


I have been following this bug in its various form for 30 years, and it is covered in pages 55-62 of The End of Error: Unum Computing. It is the "hidden scratchpad" bug where designers of a computing environment try to be helpful by providing more precision than what is explicitly requested by the programmer. If you read the blog postings of William Kahan about this issue, you will find he speaks fervently on both sides, sometimes defending it and other times vehemently deriding it. Thus the confusion. Kahan favors any technique that improves accuracy, even if it creates behavior that can only be explained through careful numerical analysis (which very few programmers have the time or skill set to perform).

Unums and posits, incidentally, are designed to eliminate all sources of irreproducibility and inexplicable behavior like this gcc bug. Hardware support for posits is on the way, maybe before the year is out. There is a way out of this mess, but it will take a few years to become mainstream.


No, Kahan advocates for what the C standard does which GCC doesn't implement correctly.


After reading much of the discussion, my conclusion is that it's past time to deprecate the x87 FPU. Ever since x86_64, it is expected that SSE2 will be available and compilers are expected to use it over the old FPU. Face it, this is an old Intel hardware "feature" that causes problems and has been obsolete for a long time.


The -fexcess-precision=standard option, which is enabled by -std=c99 (requesting ICO C99 compliance), forces excess precision to be rounded away on all assignments and casts to narrower precision types.

See https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#ind...



Always a good read:

David Goldberg: What Every Computer Scientist Should Know About Floating-Point Arithmetic (1991)

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...


I'm very surprised nobody has brought up John Gustafson's Unum work. His latest iteration is the posit, which he presented at Stanford earlier this month[0][1]

EDIT: apologies to CalChris[2]. What I should have said is discussed Gustafson's Unum work.

[0] https://www.youtube.com/watch?v=aP0Y1uAA-2Y (relevant question at 1h22m37s)

[1] http://web.stanford.edu/class/ee380/Abstracts/170201.html

[2] https://news.ycombinator.com/item?id=13739617


Because I am doing a lot of numerical code in Fortran using the Intel compiler on Windows for production and GCC/GNU Fortran for prototyping on Linux I was a bit scared by the title. This is a problem with the equality comparison with doubles, something I try to avoid as much as possible, as this was the first thing I learnt to do 25 years ago... never compare double for equality.

Anyway, the bug report is linking to a very nice research paper[0]: "The pitfalls of verifying floating-point computations", which even so very technical is a good read on the subject.

[0]: https://hal.archives-ouvertes.fr/hal-00128124


I think that this is a case of mismatch between two standards, not a bug. The C standard allows higher precision values to be used in place of lesser ones (extended precision vs double) and REQUIRES conversion to the correct precision during assignment (or casting). Also, the C abstract machine has the option to store those values when/if it deems opportune, it won't change observables as defined there.

On the other side, IEEE 754 allows extended precision to be used in place of double and of course requires that any chosen precision be kept or else.

But C Standard mandates IEEE 754 , too!

It seems to me that modern C deliberately chose to ignore such kind of mismatch in the name of (substantial) performance gains. K&R, good or bad, was way simpler!


> But C Standard mandates IEEE 754 , too!

Are you sure? I can't find it in the standard, specifically not in [1] page 28., section 5.2.4.2.2.

[1] (the final draft of C1X, dated 12 April 2011) http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf


You are absolutely correct, "mandates" is too strong a word. Annex F, which is normative, have a way out by not setting __STDC_IEC_559__, GCC6.3.0 does set a slightly different GCC_IEC_559. Yet I think that my argument still holds:

floating point calculations can be executed at wider range (§5.1.2.3.12);

assignments and casts are under obligation to wipe out that extra precision (§ same paragraph);

I am no language lawyer, but .. given the issue of program observable effects (§5.1.2.3.12) and the "as is" implied rule that governs optimizations, how possibly could equality be stable over time?


Yeah, you are right. It seems to be that the problem is that we cram two standards into too few types. Additionally to float, double, and long double there could be optional iec559_single_t, iec559_double_t and iec559_extended_t (similarly to stdint.h integer types). The operations on variables of these types could be strictly iec559 conformant.


For the PathScale compiler's initial release on x86, in the unlikely event that the x87 was used at all, I set -mpc64 as the default. We never got a customer complaint about it, despite having many numerically-savvy customers.


More proof that floats are evil. Moral of the story is to never use equality to compare floats.


That's not the issue here at all. The IEEE 754 standard does in principle make floating-point operations completely deterministic, i.e. if you do a specific sequence of floating-point operations, then the result may indeed be slightly off from the result you would expect with infinite precision arithmetic, but it should be off in just the same way every time. For example (x + 0.1) + 0.1 and x + 0.2 can give different results due to rounding, but the respective result should be constant for a given x regardless of the machine, compiler, or phase of the moon. Knowing this, there are situations where exact comparisons are appropriate.

The "bug" here is that a single operation x + 0.1 and x + 0.1 can give two different results depending on the mood of the compiler.

It's not actually a bug, because the IEEE 754 standard permits this kind of sloppiness by default. You can get stricter semantics, but you have to ask for it explicitly by setting the right compiler flags. This is why I say that the IEEE 754 standard makes floating-point arithmetic deterministic "in principle" - in practice, you have to work for it. (Changes to the rounding mode are another big issue here.)

An extra error of 1 ulp somewhere might not seem like much for your average numerical code, but it's a big deal in some situations:

* Assumptions about the precise behavior of floating-point cancellations are often used in numerical libraries, and deviations can cause disastrous loss of accuracy. For example, Kahan summation is a widely used algorithm for computing sums with increased accuracy. If compilers optimize the code too aggressively, the algorithm doesn't work and you get worse accuracy. And ironically, x87 long doubles can make general numerical calculations less accurate (not more accurate as intended) due to the double rounding issue.

* If you use floats in game logic, you might want it to behave exactly the same way on different platforms, e.g. for synchronization or replays.

* In scientific computing, you might want results to be precisely reproducible by others.

* Debugging!


In practice, I don't think IEEE 754 floating point is deterministic, especially not across multiple ISAs, multiple vendors of the same ISA (Intel, AMD), different generations of the same vendor's cpus, different compilers and different math libraries, even just C's libm.

So just a warning to everyone out there: unless you are very, very careful your floating point code will not be deterministic. If you thought you can just use floats and expect it to "just work" the same, then you're out of luck.

Here's a couple of articles which go into more detail about problems with floating point determinism:

http://gafferongames.com/networking-for-game-programmers/flo...

https://randomascii.wordpress.com/2013/07/16/floating-point-...


As long as the unit is IEE754, results are expected to be exact to the last ulp [1]. There might be issues in the lowering of high level languages to low level assemblers, but that's a different story (i.e the implementation is not IEE754 compliant).

[1] transcendental functions IIRC are not required to be exact to the full precision and historically they aren't. But there are (portable) math libraries that guarantee full precision.


Pretty much. If you want to compare two floats a and b within some precision eps, use:

    abs(a - b) <= eps*0.5*(a + b)
or even

    (abs(a - b) <= eps*abs(a)) && (abs(a - b) <= eps*abs(b))
which is implemented in Boost [0].

[0] http://www.boost.org/doc/libs/1_34_0/libs/test/doc/component...


I thought it was

(a / b) != (a / b)


Floats are not evil. They are just not as intuitive as integers because it is impossible to represent every possible rational number in the range with a fixed number of bits. Once you get that, everything makes sense.


That does not make sense of getting a failure from

assert x/y == x/y


I agree. Explaining that assertion requires explaining that the x87 FPU is providing a leaky abstraction. If you ask for double math, it gives you a result which is a long double while it resides in a register, and then becomes a double when moved to RAM.

There are some cases where direct comparison of floating point values is what you want, and this issue can still bite you even then.


For x87, it's not a leaky abstraction - it fully supports 80-bit floats as a type, and documents that its registers are 80-bit, and so are operations on them. There's nothing abstract about it.

The bug is that GCC is utilizing the (documented, by-design) x87 behavior in a way that causes it to produce the wrong result, for the sake of performance.


>compiler developers do unintuitive but technically legal thing in the name of performance

I still think the relationship between compiler developers and developer-users is toxic as fuck, but as long as we keep comparing compilers by increasingly anal microbenchmarks instead of increasingly anal "correct in practice" tests, I don't see why we continue to find this a surprise.

When we demand better performance, compiler developers are going to exploit undefined behavior as well as other language "loopholes" that don't make intuitive sense. Perhaps a new focus on correctness in practice would change this.


Properly implemented floats will never fail that (barring NaN). That gcc sometimes does is a compiler bug, not a float problem.


> More proof that floats are evil.

Just because someone doesn't know the basics of floating point numbers it doesn't mean that the floating point format is somehow evil.

Testing floats for equality is one of those no-nos which is taught right in the first class of introduction to FP numbers.

Another very basic issue covered right in the intro clsses are problems caused by using decimal values to define numbers encoded in binary format, which is often explained through the famous "0.1 doesn't exist in binary" trivia bit.


Part of my day job is to test floating point hardware and write compilers and assemblers. Testing equality---with some care---is completely fine. I can assure you this is a bug in GCC. The GCC bug is compounded by the nonstandard behavior of x87. The x87 unit is not "fixable" but GCC's behavior is.


That's interesting, you make it sound like the production of new FPUs and new compilers is common. Which industry do you work in?


There are numbers that are safe to compare against. Storing 1.0 (or 0.0 or 0.25 or any small integer), and then later comparing for equality to check "Did I assign this value before?" is completely fine.

Of course, it's not fine to ask "Does the result of this computation equal this value?", but everyone knows that.


> Of course, it's not fine to ask "Does the result of this computation equal this value?", but everyone knows that.

How about "Does the result of this computation equal the result of this exact same computation?"? That's the issue here.


Floats are evil? o.O What do you propose instead if you want to deal with non-integer values?


There's always fixed point integers[0]. They're less flexible than floats, but at least the amount of precision available is much more clear and uniform, and all the rounding is pretty explicit.

https://en.wikipedia.org/wiki/Fixed-point_arithmetic

https://en.wikipedia.org/wiki/Q_%28number_format%29


Back in the late eighties, early nineties, I was of the opinion that fixed point is really the way to go, the only thing holding it back was lack of adequate compiler support. Until I realized what would actually happen if the world started using fixed point: programmers would always set the dynamic range too small, so we'd end up with a world full of programs that were constantly failing because they needed to be run on data larger than the test cases. The headaches of floating point, annoying though they can be at times, are much less onerous.


> Until I realized what would actually happen if the world started using fixed point: programmers would always set the dynamic range too small, so we'd end up with a world full of programs that were constantly failing because they needed to be run on data larger than the test cases.

Floating point doesn't actually fix this issue. A fun bug I dealt with was a black screen caused by a few pixels having NaN values - which were then picked up by a blur pass which would poison basically the entire back buffer. The original NaNs were caused by divides by near-zero values causing out-of-range results.

You now add precision issues: Multiple coordinate systems are a common hack-around to maintain local precision when using single-precision based physics engines. Stuttering effects when time is represented as "floating point seconds since uptime" when integer wraparound would avoid any problem in the first place. And this is before even adding e.g. half precision or worse floats into the mix.

There's a reason floating point isn't used for a lot of serious banking calculations.


To clarify, any time I in any way praise floating point, I'm always talking about double precision. I think single precision floating point should never have been invented, and given that it unfortunately was, it's best to behave as though it doesn't exist. Don't get me started about half precision.

Financial calculations are another matter; they usually require decimal arithmetic, not because it doesn't give rounding errors, but because it gives the same rounding errors an accountant gets when he checks the result with his desk calculator, and that's the requirement.


> To clarify, any time I in any way praise floating point, I'm always talking about double precision.

64-bit fixed point solves a lot of problems too :)

> I think single precision floating point should never have been invented, and given that it unfortunately was, it's best to behave as though it doesn't exist. Don't get me started about half precision.

A 4x memory bandwidth & storage penalty for HDR buffers would be absolute murder (bloating a single 4k RGBA surface to 265MB if I've done my math right), so I'm glad we have the option. And in the historical context lower precision options were invented in, RAM wasn't nearly as cheap as it is today. Tradeoffs.


Continued fractions are perfect if you have infinite memory.


A real number can contain an infinite amount of information.

If you have infinite memory, you would probably start by asking whether you have countably infinite or uncountably infinite memory.


What would you suggest as an alternative to float?



128-bit decimal.


I have found that I often end up forsaking the performance and using a decimal type. It is definitely slower, but the correctness is often more desirable.


A friend mentioned being in a meeting where Mike Cowlishaw argued that Javascript's floating point type should be decimal floats not binary floats. Everyone including my friend thought he was out to lunch. My friend says now, 'a great opportunity was pissed away'

Because almost no one actually wants wants binary floating point.


Why not? Decimal floats have definite advantages if you are dealing with decimal data, but that is not the only type of data that floats are used for, and it probably isn't the majority. Binary floating point is better than decimal floating point if you don't care about decimal-interop (more precision per bit, more consistent precision), and decimal floating point has all the same issues (except for confusing conversions to decimals).

So, if you are dealing with raw sensor data, neural nets, character models in a game, numerical simulations, and many other types of data, the extra precision of binary floating-point is far more important than any confusion that arises during those (rare) conversions to decimal.


It's all about costs and benefits though. I do think that a decimal type should be standard along side a float so that the most appropriate one can be chosen. But in many contexts, one needs to go outside the immediate ecosystem to third party libraries to get an alternative to floating point. Currency is a good example of a type that should not use floating point arithmetic but is far too often.


I think the difference is binary floating point tends to be on massive automatically generated sets of data. And binary floats are mostly perfect for that.

However the bias towards scientific computing is a problem. because there are large numbers of use cases where the inputs and outputs are decimal floats because you're dealing with humans and their preference for base 10.


This is well-known by everyone who works with float determinism. You have to set the FPU flag with assembler to force the desired width (among other things). See: fldcw, LDMXCSR, etc.

Take into account rounding modes and denormals too, and make sure to disable float optimisations that break determinism in your compiler.


> Regardless of where one wishes to put the blame, this problem will _not_ be fixed. Period.

I realise that floating-point numbers are, to use the technical term, "whack", and I also realise that triaging the issue seems to point to a hardware error, not a software error, but I can't get passed this "yeah, there is a problem, so what?" attitude. Am I missing something critical here that makes this unreasonable to work around or fix?


According [indirectly] to one of the comments, the issue is that there is no solution which solves all the problems in the context.

Reference:

https://gcc.gnu.org/ml/gcc/2003-08/msg01257.html

Specifically:

    There are two things missing. The abililty to turn on the workaround in c (i.e. emit FP rounding instruction after every operation), and a bug fix for the register spills.
    
    Even with those things, I think we are still in trouble. In the first case, having explicit rounding instructions eliminates the excess precision problem, but it introduces a double-rounding problem. So we have lost again there. This is probably not fixable without changing the hardware. In the second case, fixing the problem with reload spills eliminates one source of unpredictable rounding, however, we still have the problem that local variables get rounded if they are allocated to the stack and not rounded if they are allocated to an FP reg-stack register. Thus we still have the problem that we get different results at different optimization levels. So we still lose there again also. This might be fixable by promoting all stack locals to long double to avoid unexpected rounding, but that will probably cause other problems in turn. It will break all programs that expect assignments doubles will round to double for instance. If we don't promote stack locals, then we need explicit rounding for them, and then we have double-rounding again.
    
    I really see no way to fix this problem other than by fixing the hardware. The hardware has to have explicit float and double operations.


It's not a hardware error at all. If you want to use ieee 784 on a processor, you have to learn how it works; there are some subtleties whose root cause is attempting to approximate infinite precision mathematics in a finite precision processor.


Comment 109 has a good exposition I'll just copy here:

Jan Lachnitt 2008-05-20 16:59:31 UTC

I also encountered such problems and was going to report it as a bug in GCC... But in the GCC bug (not) reporting guide, there is fortunately a link to this page and here (comment #96) is a link to David Monniaux's paper about floating-point computations. This explains it closely but it is maybe too long. I have almost read it and hope I have understood it properly. So I'll give a brief explanation (for those who don't know it yet) of the reasons of such a strange behaviour. Then I'll assess where the bug actually is (in GCC or CPU). Then I'll write the solution (!) and finally a few recommendations to the GCC team.

EXPLANATION

The x87 FPU was originally designed in (or before) 1980. I think that's why it is quite simple: it has only one unit for all FP data types. Of course, the precision must be of the widest type, which is the 80-bit long double. Consider you have a program, where all the FP variables are of the type double. You perform some FP operations and one of them is e.g. 1e-300/1e300, which results in 1e-600. Despite this value cannot be held by a "double", it is stored in an 80-bit FPU register as the result. Consider you use the variable "x" to hold that result. If the program has been compiled with optimization, the value need not be stored in RAM. So, say, it is still in the register. Consider you need x to be nonzero, so you perform the test x != 0. Since 1e-600 is not zero, the test yields true. While you perform some other computations, the value is moved to RAM and converted to 0 because x is of type "double". Now you want to use your certainly nonzero x... Hard luck :-( Note that if the result doesn't have its corresponding variable and you perform the test directly on an expression, the problem can come to light even without optimization. It could seem that performing all FP operations in extended precision can bring benefits only. But it introduces a serious pitfall: moving a value may change the value!!!

WHERE'S THE BUG

This is really not a GCC bug. The bug is actually in the x87 FPU because it doesn't obey the IEEE standard.

SOLUTION

The x87 FPU is still present in contemporary processors (including AMD) due to compatibility. I think most of PC software still uses it. But new processors have also another FPU, called SSE, and this do obey the IEEE. GCC in 32-bit mode compiles for x87 by default but it is able to compile for the SSE, too. So the solution is to add these options to the compilation command: -march=* -msse -mfpmath=sse

Yes, this definitely resolves the problem - but not for all processors. The * can be one of the following: pentium3, pentium3m, pentium-m, pentium4, pentium4m, prescott, nocona, athlon-4, athlon-xp, athlon-mp, k8, opteron, athlon64, athlon-fx and c3-2 (I'm unsure about athlon and athlon-tbird). Beside -msse, you can also add some of -mmmx, -msse2, -msse3 and -m3dnow, if the CPU supports them (see GCC doc or CPU doc).

If you wish to compile for processors which don't have SSE, you have a few possibilities:

(1) A very simple solution: Use long double everywhere. (But be careful when transfering binary data in long double format between computers because this format is not standardized and so the concrete bit representations vary between different CPU architectures.)

(2) A partial but simple solution: Do comparisons on volatile variables only.

(3) A similar solution: Try to implement a "discard_extended_precision" function suggested by Egon in comment #88.

(4) A complex solution: Before doing any mathematical operation or comparison, put the operands into variables and put also the result to a variable (i.e. don't use complex expressions). For example, instead of { c = 2(a+b); } , write { double s = a+b; c = 2s; } . I'm unsure about arrays but I think they should be OK. When you have modified your code in this manner, then compile it either without optimization or, when using optimization, use -ffloat-store. In order to avoid double rounding (i.e. rounding twice), it is also good to decrease the FPU precision by changing its control word in the beginning of your program (see comment #60). Then you should also apply -frounding-math. (5) A radical solution: Find a job/hobby where computers are not used at all.


This is a good explanation, but the conclusion that this isn't a GCC bug is false. y and y2 both have type annotations of "double" at the point they're compared. They should be computed using an intermediate of type "long double", and then casted to double. But GCC is incorrectly comparing them based on those intermediate values instead of their final values.

In a more extreme case, let's say I had two strings: v="1.0" and w="+1.0". atof(v) and atof(w) should ABSOLUTELY compare equal. The logic followed by GCC effectively says that it's OK to compare them by their intermediate string representation instead, though, wherein they would compare as non-equal.


On a related note, testing floating point equality should be known as a recipe for unexpected results.



This idea is related to the strictfp modifier in Java. https://en.wikipedia.org/wiki/Strictfp


As a system administrator, my favorite part of the Bug 323 thread is the mail bounces from postmaster.


Should add the date to the title (2000)


People need understand that floating point comparison at floating precision is undefined, so it is not a bug. The proper floating number representation should track its precision. Since that is more complicated and potentially performance affecting, so until then the programmer should manage the precision tracking on their own (similar to memory allocations), that is, one should do:

  if ( f_a < f_b-f_eps)
  if ( f_a > f_b+f_eps)
  if (fabs(f_a-f_b)<f_eps)
If you don't do this, then you are on your own. Sometimes it is fine, but sometimes it may give you grief, just like the dangers of other undefined behaviors.

PS: I also have a problem that people think "undefined behavior" is undefined because certain standards say so. Undefined behavior is due to lack of agreement of common sense or defies common sense. For example, divide by zero, there is no common sense doing so, and define its behavior does not change anything. Signed integer overflow, there is no common sense agreement for that behavior either. The common sense for undefined behaviors is to avoid it. (On the other hand, there seems to be some common sense agreement on unsigned integer overflow, and people are writing code based on that behavior. But that agreement is weak, so I would avoid it if I can regardless what the standards say.)

PS2: The reason that floating point comparison is undefined behavior because, unlike fixed precision or integers, floating point number has floating precision. Floating precision does not have common sense correspondence. In any real application (where common sense is based), a finite, and above all, consistent precision, is always assumed.

And the solution, by manually specifying a f_eps, we put that comparison into a common-sense understanding that you know the result of comparisons is only meaningful above f_eps. With that common sense, having an internal calculation in higher precision is always OK -- isn't that common sense?

PS3: The question is why not turn undefined behaviors into errors? Undefined behavior does not mean there are no behaviors, and the behaviors can be used, or more often, ignored to achieve certain efficiency. When programmers write code with undefined behaviors, they are extracting their code out of common sense domain, but into a special sense domain. Special sense domain is practically fine. Everyone has their own efficient quirks, just do not expect your quirks to be understood or supported by out large where common sense rules (unless you carefully define it and document it). A language that allows for undefined behaviors allows for individual efficiency, which sometimes it is rather good.

For example, it is ok to write programs with direct floating point number comparisons as long as you understand that your program does not care when the comparison falls into the ambiguous range. Only you who understand your special application can assess that. If you are the only one who is responsible for the result of that code, then it is perfectly fine!


Essentially everything you say here is false.

Floating point numbers represent rational numbers, and their arithmetic is straightfoward and follows common sense: you compute the exact operation using rational numbers, and the result is the closest floating point number available.

Floating point operations are never "undefined" either, they are perfectly deterministic and completely specified by the IEEE 754 standards.

There are three floating point values that do not correspond to rationals: INF, -INF and NAN. There is also "-0" that corresponds to the same rational as "0". The arithmetic of these values is extremely common sense to anyone that knows some calculus: 1/0 = INF, INF*0=NAN, INF+INF=INF, INF-INF=NAN, etc.

There is no ambiguity whatsoever in floating point numbers.


You say 1/0 = INF, but the 754 standard says negative zero is different, and 1/(-0) = -INF. Yet it says -0 is numerically equal to 0. It also specifies that the square root of -0 is -0. This is neither straightforward nor common sense.

They also do not round to the nearest representable rational number. If you double MAXREAL, it will round to INF, even though MAXREAL is infinitely closer to 2*MAXREAL than INF is.

Despite efforts in the 2008 update to the 754 Standard, there still is no guarantee that the Standard will produce reproducible, bitwise identical results. The Standards committee admits this. It is important for everyone to understand the shortcomings of the Standard.


Well, "common sense" is not the same for everybody then. For a mathematician, the rules 1/0 = INF, 1/(-0) = -INF are perfectly straightforward and common sense (if you think about the limits).

Regarding sqrt(-0)=-0, I concede it looks a bit strange. Yet, as explained in the famous Kahan article "much ado about nothing's sign bit", it turns out to be natural when you work with complex numbers.

As for the nondeterminacy, I suppose you are talking about transcendental functions? The basic arithmetic should be bitwise reproductible.


> Essentially everything you say here is false.

That is OK, but I hope you still can get my main point whether it is true or false. If all you see is everything, I am not sure you understand what I was saying. You have to understand what I was saying in order to argue back -- unless all you care is about right or wrong rather than making progress on understanding.

My main point is to point out why people keep filing these bug reports on floating point results that compilers simply will not fix. That is because, in real applications, we expect numbers to have persistent precisions. Persistent precisions make sense in real world. For example, if you take a measurement using one ruler in mm, that precision does not change whether you are measuring rice or elephant. When you are dealing with money, one always expect the precision do not go beyond cent regardless of what your internal calculation says -- (5.001 dollars is equal to 5 dollars). And in real life calculating with higher precision than its requirement will never result in error. My career is in scientific computing and we routinely deal with very small (or very large numbers). However, in a certain application, a persistent precision is always expected, and numbers within that precision become ambiguous in comparison. For example, if my precision is around 0.001, a holds 5, b hold 5.0008, is a<b? It depends. And if your critical application -- e.g. a feedback control loop -- depends on what is defined by floating point operation, you application can be very unpredictable.

I hate to repeat but your reply didn't address my main point at all -- fixed precision expectation in reality vs. arbitrary precision in computer and vs. infinite precisions in math. Common sense is rooted to reality.

Yes, I understand floating point operation by computer under a fixed width is not really undefined. That is why I made extra points commenting on what do we mean by undefined in a computer language design. Any computer architecture with any given compiler implementation will always give definite results for any input. Undefined behavior has a root meaning from language designer point of view. For example, division, we understand what we intend for how division work, whether it is in integer case or floating point case. But do we understand what we want when the divisor is 0 or near 0? We understand we don't want the computer always crash so we don't specify that is an error, but we don't know what it should do. So we specify that this is an undefined behavior. I understand that there is INF that certain mathematicians perfer. But are you sure that is the common sense behavior? For some of my application, I would rather it is a valid big number, and for others, I would rather it is an error. In stead of wish, I learn to put checks around critical area. And, please hear this, in no application that would like an INF result, INF or -INF. There is no common sense interpretation for INF, other than treat it same as error.

So do understand what ambiguity means. In math, you never have ambiguity. In computer, you never have ambiguity. In real life common sense, ambiguity arises when we can't decide a favorable general solution.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: