Hacker News new | past | comments | ask | show | jobs | submit login
Examples of floating point problems (jvns.ca)
255 points by grappler on Jan 13, 2023 | hide | past | favorite | 169 comments



> NaN/infinity values can propagate and cause chaos

NaN is the most misunderstood feature of IEEE floating point. Most people react to a NaN like they'd react to the dentist telling them they need a root canal. But NaN is actually a very valuable and useful tool!

NaN is just a value that represents an invalid floating point value. The result of any operation on a NaN is a NaN. This means that NaNs propagate from the source of the original NaN to the final printed result.

"This sounds terrible" you might think.

But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.

What's the result of sqrt(-1.0)? It's not a number, so it's a NaN. If a NaN appears in your results, you know you've got mistake in your algorithm or initial values. Yes, I know, it can be clumsy to trace it back to its source, but I submit it is better than having a bad result go unrecognized.

NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.

This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

NaN is your friend!


> But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.

You return (value, found), or (value,error) or Result<T>.

>NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.

You return error and handle the error. You want to know sensor is wonky or returns bad data.

Also you technically should use signalling NaN for "this is error" and quiet NaN for "this just impossible math result", which makes it even more error prone. Just return fucking error if it is a function.

Sure, useful for expressions but the handling should be there and then, and if function can have error it should return it explicitly as error, else you have different error handling for different types of functions.

> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

I'd like to see how much of the code actually uses that as a feature and not just sets it to 0.0 (or initializes it right away)


> You return (value, found), or (value,error) or Result<T>.

And this is great for environments that can support it, but as the levels get lower and lower, such safety nets become prohibitively expensive.

Take data formats, for example. Say we have a small device that records ieee754 binary float32 readings. A simple format might be something like this:

    record     = reading* terminator;
    reading    = float(32, ~) | invalid;
    invalid    = float(32, snan);
    terminator = uint(32, 0xffffffff);
We use a signaling NaN to record an error in the sensor reading, and we use the encoding 0xffffffff (which is a quiet NaN) to mark the end of the record.

If we wanted the validity signaling to be out-of-band, we'd need to encode it as such; perhaps as a "validity" bit preceding each record:

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(1, ~);
    terminator = uint(1, 1) & uint(32, 0xffffffff);
Now the format is more complicated, and we also have alignment problems due to each record entry being 33 bits. We could use a byte instead and lose to bloat a little:

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(8, ~);
    terminator = uint(8, 1) & uint(32, 0xffffffff);
But we're still unaligned (40 bits per record), which will slow down ingestion. We could fix that by using a 32-bit validity "bit":

    record     = reading* terminator;
    reading    = valid_bit & float(32, ~);
    valid_bit  = uint(32, ~);
    terminator = uint(32, 1) & uint(32, 0xffffffff);
But now we've doubled the size of the data format.

Or perhaps we keep it as a separate bit array, padded to a 32-bit boundary to deal with alignment issues:

    record      = bind(count,count_field) & pad(32, validity{count.value}, padding*) & reading{count.value};
    count_field = uint(32,bind(value,~));
    reading     = float(32, ~);
    validity    = uint(1, ~);
    padding     = uint(1, 0);
But now we've lost the ability of ad-hoc appends (we have to precede each record with a length), and the format is becoming a lot more complicated.


> And this is great for environments that can support it, but as the levels get lower and lower, such safety nets become prohibitively expensive.

I don't think there is low enough level here anymore. $0.3 micros maybe ? Why you're doing floats on them ? $2-3 micros already can have 64kB flash. You're just inventing imaginary cases to support "let's just do it worse and uglier" case.


Float-with-NaN is essentially (float-without-NaN, boolean error), in an efficient, hardware-optimized structure.


> What do you return for an index into the array?

None in an Optional/Maybe, and Error in a Result, Null in a nullable type, an exception,


I concur. I would use an `Option`, `Result`, or enum for all of those scenarios, depending on the details.


No way. It'd be ridiculously slow to constantly check for NaN let it propagate and then use a result or option at higher level.

Adding a branch like that to low level number crunching would be bonkers.


Indexing an array isn't low level number crunching, and whatever you produce from that is going to be the result of a branch unless you jist don't boubds check arrays ans return some random bit of memory (or crash because you indexed out of program memory.)

Honestly, ideally you have dependent types and index out of the array is a type error caught at compile time, but where its not, “NaN” is not the most logical result, even for floats, because very often you will want to distinguish “the index given was out of the array” from “the index given was in the array and the value stored was a NaN”; special values IN the domain of the values stored in the array are fundamentally problematic for that reason.


I think the concept of NaNs are sound, but I think relying on them is fraught with peril, made so by the unobvious test for NaN-ness in many languages (ie, "if (x != x)"), and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.

Really though, I'm a fan, I just think that we need better means for checking them in legacy languages and we need to entirely do away with "fast math" optimizations.


I call them "buggy math" optimizations. The dmd D compiler does not have a switch to enable buggy math.


> made so by the unobvious test for NaN-ness in many languages (ie, "if (x != x)")

Which languages do not have a function to test for NaN?

> and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.

This is not unique to NaNs. There are plenty of potential floating point problems if you enable those flags.


> Which languages do not have a function to test for NaN?

Both C and C++.

> This is not unique to NaNs. There are plenty of potential floating point problems if you enable those flags.

That's why I said in the second part that we need to do away with them.


C has had isnan() for over two decades. It’s technically a macro, but that doesn’t matter for this use case.


> Both C and C++.

Seems C++ only added it in C++11. Surprising.

> That's why I said in the second part that we need to do away with them.

Do away entirely with fast math calculations? That would be horrible. They exist for a very good reason: Some applications are too slow without them. Enabling subnormal numbers can really slow things down.

I'd wager that for the majority of programs written, the fast math is as good as the accurate math.


> I'd wager that for the majority of programs written, the fast math is as good as the accurate math.

I'd take that wager. I spent 13 years working on video games and video game technology, a domain where floating point performance is critical, and by and large we never used fast-math because of the problems it created for us.


This is surprising to me! Can you explain what problems you encountered? My (limited) understanding is that the main effect of fast-math is to turn off support for subnormals. Since subnormals are only used to represent extremely small values, I wouldn't expect them to have much effect in the real world.


fast-math can result in many things you may not expect, e.g. treating FP operations as being associative and distributive, which they aren’t.


Sure but when does that matter in video game math?


Happy to take that wager, because the majority of programs written are not video games related :-)


I’m more curious about how it caused problems for you if you ever used it


> Seems C++ only added it in C++11

I added it to Digital Mars C and C++ in the early 1990s. Also full NaN support in all the math.h functions. AFAIK it was the first C compiler to do it.

It was based on a specification worked out by NCEG (Numerical C Extensions Group), a long forgotten group that tried to modernize C numerics support.

> the fast math is as good as the accurate math

I was more interested in correct results than fast wrong answers.


Yes, but for most C++ applications, they prefer the faster wrong answers, because the wrong answers are not wrong enough to cause any bugs.


> Some applications are too slow without them.

I fully expect the intersection between programs where floating point instructions are a speed bottleneck and programs where the numerical instability that fast_math can cause is not a problem is the empty set.


IIRC we used -ffast-math for numerical simulations of radar cross sections. Fast math was a decent perf win but didn't have any negative effects on the results.

Most programs don't care about the difference between ((a+b)+c) vs (a+(b+c)). Why bother with NaNs if you know you can't get them? Etc.


Something akin to fast-math is the default for some shading languages (although some have switched to fast-math-but-preserve-NaN-and-INF).


isnan() has been around since C99.


isnan is a macro, not a function


isnan is a macro in C, function in C++. But either way, it’s an abstraction that provides a standard test for NaNs in those languages.


I believe the reason he is pointing out this distinction is that if it is a macro, it can be optimized out causing bugs.


Sure with -fast-math or similar compiler options without a way to also say “preserve NaNs”, that could happen with macros, but it can also happen with inline functions where the body is in a header file.


Not in a sound implementation. E.g. in macOS/iOS’s libm, we replace the isnan() macro with a function call if you compile with fast-math precisely to prevent this.


NaNs are incredibly helpful in numpy. Often I want to take a mean over a collection of lists but the size of each list isn't constant. Np.nanmean works great.


> I think the concept of NaNs are sound, but I think relying on them is fraught with peril

Well, allegedly in D at least https://www.reddit.com/r/rust/comments/a1w75c/the_bug_i_did_...


> What's the result of 1.0/0.0? It's not a number, so it's a NaN

It's not often that I get to correct Mr D himself, but 1.0/0.0 is...


You're right. I'll fix it.


What is 1.0/0.0 in D? A DivideByZero exception of some sort?


Infinity.


I often find myself wondering wtf? when I see discussions like this.

root -1 is i - and we get into complex numbers. If I ever end up with needing to deal with a square root of a number that might be -1 then I'll do it the old fashioned way and test for that and sort it. What is the problem here? root -1 is well defined!

Equally (lol) 1.0/0.0 and 1/0 can be tricky to handle but not beyond whit of person. In those cases it is syntax and and a ... few other things. Is 1.0 === 1.00 etc? Define your rules, work in the space that you have defined and all will be fine.

It is a complete nonsense to require "real" world maths to be correct in your program. Your program is your little world. If it needs to deal with math like concepts then that is down to you how it works. In your programming language you get some methods and functions and they often have familiar names but they do not work like the "real thing" unless you get them to.

numpy and the like exist for a very good reason: A general purpose programming language will cock up very soon.

Do basic and very simple arithmetic in your chosen language if it is well formed, switch to sophisticated addons as soon as is practicable.


> root -1 is well defined!

Yes, if you're using complex number types. Nope if you're using reals:

    double x = sqrt(1.0);


> This means that NaNs propagate from the source of the original NaN to the final printed result.

An exception would be better. Then you immediately get at the first problem instead of having to track down the lifetime of the observed problem to find the first problem.


Yeah, or using a type system that lets you avoid making NaN a float to both force the case to be handled and prevent NaN from leaking into the math.


> An exception would be better.

It depends on the context. Sometimes NaNs are expected and to be ignored. Sometimes they signal a problem.


Definitely. Unfortunately, Language implementations that guaranteed exceptions were not in wide use at the time. Also, to have a chance at being implemented on more than one CPU, it had to work in C and assembly.


You can get an exception! Just enable FP exceptions for the invalid exception, and compile all of your code such that it's FP-exception aware, and you can get the exception.


> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.

This is definitely something I like about D, but I'd much prefer a compiler error. double x; double y = x+1.5; is less than optimal.


I don't find this convincing.

> What do you return for an index into the array?

An option/maybe type would solve this much better.

> Yes, I know, it can be clumsy to trace it back to its source

An exception would be much better, alerting you to the exact spot where the problem occurred.


> An option/maybe type would solve this much better.

NaN's are already an option type, although implemented in hardware. The checking comes for free.

> An exception would be much better

You can configure the FPU to cause an Invalid Operation Exception, but I personally don't find that attractive.


The missing bit is language tooling. The regular floating point API exposed by most languages don’t force handling of NaNs.

The benefit of the option type is not necessarily just the extra value, but also the fact that the API that forces you to handle the None value. It’s the difference between null and Option.

Even if the API was better, I think there’s value in expressing it as Option<FloatGuaranteedToNotBeNaN> which compiles down to using NaNs for the extra value to keep it similar to other Option specialisations and not have to remember about this special primitive type that has option built in.


Yeah. You should be very explicit about it. Certainly not treat it like, “ooh, here are some free bits that I can use to tag things in ad hoc ways (like -1 for missing index)”.

https://internals.rust-lang.org/t/pre-rfc-nonnan-type/8418


> NaN's are already an option type, although implemented in hardware

The compromise with this is that it makes it impossible to represent a non-optional float, which leads to the same issues as null pointers in c++/java/etc.

The impacts of NaN are almost certainly not as bad (in aggregate) as `null`, but it'd still be nice if more languages had ways to guarantee that certain numbers aren't NaN (e.g. with a richer set of number types).


> The impacts of NaN are almost certainly not as bad (in aggregate) as `null`, but it'd still be nice if more languages had ways to guarantee that certain numbers aren't NaN (e.g. with a richer set of number types).

The problem with that is that to guarantee arithmetic does not result in a NaN, you need to guarantee that 0 and infinity are not valid values, and those values can still arise from underflow/overflow of regular computation. Basically, there's no subset of floating-point numbers that forms a closed set under +, -, *, or / that doesn't include NaN. So you can define FiniteF32 (e.g.), but you can't really do anything with it without the result becoming a full-on float.


As far as I'm aware, there's no equivalent to a stack trace with NaN, so finding the origin of a NaN can be extremely tedious.


I've never found it to be particularly difficult.

The extremely difficult problems to find are uninitialized data and threading bugs, mainly because they appear and disappear.


Good points!


Exceptions are actually part of floats, they're called "signalling nans".

So technically Python is correct when it decided that 0.0/0.0 should raise an exception instead of just quietly returning NaN. Raising an exception is a standards-conforming option.

https://stackoverflow.com/questions/18118408/what-is-the-dif...


In practice, I've found signalling NaNs to be completely unworkable and gave up on them. The trouble is they eagerly convert to quiet NaNs, too eagerly.


I am firmly in the belief that sNaNs were a mistake in IEEE 754, and all they really serve to do is to create hard trivia questions for compiler writers.


technically I guess it should return sNAN (so app can check for it if it want to handle it differently) and raise exception if sNaN is used in (non-comparison) operation


> > What do you return for an index into the array? > An option/maybe type would solve this much better. Only if optional<float> is the same size as float.


Only if you’re working with pure enough functions, I think? Otherwise I might be polluting state with NaN along the way. And therefore need to error handle the whole way. So I may rather prefer to catch NaN up front and throw an error.

Loud and clear on the other point. I work with ROS a lot and the message definitions do not support NaN or null. So we have these magic numbers all over the place for things like “the robot does not yet know where it is.” (therefore: on the moon.)


> The result of any operation on a NaN is a NaN.

That's not true! maxNum(nan, x) = x.


The standard for comparison with NaNs is to always return false [0].

So that entirely depends on how max() is implemented. A naive implementation of max() might just as easily instead return NaN for that. Or if max(NaN, x) is x, then it may give NaN for max(x, NaN).

Note that the fact that comparisons always return false also means that sorting an array of floating point values that contain NaNs can very easily break a comparison-based sorting algorithm!! (I've seen std::sort() in C++, for example, crash because NaNs break the strict weak ordering [1] requirement.)

[0] https://en.wikipedia.org/wiki/NaN#Comparison_with_NaN

[1] https://en.cppreference.com/w/cpp/named_req/Compare


maxNum is (was - as another poster mentioned it is now deprecated) a specific function defined in the IEEE specification. This isn't a language level comparison function but a specific operation on floating points that has to behave the way I described. It is not a comparison function.

maximumMagnitude (and a few similar ones) replace it but behave similarly.



Looks like maximum and maximumMagnitude have the same relevant property (NaN is not viral).


If you have only a couple of minutes to develop a mental model of floating-point numbers (and you have none currently), the most valuable thing IMO would be to spend them staring at a diagram like this one: https://upload.wikimedia.org/wikipedia/commons/b/b6/Floating... (uploaded to Wikipedia by user Joeleoj123 in 2020, made using Microsoft Paint) — it already covers the main things you need to know about floating-point, namely there are only finitely many discrete representable values (the green lines), and the gaps between them are narrower near 0 and wider further away.

With just that understanding, you can understand the reason for most of the examples in this post. You avoid both the extreme of thinking that floating-point numbers are mathematical (exact) real numbers, and the extreme of "superstition" like believing that floating-point numbers are some kind of fuzzy blurry values and that any operation always has some error / is "random", etc. You won't find it surprising why 0.1 + 0.2 ≠ 0.3, but 1.0 + 2.0 will always give 3.0, but 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0. :-) (Sure this confidence may turn out to be dangerous, but it's better than "superstition".) The second-most valuable thing, if you have 5–10 minutes, may be to go to https://float.exposed/ and play with it for a while.

Anyway, great post as always from Julia Evans. Apart from the technical content, her attitude is really inspiring to me as well, e.g. the contents of the “that’s all for now” section at the end.

The page layout example ("example 7") illustrates the kind of issue because of which Knuth avoided floating-point arithmetic in TeX (except where it doesn't matter) and does everything with scaled integers (fixed-point arithmetic). (It was even worse then before IEEE 754.)

I think things like fixed-point arithmetic, decimal arithmetic, and maybe even exact real arithmetic / interval arithmetic are actually more feasible these days, and it's no longer obvious to me that floating-point should be the default that programming languages guide programmers towards.


If you have even less time, just think of them as representing physical measurements made with practical instruments and the math done with analog equipment.

The common cause of floating point problems is usually treating them as a mathematical ideal. The quirks appear at the extremes when you try to to un-physical things with them. You can't measure exactly 0 V with a voltmeter, or use an instrument for measuring the distance to stars then add a length obtained from a micrometer without entirely losing the latter's contribution.


Thanks, I actually edited my post (made the second paragraph longer) after seeing your comment. The "physical" / "analog" idea does help in one direction (prevents us from relying on floating-point numbers in unsafe ways) but I think it brings us too close to the "superstition" end of the spectrum, where we start to think that floating-point operations are non-deterministic, start doubting whether we can rely on (say) the operation 2.0 + 3.0 giving exactly 5.0 (we can!), whether addition is commutative (it is, if working with non-NaN floats) and so on.

You could argue that it's "safe" to distrust floating-point entirely, but I find it more comforting to be able to take at least some things as solid and reason about them, to refine my mental model of when errors can happen and not happen, etc.

Edit: See also the floating point isn’t “bad” or random section that the author just added to the post (https://twitter.com/b0rk/status/1613986022534135809).


> whether we can rely on (say) the operation 2.0 + 3.0 giving exactly 5.0 (we can!)

Can we rely upon 2.3 + 2.3 giving exactly 4.6 though?

Can we rely upon LargeInt.0 + LargeInt.0 giving exactly 2xLargeInt.0 for all integers?


That's exactly my point: when you internalize the diagram, you'll be able to reason confidently about what happens:

• In the case of "2.3 + 2.3", each "2.3" is “snapped” to the nearest representable value (green line in the diagram), then their sum snapped to the nearest green line. In this case, because the two summands are equal and we're using binary floating-point, the result will also be a green line. If you knew more about the details of binary64 aka float64, you could confidently say that "2.3" means 2.29999995231628417969 (https://float.exposed/0x40133333) and be sure of what "2.3 + 2.3" would give (4.59999990463256835938 = https://float.exposed/0x40933333) and that this is indeed the closest representable value to 4.6 (so yes, we can rely on 2.3 + 2.3 giving the same value as what “4.6” would be stored as, i.e. "2.3 + 2.3 == 4.6" evaluating to True), but even without learning the details you can go pretty far. For instance, you know you can rely on "x + x" and "2 * x" giving the same value for any (non-NaN) value x.

• I already gave the example of 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0 above, but for the specific case of "x + x" and "2 * x" yes we can rely upon them evaluating to the same value (unless 2*x is Infinity or NaN), though of course the large integer x may itself not be representable exactly. Again, with the mental model, you'll be in a better position to state what you expect by "exactly".


> Again, with the mental model, you'll be in a better position to state what you expect by "exactly".

I'm long in the tooth, if I want exactly I'll use a symbolic algebra program (such as Cayley|Magma).

I don't expect exactly when using floats (or doubles, etc) and my mental model includes the image of the "fine teeth of coverage" having uneven spacing, that the float graticule across the reals is uneven and at best semi quasi regular has largely been my greatest issue (in geophysical computation engine development).


> If you knew more about the details of binary64 aka float64, you could confidently say that "2.3" means 2.29999995231628417969 […]"

Then I think writing

    let f = 2.3;
Should be a compile error. The compiler should force you to write the “snapped” value in order to not mislead. :)


Now try writing the "snapped" value for 2.3 as a finite decimal. :-)


Something other than 2.29999995231628417969 ?


Example 4 mentions that the result might be different with the same code. Here is an example that is particularly counter-intuitive.

Some CPU have the instruction FMA(a,b,c) = ab + c and it is guaranteed to be rounded to the nearest float. You might think that using FMA will lead to more accurate results, which is true most of the time.

However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:

p = uv + wu

Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.

With FMA, the compiler might optimize this code to:

p = FMA(u, v, wu)

That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!

So even for a simple formula like this, testing if two vectors are orthogonal would not necessary work by testing if the result is exactly zero. One recommended workaround in this case is to test if the dot product has an absolute value smaller than a small threshold.


Note that with gcc/clang you can control the auto-use of fma with compile flags (-ffp-contract=off). It is pretty crazy imho that gcc defaults to using fma


> It is pretty crazy imho that gcc defaults to using fma

Yes! Different people can make different performance-vs-correctness trade-offs, but I also think reproducible-by-default would be better.

Fortunately, specifying a proper standard (e.g. -std=c99 or -std=c++11) implies -ffp-contract=off. I guess specifying such a standard is probably a good idea independently when we care about reproducibility.

Edit: Thinking about it, it the days of 80-bit x87 FPUs, strictly following the standard (specifically, always rounding to 64 bits after every operation) may have been prohibitively expensive. This may explain gcc's GNU mode defaulting to -ffast-math.


GCC doesn't default to non-conforming behaviour like -ffast-math -- that's Intel (at least a similar option). That's usually why people mistakenly think GCC vectorization is deficient if they don't use -funsafe-math-optimizations in particular.


Indeed GCC does not enable -ffast-math by default. Unfortunately, -ffast-math and -funsafe-math-optimizations (despite the name) are not the only options that prevent bit-for-bit-reproducible floating point. For example, -ffp-contract=fast is enabled by default [1], and it will lead to different floating-point roundings: Compare [2] which generates an FMA instruction, to [3] when -std=c99 is specified. As another example, -fexcess-precision=fast is also enabled by default. Similarly, [4] does intermediate calculations in the 80-bit x87 registers, while [5] has additional loads and stores to reduce the precision of intermediate results to 64 bits. In both examples, GCC generates code that does not conform to IEEE-754, unless -std=c99 is specified.

[1] From the man page:

    -ffp-contract=style
           -ffp-contract=off disables floating-point expression
           contraction.  -ffp-contract=fast enables floating-point
           expression contraction such as forming of fused multiply-
           add operations if the target has native support for them.
           -ffp-contract=on enables floating-point expression
           contraction if allowed by the language standard.  This is
           currently not implemented and treated equal to
           -ffp-contract=off.
           
           The default is -ffp-contract=fast.
[2] https://godbolt.org/z/GKb7G4nW9

[3] https://godbolt.org/z/KTnqcT6aW

[4] https://godbolt.org/z/4q31oEe14

[5] https://godbolt.org/z/qdf4hceca


> Edit: Thinking about it, it the days of 80-bit x87 FPUs, strictly following the standard (specifically, always rounding to 64 bits after every operation) may have been prohibitively expensive

afaik you could just set the precision of x87 to 32/64/80 bits and there would not be any extra cost to the operations


Why is it crazy? Some of us don't want to lose a factor of two on linear algebra (and also care about correctness). I remember testing for correctness against Kahan's tests after FMA became available in RS/6000.


If you do want the precision improvement of fma, then it makes far more sense to explicitly call fma instead of relying compiler on doing transformation that might not happen for any number of reasons. The key here is predictability, if it was actually guaranteed that expressions in the form of (x*y) + z are always done with fma, then it'd be less crazy. But now you have no way of knowing without looking at the produced assembly if fma is used or not in any particular expression.


As I implied, we're normally interested in FMA for speed, not numerical properties. I don't know in what circumstances GCC wouldn't use it when vectorizing, but I haven't seen them.


My take is not using the higher precision operation is crazy.


If you want higher precision then long double exists for that purpose


In general with reals with any source of error anywhere, this caution about equality is always correct. the odds of two reals being equal is zero.


I have an exception that proves the rule. I thought about responding to Julia's call, but decided this was too subtle. But here we go...

A central primitive in 2D computational geometry is the orientation problem; in this case deciding whether a point lies to the left or right of a line. In real arithmetic, the classic way to solve it is to set up the line equation (so the value is zero for points on the line), then evaluate that for the given point and test the sign.

The problem is of course that for points very near the line, roundoff error can give the wrong answer, it is in fact an example of cancellation. The problem has an exact answer, and can be solved with rational numbers, or in a related technique detecting when you're in the danger zone and upping the floating point precision just in those cases. (This technique is the basis of Jonathan Shewchuk's thesis).

However, in work I'm doing, I want to take a different approach. If the y coordinate of the point matches the y coordinate of one of the endpoints of the line, then you can tell orientation exactly by comparing the x coordinates. In other cases, either you're far enough away that you know you won't get the wrong answer due to roundoff, or you can subdivide the line at that y coordinate. Then you get an orientation result that is not necessarily exactly correct wrt the original line, but you can count on it being consistent, which is what you really care about.

So the ironic thing is that if you had a lint that said, "exact floating point equality is dangerous, you should use a within-epsilon test instead," it would break the reasoning outlined above, and you could no longer count on the orientations being consistent.

As I said, though, this is a very special case. Almost always, it is better to use a fuzzy test over exact equality, and I can also list times I've been bitten by that (especially in fastmath conditions, which are hard to avoid when you're doing GPU programming).


Yes, and this is not just a theoretical concern: There was an article here [1] in 2021 claiming that Apple M1's FMA implementation had "flaws". There was actually no such flaw. Instead, the author was caught off guard by the very phenomenon you are describing.

[1] https://news.ycombinator.com/item?id=27880461


Although funnily enough the Windows fma implementation is significantly flawed.


Story time.

Back in university I was taking part in programming competition. I don't remember the exact details of a problem, but it was expected to be solved as a dynamic problem with dp[n][n] as an answer, n < 1000. But, wrangling some numbers around one could show that dp[n][n] = dp[n-1][n-1] + 1/n, and the answer was just the sum of first N elements of harmonic series. Unluckily for us the intended solution had worse precision and our solution failed.


They didn't take into account that floats come with an estimated uncertainty, and that values that are the same within the limits of experimental error are identical? That's a really badly set problem!


I think it that particular case they just didn't do error analysis.

The task was to output answer with `10^-6` precision, which they solution didn't achieve. Funnily enough the number of other teams went the "correct" route and passed (as they were doing additions in same order as original solution).


> if you add very big values to very small values, you can get inaccurate results (the small numbers get lost!)

There is a simple workaround for this:

https://en.wikipedia.org/wiki/Kahan_summation

It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.


It can also come up in simple control systems. A simple low-pass filter can fail to converge to a steady state value if the time constant is long and the sample rate is high.

Y += (X-Y) * alpha * dt

When dt is small and alpha is too, the right hand side can be too small to affect the 24bit mantissa of the left.

I prefer a 16/32bit fixed point version that guarantees convergene to any 16bit steady state. This happened in a power conversion system where dt=1/40000 and I needed a filter in the 10's of Hz.


This is a very important application and a tougher problem than most would guess. There is a huge variety of ways to numerically implement even a simple transfer function, and they can have very different consequences in terms of rounding and overflow. Especially if you want to not only guarantee that it converges to a steady-state, but furthermore that the steady-state has no error. I spent a lot of time working on this problem for nanometre-accurate servo controls. Floating and fixed point each have advantages depending on the nature and dynamic range of the variable (eg. location parameter vs scale parameter).


> It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.

OTOH, it’s easy to implement, so I have a couple of functions to do it easily, and I got quite a lot of use out of them. It’s probably overkill sometimes, but sometimes it’s useful.


I had one issue where pdftotext would produce different output on different machines (Linux vs Mac). It broke some of our tests.

I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.

It turns out the x86 was running the math at a higher precision and truncating when it moved values out of registers - as soon as it hit memory, things were equal. MacOS was defaulting to -ffloat-store to get consistency (their UI library is float based).

There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.


x86 (x87) FP is notoriously inconsistent because of the 80 bit extended precision that may not be used. In a JITed language line Java/C# it’s even less fun as it can theoretically be inconsistent even for the same compiled program on different machines.

Thankfully the solution to that problem came when x86 (32 bit) mostly disappeared.


macOS doesn't use -ffloat-store, it uses SSE for floats instead of x87 because it only supports machines with SSSE3. You wanted -mfpmath=sse or -march=native.


Could be, but that's my recollection of the situation. I had added -ffloat-store on the Linux side, and I had read somewhere that was being done on the osx side.

Note that this was back in January of 2011, possibly earlier. Back then gcc was still the default compiler on macos, and I think Apple was still 32-bit. (I don't remember when they switched to 64-bit, but I have a 32-bit only game that I bought near the end of that year.)

The situation is quite different these days, I don't think any compiler still uses the old registers by default.


One thing that pains me about this kind of zoo of problems is that people often have the takeaway, "floating point is full of unknowable, random errors, never use floating point, you will never understand it."

Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats. You should use it! And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations:

1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.

2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you'd better have a plan ready.

3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).

But sure, don't use floats for ID numbers, that's always a problem. In fact, don't use bigints either, nor any other arithmetic type for something you won't be doing arithmetic on.


> Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats.

I completely agree with you even though I go out of my way to avoid FP, and even though, due to what I usually work on, I can often get away with avoiding FP (often fixed point works -- for me).

IEEE-754 is a marvelous standard. It's a short, easy to understand standard attached to an absolutely mind boggling number of special cases or explanation as to why certain decisions in the simple standard were actually incredibly important (and often really smart and non-obvious). It's the product of some very smart people who had, through their careers, made FP implementations and discovered why various decisions turned out to have been bad ones.

I'm glad it's in hardware, and not just because FP used to be quite slow and different on every machine. I'm glad it's in hardware because chip designers (unlike most software developers) are anal about getting things right, and implementing FP properly is hard -- harder than using it!


> And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations

You can do exact real arithmetic. But this is only done by people who prove theorems with computers - or by the Android calculator! https://en.wikipedia.org/wiki/Computable_analysis

Other alternatives (also niche) are exact rational arithmetic, computer algebra, arbitrary precision arithmetic.

Fixed point sometimes gets used instead of floats because some operations lose no precision over them, but most operations still do.


In my opinion, that's in the realm of "you can only delay it". Sure, you can treat real numbers via purely logical deductions like a human mathematician would, but at some point someone's going to ask, "so, where is the number on this plot?" and that's when it's time to pay the fiddler.

Same for arbitrary-precision calculations like big rationals. That just gives you as much precision as your computer can fit in memory. You will still run out of precision, just later rather then sooner.


> Same for arbitrary-precision calculations like big rationals. That just gives you as much precision as your computer can fit in memory. You will still run out of precision, later rather then sooner.

Oh, absolutely. This actually shows that floats are (in some sense) more rigorous than more idealised mathematical approaches, because they explicitly deal with finite memory.

Oh, I remembered! There's also interval arithmetic, and variants of it like affine arithmetic. At least you know when you're losing precision. Why don't these get used more? These seem more ideal, somehow.


Because the interval, on average, grows exponentially with the number of basic operations. So it quickly becomes practically useless.


If x is the interval [-1, 1], the typical implementation of IA will

evaluate x-x to [-2, 2] (instead of [0, 0], and

evaluate x*x [-1, 1] instead of [0, 1].

Therefore the intervals become too conservative to be useful.


I wouldn't call computable reals the reals. They are a subset of measure zero. Perhaps all we sentient beings can aspire to use, but still short of the glory of the completed infinities that even one arbitrary real represents.

One half : )


These are only relevant in some circumstances. For example, a calculator is typically bounded in the number of operations you can perform to a small number (humans don’t add millions of numbers). This allows for certain representations that don’t make sense elsewhere.


> 3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).

One thing that I suspect trips people a lot is decimal string/literal <-> (binary) float conversions instead of the floating point math itself. This includes the classic 0.1+0.2 thing, and many of the problems in the article.

I think these days using floating point hex strings/literals more would help a lot. There are also decimal floating point numbers that people largely ignore despite being standard for over 15 years


The only implementation of IEEE754 decimals I've ever seen is in Python's Decimal package. Is there an easily-available implementation anywhere else?


I don't think Pythons Decimal is ieee754, instead its some sort of arbitrary precision thingy.

GCC has builtin support for decimal floats: https://gcc.gnu.org/onlinedocs/gcc/Decimal-Float.html

There are also library implementations floating around, some of them are mentioned in this thread: https://discourse.llvm.org/t/rfc-decimal-floating-point-supp...

decnumber has also rust wrappers if you are so inclined


Python's decimal absolutely is IEEE 754 (well, based on the older standard, which has now been absorbed into IEEE 754):

https://github.com/python/cpython/blob/main/Lib/_pydecimal.p...

Cool, didn't know that gcc had built-in support. But is it really as incomplete as it says there?


Huh, I didn't know it was that close, I'll grant that. But I'd say still no cigar.

One of the most elementary requirements of IEEE754 is:

> A programming environment conforms to this standard, in a particular radix, by implementing one or more of the basic formats of that radix as both a supported arithmetic format and a supported interchange format.

(Section 3.1.2)

While you could argue that you may configure Decimals context parameters to match those of some IEEE754 format and thus claim conformance as arithmetic format, Python has absolutely no support for the specified interchange formats.

To be honest, seeing this I'm bit befuddled on why closer conformance with IEEE754 is not sought. Quick search found e.g. this issue report on adding IEEE754 parametrized context, which is a trivial patch, and it has been just sitting there for 10 years: https://github.com/python/cpython/issues/53032

Adding code to import/export BID/DPD formats, while maybe not as trivial, seems still comparatively small task and would improve interoperability significantly imho.


> One thing that pains me about this kind of zoo of problems is that people often have the takeaway, "floating point is full of unknowable, random errors, never use floating point, you will never understand it."

> Floating point is amazingly useful!

Another thing about floats is they are for most parts actually very predictable. In particular all basic operations should produce bit-exact results to last ulp. Also because they are language independent standard, you generally can get same behavior in different languages and platforms. This makes learning floats properly worthwhile because the knowledge is so widely applicable


>In particular all basic operations should produce bit-exact results to last ulp.

As long as you are not using a compiler that utilizes x87's extended precision flaots for intermediate calculations, and silently rounding whenever it transfers to memory (That used to be a common issue), and as long as you are not doing dumb stuff with compiler math flags.

Also if you have any code anwhere in your program that relies on correct subnormal handling, then you need to be absolutely sure no code is compiled with `-ffast-math`, including in any dynamically loaded code in your entire program, or your math will break: https://simonbyrne.github.io/notes/fastmath/#flushing_subnor...

And of course if you are doing anything complicated with floating point number, there are entire fields of study about creating numerically stable algorithms, and determining the precision of algorithms with floating point numbers.


Floating point is a goofy hacky kludge.

> There's a reason why it's implemented in hardware in all modern computers

Yah, legacy.

The reason we used it originally is that computers were small and slow. Now that they're big and fast we could do without it, except that there is already so much hardware and software out there that it will never happen.


What replacement would you propose? They all have different tradeoffs.


(I just tried to delete my comment and couldn't because of your reply. Such is life.)

ogogmad made a much more constructive comment than mine: https://news.ycombinator.com/item?id=34370745

It really depends on your use case.


Turning all your fixed-size numeric types into variable-sized numeric types introduces some really exciting performance and security issues. (At least if you consider DoS security.)

I think fixed-point math is underrated though.


Related: In numerical analysis, I found the distinction between forwards and backwards numerical error to be an interesting concept. The forwards error initially seems like the only right kind, but is often impossible to keep small in numerical linear algebra. In particular, Singular Value Decomposition cannot be computed with small forwards error. But the SVD can be computed with small backwards error.

Also: The JSON example is nasty. Should IDs then always be strings?


> The JSON example is nasty.

Specs, vs. their implementations, vs. backwards compat. JSON just defines a number type, and neither the grammar nor the spec places limits on it (though the spec does call out exactly this problem). So the JSON is to-spec valid. But implementations have limits as to what they'll decode: JS's is that it decodes to number (a double) by default, and thus, loses precision.

(I feel like this issue is pretty well known, but I suppose it probably bites everyone once.)

JS does have the BigInt type, nowadays. Unfortunately, while the JSON.parse API includes a "reviver" parameter, the way it ends up working means that it can't actually take advantage of BigInt.

> Should IDs then always be strings?

That's a decent-ish solution; as it side-steps the interop issues. String, to me, is not unreasonable for an ID, as you're not going to be doing math on it.


IIRC, forward error: the error between the given answer and the right answer to the given question.

Backward error: the error between the given question, and the question whose right answer is the given answer.

Easier to parse like this: a small forward error means that you give an answer close to the right one.

A small backward error means that the answer you give is the right answer for a nearby question.


My 'favourite' is that the quadratic formula -b±sqrt(b²-4ac)/2a falls apart when you solve for the positive solution using floating point for cases where ε=b²/4ac is small, the workaround being to use the binomial expansion -b/2a*(0.5ε-0.125ε²+O(ε³))


This one is fun! However, that particular formula is not my favorite, as it doesn't behave nicely when a is small. The art here is deciding what to special case.

The way I approach this[1] is to choose the sign of the ± to be the same as -b to compute the first root x1. Then the second root is c/(a * x1). There are a few other checks in the code for things overflowing, but that basically gives you very accurate answers across the range.

This is explained a bit in a good Math Exchange post[2]. Jim Blinn's "How to solve a quadratic equation" also makes good reading.

Wait til you get to cubics and quartics.

[1]: https://docs.rs/kurbo/latest/src/kurbo/common.rs.html#116-16...

[2]: https://math.stackexchange.com/questions/866331/numerically-...


> example 4: different languages sometimes do the same floating point calculation differently

It's worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.

I forget all the details, but it boils down to how intermediate results are handled. When you compute certain functions, there are several intermediate calculations before it spits out the result. You get more accuracy if you allow those intermediate calculations to happen in a higher precision format (e.g. you're computing in 32 bits, so it will compute the intermediate values in 64 bits). But that is also slower.

OS's make a "default" choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.

There may be flags you can set to force one configuration regardless of the OS, but you shouldn't assume your libraries do that.

> In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:

> math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl

IEEE 754 doesn't mandate a certain level of accuracy for transcendental functions like sin/log. You shouldn't expect different libraries to give you the same value. If you're doing 64 bit calculations, I would imagine most math libraries will give results accurate enough for 99.99% of math applications, even if only the first 45 bits are correct (and this would be considered "very inaccurate" by FP standards).


See also "Floating Point visually explained": https://fabiensanglard.net/floating_point_visually_explained...


Regarding denormal/subnormal numbers mentioned as "weird": the main issue with them is that their hardware implementation is awfully slow, to the point of being unusable for most computation cases with even moderate FLOPs


Love it. I actually use Excel which even power users take for granted to highlight that people really need to understand the underlying system, or the system needs to have guard rails to prevent people from stubbing their toes. Microsoft even had to write a page explaining what might happen [1] with floating point wierdness.

[1] https://docs.microsoft.com/en-us/office/troubleshoot/excel/f...


> addition isn’t associative (x + (y + z)) is different from (x + y) + z))

A thousand thanks for not saying "addition is not commutative".

(Addition is commutative in floating point. It merely is not associative).


> Javascript only has floating point numbers – it doesn’t have an integer type.

Can anyone justify this? Do JS developers prefer not having exact integers, or is this something that everyone just kinda deals with?


Nowadays, it has BigInt.

If you're very careful, a double can be an integer type. (A 53-bit one, I think?) (I don't love this line of thinking. It has a lot of sharp edges. But JS programmers effectively do this all the time, often without thinking too hard about it.)

(And even before BigInt, there's an odd u32-esque "type" in JS; it's not a real type — it doesn't appear in the JS type system, but rather an internal one that certain operations will be converted to internally. That's why (0x100000000 | 0) == 0 ; even though 0x100000000 (and every other number in that expression, and the right answer) is precisely representable as a f64. This doesn't matter for JSON decoding, though, … and most other things.)


I believe this is technically inaccurate; while Javascript groups most of the number values under, well, "number", modern underlying implementations may resort to perform integer operations when they recognize it is possible. There are also a couple hacks you can do with bit operations to "work" with integers, although I don't remember them off the top of my head - typically used for truncating and whatnot and was mainly a performance thing.

Also there are typed arrays and bigints if we can throw those in, too.


The way runtimes optimize arithmetic is an implementation detail and must conform to IEEE-754.


Fair point, I have been taking smis for granted


> not having exact integers

What do you mean? Floating-point arithmetic is, by design, exact for small integers. The result of adding 2.0 to 3.0 is exactly 5.0. This is one of the few cases where it is perfectly legitimate to compare floats for equality.

In fact, using 64-bit doubles to represent ints you get way more ints than using plain 32-bit ints. Thus, choosing doubles to represent integers makes perfect sense (unless you worry about wasting a bit of memory and performance).


Having something work most of the time but break without warning - at least by writing to a log that an overflow has happened - goes against a lot of SWE best practices AFAICT. Imagine debugging that without knowing it can happen.

Does JS at least write ".0" at the end when converting a number to a string? Or switch to scientific notation for large numbers?


You can use doubles to store and calculate exact integer values. You just wont get 2^64 integers, instead you get the range +/-2^53 .


Example 7 really got me, can anyone explain that? I’m not sure how “modulo” operation would be implemented in hardware, if it is a native instruction or not, but one would hope it would give a result consistent with the matching divide operation.

Edit: x87 has FPREM1 which can calculate a remainder (accurately one hopes), but I can’t find an equivalent in modern SSE or AVX. So I guess you are at the mercy of your language’s library and/or compiler? Is this a library/language bug rather than a Floating Point gotcha?


Based on the nearest numbers that floats represent, the two numbers are Y = 13.715999603271484375 (https://float.exposed/0x415b74bc) and X = 4.57200002670288085938 (https://float.exposed/0x40924dd3).

The division of these numbers is 2.9999998957049091386350361962468173875300478102103478639802753918, but the nearest float to that is 3. (Exactly 3.) [2]

The modulo operation can (presumably) determine that 3X > Y, so the modulo is Y - 2X, as normal.

This gives inconsistent results, if you don't know that every float is actually a range, and "3" as a float includes some numbers that are smaller than 3.

[1] https://www.wolframalpha.com/input?i=13.715999603271484375+%... [2] https://www.wolframalpha.com/input?i=2.999999895704909138635..., then https://float.exposed/0x40400000


This is useful but note that Python uses 64-bit floats (aka "double"), so the right values are:

• "13.716" means 13.7159999999999993037 (https://float.exposed/0x402b6e978d4fdf3b)

• "4.572" means 4.57200000000000006395 (https://float.exposed/0x401249ba5e353f7d)

• "13.716 / 4.572" means the nearest representable value to 13.7159999999999993037 / 4.57200000000000006395 which (https://www.wolframalpha.com/input?i=13.7159999999999993037+...) is 3.0 (https://float.exposed/0x4008000000000000)

• "13.716 % 4.572" means the nearest representable value to 13.7159999999999993037 % 4.57200000000000006395 namely to 4.5719999999999991758 (https://www.wolframalpha.com/input?i=13.7159999999999993037+...), which is 4.57199999999999917577 (https://float.exposed/0x401249ba5e353f7c) printed as 4.571999999999999.

————————

Edit: For a useful analogy (answering the GP), imagine you're working in decimal fixed-point arithmetic with two decimal digits (like dollars and cents), and someone asks you for 10.01/3.34 and 10.01%3.34. Well,

• 10.01 / 3.34 is well over 2.99 (it's over 2.997 in fact) so you'd be justified in answering 3.00 (the nearest representable value).

• 10.01 % 3.34 is 3.33 (which you can represent exactly), so you'd answer 3.33 to that one.

(For an even bigger difference: try 19.99 and 6.67 to get 3.00 as quotient, but 6.65 as remainder.)


This has nothing to do with the definition or implementation of the remainder or modulo function.

It is a problem that appears whenever you compose an inexact function, like the conversion from decimal to binary, with a function that is not continuous, like the remainder a.k.a. modulo function.

In decimal, 13.716 is exactly 3 times 4.572, so any kind of remainder must be null, but after conversion from decimal to binary that relationship is no longer true, and because the remainder is not a continuous function its value may be wildly different from the correct value.

When you compute with approximate numbers, like the floating-point numbers, as long as you compose only continuous functions, the error in the final result remains bounded and smaller errors in inputs lead to a diminished error in the output.

However, it is enough to insert one discontinuous function in the computation chain for losing any guarantee about the magnitude of the error in the final result.

The conclusion is that whenever computing with approximate numbers (which may also use other representations, not only floating-point) you have to be exceedingly cautious when using any function that is not continuous.


Here is Go version. works exactly as expected, no surprises. People just need to grow up and use a modern language, not a 50 year old out of date language:

    package main
    
    import "fmt"
    
    func main() {
       var i, iterations, meters float64
       for iterations = 100_000_000; i < iterations; i++ {
          meters += 0.01
       }
       // Expected: 1000.000000 km
       fmt.Printf("Expected: %f km\n", 0.01 * iterations / 1000)
       // Got: 1000.000001 km
       fmt.Printf("Got: %f km \n", meters / 1000)
    }


AI already has led to a rethinking of computer architectures, in which the conventional von Neumann structure is replaced by near-compute and at-memory floorplans. But novel layouts aren’t enough to achieve the power reductions and speed increases required for deep learning networks. The industry also is updating the standards for floating-point (FP) arithmetic. https://semiengineering.com/will-floating-point-8-solve-ai-m...


Muller's Recurrence is my favorite example of floating point weirdness. See https://scipython.com/blog/mullers-recurrence/ and https://latkin.org/blog/2014/11/22/mullers-recurrence-roundo...


We work hard to retain floating point reproducibility in climate models. I have a presentation on this, if anyone is interested.

https://www.marshallward.org/fortrancon2021/#/title-slide

https://m.youtube.com/watch?v=wQQuFXm6ZqU&t=27s


I'm not on Mastodon, so I'll share here: I inherited some numerical software that was used primarily to prototype new algorithms and check errors for a hardware product that solved the same problem. It was known that different versions of the software produced slightly different answers, for seemingly no reason. The hardware engineer who handed it off to me didn't seem to be bothered by it. He wasn't using version control, so I couldn't dig into it immediately, but I couldn't stop thinking about it.

Soon enough I had two consecutive releases in hand, which produced different results, and which had identical numerical code. The only code I had changed that ran during the numerical calculations was code that ran between iterations of the numerical parts of the code. IIRC, it printed out some status information like how long it had been running, how many calculations it had done, the percent completed, and the predicted time remaining.

How could that be affecting the numerical calculations??? My first thought was a memory bug (the code was in C-flavored C++, with manual memory management) but I got nowhere looking for one. Unfortunately, I don't remember the process by which I figured out the answer, but at some point I wondered what instructions were used to do the floating-point calculations. The Makefile didn't specify any architecture at all, and for that compiler, on that architecture, that meant using x87 floating-point instructions.

The x87 instruction set was originally created for floating point coprocessors that were designed to work in tandem with Intel CPUs. The 8087 coprocessor worked with the 8086, the 287 with the 286, the 387 with the 386. Starting with the 486 generation, the implementation was moved into the CPU.

Crucially, the x87 instruction set includes a stack of eight 80-bit registers. Your C code may specify 64-bit floating point numbers, but since the compiled code has to copy those value into the x87 registers to execute floating-point instructions, the calculations are done with 80-bit precision. Then the values are copied back into 64-bit registers. If you are doing multiple calculations, a smart compiler will keep intermediate values in the 80-bit registers, saving cycles and gaining a little bit of precision as a bonus.

Of course, the number of registers is limited, so intermediate values may need to be copied to a 64-bit register temporarily to make room for another calculation to happen, rounding them in the process. And that's how code interleaved with numerical calculations can affected the results even if it semantically doesn't change any of the values. Calculating percent completed, printing a progress bar -- the compiler may need to move values out of the 80-bit registers to make room for these calculations, and when the code changes (like you decide to also print out an estimated time remaining) the compiler might change which intermediate values are bumped out of the 80-bit registers and rounded to 64 bits.

It was silly that we were executing these ancient instructions in 2004 on Opteron workstations, which supported SSE2, so I added a compiler flag to enable SSE2 instructions, and voila, the numerical results matched exactly from build to build. We also got a considerable speedup. I later found out that there's a bit you can flip to force x87 arithmetic to always round results to 64 bits, probably to solve exactly the problem I encountered, but I never circled back to try it.


Oh man, those 80-bit registers on 32-bit machines were weird. I was very confused as an undergrad when I ran the basic program to find machine epsilon, and was getting a much smaller epsilon than I expected on a 64-bit float. Turns out, the compiler had optimised all of my code to run on registers and I was getting the machine epsilon of the registers instead.


On the hardware, it's fundamentally integer arithmetic under the hood.

Floating point itself is an implementation on top of that.

With a slide rule it's basically truncated integers all the way, you're very limited on significant figures, you float the point in your head or some other way, and apply it afterward. You're constantly aware of any step where your significant figures drop below 3 because that's such a drastic drop from 3 down to 2. Or down to 1 which can really slap you in the face.

The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.

On a proper computer, integer addition, subtraction, and multiplication are exact. It's the division which produces integers which are not exact, since they can usually be truncated results, as they are written to memory at the bitness in use at the time, storing the whole-number portion of the quotient only.

Integer arithmetic can usually be made as exact as you want and it helps to consciously minimize divisions, and if necessary offset/scale the operands beforehand such that the full range of quotients is never close enough to zero for the truncation to affect your immediate result within the number of significant figures necessary for your ultimate result to be completely reliable.

The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don't take full advantage of it.

What sometimes happens is that zero-crossing functions, when the quotients are not offset, will fluctuate between naturally taking advantage of the enhanced bitness of the hardware for large values (blowing away the slide rule a huge amount of the time), while "periodically" dropping below the accuracy of a slide rule when some quotient, especially an intermediate one, is too near zero. Floating point or integer.

If there's nobody keeping track of not just the magnitude of the possible values, but also the significant figures being carried at each point, your equations might not come out as good as a slide rule sometimes.

Edit: IOW when the final reportable result is actually a floating point number where you need to have an accurate idea how many figures to the right of the decimal place are truly valid, it might be possible to use all integer calculations to your advantage from the beginning, and confidently place the decimal point as a final act.


Regarding example 2.1:

I thought the JSON spec was that a number is of any precision and that it’s up to the parser to say, “best I can do is 754.”

That is, I think you can deserialize very long decimal numbers to higher accuracy in some languages.


Got bitten by the big number being added to small number when converting code from double precision to single precision, hard to track down when subtle errors get introduced due to this.


in computer graphics, some people try to accumulate the transformations in 1 matrix. So A_{n+1} = T_{n} * A_{n} where T_{n} is a small transformation like a rotation around an axis

They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that's no longer orthogonal and will skew the image.

Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.


moreover.... floating point addition is not commutative, nor is is associative.


What would be an example of a floating-point addition operation that doesn't commute?


    >>> x = 7.00000000000001
    >>> x + 1 - x
    1.0000000000000009
    >>> x - x + 1
    1.0


Addition is commutative with floating point numbers, what you show is actually an associativity problem in disguise.

Subtraction is obviously not commutative, so let's rewrite as an addition.

  x + 1 - x => x + 1 + (-x)
  x - x + 1 => x + (-x) + 1
Now write the implicit parenthesis

  x + 1 + (-x) => (x + 1) + (-x)
  x + (-x) + 1 => (x + (-x)) + 1
Now since we assumed addition is commutative, let swap a few things

  (x + 1) + (-x) => (-x) + (x + 1)
  (x + (-x)) + 1 => ((-x) + x) + 1
And you get the classic associativity problem. If, by swapping things, we would have gotten the same expression, it would have been a contradiction, proving that addition was not commutative, but it is not the case here.

It means that if addition is not associative, which we know is the case, then your example doesn't prove that addition is not commutative.

Of course, I didn't prove that in general, addition is commutative, but it has no reason not to be.


You're right. it's non-associativity in disguise. You might not like the next example neither:

    NaN + 1 <> 1 + NaN 
(as NaN <> NaN)

IIRC it also breaks down with signed zeros.


I may be overlooking your point. x-x+1 is evaluated left to right like any other additive expression, and after the x-x part is evaluated, the intermediate result is zero. This would be the case with any numeric type, wouldn't it?


If you need to sum a large array of numbers, it might not be a bad idea to sort that array first.


All numbers in JavaScript are floats, unless you make an array with Int8Array(). https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

I wonder if people sometimes make a one element integer array this way so they can have a integer to work with.


My favorite floating point weirdness is that 0.1 can't be exactly represented in floating point.


Isn't it equally weird that 1/3 can't be exactly represented in decimal?


Indeed, 0.1 can be represented exactly in decimal floating point, and can't be represented in binary fixed point. It's just that fractional values are currently almost always represented using binary floating point, so the two get conflated.


The reason why the 0.1 case is weird (unexpected) is that we use decimal notation in floating-point constants (in source code, in formats like JSON, and in UI number inputs), but the value that the constant actually ends up representing is really the closest binary number, where in addition the closeness depends on the FP precision used. If we would write FP values in binary or hexadecimal (which some languages support), the issue wouldn’t arise.


Yep! Too bad humanity has settled on decimal instead of dozenal (base 12).


Imperial units were right all along.


Used to run into these problems all the time when I was doing work in numerical analysis.

The PATRIOT missile error (it wasn't a disaster) was more due to the handling of timestamps than just floating point deviation. There were several concurrent failures that allowed the SCUD to hit it's target. IIRC the clock drift was significant and was magnified by being converted to a floating point and, importantly, truncated into a 24 bit register. Moreover, they weren't "slightly off". The clock drift alone put the missile considerably off target.

While I don't claim that floating points didn't have a hand in this error it's likely the correct handling of timestamps would not have introduced the problem in the first place. Unlike the other examples given this one is a better example of knowing your system and problem domain rather than simply forgetting to calculate a delta or being unaware of the limitations of IEEE 754. "Good enough for government work" struck again here.


> but I wanted to mention it because:

> 1. it has a funny name

Reasoning accepted!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: