Hacker News new | past | comments | ask | show | jobs | submit login
Do not taunt happy fun branch predictor (2023) (mattkeeter.com)
279 points by fanf2 3 days ago | hide | past | favorite | 138 comments





Was summarizing this article for a group of friends who largely met during the Apple II days and wanted to repost a bit of that here:

The optimized code at the end takes 94 nanoseconds to sum an array of 1024 32-bit floating point numbers.

In 94 nanoseconds, our old friend the 1 MHz 6502, would be just starting to consider signaling the memory chips that maybe they ought to try digging up the first byte of the first instruction in the program.

Worth mentioning, that code is entirely dependent on running in cache. Otherwise even the mighty M1 Max in the post would still be stuck waiting on that first memory fetch. DRAM is slow. :-)


Lucky our total L1 cache sizes are about as big as the entire addressable memory of the 6502.

We truly live in amazing times.


Truly. And I'm also always amazed by how much slower (in terms of wall time) modern software is than it ought to be. CPUs sure don't feel three to four orders of magnitude faster than they were 50 years ago, because software has gotten four to five orders of magnitude more wasteful to compensate. Argh...

I recently moved a project from a Phenom II 945 to an I7 8700. The project took around 1:30 to compile on the Phenom, 15s to compile on the I7. Others working on the project with even more modern systems are compiling in half that time.

The major advantage I had was running the same compilers, same interpreters and same tooling, just in better hardware.

On the other hand, I always felt KDE 2 and 3 were way more responsive on way less hardware than one of the modern versions or Gnome. Part of it is UI design - giving feedback that an operation is going to take some time immediately instead of blocking until the operation is done before doing anything.


Plenty of that software is fairly well optimized, but most software is not that optimized. Microsoft Teams, Discord, Slack, basically anything else that also uses Electron... it's not UI design, it's legitimately wasted work, and tons of it.

So many programmers saw "Premature optimization is the root of all evil" and thought it means "Caring about performance makes you a heretic".

You can't hotspot optimize a Fiat Multipla into an Formula 1 car. When every software you run creates a dozen factories to replace one for-loop, you get the modern desktop experience


    > So many programmers saw "Premature optimization is the root of all evil" and thought it means "Caring about performance makes you a heretic".
This is very well said. Also, with enough experience, many good programmers will "know" where the hotspots will be, while writing the code. So, they can be "sloppy" in areas where it doesn't matter, and "care more about performance" in areas where it will likely matter much more.

    > most software is not that optimized
I would say: For good reason. The value is too low.

    > basically anything else that also uses Electron... 
To me, the point of Electron is a lovely platform with very high developer productivity. This is partly the reason why Java/C# was able to displace so much C&C++ code inside big enterprise corps writing CRUD apps. Sure, Java/C# is a bit bloatly/slow compared to C&C++, but the developer productivity is way higher.

    > it's not UI design, it's legitimately wasted work, and tons of it.
I don't understand this part. Can you explain more? Maybe an example would help.

> I would say: For good reason. The value is too low.

Oh my god, absolutely not. The value is extremely high to me, the consumer, even if the company doesn't care. Sure, they can technically still get their value proposition out even if the app is slow and bloated and sucky because every app these days is slow and bloated and sucky. And you can say they only care about money all you want, but that's not going to convince me that's a good reason for an app to perform terribly.

> To me, the point of Electron is a lovely platform with very high developer productivity. This is partly the reason why Java/C# was able to displace so much C&C++ code inside big enterprise corps writing CRUD apps. Sure, Java/C# is a bit bloatly/slow compared to C&C++, but the developer productivity is way higher.

I use Tauri in place of Electron and it's a lot snappier than Electron is, and you can get much of the same benefits in terms of developer productivity, because it's also just a webview. The host side needs to be a Rust application, but I wouldn't say it's much more difficult to get started with than Electron. Obviously, what you put inside the webview still matters, but in general, I'd say Electron is just an inferior choice in most cases, even if you do legitimately want a webview.

> I don't understand this part. Can you explain more? Maybe an example would help.

Parent said the lack of progress bars in GNOME make it feel slow. I argue that while progress bars could be a hack to make the user more open to waiting, the issue is that the user probably shouldn't need to wait at all. There are definitely cases where a progress bar is a good idea, but they shouldn't be needed in most cases.

For example, if I right-click a message in Discord, sometimes the context menu contains a loading spinner. I have to wait for the context menu to finish loading. Is a loading spinner the right solution here? Surely the user wouldn't want for the menu to not open, or for it to simply be blank until it loads. I think none of those are quite right. The context menu shouldn't have to load at all, it should simply open.


> the context menu contains a loading spinner

How the fuck is that possible, permissible, tolerated?

Another example. I am looking for a different file browser because Dolphin waits while something happens when I select a bunch of files and right-click. I know that it is downloading something very slowly because I can see the network activity but I have no idea why.


> I am looking for a different file browser because Dolphin waits while something happens when I select a bunch of files and right-click.

File browsers being slow seems like a common issue. When I tell Windows 11 Explorer to sort by modified date, it has to slowly load the list and populate the files one by one. What the fuck it is doing I have absolutely no clue. All I know is that it's doing something it shouldn't, because the last modified date should be returned instantly by the directory listing, and no additional metadata should be required to perform the sort.

And back on macOS, around five years ago, whenever I opened a folder, Finder would wait for a few seconds and then play the uncollapse animation for every folder I have expanded, shifting the position of basically every item in the list (and also causing tons of lag).

I think the second one is a well-intentioned feature that just needs some work, but the first one is just garbage, that's not how it should work.


Now you're making me wonder if there's a 6502 emulator making use of that fact.

Hah, I had the same thought. What kind of hacks can I do to convince the processor to keep as much of the emulator + opcodes in L1 as I can...

Back in the nineties, 3DFX had a synthetic rendering benchmark that relied on keeping the entire benchmark in L1 but the secret was taking over the entire machine so that no interrupt or other mechanism could pollute the cache.

A bit of ignorance on my part, but would the L1 be holding the data and the instructions? In which case we would be trying to fit our entire 6502 emulator in less than 64K of memory alongside the emulated RAM/ROM?

https://en.wikipedia.org/wiki/Apple_M1#CPU:

“The high-performance cores have an unusually large 192 KB of L1 instruction cache and 128 KB of L1 data cache and share a 12 MB L2 cache; the energy-efficient cores have a 128 KB L1 instruction cache, 64 KB L1 data cache, and a shared 4 MB L2 cache.”

⇒ chances are you don’t even have to try to fit an emulator of a 8-bit machine and it memory into the L1 cache.


I think you would very much have to try to fit a complete emulator of, say, the Game Boy into 128 + 64KB.

There's plenty of behaviour that is self-evident on real silicon but verbose and tricky to express in software.


Real question about L1 caches. For a long time, x86 (Intel & AMD) L1 caches have been pretty much pegged at 32KB. Do you know why they didn't make them larger? My guess: There is a trade-off between economics and performance.

There is a trade-off between cache size and latency.

Ok, so why do the new Mx chips from Apple have an L1 cache size greater than 32KB? Did they solve some long standing design issue?

The CPU decides what goes in there and when. You can only pray and offer sacrifices and guess at when and how.

Depends on the precise architecture, but ARM (and other RISC designs) usually have separate data and instruction L1 caches. You may need to be aware of this if writing self-modifying code, because cache coherence may not be automatic

The same thing was covered by Raymond Chen almost 2 decades ago: https://devblogs.microsoft.com/oldnewthing/20041216-00/?p=36...

As someone who has the old dead tree version of Intel’s x86 and 64 architecture instruction set reference (the fat blue books), and in general as someone who carefully reads the data sheets and documentation and looks for guidance from the engineers and staff who wrote the said data sheets, I always have reservations when I hear that “intuitively you would expect X but Y happens.” There’s nothing intuitive about any of this except, maybe, a reasonable understanding of the semi-conductive nature of the silicon and the various dopants in the process. Unless you’ve seen the die schematic, the traces, and you know the paths, there is little to no reason to have any sort of expectations that Thing A is faster than Thing B unless the engineering staff and data sheets explicitly tell you.

There are exceptions, but just my 2c. Especially with ARM.


"Intuitively" here should be taken to mean approximately the same as "naively" – as in, the intuition that most of us learn at first that CPUs work ("as if") by executing one instruction at a time, strictly mechanistically, exactly corresponding to the assembly code. The way a toy architecture on a first-year intro to microprocessors course – or indeed a 6502 or 8086 or 68000 – would do it. Which is to say, no pipelining, no superscalar, no prefetching, no out-of-order execution, no branch prediction, no speculative execution, and so on.

Respectfully, I disagree. CPU architecture optimization is in continuous dance with compiler optimization where the former tries to adapt to the patterns most commonly produced by the latter, and the latter tries to adjust its optimizations according to what performs the faster within the former.

Therefore, it is not unreasonable to make assumptions based on the premise of "does this code look like something that could be reasonably produced by GCC/LLVM?".

It is true that as cores get simpler and cheaper, they get more edge cases - something really big like Firestorm (A14/M1) can afford to have very consistent and tight latencies for all of its SIMD instructions regardless of the element/lane size and even hide complex dependencies or alignment artifacts wherever possible. But compare that with simpler and cheaper Neoverse N1, and it's a different story entirely, where trivial algorithm changes lead to significant slowdown - ADDV Vn.16B is way slower than Vn.4H, so you have to work around it. This is only exacerbated if you look at much smaller cores.

LLVM and GCC deal with this by being able to use relatively precise knowledge of CPU's (-mtune) fetch, reorder, load and store queue/buffer depths, as well as latencies and dependency penalty cost of opcodes of the ISA it implements, and other details like loop alignment requirements, branch predictor limitations.

Generally, it's difficult to do better in straight-line code with local data than such compilers assuming that whatever you are doing doesn't make concessions that a compiler is not allowed to make.

Nonetheless, the mindset for writing a performant algorithm implementation is going to be the same as long as you are doing so for the same class of CPU cores - loop unrolling, using cmovs, and scheduling operations in advance, or ensuring that should spills happen, the load and store operations have matching arguments - all of that will be profitable on AMD's Zen 4, Intel's Golden Cove, Apple's Firestorm or ARM's Neoverse V3.


He did, and that's a fantastic article which is worth the read and provides good context for interpreting this post.

One thing this post adds is the simple rectification of replacing ret with another br instruction, so the pairs are again "mirrored", and you get to have your cake and eat it too - slightly faster code without breaking the branch predictor.


Raymond Chen is a treasure. I'm so glad that Microsoft gives him the leeway to write that blog, I have learned a ton from it.

This seems no longer to be true for recent x86 processors: https://news.ycombinator.com/item?id=40767676

> After checking for loop termination, we fall through into the foo function (without a branch!)

Literally just reading this line made me go "well, there's your problem." I thought this was going to be something deep about fancy branch predictor heuristics, but it turns out to be a violation of basic heuristics.

Folks, don't think you're going to get amazing speedups by using mismatched call/ret instructions. The branch predictor keeping a shadow stack of return addresses is something that's been around for decades.


It's great that you're so knowledgeable about branch predictor behavior, but many people aren't and for them this is new, maybe even valuable, information.

I guess this article's just not for you then, and that's okay.


I don't think the complaint is that the article exists, it's that it perhaps needs more context. The article does not present the level of skill assumed about the reader or of the author, so comments like these help put that in perspective.

The commentary from "old salts" isn't meant to insult or discourage you but to inform your world view with hard earned experience.

It all has a value that's greater than the sum of it's parts and there's no good reason to be critical or reactionary about it.


If you're actually an old salt, it will be immediately obvious when an article is not meant for your level of expertise. Whining that "the article needs more context" is like Michelangelo complaining that a coloring book didn't warn him it was for six year old children.

And this will screw up program execution more profoundly (i.e.crash) on systems with architectural shadow call stacks as a security feature.

Speaking of "security feature", this insanely complex behavior (for a CPU) is a recipe for disasters like Specter and Meltdown.

On the one hand, a design goal of RISC is to improve the performance of compiled code at the expense of most other things. As such, this sort of hazard should be documented, but the designers should be able to assume that anyone directly writing assembler has read the documentation.

On the other hand, Sophie Wilson wrote an implementation of BBC BASIC for the original ARM (but it didn't have a branch predictor). While that is 32-bit and plays by different rules, I wonder how AArch64 slows down code when the architectural assumptions change.


And yet they also showed how they did actually accomplish this and other optimizations. It’s an informative read.

> The SIMD code does come with one asterisk, though: because floating-point addition is not associative, and it performs the summation in a different order, it may not get the same result as straight-line code. In retrospect, this is likely why the compiler doesn't generate SIMD instructions to compute the sum!

> Does this matter for your use case? Only you can know!

Anything is possible of course, and the typical way of writing a loop to sum up an array is literally telling the computer to go element by element and accumulate the values.

But it seems pretty unlikely that doing, say, four accumulations in parallel with simd and then summing them up at the end is any more wrong than going element by element.

Summing floats should by default be taken to have error bounds, and any answer in those bounds is valid. If you know something special about the floats you are inputting, the language should have some means to explicitly encode that. It shouldn’t be the most basic loop, that’s the default case, so it should give the best performance.


Surprisingly there are different algorithms for doing something as simple as summing up a list of numbers.

The naïve way of adding numbers one-by-one in a loop is an obvious way, but there are more sophisticated methods that give better bounds of the total accumulated error; Kahan summation[1] being one of the better-known ones.

Like most things, it can get complicated depending on the specific context. For streaming data, adding numbers one at a time may be the only option. But what if one could use a fixed-size buffer of N numbers? When a new number arrives what should be done? Take a partial sum of some subset of the N numbers in the buffer and add that to a cumulative total? If a subset if chosen, how? Are there provable (improved) error bounds for the subset selection method?

Fun stuff.

[1] https://en.wikipedia.org/wiki/Kahan_summation_algorithm


As far as I know, xsum [1] more or less solves the problem completely: order invariant and exact, at less than 2x the cost of the naive summation. Further speeding it up may be the only avenue left for improvement.

[1] https://gitlab.com/radfordneal/xsum


I was not aware of this (2015) work -- very nice!

A couple of pull-quotes from the paper to summarize:

Much work has been done on trying to improve the accuracy of summation. Some methods aim to somewhat improve accuracy at little computational cost, but do not guarantee that the result is the correctly rounded exact sum.

Many methods have been developed that instead compute the exact sum of a set of floating-point values, and then correctly round this exact sum to the closest floating-point value. This obviously would be preferable to any non-exact method, if the exact computation could be done sufficiently quickly

Exact summation methods fall into two classes — those implemented using standard floating point arithmetic operations available in hardware on most current processors, such as the methods of Zhu and Hayes (2010), and those that instead perform the summation with integer arithmetic, using a “superaccumulator”.

I present two new methods for exactly summing a set of floating-point numbers, and then correctly rounding to the nearest floating-point number. ... One method uses a “small” superaccumulator with sixty-seven 64-bit chunks, each with 32-bit overlap with the next chunk, allowing carry propagation to be done infrequently. The small superaccumulator is used alone when summing a small number of terms. For big summations, a “large” superaccumulator is used as well. It consists of 4096 64-bit chunks, one for every possible combination of exponent bits and sign bit, plus counts of when each chunk needs to be transferred to the small superaccumulator.

On modern 64-bit processors, exactly summing a large array using this combination of large and small superaccumulators takes less than twice the time of simple, inexact, ordered summation, with a serial implementation


Thanks for the summary. I kinda low-key love the idea of converting floats into a fixed point representation that covers the entire range represented by the float type. I mean the accumulator is only 32 KB, which is likely to be in L1 the entire time on modern hardware, and any given float is only going to need two 64 bit words, + 13 bits (12 bits for offset, and 1 for sign) to be represented in this scheme.

It turns into a serious problem if the floats are of different magnitude. Consider [1e50, -1e50, 1e3, 1e3]. Evaluating it as (((1e50 + -1e50) + 1e3) + 1e3) gives 2e3, evaluating it as ((1e50 + 1e3) + (-1e50 + 1e3)) gives 0.

You get similar issues if you're trying to add a large number of small-magnitude numbers to a single larger-magnitude number. (((1e3 + 1e3) + 1e3) ... + 1e50) is substantially different from (((1e50 + 1e3) + 1e3) ... + 1e3).


Adding and subtracting numbers of wildly different magnitudes is always cursed. If you're doing this I'd say the majority of the time you are making an error.

Your equation ends up being like "We calculated the diameter of the observable universe, but what if we added the Empire State Building to one side, how does that affect the result?"


I don't remember the exact numbers, but I remember seeing a CAD file that was not working well a few years ago, and it turned out the coordinate system set up for the part was roughly equivalent to building a part the size of a man's fist but measuring it from the center of the Earth (in meters) while putting the part in Earth orbit.

It’s also what we do when incrementing a counter, if it’s a very big counter. Operations like that don’t have a physical meaning, but can happen in things like random number generators where it’s a way of mixing the bits. Or maybe allocating unique ids, if you started from a random number.

Of course it is just an example so we shouldn’t nitpick you too much, but that sounds more like a job for an integer.

If all you’ve got is JS then every problem looks like a double.

Yes, absolutely. Though in JavaScript we often make do with the safe integer range. (I don’t know how fast BigInt is for 64 bit integers.)

I don’t know JavaScript so I’m sure to shoot my own foot here, but surely a BigInt who’s value can fit in 64 bits is just a normal 64 bit int (from hardware) with some extra bookkeeping around it on most platforms, right?

I can't speak for any other engines, but in SpiderMonkey a BigInt is a GC-managed object with a one-word header containing the sign bit, the length, and some GC bookkeeping bits, along with a second word that is either a single inline 64-bit digit, or a pointer to an array of 64-bit digits. So the memory overhead of a BigInt that fits in 64 bits is "only" 2x, but doing math with BigInt requires more work than a native integer because it needs to handle larger numbers, manually manage sign bits, and so on.

(https://searchfox.org/mozilla-central/source/js/src/vm/BigIn...)


Whether you care about the order or not depends on the intent of the programmer. I think the intent is usually to sum up the array for this sort of code. So, at first glance your first example looks more like a result of something like 0 +- 1e35 in double to me (added one to the exponent to avoid having to do math).

The intent of the programmer always has to be interpreted by the language. I’m saying the default interpretation should be one that doesn’t impose an ordering unless specifically requested. This is much more sympathetic to modern deeply parallel machines.


Very often there's no particular intent, other than to be consistent. You haven't lived until you've traced discrepancies in an ML model deployed across a bunch of platforms back to which SIMD instructions each happens to support (usually squirreled away in some dependency).

But you haven't demonstrated that the SIMD approach produces a meaningfully different and more accurate approach than the naive method.

Just like processor magic made the theoretically more performant code perform worse, it's possible that the processor has facilities for bucketing the data such that results from such a dataset are the same or more accurate than the naive approach.

This is absolutely a situation that requires empirical testing.


> it's possible that the processor has facilities for bucketing the data such that results from such a dataset are the same or more accurate than the naive approach.

It absolutely is not. Floating-point addition is completely defined (save for some of the bits in NaN results), and no CPU will start reordering your addition instructions in a way that changes the result. Even if it's an ordering that gives you better accuracy. Having SIMD do four (or more) of them in parallel does not change this.


> it's possible that the processor has facilities for bucketing the data such that results from such a dataset are the same or more accurate

Not realistically. The processor very much needs to do what you tell it.


Yeah, no.

If I know what my data look like, I can choose an order of summation that reduces the error. I wouldn't want the compiler by default to assume associativity and introduce bugs. There's a reason this reordering is disabled by default


Not the least that you should get the same result from your code every time, even if you compile it with a different compiler. Doesn't matter if that result is somehow better or worse or “difference is so small that it doesn't matter”; it needs to be exactly the same, every time.

I think there’s often a disconnect in these discussions between people that work on libraries and people that work on the application code.

If you have written a library, the user will provide some inputs. While the rounding behavior of floating point operations is well defined, for arbitrary user input, you can’t usually guarantee that it’ll go either way. Therefore, you need to do the numerical analysis given users inputs from some range, if you want to be at all rigorous. This will give results with error bounds, not exact bit patterns.

If you want exact matches for your tests, maybe identify the bits that are essentially meaningless and write them to some arbitrary value.

Edit: that said I don’t think anybody particularly owns rigor on this front, given that most libraries don’t actually do the analysis, lol.


> Summing floats should by default be taken to have error bounds, and any answer in those bounds is valid. If you know something special about the floats you are inputting, the language should have some means to explicitly encode that. It shouldn’t be the most basic loop, that’s the default case, so it should give the best performance.

This is a lot of "should" for something that basically doesn't happen. The only information you give is the order of arithmetic in the original expression.

It is a total nightmare if arithmetic is not stable between builds. Rebuilding the software and running it on the same input should not produce different results.

(Long ago I encountered the special Intel version of this: because the FPU used 80-bit registers internally but 64-bit in memory, changing when a register fill/spill happened would change when your answer got rounded, and thereby change the results. You can set a global FPU flag at the start of the program to force rounding on every operation.)


It does happen in languages and libraries with a higher level of abstraction. MKL for example will do whatever it wants for accumulations (which practically means that it’ll take advantage of SIMD because that’s a big reason why people use the library) unless you specifically request otherwise via there “Conditional Numerical Reproducibility” flag.

I think that was the right way to do it. BLAS made the right decision by defining these things in terms of sums and dot products instead of step-by-step instructions.

It will always be possible to write programs that run differently on different hardware or with different optimization levels. If somebody is writing code for floating point computations and expects exact bit patterns—it is possible, of course, all the rounding behavior is clearly specified. But usually this is an error.


> You can set a global FPU flag at the start of the program to force rounding on every operation

This doesn’t do quite the same thing. It still uses the wider exponent range of the 80-bit type.


Sorting the floats reduces the error. I think using multiple accumulators reduces the accuracy then. And sorted data is not uncommon.

I think there is always a correct answer and the compiler shouldn’t make changes at least by default that are wrong.

But ways for the programmer to express his intent more clearly are always welcome.


Not necessarily. If the array is large and the values are similar, using accumulators would yield a more accurate result than a single one.

Multiple accumulators increases accuracy. See pairwise summation, for example.

SIMD sums are going to typically be much more accurate than a naive sum.


Not necessarily either. It's not particularly hard to create a vector where in-order addition is the most accurate way to sum its terms. All you need is a sequence where the next term is close to the sum of all prior ones.

There just isn't a one-size-fit-all solution to be had here.


> There just isn't a one-size-fit-all solution to be had here.

But there is: https://news.ycombinator.com/item?id=40867842


That’s a one size fits some solution. Not more than 2x as slower than native floats is pretty slow for cases where you don’t need the extra precision.

If might be the case that it is the best solution if you do need the extra precision. But that’s just one case, not all.


A lot of code relies on FP operations being deterministic, often within confines of specific ISA. Applying SIMD to loops with FP could have been a default but it breaks a lot of existing code, makes the output frequently straight-up non-deterministic and as a result is something that a programmer has to explicitly choose to use.

Moreover, many programmers may not be even aware of these, so when a `float Sum(float[] values)` starts to return something different, they may not have any way to know that it is the vectorization that makes it do so. Which is why, for example, .NET's standard library will use SIMD for `integers.Sum()` but not for `floats.Sum()`.


> A lot of code relies on FP operations being deterministic, often within confines of specific ISA

How much of this is true in practice? I vaguely recall reading about a hard-to-diagnose bug where a seemingly unrelated code change meant that an intermediate value was no longer stored in the x87 extended-precision register but in memory instead, leading to a different result. Wouldn't you run into stuff like that all the time?


> A lot of code relies on FP operations being deterministic

A lot, but still quite the minority. People do run into situations where the same code with the same data produces slightly different results all the time because of changes in compiler, compiler settings, cpu arch, etc… They just don’t notice.

But, then they run into a situation where it matters and it’s a huge PITA to nail down.


Yes this is exactly it. The gold standard is bit-4-bit matching when shifting code between systems. If you move your code to a new system and get different results --> is it due to different summing order, or something else wrong? By ruling out the former you're more likely to get the bit-4-bit match and thus avoid a lot of debugging effort -- or spend it solving a real bug.

This is one of many reasons why we generally don't use the x87 registers anymore.

> this is likely why the compiler doesn't generate SIMD instructions to compute the sum!

Isn't this the type of thing ffast-math is supposed to enable?


-ffast-math can be read as -fwrong-math if the accuracy if the answers matters. For some applications accuracy might not matter much. But the differences you get by pretending that floating point math is associative can be enormous, and scientific and engineering code is often carefully designed to take the likely scale of values into account, and rearranging it is not a good idea.

For code that's been explicitly manually optimized already, indeed you wouldn't want to turn on -ffast-math; and I'd hope anyone who has actually learned what benefits from manual optimization and how to properly do that would also at some point have read something about not coupling that with -ffast-math. But I really doubt such meaningfully-manually-optimized code is at all frequent, and where it isn't it's a rather simple improvement (potentially with using some subset flags to allow NaN or infinity if desired).

Assuming associativity is not anti-accuracy in general, it just gets in the way of certain clever algorithms.

Rearranging, like sorting before, is actually a good idea to minimize errors.

Fast math does a lot of risky things in addition to using basic math rules.

Maybe the “fun safe math” flag could be used?

Classic SNL reference: "Do not taunt happy fun ball"

https://www.youtube.com/watch?v=GmqeZl8OI2M


If happy fun branch predictor starts to smoke, seek shelter immediately.

> Happy Fun Ball has been shipped to our troops in Saudi Arabia and is also being dropped by our warplanes in Iraq.

"What YEAR is it!?" /robin-williams-in-jumanji


Sometime in the 1990s, judging by how young Mike Meyers is. The Iraq reference makes me think they’re joking about Gulf I not Gulf II, so probably circa 1991.

Still legal in sixteen states!

https://www.youtube.com/watch?v=2AzAFqrxfeY


Don't miss the 2023! The article really is already a little outdated: since Rust 1.78, the compiler uses some more aggressive loop unrolling (and a little bit of SIMD): https://godbolt.org/z/zhbobW7rr

OP wrote: "Looking at the assembly, it's doing some loop unrolling.", linking to https://godbolt.org/z/Kv77abW6c, which is using "Rust Nightly", a moving target. By now, there's more loop unrolling.

Loop unrolling started with Rust 1.59: https://godbolt.org/z/5PTnWrWf7

Update: The code on Github shows they were using Rust version 1.67.0-nightly, 2022-11-27.


Good catch, I've updated the link to select Rust 1.67 specifically!

Rust 1.67.0, which is likely what OP was looking at, gives: https://godbolt.org/z/4Y61d9seh

I ran the benchmarks locally on the same hardware, using latest nightly Rust 1.81 (aggressive loop unrolling): no difference. Same speed as 1.5 years ago.


I'm not super familiar with ARM / ARM64 assembly and was confused as to how x0 was incremented. Was going to ask here, but decided to not be lazy and just look it up.

  const float f = *data++;


  ldr s1, [x0], #4
Turns out this instruction loads and increments x0 by 4 at the same time. It looks like you can use negative values too, so could iterate over something in reverse.

Kind of cool, I don't think x86_64 has a single instruction that can load and increment in one go.


lods and stos do load/store + increment of `rsi` or `rdi` respectively. There's also movs to copy between two memory addresses + increment

Usually seen in conjunction with rep, which repeats the above instruction `rcx` times.

A simple memset of 10 bytes:

    mov rcx, 10
    mov rdi, dest
    mov rax, 0
    rep stosb
You can use the w, d or q suffixes to advance by 2, 4 or 8 bytes.

Oh cool, and it looks like it can also decrement by 1/2/4/8.

> After the byte, word, or doubleword is transferred from the memory location into the AL, AX, or EAX register, the (E)SI register is incremented or decremented automatically according to the setting of the DF flag in the EFLAGS register. (If the DF flag is 0, the (E)SI register is incremented; if the DF flag is 1, the ESI register is decremented.) The (E)SI register is incremented or decremented by 1 for byte operations, by 2 for word operations, or by 4 for doubleword operations.

https://www.felixcloutier.com/x86/lods:lodsb:lodsw:lodsd:lod...



Thanks! Macroexpanded:

Do not taunt happy fun branch predictor - https://news.ycombinator.com/item?id=34520498 - Jan 2023 (171 comments)

(Reposts are fine after a year or so; links to past threads are just to satisfy extra-curious readers)


I'm surprised they didn't try something less clever to optimize their code first. The assembly code could be rewritten like this, so the test requires one branch at the bottom of the loop instead of two (one to jump to the top and one to branch conditionally), as well as one instead of two ALU operations per loop for X1 (one subtraction to compare X1 to zero, and one to decrement X1, as opposed to using the result of the latter).

      stp   x29, x30, [sp, #-16]!
      fmov s0, #0.0

      cbnz x1, exit // Skip loop if x1 == 0.
  loop:
      ldr s1, [x0], #4
  
      bl foo
      // Decrement x1. If x1 != 0, repeat the loop.
      subs x1, x1, #1
      b.ne loop
      // Fall through to exit after the loop finishes.
  exit:
      ldp   x29, x30, [sp], #16
      ret
  
  foo:
      // ...
      ret

Going further, you could just inline foo and omit the RET instruction without doing any trickery with mismatched BL and RET instructions.

I haven't benchmarked this, though, so I don't actually know how much this speeds the code up.


Typo: that line with "cbnz" should be "cbz". The CBZ instruction branches to a label if a register is zero, and CBNZ branches if the register isn't zero.

This was a lovely article. I only wish the author wouldn’t have kept switching units (between µs and ns) making it hard to scan the tables for comparison.

They also switched from C to Rust halfway through the article, which was a bit jarring...

It looks like that was an addendum that was added later and also benchmarked against the original C version

1000ns = 1us :)

We know.

On a phone at normal reading distance, with the articles styling, it’s really hard to tell the difference between n and u without zooming, and the decimal points get lost - scanning the tables is hard.


Tangential question, perhaps a dumb one: why do most hardware schedulers try to predict the load on their own, instead of letting users explicitly signal their intent? It's everywhere, from CPUs to storage.

The user doesn't know the capabilities of the system. In particular, anything compiled in is set in stone: you don't know the capabilities of systems yet unbuilt. New systems don't sell well unless they make existing code run faster, and sell badly if they require you to rebuild everything and make it non-backwards-compatible.

The peak example of "let the user do the scheduling" was Itanium, which had a VLIW system which scheduled three instructions at once. It was a huge commercial failure, and allowed AMD to instead define the instruction set people wanted: X64 is IA32 but wider and with more registers, but not substantially different and certainly not as weird as Itanium.


I've lost count of the times I've sped up some existing code by removing the hand-optimized assembly and just used a plain C implementation or similar. Sure, it was hand-optimized assembly, but for a 10+ year old CPU.

If you're going to do hand-optimized code for a given platform, include the baseline code and measure at runtime to pick the implementation.


Hand optimized assembly was necessary 30 years ago, and a good for several optimizations 20 years ago. But today's computers are SO FAST, it's just not necessary in most situations.

You should look into the history of Itanium, it was designed around the idea that the compiler would do pretty much exactly that. It looked great on paper, but in practice nobody figured out how to actually write a compiler capable of doing it without constantly running into weird edge cases.

X86 does have "prefetch" instructions, which tell the CPU that you want to use some data in the near future. There are also "branch hint" instructions which tell the CPU if a branch is usually taken or not. The problem is that they tend to make your code slower: the CPU is already more than capable of predicting it by itself, and the extra instructions slow down the overall execution because they take up cache space and have to be decoded too.


VLIW is pretty successful for loopy DSP and now AI/ML. However, Itanium was trying to work for general purpose code and then as you say, it constantly ran into weird edge cases. It seemed as if VLIW succumbed to the Peter Principle, it had risen to its level of incompetence. But as long as you use it appropriately, VLIW is a best practice and LLVM supports it.

BTW, CS252 at Berkeley and Onur Mutlu's ETH lectures give a conventionally disparaging view of VLIW without pointing out its successes.


Adding on, VLIW is only successful in AI/ML because GPUs are incapable of doing branching in a good way let alone prediction. I would guess the same story applies to DSPs. If someone figures out how to stick a branch predictor in those pipelines Im guessing the VLIW nature of those platforms will disappear overnight.

The defining character of VLIW is to have the brilliant compiler software schedule dumb parallel hardware instructions statically and then not depend on power/transistor expensive dynamic branch prediction and OOO execution.

In a perfect VLIW world that would mean you don't spend any transistors or power on branch prediction or out of order instruction searches. Indeed the original VLIW paper [1] spends vastly most its paragraphs on solving the (hard) compiler instruction scheduling problem with trace scheduling which is still used. The VLIW hardware itself is dead simple.

[1] https://safari.ethz.ch/architecture/fall2022/lib/exe/fetch.p...

So if VLIW fits the problem it has fantastic performance characteristics. If it doesn't fit, and far and away most don't, then VLIW is terrible. VLIW is very brittle.

I need to make a caveat about the Mill CPU which is a VLIW but I see I've written too much already.


VLIW, they’re still out there, doing mad stuff.. VLIW is cool and as a DSP programmer I love/live that time a VLIW comes my way :)

https://chipsandcheese.com/2023/10/04/qualcomms-hexagon-dsp-...


It's supported by some CPUs: the linux kernel has likely/unlikely macros to indicate whether a branch is expected to be taken or not in the common case (which uses GCC attributes which allow this to be propagated through to codegen). It's just that it's generally less effective: firstly because not all branches get annotated, some that are annotated are annotated incorrectly, and in a lot of cases the branching behaviour is data-dependent enough that you can't make an optimal static prediction at all.

1. "Explicitly signaling your intent" takes space and time. Decoding isn't free, and even if there were no other downsides you'd want to be careful with that change.

2. It's a bit historical. Early CPUs didn't have a RAM/CPU discrepancy and didn't need pipelining. Code wasn't written to account for pipelining. As CPUs got a little faster relative to RAM, you added a few prediction mechanisms so that most consumers and workloads could actually use your brand new 2x faster gigaterrawatthournewton CPUs without having to rewrite all their code. Iterate 10-20 generations, where most software has never been written to care about branch prediction and where modern languages don't even expose the concept, and you have the current status quo.


Because it can't actually see that return instruction until way too late.

Apple M1 CPU here loads and decodes 8 instructions per cycle, and the instruction cache access, plus the decoding takes multiple cycles. I don't know exactly how many cycles (more cycles allow for bigger instruction caches, and Apple gave the M1 a massive 192KB instruction cache), but the absolute minimum is 3 cycles.

And until the instruction is decoded, it has no idea that there is even a return instruction there.

3 cycles at 8 instructions per cycle is a full 24 instructions, so the M1 CPU will be a full 24 instructions past the return before it even knows the return exists. The loop body in this article is only 7 instructions long, so overrunning by an extra 24 instructions on every iteration of the loop would be a massive performance hit.

What actually happens is that the branch predictor remembers the fact that there is a return at that address, and on the very next cycle after starting the icache fetch of the return instruction, it pops the return address of it's return address prediction stack and starts fetching code from there.

For short functions, the branch predictor might need to follow the call on one cycle, and the return on the next cycle. The call instruction hasn't even finished decoding yet, but the branch predictor has already executed both the call and the return.


GCC has the _builtin_expect macro to give hint for branch prediction. C++ has added something similar recently .

These are usually for the compiler rather than the CPU e.g. I don't want my error-handling code inlined! (But the compiler can't work this out without a hint)

Yes, there actually isn't a way to explicitly hint branch predictors on modern CPUs.

In this case, I'd consider BL (call) a fairly explicit signal of intent to RET?

Yep. For returns, the more important thing in the article, the "ret"[0] instruction behaves exactly identically to "br x30"[1], just with the hint that it's expected to match a "bl".

On x86 things are less pretty as call/ret push/pop the ip from the stack, with no non-matching-hinted versions, but having such would also just not be particularly useful as unpaired call/ret would result in the stack pointer continually drifting without manual fixup (at which point jmp is just clearly better).

[0]: https://developer.arm.com/documentation/dui0802/a/A64-Genera...

[1]: https://developer.arm.com/documentation/dui0802/a/A64-Genera...


On 32-bit x86, there is the trap of trying to use call/pop to get the absolute instruction pointer. It will work correctly, but it will mess up any call stack prediction and cause great performance penalties. Hence why compiler-generated EIP shims use call/mov/ret instead. (But this is not such a big issue in x86-64 with its RIP-relative addressing.)

Often, but not always, optimal scheduling is data dependent so it can't be statically computed at compile time.

We could solve a lot of problems if every user had to take EE180.

They want to be able to guess long, long, before actually executing the instructions.

I was working on an x86 clone back in the 90s (sadly never saw the light of day), one of the things I did early on was to write code to emulate various potential microarchitectural designs by capturing long streams of real world instructions - our real bugbear was one of the C++ compilers (used to compile a popular web browser from memory) that did virtual method dispatch by pushing an address onto the stack and returning to it - broke return predictors every time

I was confused as to why a compiler might do that.

Surely a jump to register would always be faster than writing it out to memory.

Then I remembered we are probably talking about 16bit code (for Win16), and that far pointers were a major thing. And turns out there is no "far jump to register" instruction in x86. The far jump instruction requires the 32bit pointer to be in memory. I guess some clever compiler engineer came up with a trick to use two pushes and a far return instruction to avoid needing to allocate a temporary 32bit variable on the stack.

But that raises a deeper question: Why is this compiler using this clever "far jump to value in register" trick for virtual method dispatch? The full 32bit far pointer should already be sitting in the vtable, presumably it just loaded it from there? It would be a lot faster to just use the proper far jump instruction directly on the value in the vtable rather than loading it and writing it back out to the stack.

I suspect this compiler was deficient, and missing the optimisation that would avoid the need for this "clever trick" most of the time.


Main reason is probably just that x86s have so very few registers.

Most C++s have a pointer to the current object (bp), virtual objects then contain a pointer to a vtable (to save space) so it's a double indirect load - also I suspect (it was a long time ago) that parameters were already being passed in registers, they would have had to leave a free reg to use as a temp


<tangent> IIRC, one of the ways the 6502 would allow jumps past its normal boundaries was to PUSH the out of bounds address you wanted to go to onto whatever register then do the asm "return from subroutine" opcode.

Perhaps that's old hat and not even allowed any more, but I thought it was pretty clever when I first read about it.


Everything I know, or need to know, about branch prediction, I learned from James Mickens.

https://scholar.harvard.edu/files/mickens/files/theslowwinte...


> Why do we need a special function return instruction? Functionally, BR LR would do the same job as RET. Using RET tells the processor that this is a function return.

I'm not sure this is a good design. Those two opcodes perform the same logic function but have different branch prediction hints.

Meanwhile, there are other cases where an existing instruction is reinterpreted with new semantics:

* On x86, `XCHG eax, eax` is `NOP`.

* On x86, `XOR reg, reg` is `MOV reg, 0` and breaks dependencies for the purposes of register renaming.

* On x86, various examples of macro-op fusion and micro-op fusion.

* On various RISC architectures, `ADD r1, r0, 1234` is `MOV r1, 1234`.

* On some RISC architectures, conditionally branching if r0 == 0 is the only way to express an unconditional branch.

I see no reason why `BR LR` can't be the standard function return instruction and involve the branch predictor.


In AArch64, `BR LR` and `RET` do not perform the same logic when FEAT_BTI (Branch Target Identification) is present.

Yeah. This effect is why, also, using a bare CALL without a RET ends up being a slow way to get the program counter on 32-bit x86, which lacks native PC-relative addressing: it confuses the CPU's tracking of calls and returns. Basically, you always want to balance them. No exceptions.

Wasn't an article posted a few days ago showing that CPUs special case. CALL+0 not to update the prediction stack as it was commonly used to get IP?


Okay, fair enough. It used to be a pessimization before CPUs recognized that pattern. :-)

ok, how many people here under 45 years old actually GET the "taunt happy fun" reference?

The title of this article is a reference to "Happy Fun Ball", my favorite SNL skit: https://www.youtube.com/watch?v=GmqeZl8OI2M

> const float f = *data++;

cursed code. people might say "oh that's normal idiomatic C", I don't care. I am glad Go does not allow code like this


The C language seems to really like reading/writing data and then modifying it, or modifying data and then reading/writing it, which can make for some dense one-liners full of side-effects. The pre- and post- increment and decrement operators, and the fact that assignments are expressions, seem to encourage it - especially in combination with the short-circuiting boolean operators.

In old C, they would write code like the function below. This is based on a function from The C Programming Language, and it makes the assumption that bufp points to a buffer that's at least one byte long.

  /* Read a line into a lim-sized buffer at bufp, returning strlen(bufp) */
  int getline(char *bufp, int lim) {
      int c;
      char *p = bufp;

      while (--lim > 0 && (c = getchar()) != EOF && (*p++ = c) != '\n')
          ;
      *p = '\0';
      return p - bufp;
  }
But nowadays, in modern C code, I think most people don't write such terse code. I guess it's because the only people who still code in C are those who have no choice, and so they prefer clarity over compactness.

I think the one liner was supposed to show how short and simple programming can be, but I think it ultimately has the opposite effect.

It's basically a showcase of terrible programming idioms that you should never use.


To be fair, that's not actually in the book. I said it was based on a function from that book, but below is the version they actually write in the book; I apologize for unintentionally misleading you. I wrote the above function myself (based on this one below) to showcase the type of style I'm talking about.

  /* getline: read a line into s, return length */
  int getline(char s[], int lim)
  {
      int c, i;
      for (i=O; i<lim-1 && (c=getchar())!=EOF && c!='\n'; ++i)
          s[i] = c;
      if (c == '\n') {
          s[i] = c;
          ++i;
      }
      s[i] = '\0';
      return i;
  }
They specifically don't use pointer arithmetic or the post-increment operator, because at that point in the book, they haven't introduced those yet.

However, that function I wrote isn't far off from the real Unix V7 version of fgets [0], which is a similar function to getline:

  char *
  fgets(s, n, iop)
  char *s;
  register FILE *iop;
  {
      register c;
      register char *cs;
  
      cs = s;
      while (--n>0 && (c = getc(iop))>=0) {
          *cs++ = c;
          if (c=='\n')
              break;
      }
      if (c<0 && cs==s)
          return(NULL);
      *cs++ = '\0';
      return(s);
  }
Like I said, the creators and first users of C seemed to really like expressions with side-effects, pointer arithmetic, and while-loops with empty bodies (though fgets doesn't have that). It's interesting how Go is different in this regard, considering Ken Thompson was a co-creator of both languages.

[0]: https://minnie.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/lib...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: