Do not taunt happy fun branch predictor

jerf · on Jan 25, 2023

This is another good example of how our CPUs are in many ways specialized C processors. C is a structured programming language that uses functions, so our processors like functions. If you jump out of that paradigm, even if the assembly instructions nominally seem to allow it, you'll run more slowly. Even when it seems like what you're offering is a shortcut to the CPU.

This is neither praise nor criticism of the current CPU paradigm; it's just something you need to understand if you want the best performance out of our machines.

A different paradigm, like a concatenative-paradigm-based program, might naively be more inclined to compile into code that looks more like what the author tried, jumping between implementations of the stack operators without it actually being "functions". One can imagine processors that would be "happier" with that, and would be bothered by things that look like function returns more. But that's not the CPUs we have.

flohofwoe · on Jan 25, 2023

It's just a natural side effect of software and hardware forming a symbiosis and that they cannot really be separated. New software is (or should be!) built to run well on existing hardware, and new hardware is built to run existing software well. Besides, subroutine call instructions were a thing long before C became mainstream and manual assembly coding ruled supreme.

brigade · on Jan 25, 2023

If you want to encode a series of indirect jumps between concatenated functions, you can do that already and the return address stack won’t get involved; you’ll simply get the normal branch predictors.

But generalized branch prediction is expensive (multiple entries in the BTB hashed by the previous program flow, and mispredicts if the next function in the chain changes); the point of an RAS is that it’s a very cheap way to keep some 100% predictable branches from using those general resources. Concatenation pretty much requires branches without unique characteristics, so no fast path shortcuts and everything is a bit slower.

mncharity · on Jan 26, 2023

> Concatenation pretty much requires branches without unique characteristics, so no fast path

Discussion the other day, of optimizing dispatch overhead in no-jit-allowed lightweight-word vm's, raised a brainstormy idea of having multiple copies of some words, and the compiler juggling them with awareness of their tailcall-next-word varied branch predictor states. Sort of a dynamic version of reducing dispatch by identifying hot phrases and creating multi-word-composite new words for them.

tmzt · on Jan 26, 2023

Would that be feasible on something like iOS that doesn't allow third-party JIT?

Could this allow for translating WASM or emulator workloads to something that runs fast with those restrictions?

mncharity · on Jan 30, 2023

Re iOS, yes. But runs fast? The words have to be small enough that the branch misprediction penalty (order-10 cycles) is significant. So for example, if a word is a single 3 cycle op then the dispatch matters, but with a dependency chain just a few ops long, the potential win drops under 50%. And having multiple copies of words will increase pressure on the instruction cache, so there will be a tradeoff, which may or may not be a win. So... perhaps this could be a way to dynamically optimize loop interiors? Or... perhaps to permit low-cost register shuffling words to glue together larger words with varied hard-wired register expectations? Or... insert creative idea here? There are blog posts around where someone writes a simple vm and incrementally optimizes it. But I don't recall seeing one yet with an iOS-ish focus of no-JIT, lots of registers, modern branch prediction... and so throws creative stuff against the wall and measures if anything sticks. It might be interesting to take such a blog post/series and do a followup?

masklinn · on Jan 25, 2023

Author was not jumping out of the paradigm here, they were deliberately misusing constructs specialised for the paradigm (br/ret). That’s like saying the toolcase is a specialised screw processor because you’re trying to drive nails using a screwdriver and it does not go well.

And C is hardly the first or only procedural langage.

Nevermark · on Jan 26, 2023

No but C is a better model of most processors assembly, than most other procedural languages.

pjmlp · on Jan 26, 2023

Only when we are talking about PDP-11.

Thankfully nowadays we have Compiler Explorer supporting almost every flavour of mainstream compiled languages to dispel myths.

flohofwoe · on Jan 26, 2023

This PDP-11 vs C thing always seems to come down to the PDP-11's auto-increment addressing modes vs C's ++ and --. Are there are any other PDP-11 features baked into C? Because this increment/decrement thing is hardly unique (the 68k CPUs had that too, and every CPU with a stack pointer also does it, some CPUs just made that a general addressing mode accessible in the instruction set.

While today's compiler do a lot more transformations which can result in surprising output, it's still fairly straightforward to map C statements to CPU instructions.

msla · on Jan 26, 2023

Autoincrement and autodecrement didn't come from the PDP-11

https://web.archive.org/web/19980220175804/http://cm.bell-la...

> People often guess that they were created to use the auto-increment and auto-decrement address modes provided by the DEC PDP-11 on which C and Unix first became popular. This is historically impossible, since there was no PDP-11 when B was developed. The PDP-7, however, did have a few `auto-increment' memory cells, with the property that an indirect memory reference through them incremented the cell. This feature probably suggested such operators to Thompson; the generalization to make them both prefix and postfix was his own. Indeed, the auto-increment cells were not used directly in implementation of the operators, and a stronger motivation for the innovation was probably his observation that the translation of ++x was smaller than that of x=x+1.

--- Dennis Ritchie

tsimionescu · on Jan 26, 2023

The point is that C maps quite poorly onto the ISAs of most processors other than a PDP-11. There are various features of even 8086 that are not easily available in C (such as overflow detection) and huge numbers of instructions in newer ISAs that don't have any direct mapping to C code.

Additionally, modern processors have execution models that are entirely different from the C machine model, but this is often hidden even from the ISA. For example, modern x86-64 cores execute code by splitting up instructions into micro-instructions, determining data dependencies between them, then queueing them up on any of the available execution units (of various kinds) in parallel - which is about as far from the C model of executing instructions 1 by 1 in a series as it is from Haskell.

flohofwoe · on Jan 26, 2023

Well from the pov of machine or assembly code, C is without a doubt a high level language.

But at the same time it's the lowest-level high-level language (that's popular at least).

I'm also not aware of any high level programming languages that allow access to the CPUs status flags (for instance to check for overflow).

(there are a couple of interesting 'mid-level' languages for 8-bit processors though, like Millfork: https://github.com/KarolS/millfork)

I'd be all over a proper mid-level language that's closer to modern CPUs than C and has less 'optimization magic'. But this idea doesn't seem to be very popular amongst the compiler writer crowd.

pjmlp · on Jan 26, 2023

How do you map C statements to AVX512 instructions?

flohofwoe · on Jan 26, 2023

...by doing the only sensible thing and use intrinsics. Auto-vectorisation is exactly where compiler optimisations stop being useful.

tsimionescu · on Jan 26, 2023

Hence, C is only "portable assembly" for processors similar to a PDP-11, not for modern processors with modern ISAs.

flohofwoe · on Jan 26, 2023

Intrinsics are C language extensions outside the standard, but they are still an integral part of the "C language" that a specific compiler implements. In practice it really doesn't make much sense to separate the "standard C parts" and the "non-standard language extension" of a specific compiler.

pjmlp · on Jan 26, 2023

On which page of ISO C can I find the intrinsics specification?

junon · on Jan 26, 2023

Not sure what your point is. There are no instruction-level parts to the standard. The C standard doesn't dictate how individual CPUs should operate.

MereInterest · on Jan 26, 2023

The point is in reference to flohofwoe's earlier comment which stated "it's still fairly straightforward to map C statements to CPU instructions.". By pointing out the existence of vectorized instructions, which are entirely absent from the C standard, pjmlp was making an argument by counter-example. That while you can map C statements to CPU instructions, most compilers typically don't apply such a 1-1 mapping.

The seeming non-sequitur referring to the ISO C standard was because flohofwoe responded to the rhetorical question as if it were a normal question, leading pjmlp to restate the rhetorical question in a stronger form.

tsimionescu · on Jan 26, 2023

The point is that C is very far away from being a "portable assembly" for any kind of modern processor. For the PDP-11 and other processors from that era, there was indeed some pretty simple 1:1 mapping between C instructions and their assembly. This has not been true for decades.

flohofwoe · on Jan 26, 2023

The C standard only matters for compiler writers. What matters for compiler users is what specific compilers actually implement (and it's pretty much impossible to write a non-trivial C program which is 100% standard compliant, the Windows headers alone are full of MSVC idiosyncrasies).

MereInterest · on Jan 26, 2023

> The C standard only matters for compiler writers.

There's two types of projects: those that are so big that any compiler writer will test against them to guard against Hyrum's Law, and those that are not. The former have no need for the C standard, as compiler writers would be loathe to break compatibility with it. The latter have a strong need for the C standard, as anything outside the C standard may be broken unknowingly by compiler writers.

> What matters for compiler users is what specific compilers actually implement

What matters for me is the likelihood that a compiler upgrade will break my code. If I stay within the standard, then any such breakage is a bug. If I stray outside the standard, then all I know is that this specific compiler, in this specific context, on this specific machine, for this specific compilation, has produced some machine code.

My home projects are not so large that gcc/clang/msvc authors would test their upgrades against my project, so the bounds of the standard are the limits that I can trust.

> the Windows headers alone are full of MSVC idiosyncrasies

I'd put the Windows headers in the category of sufficiently-large projects that compiler writers would test against them, rather than the other way around.

flohofwoe · on Jan 27, 2023

> My home projects are not so large that gcc/clang/msvc authors would test their upgrades against my project, so the bounds of the standard are the limits that I can trust.

You'll need to check your code against each new compiler release you want to support anyway, because each release adds a shitton of new warnings which are also not covered by the C standard but should be fixed nonetheless.

MereInterest · on Jan 28, 2023

Absolutely, and it's really fantastic seeing how many bugs get caught from the new warnings. However, the effort required to check warnings is far, far less than the effort required to validate the assembly output, and so I think the point still stands.

Nevermark · on Jan 27, 2023

Unless you know a widely used high level language that is is a better model of newer assembly, then C would still remain “the best”, no?

“Not as good lately”, doesn’t change that status without another language superseding it, right?

masklinn · on Jan 26, 2023

Not sure how that is relevant. The GP asserts (and bemoans) the reverse cause-and-effect.

dmurray · on Jan 26, 2023

> C is a structured programming language that uses functions, so our processors like functions.

The main alternative paradigms that are ever mooted are LISP machines, which elevate the importance of functions, and maybe stack-based programming languages like Forth, which emphasize the feature of functions seen here (that they keep popping things off the stack).

It's hard to argue that C has led us down to a local minimum because it's just too functional.

ablob · on Jan 25, 2023

This has more to do with the ISA than C I'd assume. C was built as an "easy assembler". Furthermore, the computer doesn't really care about structure (in the structured programming sense).

In this case the implementation of the ISA stores information on each 'ret' depending on some 'bl' that came before. One can imagine that a different optimization technique which actually leads to a speedup exists. Without a branch-predictor, the program that Matt wrote might've been faster.

Imo, this has nothing to do with paradigms, but how control-flow interacts with the system. This interaction is different between different implementations of an architecture.

Code written for Cache-locality and paradigms that work well with it, for example, only became a "winner" after caches were widely implemented. Before that, the cost of sequential array access and random array access was identical. With caches, a linear search for an element can be faster than a binary search, even though it requires a lot more memory accessess. Thus, the optimal implementation for finding an element in an array is now dependent on size as well. (I.e. after an ordered array reaches a certain size, binary search should become faster on average).

int_19h · on Jan 25, 2023

The point is that the ISA is assuming that BL and RET opcodes come in pairs - which is an assumption that does reflect what structured code is typically compiled down to. Going by the semantics of the opcodes themselves, there's no reason why RET should be treated differently from a simple indirect jump here.

masklinn · on Jan 25, 2023

> there's no reason why RET should be treated differently from a simple indirect jump here.

There’s its very existence. If you’re looking for a general purpose indirect jump, use that.

tsimionescu · on Jan 25, 2023

As far as I understand, the ISA explicitly intends for BL and RET to come in pairs. The only point of RET is to jump to the address last stored by BL. If you don't need this behavior, there's no reason to use RET - as the article itself shows, B x30 does the exact same job, and doesn't come with the extra assumptions.

int_19h · on Jan 26, 2023

That's my point - the ISA is designed around the notion that function calls (a structured programming concept!) are a thing that BL will be used for, and not arbitrary jumps where you happen to have some clever use of the link address.

And while it's not something that the example code in the article demonstrates, but based on the description of how it confuses the branch detector, it sounds like having multiple BLs without a matching RET for each would also be a problem.

tsimionescu · on Jan 26, 2023

No, the ISA overall is not designed around that notion. The designers merely recognized that this is a common pattern, and added 1 or 2 instructions specifically for it (it's not clear to me whether using BL for other purposes would have the same detrimental effects that using RET as in the article). This would likely be a very good addition even in a purely hypothetical world where it wasn't a common pattern in 99.9% of languages.

Also, I'm not sure why you believe that there is any kind of language where this pattern isn't common. Even something like Forth relies heavily on jumping into subroutines and back.

andrepd · on Jan 26, 2023

Subroutines predate structured programming.

int_19h · on Jan 26, 2023

I was using the terminology as OP did, and yes, it is not quite right. But the point as I understood it was that branch predictors optimize around specific "structured" usage patterns of opcodes - in this case, the particular way to use BL/RET to implement function calls typical of C - to the point where any other use is too slow to be practical for anything.

dustbitying · on Jan 26, 2023

what's the difference between a subroutine an a proper C function??

theyr'e so close to what they're in principle.

IMO, C functions are the answer to the problem exposed in the famous quote "goto considered harmful".

assembly is all about GOTOs.. C provides a *structure* way to deal with them without getting bored to day (it all becomes way to much to soon)

but I think (for reasons that I wish I could get into) that in the end, the goto-based assembly code can go up to multiplication, but then with C and their computer-functions one can go beyond exponentiation.

What is there beyond, I can only name by reference but wouldn't say I undesrtand it... which tetration.

so riddle me this: why is 2[op]2=4 for all these operations: addition, multiplication, exponentiation, tetration, pentation!?

imma go keep being insane. thxbai

int_19h · on Jan 27, 2023

Early subroutines were really just a branch/return - no stack frames and thus no proper argument passing or recursion or locals. If you've seen a BASIC dialect with GOSUB in it, that's pretty much it. It's a bit more structured than simple jumps, but still much lower level than C functions.

Structural programming is traditionally considered subroutines + conditionals + loops. In its purest form, it also means no branches other than the ones implicit in this list (i.e. not only no goto, but also no return, no break/continue etc).

zerohp · on Jan 26, 2023

> Going by the semantics of the opcodes themselves, there's no reason why RET should be treated differently from a simple indirect jump here.

RET is documented as a subroutine return hint in the official ARM architecture reference manual. That's the only reason it has a distinct opcode from BR.

BL is also documented as a subroutine call hint.

raverbashing · on Jan 26, 2023

> Going by the semantics of the opcodes themselves,

This is RISC wishful thinking that makes it sound like BL is not CALL. Maybe in ARM 1 it wasn't, it is now.

classichasclass · on Jan 25, 2023

I agree, because this semantic difference between br x30 and ret doesn't exist in many other RISCs which also mostly run C-like languages. Power just has blr, and MIPS jr $ra, for example. This feels more like a well-intentioned footgun in the ISA.

shadowofneptune · on Jan 25, 2023

The potential population for that footgun is compiler writers, who should safely fall into 'know what they are doing' territory.

pjmlp · on Jan 26, 2023

> C is a structured programming language that uses functions, so our processors like functions.

So were most of the structured programming languages from 1960's.

PL/I, ESPOL, NEWP, ALGOL, Pascal, ...

sacnoradhq · on Jan 26, 2023

General-purpose CPUs began as universal Turing machines for running any sort of mathematic calculations automatically. This was done in binary on punch cards and/or with panel cable settings. Then came textual assemblers to make that easier. Imperative procedural programs were then grafted-on to simplify writing assembly rather than in any sort of assembly language.

Most processors have accumulated functionality beyond minimal instructions for the simplification and acceleration of operating system services, hardware interfacing, cryptography, vector & matrix integer and floating-point math, arbitrary precision math, string processing, virtualization (IO and CPU), secure computing, and virtual memory management; just to name a few. :)

A future era of green field development closer to the technological singularity will look across the software-hardware interface boundaries to optimize both silicon (or superconductors) and compilers to generate fast, small, and/or power efficient designs and code without as many limitations. Generative AI and holistic algorithms with enormous computing power will make this possible. It's almost possible now, it's just a few leaps beyond what EDA and compilers are already doing.

yencabulator · on Jan 28, 2023

> C is a structured programming language that uses functions, so our processors like functions.

An interesting take on this is the (vaporware) Mill CPU, in which the machine code defines EBBs (Extended Basic Blocks), which can only be entered at the top. You cannot express jumping into the middle of a block from the outside, in their ISA.

https://en.wikipedia.org/wiki/Extended_basic_block

https://millcomputing.com/topic/introduction-to-the-mill-cpu...

titzer · on Jan 25, 2023

> More specifically, the branch predictor probably keeps an internal stack of function return addresses, which is pushed to whenever a bl is executed. When the branch predictor sees a ret coming down the pipeline, it assumes that you're returning to the address associated with the most recent bl (and begins prefetching / speculative execution / whatever), then pops that top address from its internal stack.

There's no need for "probably" here. The micro-architectural mechanism is known as a return stack buffer[1] and is generally separate from the branch predictor unit, though the processor may make use of indirect branch prediction entries for returns as well.

[1] It is, indeed, a tiny little stack of return addresses and indeed, the article hit performance issues by misaligning it. The (Intel chips') RSB is behind the Retbleed vulnerabilities.

FullyFunctional · on Jan 25, 2023

Well, of course, the Return Address Stack (RAS) predictor maintains its own call stack and you need to understand how it works. However, there's a subtler way to break it: recurse too deeply. The RAS only has a fixed, small, and implementation dependent length. If you use deep recursion with non-trivial control flow (in particular multiple call sites), then the RAS will starting missing once you return from beyond that limit.

Another consequence of the RAS is that co-routines switching is more expensive than they might appear at first. RISC-V has encoding hints to mark call(jal)/returns that are actually co-routine switching but the full cost can't be avoided.

gpderetta · on Jan 25, 2023

you can mitigate the cost by not 'call'-ing into your coroutine switch function but inlining the code into the surrounding coroutine. As a bonus you get a bit better branch prediction on your yield because distinct yields will share less state.

Of course there is always going to be a penality for stackful coroutines that yield deep into a callstack.

10000truths · on Jan 25, 2023

Unfortunately, this is very difficult to do above the assembly level because it requires a custom calling convention that doesn’t yet seem to be supported by any systems programming language compiler. You have to use an assembler macro, or pipe the assembler output through sed, to patch the call and ret instructions:

https://stackoverflow.com/questions/43894511

gpderetta · on Jan 26, 2023

GCC inline assembly can be coerced to do the right thing.

saagarjha · on Jan 26, 2023

It’s not great, though. Modern C compilers will generally fight you if you try to subvert structured programming.

ahh · on Jan 26, 2023

Interestingly, Matt has invented a variant on the retpoline [1] which _intentionally_ missteers the branch predictor to prevent various speculative attacks. (Invented by my former Google manager.). It's pretty cool how much simpler a retpoline would be in aarch64, since we have explicit control over the link register rather than having to play stupid games with stacks.

(Real retpolines have a little more magic, naturally.)

[1]https://stackoverflow.com/questions/48089426/what-is-a-retpo...

jefftk · on Jan 25, 2023

> The SIMD code does come with one asterisk, though: because floating-point addition is not associative, and it performs the summation in a different order, it may not get the same result as straight-line code. In retrospect, this is likely why the compiler doesn't generate SIMD instructions to compute the sum!

What if you set -funsafe-math-optimizations, which allows "optimizations that allow arbitrary reassociations and transformations with no accuracy guarantees"?

Arnavion · on Jan 26, 2023

Based on the "We can get faster still by trusting the compiler" part of the article, the author's using Rust. It doesn't have a global flag like `-funsafe-math-optimizations` or `-ffast-math` so the change would have to be a bit more involved. They'd have to change their use of `+`, `*` etc operators on f32 to `std::intrinsics::fadd_fast`, `std::intrinsics::fmul_fast`, etc functions.

So, putting it into https://rust.godbolt.org/z/jT6Mb1K13 , it seems to indeed be using the SIMD instructions.

tialaramex · on Jan 26, 2023

They're asking for sum() on a slice f32s. The sum() function actually works via a Trait for this specific purpose, Sum, so you could go like this...

New Type wrapper for f32 called like FastFloat, marked #[repr(transparent)], and if necessary (not sure) have the compiler promise you you're getting the same in memory representation as an actual f32.

Implement Sum over FastFloat by having it use the faster SIMD intrinsics for this work to give you an answer, accepting the potential loss of accuracy.

Now, unsafely transmute the f32 slice into a FastFloat slice (in principle this is zero instructions, it just satisfies the type checking) and ordinary sum() goes real fast because it's now Sum on the slice of FastFloats.

Arnavion · on Jan 26, 2023

If you want to go the newtype + Sum impl route, you don't have to make it `#[repr(transparent)]` or transmute the slice. You can just `impl Sum<FastFloat> for f32` and do `f.iter().copied().map(FastFloat).sum()`

https://rust.godbolt.org/z/b9s3dna6r

tialaramex · on Jan 26, 2023

Oh, I didn't think of that, clever.

EtA: The attraction of a New Type plus trait impl is that is re-usable. You could imagine (particularly if it was stable which your approach isn't yet) packaging up several speed-ups like this in a crate, enabling people to get faster arithmetic where they can afford any accuracy trade off without them needing to know anything about SIMD and without (like the C or C++ compiler flags) affecting unrelated code where accuracy may be critical.

Arnavion · on Jan 26, 2023

https://crates.io/crates/fast-floats

tialaramex · on Jan 26, 2023

Nice, although I notice it doesn't implement Sum or Product :D

not2b · on Jan 26, 2023

It's a bad idea to scramble the order of floating point operations unless you can tolerate extreme inaccuracy, and in some applications accurate FP results don't matter, but the non-associativity of FP isn't just a technicality: you can lose all of your significant digits if the order of operations in well-written scientific code is changed.

thomasahle · on Jan 26, 2023

But if we aren't assuming the original order was particularly nice, we are simple substituting one random (or arbitrary) order for another. No reason to expect it to be any worse or better.

not2b · on Jan 26, 2023

The key is that if you do this, different optimization levels produce different numerical results, and also some problems in the code become unfixable; you can't group computations according to their expected value ranges because the compiler will ungroup them again, incorrectly assuming that FP addition and multiplication are associative. Certainly for some applications it's fine, but a test of whether it's fine would be, for example, that it works just as well with single-precision as double-precision and even 16-bit floats would be fine, for example weights in an NN.

makapuf · on Jan 25, 2023

You could just turn gcc to 11 and use -Ofast

zokier · on Jan 25, 2023

Both clang and gcc vectorize if you ask them nicely: https://gcc.godbolt.org/z/xvjY8P4cM

ogogmad · on Jan 25, 2023

Minor erratum: Floating point addition actually is commutative; it's in fact non-associative.

dekhn · on Jan 25, 2023

with some significant exceptions, such as NaNs.

recursive · on Jan 25, 2023

Can you think of some `x` where `x + NaN` is not identical to `NaN + x`? I can't.

jcranmer · on Jan 25, 2023

It depends on your model of NaN. If you think there is only one NaN value, then the two values return the same result. If there are multiple distinct NaN values (i.e., you distinguish between NaNs with different payloads), then the resulting payload of an arithmetic operation is... not well-specified.

Most hardware architectures (x86, ARM fall into this category) pick the rule that the payload is the Nth operand (usually first, sometimes second) when multiple inputs are NaN. I believe there's some hardware where the lesser of the payloads (converted to an integer) is picked instead. RISC-V dispenses with NaN payload propagation entirely. There is theoretically the ability for hardware to generate a new unique NaN with the hardware address of the instruction into the payload, but I don't believe any hardware actually does it.

Most programming languages do not generally model NaN with greater fidelity than "there is a NaN value" or maybe "there is a distinction between qNaN and sNaN."

MaulingMonkey · on Jan 25, 2023

If x is another NaN with a different payload, the result is likely a NaN with a payload corresponding to the left side of the addition.

raphlinus · on Jan 25, 2023

That is correct, here is a playground link which shows that result: https://play.rust-lang.org/?version=stable&mode=debug&editio...

stephencanon · on Jan 25, 2023

This varies tremendously across architectures and uarches and compilers. The result will be a quiet NaN, and you can’t say anything beyond that.

dekhn · on Jan 25, 2023

you mean, like 1 + NaN = NaN and NaN + 1 = NaN, but NaN != NaN? (I'm not a numerical expert, just repeating what others have told me)

CountSessine · on Jan 25, 2023

Yes. An NaN in IEEE754 has all 1’s in the exponent, and then the high bit of the mantissa determines whether it’s quiet or signalling, but then rest of the mantissa is/can be a “payload”.

bruce343434 · on Jan 25, 2023

Sign bit determines signalling

robocat · on Jan 25, 2023

Please double check your facts before disagreeing with somebody so abruptly.

Sign bit is NOT the signalling/quiet bit. Bit 51 (edit or bit 50 - damn ***** IEEE for not publishing important standards for free public access) is according to the first result I looked at: https://craftinginterpreters.com/optimization.html

Edit 2 from IEEE 754 (2008 version):

  6.2.1 NaN encodings in binary formats
  This subclause further specifies the encodings of NaNs as bit strings when they are the results of operations. When encoded, all NaNs have a sign bit and a pattern of bits necessary to identify the encoding as a NaN and which determines its kind (sNaN vs. qNaN). The remaining bits, which are in the trailing significand field, encode the payload, which might be diagnostic information (see above).
  All binary NaN bit strings have all the bits of the biased exponent field E set to 1 (see 3.4). A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0. If the first bit of the trailing significand field is 0, some other bit of the trailing significand field must be non-zero to distinguish the NaN from infinity. In the preferred encoding just described, a signaling NaN shall be quieted by setting d1 to 1, leaving the remaining bits of T unchanged.

  6.3 The sign bit
  When either an input or result is NaN, this standard does not interpret the sign of a NaN. Note, however, that operations on bit strings—copy, negate, abs, copySign—specify the sign bit of a NaN result, sometimes based upon the sign bit of a NaN operand. The logical predicate totalOrder is also affected by the sign bit of a NaN operand. For all other operations, this standard does not specify the sign bit of a NaN result, even when there is only one input NaN, or when the NaN is produced from an invalid operation.
  When neither the inputs nor result are NaN, the sign of a product or quotient is the exclusive OR of the operands’ signs; the sign of a sum, or of a difference x−y regarded as a sum x+(−y), differs from at most one of the addends’ signs; and the sign of the result of conversions, the quantize operation, the roundTo- Integral operations, and the roundToIntegralExact (see 5.3.1) is the sign of the first or only operand. These rules shall apply even when operands or results are zero or infinite.
  When the sum of two operands with opposite signs (or the difference of two operands with like signs) is exactly zero, the sign of that sum (or difference) shall be +0 in all rounding-direction attributes except roundTowardNegative; under that attribute, the sign of an exact zero sum (or difference) shall be −0. However, x + x = x − (−x) retains the same sign as x even when x is zero.
  When (a×b)+c is exactly zero, the sign of fusedMultiplyAdd(a, b, c) shall be determined by the rules above for a sum of operands. When the exact result of (a × b) + c is non-zero yet the result of fusedMultiplyAdd is zero because of rounding, the zero result takes the sign of the exact result.
  Except that squareRoot(−0) shall be −0, every numeric squareRoot result shall have a positive sign.

I.e. you are definitely wrong. The sign bit can be + or - for NaN (presumably a side-effect of the encoding for +/-Infinity ). And then that leads to a bunch of arse (section 6.3) because the spec needs to decide what happens to the sign bit in a bunch of different situations. PS: fucking infinity. Infinity should have been NaN. Infinity ≠ Infinity, except in in the egghead-land IEEE (side note: egghead is a compliment IMHO). Mind you, easy to see mistakes in retrospect, but corner cases are shit in programming. I do like NaN, although reading comments here, and the IEEE spec, forces me to learn how little I now about NaN encodings. Oh, and any NaN should equal any other NaN. Mathematically obviously not, but logically yes and IEEE is for programming. NaN is already defined as a nonsense, so at least keep the nonsense consistent. Changing if = to if ≠ should not introduce subtle logic bugs.

Ranting edit #755: and while we are at it, -0 is an abomination in the eyes of the Great Architect in the Matrix - it should never have been allowed - perhaps -0 should have been NaN with signalling bits in the exponent (even though that would prevent some language virtual machine optimisations where 53 bits of NaN get used to pack other information, but the win would be compelling because reducing bugs due special cases is huge IMHO). How many developers understand IEEE corner cases: fuck all in my long experience.

bruce343434 · on Jan 25, 2023

The source I had in my head as I was replying was https://posithub.org/docs/Posits4.pdf pages 31/32 which implies that the sign bit is responsible for signalling-ness. Neither of these are a primary source however.

A random stackoverflow answer without sources seems to confirm your pov. Now I don't know what to believe. How is the sign bit used in NaNs? Would they really waste that bit?

stephencanon · on Jan 25, 2023

Hi, former IEEE 754 committee member here: languages and architectures are allowed to choose how they encode signalingness, but they cannot use the sign bit to do it. The most common choice is to use the high-order but of the significand field, but most choices you might make is represented by some architecture.

saagarjha · on Jan 26, 2023

And I thought signaling NaNs were bad enough already…so you’re saying there’s no portable way of testing whether a bit pattern is going to signal if you operate on it without actually doing the operation? This is incredibly cursed :/

stephencanon · on Jan 26, 2023

There’s the 754 (2008) isSignaling operation, bound to the issignaling macro in C, which makes it the platform owner’s problem instead of yours.

saagarjha · on Jan 27, 2023

If only the platform owner provided this operation…

robocat · on Jan 25, 2023

Awesome! So the important part of section 6.2.1 (2008) is the “should” is not a “must”?

  “A quiet NaN bit string should be encoded with the first bit (d1) of the trailing significand field T being 1. A signaling NaN bit string should be encoded with the first bit of the trailing significand field being 0.“

Or is it that some implementations before 2008 used other bits?

PS: I might be complaining, but I do sincerely thank you for your hard work. I certainly would not want to be on a standards committee, so I sincerely appreciate the efforts of the committed whom fight so hard to make things better. Your comment just goes to show how many corner cases there are to the corner cases!!

stephencanon · on Jan 26, 2023

It’s a “should”, hence recommended practice. Things required for conformance are “shall”.

jcranmer · on Jan 25, 2023

Traditionally, PPC used a different encoding of qNaN than, well, everybody else. PPC itself would later include a hardware mode that aligned to the more universal interpretation of qNaN.

stephencanon · on Jan 26, 2023

IIRC MIPS does the reverse of x86 and ARM too.

jcranmer · on Jan 26, 2023

Sorry, you're right, it's MIPS that had the two encodings--PPC had the double-double type for long double that makes things "interesting." (It seems all architectures have their own unique floating-point weirdness).

jcranmer · on Jan 25, 2023

> The source I had in my head as I was replying was https://posithub.org/docs/Posits4.pdf pages 31/32 which implies that the sign bit is responsible for signalling-ness.

Citing the person who came up with posits for anything related to IEEE 754 is a pretty poor decision, he is prone to a lot of misunderstanding here. There's a lot of decent explanations of IEEE 754 (including Wikipedia), so you shouldn't need to resort the explanations of someone whose major shtick is telling people IEEE 754 sucks.

> How is the sign bit used in NaNs? Would they really waste that bit?

Sign bits have no meaning for NaNs, although some C libraries (e.g., glibc) will distinguish between a NaN with the sign bit set and not set when printing floating-point numbers, which gives the illusion that it matters more than any other bit. Using the sign bit for qNaN-versus-sNaN is a bad idea from a design perspective, since there are a few FP operations explicitly specified to modify only the sign bit (fneg, fabs), and this would mean that you get regular operations that could generate sNaNs, which defeats the design goal of sNaN.

stephencanon · on Jan 26, 2023

Sign bits of NaN is one of the dumbest parts of the 754 standard. 6.3 specifies that "the standard does not interpret the sign bit of a NaN" and then the very next sentence lists four cases where the sign bit of NaN has semantic meaning.

adgjlsfhk1 · on Jan 26, 2023

really all of the NaN behavior is really dumb.

"we gave you enough different NaNs to uniquely represent every grain of sand on the planet uniquely"

how does math work on them?

"we have no idea"

how do we tell what type of nan we have?

"we have no idea"

what should these bajillion values represent?

"I don't know, probably something"

the fact that they didn't just make there be a single NaN and make it equal to itself (and while you're at it make it so the real numbers have a total ordering)

jcranmer · on Jan 26, 2023

> how does math work on them?

For almost all operations, if any input is a NaN, the result is a NaN. It's pretty well-specified by IEEE 754, the only thing that isn't is what the payload is, but there's no general expectation that the payload is preserved through computation.

> how do we tell what type of nan we have?

`getpayload()`

> what should these bajillion values represent?

Diagnostic information that tells you which operation caused the NaN in the first place. (This is Kahan's standard argument for why having many NaNs can be useful.)

> the fact that they didn't just make there be a single NaN

What's the alternative? You'd either have to have lots of illegal floating point values, or you'd have to make the number of finite floating-point numbers with the highest exponent value slightly smaller than for all other exponent values, which complicates a good deal of numerical analysis.

> make it equal to itself

Yeah, this is the big, nasty mistake of IEEE 754.

> and while you're at it make it so the real numbers have a total ordering

There is a totalOrder predicate specified by IEEE 754. Representing floating-point comparisons as a partial order is arguably better than giving it a total order, but the lack of reflexivity of equals makes the existing predicates not a partial order.

kryptiskt · on Jan 26, 2023

But in the end we found a great use case for all those NaN values because NaN-boxing is awesome.

adgjlsfhk1 · on Jan 26, 2023

except that it breaks on processors that propagate nan bits differently (e.g. m1)

mkeeter · on Jan 25, 2023

Thanks, fixed!

lmm · on Jan 25, 2023

It's not commutative; -0.0 + 0.0 = -0.0 but 0.0 + -0.0 = 0.0.

jcranmer · on Jan 25, 2023

Um, IEEE 754 specifies that -0.0 + 0.0 is 0.0, not -0.0. It's still commutative.

marcosdumay · on Jan 25, 2023

Yes, it's commutative. But it is associative.

Make a number with 3 bits of mantissa, and go add a thousand repetitions of 1. You can get hundreds of different results depending on the order you add them up.

wizzwizz4 · on Jan 25, 2023

> But it is associative.

You're describing a non-associative operation.

ogogmad · on Jan 25, 2023

Exactly. See the definition: https://mathworld.wolfram.com/Associative.html

marcosdumay · on Jan 26, 2023

Ops. Yes, you are right.

snerbles · on Jan 25, 2023

While I was in undergrad I toyed around with abusing the branch predictor on a few different machines, compiling something like the following with optimizations off - it performs an identical computation regardless of branch outcome:

    void branchLoop(unsigned int condition, unsigned int &sum)
    {
        // put something suitably large here
        unsigned int loopCount = 0x0fffffff;

        unsigned int i;

        // compile with -O0 or this gets optimized away
        for (i = 0; i < loopCount; i++)
            if ((i & condition) == 0)
                sum++;
            else
                sum++;
    }

The Core Duo on my Thinkpad T60 had some very distinct slowdowns on certain bit patterns, which were not repeatable on the handful of other CPUs I had access to at the time. I haven't tried this with more modern CPUs, however.

gpderetta · on Jan 25, 2023

Predictors are getting better and better at recognizing long patterns (sometime at the cost of not being optimal with short patterns).

iamsmooney · on Jan 25, 2023

Title is a reference to this old SNL skit: https://www.youtube.com/watch?v=GmqeZl8OI2M

sophacles · on Jan 25, 2023

It's definitely a classic - if you haven't seen it I'd recommend it even if it wasn't related to the article :D

Agentlien · on Jan 26, 2023

It is actually linked from the article. In the sentence "In conclusion, do not taunt happy fun branch predictor with asymmetric usage of bl and ret instructions." the words "do not taunt happy fun branch predictor" are a link to the video on YouTube.

bioint7812 · on Jan 25, 2023

More translation:

> for reasons

https://www.urbandictionary.com/define.php?term=for%20reason...

londons_explore · on Jan 25, 2023

Observation: Almost any code, when micro-optimized, can gain about 10x performance.

So, if we had the time and energy, we could probably make all of computing at least 10x faster.

But we don't have the time or energy to dedicate that much effort to every line of code... But perhaps AI does?

celeritascelery · on Jan 25, 2023

I don’t think that is generally true. He only got a large speedup because he used SIMD, which has nothing to do with micro optimization. I would say a better take away is that micro optimization is really hard and you will often make things worse if you don’t know what you are doing. Even if you do, you are only going to get a few percentage points.

thethirdone · on Jan 25, 2023

My experience micro optimizing things is that even without SIMD, most software can get at least a 5x in performance. With SIMD, you can often get 50x improvements.

The reason why people thing "Even if you do, you are only going to get a few percentage points." is because it generally takes 5-50x the developer time to optimize such code. If it takes half a day to write naive code to do something like validate utf8, it probably takes ~25 workdays to make a fast SIMD version. If you instead spend an extra half a day, there is a good chance you get a 10-50% speedup using normal code.

mumumu · on Jan 25, 2023

This is true on a few streaming application (such as parsing).

And most of the speedup is because of tricks to avoid doing beaches. There is a great blog post from one of the authors of JSON SIMD discussing this.

I'm on mobile, there is a link for the blog post on the simd JSON github repository.

mumumu · on Jan 26, 2023

*avoid branches

The blog post I mentioned:

Paper: Parsing Gigabytes of JSON per Second https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of...

Another related post from Lemire:

Ridiculously fast unicode (UTF-8) validation https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-...

Those algorithms are fast. But to put them in perspective. A single x86 CPU can write 64B per cycle. At 5GHz, the theorical maximum bandwidth is 320 GBps. IIRC, the read bandwidth is twice that.

There are others botlenecks, and is very hard to write code that writes at every cycle.

A interesting consequence, is that the theorical maximum bandwidth is logarithmical to the number of cycles. Again, talking about branchless streaming application.

bee_rider · on Jan 25, 2023

It also can we weird and non-obvious. For example depending on the instruction mix and hardware it might not be worth getting dinged by AVX-512 clocks. And if you are, say, writing the UTF validation code as a library (more favorable to put effort into a library!) you might not know where the code is being used, so you might not even know the instruction mix…

shoo · on Jan 25, 2023

> Almost any code, when micro-optimized, can gain about 10x performance.

well, it depends on where you start from. the speedups can become arbitrarily impressive sounding when the starting point is arbitrarily inefficient.

e.g. if you're starting with a python script that wasn't written with performance in mind and has ended up being compute-bound in pure python number crunching code, if you rewrite the thing in C while thinking a little about appropriate data structures and memory allocation/access -- i.e. replacing a festival of python dict lookups and very frequent memory allocation for tiny intermediate results with indexing into preallocated arrays or so on -- it's quite common to see speedups of 500x - 1000x. this is before micro-optimisation, before introducing SIMD or multi-threading or setting the compiler to build for your exact CPU or so on.

Pet_Ant · on Jan 25, 2023

Whenever you read/hear/think "AI" replace it with "a probabilistic process". If you can validate the output then it's really a beam search ( https://en.wikipedia.org/wiki/Beam_search ) of the solution space, and not really "AI".

If you can't, then it's really a crap shoot as to whether the output is at all valid. If we want to use more opaque processes, I think we need more transparent outputs. If a neural net can produce a machine checkable proof or supply it with the optimisation that's great, but otherwise it's just hot air.

david2ndaccount · on Jan 25, 2023

He didn’t gain 10x from a micro optimization, he gained that much by converting it to use SIMD which is a macro optimization. You usually have to structure your program in such a way to be SIMD friendly. In this case it was already simd-friendly (adding a large number of floats).

Jensson · on Jan 25, 2023

> In this case it was already simd-friendly (adding a large number of floats).

So it was micro optimization afterall!

lmm · on Jan 25, 2023

It was a micro optimization because it was a toy example. Doing floating-point arithmetic (and not already using BLAS etc.) is very niche in real code.

Jensson · on Jan 25, 2023

It is still micro optimization to ensure that those instructions are used when the compiler is too dumb to use them. You can say that they are niche, but that is different from it being micro optimization.

lmm · on Jan 25, 2023

It's not the compiler being dumb, it won't use those extensions because they reassociate the arithmetic and change the result - and setting that compiler flag is very much a macro rather than micro change.

Jensson · on Jan 26, 2023

But he didn't set the compiler option, he rewrote it to explicitly use those instructions, that is micro optimization no matter how you look at it. The reason micro optimization is so easy to get better than compiler results is that you know things about the data the compiler doesn't, like in this case you know it is fine to reorder the result while the compiler is too dumb to know that. You need much smarter compilers than we have today in order for such micro optimizations to stop paying dividends.

lmm · on Jan 26, 2023

Making sure it will behave correctly with that change (which changes the result of the function!) is a macro change - you have to trace the implications all the way through the program.

fluoridation · on Jan 25, 2023

Not true by a long shot, unfortunately. Getting even a 100% performance gain out of manual optimization is unusual. Usually the only way to get significant gains is by either switching algorithms or by relaxing problem requirements.

cozzyd · on Jan 25, 2023

It... depends. Often things like changing memory layout, intelligent prefetching, etc. can make a pretty big difference.

error503 · on Jan 25, 2023

Interesting.

I thought it would be interesting to compare the behaviour of (very) different AArch64 processors on this code.

I ran your code on an Oracle Cloud Ampere Altra A1:

  sum_slice                  time:   [677.45 ns 684.25 ns 695.67 ns]
  sum_ptr                    time:   [689.11 ns 689.42 ns 689.81 ns]
  sum_ptr_asm_matched        time:   [1.3773 µs 1.3787 µs 1.3806 µs]
  sum_ptr_asm_mismatched     time:   [1.0405 µs 1.0421 µs 1.0441 µs]
  sum_ptr_asm_mismatched_br  time:   [699.79 ns 700.38 ns 701.02 ns]
  sum_ptr_asm_branch         time:   [695.80 ns 696.61 ns 697.56 ns]
  sum_ptr_asm_simd           time:   [131.28 ns 131.42 ns 131.59 ns]

It looks like there's no penalty on this processor, though I would be surprised if it does not have a branch predictor / return stack tracking at all. In general there's less variance here than the M1. The SIMD version is indeed much faster, but by a smaller factor.

And on the relatively (very) slow Rockchip RK3399 on OrangePi 4 LTS (1.8GHz Cortex-A72):

  sum_slice                  time:   [1.7149 µs 1.7149 µs 1.7149 µs]
  sum_ptr                    time:   [1.7165 µs 1.7165 µs 1.7166 µs]
  sum_ptr_asm_matched        time:   [3.4290 µs 3.4291 µs 3.4292 µs]
  sum_ptr_asm_mismatched     time:   [1.7284 µs 1.7294 µs 1.7304 µs]
  sum_ptr_asm_mismatched_br  time:   [1.7384 µs 1.7441 µs 1.7519 µs]
  sum_ptr_asm_branch         time:   [1.7777 µs 1.7980 µs 1.8202 µs]
  sum_ptr_asm_simd           time:   [421.93 ns 422.63 ns 423.30 ns]

Similar to the Ampere processor, but here we pay much more for the extra instructions to create matching pairs. Interesting here that the mismatched branching is faster than the single branch.

I guess absolute numbers are not too meaningful here, but a bit interesting that Ampere Altra is also the fastest of the 3 except in SIMD where M1 wins. I would have expected that with 80 of these cores on die they'd be more power constrained than M1, but I guess not.

Edit: I took the liberty of allowing LLVM to do the SIMD vectorization rather than OP's hand-built code (using the fadd_fast intrinsic and fold() instead of sum()). It is considerably faster still:

Ampere Altra:

  sum_slice               time:   [86.382 ns 86.515 ns 86.715 ns]

RK3399:

  sum_slice               time:   [306.94 ns 306.94 ns 306.95 ns]

sweetjuly · on Jan 26, 2023

>It looks like there's no penalty on this processor, though I would be surprised if it does not have a branch predictor / return stack tracking at all

One potential explanation is that a lot of processors have "meta" predictors. That is, they have multiple distinct predictors that they then use a second level predictor to decide when and how they use it. This is really useful because some predictors perform very well in certain cases but very poorly in others. Therefore, what may be happening is that the RAS is getting overridden by another structure since the predictor detects that the RAS is frequently wrong but another predictor is frequently right.

sakras · on Jan 26, 2023

If you’re only heavily using one of the cores, that core is free to use a lot more power, and can probably push its clock speed much higher than if this were an all-core workload. So I’d actually expect the opposite, that the ampere would be allowed to use a lot more power than the M1 (since it’s not a laptop).

error503 · on Jan 26, 2023

Ampere's processors are more or less fixed clock, which makes sense to me in a processor designed for cloud servers. You don't really want unpredictable performance, and in a multi-tenant situation like Oracle Cloud where I ran it, you don't want customer A's workload to affect customer B's performance running on the same physical processor.

With 10x the number of cores as the M1 Max and a TDP of 250W (maybe 5x? Apple doesn't publish numbers), the average power limit per core is likely significantly less, and the M1 might be able to leverage 'turbo' here for this short benchmark.

Still, this is not really a meaningful benchmark, just interesting.

efitz · on Jan 25, 2023

I think that the days where hand tuned assembly language outperform compiler generated code are largely behind us (let loose the contrary anecdotes).

Compilers and microprocessors are way more complex than they were back in the 80s or 90s, and compiler engineers know way more about how instructions are actually executed than the vast majority of programmers.

jcalvinowens · on Jan 25, 2023

I think about it like this:

If I ask you to write assembler faster than what the compiler emits for one very specific CPU model, you'll pretty much always be able to do that (given enough time).

But as CPUs become more complex, and compilers get better, achieving that win requires more and more "overfitting" to implementation details of the specific hardware, in ways that are not helpful or even counterproductive on older or newer models of the same CPU family.

It's still worth it sometimes, of course. It's just more work and a lower value proposition on average.

rowanG077 · on Jan 25, 2023

I used to think that was true. Then I had a crypto course in uni which required us to write and optimize 3 different hashing and encryption algorithms.

I was stunned by the first once, which I first did in C and then moved to ASM. The C code was pretty straightforward. But it was trivial to beat in ASM by something like 120%.

That course taught me how bad compilers(Or rather GCC) are at high-level register optimization and using the barrel shifter. Basically all the performance I squeezed out of ASM was because the compiler just wasn't figuring basic stuff out. Mind you I was(am) not an ASM expert. That piece of code was the first thing I ever write in ASM. And yet I was able to easily beat a decades old world class compiler.

makapuf · on Jan 25, 2023

... and yet in this article, the author beats by 10x a C compiled code with hand tuned assembly. (By using SIMD and unrolling, which the compiler did not. Granted linear compiler code is faster than hand made linear assembly)

zokier · on Jan 25, 2023

Author beats compiler because floating point constraints, with -ffast-math compiler vectorizes the code.. I don't have arm64 hardware to test the result, but its probably again pretty fast: https://gcc.godbolt.org/z/xvjY8P4cM

ravi-delia · on Jan 25, 2023

The only exception (pointed out in this post) is leveraging SIMD instructions, which aren't nearly as well exploited by compilers. I doubt, of course, that this will last all that long, but for now there are too many subtle differences that might matter and totally different instruction sets between cpus and even generations.

Agentlien · on Jan 26, 2023

I feel like I've been hearing complaints that compilers aren't leveraging SIMD instructions for years. I wonder when this will be solved.

ravi-delia · on Jan 28, 2023

Probably a decade after the interfaces stabilize and everyone figures out which minor quirks of behavior they're willing to permit with the right optimization flag. As of now things are highly dependent on platform- I've had great luck with the older x86 SIMD sets, but on ARM there's still a ways to go.

wolf550e · on Jan 25, 2023

compiler autovectorization is poor, people often outperform it when writing SIMD code. intrinsics are so low level they might as well be assembly.

JonChesterfield · on Jan 25, 2023

I've had really good results from the two in LLVM (one works on loops, one within basic blocks). Optimal loop body when the pointers had alignment metadata attached, though at the time it failed to unroll the tail.

Using intrinsics with the control flow in C or C++ works really well - you get the right instruction selection from the intrinsics, easy to reason about control flow and the compiler deals with register allocation (and possibly instruction scheduling) which are a pain to do by hand.

tubs · on Jan 27, 2023

https://codegolf.stackexchange.com/questions/215216/high-thr...

userbinator · on Jan 26, 2023

If you're optimising for size, you can still very easily beat the compiler.

water-your-self · on Jan 25, 2023

If an HN reader wanted to play around with similar digging, what would be the essential tools to be aware of and where best could he start?

Assuming prior knowledge of assembly/C but without much experience decompiling or testing speed.

shoo · on Jan 25, 2023

Learn how to use a decent profiler. if you're running linux, that's probably perf:

https://man7.org/linux/man-pages/man1/perf.1.html

https://www.brendangregg.com/perf.html

Here's a fun article from the cloudflare blog that gives an example of using of perf to diagnose performance of a small utility: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

Matt Godbolt's compiler explorer is also worth checking out: https://godbolt.org/

dcow · on Jan 25, 2023

Your compiler can spit out assembly, you just need to know how to read it. Sounds like the author was also using Xcode Instruments https://help.apple.com/instruments/mac/current/#/dev7b09c84f... to check cpu counters. And they were using criterion https://crates.io/crates/criterion to microbenchmark.

My guess would be that the author is porting some C code to Rust and making sure not to regress performance along the way (probably hopefully trying to increase it). Likely their program was written in Rust and the section they were trying to optimize called some old c code. Sounds like they rewrote the section in Rust since Rust <-> C ffi calls break out of the happy realm the rust compiler likes and end up causing a performance hit themselves. You can write inline assembly in Rust using the macro https://doc.rust-lang.org/reference/inline-assembly.html.

gpderetta · on Jan 25, 2023

Yes, never push and ret. Here is something I wrote (/me checks calendar) more than 15 years ago about optimizing coroutine control flow: https://www.crystalclearsoftware.com/soc/coroutine/coroutine...

nobody9999 · on Jan 25, 2023

Thank you for posting this.

Somewhat OT, but I miss Phil Hartman[0][1][2]. You may remember him[3] from shows like Saturday Night Live, The Simpsons and "Planet of the Apes."[4]

[0] https://en.wikipedia.org/wiki/Phil_Hartman

[1] https://en.wikipedia.org/wiki/Happy_Fun_Ball

[2] https://www.youtube.com/watch?v=GmqeZl8OI2M

[3] https://screenrant.com/the-simpsons-funniest-troy-mcclure-qu...

[4] https://www.youtube.com/watch?v=yOeUXEpxzcc

pranith · on Jan 25, 2023

Great investigative work! The stack structure you refer to here is called the Return Address Stack (RAS).

xKingfisher · on Jan 26, 2023

An interesting application of this is massaging the branch predictor using tail calls to speed up a parser/interpreter:

https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

saagarjha · on Jan 26, 2023

The wins from tail calls are generally more from being able to skip the overhead that comes from a function call rather than better branch prediction.

xKingfisher · on Jan 26, 2023

It's mentioned in passing at the end of the "the trouble with interpreter loops" section.

A traditional switch/goto loop can thrash the branch predictor. Separating into different tail calling functions gives you more slots and allows the branch predictor to learn relationships between ops.

Not to discount the many other benefits of tail calls.

*Edit: I misspoke slightly, computed gotos can also split the patch jump, but less reliably[0].

[0]https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html

sacnoradhq · on Jan 26, 2023

After about the Pentium-/Pentium Pro-era, hand-coded assembly generally is premature optimization (and wasted effort).

Once upon a time(tm), you could write self-modifying code or guess at keeping pipelines occupied by manual instruction reordering, but cache line invalidation and OOOE make these moot.

The problem is that with a pipeline stall (wrong branch predicted) in hyper-deep pipelines, the penalty is enormous: waiting for the other condition calculation to percolate through the pipeline or independent stages.

Processors are optimized for the mainstream, usually the current or last generation of compilers when the processors were designed.

To generate the generally fastest bitcode, it would require an incremental JIT with history that can permute and mutate bitcode from runtime metrics. That's beyond HotSpot(tm), LLVM, or anything of the sort.

Malic · on Jan 25, 2023

Ow. My head hurts.

And this is why optimizing compilers are some of the most complex programs there are. (or so I have been taught)

shadowgovt · on Jan 25, 2023

It's also why the modern rule of thumb is "don't optimize by writing your own assembly."

The rule is a boiled-down version of the larger notion "Don't optimize by writing your own assembly, because even with domain knowledge of the problem you're trying to solve, you're probably not more clever than the engineer-decades that went into building your compiler toolchain and processor architecture, unless you're an expert in both fields, in which case good luck and shoulder that maintenance burden."

The rule of thumb drops a lot of detail on the ground but is a good first approximation.

pjdesno · on Jan 25, 2023

Maybe better expressed as "don't write your own assembly unless you know why it might be better than the compiler".

Trying to beat the compiler at optimizing normal code is kind of like trying to beat a calculator with paper and pencil - computers are just better at that sort of thing than people are.

One use case is where you want to use bizarre CPU functions (popcount, encryption, load CR3, etc.) that no one's taught the compiler how to generate, although for some of them you might be better off using compiler intrinsics.

Another is when you're dealing with things underneath the language abstraction, like the Boost co-routines mentioned in a link a few comments above. Of course, if the language folks decide to add the capability you want (e.g. C++20 coroutines), you're back to being better off using the compiler.

Finally there are pedagogical reasons, e.g. sometimes I show my classes the world's simplest "Hello World", using a few lines of assembler to invoke the write and exit syscalls.

JonChesterfield · on Jan 25, 2023

The boost coroutines mentioned are not the same thing as the C++20 ones. Boost captures the stack, that's what the register shuffling is for. C++ is a compiler transform that moves some state onto the heap, but probably not the whole stack, and builds a switch dispatch style thing to transform the control flow. This is why C++ comes with co_* annotations and coroutines don't.

RodgerTheGreat · on Jan 25, 2023

I would just phrase it as "if you optimize by writing your own assembly, don't expect your program or its performance to be portable."

astrobe_ · on Jan 25, 2023

... Which is kind of worrying; is it really a good thing that processors are so complex that you need "decades" to use them to fully. Bottom line, you end up with chaotic (in the "sensitive to the slightest change") performance behavior.

OTOH this reminds of another saying, "don't roll your own crypto". But all those "don't" are a bit frustrating.

miloignis · on Jan 25, 2023

I've seen people get frustrated by the "don't"s before, but I think that's generally taking the first degree approximation too literally. Feel free to hand-write assembly or roll your own crypto, but don't depend on it for anything serious unless you are an expert or have it reviewed by one. Doing so for learning and fun is fine, if that's clearly called out such that no one accidentally depends on it for something serious. There's only one way to become good at something, and that's good practice!

In a professional setting there's a responsibility to the end user which generally precludes doing these dangerous things - that is, one should feel free to take up woodworking as a hobby but shouldn't offer to build someone's house unless you're a licensed professional.

olliej · on Jan 25, 2023

But you don't need decades of experience: we have compilers and optimizers to do that.

bioint7812 · on Jan 25, 2023

It's interesting that the impedence the author was experiencing was one of the CPU incorrectly "speculating" about the intent. We as readers are left to speculate about the problem being solved by the author.

Based on the content of his recent articles, we could assume he is continuing his development of a JIT for a Rust port of his graphics engine. Given that assumption, I would argue that the compiler writers are lagging here--specifically lack of dynamic compilation. For example, is there a JIT compiler for the Rust language? I was thinking about reproducing his experiment in SBCL, which does have dynamic compilation--although it wouldn't be a real "apples" to "apples" comparison because my work machine is a x86_64.

olliej · on Jan 25, 2023

JITs have very different constraints (but also many advantages) vs AOT compilers, so I don't think in the general case a language compiler can "support" JITs directly. llvm/clang for instance have a jit mode... for compiling C/C++, not some other language.

Also obviously the compiler writers (AOT or JIT) need to know about and understand very minute details of how a cpu is behaving. I was responding to a person saying it was worrying that devs need that kind of knowledge and experience by saying that that isn't true because the tools exist (I feel that there's some analogy to what knowledge you need for a car: changing oil, servicing engine, replacing engine, designing an engine...). Once you are writing a JIT you're in the "making the tools" group, so you now need to know more than the vast majority of devs (not just more than the "average" dev) that's inescapable.

magicalhippo · on Jan 25, 2023

Another is that your assembly will never target newer CPUs, but the compiler will.

I've gained a lot of performance by replacing handwritten assembly functions with plain code versions, just because CPUs and compilers have evolved over the last 15-20 years, while that assembly code is what it is.

avgcorrection · on Jan 25, 2023

I will never understand the trend of “using quotes around things”. Which is a shorthand version for “using quotes around things that I wanted to point to and say, hey, this is something that “goes over here”, you know, inside these quotes, to make sure that you understand exactly what I’m delimiting, since using commas, and semicolons, and colons wouldn’t fit for some reason. Oh look the thing that I’m quoting now consumes 80% of this paragraph. But this is way better than just saying “the modern rule of thumb is to not optimize by writing your own assembly.” Because then it isn’t 100% clear that the rule of thumb is delimited by (exactly) “[do] not optimize by writing your own assembly.” Ya see?”

JonChesterfield · on Jan 25, 2023

There's a sense in which they're complicated. It's a sequence of graph transforms which mostly deal with non-polynomial time problems using heuristics, where mistakes in the transforms can manifest quite a long way away from the error. There's a significant risk that they're implemented in languages unique to that compiler toolchain, as compiler devs are quite prone to solving problems by writing compilers.

There's also a sense in which they're really simple. The input format and output format are (usually) well defined. The user interface is largely printing things to stderr and giving up, possibly in a retry loop when there's an editor involved. The program dependency graph is often quite small so the bug you're looking at is probably in the source code you checked out. Security is not the dominant concern you have elsewhere.

bob1029 · on Jan 25, 2023

The state of modern compilers is pretty staggering to me. I've seen some code folding in RyuJIT that makes me feel inferior as a developer.

You've got a few compilers (Java, .NET, et. al.) which are capable of re-compiling hot path code during live execution and then seamlessly transitioning to those paths. This recompilation can be based upon the statistics of the live process, so it's almost like a sort of adaptive AI. Which paths are hot in production does not need to be known at compile time with these approaches.

astrange · on Jan 26, 2023

Optimizing compilers don't model things like branch prediction well, and aren't great at autovectorizing either. They work just well enough.

In general I think they aren't that complicated since they have a pass structure that's relatively easy to inspect and helps avoid spaghetti code.

CalChris · on Jan 25, 2023

It took me a while to understand the mismatched bl/ret pairs because I'm used to reading matched bl/ret pairs. This confusion has to be similar to what the silicon is 'thinking': Human, I see matched bl/ret pairs all the time and I'm good at them. Why are you giving me mismatched pairs? I'll do the right thing but I'm not so good at them.

Still, this seems like function inlining. But why not just inline and use a regular branch loop? Is foo() also being called from elsewhere? Is space at a premium?

masklinn · on Jan 25, 2023

> Upon seeing this program, it's a common reaction to ask "why is foo a subroutine at all?"

> The answer is "because this is a didactic example, not code that's trying to go as fast as possible".

LoganDark · on Jan 26, 2023

I love how "rewrite it in Rust" is an actual thing they tried, and it actually performed pretty well given the circumstances.

sylware · on Jan 25, 2023

I write assembly mainly _not_ because it is faster, but because I don't depend on an absurdely complex and massive compiler.

packetlost · on Jan 25, 2023

So... how does that work? What sort of work do you do that you have the time to write raw ASM and still be productive? I'm asking in earnest, because I'm curious what sort of workflows still allow for writing ASM directly outside of very specific cases (such as initializing embedded MCUs, for example)

Am4TIfIsER0ppos · on Jan 25, 2023

any audio or video work

You don't want to be forced to trick the compiler into using the SIMD instructions you are aware of so you write an assembly function.

packetlost · on Jan 26, 2023

Forcing SIMD instructions seems like a pretty reasonable, but specialized use-case that would warrant using ASM. But from what I understand, you'd still be using a compiler for whatever higher-level language (say, C/C++) for most of the work and ASM for the really performance sensitive portions of the code (or when trying to force the usage of some CPU extension). My interpretation of GP was that they exclusively write in ASM, though that may not have been correct.

Am4TIfIsER0ppos · on Jan 26, 2023

Okay I can see how you thought that. And we do use something higher for all the other parts. Look at ffmpeg for an example. https://github.com/ffmpeg/ffmpeg The github mirror says a mere 6.6% is assembly

ShroudedNight · on Jan 25, 2023

I feel like this treatment is incomplete without having tested the scenario where the unmatched ret is replaced with a br lr.

EDIT: Reading the documentation after the fact, it appears that that was what br x30 was - naively I had interpreted the hex as a fixed offset to a label.