This reminds me of an issue that I ran into with rust [0] when I was trying to optimize some machine learning code. Rust's über-strict type-safe math operations have you use matching types to get the euclidean non-negative remainder of x mod y. When you have a float-point x value but an integral y value, the operation can be performed much more cheaply than when y is also a floating point value.
The problem is that you end up promoting y from an integer to an f64 and get a much slower operation. I ended up writing my own `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some ~35x faster (a huge win when crunching massive arrays), but as there are range limitations (since f64::MAX > i64::MAX) you can't naively replace all call sites based on the type signatures. However, with some support from the compiler it would be completely doable anytime the compiler is able to infer an upper/lower bound on the f64 dividend, when the result of the operation is coerced to an integer afterwards, or when the dividend is a constant value that doesn't exceed that range.
So now I copy that function around from ML project to ML project, because what else can I do?
(A “workaround” was to use a slower-but-still-faster `rem_i128(self: &f64, divisor: i128) -> f64` to raise the functional limits of the operation, but you're never going to match the range of a 64-bit floating point value until you use 512-bit integral math!)
I package unrelated classes, functions, utilities like yours into one module/crate/library and just keep adding stuff to it. Sort of like my own standard library.
That works for a time but it’s ultimately not great for new/external contributors to be faced with your code being full of unfamiliar idioms and utility functions coming from a single kitchen sink package.
You do still see it though; a bunch of the core infrastructure packages in ROS depend on this grab bag of utilities related to console capture, cli, and colouring:
I do tend to stumble into it with hardware relates libs like ROS. Web adjacent libraries/frameworks, IMO, tends to package things a little nicer (though the js ecosystem just said YOLO).
If that crate becomes successful, it will basically greenlight a lot of functionality that was just there because it happened to be made by the same author. And all that extra functionality will only reduce the chances of the really useful function to get into the spotlight. Both are bad for the ecosystem.
If the GP has related useful little functions, yes, pack them together. Otherwise, I'd say a crate for a small little function isn't a problem at all.
Pet peeve: this isn't "your own constant folder in C/C++"… it's "your own enabling constant folding in C/C++"…
With "own constant folder" I expected a GCC/clang plugin that does tree/control flow analysis and some fancy logic in order to determine when, where and how to constant fold…
This seems like a great example of why people don't like C/C++, and probably a good example of why some people _do_ like it.
How is a non-expert in the language supposed to learn tricks/... things like this? I'm asking as a C++ developer of 6+ years in high performance settings, most of this article is esoteric to me.
They have a computation to do, and they want to do it using a particular instruction. That's the only reason they are fiddling with the compiler. You would have to do so too, if you had decided to solve a problem at hand using a similar method.
The author is talking about a way to get a particular (version of) C/C++ compiler to emit the desired instruction. So I'd call this clang-18.1.0-specific but not C/C++-specific since this has nothing to do with the language.
Also such solutions are not portable nor stable since optimization behavior does change between compiler versions. As far as I can tell, they also would have to implement a compiler-level unit test that ensures that the desired machine code is emitted as toolchain versions change.
Is this really C++ specific though? It seems like the optimisations are happening on a lower level, and so would 'infect' other languages too.
Whatever the language, at some point in performance tweaking you will end up having to look at the assembly produced by your compiler, and discovering all kinds of surprises.
LLVM isn't perfect, but the problem here is that there's a C++ compiler flag (-ffast-math) which says OK, disregard how arithmetic actually works, we're going to promise across the entire codebase that we don't actually care and that's fine.
This is nonsense, but it's really common, distressingly common, for C and C++ programmers to use this sort of inappropriate global modelling. It's something which cannot scale, it works OK for one man projects, "Oh, I use the Special Goose Mode to make routine A better, so even though normal Elephants can't Teleport I need to remember that in Special Goose Mode the Elephants in routine B might Teleport". In practice you'll screw this up, but it feels like you'll get it right often enough to be valuable.
In a large project where we're doing software engineering this is complete nonsense, now Jenny, the newest member of the team working on routine A, will see that obviously Special Goose Mode is a great idea, and turn it on, whereupon the entirely different team handling routine B find that their fucking Elephants can now Teleport. WTF.
The need to never do this is why I was glad to see Rust stabilize (e.g) u32::unchecked_add fairly recently. This (unsafe obviously) method says no, I don't want checked arithmetic, or wrapping, or saturating, I want you to assume this cannot overflow. I am formally promising that this addition is never going to overflow, in order to squeeze out the last drops of performance.
Notice that's not a global flag. I can write let a = unsafe { b.unchecked_add(c) }; in just one place in a 50MLOC system, and for just that one place the compiler can go absolutely wild optimising for the promise that overflows never happen - and yet right next door, even on the next line, I can write let x = y + z; and that still gets the kid gloves, if it overflows nothing catches on fire. That's how granular this needs to be to be useful, unlike C++ -ffast-math.
Because the language works by textual inclusion and so a "translation unit" isn't really just your code this is much more likely to result in nasty surprises, up to and including ODR violations.
LLVM actually also supports instruction-level granularity for fast-math (using essentially the same mechanism as things like unchecked_add), but Clang doesn't expose that level of control.
Actually, floating-point math is mostly deterministic. There is an exception for rounding errors in transcendental functions.
The perception of nondeterminism came specifically from x87, which had 80-bit native floating-point registers, which were different from every other platform's 64-bit default, and forcing values to 64-bit all the time cost performance, so compilers secretly turned data types to different ones when compiling for x87, therefore giving different results. It would be like if the compiler for ARM secretly changed every use of 'float' into 'double'.
> This is perfectly appropriate in many settings, but _especially_ video games where such deterministic results are completely irrelevant.
I'm not sure I'd agree. Off the top of my head, a potentially significant benefit to deterministic floating-point is allowing you to send updates/inputs instead of world state in multiplayer games, which could be a substantial improvement in network traffic demand. It would also allow for smaller/simpler cross-machine/platform replays, though I don't know how much that feature is desired in comparison.
Indeed, but this isn't hypothetical. A large fraction of multiplayer games operate by sending inputs and requiring the simulation to be deterministic.
(Some of those are single-platform or lack cross-play support, and thus only need consistency between different machines running the same build. That makes compiler optimizations less of an issue. However, some do support cross-play, and thus need consistency between different builds – using different compilers – of the same source code.)
The existence of situations where the optimization is not appropriate, such as multiplayer input replay, does not invalidate the many situations where it is appropriate.
Sure, but I'm not sure anyone is trying to argue that -ffast-math is never appropriate. I was just trying to point out that your original claim that deterministic floating point is "completely irrelevant" to games was too strong.
Everything in this post applies to C too, so it's not C++ specific. And the same gotchas apply for every case when you use inline assembly. I wouldn't call it a trick... Just an interesting workaround for a performance bug in Clang.
The post can be boiled down to "Clang doesn't compile this intrinsic nicely, so just use inline asm directly. But remember that you need to have a non-asm special case to optimize constants too, and you can achieve this with __builtin_constant_p".
I would challenge you to find a processor on which the rsqrt plus two newton-raphson iterations is not slower than plain sqrt. (We don't know what mtune the author used)
The author probably didn't use any mtune setting, which is likely the problem. If you look at older cores on Agner's instruction tables, SQRT has been getting steadily faster over time. This implementation is slightly faster on old Intel machines, for example.
A lesson I learned very early on: don't use -ffast-math unless you know what it does. It has a very appealing name that suggests nice things. You probably won't ever need it.
It's a feature flag, they're not all necessarily floating point related. As an example though, I still intentionally humorously misread -funroll-loops as funroll loops even though it's f(eature) unroll loops
And even if you do know what it does, it’s very impractical given that it’s a global option. Turning it on for selected portions of code would be a completely different game.
You can/would just use it in the translation units where you want it; usually for numerical code where you want certain optimizations or behaviors and know that the tradeoffs are irrelevant.
It's mostly harmless for everday application math anyway, and so enabling it for your whole application isn't a catastrophe, but it's not what people who know what they're doing would usually do. It's usually used for a specific file or perhaps a specific support library.
-ffast-math affects other translation units too, because it introduces a global initialization that affects some CPU flags. You can't really contain it to a single TU.
This is a Clang or even LLVM code generation issue and entirely unrelated to the C and C++ standards (-ffast-math, intrinsics and inline assembly are all compiler-specific features not covered by the standards).
Most other language ecosystems most likely suffer from similar problem if you look under the hood.
At least compilers like Clang also give you the tools to workaround such issues, as demonstrated by the article.
> How is a non-expert in the language supposed to learn tricks/... things like this?
just like everything else in life that's complex: slowly and diligently.
i hate to break it to you - C++ is complex not for fun but because it has to both be modern (support lots of modern syntax/sugar/primitives) and compile to target code for an enormous range of architectures and modern achitectures are horribly complex and varied (x86/arm isn't the only game in town). forgo one of those and it could be a much simpler language but forgo one of those and you're not nearly as compelling.
I'm a security student. My main experience has been python and java....but I have started to learn c to better learn how low level stuff works without so much abstraction.
My understanding is that C is a great language, but I also get that its not for everyone. Its really powerful, and yet you can easily make mistakes.
For me, I'm just learning how to use C, I'm not trying to understand the compiler or make files yet. From what I get, the compiler is how you can achieve even better performance, but you need to understand how it is doing its black magic....otherwise you just might make your code slower or more inefficient.
First order optimization is always overall program architecture, then hotspots, then fixing architectural issues in the code (e.g. getting rid of misspeculated branches, reducing instructions etc), and then optimizing the code the compiler is generating. And at no point does it require to know the internals of the optimization passes as to how it generates the code.
As for the compiler’s role in C, it’s equivalent to javac - it’s taking your source and creating machine code, except the machine code isn’t an abstract bytecode but the exact machine instructions intended to run on the CPU.
The issues with C and C++ are around memory safety. Practice has repeatedly shown that the defect rate with these languages is high enough that it results in lots of easily exploitable vulnerabilities. That’s a bit more serious than a personal preference. That’s why there’s pushes to shift the professional industry itself to stop using C and C++ in favor of Rust or even Go.
Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.
This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?
* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific
* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)
* -flto=thin - get most of the benefits of LTO with minimal compile time overhead
* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.
* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.
Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.
Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.
Indeed, C++ is different from most languages (other than C) because "knowing C++" does not mean just knowing syntaxis and standard library API but implies understanding of how the source code is turned into bytes inside an executable image. A Python or Java or whatever programmer could be writing books and praised as the language expert without slightest idea how is memory allocated, a C++ programmer who does not know that is probably going to work in some legacy shop supporting an ancient MFC/Qt data-entry app.
In my opinion, `__builtin_constant_p()` is not that obscure of a feature. In C, it is used in macros to imitate constant functions, and in C++ it is useful for determining the current alternative that has lifetime in a `union` within a constant function. Granted that `__builtin_is_constant_evaluated()` has obsoleted its primary purpose, but there are enough ways it's still useful that I see it from time to time.
> How is a non-expert in the language supposed to learn tricks/... things like this?
By learning C and inline asm. For a C developer, this is nothing out of the ordinary. C++ focuses too much on new abstractions and hiding everything in the stdlib++, where the actual implementations of course use all of this and look more like C, which includes using (OMG!) raw pointers.
arguably vast-vast-vaaast majority of projects, problems, solutions, developers simply don't need this.
but yes, given the premise of the article is that a friend wants to use a specific CPU instruction, yeah, at least minimum knowledge of one's stack is required (and usually the path leads through Assembly, some C/Rust and an FFI interface - like JNI for Java, cffi/Cython for Python, and so on)
Not only that but this is so far away from the actual problem being solved that it’s just kind of annoying.
I wish there was some sensible way for code that’s purely about optimization to live entirely separated from the code that’s about solving the problem at hand…
I mean, you don’t have to care about this unless you have an application where you do. And if you do there is enough transparency (ie ability to inspect the assembly and ask questions) that you can solve this one issue without knowing everything under the sun.
If you had an application where this sort of thing made a difference in JavaScript, the problem would likely still the there, you’d just have a lot less visibility on it.
I guess you’re still right - at the end of the day you see discussions like this far more often in C, so it impacts the feel of programming in C more.
if u understand that c/c+ purpose was at first to write an OS...u somewhat are aware of this....but that would depend upon your CS classroom studies exposure...
In my case it was by accident as I picked up assembly and machine language before I touched C in the late 1980s.
I think the real problem is "really really wanted the sqrtps to be used in some code they were writing" is at odds with -ffast-math.
Clang transforms sqrtps(x) to x * rsqrtps(x) when -ffast-math is set because it's often faster (See [1] section 15.12). It isn't faster for some architectures, but if you tell clang what architecture you're targeting (with -mtune), it appears to make the right choice for the architecture[2].
Back in the days the answer was `__builtin_constant_p`.
But with C++20 it is possible to use std::is_constant_evaluated, or `if consteval` with C++23.
But this is for scenario when you still want to keep high quality of code (maybe when you write multiprecision math library; not when you hack around compiler flag), which inline assembly violates for many reasons:
1) major issue: instead of dealing with `-ffast-math` with inline asm, just remove `-ffast-math`
3) randomly slapped inline asm in example uses non-vex encoding, you will likely to forget to call vzeroupper on transition. Or in general, this limits code to x86 (forget about x86-64/arm)
4) provided example (best_always_inline) does not work in GCC as expected
Offtopic, but the original title is "Your Own Constant Folder in C/C++". I am guessing Hacker News cut out the "your" for some reason, but at first glance I thought it was some kind of read-only directory implemented in C/C++. It's like a truncated garden path sentence.
The real fix is to not use -ffast-math. It breaks all sorts of things by design.
If you really want to relax the correctness of your code to get some potential speedup, either do the optimizations yourself instead of letting the compiler do them, or locally enable fast-math equivalents.
As for the is constant expression GCC extensions, that stuff is natively available in standard C++ nowadays.
Forced constant folding is something I use on occasion in Zig. To declare that a computation needs to happen at compile-time, just prefix it with `comptime`.
Compile-time execution emulates the target architecture, which has pros and cons. The most notable con is that inline assembly isn't supported (yet?). The complete solution to this particular problem then additionally requires `return if (isComptime()) normal_code() else assembly_magic();` (which has no runtime cost because that branch is constant-folded). Given that inline assembly usually also branches on target architecture, that winds up not adding much complexity -- especially given that you probably had exactly that same code as a fallback for architectures you didn't explicitly handle.
I was going to label this a fine example of yak shaving. Usually when I'm working this hard against the tool, I've missed something, though the author here sounds expert.
If you care that much, you use intrinsics, asm in the C file, or asm to create an object you call into.
Most of the code doesn't need nearly that level of optimization, so you write it in a higher level language.
Making your own compiler is a lot of work compared to just doing what I've outlined above. So you don't. I's not worth it, and you'd still end up back where you are.
Writing assembly really isn't that scary, especially in small doses for performance-critical code. That was a very common practice in the 80s/90s, though it's faded as compiler optimizations have improved.
It is a compiler intrinsic named after the SQRTPS instruction it's intended to generate. The "PS" refers to the vectorization type and stands for packed single.
That being said, the naming of Intel's vector intrinsics _is_ pretty bad, such as the pi64 vs. epi64 difference. ARM's vector intrinsic naming, despite looking like having come from a cat sitting on the home row keys, is more consistent and has less gotchas.
I have not found __builtin_constant_p to be very reliable when I want to fold multiple times. Is there any way to do this trick better using c++ constexpr I wonder?
> In Intel microarchitectures prior to Skylake, the SSE divide and square root instructions DIVPS and SQRTPS have a latency of 14 cycles (or the neighborhood) and they are not pipelined. This means that the throughput of these instructions is one in every fourteen cycles. The 256-bit Intel AVX instructions VDIVPS and VSQRTPS execute with 128-bit data path and have a latency of twenty-eight cycles and they are not pipelined as well.
> In microarchitectures that provide DIVPS/SQRTPS with high latency and low throughput, it is possible to speed up single-precision divide and square root calculations using the (V)RSQRTPS and (V)RCPPS instructions. For example, with 128-bit RCPPS/RSQRTPS at five-cycle latency and one-cycle throughput or with 256-bit implementation of these instructions at seven-cycle latency and two-cycle throughput, a single Newton-Raphson iteration or Taylor approximation can achieve almost the same precision as the (V)DIVPS and (V)SQRTPS instructions
That's a different case where one is trying to calculate rsqrt(x) from 1/sqrt(x) whereas the article involves computing sqrt(x) from x*rsqrt(x) and in this case you don't need the divps instruction for sqrt but do need an additional mul instruction for the rsqrt case.
Only on some older hardware really, pretty dumb optimization from clang, people writing SIMD probably already know this trick and have a version of sqrt that is exactly this, so going behind their backs and forcing it when they requested sqrt_ps is very uncool.
__builtin_constant_p(vec) is not inquiring if the contents of vec is constant. The compilers are not being fickle. The statement is not performing the question that the developer intended.
You can never trust an optimizing compiler. I’ve seen similar issues in pretty much every language I’ve written performance-sensitive code in. Point revisions change optimizations all the time, sometimes with horribly disastrous results. One example: every time rust upgrades the version of LLVM underlying the compiler backend, the team prepares itself for a deluge of optimization regression reports (though this has been somewhat "mitigated" by a massive regression in rustc's ability to convey sufficient information to the LLVM backend for it to optimize to begin with, thanks to a change made to improve compile times [0]).
This is almost universally true across all languages. The behaviors you see are usually the result of countless heuristics stacked one atop the other, and a tiny change here can end up with huge changes there.
...so does LLVM, unless you specifically instruct it to do otherwise with -ffast-math. Don't be fooled by the name of this flag, if it were possible to just do math faster without compromising anything, it wouldn't be a flag, it would be the default. It seems that Julia has a flag with the same name but a different behavior. Okay? What's your point exactly?
@fastmath in Julia is telling LLVM (basically) the same thing. I think the only difference is that Julia doesn't think rsqrt(x)*x with Newton iteration is a good way of computing sqrt
But in this case, -ffast-math actually results in slower math, so the flag is badly named. Naming things is important (and hard). The naive expectation of a flag named -ffast-math is that the math should never be slower, sometimes be faster, and potentially less accurate and/or error prone. The fact that it can also be sometimes slower means the flag really should be named -fdifferent-math or (uncharitably) -funreliable-math.
It’s too bad that it hasn’t acquired the name of an author to describe this alternative functionality more concisely without being judgy (like Nagle’s algorithm). I looked around, and it seems that there isn’t any one designer/author that you could give credit to.
I mean, the fiddly-ness comes from wanting to use inline assembly, which is often considered to be niche and second-class in most modern languages. Even writing the asm in the first place is full of footguns [0]. There are ways to reduce the optimization barrier of asm, and implementing constant folding is one such example. But I can see why most compiler writers wouldn't be that interested in it.
The problem is that you end up promoting y from an integer to an f64 and get a much slower operation. I ended up writing my own `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some ~35x faster (a huge win when crunching massive arrays), but as there are range limitations (since f64::MAX > i64::MAX) you can't naively replace all call sites based on the type signatures. However, with some support from the compiler it would be completely doable anytime the compiler is able to infer an upper/lower bound on the f64 dividend, when the result of the operation is coerced to an integer afterwards, or when the dividend is a constant value that doesn't exceed that range.
So now I copy that function around from ML project to ML project, because what else can I do?
(A “workaround” was to use a slower-but-still-faster `rem_i128(self: &f64, divisor: i128) -> f64` to raise the functional limits of the operation, but you're never going to match the range of a 64-bit floating point value until you use 512-bit integral math!)
[0]: https://github.com/rust-lang/rust/issues/83973
Godbolt link: https://godbolt.org/z/EqrEqExnc