Own Constant Folder in C/C++

ComputerGuru · 2024-06-22T14:27:30 1719066450

This reminds me of an issue that I ran into with rust [0] when I was trying to optimize some machine learning code. Rust's über-strict type-safe math operations have you use matching types to get the euclidean non-negative remainder of x mod y. When you have a float-point x value but an integral y value, the operation can be performed much more cheaply than when y is also a floating point value.

The problem is that you end up promoting y from an integer to an f64 and get a much slower operation. I ended up writing my own `rem_i64(self: &f64, divisor: i64) -> f64` routine that was some ~35x faster (a huge win when crunching massive arrays), but as there are range limitations (since f64::MAX > i64::MAX) you can't naively replace all call sites based on the type signatures. However, with some support from the compiler it would be completely doable anytime the compiler is able to infer an upper/lower bound on the f64 dividend, when the result of the operation is coerced to an integer afterwards, or when the dividend is a constant value that doesn't exceed that range.

So now I copy that function around from ML project to ML project, because what else can I do?

(A “workaround” was to use a slower-but-still-faster `rem_i128(self: &f64, divisor: i128) -> f64` to raise the functional limits of the operation, but you're never going to match the range of a 64-bit floating point value until you use 512-bit integral math!)

[0]: https://github.com/rust-lang/rust/issues/83973

Godbolt link: https://godbolt.org/z/EqrEqExnc

toast0 · 2024-06-22T15:11:46 1719069106

> So now I copy that function around from ML project to ML project, because what else can I do?

Aren't you supposed to make a crate? Which is copying with more steps, but might make your life easier (or harder).

ComputerGuru · 2024-06-22T15:34:16 1719070456

I knew someone would pipe in with that suggestion as I was writing that comment!

Yes, I suppose that would be the canonical way to go. But it's just one function; four lines! I'm getting isEven() vibes!

pryelluw · 2024-06-22T17:07:15 1719076035

I package unrelated classes, functions, utilities like yours into one module/crate/library and just keep adding stuff to it. Sort of like my own standard library.

mikepurvis · 2024-06-22T18:01:20 1719079280

That works for a time but it’s ultimately not great for new/external contributors to be faced with your code being full of unfamiliar idioms and utility functions coming from a single kitchen sink package.

pryelluw · 2024-06-23T06:39:59 1719124799

Yeah strictly for my own personal use. Having a bunch of unrelated stuff in one place is bananas

mikepurvis · 2024-06-23T20:33:39 1719174819

You do still see it though; a bunch of the core infrastructure packages in ROS depend on this grab bag of utilities related to console capture, cli, and colouring:

https://osrf-pycommon.readthedocs.io/en/latest/

pryelluw · 2024-06-24T02:51:36 1719197496

I do tend to stumble into it with hardware relates libs like ROS. Web adjacent libraries/frameworks, IMO, tends to package things a little nicer (though the js ecosystem just said YOLO).

marcosdumay · 2024-06-22T22:58:12 1719097092

Really don't.

If that crate becomes successful, it will basically greenlight a lot of functionality that was just there because it happened to be made by the same author. And all that extra functionality will only reduce the chances of the really useful function to get into the spotlight. Both are bad for the ecosystem.

If the GP has related useful little functions, yes, pack them together. Otherwise, I'd say a crate for a small little function isn't a problem at all.

pryelluw · 2024-06-23T06:42:38 1719124958

Interesting how it was perceived me suggesting they publish a crate with multiple unrelated functionality.

My point is that I keep my own little collection of useful code. I don’t publish it as a package or crate. It’s what I suggested to the other poster.

im3w1l · 2024-06-22T17:36:31 1719077791

Maybe you could go looking for a home for your function in some existing crate then?

saghm · 2024-06-22T19:06:33 1719083193

Maybe it could fit into the `num` crate in some fashion? https://docs.rs/num/latest/num/

eqvinox · 2024-06-22T12:29:08 1719059348

Pet peeve: this isn't "your own constant folder in C/C++"… it's "your own enabling constant folding in C/C++"…

With "own constant folder" I expected a GCC/clang plugin that does tree/control flow analysis and some fancy logic in order to determine when, where and how to constant fold…

noitpmeder · 2024-06-22T12:04:03 1719057843

This seems like a great example of why people don't like C/C++, and probably a good example of why some people _do_ like it.

How is a non-expert in the language supposed to learn tricks/... things like this? I'm asking as a C++ developer of 6+ years in high performance settings, most of this article is esoteric to me.

plq · 2024-06-22T14:16:56 1719065816

They have a computation to do, and they want to do it using a particular instruction. That's the only reason they are fiddling with the compiler. You would have to do so too, if you had decided to solve a problem at hand using a similar method.

The author is talking about a way to get a particular (version of) C/C++ compiler to emit the desired instruction. So I'd call this clang-18.1.0-specific but not C/C++-specific since this has nothing to do with the language.

Also such solutions are not portable nor stable since optimization behavior does change between compiler versions. As far as I can tell, they also would have to implement a compiler-level unit test that ensures that the desired machine code is emitted as toolchain versions change.

joosters · 2024-06-22T12:20:19 1719058819

Is this really C++ specific though? It seems like the optimisations are happening on a lower level, and so would 'infect' other languages too.

Whatever the language, at some point in performance tweaking you will end up having to look at the assembly produced by your compiler, and discovering all kinds of surprises.

tialaramex · 2024-06-22T14:36:39 1719066999

LLVM isn't perfect, but the problem here is that there's a C++ compiler flag (-ffast-math) which says OK, disregard how arithmetic actually works, we're going to promise across the entire codebase that we don't actually care and that's fine.

This is nonsense, but it's really common, distressingly common, for C and C++ programmers to use this sort of inappropriate global modelling. It's something which cannot scale, it works OK for one man projects, "Oh, I use the Special Goose Mode to make routine A better, so even though normal Elephants can't Teleport I need to remember that in Special Goose Mode the Elephants in routine B might Teleport". In practice you'll screw this up, but it feels like you'll get it right often enough to be valuable.

In a large project where we're doing software engineering this is complete nonsense, now Jenny, the newest member of the team working on routine A, will see that obviously Special Goose Mode is a great idea, and turn it on, whereupon the entirely different team handling routine B find that their fucking Elephants can now Teleport. WTF.

The need to never do this is why I was glad to see Rust stabilize (e.g) u32::unchecked_add fairly recently. This (unsafe obviously) method says no, I don't want checked arithmetic, or wrapping, or saturating, I want you to assume this cannot overflow. I am formally promising that this addition is never going to overflow, in order to squeeze out the last drops of performance.

Notice that's not a global flag. I can write let a = unsafe { b.unchecked_add(c) }; in just one place in a 50MLOC system, and for just that one place the compiler can go absolutely wild optimising for the promise that overflows never happen - and yet right next door, even on the next line, I can write let x = y + z; and that still gets the kid gloves, if it overflows nothing catches on fire. That's how granular this needs to be to be useful, unlike C++ -ffast-math.

gpderetta · 2024-06-22T14:52:20 1719067940

You can set fast math (or a subset of it) on a translation unit basis.

tialaramex · 2024-06-22T16:59:48 1719075588

Because the language works by textual inclusion and so a "translation unit" isn't really just your code this is much more likely to result in nasty surprises, up to and including ODR violations.

gpderetta · 2024-06-22T17:47:38 1719078458

Yes, if you try hard enough I'm sure you can find ways to screw up.

From a practical point of view it is fine.

pdw · 2024-06-22T21:41:01 1719092461

That's not really true. If you link a shared library that was compiled with -ffast-math, that will affect the entire program. https://moyix.blogspot.com/2022/09/someones-been-messing-wit...

gpderetta · 2024-06-23T01:13:43 1719105223

To enable this "feature" I believe you have to specify fast math while linking the .so, it is not enough to do it when compiling.

It had also been fixed recent GCC versions.

cozzyd · 2024-06-22T18:58:13 1719082693

Or even on a per function basis, at least with gcc (no clue about clang...)

gpderetta · 2024-06-22T19:01:07 1719082867

In principle yes, you can use Attribute optimize, but I wouldn't rely on it. Too many bugs open against it.

cozzyd · 2024-06-22T19:07:07 1719083227

It's worked for me in the past but maybe I got lucky.

nikic · 2024-06-22T17:13:04 1719076384

LLVM actually also supports instruction-level granularity for fast-math (using essentially the same mechanism as things like unchecked_add), but Clang doesn't expose that level of control.

wmobit · 2024-06-22T19:56:48 1719086208

clang does have pragma clang fp to enable a subset of fast math flags within a scope

nickelpro · 2024-06-22T17:34:11 1719077651

If you're using floating point at all you have declared you don't care about determinism or absolute precision across platforms.

Fast math is simply saying "I care even less than IEEE"

This is perfectly appropriate in many settings, but _especially_ video games where such deterministic results are completely irrelevant.

immibis · 2024-06-22T19:07:47 1719083267

Actually, floating-point math is mostly deterministic. There is an exception for rounding errors in transcendental functions.

The perception of nondeterminism came specifically from x87, which had 80-bit native floating-point registers, which were different from every other platform's 64-bit default, and forcing values to 64-bit all the time cost performance, so compilers secretly turned data types to different ones when compiling for x87, therefore giving different results. It would be like if the compiler for ARM secretly changed every use of 'float' into 'double'.

nickelpro · 2024-06-23T03:03:18 1719111798

The existence of a theoretically pure floating-point behavior doesn't have any impact on the reality of the implementations we program for.

aw1621107 · 2024-06-22T19:05:27 1719083127

> This is perfectly appropriate in many settings, but _especially_ video games where such deterministic results are completely irrelevant.

I'm not sure I'd agree. Off the top of my head, a potentially significant benefit to deterministic floating-point is allowing you to send updates/inputs instead of world state in multiplayer games, which could be a substantial improvement in network traffic demand. It would also allow for smaller/simpler cross-machine/platform replays, though I don't know how much that feature is desired in comparison.

comex · 2024-06-22T19:18:29 1719083909

Indeed, but this isn't hypothetical. A large fraction of multiplayer games operate by sending inputs and requiring the simulation to be deterministic.

(Some of those are single-platform or lack cross-play support, and thus only need consistency between different machines running the same build. That makes compiler optimizations less of an issue. However, some do support cross-play, and thus need consistency between different builds – using different compilers – of the same source code.)

nickelpro · 2024-06-23T03:02:34 1719111754

The existence of situations where the optimization is not appropriate, such as multiplayer input replay, does not invalidate the many situations where it is appropriate.

aw1621107 · 2024-06-23T04:20:27 1719116427

Sure, but I'm not sure anyone is trying to argue that -ffast-math is never appropriate. I was just trying to point out that your original claim that deterministic floating point is "completely irrelevant" to games was too strong.

hifromwork · 2024-06-22T12:21:52 1719058912

Everything in this post applies to C too, so it's not C++ specific. And the same gotchas apply for every case when you use inline assembly. I wouldn't call it a trick... Just an interesting workaround for a performance bug in Clang.

The post can be boiled down to "Clang doesn't compile this intrinsic nicely, so just use inline asm directly. But remember that you need to have a non-asm special case to optimize constants too, and you can achieve this with __builtin_constant_p".

GrantMoyer · 2024-06-22T13:32:53 1719063173

I wouldn't call this a performance bug in clang. It's an optimization working as intended.

dooglius · 2024-06-22T13:52:04 1719064324

I would challenge you to find a processor on which the rsqrt plus two newton-raphson iterations is not slower than plain sqrt. (We don't know what mtune the author used)

GrantMoyer · 2024-06-22T16:04:15 1719072255

According to Intel, any processor before Skylake (section 15.12 from [1]).

[1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...

pclmulqdq · 2024-06-22T14:32:40 1719066760

The author probably didn't use any mtune setting, which is likely the problem. If you look at older cores on Agner's instruction tables, SQRT has been getting steadily faster over time. This implementation is slightly faster on old Intel machines, for example.

parasti · 2024-06-22T12:26:20 1719059180

A lesson I learned very early on: don't use -ffast-math unless you know what it does. It has a very appealing name that suggests nice things. You probably won't ever need it.

adgjlsfhk1 · 2024-06-22T13:44:39 1719063879

I prefer -funsafemath. Who doesn't want their math to be fun and safe?

glitchc · 2024-06-22T14:00:52 1719064852

I don't know how many people got your joke, but that flag actually says unsafe math. The -f prefix signifies a floating point flag.

rockwotj · 2024-06-22T14:18:19 1719065899

I am pretty sure the -f is for feature. Because there are flags like -fno-exceptions and -fcoroutines

jkrejcha · 2024-06-22T15:33:08 1719070388

It's a feature flag, they're not all necessarily floating point related. As an example though, I still intentionally humorously misread -funroll-loops as funroll loops even though it's f(eature) unroll loops

hackcasual · 2024-06-23T04:01:02 1719115262

The f in that case stands for fruit

SassyBird · 2024-06-22T13:37:32 1719063452

And even if you do know what it does, it’s very impractical given that it’s a global option. Turning it on for selected portions of code would be a completely different game.

swatcoder · 2024-06-22T15:51:32 1719071492

You're mistaken.

You can/would just use it in the translation units where you want it; usually for numerical code where you want certain optimizations or behaviors and know that the tradeoffs are irrelevant.

It's mostly harmless for everday application math anyway, and so enabling it for your whole application isn't a catastrophe, but it's not what people who know what they're doing would usually do. It's usually used for a specific file or perhaps a specific support library.

leni536 · 2024-06-22T21:44:40 1719092680

-ffast-math affects other translation units too, because it introduces a global initialization that affects some CPU flags. You can't really contain it to a single TU.

senkora · 2024-06-23T00:58:29 1719104309

It seems like this depends on whether you specify -ffast-math when linking or only when compiling: https://stackoverflow.com/a/68938551

My understanding is that if you don't specify -ffast-math when linking then you shouldn't get crtfastmath.o linked in.

adgjlsfhk1 · 2024-06-23T02:37:49 1719110269

so this is complicated by the fact that GCC has a bug until last year where it would set the subnormal truncation flag when it wasn't supposed to.

adgjlsfhk1 · 2024-06-22T13:45:00 1719063900

Julia has this (via a @fastmath macro) and it's so nice!

flohofwoe · 2024-06-22T12:34:12 1719059652

This is a Clang or even LLVM code generation issue and entirely unrelated to the C and C++ standards (-ffast-math, intrinsics and inline assembly are all compiler-specific features not covered by the standards).

Most other language ecosystems most likely suffer from similar problem if you look under the hood.

At least compilers like Clang also give you the tools to workaround such issues, as demonstrated by the article.

almostgotcaught · 2024-06-22T12:36:02 1719059762

> How is a non-expert in the language supposed to learn tricks/... things like this?

just like everything else in life that's complex: slowly and diligently.

i hate to break it to you - C++ is complex not for fun but because it has to both be modern (support lots of modern syntax/sugar/primitives) and compile to target code for an enormous range of architectures and modern achitectures are horribly complex and varied (x86/arm isn't the only game in town). forgo one of those and it could be a much simpler language but forgo one of those and you're not nearly as compelling.

tonetegeatinst · 2024-06-22T13:33:51 1719063231

I'm a security student. My main experience has been python and java....but I have started to learn c to better learn how low level stuff works without so much abstraction.

My understanding is that C is a great language, but I also get that its not for everyone. Its really powerful, and yet you can easily make mistakes.

For me, I'm just learning how to use C, I'm not trying to understand the compiler or make files yet. From what I get, the compiler is how you can achieve even better performance, but you need to understand how it is doing its black magic....otherwise you just might make your code slower or more inefficient.

vlovich123 · 2024-06-22T14:30:55 1719066655

First order optimization is always overall program architecture, then hotspots, then fixing architectural issues in the code (e.g. getting rid of misspeculated branches, reducing instructions etc), and then optimizing the code the compiler is generating. And at no point does it require to know the internals of the optimization passes as to how it generates the code.

As for the compiler’s role in C, it’s equivalent to javac - it’s taking your source and creating machine code, except the machine code isn’t an abstract bytecode but the exact machine instructions intended to run on the CPU.

The issues with C and C++ are around memory safety. Practice has repeatedly shown that the defect rate with these languages is high enough that it results in lots of easily exploitable vulnerabilities. That’s a bit more serious than a personal preference. That’s why there’s pushes to shift the professional industry itself to stop using C and C++ in favor of Rust or even Go.

torusle · 2024-06-22T14:40:26 1719067226

Nah, it is not that bad.

Sure you can mess up your performance by picking bad compiler options, but most of the time you are fine with just default optimizations enabled and let it do it's thing. No need to understand the black magic behind it.

This is only really necessary if you want to squeeze the last bit of performance out of a piece of code. And honestly, how often dies this occur in day to day coding unless you write a video or audio codec?

vlovich123 · 2024-06-22T17:06:40 1719076000

The main flags to look at:

* mtune/march - specifying a value of native optimizes for the current machine, x86-64-v1/v2/v3/v4 for generations or you can specify a specific CPU (ARM has different naming conventions). Recommendation: use the generation if distributing binaries, native if building and running locally unless you can get much much more specific

* -O2 / -O3 - turn on most optimizations for speed. Alternatively Os/Oz for smaller binaries (sometimes faster, particularly on ARM)

* -flto=thin - get most of the benefits of LTO with minimal compile time overhead

* pgo - if you have a representative workload you can use this to replace compiler heuristics with real world measurements. AutoFDO is the next evolution of this to make it easier to connect data from production environments to compile time.

* math: -fno-math-errno and -fno-trapping-math are “safe” subsets of ffast-math (i.e. don’t alter the numerical accuracy). -fno-signed-zeros can also probably be considered if valuable.

vbezhenar · 2024-06-22T19:45:10 1719085510

Also I learned recently that there's `-Og` which enables optimizations suitable for debug build.

vlovich123 · 2024-06-22T20:15:45 1719087345

In practice I’ve had limited success with that flag. It still seems to enable optimizations that make debugging difficult.

senkora · 2024-06-23T03:15:04 1719112504

Agreed. I like to compile most translation units with -O3 and then only compile the translation units that I want to debug with -O0. That way I can often end up with a binary that's reasonably fast but still debuggable for the parts that I care about.

vlovich123 · 2024-06-24T04:00:14 1719201614

Yup that’s what I’ve resorted to (in Rust I do it at the crate level instead of translation unit). The only downside is forgetting about it and then wondering why stepping through is failing to work properly.

pandaman · 2024-06-22T18:33:45 1719081225

Indeed, C++ is different from most languages (other than C) because "knowing C++" does not mean just knowing syntaxis and standard library API but implies understanding of how the source code is turned into bytes inside an executable image. A Python or Java or whatever programmer could be writing books and praised as the language expert without slightest idea how is memory allocated, a C++ programmer who does not know that is probably going to work in some legacy shop supporting an ancient MFC/Qt data-entry app.

Conscat · 2024-06-22T17:10:20 1719076220

In my opinion, `__builtin_constant_p()` is not that obscure of a feature. In C, it is used in macros to imitate constant functions, and in C++ it is useful for determining the current alternative that has lifetime in a `union` within a constant function. Granted that `__builtin_is_constant_evaluated()` has obsoleted its primary purpose, but there are enough ways it's still useful that I see it from time to time.

nalrqa · 2024-06-22T12:55:54 1719060954

> How is a non-expert in the language supposed to learn tricks/... things like this?

By learning C and inline asm. For a C developer, this is nothing out of the ordinary. C++ focuses too much on new abstractions and hiding everything in the stdlib++, where the actual implementations of course use all of this and look more like C, which includes using (OMG!) raw pointers.

pas · 2024-06-22T14:09:03 1719065343

arguably vast-vast-vaaast majority of projects, problems, solutions, developers simply don't need this.

but yes, given the premise of the article is that a friend wants to use a specific CPU instruction, yeah, at least minimum knowledge of one's stack is required (and usually the path leads through Assembly, some C/Rust and an FFI interface - like JNI for Java, cffi/Cython for Python, and so on)

Waterluvian · 2024-06-22T19:17:20 1719083840

Not only that but this is so far away from the actual problem being solved that it’s just kind of annoying.

I wish there was some sensible way for code that’s purely about optimization to live entirely separated from the code that’s about solving the problem at hand…

kenferry · 2024-06-22T15:55:01 1719071701

I mean, you don’t have to care about this unless you have an application where you do. And if you do there is enough transparency (ie ability to inspect the assembly and ask questions) that you can solve this one issue without knowing everything under the sun.

If you had an application where this sort of thing made a difference in JavaScript, the problem would likely still the there, you’d just have a lot less visibility on it.

I guess you’re still right - at the end of the day you see discussions like this far more often in C, so it impacts the feel of programming in C more.

elvircrn · 2024-06-22T13:00:28 1719061228

Take a computer architecture course for starters.

bryancoxwell · 2024-06-22T14:03:28 1719065008

Got any you’d recommend?

pas · 2024-06-22T14:14:48 1719065688

for appetizer I recommend Cliff Click's "A Crash Course in Modern Hardware" talk

https://www.youtube.com/watch?v=5ZOuCuGrw48 (and here's the 2009 version https://www.infoq.com/presentations/click-crash-course-moder... .. might be interesting for comparison )

bryancoxwell · 2024-06-22T16:27:40 1719073660

Thanks!

fredgrott · 2024-06-22T13:50:24 1719064224

if u understand that c/c+ purpose was at first to write an OS...u somewhat are aware of this....but that would depend upon your CS classroom studies exposure...

In my case it was by accident as I picked up assembly and machine language before I touched C in the late 1980s.

GrantMoyer · 2024-06-22T12:45:00 1719060300

I think the real problem is "really really wanted the sqrtps to be used in some code they were writing" is at odds with -ffast-math.

Clang transforms sqrtps(x) to x * rsqrtps(x) when -ffast-math is set because it's often faster (See [1] section 15.12). It isn't faster for some architectures, but if you tell clang what architecture you're targeting (with -mtune), it appears to make the right choice for the architecture[2].

[1]: https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...

[2]: https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename...

Lockal · 2024-06-22T19:12:38 1719083558

For C++ this goes to the question https://stackoverflow.com/questions/8936549/constexpr-overlo...

Back in the days the answer was `__builtin_constant_p`.

But with C++20 it is possible to use std::is_constant_evaluated, or `if consteval` with C++23.

But this is for scenario when you still want to keep high quality of code (maybe when you write multiprecision math library; not when you hack around compiler flag), which inline assembly violates for many reasons:

1) major issue: instead of dealing with `-ffast-math` with inline asm, just remove `-ffast-math`

2) randomly slapped inline asm inside normal fp32 computations breaks autovectorization

3) randomly slapped inline asm in example uses non-vex encoding, you will likely to forget to call vzeroupper on transition. Or in general, this limits code to x86 (forget about x86-64/arm)

4) provided example (best_always_inline) does not work in GCC as expected

omoikane · 2024-06-22T16:20:01 1719073201

Offtopic, but the original title is "Your Own Constant Folder in C/C++". I am guessing Hacker News cut out the "your" for some reason, but at first glance I thought it was some kind of read-only directory implemented in C/C++. It's like a truncated garden path sentence.

https://en.wikipedia.org/wiki/Garden-path_sentence

pclmulqdq · 2024-06-22T14:29:28 1719066568

Another option is to use "-march" to set your target architecture to something post-skylake/Zen 2. That should emit the right instruction.

The square root and div instructions used to be a lot slower than they are now.

btdmaster · 2024-06-22T11:43:06 1719056586

Fortunately gcc does not miscompile like this: https://godbolt.org/z/PaeaohqGz

mgaunard · 2024-06-22T16:36:02 1719074162

The real fix is to not use -ffast-math. It breaks all sorts of things by design.

If you really want to relax the correctness of your code to get some potential speedup, either do the optimizations yourself instead of letting the compiler do them, or locally enable fast-math equivalents.

As for the is constant expression GCC extensions, that stuff is natively available in standard C++ nowadays.

hawk_ · 2024-06-22T17:49:48 1719078588

How does one locally enable -ffast-math equivalent?

cozzyd · 2024-06-22T19:06:12 1719083172

https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-...

hansvm · 2024-06-22T21:25:34 1719091534

Forced constant folding is something I use on occasion in Zig. To declare that a computation needs to happen at compile-time, just prefix it with `comptime`.

Compile-time execution emulates the target architecture, which has pros and cons. The most notable con is that inline assembly isn't supported (yet?). The complete solution to this particular problem then additionally requires `return if (isComptime()) normal_code() else assembly_magic();` (which has no runtime cost because that branch is constant-folded). Given that inline assembly usually also branches on target architecture, that winds up not adding much complexity -- especially given that you probably had exactly that same code as a fallback for architectures you didn't explicitly handle.

feverzsj · 2024-06-22T11:31:35 1719055895

clang has __builtin_elementwise_sqrt, which meets all your requirements.

josephg · 2024-06-22T11:50:04 1719057004

I just tried it on godbolt. __builtin_elementwise_sqrt emits the same terrible asm as _mm_sqrt_ps when -ffast-math is enabled:

https://godbolt.org/z/34z81GEsG

fwsgonzo · 2024-06-22T10:40:33 1719052833

Very interesting, and that constant folding trick is handy!

smitty1e · 2024-06-22T11:04:01 1719054241

I was going to label this a fine example of yak shaving. Usually when I'm working this hard against the tool, I've missed something, though the author here sounds expert.

DrBazza · 2024-06-22T11:40:13 1719056413

This is the sort of thing we do in HFT to get single instruction calls and save nanos in places that get called many many thousands of times a second.

smitty1e · 2024-06-22T13:48:15 1719064095

I mean, at some point, why not just write a custom compiler?

trelane · 2024-06-22T16:41:37 1719074497

If you care that much, you use intrinsics, asm in the C file, or asm to create an object you call into.

Most of the code doesn't need nearly that level of optimization, so you write it in a higher level language.

Making your own compiler is a lot of work compared to just doing what I've outlined above. So you don't. I's not worth it, and you'd still end up back where you are.

TillE · 2024-06-22T17:09:23 1719076163

Writing assembly really isn't that scary, especially in small doses for performance-critical code. That was a very common practice in the 80s/90s, though it's faded as compiler optimizations have improved.

metadat · 2024-06-22T22:18:36 1719094716

Is `_mm_sqrt_ps(..)' an Intel-only thing? Why is the naming so jacked?

https://www.intel.com/content/www/us/en/docs/cpp-compiler/de...

ack_complete · 2024-06-23T20:14:36 1719173676

It is a compiler intrinsic named after the SQRTPS instruction it's intended to generate. The "PS" refers to the vectorization type and stands for packed single.

That being said, the naming of Intel's vector intrinsics _is_ pretty bad, such as the pi64 vs. epi64 difference. ARM's vector intrinsic naming, despite looking like having come from a cat sitting on the home row keys, is more consistent and has less gotchas.

ComputerGuru · 2024-06-22T14:12:42 1719065562

> but I think the LLVM folks have purposefully tried to match what GCC would do.

I never got that impression in my perusal of LLVM bug reports and patches. I wonder if there is an open issue for this specific case.

dooglius · 2024-06-22T14:08:06 1719065286

I have not found __builtin_constant_p to be very reliable when I want to fold multiple times. Is there any way to do this trick better using c++ constexpr I wonder?

white_beach · 2024-06-22T12:49:10 1719060550

somewhat related to this:

https://godbolt.org/z/1543TYszP

(intel c++ compiler)

bibouthegreat · 2024-06-23T00:14:27 1719101667

I wonder what it would be with a constexpr. If it works, it would be good to add this to intrinsic functions

pxmpxm · 2024-06-22T12:18:08 1719058688

Huh, why would reciprocal sqrt and fpmul be faster than regular sqrt?

GrantMoyer · 2024-06-22T12:26:35 1719059195

Intel's explanation (at section 15.12): https://cdrdv2.intel.com/v1/dl/getContent/814198?fileName=24...

> In Intel microarchitectures prior to Skylake, the SSE divide and square root instructions DIVPS and SQRTPS have a latency of 14 cycles (or the neighborhood) and they are not pipelined. This means that the throughput of these instructions is one in every fourteen cycles. The 256-bit Intel AVX instructions VDIVPS and VSQRTPS execute with 128-bit data path and have a latency of twenty-eight cycles and they are not pipelined as well.

> In microarchitectures that provide DIVPS/SQRTPS with high latency and low throughput, it is possible to speed up single-precision divide and square root calculations using the (V)RSQRTPS and (V)RCPPS instructions. For example, with 128-bit RCPPS/RSQRTPS at five-cycle latency and one-cycle throughput or with 256-bit implementation of these instructions at seven-cycle latency and two-cycle throughput, a single Newton-Raphson iteration or Taylor approximation can achieve almost the same precision as the (V)DIVPS and (V)SQRTPS instructions

(emphasis mine)

dooglius · 2024-06-22T14:03:37 1719065017

That's a different case where one is trying to calculate rsqrt(x) from 1/sqrt(x) whereas the article involves computing sqrt(x) from x*rsqrt(x) and in this case you don't need the divps instruction for sqrt but do need an additional mul instruction for the rsqrt case.

GrantMoyer · 2024-06-22T16:06:24 1719072384

I think you've misread the section:

> it is possible to speed up single-precision divide and square root calculations using the (V)RSQRTPS and (V)RCPPS instructions

TinkersW · 2024-06-22T12:56:27 1719060987

Only on some older hardware really, pretty dumb optimization from clang, people writing SIMD probably already know this trick and have a version of sqrt that is exactly this, so going behind their backs and forcing it when they requested sqrt_ps is very uncool.

cozzyd · 2024-06-22T13:35:11 1719063311

Arguably this is an incorrect default of mtune

pkhuong · 2024-06-22T12:22:34 1719058954

It's less accurate.

edelsohn · 2024-06-22T13:52:15 1719064335

__builtin_constant_p(vec) is not inquiring if the contents of vec is constant. The compilers are not being fickle. The statement is not performing the question that the developer intended.

SAI_Peregrinus · 2024-06-22T16:19:18 1719073158

> if you happened to use -ffast-math

That option is terribly named. It should be -fincorrect_math_that_might_sometimes_be_faster.

kookamamie · 2024-06-22T13:14:37 1719062077

It's just so fiddly - can't trust the compiler, not a good sign for a language or its implementation.

ComputerGuru · 2024-06-22T14:14:42 1719065682

You can never trust an optimizing compiler. I’ve seen similar issues in pretty much every language I’ve written performance-sensitive code in. Point revisions change optimizations all the time, sometimes with horribly disastrous results. One example: every time rust upgrades the version of LLVM underlying the compiler backend, the team prepares itself for a deluge of optimization regression reports (though this has been somewhat "mitigated" by a massive regression in rustc's ability to convey sufficient information to the LLVM backend for it to optimize to begin with, thanks to a change made to improve compile times [0]).

This is almost universally true across all languages. The behaviors you see are usually the result of countless heuristics stacked one atop the other, and a tiny change here can end up with huge changes there.

[0]: https://github.com/rust-lang/rust/pull/91743

g15jv2dp · 2024-06-22T13:17:02 1719062222

Do you think other compilers for other languages don't perform optimizations like this?

adgjlsfhk1 · 2024-06-22T13:55:00 1719064500

Julia does all of this correctly (although there are a few 0 cost redundant move instructions presumably for ABI reasons).

julia> f(x) = @fastmath sqrt.(x)

julia > @code_native debuginfo=:none f.(Tuple(rand(Float32, 8)))

vsqrtps ymm0, ymmword ptr [rsi]

mov rbp, rsp

mov rax, rdi

vmovups ymmword ptr [rdi], ymm0

pop rbp

vzeroupper

ret

julia> g() = f((1f0, 2f0, 3f0, 4f0))

julia> @code_native debuginfo=:none g()

.LCPI0_0:

.long 0x3f800000 # float 1

.long 0x3fb504f3 # float 1.41421354

.long 0x3fddb3d7 # float 1.73205078

.long 0x40000000 # float 2

.text

.globl julia_g_600

.p2align 4, 0x90

.type julia_g_600,@function

julia_g_600: # @julia_g_600

# %bb.0: # %top

push rbp

movabs rcx, offset .LCPI0_0

mov rbp, rsp

mov rax, rdi

vmovaps xmm0, xmmword ptr [rcx]

vmovups xmmword ptr [rdi], xmm0

pop rbp

ret

g15jv2dp · 2024-06-22T14:13:06 1719065586

...so does LLVM, unless you specifically instruct it to do otherwise with -ffast-math. Don't be fooled by the name of this flag, if it were possible to just do math faster without compromising anything, it wouldn't be a flag, it would be the default. It seems that Julia has a flag with the same name but a different behavior. Okay? What's your point exactly?

adgjlsfhk1 · 2024-06-22T20:04:00 1719086640

@fastmath in Julia is telling LLVM (basically) the same thing. I think the only difference is that Julia doesn't think rsqrt(x)*x with Newton iteration is a good way of computing sqrt

chongli · 2024-06-22T14:26:12 1719066372

But in this case, -ffast-math actually results in slower math, so the flag is badly named. Naming things is important (and hard). The naive expectation of a flag named -ffast-math is that the math should never be slower, sometimes be faster, and potentially less accurate and/or error prone. The fact that it can also be sometimes slower means the flag really should be named -fdifferent-math or (uncharitably) -funreliable-math.

tcbawo · 2024-06-22T14:57:03 1719068223

It’s too bad that it hasn’t acquired the name of an author to describe this alternative functionality more concisely without being judgy (like Nagle’s algorithm). I looked around, and it seems that there isn’t any one designer/author that you could give credit to.

dist1ll · 2024-06-22T13:22:39 1719062559

I mean, the fiddly-ness comes from wanting to use inline assembly, which is often considered to be niche and second-class in most modern languages. Even writing the asm in the first place is full of footguns [0]. There are ways to reduce the optimization barrier of asm, and implementing constant folding is one such example. But I can see why most compiler writers wouldn't be that interested in it.

[0] https://news.ycombinator.com/item?id=40607845