Hacker News new | past | comments | ask | show | jobs | submit login
Why those particular integer multiplies? (fgiesen.wordpress.com)
75 points by luu 18 days ago | hide | past | favorite | 37 comments



How can software run on different CPUs when they support different operations?

When you download "debian-live-12.7.0-amd64-kde.iso", all the programs in the repos support all current Intel and AMD CPUs, right? Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?

Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?


> Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?

Mostly the former. Some highly optimized bits of software do the latter—they are built with multiple code paths optimized for different hardware capabilities, and select which one to use at runtime.

> Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?

Hypothetically yes, but in practice no for the languages you mentioned because they don't map well to things like SIMD. Some JIT-based numerical computing systems as well as JIT-based ML compilers do reap those benefits.


.NET/C# does pretty well with SIMD for a high level language, it has portable SIMD primitives which get JITed to whatever the system supports at runtime, and they're used quite extensively throughout the stdlib so you benefit even if you're not writing SIMD routines yourself.

They tried to do something similar in Javascript but it added way too much complexity to the runtimes and ended up getting dropped in favor of WASM SIMD.


It's possibly worth mentioning that Java is getting a vector API which explicitly abstracts over some of the details of SIMD, including width. You have a type Vector<T> which represents enough of some type T to fill a vector register (eg eight 32-bit numbers in a 256-bit register), operations on Vector<T> which produce another Vector<T>, and some way to break arrays up into Vectors of the right size for the platform. The API is a bit clunky, but you write code with it, the compiler performs a miracle, and efficient platform-specific vector code comes out.

https://docs.oracle.com/en/java/javase/23/docs/api/jdk.incub...

https://docs.oracle.com/en/java/javase/23/docs/api/jdk.incub...


Though it's pretty much incubating forever, or until Valhalla, whichever comes first.


SSE2 is a requirement for x86-64, which gives at least a reasonable(128bit wide SIMD) baseline.

SSE4 is from 2008, so making it a requirement isn't unreasonable.

Even AVX2 is from 2013, so some apps require it nowadays.

It is extremely difficult for a compiler to convert scalar code to SIMD automatically, even static C++ compilers really suck at it.

A dynamic compiler for javascript would have no real hope of any meaningful gains.


The problem is that there were CPUs made well after 2008 that don't support SSE4. In particular, Phenom II was fairly popular, sold up until 2012, and doesn't even support SSSE3 (much less SSE4.1 and SSE4.2; only an AMD-specific variant known as SSE4a).


Another issue with lots of older processors is that they would slow down clock speed when using SIMD instructions so much that there was effectively no performance gain. You had to be very careful which instructions you could actually use.


AFAIK this was only ever really true for AVX-512 (when touching the actual wider registers). Most others have had a moderate downclock, but still normally worth it.


You're right, it's been a while since I dabbled in SIMD.


The early Atoms too only supported up to SSSE3.


I still have an old Samsung that is from 2008 aproximately. The battery last like 10 minutes, a few keys are dead, the fan makes a weird sound, so it's 99.9% retired. I still use it every few years when I need an old version of MS Office.


other thing about avx2 is it gives you FMA because of the timing.


Others gave you the general answer, but in OPs line of work they just manually rewrite and tune all of the core algorithms a dozen times for different CPU architectures and dispatch to the most suitable one at runtime. I don't have a link to hand but IIRC they go a step beyond dispatching based on CPU features, and dispatch different code paths for CPUs with the same features but significantly different instruction costs.

RADs codecs are expensive but that's the expertise you're paying for.


A recent example of "feature detection vs specific cpus with different costs for the same features" thing is pext on zen2. It's implemented in microcode and the implementation is so slow that we'd honestly be better off if the chips reported that they did not have the feature.


I recently implemented a runtime for `__builtin_cpu_init()`, `__builtin_cpu_supports()`, and `__builtin_cpu_is()` for x86-64. Using these compiler intrinsics, or a higher level feature such as `[[gnu::cpu_dispatch]]`, you can write functions that behave differently on different CPUs. Fortunately the implementation isn't terribly complex. On x86, it's based around a neat `cpuid` instruction, and other ISAs have similar features.

https://github.com/Cons-Cat/libCat/blob/main/src%2Flibraries...


> Or do they somehow adapt to the operations supported by the user's CPU?

This is called runtime dispatch. You can do it manually or use a library, like Google Highway. GCC supports multiversioning where you write separate versions of a function and the right one is selected at runtime.

https://github.com/google/highway

https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/Function-Multiv...


> Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?

Mostly the former, some specialized software does the latter. The lowest common denominator is called the baseline, and it differs over time and between distributions. Debian for example still supports x86-64-v1 (the original 64-bit extension to x86), but RHEL 10 will require x86-64-v3, which includes SSE4 and AVX2 support.


it is very easy to rebuild your own from the exact kernel version that you downloaded from the same sources. The kernel build process allows you to tailor your build compiler to exactly the processor and hardware you have. (of course, there's some assembly in there, I'm not sure how much of it is #ifdef'ed for tuning)

In the first years of linux, this is what you did, it's the way it worked more or less by default with the Slackware, the first popular "distro". RedHat and Debian came out, and they started to streamlined away from it, but it's still easy to do in those systems. I don't use debian much, but with RH/fedora you download the source rpm and use rpmbuild to unpack and build, then you can go poke around

I say it's easy, it is, but it is a bit ... not fussy, but detailed, very detailed. You answer many questions most of which you'd need to do a little research for, but you can explore the ones you want and accept defaults for the rest.

caveat: i have not done this in a half dozen years, hope it still works


> the lowest common denominator of operations?

Note that in recent years the chosen LCD for some distros has changed - they're starting to target the v2 feature set rather than the original.

See https://developers.redhat.com/blog/2021/01/05/building-red-h...

> Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?

Dynamically-typed languages can't benefit from this at all (they may include a C library that uses runtime dispatch though). Statically-typed JIT'ed languages like Java can (and you see occasional "look, Java is faster than C" benchmarks citing this), but only if you avoid classes and use only arrays. C# can do better than Java but still suffers from its Windows-centric history.


> but only if you avoid classes and use only arrays

Please do look into the kind of codegen emitted by OpenJDK and .NET before assuming this. It's a bit difficult with -XX:+PrintAssembly and much easier with DOTNET_JitDisasm='pattern'/Disasmo/NativeAOT+Ghidra. Once you do, you will clearly see how the exact set of ISA extensions influences instruction selection for all sorts of operations like stack zeroing, loads/stores, loop vectorization (automatic or manual), etc. .NET has extensive intrinsics and portable SIMD APIs that use effectively static dispatch even if the path is still picked at runtime, but just once during JIT compilation.

> still suffers from its Windows-centric history.

This is a provably wrong, especially in peformance-related scenarios.


In non-performance contexts, C# still suffers from the fact that I can't just reach out and install it from any random distro. The only other major languages with a comparable problem are Kotlin and Swift, which suffer from a similar association with Android and Mac OS, respectively.


You can, except Debian but that’s Debian for you.

  sudo apt/dnf install dotnet-sdk-8.0
For Debian use this: https://learn.microsoft.com/en-us/dotnet/core/install/linux-...

It is a better user experience than dealing with C/C++ or Java tooling too.

For shipping packages you don’t even need this since you can just publish them as self-contained or as native binaries.


Just to add, Debian has a nice alternatives system that can tailor the correct version of libraries for your specific system. What happens for a few performance sensitive ones.

But yeah, it's mostly code compiled to the lowest common spec, and a bit of code with dynamic dispatching.


I suspect Intel uses 32x32b multipliers instead of his theorised 16x16b, just that it only has one every second lane. It lines up more closely with VPMULLQ, and it seems odd that PMULUDQ would be one uOp vs PMULLD's two.

PMULLD is probably just doing 2x PMULUDQ and discarding the high bits.

(I tried commenting on his blog but it's awaiting moderation - I don't know if that's ever checked, or just sits in the queue forever)


Makes sense to me. I have some code that uses a lot of mullo, so I get to pay twice the latency compared to if I wanted full multiplies...


Found a bug in the article.

Maximum for signed bytes is +127, not +128. Minimum is correct, it's -128.


BTW, this asymmetry makes unary negation in C an unexpected source of Undefined Behavior.


You can always tell when someone counts on their fingers.


It's a shame that SIMD is still a dark art. I've looked at writing a few simple algorithms with it but have to do it in my own time as it'll be difficult to justify it with my employer. I do know that gcc is generally terrible at auto-vectorising code, clang is much better but far from perfect. Using intrinsics directly will just lead to code that's unmaintainable by others not versed in the dark art. Even wrappers over intrinsics don't help much here. I feel there's a lot of efficiency being left on the table because these instructions aren't being used more.


The problem is that the different SIMD instruction sets are genuinely... different. The basics of “8-bit unsigned add” and similar are possible to abstract over, but for a lot of cases, you may have to switch your entire algorithm around between different CPUs to get reasonable performance (or even gain over the scalar code at all). There's no way a compiler or SIMD abstraction library will do that for you.


Re: SIMD

Suggest you look at the Julia Language, a high-level but still capable of C-like speed.

It has built in support for SIMD (and GPU) processing.

Julia is designed to support Scientific Computing, with a growing library spanning different domains.

https://docs.julialang.org/en/v1/


> PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255] = [-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows.

can someone explain this to me? isn't 32640 < 32767? how's this an overflow?


The output of the instruction is, for each 16-bit lane, the sum of two products of one i8 and one u8.

32640 * 2 > 32767

As an aside, the quoted section of the article seems to have an error. The maximum value of an i8 is 127 and the maximum value of one of these products is 32385.


Maybe it’s me in the morning, but for some reason it was a very hard read for the text about cpu instructions. Feels like it loads you with details for ages.


New to ryg blog posts? :)


Not sure what was so wrong with that or why people like it so much, but yeah.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: