I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).
https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)
Mojo's effort in bringing portable SIMD abstraction to Python audience is commendable. I'm looking forward to open-sourcing of it to try it out!
For anyone's curious, the reason I'm mostly talking about C# above is that its Vector API is the most accessible and mature portable SIMD abstraction that is part of standard library / comes out of box among most other options - you really only need to install SDK and `dotnet new console` to start working with it over more complex alternatives.
First time I hear about HighLoad. Seems really interesting to me on the first glance. I personally find SIMD and ISA/μarch-specific optimizations more rewarding than pure algorithmic challenges (codeforces and such).
Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.
Realistically beyond Haswell there hasn’t been a ton of advancement in SIMD. Hawell introduced AVX2, which is what this blog post uses. AVX512 is certainly more powerful, but that’s not even available in the majority of Intel CPUs, even brand new ones.
AVX-512 has been ubiquitous on Intel server CPUs for a long time. Most people don't run high-performance throughput-oriented codes on consumer-grade CPUs with no ECC, which is the primary application for AVX-512. AVX-512 is a markedly better ISA than AVX2, aside from being wider.
AVX-512 is no longer ubiquitous on Intel servers, but only on new AMD servers.
Even earlier, there were cheap Intel servers with CPUs using Atom cores, for example the Denverton, Snow Ridge, Denverton Refresh, Snow Ridge Refresh, Parker Ridge and Arizona Beach series of server CPUs. None of these supported AVX-512 and many did not support even AVX.
However, now, after the launch of the Sierra Forest server CPUs, which will be followed next year by the Clearwater Forest server CPUs, the Atom cores have expanded up to the biggest Intel server CPUs. While such server CPUs are intended for applications where computations using array operations are less important, like Web servers or the hosting of many small virtual machines, the fragmentation of the Intel ISA is extremely annoying, especially when AMD demonstrates how they can implement the same ISA, but at different levels of performance (by varying the number of vector pipelines and the maximum achievable clock frequency) both in laptop CPUs and in desktop/server CPUs and both in compact cores with low clock frequency and in big cores with high clock frequency.
At least for me, the lack of AVX-512 support is the reason that made me stop buying Intel CPUs already some years ago, even if there are some features of the Intel CPUs that I prefer over the AMD CPUs (like TSC deadline), but none of those can compensate the lack of AVX-512 support.
The greater width of AVX-512 is not its greater advantage, but the mask registers and a more complete set of instructions, which simplify many algorithms. Therefore when Intel will support AVX10/256 across all their CPUs, that will partially restore the competitivity of the Intel CPUs, but that is not likely to happen before 2026.
Intel's inconsistent SIMD support across its x86 chips seems crazy to me.
It's already a hard sell to get programmers to write architecture -specific SIMD code.
But Intel then made different subsets of SIMD instructions available on different models. And muddied the terminology by making "AVX-512" be ambiguous.
Maybe it was somehow a win in terms of profitability, but it sure made me reluctant to write code to exploit the new instructions.
Yeah, I buy AMD Epyc almost exclusively, primarily because it’s so freaking annoying to even decide which Intel CPU will work for a given application. Plus you can usually get current or previous generation Epyc cores new for less than half the MSRP.
As part of that Haswell brought FMA support which was a boon to those of us doing a lot of multiplication and addition (made those workloads twice as fast).
It does depend a little on what ratio of additions to multiplies you had. Haswell dropped down to one execution unit capable of floating point addition, so for addition-heavy workloads you basically had to replace half the additions with fma instructions just to keep your old performance from dropping by 2x.
To be clear, it’s not dereferencing unmapped memory, I just haven’t shown how it’s being mapped, because it’s a little complex. As I note in the post, you can imagine as if I mmap all the necessary addresses at the start of the program.
Given that the input is "integers uniformly sampled from [0, 2³¹−1]" couldn't you use a LUT for the 99.99% case of just 10/9/8 digit numbers instead and have a cold branch the handle the very rare smaller numbers.
Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.
AVX-512 has evolved from the Larrabee New Instructions (2009), passing through Knights Ferry (2010), Knights Corner (2012) and Knights Landing (2016), to reach Skylake Server (2017), whose set of AVX-512 instructions has remained a subset of the instruction sets of all later CPUs with AVX-512 support.
At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.
(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)
Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.
Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.
However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.
As a minor note, overflow checking for add & sub can be done reasonably-efficiently in software by comparing the wrapping result with a saturating one, good for i8/i16/u8/u16 on x86, should be roughly the same for signed overflow compared to repurposing hardware unsigned overflow (though of course worse than would-be-native for unsigned).
And, regardless, this would be at least one more uop in the core loop (and a somewhat-unpredictable branch at that) which you'd still want to avoid.
You could have a separate accumulator for each shuffle, which should allow 28 iterations. (and merge those together at the end of the 28-iteration-loop by widening to u16; at which point you could have an outer loop accumulating in u16 until that runs out)
It's worse: Pr[correct output | hard input] = 0, even though they estimate that Pr[correct output | random input] ~ 1. This means that you can't, for example, amplify your success probability by repeating the algorithm a bunch of times and taking the majority vote.
The challenge is to get the right answer. It's much less interesting if you relax the challenge to no longer require the right answer. Here's a really fast approximate answer: 50000000*(2^31-1)/2
Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/