Summing ASCII encoded integers on Haswell at almost the speed of memcpy

ashleyn · 2024-07-13T14:13:05 1720879985

Knew it'd be SIMD. Such an underrated feature of modern CPUs. Hopefully with cross-platform SIMD in Rust and Golang, it'll be more commonly used.

Thinking parallel gets you enormous speed benefits for any number of arbitrary algorithms: https://mcyoung.xyz/2023/11/27/simd-base64/

neonsunset · 2024-07-13T16:10:09 1720887009

Here's the tracking issue for Go if you're interested: https://github.com/golang/go/issues/67520

I wouldn't be holding my breath though - proper support of high-level portable SIMD abstraction requires quite a lot of compiler complexity due to how wide (heh) the API surface of SIMD extensions is in most ISAs, and because of details necessary to get right to keep data in appropriate (vector and/or mask) registers. This, naturally, goes in the complete opposite direction to the design philosophy of Go's compiler. Instead, you are supposed to write a custom Go ASM syntax, with byte literals used to encode opcodes if they are not natively supported (which is common).

If you're interested in what high-effort SIMD implementation in this kind of language looks like, take a look at C#'s cross-platform Vector API: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd... (uses platform intrisics, but showcases that you can go one abstraction level lower, retaining the same Vector128<T> type if you need to specialize a particular part of your algorithm for a platform, without having to maintain separate copy for each one)

Here's high-effort vectorized CRC64 implementation that uses these: https://github.com/dotnet/runtime/blob/283de5b5adf08c42d4945... (performs as fast as C++-based mnemonic variant)

ziofill · 2024-07-14T04:03:49 1720929829

Mojo should get a mention since we are in topic of SIMD.

neonsunset · 2024-07-14T09:38:52 1720949932

Corrective upvote from me :)

Mojo's effort in bringing portable SIMD abstraction to Python audience is commendable. I'm looking forward to open-sourcing of it to try it out!

For anyone's curious, the reason I'm mostly talking about C# above is that its Vector API is the most accessible and mature portable SIMD abstraction that is part of standard library / comes out of box among most other options - you really only need to install SDK and `dotnet new console` to start working with it over more complex alternatives.

dist1ll · 2024-07-13T14:50:25 1720882225

First time I hear about HighLoad. Seems really interesting to me on the first glance. I personally find SIMD and ISA/μarch-specific optimizations more rewarding than pure algorithmic challenges (codeforces and such).

Though Haswell seems like a pretty obsolete platform to optimize for at this point. Even Skylake will be a decade old next year.

sgerenser · 2024-07-13T17:34:06 1720892046

Realistically beyond Haswell there hasn’t been a ton of advancement in SIMD. Hawell introduced AVX2, which is what this blog post uses. AVX512 is certainly more powerful, but that’s not even available in the majority of Intel CPUs, even brand new ones.

jandrewrogers · 2024-07-13T21:39:39 1720906779

AVX-512 has been ubiquitous on Intel server CPUs for a long time. Most people don't run high-performance throughput-oriented codes on consumer-grade CPUs with no ECC, which is the primary application for AVX-512. AVX-512 is a markedly better ISA than AVX2, aside from being wider.

adrian_b · 2024-07-14T02:59:17 1720925957

"has been" => "had been"

AVX-512 is no longer ubiquitous on Intel servers, but only on new AMD servers.

Even earlier, there were cheap Intel servers with CPUs using Atom cores, for example the Denverton, Snow Ridge, Denverton Refresh, Snow Ridge Refresh, Parker Ridge and Arizona Beach series of server CPUs. None of these supported AVX-512 and many did not support even AVX.

However, now, after the launch of the Sierra Forest server CPUs, which will be followed next year by the Clearwater Forest server CPUs, the Atom cores have expanded up to the biggest Intel server CPUs. While such server CPUs are intended for applications where computations using array operations are less important, like Web servers or the hosting of many small virtual machines, the fragmentation of the Intel ISA is extremely annoying, especially when AMD demonstrates how they can implement the same ISA, but at different levels of performance (by varying the number of vector pipelines and the maximum achievable clock frequency) both in laptop CPUs and in desktop/server CPUs and both in compact cores with low clock frequency and in big cores with high clock frequency.

At least for me, the lack of AVX-512 support is the reason that made me stop buying Intel CPUs already some years ago, even if there are some features of the Intel CPUs that I prefer over the AMD CPUs (like TSC deadline), but none of those can compensate the lack of AVX-512 support.

The greater width of AVX-512 is not its greater advantage, but the mask registers and a more complete set of instructions, which simplify many algorithms. Therefore when Intel will support AVX10/256 across all their CPUs, that will partially restore the competitivity of the Intel CPUs, but that is not likely to happen before 2026.

CoastalCoder · 2024-07-15T10:19:08 1721038748

Intel's inconsistent SIMD support across its x86 chips seems crazy to me.

It's already a hard sell to get programmers to write architecture -specific SIMD code.

But Intel then made different subsets of SIMD instructions available on different models. And muddied the terminology by making "AVX-512" be ambiguous.

Maybe it was somehow a win in terms of profitability, but it sure made me reluctant to write code to exploit the new instructions.

zxexz · 2024-07-14T03:37:08 1720928228

Yeah, I buy AMD Epyc almost exclusively, primarily because it’s so freaking annoying to even decide which Intel CPU will work for a given application. Plus you can usually get current or previous generation Epyc cores new for less than half the MSRP.

hedgehog · 2024-07-14T04:58:47 1720933127

As part of that Haswell brought FMA support which was a boon to those of us doing a lot of multiplication and addition (made those workloads twice as fast).

hansvm · 2024-07-14T18:45:33 1720982733

It does depend a little on what ratio of additions to multiplies you had. Haswell dropped down to one execution unit capable of floating point addition, so for addition-heavy workloads you basically had to replace half the additions with fma instructions just to keep your old performance from dropping by 2x.

xipix · 2024-07-14T14:55:17 1720968917

The correct solution to this optimization problem is to write the integers raw, not as ASCII.

wolf550e · 2024-07-13T13:18:15 1720876695

I think the trick with dereferencing unmapped memory is cool, but I only really care about techniques that work reliably and I can use in production.

sYnfo · 2024-07-13T13:22:13 1720876933

To be clear, it’s not dereferencing unmapped memory, I just haven’t shown how it’s being mapped, because it’s a little complex. As I note in the post, you can imagine as if I mmap all the necessary addresses at the start of the program.

camel-cdr · 2024-07-13T14:50:05 1720882205

Given that the input is "integers uniformly sampled from [0, 2³¹−1]" couldn't you use a LUT for the 99.99% case of just 10/9/8 digit numbers instead and have a cold branch the handle the very rare smaller numbers.

anonymoushn · 2024-07-13T19:28:50 1720898930

Yes, maybe if one is clever and lucky this could cost only a popcnt and a branch? not sure.

the9 · 2024-07-13T14:36:19 1720881379

[flagged]

apantel · 2024-07-13T14:40:48 1720881648

Looks like a scam.

raldi · 2024-07-13T17:00:21 1720890021

Is there an explanation of why it sometimes gives the wrong answer?

sYnfo · 2024-07-13T17:09:53 1720890593

1) if you set BATCH_SIZE > 14 sums_acc may overflow

2) chunks with too many small numbers cannot be processed with just 2 shuffle-adds

3) (not mentioned in the post) HighLoad limits the size of the source code you can submit, so you can't put all possible values in the look-up table

camel-cdr · 2024-07-13T20:25:11 1720902311

Couldn't you organize the accumulators in 8 byte chunks, and leave the upper byte unused. Then you map consecutive digits to those chunks and use 64 bit addition for the accumulation. Then overflow between the bytes would keep the correct result if you do the shuffles correctly, and you have a full byte of overflow buffer.

Dwedit · 2024-07-13T20:35:59 1720902959

Gaps in the numbers are often enough to do some kind of "SIMD" even on ordinary 32-bit processors.

camel-cdr · 2024-07-13T20:40:29 1720903229

Yeah, but I was thinking of doing this within the vector registers to increase the batch size.

rurban · 2024-07-14T06:22:51 1720938171

So SIMD would need to set the overflow flag also to catch em.

Which would be much faster than the checked add (adc). Does any hardware support such checked SIMD arithmetic already?

Or can you still assume that most arithmetic is still broken in most languages/libraries.

adrian_b · 2024-07-14T08:22:19 1720945339

AVX-512 has evolved from the Larrabee New Instructions (2009), passing through Knights Ferry (2010), Knights Corner (2012) and Knights Landing (2016), to reach Skylake Server (2017), whose set of AVX-512 instructions has remained a subset of the instruction sets of all later CPUs with AVX-512 support.

At each step from Larrabee to Skylake Server, some instructions have been lost, because the initial set of instructions was more complete in order to enable the writing of efficient GPU algorithms, while later Intel believed that for a general-purpose CPU they can reduce the costs by omitting some of those instructions.

(Nevertheless, later they have added many other instructions, some of which may be less useful and more expensive than the original instructions that have been removed.)

Among the original Larrabee instructions that have been deleted, was addition with unsigned overflow (a.k.a. carry), where the output overflow flags were stored in a mask register, enabling their use in a later conditional SIMD instruction.

Signed overflow can be implemented in hardware with negligible additional complexity (a single gate per each result number), so it would have been easy to also add to Larrabee/AVX-512 an addition instruction with signed overflow flags stored in a mask register. Even when only unsigned overflow is available, it is possible to preprocess the operands in such a way that detecting signed overflow would be possible with the unsigned overflow bits, though that requires multiple instructions, slowing down a lot the algorithm.

However in this problem the numbers that are added are non-negative, so the addition with unsigned overflow of the original Larrabee ISA would have been sufficient, had Intel not removed it from AVX-512.

dzaima · 2024-07-14T14:19:48 1720966788

As a minor note, overflow checking for add & sub can be done reasonably-efficiently in software by comparing the wrapping result with a saturating one, good for i8/i16/u8/u16 on x86, should be roughly the same for signed overflow compared to repurposing hardware unsigned overflow (though of course worse than would-be-native for unsigned).

And, regardless, this would be at least one more uop in the core loop (and a somewhat-unpredictable branch at that) which you'd still want to avoid.

kardos · 2024-07-13T18:01:46 1720893706

For 1, can you raise that to 28 with unsigned accumulators?

sYnfo · 2024-07-13T18:41:38 1720896098

14 already assumes unsigned accumulator! 255 [accumulator capacity] / (2 [shuffle-adds] * 9 [highest digit value]) ~= 14

dzaima · 2024-07-13T20:19:19 1720901959

You could have a separate accumulator for each shuffle, which should allow 28 iterations. (and merge those together at the end of the 28-iteration-loop by widening to u16; at which point you could have an outer loop accumulating in u16 until that runs out)

genter · 2024-07-13T17:07:32 1720890452

> will only produce correct results with probability < 1, though very close to 1

That's terrifying

madars · 2024-07-13T19:52:20 1720900340

It's worse: Pr[correct output | hard input] = 0, even though they estimate that Pr[correct output | random input] ~ 1. This means that you can't, for example, amplify your success probability by repeating the algorithm a bunch of times and taking the majority vote.

_a_a_a_ · 2024-07-13T17:18:20 1720891100

Why? So long as you know the probabilities and they are tolerable, why?

kardos · 2024-07-13T18:17:17 1720894637

The challenge is to get the right answer. It's much less interesting if you relax the challenge to no longer require the right answer. Here's a really fast approximate answer: 50000000*(2^31-1)/2

sYnfo · 2024-07-13T18:32:35 1720895555

Map vs. territory. The challenge, as defined by the system the competition runs on, is to get 3 correct responses in a row. That's it.

nerdponx · 2024-07-13T18:29:45 1720895385

The fact that it's wrong sometimes is a lot less interesting than the probability distribution of wrongness, conditional on magnitude.

Neywiny · 2024-07-13T19:15:04 1720898104

Solid point and good example