Counting bytes faster than you'd think possible

anonymoushn · 2024-07-26T18:01:25.000000Z

My own solution which is ~1ms faster uses some other pattern that was found experimentally, but I cannot seem to get it to go any faster by tuning the parameters, and the #1 spot remains slightly out of reach.

Alexander Monakov has called the attention of the highload Telegram chat to this paper[0], saying:

  Haswell is tricky for memory bw tuning, as even at fixed core frequency, uncore frequency is not fixed, and depends on factors such as hardware-measured stall cycles:

  > According to the respective patent [15], the uncore frequency depends on the stall cycles of the cores, the EPB of the cores, and c-states

  > ... uncore frequencies–in addition to EPB and stall cycles–depend on the core frequency of the fastest active core on the system. Moreover, the uncore frequency is also a target of power limitations.

So one wonders if it's not really a matter of reading the RAM in the right pattern to appease the prefetchers but of using values in the right pattern to create the right pattern of stalls to get the highest frequency.

[0]: https://tu-dresden.de/zih/forschung/ressourcen/dateien/proje...

sYnfo · 2024-07-26T20:53:35.000000Z

FYI vien [0] figured out that simply compiling with "-static -fno-pie" and _exit(0)-ing at the end puts the solution presented here to 15000 points and hence #4 on the leaderboard. Pretty funny.

[0] https://news.ycombinator.com/user?id=vient

a_t48 · 2024-07-27T00:21:41.000000Z

Optimizing the leftovers loop to

    #pragma clang loop vectorize(enable)
    #pragma clang loop interleave(enable)
    for (; offset < length; offset += 4) {
        const auto x = ((uint32_t\*)start)[offset / 4];
        count += ((x & 0xFF) == 0x7F);
        count += ((x & 0xFF00) == 0x7F00);
        count += ((x & 0xFF0000) == 0x7F0000);
        count += ((x & 0xFF000000) == 0x7F000000);
    }

also gives some points. It'd probably be more if I could be bothered to break apart your assembly. :)

dinobones · 2024-07-26T19:05:54.000000Z

Is there a path forward for compilers to eek out these optimization gains eventually? Is there even a path?

550x gains with some C ++ / mixed gnarly low level assembly vs standard C++ is pretty shocking to me.

monktastic1 · 2024-07-26T21:21:15.000000Z

FYI, "eke." "Eek" is an expression of alarm or surprise.

bozey07 · 2024-07-27T23:10:04.000000Z

C++ alarms and surprises me, to be fair.

vient · 2024-07-26T19:19:32.000000Z

Note that "standard C++" solution uses std::cin while optimized one uses mmap - completely different things, a lot of speed comes just from that. Would've been nice to compare with solution having optimized input and otherwise standard summing loop.

anonymoushn · 2024-07-26T19:26:19.000000Z

For a solution that contains this stuff

  const u8 *file_lo;
  file_lo = (const u8*)mmap(0,250000000ull,PROT_READ,MAP_PRIVATE|MAP_POPULATE,0,0);
  const u8 *file_hi = file_lo + 250000000ull;
  u64 count = 0;
  while (file_lo < file_hi) {
    if (*file_lo == 127) {
        ++count;
    }
    file_lo++;
  }

I got a bit under 54ms. The solution in the article runs in a bit under 16ms.

vient · 2024-07-26T19:44:05.000000Z

Nice. Did some quick tests with your code on site, got score of ~34000 - best solution is around 14700, so this one is only 2.3 times slower.

Used clang with -Ofast -march=native -static. Funnily, gcc gets only 54000 with the same options, 1.6 times slower.

vient · 2024-07-26T19:58:08.000000Z

Wow, changing `count` type from uint64_t to uint32_t or int radically changes results - now gcc gets 26500 and clang gets 25000, that's just 1.7 times slower than current best solution.

So you can get 25k with following code, clang -Ofast -std=c++17 -march=native -static

    #include <iostream>
    #include <cstdint>
    #include <sys/mman.h>
    #include <unistd.h>

    int main() {
      auto file_lo = (const uint8_t*)mmap(0,250000000ull,PROT_READ,MAP_PRIVATE|MAP_POPULATE,STDIN_FILENO,0);
      int count = 0;
      for (uint32_t i = 0; i < 250000000; ++i) {
        if (file_lo[i] == 127) {
            ++count;
        }
      }
      std::cout << count << std::endl;
      _exit(0);
      return 0;
    }

sYnfo · 2024-07-26T20:04:30.000000Z

Neat! I'll add the best solution without explicit SIMD/asm in this thread to the post after I wake up, it's a great datapoint.

_a_a_a_ · 2024-07-27T07:16:40.000000Z

Bit rusty here but what if you replaced

    if (*file_lo == 127) {
        ++count;

with

    count += (*file_lo == 127);

That might save you the occasional branch mis-prediction, and might possibly open up some hardware-level loop optimisations. Any difference?

maxbond · 2024-07-27T06:35:06.000000Z

Usually, it's fair game to use all of the information presented in an exam-style question to derive your answer.

With that in mind, I propose the following solution.

`print(976563)`

anonymoushn · 2024-07-27T06:52:32.000000Z

You can submit that, it will fail.

lumb63 · 2024-07-26T19:58:10.000000Z

Does anyone have any tips for similar wizardry-level SIMD optimization on ARM?

anonymoushn · 2024-07-26T20:23:01.000000Z

If you learn AVX2 programming via highload, my impression is that NEON is quite similar. The main difference is the lack of movemask. You can read these[0] articles[1] about what to do instead.

For SVE, prior to very recent versions of SVE, there was no tblq (pshufb equivalent) so I didn't have much hope for using it for general-purpose programming, though of course it would be fine for stuff like TFA.

[0]: https://community.arm.com/arm-community-blogs/b/infrastructu...

[1]: https://www.corsix.org/content/whirlwind-tour-aarch64-vector...

TinkersW · 2024-07-26T22:26:07.000000Z

It isn't as wide on ARM so the gains will be smaller(most ARM is only 128 bit wide neon)

rini17 · 2024-07-26T17:58:18.000000Z

Can this optimization be applied to matmult for us, critters who are running llama on cpu? XD

twoodfin · 2024-07-26T18:18:54.000000Z

I don't think this really helps: It's a trick to drive maximum possible memory bandwidth in service of a single executing thread which can process data as fast as it's being delivered.

A parallel matrix multiply running across every core shouldn't have any trouble maximizing memory bandwidth utilization.

owlbite · 2024-07-26T18:27:51.000000Z

KV-cache based LLM inference is normally significantly memory bound on the matrix-vector multiply. This is (part of) why the quantization-based approaches are so popular.

twoodfin · 2024-07-26T19:14:36.000000Z

It's memory bound and single threaded?

rini17 · 2024-07-26T19:38:29.000000Z

I found it memory bound so that it was fastest with 6 threads on my 8 thread xeon.

twoodfin · 2024-07-26T22:52:39.000000Z

Right: So this trick probably doesn’t help much. You only need prefetch when there isn’t enough fetch otherwise.

_a_a_a_ · 2024-07-27T07:06:08.000000Z

"The solution presented here is ~550x faster than the following naive program."

   ... std::cin >> v; ...

Oh come on! That's I/O for every item, I'm surprised it's not even slower.

gpderetta · 2024-07-27T17:05:56.000000Z

Cin is buffered, do it is not technically doing I/O for every time. It is still dog slow of course.

_a_a_a_ · 2024-07-28T11:24:31.000000Z

Oops, buffered of course, you're right.

BoardsOfCanada · 2024-07-27T07:50:52.000000Z

Yeah, how does the proposed solution solve the problem as stated: "Print the number of bytes whose value equals 127 in a 250MB stream of bytes uniformly sampled from [0, 255] sent to standard input."

anonymoushn · 2024-07-27T07:55:07.000000Z

When stdin is a file, as it is in this case, you can mmap it.

Iwan-Zotow · 2024-07-27T22:41:32.000000Z

technically not an equivalent problem - you can't map a stream

anonymoushn · 2024-07-28T01:06:57.000000Z

The whole problem statement isn't equivalent to the substring of the problem statement that GP quoted? I agree. For example, the whole problem statement includes a part that specifies that stdin is a file stored in hugetlbfs.

TacticalCoder · 2024-07-26T21:06:53.000000Z

Le met hazard a guess: that blog post was not written by a LLM!?