Is the vector method really faster than mask/subtract/compare? They never really...

Retr0id · 2024-05-22T18:18:53 1716401933

I haven't benchmarked it, but almost certainly yes - The unpacking shouldn't be that big of a deal, but three branches vs 0 branches likely is.

And, the function as a whole is much more vectoriseable - if its in an inner loop, the compiler ought to be able to inline it and autovectorise the loop.

slaymaker1907 · 2024-05-22T19:10:46 1716405046

You can reduce the branches by just eliminating the early return returns. And short circuits, but you can use bitwise and instead to encourage the compiler not to generate extra branches (the compiler is generally free to do this anyway as there isn't any side effect in the example code which would prohibit optimization). Therefore, what we should really compare the improved version against is:

bool IsEveryComponentGreaterThanOrEqual(uint16_t x, uint16_t y) { int rv; auto xr = x & 0xF100; auto yr = y & 0xF100; rv = xr >= yr;

auto xg = x & 0x07E0; auto yg = y & 0x07E0; rv &= xg >= yg;

auto xb = x & 0x001F; auto yb = y & 0x001F; rv &= xb >= yb;

return rv != 0; }

According to the Compiler Explorer, the one presented in the article still has far fewer instructions on x86_64. Interestingly, the optimization of eliminating the bitshifts doesn't change how it is compiled which says something about trusting your compiler for some of this stuff. https://godbolt.org/z/zWf5GnPPb

However, one more interesting thing is that the modified version I made is nearly identical in size on ARMv8 for the one in the article vs my version (12 vs 11 instructions). https://godbolt.org/z/d4TPqGhKe

granularity · 2024-05-22T22:07:43 1716415663

Thanks for the links! Seems like we don't need to mask prior to the first compare, since the low order bits only matter if the high order bits are equal (in which case, we can also safely compare lower bits). Or am I missing something?

  bool IsEveryComponentGreaterThanOrEqual(uint16_t x, uint16_t y) {
    int rv = x >= y;
    auto xg = x & 0x07E0; auto yg = y & 0x07E0; rv &= xg >= yg;
    auto xb = x & 0x001F; auto yb = y & 0x001F; rv &= xb >= yb;
    return rv != 0;
  }

snarg · 2024-05-22T19:07:56 1716404876

Those branches are unnecessary and easily avoided.

yau8edq12i · 2024-05-22T19:01:14 1716404474

Based on my quick and dirty benchmarking, it's similar in terms of performance. The vector comparison is consistent, the unpacking comparison is about 1.5x faster if it can exit early, but 4x slower otherwise. Here are the results: https://pastebin.com/QL4sNzhj (the code is straight from the article, the benchmarking done with google benchmark).

angiosperm · 2024-05-22T20:52:32 1716411152

It is much worse than just the three extra test-and-branch operations, because they will be serialized through one ALU.