SWAR: Find any byte from set

zX41ZdbW · on March 8, 2023

Interesting to compare the SWAR approach with 64-bit register, a similar approach with 128-bit register (SSE-2), and string search instructions (SSE-4.2). This implementation in ClickHouse can be used as a reference:

https://github.com/ClickHouse/ClickHouse/blob/a04b38db907b8e...

Obviously, SWAR should be slower (due to register size), but how slower?

stabbles · on March 8, 2023

Sometimes you can do this in two instructions per 16 / 32 / 64 chars (sse / avx / avx-512 resp): shuffle & compare.

The gist of it is

    lookup[hash(char)] == char

where 'lookup[hash(char)]' corresponds to _mm_shuffle_epi8 with a carefully crafted constant and input data, and '==' corresponds to _mm_cmpeq_epi8.

The intrinsic c = _mm_shuffle_epi8(a, b) pretty much computes c[i] = b[a[i] & 0x0F], so if there are no collisions in a[i] & 0x0F, you can construct a special b so that

    b[hash('@')] = '@'
    b[hash('/')] = '/'
    b[hash('?')] = '?'
    b[hash('\\')] = '\\'
    b[x] = 0x00 # otherwise

Unfortunately in this case, it doesn't work, since `/` and `?` have a collision, they both map to `0x0F`.

If you would just search for say @, / and \, it would be possible, and shuffle + cmp is sufficient.

Sesse__ · on March 9, 2023

Shift them to the right one bit :-)

anonymoushn · on March 9, 2023

Unfortunately there's no 8-bit srli, and if you use the 16-bit srli, you could end up shifting a 1 into the high bit of the low 8 bits of your 16-bit word. pshufb treats a 1 in the high bit as a signal to map the input to 0 instead of using the lower 4 bits as a lookup table. So the full thing ends up being (for AVX-2):

  _mm256_srli_epi16   // by 1
  _mm256_and_si256    // to mask off the high bit
  _mm256_shuffle_epi8 //         b[a[i] & 0x0F]
  _mm256_cmpeq_epi8   // a[i] ==

This is still pretty good! If you end up needing more instructions to get map your chosen bytes to 16 unique low nibbles though, some other technique may pull ahead. See here for some techniques: http://0x80.pl/articles/simd-byte-lookup.html

Note that the above article is a bit too conservative about the applicability of some of these techniques and does not end up telling you to do cmpeq(input, pshufb(constant, input)) even though this is what you want if all the bytes of interest are <128 (i.e. they are ascii characters) which is often true.

hummus_bae · on March 8, 2023

All modern CPU architectures have builtin fast string compare instructions.

Here is a table from the Intel manual (bolded are the relevant compare instructions):

https://msdk.intel.com/en-us/library/cc286809.aspx

> SSE-4.2: Compare strings using dqa (32-bit mode), dqq (64-bit mode), dqu, dsx

> SSE-3: Compare strings using cmpsdx, cmpsd, scansigndx, scansignd

> SSE-2: Compare strings using pmaxsw, pand, pmin, pmins

> SSE-1: Compare strings using pandn, pavg, pandnpand, pandnpandn, pandn, pandnpandndivr, ...

anonymoushn · on March 9, 2023

If you want to write a fast parser you mostly don't end up using these. You use pshufb, as detailed in a sibling comment by stabbles.

amelius · on March 9, 2023

But this is not strictly speaking a string compare operation, I suppose.

jakobnissen · on March 9, 2023

I wrote a Julia package to do this some time ago (https://github.com/jakobnissen/ScanByte.jl). It selects the right algorithm at compile time depdending on the chosen byte set and the available instructions.

It turns out that for specific sets, you can significantly better than the generic algoritm. For example, to find the location of any byte in `abc`, this is the inner loop generated by Julia, checking 32 bytes per iteration:

    L48:
     vmovdqu 8(%rsi,%rcx), %ymm1
     vpshufb %ymm1, %ymm0, %ymm2
     vpcmpeqb %ymm2, %ymm1, %ymm1
     vpmovmskb %ymm1, %edx
     testl %edx, %edx
     jne L83
     addq $32, %rcx
     cmpq %rcx, %rax
     jae L48

anonymoushn · on March 9, 2023

This is super cool! As a user of software that is slow, I'd really like these techniques to be used more broadly, and while I have plans to tell people about them I don't have much hope for "telling people about techniques they can use in low-level languages" translating into fast parsers in software I spend my time waiting for anytime soon. Consider e.g. that simdjson has not been wrapped and highly adopted in many high-level languages. Your thing just generates the correct instructions without the user having to be a knower or a user of a library implementing a specific parser in terms of these techniques. Great!

Asooka · on March 9, 2023

    auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; };

This is actually undefined behaviour. Due to integer promotion rules, v will be promoted to an int64, which is signed, so if it is 128 or more, the product would result in signed overflow. You need to explicitly cast it to an uint64.

david2ndaccount · on March 9, 2023

Instead of a cast, you can also use a `u` suffix on the constant to force it to be unsigned. Then the `v` will be promoted to the same size .

account42 · on March 9, 2023

Can't you delay the swap_bytes_if_big_endian call until you have actually found a word with a match since the byte order only matters for index_of_first_set_byte?

estebarb · on March 9, 2023

There is a good guide on how to use SIMD instructions? I would like to learn more about it, but I feel that the documentation assumes that you know what to do.

anonymoushn · on March 9, 2023

The web site containing the article has a collection of other articles about this sort of thing. You can also read about this in some parts of Algorithms for Modern Hardware[0]. highload.fun includes some other links about this stuff[1]. In general there isn't a great guide and it's best to get your hands dirty. highload.fun happens to be a good place to do that :)

Edit: looking again, Algorithms for Modern Hardware has a whole chapter on this stuff[2].

[0]: https://en.algorithmica.org/hpc/algorithms/prefix/

[1]: https://github.com/Highload-fun/platform/wiki

[2]: https://en.algorithmica.org/hpc/simd/intrinsics/

xXx_uwu_xXx · on March 9, 2023

The HTML says it's in Polish but it's clearly in English. Such disrespect!

amelius · on March 9, 2023

If the bottleneck is memory access, then does this approach help?

anonymoushn · on March 9, 2023

If you are actually saturating the machine's capacity to load data, then maybe. You'd need to check whether the alternative branchier code also does this or not. For most modern machines, you'd need to use wider loads before such questions were relevant.