Hacker News new | past | comments | ask | show | jobs | submit login
SWAR: Find any byte from set (0x80.pl)
78 points by g0xA52A2A on March 8, 2023 | hide | past | favorite | 17 comments



Interesting to compare the SWAR approach with 64-bit register, a similar approach with 128-bit register (SSE-2), and string search instructions (SSE-4.2). This implementation in ClickHouse can be used as a reference:

https://github.com/ClickHouse/ClickHouse/blob/a04b38db907b8e...

Obviously, SWAR should be slower (due to register size), but how slower?


Sometimes you can do this in two instructions per 16 / 32 / 64 chars (sse / avx / avx-512 resp): shuffle & compare.

The gist of it is

    lookup[hash(char)] == char
where 'lookup[hash(char)]' corresponds to _mm_shuffle_epi8 with a carefully crafted constant and input data, and '==' corresponds to _mm_cmpeq_epi8.

The intrinsic c = _mm_shuffle_epi8(a, b) pretty much computes c[i] = b[a[i] & 0x0F], so if there are no collisions in a[i] & 0x0F, you can construct a special b so that

    b[hash('@')] = '@'
    b[hash('/')] = '/'
    b[hash('?')] = '?'
    b[hash('\\')] = '\\'
    b[x] = 0x00 # otherwise
Unfortunately in this case, it doesn't work, since `/` and `?` have a collision, they both map to `0x0F`.

If you would just search for say @, / and \, it would be possible, and shuffle + cmp is sufficient.


Shift them to the right one bit :-)


Unfortunately there's no 8-bit srli, and if you use the 16-bit srli, you could end up shifting a 1 into the high bit of the low 8 bits of your 16-bit word. pshufb treats a 1 in the high bit as a signal to map the input to 0 instead of using the lower 4 bits as a lookup table. So the full thing ends up being (for AVX-2):

  _mm256_srli_epi16   // by 1
  _mm256_and_si256    // to mask off the high bit
  _mm256_shuffle_epi8 //         b[a[i] & 0x0F]
  _mm256_cmpeq_epi8   // a[i] ==
This is still pretty good! If you end up needing more instructions to get map your chosen bytes to 16 unique low nibbles though, some other technique may pull ahead. See here for some techniques: http://0x80.pl/articles/simd-byte-lookup.html

Note that the above article is a bit too conservative about the applicability of some of these techniques and does not end up telling you to do cmpeq(input, pshufb(constant, input)) even though this is what you want if all the bytes of interest are <128 (i.e. they are ascii characters) which is often true.


All modern CPU architectures have builtin fast string compare instructions.

Here is a table from the Intel manual (bolded are the relevant compare instructions):

https://msdk.intel.com/en-us/library/cc286809.aspx

> SSE-4.2: Compare strings using dqa (32-bit mode), dqq (64-bit mode), dqu, dsx

> SSE-3: Compare strings using cmpsdx, cmpsd, scansigndx, scansignd

> SSE-2: Compare strings using pmaxsw, pand, pmin, pmins

> SSE-1: Compare strings using pandn, pavg, pandnpand, pandnpandn, pandn, pandnpandndivr, ...


If you want to write a fast parser you mostly don't end up using these. You use pshufb, as detailed in a sibling comment by stabbles.


But this is not strictly speaking a string compare operation, I suppose.


I wrote a Julia package to do this some time ago (https://github.com/jakobnissen/ScanByte.jl). It selects the right algorithm at compile time depdending on the chosen byte set and the available instructions.

It turns out that for specific sets, you can significantly better than the generic algoritm. For example, to find the location of any byte in `abc`, this is the inner loop generated by Julia, checking 32 bytes per iteration:

    L48:
     vmovdqu 8(%rsi,%rcx), %ymm1
     vpshufb %ymm1, %ymm0, %ymm2
     vpcmpeqb %ymm2, %ymm1, %ymm1
     vpmovmskb %ymm1, %edx
     testl %edx, %edx
     jne L83
     addq $32, %rcx
     cmpq %rcx, %rax
     jae L48


This is super cool! As a user of software that is slow, I'd really like these techniques to be used more broadly, and while I have plans to tell people about them I don't have much hope for "telling people about techniques they can use in low-level languages" translating into fast parsers in software I spend my time waiting for anytime soon. Consider e.g. that simdjson has not been wrapped and highly adopted in many high-level languages. Your thing just generates the correct instructions without the user having to be a knower or a user of a library implementing a specific parser in terms of these techniques. Great!


    auto broadcast = [](uint8_t v) -> uint64_t { return 0x101010101010101 * v; };
This is actually undefined behaviour. Due to integer promotion rules, v will be promoted to an int64, which is signed, so if it is 128 or more, the product would result in signed overflow. You need to explicitly cast it to an uint64.


Instead of a cast, you can also use a `u` suffix on the constant to force it to be unsigned. Then the `v` will be promoted to the same size .


Can't you delay the swap_bytes_if_big_endian call until you have actually found a word with a match since the byte order only matters for index_of_first_set_byte?


There is a good guide on how to use SIMD instructions? I would like to learn more about it, but I feel that the documentation assumes that you know what to do.


The web site containing the article has a collection of other articles about this sort of thing. You can also read about this in some parts of Algorithms for Modern Hardware[0]. highload.fun includes some other links about this stuff[1]. In general there isn't a great guide and it's best to get your hands dirty. highload.fun happens to be a good place to do that :)

Edit: looking again, Algorithms for Modern Hardware has a whole chapter on this stuff[2].

[0]: https://en.algorithmica.org/hpc/algorithms/prefix/

[1]: https://github.com/Highload-fun/platform/wiki

[2]: https://en.algorithmica.org/hpc/simd/intrinsics/


The HTML says it's in Polish but it's clearly in English. Such disrespect!


If the bottleneck is memory access, then does this approach help?


If you are actually saturating the machine's capacity to load data, then maybe. You'd need to check whether the alternative branchier code also does this or not. For most modern machines, you'd need to use wider loads before such questions were relevant.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: