Interesting to compare the SWAR approach with 64-bit register, a similar approach with 128-bit register (SSE-2), and string search instructions (SSE-4.2). This implementation in ClickHouse can be used as a reference:
Sometimes you can do this in two instructions per 16 / 32 / 64 chars (sse / avx / avx-512 resp): shuffle & compare.
The gist of it is
lookup[hash(char)] == char
where 'lookup[hash(char)]' corresponds to _mm_shuffle_epi8 with a carefully crafted constant and input data, and '==' corresponds to _mm_cmpeq_epi8.
The intrinsic c = _mm_shuffle_epi8(a, b) pretty much computes c[i] = b[a[i] & 0x0F], so if there are no collisions in a[i] & 0x0F, you can construct a special b so that
Unfortunately there's no 8-bit srli, and if you use the 16-bit srli, you could end up shifting a 1 into the high bit of the low 8 bits of your 16-bit word. pshufb treats a 1 in the high bit as a signal to map the input to 0 instead of using the lower 4 bits as a lookup table. So the full thing ends up being (for AVX-2):
_mm256_srli_epi16 // by 1
_mm256_and_si256 // to mask off the high bit
_mm256_shuffle_epi8 // b[a[i] & 0x0F]
_mm256_cmpeq_epi8 // a[i] ==
This is still pretty good! If you end up needing more instructions to get map your chosen bytes to 16 unique low nibbles though, some other technique may pull ahead. See here for some techniques: http://0x80.pl/articles/simd-byte-lookup.html
Note that the above article is a bit too conservative about the applicability of some of these techniques and does not end up telling you to do cmpeq(input, pshufb(constant, input)) even though this is what you want if all the bytes of interest are <128 (i.e. they are ascii characters) which is often true.
I wrote a Julia package to do this some time ago (https://github.com/jakobnissen/ScanByte.jl). It selects the right algorithm at compile time depdending on the chosen byte set and the available instructions.
It turns out that for specific sets, you can significantly better than the generic algoritm. For example, to find the location of any byte in `abc`, this is the inner loop generated by Julia, checking 32 bytes per iteration:
This is super cool! As a user of software that is slow, I'd really like these techniques to be used more broadly, and while I have plans to tell people about them I don't have much hope for "telling people about techniques they can use in low-level languages" translating into fast parsers in software I spend my time waiting for anytime soon. Consider e.g. that simdjson has not been wrapped and highly adopted in many high-level languages. Your thing just generates the correct instructions without the user having to be a knower or a user of a library implementing a specific parser in terms of these techniques. Great!
This is actually undefined behaviour. Due to integer promotion rules, v will be promoted to an int64, which is signed, so if it is 128 or more, the product would result in signed overflow. You need to explicitly cast it to an uint64.
Can't you delay the swap_bytes_if_big_endian call until you have actually found a word with a match since the byte order only matters for index_of_first_set_byte?
There is a good guide on how to use SIMD instructions? I would like to learn more about it, but I feel that the documentation assumes that you know what to do.
The web site containing the article has a collection of other articles about this sort of thing. You can also read about this in some parts of Algorithms for Modern Hardware[0]. highload.fun includes some other links about this stuff[1]. In general there isn't a great guide and it's best to get your hands dirty. highload.fun happens to be a good place to do that :)
Edit: looking again, Algorithms for Modern Hardware has a whole chapter on this stuff[2].
If you are actually saturating the machine's capacity to load data, then maybe. You'd need to check whether the alternative branchier code also does this or not. For most modern machines, you'd need to use wider loads before such questions were relevant.
https://github.com/ClickHouse/ClickHouse/blob/a04b38db907b8e...
Obviously, SWAR should be slower (due to register size), but how slower?