Optimizing Rabin-Karp Hashing

stabbles · 2024-03-09T08:52:47 1709974367

Regarding "A final optimization", there's another way:

`_mm256_cmpeq_epi32` produces `0xFFFFFFFF`, the article suggests shifting to produce a `1` and then add.

Instead you can interpret `0xFFFFFFFF` as negative one, and subtract. That saves a shift.

Flip the sign when accumulating.

In general I think this is a pretty common counting trick. I don't think those shift operations even exists for epi8, so there you really need to use it to avoid reduction to a narrow register. Also, in the case of epi8 you need to deal with overflow, so the pattern is like this in pseudo code:

    v[1:32] = 0
    total = 0
    for j = 0 to N / 256
        for i = 0 to 255
            v[1:32] -= cmpeq(..., ...)
        end
        total += sum(v)
    end

namibj · 2024-03-09T17:39:15 1710005955

8bit permutations are part of GF-NI; you select which source bits to XOR together for each output bit, and for each output bit whether to start with a 0 or a 1.

pvg · 2024-03-09T04:43:39 1709959419

Short related discussion a month ago: https://news.ycombinator.com/item?id=39254176

perihelions · 2024-03-09T14:14:11 1709993651

I wonder what a designed-for-SIMD rolling hash would look like. Like—just a trivial toy experiment—suppose hash ↤ (hash * input[j]) << 1. That's a (very crappy) rolling-hash function with a window size of sizeof(hash). It's also algebraically equivalent to hash ↤ hash * (input[j] << 1), which is a parallelizable horizontal-product over the input bytevector. This shows that you can keep the "rolling window" property, without the "evaluates serially from left-to-right" requirement: it's possible to rearrange things algebraically so that the "folding" step can be evaluated in any order.

Are there nontrivial constructions with this property?

EdSchouten · 2024-03-09T04:14:59 1709957699

Are there actually practical cases where Rabin-Karp hashing is what dominates the running time of an application? The naive implementation already gives you 0.75 GB/s. Seems pretty fast.

def-lkb · 2024-03-09T08:39:14 1709973554

Compression, synchronization and backup systems often use rolling hash to implement "content-defined chunking", an effective form of deduplication.

In optimized implementations, Rabin-Karp is likely to be the bottleneck. See for instance https://github.com/facebook/zstd/pull/2483 which replaces a Rabin-Karp variant by a >2x faster Gear-Hashing.

mis1988 · 2024-03-09T08:41:37 1709973697

I found it to be a good pedagogical tool. In terms of actual use cases, maybe something like rsync'ing a large directory where only a few files have changed? It starts to get pretty contrived though.

dehrmann · 2024-03-09T07:53:34 1709970814

Maybe implementing `foo LIKE '%bar%'` in a columnar DB backed by SSDs. Even if you're not strictly CPU-bound, there's a scale where you might care about the power use.

mis1988 · 2024-03-07T09:12:27 1709802747

A response to a question posed by Daniel Lemire.