The hardware is not that dated, they specifically target then modern hardware th...

vidarh · on May 1, 2023

> It is per-core so there is not going to be a whole lot of contention going on.

That's only true if nothing else happens between requests to the hash table. That might be the case, or it might not. Depending on your workload that might make no difference or totally ruin your performance characteristics.

> I am no expert in math but the algorithm is claimed to be better than another one because it is in a class that is better than the class the other is in. It’s not because a run of 100 shows some distribution.

The problem with this is that while it may well have better characteristics than multiply and shift on average, the quality of the distribution of multiply and shift based hashes can vary by many orders of magnitude depending on the choice of multiplication factors. Put another way: Multiplying by 1 and shifting is a perfectly valid multiply and shift hash function. It's a very stupid one. The performance characteristics for a hash table doing that is nothing like the performance characteristics of what is proposed in the original article. I have no doubt that the table based approach will beat a large proportion of the multiply and shift hashes. But so does other multiply and shift hashes, by large factors. As such, without actually comparing against a known set of some of the best multiply-shift hashes we learn very little about whether or not it'll do well against good multiply-shift hashes.

To me, the fact that they chose random factors is very suspicious. Nobody uses random factors. The effort spent on choosing good factors over the years has been very extensive, and even hacky, ad hoc attempts will tend to use large prime numbers.

> You have a lot of demands for exhaustive testing but when you are asked to provide the same, it’s all too much to ask. ‘I wouldn’t mind someone plotting graphs’ yeah thanks, I wouldn’t mind someone else doing the work.

I've not made demands for anything. I've pointed out that making a blanket claim that multiplication is bad when the performance characteristics has changed as much as they have is unreasonable, and a paper like this tells us pretty much nothing more. It's absolutely reasonable to consider table based approaches; it's quite possible, even likely they'll have desirable properties for various sets of inputs - there is no such thing as a perfect hash function for all inputs, and sometimes you care most about pathological worst case scenarios, some times you care about averages, some times you know what data you will or won't see. That it performs as well as they've shown it to means there is almost certainly situations where it will be a good choice. But because of the choices they made in that paper we can't really tell when and where that would be, and that's a shame.

What is not reasonable is just writing off the use of multiplication on the basis of hardware characteristics that are decades out of date. That doesn't mean you should blindly use that either.

If there's one thing people should know about working with hash tables it's that you should test and measure for the actual type of data you expect to see.

tinus_hn · on May 1, 2023

Sorry, all I can do is suggest you come back tomorrow and reread the comment you wrote.

vidarh · on May 2, 2023

I have and stand by every word, but seeing as you have no arguments and resort to implied insults we are done here.

tinus_hn · on May 2, 2023

Sorry, if you start making up giant arguments against claims I have not made there really is no response. But clearly you don’t see that.