more ArashPartow's comments

ArashPartow · on March 26, 2017

Once upon a time it was 'brief'... that is to say circa 2001 :D

ArashPartow · on March 26, 2017

@bogomipz: The explanation provided by psyc is spot on. Please do note that I've updated the article with explanations for both 'Internal State' and the corresponding 'Finalize' process.

ArashPartow · on March 24, 2017

@demerphq (Yves Orton):

> "They aren't the same thing at all, and you should not confuse them. "

You're absolutely right. The term "idea hash function" was what I was after all along - I just couldn't find a formal definition of it. In any case I've updated the article accordingly. Thank-you for your review and comment.

demerphq · on March 25, 2017

You are welcome. I wanted add another point. rurban does have a bit of a point. The hash functions you referenced in your article are generally old and low quality, and slow. FNV is neither a good hash, nor is it fast. FNV_yt is very fast, but not the best quality hash function, it fails many tests in SMHasher, including some I personally consider dealbreaker: failing avalanche tests is unacceptable.

Both rurban and I have (independently) spent considerable time testing about 100 hash functions using the hash test suite smhasher. In addition I have greatly enhanced smhasher, and included test results and graphs in my version. Included in my fork is the --stream option so you can test different hash functions in counter mode with tools like the dieharder RNG test suite.

https://github.com/demerphq/smhasher https://github.com/rurban/smhasher

These are the FNV test results from my version of SMHasher:

FNV1a 32 32 32 FAILED 99 of 183 tests. Avalanche: 17-41, Crc-MultiCollision: 81, 83, 85-86, 88-91, 94, 96-98, 100, Cyclic: 42-52, Hi-Lo: 157-158, HiBit-Null: 150-152, Highbits2: 147-149, Highbits: 144-146, LowBit-Null: 153-155, Lowbits: 142-143, Murmur2-MultiCollision: 101-110, Murmur3A-MultiCollision: 111, 113, 117, 119-120, Murmur3F-MultiCollision: 122-127, 129-130, OverAll: 183, Text: 160-165, TwoBytes: 54-56, 63, Zeroes: 167-168

FNV1a_YT 32 32 32 FAILED 67 of 181 tests. Avalanche: 15-39, Differential: 9-14, Hi-Lo: 154-156, HiBit-Null: 148-150, Highbits2: 145-147, Highbits: 142-144, Lowbits: 139-141, OverAll: 181, Sparse: 74-78, Text: 157-158, 160-163, TwoBytes: 51, 53-61

FNV64 64 64 64 FAILED 66 of 183 tests. Avalanche: 17-41, Cyclic: 43-45, 47, 49, 51-52, Hi-Lo: 157-158, HiBit-Null: 151-152, Highbits2: 148-149, Highbits: 145-146, LowBit-Null: 154-155, Lowbits: 142-143, OverAll: 183, Sparse: 65-67, 69, 71, 73, 75, 77, 79-80, Text: 160-162, 164-165, TwoBytes: 54-56, 58, 60, 62-63, Words: 182, Zeroes: 167-168

You can see text view of the full list at:

https://github.com/demerphq/smhasher/blob/master/doc/summary...

Also after reading this thread I added some graphs comparing performance of other hashes, including some that I have developed using genetic algorithm for use in Perl. Some of those hashes have hash-time performance at the top of the envelope (StadtX is very fast).

https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F...

You may also be interested in looking at

https://github.com/demerphq/smhasher/blob/master/HashAnatomy...

Good luck!

ArashPartow · on March 26, 2017

@demerphq: Thank-you for these stats, and the links. I'll definitely be updating the article with the missing hash functions and adding some of the links you've provided to the links section.

My intent is to over time make the article as comprehensive as possible, but to also keep it accessible to the general public.

ot · on March 25, 2017

Thanks for saying this! It is incredible how widespread still is the belief that FNV is a good hash function, when much better ones exist that are faster even on short keys.

demerphq · on March 26, 2017

I am only repeating what my tests say. I was surprised frankly, given FNV's reputation, to see it fail so many tests.

Also I put a lot of effort into improving the quality of testing that smhasher does, so there are many hashes that would pass the old version which will fail the new one.

andai · on March 25, 2017

Wow! How did you create BeagleHash? I took a look at your repos but could only find the end result.

demerphq · on March 26, 2017

BeagleHash IMO is less interesting than Zaphod64 and StadtX.

With all of them I used genetic algorithm techniques to search the space of possible minimal mix functions to find ones that minimized error from the strict avalanche test.

I will publish my tooling to do this at one point, but it is a bit hacky to release just now.

ArashPartow · on March 24, 2017

@Rurban (Reini Urban): wrt to your first point, general purpose hash functions being modeled as PRNGs, I believe that statement to be correct. It is definitely the case when analysing hash function behaviour in contexts such as bloom filters, hash tables etc.

With regards to the definition of perfect hash functions - if you had read the sentence before the one you have quoted, you would have notice that the formal definition for a perfect hash function has been given in the article as:

> "In general there is a theoretical hash function known as the perfect hash function for any group of data. The perfect hash function by definition states that no collisions will occur meaning no repeating hash values will arise from different elements of the group."

It then goes on to talk about the practicalities of perfect hash functions (deriving them etc), then makes a statement that generally a perfect hash function is one that has the least collisions - I agree that it is an intentional weakening of the formal definition, but it is generally the most practical one - as there is no guarantee that one can always find a "perfect hash function" for any arbitrary set of values in sub-polynomial time.

----

wrt the collection of hash function, take note that the article is broken into the following sections:

1. What is a general purpose hash function

2. Designing and implementing general purpose hash functions

3. Uses of hash functions

4. Example general purpose hash functions (popular ones or written by leaders in the field, and dummies such as myself)

5. Downloads

So could you clarify what it is you meant by:

> " last section is a bit misleading. First he talks about hashing in general and then he introduces his implementations of hash functions"

what specifically was misleading in the article? I'm confused, could you please clarify.

----

wrt: > "DJB: "It is one of the most efficient hash functions ever published."

The statement is _not_ incorrect, it is indeed one of the most efficient hash functions out there, it may not be the MOST efficient or in other words "the number one hash function", but it's there at the top - I think you may have jumped the gun on that one too.

To further that, your comment: " Nonsense. FNV is considered the generally most efficient"

Lets have a look and see if that's true, the following are pseudo code representations of the mixes for each of the hash functions:

DJB: hash = ((hash << 5) + hash) + msg_byte

FNV1: hash = hash ^ msg_byte * FNV_prime

Unless you have a really bad shift operation (eg: having no barrel shifter present), then I can't see how FNV1 can be significantly more efficient than DJB - but then again, the compiler may optimize the shift right by 5 as a multiply by 32 for targets where the mul is faster than the shift...

But lets step back a bit and consider the following:

Is hash function efficiency that important when neither of the hash functions DJB or FNV1 posses good distributive qualities?

https://research.neustar.biz/tag/hash-function/

rurban · on March 24, 2017

ad 1: https://stackoverflow.com/questions/31954237/pseudo-random-g... has some answers.

ad 2: in most measurable stats fnv1 is better than djb. practical throughput, time if inlined, cachegrind cost model (4x!). fnv also produces less collisions with average keys. only in pure cycle/hash djb is a bit better.

https://github.com/rurban/smhasher#smhasher https://github.com/rurban/perl-hash-stats#perl-hash-stats

> ad 3: Is hash function efficiency that important when neither of the hash functions DJB or FNV1 posses good distributive qualities?

It depends entirely on the use case. Usage in hash tables is totally different to usage as file or block checksum or crypto. Besides looking how many bis of the keys are really used or how easy it is to create collisions, it also depends if the keys are random length strings, aligned buffers or plain integers. Both DJB and FNV1 are considered bad regarding their stats. But they are small and fast.

ArashPartow · on March 6, 2017

how about something like this: http://www.wykobi.com/faq.html#FAQ06

ArashPartow · on May 20, 2012

The C++ String Toolkit Library (StrTk) consists of robust, optimized and portable string processing algorithms for the C++ language. StrTk is designed to be easy to use and integrate within existing code bases. Furthermore the library has a rich set of features that makes light work of any kind of string processing task.

A simple tutorial describing some of the uses of StrTk can be found: http://www.codeproject.com/KB/recipes/Tokenizer.aspx

ArashPartow · on May 12, 2012

The C++ Mathematical Expression Library (ExprTk) is a simple to use, easy to integrate and extremely efficient mathematical expression parsing and evaluation engine. The parsing engine supports various kinds of functional, logic processing semantics and is very easily extendible.