Brief Introduction To General Purpose Hash Functions

demerphq · on March 25, 2017

I think you have confused "ideal hash function" and "perfect hash function". A perfect hash function is one that maps a finite set of keys into the same number of buckets with no collisions. They have to be tailor constructed for the keys they are intended for and are not for general purpose use.

An ideal hash function on the other hand produces output that is indistinguishable from choosing the hash value for every key independently and randomly. In particular an ideal hash function should be a PRF, and it should handle any input size, and any keyset.

They aren't the same thing at all, and you should not confuse them. You cannot substitute an ideal hash function for a perfect hash function, or vice versa.

ArashPartow · on March 25, 2017

@demerphq (Yves Orton):

> "They aren't the same thing at all, and you should not confuse them. "

You're absolutely right. The term "idea hash function" was what I was after all along - I just couldn't find a formal definition of it. In any case I've updated the article accordingly. Thank-you for your review and comment.

demerphq · on March 25, 2017

You are welcome. I wanted add another point. rurban does have a bit of a point. The hash functions you referenced in your article are generally old and low quality, and slow. FNV is neither a good hash, nor is it fast. FNV_yt is very fast, but not the best quality hash function, it fails many tests in SMHasher, including some I personally consider dealbreaker: failing avalanche tests is unacceptable.

Both rurban and I have (independently) spent considerable time testing about 100 hash functions using the hash test suite smhasher. In addition I have greatly enhanced smhasher, and included test results and graphs in my version. Included in my fork is the --stream option so you can test different hash functions in counter mode with tools like the dieharder RNG test suite.

https://github.com/demerphq/smhasher https://github.com/rurban/smhasher

These are the FNV test results from my version of SMHasher:

FNV1a 32 32 32 FAILED 99 of 183 tests. Avalanche: 17-41, Crc-MultiCollision: 81, 83, 85-86, 88-91, 94, 96-98, 100, Cyclic: 42-52, Hi-Lo: 157-158, HiBit-Null: 150-152, Highbits2: 147-149, Highbits: 144-146, LowBit-Null: 153-155, Lowbits: 142-143, Murmur2-MultiCollision: 101-110, Murmur3A-MultiCollision: 111, 113, 117, 119-120, Murmur3F-MultiCollision: 122-127, 129-130, OverAll: 183, Text: 160-165, TwoBytes: 54-56, 63, Zeroes: 167-168

FNV1a_YT 32 32 32 FAILED 67 of 181 tests. Avalanche: 15-39, Differential: 9-14, Hi-Lo: 154-156, HiBit-Null: 148-150, Highbits2: 145-147, Highbits: 142-144, Lowbits: 139-141, OverAll: 181, Sparse: 74-78, Text: 157-158, 160-163, TwoBytes: 51, 53-61

FNV64 64 64 64 FAILED 66 of 183 tests. Avalanche: 17-41, Cyclic: 43-45, 47, 49, 51-52, Hi-Lo: 157-158, HiBit-Null: 151-152, Highbits2: 148-149, Highbits: 145-146, LowBit-Null: 154-155, Lowbits: 142-143, OverAll: 183, Sparse: 65-67, 69, 71, 73, 75, 77, 79-80, Text: 160-162, 164-165, TwoBytes: 54-56, 58, 60, 62-63, Words: 182, Zeroes: 167-168

You can see text view of the full list at:

https://github.com/demerphq/smhasher/blob/master/doc/summary...

Also after reading this thread I added some graphs comparing performance of other hashes, including some that I have developed using genetic algorithm for use in Perl. Some of those hashes have hash-time performance at the top of the envelope (StadtX is very fast).

https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F...

You may also be interested in looking at

https://github.com/demerphq/smhasher/blob/master/HashAnatomy...

Good luck!

ArashPartow · on March 26, 2017

@demerphq: Thank-you for these stats, and the links. I'll definitely be updating the article with the missing hash functions and adding some of the links you've provided to the links section.

My intent is to over time make the article as comprehensive as possible, but to also keep it accessible to the general public.

ot · on March 25, 2017

Thanks for saying this! It is incredible how widespread still is the belief that FNV is a good hash function, when much better ones exist that are faster even on short keys.

demerphq · on March 26, 2017

I am only repeating what my tests say. I was surprised frankly, given FNV's reputation, to see it fail so many tests.

Also I put a lot of effort into improving the quality of testing that smhasher does, so there are many hashes that would pass the old version which will fail the new one.

andai · on March 25, 2017

Wow! How did you create BeagleHash? I took a look at your repos but could only find the end result.

demerphq · on March 26, 2017

BeagleHash IMO is less interesting than Zaphod64 and StadtX.

With all of them I used genetic algorithm techniques to search the space of possible minimal mix functions to find ones that minimized error from the strict avalanche test.

I will publish my tooling to do this at one point, but it is a bit hacky to release just now.

ot · on March 25, 2017

> A perfect hash function is one that maps a finite set of keys into the same number of buckets with no collisions.

To be even more precise, that's a minimal perfect hash function. "Perfect" only means that it produces no collisions on that key set (F is injective). It is also "minimal" if the number of buckets (the codomain) has the same size as the input set (F is bijective).

demerphq · on March 26, 2017

Thanks for the correction!

rurban · on March 25, 2017

Just read the first two intro paragraphs, and already stumbled on 2 major mistakes:

"Hash functions are by definition and implementation pseudo random number generators (PRNG)."

No.

> "In practice it is generally recognized that a perfect hash function is the hash function that produces the least amount of collisions for a particular set of data."

No. A perfect hash function produces no collisions. "Least collisions" => better but not perfect.

I briefly looked over the rest, and this looks much better.

Just the last section is a bit misleading. First he talks about hashing in general and then he introduces his implementations of hash functions as "Available Hash Functions". This is not a generalization as above, it's merely his collection.

DJB: "It is one of the most efficient hash functions ever published." Nonsense. FNV is considered the generally most efficient simple hash function. DJB ranks in the middle of generic unaligned string hash functions.

ArashPartow · on March 25, 2017

@Rurban (Reini Urban): wrt to your first point, general purpose hash functions being modeled as PRNGs, I believe that statement to be correct. It is definitely the case when analysing hash function behaviour in contexts such as bloom filters, hash tables etc.

With regards to the definition of perfect hash functions - if you had read the sentence before the one you have quoted, you would have notice that the formal definition for a perfect hash function has been given in the article as:

> "In general there is a theoretical hash function known as the perfect hash function for any group of data. The perfect hash function by definition states that no collisions will occur meaning no repeating hash values will arise from different elements of the group."

It then goes on to talk about the practicalities of perfect hash functions (deriving them etc), then makes a statement that generally a perfect hash function is one that has the least collisions - I agree that it is an intentional weakening of the formal definition, but it is generally the most practical one - as there is no guarantee that one can always find a "perfect hash function" for any arbitrary set of values in sub-polynomial time.

----

wrt the collection of hash function, take note that the article is broken into the following sections:

1. What is a general purpose hash function

2. Designing and implementing general purpose hash functions

3. Uses of hash functions

4. Example general purpose hash functions (popular ones or written by leaders in the field, and dummies such as myself)

5. Downloads

So could you clarify what it is you meant by:

> " last section is a bit misleading. First he talks about hashing in general and then he introduces his implementations of hash functions"

what specifically was misleading in the article? I'm confused, could you please clarify.

----

wrt: > "DJB: "It is one of the most efficient hash functions ever published."

The statement is _not_ incorrect, it is indeed one of the most efficient hash functions out there, it may not be the MOST efficient or in other words "the number one hash function", but it's there at the top - I think you may have jumped the gun on that one too.

To further that, your comment: " Nonsense. FNV is considered the generally most efficient"

Lets have a look and see if that's true, the following are pseudo code representations of the mixes for each of the hash functions:

DJB: hash = ((hash << 5) + hash) + msg_byte

FNV1: hash = hash ^ msg_byte * FNV_prime

Unless you have a really bad shift operation (eg: having no barrel shifter present), then I can't see how FNV1 can be significantly more efficient than DJB - but then again, the compiler may optimize the shift right by 5 as a multiply by 32 for targets where the mul is faster than the shift...

But lets step back a bit and consider the following:

Is hash function efficiency that important when neither of the hash functions DJB or FNV1 posses good distributive qualities?

https://research.neustar.biz/tag/hash-function/

rurban · on March 25, 2017

ad 1: https://stackoverflow.com/questions/31954237/pseudo-random-g... has some answers.

ad 2: in most measurable stats fnv1 is better than djb. practical throughput, time if inlined, cachegrind cost model (4x!). fnv also produces less collisions with average keys. only in pure cycle/hash djb is a bit better.

https://github.com/rurban/smhasher#smhasher https://github.com/rurban/perl-hash-stats#perl-hash-stats

> ad 3: Is hash function efficiency that important when neither of the hash functions DJB or FNV1 posses good distributive qualities?

It depends entirely on the use case. Usage in hash tables is totally different to usage as file or block checksum or crypto. Besides looking how many bis of the keys are really used or how easy it is to create collisions, it also depends if the keys are random length strings, aligned buffers or plain integers. Both DJB and FNV1 are considered bad regarding their stats. But they are small and fast.

panic · on March 25, 2017

It might be worth mentioning popular functions like FNV (http://isthe.com/chongo/tech/comp/fnv/) MurmurHash (https://github.com/aappleby/smhasher) and SipHash (https://131002.net/siphash/).

bogomipz · on March 25, 2017

The post uses the term "internal state" pretty early on(initialize internal state, etc) and continues to reference it frequently for the algorithmic steps but nowhere is "internal state" defined. Can someone explain to me what this "internal state"?

psyc · on March 25, 2017

Internal state is typically a small-ish bag of bits that mixes, and gets mixed deterministically, as the hash mechanism is used. See the internals of the Mersenne Twister[0] for an example/analogy.

[0] https://en.wikipedia.org/wiki/Mersenne_Twister#Initializatio...

demerphq · on March 26, 2017

It is only small in older hashes. Modern hashes tend to have a state larger than both the seed and the input block.

See:

https://github.com/demerphq/smhasher/blob/master/HashAnatomy...

It actually turns out there are performance benefit by having a larger seed, but not too large. (If you think about it this makes sense consider how many bits you can change with an average combining operation, and with a worse case combining operation.)

bogomipz · on March 25, 2017

Thanks for the explanation and link. This makes total sense. Cheers.

ArashPartow · on March 26, 2017

@bogomipz: The explanation provided by psyc is spot on. Please do note that I've updated the article with explanations for both 'Internal State' and the corresponding 'Finalize' process.

jnwatson · on March 25, 2017

A great resource for hash function performance: https://www.strchr.com/hash_functions

It includes a bunch of more modern hash functions the author missed.

duckqlz · on March 25, 2017

+1 "Brief" Introduction lol

ArashPartow · on March 26, 2017

Once upon a time it was 'brief'... that is to say circa 2001 :D