The Smallest Hash Table

thomasmg · on March 6, 2023

About finding a minimal perfect hash function for a small set by brute force ("you'd be surprised to see how often you get lucky"): A few years ago, I spend (far too much) time in this rabbit hole, and invented a new algorithm and data structure for minimal perfect hash functions of an arbitrary size. I invented what I think is still the worlds best algorithm in terms of space usage. I then tried to publish a paper. As I'm not in academia (bachelor degree is all I have), I had no chance, my papers were rejected. Finally, a professor from Italy picked this up, together with his student, and then, finally, a paper was published: https://epubs.siam.org/doi/pdf/10.1137/1.9781611976007.14 - I was even able to present it at the conference, a few weeks before the pandemic started.

The paper goes into quite many details about the probability to find a minimal perfect hash function for small sets. I hope it is somewhat readable... the section for small sets is 5.2 "Searching for bijections": the average number of trials is (m^m / m!), with m being the size of the set. The rest of the paper is to extend this idea to larger sets, by splitting the sets recursively until they are small enough to be mapped in this way.

orlp · on March 6, 2023

I knew I recognized that name, nice to see you here Thomas, I love XOR filters! In fact I was thinking about writing a blog post (series) about random linear coding, fountain codes, XOR filters and ribbon filters...

thomasmg · on March 6, 2023

Let me know if I can help! If you are interested in new developments in this area, I can recommend reading the papers of Stefan Walzer. His work is quite amazing.

eternalban · on March 6, 2023

I was just looking at this paper of his per your comment: https://arxiv.org/pdf/2210.01560.pdf

You know, looking at figure 2 and 3 I was thinking so sure it is a mapping structure but really looks like a tiny tiny database! I had no idea phfs were so involved structurally. Reminds me of delicate tiny clockworks.

Genbox · on March 6, 2023

I've implemented your RecSplit method in my MPHF benchmark library (written in C#). The suite is not yet public, but I do want to say thank you for your fantastic method/code/paper. In my own rabbithole research, I stumbled upon several artifacts of your rabbithole trail. Most notably the stuff on StackOverflow, which helped my own research.

I've releaed a set of fast hash functions[1] to help gain an understanding of speed vs. quality. My biggest takeaway is that most generic hash functions can be specialized for integer inputs[2], which often reduce latency by quite a lot, making MPFH more attractive over simple iteration on small sets, as the overhead of hashing is considerably smaller.

[1] https://github.com/Genbox/FastHash

[2] https://github.com/Genbox/FastHash/blob/master/src/FastHash/...

f_devd · on March 6, 2023

Do you suspect it's primarily the 'prestige' that was missing or the analysis (or maybe even something like layout), which made the initial paper submissions fail?

If papers are rejected purely on prestige rather than merit that would be quite sad.

whatever1 · on March 6, 2023

Even for journal publications, the reviewers are anonymous. So really there is no reason to not just express your true opinion on a submission (as a reviewer). I have literally approved for publication dozens of submissions from unknown universities and non-affiliated contributors.

It is extremely hard to believe that the submission was not accepted BECAUSE of the lack of university affiliation.

But what is indeed true is that different conferences/journals have different expectations from a submission. Some prefer more theoretical analysis, others more computational work and benchmarking. And this is where a seasoned Professor offers value. They can judge where to submit a paper and have a high probability of success.

jonstewart · on March 6, 2023

The conferences to which I've submitted papers have all had blind review -- you literally take your names off the paper and reviewers don't know the authors. Like OP, I don't have an academic background but it's been immensely helpful to have a PhD as a coauthor, as not only does he know all the conventions about paper-writing, he's also a wiz with LaTeX.

thomasmg · on March 6, 2023

Exactly. The reviews where blind in my case (even thought I heard that's not always the case, specially for journals). In my case, there were multiple problems, like missing proves (probabilities and examples are not enough), which I agree. A lot of great feedback! Actually in retrospect, it seems amazing on how much feedback I got. Only one of the 6 reviews was about the tone, and rather useless.

ot · on March 6, 2023

Reviewers are also often quick to dismiss a paper when it hasn't done its homework of thoroughly reviewing all the related literature and explaining how the proposed result compares. This may seem pedantic, but it is important to limit reinvention of the wheel, and to give credit where credit is due (academia runs on credit).

_eojb · on March 6, 2023

Pretty interesting actually (just skimmed through some of the higher level ideas, have bookmarked for later reading). Is there reference code available also you can share? Thanks for sharing!

martincmartin · on March 6, 2023

The paper says code is available as part of the Sux project at: http: //sux.di.unimi.it/

thomasmg · on March 6, 2023

Yes! That's the implementation from the paper. There's also a Java implementation: https://github.com/thomasmueller/minperf

vigna · on March 12, 2023

Note that some smart people in Karlsruhe developed a highly parallel construction procedure for RecSplit, so now you can build large low-space maps very quickly.

https://arxiv.org/abs/2212.09562

less_less · on March 6, 2023

Neat! I'll have to look this over. Another MPHF-related rabbit hole, which isn't even properly published:

https://docs.rs/compressed_map/latest/compressed_map/

This isn't quite an MPHF in that it doesn't map keys to a unique bucket, but instead maps them directly to an output. So unlike an MPHF, it's only suitable for when you are sure for some other reason that the input really is a key to the map (or don't care about if you get a garbage answer when it isn't). However, for this reason the output can be smaller than an MPHF. The motivating use case is checking whether certificates are still valid, but I'm sure there are others (chess endgames??).

I think you can also use it to build an MPHF with a similar space usage to yours, with constant expected lookup time but slower (and superlinear) construction time.

The compressed_map crate isn't properly optimized for compressing small maps, because it stores a bunch of padding and metadata that's negligible for huge maps.

vigna · on March 12, 2023

That stuff is implemented in Java by the Sux4J project (https://sux4j.di.unimi.it/). Happy to see this is getting into mainstream (I am the proud inventor of the name "static function" for the problem :). And yes, you can use it to build a MPHF (https://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMin...).

Note that also an MPHF, as a static function, does not store the keys. A static function assigning a different output (e.g., rank) to every key will always use more space than a MPHF because you can choose the order.

vigna · on March 10, 2023

Ha ha it's fun to be defined "a professor from Italy". Probably for most people in this community I'm the designer of the JavaScript PRNG (V8, Node.js, Chrome, Firefox, etc.) or the author of fastutil. LOL.

martincmartin · on March 6, 2023

Could a large language model help with this? Like, train it on papers, then feed it your rejected draft and ask it to generate an improved draft that sounded more like the training data?

The idea would be to use the expected phrasings and idioms, format, etc. Of course, all content would have to be checked, as LLMs are prone to BS. But if the problems were mostly that it didn't follow the conventions of that academic subgenre, this seems exactly what an LLM could do. If the problem was more about e.g. not citing previous work, the LLM probably wouldn't do a good job there.

thomasmg · on March 6, 2023

Actually I think it would help quite a lot in what I struggled with!

> as LLMs are prone to BS

I agree. The output sounds very convincing, and on a high-level it might make sense. If there are 3 layers: (a) high-level idea, (b) more detailed ideas, and (c) choosing the right words, tone, syntax, grammar, then LLMs seem to be very good in (a) and (c), but quite bad at (b). For somebody not familiar with the topic, it's hard to detect this.

gcr · on March 6, 2023

There are also good ways of using LLMs to write better papers that don't even require the LLM to contribute to the text!

Consider feeding your outline through the LLM alongside prompts like:

- "What's missing from this draft that a reader would expect to find in a top-tier conference paper?"

- "Do you see any problems with the {experiments,discussion,background,related work} section?"

- "What previous work should this paper cite that an expert would find relevant in this field?" (be sure to double-check the sources, LLMs tend to hallucinate with prompts like this)

- "What points is an expert in the field likely to raise during peer review?"

- "Which parts should be rephrased to make the paper sound more natural to a native English speaker?"

As an occasional peer reviewer, I could imagine conferences having a problem with GPT-generated paper submissions clogging up an already highly competitive acceptance pipeline. In case conferences start using detection tools to filter such manuscripts (and I'm torn on whether they should tbh), you can still use ChatGPT like a "friendly, expert colleague" who offers suggestions that you then draft yourself.

I think the application of ChatGPT that I like the most is where ChatGPT is a "friendly, expert editor that always has time to give you suggestions and offer advice," tightening the feedback loop between author and editor to help the author improve their writing. Think of it like a "human-learner-in-the-loop" AI system rather than a "replaces-a-human" AI system.

unwind · on March 6, 2023

This is coming up a lot, here [1] and [2] here. Not a lot of discussion. I enjoyed it, since it was refreshingly hard-core and low-level.

Absurd amounts of actual run-time optimization (as opposed to golfing the source size) are informative and cool and I guess more people should engage in that kind of activity. :)

[1]: https://news.ycombinator.com/item?id=35021860

[2]: https://news.ycombinator.com/item?id=35029536

penguin_booze · on March 6, 2023

I'm curious as to how the multiple submissions happen. When I submit a content that happens to have been recently submitted, I get redirected to the previous submission instead. How doesn't that happen here? They all seem to be the same URL, after all.

jonstewart · on March 6, 2023

If you can confine yourself to a table of 16 byte values, then Intel SSE3's instruction pshufb can let you have a table in a register, with 16 parallel lookups in one clock cycle (assuming your input are the low nibbles in another SSE3 register). More here: http://www.0x80.pl/notesen/2016-03-13-simd-lookup-pshufb.htm...

xixixao · on March 6, 2023

I like Rust but I’m constantly surprised by how hard simple math involving subtraction and or division is, and how there’s little to no advice in the docs about best practices. The math code ends up being littered with `as another-number-type` and the builtin operators are almost never what you need (instead I need to use saturating_ methods etc.). I want my code to handle overflows and precision errors, but I would like to spend less time figuring out how to do it.

cwzwarich · on March 6, 2023

Rust's handling of numeric conversions is broken IMO. Implicit conversions to wider types that never overflow should be allowed, and the "as" operator shouldn't silently ignore overflow. I should also be allowed assume that usize is at least 32-bit, instead of requiring a fallible conversion. I really doubt pervasive support for 16-bit architectures is important enough to justify inconveniencing everyone else.

touisteur · on March 6, 2023

Coming from Ada and its bounded integer and floating point (and fixed points, in the standard) operations defined by type, with explicit conversions, I disagree. The correct typing of variables of physical characteristics and their operators has saved my ass so many times, and the use of static and runtime checks is a boon in scientific code. Sadly our internal math library isn't amenable to dimensional analysis, or it would also be another great bug buster. Look up 'gnat' https://gmpreussner.com/research/dimensional-analysis-in-pro... and follow the links.

Of course nobody prevents you from using Float, Long_Float, Integer everywhere, but it is actually discouraged. Define a specific type. Is it a count, a speed, a quantity of apples, a modular value (i.e. you actually want wraparound semantics), what is its minimal value, its maximum value... The actual binary representation is more of an implementation detail (and can be coerced to do many, many things through 'aspects' or pragmas or just representation clauses.

If I could change one thing there, it's the operator visibility rules, which a perennial compaibt of the Ada developer. But, to dissent from my Adaist brethren I'd ask the standard to go to more explicit operator choice. Make it explicit in the code and for those with RSI make the IDE do the inference and print it in the darned code. Please?

cwzwarich · on March 6, 2023

As far as I can tell, the correct way to handle units of measurement in a programming language is by extending unification to abelian groups, as in Andrew Kennedy's work that was later implemented in F#:

https://www.microsoft.com/en-us/research/publication/relatio...

This seems superior to the GNAT solution. And support for dimensionality is somewhat orthogonal to the ergonomics of converting e.g. a u32 to a u64 that comes up in systems programming.

touisteur · on March 6, 2023

You're right that it is orthogonal to the size of the variable, but thinking in bounds, precision, and yes, types (that can carry many notions, including dimensionality) is more interesting, to me, than 'just' u32 to u64. In my experience, being prompted to think about what you actually mean and have a tool to express it and check it is a good way to write understandable and correct code.

froh · on March 6, 2023

16bit microcontrollers are pervasive in "embedded industries" and greatly benefit from rust memory safety. that's why adacore and AUTOSAR officially try to gather some rust (hrm).

that said, rust indeed should work on fixing numeric conversion, not just for these small word sized but for everybody.

ot · on March 6, 2023

> There are fully generic methods for constructing (minimal) perfect hash functions [...] However, they tend to use lookup tables to construct the hash itself.

To be slightly pedantic, this solution uses a lookup table too, but it's small enough that it fits in a machine integer (and thus can be loaded as an immediate value).

In general, there is an information-theoretic lower bound on the size of the auxiliary data you need to build a perfect hash function, and it's Omega(n) bits for n keys. The constant depends on the load factor, when that is 1 it means that the hashes are in [0, n) and the perfect hash function is called minimal, and the space lower bound is log(e)*n bits

For very small n, brute-forcing like this generally yields good functions. In fact, it is one of the strategies used by gperf.

zokier · on March 6, 2023

Using 64bit value for the shift solution would allow still more tight scalar code, dunno for vectorization.

    pub fn phf_shift64(x: u32) {
        ((0x714258693u64 >> (x*4)) & 0b1111) as u8
    }

kzrdude · on March 6, 2023

This particular code will have a bounds check for x * 4 and panic if it's too large, thus inhibiting vectorization in most cases.

Just as a tidbit, to dig deeper into Rust.

How can I provide some quick evidence for this? Well there's a method called u64::unchecked_shr that acts as contrasting evidence.

userbinator · on March 6, 2023

You might be able to get rid of two instructions if you can find a function that takes the input to the output directly.

ljouhet · on March 6, 2023

    pub fn phf_noshift(x: u32) -> u8 {
        (x.wrapping_mul(0x9e4c340a)%13) as u8 // Not a hash table
    }

ljouhet · on March 8, 2023

(I used the wrong strings: "A X" not "A X\n")

many constants work with 13 : 0x93f0687f, 0x8c38599e, 0x8f5e14c2, 0x932d2c11 some with 71 : 0xd9526a52, ... couldn't find mod 923

hamilyon2 · on March 6, 2023

Perfect hash functions! My favorite topic, and in my mind one of optimisations compilers will automatically make in some next years.

durpkingOP · on March 6, 2023

pub fn phf_shift(x: u32) -> u8 { x.wrapping_mul(31) as u8 }

durpkingOP · on March 6, 2023

smaller: pub fn phf_shift(x: u32) -> u8 { x as u8 }

durpkingOP · on March 6, 2023

pub fn phf_shift(x: u32) -> u8 { x as u8 }