About finding a minimal perfect hash function for a small set by brute force ("y...

orlp · on March 6, 2023

I knew I recognized that name, nice to see you here Thomas, I love XOR filters! In fact I was thinking about writing a blog post (series) about random linear coding, fountain codes, XOR filters and ribbon filters...

thomasmg · on March 6, 2023

Let me know if I can help! If you are interested in new developments in this area, I can recommend reading the papers of Stefan Walzer. His work is quite amazing.

eternalban · on March 6, 2023

I was just looking at this paper of his per your comment: https://arxiv.org/pdf/2210.01560.pdf

You know, looking at figure 2 and 3 I was thinking so sure it is a mapping structure but really looks like a tiny tiny database! I had no idea phfs were so involved structurally. Reminds me of delicate tiny clockworks.

Genbox · on March 6, 2023

I've implemented your RecSplit method in my MPHF benchmark library (written in C#). The suite is not yet public, but I do want to say thank you for your fantastic method/code/paper. In my own rabbithole research, I stumbled upon several artifacts of your rabbithole trail. Most notably the stuff on StackOverflow, which helped my own research.

I've releaed a set of fast hash functions[1] to help gain an understanding of speed vs. quality. My biggest takeaway is that most generic hash functions can be specialized for integer inputs[2], which often reduce latency by quite a lot, making MPFH more attractive over simple iteration on small sets, as the overhead of hashing is considerably smaller.

[1] https://github.com/Genbox/FastHash

[2] https://github.com/Genbox/FastHash/blob/master/src/FastHash/...

f_devd · on March 6, 2023

Do you suspect it's primarily the 'prestige' that was missing or the analysis (or maybe even something like layout), which made the initial paper submissions fail?

If papers are rejected purely on prestige rather than merit that would be quite sad.

whatever1 · on March 6, 2023

Even for journal publications, the reviewers are anonymous. So really there is no reason to not just express your true opinion on a submission (as a reviewer). I have literally approved for publication dozens of submissions from unknown universities and non-affiliated contributors.

It is extremely hard to believe that the submission was not accepted BECAUSE of the lack of university affiliation.

But what is indeed true is that different conferences/journals have different expectations from a submission. Some prefer more theoretical analysis, others more computational work and benchmarking. And this is where a seasoned Professor offers value. They can judge where to submit a paper and have a high probability of success.

jonstewart · on March 6, 2023

The conferences to which I've submitted papers have all had blind review -- you literally take your names off the paper and reviewers don't know the authors. Like OP, I don't have an academic background but it's been immensely helpful to have a PhD as a coauthor, as not only does he know all the conventions about paper-writing, he's also a wiz with LaTeX.

thomasmg · on March 6, 2023

Exactly. The reviews where blind in my case (even thought I heard that's not always the case, specially for journals). In my case, there were multiple problems, like missing proves (probabilities and examples are not enough), which I agree. A lot of great feedback! Actually in retrospect, it seems amazing on how much feedback I got. Only one of the 6 reviews was about the tone, and rather useless.

ot · on March 6, 2023

Reviewers are also often quick to dismiss a paper when it hasn't done its homework of thoroughly reviewing all the related literature and explaining how the proposed result compares. This may seem pedantic, but it is important to limit reinvention of the wheel, and to give credit where credit is due (academia runs on credit).

_eojb · on March 6, 2023

Pretty interesting actually (just skimmed through some of the higher level ideas, have bookmarked for later reading). Is there reference code available also you can share? Thanks for sharing!

martincmartin · on March 6, 2023

The paper says code is available as part of the Sux project at: http: //sux.di.unimi.it/

thomasmg · on March 6, 2023

Yes! That's the implementation from the paper. There's also a Java implementation: https://github.com/thomasmueller/minperf

vigna · on March 12, 2023

Note that some smart people in Karlsruhe developed a highly parallel construction procedure for RecSplit, so now you can build large low-space maps very quickly.

https://arxiv.org/abs/2212.09562

less_less · on March 6, 2023

Neat! I'll have to look this over. Another MPHF-related rabbit hole, which isn't even properly published:

https://docs.rs/compressed_map/latest/compressed_map/

This isn't quite an MPHF in that it doesn't map keys to a unique bucket, but instead maps them directly to an output. So unlike an MPHF, it's only suitable for when you are sure for some other reason that the input really is a key to the map (or don't care about if you get a garbage answer when it isn't). However, for this reason the output can be smaller than an MPHF. The motivating use case is checking whether certificates are still valid, but I'm sure there are others (chess endgames??).

I think you can also use it to build an MPHF with a similar space usage to yours, with constant expected lookup time but slower (and superlinear) construction time.

The compressed_map crate isn't properly optimized for compressing small maps, because it stores a bunch of padding and metadata that's negligible for huge maps.

vigna · on March 12, 2023

That stuff is implemented in Java by the Sux4J project (https://sux4j.di.unimi.it/). Happy to see this is getting into mainstream (I am the proud inventor of the name "static function" for the problem :). And yes, you can use it to build a MPHF (https://sux4j.di.unimi.it/docs/it/unimi/dsi/sux4j/mph/GOVMin...).

Note that also an MPHF, as a static function, does not store the keys. A static function assigning a different output (e.g., rank) to every key will always use more space than a MPHF because you can choose the order.

vigna · on March 10, 2023

Ha ha it's fun to be defined "a professor from Italy". Probably for most people in this community I'm the designer of the JavaScript PRNG (V8, Node.js, Chrome, Firefox, etc.) or the author of fastutil. LOL.

martincmartin · on March 6, 2023

Could a large language model help with this? Like, train it on papers, then feed it your rejected draft and ask it to generate an improved draft that sounded more like the training data?

The idea would be to use the expected phrasings and idioms, format, etc. Of course, all content would have to be checked, as LLMs are prone to BS. But if the problems were mostly that it didn't follow the conventions of that academic subgenre, this seems exactly what an LLM could do. If the problem was more about e.g. not citing previous work, the LLM probably wouldn't do a good job there.

thomasmg · on March 6, 2023

Actually I think it would help quite a lot in what I struggled with!

> as LLMs are prone to BS

I agree. The output sounds very convincing, and on a high-level it might make sense. If there are 3 layers: (a) high-level idea, (b) more detailed ideas, and (c) choosing the right words, tone, syntax, grammar, then LLMs seem to be very good in (a) and (c), but quite bad at (b). For somebody not familiar with the topic, it's hard to detect this.

gcr · on March 6, 2023

There are also good ways of using LLMs to write better papers that don't even require the LLM to contribute to the text!

Consider feeding your outline through the LLM alongside prompts like:

- "What's missing from this draft that a reader would expect to find in a top-tier conference paper?"

- "Do you see any problems with the {experiments,discussion,background,related work} section?"

- "What previous work should this paper cite that an expert would find relevant in this field?" (be sure to double-check the sources, LLMs tend to hallucinate with prompts like this)

- "What points is an expert in the field likely to raise during peer review?"

- "Which parts should be rephrased to make the paper sound more natural to a native English speaker?"

As an occasional peer reviewer, I could imagine conferences having a problem with GPT-generated paper submissions clogging up an already highly competitive acceptance pipeline. In case conferences start using detection tools to filter such manuscripts (and I'm torn on whether they should tbh), you can still use ChatGPT like a "friendly, expert colleague" who offers suggestions that you then draft yourself.

I think the application of ChatGPT that I like the most is where ChatGPT is a "friendly, expert editor that always has time to give you suggestions and offer advice," tightening the feedback loop between author and editor to help the author improve their writing. Think of it like a "human-learner-in-the-loop" AI system rather than a "replaces-a-human" AI system.