I see often discussions about hash maps on HN. However they mostly focus on perf...

scottlamb · on May 9, 2017

I'd expect the same hash map implementations to work well in both scenarios (many hash maps with few entries and one hash map with many entries). I'd try dense_hash_map, and if you're feeling experimental, this new thing.

A more significant factor may be sizeof(pair<Key, Value>). std::unordered_map has a separate allocation for each entry, which is quite horrible with small values in terms of memory usage and CPU cache effectiveness. With larger values, keep in mind that because of the load factor of these hashes, you end up with more buckets than you have entries. It may be better to have that pointer indirection (either through std::unordered_map or maybe dense_hash_map<Key, std::unique_ptr<Value>>) than have the empty buckets be sizeof(pair<Key, Value>). Or if you can find something like Rust's ordermap which handles that indirection with a 32-bit index (i.e., half the size of a pointer on 64-bit platforms) into an array to be more memory-efficient. Having them be adjacent in the array

Another factor might be the cost of hashing each value. If it's cheap (ints or the like), your ideal hash map class probably doesn't store the hash values. If it's expensive (long strings, for example) and you have the memory to burn, your ideal hash map class probably does store them, so that collisions cost less on lookup and so that resizing on insert costs less.

As an alternative, if you're just holding ints or something, you might be better off with a sorted std::vector<Entry>.

dryman · on May 9, 2017

Hi I'm the author for this post. My general guideline is figure out where your bottleneck is, and optimize on the hotspot first. In your description, I can only try my best to guess where the bottleneck could possibly be...

First of all: "I often need lots of small hash maps". Do you maintain multiple small hash maps at the same time? Then the data structure holding this many hash maps may be your bottleneck instead of the hash map itself. Another possibility is you have some small hash map that get created and dropped rapidly. In this case you might want to hold it in memory so you don't have much allocation overhead.

Other than these, the remaining optimization I can think of would be making your hash table compact (like what I did in this post). Reserving buckets with max load factor should give you a nice control of your hash table size. Reference: http://www.cplusplus.com/reference/unordered_map/unordered_m...

nom · on May 9, 2017

For very small number of elements, an array or a list can be more efficient than hash maps. My rule of thumb is to avoid hash maps if you know you'll never have more than 100 entries, and then try to find the bottleneck later on. Using hash maps for everything isn't always the best solution.

oconnor0 · on May 9, 2017

Using a sequence instead of a map is fine if you know it will never be more than some small number of entries, but so often we are wrong about how many entries will be in a table that the N-squared use of a sequence as a map kills performance unexpectedly.

adrianN · on May 9, 2017

But that's what profiling is for.

arjie · on May 9, 2017

That argument favours hashmaps by default, though. After all, they're all right at small sizes and much better at big sizes.

usefulcat · on May 9, 2017

Profiling won't tell you what the chances are that the collection will someday need to contain more than N elements.

adrianN · on May 9, 2017

You don't replace every map in your code with a distributed Key-Value store, in case it might need to store terabytes in the future. Most of the time you have a pretty good idea whether a map has to store a hundred, or a hundred thousand elements.

Veedrac · on May 10, 2017

If you're looking for elements by identity, that's going to be really slow. Hash maps are designed precisely to handle that scenario, and devolving to linear scans is only going to make things worse.

alexkindel · on May 9, 2017

Just keep doing what you're doing with creating a lot of small ones. Design simplicity and consistency is more important unless you find that these hash tables are bottlenecking you (which is extremely unlikely).

resist_futility · on May 9, 2017

If you mostly do lookups then guaranteeing a perfect hashing function for the data set will improve performance. What are you looking to improve upon?