I see often discussions about hash maps on HN. However they mostly focus on performance (memory and speed) of one huge hash map with millions of records. In my numerical calculations I often need lots of small hash maps (say up to 100 elements). Currently I use unordered_maps from C++ standard library. Does anyone know what is recommended hash map implementation for my scenario?
I'd expect the same hash map implementations to work well in both scenarios (many hash maps with few entries and one hash map with many entries). I'd try dense_hash_map, and if you're feeling experimental, this new thing.
A more significant factor may be sizeof(pair<Key, Value>). std::unordered_map has a separate allocation for each entry, which is quite horrible with small values in terms of memory usage and CPU cache effectiveness. With larger values, keep in mind that because of the load factor of these hashes, you end up with more buckets than you have entries. It may be better to have that pointer indirection (either through std::unordered_map or maybe dense_hash_map<Key, std::unique_ptr<Value>>) than have the empty buckets be sizeof(pair<Key, Value>). Or if you can find something like Rust's ordermap which handles that indirection with a 32-bit index (i.e., half the size of a pointer on 64-bit platforms) into an array to be more memory-efficient. Having them be adjacent in the array
Another factor might be the cost of hashing each value. If it's cheap (ints or the like), your ideal hash map class probably doesn't store the hash values. If it's expensive (long strings, for example) and you have the memory to burn, your ideal hash map class probably does store them, so that collisions cost less on lookup and so that resizing on insert costs less.
As an alternative, if you're just holding ints or something, you might be better off with a sorted std::vector<Entry>.
Hi I'm the author for this post.
My general guideline is figure out where your bottleneck is, and optimize on the hotspot first. In your description, I can only try my best to guess where the bottleneck could possibly be...
First of all: "I often need lots of small hash maps". Do you maintain multiple small hash maps at the same time? Then the data structure holding this many hash maps may be your bottleneck instead of the hash map itself. Another possibility is you have some small hash map that get created and dropped rapidly. In this case you might want to hold it in memory so you don't have much allocation overhead.
Other than these, the remaining optimization I can think of would be making your hash table compact (like what I did in this post). Reserving buckets with max load factor should give you a nice control of your hash table size. Reference:
http://www.cplusplus.com/reference/unordered_map/unordered_m...
For very small number of elements, an array or a list can be more efficient than hash maps. My rule of thumb is to avoid hash maps if you know you'll never have more than 100 entries, and then try to find the bottleneck later on. Using hash maps for everything isn't always the best solution.
Using a sequence instead of a map is fine if you know it will never be more than some small number of entries, but so often we are wrong about how many entries will be in a table that the N-squared use of a sequence as a map kills performance unexpectedly.
You don't replace every map in your code with a distributed Key-Value store, in case it might need to store terabytes in the future. Most of the time you have a pretty good idea whether a map has to store a hundred, or a hundred thousand elements.
If you're looking for elements by identity, that's going to be really slow. Hash maps are designed precisely to handle that scenario, and devolving to linear scans is only going to make things worse.
Just keep doing what you're doing with creating a lot of small ones. Design simplicity and consistency is more important unless you find that these hash tables are bottlenecking you (which is extremely unlikely).