If you care about performance, it's important to note how the underlying representation of a dictionary data structure.
For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast. Retrieving a key is O(logn) but a very slow O(logn), because of the cache misses.
Other implementations is to keep a sorted vector of keys and a respective vector of values. There's loki::assocvector, booost::flat_unordered_map that do this for instance. Now insertion is slow, but iteration and retrieval are very fast (a fast O(logn) by binary search with few cache misses). It's also memory efficient, since no pointers between elements.
If you have one big dictionary you would use throughout an application, and know the data up front, a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array. Now you have a memory efficient and extremely fast dictionary data structure.
One other strategy any intermediate coder can implement is to have two unsorted coindexed arrays. You don't even need a hash function for this. Now iterating and insertion through the table is extremely fast, and it is memory efficient, but finding a key is just a fast O(n). So this is good for smaller tables. In C++ you could implement it as a std::pair<vector<key>, vector<value>>. If you need a quick small map in a function this is often the fastest data structure you can implement without too many headaches.
Chandler Carruth talks about unordered_map performance[1] in his CppCon 2014 talk "Efficiency with Algorithms, Performance with Data Structures". He endorses most of the other alternatives you mentioned.
> sorted vector of keys . . . reserve necessary memory and sort it . . . unsorted coindexed arrays . . .
Most of the things you mentioned are not hash tables, but members of a parent concept, dictionaries. Hash tables all by definition involve some sort of hashing of the key. The two main categories of hash table are chained hash tables (std::unordered_map does this, at least in the implementations I'm aware of) and open addressed hash tables, which use probing instead of secondary data structures to resolve conflicts.
You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values. It still keeps most of the properties we like, especially if doing the coindex-sort operation is too expensive for some reason.
> You can implement the sorted vector of keys with a hash function instead of a coindexed vector of values
So you're saying you're going to hash the keys, then sort them according to the hash, with tie breaking on the key itself? I'm not aware of any sorted table that does this, but I'm sure some exist. I suppose you'd get something of a win if N was large, and the keys had long common prefixes, and you didn't care about the ordering property.
But in that case you'd probably use an actual hash table, not the algorithm you just described. Unless there's something I'm missing.
Sorry I misspoke. Lookup is always O(1) in a hash table. But this could be a "weak dict" that is you don't actually store the keys, as long as you can reference a key through a hash you can lookup.
In the proper "data structure", we usually store the key and the value -- iteration through keys and/or values and/or pairs are probably supported operations.
Finding a key in the structure (either with binary tree search, or binary search on a sorted array, or linear lookup on an array) varies. So does iteration and most other operations.
It's an interesting fiction that we tell everyone that hash tables are O(1), when in reality hash operations are typically O(n) in the key size, which is O(log n) in the value space, it's just much cheaper to have your Own(log n) operations be reading sequential bits, rather than following pointers, reading whole keys, etc.
Probably not something people usually run into, but it does show that constants matter.
Equal indices refer to the same entity in all places. I.e. two arrays, one for the key, one for the value. The value of the key at index n in the key array is found at the same index n in the values array.
You mean insertion into a bucket when there's a collision? And your n is the number of entries in the list? But n should always be very small relative to the total number of items in the hash map such that overall insertion and lookups are O(1).
I think you might be a little confused. Even in hash tables with chaining, one does not tend to spend much time traversing linked lists, because the typical bucket will have only one member. This depends on the load factor of the table, but most practical implementations will eventually rebucket and rehash if their load factor grows too high. "Getting the key loaded" -- i.e. finding the memory location that contains the key -- is O(1) on average in all practical hash tables. It does not typically require any traversal at all.
You keep on talking about ordered tables like red black trees, as in this comment, which is another sign that makes me wonder if you might be confused.
The early versions of String.hashCode in java only looked at the first 16 characters of the string - when used for things like URLs this led to rather poor performance!
Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1). And the same applies for cache misses. All these things don't really matter except if you are using hash tables with not very many elements in them.
> Even if the typical bucket has 100 members, as long as this number is constant and does not go up with the size of the hash table, the performance is still O(1).
If you mean in a chained hash table, where you chase up to 100 pointers to get the value, the performance is atrocious.
Friendly reminder: traversal of a linked list and a contiguous array are both O(n). In the real world one is two orders of magnitude faster than the other.
> All these things don't really matter except if you are using hash tables with not very many elements in them.
The lookup of a value given a key is probably the least affected operation. If all you care about is a "weak dictionary" (you only really need to store values and lookup from keys), all of this is mostly jabber. If you store the keys to iterate through, or access for some reason, all those things start to matter a whole lot.
For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast.
Has this changed recently? Last time I used unordered_map, insertion was also slow because it had to allocate a linked list entry for every item in the hash table.
This is one of those data structures that everyone should try building at least once, kind of like linked lists, etc. Semi-unrelated, I built a couple of toy implementations a few years ago using the same basic ideas:
They were kind of fun to build and I'd recommend giving, especially if you have some skill but don't think you have the chops. To get a working toy isn't really all that hard once you understand the principals.
> And just use the built-in ones don't invent your own unless there is a very good reason for it.
Uhm, hash tables are one of these things where plenty can be gained by tailoring the structure to the application. It's not a one-size-fits-all, and it's not nearly as easy as it would seem to make a good general-purpose HT.
Built-in implementations usually are kind of slow, waste memory, have broken iterators [1], even in modern languages. If either of those things are important - it's better to do some research and maybe even invent your own.
[1] when you can't both iterate and insert items consistently in the same loop
It's often true due to a few factors. One is that they are safe, whereas a specific use case may not need the same level of safety. They generally optimize for the general case as well, but in your own code you can optimize for the very specific use case you have. One I ran into many years ago was implementing my own singly linked list by adding the next pointer to another piece of data. In this one specific case it was worth removing another layer of indirection. I was still young though, so there were probably even better ways of handling it.
I have only encountered these scenarios a small handful of times, but I'm not a very low level developer.
Hash Table has basically been elevated to a general idea lately. Pretty much anyone that is looking for an association between keys and values is using what they will call a hash table. To that end, few people actually know anything about how they are implemented. (And apologies if this comes across as negative, I do not mean it as a value judgement.)
This was different back when you would pick something like an alist to store associations. There the implementation stared you in the face. Same for the times you had to implement your own hash table. I don't exactly yearn for those days. Though, I am curious if alists actually win in speed for a large number of smaller hash tables.
Depends on implementations of the alist and of the hash table. It is even fully possible that the small hash table will use an alist for the cases with few (and small) elements.
Fair. I was making a huge assumption that the alist would be implemented as you statically see it in code.
My point was supposed to be that an alist really has an obvious implementation, whereas a hashtable actually does not. My main objection being that there is a ton of glossing over what goes into an actual hashtable. While I would expect someone to be able to do a basic alist implementation, I have grown away from expecting folks to do a basic hash table.
"this is a data structure that is at the guts of basically every python object. Classes are dicts. Modules are dicts. Packages are probably dicts. Don't worry though, it's underrated!"
Quick question: are there special hash functions that are optimized for use in hash tables? Or do typical hash table implementations in e.g. Python just use standard hash functions like MD5?
Edit: It's 100% clear now. Thanks for the great answers everyone!
Typically the hash functions that you are familiar with due to cryptographic or data consistency use (SHA family, MD family, etc) do not make for good hash table choices because they produce hashes that are much larger than needed and are slow to compute so that they have better cryptographic properties (extremely low collision, no information leakage about inputs, difficulty of guessing inputs). When picking a hash function for a hash table, you want a function that makes a hash just big enough and with low enough collisions while still being fast and easily dealing with variable length keys. This could he something as simple as byte-wise XOR or addition with some shifting as you iterate the key followed by a mod or even bitwise AND mask to pick an index.
However, collision resistance must be still quite good for use in a general-purpose hash table or a HT that is possibly exposed to attackers, otherwise denial-of-service attacks become very easy.
Many "modern" implementations (Python, Ruby, Perl, Rust, Redis, ...) use SipHash with a random seed for this very reason.
In order to use a standard hash function, you would first need to serialize the object. I am not aware of any programing language that does this.
Instead, the approach taken by (at least) Java and Python, is to define a "hash" function of objects, the classes can overwrite. The standard way of implementing such a function is to combine the hashes of the objects fields.
Python advises doing this by wrapping them in a tuple, and returning hash(self.a, self.b, ...). [1]
Java takes a simmilar approach, but does not make an explicit recomandation on how to implement hashCode(). In my experience, most programmers just XOR the hash of the fields, which (depending on the object) could be very sub-optimal, but is often good enough. Based on the doc, the typical implementation for Object.hashCode is to just take the memory address of the object.
There's also FarmHash, whose 32-bit version is 2x as fast as xxHash on little endian machines (at least according to this benchmark suite https://github.com/rurban/smhasher).
I recall reading at some point about Go having the option to use a call to AES on the crypto chip on the motherboard for fast high quality hashes, which is cool.
This article explains "2-Level Hashing" which is apparently "dynamic perfect hashing." While interesting, I'm wondering how much it's actually used? Or is it just of theoretical interest?
I think when the author says "underrated", what they mean is "I didn't realize how important this is". Hash tables are used everywhere, by everyone, for a lot of things.
Maybe they don't give it enough time in school for people to realize it is the king of practical software development.
The author of the article does not say underrated at all from what I can see; just something the OP included in his title. Your point still stands, though.
Nearly every expression in high-level languages relies on multiple hash lookups. This is part of the reason these languages are regarded as "slow." I suppose you could use a tree in place, and get reasonable performance. However the hash table's nature allows you to pretty much throw more memory at the problem in exchange for speed (though this is hardly unique to this particular data structure).
For instance `a = foo.bar.baz` in Python involves 3 hash gets (local scope, foo scope, then bar scope), and a single set operation (local scope). This is part of the reason Python programs can be optimized by assigning a deep attribute lookup to the local scope outside of a loop's scope, and it will yield improved performance relative to doing the deep attribute lookup inside the loop's scope.
When the name `a` is assigned the value of `baz`, Python is setting a key name `a` in the local dictionary (hash table) `locals()`.
Basically `a = 1` is syntactic sugar for `locals()['a'] = 1`
>>> a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'a' is not defined
>>> locals()['a'] = 113
>>> a
113
>>>
One interesting side-effect of this is you can assign names that are not valid Python syntax explicitly.
For example:
>>> locals()['foo-bar'] = 1
>>> locals()['foo-bar']
1
>>> foo-bar
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'foo' is not defined
>>>
The name `foo-bar` can't be literally referenced, because the interpreter attempts to interpret it as the subtraction operation `foo - bar`.
locals() is presented to the user as a dictionary, but is that the way cpython actually works with it internally? I've run into weird GC issues that imply it's not a normal dictionary and it's just presented to the user as one for programattic access.
No, that's not how it works internally. The Python compiler resolves all local variables at compile-time. You can see this for yourself by defining a simple function f, import dis, and calling dis.dis(f).
Global variables use a dictionary, however. The disassembly actually looks similar for both.
That's a good question, I'm pretty sure it's not a normal dictionary. However, I'd have to go through the CPython source to confirm. Maybe someone who's more familiar with CPython's implementation will chime in.
The CPython implementation is a stack-based virtual machine. The first instruction here, LOAD_NAME, pushes co_names['foo'] (basically, whatever 'foo' is bound to in local scope) to the top of the stack. That's at least one hash table lookup, since the scope is a hash table mapping names to values. Then LOAD_ATTR replaces the top-of-stack value with the value of its 'bar' attribute. That's another hash table lookup, since attributes are a hash table mapping names to values. Then another LOAD_ATTR to get to 'baz', that's another hash table lookup. Then STORE_NAME binds the value at the top of the stack to the name given as an argument ('a'). That's an insert or update of a hash table.
So, the expression 'a = foo.bar.baz' involves at least three hash table lookups and one insert or update.
Since Python is an interpreter, it keeps a dictionary/hash table of the local and global variables. These can be accesses by the functions `locals()` and `globals()`.
`a = ... ` can be thought of as `locals()["a"] = ...`
(although you can should not actually modify the local variables this way)
In fact, what performance oriented devs need to be taught about these days is how to not use hash tables, or, more generally, dictionaries. They are a super-easy data structure to reach for, but often it is possible to arrange your data such that their use is unnecessary.
One example that was brought up elsewhere in the thread is the way python looks up variables in successive scope dictionaries. This is obviously terrible for performance, and that's a big part of why Python is slow.
But how are other languages fast? Doesn't every language have to resolve variable names to memory locations? Well, yes, but fast languages do this mostly at compile time. How? Well, I'm not an expert in this, but at a high level, each variable is assigned a location in memory or registers, and then future references to that variable are rewritten to refer to the memory location by register name or memory address. This takes the whole "looking up a name" issue out of the path of work that has to get done at runtime. And you've switched from looking up things in tables to operating directly on registers and memory locations.
BTW, this has nothing to do with high-versus-low level. It's more about how mutable the process of name resolution is after program compilation. One could theoretically write an assembly language where memory locations are names, not numbers. If reflective features like runtime scope editing are available, this would be a very low-level language that still requires some kind of dictionary lookup at runtime.
> In fact, what performance oriented devs need to be taught about these days is how to not use hash tables, or, more generally, dictionaries. They are a super-easy data structure to reach for, but often it is possible to arrange your data such that their use is unnecessary.
A lot of devs are completely unaware of what's happening at the layer of abstraction below them and this is one of the ways that comes out. The number of elements it takes before hash tables are faster than iterating through an array can be surprisingly large and yet it's one of the first "optimizations" that get made.
Some other related problems are not knowing how expensive network calls are, not knowing what the ORM is doing, caching in memory even though the database already is and more. They just don't think about performance issues in code they don't write.
99.999% of projects are not going to have any meaningful hot path in variable resolution.
If your program is sensitive to that, even a simple GC pause is going to destroy your performance and you need to be out of managed memory languages at that point.
There are a lot of reasons python can be slow, but this is far from one of them.
> 99.999% of projects are not going to have any meaningful hot path in variable resolution.
It won't show up in the hot path because the performance cost so pervasive that profiling tools will ignore it, it's everywhere and you can't escape it without compiler optimizations. This cost will be in the hot and cold paths. This and data locality are the two biggest performance issues in dynamic languages and a lot of effort goes into compiling it out.
For statically compiled languages it can show up but often you'll have to write a dictionary free version to see it. Some profiling I've done with c# at various times over the years shows that it's slower than a list until you have more than 50 to 100 items. The caveat is that I normally keep the dictionary version because it's more semantically correct.
Eh, one of the standard performance tips for hot loops is to store everything relevant, including functions, in local variables because the interpreter looks in the local dictionary first. The last time I optimized Python code that made a significant difference.
To be clear, I don't think this is the only thing that makes python slow. It's probably one of the top five or ten things preventing it from being fast, though.
> Well, I'm not an expert in this, but at a high level, each variable is assigned a location in memory or registers, and then future references to that variable are rewritten to refer to the memory location by register name or memory address. This takes the whole "looking up a name" issue out of the path of work that has to get done at runtime.
Python does this for local variable names in functions. Because of this, moving a tight loop from global scope to a function can make it a couple percent faster.
Are you serious that dictionaries are a problem for performance oriented developers?
Performance oriented devs should be concerned with bottlenecks, not incredibly minute details. There's almost no situation I can think of where smallish dictionaries are much better or worse than any other data structure when it comes to performance.
Of course, if you're writing a compiler then it can be a serious difference. Most developers don't write compilers though.
Yes, if you write performance sensitive code you have to be very careful with dictionaries. See for example this nice talk by Chandler Carruth https://www.youtube.com/watch?v=fHNmRkzxHWs
Well of course. :) If you don't write perf sensitive code, don't worry about perf. But if you do, in many cases avoiding hash tables can become important.
Though once you have hash/maps/dicts set are with reach by making sets from hashes were keys are the element and values are just true or 1 or something like that.
But I think you probably meant having and using set operations effectively in day to day tasks, as in "make 2 sets and do a set different operation" instead of "do a for loop on first hash check if it is in the second, then put results in an accumulator.
Another thing is to think about set of sets. Can that be useful sometimes? Implementing that is slightly trickier. You'd need to be able to get a hash of a set. Python has frozenset https://docs.python.org/3/library/stdtypes.html#frozenset. I've used those on occasion.
Then of course there is Erlang sofs (sets of sets) module. Stumbled on it by accident. Oh my, it comes complete with an introduction to set theory and relational algebra:
Yes, while a set can be derived or mimicked if you have a hashtable for example, I'm specifically referring to all the common set operations you'd typically want to have.
Using a map/hashtable as a set is just the tip of the iceberg.
In many cases you don't need a key and a value and you can get rid of a lot of loopy and complicated code if you had a simple set to utilize with its useful operations.
>"Though once you have hash/maps/dicts set are with reach by making sets from hashes were keys are the element and values are just true or 1 or something like that."
Can you elaborate on how you can derive a set from hashes by using values of True or 1? Might you have a link? Thanks.
> Can you elaborate on how you can derive a set from hashes by using values of True or 1? Might you have a link? Thanks.
Sure. Np!
I mean that a simplified set is just a hash where the elements of the set are keys of the hash table and the values can be anything. I used 1 or True as example.
As in adding an element would be:
my_dict[element] = 1
Then membership check is:
if element in my_dict
Then removal is deleting:
del my_dict[element]
and so on.
In other words, the reason sets are sometimes not explicitly there is because they are easy to implement on top of existing data structures.
Basic operations like union, difference, intersection between two sets can be done with a few simple for loops.
But like I mentioned in other comment, there is one interesting aspect to set (and hashes) in that the element now have to be hash-able. That kind of depends how mutability and identity works in the particular language.
While that's true you could implement one that way, it's very nice to have the set operations implemented. Intersection, union, subtraction, etc. And having a uniform set type makes API signatures more consistent.
Plus you may be able to optimize Set<T> to use less space than Map<T, bool>.
I think I remember reading a claim that they pushed for a small API surface in order to make sure it got through. Now that it's part of the language, anyone else can work on getting those extra features in.
You can implement most basic functionality easily enough, MDN even has an example [0]. Although I agree that it should really be part of the language.
Does it really matter what the 'value' is? A set is surely implementable as a supertype of a hash, where the value is totally arbitrary; it only matters that the entry exists.
With:
{'A': true, 'B': false}
you seem to be suggesting `B` is 'not in the set'. What's `C`?
They're not really the same thing then. In OJFord's implementation, there's a one-to-one correspondence between states: the element is a member of the set iff it exists as a key in the hash table.
In the true/false implementation, two distinct hash table states (false or key not set) map to one set state (not a member). The program must check whether the key is set, and then check its value.
For instance, the unordered_hash_map in C++ is based around a linked list of key/value pair buckets. This means iterating through the keys or values is very slow (lots of cache misses!), but insertion is fast. Retrieving a key is O(logn) but a very slow O(logn), because of the cache misses.
Other implementations is to keep a sorted vector of keys and a respective vector of values. There's loki::assocvector, booost::flat_unordered_map that do this for instance. Now insertion is slow, but iteration and retrieval are very fast (a fast O(logn) by binary search with few cache misses). It's also memory efficient, since no pointers between elements.
If you have one big dictionary you would use throughout an application, and know the data up front, a good strategy is to reserve the necessary memory in an array, fill it with keys/values, sort it once over the keys and coindex the value array. Now you have a memory efficient and extremely fast dictionary data structure.
One other strategy any intermediate coder can implement is to have two unsorted coindexed arrays. You don't even need a hash function for this. Now iterating and insertion through the table is extremely fast, and it is memory efficient, but finding a key is just a fast O(n). So this is good for smaller tables. In C++ you could implement it as a std::pair<vector<key>, vector<value>>. If you need a quick small map in a function this is often the fastest data structure you can implement without too many headaches.