when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.
Ordering is not the only concern here. std::set actually provides a logarithmic worst-case guarantee, whereas std::unordered_set does not. This is a factor to consider depending on the application, regardless of whether ordering is necessary. Whichever one prefers in any case, though, is beside my point—I'm merely trying to use trees and hashtables to illustrate a far more general CS phenomenon that can occur in lots of data structures and languages.
If performance is a concern then you should still avoid std::set though by default. Logarithmic worst case when it's just always slow isn't really useful.
There may be a benchmark out there where std::set can beat std::unordered_set, but you'll be really hard pressed to find it
Imho "but technically..." is not a valid opinion while in practice the access is O(1) on average. Yea sure, it becomes linear if your hash is "return 42;", it can't grantee that you supplied a good hasher.
It's not technically. Worst-case guarantee is required for some applications. Hash Maps have better amortized complexity but some implementations have bad worst case time.
Java uses balanced trees instead of linked list in their chained-hashtable implementation, if I recall correctly.
There are lots of reasons to prefer trees (and, correspondingly, lots of reasons to prefer hashtables); I just pointed out ordering isn't the only one, and I merely gave another (worst-case performance guarantee). For example, yet another one is iterator stability; insertion can invalidate iterators for unordered containers, making them useless for some applications that need to analyze data efficiently as they insert.
I could go on, but it's very much detracting from the point of the paper to argue about the data structure choice when the paper isn't on this topic at all or trying to analyze any particular application to begin with.
Technically is a very valid opinion if you're writing software that needs to be robust and secure in the face of user input. std::set is guaranteed to be logarithmic strictly in the number of comparisons in the worst case without exception.
std::unordered_set has some very difficult to reason about worst case scenarios that must factor in both the size of the collection and the hash function. Unless you are very careful about the type of hashing you use, you can easily be vulnerable to DDOS attacks that scale on the order of O(k * n) for key size k and container size n.
Note that this isn't just some user-supplied bad hashers. Hash for std::string in GNU library is not much harder to hack for collisions than your example. And also it uses the whole string, so is slow on long strings. I didn't investigate if other implementations of STL are any better.
> People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.
Aren't unordered_set and unordered_map quite new (IIRC, they came only with C++0x)? For most of C++'s history, if you preferred to use the standard library, what you had was only ordered sets and maps.
Actually, C++ has barely been around at all. It is only with C++20, IIANM, that they key features Bjarne Stroustrup wanted to imbue it with are in place (IIRC; see his "Design & Evolution of C++" book from 1994). Some might even argue that the language C++ _should_ be is not even all here; it's still gradually arriving.
By now, this is no longer "quite new". A recent C++ community survey suggests that under 10% of developers currently use C++98/C++03 the most. Naturally this is not a valid sample of the whole userbase, but it's a good indication.
Also, in 2005 IIANM, Google published their dense and sparse hash map implementations, which were fast-ish and quite usable.
Why is the default set implementation ordered in the first place? The formal data structure is unordered, which probably informs people's assumptions about its performance characteristics. Should it not be "set" and "ordered_set"?
One reason would be because std::set guarantees O(log n) complexity for each operation on the worst case. std::unordered_set has average complexity O(1) and O(size) for the worst case (it being a hash table) which can be unexpected to debug in those rare cases.
Probably because the underlying implementation is a binary search tree which requires a comparison operator. Haskell likewise has Data.Set which requires an Ord instance. There's HashSet which requires a Hashable instance which can be automatically derived for any data type, so in principle you don't need ordering, but I would imagine in C++ you would have to provide your own hashing function from whatever data type you're trying to put in a hash set.
2. The C++ standard library's maps and sets are known to be rather slow. See, for example:
https://stackoverflow.com/q/42588264/1593077
when you have string values, it's even worse, as you describe. But it's not clear that an overly-clever implementation, which caches numeric ranks of strings etc., is a good idea to have.