1. People often use set instead of unordered_set (and same for map) despite not ...

mehrdadn · on March 4, 2021

Ordering is not the only concern here. std::set actually provides a logarithmic worst-case guarantee, whereas std::unordered_set does not. This is a factor to consider depending on the application, regardless of whether ordering is necessary. Whichever one prefers in any case, though, is beside my point—I'm merely trying to use trees and hashtables to illustrate a far more general CS phenomenon that can occur in lots of data structures and languages.

kllrnohj · on March 4, 2021

If performance is a concern then you should still avoid std::set though by default. Logarithmic worst case when it's just always slow isn't really useful.

There may be a benchmark out there where std::set can beat std::unordered_set, but you'll be really hard pressed to find it

mehrdadn · on March 4, 2021

Sure, but this isn't a benchmarking paper.

shultays · on March 4, 2021

Imho "but technically..." is not a valid opinion while in practice the access is O(1) on average. Yea sure, it becomes linear if your hash is "return 42;", it can't grantee that you supplied a good hasher.

majjgepolja · on March 4, 2021

It's not technically. Worst-case guarantee is required for some applications. Hash Maps have better amortized complexity but some implementations have bad worst case time.

Java uses balanced trees instead of linked list in their chained-hashtable implementation, if I recall correctly.

mehrdadn · on March 4, 2021

There are lots of reasons to prefer trees (and, correspondingly, lots of reasons to prefer hashtables); I just pointed out ordering isn't the only one, and I merely gave another (worst-case performance guarantee). For example, yet another one is iterator stability; insertion can invalidate iterators for unordered containers, making them useless for some applications that need to analyze data efficiently as they insert.

I could go on, but it's very much detracting from the point of the paper to argue about the data structure choice when the paper isn't on this topic at all or trying to analyze any particular application to begin with.

Kranar · on March 4, 2021

Technically is a very valid opinion if you're writing software that needs to be robust and secure in the face of user input. std::set is guaranteed to be logarithmic strictly in the number of comparisons in the worst case without exception.

std::unordered_set has some very difficult to reason about worst case scenarios that must factor in both the size of the collection and the hash function. Unless you are very careful about the type of hashing you use, you can easily be vulnerable to DDOS attacks that scale on the order of O(k * n) for key size k and container size n.

owl57 · on March 4, 2021

Note that this isn't just some user-supplied bad hashers. Hash for std::string in GNU library is not much harder to hack for collisions than your example. And also it uses the whole string, so is slow on long strings. I didn't investigate if other implementations of STL are any better.

cesarb · on March 4, 2021

> People often use set instead of unordered_set (and same for map) despite not needing order. This slows things down.

Aren't unordered_set and unordered_map quite new (IIRC, they came only with C++0x)? For most of C++'s history, if you preferred to use the standard library, what you had was only ordered sets and maps.

vlovich123 · on March 4, 2021

Depends what’s your definition of “quite new”. That’s over 10 years go now :)

asdfasgasdgasdg · on March 4, 2021

They've been in the language for about a decade now, and before that we had the tr1 hash_map classes that were available in most environments.

mcguire · on March 4, 2021

C++: the language that's been around for 40 years and is completely new!

einpoklum · on March 5, 2021

Actually, C++ has barely been around at all. It is only with C++20, IIANM, that they key features Bjarne Stroustrup wanted to imbue it with are in place (IIRC; see his "Design & Evolution of C++" book from 1994). Some might even argue that the language C++ _should_ be is not even all here; it's still gradually arriving.

einpoklum · on March 4, 2021

By now, this is no longer "quite new". A recent C++ community survey suggests that under 10% of developers currently use C++98/C++03 the most. Naturally this is not a valid sample of the whole userbase, but it's a good indication.

Also, in 2005 IIANM, Google published their dense and sparse hash map implementations, which were fast-ish and quite usable.

stonemetal12 · on March 4, 2021

They were in TR1 in 2005, if I am not mistaken.

brundolf · on March 4, 2021

Why is the default set implementation ordered in the first place? The formal data structure is unordered, which probably informs people's assumptions about its performance characteristics. Should it not be "set" and "ordered_set"?

hashingroll · on March 4, 2021

One reason would be because std::set guarantees O(log n) complexity for each operation on the worst case. std::unordered_set has average complexity O(1) and O(size) for the worst case (it being a hash table) which can be unexpected to debug in those rare cases.

mannerheim · on March 4, 2021

Probably because the underlying implementation is a binary search tree which requires a comparison operator. Haskell likewise has Data.Set which requires an Ord instance. There's HashSet which requires a Hashable instance which can be automatically derived for any data type, so in principle you don't need ordering, but I would imagine in C++ you would have to provide your own hashing function from whatever data type you're trying to put in a hash set.

masklinn · on March 4, 2021

> Why is the default set implementation ordered in the first place?

"Sorted" rather than ordered (an order can be arbitrary and I'd personally associate the word with insertion order).

> The formal data structure is unordered

The formal data structure doesn't have complexity bounds, the STL does.

einpoklum · on March 6, 2021

It's a mis-feature. And since C++ is very committed to backwards compatibility, they didn't change it later on.

If/when `std2::` happens, this is one of the things I assume would change.