You could use a faster sort implementation, such as radix sort which is O(n) and also probably faster in practice when well-implemented.
One option is this one [1] which I actually wrote as part of this exact task: de-duplicating a list of integers, as part of working on [2].
I wrote about the types of speedups you can expect with radix sort here [3] - it depends on the size of the input elements and their distribution (e.g., if many top bits are zero, radix sort will be much faster), but in my test case I see a ~5x speedup for moderate or large input sizes (thousands of elements or more).
Of course, the same could be said about hash tables...
Patricia is O(N) in the size of the input data to be sorted, even without a bound on the key size, as are several more recently discovered suffix array sorting algorithms. Some other radix sorts are not, it's true.
Patricia tries are less local than typical Radix tree implementations.
In typical usage of thousands of items, locality wins, low constant factors win. Simple lazily defragmented list can be faster than constant size hash table thanks to constant factors, faster than sorting.
Unique objects have a perfect hashing function, but it still has to be fast. Sometimes making it an intrusive structure is a major speedup due to cache locality - let the unique object track where it is.
If you cannot know the maximum size (which you will with unique objects), then get more adaptive.
I think the first linear-time suffix-sorting algorithm was the "skew algorithm" introduced in "Linear work suffix array construction," J. Kärkkäinen, P. Sanders, and S. Burkhardt. Journal of the ACM, 53(6):918–936, 2006, although I think Kärkkäinen and Sanders published it in 2003 ("Simple linear work suffix array construction", J.C.M. Baeten et al. (Eds.): ICALP 2003, LNCS 2719, pp. 943–955, 2003.)
In 2016 Gonzalo Navarro and a couple of other guys published https://arxiv.org/abs/1607.04346, "Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time", J. Ian Munro, Gonzalo Navarro, Yakov Nekrich
(Submitted on 15 Jul 2016 (v1), last revised 14 Nov 2016 (this version, v2)). This constructs not only the suffix array but in fact an entire compressed index similar to the FM-index. This seems to follow up 2014 work by Belazzougui.
I think there's a third totally different linear-time suffix-array construction algorithm from the mid-oughties but I can't remember it.
Paper A (Two efficient algorithms for linear time suffix array construction, https://scholar.google.com/scholar?as_sdt=0,5&btnG=&hl=en&q=...) is the main research field of Pro. Ge Nong (issng@mail.sysu.edu.cn, Sun Yat-sen University, Guangzhou, 中山大学). It is also his first paper in this field. Sequent papers relevant of him are all based on this one. Paper B is Space Efficient Linear Time Construction of Suffix Arrays(http://alumni.cs.ucr.edu/~rakthant/cs234/03_KA_Simple%20Line...). Paper A is found suspected of plagiarizing the idea of the algorithm of paper B.
I started my Ph.D. program at the school of data and computer science, Sun Yat-sen University(SYSU, Guanzhou), in 2012. Student ID:12110646. I am one of the first batch Ph.D. students(total 2) of Pro. Ge Nong (issng@mail.sysu.edu.cn).
In 2012, after I entered into SYSU, Pro. talked to me times implying bribe demand, no avail. Then, I had to research alone for 7 years and must be close to his research field according to the clauses of the university, no direction, no advice, no machine, no support even there is national funding provided to the supervisor for each PhD student.
In Apr. 2019, in the case of meeting the requirement of graduation, my last chance to apply for the degree, Pro. Ge Nong refused to sign relevant documents. And the local police station was asked to treat me as a concerned-object after my complaint to the department head.
Later in Apr. 2019, I complained to the President's Office and the Office of Government Ethics of the university, then my thesis was sent out for review, by the department, later than all the other applicants.
In May 2019, after the review results of all the other applicants came out, the department asked me to cooperate with their investigation otherwise they would not let me know my review results. During the inquiry, they told me that I did not pass the review and asked me to write a declaration “The case of implying bribe demand with regards to Pro. is insufficient of evidence and falls short of facts”, otherwise I can only apply for the certificate of completion, not the certificate of graduation. Yield to the pressure, I had to give in.
In Aug. 2019, my thesis was sent out for review again. On Sep. 2, I was told that I did not pass the review again and can not continue the progress of graduation. Comparing with all the other applications of this time, the review time cycle of my thesis is abnormally different. The graduate school is suspected of operating under the table. Meanwhile, I was asked to apply for the certificate of completion by the department again.
In Oct. 2019, I was rejected to apply for graduation and asked to apply for the certificate of completion or withdrawal, and leave the campus. Otherwise, I would be dropt out with an official document promulgated by the university.
In the case of meeting the requirement of graduation, the whole process of my application for graduation suffered resistance and backroom operation time and time again from the beginning, just because of not meeting the professor’s bribe demand, and about to being dropt out by the university in the end. I do not know whether all the thing is normal. I either do not know whether Pro. Ge Nong is qualified to be a teacher.
The issues were complained to the Ministry of Education of China in early Sep. 2019. I am waiting for the reply.
Appendix
1. Different from many kindred tragedies, I am an alive sample at present and suffering from institutionalized pressure and threat. Two places of names in the post are hidden for circumventing illegal persecution in the name of law.
People say that because it's true, for a bounded key size, and in practice many radix sort implementations have a large fixed key size, so the log factor just doesn't come up.
E.g., if you use a 64-bit key, with a default implementation, you support up to 2^64 elements with your O(n) sort. No one has a machine with more than 2^64 elements today, so the the sort is in practical terms very much like O(n).
OTOH the O(n log n) of comparison based sorts are very real: the log term is in n, the number of inputs, and it is very obvious when you plot sort performance.
What is interesting though is when you start optimizing radix sort in certain ways, e.g., eliminating passes based on the high bits being zero (in an LSD sort) or using an MSD sort which stops when the bucket size gets smaller than some threshold, the log n factor actually appears again: since you often need ~log n passes to complete the sort rather than say a fixed number like log 64 (base = digit bits) = 8 for 64-bit keys. The base of the logarithm is larger though (2^radix bits), so the hidden constant factor is lower than comparison sort where the base is generally 2.
Your idea is interesting and might be useful in some applications! But calculating a SHA256 takes time linear in the length of the thing you're hashing. Even if you assume fixed-size keys, SHA256 isn't fast. It's not designed to be fast.
You'd likely need an extremely high N before O(N) SHA256's is faster than O(N log N) comparisons.
> But calculating a SHA256 takes time linear in the length of the thing you're hashing.
Ahem, so does hashing.
And so does sort comparison.
The GP's idea is actually quite effective for medium-to-large objects.
As a bonus, you can cache and store the SHA256 of each object for future in-set testing and uniqueness-finding without looking at the objects themselves. (You cannot do this with non-cryptographic hashes or sorting by comparisons.)
Well SHA256 is a red herring here: one could suggest any suitable fast hash, non-crytopgraphic hash instead.
Yes, it takes time linear in the length of the thing you are sorting, but comparison sort is generally worse in that respect: comparison also takes up to linear time in the comprands, and you have to compare each element not just once but log n times [1].
So the time to hash all the elements once (if the hash is not already cached in the element) is going to be a small part of the sort time.
[1] There are ways around repeatedly re-comparing the entire key, but they work mostly in specialized cases.
You shouldn't accept O(n log n) as the norm: radix sort often gives you O(n) in practice (yes it is really something like O(n log k) for key size k, but k is often fixed, e.g., at 64 bits).
I think radix sorts are probably less popular since, the standard for sorting has always been to provide a "comparator" that compares two objects, and radix sort wants something different entirely: an key you can extract bits from instead. So radix sort was never a drop-in replacement e.g., for qsort in C.
Often you can provide such a key object (and indeed in trivial cases like sorting integers, the entire object is simply the key), but sometimes it would need to be longer than 32 or 64 bits, so the best interface isn't clear: especially in oldschool languages like C which want to pass function pointers as the comprand.
This was kind of cemented when e.g., C++ standard library made == and < the standard things you need to implement not just for std::sort, but also std::map and std::set and many algorithms that rely on order. It's similar in a way to how std::unordered_set/map wasn't a thing until much later when the hashing concept was introduced.
When you can use it, though, radix sort is great: sometimes an order of magnitude faster than comparison sort.
A little bit, yeah: it's like a bit like single-pass radix sort, i.e., where the "digit" in the radix sort is selected to be large enough that one digit encompasses the whole key. Another difference is bucket management strategy is different: using a fixed set of buckets with chaining rather than counting ahead of time the buckets you'll need and laying them out linearly.
Note that locality of reference is so important that even though "single pass" radix sort does the least work, the ideal radix turns out to be 6-10 bits usually, even if that means e.g., 5-10x more instructions executed for 64-bit keys: because sparsely writing into buckets spread across memory is just really slow.
TL;DR: I guess the hash unique approach is technically a radix sort in base-64 if you hash the inputs before applying the radix sort, but a realistic radix sort will use far fewer buckets giving it much better cache locality and won't unnecessarily hash the buckets.
Also it's possible to write a radix sort that doesn't use any extra indirection, but that's a much more complex implementation and I doubt it really has any performance benefits.
All this article has shown is that std::sort is much more optimal than std::unordered_map on the C++ standard library used.
Every std::unordered_map I've used is surprisingly slow. Normally hash-tables are slow because it's so easy to write a slow hash-table, but std::unordered_map is slow for other reasons that I have had explained to me and then quickly forgot :(.
I also find it strange that unordered_map is used rather than unordered_set. Not sure if it would make a performance difference, but if we are going for naive implementations, that should be at the top of the list.
I believe unordered_map does invalidate integrators on resize, but does not invalidate references to keys/data, which means the data can’t be stored inline with the hash table structure.
For the OP’s use case, replacing std::unordered_set<T> with CAtlMap<T,bool> improves the hashtable method performance by a factor of ~4, consistent over different input sizes.
Agreed on use of std::unordered_map, but I appreciated that the depth this went to was a simple comparison between the two obvious STL approaches. I think you could probably speed things up tremendously via micro-optimization, but it’s arguably more useful to know which of these two approaches to use when writing the code for the first time than how to optimize it to the teeth.
It'd be interesting to see how std::map (implemented with a red-black tree) would do on these same benchmarks. A high ratio of inserts to lookups could favor the tree-based map.
For a hash map to be faster, you'd need to trade memory for speed, i.e have significantly more buckets than the expected size. This may however collide with good caching performance which I suspect is what the author observes.
I've seen this before, where big-O obsessed co-workers love to make every algorithm and task O(1) by using hash maps, etc. but are then flabbergasted when "inferior" O(n) code written by someone else outperforms their theoretical perfection by a long shot.
I have to remind them that O(1) is just a rule of thumb for scaling and that technically a dictionary lookup + sleep(5) is still O(1)
Using hashmaps by default isn't usually about raw performance for some small N, but to ensure your algorithm doesn't blow up if someone decides to take it out of context some day and dramatically increases the N (this is especially fun if it is called from some loop that also scales with input size in some way or another, which is very common). Yes you trade some raw performance for small N, but very rarely this is relevant or even measurable, and you get scalability in return.
Now what this article is showing is something else, which is that contrary to the claims in the spec, the benchmarked std::unordered_map implementation does not actually have amortized O(1) insertion time and that you can beat it with an O(N log N) sort if you want to deduplicate integers, for any input size. Which is interesting, but very specific, and not an observation that would carry over to other algorithms (or data types, for that matter. For example, it would be interesting to see what the graph would look like for types that have a more complicated/expensive way to calculate which one of two values is smaller).
Big-O is a way to talk about scaling, but in general, the average CS program does a disservice by not talking more about the critical question: "of what?" Bright students will connect their algorithm course with their systems course and recognize that, maybe, ALU ops aren't a very useful measure in a real world context. Interested students may take a rigorous complexity theory class that will get into various computation models and their nuanced interaction with time/space/communication complexity. However, enough students seem to go through a CS program without learning this that I'm convinced this is a failing of the curriculum and not the students.
I understand the sentiment, but O(n) vs O(1) and O(n^2) vs O(n log n) are huge jumps in complexity. Even with relatively small sizes like N=100, you're already starting with 2 orders of magnitude disadvantage.
The example in this post is a log N factor. log N is a relatively small growth, you'd need 1000 elements to get 1 order of magnitude and you'll never run into 2 orders of magnitude.
If you can come up with reasonable code where an order N complexity slower algorithm is faster in practice - I'd love to see it.
The issue here is mostly about what you are counting.
Almost all instances I've seen of big O scaling not being realistic count every memory access as 1 operation. In reality, memory accesses have a huge variation in how much time they take. Moreover, this does have a dependence on the program.
This is of course due to caches. As such, an O(n^2) algorithm that has very local memory accesses can actually beat an O(n) algorithm that spreads out its memory accesses randomly over the entire memory space.
It has come to the point where I care more about the expected cache hits of an algorithm than the big O scaling.
I was about to say "the example in the post is sorting so it's n * log(n)".
Upon thinking about it, the example is comparing the relative speed of hashing every item (O(n) * O(1)) == O(n) and sorting (O(n * log(n)).
That means the ratio of these two is a log n factor n * log(n) / n == log(n).
good point.
As a side point, I think the main point of the post is that the time taken for the hash solution, increases at a greater rate than the sort + unique solution. Now, that doesn't make sense with our big 0 notation.
This is because we don't have a simple hashmap, what we have is a dynamically resizing hashmap, which we don't set a capacity on, and the resize operations are order n. Now, how often can we expect to resize? I think most hashmaps double their capacity, so there will be log(n) resize operations that each take O(m) (where m is the size of the hashmap when it needs to be resized).
Now at this point, I can't be bothered to think further about it. That feels like it should be less than O(n * log(n)), but it's kinda close. Either way, it's definitely larger than the simpler O(n) case we're thinking about.
Dynamic resizing a hash table by doubling its size when it reaches a threshold capacity is amortised O(1) per insert.
If there are deletions, both operations are amortised O(1) provided they have been sensible and made the deletion threshold a lower factor.
It's the same sort of pattern as dynamically resizing a growing string. As long as you resize by a constant > 1 factor, rather than an addition, the number of resizes is low enough that the O(n) cost of copying is amortised over prior appends.
> Either way, it's definitely larger than the simpler O(n) case we're thinking about.
It actually isn't larger than O(n), setting aside constant factors.
For a hash map's time to grow more than O(n) for n insertions starting from empty, suggests an implementation problem.
I should have thought about things more. I think this is a case of confirmation bias on my part.
I thought "what about dynamically resizing" and quickly googled complexity of rehashing a hash which confirmed my thought that it was n.
I guess I didn't think that since we must have inserted n values in at this point, we could just say "insertion is O(1)" and the constant factor would be "performing two hashes" if we are going to a point where it's resizing, i.e. pay the cost of hashing once and rehashing once.
that feels like it makes sense. I'm being a bit hand-wavy as I don't want to sit down with pen and paper.
I retract my incorrect statement above. (I can no longer edit it to add postscript retraction.)
Sure. But, in practice it’s never a debate between O(1) vs. a O(large n). It’s always either a heavy O(1) vs. a light O(small n) —-hash vs. linear search. Or, a heavy O(large n) vs. a light O(large n log n) —-hash everything vs. sort everything. In both cases, the second option usually wins.
Spitballing here, but I’d suspect a “hash table” implemented as a vector of tuples of key/value pairs with linear time search for the keys would outperform a canonical hashtable for simple JRPC APIs where the vast majority of your JSON objects have fewer than a dozen elements.
Would be interesting to build a hashtable with an equivalent “small string optimization” approach where you have a fixed sized array that is replaced by a canonical hashtable once it grows past some threshold.
It's not only about the constant factor, but also about the fact that O is only valid toward infinity, and potentially about how computer actually work (e.g. the extreme discrepancy between the latency of L1 vs RAM)
I have met many that are generally bad at comparing constant factors and even worse at understanding what computers are fast at.
At school I had a friend implement a Fibonacci heap (which is O(1) or amortized O(1) in every step) for a simple thing and didn't bother to benchmark it against a simple vector based binary heap.
It was slower by one order or magnitude. Only twice have I had to use something else than a simple binary heap due to cache issues.
Small correction: a Fibonacci heap has O(log n) amortised delete-min, otherwise you'd break the lower bound on comparison-based sorting (roughly: insert all elements, then delete the minimum n times, et voilà, you've sorted the input based solely on pairwise comparisons, therefore either insert or delete-min or both must take (amortised) logarithmic time).
But yeah, they're one of the classical examples of "faster on paper, slower in practice" algorithms. Just use a binary (or d-ary) heap.
I think of Fibonacci heaps as essentially just a academic exercise to lower the big-O complexity of Dijkstra's algorithm with very little actual practical use. It always seemed tailor made for that.
Like, Fibonacci heaps guarantee an amortized O(1) "decrease key" operation, which is a stupendously obscure priority queue operation. Except, of course, in Dijkstra's algorithm, where the lowest runtime complexity bounds depends on it.
I run into this case all the time with my React apps that handle huge amounts of mapping data. You know what's beautiful? Just keep it simple, measure it, and add the extra complexity (eg. Array to dict memoized selectors) where performance is actually an issue.
I'm not saying all cases can be deferred until later. Sometimes you really need to address it up front. But I think most can be refactored later.
The default hash functions in most C++ implementations are surprisingly slow—implementations tend to aim for hash quality rather than speed.
I can be totally wrong, but I suggest trying a simpler hash such as FNV. It might outperform the sort-based approach.
I also suggest replacing the hash table implementation with something better, such as absl::flat_hash_map. The C++ std::unordered_map is hampered by compatibility requirements: the designers wanted these unordered containers to be a good substitute for the preexisting std::map, and necessitated certains design decisions such as pointer stability which is not necessary for most applications.
A slow hash-function doesn't explain the increasing time-per-element as the hash table grows though. A poor fit between the keys and hash-function might though.
The hash table implementation can be much faster. Dyalog APL's Unique (∪) computes exactly the same function using a dedicated hash table.
∪ 'AABBADCAB'
ABDC
≢∪ a←{⍵[?⍨≢⍵]} {⍵,⍵[(?⍴)⍨≢⍵]} 5e5?2e9 ⍝ 1e6 ints with 50% unique
500000
cmpx '∪a' ⍝ Time taken by ∪a in seconds
1.8E¯2
So 18ns per element on a million elements (at 3.5GHz). Characteristics of the hash table we use include:
- Hashing with the Murmur3 mixing function (at the top of [1]), not the full hash function
- Open addressing with ints stored directly in the hash table
- Sentinel value for empty slots chosen to be outside the argument's range
- [edit] Preallocated fixed-sized table. It should be resizable when the argument is large enough; I will fix that some day.
We know the range of the argument since we check it in order to possibly apply small-range code, which can use an ordinary lookup table rather than a hash table. In 18.0, which features a largely rewritten Unique implementation (not much faster here), I applied some careful logic in order to use the first entry of the argument as the sentinel rather than trying to find a value not in the hash table.
But Dyalog's sorting actually beats its own Unique!
cmpx '{(1,2≠/⍵)/⍵} {⍵[⍋⍵]} a'
1.3E¯2
({(1,2≠/⍵)/⍵} {⍵[⍋⍵]} a) ≡ ({⍵[⍋⍵]} ∪a)
1
The function {⍵[⍋⍵]} (index the argument by its own grade) sorts a vector ascending. {(1,2≠/⍵)/⍵} gets unique elements from a sorted numeric vector by taking all those elements which are first or unequal to their predecessor: 2≠/⍵ tests for inequality on all pairs of adjacent elements, and / uses a boolean vector on the left to filter elements on the right.
The second line tests that this unique-sort code gives the same result as sorting the result of Unique.
We use a radix sort for vectors of 4-byte or larger numbers. Unfortunately we can't use sorting to implement Unique in this way because Unique has to preserve the ordering in the original argument. However, it could be used to implement the sort-Unique or Unique-sort combination.
std::unordered_map is usually slow while std::sort is quite fast. std::unordered_map's API includes pointer stability across map resizes, which requires storing the contents of the map out-of-line in a separately allocated memory block.
This results in poor cache utilization and higher memory management overhead.
for more on the optimizations that make them fast.
I wrote a quick benchmark to compare sorting, std::set, std::unordered_set, and ska::flat_hash_set (an older version of SwissTable I believe) and flat_hash_set was generally ~2.4x faster, even up to 100M integers.
What he didn't show in the main article, only in subsequent code, is that std::unordered_map is way too slow to be useful.
With a proper fast map, he called it custom_map, it outperforms sort-unique until the data set exceeds the L3 cache size.
Right well, its clear that at least in java, he doesn't pre-allocate the size of the hashmap. Thus, when the hashmap hits its maximum size, of course the resize operation has to copy all of the data into a new map, which makes this no longer O(N) but something like O(NlogN) or O(N^2) depending.
If you do O(N) work every time the hashmap is "full" and double the size of the hashmap, it's still O(N).
Hand-wavy proof:
If your last insert caused a rehash, then you have done O(N) work plus O(N/2) plus O(N/4)... the limit of which is equal to O(2N), and thus only a constant factor more.
No, it's still O(N) amortized. Imagine if the hashmap doubles in capacity whenever it is full or hits its max load factor. Then you end up having O(N) copy operations. O(2 * N) = O(N)
While your objection stands, repeated trainings of a hash table can be very impactful on the actual walltime spent, and it is a bad benchmark not to include the optimization of preallocating.
It's interesting they predicted sort would surpass hashing when the registers got to 512b, and we got AVX512 nowadays. TBH, their SSE4 sort implementation was quite complex (but beautiful).
This is a case where the optimal strategy depends a lot on your input I think.
For example, long ago I wrote a small program to find file duplicates. My first step was to use the fact that files with different lengths can't be duplicates. Thus only files which had the same length had to be checked further.
For files on your average hard drive, that simple test screens the vast majority of them.
If the keys are in random order, and the hash table is larger than L3, I bet you can make the hash version faster by looking ahead N items and issuing __builtin_prefetch() on the hash table array. (There are papers on this for hash joins in databases.)
You could use a modified version of merge sort where if an element already exists you don't add it (you end up adding only unique elements), this will save the extra O(n) where you remove the duplicates.
I've made large joins faster in MySQL by adding gratuitous order by to derived table queries, which lead to probing the index in the join in index order instead of a different, more random, order. The cost of the extra redundant sort was less than the cost of the random btree traversals.
Implicit ordering can outsource it to users and result in some awful to use software.
I think I've told this story before here but the best bug I've ever fixed is adding an order by clause to a query. The client was trying to find items in a combo box that were essentially randomly ordered, it was literally taking hours out of their day to find stuff in this list and forcing them to do overtime to keep up with the workload, they literally cried when I fixed it. I had to sneak the change past management but I'm not sure if I was successful, it may have played a role in my short tenure there.
That's a particularly bad example but like perl4ever I've found this to be more common and in practice a much bigger deal than gratuitous order by clauses.
I've seen many variations where the natural ordering is fine but once it's combined with top, or joined, or filtered the results become very random to users.
One option is this one [1] which I actually wrote as part of this exact task: de-duplicating a list of integers, as part of working on [2].
I wrote about the types of speedups you can expect with radix sort here [3] - it depends on the size of the input elements and their distribution (e.g., if many top bits are zero, radix sort will be much faster), but in my test case I see a ~5x speedup for moderate or large input sizes (thousands of elements or more).
Of course, the same could be said about hash tables...
[1] https://github.com/travisdowns/sort-bench/blob/master/radix7...
[2] https://lemire.me/blog/2019/05/07/almost-picking-n-distinct-...
[3] https://travisdowns.github.io/blog/2019/05/22/sorting.html