But isn't it really as much of an oversimplification as saying they're O(1)?
No, because big-oh notation implies worst case. It's an asymptotic upper-bound. If you want to talk about average case, then you need to qualify what your assumptions are. So insertion into a hash-table is O(n), but if we assume an even distribution of keys with a sufficiently large table, then insertion is O(1).
Big-O doesn't imply anything, it talks about functions, not algorithms. There are no laws of the English language that say a big-O bound implicitly refers to the function f(n) = max { T(input) : input is of size n }. Or with hash table insertion it's more like f(n) = max { T(h, input) : h is a hash table of size n, input is of size g(n) } where g is some Theta(log(n)) function. The proof that there's no guarantee that big-O notation refers to some worst-of-worst-case running time function is that people are using it in a way that makes you angry.
Yes, it talks about functions, and when used in the context of algorithms it describes the asymptotic growth of the function that describes the runtime behavior of an algorithm. There are no laws of English but there conventions of Computer Science that say when you say an algorithm is O(n), that implies that in the worst case, it will be linear.
For any given algorithm, the worst-case runtime behavior of an algorithm and the average-case runtime behavior are described by totally different functions, each of which has its own big-O.
There are no such conventions, you have just implanted in your mind this idea that there should be such conventions. People say that quicksort is O(n log n) all the time. They say that hash tables are O(1). It is normal for O notation to refer to average case or a set of cases with probability 1 of occurring.
Sure, but they should not say that hash tables are "O(1)", (perhaps "average case O(1)" is better). Unlike qsort's n^2 worst case, the conditions that degrade hash tables are very common: all you have to do is specify a crappy hash function for your load, or size the table poorly.
This is kind of a silly semantics argument... but, if you interview for a job and look like a deer in the headlights when I said hash tables aren't O(1), NO HIRE.
Unlike qsort's n^2 worst case, the conditions that degrade hash tables are very common: all you have to do is specify a crappy hash function for your load, or size the table poorly.
(I'll assume you're talking about the quicksort algorithm rather than the qsort() function because you're comparing it to hash table accesses in general rather than a specific implementation of them.)
But I'm not sure I agree. The worst case for quicksort is an already-sorted list. I run into this scenario all the time, though usually in a slightly different form. When I iterate through the elements of one of the tree-based containers std::set and std::map they emerge in sorted order. If, inside my loop, I then insert them into a similar container I end up with worst-case performance. The other day I replaced std::set with std::unordered_set and saw a dramatic increase in performance, although it may not have been entirely due to this effect.
On the other hand, a non-crappy hash function should be available to everyone at this point. Personally I like FNV-1a, but haven't found an issue with whatever my C++ standard library is supplying. I have spoken with other programmers though who didn't realize hash tables needed some empty space to perform well or had stumbled into a pathological case with their oversimplified hash function.
This is kind of a silly semantics argument... but, if you interview for a job and look like a deer in the headlights when I said hash tables aren't O(1), NO HIRE.
Gee Ptacek, what do you have against deer? I can just imagine some poor interviewee with a bright desk lamp shining into his eyes...
There are such conventions. Take a look at an algorithms textbook, or a theory paper. That some people are semantically lazy does not change these terms have established definitions within computer science.
Well, the idea of conventions is that people agree to use them. I see a lot of variation in the degree to which people know about or are attached to this one.
While there may be a set of textbooks and papers for which the editors would have flagged "worst cast O(..)" as redundant, I think it's relevant to point out that the use of big-oh notation in mathematics predates computer science. So to the extent algorithm analysis is a branch of mathematics, those who use big-oh in this more general way are in fact consistent with the larger body of work.
Those are the conventions in Computer Science, but the conventions in Software Engineering are different. Words having different meanings in the jargons of related fields is a really old problem, and one which has bitten me from time to time. At least we aren't arguing over whether a factor of 10 increase in something is a 10 dB or a 20 dB change.
Knuth TAOCP 1.2.11.2 People often abuse O-notation by assuming that it gives an exact order of growth; they use it as if it specifies a lower bound as well as an upper bound. [It looks to me like Knuth does this on page 389 2.3.4.4 right after equation 14.] For example, an algorithm to sort n numbers might be called inefficient "because its running time is O(n^2)." But a running time of O(n^2) dos not necessarily imply that the running time is not also O(n). There's another notation, Big Omega, for lower bounds
I find this a bit confusing because it's unclear to me what Knuth actually endorses for use in average case description.
Wikipedia: Although developed as a part of pure mathematics, this notation is now frequently also used in the analysis of algorithms to describe an algorithm's usage of computational resources: the worst case or average case running time or memory usage of an algorithm is often expressed as a function of the length of its input using big O notation. Wikipedia and the web in general have many examples of "worst case O(...)" which would be redundant under your convention.
So clearly one needs to be vigilant about assumptions when discussing average case, but common usage does not agree that big-oh notation implies worst case.
That's important to keep in mind, particularly for DoS resistance. But isn't it really as much of an oversimplification as saying they're O(1)?
In practice, there are inexpensive ways to consistently avoid that worst case. But you do have to know to do it.
Hmm reminds me of some code I should probably go double check...