You don't need the look up table. All you are asked for is min/mean/max which can all be computed in one pass without storing the data. All you need is a hash table with 400 entries and 3 floats (running min, mean and max) and and int (count, for updating running mean). That's just 16 bytes, if you use 16 bytes for name of station you can fit everything under 16K.
IO will dominate the running time for this, and JSON parsing will be second.
I don't have a CS background and when I eventually had to do interviews for Google, "Calculate mean/median/mode of temperatures" was the interview question I went with, intending to avoid BS leetcode*
I always worried it was too easy, but I'm heartened by how many comments miss the insight I always looked for and you named: you don't need to store a single thing.
I do wonder if it'll work here, at least as simply as the tracking vars you mention, with so many rows, and the implicit spirit of "you should be able to handle _all_ data", overflow might become a legitimate concern - ex. we might not be able to track mean as simply as maintaining sum + count vars.
* "oh sorry times up, but the right answer is a red black self balancing terenary tree recursive breadth-first search", ironically, was one of my Apple interviews.
In general, median and mode are much harder than min/avg/max. You can't compute the former with constant memory in one pass (you can do approximate median, but not exact median).
(Here there is a restricted range for the temperature with only 199 possible values (-99.9 to 99.9 with 0.1 increment) so you could do it constant memory, need something like 4*199 bytes per unique place name))
For the sum overflow is not an issue if you use 64-bit integers. Parse everything to integers in tenths of degree and even if all 1 billion rows are 99.9 temperature for same place name (worst possible casE), you are very far from overflowing.
I am silly and wrote 'mode' and shouldn't have :P (wetware error: saw list of 3 items corresponding to leetcode and temperature dataset, my 3 were min/max/average, their 3 are mean/median/mode)
Maintaining sum+count is more expensive than you'd imagine... Because to maintain the sum, you need to convert the numbers from ascii to decimal. And to do that you at least need to know the memory alignment of the number and the position of the decimal point.
All far more expensive than the state machine approach which is pretty much 'alignment doesn't matter, ascii doesn't matter, we're gonna just handle all possible states that could end up in a 32 bit register').
Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case, and I'm not aware of any way to calculate the mode at all. Any pointers to the methods you used in your interview answers?
If you came up with them on the fly, then... well, sign here for your bonus. Can you start Monday?
> Streaming calculation of the exact median with no storage at all is non-trivial at best in the general case
It's not "non-trivial" it's impossible. I'm not sure why people think median can be approximated at all. You need to look at every data point and store a counter for the lesser of: (a) all possible values, or (b) all elements. Just consider a data set with 1 million ones and 999,999 zeroes. Throwing away (or failing to look at) one single number can give you an error of 50%, or 100% for two numbers. If you want to make it really nasty, throw in another million random 64-bit floats between [-0.1, 0) and a million between (1, 1.1]. Four million elements, two million of them unique, in the range [-0.1, 1.1], and failing to account for two elements can get your answer wrong by +/- 1.0.
Unless you start by sorting the list, but I wouldn't call that a "streaming" calculation.
Yeah, that's why I hedged so heavily; AFAIK it's impossible to compute a streaming median (having spent some time trying).
If someone on HN knows how to do it, they will jump in and tell me exactly why it's "trivial," and I'll get some nice R&D work for free. Of course, it will probably involve mounting an FTP account, spinning up a CA and running rsync...
> If someone on HN knows how to do it, they will jump in and tell me exactly why it's "trivial,"
n.b. to you and your parent comment's commenter: you're the only two who used the word trivial in the entire comments section modulo another comment not in this thread that contains "SQL in theory makes this trivial...". The HNer claiming to build a streaming median function in a weekend fantasy might not come to fruition :(
No, it is not possible to approximate the median in the general case. The linked paper does no such thing. Their algorithm can be completely defeated by carefully tailored inputs. They only ever test them against very well-behaved random distributions.
But it brings up an important point: real data that you actually encounter is not a random sampling from "the general case". It is often possible to do a pretty good job approximating the median from real-world data, if you have some understanding of the distribution of that data. You just have to accept the fact that you might be totally wrong, if the data behaves in ways you don't expect. But of course, the more you have to know about your data before running the algorithm, the less useful the algorithm itself is.
The difference is whether you can make guarantees or not. Most algorithm design is concerned about such guarantees: on all possible inputs, this algorithm has at worst such-and-such performance. Hence the reliance on Big-O. You cannot ever make such guarantees on a "median approximator" without specifying some sort of precondition on the inputs.
What guarantees do we want to make? With "an estimator", it'd be nice to say something like: we'd like our approximation to get better given more values. That is: the more values we see, the less the next single value should be able to change our approximation. If you've looked at a billion values, all between [min, max], it'd be nice if you knew that looking at the next one or two values could only have an effect of at most 1 / f(1 billion) for some monotonically increasing f. Median does not have that property: looking at just two more values (even still within the range of [min, max]) could move your final answer all the way from max to min. If you stopped 2 data points earlier, your answer would be as wrong as possible. This remains true, for some inputs, even if you've looked at 10^10^10^10^300 values. The next 2 might change your answer by 100%.
I am silly and wrote 'mode' and shouldn't have :P (wetware error: saw list of 3 items corresponding to leetcode and temperature dataset, my 3 were min/max/average, their 3 are mean/median/mode)
Merely finding the start/end of each line will use more computation (in the inner loop) than the approach I outlined. Let alone converting the number from ascii to a float, or looking up the place name in a has table (oh, and the name is variable length, so you're gonna need to find how long the name is first).
I deleted two previous comments because I realized I misunderstood your proposal.
I understand it better now, but I am still confused about something...
Your state machine would need at least 2*160,000 states (you need an extra bit to flag whether you have reached a newline in the last word and need to increment a counter or not), correct? And you are assuming the input is 4 bytes, so won't your transition table need (2^32)*2*160,000 ~ 10^15 entries (at least 3 bytes each)?
The states don't need to map 1:1 with cities or temperatures. They merely need to encode all information collected so far which is still relevant. They also don't need to represent all possible situations - anything that is super rare (eg. temperature of 95C) can simply be diverted to a special "invalid" state which triggers regular code to take over for those few entries.
Hmm, still doesn't seem feasible. Even if you only have 256 "relevant" states (which I think you'll agree is far less than what you need) then given a 32-bit input your state transition table is 2^32*256 = 1 Terabyte.
You could shrink your input size to 2 bytes but then you can't work on a word at a time, and for a realistic number of relevant states your transition table is still way bigger than you can fit in even L3 cache.
Unless I am missing something very basic, this doesn't seem like a viable approach.
> IO will dominate the running time for this, and JSON parsing will be second.
Memory bandwidth might dominate, but probably not I/O. The input file is ~12GB, the machine being used for the test has 32GB, and the fastest and slowest of five runs are discarded. The slowest run will usually be the first run (if the file is not already cached in memory), after which there should be little or no file I/O.
Having a quick look at your code, couple of thoughts:
- You shouldn't bother with parsing and validating UTF-8. Just pretend it's ASCII. Non ASCII characters are only going to show up in the station name anyway, and all you are doing with it is hashing it and copying it.
- You are first chopping the file into line chunks and then parsing the line. You can do it it one go, just look at each character byte by byte until you hit a semicolon, and compute a running hash byte by byte. You can also parse the number into an int (ignoring decimal point) using custom code and be faster than the generic float parser.
- If instead of reading the file using standard library, you mmap it, should also speed things up a bit.
IO will dominate the running time for this, and JSON parsing will be second.