Hacker News new | past | comments | ask | show | jobs | submit login

Unless you perform a proper statistical analysis it's unfair to draw a conclusion from a single run.

Furthermore, when I see a second run that's faster than the first one, I immediately wonder if it's the cache being cold for the first run and warm for the second.

While I have your attention, https://zedshaw.com/archive/programmers-need-to-learn-statis... is worth reading.




In fairness, the phrase he used was "looks like". I don't think his comment was intended to suggest that he'd done rigorous and exhaustive wide-spectrum analysis of compile times and executable size, just that expectations matched the result for his project.


Thanks :) I'm no stranger to the scrutiny of Hacker News, I did 3 builds in a row and threw out the 1st one (cache), the last two were within 0.1s of each other, so I copied & pasted the latter.


So basically there's no speedup.


I'm pretty sure he means the last two runs of the same compiler.


"Programmers Need To Learn Statistics Or I Will Kill Them All"... What an insufferable asshat.

PSA: There is no reason to behave like this and this is an incredible way to alienate a bunch of people. You either offend people directly with the murder implication or they don't take you seriously because you sound like you're throwing such an extended temper tantrum that you managed to write it all in a blog.


or you can stop being offended by words put out on the internet by strangers... which is what i always recommended to basically everyone.


Or, you can be not offended and still criticise someone for being an asshat.


I'm not offended. I'm just not going to waste my time reading an article by someone behaving like a child.


It's like Doonesbury, but it came from the 80's: http://imgur.com/82QXoAj


... maybe it's meant to be a bit ironic/salty/sarcastic/venting?


So honest question from a non-statistician,

how, concretely, should I go about doing this particular analyzis of compile time for one project ? How many times should I run the build for each of the 2 compilers and what should I do with the result so I could; 1. Draw a conclusion 2. Come up with fair numbers of how they compare ?

I would hope someone could tech this hopefully simple and very concrete thing to the HN crowd and I do hope the answer is not "go learn statistics".


You need to first create a clean slate each time for running the experiment: no cache, no FILESYSTEM cache etc. Maybe a tonne of single use docker images? Even then filesystem caches will mess you up a little.

Beyond that, you need to run the same build "several" times to see what the variance is. Without getting specific, if the builds are within a couple percent of each other, do "a few" and take the mean. If they're all over the place do "lots" and only stop once the mean stabilises. There are specific methods to define "lots" and "a few" but it's usually obvious for large effects and you don't need to worry too much about it.

If you're trying to prove that you've made a 0.1 improvement on an underlying process that is normally distributed with a stddev of, like 2, then you're going to have to run it a lot and do some maths to show when to stop and accept the result.


I want measurements with filesystem cache because I'm interested in estimating the speed of the compile-test-edit cycle. If you want to estimate the impact on emerge then you'll want no filesystem cache.

It's all about measuring based on what you intend to use the measurements for.


If the measurements are all over the place, why not take the fastest? The average is no good, because it'll be influenced by the times it wasn't running as fast as possible.

I don't myself lose much sleep over worrying about the times it runs faster than possible.


I agree with this sentiment. Any time worse than the fastest is due to noise in the system (schedulers etc). So the fastest is the lowest noise run.

Of course, as I said in another comment it depends what you want to do with the measurement. If you plan to edit how long a run will take on an existing system, then you need to accept the noise and use the mean (or median).


There are people who have thought about this, e.g., http://onlinelibrary.wiley.com/doi/10.1002/cpe.2939/full

Personally I think it's a better idea to instrument your programs and count the number of memory (block) accesses or something. That metric might actually be useful to a reader a few years in the future. The fact that your program was running faster on a modern x86 processor from the year 2010 tells me nothing about how it would perform today, unless the difference was so large that you never needed statistical testing in the first place...

edit: I'm not sure if this paper is accessible to everyone, so here is an alternate link https://hal.inria.fr/inria-00443839v1/document




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: