While this is nice, it seems like without bucketing you would run into complexit...

marcosdumay · on Sept 4, 2022

If you have more data points than horizontal pixels, yes, you will bucket the data on your display resolution. That happens with any kind of plotting.

What is a completely different thing from the arbitrary bucketing for histograms. CDF doesn't go to zero or becomes misleading if you bucket it wrong. You just lose the details.

dafelst · on Sept 4, 2022

My point is more that you need to store n values (where n is the number of samples) or 2k if there are dupes (where k is the number of distinct values) for an eCDF, which if you did that anyways, you could generate a histogram from the same data.

If there are duplicate sample values, you can still store a sorted list of (sample,count) here and generate either a histogram OR an eCDF, or any other plot really.

Effectively it is not a fair comparison to compare the two methods since they both have storage tradeoffs that are not really discussed.

marcosdumay · on Sept 5, 2022

Nobody is discussing the computing performance of calculating those plots. I's taken for granted that you have more than enough resources to do anything you want with the data. If you don't, you are really in the big data region, and you will start to get all kinds of interesting tradeoffs that are completely different from one place to another.

The entire discussion is about the quality of the information the plot communicates to you. Histograms can be completely misleading, CDFs can't. Finding a bucketing so that an histogram communicates the background data is a non-trivial problem, for CDFs, it's not a problem at all.

ok_dad · on Sept 5, 2022

Lookup t-digests, a streaming algorithm for ecdfs. I’m pretty sure you can do histograms with these as well.

AstralStorm · on Sept 4, 2022

Computing a sorted list online is an amortized O(1) operation, O(log n) worst case.

ironSkillet · on Sept 4, 2022

Adding elements one at a time is O(log n) into an already sorted list. But producing a complete sorted list requires doing that n times, so you end up with n log n anyways. Am I missing something?