While this is nice, it seems like without bucketing you would run into complexity issues with large amounts of data, right? i.e. to plot a true eCDF you need a sorted list of all the collected datapoints. I guess for actual plotting you have to effectively bucketize based on the number of pixels in your plot, but that seems fairly arbitrary.
Histograms are nice in that they effectively compress non-trivial datasets (at least those that have a reasonable bounded domain) to something quite manageable.
I guess there is nothing stopping you from doing the same thing here, but it does kind of discount the author's claim of not being able to go between histogram and eCDF.
If you have more data points than horizontal pixels, yes, you will bucket the data on your display resolution. That happens with any kind of plotting.
What is a completely different thing from the arbitrary bucketing for histograms. CDF doesn't go to zero or becomes misleading if you bucket it wrong. You just lose the details.
My point is more that you need to store n values (where n is the number of samples) or 2k if there are dupes (where k is the number of distinct values) for an eCDF, which if you did that anyways, you could generate a histogram from the same data.
If there are duplicate sample values, you can still store a sorted list of (sample,count) here and generate either a histogram OR an eCDF, or any other plot really.
Effectively it is not a fair comparison to compare the two methods since they both have storage tradeoffs that are not really discussed.
Nobody is discussing the computing performance of calculating those plots. I's taken for granted that you have more than enough resources to do anything you want with the data. If you don't, you are really in the big data region, and you will start to get all kinds of interesting tradeoffs that are completely different from one place to another.
The entire discussion is about the quality of the information the plot communicates to you. Histograms can be completely misleading, CDFs can't. Finding a bucketing so that an histogram communicates the background data is a non-trivial problem, for CDFs, it's not a problem at all.
Adding elements one at a time is O(log n) into an already sorted list. But producing a complete sorted list requires doing that n times, so you end up with n log n anyways. Am I missing something?
Histograms are nice in that they effectively compress non-trivial datasets (at least those that have a reasonable bounded domain) to something quite manageable.
I guess there is nothing stopping you from doing the same thing here, but it does kind of discount the author's claim of not being able to go between histogram and eCDF.
Am I missing something?