The OP has a problem: The algorithms don't give any significant, meaningful results. E.g., can run one of the algorithms, and it can report that there is an outlier or not, and in either case we have no idea if the answer is better than just something from reading tea leafs.
For sorting numbers, an algorithm is appropriate, e.g., heap sort, once we have checked the algorithm and shown that actually does sort. For outlier detection, we have no such simple criterion that the algorithm did anything meaningful.
For some first progress, we can consider the framework of statistical hypothesis testing with Type I error, Type II error, and the probability of Type I error. If we can do some derivations so that we know the probability of Type I error, then we have a meaningful way to know how seriously to take the result.
E.g., suppose we are receiving points one at a time. Our null hypothesis is that the points are independent and have the same distribution. Then Type I error is saying that a point violates the null hypothesis, is an outlier, when it does not. A Type II error is saying that the point does satisfy the null hypothesis when it does not.
Generally we want to be able to adjust, set, and know the probability of Type I error. Then we have an outlier technique with some meaning. If in addition, e.g., via the classic Neyman-Pearson result, we can show that for the probability of Type I error we have selected we get the lowest possible probability of Type II error, then we have the best possible test.
The OP is too eager to work with a Gaussian assumption. Too commonly in practice, that assumption is absurd. Instead, we should be able to work with a distribution free assumption, that is, the null hypothesis assumes that all the data has the same distribution but we don't know what that distribution is.
I have nothing to do with the slide-deck but the statements
> The OP has a problem: The algorithms don't give any significant, meaningful results.
and
> The OP is too eager to work with a Gaussian assumption.
demonstrate strong mischaracterization and were supremely superficial. If I may be a little blunt, given your specialization and comment history I have come to expect better.
Compared to real estate of the slide deck that was spent on non-parametric methods Gaussian was just a passing mention. This slide-deck is supposed to be a resource to get acquainted with the topic -- not a technical publication. Of course there are no proofs but there's a list of references at the end which I presume do have the proofs. Although I have not personally verified each of them from the description of the techniques in the slides its not that hard to imagine how one can give performance guarantees in a non-parametric setting to many of them -- for example the nearest neighbor based methods are bog standard in this respect.
I saw lots of Gaussian in there and, sorry, missed the sign that said "just a passing mention" :-). From the classic limit theorems, Gaussian is crucial, but expecting real data from complicated situations to be Gaussian is a big stretch. Prof U. Grenander made this point to me in his office in the Division of Applied Mathematics at Brown when I visited -- as Director of Operations Research at FedEx, I'd gone to Lynn, MA to talk to GE about jet engines so took off an afternoon and went down to Providence to see if I wanted to go to Brown for my Ph.D., talked to Infanti, Kushner, Grenander, Fleming, and a few others.
Yes, probability distributions exist; generally, though, we can't find them. One of the places we can find some probability distributions is the Poisson process, and, as I posted recently here, there are some qualitative, "axiomatic", criteria and also a limit theorem for justifying a Poisson assumption. So, we can get the distribution of times between arrivals, the distribution of the number of arrivals in, say, one hour, and more. The formulas for combinations and permutations permit knowing some more distributions. Still, tough to assume Gaussian.
Sorry, having statistics assume real data, e.g., when looking for outliers, especially for when looking for outliers, is a sore point with me. "Especially" is because even if Gaussian looks good on, i.e., eyeball near, average and standard deviation, or even with a goodness of fit test, looking for outliers will be considering the tails of the Gaussian, and that is just where, with just some real data, we are in a lot of doubt.
For getting an hypothesis test from those algorithms will usually mean doing a derivation to find the probability of Type I error, and such a derivation is commonly difficult to not done yet. Sometimes in practice have more information that permits a good estimate of the probability of Type I error, e.g., for some graveyard humor, wait ten years and see how many patients died.
Could you point the pages where you saw where Gaussian is assumed. To me it talked more heavily about multidimensional non-parametric methods than parametric, let alone Gaussian. It does use Gaussian as a warm up motivating example in the beginning.
The rest that you say is pretty standard stuff and should be elementary knowledge for anyone working seriously in this domain( * ). Am I missing something ?
> For getting an hypothesis test from those algorithms will usually mean doing a derivation to find the probability of Type I error, and such a derivation is commonly difficult to not done yet
Sure its difficult, doesn't mean its not been done. This is bread and butter in ML and non-parametric stats. In fact type I is easy and type I is no where near enough. Setting up an RNG (random number generator) will satisfy the requirements of type I error trivially. The key is type II or in other words key is the proof that the proposed method is better than the bespoke RNG, stated yet another way if power is abysmal it counts for squats.
( * ) Gratuitous assumptions of Gaussianity and incorrect applications of CLT does not qualify as 'working seriously in this domain" for me
What you said about a random number generator is a trivial test and has big problems with power (probability of Type II error) of the test.
For power, I did mention Neyman-Pearson.
In one paper I wrote on such things, multi-dimensional and distribution-free, I did some derivations, not easy to do (the paper was published in Information Sciences) for probability of Type I error, significance of the test, and used a classic result, LeCam called "tightness" of S. Ulam to show that the power of the test was better than that of a trivial test. Ulam's result is in P. Billingsley, Convergence of Probability Measures. Billingsley was long at U. Chicago.
> What you said about a random number generator is a trivial
> test and has big problems with power (probability of Type II
> error) of the test.
erm that was exactly the last paragraph of my comment.
In ML distribution free multivariate tests have a slightly different flavor. The way those work is you compare the empirical expectation of all functions in a class (typically a Hilbert space). This is reminiscent of Cramer Wold. Separability properties of reproducing kernel Hilbert space makes it tractable to compare these.
Good to know. That is WAY beyond anything in the statistics in the pure/applied math department where I got my Ph.D. But in that department I ignored their interests in statistics as much as I could which was nearly total.
As I mentioned, I got an, admittedly weak, power result using Ulam's tightness. Yes, it's weak, but it's quite general.
If I return to the mathematical statistics of hypothesis testing, I'll look into what you mentioned.
From what you mentioned, it appears that some of ML is actually being mathematical: I continue to believe that computer science is 'out of gas' and needs some new directions and those about have to be from, with, in pure/applied math, applied to problems newly important because of computing, but, still, pure/applied math.
The "learning theory" part of ML has always been rigorous in their arguments. Same for top tier conferences and journals. ML is really applied math that stands on the shoulders of functional analysis, optimization, probability theory and algorithms. The key way in which the flavor of ML is different from the flavor of statistics is that ML theorems are about guarantees on prediction quality whereas in statistics guarantees are on recovering parameters (which could be infinite dimensional). This is a broad generalization so there will be edge cases.
Have you actually tried to implement outlier detection in production on any serious level? Most who do realize pretty quickly that deep learning is way overkill and finicky, and pretty soon head for resources like this.
For sorting numbers, an algorithm is appropriate, e.g., heap sort, once we have checked the algorithm and shown that actually does sort. For outlier detection, we have no such simple criterion that the algorithm did anything meaningful.
For some first progress, we can consider the framework of statistical hypothesis testing with Type I error, Type II error, and the probability of Type I error. If we can do some derivations so that we know the probability of Type I error, then we have a meaningful way to know how seriously to take the result.
E.g., suppose we are receiving points one at a time. Our null hypothesis is that the points are independent and have the same distribution. Then Type I error is saying that a point violates the null hypothesis, is an outlier, when it does not. A Type II error is saying that the point does satisfy the null hypothesis when it does not.
Generally we want to be able to adjust, set, and know the probability of Type I error. Then we have an outlier technique with some meaning. If in addition, e.g., via the classic Neyman-Pearson result, we can show that for the probability of Type I error we have selected we get the lowest possible probability of Type II error, then we have the best possible test.
The OP is too eager to work with a Gaussian assumption. Too commonly in practice, that assumption is absurd. Instead, we should be able to work with a distribution free assumption, that is, the null hypothesis assumes that all the data has the same distribution but we don't know what that distribution is.