OTOH, we are merely at circa Year Five into deep reinforcement learning research.
It started as a cluster of 16M CPUs having taught itself to recognize a cat 95% of the time after training on 1B google images.
And we are now at One-Shot Imitation Learning, "a general system that can turn any demonstrations into robust policies that can accomplish an overwhelming variety of tasks".
Not really, no. Saying we are at year five of Deep RL is about as informative as saying we are at year five of deep learning. Reinforcement learning as a field goes back decades.
But now we have GPUs, which makes it entirely different. /s
And it kinda does, but in an engineering way rather than a statistics way.
Like reinforcement learning from pixels is pretty new (i would be really interested if you have 10+year old citations), and pretty amazing. I've been looking at RL (through OpenAI gym) and realising that I "just" need to annotate a bunch of images and then train a network that will predict (fire/no fire in Doom from those pixels, and I can just add another network that builds some history onto this net (like an RNN) and this might actually work, is kinda amazing.
I'm still not sure I believe that it's always a good approach, but some of my initial experiments with my own (mostly image so far) data have been pretty promising.
The hype is pretty annoying though, especially if you've been interested in these things for years.
The bar to entry for these kinds of applications has been significantly lowered, which means we'll see more of it. I guess, in some sense, it's similar to the explosion of computer programs following the advent of personal computers (maybe, I haven't thought deeply about this part).
I'd like to believe that GPU's and cloud might allow for more scientific exploration of the "hows" of learning via many small experiments gradually revealing limitations and characteristics until finally insight.
Using high speed hardware can allow someone to do 10's or scores of runs a day. If you are doing one every 2 weeks or so then it's really, really hard to make any progress at all because you daren't take risks. So the productivity of 80 a day vs 2 per month isn't just 100x it's lots and lots more.
Also as you say it's lowered the bar which means that teams can onboard grad students and interns and get them to do something that's useful - it may be trivial - but it's useful.
It's easy to recognize a cat 95% of the time. I can write a program in 30 seconds that will recognize a cat 95% of the time. No, wait, this just in! My program will recognize a cat 100% of the time! The program has just one line:
Tutorial: So, with that program, whenever the picture is a cat, the program DOES recognize it. So the program DOES recognize a cat 100% of the time. The OP only claimed 95% of the time.
Uh, we need TWO (2), that's TWO numbers:
conditional probability of recognizing a cat when there is one (detection rate)
conditional probability of claiming there is a cat when there isn't one.
The second is the false alarm rate or the conditional probability of a false alarm or the conditional probability of Type I error or the significance level of the test or the p-value, the most heavily used quantity in all of statistics.
One minus the detection rate is the conditional probability of Type II error.
Typically we can adjust the false alarm rate, and, if we are willing to accept a higher false alarm rate, then we can get a higher detection rate.
With my little program, the false alarm rate is also 100%. So, as a detector, my little program is worthless. But the program does have a 100% detection rate, and that's 5% better than the OP claimed.
If focus ONLY on detection rate, that is, recognizing a cat when there is one, then it's easy to get a 100% detection rate with just a trivial test -- just say everything is a cat as I did.
What's tricky is to have the detection rate high and the false alarm rate low. The best way to do that is in the classic Neyman-Pearson lemma. A good proof is possible using the Hahn decomposition from the Radon-Nikodym theorem in measure theory with the famous proof by von Neumann in W. Rudin, Real and Complex Analysis.
My little program was correct and not a joke.
Again, to evaluate a detector, need TWO, that's two, or 1 + 1 = 2 numbers.
What about a detector that is overall 95% correct? That's easy, too: Just show my detector cats 95% of the time.
If we are to be good at computer science, data science, ML/AI, and dip our toes into a huge ocean of beautifully done applied math, then we need to understand Type I and Type II errors. Sorry 'bout that.
Here is statistical hypothesis testing
101 in a nutshell:
Say, you have a kitty cat
and your vet does a blood count,
say, whatever that is, and gets a number.
Now you want to know if your cat is sick or healthy.
Okay. From a lot of data on what appear to be healthy
cats, we know what the probability distribution is for the blood count number.
So, we make a hypothesis that our cat is healthy. So, with this hypothesis, presto, bingo, we know
the distribution of the number we got. We call this the null hypothesis because we are assuming that the situation is null, that is, nothing wrong, that is, that our cat is healthy.
Now, suppose our number falls way out in a tail of that distribution.
So, we say, either (A) our cat is healthy and we have observed something rare or (B) the rare is too rare for us to believe, and we reject the null hypothesis and conclude that our cat is sick.
Historically that worked great for testing a roulette wheel that was crooked.
So, as many before you, if you think about that little procedure too long, then you start to have questions! A lot of good math people don't believe statistical hypothesis testing; typically if it is their father, mother, wife, cat, son, or daughter, they DO start to believe!
Issues:
(1) Which tail of the distribution, the left or the right? Maybe in some context with some more information, we will know. E.g., for blood pressure for the elderly, we consider the upper tail, that is, blood pressure too high. For a sick patient, maybe we consider blood pressure too low unless they are sick from, say, cocaine in which case we may consider too high. So, which tail is not in the little two set dance I gave. Hmm, purists may be offended, often the case in statistics looked at too carefully! But, again, if it's your dear, total angel of a perfect daughter, then ...!
(2) If we have data on healthy kitty cats, what about also sick ones? Could we use that data? Yes, and we should. But in some real situations all we have a shot at getting is the data on the healthy -- e.g., maybe we have oceans of data on the healthy case (e.g., a high end server farm) but darned little data on the sick cases, e.g., the next really obscure virus attack.
(3) Why the tails at all? Why not just any area of low probability? Hmm .... Partly because we worship at the alter of central tendency?
Another reason is a bit heuristic: By going for the tails, for any selected false alarm rate, we maximize the area of our detection rate.
Okay, then we could generalize that to multidimensional data, e.g., as might get from several variables from a kitty cat, dear, angel perfect daughter, or a big server farm. That is, the distribution of the data in the healthy case looks like the Catskill Mountains. Then we pour in water to create lakes (assume they all seek the same level). The false alarm rate is the probability of the ground area under the lakes. A detection is a point in a lake. For a lower false alarm rate, we drain out some of the water. We maximize the geographical area for the false alarm rate we are willing to tolerate.
Well, I cheated -- that same nutshell also covers some of semester 102.
For more, the big name is E. Lehmann, long at Berkeley.
It started as a cluster of 16M CPUs having taught itself to recognize a cat 95% of the time after training on 1B google images.
And we are now at One-Shot Imitation Learning, "a general system that can turn any demonstrations into robust policies that can accomplish an overwhelming variety of tasks".
One Shot Imitation Learning
https://arxiv.org/abs/1703.07326