A mathematical theory of communication (information theory, Claude E. Shannon) [1]
Three approaches to the quantitative definition of information (A. N. Kolmogorov) [2]
And from inductive inference and computational learning theory:
A formal theory of inductive inference (Ray Solomonoff, 1964) [3]
Language identification in the limit (Mark E. Gold, 1967) [4]
Inductive Inference of formal languages from positive data (Dana Angluin, 1980) [5]
A theory of the learnable (PAC learning, Leslie Valiant, 1984) [6]
Occam's Razor (Blumer et al, 1987) [7]
Bonus: a deep learning paper
Long Short-Term Memory (Hochreiter and Schmidhuber, 1997)
The first two papers - well, one launched information theory and the other is Kolmogorov's
paper where he introduced the idea of Kolmogorov complexity.
The second batch of papers start with Solomonoff's inductive inference papers, kinda
important if you want to learn things from other things. Mark Gold's paper proves that
it is impossible to learn a non-finite automaton from examples. Dana Angluin's follow
up extends this with learnability results about various classes of CFG. Any time someone
claims that their deep neural net has learned a CFG, point them to these two papers.
Valiant's paper is the theroy of machine learning as we know it today. It basically
relaxes the assumptions made in inductive inference and introduces the notion of error.
If you can't learn some concept perfectly, what degree of error is likely from some
set of training data? Blumer's paper discusses a further bound on that amount of
error that follows an Occamist bias (simplest truths are better) and is a basis
for understanding overfitting (error increases as the hypothesis space does).
These two sets of papers probably look disconnected - but, learning is compression.
Compression, with generalisation, I guess. Anyway, no, they're not unrelated.
The final paper is the one that introduced LSTMs and the, er, "constant error carousel"
(the solution to vanishing gradients, which this paper is worth reading for).
These are papers that one must read if they're interested in machine learning. Carefully
so. They're not even "old" papers- more like, essential ones.
I'm totally omitting a whole bunch of others, obviously.
From information theory:
And from inductive inference and computational learning theory: Bonus: a deep learning paper The first two papers - well, one launched information theory and the other is Kolmogorov's paper where he introduced the idea of Kolmogorov complexity.The second batch of papers start with Solomonoff's inductive inference papers, kinda important if you want to learn things from other things. Mark Gold's paper proves that it is impossible to learn a non-finite automaton from examples. Dana Angluin's follow up extends this with learnability results about various classes of CFG. Any time someone claims that their deep neural net has learned a CFG, point them to these two papers.
Valiant's paper is the theroy of machine learning as we know it today. It basically relaxes the assumptions made in inductive inference and introduces the notion of error. If you can't learn some concept perfectly, what degree of error is likely from some set of training data? Blumer's paper discusses a further bound on that amount of error that follows an Occamist bias (simplest truths are better) and is a basis for understanding overfitting (error increases as the hypothesis space does).
These two sets of papers probably look disconnected - but, learning is compression. Compression, with generalisation, I guess. Anyway, no, they're not unrelated.
The final paper is the one that introduced LSTMs and the, er, "constant error carousel" (the solution to vanishing gradients, which this paper is worth reading for).
These are papers that one must read if they're interested in machine learning. Carefully so. They're not even "old" papers- more like, essential ones.
I'm totally omitting a whole bunch of others, obviously.
___________
Online pdfs (not all free):
[1] http://www.math.harvard.edu/~ctm/home/text/others/shannon/en...
[2] http://alexander.shen.free.fr/library/Kolmogorov65_Three-App...
[3.1] https://www.sciencedirect.com/science/article/pii/S001999586... (Part 1)
[3.2] https://www.sciencedirect.com/science/article/pii/S001999586...
[4] https://www.sciencedirect.com/science/article/pii/S001999586...
[6] http://www-personal.umich.edu/~yinw/papers/Angluin80.pdf
[7] https://www.sciencedirect.com/science/article/pii/0020019087...