A couple of papers on my reading list: From information theory: A mathematical t...

A couple of papers on my reading list:

From information theory:

  A mathematical theory of communication (information theory, Claude E. Shannon) [1]
  Three approaches to the quantitative definition of information (A. N. Kolmogorov) [2]

And from inductive inference and computational learning theory:

  A formal theory of inductive inference (Ray Solomonoff, 1964) [3]
  Language identification in the limit (Mark E. Gold, 1967) [4]
  Inductive Inference of formal languages from positive data (Dana Angluin, 1980) [5]
  A theory of the learnable (PAC learning, Leslie Valiant, 1984) [6]
  Occam's Razor (Blumer et al, 1987) [7]

Bonus: a deep learning paper

  Long Short-Term Memory (Hochreiter and Schmidhuber, 1997)

The first two papers - well, one launched information theory and the other is Kolmogorov's paper where he introduced the idea of Kolmogorov complexity.

The second batch of papers start with Solomonoff's inductive inference papers, kinda important if you want to learn things from other things. Mark Gold's paper proves that it is impossible to learn a non-finite automaton from examples. Dana Angluin's follow up extends this with learnability results about various classes of CFG. Any time someone claims that their deep neural net has learned a CFG, point them to these two papers.

Valiant's paper is the theroy of machine learning as we know it today. It basically relaxes the assumptions made in inductive inference and introduces the notion of error. If you can't learn some concept perfectly, what degree of error is likely from some set of training data? Blumer's paper discusses a further bound on that amount of error that follows an Occamist bias (simplest truths are better) and is a basis for understanding overfitting (error increases as the hypothesis space does).

These two sets of papers probably look disconnected - but, learning is compression. Compression, with generalisation, I guess. Anyway, no, they're not unrelated.

The final paper is the one that introduced LSTMs and the, er, "constant error carousel" (the solution to vanishing gradients, which this paper is worth reading for).

These are papers that one must read if they're interested in machine learning. Carefully so. They're not even "old" papers- more like, essential ones.

I'm totally omitting a whole bunch of others, obviously.

___________

Online pdfs (not all free):

[1] http://www.math.harvard.edu/~ctm/home/text/others/shannon/en...

[2] http://alexander.shen.free.fr/library/Kolmogorov65_Three-App...

[3.1] https://www.sciencedirect.com/science/article/pii/S001999586... (Part 1)

[3.2] https://www.sciencedirect.com/science/article/pii/S001999586...

[4] https://www.sciencedirect.com/science/article/pii/S001999586...

[6] http://www-personal.umich.edu/~yinw/papers/Angluin80.pdf

[7] https://www.sciencedirect.com/science/article/pii/0020019087...