In the experiment on teaching an LSTM to count, it's useful to note that the examples it's trained on are derivations [1] from a grammar a^nb^n (with n > 0), a classic example of a Context-Freee Grammar (CFG).
It's well understood that CFGs can not be induced from examples. Which accounts for the fact that LSTMs cannot learn "counting" in this manner, nor indeed can any other learning method that learns from examples.
_______________
[1] "Strings generated from"
[2] The same goes for any formal grammars other than finite ones, as in simpler than regular.
Aw man! That 1994 paper by Sakakibara is a piece of history! It concludes by
saying that it's "part of the work in the major R&D of the Fifth Generation
Computer Project conducted under the program set up by MITI" [1]. Plus, Sakakibara
is one of my grammar induction heroes :0
However- his algorithm learns CFGs from structural data, which is to say,
derivation trees (think parse trees). So it's completely irrelevant to the
example in the article, that attempts to learn a^nb^n from examples of its
derivations -which remains impossible.
As to the other paper, by Chen, Tseng and Chen, that's about learning a CFG
that reproduces the strings in a corpus- so learning a CFG of a corpus as
opposed to the grammar of a context-free language (therefore, a context-free
grammar) which, again, remains impossible.
From a quick glance, the example doesn't quite learn regular grammars - rather, it reduces them to finite grammars (e.g. the finite grammar of all US presidential winners and losers) and learns (or, more accurately, invents) a regular expression for them.
Finite grammar induction from positive examples only is feasible in polynomial time, so Peter Norvig's notebook will not cause the fabric of the space-time continuum to be torn asunder, I am sure.
It's not clear that the learnability results about formal grammars are useful for describing what can be learned by practical algorithms, where the grammar does not have to be identified with perfect confidence.
Counting is one thing and I'm not sure whether this is possible to do or not (my understanding is that it's "not"), but the example in the original article is specifically trying to learn to reproduce strings of exactly n a's followed by exactly n b's, for arbitrary positive n.
That's learning the CFG a^nb^n for n > 0 in a nutshell and is therefore impossible to learn, and indeed the LSTM in the article falters after test-time n exceeds training-time n by a few units.
The paper you link to discusses subject matter I'm not familiar with (bootsrapping as in psychology, say) so I'm not really qualified to opine on it. However, from a very quick look it seems like they're learning to associate counting words with numbers- sorry if I'm wrong about that. Still, that's not "learning to count", where you have to be able to make the inductive leap that n + 1 > n, for all n in your target set of numbers in addition to learning the recursive structure of number strings.
Finally, from the little I've seen of attempts to model counting with machine learning, there hasn't yet been much success, probably because it's impossible to learn a language with both structure and meaning from examples only of its structure and no access to their meaning.
It's well understood that CFGs can not be induced from examples. Which accounts for the fact that LSTMs cannot learn "counting" in this manner, nor indeed can any other learning method that learns from examples.
_______________
[1] "Strings generated from"
[2] The same goes for any formal grammars other than finite ones, as in simpler than regular.