I spent about ten years working on Markov based chat programs. I gave up on themwhen I realized that no matter how sophisticated your statistical model it will never be more than a statistical analysis of text, unless it includes some rich rule based model of mental processes and mental objects. It may be that such a model of mental processes must itself be fuzzy and probabilistic, but it must exist. Therefore I come down firmly on the side of Chomsky in this debate: we should pursue theories of intelligence, and stastical models without any theory do not advance our scientific understanding of AI, however practical their application may be at the present time. This is not to say statistical methods do not work, of course they work, what I am saying is it is not a path that leads to true understanding of intelligence any more than spectral analysis of the EMF emissions of a running computer would lead to a theory of computation.
Just because you chose Markov chains as your modeling mechanism doesn't mean that there is no statistical modeling method that is capable of developing something passing for what we'd call "meaning".
This is the same argument that was used against artificial neural networks. Neural network of type A can't do X, therefore neural networks will never do Y.
Language is immensely complex, and real human language involves things which are not encoded in text (and i'd remind you that you were trying to infer meaning from text specifically, not the full multi-channel robustness of humans communicating), we don't even have a full handle on what all of the cognitive processes and factors are that go into the production and understanding of language (although we've developed a lot of interesting work to those ends).
So hearing folks give up claim that Chomsky is correct because our current tools aren't up to the job is a bit puzzling, because we don't even have a complete understanding of what sort of thing language is or what sorts of things we are as systems which can use language.
Chomsky has opinions (and some facts) about what language is, and we are, but he does not have solid proof to confirm his specific conjectures. Is human language context free? context sensitive? Something else? (Chomsky's minimalist program uses movement along a tree to preserve referentiality and a bunch of junk, alternative syntactic frameworks such as HPSG uses directed graphs as the basis of their language modeling. Still others do weirder things like higher order combinatoric logics. And unfortunately none of the theoretical frameworks appear to be without their drawbacks)
I am not a specialist, but as far as I know, Chomsky's argument here was that the existence of recursion showed that a Markov approach had to be wrong. Surely a similar argument can be made for statistical approaches? There is no way to represent a reference to some other part of the statement in a purely statistical method. If they work they happen to work basically by accident.
Just blue-skying here, but it seems to me that if I knew enough about how a statistical program worked, I could craft a sentence that would utterly confuse it, even though it was perfectly intelligible to a normal English speaker. A putative strong-AI program could not be fooled in this way.
Except that his argument is somewhat moot as a practical matter, because there are no infinitely recursive sentences (given that all sentences are finite).
Long distance dependencies are an issue in language modeling that do need to be accounted for, but all that tells me is that Markov chains aren't the right structure to model language (unless, maybe you had a MASSIVE amount of data, and a markov chain of an order high enough that you account for the majority of sentences. maybe).
You can statistically build a model that has recursion. It's just that such a model cannot be sure it has induced the right grammar - that's what Chomsky's argument was. I think the obvious counter argument is so what? Given any other constraints like parsimony you can certainly reliably induce a grammar.
Two experiences have made it clear to me that humans don't understand language that well, without context:
(1) Raising a child. My dad often remarks that he's surprised that my daughter knows how to use a word in just the right context. I'm not, because this is a natural product of mimicry: if you copy what others say, you usually use the words in the correct context. As with computer-generated text the exceptions are often hilarious.
(2) Song lyrics. I had a very clear experience where I just could not understand the refrain of Gold Guns Girls. It sounded completely unintelligible until I read the lyrics. After that, it sounded crystal clear. Why would reading the lyrics make the song sound different? Context.
There is no valid argument leading FROM your disenchantment with Markov based chat programs, TO a conclusion that machine learning is invalid. Markov chatters are a toy.
Did you produce a better chat bot based on UG? - No? Then on what basis are you junking machine learning?
Machine translation is far from a solved problem. But Chomsky's school claimed they were going to solve it in the 60s or perhaps the 70s. Do you know what is the basis for the most successful current approach to machine translation?
Analysis of text. (But not the kind of simplistic junk one does in a Markov chatter)
All you have done is suggest that some beautiful perfect text model exists natively in every person. (Presumably this evolved somehow - or if you are Chomsky, it just developed like a crystal for no apparent reason). This isn't an explanation of anything unless you actually find that model instantiated in the brain. But this is just not happening. So either our instruments are still too crude to detect it, or it's not really there.
Appealing to an as-yet unknown perfect universal text model does not build a better chat bot or a better explanation of human behavior.
True understanding of intelligence must incorporate an understanding of how learning occurs. Because anyone who watches children sees learning occurring, and only doctrinaire Chomskyists deny that it occurs (because it is not beautiful enough and some abductive argument is claimed to show that it is not sufficient).
The fact that Markov chains are by definition memoryless isn't an argument in favor Chomsky or magical thinking. Sure, if you want to improve your output you can use (n+1)grams instead of n-grams, but the curse of dimensionality is going to quickly catch up with you. Language smoothing will help for a little while. Over a long enough horizon all Markov chain output is jibberish. None of these obvious limitations are an argument against statistical models.
Where is the data that statistical methods don't 'advance our understanding'? What does an EEG tell us about the brain works?
Running a Markov chain model as-is to generate text produces gibberish. You correctly point out that the gibberish can be much higher quality.
Fundamentally it is gibberish not for any simple algorithmic reason, but because generation is occurring without any respect to context or meaning beyond what randomly emerges from the graceful juxtaposition of randomly chosen words.
It is purely about the combinations of words (in that sense, syntax). This shouldn't be surprising - who ever actually expected that generating a kind of syntax model would result in coherent thoughts? At most it can generate texts like weird dreams, it shouldn't be surprising that the result is not a cogent discussion of current events.
This does not mean that the same information cannot be used in more sophisticated ways. But these wouldn't be a Markov chatbot. The Markov model would effectively be a component in a larger system that needed to use words. It isn't at all clear that the Markov model is the best possible one, but it is just groundless dogma to insist that learning can't have anything to do with real performance.
It strikes me that the theories are not mutually incompatible at all. They have very different purposes. Chomsky is trying to understand meaning and intelligence at a deep level. Norvig is trying to build models that help people right now (and incidently, help his company to make more money). Any new insights from either path will help refine the other.
So it sounds like what you decided was that you wanted to explore one path and not the other. Nothing wrong with that, but it's a very different statement.
As far as language modeling, this is a recent paper that models language on the character level rather than word level and can track long term dependencies and even generate plausible sounding non-words from time to time: http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf
The state of the art is improving a bit, although this method still knows nothing of meaning so it can often generate some strange sentences. Still, I wouldn't write off the whole field yet--just because something didnt work with tools years ago doesn't mean it isn't possible.