As a side note if you want an "industrial-strength" applied version, at the cost of losing the small/elegant implementation, there's been considerable work in the past years on building robust Markov-language-modeling packages. I mention it here mostly because I didn't realize until recently myself that there was much to gain on top of the vanilla versions that are fairly straightforward to DIY.
One active area of research is smoothing the models, so they capture regularities without just memorizing the input text. The vanilla approach chooses a fixed order of Markov chain (e.g. 2-grams or 3-grams), where you choose n either ad-hoc, or (ideally) to balance underfitting (miss context you could use) against overfitting (end up memorizing verbatim text strings due to insufficient data). But the right "n" to choose can vary based on context as well, since different words show up more often in the training data, and are more or less variable in what kinds of words follow them. So there are a number of approaches to choose n adaptively and/or to smooth the counts ("Katz's backoff model", "Kneser-Ney smoothing", etc.).
From a more engineering point of view, there's also considerable work on pruning the models and representing them efficiently, if you want to work with things of the size of, say, the Google Books ngram dataset.
A few free-software packages that implement a good portion of the recent work:
I created a dumb bigram version that also memorizes leaps. Not much theory behind it, but the fact that the resulting text has an uneven level of delirium is somewhat funnier.
Thanks! I was hoping to find a bunch of examples of the text your solution generates, but sadly there's only one. Installing it seems to require some effort, too, so it's a bit hard to see a demo of your solution.
Various kinds of Markov models are pretty interesting and have many interesting applications that are not too complicated, it's a pity this particular example is being done to death, since it doesn't give much insight into the topic. Shannon's classic paper on information theory is very readable and shows the usefulness of the concept:
Yes! - read the paper. Or if you are too "busy" on HN, at least skip to sections I.3. THE SERIES OF APPROXIMATIONS TO ENGLISH and I.4. GRAPHICAL REPRESENTATION OF A MARKOFF PROCESS.
I simply do not understand HN's fascination with this type of posts. "Something that could have been implemented in any other turing complete language NOW IN GO".
Here's a small Node.js state machine / Markov chain tool package that tries to make what's going on under the hood a little more obvious, so as to allow for tinkering. It contains an example for name generation.
I was pretty sure this was the point of the submission.
I'd like to see more of this. A fresh interesting take on representing programs, especially welcome as not being served as yet another eclipse plugin or a multi-screen .vimrc.
Edit: the thread above me is talking about bigram implementations, mine supports n-gram by setting the variable $degree when you train to anything you like. By default, its 1-gram.
Your complaint is a bit knee-jerk. Aside from the conventional "i is for index", the only one letter identifiers in TFA are the method receivers. It is idiomatic Go to use a single-character variable as the receiver for a method, ala:
type MyStruct struct {
wobbly int
}
func (m *MyStruct) IsReallyWobbly() bool {
return m.wobbly > 42
}
So far, I've found this convention to be straightforward to work with, and there's little reason I can see to make the receiver identifier more elaborate. Especially in light of popular languages that omit a "self" identifier entirely.
This seems to be a great example of how one-character variable names can be done well. 'p' for Prefix's, 'c' for Chains's, 's' for string's... it all works very well with this level of consistency, and this level of code complexity and length. Anything longer would be needlessly verbose.
Unless they are loop counters, array indices, they never are descriptive/intuitive outside of a 5 line function. I write a lot of Go (understatement) and for all but the former, my min is three, rarely two letters. I also buck the "system" and use "this" in my receivers :)
Nothing is worse than seeing a "r" in the body of some function or as a struct member and not knowing if it is a reader, response, request so rather than using "r", I'll go with rdr, resp, req or this (were applicable). That is hardly verbose and our code is immensely more readable than something "idiomatic" like the Go standard library IMHO. I think this really should not be considered an issue of idiomaticness, if is was gofmt or go vet should care.
The short is if you like Go and not single letter var names, great use Go with longer variables names! I realize this is a religious issue so I can already feel the flames burning :)
I don't consider req a long variable name. What I'm trying to communicate is that req is better than request, and res better than response, since they communicate the same thing and are easier to read and write. When it makes sense, i.e. when it can't lead to confusion, the variables have one-letter names--oftentimes this is e.g. i, j for loop variables or n for a number, and so on. Also, it is very common for a method on a type Bar to bind the Bar to a variable b, i.e. func (b Bar) Foo() { ... }
This is how the entire Go stdlib is written, and how most packages are written.
It's ok to use one letter variable names when the letter is mere notation and doesn't really mean much. In this context, it's somewhat analogous to 'self' all over the place in Python, and idiomatic.
One active area of research is smoothing the models, so they capture regularities without just memorizing the input text. The vanilla approach chooses a fixed order of Markov chain (e.g. 2-grams or 3-grams), where you choose n either ad-hoc, or (ideally) to balance underfitting (miss context you could use) against overfitting (end up memorizing verbatim text strings due to insufficient data). But the right "n" to choose can vary based on context as well, since different words show up more often in the training data, and are more or less variable in what kinds of words follow them. So there are a number of approaches to choose n adaptively and/or to smooth the counts ("Katz's backoff model", "Kneser-Ney smoothing", etc.).
From a more engineering point of view, there's also considerable work on pruning the models and representing them efficiently, if you want to work with things of the size of, say, the Google Books ngram dataset.
A few free-software packages that implement a good portion of the recent work:
http://kheafield.com/code/kenlm/
http://code.google.com/p/berkeleylm/