Markov chain algorithm in Go

mjn · on April 26, 2013

As a side note if you want an "industrial-strength" applied version, at the cost of losing the small/elegant implementation, there's been considerable work in the past years on building robust Markov-language-modeling packages. I mention it here mostly because I didn't realize until recently myself that there was much to gain on top of the vanilla versions that are fairly straightforward to DIY.

One active area of research is smoothing the models, so they capture regularities without just memorizing the input text. The vanilla approach chooses a fixed order of Markov chain (e.g. 2-grams or 3-grams), where you choose n either ad-hoc, or (ideally) to balance underfitting (miss context you could use) against overfitting (end up memorizing verbatim text strings due to insufficient data). But the right "n" to choose can vary based on context as well, since different words show up more often in the training data, and are more or less variable in what kinds of words follow them. So there are a number of approaches to choose n adaptively and/or to smooth the counts ("Katz's backoff model", "Kneser-Ney smoothing", etc.).

From a more engineering point of view, there's also considerable work on pruning the models and representing them efficiently, if you want to work with things of the size of, say, the Google Books ngram dataset.

A few free-software packages that implement a good portion of the recent work:

http://kheafield.com/code/kenlm/

http://code.google.com/p/berkeleylm/

kappaloris · on April 26, 2013

I created a dumb bigram version that also memorizes leaps. Not much theory behind it, but the fact that the resulting text has an uneven level of delirium is somewhat funnier.

https://github.com/kappaloris/markrov-text-generator

sillysaurus · on April 27, 2013

Thanks! I was hoping to find a bunch of examples of the text your solution generates, but sadly there's only one. Installing it seems to require some effort, too, so it's a bit hard to see a demo of your solution.

stiff · on April 26, 2013

Various kinds of Markov models are pretty interesting and have many interesting applications that are not too complicated, it's a pity this particular example is being done to death, since it doesn't give much insight into the topic. Shannon's classic paper on information theory is very readable and shows the usefulness of the concept:

http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pd...

Isamu · on April 26, 2013

Yes! - read the paper. Or if you are too "busy" on HN, at least skip to sections I.3. THE SERIES OF APPROXIMATIONS TO ENGLISH and I.4. GRAPHICAL REPRESENTATION OF A MARKOFF PROCESS.

See also the Usenet trolling program that indirectly inspired the Go example: http://en.wikipedia.org/wiki/Mark_V_Shaney

mwfj · on April 26, 2013

I simply do not understand HN's fascination with this type of posts. "Something that could have been implemented in any other turing complete language NOW IN GO".

jwilliams · on April 27, 2013

If you know Markov Chains it's an insight to see how it's implemented in (the relatively new) Go. If you don't, you get to learn about Markov Chains.

Plus the codewalk itself is cool. I'm unsure of the UX, but certainly enough for a discussion.

jlarocco · on April 27, 2013

Not sure I believe that.

There's only insight to be gained if the Go version is particularly interesting due to features of Go.

hornetblack · on April 27, 2013

From memory the Markov Chain example has been on golang.org for quite a while.

exratione · on April 26, 2013

Here's a small Node.js state machine / Markov chain tool package that tries to make what's going on under the hood a little more obvious, so as to allow for tinkering. It contains an example for name generation.

https://github.com/exratione/state-engines

joshrotenberg · on April 26, 2013

Meta, but codewalk looks pretty cool.

acc00 · on April 26, 2013

I was pretty sure this was the point of the submission.

I'd like to see more of this. A fresh interesting take on representing programs, especially welcome as not being served as yet another eclipse plugin or a multi-screen .vimrc.

gburt · on April 27, 2013

I implemented this in PHP yesterday if anyone is interested:

https://github.com/gburtini/Learning-Library-for-PHP/blob/ma...

Edit: the thread above me is talking about bigram implementations, mine supports n-gram by setting the variable $degree when you train to anything you like. By default, its 1-gram.

MostAwesomeDude · on April 26, 2013

Huh. 85 lines, ignoring the explanation at the top. I wonder if this could be golfed to be smaller; my Markov generator for Time Cube is less than half the height (in Python): https://github.com/MostAwesomeDude/airbrush/blob/master/tips...

GhotiFish · on April 27, 2013

I would be interested to see an excerpt from that!

MostAwesomeDude · on April 27, 2013

http://airbr.us/h/tipsum has a running instance.

pjmlp · on April 26, 2013

Oh man, one letter character variables. :(

saidajigumi · on April 26, 2013

Your complaint is a bit knee-jerk. Aside from the conventional "i is for index", the only one letter identifiers in TFA are the method receivers. It is idiomatic Go to use a single-character variable as the receiver for a method, ala:

    type MyStruct struct {
        wobbly int
    }

    func (m *MyStruct) IsReallyWobbly() bool {
        return m.wobbly > 42
    }

So far, I've found this convention to be straightforward to work with, and there's little reason I can see to make the receiver identifier more elaborate. Especially in light of popular languages that omit a "self" identifier entirely.

jlgreco · on April 26, 2013

This seems to be a great example of how one-character variable names can be done well. 'p' for Prefix's, 'c' for Chains's, 's' for string's... it all works very well with this level of consistency, and this level of code complexity and length. Anything longer would be needlessly verbose.

thirsteh · on April 26, 2013

..are good if they are descriptive/intuitive. It's idiomatic Go--long variable names most certainly are not.

voidlogic · on April 26, 2013

Unless they are loop counters, array indices, they never are descriptive/intuitive outside of a 5 line function. I write a lot of Go (understatement) and for all but the former, my min is three, rarely two letters. I also buck the "system" and use "this" in my receivers :)

Nothing is worse than seeing a "r" in the body of some function or as a struct member and not knowing if it is a reader, response, request so rather than using "r", I'll go with rdr, resp, req or this (were applicable). That is hardly verbose and our code is immensely more readable than something "idiomatic" like the Go standard library IMHO. I think this really should not be considered an issue of idiomaticness, if is was gofmt or go vet should care.

The short is if you like Go and not single letter var names, great use Go with longer variables names! I realize this is a religious issue so I can already feel the flames burning :)

thirsteh · on April 26, 2013

I don't consider req a long variable name. What I'm trying to communicate is that req is better than request, and res better than response, since they communicate the same thing and are easier to read and write. When it makes sense, i.e. when it can't lead to confusion, the variables have one-letter names--oftentimes this is e.g. i, j for loop variables or n for a number, and so on. Also, it is very common for a method on a type Bar to bind the Bar to a variable b, i.e. func (b Bar) Foo() { ... }

This is how the entire Go stdlib is written, and how most packages are written.

unoti · on April 26, 2013

It's ok to use one letter variable names when the letter is mere notation and doesn't really mean much. In this context, it's somewhat analogous to 'self' all over the place in Python, and idiomatic.