Nuclear weapon statistics using monoids, groups, and vector spaces in Haskell

jheriko · on Jan 4, 2013

all this proves to me is that Haskell can make even the simplest of tasks difficult to understand or follow. perhaps i miss the point... :/

KirinDave · on Jan 4, 2013

What was actually difficult to understand about the code?

I know the chained function style of haskell:

    let list_usa    = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="USA"   ) rawdata

Is a little long but this is basically the same sort of thing you see in OO languages with object.chaining.methods(with,arguments).together.forever(for(succintness { forever }).

Monoids are not complicated. They're the _easiest way to talk about combinable groups of things generically_ and lots of programs use echos of that concept when abstracting over collection objects. Monoids just finish the job by saying, "Anything which can be empty and combined associatively (think: adding integers, which also forms a monoid).

These abstractions make your life easier, not harder. But they can be a bit confusing at first because they're so universal and it can be so surprising how much milage the same code can get just by swapping monoids.

I'm porting some code from Haskell to Clojure right now for various reasons, and man do I miss monoids and the Writer monad. Producing machine-readable concatenated outputs as you go through a computation with ease is so powerful, and a lot of my time is spent writing less-powerful re-implementations of this feature.

jlarocco · on Jan 4, 2013

> What was actually difficult to understand about the code?

Really? How about the CSV reading, from the article:

    import Data.Csv
    import qualified Data.Vector as V
    import qualified Data.ByteString.Lazy.Char8  as BS
    
    main = do
        Right rawdata <- fmap (fmap V.toList . decode True) $ BS.readFile "nukes-list.csv"
            :: IO (Either String [(String, String, String, Int)])
        let list_usa    = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="USA"   ) rawdata
        let list_uk     = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="UK"    ) rawdata 
        let list_france = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="France") rawdata 
        let list_russia = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="Russia") rawdata 
        let list_china  = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="China" ) rawdata
        putStrLn $ "List of American nuclear weapon sizes = " ++ show list_usa

Here's how I'd do it in Python (note I haven't tried this code, but it'd be pretty close to this):

    import csv
    countries = dict()
    with open("nukes-list.csv", 'rb') as f:
        csvfile = csv.reader(f)
        for row in csvfile:
            tmp = countries.get(row[0], list())
            tmp.append(int(row[4]))
            countries[row[0]] = tmp

     print("List of American nuclear weapon sizes =", countries['USA'])

How can anybody argue the Haskell version is easier to understand?

The rest of the code is basically calling sum and filter on the data or calling library functions, and it would be almost exactly the same in Python, or even C++.

dons · on Jan 4, 2013

This isn't an article about how to use a Haskell CSV parser. If it was, it would be about the cassava package, http://hackage.haskell.org/package/cassava

I.e.

    csvfile <- readFile "nukes-list.csv"
    case decode csvfile of
        Left err -> putStrLn err
        Right v -> V.forM_ v $ \ (a,b,c,d) ->
            ...

I don't know why the author didn't just use one of the fine CSV packages on Hackage.

jackpirate · on Jan 4, 2013

I'm pretty sure this isn't the idomatic way to parse a CSV file in Haskell. It's just what I found from a quick Google search. I'm sure someone else could do better.

The rest of the code is basically calling sum and filter on the data, and it would be almost exactly the same in Python, or even C++.

This is definitely NOT true. The rest of the code is about using algebraic manipulations, like group operations on the data structures. This is certainly possible in other languages, it's just never done and it's not idiomatic.

politician · on Jan 5, 2013

FWIW, I really enjoyed the article, and appreciated the example dataset -- usually this literature is quite dry. Do you know if there is any monoid documentation that enumerates the various properties as something like C# interfaces and then explains what the derived types are? I think that would help me and a lot of other people who don't know as much abstract algebra or find it hard to connect with their day to day work.

jackpirate · on Jan 5, 2013

Thanks for the kind words. Sorry, I'm not familiar with any non-functional languages that take advantage of algebra for programming.

jlarocco · on Jan 4, 2013

> I'm pretty sure this isn't the idomatic way to parse a CSV file in Haskell. It's just what I found from a quick Google search. I'm sure someone else could do better.

Isn't that the point, though? In Haskell it was so difficult to read a CSV file that you had to Google it. In Python it was immediately obvious how it should be done.

jackpirate · on Jan 4, 2013

In Python it was immediately obvious how it should be done

What?

I'd have to Google it in Python too.

KirinDave · on Jan 7, 2013

> How can anybody argue the Haskell version is easier to understand?

You're not comparing apples to apples here. This is the part of the code that does the CSV parsing:

    Right rawdata <- fmap (fmap V.toList . decode True) $ BS.readFile "nukes-list.csv" -- Decode is a cassava function.

The actual CSV parsing invocation and subsequent rendering to list is:

     fmap (fmap V.toList . decode True)

The code of yours that does that is:

    countries = dict()
    with open("nukes-list.csv", 'rb') as f:
        csvfile = csv.reader(f)
        for row in csvfile:
            tmp = countries.get(row[0], list())
            tmp.append(int(row[4]))
            countries[row[0]] = tmp

As for fetching the actual subsets, I am not sure why he didn't write it this way:

    let rows_from name = filter (\x -> (x ^. _1 == name))
    let list_usa = rows_from "USA"
    let list_uk  = rows_from "UK"
    -- ...
    putStrLn $ "List of American nuclear weapon sizes = " ++
               (list_usa ^.. traverse._4)

Particularly since he was going to do the same operation over and over. But this is not Haskell, this is just the Author sort of not writing beautiful code but copypasting code from a Real Thing. And probably just playing around with Edwardk's new lens library, which everyone loves but also everyone is just coming to terms with.

Of course, you took a shortcut because you "knew" that you'd only be using the data indexed by country name. We didn't make that assumption in this code, but it'd be trivial to add.

jlarocco · on Jan 8, 2013

>The actual CSV parsing invocation and subsequent rendering to list is:

> fmap (fmap V.toList . decode True)

I'm sorry, but it's not at all obvious that code is parsing a CSV file.

>The code of yours that does that is: > (code snipped)

That's not actually true. The code to parse the CSV into a list of rows is:

    csvfile = csv.reader(f)

The for loop below it populates a dictionary of lists indexed by country name. Your list_usa is my countries['USA'].

KirinDave · on Jan 8, 2013

> I'm sorry, but it's not at all obvious that code is parsing a CSV file.

What, exactly, is your complaint?

Could this fellow have written cleaner code? Sure, we agree. Are you saying this is some fundamental problem for Haskell? Because I'm trying to tell you it's not the case. For example, on that line a better practice would be to do exactly what your Python script did; label the reader with csvs with "qualified" in the import.

So let's refresh the comparison with the proper style and without the decision to convert to lists from vectors:

    -- Haskell, cleaner but longer. 
    -- Many people prefer this style when dealing in IO.
    -- Please note that 'readFile' here sucks nearly as
    -- badly as your Python version's readfile does. 
    -- They are terrible.
    f <- BS.readFile "somefile.csv"
    case CSV.decode True f of
      Left  error  -> putStrLn error
      Right tuples -> do 
         -- ...

    # Python
    with open("nukes-list.csv", 'rb') as f:
      try: 
        csvfile = csv.reader(f)
        # ...
      except csv.Error as e:
        print e

Huh. Similar linecount, similar line density.

You might argue, "Why were all the extra fmaps there?" Well, ostensibly they're there to do what your code doesn't do, namely deal with exception conditions (your code just ignores errors). But of course the author's poor style throws that benefit away on the same line with the pattern match on "Right _", but we've already conceded the author's code could be cleaner.

My point: You're blaming Haskell for the author's decisions. You're also crediting Python with your own personal familiarity. It is not Python that is natural and obvious, your many hours of hard work with Python have made that language's solution easy for you to perceive. Give yourself some credit, and stop pretending Python is somehow "naturally' clearer.

If you'd like to see examples of good, clean, fast Haskell code then they're easy to provide. I have a funny feeling that's not your goal here, though.

rauljara · on Jan 4, 2013

I wish I didn't agree with you. I'm still a Haskell novice, and I've enjoyed the time I've spent learning it, but this post was just terrifying. All I could think was how simple all that would be in R (or any statistical package), or even Ruby (or any popular scripting language).

Of course, one can write overly complicate code in any language. I think the purpose of the post was more to show off conceptually advanced techniques rather than to actually analyze the dataset in a straightforward manner.

jackpirate · on Jan 4, 2013

I think the purpose of the post was more to show off conceptually advanced techniques rather than to actually analyze the dataset in a straightforward manner.

Correct. I admit it turned out to be a bit too much for one post, but i wanted to use some real world data (the nukes) to demonstrate the techniques.

Edit: I could be wrong, but I don't think R (or any other stats package) supports "group subtraction" of distributions, which is how we calculate the survivable nuclear weapons.

rauljara · on Jan 4, 2013

And actually, I want to amend that on second read through it was considerably easier to follow than the first.

It's just an article for Haskellers at least a little more advanced than me. I probably should have read it twice before labeling it "terrifying".

lambdasquirrel · on Jan 4, 2013

you mean the whole nukes thing didn't scare you one nilly bit?

jackpirate · on Jan 4, 2013

The "cool stuff" is in the commutative diagrams at the end of the article. They show that there are many different ways to train a distribution from data, some of which will be more suited to different tasks. In most languages, we would have to program each of those paths separately. In Haskell, we get all of those paths "for free."

lambdasquirrel · on Jan 5, 2013

Haskell has a way of making the hard things easy and the easy things hard. The high level, pretty stuff is great. The ugly stuff is, for example, getting what we want out of `rawdata`.

You're probably thinking stuff like:

    let list_usa    = fmap (\row -> row^._4) $ filter (\row -> (row^._1)=="USA"   ) rawdata

which is just saying, column 1 is the country, and we only want the rows that are "USA," and from those rows, we only want the fourth column. It would've been simpler if it were just arrays or lists, but that wouldn't be as powerful or as composable as lenses.

A better, more readable way would've been to not use those lambdas, since he's reusing the accessors over and over anyway:

    let getKaboomieColumn row = row^._4
    let trueIfUSA row = row^._1 == "USA"
    let list_usa = fmap getKaboomieColumn $ filter trueIfUSA rawdata
    ...

When you read it that way, it looks a bit less intimidating.

tel · on Jan 4, 2013

Well, it's not exactly a simple task. There are a few simple ways to approach it, but isolating the various general structures and creating the maps between them is the bigger task.

wonderzombie · on Jan 4, 2013

This doesn't strike me as falling to the category of "simplest of tasks."

on Jan 4, 2013

[deleted]

jackpirate · on Jan 4, 2013

I'm curious why you say that the code would be simpler in magma or gap? The post doesn't actually implement any group theory, it just applies it, which is exactly what you'd have to do anywhere.

Of course, it's possible that the internal code of the library would be simpler, but I'm a little skeptical.

dschiptsov · on Jan 4, 2013

Is it more precise, uses less resources, runs faster than if done with R or Octave?

What are the benefits, if any?

jackpirate · on Jan 4, 2013

When analyzing data in stages, algebra saves you from repeating work you've already done. It also makes implementing the library's back end easier.