High-Performance Log Parsing in Haskell: Part One

fnord123 · on March 31, 2015

""" We write a lot of Python here at Safari, in fact it’s our most widely-used language. Because it’s a very high-level language, it was likely it would satisfy my second requirement. But my experience with writing log parsers in Python before–even when using tricks like lazy evaluation–led me to suspect that it would not do so well in satisfying the first requirement. """

How can one say one parse is high performance and another isn't if no comparison is made? e.g. the author could use pylogsparser and benchmark it:

https://pypi.python.org/pypi/pylogsparser/0.8

Then we can get an idea of whether the project set out to meet its goal of being quicker than python.

In any event, I'd be interested to see how luajit-lpeg compares to their haskell impl. It even has a nice online test tool:

http://lpeg.trink.com/share/syslog

mazelife · on March 31, 2015

Author here. Thanks for pointing me to pylogsparser, I'll definitely take a look at that. Your point is well taken: without building a parallel implementation in Python, we don't have a way of knowing for sure if Haskell is faster. The only data points I have are that we've built a couple of logparsers for custom formats in Python before this, and the number of lines parsed/second was far smaller than the attoparsec-based parser.[1] It's not apples-to-apples, since the formats differ a bit, but I don't think that it has no predictive value. So in the second part of this post, which I'm working on now, I'm hoping to be able to provide a fully-functional NCSA combined log format parser in Haskell alongside the blog post. I think that would be fairly easy to benchmark since it's a common-enough log format.

[1] that's just measuring the time to parse log files into some sort of structured data, not necessarily to do anything with it

mijoharas · on March 31, 2015

Thanks for the article, I think you're missing some code, it seems that every time you have a `do` block in your code samples most of the code is cut off (I assume).

mazelife · on March 31, 2015

Thanks for pointing that out; I think the syntax highlighter may have clobbered some code. It's now fixed.

slashnull · on March 31, 2015

Hey, that's pretty cool!

After playing with Haskell for the last two years, I figured that it would be pretty damn hard to write conventional LAMP/JS webpages in Haskell, but I still wanted to try to solve some of my real world problems with it.

So I'm parsing the log files that my LAMP stuff produces.

So far it's a very fun and frustrating waste of time, but given my recent progress, it will probably become a slightly less fun and frustrating time saver.

I just recently cleared up some confusion and type errors coming from ByteString/Lazy ByteString/Text/Whatever conversion issues (which made up 95% of the recent frustration, as every other language I'm using right now has one single sort of string (which more than often serves to contain ints ;))

... and database CRUD and JSON typeclass instances have appeared as if by magic.

Awesome!

And now I can celebrate by unleashing that stuff tens of megs of logs, opening top, and then seeing the executable instantly eat up all my RAM, before slowly conceding space to the postgres server.

Hopefully your post will help me implement streaming.

Cheers!

codygman · on March 31, 2015

> After playing with Haskell for the last two years, I figured that it would be pretty damn hard to write conventional LAMP/JS webpages in Haskell, but I still wanted to try to solve some of my real world problems with it.

Really?

You should check out:

http://adit.io/posts/2013-04-15-making-a-website-with-haskel...

slashnull · on March 31, 2015

I already got a few libraries and frameworks up and running (including scotty, actually, which is really cute), but somehow I can never have everything I want at the same time...

Some tutorials will have a few routes and a DB, some other will have an intricate rendering and templating architecture and basically no data storage solution, some are more or less a hello world with a session manager... And there's Yesod, which seems to do everything out of the box, but which has such a colossal amount of dependencies that I never got anything to build beyond fresh yesod-init setups, which still manage to fail on my current setup, despite using Stackage and sandboxes.

Among what I've seen in Haskell, ORMs and web frameworks are the two types of libraries that carry the largest and most cumbersome monad transformer stacks around; making a website involves having to typecheck a huge monad transformer into another huge monad transformer. This is not easy.

Not to mention that, in the world of open-source web developement, nearly all the documentation, expertise, copy-pastable code examples and standard practices are in dynamic OOP languages.

Perhaps the ease of RoR, PHP and Node has let everybody forget how much stuff is involved in making modern webpages : )

codygman · on March 31, 2015

> including scotty, actually, which is really cute

You know people use Scotty in production right? It's more than cute, though I agree that most Haskell library/framework documentation could use some work.

> And there's Yesod, which seems to do everything out of the box, but which has such a colossal amount of dependencies that I never got anything to build beyond fresh yesod-init setups, which still manage to fail on my current setup, despite using Stackage and sandboxes.

Hm, did you try only using Stackage or sandboxes? If your up for another solution, I know Halcyon[0] is supposed to be very frictionless.

> Among what I've seen in Haskell, ORMs and web frameworks are the two types of libraries that carry the largest and most cumbersome monad transformer stacks around; making a website involves having to typecheck a huge monad transformer into another huge monad transformer. This is not easy.

Matter of opinion? I find Scotty and Snaps monad transformer stacks to be pretty simple. This is something that becomes easy with experience I think, just like getting used to using composition over inheritance.

> Not to mention that, in the world of open-source web developement, nearly all the documentation, expertise, copy-pastable code examples and standard practices are in dynamic OOP languages.

Mostly a function of manpower available I'm guessing. I find when you ask for examples or file bugs against libraries/frameworks they are responded to quickly though.

> Perhaps the ease of RoR, PHP and Node has let everybody forget how much stuff is involved in making modern webpages : )

You just have to remember how much polish they have and how it was nowhere near this easy in the beginning.

slashnull · on March 31, 2015

Yeah, I meant "cute" more in the sense of "elegant". Unfortunately it felt too barebones and there wasn't enough documentation to let me get everything running. I like that lib, though.

...

Halcyon, I'l remember that. I still have to try Nix one day, too

...

Matter of experience, I guess, yeah

...

Precisely, there's a virtuous cycle going around popular platforms due to the larger amounts of features and documentation getting written, and in turn, the larger amount of new devs getting into it. Shame those platforms are built on such bad languages... It's cool to see the web evolving in a direction where Haskell can solve very specific problems without disrupting people's stacks.

slashnull · on March 31, 2015

See, at the bottom of that link, I have three links that are supposed to solve my session problems:

- one is dead

- another is the raw Hackage page for a session library spun off from another project

- the last is a cute hello world for the framework that underpins Scotty.

No Node.js tutorial would finish without a working session manager.

codygman · on March 31, 2015

Try using the Scotty inspired Spock[0] with Spock-auth[1] or consider using Snap[2] or Yesod[3].

Here is a Yesod example[4] that uses Auth.

Here is a Snap example[5] that uses Auth.

> No Node.js tutorial would finish without a working session manager.

Yeah, many Haskell library/framework writers tend to see end to end tutorials as hand holding or see something as Auth as so simple it needs no explanation.

It was a roadblock for me personally.

0: https://github.com/agrafix/Spock

1: http://hackage.haskell.org/package/Spock-auth

2: http://snapframework.com/

3: http://www.yesodweb.com/

4: http://www.yesodweb.com/book/blog-example-advanced

5: http://www.christopherbiscardi.com/2014/01/07/getting-starte...

slashnull · on March 31, 2015

Thank you!

I have a nice lil' prototype going on in Snap, I have Persistent running, and I'm this close to have auth working.

But in the meantime, I have lots of logs to parse ; )

dllthomas · on March 31, 2015

"In type-system theory this is called a "Sum Type" because we can define all possible representations"

It is indeed a sum type, but I don't think that is why (or true of all sum types, or less true of product types). It sounds like they are conflating "enum"?

As an example of a product type for whom we can define every possible representation: ((),()) has precisely one representation.

As an exampe of a sum type for whom there are infinite representations: (Either String Integer)

My understanding has it that it is called a sum type because 1) for finite types the number of representations is the sum of the number of representations under each tag, and 2) it behaves more generally like a sum (a^x * a^y = a^(x+y) is an equality in arithmetic, (x -> a, y -> a) is isomorphic to ((Either x y) -> a) in programming).

mazelife · on March 31, 2015

That's a good point. A sum type/disjoint union (by my understanding) means the type is in the union of subsets and that the subsets are pairwise disjoint, but it doesn't say that the subsets have to be finite. Since this post is really aimed at Haskell beginners, I was trying to avoid going down the rabbit hole as regards type system or set theory and I might have oversimplified in the process.

sriku · on April 1, 2015

> Sequencing parsers applicatively allows the compiler to perform static analysis on a parser without running it. This knowledge can be used to avoid things like backtracking that may slow your parser down. This is not possible when sequencing parsers monadically because the grammar of each parser depends on the previous one. However, performance results in this case are probably negligible; don’t hesitate to choose do-notation if you find it easier to read.

Would be good to talk more about this and quantify too. Very interesting topic.

codygman · on April 1, 2015

Maybe this paper[0] on using Applicatives for performance and concurrency in Haskell's Haxl will help.

Also note that there is a feature proposal[1] for Applicative Do notation.

0: http://community.haskell.org/~simonmar/papers/haxl-icfp14.pd...

1: https://ghc.haskell.org/trac/ghc/wiki/ApplicativeDo