Hacker News new | past | comments | ask | show | jobs | submit login
Haskell improves log processing 4x over Python (bu.mp)
114 points by jmintz on Jan 24, 2011 | hide | past | favorite | 41 comments



The work sounds very cool (and they are hiring), but (only) a factor of 4 speedup over Python is (to repeat a phrase from elsewhere today) like boasting that you're the tallest midget ;o)


Hi, article author here.

It's important to note that this particular job is largely bound on a.) I/O and b.) format serialization tasks. Both Python's BSON and JSON libraries are mature and have their critical sections written in C, so a speedup of 4x is still noteworthy. The Haskell version, on the other hand, is pure Haskell.


Neat - thanks.


Agreed. Even where you can optimize the hot code in C, Python is no speed demon. Cassandra's java stress test can push out about 10x as many ops/s as the python one, even though Thrift C extension for Python is quite good.

/still a Python fan


Yeah, Haskell is roughly as fast as Java. I imagine, with the tuning I allude to in the blog post, we could restore about a 10x improvement, even on a single core. After all, 10x was about what we had with raw Redis ops/s before the serialization libraries got involved.

/also still a python fan :-)


I think it's difficult to compare Java to Haskell. Java is JIT compiled and trades startup time for optimized performance. Haskell is compiled to native code and has a much faster startup time, but the optimizer doesn't have as good information because it runs when the code is compiled rather than at runtime.

In general, Java tends to be better for a long-running process while Haskell might be better for a one-off job (not to say that one couldn't do the other without problem though).


> Haskell is compiled to native code and has a much faster startup time, but the optimizer doesn't have as good information because it runs when the code is compiled rather than at runtime.

I'm not that familiar with JIT, but wouldn't compiling everything to native code ahead of time be at least as good as compiling parts that turn out to be slow, just in time? Is it about what sorts of optimizations to use, which you don't know until runtime?

Also, would this situation change given the fact that they're switching to LLVM?


JIT has strictly more information than an ahead-of-time compiler running without good profiling data (that should be based on the real workload, not a dev test). JITs can notice things like paths never taken and compile the code into a set of sanity checks and a single completely inlined tight codepath for the common case, and when sanity checks fail, fall back to normal code.

JIT does, however, use cycles at runtime to do the actual compile. While I have seen benchmarks where JIT wins over AoT, I'd still bet that the balance favors traditional compilers.


Right, but those extra cycles tend to be a one-time thing. In a program that runs in a short amount of time, then the amount of time it takes to compile this might negate any speed gains. In a longer-lived application (like a web app), this can make a bigger difference.


The jit has more information about runtime behaviour. The question is though whether haskell's or similar slightly more advanced type systems aren't able to compensate for that.


There are many in-lining opportunities that depend on knowing what branches are followed in the program at run time. A JIT has better access to this information, and so can be more aggressive in leveraging it.

I'm not familiar with how common these techniques are in production JVM's, but as they're becoming common in javascript implementations I'd assume it's equally common.


> Also, would this situation change given the fact that they're switching to LLVM?

LLVM can both use a JIT and compile to native code. From what I've seen, ghc doesn't seem to be doing much different, so I'd assume they're compiling to native code still.


This is a pretty awesome example of how to use Python for it's quick development times to get a system in place, then go in later and write those bits with a faster/compiled language when growth overtakes you.


Sounds great. I'm a very big Haskell fan.

I'd love to point people to this when trying to convey some advantages of Haskell. To make it more compelling, can you expand some on the downsides and maybe obstacles you encountered?

The thing I'm unsure about, is how difficult it would be for (very) talented developers to just jump in. We have really talented developers, and everyone is super time-constrained, so many are wary of diving into a language as different as Haskell. Was it hard for your developers to figure Haskell out? Did your previous use of Scala help? How long did it take them to dive into Scala?


I would say the two real barriers to writing effective Haskell projects are a.) "getting" monads, and b.) understanding the implications of laziness, especially with regard to space leaks and unconsumed thunks. Everything else isn't that big of a deal.

It's all much easier to digest, though, even for "really talented developers", if they have some experience with another functional language first. OCaml is a nice stepping stone before digging into the abstractions involved in understanding Haskell's powerful type system. Scala is good too, but having the object stuff mixed in there can lead you to rely on some patterns that aren't going to be available in a non-OOP language. I think the scheme/clojure path isn't bad either, but it's probably ideal to spend some time in the "statically typed" wing of the functional universe before going to Haskell.


Could you say more about why "getting" monads was needed?

I came to Haskell with no understanding of monads, started writing code, and eventually used my knowledge of Haskell to learn about monads. Not understanding monads just meant I was lacking a useful design pattern, and found certain API docs confusing, but it didn't stop me from writing reasonable code in most circumstances.

On the other hand what you describe in your (awesome) blog post is a more significant Haskell project than any I've worked on, so I'd be interested to hear your experience.

I've not really written my own monad, or properly looked into monad transformer stacks, and I'm aware that I could probably clean up a lot of code using them - is that the sort of thing you mean?


Sure, I can say more--put bluntly, and despite anything you might hear to the contrary, you basically do need to grok monads and monad transformers to tackle any nontrivial project in Haskell. Even if only to understand the code and APIs of libraries your application will interact with.

Basically, you can't swing a dead cat without hitting monads in the Haskell library ecosystem; therefore, you'll need to know what they are.


Catch phrase thief!


You're just jealous I worked it in so naturally.


I agree. I dove into Haskell without doing any of the prior, and it was like running into a brick wall. However, persistence paid off in my case, but I do wonder how I would have handled it if I would have spent time with OCaml beforehand.


From personal experience: I didn't make much progress in Haskell until I stopped using Scala. The problem is that Scala allows you to mix and match different paradigms and if you come from a mostly-imperative/OO background, you tend to use Scala as an OO language with some functional constructs.

To learn to program purely functional, it's best to jump into Haskell cold-turkey, since you will have to learn to think in FP.

Learning Haskell, optimization in a lazy world was the most difficult task. Often, I still have problems predicting how efficient particular code will be. The complexity of monads is somewhat overstated, though it doesn't help that some tutorials make something big and esoteric out of it. It is nothing more than a type class, that specifies how to combine computations that result in some 'boxed value'.


The author is mostly write about the usage cases of Haskell, but simply "systems" is a bit misleading because there are certain performance characteristics of lazy programs which make them bad choices for some systems programs. Any type of real-time system, for example, can suffer unpredictable performance in critical sections, which is pretty undesirable.


Hard real time systems are probably the primary thing for which Haskell-as-is is directly unsuitable.

Haskell as an EDSL for generating hard real time, however, is very viable: http://corp.galois.com/blog/2010/9/22/copilot-a-dsl-for-moni...


Not to argue the example, but Python's garbage collection disqualifies it for real-time systems as well. In fact, I'm having a hard time find a "system" task for which Python (as a language) is qualified by Haskell is not.


Python is not a systems programming language.


Maybe not, but it is certainly used as one. Everything from package managers (yum) to operating system installers (Anaconda) have been written in Python.

Besides that, the grandparent is right: possibly every situation where Python was used as a systems programming language, Haskell could fit in (and more).


I think you're confused to what "systems" programming entails. User-space packaging with a bunch of scripts is not systems programming. OS installers is not systems programming.

To give you an example of what is systems programming, I have helped developed operating systems kernels, virtual machine monitors, and distributed networked systems. All of these things would be considered systems programming.

See http://en.wikipedia.org/wiki/System_programming for more information.


That's just one definition. Read up on Ousterhout's dichotomy:

http://en.wikipedia.org/wiki/Ousterhouts_dichotomy


You should just read http://home.pacbell.net/ouster/scripting.html (his original paper) because all his statements about systems programming languages reinforces my claim that Python is not a systems programming language.


While I agree with you that Haskell (or, really, any GC'd language) is unsuitable for real-time systems, I disagree that my statement about its excellent suitability for systems programming in general is misleading. There are many, many domains (read: most) that, in my experience, are called "systems programming" that have nothing to do with hard or soft real-time requirements.

Now, if I had stated that all conceivable systems programming domains are addressable with Haskell, that would have indeed been foolish.


Hm, good point -- I agree.


Are the logs being read from disk? In my experience, python is highly optimized for reading (possibly compressed) files from disk. If your infrastructure keeps logs in memory, python will lose this advantage and compete on computational performance where Haskell has the advantage. This is important for those of us who grind logs on disk and might be considering a language switch.


What do you mean by optimized? Python makes the same read and write syscalls everyone else does.

What you're probably observing is Python's slow code generation being masked by the inherent slowness of I/O.


Python makes the same read and write syscalls everyone else does

Except, when python's pants are on, it makes gold records.

I haven't looked to see if there are any explicit optimizations, but your statement is ridiculous; an effective IO strategy can have an enormous effect on performance.


I'm sorry, what? Just being disagreeable is not an answer. What does "IO strategy" mean? You are being incredibly vague and unhelpful.

Reading data from a file handle into a buffer is a trivial operation. It's what you do with that data afterwards that is important. In C (or Go) you have complete control over what happens next. As for Python, I don't know what happens, but I don't see how it could possibly be more efficient than any other sane language.


Reading data from a file handle into a buffer is a trivial operation.

If that is all you're doing, then yes; there isn't a much more efficient way of doing that.

In C (or Go) you have complete control over what happens next.

It is up to the programmer to know what to do next. Does haskell strike you as a language of micromanagement? Python can sometimes be multiples faster than command line grep. I haven't looked into why, but I have some ideas.

You are being incredibly vague and unhelpful.

If you don't understand what I'm talking about, it doesn't make /me/ wrong. And just saying so is also not an appropriate response. I had a specific question that was answered by the OP, and was useful to me.

Hacker News comment threads are rarely a place of education, but I will reinterpret what you said as a question.

An IO strategy is how and when you make those system calls. Reading from disk takes a vast amount of time, during which you can be doing computation. To be fast, you should be asking for the appropriate amount of lookahead, at the correct offset. Is that 4k? 1Meg? 100Megs? 1GB? Do you use threads for this? Can you skip any of the input stream? Do you let the operating system, programming language, library, or program code decide how big the read is? Where that data is stored after being read from disk is also important. Especially fast strategies use mmap to avoid copying from kernel space into user space. And of course everything is always chunked at specific intervals, so knowing where those are can sometimes reduce the number of calls. The ability to optimize for these is one thing that makes dedicated database software so successful.

It is a dark art, and it is not expected that the average person know these things. If you happen to be the kind of person working on a programming language, it could be useful to be aware of them. Here are some quick links, but there is a vast amount written on the subject.

[http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...] [http://tangentsoft.net/wskfaq/articles/io-strategies.html]


Nope, this is a process that BLPOP's logs from some Redis queue, does some processing on them, then writes them to disk.


I'd be interested in hearing more about how the author is using the resulting data set. Doing extractions at event generation time can be very useful if you know what you are after in advance, but not so good for adhoc analysis.

Any reason why you didn't use Hadoop for this, then run batch jobs to extract summaries?


Yeah, the whole pipeline is actually quite more faceted than can be deduced from this summary. This stage actually just persists the events into a consolidated transaction log. Then, there are secondary processes that scan these transaction logs (in batch) and distribute data into various databases for system, business, and user analytics. I can't go into too much detail there, but the actual digesting and reporting side is more involved.


I'd like to hear more about the use case if you have time, and can talk about it. I'm kordless at loggly dot com.


Awesome work. If you haven't heard about Tim Bray's WideFinder challenge, it was really interesting.

http://tartarus.org/james/diary/2008/06/17/widefinder-final-...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: