Libraries like Weka and Mahout are no more *toys* than any other library that im...

dvcat · on Nov 14, 2011

Dunno about weka but my last experience ( 5 months back) with Mahout was not good. There still are quite a few bugs and the fact that entire code base is in Java makes it extremely unpleasant for someone who wants to hack and modify the code to jump right in and start tweaking stuff. However, in its defense, it is open source is probably the only hadoopified ml library out there and has given me a ton of good ideas on how to write custom code.

mark_l_watson · on Nov 14, 2011

Wow, we disagree. As much as I like to do my own development in dynamic languages like Clojure and JRuby, for me:

I would much rather have library and framework code that someone else has written, debugged, and supports to be written in Java: easy to browse in a good IDE, statically typed, lots of unit tests so you can hack away with some protection, etc.

dvcat · on Nov 15, 2011

Maybe my point wasn't clear enough: 1. I am comfortable with using someone else's library without having to reinvent the wheel but I want to know exactly what I am getting into without having to browse through tons of Java code. There are zillions of variants of algorithm X but I want to know exactly which implementation/variant Mahout uses without going through the source code. Unfortunately the docs (at least 4 months back) were pretty bad.

2. Their unit test coverage was not good enough which incidentally is how I found that there were bugs. The problem in trying to contribute back to the community by trying to rectify these bugs? When I read the source code, I get the feeling that each algorithm is owned to a great extent by one developer who brings in their own idiosyncrasies which means that you need to really study the code to make sure you don't accidentally add more bugs. The other disadvantage of this approach is that questions regarding potential bugs and puzzling issues can go unanswered or answered in an unsatisfactory manner (mainly because of the one developer writing most of the code issue).

Having said all this, I want to be charitable and chart these to growing pains. But if I were building something critical and big dataish, I would either use Python (dumbo) or Scala which are much more concise languages where it is easier to express math without introducing bugs.

law · on Nov 14, 2011

You're correct to identify the point of libraries like Weka and Mahout, which are both written in Java, as providing a solid framework for interaction between and among your program and other algorithms. However, Java isn't the right solution for everyone. Moreover, in Weka's case, the GPL licensing may not comport with everyone's requirements. Mahout's license is more friendly to proprietary software, so it's admittedly a non-issue there.

I agree that hadoop is certainly not a toy, but using Mahout on hadoop clusters works better for analyzing large data sets that you've already collected and pre-processed. If you're doing any kind of active learning, or are designing software to run on a client's computer based on feedback that they provide, mahout probably isn't the best choice.

In the end, it requires understanding your problem completely enough to justify your decision.

mark_l_watson · on Nov 14, 2011

re: Weka GPL: one of my customers simply bought a commercial license. Easy.