The Best Free and Open Source Data Mining Software

tibbon · on Dec 1, 2010

If I may plug the 140kit (http://140kit.com) which is an open source Twitter mining and analytics solution that uses the streaming API to make downloading millions of Tweets in easily accessible formats quick and easy for anyone.

(doh, it looks like its down for some reason. Well, when it's back up- its there. I'm sure you could find the github repo and run it yourself since it is open source). Written in Rails.

mark_l_watson · on Dec 1, 2010

This subdomain is up: http://hackfest.140kit.com/

tibbon · on Dec 1, 2010

Oh thank you! i should have known that.

kanak · on Dec 1, 2010

I guess R is a bit more DIY than these frameworks, but it has a very large collection of tools. I've found libraries for everything from CART (classification and regression trees), to SVM, to HMM learning, to clustering, to EM. R with libraries from CRAN is my go-to tool for statistical learning.

mcgin · on Dec 1, 2010

I am amazed R didn't make it to the list.

ericac · on Dec 1, 2010

R is great at 100s of thousands but fails miserably at millions. This seems especially true for clustering and regression. With the explosion of data collection, tools have to be able to easily take in and clean millions of records quickly.

Does anybody have experience with Orange or RapidMiner?

ez77 · on Dec 1, 2010

Something like SAS or DAP would be better suited for large data sets, as R tends to load everything on the RAM.

From http://www.gnu.org/software/dap/ : "Because Dap processes files one line at a time, rather than reading entire files into memory, it can be, and has been, used on data sets that have very many lines and/or very many variables."

bkj123 · on Dec 1, 2010

Regarding R's processing power, I haven't found it to be an issue. When building a model and testing, I use a sample of the data which is usually less than 100,000 observations. I use samples even when using a tool like SAS Enterprise Miner.

As far as scoring, I usually export using PMML and run it natively on the database. Makes for fast execution. PMML is available in R, RapidMiner, and other packages.

kanak · on Dec 1, 2010

This is definitely a problem with R, although the biggest problem IMO is that a lot of libraries aren't multicore capable. Fixing the memory problem was just a matter of adding lots of ram into our workstation, we can't fix the "can't use more than one core at a time" problem as easily.

nivertech · on Dec 1, 2010

Yes you can: just run it in many single-core VMs. This what I was recommended by vendor making legacy single-threaded software. They were the best in their field, so they never really tried to port it to multicore ;(

bluedevil2k · on Dec 1, 2010

Weka isn't limited to the GUI if you want to mine your data. It's a regular JAR file you can drop in your server-side web application and make calls to it like any other Java library. I've used it in some of my apps for some clustering algorithms (the easy stuff, since it can get complicated).

I've also written a few articles on Weka if you want to read a few nice tutorials on how to use it. I'm not a Data Mining Expert, but I've had a few semesters of it in grad school.

http://www.ibm.com/developerworks/opensource/library/os-weka...

https://www.ibm.com/developerworks/opensource/library/os-wek...

Swizec · on Dec 1, 2010

How cool is that! I'm studying Computer Science at the University that makes Orange (the first on the list). And the professor who originally came up with it is an advisor for my startup.

ams6110 · on Dec 1, 2010

What is your professor advising you on, specifically? I'm curious because the computer science profs I know would not have the first clue about running a startup. Not that there's anything wrong with that...

Swizec · on Dec 1, 2010

Actually a lot of my startup is based on the idea that we can create the machine-learning/data-mining algorithms. He's the lead of the faculty department that deals primarily with data mining.

So essentially, he's advising on algorithm stuff.

earle · on Dec 1, 2010

Mahout should clearly be on this list!

mark_l_watson · on Dec 1, 2010

Good list, but I would add NLTK.

fogus · on Dec 1, 2010

This is precisely why I posted this to HN. While it's a nice list, I can't help but think that HN can fill in the missing pieces.

gtani · on Dec 1, 2010

from my bookmarks:

http://gate.ac.uk/

http://mallet.cs.umass.edu/download.php

http://alias-i.com/lingpipe/index.html

http://incubator.apache.org/uima/

http://elefant.developer.nicta.com.au/

lenley · on Dec 1, 2010

I don't think lingpipe is open source

uxp · on Dec 1, 2010

The Royalty Free version is open source, however it appears to be a highly restricted version of the GPL v1 license, where any application that uses lingpipe, by API calls or even by separately using the output of lingpipe, must comply with an Open Source Initiative license.

Somewhat interesting clause that may restrict "freedom", but none-the-less still enables it to be open-source.

http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt

elblanco · on Dec 1, 2010

I've seen some great stuff done with Rapid Miner. Really cool package -- plus I've heard it supports all the Weka components.

thingsilearned · on Dec 1, 2010

Chart.io (YCS10) is building something like these as a service. If you're interested in getting on the private beta email me at dave@chart.io.

http://chart.io

NHQ · on Dec 2, 2010

google bought a company within the last couple years that had made a really smart open source data app that ran in the browser or something. Anybody know what it was called?

albahk · on Dec 2, 2010

Did it become Google Refine? Its a downloadable app that creates a local webserver to run in a browser.

http://code.google.com/p/google-refine/

NHQ · on Dec 2, 2010

Yeah that's it. It was Freebase before.