If I may plug the 140kit (http://140kit.com) which is an open source Twitter mining and analytics solution that uses the streaming API to make downloading millions of Tweets in easily accessible formats quick and easy for anyone.
(doh, it looks like its down for some reason. Well, when it's back up- its there. I'm sure you could find the github repo and run it yourself since it is open source). Written in Rails.
I guess R is a bit more DIY than these frameworks, but it has a very large collection of tools. I've found libraries for everything from CART (classification and regression trees), to SVM, to HMM learning, to clustering, to EM. R with libraries from CRAN is my go-to tool for statistical learning.
R is great at 100s of thousands but fails miserably at millions. This seems especially true for clustering and regression. With the explosion of data collection, tools have to be able to easily take in and clean millions of records quickly.
Does anybody have experience with Orange or RapidMiner?
Something like SAS or DAP would be better suited for large data sets, as R tends to load everything on the RAM.
From http://www.gnu.org/software/dap/ : "Because Dap processes files one line at a time, rather than reading entire files into memory, it can be, and has been, used on data sets that have very many lines and/or very many variables."
Regarding R's processing power, I haven't found it to be an issue. When building a model and testing, I use a sample of the data which is usually less than 100,000 observations. I use samples even when using a tool like SAS Enterprise Miner.
As far as scoring, I usually export using PMML and run it natively on the database. Makes for fast execution. PMML is available in R, RapidMiner, and other packages.
This is definitely a problem with R, although the biggest problem IMO is that a lot of libraries aren't multicore capable. Fixing the memory problem was just a matter of adding lots of ram into our workstation, we can't fix the "can't use more than one core at a time" problem as easily.
Yes you can: just run it in many single-core VMs. This what I was recommended by vendor making legacy single-threaded software. They were the best in their field, so they never really tried to port it to multicore ;(
Weka isn't limited to the GUI if you want to mine your data. It's a regular JAR file you can drop in your server-side web application and make calls to it like any other Java library. I've used it in some of my apps for some clustering algorithms (the easy stuff, since it can get complicated).
I've also written a few articles on Weka if you want to read a few nice tutorials on how to use it. I'm not a Data Mining Expert, but I've had a few semesters of it in grad school.
How cool is that! I'm studying Computer Science at the University that makes Orange (the first on the list). And the professor who originally came up with it is an advisor for my startup.
What is your professor advising you on, specifically? I'm curious because the computer science profs I know would not have the first clue about running a startup. Not that there's anything wrong with that...
Actually a lot of my startup is based on the idea that we can create the machine-learning/data-mining algorithms. He's the lead of the faculty department that deals primarily with data mining.
The Royalty Free version is open source, however it appears to be a highly restricted version of the GPL v1 license, where any application that uses lingpipe, by API calls or even by separately using the output of lingpipe, must comply with an Open Source Initiative license.
Somewhat interesting clause that may restrict "freedom", but none-the-less still enables it to be open-source.
google bought a company within the last couple years that had made a really smart open source data app that ran in the browser or something. Anybody know what it was called?
(doh, it looks like its down for some reason. Well, when it's back up- its there. I'm sure you could find the github repo and run it yourself since it is open source). Written in Rails.