Hacker News new | past | comments | ask | show | jobs | submit login
H2O: fast statistical, machine learning and math runtime for big data (github.com/0xdata)
59 points by ColinWright on April 20, 2013 | hide | past | favorite | 16 comments



The scientific adivsory council is made up of 3 big guns of the ML (research) community: Boyd, Tibshirani, Hastie; so I guess it's not "just another ML lib"…


yeah, those are baller advisors to have. On the flip side, while that means they will write interesting algorithms /modelling tools correctly, that doesn't mean they will build the right tools.

For good or for ill, many academic folks are technological curmudgeons.

Let me in fact make a much bolder but still true statement: very very few of the folks who are strong both algorithmically and mathematical are also adept at helping engineer transformatively better to use tools. Why? Because skill in any subset of the math, cs and engineering things nicely requires deliberate practice in that entire subset of skills. Deliberate practice of all three together in tandem does not happen in an academic environment, ever. (measure zero sized counter examples exist admittedly)


Innaresting. There was some blogging last year about consistency vs. availability issues, similar to what akka, riak, cloud Haskell etc are confronting. It occurs to me you need to do Bayesian inference to order packets on partition

http://www.cliffc.org/blog/2012/05/

http://www.cliffc.org/blog/2012/06/21/progress-vacation/

___________

very readable code, also, no generics in the few files i looke at. And, like twitter, scalaz, akka etc they write their own Futures

https://github.com/0xdata/h2o/blob/8f6d81abeea8ef3221969d8aa...


It looks promising, but it's still in its early stage of development. At least the documentation is poor. The google group link doesn't seem to work, and I have no idea how to format and load my data into this framework.


Thanks for the mention: I work for H2O by 0xdata. We want to bring change to the world through math. As it turns out, Math is very widely applicable. We are building a high-scale math library & prediction engine that developers can extend or embed for gaining actionable insights from their data.

Math has been packaged for way too long and modeling has been sampling driven. We envision a world where math is free and one does not have to choose between Big Data or Better Algorithms - get them both.

(answering some of the comments)

- One of our goals is to continue to be extensible and easy to use. H2O is extensible today via, JSON, R (& Python) or via simple java (see package hex) The core platform is very scalable and fast - Thanks again to an amazing team of devoted hackers.

- We are inspired by prior art and lessons from Mahout & efforts from RHIPE (Thanks, Saptarshi!) and think of ourselves as the next generation fulfilling the promise with a one simple stack for ML & Math on Bigdata that is open, useful and production grade (performance & testing)

- We also believe that great math systems can be built by great systems engineers surrounded by great math people and domain experts (Also, by starting with the end user experience especially one for Big Data Science.) Our team reflects some of that thinking. We welcome data analysts, scientists, math people, distributed systems engineers and domain experts to use, critique and extend our product.

- Data ingest into our system can be via SQL, NoSQL, HDFS and plain old filesystem. H2O ingests regular CSV, xls or hive delimiter files. Most all commands are JSON directives and can be easily programmed via Python, see our test bed in action here - http://test.0xdata.com

Above all, we are grateful for the attention by HN and would like to welcome and nurture a community of users, doers and data enthusiasts who can use, patch, add to the docs and give feedback through your data experiences.

A product is not complete without it's community. Come join us on the refreshing journey ahead!


I'm not sure I understand. Is this something (to be) like Wakari.io?


Wakari (by the continuum folks) seems to be more of a web notebook ontop of python, where you run the python code

the oxdata stuff seems to be more of "interact with a hadoop cluster via a restful api via R" kinda thing.

the 0xdata stuff seems to be less easily extensible because of that strong separation / siloing. (though it looks like they have some really smart interesting folks on board!)

[edit: I'm working on building some scalable extensible numerical / data analysis tools myself, and whenever I see that partitioning between the tools for extending vs the tools for using, it just screams "wrong" to me. That said, the more everyone else focuses on businesses where that partitioning is normal, the more left for me :) ]


Thanks for clarification! How do they approach interactivity with "big data"/hadoop(Hbase I presume)? Apache Drill is still draft/wip, Impala is not that fast from what I hear (for interactive)... unless they pre-calculate some use cases, but how would that constitute (semi)real-time interaction with hadoop. MR by its nature is not very real-timey.


I'm not sure which you are referring to. But the best way is to go and read the source!


As far as I can tell, this is a different vision than that of Wakari. This effort seems to be about building a souped up parellel ML-oriented calculator around Hadoop. There is an R binding, there is a nascent Python binding (that "doesn't include the Python scientific frameworks because [they] are not that familiar with them"[1].

Wakari is an infrastructure for doing repeatable, collaborative analytics with Python. Our existing tools include a simple browser-based environment for code editing, IPython notebook, and sharing of plots and workflows. Our goal is to support multiple applications on that infrastructure, and not necessarily all written by Continuum.

[1] https://github.com/0xdata/h2o/blob/master/py/h2o/h2o/cloud.p...


How is this different from RHIPE? http://www.datadr.org/

Also some "big guns" on that project.


What's new here ? Mahout already did it.


Rather than simply making offhand, disparaging comments, perhaps you could actually provide some details and/or references.

Thanks.


It's just a Green Troll :)


I admit my first comment was a bit rough. My apologies.

My point is that this library provides the same features as Apache Mahout (that exists since quite a long time), so why duplicate the work and why not contribute to mahout ?

Duplicate features: - Hadoop support - RandomForest - Generalized Linear Modeling

The list of algorithm support my Mahout is way bigger than the one of 0xdata https://cwiki.apache.org/confluence/display/MAHOUT/Algorithm...

I would say the only interesting feature is the R support. So why not put R availability in Mahout ?

There is no explanations given although the existence of mahout is pretty obvious for anybody in the field. So without any proper argumentation, it seems that the authors voluntarily ignore the major concurrent tool in their fields.


Well Green trolls are fun too Hehe. Thx for this comment I am looking for solutions to experiment ML and Date Mining.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: