Hacker News new | past | comments | ask | show | jobs | submit login
Tools to get started in machine learning (k2company.com)
89 points by nwenzel on Sept 6, 2012 | hide | past | favorite | 35 comments



While I think Python is great, I think the author discounts R too casually. I share his frustrations about R, but what's nice is R (well, S) was designed to be a statistical computing tool, and it does anything related to that quite well. Especially for data analysis, which is where I spend a lot of my time (perhaps most) in the whole model building process, R is amazing. Also, R is very usable out of the box for math and data visualization, whereas Python requires many libraries. It's good to know R.

I think R and Python can be used in conjunction very effectively. Analyze your data in R, prototype your algorithms in either, build products in Python.


With rpy2 you can even embed R code in Python.


>I found the syntax baffling, the documentation copious, but written for mathematicians instead of hackers

I'm always surprised how much people hate the syntax of R. I primarily work in Python, but I use R once or twice a week... and the syntax seems very clean to me. Can someone give mee an example of what you dislike with R's syntax?

I'm even more surprised to hear complaints about the documentation in R. The help files in R are much more complete and well organized than docstrings in the python libraries we use. Even the web usually lacks anything as useful as what I get from the vignettes function in my R interpreter.


Personally it's not so much syntax as the confusing data model that gets me in R. So many different but very similar data types - lists, data frames, matrices, tables, vectors - all very similar but slightly different syntax, very frequently converted silently from one to the other when you call functions but resulting in strange quirks that are extremely hard to debug at the other end. The combination of loose data typing and this plethora of similar data types makes it a nightmare to work with at times. On the other hand when you grok it and it works for you ... it's amazing.


The complaints about R still seem counterintuitive to me. Python has the same data types listed above, and many more.

A list in R is a list in Python. A data frame in R is a date frame in the pandas library. A matrix in R is a matrix in numpy. A vector in R is a 1-dimensional ndarray in numpy.

But Python adds dictionaries, tuples, iterators, sets, and a bunch of other data types that aren't used in R.

R's lists and vectors are relatively similar... but you could say the same thing about numpy's matrix and ndarray. You could probably say the same thing about python's sets, tuples and lists.

To be honest, I'd have said the strength of python is that it has many more data types than R... rather than fewer data types.


(Author here) - I don't have any specific examples, but I was learning R and Python at the same time. I found Python to be very practical and easy to learn. When learning R, I kept getting tripped up. Maybe it isn't that R was harder, but that I had a head start on Python. And as for documentation, R was certainly very complete, but once again, I found it harder. I think because R is written by, and probably for, professional statisticians and mathematicians, it needs to have a different level of rigor than the Python documentation. Anyway, sorry for the lack of specificity.


R does have some weirdness (it took me ages to understand the index and slice notations), but it is very expressive.

I'm not a mathematician, and I've been programming Python for 15 years, but I'd always pick R for its stated problem domain given a choice.

I highly recommend "The Art Of R Programming" for learning R as a programming language. The statistical side of things are then easier to layer on top of that.


If you're using a Mac, please don't install all of this individually. Instead, install the Scipy Superpack: http://fonnesbeck.github.com/ScipySuperpack/


The superpack is a great one-click option. For staying up to date, I find a package manager (like macports or homebrew) more effective for managing my python packages.


Does anyone have any experience with Octave and how it compares to the Python setup that OP suggests?

What are the benefits/pitfalls?


Octave sits at an awkward half-way point between MATLAB and Python/R/Julia/etc. You get the shitty syntax of MATLAB at not quite MATLAB's speed and miss out on support as well as various incompatible toolboxes. So unless you have hard dependencies or lots of MATLAB code sitting around, Octave isn't that attractive an option.

It's great for matrix/vector maths, though. If that's all you do -- go for it. Everything over and above that is a royal pain in MATLAB and, conversely, in Octave.


I have used Octave but only for the Coursera ML Class, not actual production use. My understanding is that it is the open source version of Matlab capable of running many Matlab programs.

I learned Octave first, then R, then Python.

Octave to me feels like Python + NumPy. I'd say Octave has more in common with R than with Python.

Given the choice between Octave and R, I'd choose R for the more robust user community and incredibly diverse and thorough selections of libraries.


A nice list of tools. We tried to use Python 3.0, but had the same problems as the author...

And if you are lucky enough to have free-ish access to MATLAB, here's a free, BSD, open-source, github repo'd machine learning toolbox to help you get started:

http://www.newfolderconsulting.com/prt/

Full disclosure: I'm involved.


If you are going to use python for ML. Use the python package from Entthought [http://www.enthought.com/products/epd_free.php]

It has most of the libraries, out of the box.


(Author) - Nice tip. Thanks - I'll update my list. (Wish I had known at the start of my learning!)


Don't forget gensim (and its tutorials): http://radimrehurek.com/gensim

Plus it's the only one on that list that will scale beyond the "My Data Set" size.


>If you are a real data scientist or expert, skip this

I am a "real data scientist" and need some advanced data analysis technics and machine learning. Can someone recommend an introduction for me?


What's your background? (i.e. what do you already know?)


Probably not much. I started working with bigger data some months ago and now I notice that I need some real techniques and not just my "ok try this and this". I'm coding in C (where I "create" the data (numerical integration of stochastic differential equations)) and Python (plotting). I need methods/algorithms/techniques to analyse the data "on the fly" because I can't save it all (it's too much data).


Just a small tip which may ease the search for methods. The general term for "on the fly" learning is online learning [1]. The rest depends on your problem but there are often online variants of offline methods, e.g. when you work with Gaussian process regressions

[1]: http://en.wikipedia.org/wiki/Online_machine_learning


This might be of interest: http://noelwelsh.com/streaming-algorithms/2012/08/29/lean-da...

Don't have a great deal of time right now so drop me an email if you'd like more info (see profile) and I'll get on it tomorrow.


Solid summary of some powerful tools, no wonder python is the new default for academic data analysis.


Lush is an excellent platform for machine learning. There are bindings to gnuplot ,opencv, lapack, gsl, an optimization library for gradient descent, a machine learning framework, a nerual network simulator.

It also has very nice matrix and vector manipulations features built in to the language and is very easy to bind to C code.


Some people really do seem to get a lot done in Lush, so I'm not discounting its utility, but the language is sort of a mess. I took Yann's class and gave up in frustration after a few homeworks. I was very happy working in Matlab and relieved to to never see a 'bloop', 'eloop', or whatever-loop again.


Lush's purpose a little dfferent than Matlabs. The abstractions are a little lower level than Matlab for instance. But then again you you can compile your functions directly to machine code. There are trade offs to everything in life.

Matlab,Ocatave,R,S are great but if you need to be closer to the metal, Lush offers a very good compromise.


Andrew Ng, in his machine learning class [1] urges people do use Matlab instead of Python, because in his experience people develop faster with Matlab than wit any other tool/language.

Personally, I am experienced with Matlab but not so much with Python, so I am not able to judge. I definitely hate the fact that Matlab is proprietary and partly closed source. Also, I think Python syntax is much more pretty while Matlab is not even designed to be a real language. But alas, it comes with very powerful functionality out of the box.

Note, that the OP mentioned in the introduction that he had no access to Matlab over his employer or university and hence dismissed it.

[1] https://www.coursera.org/ml (one of the first videos)


Some good points for comparison:

http://www.scipy.org/NumPy_for_Matlab_Users

IMHO, it's best to prototype in Octave and then build in python. I find that the Matlab/Octave syntax is too focused on linear algebra, so it's better for small prototypes (and for people coming from non-SW fields). For big projects, I prefer the 0-based arrays, more than one function per file, and all the rest of the python goodies. I estimate that 70% of my time is usually spend preparing the data (e.g. parsing xml, or some other files, etc), for which I find python more suitable.

In fact, I usually work with them side by side, testing ideas in Octave, then implementing these pieces into a large python project.

Edit: this has also been discussed here before, e.g.

http://news.ycombinator.com/item?id=363096

http://news.ycombinator.com/item?id=689183


>Also, I think Python syntax is much more pretty while Matlab is not even designed to be a real language.

If you're only coding the core of an algorithm (rather than a full-featured library with lots of plumbing), and your logic fits naturally into Matlab's native array operators, then using Matlab is a joy.


> Andrew Ng, in his machine learning class [1] urges people do use Matlab instead of Python, because in his experience people develop faster with Matlab than wit any other tool/language.

Never trust academics when it comes to programming. :)

Seriously, Matlab might be a tiny bit better for scientific programming than Python. But: if you start building an eco system around your machine learning code (distributed evaluation of models, email reporting of results, web reporting of results, online tracking of training progress, data base related things, web services for other people, proper documentation, ...) you are happy if you chose python.

Also, there is theano for python which has auto diff, transparent GPU/CPU use and symbolic optimization. It makes your life easy if you are using complicated models.


It's too bad the OP didn't find Octave, which is an opensource Matlab clone that was also used in Andrew Ng's ML course.


I am not sure that Octave preforms as fast as MatLab; for example, I think MatLab does a better job in parallelizing non-vectorized code.


If you have taken Andrew Ng's Machine class the handwriting recognition system that is mentioned in the course was implemented in Lush. I think the original code is is even included in the demos distributed with Lush.


Word. Given Cython, Jython, etc.... this list keeps going (I use Weka in Python).


Can I get your email address? Check my profile for mine please. Thank you!


Why?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: