I've been using Python for what I guess you could consider as pre-processing for data science.
One thing people might want to consider before investing time in Python is that I found it to be quite memory inefficient: data structures take up a lot of space, and the garbage collection didn't seem to be as effective as other languages (I spend some time studying/improving GC in JVMs). So if you're dealing with large amounts of data and/or complex data structures, I wonder if Matlab might be more appropriate (AFAIK R is also not very good at memory management yet).
As a rule of thumb, if you run out of memory in python, you will also run out of memory in matlab. I've done a lot of work in both and found that while python's memory performance may not be ideal, at least when you run into trouble in python you have options. With matlab I found that when I got that dreaded "Out of memory" message the prompt, there was little I could do. The internals are completely opaque, shipping my code to C is a pain in the ass, and there are very few language constructs to help you control how you use memory.
In most cases running out of memory in matlab meant either making the problem smaller or running it on a beefier machine. I think this is the reason why you see a lot of labs at universities with machines that have 96GB of memory, even though their datasets seem to be much smaller.
FWIW, as far as processing lots of data is concerned, python is not without issues. If you do it naively you will run out of memory really quickly. But by picking your tools correctly you can go a long way. Use Pandas and/or Sparse arrays whenever possible. Learn how numpy broadcasting operations contribute to memory explosions. Take a gander at the source of that sklearn method you're using, since it's often quite obvious that the particular implementation will choke.
I've found that these days I try my best to avoid loading datasets into memory. This is second nature for people who work with 'big data', but it's an m.o. that takes some getting used to. That is, blocking and/or streaming your data, and appropriately subdividing your problem for distributed computation. It's worth mentioning that this problem with python is under active research and development. The guys at continuum developed IOpro to deal with the issue of memory efficiency when loading data, and to make streaming data from flat files/S3/mongodb/whatever easier and more stable. Also, their (very young) project called Blaze is meant to be a drop-in replacement for numpy, but is designed for efficiency and specifically for dealing with out-of-core computation. We'll see...
I'm not experienced enough in using Matlab, but I did see a seminar given by some of the engineers and they seemed to have given a lot of thought to optimising the software for large datasets.
> I've found that these days I try my best to avoid loading datasets into memory. This is second nature for people who work with 'big data', but it's an m.o. that takes some getting used to.
Yes, couldn't agree more. I'm still getting used to working this way.
I also recently learned that Panda's can talk to a PyTable's HDF data storage on disk. This should be fairly seemless and can help reduce some memory issues with larger datasets. Yves Hilpisch did a nice talk on "Performance Python" that goes into this: http://hilpisch.com/YH_Performance_Python_Slides.html
Yes, I almost exclusively use pytables+HDF5 when I'm permitted. As I commented on some thread earlier this week, it strikes a great balance between simplicity, performance and flexibility.
> ...I found it to be quite memory inefficient: data structures take up a lot of space...
Just for whatever it's worth, this is the whole point of numpy. Lists, dicts, etc are not memory-efficient, but numpy arrays are.
The main reason I switched from Matlab to Python is due to Matlab's excessive memory usage (Essentially every operation makes a copy). Using numpy, you have a lot more control over memory usage than you do with Matlab.
This is not a major problem for data science. We usually use numpy arrays (or scipy sparse arrays), which are memory efficient. There are also some other efficient data structures for python:
Have you looked at pytables? It does a good job of presenting a fairly pythonic interface dealing with datasets that don't fit in memory. Many standard operations are basically the same as when dealing with in-memory numpy structures.
I had terrible memory problems using Python on larger graphs (1M+ V/E). Weird memory usage happened and even 64GB filled up fast.
For example, the everything is a dictionary approach that widely used NetworkX library utilizes was a horrible memory gobbler on large graphs. It sounds very nice in theory, works well on smaller graphs but is near useless for larger ones.
The best graph tool that I found for larger graphs was graph-tool which is heavily NumPy based.
When I write in C, at least I know where my memory goes.
One thing people might want to consider before investing time in Python is that I found it to be quite memory inefficient: data structures take up a lot of space, and the garbage collection didn't seem to be as effective as other languages (I spend some time studying/improving GC in JVMs). So if you're dealing with large amounts of data and/or complex data structures, I wonder if Matlab might be more appropriate (AFAIK R is also not very good at memory management yet).