I just want to point this out because I feel like there's a good chance a lot of people won't have gotten this far:
Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.
Yes, using Numba we can just-in-time compile numeric Python logic straight down to machine code, so naturally we can achieve some pretty impressive numbers on kernel execution.
In case many people didn't reach the bottom here are the links to the repo and the docs. The project is still in early stages, but is public and released under a BSD license.
I taught High Performance Python covering the tools you mention at PyCon 2012 (and EuroPython last year), maybe my videos+write-up will be helpful. I also cover profiling, shedskin, pyCUDA etc:
Depends on your application. Ideally you want to change your code so as much computation as possible can happen in pure-C code and pure-C data types (using Cython). If you have a big class tree with many callbacks and work spread over hundreds of method, that can be difficult.
Before you go that far, I'd recommend making sure you know all the Python gotchas (for example, maybe you have some inner loop that does for x in range(100000) all the time), that you algorithms are in order. Sometimes even silly microoptimization can make a difference if a small function is a significant amount of your runtime. Using multiple processes with e.g. the multiprocessing module can be an option too.
Depending on what data types you operate on, numpy (and now this new thing) can do some amazing things.
Cython is helpful, but you have to spell out a lot of type-information that is not specifically necessary. You might also try Numba --- easiest way to get it is via Anaconda CE or Wakari. Both at http://continuum.io
Lua has some areas where it excels in performance, but using Python you can leverage 10 years of work on numeric libraries that unrivaled in any other general purpose language. NumPy and SciPy are extremely powerful.
Cython is good, but sometimes it's a bit tricky to bend it to do exactly what you want[1]. You'll probably still want to write that hot piece in C... But gluing it with Cython is IMHO much nicer than using the plain Python API.
[1] - on the other hand, it comes with a tool explaining exactly how each of your lines of Cython looks in resulting C, with color-coding for high level overview of which pieces translated smoothly
I would go with C/C++ as the ways to address performance are well studied. There are many tools out there like callgrind or nvvp that will make it pain-free.
I can narrow down performance in C/C++ quite quickly, but neither I nor anybody I know has done much of this for Python. Many people who I work with consider a Python implementation a prototype, while Fortran/C/C++ is mature real code worthy of attention.
The only real downside is that C/C++ requires a little knowledge of the POSIX/LINUX or Windows. This represents a learning curve, but when you are over it, it represents quite durable long lasting skills.
I think from the development effort a more sensible approach is to build the whole thing in Python, then profile your application and find the performance bottlenecks. Then get your hands a little dirty with the Python C API. This way you can gain good performance without wasting too much time.
So is there anyone using Python for machine learning in production systems (i.e. not just for prototyping). I would love to do it but seems Java/Mahout is a safer choice, performance-wise.
I wonder whether Blaze is a step towards that direction.
I use Python for nearly all of my ETL processes that involve text processing. Even in production systems, I'd be hard-pressed to admit any significant performance issues. Python facilitates implementing algorithms in a functional style, which I tend to prefer over the imperative style (i.e., Java). With C++11 and boost, I'm able to translate my Python code to C++ while preserving the functional style, which has immensely simplified prototyping/deploying NLP/ML algorithms while simultaneously begetting enormous performance gains. I see Python as an extremely viable alternative to Java.
You got me a bit confused here. If I understand correctly what you 're saying, you 're still using Python for prototyping the core algorithms and C++ in actual production systems. I'm not saying Python is not good for production systems in general, I'm wondering whether it is good enough for real-world implementations of machine learning algorithms.
Also, I believe most people would consider Java as an alternative to C++, hence all the Java-based Apache projects, such as Mahout, Solr etc.
I use Python in production for text pre-processing and other ETL-related processes, which is part of a larger reinforcement learning approach. Additionally, I use Python to prototype the core ML algorithms, which I sometimes re-implement in C++. However, for many of those algorithms, numpy actually performs identically to BLAS in C++.
Have you tried Scala? It might let you write in a functional style and then not have to translate it to something else. Please don't interpret this as a troll; I'm genuinely curious what the pros/cons of these approaches are.
I've never tried Scala, but I suppose I should give it a chance. I'm a fan of Lisp, and the two languages seem to have a lot in common. Scala's expressive type system seems like it has the potential to be both a blessing and a curse, but admittedly, I know next to nothing about the language.
I may be missing something here, but if you're a fan of lisp and want easy interaction with libraries on the JVM, please tell me you've heard of Clojure. It's a modern lisp that strongly favors functional programming, and that has great concurrency support. Plus, there is already a data analysis / statistical platform built on top of it called Incanter.
We also use python in production at plotwatt for machine learning. We started by prototyping in matlab and then porting to c++, but have since found it much much easier to just do everything in python and numpy. When speed was an issue, we slightly changed the way we implemented the algorithm rather than implement the same algorithm in a faster language. Admittedly this isn't always possible.
It would be great to eventually have a GPU version as well (as in the cases of Matlab and R). I saw a brief demo of Matlab on a Mac Retina Pro 15 where the GPU version ran 30x the CPU version.
I read about continuum after the fellow who developed numpy left a few days ago to work on continuum. I am curious to see actual projects using continuum. So some sort of writeups.
in general, i like (ie i don't see a better solution than) the idea of having an AST constructed via an embedded language that is implemented by a library. but it does have downsides - integration with other python features is going to be much more limited (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).
are there more details? i guess the AST is fed to something that does the work. and that something will have an API and be replaceable. but is that something also composable? does it have, say, a part related to moving data and another to evaluating data? so that you can combine "distributed across local machines" with "evaluate on GPU"?
> how does this compare to theano? it seems like some of the ideas are similar?
It's quite similar, we just take some of the ideas farther and try to generalize the data storage to include storage backends that data scientists use more frequently ( i.e. SQL, CSV, S3, etc ). We're very friendly with the Theano developers and hope to bridge the projects with a compatibility layer at some point.
> (it gives the illusion of a python solution, but in practice you're off in some other world that only looks like python).
I would argue that's what make Python a great numeric language, and NumPy so succesfull. You get this high level language where you can express domain knowledge but also this 1:1 mapping between fast code execution at the C level. Blaze is the continuation of that vision
> i guess the AST is fed to something that does the work. and that something will have an API and be replaceable.
Precisely, we build up a intermediate form called ATerm out of the construction expression objects, do type inference, graph rewriting, and then pattern match our layout, metadata, and type information against a number of backends to find the most optimal one to perform execution. Or if needed we build a custom kernel with Numba informed by all this type and data layout information we've inferred from the graph.
We don't aim to solve all the subproblems in this area ( expression optimization passes, distributed scheduling ) but I think we have a robust enough system that others can build extensions to Blaze to do expression evaluation in whatever fashion they like.
That's not what I meant. Both this and SA turn python expressions into expressions to be run elsewhere, on data that isn't necessarily in the process' memory.
Because our implementation does not explicitly depend on Python we are able to overcome many of the shortcomings of the Python runtime such as running without the GIL and utilising real threads to dispatch custom Numba kernels running at near C speed without the performance limitations of Python.