They usually cannot improve carefully optimized numpy operations, nor GPU operat...

dataflow · on Aug 3, 2018

IIRC, this is false regarding NumPy. The fundamental problem NumPy has is that it only vectorizes 1 operation for each pass through the data (or occasionally 2-3, like dot product). It can't do arbitrarily hybrid operations (like idk, maybe a[i] * b[i] / (|a[i]| + |b[i]|)) in one pass through your data. This is an inherent limitation in NumPy, and the lack of it is inherent in Numba. So you very much can expect speedups in some cases -- it just depends on what operations you are performing and how big your data is.

jzwinck · on Aug 4, 2018

You're right, apart from np.einsum() and numexpr which is a separate package (albeit less drastic to use than Numba, because with numexpr you don't write your own loops).

dataflow · on Aug 4, 2018

I meant "occasionally 2-3, like dot product" to include stuff like einsum. FWIW I found einsum was actually slower than tensordot last time I tried, so you may want to use that instead if this is still the case.

gh02t · on Aug 3, 2018

Yes, but it can be difficult to figure out the best way to make said carefully optimized numpy operations for anything moderately complicated. It can also be difficult or impossible to avoid extra memory allocations in numpy. Sometimes it's just quicker to bang out the obvious loop-based implementation in numba or cython.

Also, numba does in fact support targeting the GPU,̶ b̶u̶t̶ ̶I̶ ̶t̶h̶i̶n̶k̶ ̶i̶t̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶a̶ ̶l̶i̶c̶e̶n̶s̶e̶ ̶f̶o̶r̶ ̶n̶u̶m̶b̶a̶p̶r̶o̶ ̶(̶i̶.̶e̶.̶ ̶n̶o̶t̶ ̶f̶r̶e̶e̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶l̶a̶s̶t̶ ̶I̶ ̶u̶s̶e̶d̶ ̶i̶t̶ ̶t̶h̶e̶y̶ ̶h̶a̶d̶ ̶f̶r̶e̶e̶ ̶l̶i̶c̶e̶n̶s̶e̶s̶ ̶f̶o̶r̶ ̶s̶t̶u̶d̶e̶n̶t̶s̶)̶.̶ (edit: it's free now, see below).

gaius · on Aug 3, 2018

Also, numba does in fact support targeting the GPU

Numba CUDA is free with Anaconda

gh02t · on Aug 3, 2018

Thanks, I remember it used to cost money but I couldn't remember if it still did.

gaius · on Aug 3, 2018

I guess NVidia figure they can make more selling cards than charging for the library !

mkl · on Aug 3, 2018

Numba is not made by NVidia. It is [1] made by Anaconda (formerly Continuum Analytics), which was co-founded by Travis Oliphant, the primary author of NumPy.

[1] (primarily - it's open source)

gaius · on Aug 4, 2018

Right, but the NVidia drivers are free for it now, whereas they didn’t used to be according to the OP.

gh02t · on Aug 4, 2018

They aren't drivers, it's just the ability to generate CUDA kernels in Numba. It has nothing to do with Nvidia supporting it, they were not involved AFAIK.

gaius · on Aug 4, 2018

Interesting, thanks :-) either way, we’re all set to accelerate Python code on the GPU! Personally I intend to focus my efforts here rather than learning Julia.

f311a · on Aug 3, 2018

Yeah, but sometimes it's pretty hard to avoid simple loops in numpy code. This is where these tools can drastically speed up your code.

In order to achieve good performance, all numpy code should be vectorized.

messe · on Aug 3, 2018

The main use I've found for numba (I'm a theoretical physics/maths undergrad), is avoiding memory allocations in tight loops. There are some cases where I've found it hard to wrangle and arrange my numpy/scipy code such that all the vectorized operations happen in-place without extraneous allocations (in my case the difficulties have been with scipy's sparse arrays, although I can't remember the exact problems).

nerdponx · on Aug 3, 2018

You can use Numpy in Cython.

geoalchimista · on Aug 3, 2018

Second this. Cython's memory view for numpy array is great.

Paul-ish · on Aug 3, 2018

In particular, if you find you cannot use vectorized functions in numpy or scipy and absolutely MUST index, then typing the array in Cython is a life saver. Indexed operations on numpy arrays without Cython is very slow. (eg https://stackoverflow.com/q/22239199/300539)

geoalchimista · on Aug 3, 2018

Agree. It's a bit surprising to see that numpy indexed operations are even slower than the built-in list in this example. It seems the idiomatic numpy way to perform the iterations is through vectorization, but that often leads to code that is not straightforward to reason about. For this example, I'd prefer the simplicity of the Cython for-loop when it comes to optimization.

infinite8s · on Aug 4, 2018

That's because with generic python code the values in a numpy array need to be boxed into python objects leading to extra memory allocations (whereas they would already be boxed in the built-in list case).