IIRC, this is false regarding NumPy. The fundamental problem NumPy has is that it only vectorizes 1 operation for each pass through the data (or occasionally 2-3, like dot product). It can't do arbitrarily hybrid operations (like idk, maybe a[i] * b[i] / (|a[i]| + |b[i]|)) in one pass through your data. This is an inherent limitation in NumPy, and the lack of it is inherent in Numba. So you very much can expect speedups in some cases -- it just depends on what operations you are performing and how big your data is.
You're right, apart from np.einsum() and numexpr which is a separate package (albeit less drastic to use than Numba, because with numexpr you don't write your own loops).
I meant "occasionally 2-3, like dot product" to include stuff like einsum. FWIW I found einsum was actually slower than tensordot last time I tried, so you may want to use that instead if this is still the case.
Yes, but it can be difficult to figure out the best way to make said carefully optimized numpy operations for anything moderately complicated. It can also be difficult or impossible to avoid extra memory allocations in numpy. Sometimes it's just quicker to bang out the obvious loop-based implementation in numba or cython.
Also, numba does in fact support targeting the GPU,̶ b̶u̶t̶ ̶I̶ ̶t̶h̶i̶n̶k̶ ̶i̶t̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶a̶ ̶l̶i̶c̶e̶n̶s̶e̶ ̶f̶o̶r̶ ̶n̶u̶m̶b̶a̶p̶r̶o̶ ̶(̶i̶.̶e̶.̶ ̶n̶o̶t̶ ̶f̶r̶e̶e̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶l̶a̶s̶t̶ ̶I̶ ̶u̶s̶e̶d̶ ̶i̶t̶ ̶t̶h̶e̶y̶ ̶h̶a̶d̶ ̶f̶r̶e̶e̶ ̶l̶i̶c̶e̶n̶s̶e̶s̶ ̶f̶o̶r̶ ̶s̶t̶u̶d̶e̶n̶t̶s̶)̶.̶ (edit: it's free now, see below).
Numba is not made by NVidia. It is [1] made by Anaconda (formerly Continuum Analytics), which was co-founded by Travis Oliphant, the primary author of NumPy.
They aren't drivers, it's just the ability to generate CUDA kernels in Numba. It has nothing to do with Nvidia supporting it, they were not involved AFAIK.
Interesting, thanks :-) either way, we’re all set to accelerate Python code on the GPU! Personally I intend to focus my efforts here rather than learning Julia.
The main use I've found for numba (I'm a theoretical physics/maths undergrad), is avoiding memory allocations in tight loops. There are some cases where I've found it hard to wrangle and arrange my numpy/scipy code such that all the vectorized operations happen in-place without extraneous allocations (in my case the difficulties have been with scipy's sparse arrays, although I can't remember the exact problems).
In particular, if you find you cannot use vectorized functions in numpy or scipy and absolutely MUST index, then typing the array in Cython is a life saver. Indexed operations on numpy arrays without Cython is very slow. (eg https://stackoverflow.com/q/22239199/300539)
Agree. It's a bit surprising to see that numpy indexed operations are even slower than the built-in list in this example. It seems the idiomatic numpy way to perform the iterations is through vectorization, but that often leads to code that is not straightforward to reason about. For this example, I'd prefer the simplicity of the Cython for-loop when it comes to optimization.
That's because with generic python code the values in a numpy array need to be boxed into python objects leading to extra memory allocations (whereas they would already be boxed in the built-in list case).