Hacker News new | past | comments | ask | show | jobs | submit login

They usually cannot improve carefully optimized numpy operations, nor GPU operations. The use cases are thus limited.



IIRC, this is false regarding NumPy. The fundamental problem NumPy has is that it only vectorizes 1 operation for each pass through the data (or occasionally 2-3, like dot product). It can't do arbitrarily hybrid operations (like idk, maybe a[i] * b[i] / (|a[i]| + |b[i]|)) in one pass through your data. This is an inherent limitation in NumPy, and the lack of it is inherent in Numba. So you very much can expect speedups in some cases -- it just depends on what operations you are performing and how big your data is.


You're right, apart from np.einsum() and numexpr which is a separate package (albeit less drastic to use than Numba, because with numexpr you don't write your own loops).


I meant "occasionally 2-3, like dot product" to include stuff like einsum. FWIW I found einsum was actually slower than tensordot last time I tried, so you may want to use that instead if this is still the case.


Yes, but it can be difficult to figure out the best way to make said carefully optimized numpy operations for anything moderately complicated. It can also be difficult or impossible to avoid extra memory allocations in numpy. Sometimes it's just quicker to bang out the obvious loop-based implementation in numba or cython.

Also, numba does in fact support targeting the GPU,̶ b̶u̶t̶ ̶I̶ ̶t̶h̶i̶n̶k̶ ̶i̶t̶ ̶r̶e̶q̶u̶i̶r̶e̶s̶ ̶a̶ ̶l̶i̶c̶e̶n̶s̶e̶ ̶f̶o̶r̶ ̶n̶u̶m̶b̶a̶p̶r̶o̶ ̶(̶i̶.̶e̶.̶ ̶n̶o̶t̶ ̶f̶r̶e̶e̶,̶ ̶t̶h̶o̶u̶g̶h̶ ̶l̶a̶s̶t̶ ̶I̶ ̶u̶s̶e̶d̶ ̶i̶t̶ ̶t̶h̶e̶y̶ ̶h̶a̶d̶ ̶f̶r̶e̶e̶ ̶l̶i̶c̶e̶n̶s̶e̶s̶ ̶f̶o̶r̶ ̶s̶t̶u̶d̶e̶n̶t̶s̶)̶.̶ (edit: it's free now, see below).


Also, numba does in fact support targeting the GPU

Numba CUDA is free with Anaconda


Thanks, I remember it used to cost money but I couldn't remember if it still did.


I guess NVidia figure they can make more selling cards than charging for the library !


Numba is not made by NVidia. It is [1] made by Anaconda (formerly Continuum Analytics), which was co-founded by Travis Oliphant, the primary author of NumPy.

[1] (primarily - it's open source)


Right, but the NVidia drivers are free for it now, whereas they didn’t used to be according to the OP.


They aren't drivers, it's just the ability to generate CUDA kernels in Numba. It has nothing to do with Nvidia supporting it, they were not involved AFAIK.


Interesting, thanks :-) either way, we’re all set to accelerate Python code on the GPU! Personally I intend to focus my efforts here rather than learning Julia.


Yeah, but sometimes it's pretty hard to avoid simple loops in numpy code. This is where these tools can drastically speed up your code.

In order to achieve good performance, all numpy code should be vectorized.


The main use I've found for numba (I'm a theoretical physics/maths undergrad), is avoiding memory allocations in tight loops. There are some cases where I've found it hard to wrangle and arrange my numpy/scipy code such that all the vectorized operations happen in-place without extraneous allocations (in my case the difficulties have been with scipy's sparse arrays, although I can't remember the exact problems).


You can use Numpy in Cython.


Second this. Cython's memory view for numpy array is great.


In particular, if you find you cannot use vectorized functions in numpy or scipy and absolutely MUST index, then typing the array in Cython is a life saver. Indexed operations on numpy arrays without Cython is very slow. (eg https://stackoverflow.com/q/22239199/300539)


Agree. It's a bit surprising to see that numpy indexed operations are even slower than the built-in list in this example. It seems the idiomatic numpy way to perform the iterations is through vectorization, but that often leads to code that is not straightforward to reason about. For this example, I'd prefer the simplicity of the Cython for-loop when it comes to optimization.


That's because with generic python code the values in a numpy array need to be boxed into python objects leading to extra memory allocations (whereas they would already be boxed in the built-in list case).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: