Hacker News new | past | comments | ask | show | jobs | submit login
Faster Python calculations with Numba (pythonspeed.com)
148 points by rbanffy on Feb 19, 2022 | hide | past | favorite | 66 comments



Numba fits very few usecases, but where it does fit it's awesome.

I've been using it in a python graph library to write graph traversal routines and it's done me very well: https://github.com/VHRanger/nodevectors

The best part is the native openMP support on for loops IMO. Makes parallelism in data work very efficient compared to python alternatives that use processes (instead of threads)


Why do you say it fits very few uses cases?


Numba can compile in “no Python” mode only with a subset of Python. Eg classes is limited and is still experimental. Also I think string manipulation is slow but the doc has details.

If you want to specify the type (for example for aot or just because you want to make it clear) then the call signature is less flexible.

In short, pick any random Python library, you’d find there are very few places you can jit accelerate something effectively. It is for numeric.

Even for numerical code, it is more like writing C functions than say C++ (with classes etc).

But it does makes accelerating vectorized code very easy. Even if you have a function that uses Numpy, it is likely you can speed it up using Numba with a decorator only.

But when it doesn’t work, it might often be not very clear why you can’t until you get some experience.


The ahead of time compilation output is... Well.. let's say difficult to package _properly_ (compare it to Cython where it's well supported and documented). That makes it useless for production, unless you want to ship giant containers with compilers etc


In theory, a compiler toolchain is not required since Numba already comes with LLVM, i.e. for JIT compilation, no additional compiler is necessary.

In the past, that was also possible for AOT compilation [1], but that technique broke during some update and it seems like there is no one left who knows how to fix this.

[1] https://stackoverflow.com/a/42198101


Jax also has experimental support for persisting its JIT cache on the filesystem in the ‘jax.experimental.compilation_cache’ module.


How does jax compare to numba?


numba is more general. Any change to the shapes of arrays triggers a JIT recompilation in jax, numba is a bit more forgiving. jax has autodiff that numba doesn't. Also, JAX supports TPUs, which numba doesn't support (yet).


What??? Numba has more usage in the AI/ML community than Cython has ever had by anyone, ever.

"Fits very few use cases" LOL okay without numba there's no UMAP and HDBScan and those are pretty popular and important libraries that come to mind just off the top of my head...

Also, claiming Cython is well documented also gets a huge LOL from me as someone whose actually written a bit of Cython.


I have written quite a bit of cython code as well and at least the last time I looked cython was much better documented than numba (it has been a couple of years though so things might have improved on the numba side), and I would agree with the previous poster is generally quite well documented.


Also I am specifically referring to the documentation on creating ahead of time compiled packages using Numba vs Cython, but perhaps that was unclear.


FWIW Numba's JIT caches the compiled function as long as you don't call it again with different type signatures (eg. int32[] vs int64[])

I've succesfully deployed numba code in an AWS lambda for instance -- llvmlite takes a lot of your 250mb package budget, but once the lambda is "warm" the jit lag isn't an issue.

That said, if you absolutely want AOT you'll have to use Cython or some horrible hack dumping the compiled function binary.


Exactly!


You realize that Scikit-learn is written mostly in Cython (where high performance is needed)? It is a part of the most influential ML library in existence.


Also pandas


You're aware pandas and most of scipy is cython, right?

I like numba, but cython is clearly used more in the popular packages


It’s not really gonna be used in your database rest api is it.


I assume the parent comment was talking about the context of computations where numba is supposed to be a drop-in for wherever numpy is used.

And I agree that it's not actually usable everywhere, since the support for numpy's feature set is actually quite limited, especially around multidimensional arrays. I had to effectively rewrite my logic to make use of numba. Still it is pretty worth it imo, given how it can add parallelism for free. And conforming to numbas allowed subset of numpy usually results in simpler and more efficient code. In my case I ended up having to work around the lack of support for multidimensional arrays but ended up with a more efficient solution relying on low dimensional arrays being broadcasted, reducing a lot of duplicate computations


I've had success with numba speeding up code that worked on apache arrow returned by duckdb, which might just go into a rest api


In my experience, numba's killer feature is fast looping over numpy arrays. It does this extremely well. I work in an applied physics research lab, where we eschew matlab wherever possible. So we rely a lot on python, numpy and scipy. Even if we wanted to move to something like julia, it wouldn’t be practical. Every time we buy a new piece of equipment that has a python api and outputs a lot of data (we do a lot of work with picosecond resolution event timing), it’s going to use numpy.

Though if any numba developers come across this, I’d advise them to plan their upgrade process a little more carefully. The current numba conversion for python lists is due to be depreciated in favor of typed lists. But the typed list (List()) method is still too buggy. I experienced huge delays when popping and appending elements. Please flesh out these new methods before adding a bunch of warning messages about depreciation to the current builds.


Have you tried JAX? How does it compare?


JAX recompiles functions every time you call them with an array of a new shape, i.e. if called with an array of shape (7,10) and then one with shape (17, 12) the function is compiled twice. This is generally fine for deep learning and some other numerical applications where you do the same computations again and again over arrays with the same shape. But for data exploration like Pandas, in my experience, your data shapes are different with each call, so the repeated recompilations make it unattractive.

In Numba it only recompiles for dimension changes i.e. if shape changes from (7, 10) to (2, 10, 12). However numba does not integrate AD so if you need that, JAX is probably your best bet unless Enzyme matures and you want to look into integratig it with numba.


JAX is a numerics library combined with autograd for machine learning. you string together operations in python in a functional style, which is then passed into XLA which is google's optimizing compiler that can target cpus, gpus and tpus to generate optimized machine code for those architectures.

numba lets you inline with a subset of python which is then compiled with llvm producing something very similar to what you would get if you applied a bunch of regexes to that subset of python to convert it from python to C. (with special bindings for numpy arrays, since they have special importance in these domains)

numba is specifically targeted at things like core numerical algorithms that are typically coded in C and fortran, and are typically comprised of solely for loops and basic arithmetic. JAX is more targeted at high level machine learning applications where the end user is stringing together more high level numerical algorithms.

i suspect that JAX would be a bad fit for custom computer vision or numerical algorithms that are used outside of the use-case of doing neural networks work.


JAX is actually lower level than deep learning (despite including some specialized constructs) which makes it an almost drop-in replacement for numpy that has the ability to jit your python code.

I am currently doing some tests introducing JAX in a large numerical code base (that was previously using C++ extensions), we are not using autograd nor any deep learning specific functionalities. Having seen actual numbers, I can tell you that JAX on CPU is competitive with C++ but produces more readable code with the added benefit of also running on GPU. However, it does introduces some constraints (array sizes cannot be too dynamic) so, if you are not also planning on also targeting GPU, I would probably focus on numba.


i actually poked around a bit as a contributor a few years ago (before i had to start a real job) and remember it being a thin layer on top of XLA among a few other things. interesting to learn that it is growing into something that people are using as a fully fledged numerical computing library.

also a little bit surprising to see how immature and fragmented the python gpu numerical computing ecosystem is. everybody bags on matlab, but it has been automatically shipping relevant operations over to available gpus for years.


There are also tricks to get around the array shape dynamics. Like padding up your shapes to some common format. Everything between 6 and 10 becomes 10, everything 11-20 pads to 20, etc.

Jax is a great general purpose numerical computing library.


Jax offers a much lower level control, it's almost bare metal, and can be used for all sorts of things besides Deep Learning. I am currently using to implement a better `scipy.optim` library.


I would love to learn more about that library!


The article lists alternatives but doesn’t mention Cython. That’s a miss. I’ve done both. If numba works, it’s great. I found it a bear to install. Cython is more work, did what I needed - allowed me to incrementally type Python to get C code performance for general numeric code where memory allocations mattered (large arrays).


It's easy to miss, but the second bullet point about "coding in a low-level language directly" links to another of the author's articles entitled Cython, Rust, and more: choosing a language for Python extensions [1] which discusses the pros and cons of using Cython.

I've done a pretty extensive amount of development in Cython. I agree, it's very nice to be able to incrementally type Python code but I've found that using the language to its full potential requires a good understanding of C programming, especially for debugging.

[1] https://pythonspeed.com/articles/rust-cython-python-extensio...


Fwiw: Cython does NOT offer many of the killer features that Numba includes, such as parallel for loops, automatic vectorization, automatic GPU support and GPU parallelization, and taking advantage of MKL and other Intel goodies. To do that with Cython will be much much harder than in numba and at a significantly lower level of abstraction.

I don't think that they even compete or solve the same use case tbh. Not sure why they're being constantly compared to each other.


That's not quite true, cython supports parallel for loops either by directly calling openmp or via cython.parallel.prange etc. (which wraps openmp AFAIK) you can also get simd speed ups via directly calling simd intrinsics or compiling with simd support and relying on the compiler (which is admittedly is not as sophisticated as numbas simd support). For GPU support I would likely go and use cupy or arrayfire.


I wish cython had a better debugging story. If it were more ergonomic, it could have been an amazing language in its own right.


Cython does have a plugin for gdb https://cython.readthedocs.io/en/latest/src/userguide/debugg... however I have not used it. Generally I don't like writing cython code directly however and rather start with working python code and add types etc., cython -a really helps in those situations.


We’re heavy numba users at my company, and I’ve made a small contribution to the library. It’s a phenomenal library for developing novel computationally intensive algorithms on numpy arrays. In presentations, I’ve heard Leland McInnes credits numba often when he speaks of his development of UMAP.

We build a very computationally intensive portion of our application with it and it has been running in production, stable, for several years now.

It’s not suitable for all use cases. But I highly highly recommend it if you need to do somewhat complex calculations iterating over numpy arrays for which standard numpy or scipy functions don’t exist. Even then, often we were surprised that we could speed up some of those calculations by placing them inside numba.

Edit: ex of a very small function I wrote with numba that speeds up an existing numpy function (note - written years ago and numba has undergone quite some amount of changes since!): https://github.com/grej/pure_numba_alias_sampling


What's with people going Python is slow and then we have you telling us you've written a computationally intensive portion of your application in python? Can you eli5? I'm very confused.


I’ve got a concrete example because I just recently used numba to speed up some code. I’m working on a stereo camera object triangulation and tracking system. We wanted to make triangulation faster so I profiled the code and found a few hot spots. In some cases it was a matter of vectorizing some computations, or avoiding tearing apart numpy arrays and allocating new ones, but aside from those kinds of things, the system was spending most of its time in two functions: calculate Euclidean distance between two points and calculate distance of a point from a line (represented by two points). There wasn’t much refactoring possible because they were mostly pure numpy. I annotated them with numba’s JIT decorator and after the initial penalty of the JIT compiler, the functions are way faster.

Because my code is parallelized, I had to figure out how to trigger the compiler before forking because then each worker process had to do the compilation itself. There’s some decorator syntax to tell numba to precompile for specific numeric types, but I found it easier to just call the functions when you import them.

All told, it’s way easier than it would have been to implement this hotspot in C++ or something.


Sure :)

The ELI5 is that numba allows you designate specific functions in your application for compilation. That tells the library to run a just in time compiler on it. If you can do that successfully, you can wind up with that portion of your code base running at or near C-speed. There are a lot of gotchas and limitations, and it only works for a subset of the python language. But when it works, it can make loops over numpy arrays very fast, when they would normally be extremely slow in pure python. In a lot of cases, just a few compute bound processes are holding back the whole application. Numba allows you to improve those portions in a very targeted way, if you have a suitable use case.


Others have good explanations, but let me put it super simply.

Suppose we have two very large vectors and we want to take the dot product of them. If we store these vectors in a standard Python list, write pure Python, and try to run it in CPython, it is very, very slow, at least relative to how fast almost any other non-dynamic scripting language could do it. (I have to add all those caveats because this is easy mode for a JIT and PyPy might blast through this. But CPython will be very slow.)

Suppose instead we have a data type that is a C array under the hood, and all we do is this:

    array1 = load_from_file("something")
    array2 = load_from_file("other_file")
    dot_product = array1.dot_product(array2)
Suppose "load_from_file" loads into an optimized data structure that provides Python access, but under the hood is stored efficiently, and the "dot_product" method leaves Python and directly performs the dot product in some more efficient language.

In this situation, you're running a bare handful of slow Python opcodes, a few dozen at the most, and doing vast quantities of numerical computation at whatever the speed of the underlying language is, which in this case can easily use SIMD or whatever other acceleration technologies are available, even though Python has no idea what those are.

That is the core technique used to make Python go fast. There are several ways that things get from Python to that fast code and underlying structure. The simplest and the oldest is to simply do what I laid out above directly, with external, non-Python implementations of all the really fast stuff. I believe this is the core of NumPy, though I'm not sure, but it's certainly been the core of a lot of other things in Python for decades.

There are also approaches involving writing in a Python-like language, but one that can be compiled down (Cython), approaches around trying to compile Python directly with a JIT into something that under the hood is no longer using CPython's interpretation structure (PyPy), and other interesting hybrid approaches.

CPython is generally a slow language, but it isn't always a big deal in practice because if you write Python code where a reasonable amount of most lines you write go out to faster code, the slowness of Python itself is no big deal. For example, you can write decent, performant image manipulation code in Python as long as you're stringing together image manipulation commands to some underlying library, because in that case, for each "expensive" Python opcode, you're doing a lot of "real work" at max speed. I scare-quote "expensive" because in that case, the Python opcodes may be a vanishing percentage of the cost of the work. The problem with numerical manipulation in pure CPython is that it swings the other way; each Python opcode requires many assembly-level instructions to implement, but it's inefficient to run all those instructions just to add two integers together, a single assembly opcode.

Arguably the dominant factor in sitting down with any programming language and writing reasonably perfomant code on your first pass is understanding the relative costs of things and just being careful not to write code that unnecessarily amplifies the costs of doing the "real work" with too much bookkeeping. The same 1000 cycle's worth of "bookkeeping" can be anything from a total irrelevancy if your "payload" for that bookkeeping is many millions of cycles worth of work, to a crippling amplified slowdown if your payload is only one cycle's worth of work. Even a "fast" language like C++ can be brought to a crawl if you manage to use enough abstractions and bounce through enough non-inlinable method calls and through the requisite creation and teardown of function frames, etc. just to add two numbers together, and then do it all again a few million times.


I use numba quite a bit at work and it's fantastic. I recently, however, did a comparison between numba, cython, pythran and rust (ndarray) for a toy problem, and it yielded some interesting results:

https://github.com/synapticarbors/ndarray_comparison/blob/ma...

Most surprising among them was how fast pythran was with little more effort than is required of numba (still required an aot compilation step with a setup.py, but minimal changes in the code). All of the usual caveats should be applied to a simple benchmark like this.


I didn't see your comment until I wrote mine, yes I did the same comparison and pythran really is amazing. One of the great things is that it also runs as reasonably fast regular python code (as you can just leave most numpy functions in place), which makes debugging and prototyping so much easier.


Quite a few years ago I worked on a particle in cell simulation software during my bachelor thesis[0]. We used numba to accelerate the code and most importantly write GPU kernels for the heavy parts. I remember spending hours optimising my code to eek out the most performance possible (which eventually meant using atomics and manually unrolling many loops because somehow this was giving us the best performance) but honestly I was really happy that I didn't need to write cuda kernels in C and generally it was pretty easy to work with. I remember back then the documentation was sometimes a little rough around the edges but the numba team was incredibly helpful and responsive. Overall I had a great time.

[0] https://github.com/fbpic/fbpic


Somehow I always prefer cffi over numba and cython. When I write a performance-critical piece of code in C/C++, interfacing it with cffi I know it'll be very fast, while with numba/cython I always feel there is so much uncertainty and it won't be close to pure C anyway.


You can invert the solution and inline Python code into your Rust as well.

https://docs.rs/inline-python/latest/inline_python/

    use inline_python::python;

    let who = "world";
    let n = 5;
    python! {
        for i in range('n):
            print(i, "Hello", 'who)
        print("Goodbye")
    }
The nice thing about this, if you get the power of Python to handle your data loading, scrubbing, validation, etc and then spend the bulk of your high performance work in Rust.

Layout is king, so structuring your code to have compact, native layout will not only make it much easier to call compiled code, it will vastly reduce your memory requirements.

    import cffi
    ffi = cffi.FFI()
    chunk = ffi.new(f'int[{5_000_000}]')
chunk can now be used as a fixed size array. And chunk can be passed by reference in native code (Zig, Rust, Fortran, etc).


I never understood the use case for embedding python in rust. In my experience the actual high performance code is a small portion of the overall code base, and writing all the glue code file processing, plotting is typically much faster in python due to the excellent excellent ecosystem. I further forgo the ability to use the python repl if I do it this way around.

Don't get me wrong I like rust, but I would always write the high performance code to be a module that can be imported in python, write everything in rust and the use python for some specific functions only.


I used to think that too, and both techniques have merit. The extending vs embedding debate has gone since Python was created, but much of that has been shaped by the difficulty in actually embedding Python, so I found myself convincing myself that embedding is the wrong choice.

Contrast that with Lua where embedding and extending are about the same difficulty, there one is more likely to pick the technique that best fits the situation.

One thing that is really nice about calling into Python when you need it but otherwise using your calling language and runtime is you get all the semantics of your calling language to structure the codebase and use Python as a library where needed. This is esp nice when you have security or concurrency properties that you need to hold, or if you were bootstrapping a project with Python and all of its wonderful libraries but eventually will replace or reduce the script code.

One semantic nit, you probably do understand it, you just don't like it. If I have a big ETL that has lots of parts that need to run in Rust, but the cleanup code would be much easier to write in Python, I don't need a to turn my ETL into a library.


I've played with numba a bit, and my feeling is that if you're going to accept the disadvantages of just in time compilation, why not just write in Julia (where the experience is a lot more seamless)?

Julia was also faster in my experiments.


I know this might be unpopular on HN, but I find the whole "use Julia" comments every time something scientific code is discussed is becoming a bit old.

I have tried Julia but I found the promised "everything will be as easy to write as python but as fast as C" is often not true. There are quite a few pitfalls (e.g. don't use abstract data types), which one has to consider and code can start looking looking more like cython than python if one wants to optimise for speed. I also dislike some of the design decisions they made (in particular making multiply a matrix operation and using the '.' for broadcasting which is incredibly easy to overlook and one of the main sources of bugs in matlab code in my experience).

That said there are great things happening in the Julia ecosystem, just don't promote it as the be all of scientific computing all the time.


Oh, if you check my comment history, you'll see me complaining a bit about Julia. I'm not saying it's perfect. But numba is so clunky and fickle that I think, just in terms of comparing the two, Julia is pretty clearly superior if you don't have a strong reason to use Python.


The rest of the lab/company uses python and collaborating with Julia code is difficult.


I think typically people don't start out thinking they'll use Numba, they start out using Python (for a multitude of valid reasons). Then eventually they want to optimize a single bottleneck at which point Numba makes a lot more sense than rewriting everything in Julia/Rust/whatever or calling some language from Python for which no convenient mechanism exists.

And when thinking about something more than optimizing a single function, but about a large codebase and the context in which you need it, Julia is rarely a compelling option.


Julia looks promising, but it has a long way to go to catch up with Python's ecosystem. Python's libraries, communities, knowledge base, and developer availability are almost unmatched (maybe a few languages have it beat). Especially in the scientific, ML, and data science fields.


Because Julia has not even close to the ecosystem of libraries around it for ML/AI, particularly for subdomains like NLP, among other reasons...


There's a package for using R libraries (https://juliainterop.github.io/RCall.jl/stable/) and one for Python libraries (https://github.com/JuliaPy/PyCall.jl). It's pretty seamless.


It has excellent packages for non linear programming though. Both JuMP and JuliaSmoothOptimizers are an excellent eco system for optimization. I find it odd though that Non linear programming is a subset AI and not the reverse.


I like numba, but I found that I often had to rewrite my code (like write out vectorized numpy code). Recently I have mainly been using pythran which IMO is very underrated. It manages to even beat optimized cython code with just a comment in front of the function definition.

I did a comparison to cython and Julia here: https://jochenschroeder.com/blog/articles/DSP_with_Python2/


That is a really good comparison, thank you for this article! I have somehow never heard about pythran before. Interesting to see the comparison to Julia as well.

Did you have a look at JAX? It would be interesting to see how well it would compare.


Pythran sounds awesome. I've somehow always glossed over it before. That might be due to the name, which made me think it was some Fortran-interop library (and those tend to be old and not what I need)...


If you write quite a bit of numerical code you can often achieve impressive speed ups using pythran with very little changes, it's like numexpr + cython on steroids. I highly recommend checking it out. The community is quite small, so there are still some bugs, Serge the main author is really responsive.


Hmm, suspect the julia code in that post is not optimized Perhaps try asking on https://discourse.julialang.org/


I would not be surprised. I did post it to HN at the time, and received some feedback on how I should better optimise (which I mention at the bottom of the post).

The exercise for me was to see how I fast I could get the julia code with my background, I am quite honest about the fact that I have less experience in julia, however for me the whole reason to look for something else then cython is that it takes so much background knowledge to get the ultimate speed, if I have to do the same for julia I don't really gain anything. I would argue the amount of optimisations in the julia code are already more than what I had to do in pythran.


I hit a need for numba for the first time the other day. It installed perfectly with pip install, and took about an hour of fiddling to speed up some numpy stuff by a factor of about 30. The numpy stuff was already vectorized but still slow (i think) because it required intermediate allocations.

Numba just worked, with a very straightforward implementation of my calculation as well.

Really cool piece of kit.


That particular for-loop in a dynamically typed language is pretty much a worst-case scenario for the interpreter.

I think one should just vectorize the solution from the get-go! np.accumulate looks like the right way to go about things in this example.

Btw the J implementation would be >. /\ and the k implementation would be |\ which is I think much better than the for loop anyway.


I tried using numba to speed up an AI i wrote for a board game. Any function that involved numpy arrays and for loops got optimized beautifully. There were other functions that took in python enums. They were not faster than normal functions even if numba successfully compiled them in no python mode


For the reference - here it is the repo https://github.com/numba/numba


Now do it in Rust!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: