We’re heavy numba users at my company, and I’ve made a small contribution to the...

stillblue · on Feb 19, 2022

What's with people going Python is slow and then we have you telling us you've written a computationally intensive portion of your application in python? Can you eli5? I'm very confused.

eloisius · on Feb 19, 2022

I’ve got a concrete example because I just recently used numba to speed up some code. I’m working on a stereo camera object triangulation and tracking system. We wanted to make triangulation faster so I profiled the code and found a few hot spots. In some cases it was a matter of vectorizing some computations, or avoiding tearing apart numpy arrays and allocating new ones, but aside from those kinds of things, the system was spending most of its time in two functions: calculate Euclidean distance between two points and calculate distance of a point from a line (represented by two points). There wasn’t much refactoring possible because they were mostly pure numpy. I annotated them with numba’s JIT decorator and after the initial penalty of the JIT compiler, the functions are way faster.

Because my code is parallelized, I had to figure out how to trigger the compiler before forking because then each worker process had to do the compilation itself. There’s some decorator syntax to tell numba to precompile for specific numeric types, but I found it easier to just call the functions when you import them.

All told, it’s way easier than it would have been to implement this hotspot in C++ or something.

grej · on Feb 19, 2022

Sure :)

The ELI5 is that numba allows you designate specific functions in your application for compilation. That tells the library to run a just in time compiler on it. If you can do that successfully, you can wind up with that portion of your code base running at or near C-speed. There are a lot of gotchas and limitations, and it only works for a subset of the python language. But when it works, it can make loops over numpy arrays very fast, when they would normally be extremely slow in pure python. In a lot of cases, just a few compute bound processes are holding back the whole application. Numba allows you to improve those portions in a very targeted way, if you have a suitable use case.

jerf · on Feb 19, 2022

Others have good explanations, but let me put it super simply.

Suppose we have two very large vectors and we want to take the dot product of them. If we store these vectors in a standard Python list, write pure Python, and try to run it in CPython, it is very, very slow, at least relative to how fast almost any other non-dynamic scripting language could do it. (I have to add all those caveats because this is easy mode for a JIT and PyPy might blast through this. But CPython will be very slow.)

Suppose instead we have a data type that is a C array under the hood, and all we do is this:

    array1 = load_from_file("something")
    array2 = load_from_file("other_file")
    dot_product = array1.dot_product(array2)

Suppose "load_from_file" loads into an optimized data structure that provides Python access, but under the hood is stored efficiently, and the "dot_product" method leaves Python and directly performs the dot product in some more efficient language.

In this situation, you're running a bare handful of slow Python opcodes, a few dozen at the most, and doing vast quantities of numerical computation at whatever the speed of the underlying language is, which in this case can easily use SIMD or whatever other acceleration technologies are available, even though Python has no idea what those are.

That is the core technique used to make Python go fast. There are several ways that things get from Python to that fast code and underlying structure. The simplest and the oldest is to simply do what I laid out above directly, with external, non-Python implementations of all the really fast stuff. I believe this is the core of NumPy, though I'm not sure, but it's certainly been the core of a lot of other things in Python for decades.

There are also approaches involving writing in a Python-like language, but one that can be compiled down (Cython), approaches around trying to compile Python directly with a JIT into something that under the hood is no longer using CPython's interpretation structure (PyPy), and other interesting hybrid approaches.

CPython is generally a slow language, but it isn't always a big deal in practice because if you write Python code where a reasonable amount of most lines you write go out to faster code, the slowness of Python itself is no big deal. For example, you can write decent, performant image manipulation code in Python as long as you're stringing together image manipulation commands to some underlying library, because in that case, for each "expensive" Python opcode, you're doing a lot of "real work" at max speed. I scare-quote "expensive" because in that case, the Python opcodes may be a vanishing percentage of the cost of the work. The problem with numerical manipulation in pure CPython is that it swings the other way; each Python opcode requires many assembly-level instructions to implement, but it's inefficient to run all those instructions just to add two integers together, a single assembly opcode.

Arguably the dominant factor in sitting down with any programming language and writing reasonably perfomant code on your first pass is understanding the relative costs of things and just being careful not to write code that unnecessarily amplifies the costs of doing the "real work" with too much bookkeeping. The same 1000 cycle's worth of "bookkeeping" can be anything from a total irrelevancy if your "payload" for that bookkeeping is many millions of cycles worth of work, to a crippling amplified slowdown if your payload is only one cycle's worth of work. Even a "fast" language like C++ can be brought to a crawl if you manage to use enough abstractions and bounce through enough non-inlinable method calls and through the requisite creation and teardown of function frames, etc. just to add two numbers together, and then do it all again a few million times.