Hacker News new | past | comments | ask | show | jobs | submit login
Beating NumPy matrix multiplication in 150 lines of C (salykova.github.io)
392 points by p1esk 3 months ago | hide | past | favorite | 81 comments



If the point of this article is that there's generally performance left on the table, if anything it's understating how much room there generally is for improvement considering how much effort goes into matmul libraries compared to most other software.

Getting a 10-1000x or more improvement on existing code is very common without putting in a ton of effort if the code was not already heavily optimized. These are listed roughly in order of importance, but performance is often such a non-consideration from most developers that a little effort goes a long way.

1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)

2. Can we eliminate round trips to the kernel and similar heavy operations? The most common huge gain here is replacing tons of malloc calls with a custom allocator.

3. Can we vectorize? Explicit vector intrinsics like in the blog post are great, but you can often get the same machine code by reorganizing your data into arrays / struct of arrays rather than arrays of structs.

4. Can we optimize for cache efficiency? If you already reorganized for vectors this might already be handled, but this can get more complicated with parallel code if you can't isolate data to one thread (false sharing, etc.)

5. Can we do anything else that's hardware specific? This can be anything from using intrinsics to hand-coding assembly.


Don't forget the impact of network. I managed to get a several hundred times performance improvement on one occasion because I found a distributed query that was pulling back roughly 1M rows over the network and then doing a join that dropped all but 5 - 10 of them. I restructured the query so the join occurred on the remote server and only 5 - 10 rows were sent over the network and, boom, suddenly it's fast.

There's always going to be some fixed overhead and latency (and there's a great article about the impact of latency on performance called "It's the latency, stupid" that's worth a read: http://www.stuartcheshire.org/rants/latency.html) but sending far more data than is needed over a network connection will sooner or later kill performance.

Overall though, I agree with your considerations, and in roughish terms the order of them.


I meant this kind of thing to fall under #1. Don't do work that can be avoided includes pulling 1M rows * a bunch of columns you don't need over the network.

From your description though, it doesn't sound like something I'd classify as a network issue. That's just classic orm nonsense. I guess I don't know what you mean by "distributed query", but it sounds terrible.

The most classic network performance issue is forgetting to disable nagle's algorithm.

The most classic sql performance issue is not using an index.


A distributed query is something you execute over multiple instances of your DBMS. I actually would disagree with you that this specific issue was an instance of (1). There wasn't anything wrong with the query per se but rather the issue was with where the bulk of the work in the query was being done: move that work to the right place and the query becomes fast.

When considering performance issues, in my experience it's a mistake not to explicitly consider the network.


> I actually would disagree with you that this specific issue was an instance of (1). There wasn't anything wrong with the query per se

I think OP's #1 agrees that there's nothing "technically" wrong with such a query (or an algo). It just generated work you didn't have to do. Work takes time. So you used time you didn't have to use. I also think this is the number 1 way of improving performance in general (computer, life).

A perfectly valid query of fetching 1M rows turned into 99.xxx% unnecessary work when you only needed a handful of rows. The query wasn't slow, it was just generating more work than you actually needed. The network also wasn't slow, it simply had to transfer (even at peak theoretical efficiency) a lot of data you never used.

You then used an equally valid query that wasn't even necessarily fast, it just generated much less work. This query (quote from #1) "eliminate[d] some work entirely", the work of carrying over unnecessary data.


> 1. Most importantly, is the algorithm a good choice? Can we eliminate some work entirely? (this is what algo interviews are testing for)

Unfortunately this has turned into a cargo cult in practice. There are plenty of cases where doing more work results in better performance, because the "faster" algorithm has some pretty horrible constants in practice.

A lot of interviews turn into a pop quiz about rote memorization of obscure algorithms because "that's what Google does", rather than actually focusing on being able to reason and benchmark why an implementation is slow and what approached could be taken to fix that.


I did not mean to endorse current software interviewing practices by pointing out a small overlap with good optimization fundamentals. The current status quo "google" style interview is basically a joke. It started from a good place, but fizz buzz eventually became invert a binary tree on the whiteboard, which by this point has been gamified to an absurd degree that it means very little, and likely optimizes for the wrong types of candidates entirely.


Most common coding patterns leave a lot of performance unclaimed, by not fully specializing to the hardware. This article is an interesting example. For another interesting demonstration, see this CS classic "There's plenty of room at the top"

https://www.science.org/doi/10.1126/science.aam9744



Thanks for sharing. That was a great read


The papers referenced in the BLIS repo are the authoritative reference to understand this stuff. I've no idea why people think optimized BLASes aren't performant; you should expect >90% of CPU peak for sufficiently large matrices, and serial OpenBLAS was generally on a par with MKL last I saw. (BLAS implements GEMM, not matmul, as the basic linear algebra building block.) I don't understand using numpy rather than the usual benchmark frameworks, and on Zen surely you should compare with AMD's BLAS (based on BLIS). BLIS had a better parallelization story than OpenBLAS last I knew. AMD BLIS also has the implementation switch for "small" dimensions, and I don't know if current OpenBLAS does.

SIMD intrinsics aren't needed for micro-kernel vectorization, as a decent C compiler will fully vectorize and unroll it. BLIS' pure C micro-kernel gets >80% of the performance of the hand-optimized implementation on Haswell with appropriate block sizes. The difference is likely to be due to prefecth, but I don't properly understand it.


SIMD intrinsics and manually unrolled loops are surely needed. That's the reason why all BLAS libraries vectorize and unroll loops manually. Even modern compilers can't properly auto-vectorize and unroll with 100% success rate.


This looks like a nice write-up and implementation. I'm left wondering what is the "trick"? How does it manage to beat OpenBLAS, which is assembly+C optimized over decades for this exact problem? It goes into detail about caching, etc - is BLAS is not taking advantage of these things, or is this more tuned to this specific processor, etc?


- OpenBLAS isn't _that_ optimized for any specific modern architecture.

- The matrices weren't that big. Numpy has cffi overhead.

- The perf difference was much more noticeable with _peak_ throughput rather than _mean_ throughput, which matters for almost no applications (a few, admittedly, but even where "peak" is close to the right measure you usually want something like the mean of the top-k results or the proportion with under some latency, ...).

- The benchmarking code they displayed runs through Python's allocator for numpy and is suggestive of not going through any allocator for the C implementation. Everything might be fine, but that'd be the first place I checked for microbenchmarking errors or discrepancies (most numpy routines allow in-place operations; given that that's known to be a bottleneck in some applications of numpy, I'd be tempted to explicitly examine benchmarks for in-place versions of both).

- Numpy has some bounds checking and error handling code which runs regardless of the underlying implementation. That's part of why it's so bleedingly slow for small matrices compared to even vanilla Python lists (they tested bigger matrices too, so this isn't the only effect, but I'll mention it anyway). It's hard to make something faster when you add a few thousand cycles of pure overhead.

- This was a very principled approach to saturating the relevant caches. It's "obvious" in some sense, but clear engineering improvements are worth highlighting in discussions like this, in the sense that OpenBLAS, even with many man-hours, likely hasn't thought of everything.

And so on. Anyone can rattle off differences. A proper explanation requires an actual deep-dive into both chunks of code.


To your third point - it looks as if the lines of mean values were averaged, this posts code would still be a clear winner.


To beat OpenBLAS is neither suprising nor it's not unheard of, for example D language library Mir for linear algebra did just that many years ago [1]. For C++ and C implementation please check these metaprogramming approaches [2], [3].

What's really surprising is that many modern languages are still depending on OpenBLAS for examples Matlab, Julia, Mojo, etc. But to be fair they probably have their own reasons, right?

[1] Numeric age for D: Mir GLAS is faster than OpenBLAS and Eigen (2016):

http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...

[2] Vastly outperforming LAPACK with C++ metaprogramming (2018):

https://wordsandbuttons.online/vastly_outperforming_lapack_w...

[3] Outperforming LAPACK with C metaprogramming (2018):

https://wordsandbuttons.online/outperforming_lapack_with_c_m...


Maybe -march=native gives it an edge as it compiles for this exact CPU model whereas numpy is compiled for a more generic (older) x86-64. -march=native would probably get v4 on a Ryzen CPU where numpy is probably targeting v1 or v2.

https://en.wikipedia.org/wiki/X86-64#Microarchitecture_level...


Doesn’t numpy have runtime SIMD dispatching and whatnot based on CPU flags?

E.g. https://github.com/numpy/numpy/blob/main/numpy/_core/src/com...


np.matmul just uses whatever blas library your NumPy distribution was configured for/shipped with.

Could be MKL (i believe the conda version comes with it) but it could also be an ancient version of OpenBLAS you already had installed. So yeah, being faster than np.matmul probably just means your NumPy is not installed optimally.


Comparison with numpy 2.0 should be better for numpy because it integrates Google highway for better simd across different microarchitectures.


Highway TL here. The integration is a work in progress; what has been integrated so far is VQSort :)


Reading the numpy 2.0 change log I assumed Highway was deeply integrated, thanks for the correction.


Good writeup and commendable of you to make your benchmark so easily repeatable. On my 16-core Xeon(R) W-2245 CPU @ 3.90GHz matmul.c takes about 1.41 seconds to multiply 8192x8192 matrices when compiled with gcc -O3 and 1.47 seconds when compiled with clang -O2, while NumPy does it in 1.07 seconds. I believe an avx512 kernel would be significantly faster. Another reason for the lackluster performance may be omp. IME, you can reduce overhead by managing the thread pool explicitly with pthreads (and use sysconf(_SC_NPROCESSORS_ONLN) instead of hard-coding).


There is no reason to burden one side with Python while the other side is C, when they could have just as easily perform an apples-to-apples comparison where both sides are written in C, one calling a BLAS library while the other calls this other implementation.


Python is the right thing to compare to here, because it is easily the most popular way to perform these computations in the modern day. Specifically using numpy. The overhead isn't that high, but as mentioned elsewhere in this thread, calling it correctly is important. Pitting naive numpy code against tuned C code is definitely not a fair comparison.


> Python is the right thing to compare to here, because it is easily the most popular way to perform these computations in the modern day. Specifically using numpy.

By that reasoning, wouldn't it make more sense to wrap their C code and maybe even make it operate on numpy's array representation, so it can be called from Python?


Popular does not mean best.

Suppose that this blog post were part of a series that questions the axiom (largely bolstered by academic marketing) that one needs Python to do array computing. Then it is valid to compare C directly to NumPy.

It isn't even far fetched. The quality of understanding something after having implemented it in C is far greater than the understanding gained by rearranging PyTorch or NumPY snippets.

That said, the Python overhead should not be very high if M=1000, N=1000, K=1000 was used. The article is a bit silent on the array sizes, this is somewhere from the middle of the article.


Python is popular precisely because non-programmers are able to rearrange snippets and write rudimentary programs without a huge time investment in learning the language or tooling. It’s a very high level language with syntactic sugar that has a lot of data science libraries which can call C code for performance, which makes it great for data scientists.

It would be a huge detriment and time sink for these data scientist to take the time to learn to write an equivalent C program if their ultimate goal is to do data science.


I think it’s okay to say “This is the benchmark, now I’m going to compare it against something else.” It’s up to the reader to decide if a 3% (or 300%) improvement is worth the investment if it involves learning a whole other language.


It's a muddy comparison given that NumPy is commonly used with other BLAS implementations, which the author even lists, but doesn't properly address. Anaconda defaults to Intel oneAPI MKL, for example, and that's a widely used distribution. Not that I think MKL would do great on AMD hardware, BLIS is probably a better alternative.

The author also says "(...) implementation follows the BLIS design", but then proceeds to compare *only* with OpenBLAS. I'd love to see a more thorough analysis, and using C directly would make it easier to compare multiple BLAS libs.


Their implementation outperforms not only the recent version of OpenBLAS but also MKL on their machine (these are DEFAULT BLAS libraries shipped with numpy). What's the point of compairing against BLIS if numpy doesn't use it by default? The authors explicitly say: "We compare against NumPy". They use matrix sizes up to M=N=K=5000. So the ffi overhead is, in fact, neglectable.


If that was the goal, they should have compared NumPy to BLAS. What they did was comparing OpenBLAS wrapped in NumPy with their C code. It is not a reasonable comparison to make.

Look, I'm trying to be charitable to the authors, hard as that might be.


There is some reason in this comparison. You might want to answer the question: "if I pick the common approach to matrix multiplication in the world of Data Science (numpy), how far off is the performance from some potential ideal reference implementation?"

I actually do have that question niggling in the back of my mind when I use something like NumPy. I don't necessarily care exactly _where_ the overhead comes from, I might just be interested whether it's close to ideal or not.


If that was your question, you would compare against a number of BLAS libraries, which are already well optimized.

What they are doing here is patting themselves on the back after handicapping the competition. Not to mention that they have given themselves the chance to cherry pick the very best hyperparameters for this particular comparison while BLAS is limited to using heuristics to guess which of their kernels will suit this particular combination of hardware and parameters.

The authors need to be called out for this contrived comparison.


are OpenBLAS and MKL not well optimized lol? They literally compared against OpenBLAS/MKL and posted the results in the article. As someone already mentioned, this implementation is faster than MKL even on a Intel Xeon with 96 cores. Maybe you missed the point, but the purpose of the arcticle was to show HOW to implement matmul without FORTRAN/ASSEMBLY code with NumPy-like performance. NOT how to write a BLIS-competitive library. So the article and the code seem to be LGTM.


Look, it's indeed a resonable comparison. They use matrix sizes up to M=N=K=5000, so the ffi overhead is neglectable. What's the point of compairing NumPy with BLAS if NumPy does use BLAS under the hood?


Though not at all part of the hot path, the inefficiency of the mask generation ('bit_mask' usage) nags me. Some more efficient methods include creating a global constant array of {-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1, 0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0} and loading from it at element offsets 16-m and 8-m, or comparing constant vector {0,1,2,3,4,...} with broadcasted m and m-8.

Very moot nitpick though, given that this is for only one column of the matrix, the following loops of maskload/maskstore will take significantly more time (esp. store, which is still slow on Zen 4[1] despite the AVX-512 instruction (whose only difference is taking the mask in a mask register) being 6x faster), and clang autovectorizes the shifting anyways (maybe like 2-3x slower than my suggestions).

[1]: https://uops.info/table.html?search=vmaskmovps&cb_lat=on&cb_...


Hi! I'm the author of the article. It's my really first time optimizing C code and using intrinsics, so I'm definitely not an expert in this area, but Im willing to learn more! Many thanks for your feedback; I truly appreciate comments that provide new perspectives.

Regarding "creating a constant global array and loading from it" - if I recall correctly, I've tested this approach and it was a bit slower than bit mask shifting. But let me re-test this to be 100% sure.

"Comparing a constant vector {0, 1, 2, 3, 4, ...} with broadcasted m and m-8" - good idea, I will try it!


> creating a global constant array

Note you can keep int8_t elements in that array, and sign extend bytes into int32_t while loading. The _mm_loadu_si64 / _mm256_cvtepi8_epi32 combo should compile into a single vpmovsxbd instruction with a memory operand. This way the entire constant array fits in a single cache line, as long as it’s aligned properly with alignas(32)

This is good fit for the OP’s use case because they need two masks, the second vpmovsxbd instruction will be a guaranteed L1D cache hit.


vpmovsxbd ymm,[…] still presumably decomposes back into two uops (definitely does on intel, but uops.info doesn't show memory uops for AMD); still better than broadcast+compare though (which does still have a load for the constant; and, for that matter, the original shifting version also has multiple loads). Additionally, the int8_t elements mean no cacheline-crossing loads. (there's the more compressed option of only having a {8×-1, 8×0} array, at the cost of more scalar offset computation)


> definitely does on intel, but uops.info doesn't show memory uops for AMD

Indeed, but it reveals something else interesting. On Zen2 and Zen3 processors, the throughput of vpmovsxbd ymm, [...] is more than twice as efficient compared to sign extension from another vector register i.e. vpmovsxbd ymm, xmm

> the original shifting version also has multiple loads

I believe _mm256_setr_epi32 like that is typically compiled into a sequence of vmovd / vpinsrd / vinserti128 instructions. These involve no loads, just multiple instructions assembling the vector from int32 pieces produced in scalar registers.


Oh yeah, I did forget that, despite being separated into uops, sign-extend-mem is still more efficient than literally being separated as such (some other similar things include memory insert/extract, and, perhaps most significantly, broadcast; with various results across intel & AMD). I imagine the memory subsystem is able to simultaneously supply ≤128-bit results to both 128-bit ALU halves, thus avoiding the need for cross-128-bit transfers.

The _mm256_setr_epi32 by itself would indeed be very inefficient, but clang manages to vectorize it[1] to vpaddd+vpsllvd, which require some constant loads (also it generates some weird blends, idk).

[1]: https://godbolt.org/z/7jq4z39GT - L833-847 or so in the assembly, or on L67 in the source, right click → "Reveal linked code"



We were actively chatting with Justine yesterday, seems like the implementation is at least 2x faster than tinyBLAS on her workstation. The whole discussion is in Mozilla AI discord: https://discord.com/invite/NSnjHmT5xY


"off-topic" channel


What is the point of making the matrix multiplication itself multithreaded (other than benchmarking)? Wouldn't it be more beneficial in practice to have the multithreadedness in the algorithm that use the multiplication?


That's indeed what's typically done in HPC. However, substituting a parallel BLAS can help the right sort of R code simply, for instance, but HPC codes typically aren't bottlenacked on GEMM.


I've only skimmed so far, but this post has a lot of detail and explanation. Looks like a pretty great description of how fast matrix multiplications are implemented to take into account architectural concerns. Goes on my reading list!


I don’t save articles often, but maybe one time in a few months I see something I know I will enjoy reading again even after 1 or 2 years. Keep up the great work OP!


Very nice write up. Those are the kind of matrix sizes that MKL is fairly good at, might be worth a comparison as well?

Also, if you were designing for smaller cases, say MNK=16 or 32, how would you approach it differently? I'm implementing neural ODEs and this is one point I've been considering.


For very small sizes on amd64, you can, and likely should, use libxsmm. MKL's improved performance in that region was due to libxsmm originally showing it up, but this isn't an Intel CPU anyway.


Don't forget BLIS itself!


In the README, it says:

> Important! Please don’t expect peak performance without fine-tuning the hyperparameters, such as the number of threads, kernel and block sizes, unless you are running it on a Ryzen 7700(X). More on this in the tutorial.

I think I'll need a TL;DR on what to change all these values to.

I have a Ryzen 7950X and as a first test I tried to only change NTHREADS to 32 in benchmark.c, but matmul.c performed worse than NumPy on my machine.

So I took a look at the other values present in the benchmark.c, but MC and NC are already calculated via the amount of threads (so these are probably already 'fine-tuned'?), and I couldn't really understand how KC = 1000 fits for the 7700(X) (the author's CPU) and how I'd need to adjust it for the 7950X (with the informations from the article).


Actually, leaving it on 16 threads performs a bit better than setting it to 32 threads.

16 threads: https://0x0.st/XaDB.png

32 threads: https://0x0.st/XaDM.png

But still not as fast as it ran on your 7700(X) and NumPy is 2-3x faster than matmul.c on my PC.

I also changed KC to some other values (500: https://0x0.st/XaD9.png, 2000: https://0x0.st/XaDp.png), but it didn't change much performance wise.


as we discussed earlier, the code really needs Clang to attain high performance



This was a great read, thanks a lot! One a side note, any one has a good guess what tool/software they used to create the visualisations for matrix multiplications or memory outline?


excalidraw <3


Does it make sense to compare a C executable with an interpreted Python program that calls a compiled library? Is the difference due to the algorithm or the call stack?


Yes, as far as I can tell the array sizes were large enough to make the wrapping overhead negligible.


Yes


Is there any explaination for the dropping and rising in GFLOP/S at certain matrix sizes as shown in the plots.


This only proves even more how Python is magnificently fast. Only 5% difference from the native code...


Python is just interfacing with native code in these benchmarks. Matrix multiplication written in purely python would probably be slower by at least 2 orders of magnitude.


Yes, but python was made to interface with complicated C backend using modern and friendly, mathematics aware syntax. That's why python is so good for e.g. machine learning, and nobody does it in Java for example.


in terms of comparing to numpy, how much overhead would there be from Python (vs. running the numpy C code alone)?


Python can efficiently call C libs if you use ctypes and native pointers, which numpy uses. Of course depends on expected layout.

If you want to convert to Python lists its is going to take time. Not sure about Python arrays.


If you use numpy, then you use an ndarray, which you can create from a C array for 'free' (no copying, just set a pointer).


I don't recall the link but there was a github repo with comparisons of CFFI implementations in different languages, and from what i remember Python was 'bout 3 orders of magnitude slower than, say, Lua or Ocaml

Edit: ha, found it https://news.ycombinator.com/from?site=github.com/dyu


I'm not sure to the specific benchmark from which you are referring to, but it appears it was measuring the speed of Python looping and not the speed of the FFI.

As far as I can tell the FFI itself is not expensive as long as the underlying type does not have to be converted. But if you expect to call it millions of times a second you're going to have trouble. The solution is to move the loop inside the C code.

For example, suppose you want to FFT a bunch of signals. You don't repeatedly call FFT for each of the signals - you pass the entire data structure to the C code.


> #define min(x, y) ((x) < (y) ? (x) : (y))

cursed code. I checked and every single invocation is just `int`, so why do this? you can just write a function:

    func min(x, y int) int {
       if x < y { return x }
       return y
    }
and keep type safety


Heck, you can also use the min instruction, which avoids a branch.


no such thing as a min instruction




The article claims this is portable C. Given the use of intel intrinsics, what happens if you try to compile it for ARM64?


I think they mean portable between modern x86 CPUs as opposed to truly portable.


You'd need first to convert it to portable SIMD intrinsics. There are several libraries.


Apparently it also needs Clang to achieve the same performance: https://news.ycombinator.com/item?id=40875968


Does numpy's implementation actually use multithreading? If not, then the comparison is not fair.


> This is my first time writing a blog post. If you enjoy it, please subscribe and share it!

Great job! Self publishing things like this were a hallmark of the early internet I for one sorely miss.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: