Switching from GMP to GCC's __int128 reduced run time by 95% (2016)

nanis · on July 29, 2021

Note that soon after I posted this, Someone noticed[1,2] that it is possible avoid a whole lot of work, but I am still glad I realized how much performance one can gain by dropping GMP in favor of `__int128`.

It seems like for numbers on this side of 1024 bits, writing purpose-built routines using `__int128` or SSE/AVX instructions rather than general functionality is likely to pay off.

[1]: https://news.ycombinator.com/item?id=10915461

[2]: https://www.nu42.com/2016/01/excellent-numbers-explicit-solu...

tgsovlerkhgsel · on July 29, 2021

Fun "large number handling" facts:

- GMP has limits (compare e.g. https://gmplib.org/list-archives/gmp-discuss/2004-April/0011...). 2^31 limbs of 64 bit each means you cannot process numbers larger than ~2^37 bits (2^34 bytes = 16 GiB) with the standard mpz functions.

- Multiplication of large number is its own science, with different algorithms optimal at different sizes. (https://gmplib.org/manual/Multiplication-Algorithms)

olliej · on July 29, 2021

This is presumably due to losing many/most function calls (I can't imagine division being inlined but maybe I'm wrong), and fixed size allocation.

My guess is the rest of the overhead is things like gmp needing to allocate heap storage as it doesn't get the aid of having fixed size storage.

madars · on July 29, 2021

Yep, the program referenced in the blogpost https://github.com/briandfoy/excellent_numbers/blob/master/c... is using GMP's variable-width numbers (mpz_* routines). If you know bit lengths in advance you provide pre-allocated C arrays to mpn_* routines: https://gmplib.org/manual/Low_002dlevel-Functions . This is likely going to be much faster (you avoid bookkeeping and malloc inside mpz_*), and what e.g. libsnark does for its general Fp arithmetic. (E.g. addition becomes https://github.com/scipr-lab/libff/blob/develop/libff/algebr... ) Of course, mpn_* routines still do loops which __int128 ones would presumably unroll but the bulk of the gain should come from reducing mpz_* overheads.

anthk · on July 29, 2021

What about MPI?

borramakot · on July 29, 2021

I don't know much about MPI, but I thought it related to message passing/distributed workloads rather than arithmetic performance?

gradschoolfail · on July 29, 2021

Parent must have mistaken GMP for OMP, and wanted to shoot a quickie — no having to read the article! some research should be done— is it because HNers browse HN just before bed?

To be honest i made the same mistake at glance… fortunately i am not mentally exhausted…

anthk · on July 29, 2021

I meant about distributing the load of calculations...

borramakot · on July 29, 2021

Given the author mentions multiple cores being available, I'd guess you could use any method, including MPI, to distribute the computation. But whether you used 1 core or 10k cores, it would be nice to have a 20x speedup on each core via this arithmetic/fixed size optimization. Since that's the focus of the article, communication technologies feel pretty unrelated.