Hacker News new | past | comments | ask | show | jobs | submit login
Switching from GMP to GCC's __int128 reduced run time by 95% (2016) (nu42.com)
71 points by optimalsolver on July 28, 2021 | hide | past | favorite | 9 comments



Note that soon after I posted this, Someone noticed[1,2] that it is possible avoid a whole lot of work, but I am still glad I realized how much performance one can gain by dropping GMP in favor of `__int128`.

It seems like for numbers on this side of 1024 bits, writing purpose-built routines using `__int128` or SSE/AVX instructions rather than general functionality is likely to pay off.

[1]: https://news.ycombinator.com/item?id=10915461

[2]: https://www.nu42.com/2016/01/excellent-numbers-explicit-solu...


Fun "large number handling" facts:

- GMP has limits (compare e.g. https://gmplib.org/list-archives/gmp-discuss/2004-April/0011...). 2^31 limbs of 64 bit each means you cannot process numbers larger than ~2^37 bits (2^34 bytes = 16 GiB) with the standard mpz functions.

- Multiplication of large number is its own science, with different algorithms optimal at different sizes. (https://gmplib.org/manual/Multiplication-Algorithms)


This is presumably due to losing many/most function calls (I can't imagine division being inlined but maybe I'm wrong), and fixed size allocation.

My guess is the rest of the overhead is things like gmp needing to allocate heap storage as it doesn't get the aid of having fixed size storage.


Yep, the program referenced in the blogpost https://github.com/briandfoy/excellent_numbers/blob/master/c... is using GMP's variable-width numbers (mpz_* routines). If you know bit lengths in advance you provide pre-allocated C arrays to mpn_* routines: https://gmplib.org/manual/Low_002dlevel-Functions . This is likely going to be much faster (you avoid bookkeeping and malloc inside mpz_*), and what e.g. libsnark does for its general Fp arithmetic. (E.g. addition becomes https://github.com/scipr-lab/libff/blob/develop/libff/algebr... ) Of course, mpn_* routines still do loops which __int128 ones would presumably unroll but the bulk of the gain should come from reducing mpz_* overheads.


What about MPI?


I don't know much about MPI, but I thought it related to message passing/distributed workloads rather than arithmetic performance?


Parent must have mistaken GMP for OMP, and wanted to shoot a quickie — no having to read the article! some research should be done— is it because HNers browse HN just before bed?

To be honest i made the same mistake at glance… fortunately i am not mentally exhausted…


I meant about distributing the load of calculations...


Given the author mentions multiple cores being available, I'd guess you could use any method, including MPI, to distribute the computation. But whether you used 1 core or 10k cores, it would be nice to have a 20x speedup on each core via this arithmetic/fixed size optimization. Since that's the focus of the article, communication technologies feel pretty unrelated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: