Sorry I don't a reference for the "optimal" algorithm--I had it explained to me in person by someone working in the field and they didn't give me a specific place to go track it down. The gist of it is that you need to "bin" the numbers by their approximate order of magnitude and then even in each bin you need to keep a separate running estimate of the error within each bin. It's a kind of divide-and-conquer approach. AFAIR it is "optimal" with respect to minimum error, but not optimally efficient.
The non-portability of long double is just bonkers, though. Floating point is super-tricky to get right at all levels, and programmers shouldn't be constantly surprised, and often infuriated, when their compiler introduces crazy amounts of nondeterminism by "helpfully" utilizing 80-bits underneath.
I am actually pretty OK with just having an 80-bit float type in the language, as long as it is always 80 bits, and no 64-bit floats get silently promoted. But C didn't have that because it's not what x87 does the fastest. C just has whatever crap is the fastest. I hate that! I imagine that you did a better job of it in D.
In designing WebAssembly we were very careful to go for as much determinism as possible and stick closely to IEEE 754. There were loud voices calling for nondeterminism (even pushing for the default to be nondeterminism as to whether FTZ mode was enabled!) but in the end the portability considerations won out. I was very heavily on the determinism side.
https://en.wikipedia.org/wiki/Kahan_summation_algorithm
This paper from 1993 analyzes five algorithms:
https://www.semanticscholar.org/paper/The-Accuracy-of-Floati...
Sorry I don't a reference for the "optimal" algorithm--I had it explained to me in person by someone working in the field and they didn't give me a specific place to go track it down. The gist of it is that you need to "bin" the numbers by their approximate order of magnitude and then even in each bin you need to keep a separate running estimate of the error within each bin. It's a kind of divide-and-conquer approach. AFAIR it is "optimal" with respect to minimum error, but not optimally efficient.
It's apparently a hard problem!