There's an old point and trick about
linear algebra and a lot more in
scientific and engineering computing:
The point is that it is very common to
need to take inner products of two
n-tuples of numbers.
E.g., the inner product of 3-tuple (a, b,
c) with 3-tuple (x, y, z) is
ax + by + cz
The trick is to accumulate the sum of the
products in twice the numerical precision
of the input data.
This should not be too expensive since the
products are usually in twice the
precision of the input data anyway.
So, the data being added is already in
twice the precision of the input data.
Why? The main problem in numerical
computing is precision, and the main cause
of problems is subtracting two large
numbers whose difference is small.
In particular, for an inner product, it is
common for the two vectors to be nearly
orthogonal in which case there is likely a
lot of subtracting where the difference is
small.
So, in the OP, I was disappointed to see
that the precision mentioned was only 32
bits and 64 bits. IMHO, in linear algebra
computations, 32 bit precision is
essentially useless. So, our data should
be in 64 bit precision, and an inner
product should be found with 128 bit
precision and, likely, then rounded to 64
bit precision for storage out of the
processor registers and back in main
memory and the memory of the software.
Correct me if I'm wrong, but I don't know of any current SIMD hardware that supports more than 64 bits of precision. An exception would be the x87 FPU (88 bits), but you can't do SIMD with that, so I don't think anybody interested in high-preformance linalg routines uses it.
While I don't disagree with you that beyond 64 bit precision is often useful, be careful with overgeneralizing that point. Many applications are just fine with 32 bits or even less and benefit greatly from the associated speedups, hence the proliferation of GPGPU computing (where single precision has a huge advantage). Neural networks are the popular examples, but plenty of physics/engineering simulations (molecular dynamics (particularly with symplectic methods), lattice Boltzmann hydrodynamics, etc.) are on that list as well.
I agree, but I specifically restricted my remarks to numerical linear algebra. Double precision inner product accumulation is a standard idea in numerical linear algebra, e.g., as in
George E. Forsythe and Cleve B. Moler, Computer Solution of Linear Algebraic
Systems.
People have long struggled with accuracy in numerical linear algebra. At one time at the US National Bureau of Standards, the struggles were so severe that M. Newman wrote software for numerically exact solution of systems of linear equations using only machine precision integer arithmetic. How? In the given system, he multiplied by powers of 10 to get all the decimal points to the right so that he had only whole numbers. Then for each prime number in a list of some dozens of primes, he solved the system in the finite field of the integers modulo that prime. Multiplicative inverse is from the Euclidean greatest common divisor algorithm. Then using the Chinese remainder theorem, he constructed the multiprecision integer numerators and denominators of the exact rational solution. It was nice work and somewhat popular.
Careful work on condition number estimation and exploitation has also been used. Condition number can be the basis of error estimates.
Sure, for, say, the fast Fourier transform, mostly in practice 32 bits can be fine.
I don't know of any. So, I'm saying I wish there were some. But, then, we should have 128 bit floating point addition instructions. It's been a while since I looked at the x86 instruction set, registers, etc., but from other comments here it appears that x86 doesn't have 128 bit addition.
Here's my notes for those who are want; they added some 68 instructions for linalg to risc v. Their arc has a specialized reduce / aggregate logic and associated memory for vectors and matrices. They also made a c lib for their instruction set. Saw some impressive speedups compared to normal risc v (x12?), x86 (x5?) and fermi gpus (x10?) (In the hpc challenge where they did have to update some of the code, obviously, to use their instructions).
The point is that it is very common to need to take inner products of two n-tuples of numbers.
E.g., the inner product of 3-tuple (a, b, c) with 3-tuple (x, y, z) is
ax + by + cz
The trick is to accumulate the sum of the products in twice the numerical precision of the input data.
This should not be too expensive since the products are usually in twice the precision of the input data anyway.
So, the data being added is already in twice the precision of the input data.
Why? The main problem in numerical computing is precision, and the main cause of problems is subtracting two large numbers whose difference is small.
In particular, for an inner product, it is common for the two vectors to be nearly orthogonal in which case there is likely a lot of subtracting where the difference is small.
So, in the OP, I was disappointed to see that the precision mentioned was only 32 bits and 64 bits. IMHO, in linear algebra computations, 32 bit precision is essentially useless. So, our data should be in 64 bit precision, and an inner product should be found with 128 bit precision and, likely, then rounded to 64 bit precision for storage out of the processor registers and back in main memory and the memory of the software.
I was disappointed in not seeing 128 bit sums.