A RISC V Based Linear Algebra Accelerator for SoC Designs [video]

graycat · on Dec 22, 2017

There's an old point and trick about linear algebra and a lot more in scientific and engineering computing:

The point is that it is very common to need to take inner products of two n-tuples of numbers.

E.g., the inner product of 3-tuple (a, b, c) with 3-tuple (x, y, z) is

ax + by + cz

The trick is to accumulate the sum of the products in twice the numerical precision of the input data.

This should not be too expensive since the products are usually in twice the precision of the input data anyway.

So, the data being added is already in twice the precision of the input data.

Why? The main problem in numerical computing is precision, and the main cause of problems is subtracting two large numbers whose difference is small.

In particular, for an inner product, it is common for the two vectors to be nearly orthogonal in which case there is likely a lot of subtracting where the difference is small.

So, in the OP, I was disappointed to see that the precision mentioned was only 32 bits and 64 bits. IMHO, in linear algebra computations, 32 bit precision is essentially useless. So, our data should be in 64 bit precision, and an inner product should be found with 128 bit precision and, likely, then rounded to 64 bit precision for storage out of the processor registers and back in main memory and the memory of the software.

I was disappointed in not seeing 128 bit sums.

Asooka · on Dec 22, 2017

Correct me if I'm wrong, but I don't know of any current SIMD hardware that supports more than 64 bits of precision. An exception would be the x87 FPU (88 bits), but you can't do SIMD with that, so I don't think anybody interested in high-preformance linalg routines uses it.

graycat · on Dec 23, 2017

Apparently my wish for 64 bit data, 128 bit products, and 128 bit addition is not reasonable on current hardware. So, I my wish will go unfulfilled.

keldaris · on Dec 23, 2017

While I don't disagree with you that beyond 64 bit precision is often useful, be careful with overgeneralizing that point. Many applications are just fine with 32 bits or even less and benefit greatly from the associated speedups, hence the proliferation of GPGPU computing (where single precision has a huge advantage). Neural networks are the popular examples, but plenty of physics/engineering simulations (molecular dynamics (particularly with symplectic methods), lattice Boltzmann hydrodynamics, etc.) are on that list as well.

graycat · on Dec 23, 2017

I agree, but I specifically restricted my remarks to numerical linear algebra. Double precision inner product accumulation is a standard idea in numerical linear algebra, e.g., as in

George E. Forsythe and Cleve B. Moler, Computer Solution of Linear Algebraic Systems.

People have long struggled with accuracy in numerical linear algebra. At one time at the US National Bureau of Standards, the struggles were so severe that M. Newman wrote software for numerically exact solution of systems of linear equations using only machine precision integer arithmetic. How? In the given system, he multiplied by powers of 10 to get all the decimal points to the right so that he had only whole numbers. Then for each prime number in a list of some dozens of primes, he solved the system in the finite field of the integers modulo that prime. Multiplicative inverse is from the Euclidean greatest common divisor algorithm. Then using the Chinese remainder theorem, he constructed the multiprecision integer numerators and denominators of the exact rational solution. It was nice work and somewhat popular.

Careful work on condition number estimation and exploitation has also been used. Condition number can be the basis of error estimates.

Sure, for, say, the fast Fourier transform, mostly in practice 32 bits can be fine.

pkaye · on Dec 23, 2017

Which linear algebra libraries use 128 bit sums?

graycat · on Dec 23, 2017

I don't know of any. So, I'm saying I wish there were some. But, then, we should have 128 bit floating point addition instructions. It's been a while since I looked at the x86 instruction set, registers, etc., but from other comments here it appears that x86 doesn't have 128 bit addition.

freekh · on Dec 22, 2017

Here's my notes for those who are want; they added some 68 instructions for linalg to risc v. Their arc has a specialized reduce / aggregate logic and associated memory for vectors and matrices. They also made a c lib for their instruction set. Saw some impressive speedups compared to normal risc v (x12?), x86 (x5?) and fermi gpus (x10?) (In the hpc challenge where they did have to update some of the code, obviously, to use their instructions).

jabl · on Dec 22, 2017

What are the advantages of this compared to the risc-v vector extension?