Performance: SIMD, Vectorization and Performance Tuning [video]

exDM69 · on Jan 21, 2017

Watching this makes me sad that there are no languages that would have first class SIMD vector types that would enable writing portable SIMD code for different CPU instruction sets (SSE, AVX, NEON, etc). The closest thing to what I want is C vector extensions available in GCC and Clang [0] (you still need some compiler-specific #ifdefs). GPU and shader languages (GLSL, OpenCL C) have a bit better support, but I want that on the CPU too.

Here's a list of my requirements:

1. Built-in types for floating point and integer vectors (compile time constant width). E.g. float32x4_t or int64x2_t. Maybe have some matrix types too.

2. Normal infix operators for arithmetic (+, -, /). You can do this with C [1]. Built-in syntax for vector shuffles (can't do this in C) [2].

3. Compile time polymorphism to make vector-width agnostic code. If you write sin4f and sin8f (in C), they are line-by-line identical except for types. You should be able to write a single sin() function that works for any vector width

4. A standard library that has all the usual libm math functions (sin, cos, log, exp, asin, atanh, etc). I could do with less-than-perfect precision for performance (at least if -ffast-math is enabled)

5. A standard library for some basic vector and matrix operations for static-sized vectors and matrices. E.g. dot product, matrixmatrix product, matrixvector product, inverse matrix, etc.

I put some hope on Rust, which has been working on some SIMD stuff. But the current iteration doesn't fulfill most of my requirements.

[0] https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

[1] You can do this:

    typedef float float32x4_t __attribute__((vector_size(16)));
    float32x4_t a = { 1, 2, 3, 4 }, b = { 5, 6, 7, 8 }, c = (a+b)*(a-b);

[2] You'll need some #ifdefs around __builtin_shuffle (GCC) and __builtin_shufflevector (Clang). Something like my_vec.xxyy, similar to GLSL, would be nicer.

nuntius · on Jan 21, 2017

Take a look at Halide. I think it is an excellent DSL, covering the base of what you describe, and well positioned for extension to the rest. If nothing else, the documentation is a summary of a wide range of optimization techniques. Written as a C++ library, it also supports dumping an object file with C-style header.

http://halide-lang.org/

Another option in the area is OpenMP. It has a wider base of users and contributors, but I think the abstractions are not as good.

http://www.openmp.org/

If you can move to a completely new programming language, then Chapel is built to easily scale up across large supercomputer clusters.

http://chapel.cray.com/

exDM69 · on Jan 21, 2017

These are all great options for massive parallelism, but that's not what I'm after.

I want explicit SIMD with 2/4/8/16 wide vectors, primarily to be used with 3d graphics and physics calculations.

trendia · on Jan 21, 2017

I use SIMDPP [0], which allows you to explicitly write SIMD instructions in a portable way. See the documentation [1] for the available commands. Specifically, I write code to be used on both x86 and ARM systems.

> libsimdpp is a portable header-only zero-overhead C++ wrapper around single-instruction multiple-data (SIMD) intrinsics found in many compilers. The library presents a single interface over several instruction sets in such a way that the same source code may be compiled for different instruction sets. The resulting object files then may be hooked into internal dynamic dispatch mechanism.

> The library resolves differences between instruction sets by implementing the missing functionality as a combination of several intrinsics. Moreover, the library supplies a lot of additional, commonly used functionality, such as various variants of matrix transpositions, interleaving loads/stores, optimized compile-time shuffling instructions, etc. Each of these are implemented in the most efficient manner for the target instruction set. Finally, it's possible to fall back to native intrinsics when necessary, without compromising maintanability.

[0] https://github.com/p12tic/libsimdpp

[1] http://p12tic.github.io/libsimdpp/v2.0%7Erc2/libsimdpp/

nuntius · on Jan 22, 2017

Halide is for explicit SIMD, and a couple of the others provide good support for it as well. These tools are made by graphics and physics optimization people. Look at the examples.

CyberDildonics · on Jan 21, 2017

You should take a good look at ISPC, a C like language bought by Intel for exactly this purpose. It compiles to tiny .o files or intrinsics filled (.h/.c)? files.

exDM69 · on Jan 21, 2017

My understanding is that ISPC is a special compiler that takes a function and emits a SIMD function that computes the original function 4/8 times (and repeats). And it's Intel only.

It might be great for some uses, but it's not what I'm looking for.

CyberDildonics · on Jan 21, 2017

I don't think that that is true, it compiles C like programs to use SIMD units more effectively using a 'varying' keyword. I've used it to make programs that run very fast. It may be x64 only.

jackmott · on Jan 21, 2017

.NET languages come close to what you want with System.Numerics, but it lacks complete coverage of SIMD operations. But the basics are there, and it can be very useful at times.

https://msdn.microsoft.com/en-us/library/dn858218(v=vs.111)....

dnautics · on Jan 21, 2017

it seems like what you are looking for would be fairly easy to implement in Julia, if it hasn't been already.

Intel has contributed to the @simd macro in the language, and I think that if you just use the builtin type it "knows what to do". You do have to prefix with the @simd macro (feels similar to the "#pragma"s talked about in the talk)

https://software.intel.com/en-us/articles/vectorization-in-j...

dragandj · on Jan 21, 2017

You'll find almost all of those in OpenCL.

exDM69 · on Jan 21, 2017

I know. I want to use them on the CPU, without having to deal with a big runtime system like OpenCL.

OpenCL is still missing the vector width polymorphism. The others are more or less covered.

This is a matter of language standardization and some implementation work. I'm not aware of anyone working on this.

theparanoid · on Jan 21, 2017

Mike Acton's CppCon talk "Data-Oriented Design and C++" [1] is also good.

[1] https://www.youtube.com/watch?v=rX0ItVEVjHc