Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Code written for too high a SIMD capability will generally crash (it’s undefined behavior), while code written for too low a SIMD capability will fall short of the performance available.

I can't help but think that SIMD intrinsics are the wrong level of abstraction. Literally assembly being moved into a high level language but without the ability for the compiler to optimize well. Hand written intrinsics can really be fine tuned for a given microarchitecture, but I think XLA's approach of having a higher level of abstraction, then JIT compiling to the device in question (whether it be SIMD of X width, or a gpu, or a tpu) is the right way to go.

Whenever I see hand coded assembly, I can't help but think to myself:

* Did the author actually have instruction scheduling information on hand when writing this?

* For what generation of chips was this found to be optimal for, and is it still? Invariably, you wind up with hand coded assembly routines that were written a long time ago still in use without anyone revisiting the code or it's performance.



The compiler can optimize intrinsics, though. Why shouldn't it be able to?

I agree that somewhat with AVX2 and especially with AVX-512 there's probably less reason to write intrinsics if you have a vector language. But for now SSE2, SSSE3, and SSE4 are still the bread and butter for SIMD on x86, and there are some important instructions (Fabien Giesen goes into detail at [1]) that you really have to think about how to use effectively. For example (this is mentioned at the end), all horizontal adds on x86 are awful except for PSADBW, which is really limited, as Intel designed it in a fit of myopia to target only motion estimation in contemporary codecs, and it requires you to basically design your whole algorithm around it.

[1]: https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/


I find intrinsics to be a useful level of abstraction because it allows me to control and automate the code generation in C++ without having to know the details of register scheduling, etc. It also means I don't have to hand write intrinsic algorithms for every architecture -- I have template libraries that can take a rough specification of the architecture characteristics and generate approximately appropriate intrinsic code. It is neither hand writing all the architecture-specific intrinsic code nor trusting the compiler to do a good job. I started working this way when I used to target code at a really weird mix of exotic vector/parallel computing hardware, and it worked nicely.

The real problem is that, in my experience, compilers universally do a poor job of auto-vectorization/parallelization unless the code pattern is blindingly obvious. All of the "higher level abstractions" I've seen are no better at finding vectorizable patterns than the compiler does -- which makes sense since they are both just software. It ends up being syntactic sugar. The amount of effort required to trick compilers into doing the right thing, plus the things they won't do at all, is such that the complexity is worse than just writing the intrinsic code directly. Writing a lot of intrinsic code directly far from optimal but fortunately those intrinsics are available in languages like C++ that can abstract much of that.

Compilers will not consistently generate competent auto-vectorized for the foreseeable future. I've seen very little forward progress in the decade I've been exposed to it. There are still some relatively simple non-vectorized code patterns that compilers currently have a difficult time detecting and doing optimized code generation for.


> All of the "higher level abstractions" I've seen are no better at finding vectorizable patterns than the compiler does

I think a lot of that comes from C/C++ language corners.

Have you seen XLA (https://www.tensorflow.org/performance/xla/) or Glow (https://facebook.ai/developers/tools/glow)? These are very high level abstractions called from C++ that can do a great job vectorizing high level operations for the given device.


As far as I know from reading papers, Fortran compilers do a much better job thanks to the language semantics.


For the simple SIMD cases where the computation doesn't have any horizontal dependencies (except for a total reduction step), there's little need for any sort of intrinsic usage. (Although you may need some compiler hints to push it to autovectorize).

The real problem is that SIMD hardware sets tends to have lots of instructions that have mixing of horizontal lanes. The scope and performance of these mixing instructions varies greatly from implementation to implementation, and these kinds of instructions are hard for compilers to automatically pick up. It's the latter case that means you need the SIMD intrinsics to be exposed to use the hardware effectively.


I think so too, that SIMD is too low-level to be utilized effectively. Luckily joe_the_user enlightened me with this little gem on a previous post: https://news.ycombinator.com/item?id=17419917

Or for the lazy:

I don't know about tensor flow in particular but are little-known methods of running "general purpose" parallel programs on GPUs. Specifically, H. Dietz' MOG, "Mimd on GPU". It's a shame the project hasn't gotten more attention imo.

http://aggregate.org/MOG/

See: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy for explanations of terms.


Cool!

I have "evangelized" one person at least.

The thing about MOG is it's a demonstrated technique that Dietz has not yet developed into a fully releasable product. His latest comment says it's six months from release but lacks funding.

I think the problem might be that most "Mimd" programming is things like weather-simulations where access to a MIMD supercomputer is standard and price isn't that much of an issue.

That said, cheap Simd parallel computing jump-started today's deep learning advances and so cheap MIMD might do things no one anticipated.

Still, very niche. But thanks for the mention.

Edit: Note, MOG is specifically a GPU. Dietz did earlier work on other machines in the 90s but this is GPU specific (though a less flexible architecture than what Nvidia has evolved now).


what do you mean by effectively? Too hard for the programmer? Because what most of us find when we try to do SIMD stuff is anything higher level than intrinsics is too hard to use effectively where effective means good runtime performance.

I agree there should be something higher level we could use but it doesn't exist yet, to my knowledge.


Ya too hard for the programmer. What it comes down to is that there are a few main categories of multiprocessing from lowest to highest level:

1. SIMD - manually deal with packed elements (like in SSE, MMX etc)

2. DSP - Maybe someone knows a better term for this, but treating each slice of the data array as an independent serial stream (shaders, OpenCL, CUDA)

3. MIMD - freeform vector/matrix operations that get compiled down to the first two categories (MATLAB, GNU Octave, Scilab)

I'm not sure if TensorFlow fits best in 2 or 3 but my gut feeling is that it's closest to 2. The problem with the lower level abstractions is that it's more work to format the data for the problem space.

So with MATLAB, everything is a vector and operations applied to each vector happen in parallel across all elements. Then if you need to, you can drop down to less efficient code and operate on elements of the vector manually with C-like code.

Unfortunately that becomes more difficult in shaders, because you can't just magically access a neighboring element. And getting general computation to work with SIMD is often infeasible because you have to rewrite your code and potentially alter the layout of the data in memory to achieve better performance.

I also agree that currently nothing really exists to give us general-purpose vector math akin to MATLAB within a language like C.


The eigen c++ library comes somewhat close. But remember, Matlab is terrible at using multiple processors, so most people using Matlab won't see anywhere the performance most people c intrinsics get. This isn't because Matlab isn't capable, but I found that virtually no Matlab users try to write parallel code.


Half the point of using SIMD intrinsics rather than embedding assembly is that register allocation and instruction scheduling are performed by the compiler. Using intrinsics certainly does not guarantee that exactly the corresponding instructions will be emitted in exactly the given order.


> Using intrinsics certainly does not guarantee that exactly the corresponding instructions will be emitted in exactly the given order.

I hope they will fix the compilers to add such guarantees. Currently, every time compilers try to mess with manually-written intrinsics, they decrease performance.

See this bug about LLVM failing to emit the exact instructions, decreasing the performance: https://bugs.llvm.org/show_bug.cgi?id=26491

See this bug about VC++ failing to emit them in the given order, again decreasing performance: https://developercommunity.visualstudio.com/content/problem/...


As usual, things aren't quite that simple. While it certainly can happen that the compiler can pessimize carefully crafted SIMD code by applying undesirable transformations, simply not optimizing intrinsics would lead to other problems.

If you consider SIMD code that spans more than just one small function, then it is quite useful if the compiler can SROA (e.g. you pass around an array of vectors instead of having 8 arguments to each function, but still want them to end up in registers eventually), LICM (you're calling a function that needs a constant vector in a loop) or otherwise optimize your code.

If you want the compiler to leave alone your intrinsics, link in an assembly file.


> it is quite useful if the compiler can SROA

__attribute__((always_inline)) / __forceinline usually help.

> LICM (you're calling a function that needs a constant vector in a loop)

I can calculate that vector outside of the loop.

> link in an assembly file.

Way more complex, I need to write outer scalar code in assembly too, need to manually allocate registers, also C++ templates are sometimes very useful for that kind of code.

In 99% of cases intrinsics are good enough for me, but I would love the compilers to leave alone my intrinsics.


Can’t just isolate those in a compilation unit with -O0?


Interesting idea, will try next time.

I usually want to optimize the scalar code outside of the manually-vectorized body of the loops. A function call to that external compilation unit will be slower than inlining I have when everything is in the same unit. However, it could be the call overhead is small enough, obviously need to profile.


On GCC-like compilers, you could just use inline assembly. That definitely won’t get rewritten into some other instruction sequence, and it can handle things like register allocation and loads/stores for you. Downsides include that the compiler won’t be able to estimate instruction timings, the ease of screwing up the input/output notation, and that MSVC doesn’t support inline assembly on x64 at all.


As someone coming from CUDA I think you‘re on the right path. CUDA unifies multicore and vectorization parallelims. Crucial here is hardware support for branching (including early returns) even if it degrades performance. This allows a CUDA kernel being programmed in a programmer friendly way that is often already fairly performant, and the actual mapping to hardware is deferred to kernel invocation where it‘s straightforward to specialize. I think this concept is something that CPU (programming) could learn from GPU. Why not adopt the concept of kernels?


OpenCL on CPU is exactly this and does very well. Last real world test I did, an 8 core Haswell and a GTX 970 were similar in performance, for the kernel I was running. Caveats apply of course but it was refreshing to have a single programming model.


I did some evaluation of (Linux based) OpenCL implementations two years ago (but I don't think anything changed significantly since then).

I had a few takeaways:

- OpenCL on GPUs and CPUs have little to do with each other (in terms of performance characteristics) and if you tune well for one of them, the other one will suffer.

- Vectorization of work items doesn't really work well unless your kernel is so simple that normal compiler auto vectorization with a loop would have probably worked just as well if not better.

- Nvidia intentionally makes OpenCL a second class citizen vs. CUDA. I had nearly identical (simple) kernels running on both platforms and only the CUDA one managed to saturate memory throughput.

- The whole ecosystem is mostly more effort than it's worth. Portability between different OpenCL implementations is a gamble, some will even silently compute invalid results (I'm looking at you Intel...). I had kernel hangs with both Nvidia and AMD.


> if you tune well for one of them, the other one will suffer.

This is one of the things that bothers me the most about OpenCL. It attempts to offer this uniform abstraction over a generic compute accelerator, which can be CPU vector extensions, GPUs, or FPGAs, but these things are different enough that you have to develop for a specific type if you want reasonable performance. So you get none of the benefits of a accelerator specific abstraction while still writing accelerator specific code.

There is a real cost to a generic abstraction, and distinct languages/platforms would in my mind be better than different "dialects" of the same language/platform that pretend to be compatible but really aren't.

I like that CUDA is very clearly designed to run only on GPUs - it provides a clarity that OpenCL lacks.


The Intel OpenCL driver vectorized work items really well, better than the same code provided to ICC as C. It was still more fragile than CUDA.

The other points are spot on and I would add that debugging code in OpenCL is a bad experience.


I agree that this space needs to be explored more. ISPC is basically this, and it had some great ideas, it just needs to be more broadly available and integrated into build systems.


Would a better example be OpenCL (at least the vision of it)? OpenCL kernels can run on both CPU and GPU, and unlike CUDA, it's not vendor locked. Though I've only really used CUDA, I don't know how comparable OpenCL is these days.


One problem is that although OpenCL can run on both CPU and GPU, OpenCL code that performs well on CPU and OpenCL code that performs well on GPU can be different.


I think what you're looking for is ISPC.[0][1] It distinguishes between 'uniform' (scalar) vs 'varying' (vector) variables[2] and applies SIMD instructions pretty much transparently. It defaults to varying, whereas I think for a general-purpose C-like compiler you'd want scalar by default and explicitly declare eg:

  vector float add2(float base,vector float offs)
    {
    return base + 2*offs;
    }
  double sum(double* ar,size_t ln)
    {
    vector double x = (vector)0.0;
    while(ln >= sizeof vector) /* == # of simd lanes */
      {
      x += *ar; /* maybe this needs some kind of cast? */
      ln -= sizeof vector; /* == 1 on arches with no simd support */
      }
    /* this part's definitely not done properly */
    double buf[sizeof vector]; *buf = x;
    double ret = 0.0; /* slurp up the last few */
    for(size_t i=0; i<ln ;i++) ret += buf[i] + ar[i];
    return ret;
    }
0: https://pharr.org/matt/blog/2018/04/18/ispc-origins.html

1: https://ispc.github.io/

2: https://pharr.org/matt/blog/2018/04/26/ispc-volta-more-on-pe...


Yes, SIMD intrinsic programming is often worthless unless implemented for every target architecture. Let me give you an example case; my target production architecture rarely changes, but I want my program to run on every developer computer where CPUs are heterogeneous.

Now, I think the problem is that it is tying optimizations to instructions rather than CPU architectures. When I ported code from an I7 to an I9 it seemed that instruction latency changed. Still, if you are only supporting limited computational targets any hooks at all help.

But why intrinsic rather than assembly? Because usually vector extensions in GCC work well enough, and when they don't intrinsics are easier for me (a man who rarely needs to write assembly and doesn't have the time to beat the compiler in 99.99% of cases).

And a minor point... I really wish GCC had a pragma which allowed me to specify a function should be optimized to multiple targets and then install the thunks automatically.


GCC has function multiversioning. See: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/Function-Multiv... and since release 6 quite comfortably: https://lwn.net/Articles/691932/


Thank you so much for sharing this, I did not know. I guess I will have to look for the version GCC version which supported X86_64.. I might have to upgrade distros for GCC6 (older versions were i386).

Btw, this was a nice presentation of the idea: https://blog.linuxplumbersconf.org/2016/ocw/system/presentat...


Clang supports vector types for C and they're still nowhere near as fast as using intrinsics. For whatever various reasons, compilers tend to do pretty poorly at producing properly vectorized code. I think part of this is that it's hard to encode all the information you need to vectorize properly (this loop will run some power of 2 number of times, these pointers won't alias, etc)

Even using intrinsics is dicey sometimes - compilers spill registers more often than can make sense. I think the ideal situation would be if you could somehow hint at register scheduling without writing asm yourself, but compiler are still a long ways from that.


I wonder how it would like with Fortran code though.


Right, I've run into this a number of times, especially when porting code originally written for 32 bit to 64 bit. Besides compilers getting better, I think there's another factor - it used to be that a compelling reason to write asm is to get good utilization of a limited number of registers. But especially on aarch64 (and, in the not too distant future, AVX-512), there are a lot more.

Higher level abstractions are a good idea, but the problem is, are you able to exploit the capabilities that the chip exposes? For simple things like computing a scalar function elementwise over a vector, no problem, but a lot of the more interesting problems don't fit such simple templates.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: