Hacker News new | past | comments | ask | show | jobs | submit login

I spent a lot of the past month improving the RenderScript (Android data-parallel compute, using C99 plus vectors) codegen to better work with the LLVM vectorizers, so I have a fair amount of experience with this exact question.

The vector types help in some ways--if you can use only vectors and do arithmetic on entire vectors only the SIMD codegen already works great--but they don't really help more than an intelligent vectorizer could. For example, the LLVM loop vectorizer can only handle loops where the induction variable changes by 1 or -1 each iteration. As a result, you couldn't set A[i], A[i+1], and A[i+2] in a for loop where i += 3 each iteration. If you could do that, you wouldn't really need vec3 in the first place. (also, the presence of any vector types whatsoever prevent any vectorization in LLVM right now, so...)

Another issue is that using vectors for storage and vectors for computation are very different things. I don't think anyone actually likes using vectors for computation, but vectors for storage make a lot of sense in some fields like image processing. However, as soon as you start operating on single channels, things get messy. Let's say you have a for loop where each iteration operates on a float4 as individual channels. Let's also say you want to turn each of those single channel operations into its own float4 operation, vectorizing across four iterations of the loop. Given most current SIMD ISAs, you're going to have to do four 32-bit loads of A[i].x, A[i+1].x, etc., then pack that into a vector, do your arithmetic ops, unpack the vector, and do four 32-bit writes to memory. Unsurprisingly, this is not particularly fast, and if you're doing one mul or FMA per channel, you shouldn't be vectorizing at all. This is why you see cost models in vectorizers (to prevent this sort of packing/unpacking from killing your performance when you enable your vectorizer) as well as why you see newer SIMD ISAs like AVX2 in Haswell including support for scatter/gather and permute.

The last issue is that it's trivial to break a vectorizer because of easy to overlook things like ailasing. Missing a single restrict will prevent vectorization if the compiler can't prove that pointers won't alias, for example (why do you think people still use fortran in HPC?). There's actually been a lot of great work here in LLVM over the past few months with new aliasing metadata for LLVM IR (http://llvm.org/docs/LangRef.html#noalias-and-alias-scope-me...), which is what I used to make the SLP vectorizer work with RenderScript in a way similar to ISPC (except in the compiler back end instead of the front end, because we don't even know the target ISA at the time the RS source is compiled to LLVM IR). I'll probably get the patch in AOSP in the next week or two if you want to keep an eye on that; it needs a newer LLVM version than what shipped in L and we're finishing up that rebase.

(honestly, I think that if you're trying to get good server CPU performance and you know exactly what CPU you're going to be using at compile time, you should be looking at ispc instead of doing SSE intrinsics yourself: https://ispc.github.io/ )




Fabulous comment, please keep posting more and sharing your knowledge/experience. And thanks for the re-reminder of 'ispc'. A couple tiny additions:

While loading each element with an individual 32-bit load usually negates the advantage of vectorization, if you are using 8-bit data, doing a single vector load and shuffling bytes within a 128-bit vector is really fast with AVX. And shuffling double- and quad-words within 256-bits with AVX2 can be a useful alternative even if you have to do it twice and combine. But even better is to rearrange your data layout to match the operations you are going to be doing.

AVX2 doesn't actually offer scatter support, although AVX-512 will. And AVX2 gather on Haswell normally isn't any faster than multiple individual loads, although it's no worse and the the expectation is that future generations will improve on this significantly. Also worth pointing out is that 256-bit YMM vectors are conceptually split in two, so apart from the slightly slower 32/64/128-bit permute instructions, it's not possible to move data from one 128-bit 'lane' to the other.


sorry, the desktop Intel SIMD ISAs are generally not something I use on a day-to-day basis, so I get them mixed up a lot. (LRBni was the last one I looked at for any serious length of time.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: