I had to look up Devectorize.jl: https://github.com/lindahua/Devectorize.jl The ...

simonster · on July 1, 2014

I think you are getting at this above and what I've written below is somewhat redundant with srean's followup, but to clarify, writing vectorized code is mostly orthogonal to using SIMD vector instructions to compute the result. A sufficiently smart compiler can turn a loop into SIMD instructions, and these days most compilers (including LLVM, which Julia uses) are pretty good at that.

The problem that Julia solves here is that many scientific programming environments require you to write vectorized code, because they are interpreted and so looping through the elements in a vector incurs massive interpreter overhead. In some cases, this can make your code much more difficult to understand than if you wrote an explicit loop, but even when it does not, it often forces you to give up performance that could be obtained if you could write the code in a devectorized manner.

For example, consider the problem of calculating the variance of a sample. To calculate the variance in MATLAB, one might write:

  mu = mean(x)
  sum((x - mu).^2) / (length(x) - 1)

Although I don't know the details of how MATLAB optimizes this code, I do know that it has roughly equivalent performance characteristics to Julia code that first computes x - mu, then computes abs2(x - mu), then sums along the result. With an interpreter, you can't do much better, and this is actually how the MATLAB var function does it. In Julia, we compute the variance as:

  mu = mean(x)
  v = 0.
  for i = 1:length(x)
      v += abs2(x[i] - mu)
  end
  v / (length(x) - 1)

This is several times faster than MATLAB's included var function, because the computations performed on each element are cheap relative to the cost of memory access. If you wrap the loop in the @simd macro and add @inbounds to eliminate the bounds check when accessing x, the LLVM loop vectorizer will actually take that loop and translate it into SIMD instructions, making it even faster. (The @simd macro is necessary to tell LLVM that it's okay to vectorize, since accumulating into n variables and summing them at the end gives different [but usually more accurate] results compared to accumulating into a single variable due to floating point rounding.)

The promise of Devectorize.jl is that you will be able to write the first code fragment and have it transformed to the second, which would be neat indeed.

srean · on July 1, 2014

Thanks for providing more detail and context. I wonder how the clash in terminology came about ! Who would have imagined naming things uniquely would be so hard (pun intended).

I understand that your example is entirely pedagogic, but just a cautionary note for the unwary (a) although it is tempting to fold the variance calculation into a single loop (accumulate the totals of x and x^2), (b) neither that obvious single loop version nor the code above are good ways to compute variance if one cares about preserving precision, more so if x has a wide dynamic range. In large scale problems it does raise its ugly head and these bugs are difficult to catch because the code is mathematically correct. Using double mitigates the problem to an extent (but then floats are faster for SIMD vectorization).

Another side note, I have gradually come to realize and appreciate the unique position that Fortran holds. It is not often that you have compiler writers with a background in numerical analysis or vice versa. I BTW have background in none and sorely miss that.

simonster · on July 1, 2014

The way I compute the variance above is, to my knowledge, the standard algorithm implemented by most software packages (apparently including MATLAB). (The single pass computation you mention is subject to catastrophic cancellation, and thus pretty terrible unless the mean of your data is very small relative to the variance.) However, the "real" implementation in Julia standard library is indeed a bit better: it performs pairwise summation, which has O(log n) error growth instead of O(n) error growth at negligible performance cost (see http://en.wikipedia.org/wiki/Pairwise_summation).

ViralBShah · on July 1, 2014

Also, the hope is that in a future version of Julia, we can have a devectorization pass within the compiler at least to handle the most common cases. This is common in Fortran compilers.

srean · on July 1, 2014

Not sure that I understand the point/tone of your comment, all that you say is correct and I am not unaware of any of those points nor do I contradict them in mine, but it seems that you are offering it as a correction. From your comment it seems that you are aware of the two different meanings of "vectorization" but that you are choosing the wrong one to setup a strawman. So I am a bit puzzled. In any case let me offer some clarifications:

The canonical example of ET and the one that started it all is Blitz++ (unlike its modern descendants it actually handles full n-dimensional tensors). More recent solutions are Blaze, Eigen and Armadillo. I like these over uBLAS (more verbose and not as performant). Blaze is typically the fastest (in this class) as long as you are dealing with 2D arrays and linear algebraic operations. Since Julia is homoiconic and has hygienic macros, one does not have to resort to such syntactically complex ways of doing metaprogramming such as ETs in C++. ET's are great to use, not very pleasant to write the machinery for, although things like Boost::spirit and fusion make it more tolerable. Heaven forbid that you have to understand the error messages though.

Now, to another point, looping is so pathetically slow in R, Numpy and Matlab (in that order) that the traditional solution offered in these languages is to vectorize (in the Matlab/numpy/R sense not in the SIMD sense) the loops into vector-expressions. Often this needs the help of extra prefilled vectors/matrices, (some languages do it smartly with 'broadcasting', some where late to pick it up, looking at you Matlab). Even with broadcasting tricks, the extra stride indirection in the iteration slows down the code from how fast it could have been. That is but one aspect of it, the other is chaining several binary and unary operations together. This also creates temporaries and unnecessary loops (unless you use solutions of somewhat limited scope like numexpr, which BTW is lovely when it is applicable, it also does thread level parallelization as well as SIMD if you can afford to link with intel MKL).

In Julia there is no need to avoid loops because they are inherently fast. In fact they are faster than the vectorized expressions, because of inefficiencies mentioned. However, loops are a lot more verbose than expressions. So this is where devectorize comes in. It transforms those expressions into loops (much like ETs) which can then be JITed to obtain superior performance. Now with SIMD vectorization thrown in on top, one can gain even more, because now one can benefit further from the instruction level parallelism. So yes you devectorize and then vectorize (and you already know this) and the term "vectorization" refer to two different concepts named vectorization , the clash in terminology is rather unfortunate.

EDIT: @scott_s no offense taken and upvoted. As I said, this post was just to offer some clarifications, the clash in terminology indeed gets very confusing. So, thanks for making me improve my comment, if it confused you I am sure it would have confused others as well.

scott_s · on July 1, 2014

I'm sorry if you read any snark into my tone, none was intended.

This paragraph made me think you saw a conflict with the two approaches: "Reading between the lines, devectorize.jl and the vectorization framework will make for a heady mix. I am glad that Julia is challenging the traditional mantra of "vectorize (as in the R/Numpy/Matlab sense) the loops". It is heart warming to see that the language designers get it that this style has inherent speed limitations (time is spent filling and copying temporary vectors)."

The difficulty, I think, is what you identified. I was thinking of the classic compiler optimization called "vectorizing".