Not sure that I understand the point/tone of your comment, all that you say is c...

Not sure that I understand the point/tone of your comment, all that you say is correct and I am not unaware of any of those points nor do I contradict them in mine, but it seems that you are offering it as a correction. From your comment it seems that you are aware of the two different meanings of "vectorization" but that you are choosing the wrong one to setup a strawman. So I am a bit puzzled. In any case let me offer some clarifications:

The canonical example of ET and the one that started it all is Blitz++ (unlike its modern descendants it actually handles full n-dimensional tensors). More recent solutions are Blaze, Eigen and Armadillo. I like these over uBLAS (more verbose and not as performant). Blaze is typically the fastest (in this class) as long as you are dealing with 2D arrays and linear algebraic operations. Since Julia is homoiconic and has hygienic macros, one does not have to resort to such syntactically complex ways of doing metaprogramming such as ETs in C++. ET's are great to use, not very pleasant to write the machinery for, although things like Boost::spirit and fusion make it more tolerable. Heaven forbid that you have to understand the error messages though.

Now, to another point, looping is so pathetically slow in R, Numpy and Matlab (in that order) that the traditional solution offered in these languages is to vectorize (in the Matlab/numpy/R sense not in the SIMD sense) the loops into vector-expressions. Often this needs the help of extra prefilled vectors/matrices, (some languages do it smartly with 'broadcasting', some where late to pick it up, looking at you Matlab). Even with broadcasting tricks, the extra stride indirection in the iteration slows down the code from how fast it could have been. That is but one aspect of it, the other is chaining several binary and unary operations together. This also creates temporaries and unnecessary loops (unless you use solutions of somewhat limited scope like numexpr, which BTW is lovely when it is applicable, it also does thread level parallelization as well as SIMD if you can afford to link with intel MKL).

In Julia there is no need to avoid loops because they are inherently fast. In fact they are faster than the vectorized expressions, because of inefficiencies mentioned. However, loops are a lot more verbose than expressions. So this is where devectorize comes in. It transforms those expressions into loops (much like ETs) which can then be JITed to obtain superior performance. Now with SIMD vectorization thrown in on top, one can gain even more, because now one can benefit further from the instruction level parallelism. So yes you devectorize and then vectorize (and you already know this) and the term "vectorization" refer to two different concepts named vectorization , the clash in terminology is rather unfortunate.

EDIT: @scott_s no offense taken and upvoted. As I said, this post was just to offer some clarifications, the clash in terminology indeed gets very confusing. So, thanks for making me improve my comment, if it confused you I am sure it would have confused others as well.