Hacker News new | past | comments | ask | show | jobs | submit login
My Experience at JuliaCon (johnmyleswhite.com)
114 points by ViralBShah on July 1, 2014 | hide | past | favorite | 35 comments



Would the video, slides (and corresponding source repos) be uploaded sometime ? Would love to see those.

Reading between the lines, devectorize.jl and the vectorization framework will make for a heady mix. I am glad that Julia is challenging the traditional mantra of "vectorize (as in the R/Numpy/Matlab sense) the loops". It is heart warming to see that the language designers get it that this style has inherent speed limitations (time is spent filling and copying temporary vectors).

If any Julia core developer is seeing this, just a shout out, (i) to say you are doing awesome work, and (ii) @ macros available in the standard library are not documented well. It would be good to have a page that describes what is available now.

EDIT: Just clarifying my point. Macros as a feature are quite well described in the docs. What I miss is an index of macros that come built in with the standard library. They would help both with pedagogy and with putting them to actual use, for example removing bounds checks. If they are a little more discoverable than they are now that would be very helpful to a newcomer. For a new'ish language the documentation is otherwise quite admirable. @ViralBShah I filed an issue.


Slides are on GitHub: https://github.com/JuliaCon/presentations

I believe the video is heading for their YouTube channel: https://www.youtube.com/user/JuliaLanguage/featured


I had to look up Devectorize.jl: https://github.com/lindahua/Devectorize.jl

The point, I think, is not that vectorizing is slow. It's that you don't want to be myopic when you do it. Your point about time spent filling and copying vectors is independent of what Devectorize is trying to do.

Yes, when you vectorize code, you need to have a prologue and an epilogue which gets the data in the correct format for the SIMD operations. This prologue and epilogue has a cost, and if it's larger than the gain from the SIMD operations, then it's not worth vectorizing. (This can happen if, say, you only have 4 elements to vectorize.) However, we have to do this cost-analysis all the time. Figuring out when it's beneficial to vectorize code has been around as long as we've vectorized code.

Devectorize.jl is not about that problem. Rather, it's about when you have a chain of expressions, and the operators in those expressions have implied loops. The naive thing to do is to myopically execute each one of those loops, creating and passing around temporary vectors. The Devectorize framework is given an entire expression, and is able to analyze where the implied loops are, and figures out how to express that computation in a single loop.

We can tell these two concepts are independent because the ideal situation is to first use Devectorize on an expression, and then vectorize the result!

For the record, the approach taken by Devectorize.jl is similar to the problem that expression templates (http://en.wikipedia.org/wiki/Expression_templates) in C++ try to solve. Through template trickery and operator overloading, we can put off evaluating an expression, avoiding the temporaries and unnecessary memory traversal that a naive execution implies. The Boost library uBLAS (micro Base Linear Algebra Subprograms, http://www.boost.org/doc/libs/1_55_0/libs/numeric/ublas/doc/...) uses this technique.


I think you are getting at this above and what I've written below is somewhat redundant with srean's followup, but to clarify, writing vectorized code is mostly orthogonal to using SIMD vector instructions to compute the result. A sufficiently smart compiler can turn a loop into SIMD instructions, and these days most compilers (including LLVM, which Julia uses) are pretty good at that.

The problem that Julia solves here is that many scientific programming environments require you to write vectorized code, because they are interpreted and so looping through the elements in a vector incurs massive interpreter overhead. In some cases, this can make your code much more difficult to understand than if you wrote an explicit loop, but even when it does not, it often forces you to give up performance that could be obtained if you could write the code in a devectorized manner.

For example, consider the problem of calculating the variance of a sample. To calculate the variance in MATLAB, one might write:

  mu = mean(x)
  sum((x - mu).^2) / (length(x) - 1)
Although I don't know the details of how MATLAB optimizes this code, I do know that it has roughly equivalent performance characteristics to Julia code that first computes x - mu, then computes abs2(x - mu), then sums along the result. With an interpreter, you can't do much better, and this is actually how the MATLAB var function does it. In Julia, we compute the variance as:

  mu = mean(x)
  v = 0.
  for i = 1:length(x)
      v += abs2(x[i] - mu)
  end
  v / (length(x) - 1)
This is several times faster than MATLAB's included var function, because the computations performed on each element are cheap relative to the cost of memory access. If you wrap the loop in the @simd macro and add @inbounds to eliminate the bounds check when accessing x, the LLVM loop vectorizer will actually take that loop and translate it into SIMD instructions, making it even faster. (The @simd macro is necessary to tell LLVM that it's okay to vectorize, since accumulating into n variables and summing them at the end gives different [but usually more accurate] results compared to accumulating into a single variable due to floating point rounding.)

The promise of Devectorize.jl is that you will be able to write the first code fragment and have it transformed to the second, which would be neat indeed.


Thanks for providing more detail and context. I wonder how the clash in terminology came about ! Who would have imagined naming things uniquely would be so hard (pun intended).

I understand that your example is entirely pedagogic, but just a cautionary note for the unwary (a) although it is tempting to fold the variance calculation into a single loop (accumulate the totals of x and x^2), (b) neither that obvious single loop version nor the code above are good ways to compute variance if one cares about preserving precision, more so if x has a wide dynamic range. In large scale problems it does raise its ugly head and these bugs are difficult to catch because the code is mathematically correct. Using double mitigates the problem to an extent (but then floats are faster for SIMD vectorization).

Another side note, I have gradually come to realize and appreciate the unique position that Fortran holds. It is not often that you have compiler writers with a background in numerical analysis or vice versa. I BTW have background in none and sorely miss that.


The way I compute the variance above is, to my knowledge, the standard algorithm implemented by most software packages (apparently including MATLAB). (The single pass computation you mention is subject to catastrophic cancellation, and thus pretty terrible unless the mean of your data is very small relative to the variance.) However, the "real" implementation in Julia standard library is indeed a bit better: it performs pairwise summation, which has O(log n) error growth instead of O(n) error growth at negligible performance cost (see http://en.wikipedia.org/wiki/Pairwise_summation).


Also, the hope is that in a future version of Julia, we can have a devectorization pass within the compiler at least to handle the most common cases. This is common in Fortran compilers.


Not sure that I understand the point/tone of your comment, all that you say is correct and I am not unaware of any of those points nor do I contradict them in mine, but it seems that you are offering it as a correction. From your comment it seems that you are aware of the two different meanings of "vectorization" but that you are choosing the wrong one to setup a strawman. So I am a bit puzzled. In any case let me offer some clarifications:

The canonical example of ET and the one that started it all is Blitz++ (unlike its modern descendants it actually handles full n-dimensional tensors). More recent solutions are Blaze, Eigen and Armadillo. I like these over uBLAS (more verbose and not as performant). Blaze is typically the fastest (in this class) as long as you are dealing with 2D arrays and linear algebraic operations. Since Julia is homoiconic and has hygienic macros, one does not have to resort to such syntactically complex ways of doing metaprogramming such as ETs in C++. ET's are great to use, not very pleasant to write the machinery for, although things like Boost::spirit and fusion make it more tolerable. Heaven forbid that you have to understand the error messages though.

Now, to another point, looping is so pathetically slow in R, Numpy and Matlab (in that order) that the traditional solution offered in these languages is to vectorize (in the Matlab/numpy/R sense not in the SIMD sense) the loops into vector-expressions. Often this needs the help of extra prefilled vectors/matrices, (some languages do it smartly with 'broadcasting', some where late to pick it up, looking at you Matlab). Even with broadcasting tricks, the extra stride indirection in the iteration slows down the code from how fast it could have been. That is but one aspect of it, the other is chaining several binary and unary operations together. This also creates temporaries and unnecessary loops (unless you use solutions of somewhat limited scope like numexpr, which BTW is lovely when it is applicable, it also does thread level parallelization as well as SIMD if you can afford to link with intel MKL).

In Julia there is no need to avoid loops because they are inherently fast. In fact they are faster than the vectorized expressions, because of inefficiencies mentioned. However, loops are a lot more verbose than expressions. So this is where devectorize comes in. It transforms those expressions into loops (much like ETs) which can then be JITed to obtain superior performance. Now with SIMD vectorization thrown in on top, one can gain even more, because now one can benefit further from the instruction level parallelism. So yes you devectorize and then vectorize (and you already know this) and the term "vectorization" refer to two different concepts named vectorization , the clash in terminology is rather unfortunate.

EDIT: @scott_s no offense taken and upvoted. As I said, this post was just to offer some clarifications, the clash in terminology indeed gets very confusing. So, thanks for making me improve my comment, if it confused you I am sure it would have confused others as well.


I'm sorry if you read any snark into my tone, none was intended.

This paragraph made me think you saw a conflict with the two approaches: "Reading between the lines, devectorize.jl and the vectorization framework will make for a heady mix. I am glad that Julia is challenging the traditional mantra of "vectorize (as in the R/Numpy/Matlab sense) the loops". It is heart warming to see that the language designers get it that this style has inherent speed limitations (time is spent filling and copying temporary vectors)."

The difficulty, I think, is what you identified. I was thinking of the classic compiler optimization called "vectorizing".


The videos are being edited and will be available shortly. We'll post them on juliacon.org and julialang.org.

Does the Metaprogramming section of the manual not adequately describe macros? It would be great if you submit a pull request or an issue describing what is missing.


An index of macros in the standard library is a good idea. Would you mind filing an issue?


Macros are documented along with everything else in the standard library:

http://docs.julialang.org/en/latest/stdlib/base/

The stdlib docs are organized by subject matter. What would the benefit of having a separate index of macros be? If one wants a list of exported macros, that's just a matter of doing

    grep @ base/exports.jl
in a Julia repo.


It's unfortunate the Julia community doesn't have enough resources to maintain their website.

The available version was 0.2 for ages (I see now they offer a pre-release 0.3), and the blog hasn't been updated for 9 months.

Some simple posts every month or so would signal that this is a live, worked-on, language (which it is, and interesting stuff happens all the time).

It was difficult to even find out about JuliaCon -- no banner, no blog post, nothing, just a small link in the main site navigation, that one can easily not notice.


To get a full sense of the level of activity, see here: https://github.com/JuliaLang/julia/pulse (and that is only the core - there is also considerable activity in the packages)

More frequent posts on the main blog would indeed be a good idea. There will certainly be a few GSoC student posts as the summer progresses.

As of a few days ago there is a also Julia blog aggregator highlighting community posts:

http://www.juliabloggers.com/

I believe we've had 0.3-pre nightlies posted since at least January.

As far as JuliaCon, it was announced on the mailing lists and did show up on HN once I believe. It sold out very quickly, so wasn't really actively promoted after that.


julia is developed githup/maillist centric. The source+community link should show you this.


Julia looks amazing BUT as an R user:

1) RStudio is just such an amazing tool with rmarkdown and pandoc. I run my statistics in a document and can output to html, pdf or Word. I can't tell you how important the ability to print out to Word is in my corporate setting.

2) The libraries and tools are vast (Seems like Linux tons of options but really one or two are perfect for the individual use)

3) 99% of what I do takes less then 2 seconds with R

Personally if I have a need for large datasets I will look to Julia if I can't get a solution with R and data.tables.


Personally, I don't see Julia replacing R for a while. As you note, the R community has much of what it needs already. When I initially switched from R to Julia, it was because R wasn't really usable for the kind of work I do. It's since become clear to me that I work on problems that don't come up for most R users: problems involving large sparse matrices, including large-scale optimization and MCMC. If you don't deal with those kinds of things, R is a good choice.


It seems like MCMC is a potentially killer app for Julia. Any thoughts as to how Julia's MCMC facilities compare to, say, something like Stan?


I think MCMC.jl is pretty immature compared with Stan. But there is work on a Stan wrapper. Personally, I mostly write MCMC code by hand these days.


Granted I don't have much experience with Stan, but from what little I've poked around RStan, the workflow left something to be desired. Setting up a C++ environment, and then embedding Stan code (which appears to be a C++ DSL of sorts) as strings inside your R code seems... unpleasant.

Perhaps if MCMC.jl matures, and if it can offer performance competitive with BUGS, JAGS, Stan, etc., I could see Julia's statistical fortunes rise along with Bayesian methods generally. I'm working on a PhD in a discipline that is just now beginning to dip its toes in Bayesian waters. I get the feeling that the adoption surge is yet to come.

Interesting that you're implementing your own MCMC methods mainly. As part of my coursework I did a little bit of that, but perpetually felt like I wasn't smart enough to anything sophisticated--either in terms of exotic sampling methods, or working with very complex posterior distributions. It may just be my frequentist R toolbox experience messing with me, but I know I feel more comfortable with the safety blanket of a framework.


Second that. The purpose of Julia is not to replace every other language that people use for technical computing. I personally use Python for other projects (www.circuitscape.org), and only recommend using Julia for new classes of problems that need Julia's performance and capabilities. That class of problems keeps expanding over time.


I think the best route is to create shared libraries using Julia and call them from R.


While it is a nice tool, RStudio is AGPL, which means it can't be used in many companies due to contamination fears, including Google but I understand many other companies as well.


Well I would think they would pay for RStudio server. I know $10,000 seems steep but I am sure Google could afford that.


I agree on RStudio. I used ESS for a while, but couldn't in good conscience justify it for new users. So, RStudio became my mantra for new R users, and I've followed suit.

I've wondered what it would take to instrument support for Julia in RStudio. RStudio is open source, but I don't really know how generically it could interface to whatever repl.


RStudio looks really impressive. I have heard reliably that they do use the AGPL license and legal threat to get people to pay up for RStudio licenses. I have nothing against people charging for good software, but this sounds like some sort of bait and switch. I wonder if someone else can confirm if this is actually true.


I'm not sure I understand how the AGPL allows them to do that. Doesn't the AGPL just require people "releasing" derived works as part of a service to also make their modifications free? I don't really know much about the AGPL other than what I could find on Wikipedia.

Assuming I understand the added requirement of the AGPL, wouldn't RStudio's legal recourse apply only to derived works that didn't release their code?


I wonder if any code written in RStudio gets contaminated with AGPL. I also believe that many companies will just pay the fees out of fear of litigation.


Why would you think anything of the sort? Is everything written in Emacs GPL? Is everything written in Eclipse under the Eclipse license?

Like (almost?) all FLOSS dev tools, there's no connection between how the code is written and the ensuing license. Suggesting such (even by way of a comment) is rather unfair.


I agree. I don't see Julia getting much traction in the statistics community. There is the legacy thing (the tools, existing code, and knowledge) but there is also the fact that R is designed specifically for statistics. Julia is a good replacement for Matlab, but there is a reason Matlab isn't used much in statistics.


Anyone know the performance of julia for optimization in contrast to say, ilog cplex, or gurobi?


It interfaces to a number of solvers, both commercial and open source. According to this page [1] it currently supports:

COIN Cbc/Clp, GNU GLPK, Gurobi, Ipopt, Mosek, NLopt

So you can use it to generate models that are solved in Gurobi if you want. Or you can use the available open source solvers.

[1] http://www.juliaopt.org/


One can easily use such solvers in Julia via an excellent modeling language, JuMP [1]. You can easily specify optimization problems in human readable form and have these passed to the solvers; speed is quite good.

https://github.com/JuliaOpt/JuMP.jl


There are excellent Julia bindings to CPLEX, Gurobi, and many other optimization libraries: http://www.juliaopt.org/. At some point there may be pure Julia linear programming libraries, but initially it's much more fruitful to provide a consistent shared interfaces to existing libraries than to reimplement them.


In addition to the pointers in the comments here, also see the fast developing CVX.jl package. https://github.com/cvxgrp/CVX.jl




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: