Composable multi-threaded parallelism in Julia

StefanKarpinski · on July 23, 2019

It's hard to overstate what a big deal this is (and how much effort went into it). The only mainstream languages with this kind of composable parallelism are Go and Erlang, both of which target writing servers, rather than high-performance computing, and which therefore optimize for I/O throughput rather than CPU throughput.

The only systems which have supported compute-oriented composable parallelism like this are Cilk and TBB, which are both really excellent and successful—and provided major inspiration for this work. However, they both seem to have fallen short of widespread adoption partly due to being non-free (until recently) and also, I suspect, because they are non-standard C/C++ language/compiler extensions, which seem to never catch on that broadly.

This is a brave new world for high-performance technical computing, not just for Julia.

pka · on July 23, 2019

For completeness' sake, Haskell has had this for over a decade :) A SO thread that gives a nice overview [0], the paper [1], the library [2]. Here's how a parallel fib looks like:

  fib :: Int -> Int
  fib n
    | n <= 1    = 1
    | otherwise = x ‘par‘ (y ‘pseq‘ x + y + 1)
    where
      x = fib (n-1)
      y = fib (n-2)

[0] https://stackoverflow.com/questions/958449/what-is-a-spark-i...

[1] http://simonmar.github.io/bib/papers/strategies.pdf

[2] http://hackage.haskell.org/package/parallel-3.2.2.0/docs/Con...

cygx · on July 24, 2019

Perl6 also has been able to do this for a couple of years:

    sub fib(Int $n) {
        return $n if $n < 2;
        my $t = start { fib($n - 2) }
        fib($n - 1) + await($t);
    }

pkofod · on July 24, 2019

For complete completenes sake, Haskell has existed for almost three decades :)

dwohnitmok · on July 24, 2019

While the earliest versions of Haskell date back that far, those combinators only came out in 2007 (IIRC).

chrisseaton · on July 24, 2019

Those functions weren’t part of Haskell back then.

pkofod · on July 24, 2019

What do you mean? That's my point exactly. While Haskell has that now, they didn't always have it.

If a programming language is created in 2019 they can certainly instantly incorporate insights from existing technology in their design an plans. However, that doesn't mean that any programming language/compiler project started today will instantly have all of those features. Things take time in each new framework.

chrisseaton · on July 24, 2019

I don't know what your point is, then.

Haskell didn’t have it originally. Ok. But what do you think that says about whether it’s a new idea or not? What’s the connection? Why mention it in this thread about whether it's a new idea or not?

StefanKarpinski · on July 25, 2019

At what point did anyone claim it was a new idea? The blog post cites prior art in the very first paragraph. Not in a footnote—in the main text. The top post here cites several other languages with a similar model. This entire thread reads like a bunch of dudes who are really dying to "well, actually" someone.

chrisseaton · on July 25, 2019

I think people were responding to

> The only systems which have supported compute-oriented composable parallelism like this are Cilk and TBB

Because it isn’t true. Haskell is an example of why it isn’t true.

StefanKarpinski · on July 26, 2019

It's extremely debatable that Haskell is compute-oriented. I know that Haskell people like to talk about doing numerical computing in Haskell, but in reality it does not seem to be a thing. No one has ever used Haskell to implement large scale scientific computations on supercomputers, for example.

tavert · on July 26, 2019

No one? Ever? https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.3746

DNF2 · on July 29, 2019

This particular example appears to be an implementation of a 'Computer algebra system', which seems to be a type of symbolic computing rather than numerical computing.

StefanKarpinski · on July 27, 2019

[flagged]

tavert · on July 29, 2019

You're really asking for a Dan Luu sequel here.

thekhatribharat · on July 23, 2019

> this kind of composable parallelism are Go and Erlang, ... which therefore optimize for I/O throughput rather than CPU throughput

Parallelism is always for CPU throughput, precisely why you don't run any more OS threads than the number of cores.

Concurrency primitives (whether in Go, Erlang or any other programming language) are meant for I/O-bound tasks. Let's not conflate concurrency primitives enabling I/O-bound tasks and parallelism primitives enabling CPU-bound tasks.

chrisseaton · on July 23, 2019

> precisely why you don't run any more OS threads than the number of cores

It’s often worth running more than your number of cores, in order to keep execution units busy during memory fetches for example, or integer units busy while float units are a bottleneck otherwise. (Did a PhD partially on this kind of thing.)

BubRoss · on July 24, 2019

That's what hyperthreading is for, which implies having the same number of threads as logical cores.

Hyperthreading and out of order execution are meant to accomplish keeping the execution units busy.

chrisseaton · on July 24, 2019

No it implies having the same number of OS threads as hardware threads, which is more than the number of hardware cores.

BubRoss · on July 24, 2019

That's what logical cores means. Hardware threads / hyper threads are also called logical cores by the bios and OS.

gpderetta · on July 24, 2019

You are obviously correct and I'm not sure why you are being downvoted.

elcritch · on July 23, 2019

You can still use Erlang style concurrency (or Go style as well) to truly parallelize CPU bound tasks. It’s a bit less efficient due to Erlangs design where you can’t readily share mutable data between processes but will still enable true parallel calculations on all cores. By default BEAM runs one execution thread per core. For example, it’d be trivial to write a true parallel merge sort in Erlang/Elixir.

Now NodeJS or Python Asio style concurrency _are_ truly limited by their VMs and can’t do native parallel computations in a single process.

phillipcarter · on July 23, 2019

(Since this is basically now a comment where everyone "me too"s in a reply...)

F# also has facilities for this, depending on what you're after. The _MailboxProcessor_ type, though not as "for real" as the green threading in Go, allows for some pretty high throughput with a relatively sane message-passing abstraction.

And .NET itself offers many means to do parallel programming (which can get quite optimized), of which F# has some nice abstractions for.

So I don't think it's quite fair to say that only Go and Erlang have good support for parallelism. I also don't think they use their abstractions typically for parallelism (but I good be wrong).

kgwgk · on July 24, 2019

> (Since this is basically now a comment where everyone "me too"s in a reply...)

Related work in Common Lisp:

https://github.com/pcostanza/claws/

https://european-lisp-symposium.org/static/2014/costanza.pdf

DNF2 · on July 24, 2019

He didn't really say that, though, rather that just Go and Erlang has a similar kind of _composable_ parallelism. (Don't know about Haskell.)

nickpeterson · on July 24, 2019

That name though, couldn't it just be Agent?

eigenspace · on July 23, 2019

As a physicist, it's been really cool seeing the awareness of Julia in the community gradually increase.

Large swathes of the physics community have been migrating their code to Python over the past few years but it really seems now that Julia is getting so many amazing features and packages that it can't be ignored.

benjaminva · on July 24, 2019

Same here, I worked in a computational biophysics department with lots of python/bash/R and I was the only one who wrote lots of high-performance code in Julia. People were curious about the language but it was still very much unknown. Hope we will see a broader adoption of Julia in the future - it's just that much better for the stuff we do on a daily basis.

conjectures · on July 23, 2019

Great to see this finally land - thanks for all the team's work. Looking forward to giving it a whirl.

Threading is something of a prerequisite for acceptance as a serious language among many folks. So great to not just check that box, but to stick the pen right through it.

The devil is always in the details, but from the doc the interface looks pretty nice (given threaded code generally looks like it was written by Cthulhu).

KenoFischer · on July 23, 2019

> given threaded code generally looks like it was written by Cthulhu

Whereas in Julia, we use Cthulhu to read code ;) - https://github.com/JuliaDebug/Cthulhu.jl.

chrisseaton · on July 23, 2019

This is a bit over the top isn’t it? It’s just task parallelism. It’s exactly what for example Haskell has been doing for decades isn’t it? It’s even the same demo and interface!

DNF2 · on July 24, 2019

My understanding is that it's not just about the parallelism, but about the type of threading model that is used, which makes it extremely composable.

tomp · on July 24, 2019

It's literally just futures. (Still very hard to implement, of course, and in particular IMO this kind of implementation (M-to-N threading) is much better than the async/await implementations) but CS is pretty settled on futures being the "best" model, even C++ has them!)

DNF2 · on July 24, 2019

My understanding was that the depth-first scheduling was basically 'new'.

chrisseaton · on July 24, 2019

In my experience most work-stealing schedulers for task parallelism like this are going to be depth-first. I’ve written several myself. Primarily for temporal locality - if you just created a task it’ll be resident in cache.

gpderetta · on July 24, 2019

I'm still trying to access the relevant papers, but from bits of discussions here it seems to me that the difference between the Julia scheduler and cilk is that the former will try to distribute sibling leaf tasks among executors enforcing as close as a global depth first view as possible, while cilk does depth first on each executor but different executors will be exploring distinct the problem space branching off as close to the root as possible (due to the steal the oldest ancestor behaviour). It is a compromise between better cross executor cache locality and increased contention I guess.

gpderetta · on July 24, 2019

Futures in c++ are really meant for concurrency especially for IO. They currently are way too heavy weight for fine grained parallelism (although there is work ongoing on that). Also futures are about packaging tasks, all the stuff being discussed here is about optimal scheduling which is completely orthogonal.

scott_s · on July 23, 2019

Do people use Cilk heavily in the high-performance computing world? They did not when I was part of it. I think OpenMP (https://www.openmp.org/) is what more people would reach for. It started as a way to parallelize loops, but has had a task framework for a long time which I believe also allows composeable parallelism.

OpenMP, however, depends on not just compiler extensions, but pragmas, so it tends to not find traction outside of the HPC world.

gpderetta · on July 24, 2019

While extremely elegant and minimalistic, I do not think cilk ever gained much traction. GCC has deprecated it and is removing it. I think that OpenMP was too entrenched in the language extension space (and has since grown at least the potential of cilk like functionality), while more portable libraries (TBB for example) covered everything else.

I think it has been highly influential though.

scott_s · on July 24, 2019

Yes, Cilk has been enormously influential. I have personally cited it many times. But I always had the impression it was a little too early for its time, and was never picked up by the HPC community, hence my question.

summarity · on July 24, 2019

Cilk maybe not, but TBB is just a library (or more accurately a toolbox of primitives), which integrates nicely with OpenMP: https://software.intel.com/en-us/node/506128

krapht · on July 23, 2019

The real gain is with MPI, anyway... shared memory is a trap. All the projects I've worked on have exploited parallelism through message passing, or a battle tested multithreaded linear algebra library.

BubRoss · on July 24, 2019

There are many techniques and calling any of them the one solution and all others a trap is extremely myopic. Shared memory is extremely useful for IPC and fork join parallelism.

gpderetta · on July 24, 2019

My understanding is that, at least in c/Fortran/c++ land common practice is openmp within a machine and MPI across machines. I've done some work of the sort but I'm far from an expert though.

steev · on July 24, 2019

You are correct.

vmchale · on July 24, 2019

> All the projects I've worked on have exploited parallelism through message passing

What? Lots of parallelism doesn't use message passing!

seanmcdirmid · on July 24, 2019

GPU is the way to go these days to get most parallelism, and it is definitely the shared memory variety. Message passing is the go to when parallelism needs to cross machine boundaries.

steev · on July 24, 2019

GPU parallelism is highly specialized and requires very specific workloads (e.g., many linear algebra workloads). GPUs are a bad choice for general parallelized tasks.

seanmcdirmid · on July 24, 2019

Yes, but for heavy numeric problems, GPUs are your best bet if you can make them work. The question is what is Julia being used for, I guess.

scott_s · on July 24, 2019

A quibble: it's not necessarily "heavy numeric" problems, but problems where the calculations involve extensive data-reuse. (Of which matrix operations are the most obvious.) The GPU is massively parallel, but the up-front cost of transferring data to the GPU is quite high.

Hence, in order for that up-front cost to be worth it, the algorithms run on the GPU need to re-use that data many times. If the algorithms running on the GPU don't make extensive reuse of the data sent to it, it would be faster to just do the calculation on the host CPU.

marmaduke · on July 24, 2019

MPI is the HPC horizontal scaling strategy, for when you need to go from 1 core to 500k.

manigandham · on July 23, 2019

You're missing C#/.NET which has had async/parallel programming constructs for a long time, and served as the model for await/async copied by many other languages.

It's gone through several design patterns from the original APM (async programming model) to EAP (event based pattern) to TPL (task parallel library) to the all purpose Task/ValueTask embedded into the language. It's very fast and efficient, and you can build your own scheduler too. Combined with hardware intrinsics in the latest .NET Core, it's very good for computational and server work.

I'm surprised Julia considers this such a big update but it does seem to be an overstatement to say this changes technical computing.

StefanKarpinski · on July 24, 2019

Julia has has the equivalent of async/await forever, except without needing to annotate anything with keywords—that’s just how I/O works. If you call a blocking function, it yields back into the scheduler. In Julia (as of 1.3), tasks are mapped onto multiple threads which run at the same time (in parallel) and threads that are idle can steal work efficiently. This is similar (as was said) to Cilk, TBB and Go. It is not the same as async/await in C#, which is concurrent but not parallel.

What is 100% new and does not exist in any other system is that the task scheduling is depth first (rather than breadth first), which is a much better order for doing compute-heavy workloads. This ability is new and was the culmination of an Intel research project, applied to Julia.

gpderetta · on July 24, 2019

Hum, isn't cilk-like work stealing depth first? Idle executors steal oldest ancestor and busy executors eagerly execute recursive calls while pushing parent in the local job stack?

Proof of optimality was in the early cilk papers in the '90s.

I'm sure the same algorithms made into Julia via Intel (which bought cilk in the early '00), but it is hardly new except in the 'everything old is new again'.

Also IIRC OpenMP supports (or at least allows) work stealing although at least GCC does not implement it yet.

leiroigh · on July 24, 2019

It's all about better L3 utilization. Cilk is good for using L1 and L2, i.e. caches that are core-local, and suboptimal for using L3, i.e. caches that are shared between cores.

gpderetta · on July 24, 2019

Can you expand on this or point me to a paper discussing this?

ddragon · on July 24, 2019

In the presentation posted here there was a paper citation that compared in terms of cache the model used (parallel depth first scheduling) over work stealing, but I didn't have time to read it yet.

https://www.cs.cmu.edu/%7Echensm/papers/ConstructiveCacheSha...

gpderetta · on July 24, 2019

Can't manage to download the file though my flimsy mobile connection, but I'm starting to understand the difference.

dwohnitmok · on July 24, 2019

What do you mean by depth first vs breadth first? Naively I would assume task scheduling necessarily to be depth first in some sense because you have to evaluate your tasks' dependencies first.

StefanKarpinski · on July 24, 2019

Consider a machine with N cores running a program which spawns N parallel tasks, each of which in turn spawns N parallel tasks. Which subtasks do the cores work on initially? If it's the first subtask of each top-level task, that's breadth-first scheduling. If it's all N subtasks of the first top-level task, that's depth-first scheduling. The latter is better for high-performance computing because it tends to be the lower level libraries that are really carefully tuned to make optimal use of the machine (think BLAS/FTT/parallel sum/prefix sum). Previously it was unknown how to implement this kind of scheduling efficiently. See this talk by Kiran Pamnany for more details:

https://www.youtube.com/watch?v=YdiZa0Y3F3c

dwohnitmok · on July 24, 2019

Ah I see what you mean! I really hate to pile on with the Haskell observation but the 1995 paper by Blelloch that is mentioned in that talk is exactly the heart of GHC's data parallelism as well (under the name nested data parallelism).

See slide 8 here: https://www.microsoft.com/en-us/research/wp-content/uploads/...

Later versions of Blelloch's language are referred to in other parallel Haskell papers.

FWIW I think this is nonetheless fantastic news for Julia! It's certainly poised to do great work in areas that I doubt will get any Haskell traction anytime soon (if ever) for good reasons.

EDIT: Interesting, I think Julia is taking a coarser grained approach here that breaks down a bit for irregularly balanced execution trees, but relies less on fusion (i.e. doesn't use fusion at all) which is nice. I'm curious to see how this particular implementation of the "depth first" idea goes!

EDIT EDIT: Actually I'm not sure how much fusion is necessary for GHC's approach... Time to investigate more closely. Also call out to anyone more familiar with Data Parallel Haskell to chime in. I'm getting out of my depth.

StefanKarpinski · on July 25, 2019

I'm not surprised that Haskell supports this kind of parallelism, although I wasn't aware of it previously. After all, if you're going to the trouble of being that pure, why not be parallel?

manigandham · on July 24, 2019

Isn't a blocking function, by definition, something that blocks and doesn't yield? How could the language know to automatically yield on something without annotations?

> "async/await in C#, which is concurrent but not parallel"

C# is multithreaded. Things can only be parallel if you have 2 separate paths of execution that don't depend on each other or any serial order, but you can also kick off millions of individual tasks in a single function and await them all too. These tasks are all scheduled onto threads in a managed threadpool that also share work and run at the same time. I fail to see how this is any different or less capable than the others you mentioned.

StefanKarpinski · on July 24, 2019

C# has threads but async/await does not use them. If you think that `async` in C# does the same thing as `@spawn` in Julia and `go` in Go, you are mistaken. It does not. There is also an `@async` macro in Julia that does what the `async` keyword in C# does (which, again, does not involve threads).

https://stackoverflow.com/questions/37419572/if-async-await-...

https://blog.stephencleary.com/2013/11/there-is-no-thread.ht...

https://blogs.msdn.microsoft.com/benwilli/2015/09/10/tasks-a...

manigandham · on July 24, 2019

I'm very familiar with C# and I'm not talking about threads, only that multi-threading is a basic requirement and C# has a threadpool running threads in parallel.

I said that C# also uses Tasks which are scheduled onto these threads, which is what you said Julia does, so it's the same. C# can also start new threads just as easily, or use channels like Go with the TPL/Dataflow, System.Threading.Channels, actor frameworks, Reactive Extensions, or just chaining tasks together. Other than the depth-first algorithm, it seems like C# has the same capabilities as Julia and the others, if not more, and is much more "mainstream".

Can you explain how blocking functions are automatically yielded without keywords though? I cant find any documentation on this that doesnt mention the @async macro.

opportune · on July 23, 2019

I really dislike using Tasks over threads due to the reusability issues. Specifically that Tasks are not reusable and threads are. It makes parallel computational work a headache, because you need to implement the reusability in a higher scope than tasks, then make an effort to prevent race-conditions, double checkouts, etc.

Tasks are really meant for responsive computing, not proactive computing (yes, even TPL and Dataflow), and thus I don't think they are well suited to parallelized computation in areas like e.g. simulation.

summarity · on July 24, 2019

When a program is composed as a Dataflow pipeline, you don't handle tasks at all (unless you want to). I'm not sure what reusability issues you're referring to. Dataflow is very applicable to high CPU-throughput (i.e. efficient) pipelines. Concurrent by default, parallel on-demand. In practice, any I/O in my pipelines takes place in singleton consumers at the end of complex, highly-parallel, CPU-intensive code.

C# code that uses Tasks and C# programs built on Dataflow are very different. The latter benefits you most when you commit entirely to that paradigm.

If you want to get more low-level, low-overhead (for i.e. simulations), the same thing applies to things like Threading Building Blocks[1] in C++. I'd even argue that TBB is one of the best applications of this theory, going so far as to ship with dedicated tooling for the whole lifecycle of designing to implementing to profiling task- and graph-parallel programs [2]. Furthermore, it works well with other HPC libraries. Here [3] is an example of mixing it with OpenMP.

In summary, both Dataflow and TBB (et al) can readily be used to create CPU-intensive, proactive computing pipelines today. Even though they're "just" fancy Task schedulers, which are in turn "just" fancy thread multiplexers.

[1] - https://www.threadingbuildingblocks.org/tutorial-intel-tbb-f...

[2] - https://www.youtube.com/watch?v=9bBqFW5hnVM

[3] - https://software.intel.com/en-us/node/506128

manigandham · on July 24, 2019

What do you mean by reusable exactly? And how does proactive computing differ? Do you have an example you can share?

opportune · on July 24, 2019

Reusable as in, I can create a datastructure in a thread, do some computation, yield control from the thread (wait for something else to tell it to go again), then come back to the thread with the same datastructure and keep going. With Tasks, you are supposed to let tasks complete rather than let them pause/sync so anything you declare locally is lost when you want to restart it. The workaround is to declare things in a higher scope and pass them into tasks as they are reran - this makes tasks not suited to things where you want to persist data, since tasks are opinionated in that they are only meant to contain logic, not state, so you have to add your own external logic to manage the allocation of state to the tasks that you keep recreating

This trait of tasks makes sense when writing things like web servers because it prevents developers from shooting themselves in the foot by writing something that isn't stateless.

By proactive computing, I mean some program that always has control of the entire execution graph. For example, a simulation may initialize a bunch of datastructures, separate some of them into threads, run the threads, wait for all of them to complete x ticks, do some synchronization or measurement of the total system, then continue y times, and exit.

DNF2 · on July 24, 2019

What would be the purpose of yielding control from the thread and tell it to wait, rather than just let it run to completion?

ska · on July 24, 2019

Sometimes you don't know where you want to spend your CPU budget until you have more information.

DNF2 · on July 24, 2019

Maybe I misunderstand you, but telling the thread to halt is surely a waste of those resources?

opportune · on July 24, 2019

Because sometimes you need to sync the entire program state, or some subset of the state, to make measurements or to let disparate parts of the system interact

ska · on July 24, 2019

I’m not certain an overstatement. I’m not convinced either. As you note these ideas are old (they mostly predate C# and .NET by decades). On the other hand, to a first order approximation, C# has made zero impact in this area. There are a bunch of reasons, but that’s the way it is.

Fair or not, Julia does seem to have the potential to be the first really interesting shift in this area in ages. Doesn’t mean it will get there, but there is interest.

manigandham · on July 24, 2019

I was pointing out another very mainstream language (even more than Go and far more than Erlang) that has the same abilities.

What do you mean by impact/shift exactly? I guess I'm not understanding what's suddenly going to change now since, as you say, these ideas are old and implemented several times before.

ska · on July 24, 2019

What I mean is that C# has not been taken up for heavy numeric work in a significant way.

Julia, for good or ill, seems to have some potential to be.

seanmcdirmid · on July 24, 2019

Task concurrency is probably not a great idea for heavy numeric work anyways, right? I’m scratching my head a bit about what the use case for this is in Julia given that there are much better ways of doing parallelism already.

DNF2 · on July 24, 2019

Which ways are those?

seanmcdirmid · on July 24, 2019

Data parallelism mostly.

DNF2 · on July 24, 2019

We're not talking about concurrency, but actual parallelism.

Anyway, if by data parallelism you mean simd, then that is already available and widely used in Julia. Multithreading comes on top of that.

jakear · on July 23, 2019

The interface does look a lot like Cilk, which is a definite plus as that is an absolute pleasure to use.

This however offers a lot more configurability than Cilk (or at least more accessible configuration). Whether that’s a curse or blessing... I guess we’ll find out! :)

Also, have you looked into implementing Cilk-style reducers at all? [0]

[0]: https://www.cilkplus.org/docs/doxygen/include-dir/group___re...

leiroigh · on July 24, 2019

Big difference is that Cilk appears to typically synchronize on function return, while julia needs an explicit wait/fetch (there is also a @sync macro). That is, spawn/wait can be matched from the AST in cilk, while julia demands that the user communicates and stores a reference to the started task for later waiting.

It is easy to forget to wait on a task, or even lose reference to it. The current scheduler will typically execute this task quickly, even if nobody ever waits on it; until load changes and you have races everywhere. There are currently no warnings comparable to Python's "RuntimeWarning: coroutine foo was never awaited".

jakear · on July 24, 2019

Cilk does indeed sync on function return. You can also manually call 'cilk_sync'. I do think that Cilk’s approach is better (more in line with “make error states unrepresentable”, which I’m a big fan of), but Julia’s approach seems to be more in line with their overall goal of giving the programmer as much control as possible. Not sure if that’s a good thing or not, but as said above... we’ll find out soon!

gpderetta · on July 24, 2019

Affine types to represent futures would help here.

Arnavion · on July 23, 2019

>The only mainstream languages with this kind of composable parallelism are Go and Erlang, both of which target writing servers, rather than high-performance computing, and which therefore optimize for I/O throughput rather than CPU throughput.

Rust, while perhaps not as mainstream and not necessarily targeting scientific computing users, has data-bound parallelization like this via the rayon library.

oconnor663 · on July 23, 2019

Which indeed was named after Cilk.

StreamBright · on July 24, 2019

> The only mainstream languages with this kind of composable parallelism are Go and Erlang

I think there are a few more of those languages. AFAIK Clojure and Rust should be added to the list.

vmchale · on July 24, 2019

> The only mainstream languages with this kind of composable parallelism are Go and Erlang

Concurrency? Go is not good at parallelism, compared to, say, Haskell.

bakery2k · on July 24, 2019

What's wrong with parallelism in Go?

radiospiel · on July 24, 2019

I have something in the works which is very similar - but is language agnostic. Pro: no need to switch to another language; Con: not high performance.

We are using this in production for now over 1 year, and it helps tremendously. We are currently only supporting a ruby client, but other languages should be more or less easy to bring into the mix. A wider release is planned for later this year - feel free to ask me anything via email!

marmaduke · on July 24, 2019

Will this be worked to scale to multiple machines?

StefanKarpinski · on July 24, 2019

Julia has supported distributed compute forever. Just give it a machine file and you have an instant fully connected cluster.

holy_city · on July 23, 2019

MATLAB has had this for years. But it costs $1,000 on top of the existing license cost. And it's lumped together with their GPGPU/cluster tools.

DNF2 · on July 24, 2019

If you're talking about the Parallel Computing Toolbox (PCT), then, no, you do not get access to multithreading through that, only multiprocessing and GPU processing. I recently considered buying the PCT, and this is one of the main reasons I probably won't.

The only type of multithreading in Matlab is the type that is built into the core language, and you don't get any control over that.

Finally, the claim to originality here isn't just the introduction of multithreading (which isn't even new in Julia!), but the particular threading model and scheduler, which lends itself particularly well to composable multithreading.

mruts · on July 23, 2019

Scala futures seem to offer all of this functionality as well, no? And you can do all the regular monad stuff on them as well, which is super nice.

mbauman · on July 23, 2019

I'm super excited about this! For me the most exciting part is these four sentences:

> The model is portable and free from low-level details. You don’t need to explicitly start and stop threads, and you don’t even need to know how many processors or threads there are (though you can find out if you want).

> The model is nestable and composable: you can start parallel tasks that call library functions that themselves start parallel tasks, and everything works. Your CPUs will not be over-subscribed with threads.

It means that you can put threaded parallelism inside libraries without worrying. It means that Julia itself will start making its builtins threaded. It means that Julia itself is much more threadsafe by default. It means that packages can implement threaded algorithms — and you can use multiple threaded packages together at the same time without worrying. And it means that you can put threaded parallelism in your application code and all these things will work together beautifully.

jstrong · on July 24, 2019

still have to be careful about data races and thread synchronization, though:

    julia> xs = [1]
    1-element Array{Int64,1}:
     1
    
    julia> for j in 1:1024
            d = @spawn xs[1] += 1
           end
    
    julia> xs
    1-element Array{Int64,1}:
     686

thekhatribharat · on July 23, 2019

Well, only if your code and your libraries do not share state.

mindB · on July 23, 2019

It's going to be really neat to watch what the Julia package ecosystem comes up with now that they have access to truly composable parallelism at the language level. Up until now, most parallelism in the language had to live at the application level rather than being in library code since parallelism either relied on spawning new processes or the Threads.@threads macro which couldn't be nested.

karmakaze · on July 24, 2019

Not only is this huge for Julia, but they've just thrown down the gauntlet. The status quo has been upset. I expect Julia to start eating everyone's lunch starting with Python. Every language can use good concurrency & parallelism support and this is the biggest news for all dynamic languages.

cshenton · on July 24, 2019

Julia is already insanely productive for CS and statistics research and this will only make it more so. Super excited to see how this will be used, especially in PPLs, as algorithms like PMCMC benefit hugely from this kind of cooperative, interruptible concurrency.

cutler · on July 24, 2019

Why isn't some big company putting its weight behind Julia considering what it has to offer in the field of data science? I just don't get why it's still playing second fiddle to Python. Tech is full of inertia, paradoxically.

parsimo2010 · on July 24, 2019

Julia has raised millions of dollars in funding, and includes some pretty big names- National Science Foundation, DARPA, and a pretty big chunk from venture capital as well. They don't have a benevolent sponsor like Python has enjoyed from Google, but they're doing alright.

They aren't as big as Python because Python has been around for nearly 30 years. Julia is just a baby compared to that. As it grows it will increase it's share. It may not ever surpass Python in terms of raw users, because Julia is focused on numerical computing, but Python is more general purpose.

Recurecur · on July 24, 2019

> It may not ever surpass Python in terms of raw users, because Julia is focused on numerical computing, but Python is more general purpose.

That isn't the case. Julia's primary focus is scientific/numeric, but it's also general purpose. Take a look at Julia Robotics, for instance:

http://www.juliarobotics.org/

parsimo2010 · on July 24, 2019

Julia is focused on numerical computing, not exclusively for numerical computing. Note that Julia has built in n-D arrays and useful operations, while Python requires the numpy package (or other similar package) for them. Julia's Matlab-esque syntax makes it clear where its priorities lie.

You could call any Turing-complete language with I/O a general purpose language, but some are clearly designed for certain tasks.

cutler · on July 24, 2019

Julia will be general purpose when it has its own popular web framework. To be an all-round language it must not be second rate with text processing for the sake of being excellent with math.

byt143 · on July 24, 2019

http://genieframework.com/

StreamBright · on July 24, 2019

Because almost every business problem is already solved in Python (meaning in C++/C/Fortran with a Python interface) and most companies do not want to re-implement those libraries just for the sake of using a new language.

vmchale · on July 24, 2019

> Why isn't some big company putting its weight behind Julia considering what it has to offer in the field of data science?

Same as J

> Tech is full of inertia, paradoxically.

I don't think it's paradoxical. Novelty is essential in academia, but that is not the case in industry.

tavert · on July 24, 2019

Since I stopped using Julia I've kept an eye on the HN Who's Hiring threads for mentions of it. I see one consistent listing from Gambit Research, and occasionally one or two others that come and go. Anecdotal to only sample from HN and not larger listing sites like Indeed etc, but it reinforces my skepticism about industry adoption.

j88439h84 · on July 24, 2019

Why'd you stop using Julia?

tavert · on July 24, 2019

1. Carelessness, lack of professionalism, and instability from many of the core language and package developers

2. Over concentration of the main contributors within the same startup that has a questionable business model and long-term prospects

3. Language issues that limit the desirability of Julia for use in larger projects

j88439h84 · on July 30, 2019

What language did you land on after leaving?

tavert · on July 30, 2019

Mix of C++ and Python. They're both kind of awful in their own way, but well-supported and stable.

6thaccount2 · on July 24, 2019

I think some finance groups are probably putting in some donations if I had to guess.

cutler · on July 24, 2019

That's not the same as Google's support for Python, Go and Kotlin or Microsoft's support for Typescript and maybe soon Rust. I would have thought the case for supporting Julia was very strong considering Python's performance and concurrency problems.

6thaccount2 · on July 24, 2019

Python's performance is less of an issue when you use libraries like Numpy that are basically running optimized C code.

Depending on what you're doing, Python is usually great at performance where the time it saves the user writing code is more valuable than the performance itself (a lot of exploratory scientific research).

DNF2 · on July 24, 2019

Numpy is great if you can vectorize your code in a straightforward manner. But when you start doing anything non-standard it becomes difficult.

I can tell you that programmer productivity can plummet drastically when you need to get everything into vectorized form.

Also, custom classes and types do not vectorize well at all.

6thaccount2 · on July 24, 2019

Agreed.

sjkelly · on July 23, 2019

Live stream here: https://www.youtube.com/watch?v=HfiRnfKxI64

argd678 · on July 24, 2019

Does anyone have any recommendations for using Julia on real world projects at this point? Is it mature enough to use and debug on a day to day basis for example? Mainly I’ve been using R for context.

krastanov · on July 24, 2019

In terms of "canonical ways" and "community approved libraries" here is what the current state looks like to me:

- as a substitute to something as low-level as numpy it is superb

- for anything dealing with differential equations, the ecosystem is unsurpassed

- plotting is a bit chaotic (and the community has not settled on a "one true way to plot")

- a good dataframe implementation already exists

- serialization is a bit chaotic (plenty of ways to do it but no "one true way" yet)

- automatic differentiation is on the cusp of being unsurpassed, but a lot of the tools are still experimental (and the community still has not settled on the best way forward)

- the Juno IDE looks amazing

- the profiling toolkit is amazing

- it is ridiculously easy to use other languages from within Julia

Certhas · on July 24, 2019

Agreed on all points, but you will pay for all this with usability issues and early adopter bugs. The error messages are horrendous, documentation is lacking, and if you run code on hpc, saving the data is probably going to cause you dataloss at least a few times until you find the right obscure GitHub issues.

sgt101 · on July 24, 2019

This is my experience, hopefully now that 1.x is home and dry we will gradually see the pain of old libraries suddenly failing ebb away, and documentation becoming more co-herant and accurate. Also thanks for calling out the error messages - this has been something I've struggled with and needs some careful attention from the Julia community.

On the otherhand our expectations are huge - I remember early Java and the level of support that you get with Julia now really took a very long time to appear for Java, and that was backed by SUN.

If I were an Intel exec I'd be putting money into this though; I can see it gradually crushing the life out of nVidia and AMD, although I think that there will be a few smiles at ARM towers when they look at this too! I wonder what people could do with 2000 ARM cores in a rack?

Staross · on July 24, 2019

I would add:

- Good integrated test system (and test culture in the community)

- Super good package/environment manager

- Great integrated documentation system

- Good version control practices in the community

- Awesome community

cshenton · on July 24, 2019

I started using in it earnest after 1.0 and can honestly say that it's super stable and easy to write and debug errors. The only "bugs" I've had with it have stemmed from abuse/misunderstanding of the dispatch system, and you can write a lot of Julia without running into those kind of user mistakes.

npr11 · on July 24, 2019

I work at company using Julia in production. What sort of recommendations are you looking for? :)

argd678 · on July 24, 2019

Mainly, am I going to spend hours fussing with the basics of the language. For example in R, it’s too slow, so I had to write Rcpp, only to discover there’s no real way to debug it from RStudio, down into the rabbit hole, lldb, Rinside, everything that can crash does, terse errors messages with out line numbers, opaque SEXP types XCode can’t decode etc. So basically are the normal everyday use cases fraught with quirks you need to know more about the internals of Julia or can you just be a user? If that makes sense.

endoftime5 · on July 24, 2019

“Writes like Python runs like C” is one tagline I’ve seen them use. Julia gives you a lot of control over how fast you want your code to be. You can write some code without really thinking too much and it’ll be fast and work just fine. If you get to a point where it’s not fast enough then there’s a whole host of tools to help you speed it up.

At any stage you can call @code_{warntype, lowered,...} and inspect exactly what code has been produced by the compiler to find bottlenecks.

I’ve not written much R before but in my experience I’ve never had to go to those lengths to debug Julia code! (Although the error messages can sometimes be a bit of a mouthful, I think because it uses LLVM)

afiori · on July 24, 2019

Not the parent, but I am interested in a good list of libraries that might be non obvious to find.

ddragon · on July 24, 2019

You can use the search tools to find libraries for the categories you're interested at:

https://juliaobserver.com/packages

https://pkg.julialang.org/docs/

As a personal answer, I really like unitful for keeping track of physical units and libraries like MLStyle.jl or Match.jl which add stuff like pattern matching to the language (though they are more general helper libraries compared to stuff like Turing.jl and DifferentialEquations.jl):

https://painterqubits.github.io/Unitful.jl/stable/highlights...

https://github.com/thautwarm/MLStyle.jl

joehl · on July 31, 2019

Maybe I misunderstand, but: In R allocating and copying huge return vectors for small changes on input vectors ruins performance. My understanding of Julia was that one advantage over R is the possiblity to mutate input vectors in place. Referring to the psort! example for Julia's multithreading I wonder why "We simply need to allocate one initially-empty array per thread" which to me looks like allocating huge vectors for small changes down the call recursion (and wastes memory by the number of threads). If Julia can mutate psort!s first argument global vector 'v' from multiple threads, why can't Julia also mutate a single allocated 'temp' argument global vector? Bonus: then psort! could just merge from one argument to the other and save copying back on each recursion level (along the lines of Segewick's nocopy mergesort)?

joehl · on Aug 1, 2019

OK,

I found out myself: Julia can mutate a single temp buffer across all threads, and implementing Sedgewicks nocopy sort (with ping-pong merge) is faster, particularly when passing down preallocated buffer memory to the single threaded sorting Routine once N <= BIG_THRESHOLD. The following are measurements run with JULIA_NUM_THREADS=4 on a Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz, 2594 MHz, 2 physical cores, 4 logical cores, replication code follows. Note that not only runtime of nocopy mergesort ist better, it also causes less garbage collection time. Conclusion: it would make sense for Julia's sorting interface to allow passing in preallocated buffer memory and to use nocopy mergesort behind the curtain. Of course true parallel sorting also requires a parallel merge.

Best

  julia>   # 4 threads, mergesort, dynamic allocation
  julia>   @time for i=1:100 c = copy(a); psort!(c); end;
  166.270792 seconds (756.58 k allocations: 89.470 GiB, 15.34% gc time)

  julia>   # 4 threads, mergesort, preallocation per thread, re-allocation in single threaded subroutine
  julia>   @time for i=1:100 c = copy(a); tsort!(c); end;
  143.973485 seconds (658.00 k allocations: 36.972 GiB, 5.89% gc time)
  
  julia>   # 4 threads, nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
  julia>   @time for i=1:100 c = copy(a); ksort!(c); end;
  129.715467 seconds (654.51 k allocations: 37.312 GiB, 6.43% gc time)
  
  julia>   # 4 threads, nocopy mergesort, one preallocation reused in single threaded subroutine
  julia>   @time for i=1:100 c = copy(a); jsort!(c); end;
  119.139551 seconds (449.91 k allocations: 29.846 GiB, 5.14% gc time)



  import Base.Threads.@spawn

  const SMALL_THRESHOLD  = 2^6
  const BIG_THRESHOLD  = 2^16

  # insertionsort
  function isort!(v, lo::Int=1, hi::Int=length(v))
      @inbounds for i = lo+1:hi
          j = i
          x = v[i]
          while j > lo
              if x < v[j-1]
                  v[j] = v[j-1]
                  j -= 1
                  continue
              end
              break
          end
          v[j] = x
      end
      return v
  end

  # single threaded merge (from t to v)
  function nmerge!(v, lo::Int, mid::Int, hi::Int, t)
      i, k, j = lo, lo, mid+1
      @inbounds while i <= mid && j <= hi
          if t[j] < t[i]
              v[k] = t[j]
              j += 1
          else
              v[k] = t[i]
              i += 1
          end
          k += 1
      end
      @inbounds while i <= mid
          v[k] = t[i]
          k += 1
          i += 1
      end
      @inbounds while j <= hi
          v[k] = t[j]
          k += 1
          j += 1
      end
      return v
  end

  # single threaded nocopy mergesort, one preallocation
  function nsort!(v, lo::Int, hi::Int, t)
      @inbounds if hi-lo <= SMALL_THRESHOLD 
        return isort!(v, lo, hi)
      end
      mid = (lo+hi)>>>1
      nsort!(t, lo, mid, v)  
      nsort!(t, mid+1, hi, v)
      nmerge!(v, lo, mid, hi, t)
      return v
  end
  function nsort!(v, lo::Int=1, hi::Int=length(v))
      t = copy(v)
      nsort!(v,lo,hi,t)
      return v
  end
    
    
  # multithreaded mergesort, dynamic allocation
  function psort!(v, lo::Int=1, hi::Int=length(v))
      @inbounds if hi-lo <= BIG_THRESHOLD
          sort!(view(v, lo:hi), alg = MergeSort)
          return v
      end

      mid = (lo+hi)>>>1                 # find the midpoint

      half = @spawn psort!(v, lo, mid)  # task to sort the lower half; will run
      psort!(v, mid+1, hi)              # in parallel with the current call sorting

                                        # the upper half
      wait(half)                        # wait for the lower half to finish

      temp = v[lo:mid]                  # workspace for merging

      i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
      @inbounds while k < j <= hi
          if v[j] < temp[i]
              v[k] = v[j]
              j += 1
          else
              v[k] = temp[i]
              i += 1
          end
          k += 1
      end
      @inbounds while k < j
          v[k] = temp[i]
          k += 1
          i += 1
      end
      
      return v
  end



  # mergesort, preallocation per thread, re-allocation in single threaded subroutine
  function tsort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, 0) for i = 1:Threads.nthreads()])
      if hi - lo < BIG_THRESHOLD               # below some cutoff, run in serial
          sort!(view(v, lo:hi), alg = MergeSort)
          return v
      end

      mid = (lo+hi)>>>1                 # find the midpoint

      half = @spawn tsort!(v, lo, mid, temps)  # task to sort the lower half; will run
      tsort!(v, mid+1, hi, temps)              # in parallel with the current call sorting
                                        # the upper half
      wait(half)                        # wait for the lower half to finish

      temp = temps[Threads.threadid()]                  # workspace for merging
      length(temp) < mid-lo+1 && resize!(temp, mid-lo+1)
      copyto!(temp, 1, v, lo, mid-lo+1)                  

      i, k, j = 1, lo, mid+1            # merge the two sorted sub-arrays
      @inbounds while k < j <= hi
          if v[j] < temp[i]
              v[k] = v[j]
              j += 1
          else
              v[k] = temp[i]
              i += 1
          end
          k += 1
      end
      @inbounds while k < j
          v[k] = temp[i]
          k += 1
          i += 1
      end
      
      return v
  end

  # multithreaded nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
  function ksort!(v, lo::Int, hi::Int, t)
      if hi - lo < BIG_THRESHOLD               # below some cutoff, run in serial
          sort!(view(v, lo:hi), alg = MergeSort)
          return v
      end
      mid = (lo+hi)>>>1                 # find the midpoint
      half = @spawn ksort!(t, lo, mid, v)  # task to sort the lower half; will run
      ksort!(t, mid+1, hi, v)              # in parallel with the current call sorting
      wait(half)                           # wait for the lower half to finish
      nmerge!(v, lo, mid, hi, t)
      return v
  end
  function ksort!(v, lo::Int=1, hi::Int=length(v))
      t = copy(v)
      ksort!(v,lo,hi,t)
      return v
  end
  
  # multithreaded nocopy mergesort, one preallocation reused in single threaded subroutine
  function jsort!(v, lo::Int, hi::Int, t)
      if hi - lo < BIG_THRESHOLD               # below some cutoff, run in serial
          return nsort!(v, lo, hi, t)
      end
      mid = (lo+hi)>>>1                 # find the midpoint
      half = @spawn jsort!(t, lo, mid, v)  # task to sort the lower half; will run
      jsort!(t, mid+1, hi, v)              # in parallel with the current call sorting
      wait(half)                           # wait for the lower half to finish
      nmerge!(v, lo, mid, hi, t)
      return v
  end
  function jsort!(v, lo::Int=1, hi::Int=length(v))
      t = copy(v)
      jsort!(v,lo,hi,t)
      return v
  end


  a = rand(2000);
  b = copy(a); @time sort!(b, alg = MergeSort);   # reference
  c = copy(a); @time isort!(c);                   # insertionsort
  all(b==c) # regression test

  a = rand(20000000);
  b = copy(a); @time sort!(b, alg = MergeSort);   # single-threaded system mergesort used by psort and tsort
  c = copy(a); @time nsort!(c);    # single threaded nocopy mergesort, one preallocation
  all(b==c) # regression test
  c = copy(a); @time psort!(c);    # 4 threads, mergesort, dynamic allocation
  all(b==c) # regression test
  c = copy(a); @time tsort!(c);    # 4 threads, mergesort, preallocation per thread, re-allocation in single threaded subroutine
  all(b==c) # regression test
  c = copy(a); @time ksort!(c);    # 4 threads, nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
  all(b==c) # regression test
  c = copy(a); @time jsort!(c);    # 4 threads, nocopy mergesort, one preallocation reused in single threaded subroutine
  all(b==c) # regression test

  # 4 threads, mergesort, dynamic allocation
  @time for i=1:100 c = copy(a); psort!(c); end;
  # 4 threads, mergesort, preallocation per thread, re-allocation in single threaded subroutine
  @time for i=1:100 c = copy(a); tsort!(c); end;
  # 4 threads, nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
  @time for i=1:100 c = copy(a); ksort!(c); end;
  # 4 threads, nocopy mergesort, one preallocation reused in single threaded subroutine
  @time for i=1:100 c = copy(a); jsort!(c); end;

DNF2 · on Aug 2, 2019

I believe the `psort!` code that was shown in the blog post was a 'toy example', meant to demonstrate how to use different parallel techniques, one of them being the use of thread-local state.

There's probably going to be a lot of development of parallelized code going forward. Maybe you could help? :)

joehl · on Aug 5, 2019

Providing Julia Code for in-memory sorting, i.e. a binary mergesort with parallel branches and parallel merge plus a specialized sort for instable inplace sorting (better Quicksort, parallel branches only) is not a huge step from the code above: I could contribute that. If however, the expectation would be that the sort is more general, i.e. k-ary and performs for data actually being on ssd or even on disk (JuliaDB?) thats a complete different story: one would need to experiment with much more complicated algorithms such as cache-oblivious funnel-sort, parallel superscalar samplesort, real disk sorts ... I don't have time for that. How would I avoid duplicating the efforts of others?

eigenspace · on July 23, 2019

This is huge! I was testing out the master branch a few days ago and the parallelism improvements were amazing.

pkofod · on July 23, 2019

I think it's great that they went ahead and released a binary for people to try it out without having to compile from source.

staticfloat · on July 23, 2019

Thanks! We do our best to make sure that people who want to get ahold of the latest Julia binaries are always able to. Our downloads page [0] always offers the latest stable builds, whereas the nightlies page [1] allows you to download binaries off of the tip of the `master` branch, which usually lag behind the github repository by only a couple of hours.

[0] https://julialang.org/downloads/ [1] https://julialang.org/downloads/nightlies.html

polskibus · on July 23, 2019

Is this something similar to PLINQ in C# ?

ska · on July 23, 2019

IIUC it is more general, and operates at a lower level.

manigandham · on July 23, 2019

It's basically the Task-based async model.

davidacoder · on July 24, 2019

You could use this to build something like PLINQ, but this is at a lower level in the stack than PLINQ.

mr_crankypants · on July 23, 2019

It seems like it's more like C#'s PLINQ and TPL merged into a single abstraction.