It's hard to overstate what a big deal this is (and how much effort went into it). The only mainstream languages with this kind of composable parallelism are Go and Erlang, both of which target writing servers, rather than high-performance computing, and which therefore optimize for I/O throughput rather than CPU throughput.
The only systems which have supported compute-oriented composable parallelism like this are Cilk and TBB, which are both really excellent and successful—and provided major inspiration for this work. However, they both seem to have fallen short of widespread adoption partly due to being non-free (until recently) and also, I suspect, because they are non-standard C/C++ language/compiler extensions, which seem to never catch on that broadly.
This is a brave new world for high-performance technical computing, not just for Julia.
For completeness' sake, Haskell has had this for over a decade :) A SO thread that gives a nice overview [0], the paper [1], the library [2]. Here's how a parallel fib looks like:
fib :: Int -> Int
fib n
| n <= 1 = 1
| otherwise = x ‘par‘ (y ‘pseq‘ x + y + 1)
where
x = fib (n-1)
y = fib (n-2)
What do you mean? That's my point exactly. While Haskell has that now, they didn't always have it.
If a programming language is created in 2019 they can certainly instantly incorporate insights from existing technology in their design an plans. However, that doesn't mean that any programming language/compiler project started today will instantly have all of those features. Things take time in each new framework.
Haskell didn’t have it originally. Ok. But what do you think that says about whether it’s a new idea or not? What’s the connection? Why mention it in this thread about whether it's a new idea or not?
At what point did anyone claim it was a new idea? The blog post cites prior art in the very first paragraph. Not in a footnote—in the main text. The top post here cites several other languages with a similar model. This entire thread reads like a bunch of dudes who are really dying to "well, actually" someone.
It's extremely debatable that Haskell is compute-oriented. I know that Haskell people like to talk about doing numerical computing in Haskell, but in reality it does not seem to be a thing. No one has ever used Haskell to implement large scale scientific computations on supercomputers, for example.
This particular example appears to be an implementation of a 'Computer algebra system', which seems to be a type of symbolic computing rather than numerical computing.
> this kind of composable parallelism are Go and Erlang, ... which therefore optimize for I/O throughput rather than CPU throughput
Parallelism is always for CPU throughput, precisely why you don't run any more OS threads than the number of cores.
Concurrency primitives (whether in Go, Erlang or any other programming language) are meant for I/O-bound tasks. Let's not conflate concurrency primitives enabling I/O-bound tasks and parallelism primitives enabling CPU-bound tasks.
> precisely why you don't run any more OS threads than the number of cores
It’s often worth running more than your number of cores, in order to keep execution units busy during memory fetches for example, or integer units busy while float units are a bottleneck otherwise. (Did a PhD partially on this kind of thing.)
You can still use Erlang style concurrency (or Go style as well) to truly parallelize CPU bound tasks. It’s a bit less efficient due to Erlangs design where you can’t readily share mutable data between processes but will still enable true parallel calculations on all cores. By default BEAM runs one execution thread per core. For example, it’d be trivial to write a true parallel merge sort in Erlang/Elixir.
Now NodeJS or Python Asio style concurrency _are_ truly limited by their VMs and can’t do native parallel computations in a single process.
(Since this is basically now a comment where everyone "me too"s in a reply...)
F# also has facilities for this, depending on what you're after. The _MailboxProcessor_ type, though not as "for real" as the green threading in Go, allows for some pretty high throughput with a relatively sane message-passing abstraction.
And .NET itself offers many means to do parallel programming (which can get quite optimized), of which F# has some nice abstractions for.
So I don't think it's quite fair to say that only Go and Erlang have good support for parallelism. I also don't think they use their abstractions typically for parallelism (but I good be wrong).
As a physicist, it's been really cool seeing the awareness of Julia in the community gradually increase.
Large swathes of the physics community have been migrating their code to Python over the past few years but it really seems now that Julia is getting so many amazing features and packages that it can't be ignored.
Same here, I worked in a computational biophysics department with lots of python/bash/R and I was the only one who wrote lots of high-performance code in Julia. People were curious about the language but it was still very much unknown. Hope we will see a broader adoption of Julia in the future - it's just that much better for the stuff we do on a daily basis.
Great to see this finally land - thanks for all the team's work. Looking forward to giving it a whirl.
Threading is something of a prerequisite for acceptance as a serious language among many folks. So great to not just check that box, but to stick the pen right through it.
The devil is always in the details, but from the doc the interface looks pretty nice (given threaded code generally looks like it was written by Cthulhu).
This is a bit over the top isn’t it? It’s just task parallelism. It’s exactly what for example Haskell has been doing for decades isn’t it? It’s even the same demo and interface!
It's literally just futures. (Still very hard to implement, of course, and in particular IMO this kind of implementation (M-to-N threading) is much better than the async/await implementations) but CS is pretty settled on futures being the "best" model, even C++ has them!)
In my experience most work-stealing schedulers for task parallelism like this are going to be depth-first. I’ve written several myself. Primarily for temporal locality - if you just created a task it’ll be resident in cache.
I'm still trying to access the relevant papers, but from bits of discussions here it seems to me that the difference between the Julia scheduler and cilk is that the former will try to distribute sibling leaf tasks among executors enforcing as close as a global depth first view as possible, while cilk does depth first on each executor but different executors will be exploring distinct the problem space branching off as close to the root as possible (due to the steal the oldest ancestor behaviour). It is a compromise between better cross executor cache locality and increased contention I guess.
Futures in c++ are really meant for concurrency especially for IO. They currently are way too heavy weight for fine grained parallelism (although there is work ongoing on that). Also futures are about packaging tasks, all the stuff being discussed here is about optimal scheduling which is completely orthogonal.
Do people use Cilk heavily in the high-performance computing world? They did not when I was part of it. I think OpenMP (https://www.openmp.org/) is what more people would reach for. It started as a way to parallelize loops, but has had a task framework for a long time which I believe also allows composeable parallelism.
OpenMP, however, depends on not just compiler extensions, but pragmas, so it tends to not find traction outside of the HPC world.
While extremely elegant and minimalistic, I do not think cilk ever gained much traction. GCC has deprecated it and is removing it. I think that OpenMP was too entrenched in the language extension space (and has since grown at least the potential of cilk like functionality), while more portable libraries (TBB for example) covered everything else.
Yes, Cilk has been enormously influential. I have personally cited it many times. But I always had the impression it was a little too early for its time, and was never picked up by the HPC community, hence my question.
The real gain is with MPI, anyway... shared memory is a trap. All the projects I've worked on have exploited parallelism through message passing, or a battle tested multithreaded linear algebra library.
There are many techniques and calling any of them the one solution and all others a trap is extremely myopic. Shared memory is extremely useful for IPC and fork join parallelism.
My understanding is that, at least in c/Fortran/c++ land common practice is openmp within a machine and MPI across machines. I've done some work of the sort but I'm far from an expert though.
GPU is the way to go these days to get most parallelism, and it is definitely the shared memory variety. Message passing is the go to when parallelism needs to cross machine boundaries.
GPU parallelism is highly specialized and requires very specific workloads (e.g., many linear algebra workloads). GPUs are a bad choice for general parallelized tasks.
A quibble: it's not necessarily "heavy numeric" problems, but problems where the calculations involve extensive data-reuse. (Of which matrix operations are the most obvious.) The GPU is massively parallel, but the up-front cost of transferring data to the GPU is quite high.
Hence, in order for that up-front cost to be worth it, the algorithms run on the GPU need to re-use that data many times. If the algorithms running on the GPU don't make extensive reuse of the data sent to it, it would be faster to just do the calculation on the host CPU.
You're missing C#/.NET which has had async/parallel programming constructs for a long time, and served as the model for await/async copied by many other languages.
It's gone through several design patterns from the original APM (async programming model) to EAP (event based pattern) to TPL (task parallel library) to the all purpose Task/ValueTask embedded into the language. It's very fast and efficient, and you can build your own scheduler too. Combined with hardware intrinsics in the latest .NET Core, it's very good for computational and server work.
I'm surprised Julia considers this such a big update but it does seem to be an overstatement to say this changes technical computing.
Julia has has the equivalent of async/await forever, except without needing to annotate anything with keywords—that’s just how I/O works. If you call a blocking function, it yields back into the scheduler. In Julia (as of 1.3), tasks are mapped onto multiple threads which run at the same time (in parallel) and threads that are idle can steal work efficiently. This is similar (as was said) to Cilk, TBB and Go. It is not the same as async/await in C#, which is concurrent but not parallel.
What is 100% new and does not exist in any other system is that the task scheduling is depth first (rather than breadth first), which is a much better order for doing compute-heavy workloads. This ability is new and was the culmination of an Intel research project, applied to Julia.
Hum, isn't cilk-like work stealing depth first? Idle executors steal oldest ancestor and busy executors eagerly execute recursive calls while pushing parent in the local job stack?
Proof of optimality was in the early cilk papers in the '90s.
I'm sure the same algorithms made into Julia via Intel (which bought cilk in the early '00), but it is hardly new except in the 'everything old is new again'.
Also IIRC OpenMP supports (or at least allows) work stealing although at least GCC does not implement it yet.
It's all about better L3 utilization. Cilk is good for using L1 and L2, i.e. caches that are core-local, and suboptimal for using L3, i.e. caches that are shared between cores.
In the presentation posted here there was a paper citation that compared in terms of cache the model used (parallel depth first scheduling) over work stealing, but I didn't have time to read it yet.
What do you mean by depth first vs breadth first? Naively I would assume task scheduling necessarily to be depth first in some sense because you have to evaluate your tasks' dependencies first.
Consider a machine with N cores running a program which spawns N parallel tasks, each of which in turn spawns N parallel tasks. Which subtasks do the cores work on initially? If it's the first subtask of each top-level task, that's breadth-first scheduling. If it's all N subtasks of the first top-level task, that's depth-first scheduling. The latter is better for high-performance computing because it tends to be the lower level libraries that are really carefully tuned to make optimal use of the machine (think BLAS/FTT/parallel sum/prefix sum). Previously it was unknown how to implement this kind of scheduling efficiently. See this talk by Kiran Pamnany for more details:
Ah I see what you mean! I really hate to pile on with the Haskell observation but the 1995 paper by Blelloch that is mentioned in that talk is exactly the heart of GHC's data parallelism as well (under the name nested data parallelism).
Later versions of Blelloch's language are referred to in other parallel Haskell papers.
FWIW I think this is nonetheless fantastic news for Julia! It's certainly poised to do great work in areas that I doubt will get any Haskell traction anytime soon (if ever) for good reasons.
EDIT: Interesting, I think Julia is taking a coarser grained approach here that breaks down a bit for irregularly balanced execution trees, but relies less on fusion (i.e. doesn't use fusion at all) which is nice. I'm curious to see how this particular implementation of the "depth first" idea goes!
EDIT EDIT: Actually I'm not sure how much fusion is necessary for GHC's approach... Time to investigate more closely. Also call out to anyone more familiar with Data Parallel Haskell to chime in. I'm getting out of my depth.
I'm not surprised that Haskell supports this kind of parallelism, although I wasn't aware of it previously. After all, if you're going to the trouble of being that pure, why not be parallel?
Isn't a blocking function, by definition, something that blocks and doesn't yield? How could the language know to automatically yield on something without annotations?
> "async/await in C#, which is concurrent but not parallel"
C# is multithreaded. Things can only be parallel if you have 2 separate paths of execution that don't depend on each other or any serial order, but you can also kick off millions of individual tasks in a single function and await them all too. These tasks are all scheduled onto threads in a managed threadpool that also share work and run at the same time. I fail to see how this is any different or less capable than the others you mentioned.
C# has threads but async/await does not use them. If you think that `async` in C# does the same thing as `@spawn` in Julia and `go` in Go, you are mistaken. It does not. There is also an `@async` macro in Julia that does what the `async` keyword in C# does (which, again, does not involve threads).
I'm very familiar with C# and I'm not talking about threads, only that multi-threading is a basic requirement and C# has a threadpool running threads in parallel.
I said that C# also uses Tasks which are scheduled onto these threads, which is what you said Julia does, so it's the same. C# can also start new threads just as easily, or use channels like Go with the TPL/Dataflow, System.Threading.Channels, actor frameworks, Reactive Extensions, or just chaining tasks together. Other than the depth-first algorithm, it seems like C# has the same capabilities as Julia and the others, if not more, and is much more "mainstream".
Can you explain how blocking functions are automatically yielded without keywords though? I cant find any documentation on this that doesnt mention the @async macro.
I really dislike using Tasks over threads due to the reusability issues. Specifically that Tasks are not reusable and threads are. It makes parallel computational work a headache, because you need to implement the reusability in a higher scope than tasks, then make an effort to prevent race-conditions, double checkouts, etc.
Tasks are really meant for responsive computing, not proactive computing (yes, even TPL and Dataflow), and thus I don't think they are well suited to parallelized computation in areas like e.g. simulation.
When a program is composed as a Dataflow pipeline, you don't handle tasks at all (unless you want to). I'm not sure what reusability issues you're referring to. Dataflow is very applicable to high CPU-throughput (i.e. efficient) pipelines. Concurrent by default, parallel on-demand. In practice, any I/O in my pipelines takes place in singleton consumers at the end of complex, highly-parallel, CPU-intensive code.
C# code that uses Tasks and C# programs built on Dataflow are very different. The latter benefits you most when you commit entirely to that paradigm.
If you want to get more low-level, low-overhead (for i.e. simulations), the same thing applies to things like Threading Building Blocks[1] in C++. I'd even argue that TBB is one of the best applications of this theory, going so far as to ship with dedicated tooling for the whole lifecycle of designing to implementing to profiling task- and graph-parallel programs [2]. Furthermore, it works well with other HPC libraries. Here [3] is an example of mixing it with OpenMP.
In summary, both Dataflow and TBB (et al) can readily be used to create CPU-intensive, proactive computing pipelines today. Even though they're "just" fancy Task schedulers, which are in turn "just" fancy thread multiplexers.
Reusable as in, I can create a datastructure in a thread, do some computation, yield control from the thread (wait for something else to tell it to go again), then come back to the thread with the same datastructure and keep going. With Tasks, you are supposed to let tasks complete rather than let them pause/sync so anything you declare locally is lost when you want to restart it. The workaround is to declare things in a higher scope and pass them into tasks as they are reran - this makes tasks not suited to things where you want to persist data, since tasks are opinionated in that they are only meant to contain logic, not state, so you have to add your own external logic to manage the allocation of state to the tasks that you keep recreating
This trait of tasks makes sense when writing things like web servers because it prevents developers from shooting themselves in the foot by writing something that isn't stateless.
By proactive computing, I mean some program that always has control of the entire execution graph. For example, a simulation may initialize a bunch of datastructures, separate some of them into threads, run the threads, wait for all of them to complete x ticks, do some synchronization or measurement of the total system, then continue y times, and exit.
Because sometimes you need to sync the entire program state, or some subset of the state, to make measurements or to let disparate parts of the system interact
I’m not certain an overstatement. I’m not convinced either. As you note these ideas are old (they mostly predate C# and .NET by decades). On the other hand, to a first order approximation, C# has made zero impact in this area. There are a bunch of reasons, but that’s the way it is.
Fair or not, Julia does seem to have the potential to be the first really interesting shift in this area in ages. Doesn’t mean it will get there, but there is interest.
I was pointing out another very mainstream language (even more than Go and far more than Erlang) that has the same abilities.
What do you mean by impact/shift exactly? I guess I'm not understanding what's suddenly going to change now since, as you say, these ideas are old and implemented several times before.
Task concurrency is probably not a great idea for heavy numeric work anyways, right? I’m scratching my head a bit about what the use case for this is in Julia given that there are much better ways of doing parallelism already.
The interface does look a lot like Cilk, which is a definite plus as that is an absolute pleasure to use.
This however offers a lot more configurability than Cilk (or at least more accessible configuration). Whether that’s a curse or blessing... I guess we’ll find out! :)
Also, have you looked into implementing Cilk-style reducers at all? [0]
Big difference is that Cilk appears to typically synchronize on function return, while julia needs an explicit wait/fetch (there is also a @sync macro). That is, spawn/wait can be matched from the AST in cilk, while julia demands that the user communicates and stores a reference to the started task for later waiting.
It is easy to forget to wait on a task, or even lose reference to it. The current scheduler will typically execute this task quickly, even if nobody ever waits on it; until load changes and you have races everywhere. There are currently no warnings comparable to Python's "RuntimeWarning: coroutine foo was never awaited".
Cilk does indeed sync on function return. You can also manually call 'cilk_sync'. I do think that Cilk’s approach is better (more in line with “make error states unrepresentable”, which I’m a big fan of), but Julia’s approach seems to be more in line with their overall goal of giving the programmer as much control as possible. Not sure if that’s a good thing or not, but as said above... we’ll find out soon!
>The only mainstream languages with this kind of composable parallelism are Go and Erlang, both of which target writing servers, rather than high-performance computing, and which therefore optimize for I/O throughput rather than CPU throughput.
Rust, while perhaps not as mainstream and not necessarily targeting scientific computing users, has data-bound parallelization like this via the rayon library.
I have something in the works which is very similar - but is language agnostic. Pro: no need to switch to another language; Con: not high performance.
We are using this in production for now over 1 year, and it helps tremendously. We are currently only supporting a ruby client, but other languages should be more or less easy to bring into the mix. A wider release is planned for later this year - feel free to ask me anything via email!
If you're talking about the Parallel Computing Toolbox (PCT), then, no, you do not get access to multithreading through that, only multiprocessing and GPU processing. I recently considered buying the PCT, and this is one of the main reasons I probably won't.
The only type of multithreading in Matlab is the type that is built into the core language, and you don't get any control over that.
Finally, the claim to originality here isn't just the introduction of multithreading (which isn't even new in Julia!), but the particular threading model and scheduler, which lends itself particularly well to composable multithreading.
I'm super excited about this! For me the most exciting part is these four sentences:
> The model is portable and free from low-level details. You don’t need to explicitly start and stop threads, and you don’t even need to know how many processors or threads there are (though you can find out if you want).
> The model is nestable and composable: you can start parallel tasks that call library functions that themselves start parallel tasks, and everything works. Your CPUs will not be over-subscribed with threads.
It means that you can put threaded parallelism inside libraries without worrying. It means that Julia itself will start making its builtins threaded. It means that Julia itself is much more threadsafe by default. It means that packages can implement threaded algorithms — and you can use multiple threaded packages together at the same time without worrying. And it means that you can put threaded parallelism in your application code and all these things will work together beautifully.
It's going to be really neat to watch what the Julia package ecosystem comes up with now that they have access to truly composable parallelism at the language level. Up until now, most parallelism in the language had to live at the application level rather than being in library code since parallelism either relied on spawning new processes or the Threads.@threads macro which couldn't be nested.
Not only is this huge for Julia, but they've just thrown down the gauntlet. The status quo has been upset. I expect Julia to start eating everyone's lunch starting with Python. Every language can use good concurrency & parallelism support and this is the biggest news for all dynamic languages.
Julia is already insanely productive for CS and statistics research and this will only make it more so. Super excited to see how this will be used, especially in PPLs, as algorithms like PMCMC benefit hugely from this kind of cooperative, interruptible concurrency.
Why isn't some big company putting its weight behind Julia considering what it has to offer in the field of data science? I just don't get why it's still playing second fiddle to Python. Tech is full of inertia, paradoxically.
Julia has raised millions of dollars in funding, and includes some pretty big names- National Science Foundation, DARPA, and a pretty big chunk from venture capital as well. They don't have a benevolent sponsor like Python has enjoyed from Google, but they're doing alright.
They aren't as big as Python because Python has been around for nearly 30 years. Julia is just a baby compared to that. As it grows it will increase it's share. It may not ever surpass Python in terms of raw users, because Julia is focused on numerical computing, but Python is more general purpose.
Julia is focused on numerical computing, not exclusively for numerical computing. Note that Julia has built in n-D arrays and useful operations, while Python requires the numpy package (or other similar package) for them. Julia's Matlab-esque syntax makes it clear where its priorities lie.
You could call any Turing-complete language with I/O a general purpose language, but some are clearly designed for certain tasks.
Julia will be general purpose when it has its own popular web framework. To be an all-round language it must not be second rate with text processing for the sake of being excellent with math.
Because almost every business problem is already solved in Python (meaning in C++/C/Fortran with a Python interface) and most companies do not want to re-implement those libraries just for the sake of using a new language.
Since I stopped using Julia I've kept an eye on the HN Who's Hiring threads for mentions of it. I see one consistent listing from Gambit Research, and occasionally one or two others that come and go. Anecdotal to only sample from HN and not larger listing sites like Indeed etc, but it reinforces my skepticism about industry adoption.
That's not the same as Google's support for Python, Go and Kotlin or Microsoft's support for Typescript and maybe soon Rust. I would have thought the case for supporting Julia was very strong considering Python's performance and concurrency problems.
Python's performance is less of an issue when you use libraries like Numpy that are basically running optimized C code.
Depending on what you're doing, Python is usually great at performance where the time it saves the user writing code is more valuable than the performance itself (a lot of exploratory scientific research).
Does anyone have any recommendations for using Julia on real world projects at this point? Is it mature enough to use and debug on a day to day basis for example? Mainly I’ve been using R for context.
In terms of "canonical ways" and "community approved libraries" here is what the current state looks like to me:
- as a substitute to something as low-level as numpy it is superb
- for anything dealing with differential equations, the ecosystem is unsurpassed
- plotting is a bit chaotic (and the community has not settled on a "one true way to plot")
- a good dataframe implementation already exists
- serialization is a bit chaotic (plenty of ways to do it but no "one true way" yet)
- automatic differentiation is on the cusp of being unsurpassed, but a lot of the tools are still experimental (and the community still has not settled on the best way forward)
- the Juno IDE looks amazing
- the profiling toolkit is amazing
- it is ridiculously easy to use other languages from within Julia
Agreed on all points, but you will pay for all this with usability issues and early adopter bugs. The error messages are horrendous, documentation is lacking, and if you run code on hpc, saving the data is probably going to cause you dataloss at least a few times until you find the right obscure GitHub issues.
This is my experience, hopefully now that 1.x is home and dry we will gradually see the pain of old libraries suddenly failing ebb away, and documentation becoming more co-herant and accurate. Also thanks for calling out the error messages - this has been something I've struggled with and needs some careful attention from the Julia community.
On the otherhand our expectations are huge - I remember early Java and the level of support that you get with Julia now really took a very long time to appear for Java, and that was backed by SUN.
If I were an Intel exec I'd be putting money into this though; I can see it gradually crushing the life out of nVidia and AMD, although I think that there will be a few smiles at ARM towers when they look at this too! I wonder what people could do with 2000 ARM cores in a rack?
I started using in it earnest after 1.0 and can honestly say that it's super stable and easy to write and debug errors. The only "bugs" I've had with it have stemmed from abuse/misunderstanding of the dispatch system, and you can write a lot of Julia without running into those kind of user mistakes.
Mainly, am I going to spend hours fussing with the basics of the language. For example in R, it’s too slow, so I had to write Rcpp, only to discover there’s no real way to debug it from RStudio, down into the rabbit hole, lldb, Rinside, everything that can crash does, terse errors messages with out line numbers, opaque SEXP types XCode can’t decode etc. So basically are the normal everyday use cases fraught with quirks you need to know more about the internals of Julia or can you just be a user? If that makes sense.
“Writes like Python runs like C” is one tagline I’ve seen them use. Julia gives you a lot of control over how fast you want your code to be. You can write some code without really thinking too much and it’ll be fast and work just fine. If you get to a point where it’s not fast enough then there’s a whole host of tools to help you speed it up.
At any stage you can call @code_{warntype, lowered,...} and inspect exactly what code has been produced by the compiler to find bottlenecks.
I’ve not written much R before but in my experience I’ve never had to go to those lengths to debug Julia code! (Although the error messages can sometimes be a bit of a mouthful, I think because it uses LLVM)
As a personal answer, I really like unitful for keeping track of physical units and libraries like MLStyle.jl or Match.jl which add stuff like pattern matching to the language (though they are more general helper libraries compared to stuff like Turing.jl and DifferentialEquations.jl):
Maybe I misunderstand, but: In R allocating and copying huge return vectors for small changes on input vectors ruins performance. My understanding of Julia was that one advantage over R is the possiblity to mutate input vectors in place. Referring to the psort! example for Julia's multithreading I wonder why "We simply need to allocate one initially-empty array per thread" which to me looks like allocating huge vectors for small changes down the call recursion (and wastes memory by the number of threads). If Julia can mutate psort!s first argument global vector 'v' from multiple threads, why can't Julia also mutate a single allocated 'temp' argument global vector? Bonus: then psort! could just merge from one argument to the other and save copying back on each recursion level (along the lines of Segewick's nocopy mergesort)?
I found out myself: Julia can mutate a single temp buffer across all threads, and implementing Sedgewicks nocopy sort (with ping-pong merge) is faster, particularly when passing down preallocated buffer memory to the single threaded sorting Routine once N <= BIG_THRESHOLD. The following are measurements run with JULIA_NUM_THREADS=4 on a Intel(R) Core(TM) i5-4300M CPU @ 2.60GHz, 2594 MHz, 2 physical cores, 4 logical cores, replication code follows. Note that not only runtime of nocopy mergesort ist better, it also causes less garbage collection time. Conclusion: it would make sense for Julia's sorting interface to allow passing in preallocated buffer memory and to use nocopy mergesort behind the curtain. Of course true parallel sorting also requires a parallel merge.
Best
julia> # 4 threads, mergesort, dynamic allocation
julia> @time for i=1:100 c = copy(a); psort!(c); end;
166.270792 seconds (756.58 k allocations: 89.470 GiB, 15.34% gc time)
julia> # 4 threads, mergesort, preallocation per thread, re-allocation in single threaded subroutine
julia> @time for i=1:100 c = copy(a); tsort!(c); end;
143.973485 seconds (658.00 k allocations: 36.972 GiB, 5.89% gc time)
julia> # 4 threads, nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
julia> @time for i=1:100 c = copy(a); ksort!(c); end;
129.715467 seconds (654.51 k allocations: 37.312 GiB, 6.43% gc time)
julia> # 4 threads, nocopy mergesort, one preallocation reused in single threaded subroutine
julia> @time for i=1:100 c = copy(a); jsort!(c); end;
119.139551 seconds (449.91 k allocations: 29.846 GiB, 5.14% gc time)
import Base.Threads.@spawn
const SMALL_THRESHOLD = 2^6
const BIG_THRESHOLD = 2^16
# insertionsort
function isort!(v, lo::Int=1, hi::Int=length(v))
@inbounds for i = lo+1:hi
j = i
x = v[i]
while j > lo
if x < v[j-1]
v[j] = v[j-1]
j -= 1
continue
end
break
end
v[j] = x
end
return v
end
# single threaded merge (from t to v)
function nmerge!(v, lo::Int, mid::Int, hi::Int, t)
i, k, j = lo, lo, mid+1
@inbounds while i <= mid && j <= hi
if t[j] < t[i]
v[k] = t[j]
j += 1
else
v[k] = t[i]
i += 1
end
k += 1
end
@inbounds while i <= mid
v[k] = t[i]
k += 1
i += 1
end
@inbounds while j <= hi
v[k] = t[j]
k += 1
j += 1
end
return v
end
# single threaded nocopy mergesort, one preallocation
function nsort!(v, lo::Int, hi::Int, t)
@inbounds if hi-lo <= SMALL_THRESHOLD
return isort!(v, lo, hi)
end
mid = (lo+hi)>>>1
nsort!(t, lo, mid, v)
nsort!(t, mid+1, hi, v)
nmerge!(v, lo, mid, hi, t)
return v
end
function nsort!(v, lo::Int=1, hi::Int=length(v))
t = copy(v)
nsort!(v,lo,hi,t)
return v
end
# multithreaded mergesort, dynamic allocation
function psort!(v, lo::Int=1, hi::Int=length(v))
@inbounds if hi-lo <= BIG_THRESHOLD
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
mid = (lo+hi)>>>1 # find the midpoint
half = @spawn psort!(v, lo, mid) # task to sort the lower half; will run
psort!(v, mid+1, hi) # in parallel with the current call sorting
# the upper half
wait(half) # wait for the lower half to finish
temp = v[lo:mid] # workspace for merging
i, k, j = 1, lo, mid+1 # merge the two sorted sub-arrays
@inbounds while k < j <= hi
if v[j] < temp[i]
v[k] = v[j]
j += 1
else
v[k] = temp[i]
i += 1
end
k += 1
end
@inbounds while k < j
v[k] = temp[i]
k += 1
i += 1
end
return v
end
# mergesort, preallocation per thread, re-allocation in single threaded subroutine
function tsort!(v, lo::Int=1, hi::Int=length(v), temps=[similar(v, 0) for i = 1:Threads.nthreads()])
if hi - lo < BIG_THRESHOLD # below some cutoff, run in serial
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
mid = (lo+hi)>>>1 # find the midpoint
half = @spawn tsort!(v, lo, mid, temps) # task to sort the lower half; will run
tsort!(v, mid+1, hi, temps) # in parallel with the current call sorting
# the upper half
wait(half) # wait for the lower half to finish
temp = temps[Threads.threadid()] # workspace for merging
length(temp) < mid-lo+1 && resize!(temp, mid-lo+1)
copyto!(temp, 1, v, lo, mid-lo+1)
i, k, j = 1, lo, mid+1 # merge the two sorted sub-arrays
@inbounds while k < j <= hi
if v[j] < temp[i]
v[k] = v[j]
j += 1
else
v[k] = temp[i]
i += 1
end
k += 1
end
@inbounds while k < j
v[k] = temp[i]
k += 1
i += 1
end
return v
end
# multithreaded nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
function ksort!(v, lo::Int, hi::Int, t)
if hi - lo < BIG_THRESHOLD # below some cutoff, run in serial
sort!(view(v, lo:hi), alg = MergeSort)
return v
end
mid = (lo+hi)>>>1 # find the midpoint
half = @spawn ksort!(t, lo, mid, v) # task to sort the lower half; will run
ksort!(t, mid+1, hi, v) # in parallel with the current call sorting
wait(half) # wait for the lower half to finish
nmerge!(v, lo, mid, hi, t)
return v
end
function ksort!(v, lo::Int=1, hi::Int=length(v))
t = copy(v)
ksort!(v,lo,hi,t)
return v
end
# multithreaded nocopy mergesort, one preallocation reused in single threaded subroutine
function jsort!(v, lo::Int, hi::Int, t)
if hi - lo < BIG_THRESHOLD # below some cutoff, run in serial
return nsort!(v, lo, hi, t)
end
mid = (lo+hi)>>>1 # find the midpoint
half = @spawn jsort!(t, lo, mid, v) # task to sort the lower half; will run
jsort!(t, mid+1, hi, v) # in parallel with the current call sorting
wait(half) # wait for the lower half to finish
nmerge!(v, lo, mid, hi, t)
return v
end
function jsort!(v, lo::Int=1, hi::Int=length(v))
t = copy(v)
jsort!(v,lo,hi,t)
return v
end
a = rand(2000);
b = copy(a); @time sort!(b, alg = MergeSort); # reference
c = copy(a); @time isort!(c); # insertionsort
all(b==c) # regression test
a = rand(20000000);
b = copy(a); @time sort!(b, alg = MergeSort); # single-threaded system mergesort used by psort and tsort
c = copy(a); @time nsort!(c); # single threaded nocopy mergesort, one preallocation
all(b==c) # regression test
c = copy(a); @time psort!(c); # 4 threads, mergesort, dynamic allocation
all(b==c) # regression test
c = copy(a); @time tsort!(c); # 4 threads, mergesort, preallocation per thread, re-allocation in single threaded subroutine
all(b==c) # regression test
c = copy(a); @time ksort!(c); # 4 threads, nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
all(b==c) # regression test
c = copy(a); @time jsort!(c); # 4 threads, nocopy mergesort, one preallocation reused in single threaded subroutine
all(b==c) # regression test
# 4 threads, mergesort, dynamic allocation
@time for i=1:100 c = copy(a); psort!(c); end;
# 4 threads, mergesort, preallocation per thread, re-allocation in single threaded subroutine
@time for i=1:100 c = copy(a); tsort!(c); end;
# 4 threads, nocopy mergesort, one preallocation, re-allocation in single threaded subroutine
@time for i=1:100 c = copy(a); ksort!(c); end;
# 4 threads, nocopy mergesort, one preallocation reused in single threaded subroutine
@time for i=1:100 c = copy(a); jsort!(c); end;
I believe the `psort!` code that was shown in the blog post was a 'toy example', meant to demonstrate how to use different parallel techniques, one of them being the use of thread-local state.
There's probably going to be a lot of development of parallelized code going forward. Maybe you could help? :)
Providing Julia Code for in-memory sorting, i.e. a binary mergesort with parallel branches and parallel merge plus a specialized sort for instable inplace sorting (better Quicksort, parallel branches only) is not a huge step from the code above: I could contribute that. If however, the expectation would be that the sort is more general, i.e. k-ary and performs for data actually being on ssd or even on disk (JuliaDB?) thats a complete different story: one would need to experiment with much more complicated algorithms such as cache-oblivious funnel-sort, parallel superscalar samplesort, real disk sorts ...
I don't have time for that. How would I avoid duplicating the efforts of others?
Thanks! We do our best to make sure that people who want to get ahold of the latest Julia binaries are always able to. Our downloads page [0] always offers the latest stable builds, whereas the nightlies page [1] allows you to download binaries off of the tip of the `master` branch, which usually lag behind the github repository by only a couple of hours.
The only systems which have supported compute-oriented composable parallelism like this are Cilk and TBB, which are both really excellent and successful—and provided major inspiration for this work. However, they both seem to have fallen short of widespread adoption partly due to being non-free (until recently) and also, I suspect, because they are non-standard C/C++ language/compiler extensions, which seem to never catch on that broadly.
This is a brave new world for high-performance technical computing, not just for Julia.