Think Before You Parallelize: A tour of parallel abstractions

vvanders · on Sept 2, 2016

Just a small nit:

> Another scenario, say you are working on 3D game, you have some trick physics math where you need to crunch numbers, maybe adding realistic building physics to Minecraft. But separate threads are already handling procedural generation of new chunks, rendering, networking, and player input. If these are keeping most of the system’s cores busy, then parallelizing your physics code isn’t going to help overall.

Generally game engines have been migrating towards work-stealing based tasks architectures. Having monolithic thread based systems(one per physics, rendering, gameplay, etc) were great for migrating from the single-threaded games of old, however it leads quite often to idle threads.

This was even more critical in the PS3 era where you had SPUs with just ~256kb of RAM. Overall it leads to an architecture that scales well to whatever platform you end up targeting since the CPU/compute capabilities of various platforms can be pretty disparate.

yazaddaruvala · on Sept 2, 2016

This is actually a really amazing talk about parallelizing game engines.

http://www.gdcvault.com/play/1022186/Parallelizing-the-Naugh...

vvanders · on Sept 3, 2016

Yup, this type of stuff is exactly what I was referring to , ND are top of their class when it comes to engine design.

kedean · on Sept 3, 2016

Like the time they wrote their own lisp dialect to run on PS2.

https://en.wikipedia.org/wiki/Game_Oriented_Assembly_Lisp

stcredzero · on Sept 2, 2016

Generally game engines have been migrating towards work-stealing based tasks architectures.

How are Golang goroutines for implementing work-stealing?

vvanders · on Sept 3, 2016

From what I understand of goroutines the language is already handling distribution of running them so there's no work-stealing to implement.

For a game engine where task based structure is important Go would be a pretty poor choice. You don't want a GC in your inner engine loop. In addition Go has pretty poor semantics for explicit memory layout, something that can be critical for performance. It's very common to issue cache prefetches for the next task as one is spinning down for instance.

stcredzero · on Sept 3, 2016

For a game engine where task based structure is important Go would be a pretty poor choice.

I have a game server written in Go.

You don't want a GC in your inner engine loop.

Why not? I expect my typical GC pause times to be 1 or 2 ms under Go 1.7. But if my server had pause times of 20ms, I wouldn't care, so long as it didn't happen more often than once a second or so.

It's very common to issue cache prefetches for the next task as one is spinning down for instance.

Yeah, I've had Go guys say I should give every agent its own goroutine. As it turns out, this can cost 150-300 microseconds every time one of them wakes up. I have a traditional game loop right now.

vvanders · on Sept 3, 2016

> I have a game server written in Go. Game engine != game server.

> Why not? I expect my typical GC pause times to be 1 or 2 ms under Go 1.7. But if my server had pause times of 20ms, I wouldn't care, so long as it didn't happen more often than once a second or so.

We're talking about client-side latency here. You've got 16.6ms to do everything in a game engine. Even 2ms is 1/8th of your frame and a significant amount of work. 20ms is two missed frames which is really bad.

The types of engines where work stealing and task based architecture are prevalent are usually of the AAA variety, where performance is critical and the scenes are either very complex or very large. In this space GC based languages have been limited in scope to gameplay scripting and even then are usually very fast(Lua or equivalent).

MaulingMonkey · on Sept 3, 2016

> Even 2ms is 1/8th of your frame and a significant amount of work.

And this is targeting a mere 60hz. Oculus DK2s overclocked their LCD panels to run higher than spec (75hz) and shipped even higher refresh rates in CV1 (90hz) because of it's importance in reducing nausea. To say nothing of the enthusiasts running 120hz or 144hz displays - in which case you've blown through more than 1/4th of your frame budget 'at random' (read: you must assume it could happen in any frame) unless you can control when GC occurs.

> 20ms is two missed frames which is really bad.

If you're doing VR, this is bad enough that some of your customers may hurl. I've gone after fixing single frame hitches which were caused by as little as an extra 1ms spike at exactly the wrong time in non VR games. Unfixable 20ms spikes would be a total non-starter.

There's a lot of applications where you can tolerate an extra 1-2ms extra spike. A game server, where network latency is an order of magnitude or two larger, will probably almost always count. And if your improved productivity from using the language lets you optimize away an additional 1-2ms cost elsewhere that you wouldn't have time for otherwise, it can even be worth it.

For me, there's so much stuff that I use to boost my productivity, that's missing from Go - by design no less - that my productivity is going to be going the wrong way.

slashdev · on Sept 3, 2016

I believe that Go routines are implemented as work-stealing under the hood. Idle threads pick up runable Go routines from other threads.

There was a design document about this from 4 years ago, but I'm not sure how well it reflects the current implementation:

https://docs.google.com/document/d/1TTj4T2JO42uD5ID9e89oa0sL...

wyldfire · on Sept 2, 2016

> Don’t Parallelize when it is already Parallelized

For the uninitiated -- there's probably a large list of "don'ts" that would be good here, too. But maybe a prerequisite is ideal: why are you trying to write [this code] to execute in parallel? "It's slow" isn't good enough. You can save yourself a lot of time and headache if you use profilers to understand what the system is doing that makes it slow.

Using that profiler will at least help partition the problem into one of two big domains: compute resource bound (cpu/bus/mem/cache etc) or maybe it's just pending results from some other async task in the system. The latter happens much more often than you might expect. Filesystems, critical locks/exclusion, networks, databases, IPC -- there's lots of stuff that having multiple tasks in parallel for might not help much.

joshvm · on Sept 3, 2016

Some other bits:

- Always profile first! Then go for low-hanging single-thread fruit. If you also parallelise later you'll get a multiplicative effect with speedup.

- Use task manager/activity monitor/top frequently to check what CPU usage your program is actually getting. If your code is using N * 99% (for N cores) continuously then you're doing fine. You should be suspicious if your usage has spikes, implying e.g. an IO bottleneck somewhere.

- Rule of thumb: parallelisation alone will at best give you a linear speedup. Super-linear speedup is possible, but only for very specific problems like linear algebra and even then it's usually some weird memory trick like you get more stuff happening in L1 vs RAM.

- Build some intuition for how fast your code ought to run. I've chased down many bottlenecks only to realise that the thing I thought was the culprit was actually extremely fast (I'm sure we've all done this). http://norvig.com/21-days.html#answers , but also make your own measurements of time required to save files, modify images, zip things, etc. You can catch this definitively by profiling, of course, but it's nice to be able to guess where to start prodding.

- On the flipside, learn to recognise when things are embarrassingly parallel and act accordingly. Here a profiler won't help you. Sometimes you're just lucky and you can get linear gains for free.

- Think about what level you want to parallelise. If you know that your single threaded code is as good as it gets, can you get away with running multiple instances of it? Often it's far simpler to write a wrapper than explicitly paralellising low-level code.

stcredzero · on Sept 2, 2016

There were times/places when/where programmers in the boonies of the US had funny ideas about pre-optimization and dynamic languages. (Specifically, Ohio!) Basically meetup hecklers who had the weird idea that you should write super-hacky-optimized code 100% of the time, or it would never be fast enough. So then I asked the room to raise their hand if they had ever used a profiler on a real production system. Then I asked them to leave their hand up if they'd never ever been surprised by what the profiler told them!

jackmott · on Sept 3, 2016

So the lesson is.... Write super hacky optimized code 100% of the time, but constantly profile it too!

stcredzero · on Sept 3, 2016

The lesson is: If you have a TARDIS and you find yourself in Ohio in the late 90's, don't bother going to a programmer's meetup.

jsmith0295 · on Sept 3, 2016

Now we have the opposite problem, at least in my experience. Everyone just says "premature optimization is the root of all evil" and then uses that as an excuse not to have a sane architecture just to save a little effort up front. After all, that would slow down the rate at which they could complete user stories for the week...

stcredzero · on Sept 3, 2016

Yet another swing of the pendulum.

paulddraper · on Sept 3, 2016

Good article. Disagree with this:

> What you forgot was that the web server was already parallelizing things at a higher level, using all 24 production cores to handle multiple requests simultaneously.

The reason you didn't see improved performance is because you server's load is high; it's maxed on the work it can do. Obviously no amount of parallelism could fix that.

Double the servers processing requests and your then your nifty in-memory parallelism can make the difference you intended. Note: had you doubled the servers _without_ adding your extra parallelism, the extra hardware would not necessarily make a difference.

You did a good thing; you were just dealing with overloaded hardware.

weinzierl · on Sept 2, 2016

I'm positively surprised that Rust does so well. Not only the performance results are good, I also like the brevity and elegance of the code. The only downside is that the examples don't work out of the box, but require an external library.

steveklabnik · on Sept 2, 2016

Yeah it's true. This is just true of Rust in general; we've chosen to have a relatively small standard library.

At the same time, one bit that the post didn't talk about (and for good reason, it's not the point of the post) is that a strength of Rust is that Rayon enjoys the same safety guarantees as the threading stuff that _is_ built into the standard library, thanks to the trait system. Rayon will still prevent data races at compile time, just like all Rust code.

chrismorgan · on Sept 3, 2016

… an external library that is really easy to add. One line to Cargo.toml (plus another, “[dependencies]”, if you don’t have any dependencies already), one line to src/lib.rs and one line importing the appropriate trait to any module you wish to use `.par_iter()` in.

jackmott · on Sept 3, 2016

I did put it below Java on the table because it wasn't build in, but yes, the experience of adding it was very easy. I learned Rust...well I Wouldn't even say I have learned it yet. Didn't have any trouble.

SIMDArray for F#/C# will be similarly easy when it is a nuget package. It just puts extension methods on Array.

smitherfield · on Sept 2, 2016

But, for fairness' sake, shouldn't he have made sure LLVM wasn't using SIMD instructions for the Rust version?

jackmott · on Sept 2, 2016

i did, it isnt. or to be more precise it use sse adds and muls but only 1 lane the same as all the oyher 64 bit compilers in the comparison, so it isnt vectorized. you can get rust to vectorize it with unsafe intrinsics if you use the nightly build. im told better support is coming

steveklabnik · on Sept 2, 2016

Yup! There's some technical reasons why it isn't in stable yet, but it's not an inherent problem, just needs someone to do the work. It is a thing we care about though, since speed is a major goal of Rust.

jbooth · on Sept 2, 2016

One more: Don't parallelize when you're going to blow L1 caches for no good reason.

I wrote a gradient descent solver a while back, before I realized Vowpal Wabbit existed.

Vowpal Wabbit had a dedicated thread to read in records and pass them through a queue to another thread doing the math. I didn't, because it was a quick POC impl and I couldn't be arsed. My impl was faster.

This advice probably goes for most usages of auto-parallel collection impls that the author was pointing out.

jackmott · on Sept 3, 2016

That is a good point. If you have around N things going on, and N cores, and each thing can saturate the 1 core pretty well, it is better to keep them each on one core so they can rip through their own L1 cache rather than ruin each others.

The loop abstractions don't give you the control to assure that, though I would hope the good ones are smart enough to at least keep each thread on 1 core and on it's own chunk of an array.

That would be interesting to explore.

chubot · on Sept 3, 2016

I call this heterogeneous vs. homogeneous threads/processes. Heterogeneous threads are more icache and dcache friendly. They also make your program more modular.

Linux has controls to set the affinity of threads for cores.

ehsanu1 · on Sept 2, 2016

Interesting, but hard to draw any conclusions given that there were probably several other things that were different between the two implementations.

throwanem · on Sept 2, 2016

I must say this is a remarkably cogent and thorough treatment of parallelism for performance gains in Javascript especially. Well done!

jackmott · on Sept 3, 2016

If I blogged on medium.com I would have had an animated gif from southpark in that section.

tonto · on Sept 2, 2016

I am guessing pfffffft is the sound of JavaScript blowing away the competition

sidlls · on Sept 2, 2016

I would also add a caution to constrain the scope of parallelism in the implementation to only those bits of code that actually benefit from it, and to be extra cautious when refactoring an existing serial implementation into a parallel algorithm.

tormeh · on Sept 2, 2016

You should also think about the context in which the statement will be used. Is your statement normally the bottleneck? If not, does it matter if it's 3x slower for light workloads (which are fast anyway) as long as it's 4x faster for heavy workloads(where the performance difference can be felt)?

habosa · on Sept 2, 2016

> SIMDArray is ‘cheating’ here as it also does SIMD operations, but I include it because I wrote it, so I do what I want. All of these out perform core library functions above.

I love the brief interjection in the otherwise objective discussion.

merb · on Sept 3, 2016

> So think about the system your code is running in, if things are getting parallelized at a higher level, it may not do any good to do it again at a lower level. Instead, focus on algorithms that run as efficiently as possible on a single core.

Generally this is a extremly bad advice. Especially with:

> What you forgot was that the web server was already parallelizing things at a higher level

Apache parallized like that. and it was bad.

Nginx then introduced a eventloop, which was way better. however your application still needs to deal with it. i.e. if one request is blocking (taking too much time) your nginx will block, too. if that happens on all worker threads your website will be extremly slow.

A EventLoop mechanism always means that your stuff should be non-blocking and that is hard. this is especially hard when dealing with external resources, i.e databases, external services, caches, file io. when you just say "well my webserver will handle that" you will be in a really really bad shape.

jackmott · on Sept 3, 2016

> Nginx then introduced a eventloop, which was way better.

>A EventLoop mechanism always means that your stuff should be non-blocking and that is hard.

hmm ;)

So you are saying you need to think about the system your code is running in?

Even an Event Loop server, if CPU utilization is sufficiently highly, paralellizing a function can be counter productive due to overhead and cache thrashing.

merb · on Sept 3, 2016

> if CPU utilization is sufficiently highly

that won't be the case in a EventLoop system, it could be the case if you just have too less hardware. But than even your "non paralized" code will fail.

avodonosov · on Sept 2, 2016

It seems to be the same single abstraction, just in several languages

bluejekyll · on Sept 3, 2016

This is totally awesome. And I'm always blown away at how performant Java is. For some reason people give it a bad rap, but it's still awesome.

Though, one thing I wish was included is memory utilization. I would expect Rust and C++ to be significantly better than Java in this realm. (Due to the overhead of the JVM)

steveklabnik · on Sept 3, 2016

  > For some reason people give it a bad rap,

A lot of people have outdated opinions about Java, I know I did for a long time. Initially, the performance wansnt there, so people held that opinion without updating it when Java updated. Things are very different today.

stevoski · on Sept 3, 2016

> Things are very different today.

Java has been fast for at least ten years.

Except start-up time. Java still has way too long a startup time, which you notice in desktop apps implemented in Java.

zigzigzag · on Sept 3, 2016

HelloWorld.java runs in about 50msec on my machine.

However, there are lots of crap desktop apps written in Java that don't care about startup time. You can write apps with bad startup time in any language: look at OpenOffice (C++).

jeff_petersen · on Sept 3, 2016

Even the JVM startup time isn't that bad now. It's actually a lot worse in my experience when using alternate JVM languages like Clojure and Scala.

bluejekyll · on Sept 3, 2016

Startup time can often be dramatically reduced if you unpack the jars. Jars are zip files. Loading a lot of jars can cause significant slow down.

merb · on Sept 3, 2016

Actually one's Valhalla will it (expect it late 2022) the memory utilization will change heavily and the memory model, too.

FT_intern · on Sept 3, 2016

Are there any studies that benchmark the difference between modern C++ and modern Java?

chadscira · on Sept 2, 2016

Anyone trying to do this in javascript (universal) could check out Task.js [0] for a simple interface.

- [0] https://github.com/icodeforlove/task.js