Of course, but concurrent access to shared data in Rust (let alone when that dat...

Jweb_Guru · on June 19, 2015

Hazard pointers are by no means fast, but a single hazard pointer being set per operation on a shared data structure barely registers compared to the cost of the rest of a nontrivial concurrent access most of the time. Scoped threads don't require bumping a reference count on access, either. Concurrent immutable access in Rust requires literally no overhead at all. In the long run, throughput might drop if you have to take a lock on the entire data structure to perform something like a resizing operation, of course, but for many applications this can be completely avoided (and for those that can't, Java isn't much better).

I generally agree with you on GC but, assuming you haven't, I think you would benefit from actually trying to write some of these algorithms in a manually memory managed system. It's really not as hard as you consistently make it out to be in a language like Rust with very precise control over lifetime (which can be exploited just as well for the hazard pointer itself as it can for anything else).

pron · on June 21, 2015

Two things:

1. The use-case I have in mind is an in-memory database because a. that is a significant part of what I do and am most familiar with and b. because in the age of large RAM, its most effective use is embedding large parts of the database within the application process (whether as an actual product, or replicating some part of a database's functionality ad-hoc). While doing so in an environment without a GC is certainly possible, it is both much harder and/or (depending on the effort spent) less scalable with increasing numbers of cores. I have come to that conclusion after spending many years trying doing that in C++.

2. I have no doubt that Rust -- and I take great joy in this fact -- makes a whole lot of things much easier than C++. But there is also no question that Rust is harder to program than most GCed languages. The question from my perspective, then, is not "why not do it in Rust?" but "why not in Java?". There is a good answer to that question: Rust is much more suitable than Java (in terms of performance and RAM footprint) when running in constrained environments. But unless you're targeting a constrained environment, why work even a little harder at all?

More generally, I believe that the most efficient form of memory management in unconstrained environments, in terms of both throughput and worst-case latency -- and I've had this discussion on HN with pcwalton -- is an environment that allows three types of memory: stack, scoped arenas (including a permanent scope), and a GCed heap. Real-time Java (RTSJ) provides exactly that. Combining arenas with Rust's beautiful scoped threads is a great idea. I believe that reference counting has no place in such unconstrained environments.

frankmcsherry · on June 19, 2015

In my experience, concurrent access to shared data is relatively rare in a well-designed data-parallel platform. Data are exchanged rather than shared, at which point ownership of memory transfers from one worker to another. Rust is great at understanding ownership, and at managing memory in this sort of situation.

pron · on June 19, 2015

Again, that depends on how batchy things are. If all the datais available in advance -- you're right. If it's streaming -in, it's not so simple any more.

frankmcsherry · on June 19, 2015

I'm sorry, this hasn't been my experience, or the experience of anyone I know writing stream processing systems. Having written one each over a GC and not (Naiad, and the timely dataflow in Rust linked above), dev time on the former was disproportionately spent fighting GC perf issues.

The folks I know doing stream processing over the JVM have garbage collection (and avoiding it) as perf issue #1. For example, you can read about Flink's approach to avoiding GC at

https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-...

You can read a more scientific discussion of GC vs region allocation here:

https://www.usenix.org/system/files/conference/hotos15/hotos...

pron · on June 19, 2015

Thanks, I'll read your references, but there are two general points to make:

1. Memory with very clear lifespan is obviously easier to manage without a GC.

2. The fact that GCs are a performance issue, does not mean that using a non-GC environment would yield better performance. It is often easier to avoid GC in GCed environments than to handle concurrent access to data in non-GC environments.

vitalyd · on June 19, 2015

GC definitely makes some types of systems easier to develop and reason about, but it also tends to make people be a bit sloppy with respect to memory management. For systems where managing data is one of their reasons for existence, it creates a negative effect. GC'd runtimes also bloat memory consumption to support GC, as well as perform additional "hidden" operations in support of GC (e.g. card marking), even if you manage to not trigger GCs. In addition, GC works well when the generational hypothesis holds. For data intensive applications with mutation/churn, that doesn't tend to be the case.

I disagree that avoiding GC is easier; you're swimming upstream the entire time. Managing memory manually in GC environments is annoying because the platform wasn't designed for that.

As Frank said, you want to avoid sharing mutable data irrespective of GC or not - that won't scale well with larger core counts. Instead, you want to delineate ownership so you have single writer and many readers, and share the data that way.

pron · on June 19, 2015

The key is that most of the effort is not where most of the data is. It's the small percentage of data (say, database index) that requires the biggest development effort, and you want the best mechanism to assist not where most bytes are stored, but where the most effort is invested.

> For systems where managing data is one of their reasons for existence, it creates a negative effect.

I think that people who author such systems/libraries know when to rely on the GC and when not to.

> GC'd runtimes also bloat memory consumption to support GC

When you write, say, an in-memory OLTP database (that's the use-case I know best), you keep the index -- that's where concurrency is managed -- on the heap and the user data off-heap. User data is the bulk of the RAM, but you only pay the GC footprint-overhead for the index.

> you're swimming upstream the entire time.

Going off-heap for large, known-lifespan data sets in Java has been widespread -- idiomatic, almost -- for at least five years now, if not ten. There are so many libraries that do that, that it's very, very simple. Again -- that's just where most data lies, not where the most effort is required. You end up with a very small percentage of your code that is not quite "beginner's Java" -- that's not swimming against the current; that's hitting a few strokes for one tiny section of the journey. It's managing the transactions and update to the indexes that requires 95% of the code, and where you end up thanking your lucky stars for a good GC.

I haven't measured exactly, but the entire off-heap code in our database is under 3%, even though over 85-95% of the data is stored off-heap.

> Instead, you want to delineate ownership so you have single writer and many readers, and share the data that way.

Even single writer/many readers still require a concurrent access mechanism. A GC tremendously helps with this pattern.

Besides, the best way to manage data -- and that's based on how caching mechanisms work at every level -- is single-writer/multiple-readers approximately, namely the single-writer may change identity at any time, but you ensure that it happens rarely. Once you try to make ownership permanent, system-wide locking is required. It's that flexibility that helps with amortizing the costs and making sure you're never hit with the worst case at every transaction[1]. There's some literature touching on that required flexibility, but here's my own take: http://highscalability.com/blog/2012/8/20/the-performance-of... (that flexibility is required to keep the B-tree evenly distributed across nodes -- they can be across the network or even NUMA sockets).

[1]: Changing the owner is the amortization pressure-release valve (all amortized algorithms work on that idea of slowly increasing pressure, that is then released seldom but regularly). It replaces sharding's rebalancing, which usually requires a stop-the-world lock (or most of the world).

vitalyd · on June 19, 2015

>The key is that most of the effort is not where most of the data is. It's the small percentage of data (say, database index) that requires the biggest development effort, and you want the best mechanism to assist not where most bytes are stored, but where the most effort is invested.

There's also substantial amount of code around dealing with i/o efficiently (and not churning garbage), both net and disk (if there's any persistence involved in the product).

>I think that people who author such systems/libraries know when to rely on the GC and when not to.

Perhaps there's more awareness now, but there are plenty of JVM based systems where moving off-heap (and/or changing code to ease GC workload) came much later in the dev cycle.

>Going off-heap for large, known-lifespan data sets in Java has been widespread -- idiomatic, almost -- for at least five years now, if not ten. There are so many libraries that do that, that it's very, very simple. Again -- that's just where most data lies, not where the most effort is required. You end up with a very small percentage of your code that is not quite "beginner's Java" -- that's not swimming against the current; that's hitting a few strokes for one tiny section of the journey. It's managing the transactions and update to the indexes that requires 95% of the code, and where you end up thanking your lucky stars for a good GC.

IME, you end up not using a good bulk of JDK and 3rd party libs because they allocate frivolously. You end up having to look at internals of other code to see if they allocate or not since nobody in the java space cares about documenting that aspect (because allocation is mundane). If you then need to tune GC knobs, may the force be with you. On top of that, once you go off-heap, you're doing roughly the same style of block memory management that you'd do natively, except less efficiently (e.g. some Unsafe operations aren't JIT optimized as well, some operations aren't even intrinsics so you get JNI call overhead, you always get zero'd memory if you call Unsafe.allocate() even if you then overwrite everything, etc). You're probably going to now tell me that getting 90% of the performance is fine ...

I suggest you look at some non-managed OSS systems that deal with high-perf workloads. Your posts always make it sound like it's nigh near impossible to build anything of the sort without GC, which is ridiculous. If you want to see what a modern C++ rendition would look like that attempts to avoid sharing mutable data, have a look at https://github.com/cloudius-systems/seastar

pron · on June 19, 2015

> IME, you end up not using a good bulk of JDK and 3rd party libs because they allocate frivolously.

That's more of your concern because you care about worst-case (very) low-latency domains. Short-lived objects are not a big concern for those who just wish to avoid the large pauses (and can live with the very occasional query taking 40ms instead of the average 50us).

> with i/o efficiently (and not churning garbage), both net and disk

At least one of the most popular Java IO libraries don't create much garbage (Netty 4), and besides, that's not a major concern at all because short-lived etc...

> you're doing roughly the same style of block memory management that you'd do natively, except less efficiently

Only for very simple lifespans. The heap data guards the offheap data wrt concurrency. And so what? All of it still amounts to a tiny, almost negligible, portion of the code, and that portion turns out to be quite simple. I'd rather write a small simple package for stuff that would be taken care of automatically by a non-GC language than write lots of complex, bug-prone code in a non-GC language for stuff that's taken care of automatically in GC runtimes.

> you always get zero'd memory if you call Unsafe.allocate()

But you don't allocate offheap slabs often at all.

> Your posts always make it sound like it's nigh near impossible to build anything of the sort without GC

It's much more work, and requires placing and reasoning about various constraints that may prove problematic. An example later.

> If you want to see what a modern C++ rendition would look like that attempts to avoid sharing mutable data, have a look at https://github.com/cloudius-systems/seastar

Thank you, I will. The example I like to give is that of VoltDB, a "NewSQL" database written in Java and C++ (C++ for the core) which uses internal sharding. Every thread is responsible for a slice of the data. But that database (designed or supervised by none other than Michael Stonebraker) requires locking for any join (even they couldn't have come up with a better way to do it). But their approach displays results that are far from groundbreaking (for modern, in-memory databases), and places a lot of burden on the user to carefully choose a sharding strategy. As another example, MemSQL don't shard, and use a lock-free index (they use concurrent skip-lists). They wrote the code in C++, so obviously that's possible, but put a lot of effort into just implementing the skip-lists (they use hazard pointers), and ended up with results that could have been just as good -- or better -- with a general-purpose GC. BTW, another thing they missed by picking C++, is that in order to make queries faster, they compile SQL to C++ at runtime, compile that C++ and load the resulting code. Java would have given them a much more efficient JIT.

> You're probably going to now tell me that getting 90% of the performance is fine ...

As a general case -- absolutely. In this case, Java gives you more like 96-97% performance for significantly less effort.

EDIT:

I've now read about seastar (and realized I know the people behind it). That approach is the absolute best way to get the worst-case lowest latency possible, if you're willing to pay a significant cost in design (how do you transactionally update data on two shards? -- you can't. Imagine a game object moving from one shard to another -- it will either show up twice or disappear momentarily). But if you're willing to relax the worst-case latency, you can gain significant gains in both throughput and ease of development (and still keep latency low enough for almost all domains). That's certainly how I'd design video-streaming or a network/phone switch -- not so much an online game, IoT sensor-fusion, or a group-chat app.