Go GC: Latency Problem Solved [pdf]

Animats · on July 19, 2015

The LISP community went through this in the 1980s. They had to; the original Symbolics LISP machine had 45-minute garbage collections, as the GC fought with the virtual memory. There's a long list of tricks. This one is to write-protect data memory during the GC's marking phase, so marking and computation can proceed simultaneously. When the code stores into a write-protected page, the store is trapped and that pointer is logged for GC attention later. This works as long as the GC's marker is faster than the application's pointer changing. There are programs for which this approach is a lose. A large sort of a tree, where pointers are being retargeted with little computation between changes, is such a program.

If they're getting 3ms stalls on a 500MB heap, they're doing pretty well. That the stall time doesn't increase with heap size is impressive.

Re "avoid fragmentation to begin with by storing objects of the same size in the same memory span." That's easy today, because we have so much memory and address space. The simplest version of that is to allocate memory in units of powers of 2, with each MMU page containing only one size of block. The size round-up wastes memory, of course. But you can use any exponent in the range 1..2, and have, for example, block sizes every 20%. This approach is popular with conservative garbage collectors (ones that don't know what's a pointer and what's just data that looks like a pointer) because the size of a block can be determined from the pointer alone.

nickpsecurity · on July 19, 2015

You thought of the LISP machines too, eh? Ill mark GC results under my meme: failure to learn from the past. One exception Azul Systems' Vega machines that have a hardware-assisted, concurrent, pauseless GC. What do you think of their approach? Worth emulating in a smaller, hardware project?

Note: As usual, I'm thinking along lines of safer, processor design with possible GC. Needs to be a tad faster than i432. ;)

_19qg · on July 19, 2015

Symbolics added an Ephemeral GC in 1985. It keeps a bitmap of modified memory pages with ephemeral objects in RAM. The Ephemeral GC then looked only at those pages.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.125....

Macintosh Common Lisp later used a similar scheme on 68k machines with MMU.

nickpsecurity · on July 20, 2015

Thanks for the link. This jumped out at me:

"The importance of designing the architecture and the hardware to facilitate garbage collection is stressed."

Replace GC with reliability, security, concurrency, and so on it's still true. It's why I advocated safe, high-level languages and RISC processors. I figured they'd be easier to modify at compiler or hardware level as people invented solutions to these problems. Doing the same on x86, Windows, and C++? I almost gave up...

pjmlp · on July 20, 2015

In the mid-90's while at the university I got to learn about Oberon and eventually had the Native Oberon running on my PC.

It opened my mind about using GC enabled systems programming languages. Eventually I became an Oberon addict and also Modula-3 one, as a side effect of discovering a book about it gathering dust in a technical library.

OS vendors just need to care the same way as they do with improving JavaScript JIT compilers, for example.

nickpsecurity · on July 20, 2015

Yeah it was pretty nice. The SPIN OS team wrote a whole OS in Modula-3 that supported safe, dynamic linking of code into the kernel for performance boosts. A recent discussion here showed that the Go language was partly an attempt to re-create one of its author's experience coding in Oberon. The combination of systems programming, safety, and productivity disappeared when he switched to C.

Far as JS, that would help but it's a good example of why it won't. There were many attempts, like Juice [1], to replace JavaScript in the browsers with something better. There were also attempts to solve the problems at OS-level that Web solved. All of them ignored it to the point that we eventually got stuck with JavaScript & browsers being only thing that's on all devices with presentation & computation layer. So, while still ignoring better options, they continued to improve the speed of JavaScript engines and their new JIT schemes.

So, the OS's could certainly benefit from the kind of activity that led to JavaScript JIT performance. However, their existence says more in the other direction.

[1] http://www.modulaware.com/mdlt69.htm

pjmlp · on July 20, 2015

Yeah sure, I just wanted to make the point that some technologies can be improved if the companies that matter in the IT world, bother to put money into it.

Somehow I like to think that Android, WinRT/.NET Native, Swift's introduction, Mirage OS are all little steps into that direction.

nickpsecurity · on July 20, 2015

Certainly they could be improved. Swift and MirageOS are great examples. IBM's mainframes and AS/400 line are good examples given they adapted them to most useful, popular technologies + still backward compatible. Remember, though, that backward compatibility constrains the biggest and oldest systems in ways that prevent architectural improvements. All the biggest companies lining Microsoft, Oracle, and SAP's pockets would have to throw away their apps that they can't even rebuild. Not happening.

So, our best bet is that small to midsized firms with more flexibility keep adopting these technologies. That's fuels investment into them to get them eventually at a level like Microsoft and Oracle. That enterprises have switched to service model helps in that they keep some services on old tech but implement some on newer, better stuff (ex Python at BofA). So, these trends are what we have to bank on.

One thing, though: better get it right the first time as your newer, better tech will eventually be a legacy tech someone else is maintaining. It's why I focus on readability, interfaces, and type safety for maintenance concerns a lot more now in tech.

rgbrenner · on July 19, 2015

This page adds some context to the slides: https://sourcegraph.com/blog/live/gophercon2015/123574706480

It was posted here 10 days ago: https://news.ycombinator.com/item?id=9854408

joosters · on July 19, 2015

Garbage collecting seems to get solved in each new release of Go and Java, apparently.

rgbrenner · on July 19, 2015

Do you actually follow Go development? 1.5 is a major improvement to the GC.

The only other stable release that had any change to the GC worthy of mention in the release notes was 1.3, which was a minor change.

The only possible way to interpret your statement as true, would be if "get solved" meant ANY change or bug fix was applied to the GC. Which is a statement that would apply to so much code (not just the GC) that it would make the statement completely meaningless.

joosters · on July 19, 2015

Yeah, I'm being unfair in naming Go & Java specifically. But these stories of 'fixing' garbage collection come up all too often.

I wonder when we'll see a further GC update that trades latency for throughput...

The problem seems to be that no matter how you tweak GC, you will always have a class of program that it performs terribly for (and it seems to impact a large group of programs, never just some obscure corner case). So I suspect that this latest GC tweak will have unexpected results on some other class of program, leading to another tweak, and so on...

stcredzero · on July 19, 2015

The problem seems to be that no matter how you tweak GC, you will always have a class of program that it performs terribly for

For casual use, most programs can treat GC like magic, but if you are doing serious work in a language with GC, then you should learn about the GC's characteristics. That bit of due diligence and up front design effort is still often going to be tons cheaper than doing the manual memory management.

Reducing latency in exchange for throughput is the right decision for the vast majority of programs that will be written in Go. It was already a very attractive language for writing a multiplayer game server, so long as I didn't have very large heaps. (Even so, I can still support 150-250 players and 10's of thousands of entities.) With the "tweak," that limitation is much relaxed.

obstinate · on July 19, 2015

> often going to be tons cheaper than doing the manual memory management.

And on top of that, manual memory management is not free. I maintain a simple but high-throughput C++ server at Google, and tcmalloc is never less than 10-15% of our profiles.

Don't get me wrong, I'm not saying that Go is faster than C++ or ever will be. I'm just trying to counter the notion that "GC is expensive, manual memory management is near zero runtime cost."

jblow · on July 19, 2015

I bet that if someone who knew what they were doing decided to optimize that, you'd get the cost WAY down, possibly almost to zero. (If you are using std::string, that is your problem right there).

But the very important difference here is that in your case you have a choice and it is possible to optimize the cost away and to otherwise control the characteristics of wheyou pay this cost. In GC systems it is never possible to do this completely. You can only sort of kind of try to prevent GC. It's not just a difference in magnitude, it's a categorical difference.

obstinate · on July 20, 2015

Perhaps. The team is a group of seasoned veterans of high performance server engineering. But perhaps there are others who could improve on our efforts by a significant margin.

Of course we do not use std::string.

reagency · on July 19, 2015

If you really really want to, you can allocate a buffer for all your data.

obstinate · on July 20, 2015

This solves little. What do you think the system allocator is doing under the covers?

SamReidHughes · on July 20, 2015

It's doing a lot less, if you're allocating one buffer for your data instead of many.

SamReidHughes · on July 20, 2015

Just curious: Have you tried jemalloc, and what numbers did you get?

obstinate · on July 21, 2015

We haven't. Google infrastructure uses tcmalloc. Is there a reason to believe it offers a significant win?

SamReidHughes · on July 21, 2015

I'd expect similar performance but less fragmentation?, less memory used by the process if you aren't regularly calling MallocExtension::instance()->ReleaseFreeMemory() as a tcmalloc user.

The first answer at https://www.quora.com/Is-tcmalloc-stable-enough-for-producti... (by Keith Adams) is completely consistent with what I've seen. Rust went with jemalloc for some reason too.

Jweb_Guru · on July 22, 2015

IIRC jemalloc is somewhat better about releasing memory in a timely fashion, at least by default.

chetanahuja · on July 19, 2015

"That bit of due diligence and up front design effort is still often going to be tons cheaper than doing the manual memory management"

That's just a pipe dream. I say this having spent inordinate amounts of time trying to tune myriad parameters in JVM GC for large heap systems without ultimate success. What it always comes down to is, how much extra physical RAM you're willing to burn to get some sort of predictable and acceptable pauses for GC. It's usually an unacceptable amount.

stcredzero · on July 20, 2015

That's just a pipe dream. I say this having spent inordinate amounts of time trying to tune myriad parameters in JVM GC for large heap systems without ultimate success.

Patient: Doctor, it hurts when I do this!

Doctor: Don't do that!

Possibly, divide your heap into smaller pieces with their own GC? Restructure your system, such that most of your heap is persistent and exempt from GC? I don't know the details of the system you're trying to build, of course. It sounds interesting and challenging.

chetanahuja · on July 20, 2015

"Possibly, divide your heap into smaller pieces with their own GC? Restructure your system"

That's the common recommendation. (resisting calling it "pat answer"). Suffice it to say, this is not always possible. Apart from all the business related issues with rewriting a complex system from scratch, breaking up a large shared memory system into smaller, communicating processes multiplies both the software complexity (roughly by O(N^2) where N is the number of new components created) as well as hardware requirements in it's own right -- think of all the overhead of marshalling/demarshalling, communication latencies, thread managements, increased missed cache-hits because of fragmenting that nice giant cache you were hosting in that big JVM heap.

obstinate · on July 19, 2015

I'm curious how much physical ram is an unacceptable expense to you, given how cheap it is.

stcredzero · on July 19, 2015

Even the amount of RAM parceled out for virtual servers is an embarrassment of riches, provided you pay for something other than the bottom tier!

In the context of games, and other ones as well, I think there's too much attention paid to pushing the envelope and not enough to how much awesome can be had for what is readily available.

banachtarski · on July 19, 2015

> That bit of due diligence and up front design effort is still often going to be tons cheaper than doing the manual memory management.

Calling shenanigans. No it's not, unless the person doing the manual solution is a novice.

pascal_cuoq · on July 19, 2015

Despite the drastic page limit in the category I was submitting in, I made sure to include a paragraph about how GC enable sharing and how the only reasonable alternative when implementing a similar system in a non-GC language is a lot of gratuitous copying to solve ownership issues in http://frama-c.com/u3cat/download/CuoqICFP09.pdf

(The page limit was 4. Organizers only raised it to 6 after seeing submitted papers.)

I can also confirm the “bit of due diligence” part, and the fact that it's cheaper that the aggravation of not having memory management at all. In the example that I can contribute to the discussion, the due diligence amounted to two more short articles: http://cristal.inria.fr/~doligez/publications/cuoq-doligez-m... and http://blog.frama-c.com/public/unmarshal.pdf

masklinn · on July 19, 2015

> GC enable sharing and how the only reasonable alternative when implementing a similar system in a non-GC language is a lot of gratuitous copying to solve ownership issues

The solution to unclear or shared ownership is generally reference counting. There's a reason why shared_ptr is called that.

pjmlp · on July 19, 2015

With the usual set of locks, cache contention and pauses on cascade deletions of deep datastructures it brings.

masklinn · on July 20, 2015

You don't need locks to RC immutable structures, just atomics (and not even that if the system is single-threaded)

pascal_cuoq · on July 20, 2015

Reference counting is a garbage-collection system like the others (and if you are going to use a garbage-collection system, you can for many usecases do better than reference counting).

masklinn · on July 20, 2015

> Reference counting is a garbage-collection system like the others

Reference counting is a form of automated memory management which can easily be integrated and used in a manually-managed system, and can be used for a specific subset of the in-memory structures (again see shared_ptr). Not so for more complex garbage collection systems which tend to interact badly with manual or ownership-based memory management. Putting the lie to your assertion that the only way to implement sharing in a non-GC language is "gratuitous copying".

pascal_cuoq · on July 20, 2015

Yes, it's a shame that you were not a reviewer, mid-2009, of my article published in September 2009.

stcredzero · on July 19, 2015

It's not the writing of manual memory management in the usual case/happy path that's the problem. It's the very occasional mistake and the debugging time involved. (Though to be fair, automated static analysis tools have taken great strides, and this is not as big a problem as it used to be.)

What GC often gets you is a program that doesn't crash but instead has performance problems, but these are often more easily profiled and found and less severe than a crash. (Manual memory management isn't immune from the same performance problems in any case.)

In other words, GC gets you to "Step 1 -- Get it Correct" faster so you can play with running code faster. The cost/benefit may not fit your situation. In that case, use a different tool.

pcwalton · on July 19, 2015

> I wonder when we'll see a further GC update that trades latency for throughput...

This GC update in Go already trades latency for throughput, because of the added write barrier.

There is no free lunch in GC. Most features that reduce latency reduce throughput. For example, Azul C4 has lower throughput than HotSpot's GC does.

xenadu02 · on July 19, 2015

I hope you realize that malloc is far from free in a non-GC world right? (In a GC world allocating is just moving a pointer forward.) You pay the cost somewhere.

The CLR has also done a lot of GC work to enable concurrent GC, thread-local heaps, and "zero pause" (in reality extremely low constant time pauses).

The only way to avoid paying the cost for managing memory is to allocate everything you need once and never release it.

physguy1123 · on July 19, 2015

I hope you realize that stack allocation can replace a lot of allocation that would be done by a GC? And that having control over memory layout can lend itself to better performance? And that naively mallocing everywhere is not the only or fastest way to manually manage memory, and sometimes isn't even the easiest.

giovannibajo1 · on July 19, 2015

Go also has stack allocations for objects based on escape analysis; basically, if the compiler can prove that a variable doesn't escape, it is allocated on the heap, otherwise on the heap. Improvement on escape analysis in the compiler thus reduce also the heap size by allocating more things onto the stack.

tokenrove · on July 19, 2015

Many GC'd languages also have stack allocation (see dynamic-extent in Common Lisp, for example); when talking about GC vs malloc (already an over-simplified dichotomy), we should be talking about heap allocations of indefinite extent.

pjmlp · on July 19, 2015

Many GC languages have stack and global static memory allocation as well.

Go being one of them.

Others, Oberon family of languages, Modula-3, D, Eiffel and even .NET to a certain extent.

Having managed heap doesn't mean other allocation types aren't available.

joosters · on July 19, 2015

Too true! I've written a couple of different mallocs before, and I'd recommend it as a project to anyone who thinks malloc() is just a simple, lightweight operation.

It's not an either/or choice though, picking malloc or GC. There is a whole spectrum of allocation styles you can do that might be better for a particular application. For example, a server could use per-request memory pools, which effectively can turn related mallocs into a 'move the pointer forward' operation and the whole lot can be free()d together.

I'm not saying GC is worthless. I just have a distaste for GC because it doesn't truly deliver on the promise of removing worries about memory management. You still pay the cost and can be tripped up by nasty GC performance. Even worse, the garbage collector behaviour can change between language versions and a well-tested application can suddenly hit dire performance problems. Once you have to consider GC problems, IMO you might well be better off doing old fashioned app-controlled memory allocation.

strictfp · on July 19, 2015

"malloc is far from free". Hehe. Pun intended?

I always thought that malloc was further from free with GC :P

Manishearth · on July 19, 2015

> In a GC world allocating is just moving a pointer forward.

which is usually for short lived objects only, and in a non-GC world these get put on the stack anyway.

AlphaSite · on July 19, 2015

It think he means people tout every release as the final solution to their GC issues. Not officially, necessarily, but some vocal group says it.

rgbrenner · on July 19, 2015

That's not true either though. Seriously go take a look at the release announcements on hackernews.

Alphasite_ · on July 19, 2015

I never said it was.

1wd · on July 19, 2015

What about 1.4? "The release focuses primarily on implementation work, improving the garbage collector ..." https://golang.org/doc/go1.4

jlouis · on July 19, 2015

1.4 focused on getting the GC ready for the changes in 1.5.

First, the code was rewritten from C to Go. Second, the GC was made precise, which is a major improvement to a GC.

The GC was not made truly concurrent however, and short pause times were not addressed either. These concerns are addressed in the 1.5 release.

The trade-off in 1.5 is to eliminate pause times for slightly worse throughput. For the programs Go is written to handle, this trade-off is probably fine.

the8472 · on July 19, 2015

Newer java GCs generally aim for larger heaps or higher throughput than the previous ones while keeping pause times low. The goals are shifting.

I don't even see a mention of parallelism in that PDF. There's probably a lot of room left to squeeze out performance for Go.

threeseed · on July 19, 2015

The situation for Go and Java isn't comparable.

Java has expanded into the big data space in recent years and underpins Hadoop, Spark, HBase, Cassandra etc where they are dealing with heap sizes up to 1TB. The previous GC algorithms are fine for smaller heaps (<32GB) but they needed something for bigger ones hence the introduction of G1GC.

Go is very much just trying to get their foundation GC perfected.

jsprogrammer · on July 19, 2015

Is there a perfect GC? Seems it would free memory as soon as it's no longer used.

nickpsecurity · on July 19, 2015

There was a LISP machine that did that. It embedded the GC into the memory interface where the apps didn't even know it existed. GC activity ran concurrently with memory operations.

Best modern one might be Azul Systems' Vega Machines:

http://www.azulsystems.com/products/vega/overview

The LISP and Vega machines just try to solve the problem at its core. Works pretty well when you do that. The modern systems try to solve these hard problems on architectures, OS's, and apps that inherently suck at them. That's a difficult problem that requires constant attention by very smart people. Whole Ph.D.'s worth of effort were spent on getting this far with GC's.

AYBABTME · on July 19, 2015

There's actually always a duality between:

   - Reference counting/mark and sweep
   - Throughput/latency
   - RAII/GC

A good read: https://www.cs.virginia.edu/~cs415/reading/bacon-garbage.pdf

   We present a formulation of the two algorithms that shows that
   they are in fact duals of each other. Intuitively, the difference is that
   tracing operates on live objects, or “matter”, while reference counting
   operates on dead objects, or “anti-matter”.

danieldk · on July 19, 2015

Depends on the circumstances. If you would free memory as soon as it's no longer used, it could lead to delays when freeing a large object graph.

Also, a lot of performance can be gained be re-using frequently-allocated short-lived objects' memory, rather than freeing them without special treatment.

pcwalton · on July 19, 2015

> If you would free memory as soon as it's no longer used, it could lead to delays when freeing a large object graph.

A GC still has those same delays when freeing a large object graph (assuming that large object graph is in the tenured generation). They're just sometimes incrementalized or done in a background thread. Your malloc implementation could do that too. If this were actually much of a problem, batched deallocation APIs that work on a background thread could be easily added to (or layered on top of!) the popular modern mallocs.

> Also, a lot of performance can be gained be re-using frequently-allocated short-lived objects' memory, rather than freeing them without special treatment.

Your malloc is already doing this, if it's any good. free is usually implemented with a free list.

danieldk · on July 19, 2015

A GC still has those same delays when freeing a large object graph (assuming that large object graph is in the tenured generation). They're just sometimes incrementalized or done in a background thread. Your malloc implementation could do that too.

Of course, I was just making an argument against 'immediately deallocating is always the best strategy'. It's clear that malloc()/free() could implement the same behaviour.

free is usually implemented with a free list.

Sorry, I was a bit vague: I was referring to moving garbage collectors, where one of the motivations was to be more performant with a lot of short-lived objects.

banachtarski · on July 19, 2015

Yea but who would ever do that. If you know that the performance of freeing the entire graph is important, then group the entire thing in an arena and free the whole arena in O(1) time.

strictfp · on July 19, 2015

The biggest problem IMO is that GC is usually used in languages where it gets no free information about variable scope.

GC could actually be really useful in C++ or Rust for postponing the destruction of many small objects and doing it in one big sweep instead.

pcwalton · on July 19, 2015

> GC could actually be really useful in C++ or Rust for postponing the destruction of many small objects and doing it in one big sweep instead.

I've seen this claim before, but I've never understood it. You could add delayed reclamation to your system malloc if it actually helped things: just replace free() with a function that adds to a free list but doesn't actually recycle the memory. That wouldn't require more than a couple of writes and a TLS lookup.

I suspect that the reason why mallocs don't do this is that there's little benefit compared to just recycling the memory right away. Prompt reclamation is actually really nice for cache reasons, and adding memory blocks to a free list is what free's fast path does in the first place.

Now that's not to say that GC wouldn't be useful in C++ or Rust. I think what a GC would be useful for is for lock-free data structures and long-lived structures with dynamic lifetimes, especially ones shared between threads, to eliminate reference counting traffic. But for short-lived objects, if you have any sort of mark phase, you've already lost. Fundamentally, it's really hard to beat a system that precomputes the object lifetimes at compile time—which is what manual memory management is—with a dynamic system that has to compute them at runtime.

the8472 · on July 19, 2015

> But for short-lived objects, if you have any sort of mark phase, you've already lost.

A GC with compactation gives you bump pointer allocation though. No need to traverse free lists. With malloc you potentially have short-lived and long-lived interspersed with each other, creating lots of holes/fragmentation.

> That wouldn't require more than a couple of writes and a TLS lookup.

While you can just null a reference with a single, unfenced write to something that's probably in your L1 cache already. Plus thread-local lists to avoid contention. Complexity grows quickly. It's not exactly free lunch.

pcwalton · on July 19, 2015

> A GC with compactation gives you bump pointer allocation though. No need to traverse free lists. With malloc you potentially have short-lived and long-lived interspersed with each other, creating lots of holes/fragmentation.

You only get bump pointer allocation in the nursery, not in the tenured generation (unless you want an inefficient tenured generation). In a manually-managed system, short-lived objects usually aren't going to be allocated on the heap at all—they're on the stack. Even for the few that are allocated on the heap, the way free lists work does a great job of keeping them in cache.

Fragmentation really isn't much of a problem anymore with modern mallocs like jemalloc and tcmalloc.

> While you can just null a reference with a single, unfenced write to something that's probably in your L1 cache already. Plus thread-local lists to avoid contention.

I have a hard time believing that the cost of a TLS lookup and a couple of writes per freed object is more expensive than a Cheney scan for objects in the nursery.

titzer · on July 19, 2015

> You only get bump pointer allocation in the nursery, not in the tenured generation (unless you want an inefficient tenured generation).

That's not universally true; e.g. V8 can and does allocate pretenured objects in the old generation with bump-pointer allocation.

pcwalton · on July 19, 2015

Surely that can't be done in all circumstances without severe memory usage issues or fragmentation problems though. You have to fill those holes eventually, if not during allocation than during compaction.

titzer · on July 20, 2015

Sure, the GC eventually triggers compaction and that by design will produce a lot of contiguous free memory.

papaf · on July 19, 2015

I have a hard time believing that the cost of a TLS lookup and a couple of writes per freed object is more expensive than a Cheney scan for objects in the nursery

This paper describes how it is possible if you are happy to throw a lot of memory at the problem:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.49....

Section 3 is titled "Explicit freeing is more expensive".

pcwalton · on July 19, 2015

As with many late 80s memory management papers, cache effects obsolete the conclusions in this one. In fact, the compiler's ability to coalesce all the stack allocations in a single procedure call into a single pair of stack manipulation instructions makes the conclusions almost meaningless in practice: inlining optimizations and modern CPUs make that cost essentially zero nowadays. I'd almost argue that the methodology in this paper, combined with 2015 assumptions, offers an effective argument against GC. Appel's memory management papers really need to be taken with a grain of salt :)

mike_hearn · on July 19, 2015

C++ and rust don't pre compute object lifetimes, they make the developer manage them. The only place I know of object lifetimes being statically analysed and computed is in compilers that do escape analysis (JVM, go, maybe others, though go ea is very limited)

titzer · on July 19, 2015

Do some reference chasing on region inference. There is a branch of research that was spawned by Cyclone, and papers are still being published today about inference of regions.

E.g. see "Safe and Efficient hybrid memory management for Java."

http://dl.acm.org/citation.cfm?id=2754185&CFID=694525936&CFT...

pcwalton · on July 19, 2015

By "precomputing object lifetimes" I just mean that the object lifetimes are determined at compile time, without runtime scans. Human assistance is still required, of course.

AdieuToLogic · on July 20, 2015

> GC could actually be really useful in C++

Here is a very stable one for C and C++:

http://www.hboehm.info/gc/

pjmlp · on July 20, 2015

It might be stable, but it surely isn't fast, given it is conservative.

rurban · on July 22, 2015

Yes, but it is still better than the Go GC. Both are super conservative (i.e. not moving), precise (i.e ptr overhead), and just mark & sweep (i.e. full heap scans).

But Boehm-Weiser's GC can parallelize marking, and store and restore the sweep phase for lower latencies, i.e. in incremental mode. http://www.hboehm.info/gc/gcdescr.html

Better GC's, Mark & Compact and Copying as used in more mature languages, are still on the horizon for Go. (If you go for larger heaps and more speed). I see it at just 30% done yet. Go controls its ABI, so it can easily use better GC's in the future.

rurban · on July 23, 2015

Oops, I mixed up the two phases. Marking is incremental (also in v8), and sweeping is parallelizable.

http://jayconrod.com/posts/55/a-tour-of-v8-garbage-collectio... gives a good overview of good GC techniques, championed in earlier lisps and functional languages.

reagency · on July 19, 2015

1TB heap?! In one shard of a massively parallel system?

So you are going to have programs using 1PB of RAM across the cluster?

jnordwick · on July 19, 2015

Still slowish. Far far from "solved." The charts they zoom in on only go to about 500MB in heap, showing 2 ms pause times. It makes me suspicious that the nice linear trend he's showing doesn't hold up under more reasonable values -- my IDE takes up 500 MB and my web browser over a GB.

So if by his possibly rosy calculations, a basic 3GB heap is still pausing 6 ms. God forbid I use a 500 GB heap and now we're into the one second range again. This is assuming the linear relationship holds up, but given his choice of graph domain, I have a suspicion that there are issues to the right.

This seems typical of Google technology. They say they care about performance, but I have yet to see a piece of Google tech that is actually useful if you care about performance. People automatically assume Google is synonymous with performance, but it definitely isn't.

Remember, he says this improved GC pause time is going to come at the expense of Go top-line speed. You Go will get slower, and you sill will have second pauses with any serious work.

mseepgood · on July 19, 2015

The first chart goes up to 20 GB heap size.

> This is assuming the linear relationship holds up

Their goal for 1.6 is to make it constant, not linear:

"Zooming in, there is still a slight positive correlation between heap size and GC pauses. But they know what the issue is and it will be fixed in Go 1.6." https://sourcegraph.com/blog/live/gophercon2015/123574706480

jnordwick · on July 19, 2015

His version of "slight" to me isn't so slight. He's brushing under the rug that Go isn't suitable for many of the low latency, memory hungry domains that comprise modern systems.

bostik · on July 19, 2015

> many of the low latency, memory hungry domains that comprise modern systems.

I would say it really depends on the situation, and especially what you consider low-latency. It also depends very much on what your definition of low-latency is.

For background: we run a betting exhange. Customers will notice and complain if any action with their money takes more than ~100ms. This threshold aligns quite well with old research about human response times [0].

On the other hand, if we were running an interactive chat/forum system, it would be acceptable to have >500ms latencies from click to comment display. When it comes to communications, reliable persistence tends to be more important to raw latency. (Or to put it another way: it is okay to delay displaying of a fresh comment until it has been stored. That way the user knows they do not need to rewrite their contribution.)

I personally have a background in embedded systems, where hard latency limits are the norm for user experience. Developers, end users and companies are all willing to sacrifice throughput for near-immediate feedback .. and doubly so when the system in question happens to control a vehicle dashboard.

At 60 frames per second, one frame refresh is about 17ms. When you need to provide visibly immediate feedback to the user, you have at most 6 frames to display it. Because the data must be available before the rendering of the 6th frame starts, you actually have on average no more than 5,5 * 17ms = 93ms to calculate the response.

The real trick is figuring out where you can get away with non-immediate latency requirements. And incidentally, this has knock-over effects: if hard low-latency is not necessary, some GC spikes should be tolerable. Spend the engineering effort where it is crucial, not where it might be nice.

At least until you have more workforce than engineering problems.

0: http://www.nngroup.com/articles/response-times-3-important-l... (Nielsen, 1993)

crawshaw · on July 19, 2015

In the talk Rick described the problem as the data structure responsible for tracking finalizers. The fix is apparently reasonably straightforward, but the cause was not discovered until we were in the Go 1.5 tree freeze, so it has to wait for Go 1.6 in six months.

Hopefully the video will be available soon, it contains a lot of information not in the slides.

issaria · on July 19, 2015

Hope we have unicorns and rainbows in Go 1.6, besides 1.4 compiling speed.

themartorana · on July 19, 2015

But no one is suggesting Go for your web browser or IDE.

Go is brilliant for faceless fast-response highly-concurrent situations - like APIs and the like.

It's also absurd to accuse Google of not caring about performance when they can search the wealth of information on the Internet in milliseconds.

chetanahuja · on July 19, 2015

Yeah... the Google search engine is not written in Go. There's a big difference between folks who really, deeply care about sub-ms level latencies ( like this guy https://www.youtube.com/watch?v=S9twUcX1Zp0 ) are not the same folks pushing Go.

themartorana · on July 20, 2015

Except they are running their downloads site and much of YouTube on Go, among many others. Indexing the world's content may not be smart to write in Go, but Google's other properties are also heavily hit and seem to run fine.

desdiv · on July 19, 2015

>my IDE takes up 500 MB and my web browser over a GB.

>So if by his possibly rosy calculations, a basic 3GB heap is still pausing 7 ms.

How did you arrive at 3GB? Heap spaces of different processes don't add up, you know?

jnordwick · on July 19, 2015

I just said typically for about an low-end estimate for what we use at work. We used to be constrained by Java 32-bit since 64-bit was slow so coded for that range. With compressed OOPS we were able to expand, but still have a lot of code on that low side. Now hundred GB heaps are much more the norm.

_ak · on July 19, 2015

The Go 1.5 GC guarantees that the GC stops the program for a maximum of 10ms within a 50ms timeframe.

justthistime_ · on July 19, 2015

Yeah, I'm really getting tired of these imbeclies.

It's really convenient how they switch to _SECONDS_ on the x-axis.

Sorry, if they need an y-axis in seconds, GC is far from solved.

dang · on July 20, 2015

> Yeah, I'm really getting tired of these imbeclies.

Please don't include unsubstantive swipes in HN comments. It degrades the discourse for all of us.

https://news.ycombinator.com/newsguidelines.html

pkroll · on July 19, 2015

The scale of the old GC requires seconds, so the chart with old and new needs to show seconds. The new one doesn't, and shows milliseconds. The title doesn't say that all of GC is solved, just the latency problem, which for most Go use cases is at least arguable.

jmount · on July 19, 2015

I thought the issue with GO garbage collectors wasn't so much speed as correctness (as they GO team historically has gotten GCD speed by sacrificing correctness, or is correctness a goal past version 1.3?).

mseepgood · on July 19, 2015

The Go garbage collector is precise since Go 1.3: https://golang.org/doc/go1.3

shmerl · on July 19, 2015

I prefer RAII approach to GC.

reagency · on July 19, 2015

Explain?

gavazzy · on July 19, 2015

Not OP, but RAII provides deterministic destruction. That is, it is provable exactly when the object will be deleted.

From wikipedia: "Object destruction varies, however – in some languages, notably C++, automatic and dynamic objects are destroyed at deterministic times, such as scope exit, explicit destruction (via manual memory management), or reference count reaching zero; while in other languages, such as C#, Java, and Python, these objects are destroyed at non-deterministic times, depending on the garbage collector, and object resurrection may occur during destruction, extending the lifetime."

https://en.wikipedia.org/wiki/Object_lifetime#Determinism

pjmlp · on July 20, 2015

You can have RAII in C#, Java and Python by making use of using/try-with-resources/with, or HOF with monadic constructs.

Yes, it isn't as easy as declaring a templated handle manager class on the stack (it won't work for heap objects) in C++, but it also gets the job done.

shmerl · on July 19, 2015

It provides deterministic method of automated memory management and avoids this unpredictable latency issue.