D’s Garbage Collector Problem

acqq · on Sept 10, 2014

It's OK to point at the potential problem. It can be that the GC in D can be improved.

But for the given specific case, whenever I'd want to make a fast compilers (and I've made two in my career) I wouldn't set my goal to produce gazillion objects and expect some fairy to make it fast. I'd take care to use the stuff where one allocation produces a lot of stuff: like arrays. You probably know: addressing elements of the array in the machine code is very fast: there are instructions that do that very effectively. So don't be afraid to use arrays and use indexes instead of pointers. Your code won't be much less readable, but you can achieve significant speedup, provided you don't have more algorithms that slow up exponentially (and there's some chance you have these too).

One of the benefits of arrays is that you're also deallocating a lot of stuff at once, which also makes the whole process much faster. And whenever you feel that you're not programming easily enough, think about the first Fortran compiler, produced in fifties: it generated quite good machine code, almost as fast as the hand coded one and it had only a few kilobytes of RAM to produce such a result.

oelang · on Sept 10, 2014

"it generated quite good machine code, almost as fast as the hand coded one"

I'm sorry but this is complete bullshit. What is expected of a compiler these days (SSA, advanced instruction scheduling, smart inlining descisions) is not comparable with what the first optimizing compilers did.

kryptiskt · on Sept 10, 2014

The first Fortran competed with handcrafted code on extremely limited machines so it had to be almost as good to gain any traction. It did help that Fortran at that time was a barebones language so the translation was pretty straightforward.

http://polaris.cs.uiuc.edu/publications/c1070.pdf

acqq · on Sept 10, 2014

Thanks a lot kryptiskt. It's exactly the article I needed as the reference. For those who haven't seen what's behind your link, it's called "The FORTRAN I Compiler" and is written by David Padua in 2000. The must-read.

cbsmith · on Sept 11, 2014

It may have also helped that processor designs were also incredibly simple and the typical demands of a modern optimizer bear almost no resemblance to the demands of a compiler back then.

acqq · on Sept 10, 2014

If you want more modern comparison between what the article complains about and what's possible with the language which has GC, then look at this: Fabrice Bellard implemented the whole X86 emulator in JavaScript and with it it boots the Linux (measured on my machine and in my browser) in three seconds. The Linux kernel it boots is approximately 2 MB.

http://bellard.org/jslinux/

Whereas Maxime Chevalier-Boisvert (the author of the article we all comment) compiles

  a = 0;
  a;
  a;
  … // (60000 times is a used)
  a;

in four seconds with her D code. The program with no loops and one single variable, and just less than 200 KB of input.

tachyonbeam · on Sept 10, 2014

That is not a fair comparison at all.

1. An x86 emulator doesn't need to allocate much. It's probably an interpreter. It can parse x86 instructions straight from a byte vector, and use another byte vector for the RAM.

2. V8 and Firefox have much better garbage collectors than D, because they need their GC to be fast, and so they've made sure it is.

3. Higgs is a much more complex piece of software than this.

4. The 4 seconds time, as I pointed out, is at least 75% GC (but possiby more). That's the whole point of the blog post. It shouldn't be possible for the GC to take most of the execution time.

acqq · on Sept 10, 2014

How many allocations do you actually make to compile a single variable used 60000 times? I still claim you don't have to allocate much too for that particular input, as long as you use vectors/arrays, the ones you also recognize Bellard probably uses and which I've mentioned in my top comment as the right way to make your code faster. It's really as simple as that.

tachyonbeam · on Sept 10, 2014

This isn't C, it's JavaScript.

The semantics of a global variable read are a more complicated than you think. Global variables reside on the global object, which can grow dynamically. Global properties can also be deleted at any time. Accessing a property which doesn't exist should throw an exception. Properties could also be getter functions, which would then result in a function call for every property read.

Have you written a JS compiler? Have you written a JIT compiler? Any kind of compiler for a real-world language? A garbage collector?

If someone knows of a way to get allocation statistics, then I can answer the question of how many allocations occur.

acqq · on Sept 10, 2014

> Global variables reside on the global object, which can grow dynamically.

But it the example you give there's only one single variable.

> Global properties can also be deleted at any time.

Also doesn't happen in your example.

> Accessing a property which doesn't exist should throw an exception.

I agree.

> Properties could also be getter functions, which would then result in a function call for every property read.

But they aren't in your example.

And all these aspects result in proportional increase in the resulting code, not in some unexpected complexity.

> If someone knows of a way to get allocation statistics

Run a Perl script (or Python script or do it by hand) that adds an increment of global nAllocations variable in front of any statement where you do a "new" of something. (If the most of the allocations are implicit, this counting maybe has to be added to the runtime of D, if it already doesn't exist. At least the runtime is open source, AFAIK). Run the program. Print that number. My guess is that it must be an order of 6 digits (millions) for the performance you measure. Which doesn't seem to be necessary to process 60000 evaluations of a single variable if more arrays of fixed sized things could be used in the process. I know for you it's "doctor, it hurts me when I press me here" "well don't do that" moment, but really, that's how it is. A lot of small objects allocated separately makes a lot of work for a GC and it's so in every language. The question is if the language provides you the possibility to do it more effectively. It does: structs and arrays.

Yes, my bias is a bias of a C programmer, but there's a reason why good written C code is fast: the number of allocations is small, the most of stuff is preallocated before it's used. D lets you do the same.

Note: changing the memory representation of the "objects" you use doesn't suggest you have to change your existing algorithms. It's just how the stuff is stored in memory. The changes to your code then aren't fundamentally significant, but you'll certainly get the initial speedup you need.

mendort · on Sept 11, 2014

I don't think anyone would argue that the compiler in question is far from optimal for compiling this particular program.

However, if you are writing a general purpose compiler, you need to accommodate any program. Each of the properties (e.g. no use of getters, only one variable) that you point out about this program would take a compiler pass to establish.

I haven't thought too much about it, but some of the properties are very likely to be undecidable by static analysis for many programs.

It seems likely that using time to determine whether the program being compiled has these properties would not be efficient for most real world programs, even if it did provide enough information to choose a more optimal compilation strategy in some cases.

tachyonbeam · on Sept 11, 2014

Indeed, static analysis is generally rather costly. It's usually a fixed-point computation that iterates over the code multiple times. You generally need to allocate many objects to keep track of facts the analysis produces about different instructions, which need to be propagated around. Running a general-purpose static analysis on this test program would probably be more expensive than what I'm already doing.

In the case of my minimalist test program, it's easy to say "well, ha! that's obviously not an accessor!", but in the general case, it's much harder. Consider, for example, that there could be load() or eval() statements in the code. That means invisible code you haven't seen yet, and can't analyze. Then consider a much more trivial case where you find this somewhere in the code:

function foo(o) { delete o.a; }

Now, you know that "o" is very unlikely to be the global object... But can you prove that with absolute certainty? It's possible you can't, so then you must assume that maybe the global variable "a" doesn't exist, and accesses to it might throw an exception.

TL;DR: compiler design is no easy task.

bmm6o · on Sept 10, 2014

Read the article, please.

My implementation of a liveness analysis scales poorly with function size. It allocates a huge bit matrix (vector of bit sets) for liveness computation. This grows quadratically with the function size, which means that in some cases, obscene amounts of memory are required, and this data structure doesn’t fit in the cache

acqq · on Sept 10, 2014

The "doesn't fit in the cache" argument has nothing to do with too many small allocations which obviously happen (otherwise GC effects wouldn't be noticeable) and which almost certainly can be replaced with much less allocations of some arrays of things. Which it what I suggest since my top comment.

tachyonbeam · on Sept 10, 2014

From the thread on r/programming:

"The main issue from the post is quite simple: if you allocate a lot of small objects, D runtime will place them inside chunks (called pools) of fixed size (like 4 MB) that it requested from the OS. Each time a pool is filled up it will do a GC cycle and only if there's still no room for new object it will allocate a new pool. That means number of GC cycles in this situation is proportional to number of objects you allocate, while each GC cycle time is proportional to the managed heap size. It leads to quadratic complexity and awful slowness. If runtime allocated pools not of fixed size but adapted their size to the allocation rate, this scenario could be significantly faster."

oelang · on Sept 10, 2014

My point was that the first optimizing compiler isn't comparable with what is expected of a modern compiler, so please don't compare their code & memory footprint. With the first fortran compiler it was garbage in, garbage out. A modern compiler can (often) turn garbage into fast code.

Nowadays inside a compiler it's 90% graph processing & reprocessing. Since you're dealing with (cyclic) graphs, garbage collection is very useful and if you don't use some automated gc, you'll probably roll you own in one way or anther.

And a X86 interpreter isn't comparable with a compiler either. I doubt that there was any GC going on in your example, everything probably runs in a single array that represents the memory. Maybe some temporary objects but JS engines have good generational GC these days, I don't think D uses a generational GC algorithm.

tachyonbeam · on Sept 10, 2014

Exactly. Someone told me I should "avoid using the GC completely". I answered that this was basically a way of saying that I should implement my own application specific GC. It's shifting the burden onto the application developer. What I'm saying is that the D GC could probably be several times faster. That much should be obvious.

acqq · on Sept 10, 2014

Just use structs, unions and arrays of them:

http://dlang.org/struct.html

http://dlang.org/arrays.html

Preallocate them and don't increase their size one at the time. And in them use indexes to positions instead of pointers. Then GC won't make you problems anymore. It would be used but it won't be a bottleneck as it would simply get much less but bigger chunks of memory and much less pointers to worry about.

floody-berry · on Sept 10, 2014

(her)

acqq · on Sept 10, 2014

Sorry that I haven't checked if the author can possibly be "she" before I wrote the sentence. The probability is low, even if I'd prefer it to be otherwise. Corrected.

barrkel · on Sept 10, 2014

I had a 3 minute look at the code in the github repo.

The parser uses ~= (D's array concatenation operator) a lot. This always copies and will lead to quadratic performance for parsing e.g. blocks of statements. Similarly, the switch parser uses code like this:

    curStmts.length += 1;
    (*curStmts)[curStmts.length-1] = parseStmt(input);

This is similarly quadratic - it incrementally grows the statement array.

Look at algorithm performance first.

edmccard · on Sept 11, 2014

>This always copies and will lead to quadratic performance

In D, "a ~ b" always copies but "a ~= b" might not. Also, appending using ~= will grow the array exponentially, so the amoritized performance will be better than quadratic.

Doing "x.length += 1", however, does incremental growth and will have poor performance.

(See http://dlang.org/arrays.html and http://forum.dlang.org/post/op.wuic5nateav7ka@stevens-macboo... )

mostly_harmless · on Sept 10, 2014

If it's a GC thing, you should also try using various D compilers to compare performance.

There is:

dmd (official D compiler),

ldc (llvm D compiler),

gdc (GNU D compiler).

When I was messing around with D about a year ago, dmd compiled code the fastest, gdc by far the slowest, but gdc produced the most performant code. ldc was middle ground for both compile speed and program performance. Perhaps the GC issue you are having is just a quirk of the current compiler you are using.

qznc · on Sept 10, 2014

The garbage collector is shared by the three projects. You will not get a better GC this way.

Maybe ldc/gdc can optimize a few allocations away compared to dmd. I do not believe this is the case with compiler code like the Higgs discussed.

mike_hearn · on Sept 10, 2014

It'd be interesting to compare this to code running on the JVM. It's not enormously surprising that D's garbage collector can be a performance bottleneck, given the incredible amount of effort and manpower that has gone into making the Java GC's fast and scalable. It's just a really hard problem.

Does D expose any kind of GC stats? I'd imagine that with D a lot of heap pressure could be removed by stack allocating as much as possible, as you'd do in a C++ program. Part of the reason Java needs such beastly GCs is because it generates lots of tiny short lived objects. Although HotSpot can stack allocate these days if a function is hot enough, with D the programmer should be able to do it manually.

qznc · on Sept 10, 2014

D cannot copy Java's GC strategies. Being used for low-level work, D by design cannot move objects.

WalterBright · on Sept 10, 2014

This is incorrect. D can move objects, although the current GC does not do so.

Locke1689 · on Sept 10, 2014

I'm not an expert on GC, but I do work on a compiler written in a GC'd language that cares quite a bit about performance. I can share a few guidelines that may prove useful to the author or anyone else having similar trouble.

1) Profile. This does not mean gather a few charts from a primitive profiling tool that simply tells you which functions have the most time spent in them. You need a real profiling tool that gives you detailed analysis of where your time is going. You need CPU stacks, time blocked, GC stacks -- everything you can get your hands on. Your profile needs to tell you where, what, and how much you're allocating. It needs to tell you what kind of allocations you are doing and how long they are living.

2) Once you have your information, you need to figure out what you can do to improve your performance. To do this, you need to know the characteristics of your program and all the platforms it runs on. Is your code being JITed? How long does your program spend in the JIT? Can you mitigate this by doing AOT compilation, optimization profiles, or producing more JIT-friendly code? For GC, what kind of GC are you running on? Is in conservative -- do you have to worry about memory leaks? Is it stop-the-world -- do you have to worry about pauses? Is it generational -- does the behavior change at inflection points? Do you have to adapt your allocations to the sections of your program's run time? Does allocation in one portion affect allocations in another portion, e.g. will your allocations produce extra memory fragmentation when running on a non-moving GC?

3) You need to own your code and its performance. If you're allocating in a hot loop and the GC can't keep up, you need to rewrite that section to not allocate. If you're fragmenting the heap in non-critical paths, you need to stop allocating there, as well. If you're constantly recreating re-usable ephemeral objects, you need to look at object pooling. If you need to refactor sections of your code to avoid allocations and stop-the-world collections, you need to do that. Blaming the GC will not make your application faster. If you need to break out a debugger and jump into the GC to see what's taking so much time, that's what you need to do.

4) After you've optimized, re-evaluate. Are there actionable items you can hand to the GC devs? Can you point out general points where GC strategy has a known bad result and a known better implementation? File these on the GC. Maybe you'll be able to delete some optimizations in future versions. :)

The first version of Roslyn was 30x slower than the old C# compiler written in C++. We're now within 2x on 10-year-old hardware and beating it on newer. Perf optimization works -- but not all programming is a stroll through a park. :)

seanmcdirmid · on Sept 10, 2014

Why the performance discrepancy between new and old hardware? Do the immutable trees require more RAM that is more likely to be provided on newer hardware? Also, I would suspect going immutable would require more tuning in terms of allocation.

Locke1689 · on Sept 10, 2014

Yeah -- immutability and high concurrency tend to hurt single threaded performance (especially when compared to optimized C++), but we gain a ton from new highly parallel hardware. Also, GC can do analysis on spare cores.

seanmcdirmid · on Sept 10, 2014

I really should look at the architecture, I'm interested in how to concurrent it is. I use an alternative approach that is much more imperative (using dependency tracing and re-execution to ensure consistency).

copx · on Sept 11, 2014

> Your profile needs to tell you where, what, and how much you're allocating.

Which languages offer such advanced profilers, though? I am using Lua (a GCed language) right now, and I am not aware of any Lua profiler which profiles allocation/the GC. I think D's profiler does not feature that either.

In fact I have only ever heard of Java having such advanced profiling tools, and I guess Microsoft build the same thing for their version of Java (.NET). But again, which other languages have tools which allow you to profile allocation/GC behavior?

pjmlp · on Sept 11, 2014

I am on mobile right now, but I would say Oberon (a bit basic), Haskell, Go, OCaml (not sure), JavaScript and all commercial implementations of Smalltalk and Common Lisp.

Ah, regarding .NET there are a few commercial profilers out there besides what Microsoft offers. Like the JVM, the CLR exposes an API for such kind of tooling.

azth · on Sept 11, 2014

Bruce Dawson has some excellent posts about profiling performance issues on Windows: http://randomascii.wordpress.com/

azth · on Sept 10, 2014

> We're now within 2x on 10-year-old hardware and beating it on newer.

Is it an apples-to-apples comparison though? If you use the same programming approaches in C++ as you're using in Roslyn (immutable structures, etc.), do you expect you would still keep beating it?

Locke1689 · on Sept 10, 2014

Yup. Because a project that's never done can't beat a project that will be.

dstein64 · on Sept 10, 2014

Here's separate "GC issue" that Walter Bright has mentioned: http://www.reddit.com/r/programming/comments/22jwcu/how_i_ca...

tdees40 · on Sept 10, 2014

When I see things like that, I can't help but think that if you want performance, you've gotta go C.

Otherwise, you get stuff like this, that's just impossible to reason about.

WalterBright · on Sept 10, 2014

You can write D programs that only do C style malloc/free, or even do Fortran style no allocation whatsoever (static and stack only).

qznc · on Sept 10, 2014

The nice thing with D: Use a GC to develop quickly. If GC becomes a bottleneck, rewrite into C style where necessary.

The author considers this too much work for the benefit.

Nevertheless, everybody agrees that the D GC needs to improve.

tachyonbeam · on Sept 10, 2014

It is unfortunately difficult to reason about. For instance, I use D's associative arrays in my program. I know that dynamic allocations must occur inside this, but I don't know what algorithm is being used and how it allocates memory.

stonemetal · on Sept 10, 2014

I wonder if selectively disabling D's collector would help? Turn it off during the allocation phase to prevent thrashing. The runtime would be forced to allocate from the OS rather than try to collect. It could then be enabled and run when there was known to be garbage to collect.

Another option would be fall back to pre-allocated object pools rather than naively calling new. I know it is a fairly common optimization in Java\C# when the GC is causing performance issues.

WalterBright · on Sept 10, 2014

Unfortunately, there are no "fire and forget" performant memory management schemes. For an application that does a lot of allocation, it pays huge dividends to analyze the code to figure out where memory is being allocated, and why.

Often, data structures can be redone to minimize allocation, allocations can be moved to the stack, object pools can be used, etc. There are a long list of techniques that can be applied to improve performance. I know that much of the effort I put into optimizing code goes into dealing with memory management issues, regardless of what the implementation language is.

PaulHoule · on Sept 10, 2014

If you write programs that allocate memory dynamically it is no surprise that memory allocation and garbage collection are a big deal.

tormeh · on Sept 10, 2014

I program in Scala and I wish we'd be able to delete objects manually, and then have the GC clean up the stuff we forget. That'd be awesome; enabling the perfect fusion of managed and unmanaged memory. In the meantime starting out with huge heaps and setting references to null after use seems like the best we JVM folks can do.

Edited to specify I'm using Scala.

the_af · on Sept 10, 2014

Are you setting references to null in Scala? :(

tormeh · on Sept 10, 2014

No, not really. But it's a last resort if you need to make things collectable a bit sooner.

WalterBright · on Sept 10, 2014

core.memory.GC.free(void* p) should do that for you.

tormeh · on Sept 10, 2014

That's great! Unfortunately I program in Scala, so D-specific advice doesn't really help me. Edited top post.

bjourne · on Sept 10, 2014

What happens if another object holds a reference to p? Bad things?

WalterBright · on Sept 10, 2014

Yes, bad things, i.e. memory corruption.

Manual memory management means it's up to the programmer to get it right.

tormeh · on Sept 10, 2014

Yeah. Some sort of nullpointerexceptions, I would imagine.

sphink · on Sept 11, 2014

[SpiderMonkey dev here. I remember the talk you gave at Mozilla, quite a good talk. I work on the GC much more than the compiler, which I don't know that much about, but I'm only going to talk about the compiler. Ironic, given that you were talking about the GC.]

What's your actual goal here? You gave an example of a pathological function that takes a long time to compile. Does that matter?

We care a lot about compilation time. It matters a lot for Web workloads, and distinguishes JS compilers from AOT compilers. But why do you care, especially in this particular case? Do you have "real" code that is hitting the same bottlenecks? Because if not, you have two easy solutions to this case: first, leave everything as is. In a production system, you can be doing this sort of compile in the background, and only switch over to it from the fast-start interpreter when it's done. The second straightforward option is to just give up. Fail to compile if things get too big or slow, and let it run in the interpreter (or some other lower tier.)

If this does resemble a realistic scenario, then we can get a little more nuanced. We face the same problem in SpiderMonkey -- big functions can take too long to compile and/or will blow out memory (which, honestly, is the bigger concern. More on that later.) In JaegerMonkey days, we had "chunked" compilation, where it artificially broke big functions down into pieces and compiled them separately, essentially spilling all state between them. I know we've talked about doing the same for IonMonkey; I'm not sure if it has been implemented. In particular, asm.js hits this a lot because autogenerated functions can easily get enormous. Luke has definitely talked a lot about chunking asm.js compilation; I don't know if he ever got around to implementing it yet. [You remember Luke; you sat behind him for a while in the Mountain View office.]

Or how about doing it at a finer granularity? We have a number of places where we start out trying to do the smart and clever thing, but when things gett out of hand we abort and fall back on something stupid but fast and small. Here, that might mean being very pessimistic about your live ranges and pessimistically chopping them up into small pieces. Could you keep your 2d matrix, but cap its dimensions? You only track up to K variables at a time, and if you exceed K you spill one and reuse its slot for the new one. Obviously, your phi nodes (join points) would need to agree on a set of variables, so maybe this gets too complicated or just doesn't work.

But anyway, I'm still wondering what you're after. You have a research compiler. There is no need for it to be better than a production compiler in all ways. Pick something it is better at, such as producing really fast code. The learnings from that will be more likely to be useful in a production system anyway, because all current leading JS compilers are multi-tier -- you have a fast-start interpreter, usually a still pretty quick-start baseline compiler, and then an optimizing compiler. The distribution of static code across those is something like 90%/9%/1% anyway -- the majority of downloaded code is executed at most once anyway, so there's no point in spending time optimizing it. But that means the optimizing compiler is free to spend some time thinking about things. If your compiler is really slow, just claim it's meant for a 4th tier that only kicks in for really hot code. :) And it can compile in the background.

Unless it burns too much memory. That won't stay isolated from the execution thread. It'll pollute caches, hog the memory bus, and make fun of your mother. So I'd worry more about capping memory usage to a reasonable level than worrying about compile time, even if it means giving up or dropping down to a dumber algorithm when things start to get big.

One other effect of being a later tier is that you can stick to compiling simpler programs. By the time it makes it to your tier, you'll have observed the actual types and behavior of the function for a while, and so you can assume that it'll continue to behave the same way. This allows you to ignore many of the numerous quirks of the JS language. You don't need to handle getters if your hot code never sees them (or rather, you only have to handle them in the parts that do see them.) The global object probably won't get mutated, and almost certainly won't have properties deleted. You will need mechanisms to discard your compiled program if any of these things change, and you'll have to be clever about noticing these sorts of mutations, but it's great to be able to assume that nothing too crazy will happen. Even if you have to guard all those assumptions.

The GC does matter for both time and space, though. An advantage of a generational collector isn't just shorter pauses, it's also that you can use simple bump allocation in the nursery since you never need to do partial deletes. Also, if you're creating a lot of short-lived junk then it can reduce your peak memory usage by quite a bit, which is important for a 3rd or 4th tier background compiler.