Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In a real time system you probably shouldn't run out of RAM (or it should be a recoverable error, like dropping a packet).

But GC pause is inevitable if the GC can't reliably keep up (like a spike in workload).



> GC pause is inevitable

No, it isn't. That's the magic of hard-real-time GC. It isn't easy, but it is possible.

[EDIT: A more accurate claim is that GC pause is no more inevitable than an allocation failure in a malloc/free system.]


GC pauses are inevitable unless:

- you think it’s ok to crash your program with an OOM just because the GC couldn’t keep up with allocation rate. To my knowledge, nobody ever thinks this is ok, so:

- you inevitably pause. Maybe it’s rare. Maybe you are “real time” only in the sense that you want responsiveness 99.99% of the time rather than 100% of the time.

- you prove schedulability. The “good” alternative that I would assume few people do is prove that the GC can keep up (I.e. prove that it is schedulable). To do that you need to have WCET and WCAR (worst case allocation rate) analyses and you need to feed those into a schedulability analysis. Then there’s a bunch of ridiculous math. If you’re lucky, you can prove you’re schedulable, but it seems like a strictly worse dev plan than just writing C/Ada/Rust/whatever code that doesn’t GC.

I list the third option because I know it’s theoretically possible to do it, but I don’t recommend it and I don’t consider it to be a serious option if you’re building something that lives will depend on.


I agree it's very hard to do that when your code needs 90% of the CPU to keep up.

But lots of systems need less than 10% for the real-time code. Or less than 1%. In that case, it can be easy to convince yourself that straightforward incremental GCs can keep up.

Perhaps you'd argue those systems are over-provisioned, if their CPU is under 10% busy. But there may be good reasons to have extra speed, like supporting a management UI (running at non-RT priority). Or it may be a worthwhile tradeoff for reduced engineering effort. Gigaflop CPUs are like $50 these days, so if your realtime code only needs a megaflop you're likely to end up with very low CPU utilization.


> But lots of systems need less than 10% for the real-time code. Or less than 1%. In that case, it can be easy to convince yourself that straightforward incremental GCs can keep up.

Basically this, although it applies to memory as well, not just CPU. If you've got enough headroom, it's easy to prove that the application physically cannot allocate fast enough to outpace to GC. (Along the lines of: each call to cons does 1/N of a mark-sweep cycle, and you have more than N+M cons cells available, where M is the maximum the application ever has live simultaneously.) Reducing headroom from there is just a performance optimization.

(Proving the program never uses more than M units of memory is nontrivial, but you'd have to that with malloc anyway, so GC doesn't cost any extra in that regard.)


If you’re running on the softer side of real time then it’s acceptable to just run an experiment that confirms that you really only use a small percentage of CPU and that to your knowledge the GC never blew up and made you miss a deadline.

But on the harder side of real time you’re going to have to prove this. And that is super hard even if you’re overprovisioning like crazy.


Agreed, for a sufficiently hard definition of hard real time.

Hard real time gets a lot of theoretical attention, because it's sometimes provable. Whether the decisions you're making by the deadline are sensible and not, say, cranking the horizontal stabilizer all the way down when one of the angle-of-attack sensors is broken, are far more consequential in most systems than missing a tick.

Companies definitely shoot themselves in the foot by focusing so hard on the hard real time constraint that they make it 10x harder to reason about the actual behavior of the system, and then they get the more important thing wrong. I've seen this in a few systems, where it's almost impossible to discover what the control loop behavior is from reading the code.


Problem with GC is that its preferred mode of operation is to stop the world for much more than one tick.

Scanning a respectable-size heap on a respectably fast machine, sans fancy GC optimizations, could easily take 30 seconds. Modern production GCs rarely pause you for 30 seconds. Real time GCs certainly try very hard to avoid ever doing that. But:

- The optimizations that make a GC run faster than 30 sec are speculative: they target empirically found common cases. Not common cases of something the programmer has control over, but common cases of heap shape, which is a chaotic function of the GC itself, the way the OS lays out memory, the program, the program’s input, and lots of other stuff. Those common case optimizations are successful enough that GC cycle times often look more like 30 milliseconds than 30 seconds. So, the terrifying thought if you’re using a GC in real time is: what if at the worst time possible, the heap shape becomes something that isn’t common case, and the GC that you thought took 30 ms now takes 30 sec.

- Real time GCs can let the program do work while the GC is happening, so even if it takes 30 seconds, the program can keep chugging along. But here’s the catch: no memory is being reclaimed until the GC reaches the end, and some or all memory allocated during the GC cycle will remain unfreed until the next GC cycle. So if your 30ms concurrent collector decides to take 30sec instead, you’ll either run out of memory and crash or run out of memory and pause the world for 30sec.

Basically, the more you know about RTGC, the less you’ll want to use them when lives matter.


Isn’t this where the “cruise missile program will run out of RAM in an hour but the maximum flight time before detonation is 30m so we good” story goes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: