Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Dismissing Python Garbage Collection (2017) (instagram-engineering.com)
51 points by rbanffy on Dec 17, 2020 | hide | past | favorite | 26 comments



It's nice to see a hack turn into an API extension that addresses the underlying problem.


Only half similar, but Java has a no-collection memory management strategy called "Epsilon GC": https://blogs.oracle.com/javamagazine/epsilon-the-jdks-do-no...


(2017)

Original discussion: https://news.ycombinator.com/item?id=13421464

I'd be curious to know if this hack has survived for four years.


Now just imagine if someone would replace the horrible refcnt with a real copying GC. At least 2x faster I would assume.


Why couldn't they use PyPy? The GC for that has higher throughput.


The easy answer is C extensions. PyPy does not support C extensions, and thereby does not support a wide range of native, accelerated libraries. Facebook infra is heavily dependent on wrapping C/C++ libraries (like Thrift) for use in many languages, and not having C extensions (or Cython) would cut off a large portion of our shared codebase.


Pypy does support C extensions, but it has to somewhat emulate the cpython API so it's rather slow and doesn't work for 100% of modules.

If your C library uses FFI or Ctypes it'll work perfectly fine under pypy.


The more important question is whether PyPy's GC uses intrusive data structures to maintain the metadata for each object. If it does the same thing as CPython, just faster, then the process would run out of memory faster as the GC writing to COW pages converts them into owned memory.


(2017)


Added. Thanks!


This article is an excellent argument against reference counting.


As the article says, reference counting was not the problem. They got their improvements by disabling the garbage collector sweeps, but reference counting was still being used.


My point was that CPython does reference counting, resulting in the Copy-On-Read behavior described in the article. That isn't something that can be easily changed, not without writing a whole new CPython system (or switching to another Python implementation). Their solution to the problem was the cleanest, under the circumstances.


Most tracing GC solutions would run into the same problem. A common implementation strategy of the classic three-color collector puts the grey bit in the object header (and often also the white/black bit). A two finger (Cheney-style) collector avoids an explicit grey bit, but moves objects around, which also triggers CoW of shared pages.

If you want CoW-friendly GC, you need to move your color bits / reference counts to be all together in the headers of your GC arenas instead of being in the object headers. That way, your high mutation bits are all together in a small number of pages. Those pages with the color bits/counters end up not being shared across processes, but at least they're all packed together to minimize the number of affected pages.


The article goes on to point out that their theory about copy-on-read being the problem didn't pan out, because when they disabled refcounting on code objects shared memory usage didn't change. It then goes on to identify the real culprit, which was GC.

So, no, this article is not an excellent argument against reference counting.


Copy on read doesn't even make sense, since a copy is a read.

Most problems people ascribe to reference counting have way more to do with excessive memory allocation and instead of fixing that problem and getting massive speedups of 7x or more, people try to fiddle with memory management schemes to gain 5% here or there at the expense of large unpredictable pauses.


Is python a good arena for any language features? Just as I wouldn't use C++ to show off garbage collection (Conservative M+S), Python seems like a bad place to show off reference counting because the compiler can't spend a practically infinite amount of time on data flow analysis to elide RC ops.



Or C, like real men. \s


Python honestly reminds me of C more than any other language.

I'd never really used it much before a year ago so I was quite surprised how flat most abstractions are.

I wouldn't write my website in C but equally it wouldn't need many features to be python-esque.


C is not ideal for this task, but last time I checked FB is run on a combination of C++ and PHP.


There's more than that. PHP, Erlang, C++, Java and a smattering of others.


Yes, and they had to built their own php to c++ transpiler, then virtual machine and finally their own language.

Truly the real man solution. The only way this could have been better is if the vm for hack was written in brainfuck.


I'd say it was probably the optimal solution, allowing them to scale more or less gracefully. It might not be the most elegant, but it definitely works, and at that scale it's an engineering feat.

I get why people hate C, but once you get used to a couple of good libraries things get much easier - we're not in the 90s anymore and there's plenty to choose from, with some of them having excellent quality code. And especially in cases like these when it turns out you can get some benefits from delayed freeing of memory C's manual memory management is an asset.


I'm surprised this wasn't vanished into the memory hole for containing harmful terms like "master process", even if it was from way back when in 2017.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: