"What's possible, though, is accumulating Python objects in memory and keeping strong references to them12. For instance, this happens when we build a cache (for example, a dict) that we never clear. The cache will hold references to every single item and the GC will never be able to destroy them, that is, unless the cache goes out of scope."
Back when I worked with a Java memory profiling tool (JProbe!) we called these "lingerers". Not leaks, but the behaviour was similar.
I can't find the documentation now, but had a similar problem with ASP.NET's Master Page. The example of using data binding to dynamically adjust the menus had the binding go to a backing Page instance. That seemed logical. Unfortunately after about 200 visits to the site, the whole thing fell over. Turns out master pages hold a reference to the backing page even once the whole page is rendered. This caused them to stack up in memory. The fix was to use a static method to provided the data. It was in a user note at the bottom of the page.
I don't have that much experience in Java, but you're absolutely right. Those "lingerers" can happen in pretty much any language with a GC, but they're technically not leaks.
The cache will hold references to every single item and the GC will never be able to destroy them, that is, unless the cache goes out of scope
Had the reverse happening a while ago and it's nasty as well: some C++ objects were holding references to a Python objects, but due to the GC not scanning those memory regions (they're not Python-owned after all), the Python objetcs would get GC'd and all hell broke loose when the C++ tried to access the now dead Python objects. Solution is forced 'lingering', i.e. is applying some RAII adding the Python objects to a global dict and removing them when the C++ objects go out of scope.
Python is reference counted. This was happening because you didn't increment the reference count for python objects you were referencing from C++. Your RAII should increment/decrement reference counts on the object, not place objects into a global dict. That's the "correct" way to reference python objects.
Python's GC doesn't scan memory in the same way other languages do. Instead, it detects cycles between python objects. As long as you follow the reference counting rules correctly, you shouldn't have to worry about it. (Unless you need to detect cycles involving your C++ objects.)
You assume CPython, or some other reference counted implementation though. I'm talking about MicroPython which uses mark&sweep. And which I obviously should have mentioned but I guess I was tired and I didn't.
That was my first thought when I saw the headline "I bet it's lxml". I have lost count of the number of times I've had to do special nonsense to handle lxml's memory leaks. (I've even had to go so far as to launch parsing code in a subprocess just so the part that created an etree object would quit & return all its RAM.)
This sort of thing is why calling C from Python is a marginal idea. Debugging such C is tough, especially if it manipulates Python objects in complex ways. There are many invariants of the Python data structures which must be carefully maintained by hand in C code. Getting one wrong will result in obscure bugs like the parent.
"Pickle", Python's old seralizer, had a similar leak problem. It has a cache. You can reuse a Pickle stream, which is done in interprocess communication. But the cache of previously-send objects wasn't being cleared at the end of each Pickle block. I found that, did a workaround, and submitted a bug report. Not sure it was ever fixed properly.
I still have a bug report in on CPickle, a "faster" implementation of Pickle written in C.[1] In a complicated situation with multiple threads using CPickle, memory becomes corrupted and the program will crash. It doesn't happen using Python Pickle, so I just quit using CPickle. The bug report got the usual "reproduce in a simpler situation" reply to make it go away, and the bug report remains open. It may be the same bug as this one [2] from 2012, although I doubt it.
For parsing HTML, I use html4lib. It's slower, but it's all in Python.
I've had quite a few issues with cPickle myself, so I see what you mean.
Indeed, Python packages built over C extensions can be quite hard to debug, as seen with lxml. But what makes it even harder is the fact that lxml is partly built using Cython... so you deal with Python code, C code generated with Cython, and pure C code (libxml2).
While Cython - in my experience - doesn't have many bugs, it still has some, and can, at times, generate blaringly wrong C code (iirc one simple example: slice assigmnent to a uint8_t* tries to call a Python runtime function on the uint8_t* as a PyObject*).
Good to know! My only experience so far with Cython comes from lxml. It's a weird language that seems to have a lot of corner cases, though. Just like C++/CLI.
Yes. Yes it does. Also, because it works on two type systems it has many cases were you'll want to take a look at the generated C code to verify that the "cheap path" was taken and no intermediary Python objects are constructed or Py operators are used -- if performance matters, that is.
On the other hand it is a radically simpler way to write bindings that also contain logic, or to write rather fast code without straying to far from Python. Plus, it can cythonize almost all code, even very dynamic code with closures, which will still often improve performance on it's own (no interpreter, but still Python runtime for every op). And that is then a nice base to do further optimizations.
No kidding. I actually once had a freaky memory leak of my own that I was never able to fix. I was in fact using libxml2 there, I wonder if that was the cause.
The JVM community tends to prefer pure Java implementations of everything, rather than using existing C libraries like Python and Ruby. Some may see this as a bad thing, but it definitely has its benefits. One particularly relevant benefit in the context of this article is that the amount of code that can leak memory, in the conventional sense, is dramatically reduced. I suppose the same thing is happening in the Node.js ecosystem, though I don't recall if Node uses native code to parse XML.
Ironically, C memory leaks are often significantly easier to debug than Java ones. In C, libraries like libumem basically do postmortem GC that can point straight to the leaking callstack (depending on how much debug info you can tolerate).
In GC'd languages, there's no real way for the VM to identify a leak.
If you don't mind potentially slow code, that's a fine thing. Once you've measured and discovered that you're losing out on a lot of performance, it's worth evaluating whether the risk of leaks can be baked away via careful testing, which doesn't appear to have been done at all in the library used in this blog post.
If you say so. That's not what I've observed, especially if you're talking about code where Python is the glue and all the heavy lifting is done in C or C++.
I am comparing Java to Python. Not Java to Python code wrapping C++ code. Because in that case, we would be really comparing Java to C++ code. Which I am happy to do, but not what I did.
tldr; libxml2's C implementation leaked memory, author tracked it down. Kudos to the author for their persistence in digging down to the root of the problem. A lot of people would throw their hands up and decide to recycle the process every <N> seconds rather than analyze it to the depth the author did.
> "But if we're strictly speaking about Python objects within pure Python code, then no, memory leaks are not possible - at least not in the traditional sense of the term. The reason is that Python has its own garbage collector (GC), so it should take care of cleaning up unused objects."
I have a hard time beliving this. Java can have memory leaks so why couldn't Python?
I think that the author is defining memory leaks as permanently out of scope but not deallocated memory. In that sense I don't know of anything in vanilla Python, or Java, that would qualify as a memory leak. In the more intuitive sense of a memory leak being any failure to make objects available to garbage collection, (such as by retaining references to them in an unexpected place) leading to unchecked increases in a program's memory footprint, memory leaks are possible in either language.
There are indeed many interpretations of what a memory leak is in Python.
In C/C++, you can forget to free memory, thus causing memory leaks. For example, you may call malloc(), but forget to call free(). In Python or Java, you can't do that; you don't need to explicitly "free" objects, as there's a GC.
Sure, you could leave rogue Python or Java objects in memory, but in my mind this isn't a "leak" in the same sense as a leak in C/C++. The Python interpreter (which is written in C) or some C extension may themselves cause real leaks, though.
I'm the author of the post. Maybe my reasoning was not explained clearly enough.
> Weak references are the right way to deal with cache objects, imho.
Yet, I disagree ;) Whether a weakref is the correct thing to use or not depends entirely on the purpose of the cache. I often find myself using caches were weakref would not be very useful, because it would cool the cache a lot.
It's hard to tell the difference between a real memory leak and Python objects being accumulated infinitely in memory - at least if we rely only on the memory use of a process. That's why we need to use either gc or objgraph as a first step.
Would be curious for detail on your experiences; we do this a lot (we developed mdb_v8) and we've continued to extend/develop mdb_v8 to make it easier -- but trying to debug node memory growth is not something I would every characterize as simple, friendly or happy (despite our best efforts).
Back when I worked with a Java memory profiling tool (JProbe!) we called these "lingerers". Not leaks, but the behaviour was similar.