Hacker News new | past | comments | ask | show | jobs | submit login
Tracking Down a Python Memory Leak (benbernardblog.com)
150 points by bbernard on Dec 7, 2016 | hide | past | favorite | 42 comments



"What's possible, though, is accumulating Python objects in memory and keeping strong references to them12. For instance, this happens when we build a cache (for example, a dict) that we never clear. The cache will hold references to every single item and the GC will never be able to destroy them, that is, unless the cache goes out of scope."

Back when I worked with a Java memory profiling tool (JProbe!) we called these "lingerers". Not leaks, but the behaviour was similar.


I can't find the documentation now, but had a similar problem with ASP.NET's Master Page. The example of using data binding to dynamically adjust the menus had the binding go to a backing Page instance. That seemed logical. Unfortunately after about 200 visits to the site, the whole thing fell over. Turns out master pages hold a reference to the backing page even once the whole page is rendered. This caused them to stack up in memory. The fix was to use a static method to provided the data. It was in a user note at the bottom of the page.


I don't have that much experience in Java, but you're absolutely right. Those "lingerers" can happen in pretty much any language with a GC, but they're technically not leaks.


The cache will hold references to every single item and the GC will never be able to destroy them, that is, unless the cache goes out of scope

Had the reverse happening a while ago and it's nasty as well: some C++ objects were holding references to a Python objects, but due to the GC not scanning those memory regions (they're not Python-owned after all), the Python objetcs would get GC'd and all hell broke loose when the C++ tried to access the now dead Python objects. Solution is forced 'lingering', i.e. is applying some RAII adding the Python objects to a global dict and removing them when the C++ objects go out of scope.


Ugh... No. That's not how Python's GC works.

Python is reference counted. This was happening because you didn't increment the reference count for python objects you were referencing from C++. Your RAII should increment/decrement reference counts on the object, not place objects into a global dict. That's the "correct" way to reference python objects.

Python's GC doesn't scan memory in the same way other languages do. Instead, it detects cycles between python objects. As long as you follow the reference counting rules correctly, you shouldn't have to worry about it. (Unless you need to detect cycles involving your C++ objects.)


Ugh... No. That's not how Python's GC works.

You assume CPython, or some other reference counted implementation though. I'm talking about MicroPython which uses mark&sweep. And which I obviously should have mentioned but I guess I was tired and I didn't.


Shouldn't the C++ notify Python of the extra references by incrementing the ref counts?


Spoiler alert, the leak is in libxml2, not Python code.


That was my first thought when I saw the headline "I bet it's lxml". I have lost count of the number of times I've had to do special nonsense to handle lxml's memory leaks. (I've even had to go so far as to launch parsing code in a subprocess just so the part that created an etree object would quit & return all its RAM.)


This sort of thing is why calling C from Python is a marginal idea. Debugging such C is tough, especially if it manipulates Python objects in complex ways. There are many invariants of the Python data structures which must be carefully maintained by hand in C code. Getting one wrong will result in obscure bugs like the parent.

"Pickle", Python's old seralizer, had a similar leak problem. It has a cache. You can reuse a Pickle stream, which is done in interprocess communication. But the cache of previously-send objects wasn't being cleared at the end of each Pickle block. I found that, did a workaround, and submitted a bug report. Not sure it was ever fixed properly.

I still have a bug report in on CPickle, a "faster" implementation of Pickle written in C.[1] In a complicated situation with multiple threads using CPickle, memory becomes corrupted and the program will crash. It doesn't happen using Python Pickle, so I just quit using CPickle. The bug report got the usual "reproduce in a simpler situation" reply to make it go away, and the bug report remains open. It may be the same bug as this one [2] from 2012, although I doubt it.

For parsing HTML, I use html4lib. It's slower, but it's all in Python.

[1] http://bugs.python.org/issue23655 [2] http://bugs.python.org/issue12680


I've had quite a few issues with cPickle myself, so I see what you mean.

Indeed, Python packages built over C extensions can be quite hard to debug, as seen with lxml. But what makes it even harder is the fact that lxml is partly built using Cython... so you deal with Python code, C code generated with Cython, and pure C code (libxml2).


While Cython - in my experience - doesn't have many bugs, it still has some, and can, at times, generate blaringly wrong C code (iirc one simple example: slice assigmnent to a uint8_t* tries to call a Python runtime function on the uint8_t* as a PyObject*).


Good to know! My only experience so far with Cython comes from lxml. It's a weird language that seems to have a lot of corner cases, though. Just like C++/CLI.


Yes. Yes it does. Also, because it works on two type systems it has many cases were you'll want to take a look at the generated C code to verify that the "cheap path" was taken and no intermediary Python objects are constructed or Py operators are used -- if performance matters, that is.

On the other hand it is a radically simpler way to write bindings that also contain logic, or to write rather fast code without straying to far from Python. Plus, it can cythonize almost all code, even very dynamic code with closures, which will still often improve performance on it's own (no interpreter, but still Python runtime for every op). And that is then a nice base to do further optimizations.


(So many typos above. "html4lib" should be "html5lib". "Seralizer" should be "serializer". "Previously-send" should be "previously-sent". Sorry.)


No kidding. I actually once had a freaky memory leak of my own that I was never able to fix. I was in fact using libxml2 there, I wonder if that was the cause.


There's more than one leak actually. It's quite a party :)


Exact same thing. Odd unsolved memory leak, using lxml for HTML parsing.

I think I'll stay away from it in the future. Unfortunate, since it's the fastest HTML parser in Python that I know of.


Have you compared it to the C implementation of ElementTree?

With stanzas like

    for event, n in ElementTree.iterparse(xmlfile, events=("end",)):
        if n.tag == "blar":
            # do stuff
            n.clear()
it doesn't even take much memory to parse files of arbitrary size.


In libxml2? That's a reasonable guess, but I would be open to other possibilities I were you, hehe :) Stay tuned!


The post's tags, displayed at the top, spoiled it for me.


The JVM community tends to prefer pure Java implementations of everything, rather than using existing C libraries like Python and Ruby. Some may see this as a bad thing, but it definitely has its benefits. One particularly relevant benefit in the context of this article is that the amount of code that can leak memory, in the conventional sense, is dramatically reduced. I suppose the same thing is happening in the Node.js ecosystem, though I don't recall if Node uses native code to parse XML.


Ironically, C memory leaks are often significantly easier to debug than Java ones. In C, libraries like libumem basically do postmortem GC that can point straight to the leaking callstack (depending on how much debug info you can tolerate).

In GC'd languages, there's no real way for the VM to identify a leak.


If you don't mind potentially slow code, that's a fine thing. Once you've measured and discovered that you're losing out on a lot of performance, it's worth evaluating whether the risk of leaks can be baked away via careful testing, which doesn't appear to have been done at all in the library used in this blog post.


Well pure Java code is 10x to 100x faster than pure Python code. So you aren't exactly accepting slowness in that case.


If you say so. That's not what I've observed, especially if you're talking about code where Python is the glue and all the heavy lifting is done in C or C++.


I am comparing Java to Python. Not Java to Python code wrapping C++ code. Because in that case, we would be really comparing Java to C++ code. Which I am happy to do, but not what I did.


This whole thread was about memory leaks caused by not sticking to a single language.


A pure Python implementation of lxml would be exceedingly slow.


Not with a good JIT compiler like PyPy.


tldr; libxml2's C implementation leaked memory, author tracked it down. Kudos to the author for their persistence in digging down to the root of the problem. A lot of people would throw their hands up and decide to recycle the process every <N> seconds rather than analyze it to the depth the author did.


I'm the author of the post, so thanks a lot for your kind remarks.

Now, the problem appears to be in libxml2, but... it's only partly true. I assure you that the best is yet to come :)


> "But if we're strictly speaking about Python objects within pure Python code, then no, memory leaks are not possible - at least not in the traditional sense of the term. The reason is that Python has its own garbage collector (GC), so it should take care of cleaning up unused objects."

I have a hard time beliving this. Java can have memory leaks so why couldn't Python?


I think that the author is defining memory leaks as permanently out of scope but not deallocated memory. In that sense I don't know of anything in vanilla Python, or Java, that would qualify as a memory leak. In the more intuitive sense of a memory leak being any failure to make objects available to garbage collection, (such as by retaining references to them in an unexpected place) leading to unchecked increases in a program's memory footprint, memory leaks are possible in either language.


Check out this SO thread: http://stackoverflow.com/questions/2017381/is-it-possible-to....

There are indeed many interpretations of what a memory leak is in Python.

In C/C++, you can forget to free memory, thus causing memory leaks. For example, you may call malloc(), but forget to call free(). In Python or Java, you can't do that; you don't need to explicitly "free" objects, as there's a GC.

Sure, you could leave rogue Python or Java objects in memory, but in my mind this isn't a "leak" in the same sense as a leak in C/C++. The Python interpreter (which is written in C) or some C extension may themselves cause real leaks, though.

I'm the author of the post. Maybe my reasoning was not explained clearly enough.


Different definitions of memory leak.

In the post's definition, a memory leak is memory you can no longer reach but is still allocated. That's eliminated by Java and Python.

Of course, you can still waste memory in Java or Python in a variety of ways. But that's a different definition of memory leak then the post is using.


I've used the gc module, with get_referers and get_referents, to track down various leaks. This only really helps with python-allocated object.

It's trivial to end up with an unexpected strong reference. Weak references are the right way to deal with cache objects, imho.


> Weak references are the right way to deal with cache objects, imho.

Yet, I disagree ;) Whether a weakref is the correct thing to use or not depends entirely on the purpose of the cache. I often find myself using caches were weakref would not be very useful, because it would cool the cache a lot.


Interesting approach!

It's hard to tell the difference between a real memory leak and Python objects being accumulated infinitely in memory - at least if we rely only on the memory use of a process. That's why we need to use either gc or objgraph as a first step.


Reminds me of myself tracing a memory leak in a node app loading a core dump into an IllumOS VM with mdb_v8. Not so simple/friendly/happy after all.

(You could argue that you could generate a heap snapshot with v8-profiler but I was against time).


Would be curious for detail on your experiences; we do this a lot (we developed mdb_v8) and we've continued to extend/develop mdb_v8 to make it easier -- but trying to debug node memory growth is not something I would every characterize as simple, friendly or happy (despite our best efforts).


Looks like a lot of fun! :)

I agree that some memory leaks are rather hard to find.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: