Removing Python's GIL: The Gilectomy [video]

dbcurtis · on June 5, 2016

This was a great talk, and was presented to a packed house (see below). If you don't want to watch the whole video, the issue really boils down to this:

1) Lot's and lot's of fine grained locks. One on every collection. Oooof.

2) All that locking and unlocking absolutely thrashes the cache with Larry's current prototype implementation. I was surprised at how little time was spent doing the actual locking/unlocking itself. I was blown away by the performance impact of maintaining cache coherent locks (at least in Larry's current implementation). And for reference, I was a logic designer on mainframes in the 1980's where we paid attention to making locks perform well across independent caches, so I'm no newb around this issue, but it was still striking to me.

(packed house) Larry's joke at the beginning about "practicing for getting on your plane later tonight" is a reference to the packed seating. The Portland convention center staff were taking the Portland fire marshall's directives quite seriously. We spent several minutes making sure every seat was occupied, and then the staff evicted the standee's :/

mtanski · on June 6, 2016

> 2) All that locking and unlocking absolutely thrashes the cache with Larry's current prototype implementation. I was surprised at how little time was spent doing the actual locking/unlocking itself. I was blown away by the performance impact of maintaining cache coherent locks (at least in Larry's current implementation). And for reference, I was a logic designer on mainframes in the 1980's where we paid attention to making locks perform well across independent caches, so I'm no newb around this issue, but it was still striking to me.

Sadly I'm not optimistic about the approach. Fine grained locking most times has a cost people don't think about. Using reference counting converts every read into two writes (to update ref counts). Then we get into second order effects where unrelated data structures share a cache line with the lock & reference counts. You're just fighting the machine architecture.

Even with additional optimizations, I just can't getting acceptable results with the approach as taken. With the constraints put in place by Guido around GIL elimination it looks like python on multicore is going to be what we have today.

sametmax · on June 6, 2016

Multi-interpretter is another possibility that is currently explorer. Still need to load 1 interpretter by core, but no more serialization to share objects.

loqi · on June 6, 2016

How so? Sharing arbitrary objects via memory requires some form of coordinated memory management. So you still need a process-wide GC, meaning you're probably stuck with that 30% hit from thread-safe refcounting.

And all the other locking issues are still in play. How would you share a list with another interpreter such that they don't both have access to the same list items? Any solution that looks like a deep copy isn't going to be much better than just serializing across a process boundary.

sametmax · on June 6, 2016

Right now multi-processing use a very expensive serialization to send to one worker to another. Shared memory is much cheaper.

Even if you didn't have the GIL, you would still need locks with threads to manipulate any mutable datastructure. The usual concurreny safety advice always apply.

chrisseaton · on June 5, 2016

I wonder if you could completely elide the CAS needed by userspace locks (the bit which thrashes the cache I presume) while there is only one thread running. Then when a new thread is created for the first time (if it ever is), pause the program and promote and acquire all the locks.

Like how the JVM assumes classes are final and removes virtual calls until it first sees a subclass.

fatbird · on June 5, 2016

Late in the talk he discusses strategies to avoid thrashing that amount to making locks as local as possible, ideally such that they behave the same way they would in the current unsynchronized environment. The big problem is that all the locks are global, so a change to a lock triggers a gigantic cascade of cache invalidation (and voided branch prediction, etc).

It's a great talk; the video is really worth watching. At the end, he says he welcomes anyone to join him in the sprints, but basically don't bother unless you really know cpython and multithreading programming because this is very advanced stuff.

chrisseaton · on June 5, 2016

Right, but I'm suggesting why bother using atomic operations and cache invalidation if there is known to be just one thread? Which is the case we want the lowest overhead on. Why not only start doing that stuff when the second thread is started.

fatbird · on June 5, 2016

Ok, I think I see the point you're making, but what I took from Hastings talk is that devolving locking to as local a scope as possible is doing just that: wherever possible, locking in an analytically single threaded context which is trivial and basically cost-free.

The cache invalidation isn't done by cpython, it's done at the the chip level when global atomic incr and decr instructions are triggered. Hastings was pretty clear that the performance hit was very closely tied to the way that modern processors depend very heavily on caching, pipelining, branch prediction, etc.

I think that the kind of implementation you're suggesting, where the VM adopts a locking strategy based on whether or not it's actually multi-threaded, might be possible, but sounds like it conflicts with Guido's edict that no GIL-less solution will be accepted that overly complicates that cpython codebase.

I'm probably making a hash of the talk, but I saw it in person, and it was surprisingly clear to someone like me who isn't a serious C programmer, but remembers their class in operating systems and kernel design.

chrisseaton · on June 6, 2016

The complexity requirement is the tricky one then, yes. I think I'd disagree with Guido there - a little complexity in the VM managed by a few experts is probably a good cost to accept for millions of users.

sametmax · on June 6, 2016

It's not Guido's point. He means that CPython, being the reference implementation, must remain easy to understand.

fatbird · on June 5, 2016

Although, this suggestion is kind of close to Hasting's implementation plan of having separate compilations for GIL and GILless python, so that the default Python with a GIL continues to be the norm in most cases, but a GILless python with extensions taking advantage of true threaded parallelism is available as a 'high-performance' python for scientists and number crunchers and such.

gpderetta · on June 6, 2016

Some JVM do exactly that. The optimisation is called biased locking and would likely work in python as well.

geofft · on June 6, 2016

I think that was one of the suggestions from the after-talk conversation. The trouble is that you have to do it on thread creation, which is sort of a cheesy way to ensure single-threaded performance doesn't go down: it still leaves multi-threaded applications performing worse than under the GIL, which isn't super helpful. Ideally you'd do it on the first access to the object by a second thread, but there's no particular interface that makes it clear that's what's happening: the second thread just knows about the object somehow, and reads it at some point. And there are enough global variables in Python (e.g., every module) that simply promoting it when it's visible to multiple threads is too aggressive.

ori_b · on June 6, 2016

There's a lot you can do. There's not much you can do in cpython as it currently stands without breaking a lot of code, especially when it comes to extensions.

chrisseaton · on June 6, 2016

I work on a new Ruby interpreter, and our solution to this problem has been to interpret the C code of extensions. That way we can give it one API, but really implement another.

fatbird · on June 5, 2016

Sorry if I keep responding, but there's a lot of interesting points in your question.

> Like how the JVM assumes classes are final and removes virtual calls until it first sees a subclass.

AFAIK, the python VM doesn't do any JITting at all. A runtime decision in the VM about handling locks is impossible at this point, I think.

chrisseaton · on June 6, 2016

Well the user space locks he discusses already read a variable and branch between uncontended and contended cases. Specialising speculatively for single threaded would just be a third cannot-be-contended case. But yes it'd work better with a JIT - yet another reason to write one.

dbcurtis · on June 6, 2016

There is a version of Python that JITs -- I forget it's name at the moment, though.

smortaz · on June 6, 2016

there are several. we (microsoft) are also trying to get an api into cpython so that anyone can use to build JITs for python. our sample implementation is at:

https://github.com/Microsoft/Pyjion

pycon 2016 talk:

https://www.youtube.com/watch?v=1DAIzO3QXcA

anateus · on June 6, 2016

ori_b · on June 6, 2016

Sort of.

    if (threads)
         lock(...);

should still be pretty quite a bit faster than fine grained locks everywhere, since your atomics aren't broadcasting and shooting down cache lines everywhere.

Animats · on June 6, 2016

Nothing is truly thread-local in Python. You can muck with anything from another thread, finding it with "dir" and altering it with "setattr" if necessary. This implies worst-case locking.

To get past all that locking, you need to be able to determine which objects are never visible to more than one thread. They don't need locks. PyPy might be able to do this; it does some global analysis.

pizlonator · on June 6, 2016

I have some thoughts:

- Doing better than atomic inc/dec for those reference counts is going to be hard. All of those techniques are still in an "unconfirmed myth" state AFAICT: someone published a paper but nobody has confirmed that the result holds on a broader set of machines, workloads, baselines, etc.

- Great call on userspace locking. Note that you can do this portably. You don't need special OS support. See https://webkit.org/blog/6161/locking-in-webkit/

- Seems like lots of the locking use cases can indeed be made lock-free if you are willing to roll up your sleeves and get dirty. That's what I would do.

- I still bet that the dominant cost is lock contention and he is not analyzing his data correctly. He appears to claim that it can't be locks because the total CPU time is greater than the total length of critical sections and that some analysis tools tell him that there is massive cache trashing. But that's exactly what happens if you contend on locks too often. Lock contention causes context switches and thread migrations. Both of those things require cache flushes. So the code that runs under contention will report massive cache thrashing because it will have a high probability of being on a cold cache. Programs indeed will run slower under contention than without it, and while his slow-down is extreme, I've seen worse and fixed it by removing some contention. He should find every contended lock and kill it with fire.

- The dip at 4 cores doesn't surprise me. Computers are strange and most "scalability" charts (X axis is CPUs, Y axis is some measure of perf) I've made had weirdly reproducible dips and jerks.

denfromufa · on June 6, 2016

I asked about lock-free data structures from 2 major open-source libs, but Larry found major issues with them. He may consider some of these data structures in the future. I opened a bounty for this.

See closed github issues for gilectomy.

pizlonator · on June 6, 2016

Not lock-free data structures. That's usually a fool's errand. You need lock-free hacks.

Exhibit #1: the lock to protect lazy initialization. That just needs a CAS on the slow path and a load-load fence on the fast path. Delete the object you created if you lose the race and try again.

Exhibit #2: you can probably do dirty things to make loading from dicts/lists not require locks even though storing to them does.

denfromufa · on June 7, 2016

How maintainable are these hacks in the long run? Is there firm theoretical basis, if yes, then sources?

pizlonator · on June 8, 2016

The lazy initialization example is a classic algorithm. Usually concurrency hackers rediscover it from first principles. It's easy to maintain because the concurrency protocol it uses is easy to understand. Here's how I'd write it:

    template<typename T, typename... Args>
    void constructSingleton(Atomic<T*>& singletonRef, Args&&... args)
    {
        for (;;) {
            T* oldValue = singletonRef.load();
            if (oldValue)
                return oldValue;
            T* newValue = new T(std::forward<Args>(args)...);
            if (singletonRef.compareExchangeWeak(nullptr, newValue))
                return newValue;
            delete newValue;
        }
    }

This works so long as the singletons are happy to be deleted. I'm assuming that's true here.

denfromufa · on June 13, 2016

I think this is C++11, while CPython is using C89 and may update to use some C99 features supported by major compilers (GCC, Clang, MSVC).

pizlonator · on June 13, 2016

Then write it in C89. It's not hard. My first lock-free algo was in C89.

hueving · on June 6, 2016

What a nice talk. It's a shame the only question we got to hear him answer from the audience (only time left for 1) was such a lame troll. If you attend conferences, please don't waste everyone's time with nonsense.

This is a python conference talking about a long-known perf issue and the guy just asked why not use another language. It's like going to WWDC and asking why everyone isn't using Android.

needusername · on June 6, 2016

I have been subjected to more than a decade of Python propaganda about the GIL how it is a good thing because:

* it prevents deadlocks

* it prevents livelocks

* it means you don't have to lock when you share across threads

and how it's not a big deal because

* you're IO-bound anyway

* you can easily write performance critical parts in C

* you can use multiple processes, in fact that's better design anyway

So I would have asked more critical questions. Feel free to call that trolling if you want.

booop · on June 6, 2016

It seems to be a common mindset among python developers. You often hear the reason you can't have something in Python is because it's theoretically impossible, impractical and too difficult - but then it happens in another language. I was convinced that dynamic languages would always be slow (see https://wiki.python.org/moin/Why%20is%20Python%20slower%20th...) but then Google gave us V8.

soupshield · on June 6, 2016

I almost didn't watch this because it's over my head, but the speaker was enthusiastic and made it fun. It's a shame there wasn't time for questions. I'm now going to watch the original Larry Hastings GIL talk: https://www.youtube.com/watch?v=4zeHStBowEk

airless_bar · on June 6, 2016

I think it's a very valid question. Given that the constraints by the Python BDFL rule out any reasonable solution, why not move on?

sametmax · on June 6, 2016

The problem is that the question is in the wrong context. Somebody is trying very hard to do something despite those constraints. For free. Why would you even try to discourage him to do so ?

airless_bar · on June 6, 2016

To save him time and frustration due to his project going nowhere?

sametmax · on June 6, 2016

Since when trolling convinced somebody of anything?

This was definitly not the intent of the troll, just a joke. But a badly executed one. It's like going to a political meeting, hearing somebody's plan to improve economy, and as a question asking him why not just change country. It makes little sense.

Plus, everybody should consider the resume of the guy, cause he is not just a newcomer trying to show off. He is a hell of a good dev.

airless_bar · on June 6, 2016

I don't believe that the question was meant to be a joke or a trolling attempt.

The guy's competence isn't the issue. It's the requirements of what the fix can and cannot do.

_Everything_ is a mess due to the massive technical debt of never improving key parts of the runtime.

I mean what's the point of a runtime where adding threads makes things magnitudes slower than running in a single-thread?

"But that's just the state right now, they will fix this!"

No, they won't. Just as a perspective: People usually fight over a single percent of improvement or less when working on runtime concurrency. Nobody just goes in and fixes _magnitudes_ of performance issues without rethinking what has been done and picking a completely different approach. This is not about making things "faster". This is literally going from doing necessary work to discovering a way of not having to do the work at all anymore. That's just not going to happen. Neither locks nor reference counting have anywhere near this optimization potential given the current language semantics.

I think the whole (C)Python community missed the train more than a decade ago. Things that necessarily had to happen just didn't happen. Many communities and groups of developers "professionalize" over time. This doesn't seem to have happened in Python.

Just an example: C extensions. It has been clear for at least a decade that the existing C interface won't work for threading, "real" garbage collection, etc.

What would have been smart: Providing a better interface 10 years ago, effectively giving C ext devs 10 years of time to migrate their code. This way efforts to remove the GIL today would have to satisfy one fewer constraint that's currently crippling all of the efforts.

Same with a lot of other things ... getting rid of the GIL would have required changes to various parts of the stack over the years (GC, language -> Python 3?, APIs ...), turning the actual removal of the GIL just into a final act of a multi-step process.

What actually happened: Nothing. And now they just try to break the large lock into millions of smaller locks ... in 2016. WAT? This just tells me that key people in the Python community never really gave a damn about the issue and therefore this guy will waste his time.

I have never expected Python's demise, but it now seems that largely its culture and not its lacking technology brought it down for good. From my perspective, they should have never released Python 3 with a GIL. That's what broke Python's neck, finally.

l1ambda · on June 6, 2016

Interesting train of thought. So if I understand correctly the gist of it is that it starts to becomes so costly to fix Python than to use another language that already has threading...so why not just use another language.

airless_bar · on June 7, 2016

Costly: yes.

Why not use another language? I think it's a perfectly valid question for that guy to ask.

My personal opinion on "Why not use another language" is a bit different though:

Adopting a different language might make sense – not necessarily for technical reasons but for cultural. The technical issues could have been addressed in time, and the language would be in a much better situation today.

The fact that the technical issues have not been addressed for such a long time shows a clear lack of leadership and focus. I think these cultural issues are a bigger problem than the technical issues. You can fix code, but you can't fix people.

kris-s · on June 5, 2016

I hope this is a fruitful endeavor but I can't help but feeling the pressure to "Not have another Py2 -> 3" situation will win out and the GIL will remain. Which, I think is completely fine.

For me Python is the ultimate glue code - it's the duck tape of my programming world. If I know multi-core performance is going to be an issue up front, I would pick another language.

fatbird · on June 5, 2016

Hastings proposed avoiding the breaking changes issues by having separate extension integration points for python with the GIL and without. In other words, cpython would be compilable without the GIL, and extensions wanting true threaded parallelism would implement additional integration points to take advantage of that if run in a GIL-less python. So numpy, for example, would continue to work with the GILled python, but a data scientist would also have the GIL-less python available, set up a virtualenv for it, install numpy, and see all those benefits, without ever causing breaking changes for extension authors.

Dunno if that'll work in practice, but it seems like the best possible plan going forward to get the change without causing strife.

IndianAstronaut · on June 5, 2016

>If I know multi-core performance is going to be an issue up front, I would pick another language.

This is why I have shifted most of my code (I work in data warehousing) to Go. Python is handy for little scripts and data mining.

smegel · on June 5, 2016

Ideally Golang.

No other language really does it right like Go - unifying event based and traditional multi-threading paradigms in a way that transparently utilizes all the cores on your system, while allowing you to write plain old iterative, blocking code.

Go may be less than ideal in many other regards (i.e. the rest of the language), but it gets this right.

rubiquity · on June 5, 2016

> No other language really does it right like Go

Not trying to start a war but... Erlang, Haskell, F#, and a few others abstracted Evented IO and parallelism before Go even existed.

Pxtl · on June 6, 2016

Even good old c# with async/await does async I/O pretty well.

ngrilly · on June 6, 2016

C# with async/await is a different approach: it's a kind of cooperative multitasking where "yield points" are explicit. Go uses a kind of preemptive multitasking, in which goroutines don't have to explicitly "yield" control to the scheduler.

smegel · on June 5, 2016

I should have added mainstream, non-functional, compiled language.

rdtsc · on June 6, 2016

If Erlang breaks there is a 50% you won't be able to see cat pictures on your smart phone anymore. It might not be cool with the startup crowd or the Google-following one, but is a an industrially used langauge for many products (RabbitMQ, CouchDB, Riak, Ejabberd, WhatsApp messaging, payment systems in Europe, and others).

It is cool you like Go, it is a nice language. But before you claim "it is the only language that does X", you should learn about other langues as well.

smegel · on June 6, 2016

I know about Erlang and what it can do, and I maintain Go is the only mainstream, non-functional language that does abstraction of event based/threaded programming well, or at all. I'm not even sure Rust does...the best I can find is it was planned to, but it didn't make it in.

And I don't consider any functional language as a viable choice for the vast majority of mainstream users.

rubiquity · on June 6, 2016

Outside of HackerNews and Google I wouldn't consider Go to be very mainstream at all.

whateveracct · on June 5, 2016

Moving the goalposts a bit there, aren't you?

smegel · on June 5, 2016

No, I think they are common assumptions.

chrisseaton · on June 5, 2016

They obviously aren't - this whole thread is about Python, which is not a compiled language. So clearly nobody moving from Python to Go already had a requirement for it to be compiled.

smegel · on June 5, 2016

I was talking about the points "mainstream, non-functional" which excluded all the previous counter-examples. I only added "compiled" as it is a notable feature of Go that may be seen as beneficial for performance and other reasons.

InclinedPlane · on June 5, 2016

Rust?

I would never consider Go mainstream myself though. Erlang is practically more mainstream.

lelandbatey · on June 5, 2016

Rust offers almost everything that Go does, but with a (much) nicer compiler, tooling that makes me drool, memory safety, and all with zero-cost abstractions that keep Rust as fast or faster than the naive C++ equivalent.

So even though I work in a majority Go company, and I feel like Go is quite mainstream, I firmly believe that Rust is the future!

Btw, sorry if this Rust enthusiasm comes across as a bit over the top, I'm quite tired. I just really like it is all.

ngrilly · on June 6, 2016

Rust is a great and powerful language, probably the best alternative to C++ nowadays, but I think you're overselling it.

Rust definitely does things Go doesn't, but Go also offers things Rust doesn't: builtin concurrency and parallelism with goroutines and channels, garbage collection (which is useful when your app can tolerate its moderate overhead), very fast compilation, and great tooling.

kibwen · on June 6, 2016

Rust has builtin concurrency and parallelism in the standard library via threads, channels, and more, it just doesn't have green threads in the stdlib. As well, Rust also comes bundled with great tooling.

nemothekid · on June 6, 2016

>it just doesn't have green threads in the stdlib

Green Threads is one of the most important features in golang - to pretend that "Rust offers almost everything that Go does", but ignore the number 1 feature is dishonest.

kibwen · on June 7, 2016

I'm not the one who claimed that "Rust offers almost everything that Go does", but to say that concurrency and tooling aren't things that Rust excels at is equally dishonest. :P

smegel · on June 5, 2016

Rust may well be, but it is not out yet?

Erlang is mainstream, but being functional probably not open to consideration for 99% of developers.

dalke · on June 5, 2016

Rust, version 1.0.0, was released in May 2015, so over a year ago. 1.9 came out 10 days ago. Is that not out enough for you?

nzjrs · on June 6, 2016

I worry about the bus factor of rust. I was at the fosdem rust meet up / devroom and it was 95% Mozilla folk

kibwen · on June 6, 2016

Less than half of the people listed at https://www.rust-lang.org/team.html are Mozilla employees, and the rust-lang/rust repo has over 1,400 contributors. I wouldn't worry too much about all of Mozilla getting hit by a bus. :)

majewsky · on June 6, 2016

There might be a bias in that metric: Mozilla employees are probably more likely to afford the trip to FOSDEM.

foota · on June 5, 2016

@smegel, yes it is.

smegel · on June 5, 2016

I have no idea. Is the language spec fully implemented by a robust and stable compiler/runtime?

If so, I'll use it today!

geofft · on June 6, 2016

Yes, the compiler has fully implemented everything the language is intended to do since the 1.0 release a bit over a year ago. (Work focuses on the compiler and standard library itself; there isn't a separate spec doc that is updated to do things that aren't yet implemented or being implemented, so it's more like Python or GHC than R6RS or C++.) The compiler and the language are both stable; there's been no "runtime" in a strict sense since a few months 1.0, but there's a standard library which is usually statically linked in, and both the language and the standard library are covered by a very strong stability guarantee (more-or-less, code that works today and isn't deliberately doing something stupid will work indefinitely, as there are no plans for a Rust 2.0). Plenty of people are using it in production. So I'd go with "Yes."

jamesdutc · on June 6, 2016

I had a personal conversation with Larry Hastings (the presenter) at PyCon. Here are a couple of notes from that chat, phrased neutrally. Some of these points may be reïterated in the linked video:

- We can view this work is as a revisiting of Greg Stein's GIL-removal attempt in Python 1.4:

http://dabeaz.blogspot.com/2011/08/inside-look-at-gil-remova...

It seems wholly reasonable to revisit the approach in light of how the language and ecosystem have changed since 1999.

There are demands made of CPython core developers to remove or address the problem of the GIL, and these efforts demonstrate how much work is necessary to do that successfully.

- Comparing single-threaded performance in a GIL implementation against single-threaded performance in a GIL-less implementation is considered an unfair comparison. A GIL-less will do extra book-keeping that will necessarily result in slower single-threaded performance.

IanCal · on June 6, 2016

> - Comparing single-threaded performance in a GIL implementation against single-threaded performance in a GIL-less implementation is considered an unfair comparison. A GIL-less will do extra book-keeping that will necessarily result in slower single-threaded performance.

Unless you can choose between having the GIL or not, I think it's perfectly reasonable to compare the performance. If you can choose, then I think it's still useful to know the kind of overhead you're adding.

Bromskloss · on June 5, 2016

Is there anything inherent to the language that makes this global lock thing difficult to be without in Python, but not in other languages?

jerf · on June 5, 2016

It was ground in to CPython at an early date and every Python C extension ends up with it ground in, too, because it's in the C API. Some things are very difficult to retrofit on to an existing system. [1]

If you create a new Python implementation from scratch it doesn't need to be there. It is not intrinsic to Python; it is intrinsic to the specific CPython implementation.

It is also, broadly speaking, easy to remove the GIL. However the bar the CPython developers have set is that they will not accept a GIL-removal patch that harms single-threaded performance. This is the bar that has proved difficult to hurdle.

It is also fair to point out that while it's a perennial topic in the Python community, it is not the only scripting language implementation with a GIL or moral equivalent in it.

[1]: For similar reasons, none of the core implementations of the 1990s-era dynamic scripting languages have a very good true multi-threading story. (Some of the alternates can do it, like Jython.) They don't all even have implementations, and last I knew, none of them have anything that you ought to use in production. I don't think this is a fundamental limitation of any of the languages in question, it's just that it's really hard to retrofit threading onto a non-threaded code base after ~ten very heavy years of development. There's a set of other things that are hard to retrofit in to an existing code base if they don't start there from day one, like "unit testability" or "proper string handling".

slashdev · on June 6, 2016

This is not true - it is intrinsic to Python - not just CPython. Part of it is the garbage collector, reference counting is expensive to do without the GIL - but that can be changed. IronPython, Jython, some translations of PyPy all attest to that. The other part is C methods, which includes all operations on Python collections like lists and dicts, are atomic, making those collections safe to share and modify across threads. Most python code in the wild that utilizes threads makes use of those guarantees. So you either use fine-grained locking, or lock-free data structures, or STM, etc to protect ALL collection instances, or you've basically made a Python 3000 that breaks compatibility with all existing Python code. So every Python implementation that removes the GIL chooses to go the first route of pessimistically making everything threadsafe - which was tried in Java way back when and was and is a miserable failure. You give up too much performance for it to be worthwhile.

So the problem in Python will never be fixed in my opinion - the problem is Python and there's no saving it. Without breaking backwards compatibility it will always be better to use processes instead of threads and keep the GIL.

jerf · on June 6, 2016

If you carefully read what I said, I think you'll find there's less disagreement between what you posted and what I said than your tone suggests you think. I did, for instance, mention C methods, but you seem to imply that that matters to all Python implementations. It only matters to CPython. Barring what last I knew was some fairly experimental stuff from PyPy, Python extensions are not generally compatible between the implementations. (Where you see libraries that appear to work on multiple implementations, AIUI you're actually seeing multiple implementations of the binding code.)

Personally my feeling is that if you care enough about performance to be threading in Python, you care enough about performance to not be using Python anyhow. It's a nice language, it's by far my favorite of the scripting languages of its family, but even with all the PyPy JIT and other magic tech words you can throw at it, it's still a slow language. When that doesn't matter, great; when it does, you eventually reach the point where you're stuck. (I've never been overwhelmed by the "implement it in C then" argument, for many reasons, and nowadays to an increasing degree if you're going to implement your "core" in another language for speed, there's a language you can choose that is both nice to use and already fast for your task, and that's only going to get more true as time goes on.)

slashdev · on June 6, 2016

The C methods do unfortunately matter to all implementations. They gave thread safety and almost all Python code in the wild relies on that property, so it's become an undocumented property of the Python language specification and an intrinsic flaw of the language itself.

andreasvc · on June 6, 2016

Reference counting and the C API are properties of the CPython implementation, not Python the language. Other Python implementations do NOT use reference counting and don't offer the C API, but do make guarantees about atomic operations. I think the problem with backward compatibility is with wanting to preserve the C API and the existing extension modules implemented in C, not necessarily with Python the language.

slashdev · on June 6, 2016

I should have been clearer that the other Pythons don't use reference counting. Preserving the C API can be done, see Ironclad[1] which allows using C extensions in IronPython.

But assuming threadsafe collections is still a problem - that will never match single-threaded GIL Python running in multiple processes unless some efficient way can be found to optimistically make the collections not threadsafe and only pay the synchronization costs when they're really shared between threads. I think that's a solvable problem, but not with the STM approach PyPy took. Still performance will likely lag the mutli-process solution which doesn't have that cost.

I would rather see a Python that introduces a way to allocate PyObjects in shared memory and with no thread safety. Since it's a new system, that can be done without breaking compatibility. Then sharing between processes could be as easy as sharing between threads and just as efficient. The GIL could stay in that case and the C API could be left unchanged.

[1] https://github.com/IronLanguages/ironclad

bakery2k · on June 6, 2016

all operations on Python collections like lists and dicts, are atomic, making those collections safe to share and modify across threads. Most python code in the wild that utilizes threads makes use of those guarantees. So you either use fine-grained locking, or lock-free data structures, or STM, etc to protect ALL collection instances, or you've basically made a Python 3000 that breaks compatibility with all existing Python code.

Do IronPython and Jython, Python implementations which do not have a GIL, use fine-grained locking? How is multi-threaded performance on these implementations? Surely not as bad as this talk implies CPython with fine-grained locking would be?

slashdev · on June 6, 2016

IronPython uses a combination of lock-free techniques and fine-grained locking. Reference counting is of course replaced with the .NET runtime GC. Single-threaded and multi-threaded performance suffers since this approach means pessimistically synchronizing all accesses, including those that are single-threaded.

fatbird · on June 6, 2016

You should watch the video. I won't try to repeat it all here, but Hastings identifies the performance problem areas and lays out some plausible directions for tackling it, and a plausible implementation/changeover path that doesn't create another big breaking change.

fatbird · on June 5, 2016

Nothing inherent, no. Hastings discusses this is the talk: short version is that having a GIL gets python a fairly simple, straightforward implementation (also for extension authors); not having it requires a lot of work to achieve comparable performance. This is why so many interpreted languages use a GIL (like Ruby). No language requires it, it just makes the language much easier to develop and extend. Implementations like Jython that don't have one, don't because the hard work to not have one is already done in the VM.

It only took him about a week to remove the GIL (and the talk includes the steps); all this work is making a GIL-less python work as well as one with a GIL.

ubernostrum · on June 6, 2016

Not really. Other languages have a GIL or GIL-like.

The backstory of Python's GIL is a tradeoff made nearly a quarter-century ago. Python came from the Unix world, where forking processes was the common way of having multiple lines of execution, but in the 90s threading was starting to take off as an alternative (insert standard anecdote about Java adopting threading since it had to run on set-top TV boxes that couldn't handle multi-process concurrency).

So Python needed to gain support for threading, but also needed to not horribly break a lot of existing code -- particularly extensions written in C -- and not destroy performance. Which was an issue since Python's memory management and garbage collection internals were not thread-safe, and it would've been both a major slowdown and a major breaking change to just say "all right, we'll do this in the theoretically best possible way".

The trade-off solution was to introduce this global lock which needed to be acquired and released. The side effects of the lock are twofold:

1. Only a single thread can be executing at a time within a given Python interpreter process, and

2. The bookkeeping around the lock introduces overhead. This overhead is negligible in some cases but ruinous in others.

The tradeoff seemed reasonable at the time, though. For one thing, hardware capable of exposing the single-thread limit as a relevant limit wasn't exactly common in 1992. For another, the overhead of the GIL mainly bites you in CPU-bound workloads, and at the time people wanted threading to write I/O-bound applications like networking daemons.

Here in 2016, of course, everybody has a multi-core computer in their pocket and we see plenty of CPU-bound workloads that we'd like to parallelize. Which is why people complain about the GIL. But now it's even more of a breaking change to try to remove it, and still has performance implications.

dalke · on June 5, 2016

Inherent? No. Neither Jython nor Iron Python have a GIL.

the_mitsuhiko · on June 6, 2016

> Is there anything inherent to the language that makes this global lock thing difficult to be without in Python, but not in other languages?

Both language and interpreter design. I was wanting to give a talk or at least write about some of the internals of the interpreter and language and how to do it better next time.

In case there is interest for this I might actually get around doing that for once.

(I have played around with a different from of GIL-less execution a few years back that was based on independent interpreters and message passing thread bound objects but I ran into so many issues with the interpreter :()

bakery2k · on June 6, 2016

I would certainly be interested to learn about better language and interpreter design.

Do you think anyone will ever use the lessons learned about interpreters and write a "better next time" implementation of Python, or do you only see an improved runtime appearing alongside a significantly different language?

eric_bullington · on June 6, 2016

> In case there is interest for this I might actually get around doing that for once.

Yes, I'm very interested in your thoughts on this, particularly wrt interpreter design.

pizlonator · on June 6, 2016

The biggest problem is the use of reference counting.

Consider this simple statement:

tmp = p.f; o.f = tmp;

In a garbage-collected language that supports concurrency and has no GIL, this is a wait-free operations with no locks. You load p.f. You store o.f. Maybe there is a GC barrier, but that fast-paths to a load-branch combo 99% of the time. It's all easy and fast.

In a reference-counted language that supports concurrency and has no GIL, you're in a world of hurt.

First you have to make sure that you atomically increment the reference count on 'tmp'. This is guaranteed to have to store to the memory that 'tmp' points to. Note that in the GC world, no such store happened. Already here you have a problem: you're storing to memory much more often. Stores are the things that parallel CPUs hate, because the hardware now has to do work to propagate the value you stored to the other CPUs.

But wait, there's more. You also have to atomically decrement the reference count of the object that 'o.f' used to point to. That's another store that will have to be propagated. Now your algorithm for "o.f = tmp" looks like this:

1. Atomically increment the reference count of 'tmp'.

2. Load the old value of o.f and atomically decrement the reference count. If it drops to zero, delete the object.

3. Store 'tmp' into 'o.f'.

But wait, you're not done yet. You also have to handle the dec race on 'o.f'. If two threads simultaneously store to 'o.f', in which case you now have a race in steps 2 and 3. You'd think that you can just use an atomic swap in (2), but then you'd be wrong: you don't want two threads to both load the same old object from (lets that one 'oldfart') and dec it. Here's an example:

Thread #1: o.f = tmp;

Thread #2: o.f = thingy;

If the interleaving is Thread #1 executes step (2) then Thread #2 executes step (2) then Thread #1 executes step (3) then Thread #2 executes step (3), then you will dec 'oldfart' twice, inc 'tmp' once, and inc 'thingy' once. This means that you've over-released 'oldfart' and over-retained 'tmp'. Bad news!

So, you have to put a lock around steps (2) and (3). Basically every field in memory has to have a lock around it. That sucks!

It's kind of funny to me that people who are stuck with reference counting say things like "yeah, garbage collection can be made to be as fast as reference counting". That makes me lol. Garbage collection is so much faster than reference counting, particularly in concurrency scenarios, because:

In a GC: "o.f = p.f" is a load and a store, with maybe an extra load-branch if you have a barrier. That's it. There's no locks! There's no atomics! Notably, the only thing being stored to is "o.f", just as the user intended.

In reference counting: "o.f = p.f" means atomic ops on 'tmp', a lock around "o.f", and atomic ops on 'oldfart'. The lock will require one or two atomic ops. There will be atomic stores to the cache line of "o.f", the cache line of "tmp", and the cache line of "oldfart". What a mess!

TL;DR. Garbage collection is a slam dunk of overwhelming awesomeness.

andreasvc · on June 6, 2016

It seems to me that the option of going with a tracing garbage collector is preferable. Removing the GIL will require a lot of changes anyway, and it may be better to go all they way so as to preserve performance. It would affect the C API and extensions modules would have to be reworked for this, but on the other hand you avoid all the issues with reference counting. If extension modules rely on code generation, such as Cython, a move to a radically different C API might be less painful than expected.

From what I know, most current managed languages are using GC, not reference counting, so perhaps it's an inherently better approach.

Animats · on June 6, 2016

Most of the Python implementations other than CPython use GC rather than reference counting. PyPy does. Iron Python did.[1]

There is a version of PyPy without a GIL[2], but it runs much slower on ordinary code and is still under development. The developers are looking for financial support.[3] The approach is to identify large blocks of code as transactions, and run them in parallel. If they try to access the same data, one transaction fails and is backed out. It's like database rollback.

But you have to write your code like this:

    from transaction import TransactionQueue

    tr = TransactionQueue()
    for key, value in bigdict.items():
        tr.add(func, key, value)
    tr.run()

[1] http://doc.pypy.org/en/latest/cpython_differences.html [2] http://doc.pypy.org/en/latest/stm.html [3] http://pypy.org/tmdonate2.html

chadr · on June 6, 2016

Semi related to this, is it possible to run multiple CPython interpreters in the same process, but with restrictions on shared memory? The idea being that each interpreter would still have its own GIL, each interpreter would have restrictions on shared memory (like sharing immutable structures only through message passing). Note, I'm not a big Python user so if this already exists, has been discussed, etc I am not aware.

geofft · on June 6, 2016

I'm not deeply familiar with the CPython API, but that should be completely straightforward as long as you don't trade native CPython structures between interpreters. If you start with something like Numpy's array interface, and implement an array in C that knows that it could be accessed by multiple Python interpreters and needs to become immutable before it's passed, it should work. But I wouldn't expect it to be easy to take a regular PyObject and move it between interpreters, so the cost of serialization and deserialization might be an issue.

jamesdutc · on June 6, 2016

Sort of.

Note that Python has support for shared memory:

https://docs.python.org/2/library/multiprocessing.html#shari...

In fact, `numpy` has its own mechanisms to support shared memory between processes:

https://bitbucket.org/cleemesser/numpy-sharedmem

Neither of these approaches seem to be used very commonly in practice.

Python itself has some "sub-interpreter" support. There was a long conversation about this last year:

https://mail.python.org/pipermail/python-ideas/2015-June/034...

Finally, I have a working approach using `dlmopen` to host multiple interpreters within the same process:

https://gist.github.com/dutc/eba9b2f7980f400f6287

- the approach is so bizarre, because it's a very naïve multiple-embedding. It was intended to prove that you could run a Python 2 and a Python 3 together in the same process as part of a dare. This was thought impossible, since there are symbols with non-unique names that the dynamic linker would be unable to distinguish (which lead me to the `RTLD_DEEPBIND` flag for `dlopen`,) and that there is global state in a Python interpreter that interacts in undesirable ways (which lead me to `dlmopen` and linker namespaces.)

- this approach is stronger than the traditional subinterpreter approach, since I can host multiple interpreters of distinct versions. i.e., I can host a Python 1.5 inside a Python 2.7 inside a Python 3.5.

- the approach is stronger in that I completely isolate C libraries. There's a good amount of functionality provided by C libraries that maintain global state. e.g., `locale.setlocale` is a wrapping of C stdlib locale and is globally scoped.

- this approach is weaker in that it requires a dynamic linker that supports linker namespaces, which effectively limits its use on Windows

- this approach is weaker in that it's not complete: there's insufficient interest in this approach for me to actually write the shims to allow communication between processes.

- this approach is weaker in that it has some weird restrictions such as being able to spawn only 15 sub-interpreters before running out of thread-local storage space

I suppose the premise is that the GIL-removal efforts involve pessimistic coördination. A sub-interpreter approach might have a lighter touch and allow the user to handle coördination between processes (perhaps even requiring/allowing them to handle locks themselves.)

nzjrs · on June 6, 2016

I followed the sub-interpreter thread with great interest, but after the implementation went in I haven't seen anyone build the kind of multiprocess tooling that it was designed to enable. Have you heard of anything?

IanCal · on June 6, 2016

For the performance improvements, I think I'm missing something but why does coalesced reference counting have a high overhead?

Is it normally implemented along with buffered reference counting? It feels like those fit together very neatly, one thread managing the counts and receiving updates from the other threads, and each other thread tries to only send updates it needs to.

Is it simply a case of doing something basic a lot of times is faster than doing something smarter a few times because computers are just really fast at basic things? Or is there something more to this?

_y4o5 · on June 6, 2016

Linux's pthread locks are already implemented on top of futex and have a fast user space path for the non-contended case.

daveguy · on June 5, 2016

I'm just glad that Guido hasn't gone on record saying there will never be an official python without a GIL.

fatbird · on June 5, 2016

Guido's on record as having three "political" requirements for any GIL-less Python:

1. Same or better performance 2. No breaking existing extensions 3. Not overly-complicating the cpython implementation

These are tough requirements, but obviously sensible, and Hasting's discussion of trying to meet them is interesting.