As far as I understand, it just changes the way thread scheduling works, but doesn't make Python "properly multithreaded". That means it's still only one active non-native-extension thread running at any time. Could someone confirm it?
Edit: I guess janzer confirmed this at posting at the same time.
No, Python doesn't work like that even with the traditional GIL. The problem is that even when you have multiple OS threads, they all end up competing for the same lock which kills throughput (but not as badly as the fully-serialized scenario you described). By using a scheduler the locking order can at least be controlled a bit more to improve throughput. [As far as I know; been a few years since I dug through CPython.]
What do you mean by "not as badly as the fully-serialized scenario"? I thought that Python threads are fully serialized, apart from extensions code which can spawn their own threads and release the GIL during operations that don't affect the python memory (IO mainly). Interpreter still switches Python threads using GIL, but the Python code itself never runs in parallel.
Are we talking about the same thing, or is there some other non-serialized scenario?
I'm pretty sure that there's a good chunk of stuff you can do in Python without acquiring the GIL -- the problem is that in practice you end up doing a lot of I/O and stuff that requires at least momentarily acquiring the GIL, leading to contention. So if you stuck to the operations that didn't require locking any GIL-protected data, you could run at full throughput. It's at least not the case that the GIL is held all the time while running a Python thread -- the problem is instead that your threads end up having to acquire it often.
Actually, the GIL is needed to execute Python code (well, access Python objects). It is released by I/O- or computation-heavy C code, so e.g. SciPy or reading files allows some level of parallelism, but pure-Python code will be serial.
Pure Python code can do a lot. Large parts of the stdlib are extension modules, and can release the GIL (e.g. one of the tests on the bug tracker used time.sleep()). In practice, if your code is IO-bound, it's doing work in an extension module, and if it's CPU-bound, it should be doing its work in an extension module. So there's actually not a huge problem.
fork() isn't that great for a lot of situations. If you are thinking of taking advantage of your operating system's copy-on-write paging by loading a large chunk of data to be used read-only, forking a bunch of processes, processing the data each processes, and finally, 'reducing' the results of all of the forks into some sort of output, don't bother.
What happens is when you read an object in one process, python increments the reference count, thus touching the memory page, thus copying it, thus screwing you.
(however, compacting garbage collection turns out to have more or less the same problem)
Edit: I guess janzer confirmed this at posting at the same time.