Python 2.6 and 3.0 ship with a new library called multiprocessing which provides...

rbanffy · on Dec 4, 2008

I know. The library is available to 2.5 as "processing", but starting OS processes when all you wanted were threads is somewhat ugly. I agree it's not possible to do it properly with the GIL in place and attempts to remove it did have less than amazing results on single-thread applications.

That's annoying... Even more annoying when I think it's easier to do in Java ;-)

scott_s · on Dec 5, 2008

If you're running on Linux, then all threads are processes.

fauigerzigerk · on Dec 5, 2008

For some definitions of "are" maybe. Fact is that processes don't share address spaces and thus pointers, which means that all data structures have to be serialised in some shape or form before they can be accessed by multiple processes. This makes some use cases infeasible in Python and other GIL based languages.

This is important and I'm concerned that so many advocates of Python (and Ruby and PHP) try to talk this huge issue away. I do use and love Python myself and I'd like to do more with it, so this is not pointless language war fighting.

scott_s · on Dec 5, 2008

Under Linux, the only difference between a thread and a process is not sharing the address space. Both POSIX threading libraries for Linux (NTPL is the current, LinuxThreads is the old one) do a clone() system call for every pthread_create(). The clone() system call is basically fork() with extra functionality (such as sharing the address space).

Inside the kernel, your "threads" and "processes" are both represented by the task_struct data structure. So, on Linux, threads and processes are the same thing in different flavors.

I brought this up since the parent said "starting OS processes when all you wanted were threads is somewhat ugly." My point is that on Linux, starting a "process" and starting a "thread" is fundamentally the same action and have the same cost.

rbanffy · on Dec 5, 2008

They are not the same thing. Threads share address space and processes don't and that's a crucial difference.

scott_s · on Dec 5, 2008

I said same thing with different flavors. Literally, they are both represented by the same data structure in the kernel. The salient point of the discussion - cost of forking a process versus cost of spawning a thread - is the same. They require different styles of programming, but I was responding to your point on cost.

If you ever need to reason about scheduling in the Linux kernel, then you need to understand this concept. In the eyes of the scheduler, they're all the same.

rbanffy · on Dec 6, 2008

I understand I was less than clear on my concern about the cost of starting a process versus starting a thread, but that was not all of my point. Besides the convenient shared memory space, threads also should take a little less to start because they don't need a new memory context. In long running threads, this is not much of a problem but if you decide to start a lot of threads, that could make a difference. I don't know if the Linux kernel makes a big deal out of this but I am sure that, as programs get more threaded, the cost of starting a thread will approach the theoretical minimum thanks to our skillful kernel developer friends.

I am not familiar with current (less than 10 year-old) processor architectures, but, back when I was familiar with that stuff, switching context between threads of the same process was a lot less costly than switching between different processes because the memory context is the same and set-up and tear-down was a somewhat costly operation. As processors start getting more cores, it also allows schedulers to keep threads of the same process to the same cores, reducing context-switch overhead (or to power-down less used cores to conserve energy).

If I had a very good threading/multi-core support, I would probably go with Solaris on a 64-thread T2 SPARC thingie, not Linux on 6-thread x86. And that's a good reason for Sun to invest some money on helping convince Guido the GIL is bad and will become worse with time. Because Intel will probably catch up.

illume · on Dec 5, 2008

Not so seemless. Things need to be serialised to transfer them. You need to specially craft code so that it is pickle safe... things don't just work automatically.

Also serialising 100MB or so worth of objects around is really slow. So the process module is not good for many cases of use.