Interestingly enough, plan 9 did away with threads and replaced fork() with rfork() http://man.9front.org/2/fork.
Instead of having to deal with the duality of threads and processes they they opted to implement threading using light weight processes that share the parents data and bss segments. This is only turned on if you pass the RFMEM fag to rfork(), otherwise it behaves the same as unix fork and gives the child a copy of those segments.
This simplifies the system design as you only deal with processes which is arguably a more correct and simplified approach (I mean the OS only has 30 something syscalls). The api is implemented via a CSP library simply called thread. You replace main() with threadmain() and the thread api gives you all the concurrency tools. Or you can create your own threading library yourself if you so choose to.
If you are familiar with Go's CSP concurrency then you can thank plan 9's thread as Rob Pike, Ken Thompson and Russ Cox came from Bell Labs and built plan 9. And those ideas were born in Alef and Newsqueak.
It's a cleaner design, but I don't think it addresses the core of the issue.
You can still have say a "multi-threaded program" with a global mutex (the threads are created with RFMEM), and then when you want to fork (without RFMEM), you can be prone to deadlocks.
I guess the thread library has channels, and that doesn't happen? But as far as I know there are still mutexes in Go, and I would presume plan 9.
Go also has problems forking since it has a multi-threaded runtime. It only exposes os.ForkExec(), not os.Fork() and os.Execv().
FWIW, FreeBSD has rfork(2) as well. It’s not used for threading, though - FreeBSD has the usual, kernel-level threads instead of faking them on top of processes.
Unfortunately the current development state of Inferno is partially in Limbo (pun intended). This is why plan 9 users like myself tend not to mention it. The code is maintained by Vita Nova and available as open source but sees little if any development. There have been various ports but nothing is maintained. There is no stable bare metal image for i386 or Arm boards like RPi. It also doesn't build on 64 bit platforms, you need a 32bit chroot to build and 32 bit libs to run. It also fails to build on OpenBSD due to rthreads. I'd love to work on it but simply don't have the time. As for now, 9front boots and runs on modern hardware. I have a CPU server running an a low power Celeron J1900 board. However, I do have Inferno running hosted on my Raspberry Pi. Unfortunately it's quite unstable and crashes frequently.
Has anyone an example for wanting to fork? Most use cases seem covered by either threads or fork+exec process spawning.
The example given by the author:
make it practical to introduce concurrency after development, rather than designing it in from the start
seems risky to me: Even if you're safe from other threads, there is a lot of other unknown state like fd's still inherited. And concurrency can be introduced after development, e.g. by using message passing. So going all the way threaded or re-execing yourself seem both safer to me with no real downsides.
One of the available garbage collector implementations in D (version 1) (and there's a work in progress to port it to D2) is a concurrent garbage collector which performs a mark phase in forked process going over the address space.
Also I've seen a project which persists large amounts of data to disk by forking and using a consistent snapshot of the data structures (which the new process will not mutate, so there's no need for locking).
> One of the available garbage collector implementations in D (version 1) (and there's a work in progress to port it to D2) is a concurrent garbage collector which performs a mark phase in forked process going over the address space.
I'm pretty skeptical of that approach. If there's heavy churn, the copy-on-write behavior means that the mark phase _increases_ memory use, when presumably you're doing GC because you need to _decrease_ memory use. And I assume there's some cost to these VM manipulations in general. Is this implementation competitive with other D garbage collectors in terms of memory usage, CPU time, latency, etc? are D's garbage collectors in general competitive with those of Go and Java?
> Also I've seen a project which persists large amounts of data to disk by forking and using a consistent snapshot of the data structures (which the new process will not mutate, so there's no need for locking).
It isn't an uncommon strategy. For example, the Factorio dedicated server can (optionally) do that if you're running it on non-Windows. The performance is great, but it's marked as experimental, probably because of issues like this article describes.
I used to work on a system that processed lots of batch jobs. The jobs themselves were built to run inside of a software framework because they needed to be customized for different purposes but still had lots of common needs.
So we had a process that would start up, initialize the framework, then look for jobs to run. For each job, it would fork, giving the job an initialized framework but also giving jobs some isolation.
Android does something similar:
"Each app process is forked from an existing process called Zygote. The Zygote process starts when the system boots and loads common framework code and resources (such as activity themes). To start a new app process, the system forks the Zygote process then loads and runs the app's code in the new process. This approach allows most of the RAM pages allocated for framework code and resources to be shared across all app processes."
fork() is not required for this. On Apple platforms the dyld shared cache performs the same function: all processes share the same copy of libSystem, Foundation, etc.
Most platforms with dynamically-loaded shared libraries do something similar. However, that only avoids having multiple copies of the library code and read-only data in memory—each process still has to run its initialization code separately, parsing configuration files, setting up global variables, etc., and any data which is written during initialization but read-only for the rest of the program is not shared. The Zygote model allows the system to perform this initialization just once and share the results with any number of child processes.
I believe the Apache web server's Prefork MPM module does something similar—fork() allows the server to spin up new child processes in a pre-initialized state, ready to respond to requests. Servers in general tend to be a good fit for this style of execution.
So redesign your software to allow storing the results of initialization code having run, configuration files having been parsed, global variables having been set up, and so forth in a loadable library (or in some other mmappable file).
Emacs works this way, because it has a lot of Lisp code to load in and compile and doesn't have a central daemon (and even if it did, it would want to avoid recompiling the code when the computer is rebooted.) The compiled C binary "temacs" is instructed to load and evaluate a bunch of Lisp and then more-or-less coredump itself in an orderly way to a binary called "emacs"; when that binary is run, everything is in memory as if you had forked temacs after initialization. https://www.gnu.org/software/emacs/manual/html_node/elisp/Bu...
For your average server process reading config files, it probably suffices to have a "compiled" form of the config files in Cap'n Proto or some similar format that is directly usable as in-memory structures, and mmap that in. (Or perhaps just actual native data structures, if the format doesn't need to have a stable representation across versions of your software.)
I run an online game that has has instanced multiplayer.
We have a single process on each server that starts up and loads every single game resource. After this is waits for requests from the instance manager to create new instances. As new instances are required, this "spawner" process fork()s to create one.
Each process uses less memory because all the loaded resources the game needs to run are shared between all the copies.
You also get slightly faster performance because commonly used data still in the CPU cache when switching between processes.
If we used a thread per instance, then we wouldn't have crash isolation between the different game instances, which is handy.
It's also possible to explicitly mmap a shared memory segment to load resources in that without using fork()s copy on write, but it's just more tricky, and the fork() method is totally robust no matter what code other programmers might write.
A classic example is daemonizing a process. While still attached to the terminal you first parse command-line options, read config files, and do some sanity checks. Any errors caught at this point cause an error message and exit(1). Otherwise fork+setsid to detach from the session, then fork again to lose the new session leader.
Apart from that, the only thing I have used fork for (apart from fork+exec) is doing concurrent blocking I/O while prototyping, before rewriting the code to use non-blocking I/O with no forking.
Copy-on-write semantics can be pretty useful. You've built up a large in-memory graph and want to execute a destructive or long running query on the data set in the background (e.g., write it out to media when the main process MUST continue to service requests), you want to generate a snapshot core without exiting, etc. etc.
I don't classify these use cases are concurrency-after-development cases. Lots of systems use this to do interesting things.
> The example given by the author seems risky to me
Ages ago I made a DNS server, and near the top of the program was a simple:
for(i=0;i<ncpus;++i)if(!fork())break;
> Most use cases seem covered by either threads or fork+exec process spawning.
Most of the use cases of threads are covered by fork+explicit mmap/vm_remap.
A "shared-nothing" will also contain less code, and having no need for lock/unlock keeps a lot of spam invalidations away from the memory controller making it for many tasks faster.
> Even if you're safe from other threads, there is a lot of other unknown state like fd's still inherited.
I'm open to arguments, but right now I feel that threads are (for many uses cases) a poor man's fork. Like threading, forking gives you concurrency but unlike threading it also gives you isolation. There are plenty of IPC mechanisms available to communicate things back and forth when you need to.
Right, and forks are poor man's processes. All of them have pros and cons but I think forks are only useful for very exotic use cases.
> unlike threading it also gives you isolation.
If you want isolation why not launch processes? With forks, that copy-on-write thing can increase total RAM usage by a huge factor, if the parent process writes to RAM a lot.
> There are plenty of IPC mechanisms available to communicate things back and forth when you need to.
For communicating gigabytes of data back and forth you want shared memory anyway.
And even for small messages, with threads IPC is faster (1) they are in the same address space, TLB stays current, context switches are faster (2) In some cases, shared memory enables more efficient user-mode IPC methods like busy waiting and lock-free stuff.
All fair points. I didn't really consider creating processes in a non-fork way. I'd probably miss the easy transfer of unnamed shared objects you get with forking. How do unrelated processes usually initiate communication?
> How do unrelated processes usually initiate communication?
They aren't unrelated, they are parent and child. They know PIDs of each other. Parent can read/write stuff to child's streams, input output error. Child inherits opened files. Parent can set command-line arguments & environment variables for the child.
> You are better off communicating what to do with the data, rather than sharing the data.
Not if you actually want to distribute the work over several cores.
Example: Your data is partitioned into 4 segments A, B, C, and D. In step 1, node X processes A&B and node Y processes C&D; in step 2, node X processes A&D and node Y processes C&B. If A, B, C, and D are large, this is a prime example where shared memory trumps message passing in terms of performance. If you think that's a contrived case: dense matrix multiplication behaves like this.
Only if you don't have a data-parallel workload. Message passing is often a nonstarter for data-parallel workloads.
Another way to look at it is that the hardware has a very efficient word-level message-passing interface built-in, via shared memory. Why not take advantage of it?
> Another way to look at it is that the hardware has a very efficient word-level message-passing interface built-in, via shared memory. Why not take advantage of it?
Because shared mutable state is the root of all evil, or at least very difficult to reason about. Better to have clear owners for all state; if you need to conduct multiple operations on the same data you should move the code to where the data is rather than sharing the data between different processing units (this has long been a truism of distributed system design; modern multicore processors behave much the same as distributed systems if you zoom in enough, and the same techniques are appropriate).
It may be "the root of all evil", but often there's no faster way to solve the problem. We need to accept that shared mutable state is sometimes necessary and work on how to tame it to be maintainable and usable.
Shared mutable state is like goto; yes, it is sometimes necessary, but that only happens when there's a deficiency in your language's concurrency primitives.
If you have a dependency for every function, you have slow start up times. If you also have a unit test for every function, you have a very slow test cycle. You can mitigate this by forking after startup. Can't use threads because they don't provide isolation between tests.
I’ve found fork useful for Python multiprocessing. To reduce the overall memory usage. But this is an ancient solution to an ancient problem in the first place.
Seems like the GIL is still a very modern problem with ancient solutions not yet realized by python runtimes. What you want is threads, but you can’t use them. Bummer. Other language runtimes don’t have that problem.
This probably needs to be fixed in POSIX with the gradual deprecation of fork(2) over decades. If we start now we might be done by 2038. Maybe start by deprecating fork-after-multithreading, since that's extremely hard to get right in the first place.
fork/exec was a convenient hack for singlethreaded small programs on the PDP-11. Its semantics interact badly with all kinds of subsequent features, and it interferes with porting to other operating systems.
I don't think we'll see the fork/exec semantics going away anyway soon - inheriting privileges/file-descriptors etc. in this way is quite core to all unix like OSs.
posix does provide posix_spawn() though, which an OS could implement in a more efficient way than fork()+exec().
If you need to spawn child worker processes sharing state with the parent, then pass that state via pipe and tell the child about it via a command-line option. Or something like that.
fork(2) should absolutely be deprecated.
vfork(2) can stay. It's hard to use, but not actually harder to use safely than fork(2), and has much better performance (no copies, no CoW). vfork(2)'s reputation as nasty is undeserved.
Fork needs to evolve a bit for capability-based security, but other than that fork is ok. The problem is in concurrent shared memory access, which is extremely hard to get right. And the fix is to deprecate shared memory multithreading altogether. It's too hard to use, too slow, too broken and has no future either way.
> And the fix is to deprecate shared memory multithreading altogether.
Not gonna happen without a radically different CPU, OS, and application software paradigm. In other words: extremely unlikely in the next 2-3 decades (at least).
> It's too hard to use
Agree for most languages. Have you tried Rust? It makes shared memory multi-threading not just much simpler, but also safe.
> too slow
Very wrong for multi-core CPUs, where message-passing is emulated using shared memory.
> too broken
What does that even mean? With the right language-enforced mechanism, it is very much non-broken and functional.
That is simply wrong. Rust makes it easy and safe to share immutable data – it's the ability to mutate that can't be shared.
> data is owned explicitly and passed explicitly
Yes, data is owned explicitly, but the permission to read or write the data is somewhat independent of ownership – the owner of an object can grant permission to access the object, and it can pass this permission to other threads without relinquishing ownership.
> it's the ability to mutate that can't be shared.
So, I do think you're more right than your parent, but I also think this formulation misses important things.
Rust only lets you share when you cannot cause a data race. A type like Mutex lets you share something that's mutable, because the mutex is guarding the sharing. Functions like split_at_mut let you mutably share disjoint parts of a slice.
fork is a nightmare. On paper, fork/exec sounds better than Microsoft’s CreateProcess. In the real world, I’ll take the latter plus threads any day.
Fork is also stupid. 99.9% of the time (if not more), it’ll be followed by exec/execve. Why go through all the pain of either copying everything or take the performance and complexity hit of setting up COW memory space if it’ll all be thrown away in a few microseconds so a different process can be launched?
Then it leaves as “a trivial implementation detail” to the developer the process of setting up the pgrp, inheriting or negotiating terminal access (and dealing with the impossible to avoid race condition - except by using vfork instead and accepting its limitations) of determining if the child set its own pgrp before the parent did so tcsetpgrp can be called adept before the child can call exec so it doesn’t end up SIGTTIN or SIGTTOU when it tries to use the terminal after exec.
I used to think fork(2)+exec(2) was genius and much more versatile than the WIN32 CreateProcess*() functions. The last part is still true, but I no longer think fork(2) is genius. fork(2) was genius back in the 70s when it was easier to do all the spawning work in user-land, but it's not actually a good thing, not now.
The only sane way to use fork(2) is to use it very early on in main() to start worker child processes. Or to immediately exit(2) or exec(2) on one or the other (or both) side(s) of fork(2) without calling any async-signal-UNsafe code on the child side. If you're going to exec() on the child side, then just use vfork(2) -- it's much easier and safer. Even better, if you're going to exec() on the child side then just use posix_spawn(3) and be done -- let the C library / OS figure out how best to spawn the child process. (On some systems posix_spawn() is a system call. On others it's implemented in terms of vfork(2) or clone(2) called in vfork(2)-like manner.)
The irony is that vfork(2) has the worse reputation, but it really is fork(2) that is evil. fork(2) is the root of much evil Unix-land:
- there are fork-safety considerations galore
- so you can only call async-signal-safe functions
on the child-side of the fork (unless you're
quite certain of the parent's state at the time
of the fork, which is why it's OK for the child
to continue if the fork was done early in main())
- it's EXPENSIVE
- copying the RSS, or all the writable data, or
arranging for CoW -- all of these are expensive
nowadays
- it complicates accounting of swap / VM space
(JNI code that calls fork(2) in a JVM with a 32GB
heap... really slows you down.)
- it has terrible semantics as to signals
- every time something is added to what it means to
be a process... one has to think about whether that
should be inherited across fork(2) or not
fork(2) is evil. Do not use. Especially do not expose fork(2) in other languages or their system libraries.
I suppose it might be possible to fix fork() but the performance costs might be prohibitive.
The kernel would need to suspend any thread that attempts to take a lock or is not currently holding a lock. That immediately makes userland-only synchronization besides lockless impossible; the kernel has to know about the synchronization primitives.
Once all threads besides the forking thread are blocked trying to acquire a lock or suspended without holding any locks the process can fork and all threads resume.
A far better payoff might be making processes cheaper to create, faster to startup, and providing easier-to-use IPC mechanisms. Those things benefit lots of programs, not just esoteric uses of fork().
I don’t think that’s true. There’s nothing inherently bad about forking a thread while it’s holding a lock. The problem, rather, is with not forking a thread while it’s holding a lock. That is – from the child process’s perspective, it’s as if all the threads other than the one that called fork() were abruptly terminated. If any of those threads happened to be holding a lock, then it will never have the chance to finish whatever work it was doing and release the lock… because it no longer exists. On the other hand, if fork() cloned all threads in the process, then they they could keep synchronizing and working as usual, oblivious to the fact that they’re now in a new process.
Well, mostly. One significant exception: Synchronization mechanisms sometimes identify threads internally using a system-global ID. For example, some variants of the Linux futex() syscall require you to write your thread ID, as returned by gettid(), to the lock word to indicate that the lock is owned by you. This obviously won’t work if a thread’s ID can change at any time; you would have to switch to some other ID that’s instanced per-process. Except that would break mutexes shared between processes using shared memory, so you’d have to find a different solution for that. Solvable, but not quite trivial.
By the way, as a sort of existence proof, Linux already has a way to do something like multithreaded fork()! Namely, CRIU (“checkpoint restore in userland”), used with containers, is a program that can dump a process (or set of processes) to disk and restore them later – including multithreaded processes. This is normally used for things like migrating containers across physical machines, but AFAIK, in theory you could restore the same dump multiple times to get multiple copies of the same process; it would just be slow, and massively overkill. :) The catch: it relies on PID namespaces, such that all processes and threads get the same PIDs and TIDs after being restored as they had before – but they’re only unique within the PID namespace, i.e. container. Thus it doesn’t have to deal with the possibility of TIDs changing, but the process has to be isolated from others. For a container, isolation is what you want anyway, but it’s probably not what you want if you’re just a random Unix program trying to fork.
Another alternative (if you want to write a multi-threaded runtime that allows its clients to request a fork) is to implement your own memory allocation routines that use robust mutexes, along with appropriate cleanup routines for the EOWNERDEAD case.
hm, if in posix systems a thread inherit "The entire virtual address space of the parent is replicated in the child" that means actually that context switch between threads among one process is not that costly (TLB cache stays intact for example). Am i right?
It is surely faster than windows. But I don't think the TLB stays intact: You have to redo all page tables etc.., as every memory write will cause a copy on-write (if one process writes to memory, it gets a copy of the memory and the other process never sees the write)
Instead of having to deal with the duality of threads and processes they they opted to implement threading using light weight processes that share the parents data and bss segments. This is only turned on if you pass the RFMEM fag to rfork(), otherwise it behaves the same as unix fork and gives the child a copy of those segments.
This simplifies the system design as you only deal with processes which is arguably a more correct and simplified approach (I mean the OS only has 30 something syscalls). The api is implemented via a CSP library simply called thread. You replace main() with threadmain() and the thread api gives you all the concurrency tools. Or you can create your own threading library yourself if you so choose to.
If you are familiar with Go's CSP concurrency then you can thank plan 9's thread as Rob Pike, Ken Thompson and Russ Cox came from Bell Labs and built plan 9. And those ideas were born in Alef and Newsqueak.