I don't think you actually read what I wrote, or if you did, you willfully misunderstood it.
"The problem is that if the work is not distributed uniformly among the queues, some threads will be left idle" - this is why you use work-stealing queues! The very nature of work stealing queues is that the worker threads aren't left idle - they steal work from other threads' queues.
And CGI is not anything like the GC I talked about - if you have a process per request, where are you going to put your shared state?
But much of this discussion is besides the point. Don't forget, the OS scheduler is at its heart an event dispatcher when there are more runnable threads than CPU cores. The thread stack is little different than context provided to a triggered event. You want to have the same number of runnable threads as CPU cores in order to avoid the kernel cost of a context switch. You can do that by having multiple single-threaded processes, or multiple threads in a single process. While neither choice of partitioning affects the degree to which you can use an eventing style to serve requests, one - the separate process model - makes it much harder to share state. And therein lies the reason why I believe that optimal performance lies in threads, rather than processes. There are other good reasons for using processes instead - but it will be at some cost to efficiency.
"The problem is that if the work is not distributed uniformly among the queues, some threads will be left idle" - this is why you use work-stealing queues! The very nature of work stealing queues is that the worker threads aren't left idle - they steal work from other threads' queues.
And CGI is not anything like the GC I talked about - if you have a process per request, where are you going to put your shared state?
But much of this discussion is besides the point. Don't forget, the OS scheduler is at its heart an event dispatcher when there are more runnable threads than CPU cores. The thread stack is little different than context provided to a triggered event. You want to have the same number of runnable threads as CPU cores in order to avoid the kernel cost of a context switch. You can do that by having multiple single-threaded processes, or multiple threads in a single process. While neither choice of partitioning affects the degree to which you can use an eventing style to serve requests, one - the separate process model - makes it much harder to share state. And therein lies the reason why I believe that optimal performance lies in threads, rather than processes. There are other good reasons for using processes instead - but it will be at some cost to efficiency.