I was referring to the second example, using threads. Specifically this claim:
In Clojure we don't need to use callbacks. This means for common things like talking to databases, we don't need them to have asynchronous interfaces. That's because we have really fantastic primitives in the language itself for dealing with concurrency. This code runs twice as fast as the Node.js counterpart - probably due to the excellent perf of Clojure coupled with leveraging multiple cores.
You can't tell whether or not something is using "threads" or not just by looking at it. Javascript is a fairly standard Algol language with a modestly unusual object model, and if all you know is other Algol languages you may not realize what is possible. There are numerous languages where you can write the equivalent of:
function do_something():
socket = wait_for_a_socket()
data = read_everything_from_socket(socket)
process_data_long_and_hard_with_lots_of_io(data)
write_to_disk(data)
return_result(socket, data)
Where, as my function names imply, further code may be called that does things like talk to databases, or wait for other data, or any amount of other I/O, and you may not have to write a single asynchronous callback. Why? Because unwrapping code into continuations and managing them in the runtime is a trivial compiler transformation when you design your language to work like that from day one. Like Erlang, or, in this case, Clojure.
The fact that Node.js requires you to manually shatter your code into teensy-weensy little fragments and manually wire it back together is a hack which should not be mistaken for a feature. Javascript requires you to do that, because it's an Algol language and just doesn't work any other way. If that's what you love doing, great, but that's an awful lot of time and effort spent on writing plumbing (and debugging plumbing, and debugging nontrivial asynchronous plumbing in a mutable language isn't in the worst tier of programming tasks in the world, but it's solidly in the second...) that you could have been spending on writing code that actually solves customer problems.
(I am aware of the libraries that pretend to help this. They are a joke compared to working in a language that actually supports this. I can call functions and send messages and read files and read from databases and write functions to abstract all this and I don't spend one second wondering how I'm going to wire all the pieces together at runtime. No amount of code slathered over Javascript can match that, short of an entirely new language that compiles into Javascript. (Which is inevitable. And it will be hailed as a brilliant breakthrough.) All of the libraries I was pointed to last time I brought this up use the exact same obvious hack, which helps with the case of stringing a handful of teensy-weensy fragments of code onto a single string but can't handle anything more.)
Node uses a thread pool also. From Ryan's JSConf slides:
"Blocking (or possibly blocking) system calls are executed in the
thread pool. Signal handlers and thread pool callbacks are marshaled back into
the main thread via a pipe."
The point is that it has a concept of a main thread. You need to get away from that, and rather have a pool of threads ready to dispatch events, not just a pool for blocking calls to avoid the main thread blocking.
Why? Everything that runs in the main thread is Real Work to compute the response. It makes more sense to run multiple node.js processes. With a thread pool for all events, you have a lock in the critical path between accept()ing a connection and handing the fd off to a worker thread, and all worker threads pay the GC penalty. With multiple processes, you have one less lock hot spot, and the processes do their GCs concurrently.
What makes you think that a thread pool dispatching events would use a lock? No modern efficient thread pool implementation dispatching work items is serialized by a single lock, to my knowledge.
What you ought to have is multiple async accept calls outstanding, and as they complete they are entered into work queues. Worker threads (i.e. the thread pool) pulling work of those queues should steal work from other queues when their own queue is empty.
Pulling items off a queue, either by its associated worker thread or another worker thread stealing its work should be lock-free.
As I mentioned elsewhere, GC is not necessarily (or even often) the most efficient approach for request / response style servers. A more optimal approach is a heap associated with the request which can be freed in a single go when the request is done with, with all allocations associated with that request (i.e. that don't need to persist between requests) coming out of that heap.
You can even design a GC around this principle: have one GC heap per worker thread, and collect it after every request has been processed. There should be little or no roots for this heap associated with the worker itself, which should be (very) low down in its call stack after it is done with the request. If you have write barriers for any mutations to inter-request (shared) state, you can trace those to find out which bits of the worker thread's GC heap you need to keep (copy out). Then you can simply zero the GC heap and reset the free pointer. You can make your write barriers smart so that they are associated with that worker's heap, so you don't have to wander all over the shared heap looking for roots.
"What makes you think that a thread pool dispatching events would use a lock? No modern efficient thread pool implementation dispatching work items is serialized by a single lock, to my knowledge."
Are you talking about thread-safe queues implemented using atomic instructions? Those aren't free - how do you think they're implemented at the hardware level? The main advantage of atomic instructions over locks is removing the possibility of waiting on a preempted thread holding the lock (lock-freedom). They also have lower overhead than making system calls to provide locking. But the equivalent mutual exclusion logic (and contention penalties) just get moved down to the chipset level - now instead of waiting on other threads, you're waiting on other cores/CPUs.
"What you ought to have is multiple async accept calls outstanding, and as they complete they are entered into work queues. Worker threads (i.e. the thread pool) pulling work of those queues should steal work from other queues when their own queue is empty.
See e.g. http://www.bluebytesoftware.com/blog/2008/09/17/BuildingACus...
That article is horrible. Please do not follow the author's advice.
Besides the fact that the code deadlocks (see the first comment on the article), it's also easy to see that if one thread starts generating all the work the "solution" is going to become a single global queue (the "local work stealing queue" of that thread), with all the other threads looping through the global queue and then all each other's queues just to reach there!
The one good thing about that article is that the throughput gains on the toy benchmark (which doesn't deadlock or spawn tasks non-uniformly) nicely illustrate my point about the expense of contention even if using atomic instructions.
What the code in the article attempts to do is alleviate contention by partitioning the tasks among several queues. The problem is that if the work is not distributed uniformly among the queues, some threads will be left idle. The way to overcome that is to fake a global queue by having some way to synchronize the partitioned queues. The optimal solution to how to do this depends not only on the particular system you're running on, but also the pattern of work spawning by the application. And all this depends on being deadlock-free (something not managed by the article)!
Why would you ever go through something so horrible for an HTTP server? Multiple node.js processes are much simpler and more efficient.
I don't think you actually read what I wrote, or if you did, you willfully misunderstood it.
"The problem is that if the work is not distributed uniformly among the queues, some threads will be left idle" - this is why you use work-stealing queues! The very nature of work stealing queues is that the worker threads aren't left idle - they steal work from other threads' queues.
And CGI is not anything like the GC I talked about - if you have a process per request, where are you going to put your shared state?
But much of this discussion is besides the point. Don't forget, the OS scheduler is at its heart an event dispatcher when there are more runnable threads than CPU cores. The thread stack is little different than context provided to a triggered event. You want to have the same number of runnable threads as CPU cores in order to avoid the kernel cost of a context switch. You can do that by having multiple single-threaded processes, or multiple threads in a single process. While neither choice of partitioning affects the degree to which you can use an eventing style to serve requests, one - the separate process model - makes it much harder to share state. And therein lies the reason why I believe that optimal performance lies in threads, rather than processes. There are other good reasons for using processes instead - but it will be at some cost to efficiency.