It's highly cited "paper" and just as wrong.
Blocking I/O is implemented via poll(), NIO is epoll. Blocking IO has to copy the array at last once (on the stack for <64kb, else malloc/free). Most NIO implementations use heap ByteBuffers and multitude of copies which is their downfall.
Blocking IO cannot have predictable latency under load at very least, you are left at the mercy of OS thread scheduler. Due to various reasons (e.g. mutator threads should not block GC and compiler ones) thread priorities are honored.
I'd argue that a well written NIO (and virtually there are no good open source NIO impl) will beat flat out any blocking. NIO is both faster and offers better/predictable latency under load.
Ok, now I'm curious: what is a good implementation? Also, whats wrong with ByteBuffers? I was under the impression that they are usually memory-mapped and should be 0-copy.
There are 2+1 major types of ByteBuffers:
Heap- backed by byte[] (or char[], int[], etc)
Direct: backed by C memory allocated via mmap (on linux). mmap can map to the RAM or a file. Memory mapped files are not an interesting case for NIO impl that works with sockets (On a flip note: FileChannel.transferTo(SocketChannel) doesn't involve memory mapping when the kernel supports it. Windows never supports it, though)
Most impl. use heap ByteBuffer, then parsing requires state machines and often they are simplified by copying the buffers. The blocking IO doesn't really need a state machine as the stack serves that purpose. Then there is some reactive alike pattern (submitting tasks to an ExecutorService) that costs some more latency. Certainly, it's easier to work with and reason about, yet the more hand-outs there are the worse the performance/latency is.
There are minor issues like the choice of a good queue. It is an important one as java lacks MultiProducer/SingleConsumer queues out of the box, or even single producer/single consumer. Java does have MP/MC queues (CLQ is an outstanding one) but one has to pay some extra price (incl. false sharing sometimes) to use them.
Ultimately the blocking IO cannot be "faster" than NIO per se since under the hood it uses poll(2)[0] with one socket. Before that it copies the java byte[] to a new location - for smaller byte[] it's the stack. Technically one can blow up the JVM if the stack is very tiny while entering socket.getOutputStream().write(byte[])
Lastly Selector.wakeup() has a stupid issue that involves entering a synchronized block each time even if there is an outstanding wake-up request already. Wakeup requests are implemented via pipes on linux (and a socket pair on Windows) that requires kernel mode switch. During the wakeup all the threads attempting to carry the task block on that very selector for no real reason. It can be played around with a CAS, so only one thread actually enters the monitor.
I will repeat myself blocking IO doesn't have predictable latency and cannot be enforced. In the end it's all about the latency as bandwidth can be bought, more machines deployed but you can't buy latency.
Ok, I was aware of the difference between Direct vs. Heap ByteBuffers, and I guess I understand the argument about poll/epoll. Now what I don't quite get is why most open source projects chose to use the slower implementation. Don't they know any better? Is it a portability issue? Bugs in certain JVM/OS combinations? Netty.io claims to be 0-copy capable, so I guess that this must be one of the good ones that are available?
After seeing the message I've decided to check netty.io's code and I am pleasantly surprised. It has been ages since I checked the project. They use almost all tricks in the book - CAS around selector.wake(), handling the zero returned keys,ref. counting buffers allocator, even a SC/MP queue.
Only couple of downsides: 1) there appears to be the lack of bounded queues and it's a non-trivial one. Bounded queues are important to ensure proper back-pressure on 'producers' and/or killing slow peers. 2) encoding pipeline may require serializing the same message multiple times when sending to multiple clients even if the serialization results into the same byte stream. However this is really a minor issue.
Blocking IO cannot have predictable latency under load at very least, you are left at the mercy of OS thread scheduler. Due to various reasons (e.g. mutator threads should not block GC and compiler ones) thread priorities are honored.
I'd argue that a well written NIO (and virtually there are no good open source NIO impl) will beat flat out any blocking. NIO is both faster and offers better/predictable latency under load.