So the basic take away seems to be: don't bother using async patterns for single, low latency connections to a server on your local network.
For anything where you're dealing with thousands of connections from random Internet hosts, "just spawn a thread for it" does not cut it. If you take that approach, you're setting yourself up to be accidentally DoS'd at some point in the near future. Async, on the other hand, has more than proven itself to be apt for this kind of scenario.
I'd want data. The system I work on does in fact spawn a thread to handle each and every connection and in fact each connection thread spawns numerous child threads to exploit available parallelism within the request. The code is fully blocking and linear and anyone can read it and see what it is doing. The mentioned system is one of the largest public networks services on earth.
I am very skeptical of the idea that you must not handle thousands of connections with a thread per connection. High tens of thousands of threads per core is the minimum level where I would start to worry.
Additionally, a TCP connection is essentially an operating system resource; you need to set aside a port and space for a send and receive buffer. It might seem fine for a client to open hundreds connections, but imagine being a server with thousands of clients all opening hundreds of connections to you. You very quickly run out of resources and either have to close connections or reject new connections.
Linux starts to act weird around 200,000 concurrent connections in my experience, even with aggressive sysctl tuning. You end up with weird edge cases like netstat literally taking 15 minutes of CPU time (in kernel) before it dumps the list of connections to stdout.
The problem is that it doesn't scale linearly. There's some O(n4) algorithm being used in netstat or some kernel syscalls or something. Once you go over 250k, things get _really_ weird.
Solutions that pull the TCP stack out of the kernel perform so much better because they're bypassing all the internal bureaucracy that the kernel otherwise performs to make it as easy as possible for userspace applications to use the network without stepping on other applications' toes.
The kernel socket API is designed so that programs have to do as little thinking as possible to get their own personal slice of the shared and noisy network. It provides an easy abstraction, and that requires the kernel do a lot of messy stuff for you:
- When you're using TCP sockets, the kernel makes copies of everything your application writes and holds it in a buffer until its receipt is acknowledged, just in case it needs to resend it when the other side doesn't acknowledge it. If the socket's buffer fills up, your application blocks on I/O until some space is freed.
- It holds ports open in a lingering state long after they're closed just in case it needs to re-transmit the last bytes. This can be disabled, but it's on by default.
- It takes care of all the congestion control for you, but it's tuned for the general case, and as a result there are a lot of edge cases which perform very badly for the problem they're trying to solve. Redis is probably one such edge case.
Of course, all of this is fine and desirable for general applications, but it ends up being problematic if you're trying to solve a problem where performance is the chief concern.
It's tempting to say the problem is that kernel has to do way too much to provide that easy abstraction, but really the problem is that the kernel provides no way around it. You pretty much have the option of using their cushy stream abstraction at the cost of performance, or you use a userspace TCP stack on raw sockets, which requires running as root and disabling TCP in the kernel (otherwise the kernel stomps all over your TCP negotiations[1]).
There are some other transport layer protocols (SCTP, DCCP, etc.), as well as application layer protocols built on UDP, that remove some of the abstractions TCP provides and as a result require less in-kernel bureaucracy, but those solutions don't seem to be very popular or well-supported.
It would be nice if the kernel would provide some lower level system calls that could be selectively used to move parts of TCP into the application (e.g., retaining copies of data in case of re-transmission). Alas, I don't think there's much push for that, because a) it's hard, and b) the current situation is fine for 99% of network applications.
The primary advantage GridFTP has over simply using tar+netcat for performance is that GridFTP can multiplex transfers over multiple TCP connections. This is helpful as long as the endpoint systems limit the per-connection buffer size to some value less than the bandwidth-delay product (BDP) between them. If you've got to bug sysadmins to get GridFTP set up for you on both endpoints, you might as well just ask them to increase the maximum TCP buffer size to match the BDP.
EDIT: Sorry, "multiplex" is not the right word to describe that. It's more like GridFTP "stripes" files across multiple connections; it divides the file into chunks, sends the chunks over parallel connections, and reassembles the file at the destination.