The performance data looks interesting, but this is work based on a pretty old kernel (originally released in 2010 or so). There have been many changes and improvements added to the 3.x kernel that may overlap with this work. Publishing the code and details on github is great, but working with the kernel community and merging into the mainstream kernel is the only way for work like this to have a long-term meaningful existence - Google in particular have been doing a great job getting networking improvements in.
That said, it's interesting to have this kind of thing come out of large-scale production web environments in China.
>on Linux 2.6.32 achieves 470K connection per second and 83% efficiency up to 24 cores, while performance of base 2.6.32 kernel increases non-linearly up to 12 cores and drops dramatically to 159K with 24 cores. The latest 3.13 kernel doubles the throughput to 283K when using 24 cores compared with 2.6.32. However, it has not completely solve the scalability bottlenecks, preventing performance from growing when more than 12 cores are used.
It's actually not that kernel. It's the CentOS kernel, which is the RedHat kernel, which was based on a 2.6 kernel years ago, but has since had every single kernel change under the sun backported to it. It might as well be RedHat's version of 3.10. This is also why it's a bad idea to build any kernel patches on top of RedHat kernels: it has nothing to do with the vanilla trees.
In any case, it doesn't matter if it's a 50 year old kernel. If it speeds up connections per second, someone will put up a box on the frontend with it as the load balancer.
Is it possible that by using an old kernel like this one, you'd expose yourself to security vulnerabilities?
I'm new to kernel programming. Is this submission suggesting that you downgrade your kernel to a 2010-era release in order to take advantage of the performance improvements, or is the submission showing some kind of modular component which you can integrate into your current kernel?
If it's the former, then wouldn't you be pinning yourself to the old version of the kernel, so you'll have to integrate all updates by hand rather than receive them automatically during the normal update process?
This kernel is what Red Hat Enterprise Linux 6 is currently using. Red Hat maintains it, and writes patches for security vulnerabilities. It's no surprise that Sina developed, tested, and deployed this patch against what they were running in production.
By using this kernel, will you be able to automatically receive security upgrades in the future? Or will you have to apply them manually and then recompile and install the kernel yourself?
Is "developers have to apply security patches manually, then recompile and reinstall the kernel themselves rather than automatically" not a big deal in practice?
It's not a big deal, because you can automate it. As long as the patch applies cleanly (and it almost certainly will if the only vendor changes are security updates), it's going to be a pretty smooth process.
You'd need to test the new kernel before deploying in production, of course, but you'd be doing that before rolling out a vendor provided kernel change, anyway.
Old kernel perhaps, but that's still what's being shipped with the latest CentOS6 (and by extension, RHEL 6 as well). Old as it might be, it's in very wide use.
This would be a tremendous boon for those environments!
The openssl 0.9.8 with Apache/2.2.3 combo only supports TLS 1.0. I couldn't setup TLS to get better than grade "B" on Qualys' SSL Server Test. I sacrificed MSIE on WinXP, used TLS 1.0 only, TLS_RSA_WITH_AES_128_CBC_SHA, TLS_DHE_RSA_WITH_AES_128_CBC_SHA, TLS_RSA_WITH_AES_256_CBC_SHA, TLS_DHE_RSA_WITH_AES_256_CBC_SHA only, got a "B". I want Forward Secrecy only, AEAD only setup. Have to upgrade to RHEL6 for that.
Looks like its based on 2.6.32 series. I would hope they start working with the upstream Kernel otherwise this project will stay stuck in Limbo as previous initiatives to improve TCP handling at kernel level (e.g: Megapipe).
This version do not support TCP_FASTOPEN, SO_REUSEPORT, TCP_AUTOCORKING, etc.
It's RedHat's 2.6.32 kernel, which is not the vanilla Linux 2.6.32 kernel. It's been getting backported fixes since the tree began. RedHat does not release what patches they include in their kernels, but luckily for us, Oracle maintains a project called RedPatch which publicly documents the patches going into the RHEL kernels.
As an example of how this kernel is not the 2.6 tree: On April 14, 2014, in RedHat's 2.6.32-431.23.3.el6 kernel tree [1] , there was recently a patch included [2] that affects the ipv4 subsystem. You can find that same patch [3] was originally applied to the Linux kernel on April 14, 2014. This is common practice, and so RedHat kernels more closely resemble modern kernels like 3.12 than anything else.
They provide the complete source, as required by the GPL. However, they do not provide patch sets neatly broken out like they used to; that's what the parent is referring to.
The "before and after" CPU series have nearly the same exact fit. If the data was from separate 24 hour periods, wouldn't you expect the graphs to look different? I recognize that with a large service, you'd get repetitive load patterns, but the similarity here look a little extreme.
I find the first graph peculiar on its own. Supposedly, each line is the load on one of 8 cores on the same machine. Why would some cores experience heavier load than others, very consistently, over the course of a day? I've never seen a workload exhibit that kind of long-term, core-level affinity on Linux.
Even if that was the case, there isn't normally a stable mapping between processes and physical cores. There would have to be something within the kernel itself that gives higher priority to some cores than others.
Not saying that's impossible, but I've worked on machines with more than 8 cores and never seen it happen.
It looks like there are like 3 separate optimizations, but I think the most important one is the "enable_listen_spawn" feature. Here is how they describe it:
Fastsocket creates one local listen socket table for each CPU core. With this feature, application process can decide to process new connections from a specific CPU core. It is done by copying the original listen socket and inserting the copy into the local listen socket table. When there is a new connection on one CPU core, kernel tries to match a listen socket in the local listen table of that CPU core and inserts the connection into the accept queue of the local listen socket if there is match. Later, the process can accept the connection from the local listen socket exclusively. This way each network softirq has its own local socket to queue new connection and each process has its own local listen socket to pull new connection out. When the process is bound with the CPU core specified, then connections delivered to that CPU core by NIC are entirely processed by the same CPU core with in all stages, including hardirp, softirq, syscall and user process. As a result, connections are processed without contension across all CPU cores, which achieves passive connection locality.
The kernel in its normal configuration will try to spread the IRQs evenly across CPUs. So for their use case, where you have one worker thread per CPU handling zillions of short lived TCP connections, they can eliminate a bunch of locking and cache thrashing that would otherwise happen when handling new connections and dispatching the related epoll events within the kernel.
Maybe it's up to the more sensible in the kernel "community", to reach out to the developers of code known to be interesting to discuss what's in it for them to do the work required to get it merged, the probability of doing a ton of work, and then being ignored etc etc.
There's a sense in the above of "They haven't submitted us so we don't care." It might not be the best way to make the kernel as good as it can be, if that is the goal of anyone active in the kernel "community." (And maybe it is).
I have a lot of sympathy for someone publishing their code and their results and then saying "I won't play stupid kernel politics, your move." I don't know if that's what is happening here or it's cultural differences or something I haven't thought of. Nor do I know if this particular development is worthwhile merging, but hey, neither does the kernel "community" right?
This is much lower level than those. This is all about the TCP stack.
This is the OSI model, from the top down:
7) Application
6) Presentation
5) Session
4) Transport
3) Network
2) Data link
1) Physical
ZeroMQ fits in neatly at the top, layer 7 (arguably it is the presentation layer too because it uses its own protocol).
What this is talking about relates to network sockets which is around layers 4 and 5 (you can find lots of debate around the subject). Any speed improvements at lower levels in the stack would be seen by stuff on layers above it.
That said, it's interesting to have this kind of thing come out of large-scale production web environments in China.