Good read, but despite being a major proponent of Node.js and many of the ideals it seeks to embrace, I'm not sure I'm comfortable with calling Apache "archaic". It's not like IE6, which has objectively no redeeming value as a modern platform target -- it did back in the day for sure, and is still relevant in some spheres, but overall I don't think anyone (even Microsoft) would argue that IE6 is "archaic".
But to call Apache, one of the most popular and successful actively developed webservers _archaic_? I think that's a bit much. It's not inherently bad just because it's not really targeting the C10K problem... just different.
[A minor nitpick to be sure, but it bothered me nonetheless as I feel like I'm seeing this "Threads bad. Async good." rhetoric passed around as fact all over the place and it's starting to feel a bit like Animal Farm ;)]
You're perfectly correct, and Apache has and continues to serve as one of the most successful and widely deployed web servers to date. That said, in the context of more conveniently architecting for high volumes of traffic, Apache was conceived in a time of fundamentally different problems, and in that respect it can be viewed as a more antiquated option when scoping out the landscape of appropriate web server software.
I did not intend any pejorative connotations by calling it "archaic". I just wanted to emphasise that it has been eclipsed by newer software following different design paradigms better suited to this kind of problem.
An apache expert would probably prove me wrong, but I've found it quite simpler to use nginx to dispatch multiple domains to different backend types. For instance, I have a couple nodes, one wordpress blog and lots of small django websites.
Apache reverse proxies are very easy to setup once you have done it once. I've reverse proxied Tomcat,PHP and ASP.Net apps just by copy-pasting a few lines in the config.
The main strength of Apache though is the ability to separate apps run on the same system by different users without reverse proxy via .htaccess files and stuff like mod_php.
Apache also has an excellent security track record considering it's vast numbers of deployments and years of service.
Perhaps though if you are a green field developer with no Apache experience and deploy with stuff like EC2 you may as well just skip apache and go straight to nginx.
What nginx's security record will look like in 10 years if it becomes as popular as Apache though remains to be seen.
On the nginx side, author discusses tweaking sysctl.conf, cutting down the number of sockets stuck in TIME_WAIT, some other tweaks for performance resulting in a 90% reduction in occupied sockets. On the node.js side, author uses the cluster module to fully utilize available CPU cores, arriving at N-1 for the magic number of node processes to spawn, where N is the # of CPU cores.
Definitely suggested reading for anyone running Nginx + Node.js
I believe the section about TCP_FIN_TIMEOUT is wrong. tcp_fin_timeout has nothing to do with the time wait state at all. TCP_TIMEWAIT_LEN is the value that holds onto the TCB
When tcp_tw_reuse is enabled the kernel can decide to use the sockets in TIME_WAIT, before they expire or they are closed by the clients.
This is a problem though, because the connection could still be used by the client and therefore there could be some collisions regarding the TCP sequence numbers, specially on high traffic servers.
The kernel can try to avoid this collision with a technique called PAWS (protection against wrapped sequence numbers: rfc1323).
Unfortunately PAWS works only with tcp_timestamps enabled on both sides (client and server).
tcp_timestamps has also an overhead and therefore it is normally disabled on servers with a high traffic, leading to potential problems.
About tcp_tw_recycle, when it is enabled, it forces the verification of this tcp_timestamp.
So in case of NAT, multiple clients will send different tcp timestamp to the server, to the same mapped connection which points to the TIME_WAIT socket, and because the tcp timestamp are different then the packets will be dropped by the kernel. This is the reason why it is not a good thing to enable tcp_tw_recycle when you use a load balancer or in case of NAT.
A good practice is to enable tcp_tw_reuse (instead of tcp_tw_recycle), to make sure tcp_timestamp is enabled and to decrease the size of the tcp timestamp with tcp_timewait_len.
>>A good practice is to enable tcp_tw_reuse (instead of tcp_tw_recycle), to make sure tcp_timestamp is enabled and to decrease the size of the tcp timestamp with tcp_timewait_len.
Couple questions. what is tcp_timestamp? i assumes you are not referring to tcp_timestamps?
What effect does tcp_timewait_len have on timestamps at all? Isn't it just the amount of time the connection closer holds on to TCBs?
should nginx be only used for serving static files. Does it have any advantage when used to serve plain data API. I want to expose REST api (django + uwsgi) over web, but not sure if should use nginx for it.
I think this basically boils down to: "are you likely to have many users connected concurrently at any one time"
If your API basically involves a client connecting, quickly getting a small JSON/XML response and then disconnecting again you are probably absolutely fine with Apache unless you have truly enormous numbers of users.
OTOH if the socket is likely to be held open for a while, because maybe the API responses can take some time to be returned or the client is likely to hold the connection open in order to get a stream of data over time then you may get more mileage out of nginx.
service returns quick & short JSON responses and huge number of users are going to hit it. So basically there are going to be enormous concurrent connections each returning quick and short json response. No heavy work by each connection, just that there are too many.
There's probably nothing inherently wrong or slow with running Django through nginx.
That said, one of the most common deployment strategies is gunicorn. It's better documented [1], and it's always good to separate your web app server from your static file server/CDN.
I've had great success with nginx as the main entry to serve static files and direct the traffic at a django w/ gunicorn. Add a small supervisor system and you've got a very simple but robust server.
we use nginx - uwsgi - django and are quite pleased with the combination. between nginx/uwsgi, there are plenty of configuration options to let you optimize for your particular use case, and don't see any potential issues in terms of adding new capabilities to our setup (minus web sockets, but as mentioned in another comment that is coming soon / available with plugins)
But to call Apache, one of the most popular and successful actively developed webservers _archaic_? I think that's a bit much. It's not inherently bad just because it's not really targeting the C10K problem... just different.
[A minor nitpick to be sure, but it bothered me nonetheless as I feel like I'm seeing this "Threads bad. Async good." rhetoric passed around as fact all over the place and it's starting to feel a bit like Animal Farm ;)]