How we scaled Nginx

toast0 · on Aug 1, 2018

> Once in a while, a request gets slowed down enough to matter. My colleague Ivan Babrou ran some I/O benchmarks and saw read spikes of up to 1 second. Moreover, some of our SSDs have more performance outliers than others.

If you're running Intel DC 3x10 SSDs, check for a firmware update that improves 'maximum latency' in some cases, the update was released some years ago, but people might not have noticed it.

dev_dull · on Aug 2, 2018

Is there a place for benchmarks where we can discover the “best” SSDs? I find their performance varies wildly.

toast0 · on Aug 2, 2018

SSDs aren't as easy to model as spinning drives, but either way, it depends on your use case. Definitely get as much data from others as possible, but very few people will be running the same use cases as you.

You can probably figure out the general class of storage you need / can afford, which is interface and bits per cell, but then you probably need to try several well regarded drives (or several dodgy drives, if that's what you can afford) to see how they work for you. Or just go with whatever your hosting provider can supply ;)

linedash · on Aug 2, 2018

https://www.storagereview.com/ was always my first stop for in-depth reviews of disks

Ajedi32 · on Aug 2, 2018

ssd.userbenchmark.com has some good data. They crowdsource benchmarks from users, which lets you get a pretty good idea of not only average performance but also the distribution of performance numbers across hundreds of real-world systems.

ksec · on Aug 2, 2018

For Server usage, your best bet is Servethehome and Anandtech.

windows_tips · on Aug 2, 2018

https://www.tomshardware.com/reviews/best-ssds,3891.html

olavgg · on Aug 2, 2018

These are consumer level SSD's that are very well known by professionals to be slow. They are good for quick bursts of traffic but throttles with heavy use over time. They also have horrible sync write performance https://forums.servethehome.com/index.php?threads/did-some-w...

They belong in a desktop/gaming machine not in a production web server.

Here is an old review with enterprise drives and the Samsung 840 pro consumer drive which received excellent reviews at the time it was released, notice how it is the worst performer over time. https://www.storagereview.com/micron_m500dc_enterprise_ssd_r...

The exception is Optane 900p which is blazing fast! A gamechanger!

enitihas · on Aug 1, 2018

Non blocking disk I/O is one thing where NT is really ahead of all nix OSes. Unlike say network IO where we have all sorts of platforms(go, node) which allow you to scale by doing async IO, there aren't much options for disk I/O, primarily because of lack of nix options.

khc · on Aug 1, 2018

Author of the post (and the engineer who did the work) here.

There are ways to do non-blocking disk I/O in *nix (aio/io_submit in linux) but all of which requires you to have an open file descriptor first. Does NT allow you to open a file in an async fashion?

drewg123 · on Aug 1, 2018

Netflix kernel engineer here.. We use FreeBSD's async sendfile() and not aio, so it would be a bit harder for us to fix open latency, since we're not using aio.

I had not thought about open latency being an issue, that's fascinating. Looking at one of our busy 100G servers w/NVME storage, I see an openat syscall latency of no more than 8ms sampling over a few minute period, with most everything being below 65us. However, the workload is probably different than yours (more longer lived connections, fewer but larger files, fewer opens in general). Eg, we probably don't have nearly the "long tail" issue you do..

khc · on Aug 1, 2018

right I suspect you have way fewer files than we do and everything is in the dentry cache. Pretty sure that most of your files are bigger than 60KB too :-) (which is our p90)

the8472 · on Aug 1, 2018

At my job we have to open many small files from NFS. The latency of open() absolutely murders sequential performance (>80 seconds just to open a scene descriptions). Prewawrming the fairly shortlived NFS access caches in parallel evaporates most of the performance penalty.

drewg123 · on Aug 1, 2018

Wow, it is a different world :)

khc · on Aug 1, 2018

our machines also run many different services (the CDN product is just one of them) and isolating I/O from different products is difficult. I'd also love to have NVMe.

solarengineer · on Aug 2, 2018

Any chance you could use dtrace to isolate process specific I/O?

khc · on Aug 2, 2018

"isolate" here I meant prevent one process's IO from affecting another one

pg314 · on Aug 2, 2018

Have you looked into using something like SQLite instead of the filesystem? [1]

[1] https://www.sqlite.org/fasterthanfs.html

Kalium · on Aug 2, 2018

SQLite makes a ton of sense for systems that don't need to worry about concurrent writes. It's possible that a CDN's cache system might need to concern itself with concurrent writes.

khc · on Aug 2, 2018

Other than the concurrent write that another comment mentioned, looks like this test is done with a data set that fits entirely in RAM. I wish we have enough RAM for the entire internet :-(

pram · on Aug 2, 2018

TIL what a 'dentry' is!

crest · on Aug 2, 2018

A few more aio_* syscalls would really simplify things a lot. I suspect the most important missing aio_* syscalls are aio_open()/close() and aio_stat(). The semantics for async open/close would be tricky.

ioquatix · on Aug 2, 2018

8ms actually seems quite slow at that level.

drewg123 · on Aug 2, 2018

Yes, that's the absolute worst one out of a large sample size (far less than a fraction of a percent). I suspect that openat() was particularly unlucky, and interrupted multiple times..

shub · on Aug 2, 2018

IOCP doesn't do it[1]. Well, if it does then it's not documented. You can post custom completion packets so at first glance it looks easy to make open/close be async...I think there is probably a good reason why NT won't do that for you.

That's pretty awesome though, that you have to worry about latency of open().

[1] https://docs.microsoft.com/en-us/windows/desktop/fileio/i-o-...

rossy · on Aug 2, 2018

Nope, CreateFile and NtCreateFile are synchronous, only reading and writing are asynchronous.

derefr · on Aug 2, 2018

Why open(2) and close(2) all the time? If I hit this problem—and hacking on Nginx itself were an option—then I'd make the following Nginx changes:

1. at startup, before threads are spawned, find all static files dirs referenced in the config, and walk them, finding all the files in them, and open handles to all of those files, putting them into a hash-map keyed by path that will then be visible to all spawned threads;

2. in the code for reading a static file, replace the call to open(2) with a look up against the shared file-descriptor from the pool, and then a call to reopen(2) to get a separately seekable userland handle to the same kernel FD (i.e. to make the shared FD into a thread-specific FD, without having to hit the disk or even the VFS logic.)

3. (optionally) add fs-notify logic to discover new files added to the static dirs, and—thread-safely!—open them, and add them to the shared pool.

This assumes there aren't that many static files (say, less than a million.) If there were magnitudes more than that, in-kernel latency of modifying a huge kernel-side FD table might become a problem. At that point, I'd maybe consider simply partitioning the static file set across several Nginx processes on the same machine (similar to partitioned tables living in the same DBMS instance); and then, if even further scaling is needed, distributing those shards on a hash-ring and having a dumb+fast HTTP load-balancer [e.g. HAProxy] hash the requested path and route to those ring-nodes. (But at that point it you're somewhat reinventing what a clustered filesystem like GlusterFS does, so it might make more sense to just make the "TCP load-balancing" part be a regular Nginx LB layer, and then just mount a clustered filesystem to each machine in read-only-indefinite-cache mode. Then you've got a cheap, stateless Nginx layer, and a separate SAN layer for hosting the clustered filesystem, where your SSDs now live.)

khc · on Aug 2, 2018

I think you are underestimating cloudflare's scale here. Obviously we do shard across many machines but each one still has many more files than what's reasonable to keep open all the time.

lossolo · on Aug 5, 2018

This will not scale at CF and is not compatible with their current architecture.

kentonv · on Aug 2, 2018

The use case here isn't static files, it's an HTTP cache.

jasode · on Aug 2, 2018

>Does NT allow you to open a file in an async fashion?

No.

As a side note, there was an interesting old hn subthread about async disk i/o of philosophies Windows NT vs Linux:

https://news.ycombinator.com/item?id=11865760

zzzcpan · on Aug 1, 2018

If you mean IOCP API that requires you to hold memory hostage for it, then no, NT sucks at it. The idea behind that API only works well if there is no switch between kernel space and user space. Otherwise you can emulate that awful API with threads on unix-like systems, it will perform the same.

Don't get me wrong though, caching/file serving in nginx was always very hacky. And blowing latency by blocking on filesystem and yet relying on vfs cache underneath is just one side effect of that.

acdha · on Aug 1, 2018

> If you mean IOCP API that requires you to hold memory hostage for it, then no, NT sucks at it.

“hold memory hostage” really doesn't help make your point. It's inflammatory wording one step above spelling it Micro$oft and it makes it sound like you don't understand the engineering trade-offs which the Microsoft engineers made. It'd be much better if you explained _why_ you disagree with their choice and especially whether there are differences in the type of work which you've used it for which might explain that opinion.

brian-armstrong · on Aug 1, 2018

[flagged]

acdha · on Aug 1, 2018

I remember Slashdot in 1997, too, and would like to avoid repeating it. HN is poorer for substance-free jokes.

smt88 · on Aug 2, 2018

I've found reddit to be substantive, sometimes on par with HN, but filled with jokes. Depending on the subreddit, you may be able to scratch your itch there.

2038AD · on Aug 2, 2018

I've found most jokes on reddit seem to be attempts to get upvotes.

Then again I'm not a fan of the upvote system generally. One thing I dislike is that downvoted comments are hidden. It equates disagreement with spam and discourages people from expressing genuine beliefs and reactions.

I was going to make sure I haven't downvoted anything here but I can't find if there's a way to check.

orf · on Aug 1, 2018

I get the feeling you've had a bad experience with the API and that colours your opinion of the overall design of IOCP. It's really quite well thought out.

Also in my experience it handles slightly more concurrent connections with a fair bit less CPU usage. There are some weird and not very nice parts + limitations to the API that makes it pretty hard to write cross-platform things with it, granted.

johncolanduoni · on Aug 2, 2018

The memory is either held hostage in your buffer or the kernel’s. For that matter, if you’re making a blocking read/write call the memory is just as much a hostage.

sbjs · on Aug 1, 2018

Thought you were emphasizing a lot of your text, until I realized you just have two \*nix in your comment but didn't escape the asterisk!

the8472 · on Aug 1, 2018

Well, there's opportunistic nonblocking IO which you can do on your event loop with readaheads (assuming you know your access patterns) and then RWF_NOWAIT and fall back to a thread pool when that fails. Of course that only helps if you're reading from a file into a userspace buffer. If you want to sendfile that doesn't help.

tatersolid · on Aug 2, 2018

How can you use sendfile and TLS at the same time? Aren’t they incompatible?

crest · on Aug 2, 2018

There are patches for FreeBSD and Linux sendfile respectively to perform TLS symmetric encryption inside the kernel and use out of band signaling for the key exchange.

halayli · on Aug 1, 2018

Freebsd have had kqueue for the longest time. Unfortunately Linux didn't adopt it for political reasons and osx did a half port and dropped disk io support.

tedunangst · on Aug 1, 2018

kqueue doesn't work for disk. It always reports ready to read.

johncolanduoni · on Aug 2, 2018

If you try to use the readiness-based kqueue events that is the case, but the completion-based aio events do work with disk files on kqueue.

tedunangst · on Aug 2, 2018

Ah, that slipped my mind.

crest · on Aug 2, 2018

On FreeBSD kqueue does work for normal files if you use aio_* and SIGEV_KEVENT.

hyperman1 · on Aug 3, 2018

https://blogs.msdn.microsoft.com/oldnewthing/20180725-00/?p=...

This is an old new thing post which demonstrates it is easy tot fall off the happy path on nt. Im not sure the windows world is much better off here.

erikb · on Aug 2, 2018

I love how happy they are to work around blocking open().

This is a very common way of thinking. But in fact there are only two ways to handle I/O. And no matter what you do, you always end up with one of them:

Path 1, blocking I/O: When you have blocking I/O your process continues to the point where the I/O starts, then sends the corresponding request to the kernel and waits until it gets a respond, potentially forever. This is very low resource usage, but sometimes-hanging-forever is quite a huge price. So usually people put this I/O stuff in a thread/fork and use the parent to have a timeout waiting.

Path 2, non-blocking I/O: In this version when the process hits the I/O it will almost-immediately fail when the desired resource (e.g. file, port, whatever) is not available. So usually you are writing a loop and constantly poll for the resource to become ready. This obviously has a rather high cost on resources, because your code gets more complicated (loops, exceptions, etc) and whatever you are I/Oing to has more activity (e.g. if you poll a webserver you constantly create load on that webserver for each client process). But also an advantage is that you can't hang forever, because you usually break the loop after x seconds or y amounts of retries.

You might feel it sucks (at least I do) but there are not more options. Decide for one version that you can live with more easily, tune the variables you can fiddle with, like timeouts/retries, and then move on to other problems.

noncoml · on Aug 2, 2018

Relevant stackoverflow question, which makes for an interesting read:

https://stackoverflow.com/questions/22780822/linux-kernel-ai...

LinuxBender · on Aug 1, 2018

Does Cloudflare submit all of it's improvements to Nginx upstream? Do nginx.org accept / merge the improvements?

khc · on Aug 1, 2018

Author here.

I've talked to a nginx product manager and he's told me changes that are specific to one customer are unlikely to be accepted.

Also in our implementation we took some shortcuts so it may not be suitable for upstream as is anyway

LinuxBender · on Aug 1, 2018

Understood. Perhaps you could make your improvements modular in some cases so that people can toggle them on, either as nginx modules, or compile flags in nginx core?

khc · on Aug 1, 2018

In this case it's not possible to implement as nginx modules, but we are looking into releasing the patch as is.

devwastaken · on Aug 1, 2018

It would be really great for some public projects if your internal modifications and updates to nginx were a public repo. That repo could be compiled and packaged for use of open source projects that benefit from those modifications. I say this because I've seen multiple patches from cloudflare around, but it's very difficult for 1 person to go through all of that, know what version of nginx it's for, and modify the patch for newer versions of NginX like security updates. If you modify nginx internally I don't doubt there's lots of various changes and improvements over time that don't get organized or published publicly.

I think it'd be great if more companies released their own 'opinionated' versions that update with their infrastructure. Like if I wanted to host a OpenStreetMaps tiling server hypothetically using some features cloudflare has in their nginx builds. Makes it easy for white-hats to test, too.

IIRC I've been intersted in HPACK for small http responses where the headers are larger than the body, but if I wanted to use the HPACK patch I have to re-impliment it every time an update comes out that modifies the file.

thezilch · on Aug 2, 2018

Any chance we can expect the patches to be applied to the nginx that ships with OpenResty? ;)

LinuxBender · on Aug 1, 2018

Very nice! Thankyou for the improvements that your team is making.

bussetta · on Aug 2, 2018

looking forward to it!

always_good · on Aug 1, 2018

[flagged]

LinuxBender · on Aug 1, 2018

I ask because I have had many challenges with nginx bugs and improvements. Maxim and I have fundamental disagreements about many things. Anything Cloudflare can do to make improvements upstream are appreciated.

ksec · on Aug 2, 2018

I see. I am wondering if you ever had a look at H20, which I think is being used by Fastly.

mp3geek · on Aug 2, 2018

Submit them to your own github repo?

jgrahamc · on Aug 7, 2018

Here's the upstreamed patch: http://mailman.nginx.org/pipermail/nginx-devel/2018-August/0...

kev009 · on Aug 1, 2018

https://www.slideshare.net/facepalmtarbz2/new-sendfile-in-en...

iopuy · on Aug 2, 2018

Does nginx still force you to recompile the program to get access to the web application firewall? I remember this being a sticking point years ago when evaluating the product.

merlincorey · on Aug 2, 2018

No, NginX supports dynamic modules now.

brian-armstrong · on Aug 1, 2018

If they had written it in Rust this never would have happened

deathanatos · on Aug 1, 2018

My understanding is that most of Rust's standard library translates to the same blocking calls on Linux, so it would be plagued with the same issues as C would be. (And the solution reached in the article would similarly work in Rust.)

There are certainly async I/O libraries for Rust (e.g., Tokio), but those are going to be limited by the primitives the OS gives them. (AFAIK, Tokio's core libraries don't directly do async disk I/O; there is tokio-fs, and it does it by shunting the work to a threadpool.) The fact that disk I/O is so uniquely special on Linux effects any language, as it is an aspect of the kernel itself.

I love Rust, and while there are a ton of compelling reasons to use it, I don't think it's fair to say it would have prevented this from happening, in this particular case.

dang · on Aug 1, 2018

Maybe not, but please don't post unsubstantive comments here.

hjnilsson · on Aug 2, 2018

Perhaps, but the first version of Rust appeared 6 years of nginx was launched, so it was never a choice.

However, development of a new web server in Rust is an interesting project, and could certainly compete with nginx over time if enough effort was put into it.