Let's Do Some Science: Zed's Test Stack for poll vs. epoll

codexon · on Aug 4, 2010

Zed is probably right that poll is faster than epoll when all descriptors are active. There's no reason to doubt this since epoll has a callback for every file descriptor, while in poll, the OS only has to fill up an array of flags once per call.

http://lwn.net/Articles/14168/

However, I will have to side with jacquesm in saying that regular internet servers will usually benefit from epoll. It is unfathomable that web servers have lots of active file descriptors especially considering the "keep-alive" and that even broadband latency is an eternity for modern cpus to wait. On the other hand though, using poll makes sense for a heavily trafficked file server.

zedshaw · on Aug 4, 2010

There's nothing in anything I've said that disagrees with your assertion that regular web server (not internet servers) have lots of idle connections. What I've been saying all along is something very simple.

If the ATR > 0.6, poll wins. If ATR < 0.6, epoll wins.

That means, there's potential gains to be had by using both, and at a minimum you can use both and it won't hurt you very much.

Additionally, Mongrel2 isn't a regular server. Remember all those pesky 0MQ backends? Those basically crank data fast as shit, so this research could potentially benefit them. It also points out potential flaws in previous analyses, so people can now potentially use ATR to tune their server's work loads appropriately.

But notice I keep saying potentially? So far I'm the only one offering everything up, all data, everything for anyone to prove me wrong, and I actually don't think I'm totally right.

I don't see very many other people who criticize the idea doing the same. Just more comments on HN.

jfager · on Aug 4, 2010

What I've been saying all along is something very simple.

If the ATR > 0.6, poll wins. If ATR < 0.6, epoll wins.

You left out the part where you accused people who advocate using epoll of being bullshitting touts who perpetuate fallacies:

"why the touted benefits of epoll and most of the information out there is mostly bullshit because of some falacies people seem to have about epoll (mostly perpetuated by epoll's proponents)."

Which you proceeded to back up with a strawman argument about claims of O(1) performance and a rant about how graphs that show that epoll accomplishes exactly what epoll is designed to accomplish "flat out get it wrong".

You're using "ATR" to find the inflection point in the performance curve from epoll's constant overhead per active fd, for the purpose of micro-optimizing Mongrel2. That's fine. But when you claim that people who just use epoll because it solves the problem they actually need solved are making decisions based on "bullshit", you shouldn't act so surprised that they don't respond positively.

zedshaw · on Aug 4, 2010

The touted benefits of epoll are mostly bullshit, since it was advocated as being O(1) and frequently mentioned as being O(1) and always faster than poll by others. In fact, right after mentioning that I thought epoll wasn't faster the first thing people said was it was O(1). Every person I talked to said it.

I have evidence that contradicts both of the main assertions of epoll, so until someone comes up with counter evidence I can safely say that epoll information is bullshit, and my argument is nothing like a straw man.

So yes, it is bullshit, and if that offends you then awesome. Because nothing's worse than blindly believing some load of crap when you could have a better understanding of what you're doing.

jfager · on Aug 4, 2010

I googled 'epoll "O(1)"'. You're the first hit (congratulations), followed by some random person on stackoverflow, the documentation for a CPAN module, the repetition of that documentation on github and in various package repos, and some random bloggers and mailing list responders who are either just referring to the work that you have to do to find an active fd when the wait call returns, or are confused.

The man page says nothing about O(1):

http://www.kernel.org/doc/man-pages/online/pages/man4/epoll....

The original epoll writeup says nothing about O(1), and shows graphs explicitly for dead connection scenarios (iow, it never claims better performance per active fd):

http://www.xmailserver.org/linux-patches/nio-improve.html

The Dan Kegel writeup that's pretty much universally cited on this issue says nothing about O(1):

http://www.kegel.com/c10k.html#nb.epoll

The link you threw up yesterday that "flat out gets it wrong" says nothing about O(1) and simply shows epoll doing exactly what it was designed to do (again, it never claims better performance per active fd), including graphs that actually do show poll outperforming epoll with small numbers of dead connections:

http://lse.sourceforge.net/epoll/index.html

Can you show any authoritative sources who claim epoll is O(1)? If not, then yes, you built a fluffy little strawman and took him down hard.

I have no problem with your work on superpoll. I'm skeptical it will help much, guessing there will be situations it will hurt, and overall not convinced it's worth the effort to find out, but it's your time. Have fun with it, and I'm eager to see the results.

I just don't get why you feel the need to imply that everyone who ever used epoll without finding the ATR performance cut-off is a blind moron snowed into doing so by some shadowy malevolent epoll cabal spreading misinformation about how it works.

pphaneuf · on Aug 5, 2010

I've never perceived epoll as being O(1), always O(N), the big thing being the N being how many events are dispatched, rather than the number of watched fds.

I'm fine with the amount of time taken scaling linearly with the amount of work to do, I'm not cool with it scaling linearly with the number of flowers in the garden. :-)

rbanffy · on Aug 4, 2010

> there's potential gains to be had by using both

Only if your ATR crosses the .6 boundary frequently. If it consistently remains on either side, the performance would be worse at the cost of added complexity.

I agree with your assumption we need more instrumentation. I strongly suspect my ATRs are consistently closer to zero, but I never looked under the hood to check.

Also, the ATR should vary according to what you are serving. Longer streams would drive the ATR up while serving shorter bursts would drive it down.

zedshaw · on Aug 4, 2010

Exactly, nobody knows because they've been operating under the assumption that epoll is always faster so they never looked.

Now I've got some evicence that epoll isn't faster, and in fact it's a wash at 40% "utilization", which is sort of ridiculous. I'd want people to go and test and see what they get, or at least know the implications so they pick wisely.

Ultimately though, I'd rather have the Linux kernel just fix epoll so it's always faster than poll.

rbanffy · on Aug 4, 2010

> Ultimately though, I'd rather have the Linux kernel just fix epoll so it's always faster than poll.

Unless they break poll, that's very unlikely. The implementation of poll is much simpler and should be faster per fd. What could happen is that epoll gets some improvement that moves up the .6 boundary but I seriously doubt epoll could ever be faster than poll with 100% active fds - and I never encountered that kind of usage.

As with any optimization, we should be concerned with end result: how much slower does using epoll really makes the system. I am on the skeptical side here.

pphaneuf · on Aug 5, 2010

Hey, some idea I remembered (from some old paper): don't go from epoll to poll when the ATR gets too high, go to... NOTHING!

Just read() or write(), as if something told you they were ready! Your code has to be able to handle EAGAIN anyway, right? If you're at 1.0 ATR, you're saving the whole cost of poll or epoll!

I'm actually serious here, BTW. It would need some measurement to figure out the threshold at which this is okay, and if it's close enough to 0.6, I'm guessing you could go straight from epoll to nothing, and take the small (potential) performance hit between 0.6 and 0.X.

Hmm, I'm an idiot: this is basically epoll in edge-triggered mode.

derefr · on Aug 4, 2010

I believe that, other than in the new-for-the-sake-of-new sysadminning camp, people almost never start by deploying Mongrel. They deploy Apache, because it's easy (every distro comes with an excellent default set of LAMP packages, for obvious reasons.) Thus, Mongrel (or any other non-mindshare-majority server) is only sought out by those who don't like the one they've got—that is, by people who are not getting the performance they desire.

Mongrel is probably not used by Joe the VPS user, serving 20 simultaneous connections per hour, because Joe is doing just fine with the Apache install that came with his slice. Mongrel is deployed for high-traffic file servers, as the current weapon of choice.

It's very likely that that will continue to be its niche (though it will expand out somewhat as larger shared-hosting providers start to default to it), so it's very likely to mostly see use in high-ATR deployment scenarios. Under that assumption, Zed's experiment is solid. I would still like to see a prologue tacked on about the composition of the Mongrel user base, though, if Zed has it :)

Edit: rephrased to... sound less like Zed?

yummyfajitas · on Aug 4, 2010

Thus, Mongrel (or any other non-mindshare-majority server) is only sought out by those who don't like the one they've got - that is, by people who are not getting the performance they desire.

This is probably true for most non-apache servers now, but I don't think it will necessarily be true for Mongrel2.

Mongrel2 gives us a new development model - http calls -> messages, language doesn't matter. I can easily mix python and c++/Haskell, for instance [1]. It's rather different from what we've had before.

Mongrel isn't just a faster apache; people might use Mongrel2 for features rather than performance.

[1] Last time I tried to do this, I basically put a halfassed version of mongrel2 behind apache/django. I.e., apache/django handled http and sent messages to my c++ daemons. Fugly.

zedshaw · on Aug 4, 2010

Exactly, this is why I was curious about the platitude that epoll wins all the time.

lhorie · on Aug 4, 2010

I'd be interested to see if the idea of switching between epoll and poll at 60% ATR would have too much overhead as to be impractical.

Re: > 60% active fd's being real-life or not, I don't really see what's the problem in wanting to handle that case in a project whose goal is basically to keep humming along nicely regardless of what you throw at it.

pphaneuf · on Aug 4, 2010

Adding/removing from the poll() array is more or less free, but with epoll, you have to use epoll_ctl(), so you'd definitely want an hysteresis to avoid bouncing things around too much.

zedshaw · on Aug 4, 2010

Interesting, you're like the 3rd or 4th person who's mentioned hysteresis.

rbanffy · on Aug 4, 2010

It's a fairly common idea.

pphaneuf · on Aug 5, 2010

As in "only the 3rd or 4th" (so I'm possibly clever), or as in "a bunch of people"? I don't know your sample size. :-)

tzs · on Aug 4, 2010

The network interface in general on most operating systems seems kind of absurd from a server point of view. You've got data coming in from N clients. The data from all those clients gets physically multiplexed into a single data stream that comes in over the wire.

Then in the kernel you've got something that demultiplexes it back into N separate streams, which it accumulates in N different buffers.

The server software is aware of all N of these streams. Through some mechanism the kernel notifies the server software that data is available on some subset of the N buffers, and the server software has to ask for that data.

If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!

Why not get rid of some of this demultiplexing and re-multiplexing? Let me have a single file descriptor, and everything that comes in destined to my port comes through that single file descriptor, as a message with a header that tells what remote IP address and port it came from and the length, followed by that many bytes of data.

Instead of N separate buffers in the kernel, just have one big buffer per port. You'd still have to keep track of how many bytes in that buffer are associated with each remote connection, for purposes of dealing with TCP flow control, but I think this could be made to work reasonably (yes, I'm hand waving right now).

Now you don't need poll or epoll. Heck, you don't even need to use select. I'll just have a blocking thread whose job is to just read that one file descriptor using blocking reads in a loop to pull in data and put it on my queue.

Looking at this slightly different--everything destined for, say, port 80 is going to the web server, so why does the web server want this to come in through N different file descriptors? Especially when you consider that the code its going to run is the same no matter which it comes in on? All the multiple file descriptors are really accomplishing is providing a way to tell which bytes came from which client. Is that really the best use of file descriptors, as opposed to just having the kernel tell you which client the data came from when you do your read system call?

I can see that the N file descriptor model made sense in the days before threads, when to get concurrency you had to use multiple processes. You then need to have data coming in on port X from different clients not necessarily all going to the same process.

But for servers that are going to use the "single process with multiple threads" model to handle concurrency, a single stream network model would seem to be a lot nicer and efficient.

FooBarWidget · on Aug 4, 2010

> If the server software is using a queue and thread model, it likely then gets that data which the kernel has gone to great effort to demultiplex into N streams and provide it through N separate file descriptors--and multiplexes back onto a single queue for the worker threads to pull from!

This would indeed make no sense were it not for the fact that only a subset of the data received on the network card is actually intended for your server app.

You want a single file descriptor on which you can read data only destined for your port. At that point the data must already be demultiplexed by the kernel or you'd get the data destined for all possible ports.

I'm not sure not demultiplexing data will win you that much because the kernel had to demultiplex it anyway regardless. But if you want to implement this and try it out, by all means go ahead.

Wait a minute, let me get this right... you want to write your own TCP congestion handling code?? Sounds like you're interested in raw sockets, which are already available.

axod · on Aug 4, 2010

It may sound inefficient, but I've never seen it being a bottleneck. It needs to be separated out at some point, and the kernel is the best place to do that.

maybe if you're trying to serve 50,000 video streams from a single machine or something, but then you'll probably hit the limitations of your connection before you see any CPU usage from the networking code.

This whole discussion, as I've said before, seems like a classic case of premature optimization.

Networking code doesn't eat CPU. It's negligible.

Concentrate on all the BS dynamic functional language database crap people seem to insist on using at the higher levels. That's where all the wastage is.

jacquesm · on Aug 4, 2010

Technically nothing stops you from opening the raw device and to start reading packets. Sooner or later you'll need to figure out when your packets can be forwarded to the next layer for processing so work can be done for a particular connection.

And that's what the kernel did for you all along. The abstraction of file descriptors for clients is a useful one, it allows you to decide what to do. The decision here is not 'do I want all this data in one stream' or 'do I want it demultiplexed', the kernel does a fairly good job of the demultiplexing that needs to be done anyway (re-assembling a stream of packets in to functioning connections is non-trivial) and it exports that interface to all user processes to avoid having to recreate that useful abstraction in those user processes.

It's possible there is some gain to be had by 'rolling your own' and by getting rid of the kernel abstraction by collapsing all these layers in to a single user mode program.

But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.

I once ran in to a situation like the one you describe. I was the programmer in charge of coding up a driver for a high speed serial card (high speed at the time, nowadays we'd laugh at it) doing X.25. I'd nicely laid it out as two separate pieces of software talking to each other using message passing on a weird but wonderful little os called QnX. One process to handle the data link layer, one to handle the X.25 protocol.

After a series of benchmarks with 8 such cards in a single ISA industrial enclosure, together with a (for the time) very expensive micronics 33 MHz motherboard I figured that maybe we can gain some performance by merging the two processes, to save on the IPC. The end result? Identical performance, in spite of the message passing overhead the bit-twiddling, checksumming and actual IO was such a large portion of the work done that the IPC didn't matter at all performance wise.

But it did matter to the code, the second version was decidedly harder to debug and maintain, and eventually that 'branch' was scrapped in favour of the one where the layers were clearly separated.

Poll and Epoll are communications mechanisms, they communicate with the kernel about the state of a bunch of fds. Both implement the same functionality (as long as you don't use the actual events of epoll) from a programmers point of view. Both have slightly different use cases, and for some situations it may be advantageous to use the one or the other. There may be a difference, but on the whole it will probably not be a very large one once you factor in all the other stuff that needs doing.

haberman · on Aug 4, 2010

> But I think you'd soon find that you are simply re-implementing all that code, and that instead of a clean (poll or epoll, either is fairly clean) interface you'd be bogged down in maintaining your own tcp stack.

You're not understanding OP's proposal. He's not talking about implementing TCP in user-space. He said the kernel would still be handling TCP flow control, and would only be sending data destined for a specific port to this fd. So TCP is still in the kernel.

He's just proposing that the payloads for all connections come in on a single fd instead of one fd per connection. But each read from the fd tells you what remote host it came from.

haberman · on Aug 4, 2010

Probably the biggest argument against doing what you describe is that the status quo gives you control over scheduling which fds to service (of the ones that are active).

If you processed incoming network data in strict FIFO order, a single host on your LAN could totally DoS your system by flanking you with incoming data. That host would never get TCP flow-controlled because if you're processing anything, you're processing his obnoxiously voluminous input.

If he was just on FD among many, you would service him once in a while, but then you would service other clients. While you're servicing other clients, he gets TCP flow-controlled, so he doesn't singlehandedly demand all your attention.

echaozh · on Aug 4, 2010

It totally makes sense. You should also be able to choose to use more than one socket (but not so many as there are clients), in case you still want to take advantage of all the cores available.

wglb · on Aug 4, 2010

Duplicate.

See lively discussion at http://news.ycombinator.com/item?id=1570694

jacquesm · on Aug 4, 2010

Not exactly, this is where Zed presumably shows that using pipes is equivalent to real world testing and has nothing to do with using the loopback/localhost device.

Personally I think that's nonsense. I downloaded the code, checked it out, there is nothing in there that changes my mind on the whole thing, the meat is in the C code and it's the same as what was linked from the article. The rest is a wrapper script and some stuff to visualise the results, which are what he showed them to be. The premise that web servers have the majority of their fds active is where the issue lies.

FWIW I did some measuring and the results are pretty much in line with what I expected (though the spread is still puzzling, I really intend to find out the cause of that): http://news.ycombinator.com/item?id=1573145

zedshaw · on Aug 4, 2010

If the test harness is wrong than every other test that everyone doing epoll comparisons with poll has done is wrong. This is the one they've all been using. So, either you write your own test to evaluate the same thing, without confounding it with other external factors (like this test does), or you use this test and prove me wrong by finding different data.

However, I'm glad you finally went and got some metrics to show one webserver instead of the useless rhetoric you were spewing before. Not that it proves a damn thing since it has no external validity, but at least it's a step in the right direction.

I also say that your findings of 40% are much higher than what you implied in your previous comments. You made it seem like it's close to the 10% range, but 40% is damn near ready to leap over into the equal performance levels. It's interesting that even after that you still can't see the value of either making epoll faster or using both, but whatever, you can't convince every idiot on the internet, even with Science.

jacquesm · on Aug 4, 2010

You should read more carefully. 40% was the highest observed, 10% the lowest. Also, I said 'the majority', which is not like making it seem like it is only 10%. So it's not '10%' or '40%', for this particular workload, on average you're hovering somewhere between 25 and 30%, since you seem to prefer the higher number make that 30%. And if you read a bit more you would see that there are some factors that could pull that number down considerably in different situations and only one major one (large file serving, streaming) that would pull it up.

If I get around to it (I'm on a holiday right now) I'm going to instrument some more sites to measure their active-to-total ratios to see if there is some kind of number (or a set of them) that we might be able to agree on when it comes to real life web traffic. One of them is a filedump, which serves up large media files to clients, that's the use case that I can come up with in which your scheme might pay off, and in which it is possibly advantageous (even if it is a very small advantage) to use poll.

You keep 'challenging' people to come up with numbers. So, I did, from a real life webserver serving real customers, lots of them. If 60% is the cutoff point (and we're assuming that you are right about that) then 40% is comfortably under that, not 'much higher' than what I wrote in my previous comments, the majority (60 to 90%) of the fds is idle. So it's not in to 'equal performance levels' yet.

But as far as I'm concerned even that is probably moot. (But that needs more testing) because I suspect that even this fairly high traffic server spends only very small amount of time on polling the OS, and a much larger time on shovelling bits in and out of the system.

You keep using that word 'science' as though measuring anything at all is automatically science.

But what you're doing is more like selecting the data that fits your 'who needs epoll' and 'epoll is bad' hypothesis and rejecting out of hand any experience and / or experimental data that seems to prove you wrong.

That's not science. Science is to use all the data available to you, not just the juicy bits. Not just those bits that allow you to make bold claims about discoveries when in fact all you've done is to show that the underlying implementation of one systemcall is different and has different trade-offs than another, which may be specific to the way your system is set up.

I agree that to test something you need to reduce it to the minimum of code required to do your tests. But when you're going that route you are doing what so many other database, CPU and application benchmarkers forget. The real world is nothing like a benchmark.

You to suggest that this is 'not a localhost test' (http://news.ycombinator.com/item?id=1573195) and you say so pretty loud. That would mean that there may be some validity to your data that I have not been able to find to date in the stuff you've provided.

I'm challenging you to provide the data that apparently is still missing that proves that this was not 'a localhost test'.

And if you don't supply that data then I think you should take back (and try to do it gracriously for a change) that this wasn't a local host test, or you're going to have to come with some definition of what you consider a localhost test to be.

For me, that definition is simple, a localhost test is a benchmark that never generates any network traffic.

Feel free to disagree.

gthank · on Aug 4, 2010

Having followed most of the comments on both these threads, I don't think it's fair to characterize Zed's position as saying "epoll is always bad". I think a better characterization would be: "epoll is not automagically better under every imaginable scenario".

You may argue that nobody is making any such claim, but I can tell you that the sections of the blogosphere I skim via HN, reddit, etc. do imply it. Maybe the authors just assume everyone can infer that every decision has tradeoffs and there's no need to discuss them in depth for this particular case, or maybe—and the nature of the blogosphere leaves me inclined to think that this is the case—most of those authors never consciously thought about those tradeoffs because epoll happens to be a good fit for their particular problem. Either way, Zed is the first person I've seen trying to explicitly identify and quantify those trade-offs for poll vs. epoll. As such, I think he's doing the community a service.

If this research leads to a better implementation of epoll, or a server that requires less tuning to do well under a particular load, that is great for everyone. Even if it doesn't lead to any great improvements, a better-educated developer population that makes more informed decisions is still a net win for the world.

pphaneuf · on Aug 5, 2010

Right on. Myself, I always figured epoll was a bit slower than poll when most/all of the fds are ready all the time, from reading the code.

Now, there's an experiment to figure out what the threshold is, and frankly, it's a bit lower than I had hoped (I would have guessed 0.9, 0.8 at worst), which is exactly why having the actual data is awesome.

zedshaw · on Aug 4, 2010

Thank you. I very much appreciate your comments.

gthank · on Aug 4, 2010

I appreciate the work you're doing. I'm not sure I've worked on a system that would really benefit from Mongrel2, but I think it's a pretty interesting concept, and there's always a chance I work on something more interesting in the future. Even if I don't, I almost always learn something or gain a different perspective from reading your stuff, and that's reason enough to be thankful.

isaacforce · on Aug 4, 2010

>I'm challenging you to provide the data that apparently is still missing that proves that this was not 'a localhost test'.<

I think what he was saying in relation to it being not 'a localhost test' is that it isn't a test that touches the network stack, even to the loopback interface -- you seem to be equating localhost with anything that happens on a single local machine regardless of whether it has anything to do with the network or not. A difference in use of terminology, perhaps.

zedshaw · on Aug 4, 2010

Yes, nobody seems to get confounding at all. I'm testing poll vs. epoll on selecting active file descriptors to compare their performance. I'm not testing anything else, not claiming anything else. To then test that with a full on server that has an Amazon EC2 cluster blasting requests with HTTP parsing and serving files would completely confuse the analysis.

You don't measure a specific thing by inventing some full on "real world" evaluation. You test that specific thing. Why is it this simple concept seems to baffle so many coders?

pphaneuf · on Aug 5, 2010

Your results seem correct to me, even though I'm a bit disappointed at the actual number (I was hoping for 0.8 or 0.9, not 0.6), and the code looks good. I'm meaning to run it for myself, but didn't have the time to do so, I should get on it shortly.

I still of the opinion that a >0.6 ATR means you're fucked anyway (where processing all those fds can't be done as fast as data arrives), but I'm trying to come up with a good small test for that hypothesis (like you did).

Until I do, it's just an opinion/hypothesis, and it's kind of hard to argue about it strongly either way.

zedshaw · on Aug 4, 2010

You know jack squat about statistics. That's basically how I summarize your statement. You don't have a mean, median, mode, or standard deviation in your "10% 40%". You don't understand confounding and want to test the performance of poll vs. epoll on file descriptors by testing an entire server. You think science is to use available data, when it's nothing of the sort, and in fact the way you remove confounding is to remove data.

Basically, you're a FUD slinger who's got a beef and doesn't know enough to hold his own in a real discussion about measurement and analysis.

scott_s · on Aug 4, 2010

So where does his reasoning go wrong?

lrm242 · on Aug 4, 2010

Not a dupe -- this is his follow-up post.

wglb · on Aug 4, 2010

My bad. This is a follow-up article.

Apologies.