The C10K problem

spacemanmatt · on Feb 17, 2014

Last time I built a server project from scratch it was on a 300MHz quad-core Xeon base, which hosted around 150 simultaneous web users. It took some effort to make it scale like that but it was worth it. My hardware maintenance was low because we really maximized the software capacity.

In modern times, RAM and CPUs are more than 10 times bigger and faster, but I am seeing people get around 25 times LESS out of them, because they choose terrible tools, don't benchmark, and generally don't care. Now (different company) our app server pool has 150 nodes and each serves 4 users. The application complexity is significantly smaller than what I did 13 years ago.

I sincerely doubt any of my coworkers have read this document. It shows.

davidw · on Feb 17, 2014

A lot of the calculation is the $$$ you make per connection. For a SaaS kind of thing, you can generally keep that high enough to not worry about squeezing every last drop of efficiency out of things. A much bigger worry for most startup-ish companies is finding product market fit.

DrJ · on Feb 17, 2014

While I would agree that trying to squeeze every last drop of efficiency may not be worth the time/money for the company. I think the parent is implying that people are wasting a lot of extra resources due to poor tools, poor coding, poor profiling, poor maintenance, etc.

I would agree though that for startups, the problem is finding the place to $$$, it is important to know what the cost per user/connection/instance is and being able to control or monitor it.

fiatmoney · on Feb 17, 2014

"It took some effort to make it scale"

It's often a lot easier to rely on horizontal scaling, especially on a strategic level when you're deciding whether to trust your organization's technical ability (which really means management's hiring ability) or your ability to call AWS and spin up another hundred servers.

"Easier" doesn't mean "cheaper" or "faster" in the long run, but it is lower-variance for sure.

philwelch · on Feb 17, 2014

Horizontal scaling isn't just about throwing hardware at the problem. It's about distributing your application over n number of nodes without single points of failure.

jcrites · on Feb 17, 2014

I'd argue there's one more criterion for horizontal scalability: an increase in hardware capacity (number of nodes N) must yield a proportional increase in system throughput.

In other words, horizontal scalability means being _able_ to throw hardware at the problem :-)

Anyway, I agree with fiatmoney's sentiment that if you're in a corporate environment where your goal is growth, it's often best to focus on ensuring that your systems can grow along with demand, in a confident, easy, operations-free way -- as opposed to ensuring you've eked every bit of performance out of the hardware you have. In many business projects, the cost of human engineering time exceeds the cost of hardware.

Even when considering large-scale systems, it makes sense to scrutinize optimizations. If you had time to implement only one of these two equal-effort projects, which would you choose? (a) An optimization project expected to reduce a recurring yearly cost by $1 million (b) A new feature development project expected to earn an additional $2 million yearly recurring revenue?

I've found it effective to look at all engineering decisions as business decisions. From a business perspective, the mistake at the center of fallacies like premature optimization is the failure to consider opportunity cost.

(On the other hand sometimes pushing a technical project as far as you can is its own reward. Not everything in life has to be about business. Whether or not the results are widely relevant to typical businesses, I look forward to seeing the research into the C10K and C10M problems!)

sokoloff · on Feb 17, 2014

> If you had time to implement only one of these two equal-effort projects, which would you choose?

Impossible to answer without knowing the gross margin of your business. For a big-box retailer (low margins), take the first. For a high-(gross)-margin business, take the second.

justincormack · on Feb 17, 2014

People are often too optimistic about these additional $2m, while cutting yearly cost by $1m would allow you to hire several more developers, or increase your runway to profitability. There is a business case for both...

pyalot2 · on Feb 17, 2014

The 10k problem is still not solved. A run of the mill VPS typically manages around 1k simultaneous connections and 10k requests/s. That is, over plain http. Over TLS this goes down to about 100 connections and 600 requests/s. A cheap amazon instance will usually be around 1/5th of all that.

With a bit of tweaking you can get the plain http case a bit higher, but the TLS route will not get much better, because TLS isn't written with speed in mind (the handshake is slow and expensive) and the prevalent implementation (OpenSSL) is not written for high performance servers (it basically dictates that you run blocking sockets in a thread per connection).

Unfortunately spdy and various http2 proposals rely on TLS (in order to punch trough proxies), which means going back in server performance about 10 years.

So it is of little surprise that companies have started offering "cloud" solutions, because the typical VPS can't handle todays high traffic internet over TLS (worse than plain http by a factor of 10) and the typical cloud server is worse than a VPS by a factor 5, creating a 50x performance degradation, artificially. Obviously when faced with the question of running 10 servers, or 100, most small companies turn to the even worse "cloud" solution, requiring even more servers (500).

The whole affair is a sodding mess, and we're wasting massive amounts of energy and capital on insisting on doing things inefficiently. This is because by rights our VPS servers should easily be able to break trough the 10k limit in every way, but it can't because the OS wastes a lot of time running an inefficient network stack as well as that TLS and OpenSSL can't be bothered to get their act together.

And that is how, in the year of our lord 2014, more than 15 years after somebody writing the 10k problem article, and after webservers becoming at least 128x faster then back in the 90ties, most websites out there can still not stand up to serious traffic, take ages to load, and are hosted by an infrastructure (routers and whatnot) that easily buckle under even light DDOSing.

Nursie · on Feb 17, 2014

There are many commercial alternatives to OpenSSL. I've worked with one or two. There's bound to be one that suits your needs. Of course you may also need to go to a full commercial, non-FOSS stack, but I'm sure plenty of companies will have those to sell too.

wbsun · on Feb 17, 2014

When it goes to transactions and stateful sessions, C10K, or even C1K is far from solved yet.

pyalot2 · on Feb 17, 2014

But that's a bit absurd isn't it? A recent core i7 reaches 124850 MIPS. That means that if it takes 1/1000th of a second to handle a connection, the CPU could execute 124 million instructions during that time. At 10/10000th it could still execute 12 million instructions. A minimal asynchronous connection handling certainly doesn't require more than say 10'000 instructions, so our servers are under-delivering on their performance at least by a factor of 1200x, perhaps even by a factor of 12000x.

Our servers should be able to surpass C1k easily, even C10k shouldn't tax them. They should, by rights, only be taxed by the C10m problem.

sharpneli · on Feb 17, 2014

There is one number that has not really changed. It's memory latency, and another is the processor clock speed.

The latency of main memory read is still around 100ns. It has been around that for over 10 years now. It means your CPU will have to wait for hundreds of clock cycles to get a read from RAM if it's not in the cache, and in huge datasets it probably is not in cache.

Another issue is the processor clock speed. Yes it is true that modern i7 can reach 124850 MIPS. However that number comes from having 4 cores with each of them being able to reach up to 8 instructions per clock. You are still limited in executing dependent instructions.

That sounds a lot. But one must remember that it reaches 8 instructions per clock only when the instructions are a good mix of float/int instructions, no branches and the instructions are not dependent on eachother. In practice you reach maybe 1-2 instructions per clock. In some code it can go even to 0.5 IPC (bunch of unpredictable branches and whatnot).

Writing a code that takes advantage of large memory bandwidth and poor latency combined with massive CPU performance if the instructions are not too dependent on eachother is almost like writing modern GPU programs.

It would be interesting to see what kind of an web server perf one could get by carefully writing it in OpenCL (using CPU target, not GPU).

pyalot2 · on Feb 17, 2014

Yeah I'm not disputing that there aren't bottlenecks in the system (it's not only the memory, the bus between NIC and CPU is also to blame).

But, blaming bad I/O performance on large datasets misses the point slightly. You can perfectly well write a program that doesn't use heap, and has less stack use than those processors have L2 cache... (of course that's a test program).

But the network performance is probably not bound by system latency as much as by an abysmally bad software stack, started with the kernel to the networking stack to the sheer idea of TLS and to the implementation of TLS (OpenSSL).

Indeed, there's been calls to get rid of it all, all the layers and whatnot, and bann the OS from all but one or two cores and get rid of the whole network stack and layers and implement the networking directly in the application that needs to do it.

rbanffy · on Feb 17, 2014

> there's been calls to get rid of it all, all the layers and whatnot, and bann the OS from all but one or two cores and get rid of the whole network stack and layers and implement the networking directly in the application that needs to do it

You certainly know building and supporting that would cost more or less the same as building and operating a sizeable datacenter. If it succeeds.

Using all processing power a modern CPU offers on real code with real data is almost impossible. And it's not only memory latency and instruction interdependence - there are latencies all over a PC even before you leave the rackmount chassis. The supporting network is another source of uncontrollable latencies. Most apps I manage spend 99.99% of their time waiting for something to happen, be it the next packet, be the results from another server, which is actually a cluster behind one or more load balancers.

You may get some better cache hit ratios by tweaking thread/core affinity, but it won't take you to where you want to be.

If you really need that much performance, I'd suggest building your own VLIW architecture and generate the instruction mix on auxiliary CPUs as a single continuous thread on the fly based on all incoming requests for the VLIW core to devour. That would be a huge undertaking, but it would also be pretty cool CompSci.

vidarh · on Feb 17, 2014

> You certainly know building and supporting that would cost more or less the same as building and operating a sizeable datacenter. If it succeeds.

There are plenty of solutions for that already. For the simplest case of the user-space networking, you can pick a number of "off the shelf" solutions for it:

http://lukego.github.io/blog/2013/01/04/kernel-bypass-networ... http://www.openonload.org/

theknown99 · on Feb 17, 2014

Not really true unless you use crappy software.

I run a few thousand simultaneous connections on AWS small instances without any issues. (HTTP + HTTPS).

The key is writing good software. Which a lot of programmers are still really bad at.

pyalot2 · on Feb 17, 2014

You know, it's cute when somebody who wasn't even born when I started writing software for servers tries to discredit me.

theknown99 · on Feb 17, 2014

It's cute when you make ridiculous assumptions about when people were born.

Lets stick to the facts...

You stated that Amazon instances couldn't cope. They can. They can cope just fine with thousands of simultaneous HTTP and HTTPS connections. If you can't get your amazon instance to do this, you're using crappy software.

> "The 10k problem is still not solved. A run of the mill VPS typically manages around 1k simultaneous connections and 10k requests/s. That is, over plain http. Over TLS this goes down to about 100 connections and 600 requests/s. A cheap amazon instance will usually be around 1/5th of all that."

Where the hell are you getting those poor figures from? Are you using Lisp or something?

> "and the prevalent implementation (OpenSSL) is not written for high performance servers (it basically dictates that you run blocking sockets in a thread per connection)."

Again - you're using crappy software. Don't use it.

The 10k problem is historically interesting, but for anyone who knows how to program, it's not really an issue.

pyalot2 · on Feb 17, 2014

If you use anything like Apache, lighttpd, Cherokee, Nginx, Tux, etc. they all use OpenSSL, and nothing else.

theknown99 · on Feb 17, 2014

Well don't use them then!

Seriously. Writing a webserver isn't a big job. It's not very hard to program something that runs better than the ones you list, for specific use cases.

pyalot2 · on Feb 17, 2014

Been there done that, but the fact is, there is no alternative to OpenSSL I could see if you want to have SSL/TLS 1.1, 1.2 and 1.3.

vidarh · on Feb 18, 2014

Any reason why none of the open implementations listed here are suitable:

http://en.wikipedia.org/wiki/Comparison_of_TLS_implementatio...

?

theknown99 · on Feb 17, 2014

If only there was a way to replace the functionality provided by OpenSSL, by writing some code :(

pyalot2 · on Feb 17, 2014

To rewrite OpenSSL you'll need to be a security and cryptography programmer. And even if you are, the task is daunting to say the least.

rbanffy · on Feb 17, 2014

It would be better if you both gave more details on what and how you do so the possible mistakes from both sides could be pointed out and we could learn from them.

From a quick glance on your post, I too suspect there is something wrong - your performance shouldn't be that bad, but there is not enough information to point where.

pyalot2 · on Feb 17, 2014

That's really too long to list, I could conceivably write a book about it.

But, fortunately it's relatively easy to test. You get whatever server you prefer, and install nginx and install a bunch of performance testing tools (like siege, ab etc.) and then you test different concurrency load scenarios.

No custom software, and widely regarded the fastest webserver out there.

rbanffy · on Feb 17, 2014

As I pointed elsewhere, most recent servers are just overgrown IBM 5150 PCs, with much faster CPUs, immense amounts of memory and storage and somewhat faster buses. They are desktop PCs misused as servers.

theknown99 · on Feb 17, 2014

> and install nginx

Yeah there's your problem...

pyalot2 · on Feb 17, 2014

You can replicate the exact same experience (only worse) with any other webserver like apache, lighttpd, cherokee, etc.

Tux may be a bit faster, but it can't serve dynamic content nearly as fast as it serves files, so that's pretty much useless.

rbanffy · on Feb 17, 2014

Have you tried other operating systems?

theknown99 · on Feb 17, 2014

Have you tried writing your own webserver? Then tuning it to your particular use case?

pyalot2 · on Feb 17, 2014

theknown99 · on Feb 17, 2014

hmm so I've written a webserver and am able to get thousands of concurrent requests on a small amazon instance without issues.

And you've written a webserver and you can't.

Interesting...

rbsn · on Feb 17, 2014

Can we see your results?

seunosewa · on Feb 17, 2014

If you care about efficiency you should not use VPSs. You should dedicated servers.

snaky · on Feb 17, 2014

2013 update:

The C10M problem

It's time for servers to handle 10 million simultaneous connections, don't you think? After all, computers now have 1000 times the memory as 15 years ago when the first started handling 10 thousand connections.

Today (2013), $1200 will buy you a computer with 8 cores, 64 gigabytes of RAM, 10-gbps Ethernet, and a solid state drive. Such systems should be able to handle: - 10 million concurrent connections - 10 gigabits/second - 10 million packets/second - 10 microsecond latency - 10 microsecond jitter - 1 million connections/second

http://c10m.robertgraham.com/p/manifesto.html

0xbadcafebee · on Feb 17, 2014

Read the "Other Performance Metric Relationships" part at the bottom of this page[1]. Basically, just because your machine may be able to physically hold 10 million connections open, does not mean your machine could handle opening 10 million connections in a reasonable amount of time, much less handling 10 million transactions in a reasonable amount of time. If you can't open that many connections at once, or process that many transactions, just being able to keep them open becomes moot.

This article[2] breaks down the issues fairly well. In order to handle this kind of traffic, you have to basically redesign huge swaths of technology that exist because we don't want to have to implement these things more than once. I don't see how anyone would invest in this without a specific itch to be scratched (like deep packet inspection).

[1] http://www.cisco.com/web/about/security/intelligence/network...

[2] http://highscalability.com/blog/2013/5/13/the-secret-to-10-m...

jrmenon · on Feb 17, 2014

Big fan of kqueue() mentioned in the article. IIRC with sockets, it not only tells you if the socket fd is ready (say for non-blocking read), but even informs the number of bytes available to read which allows you write efficient code (i.e. not to some fixed buffer where you may need to loop in again to see if more data needs to be read).

Also I think for files/directories, you can listen for any changes that occur.

Wish this was officially available in Linux.

on Feb 17, 2014

[deleted]

kev009 · on Feb 17, 2014

gberger · on Feb 17, 2014

(1999)?

mturmon · on Feb 17, 2014

Yes, here is the original 1999 snapshot. It has been updated a bunch over the years.

http://web.archive.org/web/19990508164301/http://www.kegel.c...

I think Dan wrote this page originally when he was at Disappearing, Inc. Later he was at Google (2004-10).

Aloisius · on Feb 17, 2014

Mmm. It has been updated over the years, but of course, 10K was relatively easy even back then. For fun one day, I turned off the search index on one of the production machines and hit over 90K at Napster without much trouble (production ran at ~36K). And that was on little dual processor Pentium 2 machines.

lugg · on Feb 17, 2014

If you dont have a blog I think you should blog some stories from those days. I'd be really interested to hear them.

wging · on Feb 17, 2014

Arguably (2009) at the earliest. Was certainly originally written earlier but it seems like something of a living document. The first sentence after the table of contents is:

"See Nick Black's execellent Fast UNIX Servers page for a circa-2009 look at the situation."

jleader · on Feb 17, 2014

vangale · on Feb 17, 2014

I was definitely using this page as a resource in 2000-2001. I was going to try and use Python Medusa module (precursor of Twisted) to scale connections.

dontdieych · on Feb 17, 2014

If you decide to publish long serious article to web, please consider use css for better readability. Don't rely on browser default css too much. They are very busy to make javascript fast. Especially, please don't use 100% width. Sure, our displays have far more width than height. But that width is good for movie, not text. Put any books on your monitor. Then reduce text line width.

https://www.readability.com/articles/zrgovuxr

TW;DR: Too Wide, Didn't Read.

ricardobeat · on Feb 17, 2014

This article is 15 years old. There wasn't much in the way of CSS back then, much less responsive design. But the brilliant thing is it still works perfectly, because it's so simple! Pop your tab to a window and resize, or use a readability browser extension/website. Oh, you just did that.

elwell · on Feb 17, 2014

haha yeah

Joeboy · on Feb 17, 2014

I can pretty much always read sites that don't bother to override the browser defaults, whereas people's CSS efforts often make things hard to read.

It's obviously possible to make sites nicer using CSS, but if you're just trying to let people read your words rather than show off your design skills, leaving it to the browser's defaults is absolutely fine.

lugg · on Feb 17, 2014

If more sites did this maybe we'd get something a little nicer. Til then, you can always set your own defaults.

blago · on Feb 17, 2014

There is a bookmarklet for this: javascript:(function(){ document.body.style.width='600px'; })()

dontdieych · on Feb 17, 2014

Mostly I'm fine with default css but only 100% width. This could be helpful in many old article like this. Thanks

lugg · on Feb 17, 2014

Just quietly this article has been around since 2003.. Its still an extremely useful resource, and FYI, chrome on android renders this page just fine. To further on that sentiment, coming from someone posting on hacker news I find your comment amusingly ironic.