Worth noting this is from 2003. The performance concerns of event-based servers ...

cryptonector · on Jan 30, 2020

C10K is another name for evented I/O, and it's from the 90s. By 2003 thread-per-client was already obsolete and known to be very bad.

It's really quite simple: threads encourage the use of implicit program state via function calls, with attendant state expansion (stack allocation), whereas evented I/O encourages explicit program state, which means the programmer can make it as small as possible.

Smaller server program state == more concurrent clients for any given amount of memory. Evented I/O wins on this score.

But it gets better too! Smaller program state == less memory, L1/2/3 cache, and TLB pressure, which means the server can take care of each client in less time than an equivalent thread-per-client server.

So evented I/O also wins in terms of performance.

Can you write high-performance thread-per-client code? Probably, but mostly by allocating small stacks and making program state explicit just as in evented I/O, so then you might as well have done that. Indeed, async/await is a mechanism for getting thread-per-client-like sequential programming with less overhead: "context switching" becomes as cheap as a function call, while thread-per-client's context switches can never be that cheap.

The only real questions are:

  - async/await, or hand-coded CPS callback hell?
  - for non-essential services, do you start with
    thread-per-client because it's simpler?

The answer to the first question is utterly dependent on the language ecosystem you choose. The answer to the second should be context-dependent: if you can use async/await, then always use async/await, and if not, it depends on how good you are at hand-coded CPS callback hell, and how well you can predict the future demand for the service in question.

afcapel · on Jan 30, 2020

A context switch in a modern CPU takes only a few microseconds. A GB of RAM costs less than $10. So those concerns, although valid in theory, are usually irrelevant for most web applications.

On the other hand, simplicity in a code base usually matter. Code written with an evented API, littered of callbacks, is usually harder to read and maintain than that written in a sequential way with a blocking I/O API.

You can recreate a sync API on top of an evented architecture using async/await, but then you have the same performance characteristics of a blocking API, but with all the evented complexity lurking underneath and leaking here and there. Seems to me a very convoluted way to arrive to the point from where we started.

cryptonector · on Jan 30, 2020

A function call takes less. And using more RAM == thrashing your caches more == slowing down. The price of RAM isn't relevant to that -- this isn't about saving money on RAM but saving cycles. Yes, yes, that's saving money per-client (just not on RAM), but you know, in a commoditized services world, that counts, and it counts for a lot.

nine_k · on Jan 30, 2020

A GB of RAM only costs less than $10 if you are buying for your unpretentious gaming rig.

A GB of ECC server RAM costs more. An extra GB of RAM in the cloud can even cost you $10/mo if you have to switch to a beefier instance type.

Roboprog · on Jan 30, 2020

How much does a MB of L-n cache cost?

I don’t have the answer, but you would want to measure dollars to buy it, and nanoseconds to refill it.

projektfu · on Jan 30, 2020

That's true, if you're buying OEM ram for Dell or HP servers, it's more like $10-20/GB. However you can buy Crucial ECC DDR4 ram for $6/GB, so there's a hefty OEM markup.

gowld · on Jan 31, 2020

$10/mo is far less than the cost of thinking about the issue at all.

cryptonector · on Feb 1, 2020

Yes, but. Suppose you build a thread-per-client service before you realized how much you'd have to scale it. Now you can throw more money at hardware, or... much more money at a rewrite. Writing a CPS version to begin with would have been prohibitive (unless you -or your programmers- are very good at that), but writing an async/await version to begin with would not have been much more expensive than a thread-per-client one, if at all -- that's because async/await is intended to look and feel like thread-per-client while not being that.

One lesson I've learned is: a) make a library from the get-go, b) make it async/evented from the get-go. This will save you a lot of trouble down the line.

user5994461 · on Jan 30, 2020

It's actually a big problem for web servers. If you consider apache for example, that has to do one thread per connection. (yes, apache still doesn't support events for websockets in 2020).

Let's say you configure it for 2000 max connections (really not much) so that's 2000 threads, so 20 GB of memory right away because the thread stack is 10 MB on Linux. It's a lot of memory and it's obliterating all caches.

You can reduce the thread stack to 1 MB (might crash if you go lower) but any caching is still trashed to death.

Next challenge. How do you think concurrency work on the OS with 2000 threads? Short answer is not great.

The software making heavy use of shared memory segments, semaphores, atomics and other synchronization features. That code is genuinely complex, worse than callbacks. Then you're having issues because these primitives are not actually efficient when contended by thousands of threads, they might even be buggy.

nwallin · on Feb 1, 2020

What's wrong with the Apache event worker?

https://httpd.apache.org/docs/2.4/mod/event.html

user5994461 · on Feb 2, 2020

It's not quite event based really. It still requires one thread per connection (websocket).

dirtydroog · on Jan 31, 2020

Ah I see you've dipped your toes into the Sea of Apache too. Horrible software. Should have died in 2000.

loeg · on Jan 30, 2020

Evented has some difficulties in practice:

* On Unix, only socket IO is really evented; this can be solved by using a thread pool (the Flash paper describing this problem and solution for httpds dates to 1999[1]) but is inelegant. This is the approach Golang takes behind the scenes.

* How to scale event loop work across cores; this can be solved, but adds complexity. In contrast, threads are quite simple to scale.

* How to share accept() workload across cores; this can be solved, too, but not portably. I.e. your 100 core server may very well be accept-limited if you have a single-core accept loop, or highly contend on a single socket object.

* Threadpools are definitely inappropriate if you have a ton of relatively idle clients (long-poll server, or even typical webserver), but are less wasteful if you have relatively few, always-busy clients.

I don't disagree that the synchronous threadworker design isn't the best choice for high performance services that can afford a lot of engineering time to design and find all the bugs. But thread-per-client is often a completely acceptable place to start.

[1]: https://www.usenix.org/events/usenix99/full_papers/pai/pai.p...

cryptonector · on Jan 30, 2020

These are minor things, and not really accurate, and at any rate, not relevant to the point that evented I/O == explicit, thus easy-to-minimize program state, while thread-per-client == program state expansion and costlier context switches.

cryptonector · on Jan 30, 2020

I don't normally care about downvotes, but if I give out a nice explanation and you don't like it, it'd be nice to get a reply. Was my response wrong? How? I might learn something from your response.

notacoward · on Jan 30, 2020

> The performance concerns of event-based servers have been greatly alleviated by both hardware and software advancements.

Threads haven't exactly stood still in that time either, especially if you include green threads, fibers, coroutines, etc.

> If you're looking to squeeze 100,000 concurrent tasks from that $5 server, this paper is relevant to you.

It's relevant regardless, as part of a long-running back and forth between threads and events. For example, Eric Brewer was one of the co-authors of this paper, but also for the SEDA paper which was seminal in promoting event-based programming. I highly doubt that we've seen the last round of this, as technology on all sides continues to evolve, and context is a good thing to know. Those who do not learn the lessons of history...

zzzcpan · on Jan 30, 2020

I don't think there is back and forth anymore. Actual high performance research (e.g. when the cost of a single mutex is more than the whole CPU budget for processing something, like say a packet) has been devoid of threads since they got into the mainstream, so like for almost two decades already. They are still used, because this is what hardware and OS provide to do something on each core, but not for concurrency or performance.

notacoward · on Jan 31, 2020

When your per-request time is so short, using events is easy. You don't have to worry about tying up a poller thread. (And yes, even event-based servers use threads if they want internal concurrency.) But that's a specialized domain. If requests take a substantial amount of processing or disk I/O time, naive approaches don't work. You can still use events, in a slightly more sophisticated way, if literally everything below you does async reasonably well, but any advantage over threads is much less and sometimes the situation still reverses. I work on storage servers handling multi-megabyte requests, for example, and in that milieu there is still very much back and forth.

Jweb_Guru · on Jan 30, 2020

Sure, if you are devoting the whole computer to a single microbenchmark, threads a terrible idea. This is not necessarily the case when you have many heterogeneous applications running on a machine, though.

rezonant · on Jan 30, 2020

I'm a little confused by this- if you have multiple independent apps on a machine, would they not already be in separate OS processes?

Jweb_Guru · on Feb 5, 2020

Sure, but they still use the same kernel scheduling that threads do, and careful optimizations relying on core count = thread count are going to be basically worthless as well.

penagwin · on Jan 30, 2020

I mean the domain you're in changes your requirements drastically. Heck certain workloads (high frequency, low latency, nearly no concurrency) might run better on a single core of an overclocked i7 vs any Xeon processor.

If it's a web server then obviously you want more cores, and an event driven server would make a lot of sense.

Basically if you need concurrency then you want events, if you're compute bound than you don't want that overhead.

EDIT: Instead of i7 just imagine any high end (high frequency and IPC) consumer chip