He mostly pans the 'less-than-expert programmers' canard, so he never explores the assertion that's at its base — that events are far better for correctness.
Rob Pike has spent the last 25 years developing evented languages, and while the Hoare-style CSP approach he's settled on allows for physical concurrency, he doesn't give a shit about bare-metal performance. The fundamental purpose is to be able to write concise programs that directly model the parallelism of the real world, written in an expository manner as to be more obviously correct.
Pike's point is that you should be getting the best true performance by working in an environment that helps you arrive at the ideal algorithm in it's purest form. Making compromises to get more local physical concurrency is a fool's errand, since at scale you're going to far outgrow single machines anyway!
When it comes to the Go language, bare-metal performance certainly seemed to be something Pike was giving a shit about when he spoke last week at OSCON and the Emerging Languages Camp I organized. Go in its current fairly young stage has performance that's not too terrible; they've mostly optimized for raw complication speed so far, which is interesting, and uniquely suited to Google's development problems. Pike has said that he wants Go to be a replacement for other systems languages. That's going to mean competing with those languages, performance-wise.
I think Clojure's concurrency model ends up more concisely and correctly expressing the "parallelism of the real world" when you consider the dimension of time. Rich Hickey has done some really important thinking there.
Finally, I think you'll find that anyone with a fixed budget for servers isn't going to think that making the most of "local physical concurrency" on every machine in their cluster is a "fool's errand". Hardware is cheaper than it used to be, but it isn't free, and deploying and maintaining it is costly and time consuming. If you can make the most of your hardware and operations investements with a little more thought-work, why not do so?
The first public release of Go made little use of physical concurrency, though that's undoubtedly improved considerably given that the model allows for it. His earlier languages didn't go there, probably because they didn't have nearly as many people working on them, and physical concurrency wasn't the focus of the research. Go is more performant and provides more static assurances than the previous iterations, but it still has global GC. Algorithms and Correctness are still the primary focal points.
I find it funny that you're hung up on local physical concurrency — for me that's the prime signifier of "Scaling in the Small"! If you're going to have to distribute your workload across multiple machines, why not just run multiple copies of your single-cpu process on each machine, and let your network-level work distribution mechanism handle it?
We're not talking about shared memory supercomputing using OpenMP and MPI on special network topologies or NUMA hardware, just commodity machines running HTTP app servers in front of data stores. We aren't curing cancer, and our workload is already embarrassingly parallel (responding to discrete requests).
I've actually disabled some intra-request concurrency (some ImageMagick operations are multithreaded by default) on a system I'm working on now because it makes the workload wildly inconsistent — when independent requests try to take advantage of all the available CPUs, the latencies spike for everybody. It's just more software to have to monitor, and the ideal returns are slim.
It's not so much that I'm "hung up" on local physical concurrency, just that I don't see any reason to ignore easy gains. You can write maintainable, correct, concurrent programs that scale across cores today if you use technologies other than Node. So why wouldn't you?
Anyway, that's a good reply, and your ImageMagick case study is interesting. It just goes to show how individualized "scaling" really is. Thanks for the taking time.
I haven't actually had a project to use Node on yet, though I have an affinity for its Purity of Essence approach, and I've used Twisted a fair deal.
If it's going to actually scale or have high-availability, a system is already going to have to have a socket pool to distribute across machines. Since that already works to distribute across processes on one machine, why bother adding another layer to pool native threads? It just adds to the complexity budget.
It could easily get you much lower (1/N) latencies on unloaded systems, but in most cases at high volume the gains aren't going to be very big compared to running another N-1 single-threaded processes per box and it would take more concerted effort to keep the request latencies consistent.
It seems obvious that it's worth compromising the model and adding locks or STM to get a N00% gain for a solitary request, but what if that's only a 10% gain when you're pumping out volume? Do stop to consider that not everyone wants their individual app processes to scale across cores — actual simultaneous execution within one address space comes at a cost.
There are probably some cases at scale where the gains from thread-pooling are substantial, but I could see a lot of them being where the work wasn't very event-ish to start with, like big-data batch-flavored stuff where Hadoop would work great (especially from data locality).
some ImageMagick operations are multithreaded by default
And a really icky default at that, for the performance reasons you mention and with the added bonus of a chance of hitting some pointless bug. I've been turning it off ever since I spent a pleasant evening staring at the stacks of wedged apaches that ended with a pile of ImageMagick functions culminating in futex-something-or-other.
It gets ickier too — IM has flags for limiting its memory usage, but instead of modifying the algorithm, it just implements its own VM system and swaps to tempfiles directly from userspace when it hits the constraint.
It is, undoubtedly, one of the more appalling pieces of software in wide use. Just to complete the wild tangent, GraphicsMagick is an interface-compatible fork that claims better performance and less cruft (while still weighing in at above 280 klocs). Flickr uses it so presumably it only gives you two brain aneurysms instead of 4.7.
I also can't help but to gush about varnish a little (a handy thing to put in front of an image processing server, for one thing), which has a number of post-1975 features beyond its memory management such as an actually useful configuration language, statistics aggregation and reporting and the inexplicably uncommon ability to manage and modify configuration without a restart.
(No, I’m not making the “callbacks turn into a pile of spaghetti code” argument, although I think you hear that time and again because it’s an actual developer pain point in async systems.)
Amen.
On my first try I was mostly underwhelmed with node precisely because of the callback hell you end up in. I've already had my share of that in twisted and despite the arguments various people make for it; thanks, but no thanks.
That's not to bash node as a whole, mind you. I'll most certainly revisit it when and if it grows a stable co-routine wrapper or similar spaghetti prevention facilities.
I completely agree. Further, as he points out, code using events is fundamentally different than code that uses threads, in ways that aren't immediately clear to someone writing a "hello world" app. Complexities arise in ways that don't in threaded code. I think many of the more naive "event-driven" boosters haven't gotten far enough to see the real potential for pain.
That's not saying threaded code doesn't have its own set of issues; but having written a tool that saw very high traffic, was written an event-driven style, and suffered performance reliability issues, I can say that the event-driven approach is no panacea.
That said, for certain applications, one or the other approach is clearly appropriate. But to make that decision requires an understand of the strengths and drawbacks of each.
As he notes, evented vs threaded is only important within a single machine.
I would argue that your architecture for distributing load across machines is more important than evented vs threaded. If we think that the difference in performance is small (regardless of which is faster,) then the question becomes one of which programming style you prefer. Node's implementation of events means you can have mutable state without worrying about locks or synchronization blocks or whatnot.
I'm a little puzzled by this fascination with shared memory concurrency myself. If you have to scale beyond one physical instance, and you generally do if you have to scale at all, your scaling mechanism is your distribution mechanism. Of course, there are apps that need maximum throughput in a single address space but they seem to be very much in the minority.
That's like suggesting that you can eliminate CPU cache because most resources are going to be in main memory. Few systems scale 1:1 when adding more systems, yet making each system more powerful has the potential to be more useful than increasing the scaling factor.
In other words if 100 machines are 60 times more powerful than 1 machine, increasing your scaling factor is less important than doubling each machines throughput (assuming you keep the same scaling factor).
It's just obvious how "threaded" can be slow even in low scale scenarios. It's less obvious how evented can be slower than the hardware allows (not saying it can't be). If nothing is known about the large scale, it still looks like a clear win for evented.
Terminology is part of the problem here. Comparing "threaded" vs. "evented" may have made sense in, say, Windows in 2003 or some context where we are clearly talking "select loop" vs. "pthreads", but there are a lot of things that could be called "threading" models now. Well, sort of a lot of things, because they're all converging now on the same core primitive, leaving us with two basic cases here: 1. Old, synchronous code and 2. code running on the local select-like primitive that freely composes with other such code.
Node.js clearly wins to the extent that it is code of type 2 replacing code of type 1. But when you compare Node.js against other technologies of type 2, I find it hard to conclude that it really stands out on any front except hype. All the type 2 technologies tend to converge on equal performance on single processors, driven by the performance of the underlying select-like primitive, so what's left is two things: comparing the coding experience, where "evented" really loses out to thread-LIKE coding that actually under-the-hood is transparently transformed into select-like-primitive-based code, but is not actually based on threads, and two, dispatching the tasks to multiple processors, where Node.js doesn't even play.
A great deal, if not nearly all, pro-Node.js arguments persistently argue against systems of type 1... but they don't get that that's not actually the competition. It's the other type 2 systems that they are really facing, and Javascript isn't really helpful there. Events are a hack around the fact that Javascript's heritage is pure type 1, not a glorious feature, and I think you really ought to stick with the languages that are natively type 2, or functional enough that they made a smooth transition. Node.js is forever going to be a platform that spent a lot of its engineering budget just getting Javascript to straddle the divide from 1 to 2 with some degree of success, and that's going to be engineering budget you can't spend on your project.
I am not a specialist on this area. But please mention some interesting competitors to Node.js for your type 2?
Personally I also like the fact that Node is Javascript. I have looked at Scala several times and superficially, it looks extremely ugly. Clojure and Haskell might be interesting, but I worry that the non-modifiable memory could bite me in the end (plus, Clojure might force me to do too much dreaded Java stuff in the end). Erlang is fun because it is so freaky and different, but I am not sure how efficient it would be for writing huge amounts of code. It is also not written for performance, but for stability (according to it's inventor).
Which leaves what great alternatives? I have to admit, no other popular environments come to my mind. Arc, maybe?
You mentioned most of the ones I consider the real winners. Go is another possibility, though I'm not sure it fits in here; I don't know what sits at the core of the goroutines scheduler, but if it isn't select-like it probably easily could be. There are also a metric shitload of other event-based frameworks like Node.js, only for other languages, at varying levels of maturity and cruftiness.
(Though part of the reason something like Twisted is crufty is that the design forces in the domain push you in that direction. Twisted is crufty, yes, but it's crufty for reasons that apply to Node.js and Node.js will naturally either have to gravitate in the same direction or hit the same program-size-scaling-walls that drove Twisted in that direction in the first place.)
None of the environments are "popular" by any reasonable meaning of the term. Don't let the hype fool you, neither is Node.js.
If you like Javascript, use it, but like I said, be aware that you are paying a big price for jamming a type 1 language into a type 2 situation. (Same goes for all of the aforementioned event-based frameworks; I am not especially down on Node.js, I'm down on the entire idea.) I am reminded of those who insisted on using Visual Basic to do their object orientated programming in. Yes. It works. But you pay. You pay in a long series of little compromises that are easy to wave off individually but add up to major ongoing friction. In the medium and long term you are better off learning something designed for that niche, not jammed into it.
(And while the original article mentions it only in passing, that spaghetti code scaling wall for this sort of code is very real. I'd liken it to something O(n^1.3) as the code size grows; sure, it doesn't seem to bite at first and may even seem to win out over O(n log n) options like Erlang at first, but unfortunately you don't hit the crossover point until you're very invested in the eventing system, at which point you're really stuck.)
(Incidentally, why am I down on the whole idea of eventing? Because twice in my experience I have started on one of these bases and managed to blow out their complexity budget as a single programmer working on a project for less than a year. And while you'll just have to take the rest of this sentence on faith, I actually use good programming practices, have above-average refactoring and DRY skills, and use unit testing fairly deeply. And I still ended up with code that was completely unmanageable. Nontrivial problems become twice as nontrivial when you try to solve them with these callback systems; this is not the sign of scalable design practices. On the other hand, I've built Erlang systems and added to Erlang systems on the same time frame and the Erlang systems simply take it in stride, it hardly costs any complexity budget at all. And Erlang's actually sort of lacking on the code abstraction front, if you ask me, it's an inferior language but the foundational primitives hold up to complexity much better.
In fact, I'm taking time out from my task right now of writing a thing in Perl in POE, a Perl event-based framework, in which I have to cut the simple task of opening a socket, sending a plaintext "hello", waiting for a reply, upgrading to SSL, and sending "hello"s within the SSL-session to various subcomponents into six or seven-ish sliced up functions with terrifically distributed error handling, and even this simple bit of code has been giving me hassles. This would be way easier in Erlang. Thank goodness the other end is in fact Erlang, where the server side of that is organized into a series of functions that are organized according to my needs, instead of my framework's needs. The Erlang side I damn near wrote correctly the first time and I've just sort of fiddled with the code organization a bit to make it flow better in the source, this is my third complete rewrite of the Perl side. And I've got at least a 5:1 experience advantage on the Perl side! The Perl side requires a bit more logic, but the difficulty of getting it right has been disproportionate to the difference in difficulty.)
> Because twice in my experience I have started on one of these bases and managed to blow out their complexity budget as a single programmer working on a project for less than a year.
Could you please share a little more about this? I'm interested to know what it is you were building solo and the tools you were using, and what made it so difficult to maintain.
Not tried POE myself but it does seem to produce a wide love/hate divide among its users. AnyEvent (http://search.cpan.org/dist/AnyEvent/) seems to be the more pragmatic choice.
Thanks for the info. I don't have the experience yet, so I can't comment. Erlang definitely sounds good, too. My impression was that the hype had subsided recently, but maybe it is still going strong.
> Clojure and Haskell might be interesting, but I worry that the non-modifiable memory could bite me in the end
For the record, this is absolutely no problem given the generational GC in the Hotspot JVM; the structural sharing in Clojure's persistent data structures generate a little extra garbage, but it's barely a blip on the radar. Laziness can have more of an impact on memory usage, but it's usually really easy to spot.
I keep hearing assertions that threaded can be slow or does not scale. Can somebody describe the reasons here or point to a good document on the subject?
I often see "scalable" and "fast" used interchangeably in these discussions. These are different concepts. Is the contention that threading is not as scalable as evented, not as fast as evented, or both?
One of the traditional threads-don't-scale arguments, which led to the early-2000s wave of evented competitors to Apache (lighthttpd, nginx), was that threads have per-thread call-stack overhead that's hard to minimize without either OS support for dynamically sized call stacks, or choosing really small static stack sizes that make writing code that doesn't stack-overflow tricky.
A different issue was that OSs used to be very bad at scheduling large numbers of threads--- if you spawned 10,000 threads, the system scheduler blew up. I believe that's mostly much improved these days.
Most of the threads-don't-scale arguments also assume OS threads; user-level threads are another story.
Seems like kind of apples-to-oranges to me as well....
Threaded, done right (which isn't always easy) - can scale on a single machine to use up all the cores you can throw at it. This can be fairly optimal on the machine in question, depending on the app.
Traditionally when dealing with threaded programming you have a whole lot of locking and whatnot to keep track of - and things can get ugly fast. It's HARD.
And all that hard work, while it might help you scale up to the biggest multi-processor beast you can find - doesn't help you one bit when it comes to scaling horizontally to multiple nodes..... I'd wager, depending on the situation, that it gets right up in your face and in the way. But tha'ts back to comparing apples and oranges.
"scalable" is about overall design - what needs to talk to what, how will it scale, will it be a purely shared-nothing environment like erlang where we're passing messages and can scale out horizontally like mad, but where every function call is effectively a full memory copy, possibly over a network? Or do we nail it down to a precisely crafted C++ threaded app that our expert designers know down to it's finest detail, that's faster than sh*t througha goose, and hey, it uses some evented stuff as well.
It seems the real rise interest in the event-driven nature of node.js is that it makes it obvious to even beginner programmers where the difference is - you are setting up some actions, and letting the OS kernel handle listening for the events for you in the fastest possible way. You could do all that on your own in another language, but here it is all neatly wrapped up, in a language you already know.
Both threaded and event-driven I/O are scalable if you throw enough hardware at it and - all other things being equal - are about as fast.
The one-thread-per-connection model is conceptually simpler at the cost of a higher resource footprint, the event-driven approach adds complexity but allows you to process more concurrent connections.
Overly broad and simplified but that's about the gist of it. Like everything non-trivial, the real answer is: It depends.
Threaded does not imply one thread per connection. As an example, Java Servlet 3.0 allows a request handler to suspend and resume execution of a request handler on a thread. This feature allows the application to relinquish the thread for the occasional long operation and use blocking threaded code for most other code. Threaded applications written in this style can support a large number of concurrent connections with modest resource use and no callback spaghetti.
Models like Java Servlet 3.0 allow threaded servers to handle a large number of concurrent connections. Are there other reasons why threaded servlets are slow or not scalable?
Most web applications don't have a large number of concurrent connections because they application request handlers do not block for a long time. The resource use for thread stacks is very modest for these applications even without a suspend & resume feature.
I don't think the Servlet 3.0 example counts as an example for thread based request handling to be working. After all, it introduces event based request handling to circumvent the problems of threaded event handling.
Because Servlet 3.0 does not dictate how an application suspends or resumes request handling, I wouldn't say that Servlet 3.0 introduces event based request handling (other than the first invocation of the handler as in Servlet 1.0).
I guess one can claim that any suspend and resume logic in the application is a form of eventing, but it's different from node.js in that it's not pervasive through the application. It's also different from node.js in that the server infrastructure is not involved in the "events".
Foremost, context switching. This is a non-trivial process: on-die caches get flushed, virtual memory has to get re-mapped, registers re-loaded. This is by definition lost time, time the OS is spent not doing your workload. If the number of context switches gets too large, a system is spending an inordinate amount of time performing context switches, and less time doing "real work" inside threads. You cite two isues, slow and not scaling.
Second, memory usage. Each thread has its own stack space, which consumes a fixed memory size.
As to "slow" and "does not scale;" slow is relative; the amount of time spent context switching is very dependent on a variety of workload factors. Threading _can_ be slow. Scaling, on the other hand, is less unequivocal. A quarter million active threads, each doing very little, is not a good thing: too much memory, too many context switches, and too high a latency.
The problem is not threading per se. It's pre-emptive multi-threading, where a programs state is transparently pulled from the CPU by the OS and serialized to someplace else, and another programs state is de-serialized and loaded in transparent; that is, without the loaded or unloaded program knowing. Erlang and tasklets are examples of userspace or lightweighs threads where pre-emption is not permitted, and the context switching penalty is negligible. Instead of threads being ignorant, and being preemptively swapped in and out, they are coordinated and actively yield themselves back to the scheduler, thereby reducing the context switching penalty.
I think traditionally, something like Tomcat used to have perhaps 30 threads? Especially with todays Ajax-driven web sites, it is very easy to exhaust them.
I can only point to a stupid app of mine, which triggered Twitter searches for a list of search results via Ajax (so one page of search results would trigger, say, 10 search reguests to Twitter via Ajax, proxied through the server). Since that app was written for a competition and non-commercial, I hosted it on the free Heroku plan (it was a Rails app). That is how I found out that apparently that Heroku plan has only one process, and Rails doesn't use threads. So not only would the search requests to Twitter only be handled sequentially, while those threads were running, no other visitors would get to use the site (the 10 Ajax requests would already use up up to 10 threads, and I had only one).
Just a stupid example - with event based request handling, it would not have been an issue because the search requests to Twitter would not have blocked everything else.
But a request to twitter blocking everything else would be an issue, right?
Aside: haproxy, configured cleverly, can deal with limiting the number of concurrent connections permitted to your app on a URL basis (or just about any other part of the request you can deconstruct) to allow your app to not get hung up on slow queries and keep fast queries going where you want them.
I haven't done a whole lot of concurrency programming, so before reading this article I didn't realize that the events vs. threads debate is still being had. The example that Ryan Dahl likes to use to illustrate the superiority of the event model whenever he talks about Node is nginx vs. Apache. That example did more in my mind to reinforce the idea of events being superior to threads (in terms of speed and memory consumption) than anything else.
Keep in mind when reading this article that Alex recently left a little company called Twitter. It's safe to say that relatively few companies will ever have to scale the way that Twitter has.
Keep in mind when reading this article that Alex recently left a little company called Twitter. It's safe to say that relatively few companies will ever have to scale the way that Twitter has.
You mean the failwhale way? ;-)
Not meaning to discredit al3x but I really don't consider twitter a success story in terms of scaling. I don't know if it's incompetence or just some truly bad early decisions that they're still suffering from.
But one thing is for sure; other sites of much higher complexity have scaled much more smoothly to similar sizes (facebook, flickr, just two from the top of my head).
Well, as you've been outed as being involved with twitter first-hand, can you shed some light on the problems they are having? Or is that internal stuff, not to be talked about?
I'm curious because I've worked with various messaging systems myself. And albeit never having pushed them to near twitter-scale, the principles of scaling those horizontally seem quite straightforward to me, unless complex routing comes into play. But I don't see complex routing at twitter.
To be more precise: For all I know twitter could (should?) probably just append those tweets to one file per user and would scale beautifully from there. Auxiliary services like search are a different story, of course, but those don't need to trigger failwhales when they break...
The complexities of scaling Twitter have been pretty thoroughly discussed elsewhere, both by Twitter staff and informed third parties.
At this point in time, I don't really want to comment any further on what issues they may or may not be having. I haven't worked there in over a couple months, and I'll bet that big parts of the system have changed in that time. I'm no longer informed about what's under the hood at Twitter, and I also don't want to second-guess my former coworkers, who I'm sure are doing their best. Sorry!
I have a feeling that hiring al3x is one of the few reasons that Twitter didn't collapse into a heap of cetacean corpses.
Twitter's original codebase started as a "my first blog" tutorial in Rails — the polar opposite of a messaging platform. That's a hell of a ship to turn around, especially live on the world stage while the userbase and datastore are growing exponentially and you're rapidly approaching the second half of the chessboard.
Facebook both very intentionally limited their growth (new colleges) and avoided their most expensive features (like photos) for as long as possible. Flickr was originally an open-ended MMO/MUD based on user-contributed content called Game NeverEnding, which was far harder to scale than the photo-sharing feature in their chat client that was extracted to base their final business on.
Node does support two models for concurrency: Non-blocking I/O and worker processes.
Non-blocking I/O is no magic pony, so if you need access to more CPU cores, you start creating worker processes that communicate via unix sockets. I would argue that this is a superior scaling path than threads, because if you can already manage the coordination between shared-nothing processes, moving to multiple machines comes natural.
Otherwise I agree with the post. Nothing will allow you to scale a huge system "easily".
There are features you miss by relying on IPC. UNIX Sockets are zero copy, sure, but you still lose performance doing the marshal/unmarshal tasks, and that could be largely avoided using shared immutable memory.
I'd much rather node have a multi-threaded Web Workers implementation of nodejs that uses immutable shared memory for message passing. Not coincidentally, I'd much rather push the real distributed scaling problems (such as coordination and marshalling/demarshalling) up to a much much higher level in my application stack. As it stands now, I need multiple marshal/unmarshallers, for the IPC layer and the app's scaling-out/distributed layer.
> Herein lies my criticism of Node’s primary stated goal: “to provide an easy way to build scalable network programs”. I fundamentally do not believe that there is an easy way to build scalable anything. What’s happening is that people are confusing easy problems for easy solutions.
Even though I disagree with much of his technical argument, this is an extraordinarily important point that I find myself agreeing with more and more upon rereading. Nothing is scalable out of the box, anything can be fucked up, and there's no silver bullet.
What people tend to forget about Node.js is that everyone knows JavaScript. This is really it's killer feature against other platforms. The ability to get a team up and running on Node is unparalleled.
the particular danger i'd cite w/r/t JS is that its a much more flexible language than many, and that there's not the same level of entrenched best practices and tooling to enforce best practices. in other words, everyone knows JS, granted, but they know very different styles of it.
consider the plethora of means of doing inheritance that exist for JS.
I think Alex confuses "scaling in the small" with performance. Rewriting Starling in Scala most probably improved Twitter but it did it make it more scalable?
To me as an outsider, it sounds more as Starling (or Kestrel) got its performance upgraded, not scalability.
May sound pedantic, but it's important to differentiate performance from scalability.
IMHO you may be able to scale node 'in the big' if you plan to scale on a level above node (i.e. in node instances and inter-node communication) instead of changing the code running in node to scale big.
I think this is a fascinating look at an emerging company diving through the discover process of picking core technologies. al3x, please keep going with these.
If, on the other hand, you’re working with a system that allows for a variety of concurrency approaches (the JVM, the CLR, C, C++, GHC, etc.), you have the flexibility to change your concurrency model as your system evolves.
I think you have his thoughts right there. Ruby and python don't allow for a variety of concurrency approaches...at least not on the threading side. You have to resort to multi-process or evented schemes.
Rob Pike has spent the last 25 years developing evented languages, and while the Hoare-style CSP approach he's settled on allows for physical concurrency, he doesn't give a shit about bare-metal performance. The fundamental purpose is to be able to write concise programs that directly model the parallelism of the real world, written in an expository manner as to be more obviously correct.
Pike's point is that you should be getting the best true performance by working in an environment that helps you arrive at the ideal algorithm in it's purest form. Making compromises to get more local physical concurrency is a fool's errand, since at scale you're going to far outgrow single machines anyway!