Sorry, I must be missing something in this blog post because the requirements here sound incredibly minimal. You just needed an HTTP service (sitting behind an Envoy proxy) to process a mere 500 requests/second (up to 1MB payload) and pipe them to Kinesis? How much data preparation is happening in Rust? It sounds like all the permission/rate-limiting/etc happens between Envoy/Redis before it ever reaches Rust?
I know this comes across as snarky but it really worries me that contemporary engineers think this is a feat worthy of a blog post. For example, take this book from 2003 [1] talking about Apache + mod_perl. Page 325 [2] shows a benchmark: "As you can see, the server was able to respond on average to 856 requests per second... and 10 milliseconds to process each request".
And just to show this isn't a NodeJS vs Rust thing, check out these webframework benchmarks using various JS frameworks [3]. The worst performer on there still does >500 rps while the best does 500,000.
They list out what is being done by the service - "It would receive the logs, communicate with an elixir service to check customer access rights, check rate limits using Redis, and then send the log to CloudWatch. There, it would trigger an event to tell our processing worker to take over."
That sounds like a decent amount of work for a service, and without more detail it's very hard to say whether or not a given level is efficient or inefficient (we don't know exactly what was being done; we can assume that they're using pretty small Fargate instances though since the Node one came in at 1.5G). They also give some number; 4k RPM was their scaleout point for Node (that's not necessarily the maximum, but the point they felt load was sufficiently high to warrant a scaleout; certainly, their graph shows an average latency > 1 second). Rewriting in Rust, that number was raised to 30k RPM; 100 mb of memory, < 40ms average latency (and way better max), and 2.5% of CPU.
Given all that, it sounds like, yes, GC was the issue (both high memory and CPU pressure), and with the Rust implementation (no GC) they're nowhere near any CPU or memory limit, and so the 30k is likely a network bottleneck.
That said, while I agree that sounds like a terrible metric on the face of it, with what data they've provided (and without anything else), it also sounds like it may be due to they're just operationally dealing with very large amounts of traffic. They may want to consider optimizing the network pipe; not familiar enough with Fargate, but if it's like EC2, there may be a sizing of cpu/memory that also gives you a better network connection (EC2 goes from 1 GBPS to a 10 GBPS network card at one instance type)
> That sounds like a decent amount of work for a service
5+ years ago I wrote a real-time transcoding, muxing streaming radio service that did 5000 simultaneous connections with inline, per-client ad spot injection (every 30 seconds in my benchmark). Using C and Lua. On 2 Xeon E3 cores--1 core for all the stream transcoding, muxing, and HTTP/RTSP setup, 1 core for the Lua controller (which was mostly idle). The ceiling was handling all the NIC IRQs.
While I think what I did was cool, I know people can eke much more performance out of their hardware than I can. And I wasn't even trying too hard--my emphasis is always on writing clear code and simple abstractions (though that often translates into cache-friendly code).
At my day job, in the past two months I've seen two services in a "scalable" k8s clusters fall over because the daemons were running with file descriptor ulimits of 1024. "Highly concurrent" Go-based daemons. For all the emphasis on scale, apparently none of the engineers had yet hit the teeny, tiny 1024 descriptor limit.
We really do need to raise our expectations a little.
I haven't written any Rust but I have recently helped someone writing a concurrent Rust-based reverse proxy service debug their Rust code and from my vantage point I have some serious criticisms of Tokio. Some of the decisions are clearly premature optimization chosen by people who probably haven't actually developed and pushed into production a process that handles 10s of thousands of concurrent connections, single-threaded or multi-threaded. At least not without a team of people debugging things and pushing it along. For example, their choice of defaulting to edge-triggered instead of level-triggered notification shows a failure to appreciate the difficulties of managing backpressure, or debugging lost edge-triggered readiness state. These are hard lessons to learn, but people don't often learn them because in practice it's cheaper and easier to scale up with EC2 than it is to actually write a solid piece of software.
All I'm saying is that without some example of the payloads they're managing, and the logic they're performing, it's hard to say "this is inefficient". And, as I mentioned, if their CPU and memory are both very low, it's likely they're hitting a network (or, yes, OS) limit.
I've seen places hit ulimit limits...I've also seen places hit port assignment issues, where they're calling out to a downstream that can handle thousands of requests with a single instance, so there are two, and there aren't enough port identifiers to support that (and the engineers are relying on code that isn't reusing connections properly). Those are all things worth learning to do right, agreed, and generally doing right. I'm just reluctant to call out someone for doing something wrong unless I -know- they're doing something wrong. The numbers don't tell the whole story.
They might not be doing anything wrong, per se. But if your expectations are that 500/s is alot (or even 4000/s for log ingesting), then your architecture will reflect that.
Here's what they're doing:
> Now, when the Bearer Agent in a user's application sends log data to Bearer, it goes into the Envoy proxy. Envoy looks at the request and communicates with Redis to check things like rate limits, authorization details, and usage quotas. Next, the Rust application running alongside Envoy prepares the log data and passes it through Kinesis into an s3 bucket for storage. S3 then triggers our worker to fetch and process the data so Elastic Search can index it. At this point, our users can access the data in our dashboard.
Given their goal and their problems with GC I can tell you right off the bat probably what's the problem with their various architectures from day 1--too much simplistic string munging. If your idea of log ingestion is using in-language regex constructs to chop up strings into pieces, possibly wrapping them in abstract objects, then its predictable you're going to have GC issues, and memory bandwidth issues in general, and poor cache locality in data and code. But 99% of the time this is how people approach the issue.
What a problem like this cries out for is a streaming DFA architecture, using something like Ragel so you can operate on streams and output flat data structures. You could probably implement most of the application logic and I/O in your scripting language of choice, unoptimized GC and all, so long as you're not chopping up a gazillion log lines into a gazillion^2 strings. The latter approach will cause you grief in any language, whether it's JavaScript, Java, Go, Rust or C. The number of objects per connection should be and can be a small N. For example, at 10 distinct objects (incoming connection object, log line, data structure with decomposed metadata, output connection object, etc) per connection times 500 connections, that's 5000 objects per second. Even Python's and Ruby's GC wouldn't break a sweat handling that, even though internally it'd be closer to 10 * (2 or 3) objects.
Here's a big problem today: nobody writes their own HTTP library or JSON library; everybody uses the most popular ones. So right off the bat every ingestion call is going to generate hundreds or thousands of objects because popular third-party libraries generally suck in each request and explode it into huge, deeply nested data structures. Even in Rust. You can't optimize that inefficiency away. No amount of fearless concurrency, transactional memory, fastest-in-the-world hashing library, or coolest regular expression engine can even begin to compensate. You have to avoid it from day 1. But if your expectations about what's possible are wrong (including how tractable it is with some experience), it won't even occur to you that you can do better. Instead, you'll just recapitulate the same architectural sins in the next fastest language.
"I can tell you right off the bat probably what's the problem"
Emphasis added. I don't disagree with you that they may be doing something inefficient; I'm just saying, I don't -know- what they're doing, so I'm disinclined to judge it.
I do know that, again, in Rust, whatever bottleneck they're hitting is neither CPU nor memory, despite the seemingly low throughput, which does imply that what you're proposing isn't the bottleneck in that implementation.
Seriously, 500 qps was something we used to do in interpreted languages on the Pentium Pro. But this kind of blog post is a whole genre: How [ridiculous startup name] serves [trivial traffic] using only [obscenely wasteful infrastructure] in [trendy runtime framework that's a tiny niche or totally unknown in real industry].
Then you just weren't paying attention. There's a list of approved languages for new development and Rust isn't on it. There's a list of languages that are forbidden, for which you need high-level approval to use for new work, and Rust is on that one, next to C++.
Rust support is categorized as Tier 2 at Dropbox. Do you work at Dropbox? You can go look for the language approval list, which documents this.
Tier 2 means it requires approval. That is massively different from "forbidden". There is, for example, a Tier 3 list - Java is on there, some other languages, they're discouraged a lot more strongly than Rust though you'll still find them in some parts of the codebase (primarily acquired code from what I recall). Approval is only required for business/ product dev - non product teams can very easily write rust, at least one is currently doing so.
Tier 2 means it has internal library support, such as the communication library. That means there is, at all times, active rust development - and of course this is true, the most critical parts of Dropbox are written in Rust, and they rely on those libraries.
You can also very easily find out about the existing projects being written in Rust. Go ask the rust-lang channel. Last I saw, maybe 6 months ago, Rust was being used for another major component of the product - obviously I'm not commenting publicly on that further.
You've (probably unintentionally, judging by your tone) shined a light on the corrupt culture of that engineering org. It's written down that Tier 2 languages require special permission to begin new projects. It doesn't sound like we disagree on that. The corrupt cultural aspect is there exists an in-group clique of engineers who can and will start projects in any language they want, and an out-group against whom the written policy will be used to stop new efforts.
It is definitely not the case that any randomly selected backend engineer at that company can just pick up Rust and solve any problem with it, because the Tier 2 status is used as a cudgel to stop most such efforts.
I disagree entirely with the "forbidden" wording. That's it. Tier 2 is not "Forbidden" it is a statement on the level of support.
I agree with your point about political issues - the entire tier list, specifically even Rust being tier 2, was political. I saw stupid shit like that too many times at Dropbox.
But it is simply a fact that new projects are being built in Rust, regardless of the political aspects of why that is the case despite it being tier 2. "Forbidden" does not convey the state of things - everything else you've said is agreeable.
It took your comment to make me notice it wasn't 30k requests/second but minute instead.
500 requests per second is what I would expect of a default PHP + Apache installation on a small Ubuntu server.
I too have a hard time grasping whats special here. For example I saw cached Wordpress setups handle 400 to 500 requests per second. And Wordpress isn't known for performance even with caching plugins.
Great article and thanks for sharing! There are a couple of things that stand out at me as possible architecture smells (hopefully this comes across as positive constructive criticism :)).
As someone who has been developing on the BEAM for long time now, it usually sticks out like a sore thumb any time I see Elixir/Erlang paired with Redis. Not that there is anything wrong with Redis, but most of the time you can save yourself the additional Ops dependency and application network hop by bringing that state into your application (BEAM languages excel at writing stateful applications).
In the article you write that you were using Redis for rate limit checks. You could have very easily bundled that validation into the Elixir application and had for example a single GenServer running per customer that performs the rate limiting validation (I actually wrote a blog post on this using the leaky bucket and token bucket algorithms https://akoutmos.com/post/rate-limiting-with-genservers/). Pair this with hot code deployments, you would not lose rate limit values across application deployments.
I would be curious to see how much more mileage you could have gotten with that given that the Node application would not have to make network calls to the Elixir service and Redis.
Just wanted to share that little tidbit as it is something that I see quite often with people new to the BEAM :). Thanks again for sharing!
I would push rate limiting to the load balancer, HAProxy or Nginx, but that's just me. If you have a round-robin LB in front you just set each instance to limit at 1/nodes rate, that way you don't have to share any state.
If you're load balancing on IP hash you can set each instance to limit at full rate and not worry about it.
Shared state in rate limiting becomes a bottleneck very quickly. If you're trying to mitigate spam/DDOS you could easily get 100,000 requests a second. You're going to max out your shared state db way faster than 10gig lines
That is definitely a valid route to go so long as your rate limiting is not dependent on much business logic. If rate limiting is per user or per user per instance/service, I would personally bring that kind of concern into the application where it is closer to the persistence layer where those things are defined (and again handling the business logic inside per customer GenServers).
I have never used this product so just speculation. But I imagine there is some sort of auth token that valid agents send to tell Bearer that this is a valid/invalid request so that things can be trivially rejected to mitigate a DoS/DDoS to an extent.
One of the interesting effects of using rust is saving money! I also migrate a F#/.NET ecommerce backend and can run in less RAM/CPU that make my bills lower.
Can you share information about your experience? I'm currently working on a F# project, enjoying the functional approach, while having a lot of libraries available on the .Net platform. The |> operator is one I use all over the code, but Rust doesn't support custom operators. Is that annoying, or not at all? Is your code less functional and more imperative style due to Rust?
LLVM should optimize tailcalls and sibcalls. But tail call optimization has unexpected interactions with the extended RAII that Rust uses because stuff has to be dropped at the end of its lifetime, so the code that's running in "tail" position is sometimes not what you expect.
As a beginner to Rust I'm surprised by this. Given the Rust compiler is able to figure out the lifetimes in the recursive case, you'd think the lifetimes within the tail-optimized loop would be same. Doesn't the lexical scoping of the loop's body have the equivalent lifecycle of a recursive call (drop at the end of the loop vs the end of the function)?
This is a point that's really attractive to me as well. Faster language and runtime means I can run on fewer, cheaper servers which means when its time to scale 10 or 100x, it's not a massive cost increase.
Disclaimer: not a rust of .net dev, but my impression is they’re very different in what they’re trying to. It’d would be like comparing C to Python (as an example, not as an analogy).
Rust is closer to a super fancy C, compiles natively and was made with a heavily focus on certain types of memory safety.
F# has syntax more like Haskell/ML, and is compiled to a bytecode instead of an executable. It runs on .NET and everything that entails.
That's actually why I'm asking about that since the OP is transitioning backend logic from F# to Rust. So either Rust can cover a lot of the language features or he has a use case that really warrants that performance (and F# on .NET is not exactly non-performant).
Interesting post ! From what i've understood you had only one instance of NodeJS, while i agree that Rust is more performant generally, couldn't have you just added more instance ?
GC of 500 request/s could not have possibly caused a performance issue. Most likely the problem was due to JS code holding on to the 1MB requests for the duration of the asynchronous Kinesis request or a bug in the Kinesis JS library itself. With timeout of 2 minutes, you may end up with up to 30K/min x 2min x 1mb = 60GB RAM used. GC would appear running hot during this time but it is only because it is has to scrape more memory somewhere while up to 60gb is being in use.
They didn't mention Java as a possible solution, even though its GC's are far better than anything else out there. I have nothing against Rust but if I was at a startup I would save my innovation points for where they're mandatory
Ah yes, a 236 word "article", that says to choose "boring old technology" and also to use Rust in the same breath.
This article should mention that rust isn't close to ready when it comes to web backends. As much as I love Rust, if I were running a startup or even a decently sized company I would always choose Rails. Now -that- is boring, old... and mature technology. Certain components could get re-written in Rust, certainly, but there's no reason to ignore a mature ecosystem from the start.
> This article should mention that rust isn't close to ready when it comes to web backends.
Actix-web works just fine. They got a new maintainer team involved that has been spending some time getting rid of all the insane unsafety that was in the code before.
Yes. It "works". However deciding to use Actix/Warp means throwing away years and years of work in the rails and ruby world.
Rails is mature, robust, and has a huge ecosystem with rubygems. Rust (when it comes to web stuff) is not. "it works" does not pass my litmus test. With Actix/Warp I have to implement stuff by hand that either comes by default with rails or already exists in a gem.
I like Rust but I'm not a zealot. People way overestimate performance when they barely have any traffic to begin with as a small startup or even a medium sized company.
You could even use rails, and use Rust to write ruby gems instead of going with actix/warp/etc.
> insane unsafety
This was overblown. Yes, the author didn't respond appropriately, but unsafe isn't inherently dangerous. This is a stupid misconception within the Rust community and caused a lot of unnecessary drama around Actix.
One step beyond good Java GCs is to write fully zero-GC Java code. The advantage of it is complete control over your performance which means your software is going to be consistently fast. The disadvantage is that it is relatively difficult to obtain.
If you want to see an example of fully zero-GC Java, you can check out QuestDB on Github [1] - Disclaimer I work for QuestDB.
> One step beyond good Java GCs is to write fully zero-GC Java code. The advantage of it is complete control over your performance which means your software is going to be consistently fast. The disadvantage is that it is relatively difficult to obtain.
I don't know that it's actually possible in the general case, as Java's support for value types remains wholly insufficient. IIRC the ixy folks never managed to remove all allocations from the java version.
It's not like Elixir doesn't benefit from a battle hardened VM since it's older and has been used in these kind of high volume scenarios before Java was.
They ruled out a language because it had stop the world GC, and even if it removed the bottleneck, it would likely become one later. Java has the same issue. Rust does not.
Not sure why they'd consider Java, given that concern.
Shenandoah and ZGC collectors have worse case pauses of ~10ms and average pauses of 0.5ms . The average pause is faster than malloc() sometimes in C, so you won't really be introducing more latency than C does.
The other option is to avoid allocating memory at all, which you could do in C/Rust but also in Java. The vast majority of shops given the choice for low/no allocation performance use Java anyways (HFT)
But those GCs also don't guarantee all garbage has been collected, nor how much processing time you'll get before they run again. So op could still end up stuck with their code barely executing, due to memory and CPU pressure, and throughput/latency drops to zilch.
"The vast majority of shops" is an interesting metric given the vast majority had to pick a language before Rust existed. Java and trying to minimize allocations, vs C/C++, I'd probably choose Java too. Java trying to minimize allocations (no way to guarantee you've done it right), vs Rust (which does guarantee no GC)...I'd probably pick Rust.
> But those GCs also don't guarantee all garbage has been collected, nor how much processing time you'll get before they run again. So op could still end up stuck with their code barely executing, due to memory and CPU pressure, and throughput/latency drops to zilch.
You could malloc() and free() so much that the code doesn't have time to do anything too. And these operations aren't bounded in time either. Just using C doesn't save you from memory allocation, its just done manually instead of automatic. In every system I've worked on you won't have this kind of GC pressure unless you do something profoundly bad
> The vast majority of shops" is an interesting metric given the vast majority had to pick a language before Rust existed.
You could, sure. You could do (anything) and still block (other parts of the code). Point is, GC isn't necessary for executing code, malloc/free is (well, for non-trivial programs that have data whose size is unknown at compile time).
With GC, I've definitely had plenty of times where GC created very noticeable pressure on the running system. Nothing quite as bad as they describe here, true, but that would probably predispose me to ditching GC if I saw that kind of behavior, too (since it's so unlikely that either my case is truly deviant, or my devs have done something very wrong, and neither of those is something I want to bet on being able to fix).
GC is like malloc() but automated and done on multiple threads.
With how low the pause times are with Java's new collectors (categorically not longer than malloc()), the only faster alternative is not allocating at all. Which you can do in many languages.
I agree your point the memory allocation and GC is expensive, but that's something you see in any language when allocating. It's not unique to the GC. You either pay the price with GC or malloc()/free(). Its the same price but one is much easier to deal with than the other.
If pause times lower than 10ms are important you could write your app in Rust which makes it much easier to systematically avoid allocations using borrowing.
I guess my point is that the new collectors make Java memory management roughly comparable to C, and if you care so much about performance that allocation costs are important, you should just use a language like Rust where its easy to avoid allocation completely. The performance gap between allocating in C vs Java is, for practical purposes, gone.
They did mention Go that also has a modern GC, and does not need to run your app under a software VM. But ultimately, even Go is just another GC language. They had no real need for GC and the like so why use Go, let alone Java?
Go's GC is notoriously low tech. Its design dates back to papers published in the 70s. Java's Parallel collector is most similar to Go's GC and its deprecated for removal due to poor performance.
Go collects VERY frequently to keep average pause times low, which hurts throughput. It also has pathological worst-case pause times, which is what the author ran into.
My suggestion of Java is just because its been used for high performance REST for decades. Rust has Actix and a few other frameworks, in Java you have 20+ options. Its a lot easier to get something off the ground when you're 99% sure you won't have to build anything except some glue code. Yes, Rust is faster, but is 30% better performance worth treading the wilds while you're trying to keep a startup afloat, I don't think so personally.
Using Java with Shenandoah or ZGC collector would directly handle their GC issue with rather boring technology
Rust doesn't have a GC; it uses an ownership model instead.
Ultimately in a startup it comes to what people are comfortable with. I personally would use Rust for something like this because I am comfortable with Rust. It's a perfect use case for it as well, imho.
There is a couple things I see in this post that I wouldn’t do at all, and I maintain a couple services with orders of magnitude higher QPS. I feel that replacing Node.js with any compiled language would have had the same positive effect.
Having never dealt with issues relating to garbage collection before, how do you go about diagnosing GC issues in a language where that’s all handled for you?
In the java realm you have very fine-grained GC logging that provides insight into the overall behavior of the GC and its different subcomponents. Then there are recording/debugging facilities that allow you to trace allocations, how long objects live, analyze the entire heap (including unreachable but not yet collected objects). And higher-level monitoring APIs separate from the logging and debugging stuff. You can also choose between collectors with different characteristics, trading between overall heap size, latency, utilization of CPU cores and other factors.
In the case of NodeJS, which use the V8 Engine [0], you have access to the diagonistics API [1] that allows you profile your cpu or memory consumption.
There are some tools that make that easier (see [2] or [3]) but you often left to interpret the result yourself
There are some general tricks that are language-agnostic, like allocating a huge "buffer" object when the app starts, the size of which is some significant portion of the memory you allow the process to use, which always has a reference, then storing references to other objects you need in that big object. In other words, circumvent the garbage collector.
Of course, this has its own issues, but I've seen it done in e.g. Go before. Its likely you'll inevitably end up with leaks, but if your service is fungible and can tolerate restarts, basically what you're doing is moving the "GC Pause" to be a "Container Restart" pause, which may be slower, but would happen less often. Some languages have ways to manually call the GC (Node is not one of them, afaik).
Node allows you to call global.gc() if you enable that functionality with a separate argument. In many cases this is an anti-pattern that would make your application behavior worse rather than better, that's why you have to opt into that.
It's a fairly common thread in game development circles. Usually game development is one of the few places with big enough constraints that it's worth doing your own memory management in languages that have garbage collectors. Other places it often makes sense to just architect around GC pauses, since you're going to want redundancy and load balancing anyways.
I know this comes across as snarky but it really worries me that contemporary engineers think this is a feat worthy of a blog post. For example, take this book from 2003 [1] talking about Apache + mod_perl. Page 325 [2] shows a benchmark: "As you can see, the server was able to respond on average to 856 requests per second... and 10 milliseconds to process each request".
And just to show this isn't a NodeJS vs Rust thing, check out these webframework benchmarks using various JS frameworks [3]. The worst performer on there still does >500 rps while the best does 500,000.
It's 2020, the bar needs to be much higher.
[1] https://www.amazon.com/Practical-mod_perl-Stas-Bekman/dp/059...
[2] https://books.google.com/books?id=i3Ww_7a2Ff4C&pg=PT356&lpg=...
[3] https://www.techempower.com/benchmarks/#section=data-r19&hw=...