Sorry, I must be missing something in this blog post because the requirements here sound incredibly minimal. You just needed an HTTP service (sitting behind an Envoy proxy) to process a mere 500 requests/second (up to 1MB payload) and pipe them to Kinesis? How much data preparation is happening in Rust? It sounds like all the permission/rate-limiting/etc happens between Envoy/Redis before it ever reaches Rust?
I know this comes across as snarky but it really worries me that contemporary engineers think this is a feat worthy of a blog post. For example, take this book from 2003 [1] talking about Apache + mod_perl. Page 325 [2] shows a benchmark: "As you can see, the server was able to respond on average to 856 requests per second... and 10 milliseconds to process each request".
And just to show this isn't a NodeJS vs Rust thing, check out these webframework benchmarks using various JS frameworks [3]. The worst performer on there still does >500 rps while the best does 500,000.
They list out what is being done by the service - "It would receive the logs, communicate with an elixir service to check customer access rights, check rate limits using Redis, and then send the log to CloudWatch. There, it would trigger an event to tell our processing worker to take over."
That sounds like a decent amount of work for a service, and without more detail it's very hard to say whether or not a given level is efficient or inefficient (we don't know exactly what was being done; we can assume that they're using pretty small Fargate instances though since the Node one came in at 1.5G). They also give some number; 4k RPM was their scaleout point for Node (that's not necessarily the maximum, but the point they felt load was sufficiently high to warrant a scaleout; certainly, their graph shows an average latency > 1 second). Rewriting in Rust, that number was raised to 30k RPM; 100 mb of memory, < 40ms average latency (and way better max), and 2.5% of CPU.
Given all that, it sounds like, yes, GC was the issue (both high memory and CPU pressure), and with the Rust implementation (no GC) they're nowhere near any CPU or memory limit, and so the 30k is likely a network bottleneck.
That said, while I agree that sounds like a terrible metric on the face of it, with what data they've provided (and without anything else), it also sounds like it may be due to they're just operationally dealing with very large amounts of traffic. They may want to consider optimizing the network pipe; not familiar enough with Fargate, but if it's like EC2, there may be a sizing of cpu/memory that also gives you a better network connection (EC2 goes from 1 GBPS to a 10 GBPS network card at one instance type)
> That sounds like a decent amount of work for a service
5+ years ago I wrote a real-time transcoding, muxing streaming radio service that did 5000 simultaneous connections with inline, per-client ad spot injection (every 30 seconds in my benchmark). Using C and Lua. On 2 Xeon E3 cores--1 core for all the stream transcoding, muxing, and HTTP/RTSP setup, 1 core for the Lua controller (which was mostly idle). The ceiling was handling all the NIC IRQs.
While I think what I did was cool, I know people can eke much more performance out of their hardware than I can. And I wasn't even trying too hard--my emphasis is always on writing clear code and simple abstractions (though that often translates into cache-friendly code).
At my day job, in the past two months I've seen two services in a "scalable" k8s clusters fall over because the daemons were running with file descriptor ulimits of 1024. "Highly concurrent" Go-based daemons. For all the emphasis on scale, apparently none of the engineers had yet hit the teeny, tiny 1024 descriptor limit.
We really do need to raise our expectations a little.
I haven't written any Rust but I have recently helped someone writing a concurrent Rust-based reverse proxy service debug their Rust code and from my vantage point I have some serious criticisms of Tokio. Some of the decisions are clearly premature optimization chosen by people who probably haven't actually developed and pushed into production a process that handles 10s of thousands of concurrent connections, single-threaded or multi-threaded. At least not without a team of people debugging things and pushing it along. For example, their choice of defaulting to edge-triggered instead of level-triggered notification shows a failure to appreciate the difficulties of managing backpressure, or debugging lost edge-triggered readiness state. These are hard lessons to learn, but people don't often learn them because in practice it's cheaper and easier to scale up with EC2 than it is to actually write a solid piece of software.
All I'm saying is that without some example of the payloads they're managing, and the logic they're performing, it's hard to say "this is inefficient". And, as I mentioned, if their CPU and memory are both very low, it's likely they're hitting a network (or, yes, OS) limit.
I've seen places hit ulimit limits...I've also seen places hit port assignment issues, where they're calling out to a downstream that can handle thousands of requests with a single instance, so there are two, and there aren't enough port identifiers to support that (and the engineers are relying on code that isn't reusing connections properly). Those are all things worth learning to do right, agreed, and generally doing right. I'm just reluctant to call out someone for doing something wrong unless I -know- they're doing something wrong. The numbers don't tell the whole story.
They might not be doing anything wrong, per se. But if your expectations are that 500/s is alot (or even 4000/s for log ingesting), then your architecture will reflect that.
Here's what they're doing:
> Now, when the Bearer Agent in a user's application sends log data to Bearer, it goes into the Envoy proxy. Envoy looks at the request and communicates with Redis to check things like rate limits, authorization details, and usage quotas. Next, the Rust application running alongside Envoy prepares the log data and passes it through Kinesis into an s3 bucket for storage. S3 then triggers our worker to fetch and process the data so Elastic Search can index it. At this point, our users can access the data in our dashboard.
Given their goal and their problems with GC I can tell you right off the bat probably what's the problem with their various architectures from day 1--too much simplistic string munging. If your idea of log ingestion is using in-language regex constructs to chop up strings into pieces, possibly wrapping them in abstract objects, then its predictable you're going to have GC issues, and memory bandwidth issues in general, and poor cache locality in data and code. But 99% of the time this is how people approach the issue.
What a problem like this cries out for is a streaming DFA architecture, using something like Ragel so you can operate on streams and output flat data structures. You could probably implement most of the application logic and I/O in your scripting language of choice, unoptimized GC and all, so long as you're not chopping up a gazillion log lines into a gazillion^2 strings. The latter approach will cause you grief in any language, whether it's JavaScript, Java, Go, Rust or C. The number of objects per connection should be and can be a small N. For example, at 10 distinct objects (incoming connection object, log line, data structure with decomposed metadata, output connection object, etc) per connection times 500 connections, that's 5000 objects per second. Even Python's and Ruby's GC wouldn't break a sweat handling that, even though internally it'd be closer to 10 * (2 or 3) objects.
Here's a big problem today: nobody writes their own HTTP library or JSON library; everybody uses the most popular ones. So right off the bat every ingestion call is going to generate hundreds or thousands of objects because popular third-party libraries generally suck in each request and explode it into huge, deeply nested data structures. Even in Rust. You can't optimize that inefficiency away. No amount of fearless concurrency, transactional memory, fastest-in-the-world hashing library, or coolest regular expression engine can even begin to compensate. You have to avoid it from day 1. But if your expectations about what's possible are wrong (including how tractable it is with some experience), it won't even occur to you that you can do better. Instead, you'll just recapitulate the same architectural sins in the next fastest language.
"I can tell you right off the bat probably what's the problem"
Emphasis added. I don't disagree with you that they may be doing something inefficient; I'm just saying, I don't -know- what they're doing, so I'm disinclined to judge it.
I do know that, again, in Rust, whatever bottleneck they're hitting is neither CPU nor memory, despite the seemingly low throughput, which does imply that what you're proposing isn't the bottleneck in that implementation.
Seriously, 500 qps was something we used to do in interpreted languages on the Pentium Pro. But this kind of blog post is a whole genre: How [ridiculous startup name] serves [trivial traffic] using only [obscenely wasteful infrastructure] in [trendy runtime framework that's a tiny niche or totally unknown in real industry].
Then you just weren't paying attention. There's a list of approved languages for new development and Rust isn't on it. There's a list of languages that are forbidden, for which you need high-level approval to use for new work, and Rust is on that one, next to C++.
Rust support is categorized as Tier 2 at Dropbox. Do you work at Dropbox? You can go look for the language approval list, which documents this.
Tier 2 means it requires approval. That is massively different from "forbidden". There is, for example, a Tier 3 list - Java is on there, some other languages, they're discouraged a lot more strongly than Rust though you'll still find them in some parts of the codebase (primarily acquired code from what I recall). Approval is only required for business/ product dev - non product teams can very easily write rust, at least one is currently doing so.
Tier 2 means it has internal library support, such as the communication library. That means there is, at all times, active rust development - and of course this is true, the most critical parts of Dropbox are written in Rust, and they rely on those libraries.
You can also very easily find out about the existing projects being written in Rust. Go ask the rust-lang channel. Last I saw, maybe 6 months ago, Rust was being used for another major component of the product - obviously I'm not commenting publicly on that further.
You've (probably unintentionally, judging by your tone) shined a light on the corrupt culture of that engineering org. It's written down that Tier 2 languages require special permission to begin new projects. It doesn't sound like we disagree on that. The corrupt cultural aspect is there exists an in-group clique of engineers who can and will start projects in any language they want, and an out-group against whom the written policy will be used to stop new efforts.
It is definitely not the case that any randomly selected backend engineer at that company can just pick up Rust and solve any problem with it, because the Tier 2 status is used as a cudgel to stop most such efforts.
I disagree entirely with the "forbidden" wording. That's it. Tier 2 is not "Forbidden" it is a statement on the level of support.
I agree with your point about political issues - the entire tier list, specifically even Rust being tier 2, was political. I saw stupid shit like that too many times at Dropbox.
But it is simply a fact that new projects are being built in Rust, regardless of the political aspects of why that is the case despite it being tier 2. "Forbidden" does not convey the state of things - everything else you've said is agreeable.
It took your comment to make me notice it wasn't 30k requests/second but minute instead.
500 requests per second is what I would expect of a default PHP + Apache installation on a small Ubuntu server.
I too have a hard time grasping whats special here. For example I saw cached Wordpress setups handle 400 to 500 requests per second. And Wordpress isn't known for performance even with caching plugins.
I know this comes across as snarky but it really worries me that contemporary engineers think this is a feat worthy of a blog post. For example, take this book from 2003 [1] talking about Apache + mod_perl. Page 325 [2] shows a benchmark: "As you can see, the server was able to respond on average to 856 requests per second... and 10 milliseconds to process each request".
And just to show this isn't a NodeJS vs Rust thing, check out these webframework benchmarks using various JS frameworks [3]. The worst performer on there still does >500 rps while the best does 500,000.
It's 2020, the bar needs to be much higher.
[1] https://www.amazon.com/Practical-mod_perl-Stas-Bekman/dp/059...
[2] https://books.google.com/books?id=i3Ww_7a2Ff4C&pg=PT356&lpg=...
[3] https://www.techempower.com/benchmarks/#section=data-r19&hw=...