They list out what is being done by the service - "It would receive the logs, co...

wahern · on June 16, 2020

> That sounds like a decent amount of work for a service

5+ years ago I wrote a real-time transcoding, muxing streaming radio service that did 5000 simultaneous connections with inline, per-client ad spot injection (every 30 seconds in my benchmark). Using C and Lua. On 2 Xeon E3 cores--1 core for all the stream transcoding, muxing, and HTTP/RTSP setup, 1 core for the Lua controller (which was mostly idle). The ceiling was handling all the NIC IRQs.

While I think what I did was cool, I know people can eke much more performance out of their hardware than I can. And I wasn't even trying too hard--my emphasis is always on writing clear code and simple abstractions (though that often translates into cache-friendly code).

At my day job, in the past two months I've seen two services in a "scalable" k8s clusters fall over because the daemons were running with file descriptor ulimits of 1024. "Highly concurrent" Go-based daemons. For all the emphasis on scale, apparently none of the engineers had yet hit the teeny, tiny 1024 descriptor limit.

We really do need to raise our expectations a little.

I haven't written any Rust but I have recently helped someone writing a concurrent Rust-based reverse proxy service debug their Rust code and from my vantage point I have some serious criticisms of Tokio. Some of the decisions are clearly premature optimization chosen by people who probably haven't actually developed and pushed into production a process that handles 10s of thousands of concurrent connections, single-threaded or multi-threaded. At least not without a team of people debugging things and pushing it along. For example, their choice of defaulting to edge-triggered instead of level-triggered notification shows a failure to appreciate the difficulties of managing backpressure, or debugging lost edge-triggered readiness state. These are hard lessons to learn, but people don't often learn them because in practice it's cheaper and easier to scale up with EC2 than it is to actually write a solid piece of software.

lostcolony · on June 16, 2020

All I'm saying is that without some example of the payloads they're managing, and the logic they're performing, it's hard to say "this is inefficient". And, as I mentioned, if their CPU and memory are both very low, it's likely they're hitting a network (or, yes, OS) limit.

I've seen places hit ulimit limits...I've also seen places hit port assignment issues, where they're calling out to a downstream that can handle thousands of requests with a single instance, so there are two, and there aren't enough port identifiers to support that (and the engineers are relying on code that isn't reusing connections properly). Those are all things worth learning to do right, agreed, and generally doing right. I'm just reluctant to call out someone for doing something wrong unless I -know- they're doing something wrong. The numbers don't tell the whole story.

wahern · on June 16, 2020

They might not be doing anything wrong, per se. But if your expectations are that 500/s is alot (or even 4000/s for log ingesting), then your architecture will reflect that.

Here's what they're doing:

> Now, when the Bearer Agent in a user's application sends log data to Bearer, it goes into the Envoy proxy. Envoy looks at the request and communicates with Redis to check things like rate limits, authorization details, and usage quotas. Next, the Rust application running alongside Envoy prepares the log data and passes it through Kinesis into an s3 bucket for storage. S3 then triggers our worker to fetch and process the data so Elastic Search can index it. At this point, our users can access the data in our dashboard.

Given their goal and their problems with GC I can tell you right off the bat probably what's the problem with their various architectures from day 1--too much simplistic string munging. If your idea of log ingestion is using in-language regex constructs to chop up strings into pieces, possibly wrapping them in abstract objects, then its predictable you're going to have GC issues, and memory bandwidth issues in general, and poor cache locality in data and code. But 99% of the time this is how people approach the issue.

What a problem like this cries out for is a streaming DFA architecture, using something like Ragel so you can operate on streams and output flat data structures. You could probably implement most of the application logic and I/O in your scripting language of choice, unoptimized GC and all, so long as you're not chopping up a gazillion log lines into a gazillion^2 strings. The latter approach will cause you grief in any language, whether it's JavaScript, Java, Go, Rust or C. The number of objects per connection should be and can be a small N. For example, at 10 distinct objects (incoming connection object, log line, data structure with decomposed metadata, output connection object, etc) per connection times 500 connections, that's 5000 objects per second. Even Python's and Ruby's GC wouldn't break a sweat handling that, even though internally it'd be closer to 10 * (2 or 3) objects.

Here's a big problem today: nobody writes their own HTTP library or JSON library; everybody uses the most popular ones. So right off the bat every ingestion call is going to generate hundreds or thousands of objects because popular third-party libraries generally suck in each request and explode it into huge, deeply nested data structures. Even in Rust. You can't optimize that inefficiency away. No amount of fearless concurrency, transactional memory, fastest-in-the-world hashing library, or coolest regular expression engine can even begin to compensate. You have to avoid it from day 1. But if your expectations about what's possible are wrong (including how tractable it is with some experience), it won't even occur to you that you can do better. Instead, you'll just recapitulate the same architectural sins in the next fastest language.

lostcolony · on June 16, 2020

"I can tell you right off the bat probably what's the problem"

Emphasis added. I don't disagree with you that they may be doing something inefficient; I'm just saying, I don't -know- what they're doing, so I'm disinclined to judge it.

I do know that, again, in Rust, whatever bottleneck they're hitting is neither CPU nor memory, despite the seemingly low throughput, which does imply that what you're proposing isn't the bottleneck in that implementation.

theikkila · on June 16, 2020

wasn't that 4k requests per minute?