The current popularity of the async stuff has its roots in the classic "c10k" pr...

_flux · on April 28, 2024

> If you ever pull up a debugger and step through an async Rust/tokio codebase, you'll get a good sense for what the overhead here we're talking about is.

So I didn't quite do that, but the overhead was interesting to me anyway, and as I was unable to find existing benchmarks (surely they exist?), I instructed computer to create one for me: https://github.com/eras/RustTokioBenchmark

On this wee laptop the numbers are 532 vs 6381 cpu cycles when sending a message (one way) from one async thread to another (tokio) or one kernel thread to another (std::mpsc), when limited to one CPU. (It's limited to one CPU as rdtscp numbers are not comparable between different CPUs; I suppose pinning both threads to their own CPUs and actually measuring end-to-end delay would solve that, but this is what I have now.)

So this was eye-opening to me, as I expected tokio to be even faster! But still, it's 10x as fast as the thread-based method.. Straight up callback would still be a lot faster, of course, but it will affect the way you structure your code.

Improvements to methodology accepted via pull requests :).

cmrdporcupine · on April 28, 2024

I'd want to see perf stats on branch prediction misses and L1 cache evictions alongside that though. CPU cycles on their own aren't enough.

_flux · on April 28, 2024

It doesn't seem my perf provides metric for L1 cache evictions (per perf list).

Here's the results for 100000 rounds for taskset 1 perf record -F10000 -e branch-misses -e cache-misses -e cache-references target/release/RustTokioBenchmark (a)sync; perf report --stat though:

async

    Task 2 min roundtrip time: 532
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0,033 MB perf.data (117 samples) ]

    ...    
    branch-misses stats:
              SAMPLE events:         54
    cache-misses stats:
              SAMPLE events:         27
    cache-references stats:
              SAMPLE events:         36

sync

    Thread 2 min roundtrip time: 7096
    [ perf record: Woken up 5584 times to write data ]
    [ perf record: Captured and wrote 0,367 MB perf.data (7418 samples) ]

    ...
    branch-misses stats:
              SAMPLE events:       6577
    cache-misses stats:
              SAMPLE events:        159
    cache-references stats:
              SAMPLE events:        682

cmrdporcupine · on April 29, 2024

Interesting. Thing is all you're benchmarking is the cost of sending a message on tokio's channels vs mpsc's channels.

It would be interesting to compare with crossbeam as well.

But not sure this reflects anything like a real application workflow. In some ways this is the worst possible performance scenario, just two threads spinning and spinning at the fastest speed they can, dumping messages into a channel and pulling them out? It's a benchmark of the channels themselves and whatever locking/synchronization stuff they use.

It's a benchmark of a "shared concurrent data" situation, with constant synchronization. What would be more interesting is to have longer running jobs doing some task inside themselves and only periodically (ever few seconds, say) synchronizing.

What's the tokio executor's settings by default there? Multithreaded or not? I'd be curious how e.g. whether tokio is actually using multiple threads or not here.

_flux · on April 29, 2024

Actually I wasn't that interested in throughput, only the latency in terms of instructions executed since sending until it is received, though indeed the throughput is also superior with tokio.

For most applications this difference doesn't really matter, but maybe some applications do a lot of small things where it does matter? In those cases it might be an easy solution to switch from standard threads to tokio async and gain 10x speed, as the structure of the applications remains the same.

> It's a benchmark of the channels themselves and whatever locking/synchronization stuff they use.

Yeah, in retrospect some mutex-benchmark might be better, though I don't expect a message channel implemented on top of that is noticeably slower. A mutex benchmark is probably easier to get wrong..

> What would be more interesting is to have longer running jobs doing some task inside themselves and only periodically (ever few seconds, say) synchronizing.

I don't quite see how this would give any different results. Of course, in that case the time it takes to transmit the message would be completely meaningless.

> What's the tokio executor's settings by default there? Multithreaded or not? I'd be curious how e.g. whether tokio is actually using multiple threads or not here.

It's using the multithreaded executor. I tried the benchmark with #[tokio::main(worker_threads = 1)] and 2 and while with =1 the result was 529 but with =2 it was 566.

karmarepellent · on April 27, 2024

> Putting aside that not all of use are building web applications

Perfect moment to mention "rouille" which is a very lightweight synchronous web server framework. So even when you decide to build some web application you do not necessarily have to go down the tokio/async route. I have been using it for a while at work and for private projects and it turned out to be pretty eye-opening.

cultureswitch · on May 8, 2024

Hit the nail on the head.

Unless you're really dealing with absurd numbers of simultaneous blocking I/O, async has entirely too many drawbacks.

collinvandyck76 · on April 27, 2024

>now you're having to fire up a tokio runtime

I've been developing in (mostly async) Rust professionally for a about a year -- I haven't written much sync rust other than my learning projects and a raytracer I'm working on, but what are the kind of common dependencies that pose this problem? Like wanting to use reqwest or things like that?

Animats · on April 27, 2024

> Like wanting to use reqwest or things like that?

Yes. Reqwest cranks up Tokio. The amount of stuff it does for a single web request is rather large. It cranks up a thread pool, does the request, and if there's nothing else going on, shuts down the thread pool after a while. That whole reqwest/hyper/tokio stack is intended to "scale", and it's massive overkill for something that's not making large numbers of requests.

There's "ureq", if you don't want Tokio client side. Does blocking HTTP/HTTPS requests. Will set up a reusable connection pool if you want one.

chromatin · on April 27, 2024

reqwest also has a blocking version, which I use in projects not already using an async rt

https://docs.rs/reqwest/latest/reqwest/blocking/index.html

cmrdporcupine · on April 27, 2024

The blocking implementation still depends on and uses tokio, last I looked.

I've seen this with multiple Rust packages. "Yes, we offer a synchronous blocking version..." and then you look and it's calling rt.block_on behind the scenes.

Which is a pretty large facepalm IMHO

zozbot234 · on April 27, 2024

You don't have to do that, Tokio also provides a single-threaded runtime that just runs async tasks on the main thread.