The fact that the Rust tokio version (the one that uses tokio tasks instead of t...

kprotty · on Sept 13, 2021

I don't think tokio's slowness here is "to be expected". There isn't much reason for tokio tasks to have that much overhead over normal thread pools. The I/O driver shouldn't be called in such a benchmark given there's no I/O work happening. Waker's only add a reference count inc/dec + an atomic cas for wake() which should only happen on the JoinHandle `awaits` [4] compared to just an atomic swap for the Zig case on the join handle.

Golang doesn't poll for I/O under such cases [0] and tokio should be using the `ParkThread` parker for putting threads to sleep [1] given `net` features aren't enabled (not sure if this is the actually the case) which you can force with a custom Runtime initialization instead of `#[tokio::main]` as an exercise.

`crossbeam-deque` requires heap allocation for the run queues, heap allocates on growth, and garbage collects said memory. This is an overhead I wished to avoid and is something tokio has been improvements with avoiding as well [2].

`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. This isn't general purpose and it takes advantage of unbounded stack allocation which can technically cause a stack overflow. Zig could do this and also take advantage of batch scheduling, but it complicates the code and is unfair here given `async` usage. Tokio, golang, and the Zig benchmarks require heap allocation on spawn so I believe it makes it a fairer comparison. This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.

The `unsafe` there is from me copying the benchmark code from the tokio version to the rayon version and forgetting to remove the hack. In tokio, ownership of the array needed to be passed into the function given the lifetime was no longer linear from the spawn() I assume (correct me if i'm wrong here, but this is what the compile error hinted at). So I needed to recreate the array after the function, hence unsafe. If there's a better way for the tokio version, please send a PR. I see you've done so for the rayon version and I gladly merged it.

[0]: See `atomic.Load(&netpollWaiters) > 0` in https://golang.org/src/runtime/proc.go

[1]: https://github.com/tokio-rs/tokio/blob/master/tokio/src/runt...

[2]: https://tokio.rs/blog/2019-10-scheduler#a-better-run-queue

[3]: https://docs.rs/rayon-core/1.9.1/src/rayon_core/join/mod.rs....

[4]: https://github.com/tokio-rs/tokio/blob/98578a6f4a494e709f000...

Arnavion · on Sept 13, 2021

>I don't think tokio's slowness here is "to be expected".

And yet on my machine the rayon version takes ~160ms and the tokio version takes ~1350s. This isn't at the level of some minor missed performance optimization.

>There isn't much reason for tokio tasks to have that much overhead over normal thread pools.

tokio is an async runtime. tokio tasks are meant to be for distributing I/O-bound work. It would be at least a little more correct to use spawn_blocking for CPU-bound tasks, but that still doesn't work for your recursive calls because that's not what it's meant for.

In general, if you have CPU-bound work in your tokio-using application, you run it on a different threadpool - tokio's blocking one, or a completely different one.

>`rayon` isn't a good comparison here given `rayon::join` is optimized to hook directly into the scheduler and run the caller only until the other forked-section completes [3]. [...] This is also why I used rayon scopes instead of join(): less specialization and reintroduced the heap allocation from the unbounded concurrency.

My original comment was also talking about scopes, not `rayon::join`. So yes, `rayon` is absolutely a good comparison.

kprotty · on Sept 14, 2021

This actually can be at the level of a missed optimization. A run queue with a lock-shared queue amongs all the threads scales even worse than the tokio version. Sharding the run queues and changing the notification algorithm, even while keeping locks on the sharded queues improves throughput drastically.

Tokio is an async runtime, but I don't see why being an async runtime should make it worse from a throughput perspective for a thread pool. I actually started on a Rust version [0] to test out this theory of whether async-rust was the culprit, but realized that I was being nerd-sniped [1] at this point and I should continue my Zig work instead. If you're still interested, I'm open to receiving PRs and questions on that if you want to see that in action.

It's still correct to benchmark and compare tokio here given the scheduler I was designing was mean to be used with async tasks: a bunch of concurrent and small-executing work units. I mention this in the second paragraph of "Why Build Your Own?".

The thread pool in the post is meant to be used to distribute I/O bound work. A friend of mine hooked up cross-platform I/O abstractions to the thread pool [2], benchmarked it against tokio to be have greater throughput and slightly worse tail latency under a local load [3]. The thread pool serves it's purpose and the quicksort benchmark is to show how schedulers behave under relatively concurrent work-loads. I could've used a benchmark with smaller tasks than the cpu-bound partition()/insertion_sort() but this worked as a common example.

I've already mentioned why rayon isn't a good comparison: 1. It doesn't support async root concurrency. 2. scoped() waits for tasks to complete by either blocking the OS thread or using similar inline-scheduler-loop optimizations. This risks stack overflow and isn't available as a use case in other async runtimes due to primarily being a fork-join optimization.

[0]: https://github.com/kprotty/zap/blob/blog-rust/src/thread_poo...

[1]: https://xkcd.com/356/

[2]: https://github.com/lithdew/hyperia

[3]: https://gist.github.com/kprotty/5a41e9612657de00788478a7dde4...

skohan · on Sept 13, 2021

Honestly tokio is so complex, and it serves so many use-cases it's hard for me to believe it could ever be truly optimal for any one use-case.

kristoff_it · on Sept 13, 2021

Isn't Go's scheduler just as versatile?

skohan · on Sept 13, 2021

I'm not an expert on the go scheduler, but my perception is that it is more of a focused single-purpose component whereas tokio seems like a sprawling swiss-army-knife of a library if you browse the source

bbatha · on Sept 13, 2021

The tokio scheduler and the go scheduler are roughly equivalent. Much of tokios bulk is reimplementing much of the std lib in an async compatible way (io, os, blocking).

kprotty · on Sept 14, 2021

If you browse the source, the go scheduler has complexities to deal with that tokio doesn't as well. The thread pool is unified between worker threads & blocking threads. Go also does goroutine preemption via signaling/SuspendThread + a background monitor thread called sysmon. Go does garbage collection and the tracing/race-detection/semantics are tightly coupled to both its allocator and it's scheduler. Go also exposes and maintains its entire standard library which includes an http client/server (tokio the org maintains their own as hyper but its separated from tokio the runtime). It can be fair to argue that Go is just as "swiss-army-knife" of a system as tokio.

tick_tock_tick · on Sept 13, 2021

It's much more versatile but also has many more years of optimization.

gefhfff · on Sept 13, 2021

Can you explain that Tokio vs crossbeam vs rayon a bit more? Should I default to rayon or crossbeam because of that?

ModernMech · on Sept 13, 2021

It depends on what you want to do. If you are doing io-bound work, Tokio would be what you want -- you would use it as a runtime for the async capabilities in Rust. If you have cpu-bound work, then rayon is what you want to use. Rayon is a work-stealing parallelism crate -- it will schedule work to be done, and different threads will schedule portions of it as they become available to do work. It's very easy to get 100% CPU utilization across all cores use Rayon if your work is naturally paralellizable, and the interface is dead simple: anywhere you do a .iter(), you can turn it into a .par_iter(), and Rayon will parallelize the work.

Note there is some overhead to using Rayon -- you normally will be better off doing your work on a single thread unless you have a large number of elements in your collection... I found for my application I needed more than 1e6 elements before I saw an appreciable performance benefit to using Rayon.

As others said, Crossbeam is for sending and receiving messages across threads. I use it alongside of tokio and rayon.

YorickPeterse · on Sept 13, 2021

Crossbeam is a library that provides various concurrency primitives, such as a queue type that allows stealing of work.

Rayon is a library for running code in parallel. Typically you'll give it a (parallel) iterator of sorts, and it will distribute the work across a pool of threads.

Tokio is a library for async IO operations.

nicoburns · on Sept 13, 2021

Tokio is designed for concurrent io, rayon, or tokio-threadpool for concurrent cpu-bound task.

knuthsat · on Sept 13, 2021

What about smol?

ModernMech · on Sept 14, 2021

smol is in the same category as Tokio - it's a runtime for the async capabilities in Rust.