> fibers can context switch billions of times per second on a modern processor.
On a 3GHz (3 billion hertz) processor, you expect to be able to context switch billions of times per second?
I would probably accept millions without question, even though that might be pushing it for a GIL-ed runtime like Ruby has. But, unless your definition of "context switch" counts every blocked fiber that's passed over for a context switch as being implicitly context switched to and away from in the act of ignoring it, I find this hard to believe.
It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.
if I remember correctly, Go's scheduler has a global queue and a local queue per worker thread, so when you spawn a goroutine it probably has to acquire a write lock on the global queue.
Allocating a brand new goroutine stack and doing some other setup tasks has a nontrivial overhead that has nothing to do with context switching, regardless of global locks.
To properly benchmark this, I think I would start with just measuring single task switching by measuring how long it takes main to call https://golang.org/pkg/runtime/#Gosched in a loop a million times. This would measure how quickly Go can yield a thread to the scheduler and have it be resumed, although this includes the overhead of calling a function.
Then I would launch a goroutine per core doing this yield loop and see how many switches per second they did in total, and then launch several per core, just to ensure that the number hasn't changed much from the goroutine per core measurement.
Since Go's scheduler is not bound to a single core, it should scale pretty well with core count.
I might run this benchmark myself in awhile, if I find time.
It looks like the context switching speed when you have a single Goroutine just completely outperforms any of the benchmark numbers that have been posted here for Python or Ruby, as would be expected, and it still outperforms the others even when running 256 yielding tasks for every logical core.
The cost of switching increased more with the number of goroutines than I would have expected, but it seems to become pretty constant once you pass the number of cores on the machine. Also keep in mind that this benchmark is completely unrealistic. No one is writing busy loops that just yield as quickly as possible outside of microbenchmarks.
This benchmark was run on an AMD 2700X, so, 8 physical cores and 16 logical cores.
The one additional comment I have is that this addendum doesn't involve a reactor/scheduler in the benchmark, so it excludes the process of selecting the coroutine to switch into, which is a significant task. The Go benchmark I posted above is running within a scheduler.
So, once it's decided what work to do, it's just a matter of resuming all the fibers in order.
Additionally, since fibers know what work to do next in some cases, the overhead can be very small. You sometimes don't need to yield back to the scheduler, but can resume directly another task.
He might have mis-phrased it though. Maybe he meant to say that it can handle billions of concurrent tasks maybe not that they can switch rapidly in just a second xD
> It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.
Not really, if you just implement a round-robin scheduler and if none of the fibres are actually blocked (i.e. if they just yield).
Artificial benchmark? Sure. But there's no way you could ever approximate this with OS threads, even if the user-space code/threads themselves do no useful work.
The difference between stackful coroutines (i.e. fibers in this case) and Python-style coroutines is that Python's coroutines need to rebuild the stack on every resume, whereas resuming a fiber is basically a goto. The cost of yielding and resuming a coroutine in Python is O(N) for the depth of the call chain, but O(1) for fibers.
So as you say, a simple scheduler that just walks a linked-list could resume a stackful coroutine in a few instructions, plus the cost of restoring hardware registers (which is the same cost as a non-inlined function call), and the latter is easily pipelined so would take fewer cycles than instructions.
I'm a bit hesitant about this article's proposition, although I'm open to be convinced.
In the end there's a GIL. For pure CPU-bound workloads, your only true source of parallelism will be putting more processes into the mix, be it via forking or simply spawning more processes from scratch.
But inside a given a process, it seems to me that no matter what you do (fibers/threads), only 1 CPU can do CPU-bound work at a time.
If I was to design a high-performance Ruby solution, I'd ditch forking, threading and fibers. I'd focus instead in creating low-memory-footprint, fast-startup processes (read: no Rails!) that one can comfortably spawn in parallel, be it in a single server, or in a cluster (think k8s). Max Ruby processes count: 1 per core.
Mostly Ruby web app workloads aren't CPU-bound. And when they are, you're using a language that is 100 to 1000 times slower than C.
That is, for a CPU-bound task, you might be able to replace your thousand node k8s cluster with one machine running code written in a faster language (before even getting into communication and load balancing and HA and all that crap.)
But abandoning everything you've written in one language for the sake of a single hot path isn't really a good idea either.
It's probably a more sensible idea to use a native extension to optimize just that hot path. For instance, the Rust bindings are really good: https://github.com/tildeio/helix
Sometimes. If your profile is "lumpy" enough that can be true, but often the performance problems of slow languages manifest as annoyingly flat death-by-a-thousand-cuts profiles. (Especially after the low hanging fruit has been picked.)
Grinding another 5x out of a flat profile can be hard work requiring more creativity and producing less readable code than a (fairly mechanical) translation. Luckily these tasks also tend to crop up when design has stabilised and team sizes are increasing, meaning it can often coincide with the benefits of static typing becoming greater.
It's not a panacea, and it's
obviously both risky and costly. Shrugs.
(I've also seen the other case, FWIW. Wrote some C extension code called from a Rails app, and it was the right call at the time and we got great mileage out of it.)
Fast execution context switching is great for concurrency. Concurrency is a way to express the control flow. It can be, and often is, achieved without any parallelism, with a strictly one-thing-at-a-time execution.
Parallelism is a way to have more than one execution context execute simultaneously. In a system with a GIL, it only makes sense for contexts mostly waiting for I/O. CPU-bound tasks are obviously sequenced and limited by GIL access, and no parallelism is possible.
Title is actually "Fibers are the Right Solution" and the article ends with "Fibers are the best solution for composable, scalable, non-blocking clients and servers."
The title as posted is most definitely not true. I have worked on high volume Ruby applications where the problem with using async I/O combined with poor raw execution of Ruby code resulted in excess garbage to be collected. Ruby performance should mean the performance of executing Ruby code. Perhaps the title could have been 'Ruby webapp performance'.
> I have worked on high volume Ruby applications where the problem with using async I/O combined with poor raw execution of Ruby code resulted in excess garbage to be collected.
What async I/O approach did you use and where did the garbage come from? I/O buffers?
It was an api endpoint making a number of db and other network requests. The garbage was the accumulation of temporary and response objects that were held for the long duration of processing, up to half of which could be in the ruby code not waiting on i/o. If the code ran faster there would be more time for gc and handling the incoming request volume.
> We show that fibers require minimal changes to existing application code and are thus a good approach for retrofitting existing systems.
I disagree that 'minimal changes to existing code' is a good goal for a Ruby parallelism solution. The large Ruby codebases I have dealt with have gigantic hacks to get around the GIL: complex fork management systems, subprocess management, crazy load-this-but-not-that systems for prefork memory management to optimize copy on write, and probably more. Parallel tasks in Ruby are a nightmare to work with.
Changing existing code _should_ be a goal of any Ruby parallelism solution. If we can't get rid of this kind of cruft, what are we even doing?
I still love Ruby, but I want go-style channels, not apologies for the GIL.
These are the slides for Koichi Sasada’s RubyConf 2018 (last week) talk updating the community on his progress in the design and implementation of Guilds.
Hi, I'm the author, I'm also a Ruby core committer, and yes I'm aware of Guilds. The are a model for parallelism and don't really affect concurrency at all. At best, they provide a 3rd option on top of processes/threads/guilds.
Guilds felt omitted in your post since they (should) address one of the points you make about the usability/ergonomics of existing Ruby APIs for managing non-sequential execution.
But it’s definitely a bit early to tell what Guilds will actually look like as a final product.
Any time you call a blocking function that the system provides, it should immediately yield the fiber, which makes it look preemptive. If you write a loop that just spins forever, that could block the whole system, potentially, making the abstraction leaky. In a language like Ruby, they could definitely add some true preemption, but I don't know if that's what they plan to do.
From the article: "Fibers, on the other hand, are semantically similarly to threads (excepting parallelism), but with less overhead." So, the author definitely isn't implying parallelism of the fibers.
> What's all the fuss about paralllelism in the article about then?
The author was talking about different methods for handling more than one request at a given time, which include forking and threads. With Ruby's GIL, threads are a lot less attractive than they could be. A good fiber implementation can handle tons of network requests concurrently and very efficiently even on a single core, which is the case being discussed here.
At the end, the author discusses a hybrid approach of forking and fibers, where each processor core would have a fork of the Ruby program running, and each fork would have its own fiber pool, running many tasks concurrently.
In languages that don't have a GIL, forking is rarely a tool that I reach for. It really hurts your database pooling and all sorts of other small problems, but it's a common trade-off when using Ruby, Python, and Node.
> Any time you call a blocking function that the system provides, it should immediately yield the fiber, which makes it look preemptive.
In the old days we would call this cooperative to contrast it from preemptive. This is the essence of cooperative, yielding at explicit points be they IO request, timers, or waiting on a message queue. Preemptive used to mean a certain thing and this is not it at all.
cooperative multitasking typically implies (to me, at least) that the programmer is required to explicitly / manually yield their task, which is annoying, error prone, and isn't required here. The system's blocking functions will handle that behind the scenes.
Fibers are cooperative here, but not from the programmer's point of view, and that's an important distinction to make. If you write the same code for a cooperative system as you would for a preemptive system, is there really any difference to the programmer? It looks preemptive. If anything, properly implemented cooperative systems are more efficient. Most of the time when people ask the question that is asked higher in the thread, I believe they're worried that they will be responsible for remembering to yield control.
I'm pretty sure I did a decent job in my previous comment of explaining that the system only looks preemptive, and that it is possible to block it with some uncooperative code, so I'm not sure what point you're trying to make.
It's a matter of point of view, but to me cooperative/preemptive is a property of the underlying scheduler, not of what the programmer is usually exposed to. As you correctly pointed out, it is possible to block the scheduler with uncooperative code. It's not even hard: it takes just one heavy CPU-bound computation. I write these kind of computations every day: if you sell me a system as preemptive and it's not, I will get angry...
That's always how it worked though. In the cooperative multitasking that people complain about (in early Windows and Mac for instance), "blocking I/O" called yield internally, and you only needed to call yield() manually in long running computations that didn't have any I/O.
What you're describing is bog standard cooperative multitasking.
Forking is going out of fashion in Ruby land really fast due to the problems you mentioned. Now multi-threading is the norm it's easier to just run one process per core and live with a little extra RAM usage.
I was working on some more improvements to forked memory usage in CRuby but I don't think it's worth pursuing.
What are you basing that it's going out of fashion? Shopify and Github both run unicorn which is pre forking as far as i know. I think some companies prefer to prevent thread safety issues and pay the extra performance cost.
The point I'm trying to make is that you can't share resources easily between all of those processes, even though they're on a single machine, so you usually open a lot more database connections that you would need with a single shared connection pool. So, people often end up dealing with PgBouncer and other inconveniences much earlier than they would otherwise need.
Trying to share a much smaller number of connections between a larger number of threads with fine-grained checkin/checkout is a nightmare, in my experience. You end up with all sorts of difficult resource and lock contention issues. As soon as you need a simple transaction you're stuck holding the connection for the duration anyway.
In my experience, it's all handled transparently behind the scenes... there is no headache. In Rust, checking a connection out is a single function call on the pool, which is easily shared among all threads, and it will automatically get checked back into the pool when the connection goes out of scope... you don't have to do a single thing to check it back in. In Go, the connection pooling is all handled transparently behind the scenes, such that you don't even need to know it's happening. I actually had to do some googling when I started using Go, as I was concerned that no one was recommending the use of a connection pool... creating and tearing down a connection per request is just wasteful when connection pools are so nice to use. It just turns out that Go embraces connection pools so deeply that I don't know of an easy way to avoid pooling your database connections.
If your application gets bottlenecked by the number of connections in your pool, it's easy enough to increase the number, but the more independent pools you have, the more overprovisioned connections (connected but not being used) you will have scattered throughout those pools. It's also usually possible to run a connection pool without an upper limit, if you trust your database to handle large number of connections gracefully.
Rust and Go's connection poolers will also automatically scale down the connection pool when connections are idle for a given period of time, which is nice.
I can't think of any nightmares or headaches that I've encountered with those connection poolers. It all "Just Works"... except for PgBouncer, the ultimate connection pooler. PgBouncer doesn't work with prepared statements or transactions unless you run it in transaction mode, and then you have to run every query in a transaction to use prepared statements.
I'm definitely not suggesting that you try to serve 1000 concurrent requests with 10 connections or something silly like that, but that is what often happens when you get large Ruby deployments which would attempt to establish more connections than Postgres can handle, so you route them through PgBouncer where a small fraction of the number of connections exists.
But, this is pretty off-topic at this point. I didn't mean to point the conversation in this direction.
> Rust, checking a connection out is a single function call on the pool
Still a pain in the arse when you are making function calls inside a transaction and dealing with the the connection reference lifetime.
> Go, the connection pooling is all handled transparently behind the scenes
This actually has a few nasty properties. Firstly, executing two simple queries in seemingly sequential Go code actually execute in parallel. Secondly, it's possible for Go's connection pooling to cause some very nasty failures. Rather than timing out at the first of a bunch of normally fast but now unusually slow queries (because of a lock etc), Go will keep spawning new connections and parking running but not yet timed out queryies until everything is on fire. Max connections is definitely a good idea.
> It's easy enough to increase the number
Only if you can restart your DB. Which, if you're trying to scale up under load, is the last thing you want to do.
> automatically scale down the connection pool when connections are idle for a given period of time
PgBouncer has supported this since release.
> PgBouncer doesn't work with prepared statements
Prepared statements themselves work fine. The problem is many ORMs do fragile, non-deterministic things with caching named prepared statements to improve throughput in simple scenarios.
Using named prepared statements can also cause other issues because it signals to PG that it's OK to use a generic query in some cases. It might not be!
I'm talking about client-side connection count maximums, not server-side. It's just a setting in connection pools like Rust and Go have.
> Prepared statements themselves work fine. The problem is many ORMs do fragile, non-deterministic things with caching named prepared statements to improve throughput in simple scenarios.
Postgres specifically supports unnamed prepared statements as a feature, and PgBouncer's model cannot do anything to help those. One connection creates this statement, and another tries to execute it. In fact, PgBouncer's docs specifically say that they do not support prepared statements, and not to use them, so your claim is contrary to the docs.
I really don't want to even bother with your Rust and Go comments, since they are just nonsense. Lifetimes are not a problem with function calls involving transactions in Rust. At all. I work with Ruby, Rust, and PostgreSQL professionally at my current full-time job. I've written a lot of queries, and many of those involved transactions.
Go will not execute two seemingly sequential queries in parallel. It will execute them sequentially. When you run a query, it's a synchronous process, unless you specifically launch that query in its own separate goroutine... in which case, it is absolutely not a surprise that it runs in parallel, because you did that. Your slippery slope argument is completely nullified by this property. If you don't set a maximum database connection limit and your web server receives another request that requires a connection, it's no surprise that it tries to open another database connection to help service that web request. From the beginning, it appeared you were making the argument that connection pools should not be used, and therefore each request just handles its own connections... which would also be unbounded just like this. Fortunately, Go and Rust database pools provide an option to limit the upper bound. I worked with Go and MySQL professionally at my previous full-time job.
Then... you're defending PgBouncer?! I thought you hated connection pools? PgBouncer is great at what it does, but what it does is a painful headache to deal with, because it breaks half the features any normal Postgres client expects to work seamlessly. You can't just prop it up in front of a database and expect things to "just work".
You're presenting information like you have all this experience, but my experience clearly indicates that what you're saying is just plainly wrong. I don't see any benefit to either of us in continuing this discussion further. I'm out.
No, you're just being unnecessarily rude. I just said I thought that fine-grained checkin-checkout like Go and Rust encourage is somewhat overrated. There's no need to be hostile.
> Postgres specifically supports unnamed prepared statements as a feature, and PgBouncer's model cannot do anything to help those.
PqExecParams (single phase prepared statement) works fine over PgBouncer. That's what I'm talking about. This is different to PREPARE & EXEC. You can't actually do a single phase prepared statement from psql, AFAIK, only via client libraries. https://www.postgresql.org/docs/11/libpq-exec.html
I agree, the PgBouncer documentation could be clearer. I think they just don't want people trying it to file bug reports. PREPARE and EXEC can actually work even over statement pooling but you need to make sure your connection setup statements prepare all the necessary statements.
> it's a synchronous process, unless you specifically launch that query in its own separate goroutine
You're right, I'm getting two things mixed up here. It is a while since I dealt with this problem.
1) The auto-checkout can mean you end up executing related statements on different connections. IMHO this is highly confusing to have as default behaviour and I prefer the Rust approach.
2) By default Go will just keep piling up Goroutines blocked on slow queries and open more DB connections and kill a DB server.
I've coded multiprocess, multithread, callback and context switching on multiple OSes and languages. Golang nails it, a bed of green threads scheduled onto CPU threads with isolation. The foundation is correct.
Seems to me that one interesting approach would be to automatically rewrite a program in continuation-passing style, and then use callbacks (the lowest-overhead approach) where the callback is the continuation.
It ought to compile pretty cleanly, since a continuation can end up being compiled into an address.
I've thought about this too, and it might be the route C++ takes. That being said, to resume you need to wind through all your function calls executing parts of the state machine, it might actually be slower.
That being said, I'm sure there are different ways to implement it and I look forward to proposed faster/efficient implementations.
I've been trying hard not to be snarky :-) It's pretty amazing how old things become new as a combination of 1) people not knowing about it and 2) it being given a new name. The first one is a bit sad and speaks volumes about our collective lack of education as an industry. I gladly include my self in that indictment by the way. I only console myself with the fact that the things I don't know (but should) is probably greater than what could fill my (mostly useless) 4 year CS degree. Of course I graduated a long time ago... I'm a bit worried that as people get into this industry in less traditional ways we are gradually losing the sight of the fact that there really is something important to learn.
I'm not trying to be snarky but writing a performant server or web server in userland Ruby doesn't seem to be good engineering. We already have nginx & apache. If you want to write performant servers you probably want something like Go, Rust, C++, C, D, JVM, BEAM.
I am glad that fibers and coroutines are going mainstream the way it is. With sincere and profuse apologies to Ruby fans (lest they think this is an attempt at hijacking the discussion) let me share this
https://felix.readthedocs.io/en/latest/fibres.html
I feel some HN'ers would be curious as there arent many languages that offer fibers as a first class citizen.
BTW I am in no way involved in the development of the Felix. Its not a new language, its more than 15 years old and has had fibers from the start.
Are these similar in implementation to the upcoming Java Fibers? Those basically yield automatically whenever they perform a blocking operation, so another fiber can be executed.
Ruby fibers yield control to the fiber that spawned them or transfer control to another specific fiber explicitly. That's close to the Continuation model which Loom's fibers are built on top of, but Loom's fibers hide that process of yielding from the user, instead concentrating on providing a more general scheduling API and yielding at certain points inside the core library.
There are quite a lot of other small differences as well since Ruby fibers are always run on the same thread which originally spawned them, whereas Loom's continuations and fibers can be resumed on a different thread.
It is quite possible to implement Ruby fibers in terms of Loom's continuations, and we've done a prototype of this in TruffleRuby (I think Charles Nutter has done one in JRuby as well), and it certainly allows us to support very large numbers of active fibers.
You’re still working with types, you just don’t have anything checking your work (except the tests you write). Controversial opinion, but I think many of the folks who don’t like types really just don’t want to be bothered about the bugs they’ve introduced off the happy path. This is based on my experience in a Python shop.
it doesn’t really matter what you enjoy if we are talking about what is needed to make a given language faster. That being said, nodejs on v8 is a lot faster than ruby and dynamically typed, so I do agree with your conclusion that static typing is not necessarily the answer.
One of the goals of Ruby 3 is to have a 3x improvement in performance. If you're interested in how Ruby is planning to improve performance, this is a good read: https://blog.heroku.com/ruby-3-by-3
Getting a bit off topic here, but you sound knowledgeable about Ruby. Ruby is a language I've always admired from afar, but never spent much time trying out. I've read the poignant guide, which was amusing but didn't teach me too much. What's the SICP of Ruby? As in, a high quality book that will teach me the ins and outs of the language.
Ruby is a heavily idiomatic language. It's strongly advised to use rubocop or a similar style guide -- and to temper that with good judgment, at least when it recommends avoiding `!!`.
In addition to the Pickaxe book, I recommend Metaprogramming Ruby by Paolo Perrotta, and potentially the sequel to that book (which I have yet to read).
Ruby lends itself to a very fluid style. One of the things that you may find less common is explicit variable assignment within a method: most of the time your assignment will be in method signatures and block declarations. The following code is illustrative of technique, but not recommended practice:
This generates "non-words", which are guaranteed not to exist in the (Unix) system dictionary, without using explicit assignment. First it creates a (lazy) generator object which yields random "words" of varying length. In Ruby, if you are calling a method with only one argument in an inner loop, you can avoid writing the loop explicitly, which is nice here because it also avoids the performance hit of reading the dictionary file repeatedly. The `method` method lets us pass an object-method pair, to be evaluated later, and the '&' there is a sort of cast-to-block, and you'll see that used in other contexts. So, at that point, we have a lazily-evaluated collection of lazily-filtered strings, and we can take an arbitrary number of these out and print them.
The nice thing about Ruby is that you can probably express what you want in one statement. This does come at the cost of a fair amount of idiom. Some of it is functionally important, some of it is convenient (like using the `*` operator to destructure ranges), and some is pure window dressing, but enforced by convention just the same. The Pickaxe book is better than anything else that I am aware of for describing Ruby idioms. I'm not sure how well it has aged. It's probably recommended to do a lot of pair programming and code review. At times I have mentored others on the website Exercism, and I would recommend that or a similar site.
I had decided against a somewhat stronger statement; I didn't want to bash Pickaxe unnecessarily, particularly as I don't have a lot of better suggestions. I'm fairly inclined to write a book myself to address the situation, but not any time soon.
The V8 team also has the budget to make these kinds of improvements. Browsers are a massive platform, not to mention Node. I don’t expect these changes to come to Python or Ruby soon.
Crystal suffers from other problems. They diverged too much from type annotations and use macros / metaprogramming too much which immensely increases the compile times.
I don't know about Crystal's preformance problems specifically, but I would be surprised if macros would take up much compile time. I would have guessed it was more to do with the fiendishly tricky type inference problem that they set for themselves (ie. global inference + subtyping + union types + inheritance + overloaded methods).
beware before using fibers because it won't access Thread.current[:vars].
So if you plan on going with fiber on an existing app, double check first if Thread.current[:vars] is a must have for you & you dependencies (ex: the I18n gem
Are you sure about that ?
If you search `Ruby Thread Locals are also Fiber-Local` there will be a blog post from 2012 about that, and the code sample works fine for me on ruby 2.5
I'm sorry, my previous comment was not clear. I should have written beware before using fibers because it won't access Thread.current[:vars] _as you might expect_.
Here is an example:
fiber = Fiber.new do # Thread.current locals are copied to the Fiber when fiber it is built
puts "1. #{Thread.current[:test]}"
Thread.current[:test] = false # the fiber has it's own stack, won't leak away
puts "2. #{Thread.current[:test]}"
end
Thread.current[:test] = true
fiber.resume
puts "3. #{Thread.current[:test]}"
Output:
1.
2. false
3. true
So fibers comes with their _own stack_, including threads locals, yes, but from _when_ you instantiated them. Not from Thread.current :/ Also writing Thread.current[] won't apply outside the Fiber.
I've been using Ruby and Rails for about 12 years now and I see all the solutions people come up with as bandages over a wound. A normal Rails app starts off with Node and Redis installed. If like me you hop onto the Elixir world it becomes refreshing to see the language you are using do all the stuff multiple tools did and doing it faster.
On a 3GHz (3 billion hertz) processor, you expect to be able to context switch billions of times per second?
I would probably accept millions without question, even though that might be pushing it for a GIL-ed runtime like Ruby has. But, unless your definition of "context switch" counts every blocked fiber that's passed over for a context switch as being implicitly context switched to and away from in the act of ignoring it, I find this hard to believe.
It takes more than one clock cycle to determine the next task that we're going to resume execution for and then actually resume it.