Where I work we don't use any process level parallelism at the ruby level, we hoist that up to the kubernetes level and use signals (CPU load, job queue sizes, etc) to increase/decrease capacity. Workloads (replica sets) are segmented across multiple dimensions (different types of API traffic, worker queues) and are tuned for memory, cpu and thread counts according to their needs. Some heavy IO workloads can exceed a single cpu ever so slightly because db adapter isn't bound by the GVL, but practically speaking a pod/ruby process can only utilize 1 CPU, regardless of thread count.
One downside of this approach though is it takes a long time for our app to boot and this along with time to provision new nodes can cause pod autoscalers to flap/overprovision if we don't periodically tune our workloads.
In a perfect world we would be able to spawn processes/pods that are already warmed up/preloaded (similar to forking, but at the k8s level and the processes are detached from the root process) in a way that's not constrained by the CPU capacity of some underlying k8s node it is running on and instead is basically an infinite pool of CPUs that we only pay for what we use. Obviously serverless sort of offers this kind of solution if you squint but it is not a good fit for our architecture.
In my past experience with a large rails monolith, memory usage was always the limiting factor. Just booting the app had significant memory overhead. Using in-process concurrency would have led to massive infrastructure savings, since a lot of that overhead could be shared across threads. Probably 2-3x the density compared to single threaded.
In the end, we never got quite there, due to thread safety issues. We did use a post-boot forking solution to achieve some memory savings thanks to copy-on-write memory, which also led to significant savings, but was a bit more complex.
All that to say, the naive "just let kubernetes scale it for you" is probably quite expensive.
Where I work we don't use any process level parallelism at the ruby level, we hoist that up to the kubernetes level and use signals (CPU load, job queue sizes, etc) to increase/decrease capacity. Workloads (replica sets) are segmented across multiple dimensions (different types of API traffic, worker queues) and are tuned for memory, cpu and thread counts according to their needs. Some heavy IO workloads can exceed a single cpu ever so slightly because db adapter isn't bound by the GVL, but practically speaking a pod/ruby process can only utilize 1 CPU, regardless of thread count.
One downside of this approach though is it takes a long time for our app to boot and this along with time to provision new nodes can cause pod autoscalers to flap/overprovision if we don't periodically tune our workloads.
In a perfect world we would be able to spawn processes/pods that are already warmed up/preloaded (similar to forking, but at the k8s level and the processes are detached from the root process) in a way that's not constrained by the CPU capacity of some underlying k8s node it is running on and instead is basically an infinite pool of CPUs that we only pay for what we use. Obviously serverless sort of offers this kind of solution if you squint but it is not a good fit for our architecture.