Hard to say without more details, but those graphs look very similar to nproc numbers of goroutines interacting with the Linux-of-the-time's CFS CPU scheduler. I've seen significant to entire improvement to latency graphs simply by setting GOMAXPROC to account for the CFS behavior. Unfortunately the blog post doesn't even make a passing mention to this.
Anecdotally, the main slowdown we saw of Go code running in Kubernetes at my previous job was not "GC stalls", but "CFS throttling". By default[1], the runtime will set GOMACSPROCS to the number of cores on the machine, not the CPU allocation for the cgroup that the container runs in. When you hand out 1 core, on a 96-core machine, bad things happen. Well, you end up with a non-smooth progress. Setting GOMACPROCS to ceil(cpu allocation) alleviated a LOT of problems
Similar problems with certain versions of Java and C#[1]. Java was exacerbated by a tendency for Java to make everything wake up in certain situations, so you could get to a point where the runtime was dominated by CFS throttling, with occasional work being done.
I did some experiments with a roughly 100 Hz increment of a prometheus counter metric, and with a GOMAXPROCS of 1, the rate was steady at ~100 Hz down to a CPU allocation of about 520 millicores, then dropping off (~80 Hz down to about 410 millicores, ~60 hz down to about 305 millicores, then I stopped doing test runs).
[1] This MAY have changed, this was a while and multiple versions of the compiler/runtime ago. I know that C# had a runtime release sometime in 2020 that should've improved things and I think Java now also does the right thing when in a cgroup.
AFAIK, it hasn't changed, this exact situation with cgroups is still something I have to tell fellow developers about. Some of them have started using [automaxprocs] to automatically detect and set.
Ah, note, said program also had one goroutine trying the stupidest-possible way of finidng primes in one goroutine (then not actyakly doing anything with the found primes, apart from appending them to a slice). It literally trial-divided (well, modded) all numbers between 2 and isqrt(n) to see if it was a multiple. Not designed to be clever, explicitly designed to suck about one core.