I think the disappointment here is that the OP seemed not to understand the root cause of the problem and implemented a solution in a different language (load balanced worker pool) that could have also worked in the original language without a total rewrite. Then the new language is trotted out as the savior. It sounds like the file fetch workers and request handlers were running on the same process, so the longer-running workers ended up blocking requests. The way it's discussed, it seems like pure luck that they stumbled on a solution that solved this (running the workers on a separate queue/process).
I totally agree that it's impressive that it only took 2 weeks to build the Go version, it just seems like it would have taken 2 days to try the worker pool implementation with Node.js.
Good point. Let's consider another side to that with just some general observational comments. Golang seems to have a habit so far for revealing answers for devs with varying skills for internet applications that other languages don't.
I've moved from writing code (including golang) to managing large projects. My biggest concern now is meeting the three metrics of project ecstacy: 1) correct solution, 2) on time, 3) within budget. If one language appears to get me better performance on these metrics over another, then I'm interested, whether that language has generics or not.
Another concern from the business perspective is whether or not a language is easy to hire for, and gets devs more productive in less amount of time without creating a lot of technical debt in the process. I have a gruesome time dealing with this very issue, and if golang was the basis for my toolchain, my guess is my hiring concerns would probably be minimized to enough degree that it would have a positive impact on my business -- looping back around to metrics I mentioned.
Maybe the solution should have been kept in js, but I guess it wouldn't surprise me if the golang effort these folks just went through will probably continue to pay off in a substantive way.
Yes, the root cause analysis feels off. (Though the cause might be as simple as the cost of walking through the message queue.)
Something is definitely blocking or resource constrained and causing "thrashing": the (uncontrolled) number of requests allowed to spin up at a time (which creates resource contention) combined with the fan out (1 request = 100+ S3 requests/callbacks) seem like a likely causal factor. As you said, a worker approach (with limits on the number of concurrent requests) is going to be similar to the golang approach used.
The golang approach makes the average execution time of a given request more consistent but the overall wait time may still increase (dramatically) if the arrival rate grows too high. (Classic problem).
Say "easily" fixable by adding servers? Partly true. What happens if the S3 calls slows down dramatically?
The project has only 9 stars on GitHub. While it might be a perfectly fine solution that fixes OP's problem, it certainly doesn't inspire confidence as a battle-tested production-ready library. On the other hand, worker pools and Golang go together very naturally.
But using a process pool is far more heavyweight than threads. There is a cost there that you pay in hardware. Using Go allowed them to do this with less hardware cost than Node. That's a win that continuing to use Node would not have been able to provide.
I totally agree that it's impressive that it only took 2 weeks to build the Go version, it just seems like it would have taken 2 days to try the worker pool implementation with Node.js.