I wonder how big part of that shitstorm was due to "hey look, the new guy thinks...

newnewpdro · on Oct 11, 2019

Honestly there were many people on the team much smarter than me. It was mostly just a case of a fresh set of eyes, and I think their egos prevented them from seriously considering such an obvious oversight could exist in their masterpiece.

As for the shitstorm, no doubt my being new and irreverent played a part. But the real trouble was the senior engineers had been arguing it can't be on their end for over a year. They weren't stupid, it was just some overlooked details, and like I mentioned, ego probably got in the way. It was going to make them look quite bad regardless of who fixed it, just because of how long it had been ignored while impacting paying customers.

There was a subtlety to the bug I didn't mention. After I had posted the tested fix for review, it became known that someone on the team had already investigated the socket-IO starvation theory and was unable to observe anywhere near long enough intervals between schedulers idling. But what they didn't do properly was measure the idle intervals per-CPU. All they had checked was if any scheduler went idle often enough. The problem with that is every CPU had its own list of sockets associated with coroutines belonging to that CPU's scheduler. So it required checking the interval on a per-CPU basis, and that's when my instrumentation showed the problem under load. They just missed that detail in the early investigation, and nobody ever revisited it. It was literally a mistake of using a global vs. per-CPU "last-idle-ticks" variable in their instrumentation code, oops!