Hacker News new | past | comments | ask | show | jobs | submit login

That's not really true.

Here's an amusing personal anecdote I find relevant, and would enjoy reliving by typing it out:

Years ago I had recently started a new job at a clustered database startup as a software engineer. Part of their on-boarding process was to have new hires go through the bug tracker and resolve issues which had been flagged for this purpose. They were supposed to be low-priority low-hanging-fruit type things that jumped all over the code base, good for acquiring some breadth of familiarity while being useful.

Well I got bored with that pretty quickly and started looking for the oldest high-priority bugs. To my surprise there were actually quite a number of them, and it wasn't like they were ignored - they had all been touched by many hands but seemed to be difficult to reproduce according to the comments from engineers.

One in particular looked especially terrible; the database would close connections as if they were idle during the data import process, causing the import to fail. The import was a very common thing as this product was supposed to replace large, existing FOSS databases, by providing a more scalable clustered solution. Customers would be doing this as the first thing they did with the product, and it's failing! And the bug was over a year old.

The thing with import is it's highly optimized because the databases are expected to be huge, so while the logs were showing the connection had been closed due to inactivity, that's very unlikely to be what really happened. But the engineers were all just pointing fingers at the customer side saying there must be a network problem causing the timeout under all the load.

But the way the database was implemented was rather complex with cooperatively scheduled coroutines running in their own per-cpu schedulers. It was very Golang-like, except this was not Golang, it was C, and we controlled everything. Knowing that the architecture was cooperatively scheduled, and imports would be causing a whole lot of write-bound work saturating the cluster on the backend, I figured it was probably a starvation problem preventing the coroutine responsible for servicing the sockets from running in a timely fashion, then some idle socket timer would notice and kill it.

Some instrumentation code later and a synthetic import running with the cluster on my slow personal laptop (you see I hadn't even gotten a beastly company machine yet, so running a cluster on my old personal machine made it very easy to saturate the local cluster), and I easily reproduced the failure with the instrumentation clearly showing it was indeed due to internal socket-IO coroutine starvation.

The way they had the socket-IO getting serviced was only when a given CPU's scheduler was idle. The mostly correct assumption was that there'd frequently be moments where no coroutine was runnable on some CPU, and that would be the best time to service the CPU's sockets. Well, it turns out sometimes that just isn't the case.

Coded a trivial fix to have a per-CPU timer service all the local CPU's sockets occasionally, like every five seconds or so, just to prevent the scheduler starvation from triggering a timeout, and submitted it for review as a fix to this ancient, horrible, and embarrassing bug.

Well as you can imagine this started a bit of a shitstorm requiring a lot of arguing back and forth and demonstration of the problem, proof of instrumentation showing this was possible, patch actually fixing it, etc.

The reason I'm telling this story however is that after the dust settled, and an urgent release was shipped with this fix, I had some new friends in high places at that company. One of the cofounders was in charge of the operations and support department and had been fighting with engineering for ages over this particular bug.

Obviously maintenance of neglected but important things can get you promoted. Especially in a smaller company where the important decision makers are watching and still give a fuck.




I wonder how big part of that shitstorm was due to "hey look, the new guy thinks he is smarter than us!" attitude. Which is often warranted, way too often fresh blood charges at the Chesterton fences. Except for this time, the new guy was indeed smarter. Kudos.


Honestly there were many people on the team much smarter than me. It was mostly just a case of a fresh set of eyes, and I think their egos prevented them from seriously considering such an obvious oversight could exist in their masterpiece.

As for the shitstorm, no doubt my being new and irreverent played a part. But the real trouble was the senior engineers had been arguing it can't be on their end for over a year. They weren't stupid, it was just some overlooked details, and like I mentioned, ego probably got in the way. It was going to make them look quite bad regardless of who fixed it, just because of how long it had been ignored while impacting paying customers.

There was a subtlety to the bug I didn't mention. After I had posted the tested fix for review, it became known that someone on the team had already investigated the socket-IO starvation theory and was unable to observe anywhere near long enough intervals between schedulers idling. But what they didn't do properly was measure the idle intervals per-CPU. All they had checked was if any scheduler went idle often enough. The problem with that is every CPU had its own list of sockets associated with coroutines belonging to that CPU's scheduler. So it required checking the interval on a per-CPU basis, and that's when my instrumentation showed the problem under load. They just missed that detail in the early investigation, and nobody ever revisited it. It was literally a mistake of using a global vs. per-CPU "last-idle-ticks" variable in their instrumentation code, oops!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: