Reminds me of the time I had written a physical simulation engine back in grad school and there was a "minus" sign error. Of course, the error was rare enough that we didn't notice it until after the code was used in a real production environment. Tracking down one minus sign in several hundred thousands of lines is a pain. Not to mention the uneasy feeling you get after you solve it, "How was everything ever working correctly before!? What else did we overlook?"
If I have to venture a guess, I guess you didn't have a comprehensive set of tests at the function/method level of the code? Having that would probably have caught the bug, because you would have written a test for correctly executing the code in that branch.
You're right. But it was after that pain-staking experience that I became fully engrossed in using unittests for all non-trivial functionality. Live and learn.
I'm not completely satisfied by the explanation. I still have that uneasy feeling that you get when you solve a bug, but an unsolved mystery remains. "Also, I still don't know why not all consoles connected to that PC froze."
He didn't mention how the logging was done but if it was over a TCP connection then the send() call probably blocked until it timed out since the sleeping computer didn't close the socket nicely, then it had to re-establish the connection. Although reliability is nice, if I were writing a remote logger for a something like a game, I think I'd use UDP.
Would you not use send() with MSG_DONTWAIT? You get the reliability of TCP and you get feedback if there is any potential blocking. (But I certainly am not a socket guru.)
Are you trying to explain how it's possible for some of the consoles to freeze but others not while talking to the same sleeping computer? If so, I did not understand your explanation.
I believe socket writes don't block until you've filled the internal socket buffer, so it's likely that the unaffected machines simply hadn't done this yet.
I'm reminded of this story of the folks who worked on LEO hunting down a similarly difficult-to-find bug that was eventually found to be caused by an unrelated external machine: the manager's elevator. https://www.youtube.com/watch?v=Lrn24SdW64I&t=2m50s
I once spent an afternoon tracking down a "bug" as to why sales tax wasn't being calculated on LedgerSMB only to find out I had set the tax rate to 0 in the tax interface.... Ok, it was working as intended. I felt pretty sheepish too.
The problem in my case is that sales tax calculation easily qualifies as a big deal and so any sense that it's not working raises all sorts of alarm bells. In addition to the immediate questions of "are production versions affected? If so what do we tell customers?"
Also taxes with a rate of 0 are ignored specifically because sometimes sales tax structures change (as with HST consolidation in Canada) and consequently old taxes need to be retired.....
While I am sympathetic to this argument, I would say that is not always the case. Some configuration issues are usually required and when something is set up for a specific case, and it behaves for that case, and the user simply forgot that this is what they did, then it's a bug only in the storage retrival routines of the user's own memory.
They could have solved that bug with one developer in ten minutes by just telling the PS3 to generate a core dump and running addr2line.exe on the core dump report's callstacks.
And the report places the blame on the server instead of their code. Clearly it's their code's fault for doing blocking sockets calls in a main thread.
This looks like an interesting bug. I wonder if there are more bugs like this from the website view such as analytic tools giving you false or misleading information? Or, even monitoring or performance tools?