Lamest bug we ever encountered

akg · on Dec 11, 2011

Reminds me of the time I had written a physical simulation engine back in grad school and there was a "minus" sign error. Of course, the error was rare enough that we didn't notice it until after the code was used in a real production environment. Tracking down one minus sign in several hundred thousands of lines is a pain. Not to mention the uneasy feeling you get after you solve it, "How was everything ever working correctly before!? What else did we overlook?"

Confusion · on Dec 11, 2011

If I have to venture a guess, I guess you didn't have a comprehensive set of tests at the function/method level of the code? Having that would probably have caught the bug, because you would have written a test for correctly executing the code in that branch.

akg · on Dec 11, 2011

You're right. But it was after that pain-staking experience that I became fully engrossed in using unittests for all non-trivial functionality. Live and learn.

AndyKelley · on Dec 11, 2011

I'm not completely satisfied by the explanation. I still have that uneasy feeling that you get when you solve a bug, but an unsolved mystery remains. "Also, I still don't know why not all consoles connected to that PC froze."

radarsat1 · on Dec 11, 2011

He didn't mention how the logging was done but if it was over a TCP connection then the send() call probably blocked until it timed out since the sleeping computer didn't close the socket nicely, then it had to re-establish the connection. Although reliability is nice, if I were writing a remote logger for a something like a game, I think I'd use UDP.

Nick_C · on Dec 11, 2011

> although reliability is nice ... UDP

Would you not use send() with MSG_DONTWAIT? You get the reliability of TCP and you get feedback if there is any potential blocking. (But I certainly am not a socket guru.)

radarsat1 · on Dec 11, 2011

Definitely, asserting non-blocking flags for the socket options is also a good idea.

AndyKelley · on Dec 11, 2011

Are you trying to explain how it's possible for some of the consoles to freeze but others not while talking to the same sleeping computer? If so, I did not understand your explanation.

alexgartrell · on Dec 11, 2011

I believe socket writes don't block until you've filled the internal socket buffer, so it's likely that the unaffected machines simply hadn't done this yet.

AndyKelley · on Dec 11, 2011

ah, there's the missing piece of information. Now I got it, thanks.

radarsat1 · on Dec 11, 2011

It's just a hypothesis. Obviously I don't have enough information to know for sure.

botker · on Dec 10, 2011

I'm reminded of this story of the folks who worked on LEO hunting down a similarly difficult-to-find bug that was eventually found to be caused by an unrelated external machine: the manager's elevator. https://www.youtube.com/watch?v=Lrn24SdW64I&t=2m50s

einhverfr · on Dec 11, 2011

I once spent an afternoon tracking down a "bug" as to why sales tax wasn't being calculated on LedgerSMB only to find out I had set the tax rate to 0 in the tax interface.... Ok, it was working as intended. I felt pretty sheepish too.

decadentcactus · on Dec 11, 2011

The worst bugs are when things work as intended, but you still think it's a bug, such as your example.

Natsu · on Dec 11, 2011

It's worse when your users find these and are all mad because the computer did exactly what they told them to.

einhverfr · on Dec 11, 2011

The problem in my case is that sales tax calculation easily qualifies as a big deal and so any sense that it's not working raises all sorts of alarm bells. In addition to the immediate questions of "are production versions affected? If so what do we tell customers?"

Also taxes with a rate of 0 are ignored specifically because sometimes sales tax structures change (as with HST consolidation in Canada) and consequently old taxes need to be retired.....

AndyKelley · on Dec 11, 2011

Nah, then it's a bug in your user interface.

einhverfr · on Dec 11, 2011

While I am sympathetic to this argument, I would say that is not always the case. Some configuration issues are usually required and when something is set up for a specific case, and it behaves for that case, and the user simply forgot that this is what they did, then it's a bug only in the storage retrival routines of the user's own memory.

TwoBit · on Dec 11, 2011

They could have solved that bug with one developer in ten minutes by just telling the PS3 to generate a core dump and running addr2line.exe on the core dump report's callstacks.

And the report places the blame on the server instead of their code. Clearly it's their code's fault for doing blocking sockets calls in a main thread.

zitterbewegung · on Dec 10, 2011

This looks like an interesting bug. I wonder if there are more bugs like this from the website view such as analytic tools giving you false or misleading information? Or, even monitoring or performance tools?

_yobq · on Dec 11, 2011

The lamest bug you will ever encounter deletes your whole /usr.

manojlds · on Dec 11, 2011

How is that lame?

narcissus · on Dec 11, 2011

I think he's talking about this https://github.com/MrMEEE/bumblebee/commit/a047be85247755cdb... , where the deletion of /usr was not on purpose... the bug was a space in the middle of a file path in the install script.