Java app, deployed to four servers (by rsyncing a zip file and unzipping it). One of the four fails on startup, in some way that's hard to trace (I think it might have been a JNI crash)? But all four servers have the same OS, same version, same packages installed, same JVM, and it was the same zip file. Check the md5sum on the zip, it matches. In desperation one of my colleagues writes a script to recursively go through the unpacked version of the app and check the md5sums of all the files. Still matches perfectly, and the same files are present on all machines.
We get the dev team to try the app - there's a bit more variety in our devboxes than servers. Two of them can reproduce the failure, but there's no obvious correlation - one's java 1.5, one's java 1.6. One's Debian, one's Gentoo. For every combination there's another developer with a similar machine where it works fine.
Turns out that one server had been installed with a different filesystem from the other three (reiserfs?), which meant that the directory entries for the files were in a different order. The JVM just lists all the classes in the directory and then loads them in on-disc order, so classes were getting initialized in a different order, and it was that that was ultimately triggering the bug.
I recall averting what would have been a rather opaque security weakness by pointing out that occasionally someone would goof and vary the default locale of some JVM instances. (With a particular knock-on effect that I'll omit here.)
Paying attention to the system you'll land on is something that can go lost even to senior developers. And, therein, details, details, details...
Another reason to have a strong ops (or devops) team: Providing/enforcing proper, intended, and thoughtful context and runtime.
P.S. As I now recall, there was also the use case for locale to deliberately vary, although that use case / those instances should have remained orthogonal to those of other locales. Nonetheless, one more possible driver of a mistake that would enable this weakness to occur.
> Another reason to have a strong ops (or devops) team: Providing/enforcing proper, intended, and thoughtful context and runtime.
I have to disagree, actually. This kind of problem is exactly where you really benefit from having full-stack people who understand both sides of the system; it would have been very hard for an ops person who didn't know about Java's quirks or a pure dev who didn't know about the unix filesystem to diagnose.
I see your point. And... I had to escalate to a very senior person of that sort (full stack, or full picture -- in detail) in order to get that particular problem paid attention to and resolved.
(Such a perspective is also how I determined the problem in the first place -- which actually resulted from a botched fix to a prior problem that I'd identified.)
However, despite the variance -- or risk of same -- that I described, in general we had some very capable and dedicated devops people who put a lot of effort and care into our environments. I get rather uncomfortable considering how things would have been had that not been the case.
It's been a while, and maybe I'm mixing my stories a bit. But I left that role and product mix -- which had very significant security requirements and ramifications -- quite impressed with the role those devops folks played in keeping us safe.
Perhaps what I meant by "strong" goes somewhat in the direction of your description of "full stack". Our senior devops people tended to trend in that direction.
And we didn't have "turf wars". Instead, devops was a partner often throughout the development lifecycle. It helped make sure that the final destination was appropriate, safely, and consistently configured. (My "locale" situation aside; and in such an instance, the setting would subsequently receive heightened and sustained scrutiny, putting a curb on unintended variance of the setting as well as fixing the code that such variance would impact.)
Strong in knowledge and ability, as opposed to simply or foremost in an authority to dictate.
I think we agree that it's good to have these skills somewhere on your team, and to work together during development. I just don't think dividing people into "ops" and "dev" is helpful. There it was something like one guy who spent 80% of his time on traditionally-ops stuff and 20% development, another guy who was 70% development and 30% database admin, one who was basically full-time ops but happened to be the maintainer of an open-source library that we used in our products, one who had been hired as a developer but was starting to pick up ops tasks because he preferred that...
This is a very common problem for buggy Java programs that have different versions of the same class on the classpath. The JVM will take the first instance of a class it finds on the classpath, so some orders of the classpath will work and others won't.
My favorite technical interview question is to ask the candidate about the worst bug they ever wrote. I let them choose their own definition of "worst", and see if they choose something that was hard to debug, or caused a lot of trouble or was just something they thought was a dumb mistake.
It usually provides them with an opportunity to talk technically about something they know and helps me understand how well they communicate problems and solutions. Plus, it's sometimes fun to hear the stories.
My own personal worst bug (where by worst I mean "had the worst impact") was when we disabled a large chunk of the southern Beijing cell phone system for a short time during the night whilst deploying a field test of new base station hardware. That was a stressful rollout.
My most memorable one wasn't technically complicated, but we were receiving mixed feedback on a site that I had worked on. Some of the test group though it was working fine, the other was having "doesn't work" issues.
It turned out that the problem came down to a single dialog box, when people tried to cancel a process on the page.
> My favorite technical interview question is to ask the candidate about the worst bug they ever wrote.
Are you sure you are getting honest answers? Some candidates may be thinking "He'll never hire me if I admit how stupid I was, so I'll use this secondhand or dumbed-down story instead."
Irrelevant - they talk about a technical issue they're familiar with. I can ask them questions about that issue. It's a good way to gauge their skill level while keeping them on familiar terrain.
It's usually fairly straightforward to see if somebody talks about their own experiences or is retelling a story they simply heard.
Exactly, even if it's not their own story, if they are able to talk intelligently about the subject matter and explain the problem and solution well then i've gained valuable information about the candidate.
Unfortunately i've had a couple of candidates who's response to the question was "I don't really write bugs". Those are the ones I know for sure are lying.
Memory corruption in a video game that I was developing in the 1990s. It took 2-3 days of running attract mode to trigger it, whereupon the game would crash catastrophically.
Solution: Videotaped attract mode for 2 days until it happened. Then I single frame advanced through the 4 frames between its first manifestation and the complete crash of the program. 15 minutes later I knew exactly what was going on and fixed it shortly thereafter.
These days, any bug that survives my best efforts for more than a day usually ends up being a HW/driver issue in equal measure. I've learned a lot since then.
The stuff coming out of those computer algebra things is not always exactly the form you want to be stuffing in your code. Working it out manually focuses the mind.
They also tend to do it quite badly in some non-trivial cases. I've seen Wolfram spewing a three-line result which could be further simplified to a simple expression. It also doesn't always help to simply know the result; a lot of insight about numerical problems can be gained from how the result is derived.
I also spent several days churning formulae in bars. Got a lot of weird looks.
I don't know if it was the hardest bug I've ever tracked down, but I really enjoyed discovering that HP-UX executables have an attribute bit that controls whether or not they'll segfault upon dereferencing a null pointer. If I'm not mistaken, the non-faulting behavior was the default in their toolchain.
Turns out that HP was also shipping a build of Kerberos/GSS libraries that actually relied on this behavior to function properly, and since our own project linking in these libraries did the sane thing and enabled the faults, the Kerberos code would crash.
I won't forget debugging through the library assembly code, wondering how the hell it ever worked. Luckily my education had included a good survey of the computer architecture zoo, where surely someone once upon a time had thought it was a great idea to spec a zero page at low memory addresses.
And of course today I know enough to never believe that a platform must segfault on a null pointer dereference. C undefined behavior works in mysterious ways.
I used be in awe of articles like this. For a long time I believed some of the most difficult bugs are the ones where debugger itself has an issue. But having been working in machine learning and distributed computing all these bugs looks like pretty stories in kids book. Imagine this: When your machine learned model does not "work", there are no breakpoints to put, watches to watch or even a code to reflect about. All you have is data and probabilistic statements from which you need to statistically infer the root cause. It's same way in distributed computing. One of my script was over 1600 lines of dense high level statements that ultimately gets translated to hundreds of map reduces. Again you don't get any exceptions when things may get unexpected. All you can see is some statistical pattern that "doesn't look right" and it's always interesting to make you r probabilistic "if-then" arguments leading to a root cause.
Add some consistency checks on your probability distributions (do the marginals sum up?).
What makes articles like this impressive is that they involve a problem at a different level of abstraction. For example, what if your model didn't work because multiplication broke for specific inputs?
OK this is my life too. But there are clever ways of unit testing statistical systems. Hey, not that I do all I say, just saying ;) Sometimes it saves many days.
Of all the "this is how we coded things", debugging stories are my favorite, and are probably the things I re-read most...especially since most of them seem to happen in low-level areas, which I don't get to deal with on a regular basis but still find the intuition and principles to be transferrable.
That said, every time I read one of these...I'm always reminded of a legendary bug report...or at least an old one, because I thought I remember reading it in a Usenet newsgroup posting...it involved some malfunctioning hardware and the cause was related to the floor panels and the proximity of people walking around...and I can't find it to store in my bookmarks. Someone here must know what I'm referring to yet can't summon on Google
I believe that's the one...wow, my memory is horrible. You can't get much further from "Usenet newsgroup" than "on a Tumblr". I guess that the story as told took place so long in the past, my brain associated it with the days of reading from a text-browser.
I remember reading that story. IIRC, the bug was was that long running jobs would fail. They eventually traced it to electrical interference generated by a loose floor panel that would move when the user paced over it. Also can't find the link though :(
A friend of mine who works in the embedded space recently had a very similar bug. The component would crash when he stood up from his chair. Apparently the hardware connections were so sensitive that tiny bursts of static electricity through the air were enough to disrupt the whole setup.
If you put an LED, resistor, and MOSFET in series and then attach the gate to a wire, you can often switch the LED on and off by waving your hand around the wire. If you have an oscilloscope, hook it up to the resistor and see if you can find a position of your hand that will switch the transistor at 60hz :-)
This is quite common when working on unshielded electronics. Standing up from a chair, depending on clothes and chair fabric, generates a huge amount of static. Even just walking past can be enough.
Proper hardware development labs have antistatic carpet etc., but in embedded software development, one often doesn't have this luxury and has to be aware.
They say you try and suppress your worst memories, which explains why I can't remember anything much worse than this:
On Symbian OS, the window manager managed all the screen drawing. All visible apps would be asked to send draw ops to the WM and it would draw them clipped to the apps's windows.
And at UIQ we were adding theme wallpapers and memory hungry graphics faster than out licensees were adding RAM.
And a real problem was running out of RAM drawing the screen. Doing rectangle intersections actually requires allocation, so drawing isn't constant memory.
To speed up drawing we made the WM retain the draw ops. This was transparent to the apps, but a massive performance win. We made a 'transition engine' to smoothly slide between windows and smooth scroll windows and things, at a time when some Nokians confidently told me it wasn't possible :)
But what if our cleverness caused an out-of-memory in the WM? I had a cunning plan...
We intercepted the malloc and, if it failed, we called out a memory manager app to start zapping things. And if a second alloc attempt failed, we started discarding draw op buffers and unloading theme assets.
And this seemingly worked! By making our graphics adapt dynamically to RAM usage rather than ring fencing we got much better app switching because background apps weren't getting unloaded.
And then, just as the first phone with this tech was being tested (manually, by a small army), it would sometimes crash with meaningless stacks.
My team jumped into the challenge thinking we were elite clever sods and all bugs were shallow.
After a few days, I had to start making excuses; we were stumped. The thousand monkey tests should no pattern, only that it was often. Where was the crash coming from?
A lunchtime walk cleared our heads a bit and suddenly the horrid realisation was before us: if the allocation that failed was a bitmap data block, the bitmap itself may be reaped but the stack would resume initialising RAM that malloc didn't think it had any more and in the end some other random bit of data would be interpreted as a memory address and eventually the WM would blit to it...
The phone never shipped because the plug was pulled on UIQ, but I think this bug was fixed and forgotten before then.
With some custom embedded electronics, the ADC would work fine in a lab setting but always throw out garbage in the field. Now, about 500 of these had been running fine in the field themselves since February. Turned out there was a bug in the FPGA that controlled the ADC and at somewhere around 80 degrees fahrenheit, there was enough of a propagation delay that the ADC wouldn't start up correctly. Since the other units were started in February when it was 25 degrees and only a few had been restarted, it wasn't noticed.
That was frustrating.
Another fun one was the rapid degradation of a database when the write-back cache battery on the RAID controller failed on the write-ahead logging disk and nobody was notified.
Right now I've been battling a random corruption NFS bug for a few weeks. Recently thought it was the automounter but the bug has appeared in a few other nodes since:
Do other people who have been doing this for a while feel like those bugs end up as a blur? As new programmers we all went through hard to debug timing/memory corruption/threading issues but I like to think I've learned enough to avoid those classes of issues...
War story #1:
I designed and programmed a board based on a TI fixed point DSP (5x series). Problem is the software ran for a very short while and then the board would crash. I went through everything I could think of, the software, verifying the reset sequence, memory accesses. Everything looked good. Called TI support. Couldn't figure it out. After I think two weeks of checking everything it turned out that one of the ground pins that was supposed to be connected (it was in my schematic) was left unconnected by the PCB designer. When we brought out the PCB design we saw the via to the ground plane but a tiny little segment between the pad and the via (under the chip) was left unconnected. If you don't hook up all the Vcc and Gnd pins you get undefined behaviour...
War story #2: Odd intermittent very rare behaviour in an application we worked on. Turned out we were using some implementation of shared pointers that used interlocked increment for incrementing the count but didn't use interlocked decrement for decrementing. So very rarely two threads on two cores would hit that and someone would end up with an invalid pointer. That one also took a long time with trying to get some semi-reproducible behaviour to even know where to start looking.
EDIT: One thing I've learnt over the years is that bugs that look impossible to figure out will eventually. The magic time period is around two weeks for those rare super hard bugs. This is from having no clue of what's going on, intermittent weird failures that look impossible to figure out, all you have to do is "do the time" and you can figure those out. I've seen people simply give up and live with things not working and believe that those issues are "unsolvable"...
Worst bug I saw took nearly a year with some really smart people working on it. It reproduced only after about 6 machine years (had ~100 test machines, and one of them would randomly crash every 3 weeks).
The bug I think I've spent most time on was when my Racket program would run for a while and then segfault randomly. Core dumps showed a stack overflow, but unsurprisingly, examining the racket source code didn't enlighten me. I had to git bisect through a number of commits, running them on several instances and tentatively marking them 'good' if I didn't find anything after a few hours. (One time I was too quick to mark a commit 'good'...)
The guilty commit seemed fairly innocent, but it prompted me to try running `(loop (thread (const null)))`, which immediately segfaulted. `(loop (thread (thunk null)))` didn't. At this point we handed off to the racket devs, and replaced our `(const null)` callbacks with `(thunk null)`. After a few days they worked out what was going on and fixed it.
I remember a friend of mine that I and all our friends tried to help figuring out why two lines of c code never worked when running, but always worked when debugging.
It was a simple while loop and it took as I remember two weeks to spot the mistake, a missing = 0 in the
while (int c; c < x; c++)
The debugger initialized all memory to zero so it never failed when debugging :)
Around 26 years ago, I was working on some C code that was ported from assembler. It was too long ago to remember the actual bug, but I was stumped how a certain variable "C" was getting set. I wasn't set explicitly anywhere in the code.
It turns out that is was declared like this:
int A, B, C;
and it was being set something like this:
int* pA = &A; // pointer to contents of "A"
pA[2] = 3; // WTF?
A,B, and C where assumed to be contiguous in memory and the code was treating "C" like the 3rd item in an array starting at "A".
In the mid-eighties, I was working on some app that did serial port communications, probably under PC-DOS. The program was failing to communicate, and I spent most of a week stepping through the debugger, watching it send the characters down the cable, working perfectly. But it was failing at normal speed. It finally occurred to me that maybe the hardware wasn't quite working, so I ran a diagnostic and it immediately failed. It was probably a lightning surge from months before that made that port "special". The psychic scar of that week made such an impression on me that I bought a used 5KW 120lb ultra-isolation transformer for $75, and I have run all my various equipment behind it ever since. I've never had another hardware failure due to electrical surge, and my cat loves to sleep snuggled against it for warmth.
My favorite bug came just after starting a job. There was a C++ program that was being run 32-bit on servers with 8GB of memory, back when that was a lot, because it would crash when compiled 64-bit. I was assigned the task of making it work.
After many, many fruitless debug sessions the problem turned to be that the structure packing was different between two different compilation units. In some compilation units a particular structure was 56 bytes, in others 48 (or something like that). This was Bad.
There was an unterminated pragma-pack which was included in some compilation units but not others. In 32-bit mode it didn't cause any problems, because the structures were optimally packed anyway, but in 64-bit mode, when pointers were 8 bytes, the structures packed differently when the unterminated pragma-pack was included in the header before them.
Worst I can remember recently, since I never really solved it, was a .NET Runtime Fatal Execution Engine Error that occurred somewhat randomly in a long-running, multi-threaded console application, but only on one machine. Eventually just moved to a different machine.
Once had an issue with a network of wireless mesh devices that I had to try to debug from across the country (I was in Philadelphia, the client was having an issue with a network installed in a hotel in Las Vegas). Getting ready to leave for the a long weekend, I got a call from a client at 4:45pm (knowing this client, I suspected they did it on purpose. They had called me on my cell, so clearly they didn't expect me to be in the office). Basically, every time you'd try to reset the room for a new guest, the blinds would shake, half the lights would fail to turn on, and the door would unlock. If you kept at it, it would eventually work. But then, setting the thermostat would sometimes make half the lights go out. And the door would unlock.
They had a packet sniffer that we had built for them, so we went to trying to diagnose the issue. I'd send them new versions of the programmer tool, they'd flash the room (which required pinging each and every device in the room and tapping a physical button on it, in one of the largest single-room installations of this brand of mesh network in the world), and they'd send me the logs of the sniffed packets. We could see that the packets for any device you happened to be standing next to would be completely fine, but if you turned your back on it and walked across the room, it started acting up. But they would be fine again if you walked back to it. "It's like it knows you're watching it", said the guy on the phone.
They kept insisting that I had let a virus into their network. Never mind that it was only possible to rewrite the configuration ROM over the air, rewriting the ROM required physical access to the board in the case, clearly it was my fault for all the "unnecessary fiddling" I had been doing recently (i.e. a slew of bug fixes they had requested all involving my predecessor's lack of understanding of the pitfalls of threading and UI in .NET).
I kept telling the client that all the symptoms suggested radio interference from an outside source. They insisted they had never heard of such a thing, ever, in any context, including static on their car's radio. I being "merely" an applications developer and not an electrical engineer, they lacked faith in my explanation and insisted that I "just fix it". How I was supposed to be knowledgeable enough to fix it if I was apparently not knowledgeable to understand what was going on with it was beyond me, but whatever.
"Put it back to what it was before you started f*ing around with it." Revert the code through source control (thank God I had installed SVN when I first arrived at that company, because apparently EEs don't understand that it's not a good idea to keep dated copies of code directories around as "backups"). "This isn't working, I said give me the old version." Send them links to the installer on their own server. "You must have changed it on our server! How did you get access to our servers? This isn't working!"
I finally gave up around midnight and drove the 3 hours to my parents' house for Thanksgiving the next day. I nearly got fired for it.
We shipped them a radio spectrum analyzer and determined for sure, it was radio interference. The hotel opened the room next door and found a baby monitor, still on, fallen behind the dresser. They turned it off and the room responded flawlessly.
I should have quit then, but I needed the money and I was going through some depression issues so I really thought it was my fault. I eventually did get fired from that place, the only place I've ever gotten fired from, for "not working enough overtime" because I was only doing 50 hours a week when the intern fresh out of college needed 60 to get his much simpler tasks done (and often leaving me blocked because of it, but I wasn't allowed to help him with anything because "you're not an electrical engineer", where apparently only electrical engineers know how to code in C). I don't regret it, biggest piece of shit place I've ever been, and just the motivation I needed to get off my ass and finally change my relationship to work. I've been freelancing every since.
I guess that's not so much "my hardest bug", but I did actually fix a bunch of bugs in the process of trying to convince them it was interference and not some mythical radio virus that could corrupt packets in mid-air. And all of it based on phone calls and emails with hex dumps of sniffed radio packets, with being "nothing more" than a lowly applications programmer.
Whenever I see a problem that just has to be a compiler or hardware bug, I keep digging in my code because I know I'm wrong. 99% of the time that turns out to be the case.
on the topic of debugging, i would just like to heartily recommend "the medical detectives", which has a dr-house'esque streak to it, but it is real...
Apparently, most of the ones categorized hard seem to be some thing related to hardware i.e. not a software mistake of some programmer. I would narrate a couple which were not the case.
a) I used to work on deep packet inspection software for a multicore network processor. It was kind of c but with restricted api's and some unique concepts related to multicore. Among the concepts was, same binary being run on multiple cores to process packets, but still no hardware locks, because there was an implicit tag - a kind of a hash computed on 5 tuple (src/dst ip, ports, protocol) to ensure only one core gets packets from one session / 5 tuple.
So the scenario was a protocol parser whose job was to parse some other info along with ip, call an external api to add a subscriber. When this parser was ran for like 10-15 minutes on live setup, it used to seg fault after processing some 60-70 million packets. The behavior was reproducible, but was not occurring at the same time, nor in the same piece of code.
Narrowing down didn't exactly work, since it stopped occurring with either of the subscriber addition api call OR the parser was commented. But each worked perfectly on its own.
Finally, after a couple weeks of long debug cycles and notes, it turned out to be AN IMPLICIT tag switch inside the subscriber addition api. Since we were not locking through apis, the tag switch would lead to same packet being sent to multiple cores, and any where along the line in the follow up code, an allocation (which turns redundant) or a shared mem access or deletion (free) it could turn into a seg fault.
Now implicit switch of locks in the subscribe api was also a documented and needed feature of hardware. Just that it should have been DOCUMENTED in BOLD on the api, which was not the case.
b) In the same dpi product, once we added two fields to look for in the incoming traffic which should not have matched but were still matching in results. Unique thing was, they only fail when those were together and would work fine independently.
Going deeper in their code, showed a strncpy which was intended to use as a safety against strcpy, but with MAX_STRING_SIZE. So basically when the actual string was much shorter, it would just wipe off the entire length with padded zeros in the buffer, there by over writing the originally appended fields to look for. The author seemed to have missed the following comment in strncpy's definition.
"If the end of the source C string (which is signaled by a null-character) is found before num characters have been copied, destination is padded with zeros until a total of num characters have been written to it."
Since then, i have been really careful in choosing to use strncpy instead of strcpy as often mistakenly advised in general.
Ouchies. On the narrow subject of strncpy(): strlcpy() and friends are the "correct" API, IMHO, and it's easy enough to copy-paste them from e.g. the OpenBSD code.
Java app, deployed to four servers (by rsyncing a zip file and unzipping it). One of the four fails on startup, in some way that's hard to trace (I think it might have been a JNI crash)? But all four servers have the same OS, same version, same packages installed, same JVM, and it was the same zip file. Check the md5sum on the zip, it matches. In desperation one of my colleagues writes a script to recursively go through the unpacked version of the app and check the md5sums of all the files. Still matches perfectly, and the same files are present on all machines.
We get the dev team to try the app - there's a bit more variety in our devboxes than servers. Two of them can reproduce the failure, but there's no obvious correlation - one's java 1.5, one's java 1.6. One's Debian, one's Gentoo. For every combination there's another developer with a similar machine where it works fine.
Turns out that one server had been installed with a different filesystem from the other three (reiserfs?), which meant that the directory entries for the files were in a different order. The JVM just lists all the classes in the directory and then loads them in on-disc order, so classes were getting initialized in a different order, and it was that that was ultimately triggering the bug.