Hacker News new | past | comments | ask | show | jobs | submit login
The best debugging story I've ever heard (2010) (patrickthomson.tumblr.com)
185 points by thunderbong on Sept 4, 2022 | hide | past | favorite | 39 comments



Many years ago I was a systems programmer for a Burroughs B6700 mainframe in New Zealand - this was the only mainframe at the University, everyone depended on it for almost everything. One day it started doing random IOs to random devices - garbage on a printer or a console were annoying, but scribbling over disk was a particularly bad thing, we could be down for ages bringing it back. We had an engineering contract with Burroughs and eventually they flew in the big gun engineer from the US .....

He took a stick and ran it down one row of cards in the IO processor, nothing happened, he ran it down the next row, bang! hours later we're up again, he runs it down half the row, bang! binary search with stick ensues, eventually he starts pulling cards one comes out and 3 small solder blobs fall off, they'd been sitting there on some chip's pins since it was manufactured years ago


And of course there was the IBM Stretch possibly the only computer ever fixed by giving it an oil change

http://www.chilton-computing.org.uk/acl/literature/reports/p...


Nice story - but what I liked even more was how the pictures were of a computer from equipped like the first place i ever worked: even back in 1991 the mag tapes and the fleet of three Prime computers were a little antiquated, but lifting up a for tile with the suction grip felt like being in the movies. We never had The Expert visit, but during one regular service visit an onlooker tried peering into the bowels and got a brisk bollocking from the technician: he had no tie clip and so his polyester noose was dangling perilously over circuit boards, ready to deliver a static zap.


Debugging Behind the Iron Curtain is also a good story:

https://www.jakepoz.com/debugging-behind-the-iron-curtain/


That's the 80s though, there wasn't much of the iron curtain left by that time.

(Sverdlovsk == modern Ekaterinburg)


The Iron Curtain was firmly in place in the 1980s. If you tried crossing the GDR/BRD or the Czechoslovak/(BRD, Austria) border in the 1980s, you would be risking your life and, if caught, you would certainly do some time in prison.

The last person shot on the Berlin wall died in February 1989 [0].

[0] https://en.wikipedia.org/wiki/Chris_Gueffroy


There are many similar gems in this old Usenet thread:

https://web.archive.org/web/20060327191523/http://www.speedy...

There's an overarching theme of, "computers used to be much, much, much more sensitive to interference."


> Let's see if 160Kb makes it around the Net.

Love it!


A very similar situation - we called it the "million rand" muffin problem and one store kept getting occasional misreads on barcodes where the price was embedded in the barcode.

We looked at our code and were rather stumped - a simple ACK/NAK serial RS232 protocol - replaced serial boards.

I got sent onsite (my fate as lead programmer) and sat in the office and waited - hours later it happened and when I looked at the offending muffin I looked down and noticed the barcode scanner was from another manufacturer (cheaper) we normally did not use.

I send some samples to manufacturer - few days later - turned out it was firmware issue.


Your story reminded me of a barcode debugging story from my past.

I wanted to use an image, so I ended up writing it on Twitter: https://twitter.com/benji_york/status/1566421109255315458


Good on finding that one - the rotation trick was neat.

We normally used NCR barcode scanners as they had some "3D" mirror scanning but were rather expensive and larger footprint in the frontline.


There was a CP/M machine that I knew of that would only work when plugged in to the printer.

The reason turned out to be a ground fault. The computer used only a two prong plug for power. The printer used a three prong plug with proper ground and the centronics cable/connection between the two units had a grounded shell and shielding foil in the cable.


The story where emails could only be sent a certain distance is fun. I haven't read that one in a while.


https://web.mit.edu/jemorris/humor/500-miles for those who have not read it - it is brilliant and solving the bug involves understanding the speed of light!


It's on "Did you win the Putnam?" or "xkcd 927" level and does not really need a link on HN at this point.


Now I need to look up "Did you win the Putnam?". It's my lucky day to learn about this.



https://xkcd.com/1053/

I can assure you there are plenty of readers here who have not heard of it. I shared the link at work recently and was interested that not a single person knew of it (they are fairly young though).


[flagged]


You just come off so fucking cool.


The irony of referencing September with one's own obnoxious overconfidence that everyone memorizes the same XKCD jokes as you.


Similarly: open office cannot print on Tuesdays.

https://beza1e1.tuxen.de/lore/print_on_tuesday.html

(Date embedded in printjob and parsed by the wrong filter)


There's a whole load of stories like these linked from https://dbrgn.ch/stories-from-the-internet.html . I'd definitely recommend them.


A rather popular site had gotten a donated, top of the line Compaq server system to run their operations; everything ran fine for the developers at home but when the server was sent to the DC there were all manner of transient bugs that couldn't be diagnosed. It went back and forth a couple times then they gave up on that one and acquired new hardware.

I came to own the "cursed" server, and it ran fine for me for several months in heavy but noncritical tasks. I put it in my DC and sure enough the odd errors begin happening again.

This was a 4U box with a row of drive bays in front and "onboard SCSI raid" and "hotplug backplane" and all those trimmings. So we had all been using the onboard built in SCSI controller this whole time, which included a large SCSI cable carefully routed through the folds and byways of this complicated case.

That was the fault: the SCSI cable connector to the backplane board had cracked; and when it was cold pulled apart enough to have poor connections on some pins, and result in (sometimes silent!) data corruption problems. Getting to it required stripping the machine down further than anybody had managed to that point. After that it served me well for several more years.


I had the opportunity to pick up one of their "cluster-in-a-box" servers (two servers and a shared RAID array in a single 10U chassis) super cheaply one time, and ultimately passed on it because I would have had to run a dedicated 20A circuit for it, and it would have overwhelmed my home A/C system (nice in the winter though).


For some reason, the link doesn't open for me. Wayback machine works:

https://web.archive.org/web/20220904055009/https://patrickth...


So, for once, it actually was a hardware problem…


Hardware bugs make the best war stories. I once spent a few days debugging a robot arm that would only manifest the erroneous behavior if it was in the right room. If you physically moved the arm and all its associated cables to another room, the problem would disappear only to manifest again when the hardware was all moved back to the original location.

Turns out the problem was an undocumented revision to a passive port cover. The new design introduced a short between pins without changing external markings. The only distinguishing mark was a slight color difference between different batches of plastic.


And you... changed port covers when you moved the robot?


Yes, or omitted them. They weren't supposed to be / documented to be necessary or functional, just keep the port clean of grime (robots operate in dirty environments) and mitigate signal loss.


I once had a bug that was triggered by sunlight. We fixed it by painting the device black.


Similarly, I worked on a team that experienced backend failures in a custom distributed hash table that some crazy person decided to deploy on the cheapest server tier available from Hetzner hosting. Every hot summer day around 3pm, we started to see cascading failures in the DHT nodes.

Turned out Hetzner were to cheap to pay for cooling for their cheapest server racks and they all failed like clockwork when it got hot enough outside.


Is that something like erasing an EEPROM with sunlight?

I recall Raspberry PI's had an issue where certain types of flash photography would cause them to shut down :) https://www.raspberrypi.com/news/xenon-death-flash-a-free-ph...


Early in google life they skimped on ecc memory for their 30 thousand something serves (it was way back then). All sorts of nasty hardware bugs… Even in my time with ecc there was all sorts of cpu bugs that would occasionally mess up every job that happened to land on a particular node


Bitflips are still very much a thing, but you need to be processing datasets at least several petabytes in size to get a decent probability of a bitflip/instruction error.

It's always about the probability of a flip/error, not about whether they exist at all. They always exist.


When I worked at a company called Infowave Software, someone was struggling with an apparent bug: the aplication on a PocketPC device was crashing during transmission of packets out thew wireless modem.

So of course, the intuition was that there is some crash somewhere, maybe in the wireless stack that was being worked on.

Turned out, the application's disappearance wasn't a crash but an exit. The pointer happened to be hovering over the [X] window close button, and the RF interference from the wireless transmission was triggering a phantom tap on the screen.


We had a similar situation with a guy who's old CRT monitor was "jiggly". I replaced the monitor. Same problem. Replaced the network card. Same problem. Tried replacing the whole PC. Same problem.

Finally I put his whole set up on a cart with a giant extension cord and wheeled it out into the hallway. The problem went away.

Turned out his office had an insufficiently shielded fuse box that was causing magnetic interference.


Hardware memory faults are still possible right? And must be seen at AWS scale with enough equipment.


Definitely. At a large company we used to be able to detect solar storms by checking the ECC error graph for different data centers around the world. The vast majority were correctable but of course probability says that a few were not.


Possible and fairly frequent - smaller components are sensitive to smaller energies. It's one of the reasons satellites don't use modern chips - they're too sensitive, bigger transistors and higher energies survive radiation better.

One significant thing that has changed since then though is that in many cases, much of the RAM in use isn't doing something truly critical. Gigabytes are spent holding data, not behavior, so it manifests as a sightly miscolored pixel or a minor calculation difference. The odds of random flips changing something that'll be noticed are much lower.

And truly critical stuff often runs multiple machines simultaneously, to detect one of them disagreeing. Which is much better than ECC, as it'll catch all flips anywhere, at the point it becomes relevant. Assuming you're not truly unlucky, and have similar flips on multiple machines.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: