Many years ago I was a systems programmer for a Burroughs B6700 mainframe in New Zealand - this was the only mainframe at the University, everyone depended on it for almost everything. One day it started doing random IOs to random devices - garbage on a printer or a console were annoying, but scribbling over disk was a particularly bad thing, we could be down for ages bringing it back. We had an engineering contract with Burroughs and eventually they flew in the big gun engineer from the US .....
He took a stick and ran it down one row of cards in the IO processor, nothing happened, he ran it down the next row, bang! hours later we're up again, he runs it down half the row, bang! binary search with stick ensues, eventually he starts pulling cards one comes out and 3 small solder blobs fall off, they'd been sitting there on some chip's pins since it was manufactured years ago
Nice story - but what I liked even more was how the pictures were of a computer from equipped like the first place i ever worked: even back in 1991 the mag tapes and the fleet of three Prime computers were a little antiquated, but lifting up a for tile with the suction grip felt like being in the movies. We never had The Expert visit, but during one regular service visit an onlooker tried peering into the bowels and got a brisk bollocking from the technician: he had no tie clip and so his polyester noose was dangling perilously over circuit boards, ready to deliver a static zap.
The Iron Curtain was firmly in place in the 1980s. If you tried crossing the GDR/BRD or the Czechoslovak/(BRD, Austria) border in the 1980s, you would be risking your life and, if caught, you would certainly do some time in prison.
The last person shot on the Berlin wall died in February 1989 [0].
A very similar situation - we called it the "million rand" muffin problem and one store kept getting occasional misreads on barcodes where the price was embedded in the barcode.
We looked at our code and were rather stumped - a simple ACK/NAK serial RS232 protocol - replaced serial boards.
I got sent onsite (my fate as lead programmer) and sat in the office and waited - hours later it happened and when I looked at the offending muffin I looked down and noticed the barcode scanner was from another manufacturer (cheaper) we normally did not use.
I send some samples to manufacturer - few days later - turned out it was firmware issue.
There was a CP/M machine that I knew of that would only work when plugged in to the printer.
The reason turned out to be a ground fault. The computer used only a two prong plug for power. The printer used a three prong plug with proper ground and the centronics cable/connection between the two units had a grounded shell and shielding foil in the cable.
I can assure you there are plenty of readers here who have not heard of it. I shared the link at work recently and was interested that not a single person knew of it (they are fairly young though).
A rather popular site had gotten a donated, top of the line Compaq server system to run their operations; everything ran fine for the developers at home but when the server was sent to the DC there were all manner of transient bugs that couldn't be diagnosed. It went back and forth a couple times then they gave up on that one and acquired new hardware.
I came to own the "cursed" server, and it ran fine for me for several months in heavy but noncritical tasks. I put it in my DC and sure enough the odd errors begin happening again.
This was a 4U box with a row of drive bays in front and "onboard SCSI raid" and "hotplug backplane" and all those trimmings. So we had all been using the onboard built in SCSI controller this whole time, which included a large SCSI cable carefully routed through the folds and byways of this complicated case.
That was the fault: the SCSI cable connector to the backplane board had cracked; and when it was cold pulled apart enough to have poor connections on some pins, and result in (sometimes silent!) data corruption problems. Getting to it required stripping the machine down further than anybody had managed to that point. After that it served me well for several more years.
I had the opportunity to pick up one of their "cluster-in-a-box" servers (two servers and a shared RAID array in a single 10U chassis) super cheaply one time, and ultimately passed on it because I would have had to run a dedicated 20A circuit for it, and it would have overwhelmed my home A/C system (nice in the winter though).
Hardware bugs make the best war stories. I once spent a few days debugging a robot arm that would only manifest the erroneous behavior if it was in the right room. If you physically moved the arm and all its associated cables to another room, the problem would disappear only to manifest again when the hardware was all moved back to the original location.
Turns out the problem was an undocumented revision to a passive port cover. The new design introduced a short between pins without changing external markings. The only distinguishing mark was a slight color difference between different batches of plastic.
Yes, or omitted them. They weren't supposed to be / documented to be necessary or functional, just keep the port clean of grime (robots operate in dirty environments) and mitigate signal loss.
Similarly, I worked on a team that experienced backend failures in a custom distributed hash table that some crazy person decided to deploy on the cheapest server tier available from Hetzner hosting. Every hot summer day around 3pm, we started to see cascading failures in the DHT nodes.
Turned out Hetzner were to cheap to pay for cooling for their cheapest server racks and they all failed like clockwork when it got hot enough outside.
Early in google life they skimped on ecc memory for their 30 thousand something serves (it was way back then). All sorts of nasty hardware bugs… Even in my time with ecc there was all sorts of cpu bugs that would occasionally mess up every job that happened to land on a particular node
Bitflips are still very much a thing, but you need to be processing datasets at least several petabytes in size to get a decent probability of a bitflip/instruction error.
It's always about the probability of a flip/error, not about whether they exist at all. They always exist.
When I worked at a company called Infowave Software, someone was struggling with an apparent bug: the aplication on a PocketPC device was crashing during transmission of packets out thew wireless modem.
So of course, the intuition was that there is some crash somewhere, maybe in the wireless stack that was being worked on.
Turned out, the application's disappearance wasn't a crash but an exit. The pointer happened to be hovering over the [X] window close button, and the RF interference from the wireless transmission was triggering a phantom tap on the screen.
We had a similar situation with a guy who's old CRT monitor was "jiggly". I replaced the monitor. Same problem. Replaced the network card. Same problem. Tried replacing the whole PC. Same problem.
Finally I put his whole set up on a cart with a giant extension cord and wheeled it out into the hallway. The problem went away.
Turned out his office had an insufficiently shielded fuse box that was causing magnetic interference.
Definitely. At a large company we used to be able to detect solar storms by checking the ECC error graph for different data centers around the world. The vast majority were correctable but of course probability says that a few were not.
Possible and fairly frequent - smaller components are sensitive to smaller energies. It's one of the reasons satellites don't use modern chips - they're too sensitive, bigger transistors and higher energies survive radiation better.
One significant thing that has changed since then though is that in many cases, much of the RAM in use isn't doing something truly critical. Gigabytes are spent holding data, not behavior, so it manifests as a sightly miscolored pixel or a minor calculation difference. The odds of random flips changing something that'll be noticed are much lower.
And truly critical stuff often runs multiple machines simultaneously, to detect one of them disagreeing. Which is much better than ECC, as it'll catch all flips anywhere, at the point it becomes relevant. Assuming you're not truly unlucky, and have similar flips on multiple machines.
He took a stick and ran it down one row of cards in the IO processor, nothing happened, he ran it down the next row, bang! hours later we're up again, he runs it down half the row, bang! binary search with stick ensues, eventually he starts pulling cards one comes out and 3 small solder blobs fall off, they'd been sitting there on some chip's pins since it was manufactured years ago