Reminds me of a similar story from a colleague regarding email an email that couldn't be sent with a specific subject. After much troubleshooting by L1/L2 techs it turned out to be a bit pattern error / broken ASIC in a core router/switch.
I've personally twice encountered similar issues, once with FTPS and another with HTTPS, both manifesting as strange cryptographic failures, but ultimately caused by broken network device somewhere in the path.
That reminds me of the time a shiny new enterpricey ethernet switch delayed a release of Red Hat Linux until we figured out that it was happily corrupting packets and fixing up the CRC and checksum causing certain stress tests being run over NFS to fail randomly. It turns out that sometimes hardware devs don't actually design for the level of reliability expected by customers in their hardware.
Sounds similar to something I hit at work once. I built a release of one of our products and uploaded the binary to a file server on our Ethernet LAN for the test department to test, but the upload failed. Everything else I tried to upload worked. I found I could also not transfer that file to anything else over Ethernet.
After some experimenting I determined that there was a particular 4 byte (if I remember the length correctly) string that simply could not be sent or received by my Ethernet card.
Some searching turned up something on some obscure support forum where the maker of that card admitted that there was a bug in the hardware's checksum implementation that made it fail on that byte pattern.
A friend of mine in college had an interesting hardware data corruption bug on a computer he built for his project in the microprocessor lab we both took in 1982. We did a joint project, where together we designed a small 68k based system and then separately each built a copy. Here's mine [1] [2].
On mine the serial ports worked fine up to 19200 bps (the limit of the ADM3a terminals we were using). On his once you got past 2400 bps it started getting errors.
That nature of the errors was such that only a subset of the possible ASCII characters could be sent. Characters outside that subset would be replaced by characters from the subset. The faster you went the smaller the working subset.
What happened was that there were filter capacitors on the serial lines to filter out high frequency noise. My filter capacitors were correct. His were not. When he had went to the EE stockroom to buy filter capacitor they gave him capacitors that were a couple of orders of magnitude bigger than he asked for.
On his system then if a character with too many bit transitions or bit transitions to close together those looked like high frequency noise to the filter capacitors and were filtered out.
Reading this made me think “huh, where have I seen something like this before?” - an AT&T router was doing something awfully similar back in 2020 - https://news.ycombinator.com/item?id=25335936
This is a great write-up demonstrating how you can use basic tools and a knowledge of networking protocols to identify a failure on the Internet on a hop you don't control. Thanks for sharing this with us.
It’s crazy that carrier-grade routers don’t auto-detect packet data corruption. They could do it by injecting test packets on ingress ports and sniffing on egress. Something like this would be detected quickly. It’s better to shut down a router corrupting packets and let the system route around it than keep corrupting.
Whenever we have weird network problems, the first thing our networks people do is check the bad packet counters on the machines involved. It annoys me slightly, because the problem is never related to packet corruption. But ...
> A Montreal machine receiving this packet discarded it at the kernel level after realizing it’s corrupt, never passing it to the userland ssh daemon. London then re-transmitted it, going through the same corruption, getting the same silent treatment. From ssh and sshd’s perspective, the connection was at a stalemate. From tcpdump’s perspective, there was no loss, and Montreal machines appeared to be just ignoring data.
... it would have caught this! I suppose it's a good habit, because it's a cheap check for something that would be absolutely baffling otherwise.
With nearly 100% of Web traffic served over HTTPS one would think that stories like this become much less frequent as the transport layer crypto is very picky about data integrity.
I've personally twice encountered similar issues, once with FTPS and another with HTTPS, both manifesting as strange cryptographic failures, but ultimately caused by broken network device somewhere in the path.