I used to use "Lanner" gear for voip and these had embedded intel ethernets. I don't have any more of them to test, but I swear I've seen it on them as well. We suspected power supply problems because the link lights would just go dark every once in a blue moon and need a power cycle to set right, but then we were never be able to reproduce it.
I am impressed by his original troubleshooting, but this followup seems impractical. Of his three suggestions, only the third (Intel providing improved board testing tools) even seems like it could possibly prevent this sort of problem. Asking for hardware-enforced "sane" behavior is like asking, "why doesn't my computer know I don't want my program to deadlock, segfault, or loop indefinitely?" That is, if the controller could do that then it would solve the Halting Problem. Improved drivers, his second suggestion, are always a good thing, but drivers only get patched to handle broken hardware in response to the discovery of broken hardware. There is no way to anticipate each particular way a NIC could possibly be broken ahead of time.
The market demands controllers with flexible and expandable functionality. Board manufacturers use the EEPROM to specify exactly what behavior is required. If a particular manufacturer underestimates the importance of correctness and doesn't perform the code review and testing necessary to prevent a PoD, that isn't Intel's fault.
Kristian Kielhofner here - While I understand your analogy I don't think it's an accurate one. In fact, with the release of the successor to the 82574 Intel has already implemented some of the things I suggested:
Clearly they have learned from the various EEPROM issues on previous controllers (including the 82574) and implemented (among other things) EEPROM signing, which addresses some (most?) of my concerns about sane hardware behavior. Software drivers already do some basic EEPROM checks on this hardware (I know because I've had to tweak them); I'm simply suggesting these checks go a little further to verify the various EEPROM settings than could potentially result in a scenario like this one. When the effects are as significant as they are here I hope we can all agree: more sanity checking is a good thing.
Let me preface this by saying that your epic trouble-shooting effort was really cool. That's what inspires me to pay so much attention to this.
If you'll excuse my ignorance, could you identify which points made on the linked page correspond to your suggestions? I can see how signature checking, if there is in fact such a mechanism on the controller, can help ensure that an EEPROM image is a member of a particular favored set of such images, but you'll admit that that's a less general approach than "in-hardware sane behavior". I don't know anything about µC design, but it would surprise me if the mistake here were as simple as setting a "die when you see this particular byte sequence" bit. It seems more likely that the behavior is an emergent property based on a combination of flags and coded behavior. I still don't think it would be possible for the controller to prevent that result in general. It is possible to test for bad behavior, as your customers proved. It's also possible for drivers to correctly handle the bad behavior of their hardware, and I'm sure appropriate patches are welcome.
Did your board vendor inform you of Intel's findings back in October? If so, could your original article have been a bit more explicit about the fact that Intel wasn't responsible for this? If not, are you looking for another board vendor?
Let me start by saying that I'm not asking for or expecting perfect hardware or software. This does not exist. I'm looking for improvements. Sane? Let's start with "sane-er". I linked to the i210 because it offers exactly what I'm asking for: improvement (as you'd expect in 4+ years of development).
The link for the i210 was an overview for general consumption. The 862 page datasheet is here:
The description of the various memory and configuration spaces starts around page 53. When compared to what's available in the 82574L this is clearly a substantial improvement.
However, as I say in my update, we still don't /really/ know why this issue manifested the way it did. Without knowing the true underlying cause anything I offer is speculation, as are your suppositions. With that it is unknown as to whether or not the improvements in the i210 would have eliminated or even ameliorated this issue.
As far as catching this exception in driver software? Possible, but doubtful. Working with Intel last fall they seemed to dismiss this possibility. Current drivers report a loss of communication with the PHY and the adapter seems to essentially disappear from the PCI bus until a full power cycle.
Neither Intel nor my board vendor reported these findings to me until this story broke last week. I reported this issue to them last fall: both of them claimed to have never seen this issue before (or since).
Meanwhile, as I’ve said before, other people have consistently reproduced this issue with different board manufacturers. We are pursuing a second source but I'm not going to be any more confident with the second source if it has 82574L controllers. I can't be certain it's going to be any different.
Thanks so much for the detailed response, and good luck in your hunt for better vendors. It seems that it's going to fall to you to test and correct the EEPROM settings. You might want to keep your results to yourself in future; you could probably get some big-money consulting work with other companies forced to use these products. It's so shitty that neither party bothered to respond until you went public with this.
> "why doesn't my computer know I don't want my program to deadlock, segfault, or loop indefinitely?"
For a device which can be checked from outside (second subsystem for device self-monitoring), this is actually possible to implement and fairly common. Watchdogs are often implemented to restart automatically when the device is completely unresponsive.
> That is, if the controller could do that then it would solve the Halting Problem
The halting problem is actually decidable for limited-memory machines, though you need O(2^n) memory beyond the n-memory of the machine to actually decide it.
I have a plurality of systems with Intel motherboards which demonstrate the same kind of problems. The motherboards in question have two Intel ethernet controllers, one of which is an 82574L.
The systems connect to two different networks. When the systems attach to one of the networks (but not the other) using the 82574L interface (but not the other), that interface dies after some unpredictable amount of time.
I have tried posting comments to the Intel engineer's blog post (and PM-ing the engineer directly), but they do not appear. In fact, there seem to be no comments at Intel's site, despite the post having nearly 6000 views (at my time of writing).
As I say in my updated post, this is a complex issue with clear combinatorial factors. More than likely it's not limited to one chip, one packet, or one EEPROM configuration. A quick reading of the web shows various unexplained issues with this family of Intel ethernet controllers randomly exhibiting the exact behavior I've described. Different controllers, different mobo OEMs, different EEPROM settings. Are all of these issues related to some kind of "packet of death"? Certainly not. However, are at least some of them? Almost certainly, even if they're not vulnerable to my (extremely specific) "packet of death". We still don't know exactly why this is happening (even in my extremely specific case).
I have another interesting (and reproducible) manifestation on Supermicro motherboards with two 82574L controllers. In this case, it is again true that we only experience problems on the first (as ordered by ascending MAC address) of the two interfaces.
That was the case, though I did not clearly state so above, on the Intel motherboard with one 82574L and one 82579LM.
Company from Taipei flashes some Intel equipment, then it appears to function correctly, but can be bricked remotely with a specially crafted incoming packet.
I think you're reading way more into this, than there is to it.
Taiwan (Republic of China) is by the way, basically it's own country with it's own leadership and currency. I find it somewhat hard to put China (People's Republic of China) and Taiwan (Republic of China) together.
Just to clarify, there's a difference between a Special Administrative Region like Hong Kong or Macau, and Taiwan. While Hong Kong is largely self-governing internally, it's still part of the PRC. Meanwhile, Taiwan (the ROC) was founded by people ousted during the revolution. It's like saying North and South Korea are 'basically' their own countries. Politically they aren't even friendly.
Indeed, I guess I was a little too fuzzy in how I phrased myself in hind sight. Thanks - a good addition in itself.
I guess the only really suitable way of explaining the situation is "It's complicated.". It's a colourful situation and in no way is it neither black nor white.
I have commented about this in the past; when I was at lockheed, in 2006, we had security debriefings about chinese hacking attempts.
There were trojans on the network that were sending little data packets back to china... but more interestingly:
Lockheed employees were not allowed to connect their macines to any foreign network. Even of those which were suppliers.
There was a supplier in Taiwan where employees would go and would transfer some files via sneakernet (USB keys) - the supplier had been hacked and the chinese were using the Taiwanese suppliers machines to attack the lockheed employees via the transfer of the USB sticks.
The point is that don't underestimate Chinese hackers and the potntial vectors they are willing to exploit.
Well, I suppose the argument would go that a Taiwanese engineer would, by dint of shared language and culture, be more susceptible to coercion and bribes than someone from europe, south asia, etc...
And I agree that you don't want to be too paranoid about this stuff. But at the same time, if you were expecting and looking for an "illicit backdoor" in hardware, this is exactly the kind of thing you'd expect. Firmware in a place virtually no one knows about gets modified on a per-product basis to do nefarious things. And this is exactly how you'd expect such a modification to be discovered, by accidentally introducing a bug that distinguishes itself from the clean parent.
I mean, I'm not screaming "spy" here, but if I were to have read this story in a techno-thriller novel I'd be writing a post applauding the author for her excellently researched and eminently plausible plot hook.
Well, I suppose the argument would go that a Taiwanese engineer would, by dint of shared language and culture, be more susceptible to coercion and bribes than someone from europe, south asia, etc...
You could try to make that argument, but I've always figured it would go kind of the other direction. Every Taiwanese national I know is a fierce supporter of Taiwan's independence of China, and China certainly does all it can to foster that every time it tries to annex the country.
Trust, but verify. With everything that we know, it's silly to not be paranoid.
I got pretty excited about election integrity for a while.
The default position of the defenders of the status quo was "I can't believe you don't trust us. Prove there's something wrong. You 'experts' in computers, security, and elections are just a bunch of conspiracy freaks."
My default position is "show me". That skepticism merely makes me an informed consumer.
If this were a DoS backdoor, it would've not been that much harder to make it less discoverable. Just use two magic bytes, or three. The chance of false positive are virtually zero and yet you'd still be able to use basic ICMP/ping to trigger it if needed.
A backdoor that bricks the device, but only if it's the first packet received, isn't terribly useful. The more plausible explanation is that it's a bug that was introduced by the actual backdoor that still remains undiscovered.
Firmware images usually have checksums. Was this an Intel blob suffering from bitrot, or does Intel have some more or less error prone way to build your own FW images for NICs?
I suspect NICs these days are tiny computers in their own right. As a motherboard manufacturer, you can probably program them to do all sorts of nifty, with the possible downside of strange things happening if you get it wrong.
It is not the firmware that was corrupted. It is the EEPROM that was incorrectly programmed.
The EEPROM is typically 4kB on the 82754. When it is reprogrammed either by the end-user (eg. via ethtool(1)) or by the manufacturer, the programming procedure recomputes a checksum on the first 128 bytes IIRC (when reprogramming via ethtool, the kernel driver e1000e is responisble for automatically updating the checksum.
So all in all, no, the packet of death issue was not caused by bitrot.
http://blog.krisk.org/2013/02/packets-of-death-update.html
I used to use "Lanner" gear for voip and these had embedded intel ethernets. I don't have any more of them to test, but I swear I've seen it on them as well. We suspected power supply problems because the link lights would just go dark every once in a blue moon and need a power cycle to set right, but then we were never be able to reproduce it.