I worked at a large cluster computing company and we did occasionally, very occasionally see PCIe problems. Note that a lot of people are now exporting PCIe over a cable, not just plugging into the mainboard, and that can be a source of problems ('oops, the PCIe cable was routed in a location that made it experience more EMF, vibration, and physical damage and then it started to show more errors).
These sorts of problems mainly show up if you're running your own fleet, designing your own servers (poorly/aggressively) and have a budget of $10B.
These sorts of problems mainly show up if you're running your own fleet, designing your own servers (poorly/aggressively) and have a budget of $10B.