Electronics (within spec) don't usually mind heat, *thermal cycling* is what the...

latchkey · 2024-02-07T17:28:58 1707326938

The system I built would auto tune the GPUs. The failure case for a single GPU is to crash the entire system and there are 12 in a box. The machines would reboot hundreds of times until they were tuned to stability. Through 4 seasons (including snow). Again, we had the majority of these cards with zero failures, across multiple years.

The things that were more likely to fail were things like PSU's. One time we had a bad batch of those and had to replace nearly every one of them. We cracked a few open and they were clearly hand soldered by someone in China and shorting out internally due to failed connections. We would see a lot more thermal cycling failure from that than we would from a GPU card that was pick&place assembled by a machine with solder paste.

yjftsjthsd-h · 2024-02-07T19:24:29 1707333869

Can I ask what this was for? I'm struggling to think what you'd be doing in the middle of a field with a crate full of mostly-GPU compute. Something with machine vision?

latchkey · 2024-02-07T19:52:20 1707335540

Mining ethereum when it was proof of work. The middle of the field was just one location out of 7. We did "real" data centers too.