Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I ran a cluster of ~30k blade based computers booting entirely off iPXE. They didn't have any onboard ssd/disk storage or ECC memory. Every day, a few of them would randomly lock up, they'd reboot with a fresh network image and keep on humming.


> Every day, a few of them would randomly lock up, they'd reboot with a fresh network image and keep on humming.

There same ones, or random new machines every time?


Totally random.


Could easily be software or some other marginal hardware bug though.


Indeed. Although, sometimes the machine wouldn't fully crash. It was like the disk was corrupted, but apps were still running, which makes me suspect it was the lack of ECC.


How do you even get that many computers without ECC? I think all the blades I've seen have ECC as baseline spec.


Google started with "standard PCs", though I'm not finding any info if they used ECC memory.

https://blog.codinghorror.com/building-a-computer-the-google...

https://patents.google.com/patent/US6549988


They were purpose built and didn't require 100% uptime.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: