Silent Data Corruptions at Scale

paulsutter · on June 12, 2021

Perhaps run a complex “idle process” that exercises the major functional areas of the CPU and can detect such failures so that such cores/cpus can be isolated / decommissioned. Really should be part of Linux

Anyone know of a suitable program for such a process?

willvarfar · on June 12, 2021

There are programs, used for testing and “fuzzing” compilers, that generate random programs based on a seed.

When a node runs a random program, it doesn’t know if the output is correct. But it could report the seed and result to a central database.

Then, if you had a several nodes, and two nodes ran the same seed and got different output, that would mean something was wrong and needed investigation.

There are also programs for reducing such programs down to a minimum test case. So once a discrepancy is found, it can be reduced to some small program that recreates it.

I once worked on a compiler backend and a CI job generated random C programs and compared x86 output. against the novel cpu simulator. Any discrepancies found were auto reduced by these tools and then a ticket was automatically created. Lots of our bugs were found and fixed this way.

(My memory is we used C-reduce for the reductions. I can’t remember the tool we used for generating the test programs, but there are several.)

ot · on June 12, 2021

I would assume that most large fleets have background periodic tasks that perform basic self-checks (the fleets I know about certainly do).

Using "idle" cycles is not a great idea though:

- They may seem "free", but in fact you would end up using more power: CPU turns itself off during idle time, and you'd replace that with an intensive process. Power (and, as a consequence, cooling) is one of the main costs of a data center.

- Machines that are properly utilized (close to 100% CPU utilization) would get less coverage, and those are the ones that need it the most.

So it is better to allocate a certain percentage of your CPU budget to self checks, based on risk and sensitivity of the tests. And have some easy way to put a machine under stress testing if it is suspected of having rare memory or CPU errors.

kevingadd · on June 12, 2021

There are a few game engines that do this while running to detect bad hardware. The resulting 'bad hardware flag' is tracked and forwarded in crash reports to help sort them out from the 'real' crash reports (caused by bugs), and the information is also shown to the user when they hit an issue ('you seem to have bad RAM', etc). It'd be cool to see this turned into a reusable library that could be included by various types of software that is more likely to be impacted by bad hardware and can afford to burn a little cpu/gpu while it's running - you wouldn't want it in background services, but a game or high performance server app might be able to justify it.

makomk · on June 12, 2021

The general reckoning seems to be that it wouldn't be possible to write such a program without internal knowledge available only to the CPU manufacturer, and maybe not even then - modern CPUs are too complicated and have too much stuff going on that's not under direct control of the software running on them,

tinus_hn · on June 13, 2021

Windows does this sometimes, it causes mysterious excessive cpu usage attributed to ‘memory compression’.

I presume in theory this would be using idle time only but in practice it’ll slow down your system a lot.

anotherhue · on June 12, 2021

zfs scrub might be compelling.

imperialdrive · on June 12, 2021

This article is a little scary to me, and your idea/solution sounds pretty clever!

TazeTSchnitzel · on June 12, 2021

> introduce more […] asserts statements

I agree that's a good idea, but don't people usually disable asserts in production?

jboggan · on June 12, 2021

As someone building numerous Spark workloads this is quite concerning.

rob_c · on June 12, 2021

I raise you the WLCG and OSG computing grids with distributed global datasets. Checksums of transfers and verified reproducibility works wonders.

hermitcrab · on June 12, 2021

I thought this was going to be about the UK government.