> And that's why, even though it sounds crazy, the best way to avoid failure is to fail constantly.
This is my biggest concern with things like large nation-states, large banks, large reinsurance companies, large RAIDs, and large nuclear plants: we centralize resources into a larger resource pool in order to reduce the chances of failure, but in doing so we make the eventual failure more severe, and we reduce our experience in coping with it and our ability to estimate its probability. In fact, we may not even be reducing the chances of failure; we may just be fooling ourselves.
Consider the problem of replicating files around a network of servers. Perhaps you have a billion files and 200 single-disk servers with an MTBF of 10 years, and it takes you three days to replace a failed server.
One approach you can use is to pair up the servers into 100 mirrored pairs and put 10 million files on each pair. Now, about 20 servers will fail every year, leaving ten million files un-backed-up for three days. But the chance that the remaining server of that pair will fail during that time is 3/3650 = 0.08%. That will happen about once every 60 years, and so the expected lifetime of the average file on your system is about 6000 years.
So it's likely that your system will hum along for decades without any problems, giving you an enormous sense of confidence in its reliability. But if you divide the files that will be lost once every 60 years (ten million) by the 60 years, you get about 170 thousand files lost per year. The system is fooling you into thinking it's reliable.
Suppose, instead, that you replicate each file onto two servers, but those servers are chosen at random. (Without replacement.) When a server fails (remember, 20 times a year), there's about a one in six chance that another server will fail in the three days before it's replaced. When that happens, every three or four months, a random number of files will be lost --- about 10 million / 200, or about fifty thousand files, for a total data loss of about 170 thousand files a year. You will likely see this as a major problem, and you will undertake efforts to fix it, perhaps by storing each file on three or four servers instead of two.
This is despite the fact that this system loses data at the same average rate as the other one. In effect, instead of having 100 server pairs to store files on, you have 19,900 partition pairs, each partition consisting of 0.5% of a server. By making the independently failing unit much smaller, you've dramatically increased your visibility into its failure rate, and given yourself a lot of experience with coping with its failures.
In this case, more or less by hypothesis, the failure rate is independent of the scale of the thing. That isn't generally the case. If we had a lot of half-megawatt nuclear reactors scattered around the landscape instead of a handful of ten-gigawatt reactors, it's likely that each reactor would receive a lot less human attention to keep it in good repair. When it threatened to melt down, there wouldn't be a team of 200 experienced guys onsite to fight the problem. There would be a lot more shipments of fuel, and therefore a lot more opportunities for shipments of fuel rods to crash or be hijacked. And so on.
But we might still be better off that way, because instead of having to extrapolate nuclear-reactor safety from a total of three meltdowns of production reactors --- TMI, Tchernobyl, and Fukushima --- we'd have dozens, if not hundreds, of smaller accidents. And so we'd know which design elements were most likely to fail in practice, and how to do evacuation and decontamination most effectively. Instead of Tchernobyl having produced a huge cloud of radioactive smoke that killed thousands or tens of thousands of people, perhaps it would have killed 27, like the reactor failure in K-19.
With respect to nation-states, the issue is that strong nation-states are very effective at reducing the peacetime homicide rate, which gives them the appearance of substantially improving safety. Many citizens of strong nation-states in Europe have never lived through a war in their country, leading them to think of deaths by violence as a highly unusual phenomenon. But strong nation-states also create much bigger and more destructive wars. It is not clear that the citizens of, say, Germany are at less risk of death by violence than the citizens of much weaker states such as Micronesia or Brazil, where murder rates are higher.
This is my biggest concern with things like large nation-states, large banks, large reinsurance companies, large RAIDs, and large nuclear plants: we centralize resources into a larger resource pool in order to reduce the chances of failure, but in doing so we make the eventual failure more severe, and we reduce our experience in coping with it and our ability to estimate its probability. In fact, we may not even be reducing the chances of failure; we may just be fooling ourselves.
Consider the problem of replicating files around a network of servers. Perhaps you have a billion files and 200 single-disk servers with an MTBF of 10 years, and it takes you three days to replace a failed server.
One approach you can use is to pair up the servers into 100 mirrored pairs and put 10 million files on each pair. Now, about 20 servers will fail every year, leaving ten million files un-backed-up for three days. But the chance that the remaining server of that pair will fail during that time is 3/3650 = 0.08%. That will happen about once every 60 years, and so the expected lifetime of the average file on your system is about 6000 years.
So it's likely that your system will hum along for decades without any problems, giving you an enormous sense of confidence in its reliability. But if you divide the files that will be lost once every 60 years (ten million) by the 60 years, you get about 170 thousand files lost per year. The system is fooling you into thinking it's reliable.
Suppose, instead, that you replicate each file onto two servers, but those servers are chosen at random. (Without replacement.) When a server fails (remember, 20 times a year), there's about a one in six chance that another server will fail in the three days before it's replaced. When that happens, every three or four months, a random number of files will be lost --- about 10 million / 200, or about fifty thousand files, for a total data loss of about 170 thousand files a year. You will likely see this as a major problem, and you will undertake efforts to fix it, perhaps by storing each file on three or four servers instead of two.
This is despite the fact that this system loses data at the same average rate as the other one. In effect, instead of having 100 server pairs to store files on, you have 19,900 partition pairs, each partition consisting of 0.5% of a server. By making the independently failing unit much smaller, you've dramatically increased your visibility into its failure rate, and given yourself a lot of experience with coping with its failures.
In this case, more or less by hypothesis, the failure rate is independent of the scale of the thing. That isn't generally the case. If we had a lot of half-megawatt nuclear reactors scattered around the landscape instead of a handful of ten-gigawatt reactors, it's likely that each reactor would receive a lot less human attention to keep it in good repair. When it threatened to melt down, there wouldn't be a team of 200 experienced guys onsite to fight the problem. There would be a lot more shipments of fuel, and therefore a lot more opportunities for shipments of fuel rods to crash or be hijacked. And so on.
But we might still be better off that way, because instead of having to extrapolate nuclear-reactor safety from a total of three meltdowns of production reactors --- TMI, Tchernobyl, and Fukushima --- we'd have dozens, if not hundreds, of smaller accidents. And so we'd know which design elements were most likely to fail in practice, and how to do evacuation and decontamination most effectively. Instead of Tchernobyl having produced a huge cloud of radioactive smoke that killed thousands or tens of thousands of people, perhaps it would have killed 27, like the reactor failure in K-19.
With respect to nation-states, the issue is that strong nation-states are very effective at reducing the peacetime homicide rate, which gives them the appearance of substantially improving safety. Many citizens of strong nation-states in Europe have never lived through a war in their country, leading them to think of deaths by violence as a highly unusual phenomenon. But strong nation-states also create much bigger and more destructive wars. It is not clear that the citizens of, say, Germany are at less risk of death by violence than the citizens of much weaker states such as Micronesia or Brazil, where murder rates are higher.