> their service might exhibit problems with long-lived instances that they never notice because Chaos Monkey tends to terminate instances before they hit those problems
Tangential to the main point of the article, but I've seen this exact failure before! I worked at a retailer that would reboot all point-of-sale devices daily, after stores closed. The day before biggest sales day of the year, the team decided to leave the devices online overnight - sometimes a device would fail to come back online after the reboot, so leaving them on would prevent that sort of issue.
Of course, one bug or another reared its head after 24 hours of uptime. We ended up rebooting all of the devices anyway.
Tangential to the main point of the article, but I've seen this exact failure before! I worked at a retailer that would reboot all point-of-sale devices daily, after stores closed. The day before biggest sales day of the year, the team decided to leave the devices online overnight - sometimes a device would fail to come back online after the reboot, so leaving them on would prevent that sort of issue.
Of course, one bug or another reared its head after 24 hours of uptime. We ended up rebooting all of the devices anyway.