Distributed and highly available systems are pretty interesting but something I've noticed is that it's really surprising how much you can get done with a centralized system (say a single redis instance) and how much easier it is. Some of the cloud VMs can run for years with only transient issues (GCE live migration is neat).
High availability is really cool as well, but I recently learned that world of Warcraft has apparently had <97% uptime for years due to weekly maintenance. That leads me to reconsider just how necessary those additional nines are. It seems like a lot of services would be better off optimizing for, say, operational efficiency than uptime. Doing server upgrades and rolling out patches when you have a multi hour maintenance window sounds waaay easier than live updates to your servers.
Of course it depends entirely on the service you are running, but I'm still impressed that a service as large as world of Warcraft can work fine with that level of uptime.
I think nulltype's point is that yes, 97% is considered "low reliability" by software engineers, but Blizzard and the WoW players don't seem to mind this low reliability.
I've been reading this, and I think it's a great intro to the subject that isn't inscrutable and doesn't go into excessive detail. I also like: http://christophermeiklejohn.com/distributed/systems/2013/07... as a source for more in depth readings in specific subjects. Many of these are seminal papers
High availability is really cool as well, but I recently learned that world of Warcraft has apparently had <97% uptime for years due to weekly maintenance. That leads me to reconsider just how necessary those additional nines are. It seems like a lot of services would be better off optimizing for, say, operational efficiency than uptime. Doing server upgrades and rolling out patches when you have a multi hour maintenance window sounds waaay easier than live updates to your servers.
Of course it depends entirely on the service you are running, but I'm still impressed that a service as large as world of Warcraft can work fine with that level of uptime.