This is a great article for defining terms. For some reason though, this quote m...

ksmith14 · on July 19, 2018

The Google SREs mentioned this in their book; the Chubby locking service had uptime that was so high that folks started to neglect making their own services resilient to Chubby failures: https://landing.google.com/sre/book/chapters/service-level-o...

robax · on July 19, 2018

+1 for this book. As a junior DevOps engineer this book has been super helpful.

philsnow · on July 19, 2018

the book is structured in a way that makes it pretty easy to jump around and pick and choose which parts you want to read or skip, so it's not a very large commitment to read it

AdamM12 · on July 20, 2018

Mine just came in the mail today. Pretty stoked.

mav3rick · on July 19, 2018

Still that's bad design on the clients' part. E.g. - Just because malloc "never" fails doesn't mean it can't fail :) so better error check for it.

Filligree · on July 19, 2018

Doesn't matter. Engineering around human failure is part of the profession.

kjeetgill · on July 19, 2018

That's a beautiful way to put it. I'd read that book.

Filligree · on July 19, 2018

Well, I'm a Google SRE so...

smcameron · on July 20, 2018

Failure of malloc() might be a bad example to pick because on linux, by default, most distros overcommit, so malloc won't fail, generally. Instead, malloc will succeed allocating the address space just fine, but the RAM will get allocated upon first use, meaning that even though malloc gave you a supposedly valid pointer rather than NULL, actually using that pointer will crash your program.

mav3rick · on July 24, 2018

Other distros may have this differently and return NULL. It's not portable and also just bad to not check for it.

tannhaeuser · on July 20, 2018

Is there a way to fix this/switch it off? I never got the rationale for this behaviour.

zorked · on July 20, 2018

There's a sysctl: vm.overcommit_memory=2

What most people don't realize is that you will get more OOMs if you disable overcommit.

frowawayz · on July 19, 2018

This is actually a serious point, not a joke.

New services may be launched with provisional technology to establish or evaluate a market or pricing model. The underlying technology in the initial implementation may have different performance or availability characteristics to what's actually envisioned for the full-scale product, and care has to be taken to actually compensate for this - i.e. introducing synthetic delay/jitter/faults to avoid setting the wrong expectation for the product.

anonacct37 · on July 19, 2018

It's funny but true. All observable properties of a system will eventually become hard dependencies for someone.

sgift · on July 19, 2018

Isn't that one of the reasons for Netflix' chaos monkey? To make sure no one thinks "my dependency will always be there"?

sargun · on July 19, 2018

That's more part of CHaP: https://medium.com/netflix-techblog/chap-chaos-automation-pl... and FIT: https://medium.com/netflix-techblog/fit-failure-injection-te... -- it's a mechanism to artificially inject errors to understand how upstream dependencies effect your availability.

ssdd · on July 19, 2018

I guess what meant here is that one should never make mistake of assuming that a highly reliable system can be built. As you start to approach near 100% reliable system, you start experiencing failures that are caused by minute disturbances/flaws in underlying dependencies(hardware, physical location) which can't be controlled. This is what they realized while trying to push the limits to build highly reliable system.

antpls · on July 19, 2018

> I guess what meant here is that one should never make mistake of assuming that a highly reliable system can be built.

Wrong guess imho. It means building highly reliable systems requires knowledge and experiences. Trying to build them and solving the problems step-by-step is one way to understand how it can be achieved.

sdhgaiojfsa · on July 19, 2018

This is a semi-variant of Hyrum's law.