Hacker News new | past | comments | ask | show | jobs | submit login

This is a great article for defining terms. For some reason though, this quote made me laugh out loud:

"Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable."




The Google SREs mentioned this in their book; the Chubby locking service had uptime that was so high that folks started to neglect making their own services resilient to Chubby failures: https://landing.google.com/sre/book/chapters/service-level-o...


+1 for this book. As a junior DevOps engineer this book has been super helpful.


the book is structured in a way that makes it pretty easy to jump around and pick and choose which parts you want to read or skip, so it's not a very large commitment to read it


Mine just came in the mail today. Pretty stoked.


Still that's bad design on the clients' part. E.g. - Just because malloc "never" fails doesn't mean it can't fail :) so better error check for it.


Doesn't matter. Engineering around human failure is part of the profession.


That's a beautiful way to put it. I'd read that book.


Well, I'm a Google SRE so...


Failure of malloc() might be a bad example to pick because on linux, by default, most distros overcommit, so malloc won't fail, generally. Instead, malloc will succeed allocating the address space just fine, but the RAM will get allocated upon first use, meaning that even though malloc gave you a supposedly valid pointer rather than NULL, actually using that pointer will crash your program.


Other distros may have this differently and return NULL. It's not portable and also just bad to not check for it.


Is there a way to fix this/switch it off? I never got the rationale for this behaviour.


There's a sysctl: vm.overcommit_memory=2

What most people don't realize is that you will get more OOMs if you disable overcommit.


This is actually a serious point, not a joke.

New services may be launched with provisional technology to establish or evaluate a market or pricing model. The underlying technology in the initial implementation may have different performance or availability characteristics to what's actually envisioned for the full-scale product, and care has to be taken to actually compensate for this - i.e. introducing synthetic delay/jitter/faults to avoid setting the wrong expectation for the product.


It's funny but true. All observable properties of a system will eventually become hard dependencies for someone.


Isn't that one of the reasons for Netflix' chaos monkey? To make sure no one thinks "my dependency will always be there"?


That's more part of CHaP: https://medium.com/netflix-techblog/chap-chaos-automation-pl... and FIT: https://medium.com/netflix-techblog/fit-failure-injection-te... -- it's a mechanism to artificially inject errors to understand how upstream dependencies effect your availability.


I guess what meant here is that one should never make mistake of assuming that a highly reliable system can be built. As you start to approach near 100% reliable system, you start experiencing failures that are caused by minute disturbances/flaws in underlying dependencies(hardware, physical location) which can't be controlled. This is what they realized while trying to push the limits to build highly reliable system.


> I guess what meant here is that one should never make mistake of assuming that a highly reliable system can be built.

Wrong guess imho. It means building highly reliable systems requires knowledge and experiences. Trying to build them and solving the problems step-by-step is one way to understand how it can be achieved.


This is a semi-variant of Hyrum's law.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: