Principles of Chaos Engineering

antoncohen · on Jan 28, 2018

There is no mention of Netflix on the site, but the term Chaos Engineering, and the popularization of the technique, seem to come from Netflix. The Chaos Monkey README even links to this site.

https://github.com/Netflix/chaosmonkey

https://medium.com/netflix-techblog/chaos-engineering-upgrad...

http://www.oreilly.com/webops-perf/free/chaos-engineering.cs...

https://en.wikipedia.org/wiki/Chaos_Monkey

rapnie · on Jan 28, 2018

i think you are indeed correct on stating that.. the chaos monkey, cool naming :)

nbraga · on Jan 28, 2018

A related Ask HN (from 2016) that I submitted: [Ask HN: Who uses Chaos Monkey in production?](https://news.ycombinator.com/item?id=12749597). A couple of interesting responses there.

taneq · on Jan 28, 2018

This sounds like the kind of thing that would be very powerful if built in from the start, but an absolute nightmare to try and retrofit to an existing system.

It reminds me of the Amazon "everything must be a network service" architecture. Painful to institute but with great payoffs later in terms of robustness and scalability.

empath75 · on Jan 28, 2018

We started our current project running chaos monkey from almost day one, and a few weeks ago, someone fucked up a tag enforcement script on our dev aws account and started shutting down every single instance, while all of our asgs kept restarting them. So basically every single service from consul to kafka to our microservices was down to a few or sometimes even no instances.

Because we had already been dealing with chaos monkey, everything recovered by itself within a few minutes of the script stopping, and we barely had any alerts from any services going down.

Also, when aws was pushing our meltdown patches, we just ignored all the notifications because we didn’t care about instances being shut down, because they go down randomly all the time.

It’s a pain in the ass at first to deal with, but it definitely forces you to fix a lot of potential problems unless you want to spend all your time manually fixing stuff every day.

Even just aside from chaos monkey, we build new base amis and push out rolling updates daily. That’ll also quickly suss out problems with not having version numbers locked, etc.

ealexhudson · on Jan 28, 2018

I know what you mean here, but I have to disagree. I see constant over-engineering in teams, and one of the main reasons for this is copying patterns like this way too early in the life of the business.

I see this as a version of Monte Carlo testing - I'm not totally sold on doing it in production, but certainly if you have a load that's difficult to replicate in scale, then this is the type of thing you should be doing.

But I don't think anyone should be doing this from the start. This is something that you evolve into. Sure, it's easier to bring it in if you planned it from the start - but as an engineer, you cannot proceed on that basis. You know and understand stuff has to change as you grow, and the art of it is making (more often than not) good calls on architecture to allow that.

drdrey · on Jan 28, 2018

What you can do from the start that will help you down the line is to use a common RPC framework/library across the board, especially if it supports failure injection. It's a shame when you can't easily do chaos testing on a system because it chose to use a different RPC system.

xg15 · on Jan 28, 2018

Very considerate that they mentioned negative impact for users in the last point. I wonder though how well that is measured and controlled in real-world applications of this design style.

aaronblohowiak · on Jan 28, 2018

I gave a talk on minimizing blast radius at velocity last year: https://m.youtube.com/watch?feature=youtu.be&v=C11LNUEaHuo

lallysingh · on Jan 28, 2018

It depends on your customer behavior. If there customer can retry, then a small failure is trivially passed.

kostarelo · on Jan 28, 2018

I too wonder if any of these experiments actually affected real customers.

extrapickles · on Jan 28, 2018

They have certainly impacted real customers. The idea is that ideally these outages can be easily stopped, and your engineers know exactly what the chaos monkey did rather than just seeing an uptick in errors.

Basically you are creating a fault at a time of your choosing to prevent the same fault from occurring at the worest possible time.

falcor84 · on Jan 28, 2018

This is what an SLA is for, no?

If you have otherwise upheld your targets, you can use the remaining slack at the end of the period to run larger experiments such as these, for a long term benefit.

zzzcpan · on Jan 28, 2018

Not affecting real customers is kind of the point of controlled experiments.

drdrey · on Jan 28, 2018

You can very much impact real customers, but the idea is that you should be able to only apply the experiment to a small subset of customers and abort it in case something goes wrong.

For instance, you may want to check that the authorization service can function without access to the A/B testing service, so you cut the network connection between them. If authoritazion errors start rising, you stop the experiment and investigate why this happened. You may also find that clients did not retry properly on unexpected errors from the authorization service (e.g. http 500 error codes)

kostarelo · on Jan 28, 2018

  improper fallback settings when a service is unavailable; 
  retry storms from improperly tuned timeouts; outages when a 
  downstream dependency receives too much traffic; cascading 
  failures when a single point of failure crashes;

I don't see how one couldn't replicate those in an environment other than the production. They all involve bringing down complete services and sending some unexpected high load to other services.

drdrey · on Jan 28, 2018

You can start finding and fixing issues in non prod environments, but eventually you have to run this in prod if you want to find the good stuff. Nothing is going to replicate the particular configuration, load, mix of requests and client behavior that you have in prod

vvdcect · on Jan 29, 2018

So did this come about based off Chaos Theory? Then the main goal of chaos engineering would be to "create order out of chaos"?

projectileboy · on Jan 28, 2018

I’m not affiliated with Netflix, but the negative tone of some of the comments here is hilarious. Would the critics care to share how their traffic/uptime ratio compares to Netflix’s? On weekends, Netflix is responsible for about 50% of the packets on the Internet. On Earth. Netflix is down how often? But please, do share how you once replicated a MySQL database from prod to a staging environment.

user5994461 · on Jan 28, 2018

[flagged]

dang · on Jan 30, 2018

Please don't post shallow dismissals of other people's work. That's emblematic of what we don't want here.

https://news.ycombinator.com/newsguidelines.html

yeukhon · on Jan 29, 2018

Cooking is not rocket science, but can you cook like a world-class chef? I don’t know about you, but I can’t. There’s science in cooking. Professional work is rocket science hard.

alayne · on Jan 29, 2018

Netflix developed new software architectures to support its products. It's more rocket science than most of what people work on in the industry.

JshWright · on Jan 29, 2018

Well, if you add up the various encodings they offer, I think it's like 8 static videos...

marknadal · on Jan 28, 2018

Chaos Monkey and Kyle Kingsbury's work on the Jensen tests inspired us to make this (https://github.com/gundb/panic-server), which lets us simulate all sorts of failure cases. It should be extremely useful to others if they enjoy stuff like Chaos Monkey, etc.!

beagle3 · on Jan 28, 2018

^Jensen^Jepsen

Kyle Kingsbury == Aphyr

https://jepsen.io/

marknadal · on Jan 29, 2018

Sorry, thanks, my phone autocorrected!

voidmain · on Jan 28, 2018

It says something about the engineering maturity of the industry that destructive testing in production environments is among the more sophisticated practices.

mseebach · on Jan 28, 2018

The chaos monkey is pointedly unsophisticated, but a system that can reliably survive its attack is likely to be very sophisticated. Airline pilots are effectually faced with a chaos monkey in their recurring simulator trainings and as a result have a very sophisticated skill set (indeed, one most of them will never actually need, but train for anyway, for obvious reasons).

Also, I take issue with it being “destructive”. Compute resources are effectively free on the margin (at least at the scale that concerns us here) and ephemeral. Killing such a process doesn’t meaningfully destroy anything of value.

CharlesW · on Jan 28, 2018

Assuming perfect systems that never fail seems more immature, no?

brandonbloom · on Jan 28, 2018

That the industry is finally maturing in to a field that capitalizes on the unique properties of its medium?

darkerside · on Jan 28, 2018

Seems analogous to load testing a physical structure. At first, that probably seemed unnecessarily risky, but in retrospect, you want to build robustly enough that even fairly excessive/destructive loads have no real impact.

tormeh · on Jan 28, 2018

Yeah, when hardware failures are expected, simulating hardware failures is no different from simulating trucks driving over a bridge.

platinumrad · on Jan 28, 2018

Why do you think this is one of the "more sophisticated" testing practices?

philipov · on Jan 28, 2018

I don't like the name. It's sounds too much like it's the Engineering of Chaotic Systems, but there is no reference to Chaos Theory: https://en.wikipedia.org/wiki/Chaos_theory

Are they using the formal definition of chaos, or just a colloquialism?

jimnotgym · on Jan 28, 2018

I'm not sure I can agree.

> Are they using the formal definition of chaos, or just a colloquialism?

Is it a colloquialism to use a word in its normal English sense? Chaos as a word precedes Chaos Theory by some thousands of years. If I write a theory and take over an existing word in everyday use, it seems a bit much to accuse every one else of colloquiallism when they use an existing but less strict definition?

letlambda · on Jan 28, 2018

> If I write a theory and take over an existing word in everyday use, it seems a bit much to accuse every one else of colloquialism when they use an existing but less strict definition?

It does, but engineering is a technical profession and it's practitioners are likely familiar with the mathematical concept.

I've read about applications of chaos theory in system design, and I expected 'Principals of Chaos Engineering' to be about that topic.

philipov · on Jan 28, 2018

The problem I have with the name is that it feels like their audience is people not familiar with the mathematical concept. In other words, not engineers; in other words, non-technical managers. And that marks it as the latest marketing fluff being pitched by management consultants.

mcbits · on Jan 28, 2018

People familiar with the mathematical concept will almost always be talking about something like "nonlinear dynamics" instead, anyway.

philipov · on Jan 28, 2018

Funny you say that. I had to restrain myself from using that term. Technically, chaotic systems are only a subset of nonlinear systems. For example, solitons are nonlinear, but neither are they chaotic.

alextheparrot · on Jan 28, 2018

They are using a dictionary definition, I believe. It is “Engineering of Chaos”, which is to describe how computer systems, especially when distributed, can have near random failure states, resulting in what appears to be a chaotic system. Network partitioning, hard drives failing, bits randomly flipping because of physics would all be examples of such failure states.

The engineering comes in when we try to simulate such situations and create processes for hardening applications against those types of failure.

tzahola · on Jan 28, 2018

It has nothing to do with chaos theory.

They’re just trying to co-opt “chaos” as a buzzword for fault-tolerant systems.

choonway · on Jan 29, 2018

Yes. The chaos part is where management fails to appreciate the time and resources to make things fault-tolerant. :)

wglb · on Jan 28, 2018

I agree. The article is not clear and I think it uses the term “chaos” where the word “error” might be better suited.

drdrey · on Jan 28, 2018

In this context, "chaos" means roughly "injecting failures/latency"

stochastic_monk · on Jan 28, 2018

Resilient Engineering, then, would be a less buzz-wordy alternative that expresses what they mean.

wglb · on Jan 28, 2018

I would call that "noise" rather than chaos.