Hacker News new | past | comments | ask | show | jobs | submit login
Final Root Cause Analysis of Nov 18 Azure Service Interruption (microsoft.com)
198 points by asyncwords on Dec 17, 2014 | hide | past | favorite | 70 comments



Nice writeup. I hope that the engineer in question didn't get fired or anything. One of the challenges in SRE/Ops type organizations is to be responsible, take ownership, put in the extra time to fix things you break, but keep the nerve to push out changes. Once an ops team loses its willingness to push large changes, the infrastructure calcifies and you have a much bigger problem on your hand.


I agree, and am reminded of the quote from Thomas Watson (IBM) on this:

“Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?”


Almost certainly apocryphal. There are multiple versions of the quote floating around. Also: Who was Watson talking to? How did this story get from him to us?


I heard the exact same story in a Wall St. trading firm I worked for - except the story was about a trade gone wrong (trades X shares instead of $X worth of shares, causing the Dow to slide and all sorts of havok).


It just sounds like an anecdote someone would use during a talk.


There's a pretty credible one about AOL's board not firing Steve Case after spending $5 million on him. (For those who have forgotten, Steve didn't actually found AOL; he was brought in from the outside.)


My previous boss used to say that there are two types of engineers - ones that had caused a major outage and ones that never did - and he preferred working with the 1st group as they were naturally more careful.


That's an incomplete analysis.

A more complete one would look at what each engineer had contributed. If the one with the foul-up has and does contribute significant value, and doesn't repeat the same mistakes, it's a good call.

If the careful worker both avoids errors (and costly mistakes) and exceeds other engineers in contributing value, there's a strong argument for keeping them.

There are people who simply foul things up. And there are those who avoid mistakes by simply never taking risks. You almost certainly want to discard the first. The second's value depends on the value your organization gains from innovation.


I would not call it an "analysis" - an moderately insightful joke perhaps, or a heuristic.


At the very beginning of my career, I once made a very stupid and very costly mistake because of tight schedules and lack of proper QA process.

The first thing my boss said the next morning before even explaining what happened was that he was the one responsible and I was not to blame. Oh and he gave me a raise.


Your case is among the 1% luck cases. The other 99% cases are NOT same as yours. They are fired.


I hope it becomes a fireable offence. Any deviation from procedure has to be signed off at the very least.


The really big missing piece that I found in this post mortem is if it only took 30 minutes to revert the original change, why did it take over ten hours to restart the Azure Blob storage servers? This was neatly elided in the last sentence of this paragraph of their writeup:

".... We reverted the change globally within 30 minutes of the start of the issue which protected many Azure Blob storage Front-Ends from experiencing the issue. The Azure Blob storage Front-Ends which already entered the infinite loop were unable to accept any configuration changes due to the infinite loop. These required a restart after reverting the configuration change, extending the time to recover."

The ten+ hours extension was the vast majority of the outage time; why wasn't the reason for this given? More importantly, what will be done to prevent a similar extension in the time Azure spends belly up if at some point in the future, the Blob servers go insane and have to be restarted?


Only a guess but from how its worded it seems that the storage frontend that had already entered infinite loops may have taken the tem+ hours to restart.


Mark, the Azure CTO, gives a good breakdown of the time taken for each portion of the incident recovery in this video: http://channel9.msdn.com/posts/Inside-the-Azure-Storage-Outa...

That may help address these questions. Just FYI, I am an engineer in the Azure compute team.


"These Virtual Machines were recreated by repeating the VM provisioning step. Linux Virtual Machines were not affected."

So Azure supports Linux VMs?! Microsoft does so little Azure advertising that I had to learn this fact from their RCA. Apparently they do support it since 2012: http://www.techrepublic.com/blog/linux-and-open-source/micro... but it is likely that many non-users of Azure do not know this.


In fact, something like 20% of Azure is running Linux VMs. Works great, you can use Chef, Puppet, Vagrant, and the open-source cross platform CLI to manage them. There's a thousand Linux VMs to choose from here https://vmdepot.msopentech.com/List/Index


There's also the "Microsoft Loves Linux" campaign going on[1]. It's part of their new attempt to embrace the FOSS world and generally be more "open".

[1] https://twitter.com/jniccolai/status/524281997632745472


It's on the homepage (http://azure.com) as "Launch Windows Server and Linux in minutes". Plus it has an entire section in their VMs page (http://azure.microsoft.com/en-us/services/virtual-machines/).

Although yes, their advertising of Azure and Azure features isn't very huge. The ads that I have seen are for "Microsoft Cloud" (http://www.microsoft.com/enterprise/microsoftcloud/default.a...) which is a combination of products and technologies.


Yes, Linux is very well supported. After being accepted into Microsoft's BizSpark program this fall (excellent! Have a startup idea? Then apply!) I started using Azure just about exclusively and their Linux VM support is very nice.

I feel that Microsoft is on fairly even footing with AWS and Google Cloud.


Scott Hanselman had a really interesting talk that covered .net/Azure on Hack Summit earlier this month.

You can view the recorded video on: https://hacksummit.org/


Yup, you can get up and running with a Linux VM in seconds. I've found Azure to be cheaper than AWS for micro VM instances. Running a twitter bot on AWS cost me ~ $10/month but after switching over to a CentOS VM on Azure I'm paying about $6/month. I also have dedicated Ubuntu and Windows Server VMs for personal projects and hosting.


And their Linux VMs are cheaper than Windows.

Another minor outage though, I think. They recently changed the permissions on /mnt, which contains the temp disk. Now various services that were using that for temp space fail to start. Easy enough to fix by chmod on start but still...


> The engineer fixing the Azure Table storage performance issue believed that because the change had already been flighted on a portion of the production infrastructure for several weeks, enabling this across the infrastructure was low risk.

Ugh, I wouldn't want to be that guy (even if there would be no direct repercussions). That said, and as others have highlighted - kudos on the writeup and openness.


The sentence you quoted seems ambiguous to me, since (from what I understood reading the article) there are two separate storage mechanisms using the new feature. The two possible beliefs the engineer may have had are:

1. Since we tested the change on a subset of A for a few weeks, we can assume it will work for all of A.

2. Since we tested the change on a subset of A for a few weeks, we can assume it will work for all of A and all of B.

#1 seems reasonable, but #2 is what needed to hold true in order for there to be no problems, since the change was actually enabled for all of A and B.

But was the engineer actually advocating to enable the change in B, or was that an accident during the manual deployment?


I don't agree with your possible beliefs. What about:

3. Since we tested this change on a subset of A and a subset of B, we can assume it will work for all of A and all of B.


It's kind of shitty--it kinda seems like it had passed initial testing, and then got rushed out to full-scale production because it also fixed other customer issues (as mentioned in the writeup).

That's the thing about these kinds of bugs...they are, by definition, tricky enough to have passed testing unseen.


Forever known as the guy who knocked Azure offline.


Probably less than you think. At least among his peers.

These kinds of "almost took X offline" happen All The Time, its just that most of the time they get caught before it gets too far. Its inevitable that a few will squeak through the nets.

Mistakes can and will happen anywhere we allow them to. If you want to prevent mistakes, write tools to help reduce the "attack surface" (areas where mistakes can be made). Eg Don't want someone to be able to do "sudo reboot" accidentally? Alias reboot to something else. It won't stop hackers but it might help fight fat fingers.


I've halted production in 14 manufacturing plants before. Ran a query against the wrong server and locked up the entire ERP system for 30 minutes.

Accidents happen.


Shit happens and at this scale it happens big. I wish everyone would provide details like this when the fan gets hit or your security fails. I'm glad I never have to deal with scale like this, it's pretty scary.


I've seen several companies where analysis like this would be for management only. I guess it's just human nature to want to sweep mistakes and accidents under the rug, but it does also speak volumes about the culture in such companies. Kudos to Microsoft and every other big player that communicates these things.


It reminds me of the NTSB's crash investigations. Instead of looking for a scapegoat or someone to blame, they look for the cause, and then look even deeper to find the root cause.

For example they discover a pilot made a mistake. But they don't end it there, they then look at the airline's training materials, see if other pilots would repeat the same mistakes, and so on until they reach a point where they have a "this won't happen again" resolution (rather than simply discovering what happened).

I feel like with Microsoft's breakdown they did the "this is what happened" post-mortem but then went to the next level and said "here's why this happened, and here is why it won't happen again."


Nitpick: Crash investigations are done by the NTSB, not the FAA.

The NTSB has no authority to enforce its recommendations. That's up to the FAA. The idea behind that is the NTSB is more likely to be impartial.


Valid correction. I've edited it in. But it did originally say FAA, not NTSB.


>I've seen several companies where analysis like this would be for management only

I've found it to be pretty standard in the hosting world. I assume because if you have unexplained outages, customers leave.


especially when it's an outage for 5-11 hours (depending on the customer) as this one was.


Eh, its fun. After pushing to 15k servers, everything else seems unimportant!


Impressive non-jargonized report. I would have "quantified" the "small number" but kudos anyway to Microsoft for taking this path towards transparency.


This is a good effort. I do have some concerns about it.

A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices ?

The surface-level answer is that Azure platform lacked tooling. Is that the cause or an effect ? I think it is an effect. There are deeper root causes.

Let's ask -- why was it that the design allowed one engineer to effectively bring down Azure ?

We often stop at these RCAs when it gets uncomfortable and it starts to point upwards.

I say this to the engineer who pressed the buttons: Bravo! You did something that exposed a massive hole in Azure which may have very well prevented a much bigger embarrassment.


A true root cause would go deeper and ask why is it that an engineer could solely decide to roll out to all slices?

Because writing code which contains a large number of checks and balances is generally orders of magnitude more expensive than human trust/judgment on the Ops team. Reading the postmortem makes me think that this sort of failure could have happened to anyone, and no-one really did anything wrong. The mistake was the blob store config flag not getting flipped, which is just a natural human error. The engineer who did the roll out could have been any of us. Given what he/she knew, he/she thought they had a good soak test (and a couple of weeks is a pretty good soak test) and made a call, similar calls he/she makes a number of times every day. This one didn't pan out.

I would hazard that most companies have a big red rollout button that is reserved for trusted engineers that will do a rollout without all the checks you're requesting.


No one is saying that it had to be code. It could be as simple as "talk to another peer or your manager before making the next step".

For critical infrastructure companies there is the usual rule of "four eyes" for roll outs.

So, while it may be the case that most companies will have the trusted person with the keys to the rollout car the more critical the mission gets the higher the levels of human checks are put in.

Maybe that's what the RCA should have said -- we F-ed up designing and managing the rollout process. An engineer just fell victim to it.


Just a second level of approval can be very useful, without requiring orders of magnitude costs. In part because it usually requires that the change be explained in writing to the second approver, and that can often reveal issues.


It's not clear he/she didn't notify a secondary person, who would have likely had the same knowledge he/she did. Given the same knowledge, the same push might well have happened.


I'm pretty impressed with the openness of this statement.


idem


"Unfortunately, the configuration tooling did not have adequate enforcement of this policy of incrementally deploying the change across the infrastructure." They relied on tooling to do the review of the last step of the process? I would have thought there were a few layers of approval that goes along with that final push into mission critical infrastructure.


That's exactly what they mean - the workflow tool they used didn't enforce approvals from all the concerned parties.


Pros: -they are sharing info -they allowed some caustic comments to remain at the bottom of the page (so far).

Cons: -This is almost 30 days after the incident -Look at the regions, it was global! -This was a whole chain of issues. I count it as 5 separate issues. This goes deep into how they operate and it does not paint a picture of operational maturity:

1: configuration change for the Blob Front-Ends exposed a bug in the Blob Front-Ends

2: Blob Front-Ends infinite loop delayed the fix (I count this as a separate issue though I expect some may not)

3: As part of a plan to improve performance of the Azure Storage Service, the decision was made to push the configuration change to the entire production service

4: Update was made across most regions in a short period of time due to operational error

5: Azure infrastructure issue that impacted our ability to provide timely updates via the Service Health Dashboard

That is quite a list. [Edit : formatting only]


> -This is almost 30 days after the incident

What would have been the optimal response time? They fixed the immediate problem as fast as they could and gave a preliminary RCA, then they did a longer-term RCA and fix. I feel this shows maturity by not rushing to immediate conclusions and trying to do a 5-Whys drill-down to fix the underlying cause. Furthermore, they also took steps to actually fix the problem by pointing out they moved the human out of the loop in one aspect and that's always a good thing (unless the replacement software is faulty itself of course).

Also, in response to the list, I believe [3&4] are actually the same thing, are they not? The operator was the one who made the 'decision' by accidentally ignoring the incremental config change policy that was in place and did it all at once. This was identified as a human error and they fixed it by enforcing incremental changes.


They actually posted a blog the day after it happened with their initial triage and updates. https://news.ycombinator.com/item?id=8633633

I agree with you though, and said so at the time. These issues seem systemic, not isolated. Cascading failures, some in code, some in ops procedures, indicates to me they still have work to do.


Why are they so quiet about SLA credit? Not a word for a month and for a year I have been wasting good money on doubling up services to be inside the SLA + also deploying cross region to ensure zero downtime, what a joke. Surely Azure are not hoping we will forget?


TL;DR

>In summary, Microsoft Azure had clear operating guidelines but there was a gap in the deployment tooling that relied on human decisions and protocol.


After analysis was complete, we released an update to our deployment system tooling to enforce compliance to the above testing and flighting policies for standard updates, whether code or configuration.

Hopefully there is a way to disable this policy adherence for when you really need to push out a configuration or code change everywhere quickly.


I cannot believe how many times I have seen a PROD (or new env X) deployment go bad from configuration issues. At least they separate configuration deployments from code deployments, that's a good sign. Why not take it a step further and instead of doing config deployments, use a config server?


In the end, you would still have config deployments, but to the config server. And if you can push config to nodes needing it. you have one less point-of-failure, right? I'm not too familiar with the concept of a config server.


If utility computing is to is be taken seriously, then it has to institute the same kind of discipline that we see occurring in the airline industry. Recent examples come to mind: pilot letting songstress hold the wheels and wearing the pilots cap - fired. Airline executive overruling pilot over macadamia nuts - million dollar fine.

If we wish for a future where cloud computing will be considered reliable enough for air traffic control systems, then management of these infrastructure requires a level of dedication and commitment to process and training.

Failover zones need to be isolated not only physically, but also from command and control. A lone engineer should not have sufficient authority or capability to operationally control more than one zone. It is extremely unnerving for enterprises to see that a significant infrastructure like Azure has a root account which can take down the whole of Azure.


I hope the engineer in question did not get fired.

I also hope that no one who recommended Azure to their employer got fired either.


Only one question: Will the engineer be fired?


I want to know the reality. I don't want to see another PR show. Dear Information diggers, please let us know whether the guy was fired!


OSS is alive and well on the Azure Platform www.microsoft.com/openness


Does anyone else see the missing piece to this post mortem? An infinite loop made its way onto a majority(? all?) of production servers, and the immediate response is more or less 'we shouldn't have deployed to as many customers, failure should have only happened to a small subset'?

I agree that improvements made to their deployment tooling were good and necessary, take the human temptation to skip steps out of the equation.

But this exemplifies a major problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact.

I find this absolutely unacceptable. How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here? Yes I'm familiar with the halting problem and limitations of formal verification on turing complete languages, but I don't believe it's an excuse.

This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models".


> An infinite loop made its way onto a majority(? all?) of production servers, and the immediate response is more or less 'we shouldn't have deployed to as many customers, failure should have only happened to a small subset'?

All server software has one or more "infinite loops." It is a fundamental object in all listeners.

Plus when they say infinite loop, I assumed they meant it continuously entered a crash/restart cycle rather than a while(true) {} in a line of code.

I think the reality on the ground is that bug-free software is a myth. All you can do is have processes (like gradual deployment) to mitigate the damage it can do, rather than making it your goal to write the mythical perfect code.

> But this exemplifies a major problem our industry suffers from, in that it just taken as a given that critical errors will sometimes make their way into production servers and the best we can do is reduce the impact. I find this absolutely unacceptable

It is a major problem. It costs billions every year. But what can be done? If there was a magic wand solution I'm sure people would be scrambling to deploy it as it saves them money.

> How about we short circuit the process and identify ways to stop that from happening? Were there enough code reviews? Did automated testing fail here?

This seems like a somewhat naive view of software development in general. Like what I'd call a "mathematician's view," in the sense that they think large complex systems can be reduced to a simple quantifiable process.

Code reviews and more importantly unit tests can help find bugs. But inter-connectivity between large complex systems is harder to test again, and harder to code review (because the bugs don't exist on any single line of code, or in any single block even).


> This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models".

Which would be a pretty reasonable thing to say if you had a large portion of the population on a single plane. What this all comes down to is: problems (small or big; stupidly simple or ridiculously complex) will happen. Isolating problems to the smallest number of people possible is the responsible course of action.

That, of course, isn't mutually exclusive with doing better from a software engineering front.


"This is tantamount to saying "yeah sometimes our airplanes crash, so from now on we'll just make sure we have less passengers ride in the newer models"."

Things break. You can not design something to be perfect. That is why there is redundancy in every critical system. You are better off having your system gracefully recover from failure, than trying to design the system perfectly. That might mean spreading the traffic across multiple instances within one data center so when one breaks (for any reason), the others pick up the slack. The next level is spreading across multiple data centers so if one place goes down, you have another pick up the slack. Arguably the next level would be going across multiple providers, but to me that seems like overkill.

Given that, if Microsoft rolls it out to 5% of servers and those crap out, that is roughly equivalent for individual customers that are properly spread over multiple instances as a spate of harddrive failures. This only breaks down when they roll the broken stuff out to 100% at once.


I see this type of comment across a number of fields - why isn't X safer?

Generally is comes down to the 'insurance' argument - why didn't we spend the time (read: money) to test and prevent for X?

The answer corms down to the risk/benefit.

It's possible to insure your house against total loss, against any type of threat. You could build it on the shoreline of a known hurricane location, or on top of an active volcano, and insure it for full replacement. All you have to do is deposit an amount equal to the replacement cost in an account - if you suffer total loss, spend the money and replace the house.

Where people go wrong is by thinking that problems can and should be prevented at any cost. But the issue is that thinking that way leads to excessive costs for the thing in the first place. It would be possible to design a highway system where nobody ever died. However the cost would be so high that very few highways would be built, so the advantages of cheap and easy travel are lost.

Likewise, it would be possible for Microsoft to build an automated testing and checking software that never made a mistake. However that would make azure uncompetitive or unprofitable. It's cheaper just to hire good people and accept that occasionally, something might go wrong.

Some software is actually made to never go wrong. That software is in satellites and mars rovers and the like. Even then mistakes happen due to the nature of complexity and probability. But the cost per line of delivered code for a satellite is orders of magnitude higher than the cost of azure management code.

You really only need to look at problems when the cost of fix is much better than the cost of potential loss. That's why planes are safer than cars - because the loss of a big plane and passengers is a very costly event.


It's not a missing piece, it's in the release: "Azure Table storage Front-Ends, the configuration switch was incorrectly enabled for Azure Blob storage Front-Ends."

Not only was it deployed to everyone but it was deployed, untested, to the wrong place. Otherwise deploy would have been just as broad but successful.


> Were there enough code reviews? Did automated testing fail here?

I really do love tests and all but they only get you so far. In fact you're way more often bitten by things that are outside of your frame of reference and therefore these are not the ones you take into account when designing testing pipeline.


Ops engineer here. This is a particularly hard case because the problem involved an interaction of components across the network, and was scale-dependent. These kinds of problems are truly "emergent" in that they're enormously hard to test for. Absent an exact copy of production, with the same workload, I/O characteristics, network latencies, etc., there are always some class of scale/performance-related bugs you just won't catch until the code hits production.

One defense is a "canary" deployment process (they used the term "flighting") to ensure major changes are rolled out slowly enough to detect major performance shifts. Had their deployment process worked correctly, they may have been able to roll back the change without incident.

A second defense is proactively building "safeties" and "blowoff valves" into your software. Example: if a client notices a huge spike in errors, back off before retrying a connection request, otherwise you may put the system into a positive feedback loop. Ethernet collision detection/avoidance is a great example of a safety mechanism done well.

Finally, every high-scale domain has its own problems, which experienced engineers know to worry about. In my case, at an analytics provider, one of the hardest problems we face is data retention: how much to store, at what granularity, for how long, and how that interacts with our various plan tiers. OTOH we have significant latitude to be "eventually correct" or "eventually consistent" in a way a bank, stock exchange, or other transactional financial system (e.g. credit approval) can't be. I imagine other things like ad serving, video serving, game backend development, etc. there are similar "gotchas", but I don't know what they are.


New model airplanes always fly with miminum crew for their first test flights. This is classic risk reduction. Reduce the consequence of a problem when the probability of that problem occuring is greatest. Thats a better analogy than designing them to seat fewer passengers.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: