As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.
This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?
Self driving cars don't have to be perfect. They just have to be safer then driving is today [1].
The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving. It's easy to blame negligent driving on the driver, we're clearly not negligent so it really doesn't effect us right? But a software error might as well be an act of god, it's something that might actually happen to me!
Well No,
There is an upper limit on the damage a bad driver can do by say crushing his car with a bus or something like that. Imagine a bug or malware triggered at the same moment world-wide. It could kill millions. So it not as simple as 'It just has to be better than a human'
I've been itching to release this terror movie plot into the wild:
It's 2025 and more than 10% of the cars on the road in the US are self-driving. It's rush hour on a busy Friday afternoon in Washington, DC. Earlier that day, there'd been a handful of odd reports of self-driving Edsels (so as not to impugn an actual model) going haywire, and the NTSB has started its investigation.
But then, at 430pm, highway patrol units around the DC beltway notice three separate multi-Edsel phalanxes, drivers obviously trapped inside, each phalanx moving towards the Clara Barton Parkway, which enters DC from the west. Other units notice four more phalanxes, one comprising 20 Edsels, driving into DC from the east side, on Pennsylvania Avenue.
At this point, traffic helicopters see similar car clusters, more than two dozen, all over DC, all converging on a spot that looks to be between the Washington Monument and the White House.
We zoom in on the headquarters of the White House Secret Service. A woman is arguing vociferously that these cars have to be stopped before they get any closer to the White House. A colleague yells back that his wife is one of those commandeered cars and she, like the rest of the "hackjacked" drivers and passengers is innocent.
A related scenario, one that theoretically could happen today, is hacking into commercial airliners auto-pilot systems, and directing dozens of flights onto a target.
Set aside the fantasy movie plot angle, how realistic is this today? Is it any more or less plausible than the millions of cars scenario? If people are truly concerned about the car scenario, shouldn't they be worrying about the aircraft scenario?
I will disagree with the other commenter and say that this is more plausible for the aircraft than for the cars. Modern jetliners and military aircraft (scarrier yet) are purely fly-by-wire - there aren't cables running between the yokes and the control surfaces like in a Piper Cub, and if there were, no pilot would be strong enough to move them.
Yes, the autopilots can be turned off, but that's just a button, probably a button on the autopilot itself. Depending where the infection happens, the actual position of the yoke could be entirely ignored by the software. Or the motor controllers for the control surfaces themselves could be driving the plane, though I don't know how they could coordinate their actions and get feedback from an IMU.
Perhaps the pilots could rip out components and cut cables fast enough to prevent the plane from reaching its destination, and maybe they could tear out the affected component and limp back to a runway with what remains, but it's an entirely feasible movie plot.
But should we actually worry about either? No. The software sourcing, deployment and updating protocols at the various manufacturers of aircraft are certain to be secure. Right?
> Yes, the autopilots can be turned off, but that's just a button, probably a button on the autopilot itself.
Airplaine components tend to have shitloads of fuses for each components, any trained pilot knows how to disable the fuse for the autopilot system (or, in an extreme case, ALL fuses to kill the entire airplane).
TL;DR you simulate a bunch of other planes in close proximity and the auto-pilot freaks out and tries to avoid them. As the second talk explains, the pilots would definitely notice and switch autopilot off. This is why IMO it's very important to not take ultimate control away from humans in cars. I would personally never buy one of the Google (or any other) self-driving models with no controls. It already freaks me out that many cars are drive-by-wire (for the accelerator), and now even steer-by-wire: http://www.caranddriver.com/features/electric-feel-nissan-di... #noThankYouPlease
No current airliner will automatically change course in response to a traffic conflict. If TCAS [0] gives an advisory, the pilot takes manual control or reprograms the autopilot. Spoofing transponder returns wouldn't do much to the aircraft except annoy the pilots.
Another reason traffic spoofing wouldn't cause the aircraft to deviate is that airliners fly standard approaches and departures (STAR [1] and SID [2]) and heavy traffic away from the approach paths would definitely get noticed.
Even the fly-by-wire Airbus can be flown manually using differential thrust and/or pitch trim control.
The only time I've heard of an Airbus loosing control of a damaged engine is when the electrical cable was physically severed. This was Qantas QF32 [1], after one engine exploded and damaged the cables to another engine.
To "take over" an aircraft with pilots in the cockpit, would require the compromise to multiple systems.
When you are barreling down a highway at 65 miles per hour and are not paying attention (and you wouldn't, because the car drives itself just fine), giving you controls is much more dangerous (for you and others around you) then not.
If a fuel injection system were to fail via fried component or even a short would trip a fuse and cause it to fail safe by cutting fuel and shutting off the car. Fuel Throttle cables however have definitely become stuck in their sheathing in the WOT position. Happened to my dad on the highway in a 1992 Rodeo Isuzu.
I've always seen that called EPC for Electronic Pedal Control, but that is probably VW-ism.
On the other hand on EFI car, having mechanical throttle cable does not add much to hack-safety as the ECU always has some way to override closed throttle (either disengaging throttle pedal mechanically switches the control of throttle to ECU operated servo or there is completely separate throttle controlled by ECU).
Sure, but rayval was talking about a scenario that could happen today.
Although looking at the other comments, I think I'm significantly underestimating just how much of modern airliners is dependent on software. The pilots might be able to see that they're heading for disaster, but may not be able to do anything about it.
I know for a fact that there are 3 separate computer systems from 3 separate manufacturers on each Boeing airplane. Auto-pilot always uses the consensus of the 3 machines. It's a pretty far-fetched scenario in real life so I thought we were talking fiction.
Former Boeing software engineer, worked on engineering simulators (where real hardware was in the loop):
There is an idea of triple channel autolanding, wherein the plane uses the consensus of the three autolanding systems. Should no consensus be available, then the pilot is advised that autolanding is not available.
Other than that, any sourcing from different manufacturers is happenstance. 737 avionics are sourced from a different vendor than 747/757/767/777. And different functions can come from different vendors, although vendor consolidation has cut down on that.
I'm not across what happened post 777, as I left Boeing in 1999.
Maybe it'd be waves of vehicles, the first few waves crashing through traffic to make way for the rest of the swarm... I'm surprised I hadn't heard of this doomsday scenario yet haha
You don't need self driving cars for such a scenario to happen -- cars are increasingly drive by wire, and driver assistance features being added to cars (automatic lane keeping, automatic breaking, smart cruise control, etc) mean computers are already capable of taking over cars.
I've come to that realization when driving my Leaf.
You see, with an internal combustion engine, there are several ways that you can stall the engine, even if the computer is controlling it. As long as you can stop it from rotating, it will stall.
Now, take a Leaf. The engine can't physically stall. It is completely controlled by electronics – in contrast, even an ICE with an engine control unit will have some of it being driven mechanically (valves and driveshaft are all mechanical). This also causes cars to "creep" when you release the brakes, as the engine has to keep rotating. In the Leaf, the "creep" exists, but it is entirely simulated.
Similarly, the steering is also electric and controlled by algorithms (more assist in parking lot, less in the highway).
Braking is also software-controlled. The first ones had, as people called them, "grabby breaks" (it would use regenerative breaking with a light force, if you pressed more, the breaks would suddenly "grab" the wheel). This was fixed in a software update.
Turning on and off is also a button. Can't yank the keys either.
So yeah, presumably, a Leaf could turn on, engage "drive" and start driving around, all with on-board software. It lacks sensors to do anything interesting, but the basic driving controls are there.
It's Will Smith, he's the president, and he just got finished dealing with a one of his super-cute children acting up, but actually doing something noble (but getting in trouble at school for it). Upon seeing the swarm: "Aww helllll naw"
There are actually already physical barriers throughout the governmental parts of Washington DC to prevent this sort of thing. There are permanent walls, blocks, and poles along the edges of the roads (often well-integrated into the architecture,) and raise-able barriers built into the road surface at intersections. Ain't no cars running over our president!
Fun. Johnny555 is right of course, and it was the Jeep hacked on the Interstate last year that first inspired me. The airplane scenario may be more likely, but I liked the additive nature of so many cars, cars a bit like insects. (You could go further in this direction by conjuring up multiple drone swarms.)
As to movie points: of course Will Smith is the hero, and we'll handle DC rush hour stasis through special effects. ;-)
Car manufacturers conduct recalls all the time. There might be the possibility that a million self-driving cars will be held hostage from a remote control tower simultaneously leading to injury or death to millions. However, in practice, as soon as an issue is discovered, there will be the equivalent of recalls (remote updates) and things like this will be fixed. People who are uncomfortable with self driving cars will always be able to drive manually or override the automated controls. At some point, technology will progress enough that the benefits will outweigh the risks and people will adopt.
In some cases, people are having to wait months to get new airbags because they just don't have them in stock. In the computer case, would you want to keep driving until they can get you scheduled for a software update? Remember that many cars can't update critical software OTA.
>In some cases, people are having to wait months to get new airbags because they just don't have them in stock. In the computer case, would you want to keep driving until they can get you scheduled for a software update? Remember that many cars can't update critical software OTA.*
So I assume you don't own a car and you avoid them at all costs? Otherwise your paranoia becomes hypocrisy. If you cannot trust the car company to deliver software updates, you can't trust them to write the software in the first place, and modern cars are full of safety-critical software.
I also don't know why you're equating the a manufacturing capacity limitation with a software update limitation. It's not as if Toyota is going to have trouble shipping bits a million times vs a thousand times once the software update is written.
I think we can also safely assume that self-driving cars will generally be updatable OTA. But yes, you could drive it to the dealer if needed, and worst case the dealer could send people on-site to do the update.
A friend of mine works for VW's engine computer division. Yes, those engine computers. After all I've heard of their development methods (or lack thereof), I'm surprised the engines even start more often than one time out of ten.
My VW Golf has a bug where the driver's side door will be completely unresponsive after starting the ignition, with all the lights on the door being off too. After 5 to 10 seconds it will become responsive, which is a bit annoying if you're trying to open the windows to clear the damp mist on them, as you can't....
Also, if during normal routine you run through all four electric windows to close them (so passenger, driver, passenger rear, driver rear) in that order, you hear the solenoids click in a COMPLETELY different order. I am not sure if it is prioritising the messages in some way but the order that the windows "click" is not the order I press the buttons.
Also, I can get the CD player to crash.
Such minor noticeable issues make me think about the quality of the more important bits somewhat.
The breakdown on Toyota's safety code was interesting; and frightening really.
But they do know what a VCS is, and they don't re-invent lint because they want to ship broken code and need to only check for 2-3 minor issues while leaving the rest alone, as they need to rely on "magic" code exploiting undefined behaviour in certain hardware+compiler combinations.
People won't be able to override a buggy software: they can't even do that now, just look at the remote Audi and BMW hacks that can brake the car on the highway.
Auto mfgs seem to be about 20-30 years behind when it comes to computers. Not really surprising that Tesla is whomping them on this front, given how SV people are scrambling to work there. You don't see that with the Big 3 or really any other car mfg.
There is something that's called fail safe(ly). In case of any error inside the car, it should sowly decelerate and pull over. Sure, it will cause a lot of traffic problems if say 10% of all cars did that at the same time but the damage would not be as severe as a ghost driver entering the freeway with 140mph.
There are a few instances where some bad tesla batteries (the standard 12 volt batteries ironically) failed, and the cars handled it perfectly. It slowed down so that the driver could do safely pull over. Sure it did not happen to all cars at once, and autonomous cars migh not be able to do that by themselves but we have a log way to go to reach 100% autonomous driving (i.e. Without a steering wheel and a car that drives everywhere humans drive and not only on San Francisco's perfect sunny roads where it's been thoroughly tested on).
Yes, I'm eager to risk my life on self-driving car failures; it would be a tremendous step up from risking my life on human drivers (including myself!) as I do on a daily basis.
That assumes that failures are uncorrelated. My personal concern is with correlated failures, like those that occurred in GCE. What if cars from some manufacturer all fail simultaneously in the same way (say, because of a software push that rolled out more aggressively than it just have, just as in the GCE case)? That's the sort of scenario I'm really concerned about.
At least with human drivers, the failures are generally uncorrelated.
> The real question is if society can handle the unfairness that is death by random software error vs. death by negligent driving.
Most people have a greater fear of flying than driving by car although statistically you're far more at risk in a car. One cause of that fear of flying is loss of control; you have to accept placing your life in someone else's hands.
With self driving cars suspect lack of control will also be a problem. Either we need to provide passengers with some vestige of control to keep them busy or we just wait a generation until people get used to it
> Most people have a greater fear of flying than driving by car although statistically you're far more at risk in a car. One cause of that fear of flying is loss of control; you have to accept placing your life in someone else's hands.
Really? That sounds counter intuitive. You'd think the reason people are afraid of flying is, because, you know, it's flying. Thirty thousand feet between you and the cold, hard ground. That's a long fall of agony to utmost certain death, and some magic turbo voodoo keeping you from it.
Would people with fear of flying really rather be the pilot?
Can't find it with a quick google now, but there was a study on this. Giving people fake controls, even when they knew they were fake, reduced anxiety.
Give ambulatory meat some onboard decision-making ability, and it will want to use it.
Maybe not the pilot, but simply being in the cockpit is usually enough to control the fear. Why? First, you can see what is going on, whereas in the cabin you can imagine what might be going on, and think the worst. Second, you can see the pilot is calm. You could watch the flight attendants when in the cabin but that may not be convincing. Third, if you hear or feel something unusual, you can simply ask the captain.
But, since flying in the cockpit isn't available, then what? Get a copy of "SOAR: The Breakthrough Treatment for Fear of Flying" (Amazon editors' 2014 favorite book).
Yeah and it's funny how people seem to have some level of tolerance for death-by-design-flaws-in–hardware but somehow, software is a different kind of engineering endeavor. My guess is this will persist for some time, but eventually even out.
As you say, it can be pretty difficult to accept the unfairness of death by random software error. However, this is a situation that already exist today, but understood and acknowledged by a very small portion of people in our societies. For (a non life-threatening) example, thanks to the post-mortem above, we have a pretty deep understanding of what happened and why during this outage; but for some people it could have been a mere "Spotify is down, again, no more music :(".
I'm very curious to see if our understanding (as a society) of our own technology will improve over time or if people will continue to blame the internet for "not working" 20 years from now.
Well, this bug took down the entire system. What happens when self-driving software hits a similar bug? I don't think that there is any precedent for that sort of thing with manually driven cars. The scale could easily be larger than 100-car pile-ups due to poor weather conditions.
A stroke or heart attack while driving has (possibly) nationwide or worldwide reach? I somehow don't think you understand what I'm getting at.
A "Perfect Storm" of bugs could cause a systemic failure that has the possibility to affect all cars everywhere (well, probably limited to a single car-maker / model / etc). This has the possibility to affect millions. Claiming that a stroke or heart attack while driving has the possibility of a similar scope / reach doesn't make sense.
AFAIK, because most new cars today are pretty close to drive-by-wire anyway, there's little if anything that exists today that precludes such a perfect storm scenario from occurring with the ECUs in human driven vehicles today other than whatever internal processes manufacturers happen to have in place to avoid ECU bugs.
So in that respect, the introduction of self-driving-cars won't necessarily make such events more likely.
That would only affect the person who had a stroke or heart attack and their close vicinity. The parent is suggesting a scenario where the manual driving equivalent would be every single driver getting a stroke or heart attack at the same time.
I think the bigger worry is not the perfection part: It is the uniformity part. Considering software will be replicated (ignoring the ML ways of driving for now), updated and refreshed en masse, the impact is going to be very severe. A single nut case shoots up one school or his neighbors. The whole world turning into nut cases is going to be a walking dead scenario.
We see this ALL the time with ALL the big companies including the ones I have worked for in the past. I am very interested in possible solutions people are cooking up here.
There's also the ethical problem of life and death choices that will have to programmed in advance.
When an accident is inevitable, software will decide if prived or public property should be prioritized, which action is more likely to to protect driver/passenger A in detriment of driver/passenger B, etc.
Most people wouldn't blame the outcome of a split second decision made in heat of the moment but would take issue when the action is deliberate.
Is there actually some way of formal verification of software that is driven mostly by machine learning? I imagine there's a small core of code that runs some trained models, but how are they being formally verified? How do we know there's not a blind spot in that model that turns out to be fatal under certain conditions?
I'd imagine that you'd have some hybrid of ML and traditional code, and may be able to reason statistically about the ML sections, and user traditional (verified) code to cut the tail on the distribution.
All pipe dreams of mine, but the research potential here could be worth flaming truckloads of grant money :-)
To other replies, since we've reached max depth: yes, and that's why it makes no sense to me to equip such badly-spec'd vehicles with the self-driving bits first (Tesla S excluded). I drive a higher-than-normal performance vehicle for the exact reason that it gives me more options when I need to get out of a bad situation. I can out-break, out-swerve (more lateral grip), and out-accelerate most other cars on the road. The stopping distance (factor of tire grip and break power) has definitely helped me avoid accidents. Lower body roll = more grip during more extreme maneuvers + more control during them.
Still would be nice if the software could go "shit I'm about to get into an accident" and deploy air bags etc. that much faster. Plus now that I'm thinking about it, most accidents probably occur with some option that, had it been utilized, would've prevented the incident. Imagine the insane accident-avoiding swerve maneuvers a software program could potentially pull off.
> It's not quite possible to write a car that avoids ALL accidents because a car has a speed and a turning radius and breaks only work so fast.
It's not possible to have a self-driving car avoid all accidents, but one can presumably get pretty close. The realtime data from sensors give you enough information about the car itself, its surroundings and other objects around it to continuously compute a safety envelope - a subspace of the phase space of controllable parameters (like input, steering) within which the car can stop safely - and then make one of the goals to aggressively steer the car to remain in that envelope. This approach should be able to automagically handle things like safe driving distances or pedestrians suddenly running into the street.
Of course there will be a lot of details to account for when implementing this software, but it's important to realize that we have enough computing power to let the car continuously have every possible backup plan for almost any contingency in its electronic brain.
The problem with formal verification of systems like this is that checking whether software matches the specification or whether the specification is self-consistent is not the interesting problem. Whether specification matches the real world is the interesting problem and there is no formal way to verify that.
For this reason, most self-driving car development will only happen at large companies like Google, Tesla, Ford, etc. because they are the only ones who will be able to afford to purchase a massive general liability insurance policy.
I think that the other part with insurance is that insurers have no idea what the risk involved actually is (since it's not been around long) so they aim way high to cover themselves.
> Self driving cars don't have to be perfect. They just have to be safer then driving is today
But how is Google or any other manufacturer going to test their software updates? Are they going to test-drive their cars for tens of thousands of miles over and over again for every little update?
Google's cars drive millions of miles daily in simulation already. They take sensor data (including LIDAR) from past drives and essentially redrive their entire dataset with the updated software.
But if a new update changes how the car drives, wouldn't the data that would've been captured by the updated car be different than what the old outdated car recorded?
For example if the new update makes the car more aggressive, then other real drivers might be more careful, slow down more, etc compared to the original runs?
As long as there are human drivers, Google will need to continue updating software. We'll find exploitable behavior (e.g. how to make the self driving car yield) just like we find weaknesses in their search algorithms, and they'll need to adapt.
> I'd assume the more severe the change the more they'd want to test in the real world.
I'm sorry but the important point is: how are we going to agree what needs more testing, and what can be updated without testing? If we let those big companies decide about those issues, then I'm afraid we will soon see another scandal like VW, except possibly with deadly consequences.
I can already predict the reasoning of those companies: last quarter our cars were safer than average, so we can afford some failures now.
They don't have to be safer than driving is today. They can be significantly less safe while still being an improvement for society because drivers will be able to focus on other activities while travelling instead of wasting that time focusing on driving the car.
Actually they have to be significantly safer than driving today. People would rather be unsafe and in control than not in control and a tiny bit safer. I know personally if a self driving car could only drive as well as I could then I'd still want to be the one driving.
People only have the illusion of safety when in control, and are also demonstrably incapable of judging their own ability to perform tasks. Your criterion won't be taken seriously by anyone involved in policy, because this is already well understood.
While I agree with your first sentence, I think you're ignoring the fact that when media get wind of a case like this, it's almost only the irrational opinions of masses that matter. In western democracies, politics - and thus policies - is driven by pandering to the population (and bribes^Wlobbying).
At a certain point the policy question will inevitably be: why should any regular person be even allowed to drive given the superior abilities of the machines? There are certain ideological assumptions that will then have to be debated. Making your own mistakes is a consequence of freedom. Limiting the freedom to make mistakes for the overall benefit to society is not uncontroversial - (see the gun control debate), and contributes to alienation in the Marxist sense of the term. There is more than utility involved here. Just because the trade-off doesn't matter to techno-determinists doesn't mean it it doesn't exist.
Cars are already extremely regulated. You can only drive at speeds dictated by the government in directions dictated by the government, turn in ways prescribed by the government. Your car has to be identifiable in specific ways by the government. You have a lowered expectation of privacy in a car.
I hardly think the argument will be difficult to just prohibit cars.
You can only drive at speeds dictated by the government in directions dictated by the government, turn in ways prescribed by the government
Sure... so you're saying that your steering wheel blocks when you're trying to make an uncharted turn? Or that your throttle has a variable hard limit, depending on the road you're on?
That's a popular theory of how people behave but it doesn't play out that way. People overwhelmingly choose convenience and low price over safety all the time.
There's psychological and game theoretic factors that the safety has to overcome in order to be acceptable. Part of why human drivers are allowed today is because the people who bear the cost of driving decisions are directly involved in making those decisions. Once you give up control to a third party, they need to be significantly better to make it an acceptable choice on the individual level.
In other words, I agree that it's better for society, but that "better for society" isn't the metric that gets used for making decisions within the system.
"Here’s a different way of thinking about [the trolley] problem: if you wanted to design a car that intentionally murdered its driver under certain circumstances, how would you make sure that the driver never altered its programming so that they could be assured that their property would never intentionally murder them?"
"If self-driving cars can only be safe if we are sure no one can reconfigure them without manufacturer approval, then they will never be safe."
"Your relationship to the car you ride in, but do not own, makes all the problems mentioned even harder."
> Part of why human drivers are allowed today is because the people who bear the cost of driving decisions are directly involved in making those decisions.
This gives me weird visions of Google engineers with a necklace that explodes in the event that one of their cars causes an accident :S
I wonder if you've read Fallen Dragon by Peter F. Hamilton?
Unremovable, remote controllable lethal necklaces are a central plot device in the mercenary invasion. They are put on randomly chosen civilians as "collateral" to ensure co-operation and disincentive insurgency.
People in general don't willingly prioritize their own safety at the expense of convenience. That's why cars beep when you don't buckle your seatbelt. That's why people drink soda and eat fast food.
Google's take on postmortems is really nice. As the SRE book points out, they are seen as a learning tool for others. Most internal postmortems are available for anyone within the company to see and learn from. As well, they are always blameless. No fingers are pointed at the person who caused the issue in the postmortem. They explain the issue, what happened, and how it can be prevented in the future.
Pair this with the outage tracking tools and you can find all the outages that have happened across Google and what caused them.
Then there is DiRT[0] testing to try and catch problems in a controlled manner. Having things break randomly through Google's infrastructure and you have to see if your service's setup and oncall people handle it properly is a really awesome exercise.
It also showcases the great thing about self-driving cars. Even though accidents will happen, when it does there will be plenty of sensor data and logs which can be examined to find the exact cause in a post-mortem. An improvement to the software can then be made, and millions of cars deployed can all effectively learn from a single accident.
With humans, the amount of knowledge gained and the collective improvement of driving behavior from a single accident is low, and each accident mostly provides some data points to tracked statistics. With machines, great systematic improvements are made possible over time such that the remaining edge cases will become increasingly improbable.
What does post-mortem mean in this context? The software one (after an accident) or the human one (after death)? I think It's crazy that the word gets back the original meaning..
I wouldn't worry so much. I'm sure self driving cars are going to save a lot more lives than they are going to end. Humans are terrible drivers, and the software will only get better.
I know this logically. But emotionally I know how many bugs I have written in my life. I know software devs are human.. aka I know how the sausage is made.
Yes, but have you been part of the development & testing effort for mission-critical software (e.g. a class 1 or 2 medical device?). It's not true in all cases, but for the most part the level of QA that goes into the devices before release is significantly higher than that of your average product. This is why regulation is required.
GCE downtime just means people lose money, it's not life-or-death. Skimping on QA in order to reduce costs and get to market faster is a perfectly reasonable decision when the consequences are so mundane.
I understand what you mean but that is generalising too much what people use GCE, public clouds, self hosted servers for, and especially going forward. It is not all convenience applications, game backends etc. What people these days use AWS/GCE for is so varied, even public sector use AWS Gov Region for example. Downtime consequences is not just money lost but can be life-and-death and for some application they need solid QA even if hosted in a public cloud.
It may (emphasise 'may') be how they share medical data via GCE/AWS that gets delayed just before a surgery (ok, edge case) or how they update bugs in a critical GPS model that happen to be used by an ambulance, or even a taxi used by pregnant lady that is about to drop, etc. Or simple general medical self diagnosis information site that by chance could have saved someone in that time slot. Or any other random non medical usage which involves a server and data of some kind that happen to be in GCE.
Yes critical real time systems often are on-premise or in self hosted data centres, but more and more are not especially if viewed as not critical but in some cases indirectly are critical.
You make a good point, but in the end the responsibility is on the life-critical application (e.g. medical software, device, self-driving car) to ensure that is has been properly QA'd and that all of its dependencies (including any cloud services or framework that it is built upon) meet its safety requirements. The event of an app server or cloud service experiencing downtime would very much have to be planned for as part of a Risk Management exercise. Ignoring that possibility would be negligent.
I had the same feeling just before Y2K. Too much knowledge of the process, it all must work the first time in production, etc., etc. I was pleasantly surprised at the non-event of Y2K.
Especially since "driver-error" is the cause of 94% of motor vehicle crashes in the U.S.[1], with 32,675 people killed and 2.3 million injured in 2014.[2] Worldwide, motor-vehicle crashes cause over 1.2 million deaths each year and are the leading cause of death for people between the ages of 15-29 years old.[3]
It's estimated that self-driving cars could reduce vehicle crashes by approximately 90%! [4]
This is true for a single car, but self driving cars are introducing something that did not happen before. Imagine majority of cars are self driving cars and they all malfunctions due to bug in software update.
> and they all malfunctions due to bug in software update.
That's assuming everyone with a self driving car is driving the exact same model and they all updated at the exact same time. Chances are there will be many different models and manufactures so an OTA update with a bug will only affect a much smaller percentage of the self driving cars.
Yeah, remember, auto-pilot in a plane needs to be 100% reliable, or everyone dies. A car needs to be, I dunno, 80%? Compared to a bad human driver, who still drives every damn day, a computer need only be about 60% reliable to be better.
People suck at driving. Even a shitty self-driving car will save a ton of lives simply by obeying traffic laws.
An auto-pilot for an airplane is a considerably easier problem to solve. No lanes; no pedestrians; very little other traffic; three spatial degrees of freedom. That's why auto-pilots for airplanes have existed for almost a century but we're just now beginning to get self-driving cars. Humans are still better at dealing with the full panoply of crap that road driving throws at us.
Aircraft autopilots also rely on experienced and licensed pilots to operate them and be responsible for the aircraft at all times. Self driving cars have assume the operator is not particularly capable nor paying attention to anything happening on the road.
Google Self-Driving Cars currently fail approximately every 1,500 miles, according to their own report. We are a LONG way away from being able to separate the operator's attention from driving.
Current generation of autopilots doesn't handle traffic avoidance or make any routing decisions - they just follow pre-programmed routes at pre-programmed speed and altitude (or climb/descent profile).
But even this relatively simple level of automation causes problems - pilots start to rely on automation too much, and when things go south they are not capable to deal with it.
Airlines recognize it, and put more emphasis on hand-flying during training and routine operations, so pilots don't lose their basic piloting skills.
This video also inspired an excellent podcast from 99 Percent Invisible about the challenges and dangers of automation. I would highly recommend listening.
> Yeah, remember, auto-pilot in a plane needs to be 100% reliable, or everyone dies.
First, they're not anywhere near 100% reliable. They can fail on their own, and they'll also intentionally shut themselves off if the instruments they rely on fail. https://en.wikipedia.org/wiki/Air_France_Flight_447
Second, an autopilot failure shouldn't lead to death if the pilots are competent and paying attention.
What does a failed autopilot look like? Would a pilot do any better with an "aerodynamic stall"? I know little about planes and it seems like that'd be a big problem with or without a pilot driving.
A failed autopilot could look like all sorts of things, from just automatically disconnecting itself (usually with a loud warning alert) to issuing incorrect instructions (which is why the pilots are supposed to be awake and alert while it's engaged, watching the instruments).
Actually a key finding in AF447 was that pilots were not trained on how to recognize and recover from a high altitude stall. It is not like flying a Cessna 150. The junior first officer didn't realize the aircraft had stalled.
Pilots were trained on the procedure for recovering from a low altitude stall; 100% or TOGA thrust and power out of it while minimizing altitude loss. Training has now changed for both low altitude and high altitude stall recovery.
AF447 was a PILOT failure, not AUTOPILOT failure. The pilot didn't use proper stall recovery procedures, and put the aircraft into an unrecoverable stall.
But autopilot for planes is actually much easier than negotiating traffic with irrational humans with road rage. You can coordinate with air traffic for takeoffs and landings, and there is very little to run into at tens of thousands of feet.
Autopilot in a plane usually doesn't involve autonavigation, whereas autodrive in a car generally requires navigation. Autodrive in a car without navigation is basically 'cruise control'.
'Cruise control' does not (at least it didn't until very recently) even attempt to avoid collisions with neighboring cars or keep the car in its lane. Even without navigation, autodrive in a car is a considerably more difficult problem than either cruise control or autopilot in a plane.
Not sure low-probability/high-damage events are comparable to high-prob/"low"-dmg in the first place and that's not a trivial question to handle in real-time
Self-driving car could be better than human in average. But as long as there are human drivers who drive better than self-driving software, it would be disaster for these drivers. We definitely do not want some technique than do good for majority but do horrible things for minority, right?
A single driver's ability isn't the only risk factor... if I'm a great driver but every one else sucks (that's how it for everyone already, right :D ), then an overall increase in the population's driving ability helps me, right?
the overall increase helps you indeed. But do you want use self-driving software if you are a great driver(or you think you are)? I do not because I want to be more safer by driving myself.
If great drivers like to drive themselves. Others wants too because they do not trust these great drivers.
In everyone driver's eyes, there are only two kinds of drivers
1 ) bad driver slower than me.
2 ) mad driver faster than me.
This would be true only if your driving ability only affected your chance to die, but your driving ability has an effect on everyone else's safety on the road as well!
Consider you are a damn good driver, better than self-driving software. If self-driving software can reduce your risk by giving you a safer environment(by replacing lots of bad drivers), but it will increase your risk when handling risks(because it's not as good as you). Would you like to choose self-driving software?
The point here is, no matter how good the driving environment goes, I do not want to lost any chance to survive(If I'm a good driver).
My consolation in that fact is that weird edge cases happen with human driven cars as well. Someone has a seizure and crashes, or more commonly reaches for a cigarette, the radio, their phone. People hit ice or water and overcorrect their spin. People drive too fast. Etc etc etc. Not even all edge cases, many common modes of failure. I except self-driving cars that kill people will be a huge emotional issue for a lot of people in accepting them, but for me, i just want them to be safer than human drivers, which isn't THAT high of a bar to cross.
Yup. People seem to be overly critical with automated car failures.
Personally, i think automated cars are going to easily be better than humans in the working cases (both human and ai are concious). Next, i expect to see fully operational backup systems.
Eg, if a monitoring system decides that the primary system is failing for whatever reason, be it bug or unhandled road condition (tree/etc), the backup system takes over and solely attempts to get the driver off the road, and into a safe location.
Humans often fail, but often can attempt to recover. And, as bad as we may be at even recovering, we know to try and avoid oncoming traffic. Computers (currently) are very bad at recovering when they fail. I feel like having a computer driving, in the event of failure, is akin to a narcoleptic driver - when it goes wrong, it goes really wrong. Hence why i hope to see a backup system, completely isolated, and fully intent on safely changing lanes or finding a suitable area to pull over.
Sounds good in theory. Until the bug that causes failure is also present in the monitoring system, and as such doesn't fail over to the backup system. AKA exactly what happened here to Google.
Currently, the self-driving software fails out on a Google Self-Driving Car every 1,500 miles. If the car suddenly stops trying to drive in the road, and the driver isn't attentive (or worse, if Google gets their way and convinces the laws to change so they don't have to have steering wheels) that's a lot of deaths.
I'm not saying it won't get better, but pretending self-driving cars is a cure-all right now is hilarious and insane.
272 times the car had a 'system failure' and immediately returned control to the driver with only a couple seconds of warning. (Approx. every 1,558 miles.) A car mid-traffic spontaneously dropping control of the vehicle would likely create a large number of accidents.
13 car accidents prevented via human intervention (Approx. every 32,615 miles), 10 of which would've been the self-driving car's at-fault (Approx. every 42,400 miles). These virtual accidents were tested with the telemetry recorded during the incident, and it was determined had the human test driver not intervened, an accident would've occurred.
Total of these events is 285, which is approximately every 1,487 miles driven.
For useful comparison, a rough human average (when you add a large margin to account for unreported accidents) is somewhere around one accident every 150,000 miles driven. (Insurance companies see them every 250,000 miles approximately, I believe.)
Your comment is extremely misleading. Firstly, the "couple of seconds warning" does not seem accurate. The actual report states:
"Our test drivers are trained and prepared for these events and the average driver response time of all
measurable events was 0.84 seconds."
Secondly, you fail to mention:
"“Immediate manual control” disengage thresholds are set conservatively. Our objective is not
to minimize disengages; rather, it is to gather as much data as possible to enable us to improve our
self-driving system."
Thirdly, you fail to mention that the rate has dropped significantly:
"The rate of this type of disengagement has dropped significantly from
785 miles per disengagement in the fourth quarter of 2014 to 5318 miles per disengagement in the
fourth quarter of 2015. "
On the contact events, you fail to mention:
"From April 2015 to November 2015, our cars self-drove more than 230,000
miles without a single such event."
Lastly, your comparison with human drivers fails to take into account the environment:
"The setting in which our SDCs and our drivers operate most frequently is important. Mastering
autonomous driving on city streets -- rather than freeways, interstates or highways -- requires us to
navigate complex road environments such as multi-lane intersections or unprotected left-hand turns, a
larger variety of road users including cyclists and pedestrians, and more unpredictable behavior from
other road users. This differs from the driving undertaken by an average American driver who will
spend a larger proportion of their driving miles on less complex roads such as freeways. Not
surprisingly, 89 percent of our reportable disengagements have occurred in this complex street
environment"
I don't think self-driving cars are quite ready yet, but you are not representing the state of the art accurately by making out it is as bad as you say.
> As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.
I wouldn't really agree with that. There were two pieces of code designed to perform checks on new configs and cancel them. They both failed. Neither of those checks is a corner case. If you had a spec sheet for the system that manages IP blocks, that functionality would be listed as a feature right up front.
This wasn't an edge case. It was two bugs in two sections of code both designed to recover from a serious problem. It sounds like both sections of code were not tested properly at the very least.
Sounds to me like someone just didn't bother to test the failsafe part of the code.
In a failure case, it should remove the failing config, not all of them.
Pretty hard thing to miss if you test for it with any level of basic unit test or similar.
Second bug:
canary failure should prevent further propogation of the bad config.
A little more difficult to test with automated tests due to requiring a connection. It sounds like this was in fact tested, but the usage between the two bits of software was not tested. A good integration test would have caught this. But I wouldn't call that required. I would at least however think it was required that the use case of that particular code to be at least manually checked because, you know it's a feature for disaster prevention / recovery.
There was enough information to deduce this pretty easily. Although they did tend to glaze over it in the write-up, almost purposefully.
For all those spouting that this was a good postmortem, not really, it's a good covering of ones ass, a good spin, sidestepping the real root cause.
What has slas and "here take credits" got to do with a postmortem?
I'm not really sure why I got downvoted for this. The post mortem was good but it wasn't something I'd aim to strive for. I like gcloud and I'll keep using it but I find the response to this thing a little bit hard to swallow.
What makes you think software that produces and rolls out configuration files is something complicated?
I don't doubt that Google's infrastructure is as complicated and nuanced as it can get. Configuration software just simply isn't.
I still don't really see the point you're trying to make here. There isn't enough detail in the two sentences they gave us on the actual cause of the problem to really say much more in any further detail than I did.
But I guess that just proves my other point. Postmortem was 90% fluff.
Yet in googles defence, the information they gave was thorough enough for me. My only gripe was how it was being treated here. It just wasn't a very interesting situation and turned out to be something quite mundane.
Yes but how many people drive stoned, drunk, or distracted?
How many people drive aggressively, speeding, or erratically? How many people do dumb things on the road?
As a software engineer I know that there will be bugs and some will likely kill people. But as a driver who has driven many years in less civilized countries, I know that human beings are terrible drivers.
Who would you rather share the road with, computer drivers that drive like your grandma, or a bunch of humans? It's a no-brainer right?
Agreed, I think a good postmortem distinguishes great companies from just good companies. The depth on philosophy, reasoning and then action is very digestible.
> This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?
Yes. However, the current failure rate of human drivers being improved on is the standard I care about.
> After crunching the data, Schoettle and Sivak concluded there's an average of 9.1 crashes involving self-driving vehicles per million miles traveled. That's more than double the rate of 4.1 crashes per one million miles involving conventional vehicles.
That is the only number that matters to me. Google gets that to 4.0 per million miles and I'd say they are good to go.
So crashes like that include drunk drivers, drugged drivers, and texting drivers.
What is the crashes per miles for a paying attention driver? If it is 1 per million miles, the self driving car would need to be a lot lower. Now if it was 4am and I am falling asleep at the wheel, I bet any self driving car would beat me. So cool to turn on, but maybe not for a daytime cruise...
It's not just crashes per mile - it's the severity of the crashes that needs to be considered too. 5 fender benders is a better outcome than 1 horrific wreck that kills someone.
Is crashes/mile the best metric or mistakes/mile that could lead to a crash? I certainly make a TON of mistakes that I can correct before they lead to problems--most of them are mild, like having to break a half second later than I'd like, but i bet you are UNDERestimating the improvement due to self driving cars.
It's definitely crashes per mile. Unexpected stuff happens all the time on real roads -- black ice, debris, animals, flat tires -- and safety depends upon the driver's or software's ability to deal with them. It's easy for self-driving not to wander into another lane because they're changing the radio, but it's hard for them to deal with situations their designers didn't anticipate.
Why are you comparing self-driving cars to exclusively a "paying attention driver"?
For self-driving cars to be safer than human drivers, there is no requirement that the self-driving cars should be better/safer than the best human driver... the self-driving car simply needs to be safer than the majority of humans.
> For self-driving cars to be safer than human drivers, there is no requirement that the self-driving cars should be better/safer than the best human driver... the self-driving car simply needs to be safer than the majority of humans.
That is true on a whole, but not true for ME. It needs to be safer than ME, not some hypotehtical average person.
Further compounding it:
> For driving skills, 93% of the U.S. sample and 69% of the Swedish sample put themselves in the top 50% [0]
Consider that most driving is done feet away from another vehicle. If those cars start being replaced by self-driving cars, then the you are safer (system safety x personal safety). The car might make more dubious decisions than you would yourself, but now it has less opportunities for failure due to others driving erratically.
Are BGP updates for Google's own router configurations really so frequent that they can't pay an engineer to at least monitor the propagation of configuration changes? In this case, a human would have instantly seen that the update was a) rejected (as explained in the postmortem), and b) holy shit, WHY DID THE ROUTER CHANGE ITS OWN CONFIGURATION TO BLOW AWAY ALL OF THE GCE ROUTES!?!
I'm all for automation, but WTF? Insert even a semi-competent engineer in the loop to monitor the configuration change as it propagates around and the entire problem could have been addressed almost trivially, as the human engineers eventually decided to do.
First of all, BGP is core to Google's load balancing architecture. So within a single datacenter you probably have at least a few dozen devices down stream from each edge router.
Secondly, I'm seeing just shy of 500 individual prefixes, 282 directly connected peers (other networks), and a presence at over 100 physical internet exchanges, just for one of Google's four ASes.
Would you be able to read over that configuration and tell me if it has errors?
Any sufficiently large system quickly reaches a point where a human has difficulty tracking what the system should look like.
Google has at least tens of data center locations, each of which will have multiple physical failure domains.
There are also many discontiguous routes being announced at all of their network PoPs. They have substantially more PoPs than data centers.
It very quickly gets too much to reasonably expect people to be able to keep track of what the system should look like, let alone grasping what it does look like.
This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?
Formal systems are still built on a model of the outside world, not on the world itself FAFAIK. Even if your formal coverage is 100%, you can not anticipate all weird edge cases the real world can come up with.
The key thing to remember is a bug in autonomous driving doesn't mean the car swerves off the road at 100mph. If the software crashes or fails the car can come to a stop quite quickly without harm and allowing for human intervention.
Having said that I am still scared, I'm not sure how well Tesla auto pilot will handle a tire blowout at 70mph. Perhaps better than I would, but I would much rather I was in control.
Assuming bugs are never intentional and mostly random... Maybe instead of one autopilot software, self driving cars of the future will have several, developed by completely different teams. Then a self driving car can take some sort of average or most common output instruction (thus minimizing the risk of random bugs/edge cases...etc.)
I think when you say they "fail", I think you are referring to the disengagements, right? The report you link to there says:
“Immediate manual control” disengage thresholds are set conservatively. Our objective is not
to minimize disengages; rather, it is to gather as much data as possible to enable us to improve our
self-driving system.
Also, table 4 reports the number of disengagements (for any reason) each month, as well as the miles driven each month. In the most recent month in that table, it's actually 16 disengagements over 43275.9 miles. That's approximately one disengagement every 2705 miles; about the distance from Sacramento, CA to Washington, DC. At the start of 2015 it was only 343 miles per disengagement; 53 disengagements over 18192.1 miles. The pace of improvement is incredible, especially considering disengagements are set conservatively.
Can a human drive from Sacramento CA to Washington DC without a single close call or mistake along the way? I really doubt it. This technology will be saving lives soon.
It's hard to get an exact figure, particularly because of unreported accidents, and various sources. But I believe insurance companies have previously stated it's about one in every 250,000 miles. For the sake of giving a wide berth for unreported accidents, and to not give humans the benefit of the doubt, I've been using the rough figure of 150,000 miles between accidents.
I don't have a great source for it though, and if anyone finds a good source, it'd be fantastic.
Airplanes are equipped with software and many pilots would turn on auto-pilots after a long take off. Bugs are everywhere, and it's just a matter of time before one is so critical and kill people. So our best bet is better quality assurance through proof and overtesting (do this incrementally!)
> There are a number of lessons to be learned from this event -- for example, that the safeguard of a progressive rollout can be undone by a system designed to mask partial failures -- ...
This is a really important point that should be more generally known. To quote Google's own "Paxos Made Live" paper, from 2007:
> In closing we point out a challenge that we faced in testing our system for which we have no systematic solution. By their very nature, fault-tolerant systems try to mask problems. Thus they can mask bugs or configuration problems while insidiously lowering their own fault-tolerance.
As developers we can try to bear this principle in mind, but as Monday's incident demonstrated, mistakes can still happen. So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?
Google actually does have a systematic solution: fault injection. Google's systems are designed so that you can (manually, if you have the right privileges) tell an RPC to fail regardless of whether it would otherwise have succeeded, and then test the response of the system as a whole.
The problem is that these failure cases are exercised much less frequently than the "normal execution" code paths are. For example, every year Google does DiRT [1] exercises which test system responses to a large calamity, eg. a California earthquake that kills everyone in Mountain View and SF including the senior leadership, and also knocks out all west coast datacenters. The half-life of code at Google (in my observation) is roughly 1 year, which means that half of all code has never gone through a DiRT exercise. The same applies to other, less serious fault injection mechanisms: they may get executed once every year or two, and serious bugs can crop up in the meantime. Automated testing of fault injection isn't really feasible, because the number of potential faults grows combinatorially with the number of independent RPCs in the system.
I'd be willing to bet that the two bugs that caused this outage were less than 6 months old. In my tenure at Google, the vast majority of bugs that showed up in postmortems were introduced < 3 months before the outage.
There was an relevant section in the Google SRE book notes posted here the other day about injecting faults into Chubby, their distributed lock service, which was too reliable.
Ex: Chubby planned outages
Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google
Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby
Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down
Testing doesn't detect failure, it only detects the failure of a test. Real failures happen more often than test failures, for the same test on the same code with the same input and output. The best systematic solution would detect real failures, not see what happens when you fail a test.
That's monitoring, then. As Steve Yegge's Platforms Rant [1] mentioned, testing and monitoring are two sides of the same coin. Google does both, but the original thread-starter here was asking about how to detect failures when the system itself is designed to mask & recover from failures. (FWIW, most such systems do log when they've encountered a failure condition and recovered from it, and this stat is available to the monitoring system.)
Basically, yes. But we don't have to make a traditional monitor, or have it be an extra component. Monitoring all the facets of, say, a code deployment, or a software build, or performance testing, is a dynamic thing. It may fail, or it may succeed, or it might be suspicious.
Normally we design systems for humans to determine that 3rd part; in this case, there should have been a system where humans could see the one or two pieces of unusual activity and investigated. But there wasn't, or it didn't work right. So a "fix" would be to develop software that adapts to nondeterministic behavior the way a human does. I wouldn't exactly call that monitoring, though.
This is an interesting question, and seems to get to the core of Nassim Taleb's ideas [1] about fragility and the limits of what we can understand, and how many of our attempts to create artificial stability ultimately bring about the opposite.
That said, based on this post-mortem, I think Google, and our industry as a whole, is doing a pretty good job. Periodic failures like this are inevitable, and if they serve to make it less likely that a similar failure occurs in the future, then that is a system as a whole that could be described as "anti-fragile".
> So, has anyone managed to make progress toward a "systematic solution" in the last 9 years?
That depends on how you define "solution". If development time isn't a concern, then formal verification is a pretty solid solution. AWS has used TLA+ on a subset of its systems. [0]
The standard solution in realtime safety-critical systems is to perform health monitoring in addition to robust fallbacks, such that when the system is falling back, it is reported as unhealthy.
For example, the CAN bus normally has an automatic retry feature on a variety of errors. A properly functioning CAN bus should have a bit error rate that is nearly zero. Lightly loaded, it can tolerate a very high error rate (say, due to noise, poor termination, etc). In that situation, the product would report a specific warning message to higher-level SCADA systems, such that it gets bubbled up all the way to the operators.
The approach at Google is to report the actual error rate up to the monitoring system, and then let the monitoring system decide at what threshold to alert with a warning message. This lets you catch a wide variety of errors, eg. if a single replica has a high error rate, that's probably a wildly different problem from if a whole rack of machines has a high error rate, which is different from every machine in the service having a high error rate, which is different from only the set of machines that were fed a specific query having a high error rate.
One of the bugs in this postmortem was that the process in question didn't do this, instead masking the error. Somewhat understandable, as I found the whole "execute a fallback, report the failure, and let the monitoring rules deal with it" philosophy one of the most confusing parts of being a Noogler. If you've never worked on distributed systems before, the idea that there is a monitoring system is a strange concept.
CAN and it's automatic retransmit is actually pretty good example of how simple transient problems can quickly overgrow into global system failures. On typical CAN the bandwidth headroom is small enough that when all colliding/failed telegrams would be blindly retransmitted the collision rate would skyrocket and thus only high-priority traffic will make any progress, and as on CAN priority and purpose is intrinsically linked together, from the global point of view nothing will make progress. That's why most CAN controllers have configurable retransmit behavior per packet (drop/retry/raise error and application deal with that) and partially why today's cars have multiple CAN buses.
Degraded modes of operation is one example of how to visualize masked errors. Another is to trigger an alarm on fallbacks.
As a general reflection, many distributed system leave out the cause of their changes and only log actions. Instead of logging "new membership, new members are b,c,d" you are better of logging "node a has not responded to heartbeat in the last 30 seconds, considering it faulty". Following such a principle makes it much easier to spot masked bugs, since you can reason about the behaviour much better.
Aggregating logs to a central location and being able to analyze global behaviour in retrospect is also a great feature.
Great point! More precisely, each state transition in a system should report the old state, the new state, and the triggering event (cause) to a monitoring system (possibly just a log).
It looks like there were at least three catastrophic bugs present:
1. Evaluated a configuration change before the change had finished syncing across all configuration files, resulting in rejecting the change.
2. So it tried to reject the change, but actually just deleted everything instead.
3. Something was supposed to catch changes that break everything, and it detected that everything was broken but its attempt to do anything to fix it failed.
It is hard to imagine that this system has good test coverage.
I'm attempting to even imagine how one would build a useful way to test this. Would they have to have a secondary, world-wide datacenter network with all their various services behind it?
Yes, in a manner of speaking; physical or virtual lab. At googles scale it wouldn't be unreasonable to have a completely parallel, but scaled back, network where they test their automation and code for happy and sad path.
That doesn't mean that bugs can't creep in. Who knows, maybe these were all extremely unlikely bugs and Google hit an astronomically unlikely bad-luck streak. Happens.
You could have it send messages to the actual servers, but with an added flag that says "fake", which makes the servers ignore the message/send back a message saying pass/fail/whatever (testing the flag could happen first, one server at a time manually). Then check whether the program continued to push updates.
You may be able to build an elaborate system of dummy network operations to test with, but this system may wind up with bugs that mask what would be errors in the real system. And how to you test against that? A dummy network to test the dummy network operations on? What if the dummy network contains bugs that make it behave significantly different from the real network, in error cases? How do you test for that?
You can test it against the actual network; if something goes wrong, you'll have downtime, but you'll be prepared to get it all back up.
Or, to test whether the "prevent errors from going to new places" works, temporarily configure the new places to ignore new configs; if the system works, no messages will be sent there; if the system doesn't work, they ignore the message and you learn about a bug.
You mean an ICMP request? The IPs were anycast and did not become unreachable until all edge routers had stopped announcing BGP routes. At that point the failure was global. Check out the postmortem, it's a good read.
The servers became unreachable, but the IPs weren't unreachable until all the servers were reconfigured.
My test should have caught this bug:
> In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
You fake out the connection with a faker object and give that to the code that wants to communicate to the network, and it returns streamed, deterministic data that would have been expected from the actual network, given deterministic inputs. The test uses the fake; the production code gets given the real object.
While testing would have been quite difficult, any simple canary release or timed release mechanism would have prevented this / limited the damage. At such mission critical systems, applying any global change in a such manner is asking for it, Devops can also be SPOF, this seems one such case.
They had a canary release mechanism in place. This is described in the post mortem.
> These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
Taking no cofirmation of the canary testing process as a signal to go ahead though is not just a bug but a design flaw IMO.
If you read the actual report, it mentions that they did a canary step but its effectiveness was undermined.
> In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
Seriously. This is a good postmortem, but these are hardly edge case bugs. In this case, major critical functionality just plain didn't work. Kind of shocking.
They explained the issues in laymans terms that most likely mask the true complexity of what happened. It's easy to read the final result: "tried to reject but then deleted everything" and think "Well duh that's bad, who would build a system that does that?", but I think you're fooling yourself if you think that edge cases couldn't cause that.
Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
It seems obvious to me that the push system should not proceed without confirmation from the management software, and the management software should not confirm the change is OK if it detects failure.
I see a straightforward defect here, not a confluence of edge cases.
The take-home here is: Unit-test your failure states as well, people. Not just your happy paths!
I mean, this problem was a result of MULTIPLE untested failure states.
And yes, it IS possible to unit-test this sort of thing. You can fake out network connections and responses. I haven't yet found something that's impossible to unit-test, if you just think about how to do it properly, actually.
EDIT: Why downvotes without a typewritten rebuttal? That's just not what I expect from HN (as opposed to, say, Reddit)
For progressive rollouts, what if config changes where pulled instead of pushed?
Each system would be responsible for itself updating, verifying (canary, smoketest, make sure other systems successfully updated, etc), bouncing, and then rolling back as needed.
A bunch of that's in-place already, eg. all Google servers have health checks that run basic smoke tests on a configuration, and if a large number of replicas become unhealthy after a config change, the rollout process automatically aborts and rolls back to the last known good conversion.
The problem here was that there was a bug in the health check that masked the problem by assigning the last-good configuration, and then there was a bug in that code that had saved "nothing" as the last-good configuration. So rather than failing and having the error caught at the top level, it failed and buggy failure-recovery code made the problem worse.
In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
Classic Two Generals. "No news is good news," generally isn't a good design philosophy for systems designed to detect trouble. How do we know that stealthy ninjas haven't assassinated our sentries? Well, we haven't heard anything wrong...
It may not be good design, but it might be necessary / practical design. If you have enough machines that some percentage of them are down or unreachable at any given time, you can't wait for full go-ahead before proceeding; you'll never get full go-ahead. So you're left with probabilistic solutions, and as T approaches infinity the expectation of more than zero false-positives approaches 1.
The whole point of the canary sub-population, though is that 1) It's not your whole population. 2) You want to find out empirically if something's wrong.
I'm waiting for the time when they push over the air updates to airplanes in flight.
"You can fly safely, we have canaries and staged deployment"
A year forward:
"Unfortunately because the canary verification as well as the staged deployment code was broken, instead of one crash and 300 dead, an update was pushed to all aircraft, which subsequently caused them to crash, killing 70,000 people."
I'm not 100% sure why they don't do the staged deployment for google scale server networking over a few days (or even weeks in some cases) instead of a few hours, but I don't know the details here...
It's good that they had manually triggerable configuration rollback possibility and a pre-set policy so it was solved so quickly.
The answer, of course, is that slower and less-frequent deployments mean slower progress building a better platform and delivering new features. If breakages could lead to plane crashes then, obviously, we'd want them to slow down. But if it mainly means no one can listen to Spotify for 15 minutes then that calls for a different trade-off.
As a founder of a startup that hosts services on GCE I'm happy with the trade-off they've chosen.
There are businesses that fit somewhere between Boeing and Spotify where failures still have some kind of steeper than casual cost.
On Hacker News the "move fast and break things" ethos is probably making sense for many of the people submitting and commenting, since their business is closer to casual usage anyway. But that's not the whole audience.
Shit happens, when it comes to engineering, I'd trust Google more than even likely Boeing to manage systemic risk.
As for cars, it's a real risk, but not the same as the bugs Google experienced; I personally have experienced a "bug" driving a car at high speeds, which resulted in a number of major electronic systems failing due to custom
systems installed by a well known US startup.
Depends. You design and operate for a certain "shit happens" probability and price.
That's why I brought up airliners. You can't set low reliability goals and just say "shit happens". You would have less buyers, and it's not even legal anymore. So the bullet was bit and more reliable aircraft were developed. In the software world we're more like nineteen twenties still. That changed.
Let me phrase it in a perhaps less confrontational way. I see that there could be some business value in more reliable cloud platforms.
There might be some business value with more nines in the availability percent, that is, less downtime per year. Or maybe just less global outages, even if that means more cases where some of a certain customers' containers or vm:s or what you have might be unavailable some of the time. That can be handled by running multiple units in the same cloud and with other techniques.
But at the moment, since there seem to be single points of failure (or policies that are single points of failure, like to update everything at once), if you, as a customer, would like to have more safety, you would have to run services in two different providers' cloud platforms. That could get slightly more complicated - and expensive as well. I guess some parts of these technologies are quite new so someone will come up with easy and good solutions.
As it relates to airplanes, "shit happens" still applies. I was flying into NYC one time and air traffic control mistakenly allowed the plane I was on to attempt a landing while another plane was taking off; my pilot don't even notice the other plane until we were over the runway. Later found out NYC depends in a number of cases for pilots to avoid collision by literally looking out the window for traffic in their fight path.
>> "I see that there could be some business value in more reliable cloud platforms."
Likely, though I have no idea how much Google is making with cloud services, but Amazon I believe is making tens of billions alone with its cloud services. That said, Amazon as far as I'm able to recall has had far worse issues and appears to be doing fine as a business.
Think of the canary as the last line of defense, not the first. You always aspire to deploy zero bugs into production, through good testing and other QA. But if a problem happens, you want to limit the impact as much as possible. Affecting one site isn't great, but there is enough redundancy that overall service should be unaffected.
At Google, they do these really awesome post-mortems when there's a major failure. It provides a point of reflection, and are usually well written entertaining reads. Didn't know they made (some?) public.
They're a good learning exercise writing one, and is more of a learning exercise than a punishment.
It's worth noting that the publicly posted postmortem is not the same as the internal postmortems (which include much more detail, specific action items, timelines etc). The SRE book (https://landing.google.com/sre/book.html) has a whole chapter on our internal postmortems, which is probably a better learning exercise in how to write one.
Source: I work on the team that writes these external postmortems.
Google publishes a public incident report for all service outages (code red) in the Cloud status dashboard. You can see some in the History page: https://status.cloud.google.com/summary
Note that the length of the report tends to correlate with the severity of the outages and that disruptions (code orange) disruptions do not get reports.
Disclaimer: I work in Cloud Support and write some of these.
Completely off topic, but this thread is an example of why I (and a lot of people) want collapsible comments native to HN. I'm on my phone, in Safari, and I had to scroll for over 20 seconds just to reach the second comment. The first comment was a tangent about self-driving cars, which while relevant, I didn't want to read about.
>However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.
>Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
I assume the software was originally tested to make sure it works in case of failure. It would be interesting to know exactly what the bug was and why it didn't show in tests.
Network management software complexity is supposed to be one of things that SDN was built to solve (by introducing more modularity and defined interfaces). But in this case the fault was at the edge with BGP route updates, which the internet has been doing for decades. I share your curiosity in the specific bug.
However, this is a great detailed post-mortem from a service provider. Your Telco or ISP will never provide this much detail...
This is very interesting. From the little I understand (sorry for using AWS terms as I am more versed with AWS than GCE) this can happen to AWS as well right? even if your software is deployed to multiple AZs / multiple regions, if bad routing / network configuration makes it through the various protection mechanisms then basically no amount of redundancy can help if your service is part of the non functional IP block. I mean it seems no matter how redundant you are, there will always be somewhere along the line a single point of failure, even if it has multiple mechanism to prevent it from happening, if all of these mechanisms fail, then it's still a single point. What prevents this from happening at Azure / AWS? Is there anything that general internet routing protocols need to change to prevent it from happening?
e.g. I'm sure that we will never hear that Bank of X has transferred a billion dollar to an account but because of propagation errors it published only the credit but didn't finish the debit and now we have two billionaires. This two or more phase commit is pretty much bulletproof in banking as far as I know, and banks are not known to be technologically more advanced than Google, how come internet routing is so prone to errors that can an entire cloud service unavailable for even a small period of time?
I'm far from knowing much about networking (although I took some graduate networking courses, I still feel I know practically nothing about it...)
So I would appreciate if someone versed in this ELI5 whether it can happen in AWS and Azure regardless of how redundant you are, (which leads to a notion of cross cloud provider redundancy which I'm sure is used in some places) and whether the banking analogy is fair and relevant, and if there are any RFCs to make world-blackout routing nightmares less likely to happen.
Of course things such as mismatched account balances (i.e. the account balance does not equal all credits - debits) or erroneous postings due to bugs happen in the banking IT world. It's just that they are not that visible because only the people affected and that are quick enough in checking their balances will learn about it and after a few hours or days, when they notice the mistake, they will just fix the entries. (And if you thought you were clever and transferred all the funny money away, they are going to sue you to get it back; see e.g. this Quora thread for bank errors and legality: https://www.quora.com/If-my-bank-mistakenly-deposits-1-000-0...)
EDIT: Also, to answer the question: I think distributed computing is hard. The bank will usually have all their account balances on one huge central mainframe in one location, so you do not need to rely on computers talking to each other. And also, a bank does not really need to publish credits and debits at the same time - they just have to make sure your account is debited at or before the other account is credited (in fact, with most money transfers between banks there will be days between these two). So they can just debit your account, check whether this has worked and then send the money on its journey afterwards and be done with it. If a bug happens and the money does not show up at the recipient, they will complain, the bank can look into it and fix it - no (or not much to the bank, anyways) harm done.
I'm not sure the AWS network follows the same setup, AWS has very distinct blocks between the US/EU/APAC compared to GCP where you can inherit the same IP if you quickly delete/recreate instances in different regions?
> . Internal monitors generated dozens of alerts in the seconds after the traffic loss became visible at 19:08 ... revert the most recent configuration changes ... the time from detection to decision to revert to the end of the outage was thus just 18 minutes.
It's certainly good that they detected it as fast as they did. But I wonder if the fix time could be improved upon? Was the majority of that time spent discussing the corrective action to be taken? Or does it take that much time to replicate the fix?
Having worked in ISP operations on BGP stuff (admittedly more than 10 years ago), it was both too slow and too fast.
If the rollout took 12 hours instead of 4 or the VPN failure to total failure was multiple hours instead of minutes, they'd have had enough time to noodle it out. Eventually at a slow enough deploy rate they'd have figured it out. It only took 18 hours to make the final report after all, so an even slower 24 hour deploy would have been slow enough, if enough resources were allocated.
On the opposite side, most of the time when you screw up routing the punishment is extremely brutal and fast. If the whole thing croaked in five minutes, "OK who hit enter within the last ten minutes..." and five minutes later its all undone. What happened instead was dude hit enter, all is well hours later although average latency was increasing very slowly as anycast sites shut down. Maybe there's even shift change in the middle. Finally hours later it finally all hit the fan meanwhile the guy who hit enter is thinking "it can't be me, I hit enter over four hours ago followed by three hours of normal operation... must be someone else's change or a memory leak or novel cyberattack or ..."
Theoretically if you're going to deploy anycast you could deploy a monitoring tool to traceroute to see that each site is up, however you deploy anycast precisely so that it never drops... Its the titanic effect, why this is unsinkable, why would you bother checking to see if its sinking? And just like the titanic if you break em all in the same accident, that sucker is eventually going down, even if it takes hours to sink.
Hmm. Seems like this begs for a different way to solve the problem, like alarming on major changes to configuration files or better recognition of invalid configs, i.e. google should be able to make a rule that says "if I ever blackhole x% of my network then alarm"...
The first one is alarm fatigue. Like the "Terror Thermometer" or whatever its called where we're in eternal mauve alert meaning nothing to anyone. All our changes are color coded as magenta now. Or its turned down such that one boring little ip block isn't a major change. After all, it isn't. Of course you (us) developers could run crazy important multinational systems on what to us networking guys was one boring little IP block who cares about such as small block of space.
The second one is covered in the article, their system for that purpose crashed and then the system that babysits that crashed and then whatever they use to monitor the monitors monitor system didn't notice. Probably showed up in some dude's nightly syslog file dump the next day. Oh well. If your monitor tool breaks due to complexity (as they often do) it needs to simplicate and add lightness not slather more complexity on. Usually monitoring is more complicated and less reliable than operating, its harder computationally and procedurally to decide right from wrong than to just do it.
The odds of cascaded failure happening are very low. Given fancy enough backup systems that means all problems will be weird cascaded failure modes. That might be useful in training.
When I was doing this kind of stuff I was doing higher level support so see above at least some of my stories are weird cascaded impossible etc. A slower rollout would have saved them, working all by myself I like to think I could have figured it out by comparing BGP looking glass results and traceroute outputs from multiple very slowly arriving latency reports to router configs with papers all over my desk and multiple monitors in at most maybe two days. Huh, its almost like anycast isn't working at more sites every couple hours, huh. Of course their automated deployment is complete in only 4 hours, which means all problems that take "your average dude" more than 4 hours of BAU time to fix are going to totally explode the system and result in headlines instead of a weird bump on a graph somewhere. Given that computers are infinitely patient, slowing down the rollout of automated deployments from 4 hours to 4 days would have saved them for sure. Don't forget that normal troubleshooting places will blow the first couple hours on written procedures and scripts because honestly most of the time those DO work. So my ability to figure it out all by myself in 24 hours is useless if the time from escalation to hit the fan was only an hour because they roll out so fast. Once it hit the fan a total company effort fixed it a lot faster than I could have fixed it as an individual.
Or the strategy I proposed where computers are also infinitely fast, roll out in five minutes, one minutes to say WTF, five minutes to roll back, 11 minute outage is better than what they actually got. Its not like google is hurting for computational power. Or money.
I'm sure there are valid justifications for the awkward four hour rollout thats both too fast and too slow. I have no idea what they are, but the google guys probably put some time into thinking about it.
From the rest of the post, it sounds like replication time. Datacenters started dropping an hour beforehand one by one, and they had all fallen over by 19:08. Given that you have to push the rollback to routers around the world, and that peer routers have to propagate the changes from there, 18 minutes for a change like this sounds about right.
... although once the first datacenter once again announced the prefixes into BGP, those networks would have been reachable again, from everywhere. I imagine this is what happened at 19:27 -- the first datacenter came back online.
Of course, the traffic load might have overwhelmed that single datacenter but that would be alleviated as soon as additional datacenters came back online ("announced the prefixes"). A portion of the traffic load would shift to each new datacenter as it came back online.
It could have been hours later before they were all operational again but, as far as the users were concerned, the service was up and running and back to normal as soon as the first one or two datacenters came back up.
Well, perhaps, but -- to be clear, I'm not suggesting that there is a problem, rather gathering more information in order to determine whether there's a more optimal solution.
e.g. if the detection mechanism latency is ~60s but the time-to-resolve is 18 mins, then I wonder: "how good could the best possible recovery system be?" Implicit in this question is that I think the answer to my question could just as easily be "19 minutes" as it could "5 minutes."
It's not a bias if I'm asking questions in order to improve the system. Could this fault have been predicted? Yes, IMO it could have. I believe that the fault in this case is grossly summarized as "rollback fails to rollback."
What if the major driver of the 18 minute latency was getting the right humans to agree that "execute recovery plan Q" was the right move? If that were the case then perhaps another item to learn could be "recovery policy item 23: when 'rollback fails to rollback', summon at least 3 of 5 Team Z participants and get consensus on recovery plan." And then maybe there could be a corresponding "change policy item 54: changes shall be barred until/unless 5 participants of Team Z are 'available'"
But that's all moot, if "fastest possible recovery [given XYZ constraints of BGP or whatever system] is ~16 minutes." Which it sounds like may indeed be the case.
> Finally, to underscore how seriously we are taking this event, we are offering GCE and VPN service credits to all impacted GCP applications equal to (respectively) 10% and 25% of their monthly charges for GCE and VPN.
These credits exceed what is promised by Google Cloud in their SLA's for Compute Engine and VPN service!
... which is precisely (almost word-for-word) what the post-mortem goes on to say. Is there something specific you're trying to call attention to here?
Traynor was quoted in a networkworld article last year saying they aim for three and a half nines (99.95%). But you need to read into the incidents more carefully -- figuring out actual "uptime" is quite hard. Consider the longest-lasting incident:
"On Tuesday 23 February 2016, for a duration of
10 hours and 6 minutes, 7.8% of Google Compute Engine
projects had reduced quotas. ... Any resources that
were already created were unaffected by this issue."
I'm not sure off the top of my head how I'd try to compute the overall availability #s from that one. One can possibly try to determine and sum the effects on the individual customers, but we can't from the information provided. But it's certainly less overall downtime than just counting it as a 7 hour failure.
Agreed. It is difficult to tell. But if the bug is preventing you from processing (because you can't save the existing results) then it's essentially down time for new processing. There are also connectivity issues by region and DNS issues. It is difficult to get exact downtime considering partial failures.
That said, this is the second major asia-east1 downtime in 90 days:
April's incident is unique, This was the only case (listed) that was a service outtage, which impacted all of GCE.
The other incidents (as far as I can tell), were service disruptions at the AZ/regional level. Those disruptions don't impact the 9's, as GCE was available for other regions.
> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration
> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process
I'm sure the devil is in the details, but generally speaking, these are 2 instances of critical code that gets exercised infrequently, which is a good place for bugs to hide.
But once there are strong consequences for downtime the service provider is going to set up training, monitoring, oncall, etc to make sure things stay within the SLA limits. So you are effectively negotiating uptime.
The only SLAs that matter are the ones where service provider will suffer serious $ penalties on braking the SLA. Which rules out basically all major cloud providers that will simply issue credit for the downtime.
(As background, the author, MIT Prof. Nancy Leveson, summarizes decades of work in the field, offers groundbreaking new theoretical tools that scale up to some of the world's most complex accidents, and has the experience and evidence to back up their relevance e.g. via work on Therac-25, the Columbia Space Shuttle, and Deepwater Horizon to name just a few...)
> However, in this instance a previously-unseen software bug was triggered, and instead of retaining the previous known good configuration, the management software instead removed all GCE IP blocks from the new configuration and began to push this new, incomplete configuration to the network.
Always test your crash / exception handling / special case termination+recovery code in production.
I have seen this too often. Most often in in "every day" cases when service has a "nice" catch way of stopping and recovering. Then has a separate "if killed by SIGKILL/immediate power failure" crash and recovery. This last bit never gets tests and run in production.
One day power failure happens, service restart and tries to recover. Code that almost never runs, now runs and the whole thing goes into an unknown broken state.
> Configuration
>
> Configuration bugs, not code bugs, are the most common cause
> I’ve seen of really bad outages. When I looked at publicly available
> postmortems, searching for “global outage postmortem” returned
> about 50% outages caused by configuration changes. Publicly
> available postmortems aren’t a representative sample of all
> outages, but a random sampling of postmortem databases also
> reveals that config changes are responsible for a disproportionate
> fraction of extremely bad outages. As with error handling, I’m
> often told that it’s obvious that config changes are scary, but
> it’s not so obvious that most companies test and stage config
> changes like they do code changes.
It's a shame it's not easier or more common for people to create clones of (most|all) of their infrastructure for testing purposes.
Something like half of outages are caused by configuration oopsies.
If you accept that configuration is code, then you also come to the following disturbing conclusion: the usual test environment for critical network-related code in most environments is the production environment.
The main issue there is that "environments" are defined by configuration, so if you try to set up a configuration test environment, you run into a direct logical impass: either your configs are production configs, and thus not a separate environment, or they're different from production configs, and thus may provide different test results from production.
While I agree with you, I think we could get closer to "production" than is common right now.
In an AWS environment, imagine a setup where all that differs is the API keys used (the API keys of the production vs test environment). What gets tricky is dealing with external dependencies, user data, and simulating traffic.
For an example more relevant to today's issue: imagine a second simulated "internet" in a globally distributed lab environment. With BGP configs, fake external BGP sessions, etc, servers receiving production traffic, etc.
I get that it's a lot of work to setup and would require ongoing work to maintain - and that it's hard/impossible to have it correctly simulate the many nuances of real world traffic - and yet I also think in many cases it would be sufficient to prevent issues from making it into production.
For the amount this cost them, they should have bought CloudFlare. If you play with [global BGP anycast] you are bound to get burned. This is not the first time that BGP took out your entire routing. This is probably not the last time that BPG will take out your entire routing. Whoever's job it was to watch the routing, I am sorry.
Pulling your own worldwide routes because you have too much automation; it will make a good story once it's filtered down a bit! Icarus was barely up in the air, too early for a fall.
(Usual disclaimer: I speak for myself, not for my employer, etc.)
The team in charge of solving this particular problem is located in two sites in two different timezones. This is true of most critical SRE teams at Google, and it is precisely to be able to have 24h coverage in these time sensitive situations.
In the 2+ years I have spent in SRE I have never heard of a single instance of an SRE being asked or even encouraged to stay after hours (let alone overnight) for incident remediation. There is quite a lot of emphasis being put on work/life balance.
Wow, that's amazing to read, having served as a de-facto SRE (like every other SDE) at an unnamed competitor to GCE, where I was expected to stay up all night if necessary to resolve an issue (relatively few teams had follow-the-sun coverage). I swore I would never carry a pager again after that, but maybe Google really is different.
How important for redundancy/quality of service is the feature of advertising each region's IP blocks from multiple points in Google's network? It seems like region isolation is the most important quality that Google's network could provide, and their current design is what made something like this possible, not just the bugs in the configuration propagation. They mention the ability of the internet to route around failures, so why not rely on that instead?
as devops Borat was saying all along, automated propagation of a error as the main root cause here. A error (new configuration) should be rolled out site by site - ok us-east1, move onto us-west1 ... ok, move onto ... . A canary site may be the first in sequence, yet success ("no failure reported") can't be a big "ok" for automated push to all sites at the same time.
I hope that one of their solutions is the obvious one; make change control testing a closed loop instead of an open loop. (Watch for /success/ reported instead of failure notification.)
Google has contributed ISIS and BGP code to Quagga in the past, as well as funding some testing at the OSRF. Presumably they use it in at least some parts of their operations.
> Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
Perhaps the progressive rollout should wait for an affirmative conclusion instead of assuming no news is good news? I'm not being snarky, there may be some reason they don't do this.
Presumably it received a false positive (or it was interpreted as such). This really seems like the root cause, and I suspect a case of happy path engineering striking again.
TLDR; they simply didn't test their (global!) custom route announcement management software. An edge case was triggered in production, and they gee-whiz-automatically went offline. Epic fail.
"In other words, they simply didn't test their (global!) custom route announcement management software. An edge case was triggered in production, and unsurprisingly they automatically went offline."
Upvoted. I think they should put a soft version of this right on the first line, instead of burying it in an ocean of "harmless", "previously unseen" text dances.
DRY
"The inconsistency was triggered by a timing quirk in the IP block removal - the IP block had been removed from one configuration file, but this change had not yet propagated to a second configuration file also used in network configuration management."
I think most people are missing the main failure point: Why does one change propagate automatically to all regions?
All this could have been contained if they deployed changes on different regions at different times. That would also help with screwing less your overseas users by running a maintenance at 10am their local time :-)
> These safeguards include a canary step where the configuration is deployed at a single site and that site is verified to still be working correctly, and a progressive rollout which makes changes to only a fraction of sites at a time, so that a novel failure can be caught at an early stage before it becomes widespread. In this event, the canary step correctly identified that the new configuration was unsafe. Crucially however, a second software bug in the management software did not propagate the canary step’s conclusion back to the push process, and thus the push system concluded that the new configuration was valid and began its progressive rollout.
The system does do progressive rollouts, which are essentially what you are referring to (albeit perhaps at a different pace). The number of changes being rolled out means that it's not really feasible to hand roll out configurations to different regions, so the checks are automated. In this case, the automated checks failed as well.
Waiting a longer time between regional rollouts (so monitoring systems would have time to detect serious failures) would sacrifice deployment latency, but not deployment throughput (assuming deployments can be made in parallel). For continuous deployment, throughput really matters more than latency.
I'm not sure you really understand what I've tried to say, but it's probably my fault because of my poor grasp of the English language.
You are just confirming my previous comment. Your rollouts are automated, so pushing a change automatically configures every region, instead of configuring just one and maybe waiting for a prudential time in human scale before the next one because, surprise!, shit happens.
I understand your colleagues probably make lots of changes, but if that introduces risks of global outages IMHO you should reconsider your strategy.
And I'm not sure why you downvoted my previous comment. It's a perfectly valid observation, based on the published information.
As I assumed it was kind of a corner case bug meet corner case bug met corner case bug.
This is also why I am of afraid of a self driving cars and other such life critical software. There are going to be weird edge cases, what prevents you from reaching them?
Making software is hard....