Sharing details on a recent incident impacting one of our customers

snewman · 2024-05-24T21:00:47 1716584447

Given the level of impact that this incident caused, I am surprised that the remediations did not go deeper. They ensured that the same problem could not happen again in the same way, but that's all. So some equivalent glitch somewhere down the road could lead to a similar result (or worse; not all customers might have the same "robust and resilient architectural approach to managing risk of outage or failure").

Examples of things they could have done to systematically guard against inappropriate service termination / deletion in the future:

1. When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem.

2. Audit all deletion workflows for all services (they only mention having reviewed GCVE). Ensure that customers are notified in advance whenever any service is terminated, even if "the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool".

3. Add manual review for any termination of a service that is in active use, above a certain size.

Absent these broader measures, I don't find this postmortem to be in the slightest bit reassuring. Given the are-you-f*ing-kidding-me nature of the incident, I would have expected any sensible provider who takes the slightest pride in their service, or even is merely interested in protecting their reputation, to visibly go over the top in ensuring nothing like this could happen again. Instead, they've done the bare minimum. That says something bad about the culture at Google Cloud.

steveBK123 · 2024-05-25T13:19:07 1716643147

>> 1. When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem.

This is so obviously "enterprise software 101" that it is telling Google is operating in 2024 without it.

Since my new hire grad days, the idea of immediately deleting data that is no longer needed was out of the question.

Soft deletes in databases with a column you mark delete. Move/rename data on disk until super duper sure you need to delete it (and maybe still let the backup remain). Etc..

crazygringo · 2024-05-25T13:55:18 1716645318

It sounds like the problem is that the deletion was configured with an internal tool that bypassed all those kinds of protections -- that went straight to the actual delete. Including warnings to the customer, etc.

Which is bizarre. Even internal tools used by reps shouldn't be performing hard deletes.

And then I'd also love to know how the heck a default value to expire in a year ever made it past code review. I think that's the biggest howler of all. How did one person ever think there should be a default like that, and how did someone else see it and say yeah that sounds good?

roughly · 2024-05-25T21:21:00 1716672060

> This is so obviously "enterprise software 101" that it is telling Google is operating in 2024 without it.

My impression of GCP generally is that they've got some very smart people working on some very impressive advanced features and all the standard boring stuff nobody wants to do is done to the absolute bare minimum required to check the spec sheet. For all its bizarre modern enterprise-ness, I don't think Google ever really grew out of its early academic lab habits.

steveBK123 · 2024-05-25T22:14:16 1716675256

I know a bunch of way-too-smart PHD types that worked at GOOG exclusively in R&D roles that they bragged to me earnestly was not revenue generating.

jononor · 2024-05-26T18:53:15 1716749595

It makes sense for companies to have 1-10% of resources allocated for that. At Google scale that is thousands of people.

steveBK123 · 2024-05-27T19:37:19 1716838639

What if it's a lot more than 1-10% at GOOG?

Why haven't I met anyone who proudly works on revenue generating product at GOOG compared to the several R&Ders I know from different social circles?

jononor · 2024-05-27T22:52:14 1716850334

Sure it might well be. Ads and Cloud are the main things making money at Google. Those have very high profit margins. So it might be 90% at Google are just spending the money earned by the 10% that are directly bringing in the dough :)

nikanj · 2024-05-25T13:30:16 1716643816

There are many voices in the industry arguing against soft deletes. Mostly coming from a very Chesterton's Fence perspective.

For some examples https://www.metabase.com/learn/analytics/data-model-mistakes...

https://www.cultured.systems/2024/04/24/Soft-delete/

https://brandur.org/soft-deletion

Many more can easily be found.

snewman · 2024-05-25T13:41:39 1716644499

For the use case we're discussing here, of terminating an entire service, the soft delete would typically be needed only at some high level, such as on the access list for the service. The impact on performance, etc. should be minimal.

steveBK123 · 2024-05-25T13:43:55 1716644635

Precisely, before you delete a customer account, you disable its access to the system. This is a scream test.

Once you've gone through some time and due diligence you can contemplate actually deleting the customer data and account.

Nathanba · 2024-05-26T01:10:13 1716685813

I think the reason why someone wouldn't want to do this is because it will cost Google money to keep it active on any level.

danparsonson · 2024-05-25T14:10:16 1716646216

OK, but those examples you gave all boil down to the following:

1. you might accidentally access soft-deleted data and/or the data model is more complicated 2. data protection 3. you'll never need it

to which I say

1. you'll make all kinds of mistakes if you don't understand the data model, and, it's really not that hard to tuck those details away inside data access code/SPs/etc that the rest of your app doesn't need to care about

2. you can still delete the data later on, and indeed that may be preferable as deleting under load can cause performance (e.g. locking) issues

3. at least one of those links says they never used it, then gives an example of when soft-deleted data was used to help recover an account (albeit by creating a new record as a copy, but only because they'd never tried an undelete before and where worried about breaking something; sensible but not exactly making the point they wanted to make)

So I'm gonna say I don't get it; sure it's not a panacea, yes there are alternatives, but in my opinion neither is it an anti-pattern. It's just one of dozens of trade-offs made when designing a system.

jiveturkey · 2024-05-25T23:07:16 1716678436

gdpr compliance precludes such approach

rezonant · 2024-05-25T00:06:39 1716595599

Hard agree. They clearly were more interested in making clear that there's not a systemic problem in how GCP's operators manage the platform, which read strongly and alarmingly that there is a systemic problem in how GCP's operators manage the platform. The lack of the common sense measures you outline in their postmortem just tells me that they aren't doing anything to fix it.

ok_dad · 2024-05-25T05:45:42 1716615942

“There’s no systemic problem.”

Meanwhile, the operators were allowed to leave a parameter blank and the default was to set a deletion time bomb.

Not systemic my butt! That’s a process failure, and every process failure like this is a systemic problem because the system shouldn’t allow a stupid error like this.

joshuamorton · 2024-05-25T20:44:15 1716669855

If you're arguing that that was the systemic problem, then it's been fully fixed, as the manual operation was removed and so validation can no longer be bypassed.

rezonant · 2024-05-26T12:37:38 1716727058

I think you glossed over the importance of the term process failure.

The idea is that this one particular form missing the appropriate care is indicative of a wider lack of discipline amongst the engineers building it.

Definitionally, you cannot solve a process problem by fixing a specific bug.

joshuamorton · 2024-05-26T17:18:21 1716743901

"we removed the system that can enable a process failure" fixes the process failure. I didn't misunderstand anything.

cowboylowrez · 2024-05-27T21:58:25 1716847105

joshua nails it. companies do not do root cause anymore. its just the great enshitification, google edition.

sangnoir · 2024-05-25T01:33:44 1716600824

> When terminating a service, temporarily place it in a state where the service is unavailable but all data is retained and can be restored at the push of a button. Discard the data after a few days. This provides an opportunity for the customer to report the problem

Replacing actual deletion with deletion flags may lead to lead to other fun bugs like "Google Cloud fails to delete customer data, running afoul of EU rules". I suspect Google would err on the side of accidental deletions rather than accidental non-deletions: at least in the EU.

mcherm · 2024-05-25T09:46:56 1716630416

> I suspect Google would err on the side of accidental deletions rather than accidental non-deletions: at least in the EU.

I certainly hope not, because that would be incredibly stupid. Customers understand the significance of different kinds of risk. This story got an incredible amount of attention among the community of people who choose between different cloud services. A story about how Google had failed to delete data on time would not have gotten nearly as much attention.

But let us suppose for a moment that Google has no concern for their reputation, only for their legal liability. Under EU privacy rules, there might be some liability for failing to delete data on schedule -- although I strongly suspect that the kind of "this was an unavoidable one-off mistake" justifications that we see in this article would convince a court to reduce that liability.

But what liability would they face for the deletion? This was a hedge fund managing billions of dollars. Fortunately, they had off-site backups to restore their data. If they hadn't, and it had been impossible to restore the data, how much liability could Google have faced?

Surely, even the lawyers in charge of minimizing liability would agree: it is better to fail by keeping customers accounts then to fail by deleting them.

shadowgovt · 2024-05-30T18:26:12 1717093572

> Customers understand the significance of different kinds of risk.

Customers do; the law does not. The GDPR introduces unintended consequences.

> Surely, even the lawyers in charge of minimizing liability would agree: it is better to fail by keeping customers accounts then to fail by deleting them.

Not at all. Those in charge of enforcing the GDPR are heavily incentivized to assume the opposite of that. Google accidentally losing customer data is a win for privacy as far as the law's intent.

rlpb · 2024-05-25T10:51:01 1716634261

A deletion flag is acceptable under EU rules. For example, they are acceptable as a means of dealing with deletion requests for data that also exists in backups. Provided that the restore process also honors such flags.

pm90 · 2024-05-25T03:23:25 1716607405

I highly doubt this was the reason. Google has similar deletion protection for other resources eg GCP projects are soft deleted for 30 days before being nuked.

boesboes · 2024-05-25T09:27:06 1716629226

Not really how it works, GDPR protects individuals and allow them to request deletion with the data owner. They need to then, within 60(?) days, respond to any request. Google has nothing to do with that beyond having to make sure their infra is secure. There even are provisions for dealing with personal data in backups.

EU law has nothing to do with this.

phito · 2024-05-25T00:08:34 1716595714

It's a joke that they're not doing these things. How can you be an giant cloud provider and not think of putting safe guards around data deletion. I guess that realistically they thought of it many times but never implemented it because our costs money.

pm90 · 2024-05-25T03:25:40 1716607540

It’s probably because implementing such safeguards wouldn't help anyones promo packet.

I really dislike that most of our major cloud infrastructure is provided by big tech rather than eg infrastructure vendors. I trust equinix a lot more than Google because thats all they do.

Thorrez · 2024-05-25T03:48:47 1716608927

I work in GCP and have seen a lot of OKRs about improving reliability. So implementing something like this would help someone's promo packet.

cbarrick · 2024-05-25T04:22:20 1716610940

This is exactly the kind of work that would get SREs promoted.

passion__desire · 2024-05-25T05:01:57 1716613317

It is funny Google has internal memegen but not ideagen. Ideate away your problems, guys.

metadat · 2024-05-25T03:32:25 1716607945

Understandable, however public clouds are a huge mix of both hardware and software, and it takes deep proficiency at both to pull it off. Equinix are definitely in the hardware and routing business.. may be tough to work upstream.

Hardware always get commoditized to the max (sad but true).

lima · 2024-05-25T10:24:56 1716632696

As a customer of Equinix Cloud... No thank you. Infrastructure vendors are terrible software engineers.

mehulashah · 2024-05-25T14:42:52 1716648172

I’m completely baffled by Google’s “postmortem” myself. Not only is it obviously insufficient to anyone that has operated online services as you point out, but the conclusions are full of hubris. I.e. this was a one time incident, it won’t happen again, we’re very sorry, but we’re awesome and continue to be awesome. This doesn’t seem to help Google Cloud’s face-in-palm moment.

playingalong · 2024-05-25T19:04:17 1716663857

It looks like they could read the SRE book by Google. BTW available for free at https://sre.google/sre-book/table-of-contents/

A bit chaotic (a mix of short essays) and simplistic (assuming one kind of approach or design), but definitely still worth a read. No exaggeration to state it was category defining.

ajross · 2024-05-25T13:44:07 1716644647

FWIW, you're solving the bug by fiat, and that doesn't work. Surely analogs to all those protections are already in place. But a firm and obvious requirement of a software system that is capable of deleting data the ability to delete data. And if it can do it you can write a bug that short-circuits any architectural protection you put in place. Which is the definition of a bug.

Basically I don't see this as helpful. This is just a form of the "I would never have written this bug" postmortem response. And yeah, you would. We all would. And do.

belter · 2024-05-25T18:55:32 1716663332

Can you imagine if there was no backup? Google would be in for to cover the +/- 200 billion in losses?

This is why the smart people at Berkshire Hathaway don't offer Cyber Insurance: https://youtu.be/INztpkzUaDw?t=5418

ratorx · 2024-05-25T23:49:46 1716680986

I’d be very surprised if there wasn’t legalese in the contract/ToS about liability limitations etc. Would maybe expect it to be more than infrastructure costs for a big company custom contract, but probably not unlimited/as high as that, because it seems like such a blatant legal risk…

Disclaimer: Am Googler who knows nothing real about this. This is rampant speculation on my part.

PcChip · 2024-05-25T16:02:57 1716652977

Could it have been a VMware expiration setting somewhere, and thus VMware itself deleted the customer’s tenant? If so then Google wouldn’t have a way to prove it won’t happen again except by always setting the expiration flag to “never” instead of leaving it blank

yalok · 2024-05-25T16:28:28 1716654508

I would add one more -

4. Add an option to auto-backup all the data from the account to the outside backup service of users choice.

This would help not just with these kind of accidents, but also any kind of data corruption/availability issues.

I would pay for this even for my personal gmail account.

2OEH8eoCRo0 · 2024-05-24T22:43:31 1716590611

That sounds reasonable. Perhaps they felt that a larger change to process would be riskier overall.

TheCleric · 2024-05-25T00:03:24 1716595404

No it would probably be even worse from Google’s perspective: more expensive.

markhahn · 2024-05-25T14:43:32 1716648212

most of this complaint is explicitly answered in the article. must have been TL...

Ocha · 2024-05-25T00:23:59 1716596639

I wouldn’t be surprised if VMware support is getting deprecated in GCP so they just don’t care - waiting for all customers to move off of it

snewman · 2024-05-25T00:58:23 1716598703

My point is that if they had this problem in their VMware support, they might have a similar problem in one of their other services. But they didn't check (or at least they didn't claim credit for having checked, which likely means they didn't check).

dekhn · 2024-05-24T22:42:06 1716590526

If you're a GCP customer with a TAM, here's how to make them squirm. Ask them what protections GCP has in place, on your account, that would prevent GCP from inadvertently deleting large amounts of resources if GCP makes an administrative error.

They'll point to something that says this specific problem was alleviated (by deprecating the tool that did it, and automating more of the process), and then you can persist: we know you've fixed this problem, then followup: will a human review this large-scale deletion before the resources are deleted?

From what I can tell (I worked for GCP aeons ago, and am an active user of AWS for even longer) GCP's human-based protection measures are close to non-existent, and much less than AWS. Either way, it's definitely worth asking your TAM about this very real risk.

RajT88 · 2024-05-24T23:58:29 1716595109

Give 'em hell.

This motivates the TAM's to learn to work the system better. They will never be able to change things on their own, but sometimes you get escalation path promises and gentlemen's agreements.

Enough screaming TAM's may eventually motivate someone high up to take action. Someday.

ethbr1 · 2024-05-25T02:56:24 1716605784

Way in which TAMs usually actually fix things:

   - Single customer complains loudly
   - TAM searches for other customers with similar concerns
   - Once total ARR is sufficient...
   - It gets added to dev's roadmap

nikanj · 2024-05-25T13:32:11 1716643931

- It gets closed as wontfix. Google never hires a human to do a job well, if AI/ML can do the same job badly

mulmen · 2024-05-25T16:35:21 1716654921

Does the ARR calculation consider the lifetime of cloud credits given to burned customers to prevent them from moving to a competitor?

In other words can UniSuper be confident in getting support from Google next time?

ethbr1 · 2024-05-25T19:46:57 1716666417

My heart says absolutely not.

RajT88 · 2024-05-25T05:52:50 1716616370

If you are lucky!

I work for a CSP (not GCP) so I may be a little cynical on the topic.

ethbr1 · 2024-05-25T12:54:23 1716641663

Steps 3 and 4 are usually the difficult ones.

DominoTree · 2024-05-25T05:49:42 1716616182

Pitch it as an opportunity for a human at Google to reach out and attempt to retain a customer when someone has their assets scheduled for deletion. Would probably get more traction internally, and has a secondary effect of ensuring it's clear to everyone that things are about to be nuked.

tkcranny · 2024-05-24T23:24:51 1716593091

> ‘Google teams worked 24x7 over several days’

I don’t know if they get what the seven means there.

shermantanktop · 2024-05-25T02:24:50 1716603890

24 engineers, 7 hours a day. Plus massages and free cafeteria food from a chef.

crazygringo · 2024-05-25T13:50:00 1716645000

Ha, you're right it's a bit nonsensical if you take it completely literally.

But of course x7 means working every day of the week. So you can absolutely work 24x7 from Thursday afternoon through Tuesday morning. It just means they didn't take the weekend off.

ReleaseCandidat · 2024-05-24T23:46:48 1716594408

They worked so múch, the days felt like weeks.

rezonant · 2024-05-25T00:10:00 1716595800

I suppose if mitigation fell over the weekend it might still make sense.

Etherlord87 · 2024-05-25T10:18:17 1716632297

Perhaps the team members cycle so that the team was working on the thing without any night or weekend break. Which should be a standard thing at all times for a big project like this IMHO.

iJohnDoe · 2024-05-25T16:28:32 1716654512

Or they offloaded to India like they do for most of their stuff.

shombaboor · 2024-05-25T17:16:19 1716657379

this comment made my day

tempnow987 · 2024-05-24T20:24:15 1716582255

Wow - I was wrong. I thought this would have been something like terraform with a default to immediate delete with no recovery period or something. Still a default, but a third party thing and maybe someone in unisuper testing something and mis-scoping the delete.

Crazy that it really was google side. UniSuper must have been like WHAT THE HELL?

markmark · 2024-05-25T01:38:02 1716601082

The article describes what happened and it had nothing to do with Unisuper. Google deployed the private cloud with an internal Google tool. And that internal Google tool configured things to auto-delete after a year.

rezonant · 2024-05-25T00:10:52 1716595852

One assumes they are getting a massive credit to their GCP bill, if not an outright remediation payment from Google.

abraae · 2024-05-25T00:28:38 1716596918

The effusive praise for the customer in Google's statement makes me think they have free GCP for the next year, in exchange for not going public with their frustrations.

gnabgib · 2024-05-24T19:43:51 1716579831

Related stories UniSuper members go a week with no account access after Google Cloud misconfig[0](186 points, 16 days ago, 42 comments) Google Cloud accidentally deletes customer's account [1](128 points, 15 days ago, 32 comments)

[0]: https://news.ycombinator.com/item?id=40304666

[1]: https://news.ycombinator.com/item?id=40313171

foobazgt · 2024-05-24T19:31:27 1716579087

Sounds like a pretty thorough review in that they didn't stop at just an investigation of the specific tool / process, but also examined the rest for any auto deletion problems and also confirmed soft delete behavior.

They could have gone one step further by reviewing all cases of default behavior for anything that might be surprising. That said, it can be difficult to assess what is "surprising", as it's often the people who know the least about a tool/API who also utilize its defaults.

rezonant · 2024-05-25T00:14:48 1716596088

> and also confirmed soft delete behavior.

Where exactly do they mention they have confirmed soft delete behavior systemically? All they said was they have ensured that this specific automatic deletion scenario can no longer happen, and it seems the main reason is because "these deployments are now automated". They were automated before, now they are even more automated. That does zero to assure me that their deletion mechanisms are consistently safe, only that there's no operator at the wheel any more.

x0x0 · 2024-05-24T20:06:29 1716581189

Sounds more like some pants browning because incidents like this are a great reason to just use aws. Like come on:

> After the end of the system-assigned 1 year period, the customer’s GCVE Private Cloud was deleted. No customer notification was sent because the deletion was triggered as a result of a parameter being left blank by Google operators using the internal tool, and not due a customer deletion request. Any customer-initiated deletion would have been preceded by a notification to the customer.

... Tada! We're so incompetent we let giant deletes happen with no human review. Thank god this customer didn't trust us and kept off-gcp backups or they'd be completely screwed.

> There has not been an incident of this nature within Google Cloud prior to this instance. It is not a systemic issue.

Translated to English: oh god, every aws and Azure salesperson has sent 3 emails to all their prospects citing our utter fuckup.

markfive · 2024-05-24T20:27:27 1716582447

> Thank god this customer didn't trust us and kept off-gcp backups or they'd be completely screwed.

Except that, from the article, the customer's backups that were used to recover were in GCP, and in the same region.

ceejayoz · 2024-05-24T20:55:31 1716584131

I'm curious about that bit.

https://www.unisuper.com.au/contact-us/outage-update says "UniSuper had backups in place with an additional service provider. These backups have minimised data loss, and significantly improved the ability of UniSuper and Google Cloud to complete the restoration."

politelemon · 2024-05-24T22:14:20 1716588860

That's the bit that's sticking out to be as contradictory. I'm inclined to not believe what GCP have said here as an account deletion is an account deletion, why would some objects be left behind.

No doubt this little bit must be causing some annoyance among UniSuper's tech teams.

flaminHotSpeedo · 2024-05-25T03:54:38 1716609278

I'm inclined to not believe GCP because they edited their status updates retroactively and lied in their postmortem about the Clichy fire in Paris not affecting multiple "zones"

graemep · 2024-05-24T22:31:09 1716589869

The had another provider because the regulator requires it. I suspect a lot of businesses in less regulated industries do not.

skywhopper · 2024-05-25T10:09:56 1716631796

I think you misread. Here’s the relevant statement from the article:

“Data backups that were stored in Google Cloud Storage in the same region were not impacted by the deletion, and, along with third party backup software, were instrumental in aiding the rapid restoration.”

janalsncm · 2024-05-24T20:41:44 1716583304

I think it stretches credulity to say that the first time such an event happened was with a multi billion dollar mutual fund. In other words, I’m glad Unisuper’s problem was resolved, but there were probably many others which were small enough to ignore.

I can only hope this gives GCP the kick in the pants it needs.

resolutebat · 2024-05-24T20:46:54 1716583614

GCVE (managed VMware) is a pretty obscure service, it's only used by the kind of multi billion dollar companies that want to lift and shift their legacy VMware fleets into the cloud as is.

joshuamorton · 2024-05-25T20:49:11 1716670151

A critical piece of the incident here was that this involved special customization that most customers didn't have or use, and which bypassed some safety checks, as a result it couldn't impact "normal" small customers.

crazygringo · 2024-05-25T13:57:50 1716645470

I doubt it, because even a smaller customer would have taken this to the press, which would have picked up on it.

"Google deleted our cloud service" is a major news story for a business of any size.

jawns · 2024-05-24T20:47:51 1716583671

> The customer’s CIO and technical teams deserve praise for the speed and precision with which they executed the 24x7 recovery, working closely with Google Cloud teams.

I wonder if they just get praise in a blog post, or if the customer is now sitting on a king's ransom in Google Cloud credit.

rezonant · 2024-05-25T00:18:40 1716596320

There's no reality where a competent customer isn't going to ensure Google pays for this. I'd be surprised if they have a bill at all this year.

wolfi1 · 2024-05-25T07:06:53 1716620813

there should have been some punitive damage

postatic · 2024-05-25T00:13:57 1716596037

Uni super customer here in Aus. Didn’t know what it was but kept receiving emails every day when they were trying to resolve this. Only found out from news on what’s actually happened. Feels like they downplayed the whole thing as “system downtime”. Imagine something actually happened to people’s money and billions of dollars that were saved as their superannuation fund.

badcppdev · 2024-06-03T09:22:24 1717406544

Did you get the same emails that other people did? An email almost every day with the words "disruption", "apologies", "frustration" used multiple times.

A few days in an email titled: "A letter from the CEO"

> I am writing to provide you with an update on the disruption to our services.

> Firstly, let me begin by personally apologising for the outage, and thank you for your patience with our teams as they work around the clock to progressively get our systems back online.

I'm really not sure that you could ask for clearer communication at the time or a clearer description of what went from inside Google Cloud

lukeschlather · 2024-05-25T00:37:56 1716597476

The initial statement on this incident was pretty misleading, it sounded like Google just accidentally deleted an entire GCP account. Reading this writeup I'm reassured, it sounds like they only lost a region's worth of virtual machines, which is absolutely something that happens (and that I think my systems can handle without too much trouble.) The original writeup made it sound like all of their GCS buckets, SQL databases, etc. in all regions were just gone which is a different thing and something I hope Google can be trusted not to do.

wmf · 2024-05-25T00:42:52 1716597772

It was a red flag when UniSuper said their subscription was deleted, not their account. Many people jumped to conclusions about that.

walrus01 · 2024-05-25T18:07:37 1716660457

The idea that you could have an automated tool delete services at the end of a term for a corporate/enterprise customer of this size and scale is absolutely absurd and inexcusable. No matter whether the parameter was set correctly or incorrectly in the first place. It should go through several levels of account manager/representative/management for manual review by a human at the google side before removal.

hiddencost · 2024-05-25T00:40:52 1716597652

> It is not a systemic issue.

I kinda think the opposite. The culture that kept these kinds of problems at bay has largely left the company or stopped trying to keep it alive, as they no longer really care about what they're building.

Morale is real bad.

kjellsbells · 2024-05-25T04:39:12 1716611952

Interesting, but I draw different lessons from the post.

Use of internal tools. Sure, everyone has internal tools, but if you are doing customer stuff, you really ought to be using the same API surface as the public tooling, which at cloud scale is guaranteed to have been exercised and tested much more than some little dev group's scripts. Was that the case here?

Passive voice. This post should have a name attached to it. Like, Thomas Kurian. Palming it off to the anonymous "customer support team" still shows a lack of understanding of how trust is maintained with customers.

The recovery seems to have been due to exceptional good fortune or foresight on the part of the customer, not Google. It seems that the customer had images or data stored outside of GCP. How many of us cloud users could say that? How many of us cloud users have encouraged customers to move further and deeper along the IaaS > PaaS > SaaS curve, making them more vulnerable to total account loss like this? There's an uncomfortable lesson here.

kleton · 2024-05-25T05:17:30 1716614250

> name attached

Blameless (and nameless) postmortems are a cultural thing at google

rohansingh · 2024-05-25T05:44:03 1716615843

That's great internally, but serious external communication with customers should have a name attached and responsibility accepted (i.e., "the buck stops here").

hluska · 2024-05-25T06:16:43 1716617803

Culture can’t just change like that.

saurik · 2024-05-25T17:46:59 1716659219

So, I read your comment and realized that I think it made me misinterpret the comment you are replying to? I thereby wrote a big paragraph explaining how even as someone who cares about personal accountability within large companies, I didn't think a name made sense to assign blame here for a variety of reasons...

...but, then I realized that that isn't what is being asked for here: the comment isn't talking about the nameless "Google operators" that aren't being blamed, it is talking about the lack of anyone who wrote this post itself! There, I think I do agree: someone should sign off on a post like this, whether it is a project lead or the CEO of the entire company... it shouldn't just be "Google Cloud Customer Support".

Having articles that aren't really written by anyone frankly makes it difficult for my monkey brain to feel there are actual humans on the inside whom I can trust to care about what is going on; and, FWIW, this hasn't always been a general part of Google's culture: if this had been a screw up in the search engine a decade ago, we would have gotten a statement from Matt Cutts, and knowing that there was that specific human who cared on the inside meant a lot to some of us.

JCM9 · 2024-05-24T21:03:52 1716584632

The quality and rigor of GCP’s engineering is not even remotely close to that of an AWS or Azure and this incident shows it.

kccqzy · 2024-05-24T22:43:49 1716590629

Azure has had a continuous stream of security breaches. I don't trust them either. It's AWS and AWS alone.

IcyWindows · 2024-05-25T03:49:02 1716608942

Huh? I have seen ones for the rest of Microsoft, but not Azure.

twisteriffic · 2024-05-25T04:34:38 1716611678

One of many https://msrc.microsoft.com/blog/2021/09/additional-guidance-...

justinclift · 2024-05-24T21:21:55 1716585715

And Azure has a very poor reputation, so that bar is not at all high.

SoftTalker · 2024-05-24T22:15:55 1716588955

Honestly I've never worked anywhere that didn't have some kind of "war story" that was told about how some admin or programmer mistake resulted in the deletion of some vast swathe of data, and then the panic-driven heroics that were needed to recover.

It shouldn't happen, but it does, all the time, because humans aren't perfect, and neither are the things we create.

20after4 · 2024-05-25T05:11:35 1716613895

Sure, it's the tone and content of their response that is worrying, more than the fact that an incident happened. An honest and transparent root cause analysis with technically sound and thorough mitigations, including changes in policy with regard to defaults. Their response seems like only the most superficial, bare-minimum approximation of an appropriate response to deleting a large customer's entire account. If I were on the incident response team I'd be strongly advocating for at lease these additional changes:

Make deletes opt-in rather than opt out. Make all large-scale deletions have some review process with automated tests and a final human review. And not just some low-level technical employee, the account managers should have seen this on their dashboard somewhere long before it happened. Finally, undertake a thorough and systematic review of other services to look for similar failure modes, especially with regard anything which is potentially destructive and can conceivably be default-on in the absence of a supplied configuration parameter.

cebert · 2024-05-24T20:57:27 1716584247

> “Google Cloud continues to have the most resilient and stable cloud infrastructure in the world.”

I don’t think GPC has that reputation compared to AWS or Azure. They aren’t at the same level.

sa46 · 2024-05-25T00:49:52 1716598192

Microsoft is prone to severe breaches a few times per year.

https://firewalltimes.com/microsoft-data-breach-timeline/

pquki4 · 2024-05-25T09:08:46 1716628126

You can't just equate Microsoft and Azure like that.

skywhopper · 2024-05-25T10:15:18 1716632118

Azure has had multiple embarrassingly bad tenant-boundary leaks, including stuff like being able to access another customer’s metadata service, including credentials, just by changing a port number. They clearly have some major issues with lack of internal architecture review.

pm90 · 2024-05-25T03:30:33 1716607833

I have used Azure, AWS and GCP. The only reason people use Azure is because others force them to. It’s an extremely shitty cloud product. They pretend to compete with AWS but aren’t even as good as GCP.

markmark · 2024-05-25T01:39:35 1716601175

Does Azure? I think there's AWS then everyone else.

bowmessage · 2024-05-24T21:32:26 1716586346

[flagged]

pja · 2024-05-24T22:07:18 1716588438

Google Private Cloud I think.

jwnin · 2024-05-25T02:00:57 1716602457

End of day Friday disclosure before a long holiday weekend; well timed.

lopkeny12ko · 2024-05-24T22:08:44 1716588524

> Google Cloud services have strong safeguards in place with a combination of soft delete, advance notification, and human-in-the-loop, as appropriate.

I mean, clearly not? By Google's own admission, in this very article, the resources were not soft deleted, no advance notification was sent, and there was no human in the loop for approving the automated deletion.

And Google's remediation items include adding even more automation for this process. This sounds totally backward to me. Am I missing something?

jerbear4328 · 2024-05-24T23:18:29 1716592709

They automated away the part that had a human error (the internal tool with a field left blank), so that human error can't mess it up in the same way again. They should move that human labor to checking before tons of stuff gets deleted.

20after4 · 2024-05-25T04:57:15 1716613035

It seems to me that the default-delete is the real WTF. Why would a blank field result in a default auto-delete in any sane world. The delete should be opt-in not opt-out.

macintux · 2024-05-25T15:40:18 1716651618

It took me way too many years to figure out that any management script I write for myself and my co-workers should, by default, execute as a dry run operation.

I now put a -e/--execute flag on every destructive command; without that, the script will conduct some basic sanity checks and halt before making changes.

xyst · 2024-05-25T14:32:14 1716647534

Transparency for Google is releasing this incident report on the Friday of a long weekend [in the US].

I wonder if UniSuper was compensated for G’s fuckup.

“A single default parameter vs multibillion organization. The winner may surprise you!1”

l00tr · 2024-05-24T21:25:50 1716585950

if it would be small or medium buisness google wouldnt even care

sgt101 · 2024-05-24T20:47:31 1716583651

Super motivating to have off cloud backup strategies...

tgv · 2024-05-24T22:01:50 1716588110

Or cross-cloud. S3's ingress and storage costs are low, so that's an option when you don't use AWS.

nurettin · 2024-05-25T03:05:13 1716606313

It sounds like a giant PR piece about how Google is ready to respond to a single customer and is ready to work through their problems instead of creating an auto-response account suspension infinite loop nightmare.

maximinus_thrax · 2024-05-24T22:26:26 1716589586

> Google Cloud continues to have the most resilient and stable cloud infrastructure in the world.

As a company, Google has a lot of work to do about its customer care reputation regardless of what some metrics somewhere say about who's cloud is more reliable or not. I would not trust my business to Google Cloud, I would not trust anything with money to anything with the Google logo. Anyone who's been reading hacker news for a couple of years can remember how many times folks were asking for insider contacts to recover their accounts/data. Extrapolating this to a business would keep me up at night.

mannyv · 2024-05-24T20:39:48 1716583188

I guessed it was provisioning or keys. Looks like I was somewhat correct!

logrot · 2024-05-25T09:11:09 1716628269

Executive summary?

taspeotis · 2024-05-25T12:24:44 1716639884

Google employee scheduled the deletion of UniSuper’s resources and Google (ironically) did not cancel it.

petesergeant · 2024-05-25T09:25:09 1716629109

there's literally a tl;dr in the linked article

none_to_remain · 2024-05-25T20:20:48 1716668448

There should be an Executive Summary

emmelaich · 2024-05-25T05:57:12 1716616632

Using "TL;DR" in professional communication is a little unprofessional.

Some non-nerd exec is going to wonder what the heck that means.

logrot · 2024-05-25T09:15:12 1716628512

It used to be called executive summary. It's brilliant but the kids found it a too formal phrase.

IMHO almost every article should start with one.

noncoml · 2024-05-24T20:55:09 1716584109

What surprises me the most is that the customer managed to actually speak to a person from Google support. Must have been a pretty big private cloud deployment.

Edit: saw from the other replies that the customer was Unisuper. No wonder they managed to speak to an actual person.

mercurialsolo · 2024-05-25T00:48:46 1716598126

Only if internal tools went thru the same scrutiny as public tools.

More often than not critical parameters or mis-configurations happen because of internal tools which work on unpublished params.

Internal tools should be treated as tech debt. You won't be able to eliminate issues but vastly reduce the surface area of errors.