Gmail and Google Drive Outage

jordanthoms · on Aug 20, 2020

There was an outage August 19th, 2019 - almost 1 year ago to the day. As I posted at the time: "Google often has a outage or two around this time of the year when all the US schools come back and millions of students log in at the same time."

My pet theory wasn't too popular but I'm going to stick with it :)

1- https://news.ycombinator.com/item?id=20740997

piva00 · on Aug 20, 2020

I work on the educational part of the product my company develops and I can attest that school start is a stressful day with login attempts, assignments lookups and other setup activities for the school period.

I wouldn't doubt that Google Classroom and other systems that use Google's SSO will be under strain from millions of students.

londons_explore · on Aug 20, 2020

Google does a once per year disaster recovery training... They do things like deliberately turn off datacenters with no warning. Sometimes failover systems don't work as intended.

Was that this week?

ithkuil · on Aug 20, 2020

"dirt" - https://www.usenix.org/conference/lisa15/conference-program/...

jefftk · on Aug 20, 2020

It was not this week, sorry!

perftime · on Aug 20, 2020

Every year around the same time people have to work on Perf (internal performance review), maybe people were more focused on that rather than keeping the systems up.... or maybe they needed to push the latest update to be included in their perf...

jordanthoms · on Aug 20, 2020

I like this theory too - but is performance review this week?

tweenagedream · on Aug 21, 2020

plibither8 · on Aug 20, 2020

See you in August 2021, good sir!

kevstev · on Aug 20, 2020

US schools don't all start on the same day though- its pretty staggered with some starting in early-mid august, and most in the Northeast start right after Labor day.

sujinge9 · on Aug 20, 2020

It's still probably a normalish distribution

kevstev · on Aug 20, 2020

Right- which I would expect Google or any half decent service to be able to withstand easily. Its not a sudden spike that happens under a few minutes to several orders of magnitude above the average weekly peak, this is a fairly gentle sloping upward.

And if this happened last year too, you would think this would be on top of the list of things to watch for next year and add capacity for. Amazon and Walmart start planning and drilling now for their holiday season.

adrianmonk · on Aug 20, 2020

That's an interesting theory because the timing does correlate.

A lot of people would immediately dismiss it because Google has the resources to scale up. But having resources doesn't guarantee someone actually turns the knob that increases the number of instances. (Whether automatic or manual, the adjustment could be too slow to match an unanticipated spike in demand.)

But there's another reason I don't think that's the explanation. Gmail has 1.5 billion active users[1]. Millions of students logging in at the same time sounds like a lot, but if Gmail has 100 million more active users today than yesterday, that's not even a 10% increase!

---

[1] Source: https://en.wikipedia.org/wiki/Gmail

jordanthoms · on Aug 20, 2020

I don't think it's the load on Gmail that's an issue. I'd point more to Google Drive, Docs and the underlying shared storage infrastructure. Also keep in mind most of those 1.5 billion users won't be very active - a few million users that have no usage at all for a few months and then all come back to being extremely active within a few days can be pretty disruptive!

IMO it's not really about having the resources to scale, but the unpredictable emergent behaviours which can happen when the load profile suddenly changes

jarym · on Aug 20, 2020

Actually a legit observation. Sorry to say but the naysayers on here are super dumb and missing the point with their comparison to other Google services or claiming lack of evidence.

mrkramer · on Aug 20, 2020

Millions of people are searching simultaneously at google.com or youtube.com but servers are not crashing. Issue is not traffic overload but something else.

klohto · on Aug 20, 2020

These are not the same products nor infrastructure

mrkramer · on Aug 20, 2020

But I'm sure similar infrastructure architecture was applied to gmail.com as it was to google.com and youtube.com.

And similar concepts of maintaining by sysadmins are practiced.

londons_explore · on Aug 20, 2020

Hah...

Press and hold the F5 key on your keyboard for 2 minutes while on gmail.com. You will get a "service unavailable" error. About 500 other people whose data happens to be cohosted with you will also get the same error, and all of you will be unable to send or receive email, even by IMAP, for about 10 mins while your particular corner of the data store is restarted and the data integrity checked.

That doesn't happen on Google.com

jordanthoms · on Aug 20, 2020

Ok, I definitely want to know how you discovered that... (and found one of those 500 people to verify?)

londons_explore · on Aug 20, 2020

Not sure if this is still the case, but if you did this a couple of times, your account data would be permanently migrated to an instance with more CPU and RAM allocated - you'd also be in with all the other badly behaved accounts, so reliability goes down lots. The benefit was much quicker complex searches, and being able to bulk label or delete emails without it taking minutes or hours.

Don't believe me how slow it is on a regular instance? Try going to "All mail", selecting all of your emails, and applying a label to them all. In my experience, it can only label about 50 mails per second, so it can take hours to do them all. It will keep going if you quit the browser, but will stop if the gmail devs do a software update, which they seem to do on usually tuesdays, but never fridays or the weekends.

1f60c · on Aug 20, 2020

I find it hard to believe that Gmail will always serve certain users from the same machines, especially in this day and age, with “cattle, not pets” and ephemeral containers.

I’m sure they have machines that are only used to serve G Suite and Google One customers, and maybe some other VIPs, but regular heavy users? It sounds like an urban legend to me.

jeffbee · on Aug 20, 2020

Gmail accounts will sometimes be automatically "hospitalized" -- assigned more than the usual amount of resources because for some reason they are chronically behind or growing without bound -- or "jailed" -- moved into isolation along with other bozo accounts, to keep from disturbing normal people's accounts. Not a legend.

organsnyder · on Aug 20, 2020

The data has to be sharded somehow, though. You might not be hitting the same exact machine, especially for the frontend, but your data isn't just magically everywhere in "the cloud".

toast0 · on Aug 20, 2020

Not the same machines, but different groups or machines (pools, shards, farms, whatever). They'll certainly be able to move you around to balance the pools, or to decomission pools, or to put you in a pool with primary data closer to where you usually access from etc. Grouping by behavior makes sense too --- separating heavy and light users makes a lot of sense, you can serve a lot more light users from one pool, and the heavy users won't impact their service.

jacobr1 · on Aug 20, 2020

It isn't about the compute. This is about the databases. I don't see it as stretch that google is using a sharding strategy for their datastores.

bluecmd · on Aug 21, 2020

That's not the "cattle vs. pet" as I understand it. The servers are identical, i.e. cattle. This is just a case of sticky sessions. It's a common pattern to help latency and keep resource usage down.

dannyw · on Aug 20, 2020

I get brief gmail outages once every several months. Bulk labelling is really fast. I wonder if this is why...

trishmapow2 · on Aug 20, 2020

Interesting. I held F5 down for less than a minute and I got an "Unusual usage - account temporarily locked down" message. Disappeared after a few seconds pause though...

dna_polymerase · on Aug 20, 2020

Drive and Gmail are not the same thing as search. The bottlenecks are different, the architecture and problem spaces aren't the same either.

lallysingh · on Aug 20, 2020

Primarily cache hits with no state.

omarhaneef · on Aug 20, 2020

This. And to expand on it:

for Gmail, you have to log in. It has to know it is you. It has to know that everything it is serving you is only for you. It has to retain copies of documents that -- as far as it knows -- are unique to you or whomever you share with. It has to do all this while keeping that information safe from other people who might want to take a look at it.

And that's just the items I can think of in real time while typing.

cezart · on Aug 20, 2020

The status from Google Cloud status page offers a bit more technical details of what happened:

"We are experiencing an issue with Google Kubernetes Engine (GKE) clusters using node auto-provisioning becoming stuck during node version upgrades. Node auto-upgrades have been disabled temporarily."

https://status.cloud.google.com/

9nGQluzmnq3M · on Aug 20, 2020

That issue is unrelated. This is the correct one: https://status.cloud.google.com/incident/zall/20008

piahoo · on Aug 20, 2020

Does it mean, that Gmail is hosted on GKE?

brown9-2 · on Aug 20, 2020

No, it’s likely the GKE incident is caused by a dependency that Gmail also has.

piahoo · on Aug 20, 2020

thanks

naedish · on Aug 20, 2020

I'm guessing this outage will allow GSuite customers to claim Service Credits under the SLA - does anyone have any experience with doing so? Google's documentation is lacking in detail[0].

[0] https://gsuite.google.com/intl/en/terms/sla.html

colinbartlett · on Aug 20, 2020

Follow up to that: Does anyone know if there are tools to help customers collect and submit data to their their vendors for SLA credits?

I’d like to add something like that to StatusGator but I’m unsure if there’s a market.

RachelF · on Aug 20, 2020

Good luck with that - read GSuite's terms, they, Google, define what an outage is, not the customer.

dannyw · on Aug 20, 2020

The linked terms say:

> "Downtime" means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate.

breakingcups · on Aug 20, 2020

Which is fantastic if the error causes the server side not to be able to log errors

iagooar · on Aug 20, 2020

It depends on the specific country's laws - in Germany you can not write anything you want into a contract and call it valid. In this case, you could still sue Google if according to industry standard definition it actually was an outage.

grumple · on Aug 20, 2020

The same is true in the US but some people like to pretend contracts mean everything here.

synack · on Aug 20, 2020

Incident says it's affecting anything that uses Cloud Storage, which I'm guessing is most things.

https://status.cloud.google.com/incident/zall/20007

Edit: New incident number https://status.cloud.google.com/incident/zall/20008

b8ne · on Aug 20, 2020

Confirmed it's an Object Storage issue. I've personally had dramas uploading APK's to the play store. Also noticed a bunch of big websites down. Must be a decent issue.

downvote-avoid · on Aug 20, 2020

Well this happens if you rely only on 3rd party providers. My email server is still running without any issues and without any indecent for last 3 years (system reinstall).

Surely no one can send me emails now that gmail is down as most of world is relying on single point of failure, but this is another story.

Moral of this story is - always own your mission critical infrastructure.

wuunderbar · on Aug 20, 2020

> Moral of this story is - always own your mission critical infrastructure.

Sure, if you have the time and money to own it properly. How far do you need to go to say you own it? Multi-regional servers located on properties that you own?

samdung · on Aug 20, 2020

I'm so used to Gmail/Gsuite working that i assumed the problem was with my internet and restarted the router a couple of times.

Havoc · on Aug 20, 2020

Yeah same w/ hotel internet.

Especially because Teams status notifications seem funky today too

OnWriting · on Aug 20, 2020

Gmail is practically unusable at the moment. It's been going on for a while too, so I'm interested in the post mortem for this.

I'm in Sydney, Australia.

Santosh83 · on Aug 20, 2020

Plain HTML GMail seems to be working as well as ever here in India...

hnick · on Aug 20, 2020

Do you have a signature image or other attachment? It sends instantly without, fails with [on the desktop web client]. The google drive server they save our stuff on is giving a 503.

OnWriting · on Aug 20, 2020

My email drafts can't even save. A popup saying "Oops, something went wrong. Recent changes may not have been saved."

No images either!

hnick · on Aug 20, 2020

Hmm odd. I had a draft from this morning (I took a photo on my phone, attached, then went to desktop later to type the rest up). It would not send and kept popping up that error you mentioned while typing. As soon as I deleted the attachment, I could send it. New emails with nothing attached work fine too in all 3 accounts.

Yeri · on Aug 20, 2020

Yes -- remove the attachment (including in-line images) and it'll work

bartman · on Aug 20, 2020

Was greeted by this just as we started work. Observations so far: Reading and receiving emails via the Gmail web UI works, sending emails via the web UI doesn't work, sending emails via SMTP works.

Yeri · on Aug 20, 2020

It's mostly related to attachment. Regular emails should be fine.

tobtoh · on Aug 20, 2020

Youtube is also affected. You can upload files but they won't process.

bleepblorp · on Aug 20, 2020

Gmail accessible via IMAP at $undisclosed_location. SMTP send works, but sent messages reappear in the inbox rather than in sent mail.

Haven't tested the Gmail web interface. I don't use it.

londons_explore · on Aug 20, 2020

Sent messages appearing in the inbox has been a bug with Gmail since the dawn of time.

tommykins · on Aug 20, 2020

I have 5 tabs open with emails written ready to send at some point. Had issues saving drafts, too.

blindm · on Aug 20, 2020

In fairness Gmail has a good uptime record, so it's not the end of the world

nindalf · on Aug 20, 2020

I'm not that interested in a dashboard, but I'd bee interested in reading the post-mortem. Since they have paying customers for these products, they might release that.

octvcdrc · on Aug 20, 2020

They usually provide it on the same page after the incident (an example: https://www.google.com/appsstatus#hl=en&v=issue&ts=159398639... )

kuu · on Aug 20, 2020

I wonder how much money this kind of outages can cost to all affected companies and institutions

danparsonson · on Aug 20, 2020

Hopefully a lot less than they save by outsourcing this part of their infrastructure :-)

kuu · on Aug 20, 2020

Yeah, I imagine so :D

Mandatum · on Aug 20, 2020

Right at 9am pacific too. Good reason for me to wrap up in APAC. Thanks Google!

hnick · on Aug 20, 2020

It's nearly the end of the workday here in Australia. Very insensitive of them, they should be more timezone aware.

Edit: I seem to be able to send from some accounts but not others. It looks like emails without attachments are fine, otherwise the drive error messes them up.

throwawaywrench · on Aug 20, 2020

>It's nearly the end of the workday here in Australia. Very insensitive of them

It's nearly the end of the workday on the east coast of Australia. Very insensitive of you. Sincerely, Western Australia

The above is a joke just in case that wasn't clear.

hnick · on Aug 20, 2020

I guess I kicked an own goal there.

At least it's not daylight savings yet so the QLDers can't be annoyed at me too!

codemac · on Aug 20, 2020

Please tell our SREs in the SYD (in darling harbour/pyrmont) office thank you from me.

chippy · on Aug 20, 2020

Pacific as in Pacific Time (PST/PDT) ? It's not even 3am there currently.

maple3142 · on Aug 20, 2020

Is this only affecting G suite? I still can send and receive email nornally.

sushshshsh · on Aug 20, 2020

That would explain why my ETLs are failing at 3am in the morning !!!

jwally · on Aug 20, 2020

If it makes you feel any better; same thing happened to me. Ugly Frankenstein process to get our data out of a third party vendor who can only e-mail us reports failed half-way through its run, and I was about to spend hours digging through to see why until I saw this.

sushshshsh · on Aug 20, 2020

Hehehe.... my condolences... we're still trying to recover

FridgeSeal · on Aug 20, 2020

Your ETL processes write to google drive or are dependent on email?

Earnest question, don't mean to sound derisive.

sushshshsh · on Aug 20, 2020

User emails file to drive

script connects to drive and parses list of files within

script copies files into redshift and moves original file to different directory or deletes it

alas what can be done? this is what the company gets for building like this ^.^

pirocks · on Aug 20, 2020

My understanding is part of google cloud failed(Google Cloud Storage), causing many outages, including but not limited to google drive/email.

shrikant · on Aug 20, 2020

I'm assuming some kind of workflow which uses email notifications, and only goes to the next step if the email notification succeeds..?

laurynas-s · on Aug 20, 2020

Gmail works but I can't send any emails - so frustrating.

antihero · on Aug 20, 2020

We're unable to publish apps in the Play Store console/fastlane, too.