Hacker News new | past | comments | ask | show | jobs | submit login
Gmail and Google Drive Outage (google.com)
244 points by severb on Aug 20, 2020 | hide | past | favorite | 84 comments



There was an outage August 19th, 2019 - almost 1 year ago to the day. As I posted at the time: "Google often has a outage or two around this time of the year when all the US schools come back and millions of students log in at the same time."

My pet theory wasn't too popular but I'm going to stick with it :)

1- https://news.ycombinator.com/item?id=20740997


I work on the educational part of the product my company develops and I can attest that school start is a stressful day with login attempts, assignments lookups and other setup activities for the school period.

I wouldn't doubt that Google Classroom and other systems that use Google's SSO will be under strain from millions of students.


Google does a once per year disaster recovery training... They do things like deliberately turn off datacenters with no warning. Sometimes failover systems don't work as intended.

Was that this week?



It was not this week, sorry!


Every year around the same time people have to work on Perf (internal performance review), maybe people were more focused on that rather than keeping the systems up.... or maybe they needed to push the latest update to be included in their perf...


I like this theory too - but is performance review this week?


Yes.


See you in August 2021, good sir!


US schools don't all start on the same day though- its pretty staggered with some starting in early-mid august, and most in the Northeast start right after Labor day.


It's still probably a normalish distribution


Right- which I would expect Google or any half decent service to be able to withstand easily. Its not a sudden spike that happens under a few minutes to several orders of magnitude above the average weekly peak, this is a fairly gentle sloping upward.

And if this happened last year too, you would think this would be on top of the list of things to watch for next year and add capacity for. Amazon and Walmart start planning and drilling now for their holiday season.


That's an interesting theory because the timing does correlate.

A lot of people would immediately dismiss it because Google has the resources to scale up. But having resources doesn't guarantee someone actually turns the knob that increases the number of instances. (Whether automatic or manual, the adjustment could be too slow to match an unanticipated spike in demand.)

But there's another reason I don't think that's the explanation. Gmail has 1.5 billion active users[1]. Millions of students logging in at the same time sounds like a lot, but if Gmail has 100 million more active users today than yesterday, that's not even a 10% increase!

---

[1] Source: https://en.wikipedia.org/wiki/Gmail


I don't think it's the load on Gmail that's an issue. I'd point more to Google Drive, Docs and the underlying shared storage infrastructure. Also keep in mind most of those 1.5 billion users won't be very active - a few million users that have no usage at all for a few months and then all come back to being extremely active within a few days can be pretty disruptive!

IMO it's not really about having the resources to scale, but the unpredictable emergent behaviours which can happen when the load profile suddenly changes


Actually a legit observation. Sorry to say but the naysayers on here are super dumb and missing the point with their comparison to other Google services or claiming lack of evidence.


Millions of people are searching simultaneously at google.com or youtube.com but servers are not crashing. Issue is not traffic overload but something else.


These are not the same products nor infrastructure


But I'm sure similar infrastructure architecture was applied to gmail.com as it was to google.com and youtube.com.

And similar concepts of maintaining by sysadmins are practiced.


Hah...

Press and hold the F5 key on your keyboard for 2 minutes while on gmail.com. You will get a "service unavailable" error. About 500 other people whose data happens to be cohosted with you will also get the same error, and all of you will be unable to send or receive email, even by IMAP, for about 10 mins while your particular corner of the data store is restarted and the data integrity checked.

That doesn't happen on Google.com


Ok, I definitely want to know how you discovered that... (and found one of those 500 people to verify?)


Not sure if this is still the case, but if you did this a couple of times, your account data would be permanently migrated to an instance with more CPU and RAM allocated - you'd also be in with all the other badly behaved accounts, so reliability goes down lots. The benefit was much quicker complex searches, and being able to bulk label or delete emails without it taking minutes or hours.

Don't believe me how slow it is on a regular instance? Try going to "All mail", selecting all of your emails, and applying a label to them all. In my experience, it can only label about 50 mails per second, so it can take hours to do them all. It will keep going if you quit the browser, but will stop if the gmail devs do a software update, which they seem to do on usually tuesdays, but never fridays or the weekends.


I find it hard to believe that Gmail will always serve certain users from the same machines, especially in this day and age, with “cattle, not pets” and ephemeral containers.

I’m sure they have machines that are only used to serve G Suite and Google One customers, and maybe some other VIPs, but regular heavy users? It sounds like an urban legend to me.


Gmail accounts will sometimes be automatically "hospitalized" -- assigned more than the usual amount of resources because for some reason they are chronically behind or growing without bound -- or "jailed" -- moved into isolation along with other bozo accounts, to keep from disturbing normal people's accounts. Not a legend.


The data has to be sharded somehow, though. You might not be hitting the same exact machine, especially for the frontend, but your data isn't just magically everywhere in "the cloud".


Not the same machines, but different groups or machines (pools, shards, farms, whatever). They'll certainly be able to move you around to balance the pools, or to decomission pools, or to put you in a pool with primary data closer to where you usually access from etc. Grouping by behavior makes sense too --- separating heavy and light users makes a lot of sense, you can serve a lot more light users from one pool, and the heavy users won't impact their service.


It isn't about the compute. This is about the databases. I don't see it as stretch that google is using a sharding strategy for their datastores.


That's not the "cattle vs. pet" as I understand it. The servers are identical, i.e. cattle. This is just a case of sticky sessions. It's a common pattern to help latency and keep resource usage down.


I get brief gmail outages once every several months. Bulk labelling is really fast. I wonder if this is why...


Interesting. I held F5 down for less than a minute and I got an "Unusual usage - account temporarily locked down" message. Disappeared after a few seconds pause though...


Drive and Gmail are not the same thing as search. The bottlenecks are different, the architecture and problem spaces aren't the same either.


Primarily cache hits with no state.


This. And to expand on it:

for Gmail, you have to log in. It has to know it is you. It has to know that everything it is serving you is only for you. It has to retain copies of documents that -- as far as it knows -- are unique to you or whomever you share with. It has to do all this while keeping that information safe from other people who might want to take a look at it.

And that's just the items I can think of in real time while typing.


The status from Google Cloud status page offers a bit more technical details of what happened:

"We are experiencing an issue with Google Kubernetes Engine (GKE) clusters using node auto-provisioning becoming stuck during node version upgrades. Node auto-upgrades have been disabled temporarily."

https://status.cloud.google.com/


That issue is unrelated. This is the correct one: https://status.cloud.google.com/incident/zall/20008


Does it mean, that Gmail is hosted on GKE?


No, it’s likely the GKE incident is caused by a dependency that Gmail also has.


thanks


I'm guessing this outage will allow GSuite customers to claim Service Credits under the SLA - does anyone have any experience with doing so? Google's documentation is lacking in detail[0].

[0] https://gsuite.google.com/intl/en/terms/sla.html


Follow up to that: Does anyone know if there are tools to help customers collect and submit data to their their vendors for SLA credits?

I’d like to add something like that to StatusGator but I’m unsure if there’s a market.


Good luck with that - read GSuite's terms, they, Google, define what an outage is, not the customer.


The linked terms say:

> "Downtime" means, for a domain, if there is more than a five percent user error rate. Downtime is measured based on server side error rate.


Which is fantastic if the error causes the server side not to be able to log errors


It depends on the specific country's laws - in Germany you can not write anything you want into a contract and call it valid. In this case, you could still sue Google if according to industry standard definition it actually was an outage.


The same is true in the US but some people like to pretend contracts mean everything here.


Incident says it's affecting anything that uses Cloud Storage, which I'm guessing is most things.

https://status.cloud.google.com/incident/zall/20007

Edit: New incident number https://status.cloud.google.com/incident/zall/20008


Confirmed it's an Object Storage issue. I've personally had dramas uploading APK's to the play store. Also noticed a bunch of big websites down. Must be a decent issue.


Well this happens if you rely only on 3rd party providers. My email server is still running without any issues and without any indecent for last 3 years (system reinstall).

Surely no one can send me emails now that gmail is down as most of world is relying on single point of failure, but this is another story.

Moral of this story is - always own your mission critical infrastructure.


> Moral of this story is - always own your mission critical infrastructure.

Sure, if you have the time and money to own it properly. How far do you need to go to say you own it? Multi-regional servers located on properties that you own?


I'm so used to Gmail/Gsuite working that i assumed the problem was with my internet and restarted the router a couple of times.


Yeah same w/ hotel internet.

Especially because Teams status notifications seem funky today too


Gmail is practically unusable at the moment. It's been going on for a while too, so I'm interested in the post mortem for this.

I'm in Sydney, Australia.


Plain HTML GMail seems to be working as well as ever here in India...


Do you have a signature image or other attachment? It sends instantly without, fails with [on the desktop web client]. The google drive server they save our stuff on is giving a 503.


My email drafts can't even save. A popup saying "Oops, something went wrong. Recent changes may not have been saved."

No images either!


Hmm odd. I had a draft from this morning (I took a photo on my phone, attached, then went to desktop later to type the rest up). It would not send and kept popping up that error you mentioned while typing. As soon as I deleted the attachment, I could send it. New emails with nothing attached work fine too in all 3 accounts.


Yes -- remove the attachment (including in-line images) and it'll work


Was greeted by this just as we started work. Observations so far: Reading and receiving emails via the Gmail web UI works, sending emails via the web UI doesn't work, sending emails via SMTP works.


It's mostly related to attachment. Regular emails should be fine.


Youtube is also affected. You can upload files but they won't process.


Gmail accessible via IMAP at $undisclosed_location. SMTP send works, but sent messages reappear in the inbox rather than in sent mail.

Haven't tested the Gmail web interface. I don't use it.


Sent messages appearing in the inbox has been a bug with Gmail since the dawn of time.


I have 5 tabs open with emails written ready to send at some point. Had issues saving drafts, too.


In fairness Gmail has a good uptime record, so it's not the end of the world


I'm not that interested in a dashboard, but I'd bee interested in reading the post-mortem. Since they have paying customers for these products, they might release that.


They usually provide it on the same page after the incident (an example: https://www.google.com/appsstatus#hl=en&v=issue&ts=159398639... )


I wonder how much money this kind of outages can cost to all affected companies and institutions


Hopefully a lot less than they save by outsourcing this part of their infrastructure :-)


Yeah, I imagine so :D


Right at 9am pacific too. Good reason for me to wrap up in APAC. Thanks Google!


It's nearly the end of the workday here in Australia. Very insensitive of them, they should be more timezone aware.

Edit: I seem to be able to send from some accounts but not others. It looks like emails without attachments are fine, otherwise the drive error messes them up.


>It's nearly the end of the workday here in Australia. Very insensitive of them

It's nearly the end of the workday on the east coast of Australia. Very insensitive of you. Sincerely, Western Australia

The above is a joke just in case that wasn't clear.


I guess I kicked an own goal there.

At least it's not daylight savings yet so the QLDers can't be annoyed at me too!


Please tell our SREs in the SYD (in darling harbour/pyrmont) office thank you from me.


Pacific as in Pacific Time (PST/PDT) ? It's not even 3am there currently.


Is this only affecting G suite? I still can send and receive email nornally.


That would explain why my ETLs are failing at 3am in the morning !!!


If it makes you feel any better; same thing happened to me. Ugly Frankenstein process to get our data out of a third party vendor who can only e-mail us reports failed half-way through its run, and I was about to spend hours digging through to see why until I saw this.


Hehehe.... my condolences... we're still trying to recover


Your ETL processes write to google drive or are dependent on email?

Earnest question, don't mean to sound derisive.


User emails file to drive

script connects to drive and parses list of files within

script copies files into redshift and moves original file to different directory or deletes it

alas what can be done? this is what the company gets for building like this ^.^


My understanding is part of google cloud failed(Google Cloud Storage), causing many outages, including but not limited to google drive/email.


I'm assuming some kind of workflow which uses email notifications, and only goes to the next step if the email notification succeeds..?


Gmail works but I can't send any emails - so frustrating.


We're unable to publish apps in the Play Store console/fastlane, too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: