There was an outage August 19th, 2019 - almost 1 year ago to the day. As I posted at the time:
"Google often has a outage or two around this time of the year when all the US schools come back and millions of students log in at the same time."
My pet theory wasn't too popular but I'm going to stick with it :)
I work on the educational part of the product my company develops and I can attest that school start is a stressful day with login attempts, assignments lookups and other setup activities for the school period.
I wouldn't doubt that Google Classroom and other systems that use Google's SSO will be under strain from millions of students.
Google does a once per year disaster recovery training... They do things like deliberately turn off datacenters with no warning. Sometimes failover systems don't work as intended.
Every year around the same time people have to work on Perf (internal performance review), maybe people were more focused on that rather than keeping the systems up.... or maybe they needed to push the latest update to be included in their perf...
US schools don't all start on the same day though- its pretty staggered with some starting in early-mid august, and most in the Northeast start right after Labor day.
Right- which I would expect Google or any half decent service to be able to withstand easily. Its not a sudden spike that happens under a few minutes to several orders of magnitude above the average weekly peak, this is a fairly gentle sloping upward.
And if this happened last year too, you would think this would be on top of the list of things to watch for next year and add capacity for. Amazon and Walmart start planning and drilling now for their holiday season.
That's an interesting theory because the timing does correlate.
A lot of people would immediately dismiss it because Google has the resources to scale up. But having resources doesn't guarantee someone actually turns the knob that increases the number of instances. (Whether automatic or manual, the adjustment could be too slow to match an unanticipated spike in demand.)
But there's another reason I don't think that's the explanation. Gmail has 1.5 billion active users[1]. Millions of students logging in at the same time sounds like a lot, but if Gmail has 100 million more active users today than yesterday, that's not even a 10% increase!
I don't think it's the load on Gmail that's an issue. I'd point more to Google Drive, Docs and the underlying shared storage infrastructure. Also keep in mind most of those 1.5 billion users won't be very active - a few million users that have no usage at all for a few months and then all come back to being extremely active within a few days can be pretty disruptive!
IMO it's not really about having the resources to scale, but the unpredictable emergent behaviours which can happen when the load profile suddenly changes
Actually a legit observation. Sorry to say but the naysayers on here are super dumb and missing the point with their comparison to other Google services or claiming lack of evidence.
Millions of people are searching simultaneously at google.com or youtube.com but servers are not crashing. Issue is not traffic overload but something else.
Press and hold the F5 key on your keyboard for 2 minutes while on gmail.com. You will get a "service unavailable" error. About 500 other people whose data happens to be cohosted with you will also get the same error, and all of you will be unable to send or receive email, even by IMAP, for about 10 mins while your particular corner of the data store is restarted and the data integrity checked.
Not sure if this is still the case, but if you did this a couple of times, your account data would be permanently migrated to an instance with more CPU and RAM allocated - you'd also be in with all the other badly behaved accounts, so reliability goes down lots. The benefit was much quicker complex searches, and being able to bulk label or delete emails without it taking minutes or hours.
Don't believe me how slow it is on a regular instance? Try going to "All mail", selecting all of your emails, and applying a label to them all. In my experience, it can only label about 50 mails per second, so it can take hours to do them all. It will keep going if you quit the browser, but will stop if the gmail devs do a software update, which they seem to do on usually tuesdays, but never fridays or the weekends.
I find it hard to believe that Gmail will always serve certain users from the same machines, especially in this day and age, with “cattle, not pets” and ephemeral containers.
I’m sure they have machines that are only used to serve G Suite and Google One customers, and maybe some other VIPs, but regular heavy users? It sounds like an urban legend to me.
Gmail accounts will sometimes be automatically "hospitalized" -- assigned more than the usual amount of resources because for some reason they are chronically behind or growing without bound -- or "jailed" -- moved into isolation along with other bozo accounts, to keep from disturbing normal people's accounts. Not a legend.
The data has to be sharded somehow, though. You might not be hitting the same exact machine, especially for the frontend, but your data isn't just magically everywhere in "the cloud".
Not the same machines, but different groups or machines (pools, shards, farms, whatever). They'll certainly be able to move you around to balance the pools, or to decomission pools, or to put you in a pool with primary data closer to where you usually access from etc. Grouping by behavior makes sense too --- separating heavy and light users makes a lot of sense, you can serve a lot more light users from one pool, and the heavy users won't impact their service.
That's not the "cattle vs. pet" as I understand it. The servers are identical, i.e. cattle. This is just a case of sticky sessions. It's a common pattern to help latency and keep resource usage down.
Interesting. I held F5 down for less than a minute and I got an "Unusual usage - account temporarily locked down" message. Disappeared after a few seconds pause though...
for Gmail, you have to log in. It has to know it is you. It has to know that everything it is serving you is only for you. It has to retain copies of documents that -- as far as it knows -- are unique to you or whomever you share with. It has to do all this while keeping that information safe from other people who might want to take a look at it.
And that's just the items I can think of in real time while typing.
The status from Google Cloud status page offers a bit more technical details of what happened:
"We are experiencing an issue with Google Kubernetes Engine (GKE) clusters using node auto-provisioning becoming stuck during node version upgrades. Node auto-upgrades have been disabled temporarily."
I'm guessing this outage will allow GSuite customers to claim Service Credits under the SLA - does anyone have any experience with doing so? Google's documentation is lacking in detail[0].
It depends on the specific country's laws - in Germany you can not write anything you want into a contract and call it valid. In this case, you could still sue Google if according to industry standard definition it actually was an outage.
Confirmed it's an Object Storage issue. I've personally had dramas uploading APK's to the play store. Also noticed a bunch of big websites down. Must be a decent issue.
Well this happens if you rely only on 3rd party providers. My email server is still running without any issues and without any indecent for last 3 years (system reinstall).
Surely no one can send me emails now that gmail is down as most of world is relying on single point of failure, but this is another story.
Moral of this story is - always own your mission critical infrastructure.
> Moral of this story is - always own your mission critical infrastructure.
Sure, if you have the time and money to own it properly. How far do you need to go to say you own it? Multi-regional servers located on properties that you own?
Do you have a signature image or other attachment? It sends instantly without, fails with [on the desktop web client]. The google drive server they save our stuff on is giving a 503.
Hmm odd. I had a draft from this morning (I took a photo on my phone, attached, then went to desktop later to type the rest up). It would not send and kept popping up that error you mentioned while typing. As soon as I deleted the attachment, I could send it. New emails with nothing attached work fine too in all 3 accounts.
Was greeted by this just as we started work. Observations so far: Reading and receiving emails via the Gmail web UI works, sending emails via the web UI doesn't work, sending emails via SMTP works.
I'm not that interested in a dashboard, but I'd bee interested in reading the post-mortem. Since they have paying customers for these products, they might release that.
It's nearly the end of the workday here in Australia. Very insensitive of them, they should be more timezone aware.
Edit: I seem to be able to send from some accounts but not others. It looks like emails without attachments are fine, otherwise the drive error messes them up.
If it makes you feel any better; same thing happened to me. Ugly Frankenstein process to get our data out of a third party vendor who can only e-mail us reports failed half-way through its run, and I was about to spend hours digging through to see why until I saw this.
My pet theory wasn't too popular but I'm going to stick with it :)
1- https://news.ycombinator.com/item?id=20740997