Amazon EC2 Issues in Frankfurt AZ

intsunny · on Nov 12, 2019

It is baffling that the Europe tab for AWS' Status page posts timestamps in PST. What the fuck is PST?

We have the ability to do everything and anything in Javascript EXCEPT for fucking UTC or localized time zones.

Is PST the new UTC? Is there an RFC for this? Did I miss the memo?

Un-fucking-real.

sm4rk0 · on Nov 12, 2019

That's due to the general ignorance about the world outside the US.

lxgr · on Nov 12, 2019

Outside the Bay Area, really.

sargun · on Nov 12, 2019

AWS is based in Seattle.

lxgr · on Nov 12, 2019

Sorry, Bay Area plus Seattle ;)

ti_ranger · on Nov 13, 2019

Actually, the post was written by someone located outside the U.S. (a place not using any US-specific time-zone), and posted to the status page by someone else located outside the U.S.

aspyct · on Nov 12, 2019

PST - Potato Standard Time, everybody knows that.

polack · on Nov 12, 2019

And how about they inform us what availability zone is affected instead of just writing "single availability zone" 10 times.

AnssiH · on Nov 12, 2019

That is not that easy as the availability zone numbering is account-specific.

Doable, of course, but it is not just a matter of putting it into the issue description.

RKearney · on Nov 12, 2019

Amazon publishes all the underlying zone names for your account so it would be trivial for them to specify the exact availability zone.

carlsborg · on Nov 12, 2019

This is a non-issue.

"We are experiencing elevated API error rates and network connectivity errors in a single Availability Zone."

Key fact "A SINGLE AZ". Availability Zone's are isolated from each other with redundant power supplies and internet connectivity and most often physically different datacenter locations. Well architected applications are designed to allow for a single AZ to become unavailable. This is precisely why the cloud is useful: you can bring up new capacity in the other Availability Zones behind the same load balancer with zero effort - the autoscaling does that for you automatically.

gjtempleton · on Nov 12, 2019

Except it's not currently handling this nicely, we've got an ASG behind an ELB, spread across multiple AZs including the affected one, and the ASG doesn't see the new instances as it scales up as ever coming into service - they just stick in "Not yet in service"

carlsborg · on Nov 12, 2019

I spun up a new t2.small instance in eu-central-1 and it took less than 90 seconds, which is more than it usually takes, but not that bad.

https://pastebin.com/WC1hkh0c (see ts on uptime command, ignore timezone differences)

the_70x · on Nov 12, 2019

ASG is in dead, not terminating or spinning new instances at this moment

mvanbaak · on Nov 12, 2019

We have exactly this for 2 autoscaling groups. Very anoying :(

aspyct · on Nov 12, 2019

What do you mean, "non issue"? We had timeouts all over the place. It doesn't matter how well you app is designed, if the network is congested, you're gonna have calls from your clients...

carlsborg · on Nov 12, 2019

Its a non-issue because you are not immune to the laws of probability. Things will fail, and you should expect them to. The beautiful thing here is that the failovers work to spec, and in case a single AZ goes out, mean time to recovery for you app can be in the order of single digit minutes (if built right).

thecopy · on Nov 12, 2019

This is a very naive perspective to take.

>The beautiful thing here is that the failovers work to spec

Who's failover? AWS' certainly did not for us.

runarb · on Nov 12, 2019

My managed ElasticSearch cluster did go down despite being multi az.

lxgr · on Nov 12, 2019

Are you talking about stateless applications here? The story is a bit different for databases on AWS in my experience. Yes, you can do multi-AZ synchronous replication, but failover times are between 1-2 minutes.

Also, I don't know how AWS implements failovers internally, but I can see several potential issues that could arise from connectivity loss between availability zones, especially if both EC2 and RDS are involved – few of which are trivial to handle if your application cannot tolerate loss of consistency.

carlsborg · on Nov 12, 2019

We use DynamoDB for storage and state and its unaffected.

Are you referring to this kind of scenario? https://github.blog/2018-10-30-oct21-post-incident-analysis/

lxgr · on Nov 12, 2019

Then you probably don't have the strict consistency and durability requirements that I mentioned. That is probably true for many use cases, but certainly not for all.

Yes, after skimming your referenced article, this is exactly the type of issue that prevents a "simple" failover when databases are involved.

Depending on your application, even returning stale data on one AZ in a split brain scenario might not be acceptable (this is a stronger requirement than not allowing non-reconcilable conflicting writes).

carlsborg · on Nov 12, 2019

DynamoDB allows strongly consistent reads and your data is replicated across AZs. Id argue that its even better than an RDS setup in that respect.

RDS lets you choose a DB Engine (MySQL, Postgres, etc). the replication depends on the engine you choose, e.g. Postgres native streaming. Heres a good writup on how it works and best practices (https://aws.amazon.com/blogs/database/best-practices-for-ama...)

DasIch · on Nov 12, 2019

It doesn't affect a single AZ though. We see numerous problems with AWS services including but not limited to being unable to create new EC2 instances in any AZ.

This is creating a huge number of issues as you lose capacity with the AZ going down and can't recover by creating new instances.

ti_ranger · on Nov 12, 2019

> We see numerous problems with AWS services including but not limited to being unable to create new EC2 instances in any AZ.

How were you launching? Using EC2 directly (e.g. `aws ec2` cli or RunInstances API call - https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_R... ), or using e.g. ASGs?

If you were launching directly with EC2, were you doing targeted launches (e.g. specifying Placement.AvailabilityZone for classic or SubnetId for VPC)?

In the case of an AZ failure, launches which don't specify an AvailabilityZone or SubnetId may be impacted for a few minutes, whereas launches which do specify them should still succeed.

mrmattyboy · on Nov 12, 2019

I have a couple of arguments, even if you are spread evenly: * If you lose an AZ,, you are in a state of reduced redundency. * If you have intermittent drops, you can have lots of issues as it will be coming up/dropping out and cause issues. This will also have an impact on replication of services and cause additional load on instances in working AZs (RDS saw this).

Roritharr · on Nov 12, 2019

We were affected with one of our marketing pages that we migrated to ECS in a non-HA configuration, our main applications are setup in Multi-AZ HA Configs and weren't affected by this.

Not a big deal to remediate as all the other AZs are working.

nrki · on Nov 12, 2019

Looks like this took down TransferWise (debit card/forex/payments processor): https://status.transferwise.com

fyfy18 · on Nov 12, 2019

I wonder if this is related to TransferWise issues, their app and even card payments have been down all morning.

https://twitter.com/TransferWise/status/1194168200210124800?...

mangatmodi · on Nov 12, 2019

This year has been really bad for the cloud vendors in terms of stability.

StreamBright · on Nov 12, 2019

This is a single AZ. Your application should tolerate single AZ outages. That is the first rule of building reliable, highly available services on AWS.

   12:08 AM PST We are investigating increased network connectivity errors to instances in a single Availability Zone in the EU-CENTRAL-1 Region.

mvanbaak · on Nov 12, 2019

True, but autoscaling at the moment is unable to scale in/out because of this issue. If your autoscaling group includes the affected AZ, you are out of luck because those instances are being terminated since their health checks fail. But because of the failures, autoscaling is unable to complete this, and unable to launch new instances in the other AZ's as it is stuck on the terminating part.

StreamBright · on Nov 12, 2019

That is an interesting detail. Would you consider having 3 separate autoscaling groups (one per AZ) or this is not feasible for some reason? One of the interesting aspects of running services on AWS was to remove the AZ from "rotation" while the outage lasts. Meaning, having a DNS change and exclude the public endpoint from taking any traffic. If you have 3 separate groups doing 1/3 of the load and having independent DNS entries, autoscaling groups, etc. then moving traffic from one AZ to another is probably easier. Not sure about the details of your setup though, you might have reasons not to do this.

mvanbaak · on Nov 12, 2019

A setup with an ASG per AZ makes it a lot harder to do real autoscaling based on load/mem/connections. If it was purely for running a fixed amount of instances equally spread acros AZ's this would probably work, but not in our setup where we have unpredictable traffic and load patterns.

[edit] I know it can be done with combined metrics etc, but it would make it a lot more complicated ;-)

StreamBright · on Nov 12, 2019

It is most certainly more complicated. We were ok to be put up with that additional complication because service reliability (especially tolerating single AZ outages with ease) was higher on the requirements than avoiding complication. :)

karma_fountain · on Nov 12, 2019

Probably unrelated but there is also an issue with JIRA cloud based solution. https://status.atlassian.com/

mbo · on Nov 12, 2019

It's related. Atlassian runs on AWS.