I've been overall very impressed with the direction of Google Cloud over the last year. I feel like their container strategy is much better than Amazon's ECS in that the core is built on open source technology.
This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.
Ah well. Godspeed to anyone affected by this, including the SREs over at Google!
I'm totally impressed with gcloud. Slick, smooth interface. Cheap pricing. The fact the UI spits out API examples for doing what you're doing is really cool. And it's oh-so-fast. (From what I can tell, gcloud's SSD is 10x faster or 1/10th the cost of AWS.)
And this is coming from a guy that really dislikes Google overall. I was working on a project that might qualify for Azure's BizSpark Plus (they give you like $5K a month in credit), and I'd prefer to pay for gcloud than get Azure for free
Same, was considering GCP for the future, but this is bad. I'm not using them without some kind of redundancy with another provider. I hope they write a good post-mortem, these are always interesting at large scale.
How bad is it really? They started investigating at 18:51, confirmed a problem in asia-east1 at 19:00, the problem went global at 19:21, and was resolved at 19:26.
They posted that they will share results of their internal investigation.
That kind of rapid response and communication is admirable. There will be problems with cloud services - it's inevitable. It's how cloud providers respond to those problems that is important.
In this situation, I am thoroughly impressed with Google.
It's bad because it concerns all their regions at the same time, while competing providers have mitigations against this in place. AWS completely isolates its regions for instance [1], so they can fail independently and not affect anything else. That Google let an issue (or even a cascade of problems) affect all its geographic points of presence really shows a lack of maturity of the platform. I don't want to make too many assumptions, and that specific problem could have affected AWS in the same way, so let's wait for more details on their part.
The response times are what's expected when you are running one of the biggest server fleets in the world.
Expecting that problems won't happen with a cloud provider that happen everywhere else is a pipe dream. They might be better at it because of scale, but no cloud provider can always be up. It happened at Amazon, now it's happened at Google. Eventually, finding a provider that never went down will be like finding the airline that never crashed.
Operating across regions decreases the chances of downtime, it does not eliminate them.
> The response times are what's expected when you are running one of the biggest server fleets in the world.
That may be true, but actually delivering on that expectation is a huge positive. And more than having the right processes in place, they had the right people in place to recognize and deal with the problem. That's not a very easy thing to make happen when your resources cross global borders and time zones.
Look at what happened with Sony and Microsoft - they were both down for days and while Microsoft was communicative, Sony certainly was not. Granted, those were private networks, but the scale was enormous and they were far from the only companies affected.
AWS has never had a worldwide outage of anything (feel free to correct me). It's not about finding "the airline that never crashed", it's finding the airline whose planes don't crash all at the same time. It's pretty surprising coming from Google because 15 years ago they already had a world-class infrastructure, while Amazon was only known for selling books on the Internet.
Regarding the response times, I recognize that Amazon could do better on the communication during the outage. They tend to wait until there is a complete failure in an availability zone to put the little "i" on their green availability checkmark, and not signal things like elevated error rates.
AWS had two regions in 2008 [1]. That was 7 years ago, and I think you would agree that running a distributed object storage system across an ocean is a whole different beast than ensuring individual connectivity to servers in 2016.
Yeah... just don't look too closely under the covers. AWS has been working towards this goal but they aren't there yet. If us-east-1 actually disappeared off the face of the earth AWS would be pretty F-ed.
Our servers didn't go off, just lost connectivity. Same has happened to even big providers like Level3. Someone leaks routes or something and boom, all gone.
I'd be surprised if AWS didn't have a similar way to fail, even if they haven't. This is obviously a negative for gcloud, no doubt, but it's hardly omg-super-concerning. I'm sure the post-mortem will be great.
Actually, according to the status report, they confirmed that the issue affected all regions at 19:21 and resolved it by 19:27. That's six minutes of global outage.
The outage took my site down (on us-central1-c) at 19:13, according to my logs, so it was already impacting multiple regions by 19:13. (I have been using GCP since 2012 and love it.)
Thank you, I missed that on my first reading - I saw the status update was posted at 19:45, not the content within it stating the issue was resolved at 19:27. I updated my parent comment.
Switching from ECS to GKE (Google Container Engine) currently. Both seem overcomplex for the simpler cases of deploying apps (and provide a lot of flexibility in return), but I have found the performance of GKE (e.g. time for configuration changes to be applied, new containers booted, etc) to be vastly superior. The networking is also much better, GKE has overlay networking so your containers can talk to each other and the outside world pretty smoothly.
GKE has good commandline tools but the web interface is even more limited than ECS's is - I assume at some point they'll integrate the Kubernetes webui into the GCP console.
GKE is still pretty immature though, more so than I realized when I started working with it. The deployments API (which is a huge improvement) has only just landed, and the integration with load balancing and SSL etc is still very green. ECS is also pretty immature though.
The Problem is that GCP doesn't run an RDS service with PostgreSQL. And external vendors are mostly more costly than AWS RDS. Especially for some customer homepages where you want to run on managed stuff as cheap as possible.
This is sad for sure. The new MySQL cloud 2.0 is really good, and if you use a DB agnostic ORM you can probably make MySQL work for quite a while. Sad to lose access to all the new PG features though, and I would love if Google expanded their cloud SQL offerings..
While not Docker Cloud specifically, when we eyeballed UCP we found it very underwhelming when pitted against Kubernetes.
To us it appeared yet another in a sea of many orchestration tools that will give you a very quick and impressive "Hello World", but then fail to adapt to real world situations.
This is what Kubernetes really has going for it, every release adds more blocks and tools that are useful and composable targeting real world use (and allow many of us crazies to deal with the oddball and quirky behavior our fleet of applications may have), not just a single path of how applications would ideally work.
This generally has been a trend with Docker's tooling outside of Docker itself unfortunately.
Similarly docker-compose is great for our development boxes, but nowhere near useful for production.
And it doesn't help Docker's enterprise offerings still steer you towards using docker-compose and the likes.
Not to bash, but the page you linked is classic Docker - it says literally nothing about what "Docker Cloud" is.
"BUILD SHIP & RUN, ANY APP, ANYWHERE" is the slogan they repeat everywhere, including here, and it means even less everytime they do it. What IS Docker Cloud? Is it like Swarm? Does it use Swarm? What kinds of customers is Docker Cloud especially good at helping? All these mysteries and more, resolved never.
so am I (I'm YC alumni) .. but RDS is too important for us to move away from it.
Let me put it this way - if you had an RDS equivalent in Docker Cloud, lots of people would switch. Docker is more popular than you know.
Heroku should be an interesting learning example to the tons of new age cloud PAAS that I'm seeing. Heroku database hosting has always been key to adoption.. to an extent that lots of people continue to use it even after they move their servers to bare metal. The consideration and price sensitivity to data is very different than app servers.
I believe this is tutum that they bought some time ago. I tried tutum before with Azure. After deleting the Containers from tutum portal, it does not clean everything from Azure. Today the storage created by tutum is still in my Azure storage. LOL.
Seconded--I can tell the ECS documentation is trying to help, but the foreign task/service/cluster model + crude console UI keeps telling me to let my workload ride on EC2 and maybe come back later.
what I figured out much later was that ECS is a thin layer on top of a number of AWS services - they use an AMI that I can use, ec2 VMs that I can run myself and Security Groups + IAMs that I can create by my own.
But the way they have built the ECS layer is very very VERY bad.. and I have an unusually high threshold for documentation pain.
I work on Convox, an open source PaaS. Currently it is AWS only. It sets up a cluster correctly in a few minutes. Then you have a simple API - apps, builds, releases, environment and processes - to work with. Under the hood we deploy to ECS but you don't have to worry about it.
So I do agree that ECS is hard to use but with better tooling it doesn't have to be.
This can wipe away a lot of goodwill, though. A worldwide outage is catastrophic and embarrassing. AWS has had some pretty spectacular failures in us-east (which has a huge chunk of the web running within it), but I'm not sure that I can recall a global outage. To my understanding, these systems are built specifically not to let failures spill over to other regions.
Ah well. Godspeed to anyone affected by this, including the SREs over at Google!