Fly.io Status – Deployments are broken

jteppinette · on Oct 28, 2022

Fly deployments have been "down" (partially / fully) for a couple days per their status page.

After all of the recent talk about moving from Heroku to Fly, I was surprised to have all of these operational issues when doing my initial sniff test.

Can anyone testify to the production worthiness of Fly, or should I look somewhere else?

Edit: Before anyone says "editorialized title", when I posted this, the title of the linked page literally said "Deployments are broken".

mrkurt · on Oct 28, 2022

We've had an issue with our Consul/Nomad since last night. This was preventing new deploys, and also preventing rescheduling peoples' VMs when they crashed. Not good!

This did not affect running apps (unless they crashed and needed rescheduling).

This kind of event is super rare. I think this is the second outage of this scale we've had in the last three years.

The Consul/Nomad deploy infrastructure is the most brittle part of our stack. We are working to replace this. New Postgres DBs don't use it at all, but it'll be a few months before all apps are off.

While we're still relying on Consul/Nomad, there's a chance this will happen again. But the way these tend to work is things break when we cross some capacity threshold. We get that fixed and it buys us time to discover the next capacity threshold.

Also, we _aggressively_ update our status page. It's not really an indication of our reliability relative to other providers. You need to read each individual incident to get an idea of what the effect was. Earlier this week we had an issue where new apps couldn't get new IPv4 addresses that lasted about 45 minutes. That's not awesome, but it's not the same scale of problem as we dealt with last night either.

Other status page entries are "a host in a particular region failed". This is entirely normal, and something we're going to deal with forever.

jteppinette · on Oct 29, 2022

Out of curiosity, what are you replacing Consul/Nomad with?

jtmarmon · on Oct 28, 2022

I think this status page is inaccurate - hosting is affected. My app is _unfixbly_ broken right now. I have an app on fly whose VM appears to have died due to these issues, and because deploys and restarts are broken, I have literally no way of fixing it. https://community.fly.io/t/ewr-app-is-completely-inaccessibl...

jteppinette · on Oct 28, 2022

This is what is worrying me about moving over to Fly. I am surprised that it has been so heavily pushed here on HN. Perhaps this is just a relatively isolated event, we will see how it is handled moving forward.

mrkurt · on Oct 28, 2022

This didn't actually kill VMs, but it _did_ prevent them from being rescheduled for upwards of an hour. The vast majority of apps running on the platform had 100% uptime throughout the incident. The ones that didn't rely on our rescheduling infrastructure to recover from app errors.

jtmarmon · on Oct 28, 2022

Except my app isn't down due to an app error but a failed host in EWR which I couldn't escape from (due to the concurrent scheduling issues) https://status.flyio.net/incidents/v2dshzvy1mcl

EDIT: recognize that these may be poorly timed but unrelated incidents, but it has been frustrating to be trapped on a broken box for 12 hours and have the status page telling me it's just new deploys that are borked :)

mrkurt · on Oct 28, 2022

I don't want to belabor this because we need to do a much better job making it obvious: but single node, development postgres databases are going to have downtime in our infrastructure. We'll get that host back for you, but you should _definitely_ add a replica if you care about availability.

sebslomski · on Oct 28, 2022

One of our (non-essential) services which we used to test fly.io in production is down since hours.

No email from them yet, nothing.

mrkurt · on Oct 28, 2022

The status page is the communication mechanism we use for incidents. Definitely subscribe to that!