Heroku - if you're listening. Please do NOT migrate IPs / load balancers here for a little while. You might also do some work to detect those of us who have bypassed your DNS, and give us some nudges individually to move back to your DNS when the time is right.
You just caused a very painful outage for your entire customer base, and a number of us used this hack to get back online. If you don't recognize this, and cause ANOTHER outage (the third in the past few weeks...) we're going to flee even faster to more stable platforms.
Those IPs tend to stay pretty stable and are shared across the fleet. While you should change them back once things are stable, it's reasonably likely these won't change out from under you in a few hrs.
Hi all, deleting a root CNAME record and adding a new A record on Cloudflare was too risky for us, with its potential to impact external DNS responses, so we took the following steps instead:
1. Create a new temporary subdomain in the Cloudflare DNS panel, e.g. "temp-heroku-resolve.your-domain.com", with IPs from the responses you get by querying Heroku's upstream provider directly (`dig @1.1.1.1 +trace your-app.herokuapp.com`). Make them unproxied (grey cloud) A records with a 30s TTL.
2. Change your root CNAME to point at "temp-heroku-resolve.your-domain.com"
To undo, just reverse these steps and point your root CNAME back at Heroku.
The advantage is that you can do the CNAME swap atomically in the Cloudflare interface, instead of deleting a CNAME record and adding a new A record, which would lead to a few seconds when your DNS is entirely removed. Those few seconds may not sound like a lot, but they could be cached by Cloudflare or other external DNS resolvers, trigger "clean-up" functions within Cloudflare to remove resources associated with those domains, etc.
In our case, we have substantial logic running on the edge in Cloudflare Workers, as well as lots of other Cloudflare configuration options, and I don't even want to know what it could potentially do to our zone to remove the root DNS record that everything's tied to.
Honestly, I think this solution worked better than any dedicated feature could have, since it used the exact same existing codepaths that Cloudflare already had for CNAME flattening. Using CNAME records for self-documentation seems like a better practice anyway. Spending the time building a complicated new feature for such a rare case seems like a bad idea.
Deleting a record causes requests for that record to be negatively cached. The negative cache TTL[1] is often set to 1 day. So, deleting and immediately recreating a record can take your site offline for days. (Source: I made this mistake when changing a Route53 record for a subdomain that was getting about 70Krps. Luckily it was our source code making the requests, so we could change it, but that took an hours or so to roll out.)
Nah bc you're not swapping from CNAME > A records, which could have other random networking related impacts, especially if you're running a public API.
Nice. We got our test site online by migrating just the dyno to render.com. (And connecting back to Heroku for the database.) We might just switch our DNS there and call it a permanent move.
Anybody know what the risks are for doing this? We have not one but 13 heroku domains running where we might do this.
What are the chances that their current work on the DNS issues might replace the underlying IP-addresses and hence we are making it harder on ourselves?
I thought about this, and it can't really get any worse than 'total outage', so I thought it was worth taking a shot. It can't go any more down.
It worked for me, so once Heroku has sorted their s*t out and their DNS system is back up it's just a matter of deleting the A records and replacing it with the CName.
Worst case scenario is the underlying Amazon IP Addresses change and it goes back down.
Wouldn't the worst case be the underlying IP address changes, and your site now points to some other service? Since Heroku doesn't offer static IPs, I don't think.
Heroku seems to use the Host header on the HTTP request to determine what customer the request is sent to, so I don't think it can end up somewhere it shouldn't be unless the IP leaves control of Heroku entirely and is picked up by another AWS customer.
Heroku uses the host name to route requests. So you wouldn't serve another site, it would just give you an error that your site wasn't found or something.
There is one more problem I can think of (but yes you're right, it can't get worse than 'total outage'): Heroku might detect that the CNAME is missing and will deactivate some stuff from their automatic domain management / TLS certificate issuing.
For those on Cloudflare, this will still work. I just deleted the existing CNAME, added the 4 IP addresses as seperate A records and it came back up instantly.
That is a random public DNS server I found on a DNS checker site that works (as it seems to be down from all the common ones like 8.8.8.8, 1.1.1.1, etc). DOMAIN can be your actual domain or the long heroku alias.
Can someone clarify what to do if your app doesn't use a custom domain, and running `heroku domains -a` just returns `my_app.herokuapp.com`, and then entering that into https://securitytrails.com/dns-trails returns no results?
No we just run our API on heroku and use no custom domain names. For some of us requests are working as normal, but for others every request is failing.
What error are you getting? Because for us it seemed that the Cloudflare <-> HerokuDNS link was down. But even then I was able to access my app from the dashboard (like https://my-app.herokuapp.com/)
Presumably if your site is behind Cloudflare then this strategy won't work, right? Since the IP addresses that Security Trails sees are just of Cloudflare rather than your actual Heroku IPs...?
Or is it possible in the Cloudflare dashboard there is somewhere to see your Heroku server's IP address?
For me the securitytrails.com website just crashes. I put my DNS target: "stark-horse-mrp4jeowu9yvwpnnma32x6hd.herokudns.com", clicked "Run Check" and it seems to redirect to a failed (no CSS) webpage. Anyone else experiencing this?
Make sure to go through all the tabs at the top (Cloudflare DNS, Google DNS) - for me they were all "no A records found". Only "Authoritative" gave me 3 A records which I successfully managed to use.
We use cloudflare. If you look up the IP address for your public domain name, you will get the cloudflare IP, yes. If you lookup the IP address of the CNAME target, you will get the heroku IP.
Note: If you used to have a CNAME record for yourdomain.com to www.yourdomain.com, then you have to add two A records per each IP (one whose name is yourdomain.com pointing to the IP, and another whose name is www pointing to the same IP)
Worked... thanks a bunch.
Does anyone know if it's safe to go back to cname now or should I leave till the A record till the morning? Would prefer to get some sleep.
We're currently using Heroku for app hosting, while evaluating fly.io, render.com and railway.app. All three have had exceptional reviews from other customers and differ slightly on their service offerings and setups. All seem like viable alternatives so far!
I’d be really interested to know how your research goes. We are running a large production scale app on heroku. And that we have no DevOps or Infra staff is awesome. But heroku has been a huge let down lately.
Most of the reviews we have seen of the competitors are all hobby level. And last time we check some of this competitors we found their security posture was not the level we would require.
So we had to simple rule them out and either stay with Heroku or move to a big 3.
If you’re keen to share - let me know and I’ll send you my details.
Craig here from Crunchy Data. I think you're speaking to the app side of things on hobby level. On the database side of things our security posture for Crunchy Bridge I'd say is stronger than the Heroku one. By default we isolate all databases in a VPC, everything is purely single tenant where as Heroku Postgres at least when I was there had multiple forms of multi-tenancy which when doing multi-tenancy in Postgres can have risks[1]-this applies even to the major 3 cloud providers. Our team is essentially the original Heroku Postgres team so we've built with security but also user experience for Postgres in mind since day one.
Now I assume you were speaking to the 3 mentioned, render, railway, fly in terms of hobby level. All three are fairly young relative to Heroku's age, but Fly did recently get their SOC2 and the team really took it to heart and invested in it so I'd put some stock in that. I can't speak definitively to the others, but do know all three can be solid for production apps. If you've got HIPAA or other specific requirements I'd encourage a conversation with them.
Thanks Craig. I took at look at Crunchydata - however best I can tell unless we are in Enterprise Tier Heroku (or maybe not even then) we have to connect to Crunchydata via internet (with IP whitelist?) rather than through VPC peering or similar. Which is a limitation of Heroku rather than you. I assume with something like fly it could be done via VPC peering?
I just read fly had SOC2 type I recently. But I mean this hosting infra containing all our data and our customers data. People providing infra really need to take security extremely seriously and prove it.
Awesome what they are doing - just don’t feel like they are ready for primetime busines. We are a small startup (5k monthly on Heroku) but there is just no reasonable way we can tell our enterprise customers security teams are hosted on these guys and can vouch and vet their security.
This is the same path we're on. Migrated to Crunchy a month ago or so to remove the major migration risk and are using Render to host an auxiliary service while our core application remains on Heroku. Haven't yet done any non-toy deployments on Fly.io or Railway but I very much like Render's Blueprints and environment groups.
For what it's worth: Crunchydata appears to have been using Heroku DNS in some way. We have a Heroku app with Crunchy databases, and our Crunchy dbs became inaccessible shortly before the app did.
Craig here from Crunchy, we do run some small pieces on Heroku, but during the DNS outage saw no interruption of databases and they all seemed to be available and up. Please do feel free to drop us a note and we'll happily investigate if something did occur there.
I moved from Heroku to Render. Can't recommend them enough. Smaller operation, of course so keep that in mind if you need support, but for a hobbyist like myself, I'm running 2 postgres dbs, 1 redis instance, and 4 web services (2 websites).
(Render founder) thanks for the rec and feel free to get in touch with me anytime (email in profile). For those who need them, we offer enterprise support plans with response and uptime SLAs.
How have you found uptime on fly.io (and are you using multiple instances for a high-availability setup?). We've been running on Heroku without HA, and it's been good enough, and wondering if we might expect to be able to get away with that on fly.io too...
Uptime has mostly been good, although I didn't set up alerting for any of my Fly projects until recently. All my apps are running on a single node per app, so I haven't tried
The one notable outage was this bug where Fly was evicting smaller instances at overloaded DCs rather than preventing new apps from acquiring resources:
No, I wasn't using much Heroku-specific. All of my apps run in Docker containers, so it's mostly just migrating my CI to deploy to Fly instead of Heroku.
I did get a lot of perf increases moving from GCP to Fly, but I think I have to credit most of the improvements to SQLite being faster than Firestore:
Moving to ECS on AWS, only reason is that I'm a DevOps engineer with familiarity in that space. Honestly wouldn't know where to send developers without the time to figure the nuts and bolts of convenience that Heroku provides.
Please check out withcoherence.com (I'm a cofounder). We manage the nuts and bolts of apps using ECS/fargate on AWS and Cloud Run/GKE on GCP. We also integrate managed cloud IDEs and managed CI/CD pipelines using the same simple configuration, and put a dashboard on top of it!
Same. The heroku incident report says "increased failure of insert/delete/changes to DNS."
Haven't changed DNS in months. Site is down.
Heroku is looking increasingly like nobody is minding the store these days. I still know of no competitor which can match the DX when it's working, but what good does that do you if it breaks all the time.
We're still waiting on them fixing our ability to restore from backups without manual hack -- which is effecting many customers, but which they don't even have an active published incident on.
This does not appear to be a total outage. I cannot reach any of our sites, and Pingdom also reports we are down, however, I can see normal looking traffic reaching our servers (via heroku logs --tail). In addition, members of our team are reporting via Slack that some can reach our Heroku-hosted sites, others cannot. It seems to be ISP-related. Two people within 1 block of each other on different ISPs see different results.
We proxy some services through Cloudflare to gain IPv6 support, and all of those are down, which suggests the Cloudflare -> Heroku network route is broken.
So my conclusion is that NS1 is having issue responding DNS queries from other DNS servers. Interestingly, there is no public information on heroku being dependent on NS1 or any current outages from NS1 status page.
Well, the new product is hosted on AWS, so it's much more stable, and we've made our first lot of pre-sales so we're on a tight deadline. Not sure why you're being so negative for no reason.
Seeing issues with hackerweb.app and substack.com too.
They don't share upstream DNS (and I'm not sure heroku's homepage has the same DNS provider as customer domains). NS matches SOA for each of these domains.
SOA heroku.com. 1h00m00s "dns1.p04.nsone.net." "hostmaster.nsone.net."
SOA hackerweb.app. 1h00m00s "olga.ns.cloudflare.com." "dns.cloudflare.com."
SOA substack.com. 1h00m00s "ali.ns.cloudflare.com." "dns.cloudflare.com."
Wow. Imagine being so out of touch that you have to have another entity run your DNS services.
Email I get, because there has been a hard push for decades to force everyone on to big providers, but DNS can literally be run by anyone, anywhere.
Did the primary servers push bad data, making the secondary / tertiary ones break, too? If not, why not extend the cache lifetime and run off of them until the primary are fixed?
Sigh. This is rather ridiculous, and is rather embarrassing for Heroku.
I am now seeing one of my services coming back online. For this service I have not replaced the CNAME. Anyone else seeing some service restoration as well?
Yes, cant even access https://dashboard-next.heroku.com/ so the problem seems broader than what they describe on their status page which seems to imply only issues related to updating DNS settings.
Note: If you used to have a CNAME record for yourdomain.com to www.yourdomain.com, then you have to add two A records per each IP (one whose name is yourdomain.com pointing to the IP, and another whose name is www pointing to the same IP)
Weirdly, dig +trace works fine, but public resolvers like Google and Cloudflare refuse to return the DNS records. This has to be a DNSSEC issue, right? paging tptacek :p
Maybe. I'm just trying to figure out what's common between Steam, Observable, and Heroku, and Cloudflare itself which are all down according to downdetector.
That's blaming the wrong entity - again. No-one seems to read error pages nowadays, not shocking. Also, Steam never used CF services, and user reports about CF are that - they looked at their favorite Substack (https://substack.com) and just seeing a CF page they assumed it's CF's fault.
1. Find your DNS Target in heroku. It should end with .herokudns.com
2. Lookup the historical DNS record to get the IP addresses. You can find historical DNS records here: https://securitytrails.com/dns-trails
3. Replace your CNAME record in your DNS provider with A records that point to the IP addresses you just found.
Your site should come back up shortly. We plan to revert back to CNAME records once Heroku gets their DNS issues sorted.