Hacker News new | past | comments | ask | show | jobs | submit login
Issues with upstream DNS provider (heroku.com)
165 points by bfirsh on Aug 23, 2022 | hide | past | favorite | 149 comments



We just deployed a workaround that brought our site back up.

1. Find your DNS Target in heroku. It should end with .herokudns.com

2. Lookup the historical DNS record to get the IP addresses. You can find historical DNS records here: https://securitytrails.com/dns-trails

3. Replace your CNAME record in your DNS provider with A records that point to the IP addresses you just found.

Your site should come back up shortly. We plan to revert back to CNAME records once Heroku gets their DNS issues sorted.


Heroku - if you're listening. Please do NOT migrate IPs / load balancers here for a little while. You might also do some work to detect those of us who have bypassed your DNS, and give us some nudges individually to move back to your DNS when the time is right.

You just caused a very painful outage for your entire customer base, and a number of us used this hack to get back online. If you don't recognize this, and cause ANOTHER outage (the third in the past few weeks...) we're going to flee even faster to more stable platforms.


Those IPs tend to stay pretty stable and are shared across the fleet. While you should change them back once things are stable, it's reasonably likely these won't change out from under you in a few hrs.


Hi all, deleting a root CNAME record and adding a new A record on Cloudflare was too risky for us, with its potential to impact external DNS responses, so we took the following steps instead:

1. Create a new temporary subdomain in the Cloudflare DNS panel, e.g. "temp-heroku-resolve.your-domain.com", with IPs from the responses you get by querying Heroku's upstream provider directly (`dig @1.1.1.1 +trace your-app.herokuapp.com`). Make them unproxied (grey cloud) A records with a 30s TTL.

2. Change your root CNAME to point at "temp-heroku-resolve.your-domain.com"

To undo, just reverse these steps and point your root CNAME back at Heroku.


This is essentially doing the same thing right? What is the advantage of this method? Less DNS changes?

(not trying to be snarky, trying to understand)


The advantage is that you can do the CNAME swap atomically in the Cloudflare interface, instead of deleting a CNAME record and adding a new A record, which would lead to a few seconds when your DNS is entirely removed. Those few seconds may not sound like a lot, but they could be cached by Cloudflare or other external DNS resolvers, trigger "clean-up" functions within Cloudflare to remove resources associated with those domains, etc.

In our case, we have substantial logic running on the edge in Cloudflare Workers, as well as lots of other Cloudflare configuration options, and I don't even want to know what it could potentially do to our zone to remove the root DNS record that everything's tied to.


Sounds like cloudflare could add an atomic cname -> a record feature.


Cloudflare should support RFC 2136 which includes atomic updates and is the standard for DNS updates.


Honestly, I think this solution worked better than any dedicated feature could have, since it used the exact same existing codepaths that Cloudflare already had for CNAME flattening. Using CNAME records for self-documentation seems like a better practice anyway. Spending the time building a complicated new feature for such a rare case seems like a bad idea.


Good points, thanks!


Deleting a record causes requests for that record to be negatively cached. The negative cache TTL[1] is often set to 1 day. So, deleting and immediately recreating a record can take your site offline for days. (Source: I made this mistake when changing a Route53 record for a subdomain that was getting about 70Krps. Luckily it was our source code making the requests, so we could change it, but that took an hours or so to roll out.)

[1]: https://www.rfc-editor.org/rfc/rfc2308


Nah bc you're not swapping from CNAME > A records, which could have other random networking related impacts, especially if you're running a public API.


Nice. We got our test site online by migrating just the dyno to render.com. (And connecting back to Heroku for the database.) We might just switch our DNS there and call it a permanent move.


probably worth staying at render considering how many issues heroku's had recently


Anybody know what the risks are for doing this? We have not one but 13 heroku domains running where we might do this.

What are the chances that their current work on the DNS issues might replace the underlying IP-addresses and hence we are making it harder on ourselves?


I thought about this, and it can't really get any worse than 'total outage', so I thought it was worth taking a shot. It can't go any more down.

It worked for me, so once Heroku has sorted their s*t out and their DNS system is back up it's just a matter of deleting the A records and replacing it with the CName.

Worst case scenario is the underlying Amazon IP Addresses change and it goes back down.


Wouldn't the worst case be the underlying IP address changes, and your site now points to some other service? Since Heroku doesn't offer static IPs, I don't think.


Heroku seems to use the Host header on the HTTP request to determine what customer the request is sent to, so I don't think it can end up somewhere it shouldn't be unless the IP leaves control of Heroku entirely and is picked up by another AWS customer.


Heroku uses the host name to route requests. So you wouldn't serve another site, it would just give you an error that your site wasn't found or something.


There is one more problem I can think of (but yes you're right, it can't get worse than 'total outage'): Heroku might detect that the CNAME is missing and will deactivate some stuff from their automatic domain management / TLS certificate issuing.


Oh my god, thank you so much. You're a lifesaver.

For those on Cloudflare, this will still work. I just deleted the existing CNAME, added the 4 IP addresses as seperate A records and it came back up instantly.


No historical DNS records on that site for us unfortunately, any other possible sites to check?


Try this command: nslookup DOMAIN 212.230.255.1

That is a random public DNS server I found on a DNS checker site that works (as it seems to be down from all the common ones like 8.8.8.8, 1.1.1.1, etc). DOMAIN can be your actual domain or the long heroku alias.


Thanks, for some reason this worked!


You bloody legend, thanks!


you saved my users! thanks so much


This one worked for me: https://dnshistory.org/


Used this site https://www.whatsmydns.net, and it worked.

Unfortunately the Heroku issue lost us a few thousand $ alongside a few customers.


This worked for us, thank you.

And for those using Cloudflare, this method works.


This got us back online. Thank you so much.


Can someone clarify what to do if your app doesn't use a custom domain, and running `heroku domains -a` just returns `my_app.herokuapp.com`, and then entering that into https://securitytrails.com/dns-trails returns no results?


Do you use Cloudflare?


No we just run our API on heroku and use no custom domain names. For some of us requests are working as normal, but for others every request is failing.


What error are you getting? Because for us it seemed that the Cloudflare <-> HerokuDNS link was down. But even then I was able to access my app from the dashboard (like https://my-app.herokuapp.com/)


Presumably if your site is behind Cloudflare then this strategy won't work, right? Since the IP addresses that Security Trails sees are just of Cloudflare rather than your actual Heroku IPs...?

Or is it possible in the Cloudflare dashboard there is somewhere to see your Heroku server's IP address?


You need to put in the herokudns.com address that the CNAME is pointing at – e.g. stark-wisteria-rnbgkawldfk6gq7m8308ytts.herokudns.com in our case.


For me the securitytrails.com website just crashes. I put my DNS target: "stark-horse-mrp4jeowu9yvwpnnma32x6hd.herokudns.com", clicked "Run Check" and it seems to redirect to a failed (no CSS) webpage. Anyone else experiencing this?

EDIT: I MANAGED to make it work with this:

https://www.nslookup.io/

Make sure to go through all the tabs at the top (Cloudflare DNS, Google DNS) - for me they were all "no A records found". Only "Authoritative" gave me 3 A records which I successfully managed to use.


That seems to happen when you put in a DNS target that doesn't have any records in SecurityTrails. In that case, it is best to use nslookup


Yeah securitytrails.com was working for me, but went down about 10 mins ago.


Thanks so much, it worked for us.


We use cloudflare. If you look up the IP address for your public domain name, you will get the cloudflare IP, yes. If you lookup the IP address of the CNAME target, you will get the heroku IP.


on Cloudflare do you create A records (with found IPs) with name "www" or "mydomain.com"? also do you make that A record proxied or no? Thank you!


For my domain (https://www.poof.io), it's www.

If you just use something like https://poof.io, then it would be @. Depends on your site.

There should be a few historical IP addresses, but you would create an A record for each of them.


Use an A record and type in @. Proxied is fine.


Worked for us on Cloudflare


We use Cloudflare + Heroku and it worked for us.


It worked for us too. Thanks.

Note: If you used to have a CNAME record for yourdomain.com to www.yourdomain.com, then you have to add two A records per each IP (one whose name is yourdomain.com pointing to the IP, and another whose name is www pointing to the same IP)


Bless you! Thank you for taking the time out of your day to help us internet people :)


Heroku reports the outage is over - but I wasn't able to resolve our Heroku-based domain name until recently today.

Should be able to go back to CNAME now. Regardless, recheck yours


Worked... thanks a bunch. Does anyone know if it's safe to go back to cname now or should I leave till the A record till the morning? Would prefer to get some sleep.


Worked for us on Cloudflare using the .dns Heroku target!


Thanks so much, worked for us as well. I created A records for all three of the listed IP addresses but it started working for the first one already.


Yes, this worked for me...I was trying to do this right away, but I was missing a way to find historical DNS, thank you!!


Woot! Thank you - that seems to work, just used the first IP that came up on ST (there were 4 listed).


This saved our bacon and got me a pat on the back from our VP. Thanks!


You and your team are brilliant. Thank you for this post! Much love.


This worked for us. We're using Heroku + Cloudflare.


Anyone know how to make this work for a subdomain?


Do you have a CNAME record pointing to a herokudns.com domain? Replace it with an A record that points to the IP instead. You can find the IP by going to https://securitytrails.com/domain/<something-1234.herokudns....


Wow this worked! Your amazing! Thank you so much


I think I love you. We're back online.


Worked for me as well! Thank you!


Thank you! You saved our ass.


Thank you a lot!!


This worked for us too - thank you!

Edit: It also works for subdomains.


Working well !


Production completely inaccessible, on the plus side I might finally now get backing to migrate away from the Heroku cesspit.


Yes, i'm also done with Heroku. What alternative do you recommend?


My company moved a 600GB postgres db to crunchydata (https://www.crunchydata.com), the process was super smooth. In fact, it went so well that I agreed to a quick case study: https://www.crunchydata.com/case-studies/calibre.

We're currently using Heroku for app hosting, while evaluating fly.io, render.com and railway.app. All three have had exceptional reviews from other customers and differ slightly on their service offerings and setups. All seem like viable alternatives so far!


I’d be really interested to know how your research goes. We are running a large production scale app on heroku. And that we have no DevOps or Infra staff is awesome. But heroku has been a huge let down lately.

Most of the reviews we have seen of the competitors are all hobby level. And last time we check some of this competitors we found their security posture was not the level we would require.

So we had to simple rule them out and either stay with Heroku or move to a big 3.

If you’re keen to share - let me know and I’ll send you my details.


Craig here from Crunchy Data. I think you're speaking to the app side of things on hobby level. On the database side of things our security posture for Crunchy Bridge I'd say is stronger than the Heroku one. By default we isolate all databases in a VPC, everything is purely single tenant where as Heroku Postgres at least when I was there had multiple forms of multi-tenancy which when doing multi-tenancy in Postgres can have risks[1]-this applies even to the major 3 cloud providers. Our team is essentially the original Heroku Postgres team so we've built with security but also user experience for Postgres in mind since day one.

Now I assume you were speaking to the 3 mentioned, render, railway, fly in terms of hobby level. All three are fairly young relative to Heroku's age, but Fly did recently get their SOC2 and the team really took it to heart and invested in it so I'd put some stock in that. I can't speak definitively to the others, but do know all three can be solid for production apps. If you've got HIPAA or other specific requirements I'd encourage a conversation with them.

[1] https://www.wiz.io/blog/the-cloud-has-an-isolation-problem-p...


Thanks Craig. I took at look at Crunchydata - however best I can tell unless we are in Enterprise Tier Heroku (or maybe not even then) we have to connect to Crunchydata via internet (with IP whitelist?) rather than through VPC peering or similar. Which is a limitation of Heroku rather than you. I assume with something like fly it could be done via VPC peering?

I just read fly had SOC2 type I recently. But I mean this hosting infra containing all our data and our customers data. People providing infra really need to take security extremely seriously and prove it.

Awesome what they are doing - just don’t feel like they are ready for primetime busines. We are a small startup (5k monthly on Heroku) but there is just no reasonable way we can tell our enterprise customers security teams are hosted on these guys and can vouch and vet their security.

Once fly has type II - we’ll take another look.


This is the same path we're on. Migrated to Crunchy a month ago or so to remove the major migration risk and are using Render to host an auxiliary service while our core application remains on Heroku. Haven't yet done any non-toy deployments on Fly.io or Railway but I very much like Render's Blueprints and environment groups.


For what it's worth: Crunchydata appears to have been using Heroku DNS in some way. We have a Heroku app with Crunchy databases, and our Crunchy dbs became inaccessible shortly before the app did.


Craig here from Crunchy, we do run some small pieces on Heroku, but during the DNS outage saw no interruption of databases and they all seemed to be available and up. Please do feel free to drop us a note and we'll happily investigate if something did occur there.


We did some trials and ended up on Crunchy Data for the Postgres part of the equation. Wish we had done so earlier, logical replication was a big win.

Sadly, we're still using Heroku DNS, but this should accelerate finding an alternative.


I'm looking at fly.io and render.com.


I moved from Heroku to Render. Can't recommend them enough. Smaller operation, of course so keep that in mind if you need support, but for a hobbyist like myself, I'm running 2 postgres dbs, 1 redis instance, and 4 web services (2 websites).


(Render founder) thanks for the rec and feel free to get in touch with me anytime (email in profile). For those who need them, we offer enterprise support plans with response and uptime SLAs.


I just completed the move of my web app to render, and for now connecting back to Heroku for the postgres instance. I'll move that over eventually.


We’re using Porter to simplify the migration to AWS


Left Heroku for Fly.io a year or so ago and have been very happy.


How have you found uptime on fly.io (and are you using multiple instances for a high-availability setup?). We've been running on Heroku without HA, and it's been good enough, and wondering if we might expect to be able to get away with that on fly.io too...


Uptime has mostly been good, although I didn't set up alerting for any of my Fly projects until recently. All my apps are running on a single node per app, so I haven't tried

The one notable outage was this bug where Fly was evicting smaller instances at overloaded DCs rather than preventing new apps from acquiring resources:

https://community.fly.io/t/app-stuck-in-pending-state-after-...

The workaround was to upgrade to a larger instance, but I probably evicted someone else.


Did you use the https://fly.io/launch/heroku migration page? It almost looks too good to be true.


No, I wasn't using much Heroku-specific. All of my apps run in Docker containers, so it's mostly just migrating my CI to deploy to Fly instead of Heroku.

I did get a lot of perf increases moving from GCP to Fly, but I think I have to credit most of the improvements to SQLite being faster than Firestore:

https://mtlynch.io/retrospectives/2021/12/#migrating-my-side...


wow it def does lol


We moved our app to Porter after the credentials incident this spring and have been very pleased.


Moving to ECS on AWS, only reason is that I'm a DevOps engineer with familiarity in that space. Honestly wouldn't know where to send developers without the time to figure the nuts and bolts of convenience that Heroku provides.


Please check out withcoherence.com (I'm a cofounder). We manage the nuts and bolts of apps using ECS/fargate on AWS and Cloud Run/GKE on GCP. We also integrate managed cloud IDEs and managed CI/CD pipelines using the same simple configuration, and put a dashboard on top of it!


Render.com has been great. Our test site is up, connecting back to Heroku for postgres. ( https://www.public.pink )

I just happened to try out Render last week - perfect timing.


Yep...it's time (me too).


Looks like just DNS for the CNAME is broken.

    $ dig @1.1.1.1 stark-wisteria-rnbgkawldfk6gq7m8308ytts.herokudns.com A
    ...
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 1232
    ; OPT=15: 00 09 6e 6f 20 53 45 50 20 6d 61 74 63 68 69 6e 67 20 74 68 65 20 44 53 20 66 6f 75 6e 64 20 66 6f 72 20 68 65 72 6f 6b 75 64 6e 73 2e 63 6f 6d 2e ("..no SEP matching the DS found for herokudns.com.")
    ;; QUESTION SECTION:
    ;stark-wisteria-rnbgkawldfk6gq7m8308ytts.herokudns.com. IN A

I wonder if there is any way to get out an IP address of the Heroku router we were assigned to that we can use in place of the CNAME.

Might be in the logs somewhere, or in Cloudflare somewhere?


no SEP matching is a dnssec error. The DS key in the .com. zone file are there but no longer available in herokudns.com.

Hopefully someone has a backup of those keys. If not, I think they have to contact .com. to replace the keys. It can take several hours to come back.


Immediately came to HN, best status server


Same. If I am having a problem with a site but nothing else, next step is HN.


I wonder if some clever person could build a routine to grep HN headlines and build a multi-service outage dashboard from what suddenly gets popular.


I'm building a service to create a forum, a la HN style, for service incidents.

If you're interested in getting early access - get in touch here: hello@awareops.com


Haven't made any updates, or changes in the last few days, and still won't show my site. So don't think it has to do with new apps or updates.


Same. The heroku incident report says "increased failure of insert/delete/changes to DNS."

Haven't changed DNS in months. Site is down.

Heroku is looking increasingly like nobody is minding the store these days. I still know of no competitor which can match the DX when it's working, but what good does that do you if it breaks all the time.

We're still waiting on them fixing our ability to restore from backups without manual hack -- which is effecting many customers, but which they don't even have an active published incident on.


+1 prod outage here too. No changes to DNS on our part.


Upstream DNS provider? I would have thought Heroku would run their own DNS servers? Is it just turtles all the way down?


NS1 is a venture of Salesforce, the owner of Heroku, so it makes sense if that is the provider.


No one wants to do anything themselves anymore. It's pretty sad, to be honest. DNS isn't even that hard.


Our site just came back a few minutes ago, and we're showing up in 8.8.8.8 again, so it seems that heroku have sorted out their dns issues.


> We will be providing updates on a 1 hour cadence.

Ouch. Not a great look, Heroku!


And no update after 2 hours. How/why did Salesforce run this platform into the ground?


last update at this point: over an hour ago


It seems that NS1 (Heroku's provider) was having problems, but according to them it has been fixed: https://ns1status.com/#!/incident/365716


This does not appear to be a total outage. I cannot reach any of our sites, and Pingdom also reports we are down, however, I can see normal looking traffic reaching our servers (via heroku logs --tail). In addition, members of our team are reporting via Slack that some can reach our Heroku-hosted sites, others cannot. It seems to be ISP-related. Two people within 1 block of each other on different ISPs see different results.

We proxy some services through Cloudflare to gain IPv6 support, and all of those are down, which suggests the Cloudflare -> Heroku network route is broken.


Looks like Heroku uses NS1 as the upstream DNS provider. you can find the information like this

  dig NS @1.1.1.1 test.herokuapp.com -> fail

  dig @1.1.1.1 test.herokuapp.com -> fail

  dig NS @dns1.p03.nsone.net test.herokuapp.com -> works

  dig @dns1.p03.nsone.net test.herokuapp.com -> works

So my conclusion is that NS1 is having issue responding DNS queries from other DNS servers. Interestingly, there is no public information on heroku being dependent on NS1 or any current outages from NS1 status page.


Complete outage on all of my sites.

I have a migration plan in place but it's been put on hold to launch a new Product.


And the new product might have to be put on hold until Heroku is back… catch-22 - you’ll never migrate away


Well, the new product is hosted on AWS, so it's much more stable, and we've made our first lot of pre-sales so we're on a tight deadline. Not sure why you're being so negative for no reason.


I think they were being funny... :)


Then they should have added a smiley face :p


Here :)


Yes...all my apps.

https://status.heroku.com


I feel silly for posting now. Thanks for sharing.


Missive, Substack, and Steam appear to be down because of this


Our entire suite of sites is down, both ones proxied via Cloudflare and ones not proxied.


TIL Substack runs on Heroku.



They might just have the same upstream DNS provider?


Seeing issues with hackerweb.app and substack.com too.

They don't share upstream DNS (and I'm not sure heroku's homepage has the same DNS provider as customer domains). NS matches SOA for each of these domains.

    SOA heroku.com.    1h00m00s   "dns1.p04.nsone.net." "hostmaster.nsone.net."
    SOA hackerweb.app. 1h00m00s   "olga.ns.cloudflare.com." "dns.cloudflare.com."
    SOA substack.com.  1h00m00s   "ali.ns.cloudflare.com." "dns.cloudflare.com."


Wow. Imagine being so out of touch that you have to have another entity run your DNS services.

Email I get, because there has been a hard push for decades to force everyone on to big providers, but DNS can literally be run by anyone, anywhere.

Did the primary servers push bad data, making the secondary / tertiary ones break, too? If not, why not extend the cache lifetime and run off of them until the primary are fixed?

Sigh. This is rather ridiculous, and is rather embarrassing for Heroku.


All services are completely down for my business.

Looks like they nuked their DNS.


I am now seeing one of my services coming back online. For this service I have not replaced the CNAME. Anyone else seeing some service restoration as well?


Same boat for me. How long do you think you'll wait to go back to CNAME's?


I went back to the CNAME now and everything is up and running. Fingers crossed that Heroku reports to have the issues resolved.


Our service started coming back up around twenty minutes ago too.


Why does Heroku have so many uptime issues? Seems to be happening every few months. Last week there was downtime, now again...

Is Salesforce committed to Heroku?


Heroku has brought our site down 5 times in the last 6 months. It's infuriating, but I only have myself to blame for not migrating.


Yes, cant even access https://dashboard-next.heroku.com/ so the problem seems broader than what they describe on their status page which seems to imply only issues related to updating DNS settings.


17:57 PDT, at least a few of our sites are starting to wake up. Decidedly not all of them yet. sigh


It worked for us too. Thanks.

Note: If you used to have a CNAME record for yourdomain.com to www.yourdomain.com, then you have to add two A records per each IP (one whose name is yourdomain.com pointing to the IP, and another whose name is www pointing to the same IP)


Weirdly, dig +trace works fine, but public resolvers like Google and Cloudflare refuse to return the DNS records. This has to be a DNSSEC issue, right? paging tptacek :p


I've swapped the A records back to the CNAME and all is looking good. The Heroku Dashboard still isn't loading app details though.


Production outage here as well.

Is this isolated to Heroku?


Heroku dashboard was at least working before, but now trying to view details of an app it just hangs


It seems that their service is back up


Other Cloudflare-backed services like Observable are offline right now. Is it a Cloudflare outage?



More likely that Cloudflare is unable to access downed servers.


Maybe. I'm just trying to figure out what's common between Steam, Observable, and Heroku, and Cloudflare itself which are all down according to downdetector.


That's blaming the wrong entity - again. No-one seems to read error pages nowadays, not shocking. Also, Steam never used CF services, and user reports about CF are that - they looked at their favorite Substack (https://substack.com) and just seeing a CF page they assumed it's CF's fault.


Seems to be resolved now


Heroku. DansGame.


Heroku Herokuing.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: