About the Tailscale.com outage on March 7, 2024

smackeyacky · 2024-03-30T22:15:30 1711836930

I've said it before and I'll say it again: expiring certs are the new DNS for outages.

I still marvel at just how good Tailscale is. I'm a minor user really but I have two sites that I use tailscale to access: a couple of on-prem servers and my AWS production setup.

I can literally work from anywhere - had an issue over the weekend where I was trying to deploy an ECS container but the local wifi was so slow that the deploy kept timing out.

I simply SSH'd over to my on-prem development machine, did a git pull of the latest code and did the deploy from there. All while remaining secure with no open ports at all on my on-prem system and none in AWS. Can even do testing against the production Aurora database without any open ports on it, simply run a tailscale agent in AWS on a nano sized EC2.

Got another developer you need to give access to your network to? Tailscale makes that trivial (as it does revoking them).

Yeah, for that deployment I could just make a GitHub action or something and avoid the perils of terrible internet, but for this I like to do it manually and Tailscale lets me do just that.

wonrax · 2024-03-31T04:19:38 1711858778

Tailscale remains useful when deploying with GitHub actions. Currently, I have my cloud VM open on an unconventional SSH port so that GHA workers can SSH into it and initiate the deployment. I plan to utilize their action [0] so that any GHA worker can access the deployment machine without exposing any ports.

[0] https://github.com/tailscale/github-action

nurettin · 2024-03-31T07:05:53 1711868753

> simply SSH'd over to my on-prem development machine

I use mosh and gnu screen for flaky connections. Works wonders even if you disconnect every 10 seconds.

edvinbesic · 2024-03-31T14:47:32 1711896452

Mosh is indeed glorious, esp when tethering over a spotty mobile connection. I just wish they would support gpg forwarding, otherwise chefskiss.

lmeyerov · 2024-03-30T16:18:45 1711815525

Expiring certs strikes again!

I'd recommend as part of the post mortem to move their install script off their marketing site or putting in some other fallback so marketing site activity is unrelated to customer operations critical path. They're almost there for maintaining that typical isolation, which helps bc this kind of thing is common.

We track uptime of our various providers, and seeing bits like the GitHub or Zendesk sites go down is more common than we expected... and they're the good cases.

GICodeWarrior · 2024-03-30T17:01:03 1711818063

Further, security of a marketing site tends to be lower priority than the product itself, and an install script should generally be secured similar to the product.

bradfitz · 2024-03-30T17:11:12 1711818672

Yes. We're lamentably probably going to have to move it (the install script), even though it has a nice URL today.

When we picked that URL, the marketing site was created and run by the same people who built the rest of the product, so it didn't seem like a concern at the time.

oliviabenson · 2024-03-30T19:49:39 1711828179

You can achieve both. The only mistake you made was to half-bake the proxy (doing it for IPv6 only): proxy every http(s) request to tailscale.com. Vercel’s platform is valuable for a whole host of reasons, the networking side isn’t that important, your developers will greatly value the use of Vercel even if every request is being proxied through a web server hosting tailscale.com which responds to a request for /install.sh instead of passing it through to the marketing site.

(In Google Cloud you could do it entirely with load balancing rules, no need to even run a web server)

bradfitz · 2024-03-30T20:44:01 1711831441

That is exactly what I want us to do :)

hunter2_ · 2024-03-31T04:55:51 1711860951

> it has a nice URL today

`curl -fsSL https://install.tailscale.com | sh` wouldn't be any less nice. Append /sh if having something human-friendly at the root is desirable (SEO, etc.), and you're still at the same overall length as today.

solatic · 2024-04-01T04:54:45 1711947285

> Append /sh if having something human-friendly at the root is desirable

Even this isn't really necessary; curl includes a default user agent header identifying the traffic coming from curl. It's simple enough to direct traffic with the curl user agent header to the script and all other traffic to a static website with directions for how to quick-install.

hunter2_ · 2024-04-02T00:47:30 1712018850

Very nifty, but I'd argue against it due to possible confusion in various atypical scenarios, such as:

* The user wants to read the script before executing it, and their preferred reader (perhaps due to browser extension or something) is a standard browser.

* The user has `curl` aliased to `curl-impersonate` in order to avoid things like Cloudflare's bot detection (a captcha that triggers on things beyond the HTTP request, like the less fancy TLS handshake of regular curl) -- https://github.com/lwthiker/curl-impersonate

* The user doesn't have curl installed, but has wget / lynx / some headless browser / etc. and expects any of those to work the same as curl.

Not to mention, if a site encouraged users to execute an HTTP response by piping curl into sh, and the response for curl was different than the response otherwise, that just might make the top of HN for being sketchy as hell.

solatic · 2024-04-02T05:21:29 1712035289

> The user wants to read the script before executing it, and their preferred reader (perhaps due to browser extension or something) is a standard browser.

I mean, the point of wanting to read the script before executing it is to try and protect yourself from malicious scripts that abuse the curl | sh pattern. So since it would be simple enough for a malicious actor to return something different when the user agent indicates the usage of curl, the only responsible thing to do, anyway, is to use curl to download the script to a file, read the file, then execute the file.

> `curl` aliased to `curl-impersonate`

So when the user uses a tool to impersonate a browser, they'll see exactly what they'll see in a browser... which are the quick-install instructions anyway, which can include a note about the user agent, if anyone actually hits this in the real world?

> wget / lynx / some headless browser

Which would provide the quick-install instructions to use curl :)

ShakataGaNai · 2024-03-30T20:40:40 1711831240

That's great that DevOps (or whatever their title) owns both product and marketing sites. Far too many companies (and DevOps teams) think the www site is "not important" or "not their core job" and outsource it to either a less qualified team, or out of the company altogether.

From an external perspective no one cares if www going down isn't "your fault" or of "direct impact to the product". It's a corporate blackeye either way.

belthesar · 2024-03-31T17:39:42 1711906782

This hasn't been my experience. In my experience, the reason why your ops teams divest themselves from the marketing side is because marketing decides to contract some firm to design their site for them, and the firm decides to deploy to Vercel or WP-Engine or whatever. Then, Marketing comes to ops and says "hey, I need you to change this DNS thing" weeks/months into their engagement, with no understanding of the ramifications. Ops/product team pushes back, because the change would fundamentally break the application. Marketing gets defensive, "we've spent all this time and money on this, you just need to make it work", a broken halfway solution is implemented, and ops/product, in protest, divests themselves from the solution. Bingo bango, shadow IT is ratified, the kludgy hackjob lives in production forever, and no one thinks about it until the next time something breaks.

Reminds me of the time marketing decided to change the logo on the marketing site for the product team I was on without being aware that the site was scraped and redeployed on a different domain (by hand). When the logo changed, the CSS for the image element wasn't updated, truncating part of the logo, proudly displaying the word "ass" as a part of the logo in an unfortunate cropping incident.

raffraffraff · 2024-03-31T08:18:44 1711873124

"Far too many companies (and DevOps teams) think the www site is "not important" or "not their core job" and outsource it to either a less qualified team, or out of the company altogether"

It's impossible to know because they won't admit it publicly. You are guessing based on some anecdotal experience.

But then again... here's mine! I worked at a very successful SaaS that had (really not kidding) the most incompetent, lazy dope running the www site. He live-edited a "staging" version of the site on the fly (no, it wasn't private, you could access this thing from the internet, and he didn't know or care about that). When he was happy with his changes he'd destroy the live instances behind the load balancer and clone his staging instance without taking it down or running any extra checks. This staging instance was around for years and I don't think he ever bothered doing a system update. Since he didn't use git, I I'll bet that at least once he cloned a live instance back to staging to undo a bunch of bork.

I lost count of the incidents. He never detected them himself, was never available to troubleshoot them and was generally a big "durrrr" when you'd finally get him on the call. Example: one time we had a "slow, intermittent errors" customer support ticket surfaced to us, not because it was our job, but because dopey was being an absolute ass to the helpdesk guys. He ran his crap in another AWS account we didn't have access to. About a day later the www site went down completely, so we got hold of the AWS account and dug in. All 5 of the instances behind the load balance were "unhealthy" for various reasons. Certs expired, disks full, apache stopped. We bounced them, restarted them and sshed in. They all had different versions of the site. It was a complete mess. Turns out dopey wasn't very good at killing the old instances and cloning staging. He was probably live-editing the instances for smaller changes if that seemed easier than a bunch of AWS console work.

Unbelievably he wasn't fired and continued to mismanage the site, and we could do nothing because the head of marketing didn't listen to the head of engineering. They hated each other. The way Marketing saw it "your SRE guys couldn't fix it, they had to wait for <dopey> to get on the call". I'm not even kidding.

Just more anecdotal evidence from me. You might be right.

wlonkly · 2024-03-31T02:49:06 1711853346

Alas, the part you're describing as great is written in past tense.

jonhohle · 2024-03-31T00:39:25 1711845565

Orgitecture† strikes!

† A systems architecture tightly coupled to the structure of the organization in which it was created.

gregw2 · 2024-03-31T12:24:28 1711887868

More widely called Conway’s law…

j45 · 2024-03-30T19:01:51 1711825311

Brings to mind if there's a service that will monitor all the certs and their expiry.

Cloudflare seems to handle a fair bit of this if you host your domain with them, but you have to use Cloudflare.

cassianoleal · 2024-03-30T21:35:37 1711834537

Push their expiry date as a UNIX timestamp to whatever TSDB you use and hook up an alert for when it gets close.

helloericsf · 2024-03-31T11:24:05 1711884245

Cloudflare has a dedicated team handling certificate-related engineering challenges. While it simplifies the process for your domain on Cloudflare, internally-facing domains remain a pain point for a lot engineering teams.

nijave · 2024-03-31T02:47:15 1711853235

This is common in synthetic test software

gnat · 2024-03-30T23:18:18 1711840698

They made the same mistake we did at a former company — put a link to our webapp’s login page (app.foo.com) on the marketing site (www.foo.com) homepage.

It wasn’t until our first marketing web site outage that we realised that our $40/mo hosting plan was not merely hosting a “marketing site” but rather critical infrastructure. That was a load-bearing $40 hosting plan. Our app wasn’t down but the users thought it was.

I learned then that users follow the trails you make for them without realising there are others, and if you take one away then a segment of your user base be completely lost.

PokestarFan · 2024-03-31T03:36:03 1711856163

When I type "tailscale" into my browser, the first result is tailscale.com. I do not need to use the tailscale admin console often enough that I would go out of my way to memorize the different URL.

My browser used to autofill dash.cloudflare.com when I typed in cloudflare. I visited the cloudflare.com website exactly once and now that's what shows up in the first result, and I find myself doing the same thing with Cloudflare.

basch · 2024-03-31T06:38:55 1711867135

Time for an industry standard subdomain.

www for the website

app, signin, login, entr, for the sign in page.

TiredOfLife · 2024-03-31T09:20:14 1711876814

Couple years ago there was an outage at Cloudflare and their www.cloudflare.com started pointing to an unrelated 3rd party marketing website, while the dash.cloudflare.com was showing the dashboard.

snapplebobapple · 2024-03-30T18:43:44 1711824224

I really like these guys, I wish their pricings wasn't so ridiculous. proper access control shouldn't cost 18 bucks a month for a vpn, it's basically unsellable to management at that price and the lower tiers are unsellable without it.

starttoaster · 2024-03-30T19:01:10 1711825270

I'm really interested in what you're comparing Tailscale to internally, because it does way more than just VPN. What are the cheaper options, and do they also have an SSH feature, oauth authentication to the network for automation services, the ability to stand up VPN node loadbalancers in kubernetes clusters, and ACME certificate request automation through LetsEncrypt? Just to list a few features that I use from Tailscale's free tier that I don't normally think of as the job of a VPN service. And they're constantly adding new features that make it a really interesting and competitive choice, in my opinion. Honestly I'm mostly interested in this take because I'm shocked by how much they offer in the cheaper tiers.

snapplebobapple · 2024-03-31T00:37:17 1711845437

My problem is I don't care about any of the stuff you mentioned because I can't easily recreate my internal vlan segregation at the six dollar tier, I need to pay up for the 18 dollar a user tier, which is insane pricing for my shop of 70-100 users. Compare to the included fortissl vpn that is provided by my router. I can add users to groups in active directory that are then synced to azuread, which provides saml authentication to the vpn. Users login with their O365 credentials and I can configure it to mimic the already approved vlan segregation easily based on groups provided by O365. This basic stuff should just be there for a corporate solution and it appears to be but for 100 users the bill is 21.6k/year, which is literally 4 times higher than I can justify/ get approved. Yes it has a bunch of other features, but they are irrelevant because the basics aren't there at a price that is doable.

basch · 2024-03-31T06:30:19 1711866619

Have you contacted them about custom pricing?

“Im not going to use most of your features, can I have 75% off” isn’t as outlandish as it sounds. Willing to bet they would bite.

starttoaster · 2024-03-31T01:41:38 1711849298

> Yes it has a bunch of other features, but they are irrelevant because the basics aren't there at a price that is doable.

The basics don't cost $18/user/month though. The whole package does. I hear what you're saying, and maybe you just accidentally worded it this way, but the obvious rebuttal to it is: How much would it cost you to set up a solution where only ACL'd users can SSH into your infrastructure/servers? You're looking at services that cost money like Userify for that. For many of the other features Tailscale offers, you're probably either paying another service to handle that responsibility, not doing it at all, or you're spending your time recreating it, and I bet your time isn't cheap to the company either.

Anyway, that's somewhat of a hypothetical rebuttal. I actually assume you did the due diligence and weighed the cost with the portion of the feature set that you actually would make use of. I could see the price being more fair if they offered a lower cost tier that only provided the VPN and ACLs for unlimited users, but I'm not a savvy businessman so I'm not sure if it makes sense for a multi-tool company to sell a minority of users a screwdriver.

snapplebobapple · 2024-03-31T03:29:05 1711855745

You're assuming that feature matters to me (it does but I have a lot of users relative to servers, so even the most basic tier of userify express or their cloud offering would handle my needs for dramatically less than premium tailscale). What you're missing is the foundation of being able to deploy something like this depends on easy acls that integrate with my current identity system (i.e. saml) because I need to be able to control access per business unit so that when the mesh of everything is brought online users still only have access to the subset of systems they should have access to. This matters whether it's Sheila in accounting who only ever uses it to remote desktop to her on prem workstation, or jackie in IT using it to manage on prem servers across multiple locations and cloud servers.

So yes, maybe a small subset of my users are actually using enough of the premium bundle to justify that cost but I can't even mix and match because the basis of every use case (the ACLs) are only in the premium package in a functional way.

datascienced · 2024-03-31T06:33:50 1711866830

The problem might be your approval process? $18 compared to fully loaded monthly cost is nothing. Unless you are using 100 tools like this maybe. Or based in LCOL location.

The pricing isn’t ridiculous, it is by design. For better of worse SaaS pricing is about finding features (regardless of their actual cost) that act as signals for “is a customer who can afford more”. The $6 tier was you paying their marketing and market research cost by you trying them out :-). They probably don’t need the $6 they need the data that you were willing to pay something!

conception · 2024-03-30T20:07:49 1711829269

But if you’re comparing it to other VPNs and only need VPN the pricing is bonkers.

starttoaster · 2024-03-30T20:11:35 1711829495

That's fair. I guess if you just need a VPN, it doesn't really make sense to consider a product that packs all of these VPN-adjacent features. But part of my point was how much you're able to do on Tailscale for free, so in that regard, the pricing really doesn't seem as bonkers to me, to be honest. The $6/month tier is also incredibly reasonable for unlimited users, but it is annoying, I'll grant, that they actually drop some ACL features from Free tier to Starter. I suppose that's how they funnel a large number of enterprises from considering Starter in favor of Premium. But if you actually make use of the feature set of the platform, which if you have a decent DevOps team I'm pretty sure there's tools in there that they'll love, then $18/User/month actually doesn't seem too outrageous to me.

newdee · 2024-03-30T22:07:44 1711836464

Which other VPNs? And are you talking about trad VPN (concentrators/hub & spoke)? If so, you wouldn’t be considering Tailscale anyway.

j45 · 2024-03-30T19:03:08 1711825388

You can install headscale and self-host for nothing then.

Tailscale has competitors too with some overlaps, it might not be fully what you're looking for.

All I know is within a few minutes I had more of a project working together than without it.

It really is one of the more remarkably simple tools out there for everything it does, and has a generous free tier with 100 devices and 3 users.

j-krieger · 2024-03-30T20:31:07 1711830667

Just a quick heads up, colleagues of mine could not successfully host headscale themselves. In the end, they saw the value and bought tailscale access.

Configuring wireguard really is that hard. Tailscale is easily worth it

linsomniac · 2024-03-31T00:51:12 1711846272

FYI: I've been self-hosting headscale for 9 months or so, and it's pretty brilliant. I didn't find it very hard to set up. A dedicated DERP server was pretty hard to set up, but most of that was I was trying to host it behind our office load balancer, and that's no bueno. But once I put it on a dedicated IP,my secondary DERP was pretty easy.

But, if you are going to self-host, seriously consider Nebula instead of tailscale. Unless you need non-technical users accessing it, tailscale has a better story there.

(edit) The biggest downside of headscale is I don't feel confident I can update ACLs without having a high likelihood of taking down the entire tailnet until I can get it fixed.

snapplebobapple · 2024-03-31T00:39:52 1711845592

It wasn't hard to setup headscale (or netbird for that matter). I have setup both to test at home fairly easily. They aren't appropriate for a corporate setting though. I actually want to pay somebody for this because I want the support when some change causes this to eat it at 4 in the morning with the business day about to start with a strong requirement for it to be working.

eddieroger · 2024-03-30T23:40:44 1711842044

Your (their) mileage may vary. I set up Headscale with external auth and it's been a dream, the kind where I don't really have to think too much about it. The only little gotcha is that sometimes getting the iOS client to read the server url from settings can be tricky. But once authed, it "just works" for me.

hnarn · 2024-03-30T21:48:37 1711835317

> colleagues of mine could not successfully host headscale themselves

it’d be interesting to know why, I use it frequently at work and it’s worked pretty well so far.

j45 · 2024-03-30T21:12:08 1711833128

Any details? Could be adjacent self-hosting issues compared to headscale itself.

Considering it can also run in a docker container, it’s next to trivial to install locally to try out

https://headscale.net/running-headscale-container/

watermelon0 · 2024-03-30T20:59:04 1711832344

IIRC, Wireguard is exclusively managed by Tailscale clients, and not by the server (headscale in this case).

j45 · 2024-03-30T19:12:50 1711825970

Link to headscale https://github.com/juanfont/headscale

stingraycharles · 2024-03-31T04:12:35 1711858355

I had a very easy time selling this. We moved away from an OpenVPN setup, and Tailscale made it so much easier to onboard new employees and to do a lot more things “the right way”. We’re a fully remote company, so it’s even more important.

Although I admit that in my role I have quite a lot of weight in convincing management on these topics, price was not a concern.

We’ve been a happy customer since April last year, everyone on the premium / “expensive” tier. I’m also very impressed with their development speed: some features that were said “May take a few years to be delivered” actually were delivered last year already.

Cloudflare One could have been an alternative, but that would have been even more expensive.

snapplebobapple · 2024-03-31T13:07:13 1711890433

Yah, if I was a fully remote company this would have been an easier sell. What I actually am (and what the vast majority of business likely is) is a small bunch of office locations and a bunch of cloud infrastructure so I can approximate tailscale in a way I am unhappy with right now by limiting cloud access to the ips of the office and controlling access with various things from the office pcs. The vpn to do that works properly with acls and is included in our networking gear. So what tailscale is competing with is something that is a sunk cost and not ideal but of minimal additional cost vs 20k a year of additional cost providing premium features to all users when maybe 10% of users need them.

Basically, tailscale has a bundling problem. They bundled necessary to all features (proper acls) with a bunch of premium stuff that is of less value to many and they don't have the market power of microsoft with their windows operating system to force that kind of arrangement down my throat. They need a 2-3 dollar a month tier with proper acls and mesh vpn, then al a carte of the rest of the feature bloat per user (ssh key management is worth no more than 2 dollars a month based on the competition, no idea what the other features are worth because I have no use for them).

They also really need to improve their windows experience. More than once during testing I had a windows update break the vpn requiring alternate means of logging in and reconnecting, but that's an ancillary issue.

djbusby · 2024-03-31T02:20:57 1711851657

What management is tripping over $18/mo? Per person that expense is closer to zero than any of a dozen other things we're buying for them.

MobiusHorizons · 2024-03-31T04:56:28 1711860988

Pretty sure that is the monthly cost per user, not a flat monthly fee.

karmajunkie · 2024-03-31T16:19:05 1711901945

that’s basically the issue that pushed us to twingate. turns out i like twingate a little better for their routing capabilities anyway (which isn’t to say i don’t like tailscsle at all—i use both for different purposes.)

nerdbaggy · 2024-03-30T16:38:00 1711816680

I wonder what provider they use for their website. Sounds like a lot of hoops to jump through for IPV6 when just about any other provider has IPv6 support.

p1mrx · 2024-03-30T16:45:51 1711817151

  $ host www.tailscale.com
  www.tailscale.com has address 76.76.21.21  # Vercel
  www.tailscale.com has IPv6 address 2600:9000:a51d:27c1:6748:d035:a989:fb3c  # Amazon
  www.tailscale.com has IPv6 address 2600:9000:a602:b1e6:5b89:50a1:7cf7:67b8  # Amazon

IPv4 uses a Let's Encrypt certificate, while IPv6 uses an Amazon certificate.

opheliate · 2024-03-30T17:02:57 1711818177

The Vercel IPv6 feature request & surrounding discussion makes for frustrating reading: https://github.com/orgs/vercel/discussions/47

aftbit · 2024-03-30T19:55:03 1711828503

Wow, all comments removed as spam or hidden by default, update posted saying "We are targeting to land support for IPv6 towards the beginning of next year." Well, Q1 2024 has come and gone. Where's IPv6 support or the communication about what is happening? Good reason to never use Vercel if you ask me.

trashburger · 2024-03-31T10:38:35 1711881515

Oh wow, leerob actually edited the post after this removing the ETA (due to management/embarrassment?).

aftbit · 2024-03-31T19:21:27 1711912887

Archive.org captured the original text:

https://web.archive.org/web/20240221195021/https://github.co...

>I apologize for the slow response. We are targeting to land support for IPv6 towards the beginning of next year. We will communicate updates on this issue. Thanks for the patience.

So cringey. Why not just post a new post that said "sorry the deadline slipped, no new date available at the moment"? I will strongly recommend _against_ this company solely based on this communication. If this sort of gaslighting is how they handle their public comms, imagine how their support must be run.

averageRoyalty · 2024-04-01T09:41:44 1711964504

Github is complicit too. No visual indicators or ways of seeing the comment was edited, just the original comment date.

miyuru · 2024-03-30T17:36:30 1711820190

Vercel's VP of product has been asking for requirements for IPv6 there. This should be a good one.

https://github.com/orgs/vercel/discussions/47#discussioncomm...

It painful to see tech providers go down this road, which is pretty similar to what's happening at Boeing. (Business taking over Engineering)

kawsper · 2024-03-30T18:08:18 1711822098

It feels like the same anywhere, sadly, DigitalOceans IPv6 support in their loadbalancer product have been "under review" since 2021: https://ideas.digitalocean.com/network/p/ipv6-for-load-balan...

opello · 2024-03-30T17:35:49 1711820149

Only if you expand all the discussion they've hidden! Ignorance is bliss, right?

> We are targeting to land support for IPv6 towards the beginning of next year. We will communicate updates on this issue.

Was from 2023-10-01, I guess it's early until June 30.

watermelon0 · 2024-03-30T20:51:25 1711831885

I'd say that the beginning of the year ends at the end of Q1, if not earlier.

seabass · 2024-03-31T02:14:01 1711851241

What is it about ipv6 that is so difficult for hosting providers to support?

nijave · 2024-03-31T02:49:30 1711853370

It costs money

johnnyAghands · 2024-03-30T20:01:35 1711828895

Wow, mad jelly their CI/CD and monitoring proceses are robust enough to trust a major rollout in December. That's a pretty badass eng culture

That being said, still some unanswered questions:

- If the issue was ipv6 configuration breaking automated cert renewals for ipv4, wouldn't they have hit this like.. a long time ago? Did I miss something here?

- Why did this take 90 minutes to resolve? I know it's like a blog post and not a real post-mortem, but some kind of timeline would have been nice to include in the post.

- Why not move to DNS provider that natively supports ipv6s?

Also I'm curious if it's worth the overhead to have a dedicated domain for scripts/packages? Do other folks do this? (excluding third-parties like package repositories).

Thorrez · 2024-03-30T20:06:17 1711829177

>- If the issue was ipv6 configuration breaking automated cert renewals for ipv4, wouldn't they have hit this like.. a long time ago? Did I miss something her

AIUI, they switched to their current setup 90 days prior to the outage. The initial cert they installed during their migration lasted 90 days. So 90 days after the migration, they had an outage.

PokestarFan · 2024-03-31T03:36:45 1711856205

They're using Vercel, which lacks IPv6 support.

agwa · 2024-03-30T17:19:12 1711819152

Why does the proxy need to terminate TLS? If it were just a TCP proxy, then at least the monitoring wouldn't have been fooled into thinking the certificate wasn't about to expire.

Heck, a TCP proxy might even allow automatic renewal to work if the domain validation is being done using a TLS-ALPN challenge.

p1mrx · 2024-03-30T22:19:56 1711837196

A TCP proxy discards the user's IP address, unless you use something like the PROXY protocol[1], which then needs to be supported by the target HTTPS server. You would also need a way to prevent unauthorized users from injecting their own PROXY header.

This isn't a problem if you don't need the user's IP address at all, but it's often useful for logging and abuse detection.

[1] https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt

bastawhiz · 2024-03-30T17:55:01 1711821301

Not that it's an amazing reason, but H3 doesn't run over TCP, and running a UDP proxy doesn't sound like a great time.

ignoramous · 2024-03-30T19:24:19 1711826659

> UDP proxy doesn't sound like a great time

QUIC, in particular, is harder to proxy (if you're load balancing, say: https://quicwg.org/ops-drafts/draft-ietf-quic-manageability....).

If it is point-to-point and you control both those points (forward A to B with ports open as approp), proxying any protocol should be straightforward, no?

cayde · 2024-03-30T21:18:12 1711833492

I donet believe they control the Vercel endpoint "B"

bradfitz · 2024-03-30T20:47:23 1711831643

It doesn't. That was one of our mistakes and action items to fix.

The original proxy was stood up quickly when it was first discovered IPv6 was broken and the people standing up the proxy didn't know at the time how ACME worked.

We'll be changing it to just a TCP proxy.

ikiris · 2024-03-30T21:37:35 1711834655

> and the people standing up the proxy didn't know at the time how ACME worked

yikes

bradfitz · 2024-03-30T21:43:23 1711835003

To be fair, it didn't help the provider's docs were inconsistent about whether dns-01 or http-01 was to be used.

PokestarFan · 2024-03-31T03:38:11 1711856291

I've done enough fighting with Certbot to learn that the path of least resistance is to use dns-01 with the cloudflare plugin, and just make a very limited access key and store it on the servers that I am using to renew the cert.

amluto · 2024-03-30T20:17:49 1711829869

A non-TLS-terminating proxy is a great thing to host on a service like Hetzner. If you set up CAA correctly, then you are trusting the provider for latency and availability only, and you might as well avoid hilariously expensive services like CloudFront or an EC2-based proxy.

Hmm, it looks like Tailscale is using NetActuate for pkgs.tailscale.com. I bet NetActuate could help serve up a non-terminating proxy with plenty of PoPs at a reasonable price. Their website doesn’t give pricing, but it sounds like the kind of company that doesn’t mark up egress 50x.

SahAssar · 2024-03-30T23:30:31 1711841431

> A non-TLS-terminating proxy is a great thing to host on a service like Hetzner. If you set up CAA correctly, then you are trusting the provider for latency and availability only, and you might as well avoid hilariously expensive services like CloudFront or an EC2-based proxy.

Are you really getting any latency or availability improvements in that case? What does a non-TLS-terminating proxy give you?

amluto · 2024-03-31T00:32:12 1711845132

Functionality. The reason for having a proxy at all seems to be that Vercel doesn’t support IPv4.

RulerOf · 2024-03-30T19:54:31 1711828471

They may have AWS CloudFront CDN in front of it for IPv6. If you're doing that, you're terminating TLS at CloudFront. I don't believe that's optional.

fanf2 · 2024-03-30T17:32:27 1711819947

I think a TCP proxy would also work with http challenges.

agwa · 2024-03-30T18:37:32 1711823852

It would, but so would an HTTP proxy. It makes make think the hosting provider doesn't use HTTP challenges.

fanf2 · 2024-03-30T20:03:43 1711829023

An http proxy would need to be configured cleverly enough to serve its own acme challenges directly, and proxy any requests for the backend’s acme challenges. Which is I think the trick that was missed by the tailscale setup.

agwa · 2024-03-30T23:59:23 1711843163

Good point!

benreesman · 2024-03-31T04:53:15 1711860795

Anything even remotely security adjacent that TailScale as an institution even remotely fumbles even once is too dangerous for the merely mildly paranoid (like me for example).

We need a better story on this.

physicles · 2024-03-31T03:14:39 1711854879

They have monitoring for their infrastructure, right? Add 50 lines of code that connects to all public domains on ipv4 and ipv6 and alerts if the cert expires in under 19 days. Set automatic renewal to happen 20 days out. Done. I wrote this code years ago, after missing a couple ssl renewals in the early days of our small company. Haven’t had an ssl-related outage since.

Edit: this is the only necessary fix, no need for calendar invites:

> We also plan to update our prober infrastructure to check IPv4 and IPv6 endpoints separately.

eacapeisfutuile · 2024-03-31T04:17:09 1711858629

> That arrangement is deemed a “misconfiguration” by that provider, and we’ve been receiving alerts about it since rolling it out

So 90 days of alerts about certs, and then certs fail?

tczMUFlmoNk · 2024-03-31T04:50:19 1711860619

It sounds more like "90 days of alerts about DNS, and then certs fail". The fact that the presence of IPv6/AAAA DNS records causes Vercel to decline to auto-renew certificates seems to not have been known to the Tailscale team prior to the incident. (I haven't seen the alerts in question, so I don't know whether they made this fact clear.)

re5i5tor · 2024-03-31T14:41:20 1711896080

“like our ancestors did: multiple redundant calendar alerts”

Love that line

Scubabear68 · 2024-03-30T16:30:51 1711816251

“That means the root issue with renewal is still a problem, and we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves”.

lijok · 2024-03-30T16:37:06 1711816626

The other day I was looking for a system where we can track recurring yearly/monthly/etc tasks (such as cert rotation) and get alerted a week before and on the day.

2~ hours into my search, contemplating building my own, someone pointed out we can just use a shared gsuite calendar.

How the mind overcomplicates things sometimes..

supriyo-biswas · 2024-03-30T17:13:54 1711818834

Uptime Kuma[1] has a certificate expiry notification feature.

[1] https://github.com/louislam/uptime-kuma

rnewme · 2024-03-30T18:51:53 1711824713

Uptime kuma is cool.

j45 · 2024-03-30T19:03:49 1711825429

Thanks for the tip!

_joel · 2024-03-30T17:02:54 1711818174

I'd probably spend the effort adding ssl expiration to the monitoring system for all the certs in use. Trigger then a month/week whatever before they're due to expire.

rustcleaner · 2024-03-30T17:04:48 1711818288

We both got the downdoot simultaneously, someone didn't like our contributions. :^)

eastbound · 2024-03-30T18:11:23 1711822283

Jira. Create an automation that clones tickets on due dates.

j45 · 2024-03-30T19:04:31 1711825471

Jira is really underrated for it's workflows and automations.

Part of my love/hate relationship with JIRA was until the lightbulb that it's not supposed to work perfect out of the box because no two places are the same.

mxuribe · 2024-03-31T16:00:39 1711900839

The org i work for at $dayjob is in the midst of a poorly-managed migration from Jira to Azure DevOps (ADO)...and I've had a tolerate/hate relationship with Jira for some years now...It often felt like it has too many distractions (what other people might call "bells and whistles"). So, when i recently started using ADO, i thought, eh, this is close enough to Jira, why all the fuss from some of my peers about missing Jira features and such? And, then I realized ADO lacks auto save (but Jira has it); freakin' auto save! Of all the things that i guess i miss from Jira; this is one of those seemingly minor features that i guess i grew to depend on while using Jira...So, yeah, i guess i hate Jira a little less. ;-)

rustcleaner · 2024-03-30T17:01:58 1711818118

[flagged]

nimih · 2024-03-30T17:35:57 1711820157

> When the Third Millennium Stasis kick off, basilisks on leashes, those are the intelligence treasure troves they will stripmine your optionality with.

I don't know what all this means, but generally I think it's probably not too hard to deduce an org's schedule for certificate rotation by just looking at the expiration dates on those same certificates.

lijok · 2024-03-30T17:41:39 1711820499

I understand and somewhat agree with what you're saying. But good luck justifying that to the business.

BuildTheRobots · 2024-03-31T06:31:11 1711866671

How can someone be hosting infrastructure without some form of monitoring and alerting framework? Domain and cert expiry (not just for your estate but for any of your dependencies) seems like the lowest of the hanging fruit checks to implement when setting that up.

gigatexal · 2024-03-30T22:49:42 1711838982

Surely they can automate the renewal? It seems their solution is a manual one. Am I being a simpleton?

linsomniac · 2024-03-31T01:00:48 1711846848

The issue they mention is that renewals are automated, but the IPv4-hosting service noticed some extra IPv6 addresses and halted the renewals because of that. On top of that, their monitoring of cert expiry was checking the IPv6 proxy so it didn't notice the problem.

aktuel · 2024-03-30T19:54:24 1711828464

That's why I roll my VPN locally. One less party to worry about.

txutxu · 2024-03-30T20:54:57 1711832097

The P in VPN has been perverted long time ago.

edward28 · 2024-03-30T21:25:37 1711833937

Virtually public network

NelsonMinar · 2024-03-30T16:48:10 1711817290

The conclusion is hilarious: "we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves"

Devops is so 2023. Back to ops!

bradfitz · 2024-03-30T17:06:45 1711818405

That was mostly a joke. You know we're going to fix it properly, Nelson :)

(But super short term, yes.)

NelsonMinar · 2024-03-30T17:49:30 1711820970

Fair enough! But you might want to add this feed to your Google Reader, Brad :-) https://scrutineer.tech/monitor/cert/tailscale.com.rss

j45 · 2024-03-30T19:05:00 1711825500

It's a joke but also the least that should be in place while whatever fix is coming is put into place.

A simple cronjob would look like it would handle it, but what usually ends up being needed with 10-15 of these types of tasks is a simple, independent bpm workflow platform that tracks whether it happened or not.. or anything else.

Learned this the hard way and won't do it any other way.

c6400sc · 2024-03-30T23:23:48 1711841028

A calendar alert is an automated alert, is it not? </s>

PuffinBlue · 2024-03-30T16:50:36 1711817436

Migrating to a host that doesn’t support IPv6 when it’s important to you seems…like a bad decision.

bradfitz · 2024-03-30T17:07:43 1711818463

Suffice it to say neither their lack of IPv6 nor its importance to us was evenly understood throughout the company.

PuffinBlue · 2024-03-30T19:44:26 1711827866

I very much enjoy the diplomatic phrasing of this statement :-)

j45 · 2024-03-30T19:06:44 1711825604

This doesn't seem uncommon. Why learn IPv6 until you need to. I know it has some great features.

lanthade · 2024-03-31T05:55:29 1711864529

My thoughts exactly. If feature X is important enough to do all the silly workarounds they did then why would you in the first place choose a provider that didn’t support feature X?

The choice of IPV4 + shenanigans vs IPV6 seems pretty straightforward.

speedgoose · 2024-03-30T22:04:01 1711836241

IPv6 is important but step one is to remove the AAAA records.

sowbug · 2024-03-30T16:52:12 1711817532

Two ideas for discussion.

Certificate Transparency is used to account for maliciously or mistakenly issued certificates. Perhaps it could also be used to assert the unavailability of correctly issued but obsolete certificates that are believed to be purged but actually aren't. (Services like KeyChest might already do this.)

Let's Encrypt is a miracle compared to the expensive pain of getting a cert 20 years ago. Rather than resting on laurels, would there be any benefit to renewing even more frequently, like daily? This might have confined the Tailscale incident to a quick "oops!" while the provider migration was still underway and being actively watched.

striking · 2024-03-30T17:02:37 1711818157

90 day renewal is frequent enough in my book. It's not so often as to be easy to miss, but often enough that the person setting it up can witness the first renewal cycle (if they so choose, which in this case they apparently did not).

sowbug · 2024-03-30T19:01:11 1711825271

Right. I was thinking of keeping the same 90-day validity but renewing much more frequently, rather than the 60-day period that LE recommends. But I can see my questions have irked other community members, so I'll leave it at that. :)

starttoaster · 2024-03-30T19:19:29 1711826369

I renew some of my LetsEncrypt certificates monthly, which should be plenty, in my opinion. Gets you about 2 buffer cycles to notice the certificate isn't updating and recognize an issue in your automation.

pluto_modadic · 2024-03-31T01:46:17 1711849577

is this a good amount of not overwhelming the system? e.g. they recommend 60 days? https://letsencrypt.org/2015/11/09/why-90-days - and folks say the bot process won't let you do it unless it's within 30 days of expiring? (but this might not be limited by the server-side API, because --force-renewal exists)