I've said it before and I'll say it again: expiring certs are the new DNS for outages.
I still marvel at just how good Tailscale is. I'm a minor user really but I have two sites that I use tailscale to access: a couple of on-prem servers and my AWS production setup.
I can literally work from anywhere - had an issue over the weekend where I was trying to deploy an ECS container but the local wifi was so slow that the deploy kept timing out.
I simply SSH'd over to my on-prem development machine, did a git pull of the latest code and did the deploy from there. All while remaining secure with no open ports at all on my on-prem system and none in AWS. Can even do testing against the production Aurora database without any open ports on it, simply run a tailscale agent in AWS on a nano sized EC2.
Got another developer you need to give access to your network to? Tailscale makes that trivial (as it does revoking them).
Yeah, for that deployment I could just make a GitHub action or something and avoid the perils of terrible internet, but for this I like to do it manually and Tailscale lets me do just that.
Tailscale remains useful when deploying with GitHub actions. Currently, I have my cloud VM open on an unconventional SSH port so that GHA workers can SSH into it and initiate the deployment. I plan to utilize their action [0] so that any GHA worker can access the deployment machine without exposing any ports.
I'd recommend as part of the post mortem to move their install script off their marketing site or putting in some other fallback so marketing site activity is unrelated to customer operations critical path. They're almost there for maintaining that typical isolation, which helps bc this kind of thing is common.
We track uptime of our various providers, and seeing bits like the GitHub or Zendesk sites go down is more common than we expected... and they're the good cases.
Further, security of a marketing site tends to be lower priority than the product itself, and an install script should generally be secured similar to the product.
Yes. We're lamentably probably going to have to move it (the install script), even though it has a nice URL today.
When we picked that URL, the marketing site was created and run by the same people who built the rest of the product, so it didn't seem like a concern at the time.
You can achieve both. The only mistake you made was to half-bake the proxy (doing it for IPv6 only): proxy every http(s) request to tailscale.com. Vercel’s platform is valuable for a whole host of reasons, the networking side isn’t that important, your developers will greatly value the use of Vercel even if every request is being proxied through a web server hosting tailscale.com which responds to a request for /install.sh instead of passing it through to the marketing site.
(In Google Cloud you could do it entirely with load balancing rules, no need to even run a web server)
`curl -fsSL https://install.tailscale.com | sh` wouldn't be any less nice. Append /sh if having something human-friendly at the root is desirable (SEO, etc.), and you're still at the same overall length as today.
> Append /sh if having something human-friendly at the root is desirable
Even this isn't really necessary; curl includes a default user agent header identifying the traffic coming from curl. It's simple enough to direct traffic with the curl user agent header to the script and all other traffic to a static website with directions for how to quick-install.
Very nifty, but I'd argue against it due to possible confusion in various atypical scenarios, such as:
* The user wants to read the script before executing it, and their preferred reader (perhaps due to browser extension or something) is a standard browser.
* The user has `curl` aliased to `curl-impersonate` in order to avoid things like Cloudflare's bot detection (a captcha that triggers on things beyond the HTTP request, like the less fancy TLS handshake of regular curl) -- https://github.com/lwthiker/curl-impersonate
* The user doesn't have curl installed, but has wget / lynx / some headless browser / etc. and expects any of those to work the same as curl.
Not to mention, if a site encouraged users to execute an HTTP response by piping curl into sh, and the response for curl was different than the response otherwise, that just might make the top of HN for being sketchy as hell.
> The user wants to read the script before executing it, and their preferred reader (perhaps due to browser extension or something) is a standard browser.
I mean, the point of wanting to read the script before executing it is to try and protect yourself from malicious scripts that abuse the curl | sh pattern. So since it would be simple enough for a malicious actor to return something different when the user agent indicates the usage of curl, the only responsible thing to do, anyway, is to use curl to download the script to a file, read the file, then execute the file.
> `curl` aliased to `curl-impersonate`
So when the user uses a tool to impersonate a browser, they'll see exactly what they'll see in a browser... which are the quick-install instructions anyway, which can include a note about the user agent, if anyone actually hits this in the real world?
> wget / lynx / some headless browser
Which would provide the quick-install instructions to use curl :)
That's great that DevOps (or whatever their title) owns both product and marketing sites. Far too many companies (and DevOps teams) think the www site is "not important" or "not their core job" and outsource it to either a less qualified team, or out of the company altogether.
From an external perspective no one cares if www going down isn't "your fault" or of "direct impact to the product". It's a corporate blackeye either way.
This hasn't been my experience. In my experience, the reason why your ops teams divest themselves from the marketing side is because marketing decides to contract some firm to design their site for them, and the firm decides to deploy to Vercel or WP-Engine or whatever. Then, Marketing comes to ops and says "hey, I need you to change this DNS thing" weeks/months into their engagement, with no understanding of the ramifications. Ops/product team pushes back, because the change would fundamentally break the application. Marketing gets defensive, "we've spent all this time and money on this, you just need to make it work", a broken halfway solution is implemented, and ops/product, in protest, divests themselves from the solution. Bingo bango, shadow IT is ratified, the kludgy hackjob lives in production forever, and no one thinks about it until the next time something breaks.
Reminds me of the time marketing decided to change the logo on the marketing site for the product team I was on without being aware that the site was scraped and redeployed on a different domain (by hand). When the logo changed, the CSS for the image element wasn't updated, truncating part of the logo, proudly displaying the word "ass" as a part of the logo in an unfortunate cropping incident.
"Far too many companies (and DevOps teams) think the www site is "not important" or "not their core job" and outsource it to either a less qualified team, or out of the company altogether"
It's impossible to know because they won't admit it publicly. You are guessing based on some anecdotal experience.
But then again... here's mine! I worked at a very successful SaaS that had (really not kidding) the most incompetent, lazy dope running the www site. He live-edited a "staging" version of the site on the fly (no, it wasn't private, you could access this thing from the internet, and he didn't know or care about that). When he was happy with his changes he'd destroy the live instances behind the load balancer and clone his staging instance without taking it down or running any extra checks. This staging instance was around for years and I don't think he ever bothered doing a system update. Since he didn't use git, I I'll bet that at least once he cloned a live instance back to staging to undo a bunch of bork.
I lost count of the incidents. He never detected them himself, was never available to troubleshoot them and was generally a big "durrrr" when you'd finally get him on the call. Example: one time we had a "slow, intermittent errors" customer support ticket surfaced to us, not because it was our job, but because dopey was being an absolute ass to the helpdesk guys. He ran his crap in another AWS account we didn't have access to. About a day later the www site went down completely, so we got hold of the AWS account and dug in. All 5 of the instances behind the load balance were "unhealthy" for various reasons. Certs expired, disks full, apache stopped. We bounced them, restarted them and sshed in. They all had different versions of the site. It was a complete mess. Turns out dopey wasn't very good at killing the old instances and cloning staging. He was probably live-editing the instances for smaller changes if that seemed easier than a bunch of AWS console work.
Unbelievably he wasn't fired and continued to mismanage the site, and we could do nothing because the head of marketing didn't listen to the head of engineering. They hated each other. The way Marketing saw it "your SRE guys couldn't fix it, they had to wait for <dopey> to get on the call". I'm not even kidding.
Just more anecdotal evidence from me. You might be right.
Cloudflare has a dedicated team handling certificate-related engineering challenges. While it simplifies the process for your domain on Cloudflare, internally-facing domains remain a pain point for a lot engineering teams.
They made the same mistake we did at a former company — put a link to our webapp’s login page (app.foo.com) on the marketing site (www.foo.com) homepage.
It wasn’t until our first marketing web site outage that we realised that our $40/mo hosting plan was not merely hosting a “marketing site” but rather critical infrastructure. That was a load-bearing $40 hosting plan. Our app wasn’t down but the users thought it was.
I learned then that users follow the trails you make for them without realising there are others, and if you take one away then a segment of your user base be completely lost.
When I type "tailscale" into my browser, the first result is tailscale.com. I do not need to use the tailscale admin console often enough that I would go out of my way to memorize the different URL.
My browser used to autofill dash.cloudflare.com when I typed in cloudflare. I visited the cloudflare.com website exactly once and now that's what shows up in the first result, and I find myself doing the same thing with Cloudflare.
Couple years ago there was an outage at Cloudflare and their www.cloudflare.com started pointing to an unrelated 3rd party marketing website, while the dash.cloudflare.com was showing the dashboard.
I really like these guys, I wish their pricings wasn't so ridiculous. proper access control shouldn't cost 18 bucks a month for a vpn, it's basically unsellable to management at that price and the lower tiers are unsellable without it.
I'm really interested in what you're comparing Tailscale to internally, because it does way more than just VPN. What are the cheaper options, and do they also have an SSH feature, oauth authentication to the network for automation services, the ability to stand up VPN node loadbalancers in kubernetes clusters, and ACME certificate request automation through LetsEncrypt? Just to list a few features that I use from Tailscale's free tier that I don't normally think of as the job of a VPN service. And they're constantly adding new features that make it a really interesting and competitive choice, in my opinion. Honestly I'm mostly interested in this take because I'm shocked by how much they offer in the cheaper tiers.
My problem is I don't care about any of the stuff you mentioned because I can't easily recreate my internal vlan segregation at the six dollar tier, I need to pay up for the 18 dollar a user tier, which is insane pricing for my shop of 70-100 users. Compare to the included fortissl vpn that is provided by my router. I can add users to groups in active directory that are then synced to azuread, which provides saml authentication to the vpn. Users login with their O365 credentials and I can configure it to mimic the already approved vlan segregation easily based on groups provided by O365. This basic stuff should just be there for a corporate solution and it appears to be but for 100 users the bill is 21.6k/year, which is literally 4 times higher than I can justify/ get approved. Yes it has a bunch of other features, but they are irrelevant because the basics aren't there at a price that is doable.
> Yes it has a bunch of other features, but they are irrelevant because the basics aren't there at a price that is doable.
The basics don't cost $18/user/month though. The whole package does. I hear what you're saying, and maybe you just accidentally worded it this way, but the obvious rebuttal to it is: How much would it cost you to set up a solution where only ACL'd users can SSH into your infrastructure/servers? You're looking at services that cost money like Userify for that. For many of the other features Tailscale offers, you're probably either paying another service to handle that responsibility, not doing it at all, or you're spending your time recreating it, and I bet your time isn't cheap to the company either.
Anyway, that's somewhat of a hypothetical rebuttal. I actually assume you did the due diligence and weighed the cost with the portion of the feature set that you actually would make use of. I could see the price being more fair if they offered a lower cost tier that only provided the VPN and ACLs for unlimited users, but I'm not a savvy businessman so I'm not sure if it makes sense for a multi-tool company to sell a minority of users a screwdriver.
You're assuming that feature matters to me (it does but I have a lot of users relative to servers, so even the most basic tier of userify express or their cloud offering would handle my needs for dramatically less than premium tailscale). What you're missing is the foundation of being able to deploy something like this depends on easy acls that integrate with my current identity system (i.e. saml) because I need to be able to control access per business unit so that when the mesh of everything is brought online users still only have access to the subset of systems they should have access to. This matters whether it's Sheila in accounting who only ever uses it to remote desktop to her on prem workstation, or jackie in IT using it to manage on prem servers across multiple locations and cloud servers.
So yes, maybe a small subset of my users are actually using enough of the premium bundle to justify that cost but I can't even mix and match because the basis of every use case (the ACLs) are only in the premium package in a functional way.
The problem might be your approval process? $18 compared to fully loaded monthly cost is nothing. Unless you are using 100 tools like this maybe. Or based in LCOL location.
The pricing isn’t ridiculous, it is by design. For better of worse SaaS pricing is about finding features (regardless of their actual cost) that act as signals for “is a customer who can afford more”. The $6 tier was you paying their
marketing and market research cost by you trying them out :-). They probably don’t need the $6 they need the data that you were willing to pay something!
That's fair. I guess if you just need a VPN, it doesn't really make sense to consider a product that packs all of these VPN-adjacent features. But part of my point was how much you're able to do on Tailscale for free, so in that regard, the pricing really doesn't seem as bonkers to me, to be honest. The $6/month tier is also incredibly reasonable for unlimited users, but it is annoying, I'll grant, that they actually drop some ACL features from Free tier to Starter. I suppose that's how they funnel a large number of enterprises from considering Starter in favor of Premium. But if you actually make use of the feature set of the platform, which if you have a decent DevOps team I'm pretty sure there's tools in there that they'll love, then $18/User/month actually doesn't seem too outrageous to me.
Just a quick heads up, colleagues of mine could not successfully host headscale themselves. In the end, they saw the value and bought tailscale access.
Configuring wireguard really is that hard. Tailscale is easily worth it
FYI: I've been self-hosting headscale for 9 months or so, and it's pretty brilliant. I didn't find it very hard to set up. A dedicated DERP server was pretty hard to set up, but most of that was I was trying to host it behind our office load balancer, and that's no bueno. But once I put it on a dedicated IP,my secondary DERP was pretty easy.
But, if you are going to self-host, seriously consider Nebula instead of tailscale. Unless you need non-technical users accessing it, tailscale has a better story there.
(edit) The biggest downside of headscale is I don't feel confident I can update ACLs without having a high likelihood of taking down the entire tailnet until I can get it fixed.
It wasn't hard to setup headscale (or netbird for that matter). I have setup both to test at home fairly easily. They aren't appropriate for a corporate setting though. I actually want to pay somebody for this because I want the support when some change causes this to eat it at 4 in the morning with the business day about to start with a strong requirement for it to be working.
Your (their) mileage may vary. I set up Headscale with external auth and it's been a dream, the kind where I don't really have to think too much about it. The only little gotcha is that sometimes getting the iOS client to read the server url from settings can be tricky. But once authed, it "just works" for me.
I had a very easy time selling this. We moved away from an OpenVPN setup, and Tailscale made it so much easier to onboard new employees and to do a lot more things “the right way”. We’re a fully remote company, so it’s even more important.
Although I admit that in my role I have quite a lot of weight in convincing management on these topics, price was not a concern.
We’ve been a happy customer since April last year, everyone on the premium / “expensive” tier. I’m also very impressed with their development speed: some features that were said “May take a few years to be delivered” actually were delivered last year already.
Cloudflare One could have been an alternative, but that would have been even more expensive.
Yah, if I was a fully remote company this would have been an easier sell. What I actually am (and what the vast majority of business likely is) is a small bunch of office locations and a bunch of cloud infrastructure so I can approximate tailscale in a way I am unhappy with right now by limiting cloud access to the ips of the office and controlling access with various things from the office pcs. The vpn to do that works properly with acls and is included in our networking gear. So what tailscale is competing with is something that is a sunk cost and not ideal but of minimal additional cost vs 20k a year of additional cost providing premium features to all users when maybe 10% of users need them.
Basically, tailscale has a bundling problem. They bundled necessary to all features (proper acls) with a bunch of premium stuff that is of less value to many and they don't have the market power of microsoft with their windows operating system to force that kind of arrangement down my throat. They need a 2-3 dollar a month tier with proper acls and mesh vpn, then al a carte of the rest of the feature bloat per user (ssh key management is worth no more than 2 dollars a month based on the competition, no idea what the other features are worth because I have no use for them).
They also really need to improve their windows experience. More than once during testing I had a windows update break the vpn requiring alternate means of logging in and reconnecting, but that's an ancillary issue.
that’s basically the issue that pushed us to twingate. turns out i like twingate a little better for their routing capabilities anyway (which isn’t to say i don’t like tailscsle at all—i use both for different purposes.)
I wonder what provider they use for their website. Sounds like a lot of hoops to jump through for IPV6 when just about any other provider has IPv6 support.
Wow, all comments removed as spam or hidden by default, update posted saying "We are targeting to land support for IPv6 towards the beginning of next year." Well, Q1 2024 has come and gone. Where's IPv6 support or the communication about what is happening? Good reason to never use Vercel if you ask me.
>I apologize for the slow response. We are targeting to land support for IPv6 towards the beginning of next year. We will communicate updates on this issue. Thanks for the patience.
So cringey. Why not just post a new post that said "sorry the deadline slipped, no new date available at the moment"? I will strongly recommend _against_ this company solely based on this communication. If this sort of gaslighting is how they handle their public comms, imagine how their support must be run.
Wow, mad jelly their CI/CD and monitoring proceses are robust enough to trust a major rollout in December. That's a pretty badass eng culture
That being said, still some unanswered questions:
- If the issue was ipv6 configuration breaking automated cert renewals for ipv4, wouldn't they have hit this like.. a long time ago? Did I miss something here?
- Why did this take 90 minutes to resolve? I know it's like a blog post and not a real post-mortem, but some kind of timeline would have been nice to include in the post.
- Why not move to DNS provider that natively supports ipv6s?
Also I'm curious if it's worth the overhead to have a dedicated domain for scripts/packages? Do other folks do this? (excluding third-parties like package repositories).
>- If the issue was ipv6 configuration breaking automated cert renewals for ipv4, wouldn't they have hit this like.. a long time ago? Did I miss something her
AIUI, they switched to their current setup 90 days prior to the outage. The initial cert they installed during their migration lasted 90 days. So 90 days after the migration, they had an outage.
Why does the proxy need to terminate TLS? If it were just a TCP proxy, then at least the monitoring wouldn't have been fooled into thinking the certificate wasn't about to expire.
Heck, a TCP proxy might even allow automatic renewal to work if the domain validation is being done using a TLS-ALPN challenge.
A TCP proxy discards the user's IP address, unless you use something like the PROXY protocol[1], which then needs to be supported by the target HTTPS server. You would also need a way to prevent unauthorized users from injecting their own PROXY header.
This isn't a problem if you don't need the user's IP address at all, but it's often useful for logging and abuse detection.
If it is point-to-point and you control both those points (forward A to B with ports open as approp), proxying any protocol should be straightforward, no?
It doesn't. That was one of our mistakes and action items to fix.
The original proxy was stood up quickly when it was first discovered IPv6 was broken and the people standing up the proxy didn't know at the time how ACME worked.
I've done enough fighting with Certbot to learn that the path of least resistance is to use dns-01 with the cloudflare plugin, and just make a very limited access key and store it on the servers that I am using to renew the cert.
A non-TLS-terminating proxy is a great thing to host on a service like Hetzner. If you set up CAA correctly, then you are trusting the provider for latency and availability only, and you might as well avoid hilariously expensive services like CloudFront or an EC2-based proxy.
Hmm, it looks like Tailscale is using NetActuate for pkgs.tailscale.com. I bet NetActuate could help serve up a non-terminating proxy with plenty of PoPs at a reasonable price. Their website doesn’t give pricing, but it sounds like the kind of company that doesn’t mark up egress 50x.
> A non-TLS-terminating proxy is a great thing to host on a service like Hetzner. If you set up CAA correctly, then you are trusting the provider for latency and availability only, and you might as well avoid hilariously expensive services like CloudFront or an EC2-based proxy.
Are you really getting any latency or availability improvements in that case? What does a non-TLS-terminating proxy give you?
An http proxy would need to be configured cleverly enough to serve its own acme challenges directly, and proxy any requests for the backend’s acme challenges. Which is I think the trick that was missed by the tailscale setup.
Anything even remotely security adjacent that TailScale as an institution even remotely fumbles even once is too dangerous for the merely mildly paranoid (like me for example).
They have monitoring for their infrastructure, right? Add 50 lines of code that connects to all public domains on ipv4 and ipv6 and alerts if the cert expires in under 19 days. Set automatic renewal to happen 20 days out. Done.
I wrote this code years ago, after missing a couple ssl renewals in the early days of our small company. Haven’t had an ssl-related outage since.
Edit: this is the only necessary fix, no need for calendar invites:
> We also plan to update our prober infrastructure to check IPv4 and IPv6 endpoints separately.
It sounds more like "90 days of alerts about DNS, and then certs fail". The fact that the presence of IPv6/AAAA DNS records causes Vercel to decline to auto-renew certificates seems to not have been known to the Tailscale team prior to the incident. (I haven't seen the alerts in question, so I don't know whether they made this fact clear.)
“That means the root issue with renewal is still a problem, and we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves”.
The other day I was looking for a system where we can track recurring yearly/monthly/etc tasks (such as cert rotation) and get alerted a week before and on the day.
2~ hours into my search, contemplating building my own, someone pointed out we can just use a shared gsuite calendar.
I'd probably spend the effort adding ssl expiration to the monitoring system for all the certs in use. Trigger then a month/week whatever before they're due to expire.
Jira is really underrated for it's workflows and automations.
Part of my love/hate relationship with JIRA was until the lightbulb that it's not supposed to work perfect out of the box because no two places are the same.
The org i work for at $dayjob is in the midst of a poorly-managed migration from Jira to Azure DevOps (ADO)...and I've had a tolerate/hate relationship with Jira for some years now...It often felt like it has too many distractions (what other people might call "bells and whistles"). So, when i recently started using ADO, i thought, eh, this is close enough to Jira, why all the fuss from some of my peers about missing Jira features and such? And, then I realized ADO lacks auto save (but Jira has it); freakin' auto save! Of all the things that i guess i miss from Jira; this is one of those seemingly minor features that i guess i grew to depend on while using Jira...So, yeah, i guess i hate Jira a little less. ;-)
> When the Third Millennium Stasis kick off, basilisks on leashes, those are the intelligence treasure troves they will stripmine your optionality with.
I don't know what all this means, but generally I think it's probably not too hard to deduce an org's schedule for certificate rotation by just looking at the expiration dates on those same certificates.
How can someone be hosting infrastructure without some form of monitoring and alerting framework? Domain and cert expiry (not just for your estate but for any of your dependencies) seems like the lowest of the hanging fruit checks to implement when setting that up.
The issue they mention is that renewals are automated, but the IPv4-hosting service noticed some extra IPv6 addresses and halted the renewals because of that. On top of that, their monitoring of cert expiry was checking the IPv6 proxy so it didn't notice the problem.
The conclusion is hilarious: "we plan to address it in the short term much like our ancestors did: multiple redundant calendar alerts and a designated window to manually renew the certificates ourselves"
It's a joke but also the least that should be in place while whatever fix is coming is put into place.
A simple cronjob would look like it would handle it, but what usually ends up being needed with 10-15 of these types of tasks is a simple, independent bpm workflow platform that tracks whether it happened or not.. or anything else.
Learned this the hard way and won't do it any other way.
My thoughts exactly. If feature X is important enough to do all the silly workarounds they did then why would you in the first place choose a provider that didn’t support feature X?
The choice of IPV4 + shenanigans vs IPV6 seems pretty straightforward.
Certificate Transparency is used to account for maliciously or mistakenly issued certificates. Perhaps it could also be used to assert the unavailability of correctly issued but obsolete certificates that are believed to be purged but actually aren't. (Services like KeyChest might already do this.)
Let's Encrypt is a miracle compared to the expensive pain of getting a cert 20 years ago. Rather than resting on laurels, would there be any benefit to renewing even more frequently, like daily? This might have confined the Tailscale incident to a quick "oops!" while the provider migration was still underway and being actively watched.
90 day renewal is frequent enough in my book. It's not so often as to be easy to miss, but often enough that the person setting it up can witness the first renewal cycle (if they so choose, which in this case they apparently did not).
Right. I was thinking of keeping the same 90-day validity but renewing much more frequently, rather than the 60-day period that LE recommends. But I can see my questions have irked other community members, so I'll leave it at that. :)
I renew some of my LetsEncrypt certificates monthly, which should be plenty, in my opinion. Gets you about 2 buffer cycles to notice the certificate isn't updating and recognize an issue in your automation.
is this a good amount of not overwhelming the system? e.g. they recommend 60 days? https://letsencrypt.org/2015/11/09/why-90-days - and folks say the bot process won't let you do it unless it's within 30 days of expiring? (but this might not be limited by the server-side API, because --force-renewal exists)
I still marvel at just how good Tailscale is. I'm a minor user really but I have two sites that I use tailscale to access: a couple of on-prem servers and my AWS production setup.
I can literally work from anywhere - had an issue over the weekend where I was trying to deploy an ECS container but the local wifi was so slow that the deploy kept timing out.
I simply SSH'd over to my on-prem development machine, did a git pull of the latest code and did the deploy from there. All while remaining secure with no open ports at all on my on-prem system and none in AWS. Can even do testing against the production Aurora database without any open ports on it, simply run a tailscale agent in AWS on a nano sized EC2.
Got another developer you need to give access to your network to? Tailscale makes that trivial (as it does revoking them).
Yeah, for that deployment I could just make a GitHub action or something and avoid the perils of terrible internet, but for this I like to do it manually and Tailscale lets me do just that.