Let's Encrypt API v2 Service Disruption (12 Sep 2021)

throwaway20371 · on Sept 12, 2021

I think the OP was referencing this: https://letsencrypt.status.io/pages/history/55957a99e800baa4...

   September 12, 2021 20:30 UTC
      Service disruption
   
   September 12, 2021 05:39 UTC
      Production planned maintenance
   
   September 12, 2021 02:33 UTC
      Degraded API and OCSP Performance

And then you see this planned maintenace for the 20th:

   September 20, 2021 16:30 - 17:00 UTC
      API Database Maintenance

I'm curious what kind of database and maintenance. The one useful thing about Cloud-managed NoSQL databases is the potential for zero-downtime maintenance. Making it stable enough is a hercluean task for the Cloud provider, but the customer never has to think about it.

traceroute66 · on Sept 12, 2021

> Cloud-managed NoSQL databases is the potential for zero-downtime maintenance.

Aah yes... that magical "zero-downtime" cloud, lots of promises are made by the cloud providers in that area but all eventually broken at some point. Its a fact of life with technology unfortunately. :)

No need for the cloud with ScyllaDB[1], CockroachDB[2] and others all being used in production.

LetsEncrypt chose their setup for perfectly valid reasons they described on their blog[3]. Please describe your expertise for saying they're "doing it wrong", because its clear from their blog they put a lot of time thinking about it.

[1]https://www.scylladb.com/ [2]https://www.cockroachlabs.com/ [3]https://letsencrypt.org/2021/01/21/next-gen-database-servers...

PeterCorless · on Sept 13, 2021

While absolutely "zero downtime" is hard to guarantee in the face of massive catastrophic system failure, a distributed NoSQL database like Scylla should prove to be survivable with the loss of any one datacenter. (You'd have to knock out two of three, or all three datacenters to pull us down. Is that still possible? Sure.) I do recommend people put a lot of thought and some actual chaos monkeying into their "zero downtime" architectures.

https://www.scylladb.com/2021/03/23/kiwi-com-nonstop-operati...

throwaway20371 · on Sept 12, 2021

I'm not an expert, and I'm not saying they're doing it wrong.

But for what it's worth, just because they thought for a long time doesn't mean they made the right choices. Some of their design decisions have been not great. You can still create perfectly valid domain certs for domains you don't control with a variety of common attacks. Then there was the ACME API vulns. Either they knew about all this and chose a simpler design that's more flawed on purpose, or all that thinking just didn't end up being correct.

traceroute66 · on Sept 12, 2021

Yawn. If you're after a Let's Encrypt bashing session then I suggest you take it elsewhere.

They might not be perfect, but you get what you pay for, just like open-source software. Free but no warranty. Fine by me since Let's Encrypt is available 99.999% of the time, outages are the exception rather than the rule (plus you should always be renewing your certs in good time and not at the last minute).

toast0 · on Sept 12, 2021

> Making it stable enough is a hercluean task for the Cloud provider, but the customer never has to think about it.

If the Cloud provider messes up, the customer still has to think about it. I'm not a huge fan of high level cloud services for critical infrastructure, because they're likely to go wrong (as high level services do from time to time) and when they do, you'll usually have little visibility into them, which makes recovery dependent on the Cloud provider.

throwaway20371 · on Sept 12, 2021

Sure, but there's different levels of complexity, risk, cost, recovery with different services. And it depends on your design. If you do go with a NoSQL distributed DB, I would absolutely push the maintenance off to cloud hosting; that shit is a nightmare to do yourself. RDBMS you can totally do yourself, but if you're not really great at DBA/infra/etc and it's the lynchpin for your service, will you really do a better job than a dedicated cloud hosting team?

toast0 · on Sept 13, 2021

Having done some of this, I know how sometimes 'no impact' scheduled maintenances become total system outages. Having no real input into when those get scheduled and sometimes having them happen without notice sounds like a great way to have my stuff down with no control. Sure, I've got someone to blame, but that doesn't help me serve my customers.

I'd rather have potentially less uptime but more control, but maybe that's me.

throwaway20371 · on Sept 13, 2021

I completely agree that scheduling has to be controlled by the customer (you) and having more context on the change helps. In AWS I can schedule the maintenance windows, and I know if it's because it's a major version bump or a patch, and I can talk to support if I'm concerned.

For me the major consideration is getting away from toil so I can spend time improving things that will move the needle. If somebody else can do a half-decent job, and there's a backup plan in case things go wrong, please somebody else do it for me!

PeterCorless · on Sept 13, 2021

Also, you might want to check with your DBaaS provider if backups, upgrades or other administrative tasks can be "pause and resume." Because you never know when you'll have a burst of real-time traffic even in a projected lull time for your production systems.

abraham · on Sept 12, 2021

https://letsencrypt.org/2021/01/21/next-gen-database-servers...

throwaway20371 · on Sept 12, 2021

"Database performance is the single most critical factor in our ability to scale while meeting service level objectives."

Ah, that's unfortunate. It's hard to get RDBMS reliability/performance "right" in the face of changes, and scaling it isn't easy. But at least all the problems are well-known.