I run the signing systems for a bunch of TLDs and honestly it scares me.
.NZ was just the latest public failure. If multiple large registry operators employing people who wrote the DNS and DNSSEC rfcs have had failures then how can you expect regular mortals to get it right?
Transferring your domain and your old ISP had dnssec on but the new one knows nothing about it and didn't check ... Down you go.
Mess up the key signing key change over procedure while transferring to a competent DNSSEC provider ... down you go again (although hopefully for a shorter time).
We do best practice DNSSEC, using fancy HSMs to do so, but if my fancy, proprietary, badly documented Thales HSM dies and the next person with my job has trouble getting the new box configured... no changes to the zones until its fixed. Take long enough that the RRSIGs expire ... down YOU go again. If at that point they throw out the HSM and roll regular DNSSEC using Knot or something it will probably be at least 24 hours for the zone to work properly even with IANAs emergency procedures.
I'm paranoid so we have 3 identical HSMs and detailed restore procedures but there is zero guarantee that if I ever leave my replacement will get it right. You can't hire dnssec staff, they are impossible to find, you have to train on the job.
The basic idea is excellent but the implementation in the real world is horrible. If dnssec usage was wide spread we might reach some decent maturity with the tools and protocols, but I can't see that happening any time soon.
> If dnssec usage was wide spread we might reach some decent maturity with the tools and protocols, but I can't see that happening any time soon.
DNSSEC's lack of maturity is a symptom of DNS itself not having a healthy ecosystem. I feel like it'd take as little as a half-percent of the people writing new HTTP tooling to noticeable improve the DNS ecosystem as a whole.
Totally. DNS was always an IETF RFC bound thing. Recently, big players just kinda started making their own decisions.
I'm not sure which is better. Having big companies control the internet is definitely bad, but the pedantry and bikeshedding the IETF offers can be equally bad.
I'd probably rather see the two cooperate/converge, but that's a pipe dream.
This sounds exactly like the age-old debates bemoaning the bureaucracy of government versus the exploitation of industry. People knowing the pain of one always look hopefully at the other.
The basic idea isn't excellent. It's wrong. It's firmly rooted in mid-1990s cryptography (the original design was a DOD contract to TIS); in particular, to the idea that cryptography is too expensive to do on edge systems. That's why DNSSEC's core design has offline signers (my belief is that the supposed security benefits of offline signing are back-rationalized --- remember that a best-practices "leaf" DNSSEC deployment today will have online signers, to foil enumeration attacks).
If you were to scrap DNSSEC-ter and start from scratch in 2023, there is no chance the design would look anything like DNSSEC. It would be fully encrypted for privacy. It would strike a different balance between authenticated denial and offline keys. It would use different primitives, and it would probably have a different DNS RR schema.
There are lots of implementation problems with DNSSEC, but the real issue is that the design is the best we could do in 1998, and that most of what's "improved" since then have been more effective ways to shoehorn the same model into the evolving core DNS infrastructure, rather than really improving the fundamental design.
One major problem with new security technology is that testing is generally created as the result of failure, rather than preceding the failure so that it doesn't occur in the first place. Even very basic things like verifying that a new key actually fit the intended lock (ie, that a new key correspond to the published data) is a concept that people tend to implement only once they have had experienced a failure.
One general solution is to have standardized tools with standardized testing. Transferring a domain should always involve testing the old data, testing the old keys, testing the new data, testing the new keys, testing the chain of delegation, testing the chain of trust and so on. Some TLD's has gone as far as doing basic sanity test before a domain owner is allowed to transfer a domain. (industry term is to delegate a domain, since transfer is strictly about legal ownership)
Thankfully there are initiatives to create standard tools and standard test. Zonemaster (https://zonemaster.net) is one I recommend for testing, which is the result of a collaboration between .se and .fr. There is also initiatives for tooling that create multi-signing (https://github.com/DNSSEC-Provisioning/music), which is to have multiple keys operating for the same domain.
Redundancy in HSM's are nice, but redundancy is not a replacement for testing and backup procedures. It was one of the hard lessons learned by the IT industry, especially with data. No amount of RAID will save a company if a rogue script deletes all the files, or if the database becomes corrupt. A lot of experts learned the hard way that redundancy is mostly just a complement to sanity tests and backup procedures.
The experience around just setting up DNSSEC when I tried it on Google domains was so incredibly frustrating due to the time delays and pain points surrounding correcting mistakes.
One trick to make it easier: set all relevant TTLs to 1 minute before trying something dangerous. Plenty of resolvers ignore that, but the main ones (1.1.1.1, 8.8.8.8, etc.) respect that.
Correcting mistakes is a pain point, but really applies to DNS as a whole rather than just DNSSEC.
I think a lot of the problem stems from people using really long TTLs for the keys, which used to be the standard advice.
That said, as someone who used to manage DNS and DNSSEC at a TLD level, I will admit that documentation and best practices are poor. I remember asking someone why it was this way, and he told me it was because people want to make money contracting, so were less than willing to make it accessible.
Once you understand what's happening, you can make a little cheat sheet and it's actually really simple. But it seems like everyone forges ahead their own way, myself included.
I've been out of the business for seven years, and don't have much anymore as I've switched jobs a few times.
It was mainly around process and timings. Bind9 has a pretty good guide on it. The easy option is to just add the new key, let everything sign, then later(after at least a TTL period, probably longer to be safe) remove the old.
The other way, that we did, is to publish keys before you use them, then retain the old key after the signing key swap, for a TTL period each. That keeps the zone size small.
Lastly, don't roll a KSK and ZSK at the same time. It's doable, but not worth the dance in any situation.
That said, if you have any specific questions, I'm happy to help.
The protocol wasn't designed on purpose to do this, but I think it's worth keeping in mind that this is why centralized DNS providers do DNSSEC features and promote DNSSEC: it's a thing that keeps you dependent on SAAS providers, because the DIY is so difficult. Keeping your domain anchored to a particular platform is a powerful customer retention mechanism; you're likely to do everything you can with that domain with that same platform.
Also, validation errors is how you get people to ignore XHTML.
Strict validation is tricky. With HTTPS, validation was initially very lax, and AFAIK, most SMTP servers still do not validate the certificates of the destination SMTP servers. It is hard to get people to move to stricter validation once habits and working systems have been established which depend on lax – or no – validation.
Some try to avoid this by specifying strict validation in advance, like XHTML and DNSSEC, which obviously has its own drawbacks.
> Exhorting people to not screw up things like DNSSEC KSK rollover clearly hasn't worked, so the only real solution would be better ways to automatically recover from it.
The whole issue of expiration in cryptographic systems is extremely fraught. My gut sense is that X.509 certificate expiry is the most common cause of mega-outages these days, where by "mega outage" I mean the sort of thing that takes down entire cellphone networks or major online services for hours at a time. In the past I've tried arguing against this practice of immediately cutting all communication because a deadline expired, but cryptography as a field is highly resistant to business requirements analysis of any kind. No CEO in their right mind would ever agree to having expiry in their infrastructure but the decisions are made by people far away, very deep down in organizational stacks, and it honestly doesn't occur to most business-side people that anyone could be crazy enough to litter miniature time bombs all throughout mission critical HA infrastructure anyway.
Part of the issue is misalignment of requirements. Keys don't actually expire, of course. In the old RSA days there was a notion of a sort of slow rusting as factoring attacks tended to get better over time, so every so often you'd want to bump the size of an RSA key to keep it secure. But that wasn't something that had to be done on a deadline. When elliptic curve cryptography went off-patent and started getting deployed it fixed that, and a secp256r1 key made today is as good as one made 15 years ago.
So what is the purpose of expiry, then, in today's world?
For DNSSEC, it seems to be a kind of encoding of an assumption that keys can leak without anyone being aware of it, whilst the holes being used to do so get sort of ambiently patched anyway again without anyone being aware that they've done so. If you assume a world where keys are constantly being stolen and abused, then regular rotation can help. However, that's not the real world! These keys are stored in HSMs. Has there ever been a leaked DNSSEC key spotted being abused in the wild? Certainly it's questionable whether the juice is worth the squeeze given the outages it causes, and the way these undermine the entire system.
For certificates there's that, but the original justification was actually about constraining the size of CRLs. If certificates don't expire then in theory any decommd cert has to be recorded and stuck on a revocation list forever, which would mean they can never shrink. How exactly to handle phaseout of old keys and certificate details has experienced a lot of churn over the years with CRLs, OCSP, OCSP Stapling and nowadays CRLsets all doing similar tasks.
Still, this is primarily a logistics and cost management issue rather than an actual security problem. The task could also be done with gradual phase out in which at first expired certs only trigger warnings to developers via the browser console, Google Search console etc. Then later it graduates to a warning in the URL bar. Then eventually TLS stacks start increasing the latency on connections, etc. Then once artificial latency passes a certain threshold that's basically equivalent to an outage anyway and you may as well just return an error immediately.
Lately there's another reason for expiry, with short expiry times being enforced by the CA/B Forum so they can change the rules for certificates quicker, not keys. It took a long time to phase out SHA1 and introduce CT, the people who set the certificate rules didn't like that as their job is security, so by making certs constantly expire they hope to push people into setting up automated renewal pipelines which in turn means they can then do things like mandate new algorithms, change the contents of certificates and so on. Constant expiry certainly helps them do their job, but it makes everyone else's job harder as automated renewal isn't always easy especially outside the context of simple web servers that use LetsEncrypt.
Summary: If tomorrow I was temporarily made King Of The Internet and told that my only mission in life was to increase availability, I'd start by altering X.509 libraries to implement gradual degradation instead of hard expiry, e.g. with a sleep(time_since_expiry) call somewhere in the stack. I'd also default to simply ignoring expiration outside of TLS context, because X.509 gets used in a lot of places that aren't TLS and they're often hard to reach and fix e.g. certs that are hard-coded into embedded devices. You do NOT want your electricity grid to have an outage caused by certificate expiry. In those contexts it hardly matters because the PKIs are often custom and not very large, so CRL sizes just don't matter. You can easily list every revoked certificate for the entire lifetime of the system, if you even need PKIX revocation support at all, and for those use cases expiry is probably a much bigger business risk than attacks using obsolete algorithms.
There's some irony here: short expiry has some clear practical benefits for at least some of the goals of the WebPKI, and that works because the WebPKI has stakeholders that will push things like "if you issue one more SHA1 certificate we will distrust your CA". Certificate expiration failures are common, but not that common, because "needs a certificate" isn't the default for every host.
Key expiry is a much bigger problem for DNSSEC, since once you sign, every hostname in your whole zone depends on no RR in your DNS configuration expiring. But there are no stakeholders extracting any value from those expirations: nobody can say "if you sign one more record with RSA-1024 we will break your whole TLD".
Further, unlike with the WebPKI, where verification is conceptually offline --- you have a trust anchor and a signature chain and the verification is all done locally --- DNSSEC resolution is online: you have a TTL issue with revocation, but not a "this bogus signature will be valid for an entire year because there is no effective revocation system".
So, arguably, the RRSIG expiration field is another fundamental flaw with DNSSEC.
Yes. It seems there's some thinking around doing offline validation for DNSSEC as part of DANE, so you'd get the proofs you need from DNSSEC on the server side, then combine those with a certificate and send that to TLS clients. That seems reasonable.
There was an effort to staple DANE proofs to TLS handshakes, but it failed: if you could suborn a CA, you could control the TLS handshake enough to stripe the DANE proofs out, and the whole point of the DANE proofs was to avoid trusting CAs.
Catch 22. To know the server wants to exclusively use DANE you must learn that in a secure way, like via DNSSEC, but the point of DANE is to not do interactive DNSSEC queries.
I guess if you're publishing an intent to use DANE stapling exclusively via DNS, then you could have tools that crawl the CT logs looking for CA issuances for domains that claim they don't want them. Though maybe detecting an attack in that way isn't worth much.
If a key is ever potentially compromised (say, a key employee quits), you should rotate the key. If this is too hard, people will just not do that, and we will have a world of hurt, with old keys being hard-coded all over the place. But you know what they say, “if something hurts, do it more often”. If key rotation is hard, automate it. Then, when a key really does need replacing, it’s easy. I prefer the pain of key rotation over the pain of all keys being hard-coded everywhere.
There are at least two possible fixes for that situation:
1. Use HSMs
2. Do constant key rotation
But (2) doesn't work for several reasons and (1) does, which is why in practice people use HSMs.
It doesn't work because:
1. There is always a period for which a key is live, and if rogue employee can simply copy a private key on a USB stick they will probably abuse it quite soon after doing so. You're encoding into the infrastructure (at massive cost) the assumption that for some reason the attacker will wait a long-ish period of time before using their access and rotation will thus defeat them, but there's no reason to assume that. So it can easily end up useless.
2. If your key handling is so weak that someone can just walk off with the private key then they can probably tamper with the key generation process too.
It's obviously great if you can make key rotation easy, but in the cases I've been involved with it was always hard. There's got to be a root of trust somewhere so often making key rotation easy and fast just means pushing the trust to a different key that can't be easily rotated (e.g. firmware key).
HSMs aren't perfect. They're just specialized computers at the end of the day. But given a choice between building a better HSM, or trying to solve very thorny distributed computing problems that at minimum assume things like a globally synchronized clock, the former seems easier, much more likely to actually work and much less likely to create unforced outages.
Remember that in my post I'm discussing not only the relatively easy case of well maintained always-online laptops and servers running software built by well funded tech firms and getting certificates from Google-subsidized free CAs. I'm also considering all the other uses of expiring keys/certs, like in private networks, credit card chip systems, industrial control, phone networks, internal cloud usages etc.
A (potential) key compromise can result from more things than an employee leaving. It can also be due to, say, an algorithm change. HSMs do not solve this (and in fact probably make it harder to change the encryption algorithm).
Why do HSMs make it harder? Algorithm changes are rare, and when they happen you'd need to upgrade all the clients that are working with the keys or certificates, so at that point it's a question of software rotation rather than key rotation. The HSM shouldn't be your biggest problem at that point.
When I’ve enabled it in systemd.resolved it breaks every hotel / conference centre / airport WiFi. Maybe that’s a good thing but it’s pretty inconvenient.
Create your own root certificate, import it into your browser, and into the OS certificate store. Then just sign CSRs with this root certificate's private key.
This is pretty much what all large corporations do for their internal infrastructure.
This of course, only works for your own internal use. I feel this disclaimer needs to be called out explicitly.
If you're trying to do something that will be trusted by the global internet, you'll need to use an official CA, ergo the complaints of DNSSEC being too 'centralized'.
Then again, what the "global CAs" are doing is not wholly different in a technical level than what a company with their own internal CA does, except that I hope the global CAs take security very very seriously, and they pass an expensive audit.
In the end, it is just some company or organisation that creates a root CA cert and intermediaries, and then jump through hoops to make browsers and OS cert stores add their certs. Not saying it is easy, just that the technical parts are rather similar, the rest being a cumbersome way to prove you are trustworthy by showing off your processes and fences around the signing boxes.
I had the power turned off at home about a year ago. I think it was due to maintenance by the utilities company, or maybe I did it for DIY reasons.
Anyway, after it came back on, it took me far too long to realise why my internet was down. All DNS responses were being rejected. It turns out, my Pi-Hole DNS server - which had forgotten the time, of course - was trying to resolve the IP of an NTP server so it could fetch the current time, but this name resolution was over DNSSEC which requires the current time. You need the time to get the time.
I turned off DNSSEC and haven't looked back since.
For me it's a reason to use IP address in /etc/ntp.conf instead. DNS over HTTPS and HTTPS in general also need current time to validate a certificate so DNSSEC is not unique here.
I had exactly the same issue happening to me! I’ve since added a Real Time Clock (RTC), so the Pi shouldn’t forget about the current time anymore. That is until the battery runs out of course, for which there is no monitoring in place. So yeah maybe I should disable DNSSEC as well. At least we need to get better at building fault tolerant systems. Maybe Pi-Hole needs to be aware of and handle this scenario better: a special “we’re booting up and we don’t have the current time so don’t do DNSSEC yet”.
fake-hwclock[1] is useful for this. I don't know much about Pi-Hole, but if it involves writable storage and has a package manager, fake-hwclock may be available from there.
> At least we need to get better at building fault tolerant systems.
Security and fault tolerance are at odds here. Failing closed when you can't validate is correct, but not so useful. Properly modeling the clock might give you a way to get bootstrapped, but then an adversary would arrange for a bootstrap situation so they could interfere with your clock and then make use of the incorrect time for their nefarious purposes.
Grandparent is saying that DoH ignores the dns settings on your computer and on your network (from dhcp) because every piece of software and every "smart" device just asks cloudflare where to go meaning pihole is ineffective.
Why isn't Blockchain tech used to solve the problem of domain name resolution? Why aren't these solutions getting any traction? Tokens would serve as a built-in payment system for purchasing domain names and also as a spam prevention mechanism, it's decentralized and therefore wouldn't rely on any centralized authority, the incentives for running nodes would be built in... They are highly replicated and can replicate without limit which means it would be able to scale to support unlimited reads (which is what matters most when it comes to a DNS system). It would be distributed geographically and therefore could provide the fastest lookup times possible as you could run your own node to get the lowest lookups for existing domains... Also, ownership of domains would be cryptographically secured. WTF is this not getting any traction? Why aren't browser makers adopting this or enabling Blockchain-based domain lookups by default? There are solutions like Unlimited domains but they're not actually being rolled out...
I forgot the mention the most relevant aspect; that because you could run a node on-premises which syncs up with all the latest domains, may not even need encryption... But if you didn't want to run your node on-premises, you could have an encrypted tunnel to someone (or a company) which does.
.NZ was just the latest public failure. If multiple large registry operators employing people who wrote the DNS and DNSSEC rfcs have had failures then how can you expect regular mortals to get it right?
Transferring your domain and your old ISP had dnssec on but the new one knows nothing about it and didn't check ... Down you go.
Mess up the key signing key change over procedure while transferring to a competent DNSSEC provider ... down you go again (although hopefully for a shorter time).
We do best practice DNSSEC, using fancy HSMs to do so, but if my fancy, proprietary, badly documented Thales HSM dies and the next person with my job has trouble getting the new box configured... no changes to the zones until its fixed. Take long enough that the RRSIGs expire ... down YOU go again. If at that point they throw out the HSM and roll regular DNSSEC using Knot or something it will probably be at least 24 hours for the zone to work properly even with IANAs emergency procedures.
I'm paranoid so we have 3 identical HSMs and detailed restore procedures but there is zero guarantee that if I ever leave my replacement will get it right. You can't hire dnssec staff, they are impossible to find, you have to train on the job.
The basic idea is excellent but the implementation in the real world is horrible. If dnssec usage was wide spread we might reach some decent maturity with the tools and protocols, but I can't see that happening any time soon.