It means the authoritative servers of your domain simply does not support EDNS (...

colmmacc · on Jan 17, 2019

Route53 absolutely does support EDNS0 and large answers without needing to truncate.

The minor error reported is that Route 53 does not respond with a "BADVERS" error in response to a query reporting EDNS version 1.

There is no EDNS version 1 standardized, and the specs say to respond with an error when you see a version you don't support. Per the specs Route 53 is "in the wrong". The thinking from the spec authors is that this kind of behavior will make it easier for resolvers to negotiate future EDNS versions and figure out what works. I'm skeptical that this will ever actually work in practice: most protocols tend to have backwards-compatible version upgrades that don't require round-trips.

The thinking on the Route 53 side, or at least my thinking - I wrote Route 53's EDNS support - is that we see incorrect versions occasionally on the wire, from weird clients where people have clearly made some kind of mistake, and that it is better to fail gracefully and give them /an/ answer than an error that could lead to outages. I might be completely wrong about that, but that's the thinking. We tend to always heavily lean in the direction that is likely to increase availability and avoid any potential for outages.

JoshTriplett · on Jan 17, 2019

The problem is that that behavior means people won't have to fix the software that's emitting bad version numbers, which will then make it difficult to upgrade to a real EDNS version 1 (or other version) in the future. A similar problem often occurs with other Internet protocol upgrades, as well as with things like Linux kernel syscalls that don't error out on unknown flags. If you don't give an error on things you don't understand, you make it much harder to build newer software.

If you see "incorrect versions occasionally on the wire from weird clients", perhaps the people building those weird clients will know to fix them if they get BADVERS errors.

colmmacc · on Jan 17, 2019

Personally I have no appetite for breaking any customer. Sure they may be "in the wrong" because they have an old dig recipe, or a load balancer health check tool, or a latency measurers, or some kind of DNS canary, that mistakenly used a wrong EDNS version, but if we change our behavior and break them they will a) feel it viscerally and b) rightly, blame us for breaking things. We're pretty serious about maintaining backwards compatibility always and treating every API like a promise.

The other side of this is that being too tolerant can lead to network ossification. This impacted TLS1.3's roll-out, which had to be made to look similar enough to TLS1.2 session resumption that many network middle boxes can't tell the difference and let the traffic through. This is less elegant than a cleaner new design that isn't tainted by having to appear like the old formats.

Personally, I worry about this a lot less and consider the extra effort to be backwards compatible very worth it. It think it will still be very easy to find totally unused never-seen-in-the-wild EDNS version numbers; there's no shortage. On the TLS working group we've been through this a few times, having to find unused magic numbers that hadn't been burned as part of the various experiments. We've done with this protocol versions, cipher suites, etc.

On the DNS side; the proposed hypothetical EDNS version negotiation scheme that might show up in the future requires round-trips and state. Resolvers would have to first send an EDNSv1 query, observe it fail, store that state somewhere and then try with EDNSv0. What if the authoritative server rolls back their support? What if the auth service is mixed and only some of the fleet supports the new version? These challenges tend to make that kind of version negotiation impractical ... hence my skepticism that it will ever work like that.

A small part of all this is that the DNS WG isn't really stewarded with these practical issues front of mind. DNAME and DNSSEC have each had backwards-incompatible implications that real-world operators had to fudge around and ignore precisely what the specs say, to be able to satisfy customers. Due to that history, IMO the DNS specs have lost a certain level of moral authority that other specifications still enjoy.

JoshTriplett · on Jan 17, 2019

> Personally I have no appetite for breaking any customer. Sure they may be "in the wrong" because they have an old dig recipe, or a load balancer health check tool, or a latency measurers, or some kind of DNS canary, that mistakenly used a wrong EDNS version, but if we change our behavior and break them they will a) feel it viscerally and b) rightly, blame us for breaking things. We're pretty serious about maintaining backwards compatibility always and treating every API like a promise.

That's completely understandable.

What I'm wondering is whether you could accept it for now, but reach out to customers you see it from and provide guidance, in addition to blogging about the issue.

I'm not suggesting "turn it off tomorrow and break people", I'm suggesting "work towards a long-term plan of turning it off".

colmmacc · on Jan 17, 2019

That's probably what we will do. Rightly or wrongly, the new tool will probably accelerate it because I'm sure we'll be getting support queries from needlessly worried customers. Sometimes that kind of effort is good, but in this case I have sad feelings about that result, because the impetus is misguided. I totally predict that any future EDNS rev will not use a trial-and-fallback kind of negotiation, making all of this work pointless.

takeda · on Jan 17, 2019

Make this a default behavior, and provide checkbox to allow answering to requests that are using version 1.

joethebl9w · on Jan 20, 2019

Perhaps make it respond progressively slower. It won't be broken but will encourage movement.

dagenix · on Jan 17, 2019

What do other DNS services do? If they do the same thing, this behavior makes sense to me. But, if other services do fail these invald requests, then I really don't see how this helps anyone. If these "weird" broken clients can only talk to route53, it's difficult for me to believe that they are production clients and worth the effort of this workaround.

privateSFacct · on Jan 18, 2019

Says someone whose clearly never tried to maintain production uptime :)

Seriously, if you have responsibility for a large system (ie, think the GE network) - you can take the AWS route - and uptime will be good, or you can do the backwards incompatible changes - and everyone will hate you. Seriously, just the printer DNS lookup clients / the email lookup clients / the amount of cruft in a big system is mind bending.

The other thing I'm not getting, your going to be doing DNS round trips to get to version match? DNS is on the critical path to web page response - I hope this is a joke. Put this into a hint in a recursive lookup or something so when I look up what DNS server handles abc.com I (can optionally) see what EDNS version that server supports? Or TXT field or something? Does the error returned say what is supported? Or will the client need to do multiple R/T's to find the correct version? That's can't be right - I'm not a DNS expert, but I'm not seeing the point of this approach.

dagenix · on Jan 23, 2019

You must be a real delight to work with.

None of your rant addressed my question at all - is what Route53 is doing standard practice, or, is it the unusual one?