Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.
Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.
I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.
> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.
It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.
> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.
I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.
> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.
Crash is a feature, though. It's not like exceptions raises by itself into interpreter specifications. It's just that it so happens that Web apps ain't need no airbags that slow down businesses.
That line of reasoning is how you have systemic failures like this (or the Ariane 5 debacle). It only makes sense in the most dire of situations, like shutting down a reactor, not input validation. At most this failure should have grounded just the one affected flight rather than the entire transportation network.
> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.
What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.
I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.
Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.
If there are duplicate waypoint IDs, they are not close together. They can be easily eliminated by selecting the one that is one hop away from the prior waypoint. Just traversing the graph of waypoints in order would filter out any unreachable duplicates.
That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.
Well yes because you're describing a system where there are really low stakes and crash recovery is always possible because you can just throw away all your local state.
The flip side would be like a database failing to parse some part of its WAL log due to disk corruption and just said, "eh just delete those sections and move on."
The other “tabs” here are other airplanes in flight, depending on being able to land before they run out of fuel. You don’t just ignore one and move on.
Nonsense comparison, your browser's tabs are de facto insulated from each other, flight paths for 7000 daily planes over the UK literally share the same space.
No, it's more like saying your browser has detected possible internal corruption with, say, its history or cookies database and should stop writing to it immediately. Which probably means it has to stop working.
It definitely isn't. It was just a validation error in one of thousands external data files that the system processes. Something very routine for almost any software dealing with data.
Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.