UK air traffic control meltdown

reactordev · on Sept 11, 2023

So they forgot to "geographically disparate" fence their queries. Having built a flight navigation system before, I know this bug. I've seen this bug. I've followed the spec to include a geofence to avoid this bug.

sam0x17 · on Sept 11, 2023

Why on earth do they not have GUIDs for these navigation points if the names are not globally unique and inter-region routes are commonplace?

nwallin · on Sept 11, 2023

1. Pilots occasionally have to fat finger them into ruggedized I/O devices and read them off to ATC over radios.

2. These are defined by the various regional aviation authorities. The US FAA will define one list, (and they'll be unique in the US) the EU will have one, (EASA?) etc.

The AA965 crash (1995-12-20) was due to an aliased waypoint name. Colombia had two waypoints with the same name within 150 nautical miles of each other. (the name was 'R') This was in violation of ICAO regulations from like the '70s.

https://en.wikipedia.org/wiki/American_Airlines_Flight_965

DyslexicAtheist · on Sept 11, 2023

> Pilots occasionally have to fat finger them into ruggedized I/O devices

you're saying what-3-words (W3W) is unsuitable for safety critical applications ? /s

joshuamorton · on Sept 11, 2023

I'm trying to imagine someone ensuring differentiation between minimums.unsettled.depends (Idaho), minimums.unsettled.depend (Alaska), minimums.unsettles.depend (Spain), and minimum.unsettles.depend (Russia) while typing them in on a t-9 style keypad with a 7 figure display in turbulence.

lxgr · on Sept 12, 2023

That's easily fixed – just spell them out using the ICAO spelling alphabet!

pmarreck · on Sept 12, 2023

I can't believe the What3Words person or people didn't normalize all words to be singular before canonizing the list.

That's ridiculous.

thombat · on Sept 12, 2023

The word list is 40,000 long, so without plurals probably there aren't enough words that people could spell or even pronounce. A better fix would be making it "what four words" - I wonder if they'd already committed too much to the "three" concept before discovering the flaw? Either way, using phony statistics to make unwarrantable claims of accuracy is a poor workaround.

jonny_eh · on Sept 12, 2023

That seems like a huge flaw in their system, has it never been addressed?

wlonkly · on Sept 12, 2023

No, and by my understanding it can't be, as the algorithm is now permanent.

But it's worse than that, there are confusables within small distances of each other:

https://cybergibbons.com/security-2/why-what3words-is-not-su...

https://w3w.me.ss/

pmarreck · on Sept 12, 2023

Since the app gives you the words to say, and translates those back to coordinates on the receiving end, in theory they could alter the word list, at the cost of making any written-down version obsolete.

Maybe they should release a new service called What4ActuallyVettedWordsAndWordCombinations ;)

iudqnolq · on Sept 12, 2023

Aren't they trying to turn their word list into a subscription service? Obsoleting paper copies might be a feature.

retrocryptid · on Sept 11, 2023

Something like what3words might be useful, but what3words itself doesn't have enough "auditory distance" between words. (i.e. - there are some/many words used by what3words that sound similar enough to be indistinguishable over an audio channel with noise.)

Something like FixPhrase seems better for use over radio.

pmarreck · on Sept 12, 2023

There are a number of word lists whose words were picked due to their beneficial properties given the use-case of possibly needing to be understood verbally over unclear connections. The NATO phonetic alphabet, and PGP word lists come to mind: https://en.wikipedia.org/wiki/PGP_word_list

I'm particularly a fan of the PGP word list (it would definitely require more than 3 words for this purpose, though) because it has built-in error detection (of transposition, insertion or deletion): Separate word lists are used for "even" and "odd" hex digits. This makes it, IMHO, fairly ideal for use over verbal channels. From the wiki: "The words were carefully chosen for their phonetic distinctiveness, using genetic algorithms to select lists of words that had optimum separations in phoneme space"

It sounds like the w3w folks did not do any such thing

EDIT: According to my napkin math, 6 PGP words should be enough to cover the 64 trillion coordinates that "what3words" covers, but with way better properties such as error detection and phonetic incongruity (and not only that, it is just over 4 times larger, which means it can achieve a resolution of 5 feet instead of 10)

robocat · on Sept 12, 2023

> I'm particularly a fan of the PGP word list

As a New Zealander, the PGP list is unfriendly because there are plenty of words that are hard to spell, or are too US centric.

dogsled (contains silent d, and sleigh might be a British spelling)

Galveston (I've never heard of the place)

Geiger (easy to type i before e - unobvious)

Wichita (I would have guessed the spelling began with which or witch)

And why did the designers not make the words have some connection to the numbers e.g. there are 12 even and 12 odd words beginning with E - add 16 more E words and you could use E words for E0 to EF. Redundant encoding like that helps humans (and would help when scanning for errors or matches too)

I imagine it is even harder for ESOL people from other countries! I am sure the UI has completion to help - but I wouldn't recommend using that list for anything except a pure US audience.

railmeat · on Sept 14, 2023

I have been to Galveston and I can assure you that you have not missed anything. There is no good reason to visit or know anything about it.

Making a word list that could work well for speakers different English dialects and for speakers of English as a second language sounds really hard. Has such a list as been made?

Probably it is too hard so we will continue to ignore the problem.

pmarreck · on Sept 12, 2023

Great ideas all!

It should be discussed like this! It's clear that the w3w people didn't even do the bare minimum here!

The thing is, once you agree that some words are subpar or need translations, you can do a 1-to-1 mapping.

The problem with What3Words is that supporting the original word set will always be a pain even if they release a v2 word set with a 1-to-1 mapping (I believe they've already released versions for other languages?)

re: Geiger- parsing it could trivially accept misspellings of words

clankyclanker · on Sept 12, 2023

I never knew there was a “sleigh” in “dogsled,” I’ve only ever heard “sled,” like “slid” or “skid.”

mavhc · on Sept 14, 2023

> It sounds like the w3w folks did not do any such thing

They were too busy spending money on marketing, it's not like every news organisation ran a story about it by accident

gorkish · on Sept 12, 2023

I mean there is the ICAO phonetic alphabet already known and used by every single licensed pilot the world over, regardless of their native language.

or, or... hang with me here for a minute...

We could instead use one of these cool new hash algorithms that require a computer and use about fifteen thousand English words! I understand they are all the rage in the third world countries that lack a postal system.

NikolaNovak · on Sept 12, 2023

These are all lovely technical solutions. The problem I imagine isn't coming up with unique words. The problem is organizing a switchover for dozens if not hundreds of systems and agencies around the world. The chaos of change is probably out weighing the benefits.

reactordev · on Sept 12, 2023

what3words is not useful at all. 1) FAA (and thus the world) have a hard character limit of 8, this is to support old mainframes running (Delta, I'm looking at you) old unix dispatch software. 2) The cockpit computers have limited characters on screen. A FMC can display 28 characters x 16 rows at best. Most are 8 rows. Military aircraft have some that are 2 rows. The FMC or Flight Management Computer is really just an old embedded chip. 3) The entire airline, flight, tourism, booking, and ticketing systems of the world would need to change. Including all legacy systems, all paper charts, all maps, all BMS's, all AirBoss's, all ATC software, all radio beacons. There is no chance that any of this will change simply because someone came up with a way to associate words with landmarks you can't see from the air.

amatecha · on Sept 13, 2023

Many authorities on the subject warn against using W3W for safety-critical applications.[0][1][2]

Personally I could imagine that Maidenhead Locator System[3] may be more useful. It's just 4-12 chars (depending on what degree of accuracy you need)

[0] https://www.summerlandreview.com/news/bc-search-and-rescue-g...

[1] https://globalnews.ca/news/8258671/north-shore-rescue-what3w...

[2] https://www.squamishchief.com/local-news/squamish-search-and...

[3] https://en.wikipedia.org/wiki/Maidenhead_Locator_System

tim333 · on Sept 12, 2023

You could maybe make them globally unique by adding the country where appropriate like we do with Paris, France vs Paris, Texas? And not using the same name twice in the same country.

f1shy · on Sept 11, 2023

The names have to be entered manually by pilots, if e.g. they change the route. They have to be transmitted over the air by humans. So they must be short ans simple.

blitzar · on Sept 11, 2023

Clippy: It looks like you are trying to enter a non unique navigation point, did you mean the one in France or the one in Australia?

moritzwarhier · on Sept 11, 2023

if only it were as simple as that - what about unique but easily confusable "human-friendly" identifiers?

As a layman, I'd argue that such efforts would be band-aid and better spent on robust standardization

zarzavat · on Sept 11, 2023

Yes but shouldn’t one step of the code be to translate these non-unique human-readable identifiers into completely unique machine-readable identifiers?

avianlyric · on Sept 11, 2023

How exactly would you do that? It’s impossible to map from a dataset of non-unique identifiers to unique identifiers without additional data and heuristics. The mapping is ambiguous by definition.

The underlying flight plan standard were all created in an era of low memory machines, and when humans were expected to directly interpret data exactly as the programs represented it internally (because serialisation and deserialisation is expensive when you need every CPU cycle just run your core algorithms)

zarzavat · on Sept 11, 2023

Couldn’t you use the surrounding points? Each point is surrounded by a set of nearby points. You can prepare a map of pairs of points into unique ids beforehand, then have a step that takes (before, current, after) for each point in the flight plan and finds the ID of current.

seabass-labrax · on Sept 11, 2023

That sort of thing happens already; for instance, the MCDU of an Airbus aircraft will present various options in the case of ambiguous input, with a distance in nautical miles for each option. Usually, the closest option is the most appropriate.

avianlyric · on Sept 13, 2023

Yes you can, but that would be using

> additional data and heuristics

Directly mapping is impossible, so you cant just do a dumb ID-at-time pre-processing step (which is what your comment seems to suggest). You need a more complex pre-processing step that’s capable of understanding the surrounding context the identifier is being used in. A major issue with the flight planing system (as highlighted in the article) is that they attempted to do this heuristic mapping as part of their core processing step, and just assumed the ID wouldn’t be too ambiguous, and certainly wouldn’t repeat.

codeulike · on Sept 11, 2023

If two waypoints have the same name assume its the closest one to the adjacent ones in the route-chain rather than the one 4000 km away

moritzwarhier · on Sept 11, 2023

Sounds like a helpful idea with considerable implementation complexity, including the potential for new disastrous failure modes.

codeulike · on Sept 12, 2023

But its how the waypoint codes work in practice - they are contextual. If Air Traffic Control tell a plane to head for waypoint RESIA, they mean the one nearby, not the one 4000 km away.

moritzwarhier · on Sept 12, 2023

Have to admit, I read the article in full detail only after commenting and I see your point.

Especially since the implenting company is called out explicitly for failing to achieve this, and the risks of changing the well-established identifiers are also illustrated.

Perfect might be the enemy of the good then, or the standardization thing at least is a separate topic.

zarzavat · on Sept 12, 2023

Not really because each point is only adjacent to a small neighborhood of other points, so if you want to test every possibility then your search space only grows by a constant factor proportional to the maximum degree of the graph.

As for implementation complexity, you would hope they would use formal verification for something like this.

sam0x17 · on Sept 12, 2023

Sounds like we should have globally unique human-enterable identifiers governed by an ISO..

amoerie · on Sept 11, 2023

Long story: because changing identifiers is a considerable refactoring, and it takes coordination with multiple worldwide distributed partners to transition safely from the old to the new system, all to avoid a hypothetical issue some software engineer came up with

Short story: money. It costs money to do things well.

ftxbro · on Sept 11, 2023

> Long story: because changing identifiers is a considerable refactoring

is this what refactoring means

NBJack · on Sept 11, 2023

Yes. It would cascade into:

Changes in how ATCs operate

Changes in how pilots operate

Changes in how airplanes receive these instructions (including the flight software itself, safety systems, etc.)

Changes in how airplanes are tested

Changes in how pilots are trained

Etc. In this case, the refactoring requires changes to hardware, software, training, manufacturing, and humans.

deadfish · on Sept 11, 2023

Pretty sure that is still not the meaning of refactoring. As I understand it refactoring should mean no changes to the external interface but changes to how it is implemented internally.

LelouBil · on Sept 11, 2023

You could see it as the whole international flight system being refactored, consumers will still use planes like before

NBJack · on Sept 13, 2023

We can pontificate on how to define the scope of a system here. I will only state that, from the perspective of a consumer, you could consider this a Service on which the interface of find flight, book flight, etc. would appear to be the same while the connections internal to each of the above modules would have to account for the change.

Functionally, I suppose it's the equivalent of upgrading an ID field that was originally declared as an unsigned 32 bit integer to a wider 64 bit representation. We may not be changing anything fundametal in the functionality, but every boundary interface, protocol, and storage mechanism must now suffer through a potentially painful modification.

ftxbro · on Sept 11, 2023

does refactoring mean literally any non-local change even just like changing a variable name, or does it usually mean some kind of structural or architectural non-local change

paulddraper · on Sept 11, 2023

Aviation protocols are extremely backwards compatible and low-tech compatible.

You need to be able to read, write, hear, and speak the identifier. (And receive/transmit in morse code)

Would it be okay to have an "area code prefix" in the identifier? Plausible (but practically speaking too late for that)

reactordev · on Sept 11, 2023

FAA regulations state that fixes, navs, and waypoints must be phonetically transmittable over radio.

I.E. Yankee = YANKY. The pilot and ATC must be location aware. Apparently their software does not.

Topgamer7 · on Sept 11, 2023

I would guess because humans have to read this and ascertain meaning from it. Not everyone is a technical resource.

tortue0 · on Sept 11, 2023

They do and use lat/lon in some cases. Reviewing and inputting that (when being done manual) is another story - but it's technically possible.

gavinsyancey · on Sept 11, 2023

It sounds like for actual processing they replace them with GPS coordinates (or at least augment them with such). But this is the system that is responsible for actually doing that...

ExoticPearTree · on Sept 12, 2023

Because they need to be short, that's why they are 5 letters long. And need to be understood phonetically very quickly by pilots.

epanchin · on Sept 11, 2023

What three words would be a better solution than a guid, as transmittable over radio.

dharmab · on Sept 11, 2023

W3W contains homonyms and words that are easily confused by non-native english speakers. Often within just a few KM. The latter is why ATC uses "niner", to avoid confusing "nine" and "nein".

Talk to someone deep in the GIS rabbit hole and you'll get a rant about how bad W3W is: https://cybergibbons.com/security-2/why-what3words-is-not-su...

WalterBright · on Sept 12, 2023

Wernher von Braun polled his reports for whether the rocket was reliable or not. Each engineer replied "nein".

Von Braun reported that the rocket had six nines probability of success.

tjohns · on Sept 11, 2023

Also, "tree" for 3 and "fife" for 5.

_lqaf · on Sept 11, 2023

I always love it when someone helicopters in to a complex, long-established system and, without even attempting to understand the requirements, constraints or history, knows this thing they read on a blog one time would fix all the problems thousands of work-years have failed to address.

zeroc8 · on Sept 12, 2023

As software developers, we are often living in our own bubble. As a pilot and developer working on an aviation solution, I quite often run into this issue when discussing solutions with my colleagues.

reactordev · on Sept 12, 2023

Well, it would be hard for them to helicopter in with their navigation system. ATC would have a field day. ;)

bsder · on Sept 11, 2023

WTW is a proprietary system that should never be used:

https://www.walklakes.co.uk/opus64534.html

The biggest fault (besides being proprietary) is that you must be online in order to use WTW. The times that you might need WTW are ALSO the times you are most likely to be unable to be online.

lxgr · on Sept 12, 2023

> The biggest fault (besides being proprietary) is that you must be online in order to use WTW.

That doesn't seem to be the case anymore.

It's still not a great system – many included words are ambiguous (e.g. English singular and plural forms are both possible, and an "s" is notoriously difficult to hear over a bad phone line), and it's proprietary, as you already mentioned.

andrewaylett · on Sept 12, 2023

It's definitely not the case, as the word list and algorithm are not secret (notwithstanding that they're proprietary) and have been re-implemented and ported into at least a couple of languages that allow for offline use. I have a Rust implementation that started life as a transliteration from Javascript. I wouldn't recommend using it, still -- I wrote it in the hope of finding more problems with collisions, not because I like it.

tjohns · on Sept 11, 2023

That would actually be pretty bad. As mentioned, W3W is propritary, requires an online connection, and has homonyms. On top of that, you need to enter these waypoints into your aircraft's navigation system - sometimes one letter at a time using a rotary dial. These navigation systems will stay in service for decades.

Aviation already uses phonetically pronounceable waypoint names. Typically 5 characters long for RNAV (GPS) waypoints, for example "ALTAM" or "COLLI". Easy to pronounce, easy to spell phonetically if needed, and easy to enter.

The problem is the list of waypoints are independently defined by each country, so duplicates are possible between countries.

Rather than replacing a system that mostly works (and mandating changes to aircraft navigation systems, ATC systems, and human training for marginal benefit)... an easier fix would just be to have ICAO mandate that these waypoints are globally unique.

Pxtl · on Sept 12, 2023

If only there was a globally unique set of short two-letter names for every country that could be used as prefixes to enforce uniqueness while still allowing every country to manage their own internal waypoint list.

If only.

tjohns · on Sept 12, 2023

I'm sure they thought about this at some point. Airports already have a country-code prefix. (For example, airports in the Continental US always start with K.)

For whatever reason, by convention navaids never use a country prefix. Even when it would make sense - the code for San Francisco International Airport is "KSFO", but the identifier for the colocated VOR-DME is just "SFO". (Sometimes this does make a big difference, when navaids are located off site - KCCR vs CCR for Concord Airport vs the off-site Concord VOR-DME, for example.)

It's even worse for NDB navaids, which are often just two letters.

Either way, we're stuck with it because it's baked into aircraft avionics and would be incredibly expensive to change at this point.

Pxtl · on Sept 12, 2023

Yes, but country-prefixes are something you could migrate gradually in a backwards-compatible way.

"Waypoint Charlie Alpha Dash ALKOG, over."

"This is an old machine, so just ALKOG then, over."

chrisweekly · on Sept 11, 2023

That's "What3Words" -- https://en.m.wikipedia.org/wiki/What3words -- a system for representing geographic location using globally-unique word triads.

cutler · on Sept 11, 2023

Just curios, what language was used to develop?

tppiotrowski · on Sept 11, 2023

ICAO standard effective from 1978 to only duplicate identifiers if more than 600 nmi (690 mi; 1,100 km) apart

WalterBright · on Sept 12, 2023

> the backup system applied the same logic to the flight plan with the same result

Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.

The boxes were designed with:

1. different algorithms

2. different programming languages

3. different CPUs

4. code written by different teams with a firewall between them

The idea was that bugs from one box would not cause the other to fail in the same way.

dhx · on Sept 12, 2023

This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.

Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.

Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.

[1] https://en.wikipedia.org/wiki/Triple_modular_redundancy

[2] https://en.wikipedia.org/wiki/Instrument_flight_rules#Separa...

[3] https://en.wikipedia.org/wiki/Visual_flight_rules#Controlled...

mannykannot · on Sept 12, 2023

> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.

But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.

As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:

https://news.ycombinator.com/item?id=37476624

justinclift · on Sept 12, 2023

> But this was a 1oo1 system, and the human backup handled it well enough ...

Heh, a hundred million pound outage. ;)

True, no-one seems to have died from it directly though.

mannykannot · on Sept 12, 2023

True. I don't want to downplay the actual cost (or, worse, suggest that we should accept "the system worked as intended" excuses), but it's not just that there were no crashes: the air traffic itself remained under control throughout the event. Compare this to, for example, the financial "flash crash" of 2010, or the nuclear 'excursions' at Fukushima / Chernobyl / Three Mile Island / Windscale, where those nominally in control were reduced to being passive observers.

It also serves as a reminder of how far we have to go before we can automate away the jobs of pilots and air traffic controllers.

wcarss · on Sept 12, 2023

This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!

jangxx · on Sept 12, 2023

Sounds like the joke about a man with one watch always being sure about what time it is, but a man with two being continuously in doubt.

sparrowInHand · on Sept 12, 2023

Just computate the average, then counter the documented drift vs a external source?

jwr · on Sept 12, 2023

My grandfather was working with Stanisław Skarżyński, who was preparing for his first crossing of the Atlantic in a lightweight airplane (RWD-5bis, 450kg empty weight) in 1933.

They initially mounted two compasses in the cockpit, but Skarżyński taped one of them over so that it wasn't visible, saying wisely that if one fails, he will have no idea which one is correct.

jefftk · on Sept 12, 2023

> if one fails, he will have no idea which one is correct

Depends how it fails! For example, say, when you change direction one turns and the other doesn't.

cutemonster · on Sept 12, 2023

Couldn't he bring his own 3rd? Compasses aren't heavy?

jwr · on Sept 13, 2023

…or a 4th and a 5th, and have voting rounds — an idea explored by Stanisław Lem in "Golem XIV", where a parliament of machines voted :-)

zero_k · on Sept 12, 2023

That's a cool story! Would have loved to have heard more about that :)

iudqnolq · on Sept 12, 2023

In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.

virtue3 · on Sept 12, 2023

you would be very surprised how difficult avionics are from even a fundamental level.

I'll provide a relatively simple example.

Just even attempting to design a starfox game clone where the ship goes towards the mouse cursor using euler angles will almost immediately result in gimbol lock and your starfighter locking up tighter than unlubricated car engine going 100mph and unable to move. [0]

The standard solution in games(or at least what I used) has been to use quaternions [1] (Hamilton defined a quaternion as the quotient of two directed lines in a three-dimensional space,[3] or, equivalently, as the quotient of two vectors.) So you essentially dump your 3D coordinate into the 4D quaternion coordinate, apply your matrix rotations, then convert back to 3D space and apply your rotations/transforms.

This was literally just to get my little space ship to go where my mouse cursor was on the screen without it locking up.

So... yeah, I cannot even begin to imagine the complexity of what a Boeing 757 (let alone a 787) is doing under the hood to deal with reality and not causing it to brick up and fall out of the sky.

[0] https://math.stackexchange.com/questions/8980/euler-angles-a... [1] https://en.wikipedia.org/wiki/Quaternion

iudqnolq · on Sept 12, 2023

I don't think we're talking about that kind of software, though. This big was in code that needs to parse a line defined by named points and then clip the line to the portion in the UK. Not trivial, but I can imagine writing that myself.

But regardless the more complex the code the worse idea it is to maintain three parallel implementations, if you won't/can't afford to do it properly

madengr · on Sept 12, 2023

I was doing some orientation sensing 20 years ago with an IMU and ran into the same problem. I had never known at the time it was gimbal lock (which I had heard of) but did read quaternions were the way to fix it. Pesky problem.

DocTomoe · on Sept 12, 2023

> Human backup is not possible because of human resourcing

This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".

Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.

wavemode · on Sept 12, 2023

First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.

WalterBright · on Sept 12, 2023

I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.

A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

mannykannot · on Sept 12, 2023

> A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

Thoroughly checking of the inputs as far as possible should be a given, but in this case, the inputs were correct: while the use of duplicate identifiers is considerably less than ideal, the constraints on where that was permitted meant that there was one deterministically unambiguous parsing of the flight plan, as demonstrated in the article. The proximate cause of the problem was not in the inputs, but how they were processed by the ATC system.

For the same reason, multiple implementations of the software would only have helped if a majority of the teams understood this issue and got it right. I recall a fairly influential paper in the '90s (IIRC) in which multiple independent implementations of a requirements specification were compared, and the finding was that the errors were quite strongly correlated - i.e. there was a tendency for the teams to make the same mistakes as each other.

nightpool · on Sept 12, 2023

not stronger isolation between different flight plans? it seems "obvious" to me that if one flight plan is causing a bug in the handling logic, the system should be able to recover by continuing with the next flight plan and flagging the error to operators to impact that flight only

xp84 · on Sept 12, 2023

I'm no aviation expert, but perhaps with waypoints:

  A B C D E
   /
  F G H I J

If flight plan #1 is known to be going from F-B at flight level 130, and you have a (supposedly) bogus flight plan #2, they can't quite be sure if it might be going from A-G at flight level 130 at the same time and thus causing a really bad day for both aircraft. I'd worry that dropping plan #2 into a queue for manual intervention, especially if this kind of thing only happens once every 5 years, could be disastrous if people don't realize what's happening and why. Many people might never have seen anything in that queue and may not be trained to diagnose the problem and manually translate the flight plan.

This might not be the reason why the developer chose to have the program essentially pull the fire alarm and go home in this case, but that's the impression I got.

mannykannot · on Sept 12, 2023

The ATC system handled well enough (i.e. no disasters, and AFAIK, no near misses) something much more complicated than one aircraft showing up with no flight plan: the failure of this particular system put all the flights in that category.

I mentioned elsewhere that any ATC system has to be resilient enough to handle things like in-flight equipment failure, medical emergencies, and the diversion of multiple aircraft on account of bad weather or an incident which shuts down a major airport.

As for why the system "pulled the plug", the author of the article suspects that this particular error was regarded as something that would not occur unless something catastrophic had caused it, whereas, in reality, it affected only one flight and could probably have been easily worked around if the system had informed ATC which flight plan was causing the problem.

bdavbdav · on Sept 12, 2023

I'm not sure they're even used for that purpose - that side of thing is done "live" as I understand it - the plans are so that ATC has the details on hand for each flight and it doesn't all need to be communicated by radio as they pass through.

WalterBright · on Sept 12, 2023

"unexpected errors" are not necessarily problems with the flight plans. They could be anything.

gorgoiler · on Sept 12, 2023

I wonder where most of the complexity lies in ATC. Naively you’d think there would be some mega computer needed to solve the puzzle but the UK only sees 6k flights a day and the scale of the problem, like most things in the physical world, is well bounded. That’s about the same number of buses in London, or a tenth of the number of Uber drivers in NYC.

It would be interesting to actually see the code.

tjohns · on Sept 12, 2023

Much of the complexity is in interop. Passing data between ATC control positions, between different facilities, and between different countries. Then every airline has a bidirectional data feed, plus all the independent GA flights (either via flight service or via third-party apps). Plus additional systems for weather, traffic management, radar, etc. Plus everything happening on the defense side.

All using communication links and protocols that have evolved organically since the 1950s, need global consensus (with hundreds of different countries' implementations), and which need to never fail.

ExoticPearTree · on Sept 12, 2023

The system should have just rejected the FPL, notify the admins about the problem and keep working. The admins could have fixed whatever the software could not handle.

The affected flight could have been vectored by ATC if needed to divert from filed FPL.

Way less work and a better doutcome than the “system throws hands in the air and becomes unresponsive”.

shatnersbassoon · on Sept 12, 2023

"When a failsafe system fails, it fails by failing to fail safe."

J. Gall

borissk · on Sept 12, 2023

Different teams often make the same mistake. The system you describe is not perfect, but makes sense.

WalterBright · on Sept 12, 2023

I neglected to mention there was a third party that reviewed the algorithms to verify they weren't the same.

Nothing is perfect, though, and the pilot is the backup for failure of that system. I.e. turn off the stab trim system.

bdavbdav · on Sept 12, 2023

Is this still the case for simple algorithms?

WalterBright · on Sept 13, 2023

I don't know as much about modern avionics.

chii · on Sept 12, 2023

if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?

In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.

awestroke · on Sept 12, 2023

Such incentives can lead to reduced collaboration. If I get paid every time you make mistakes, I won't want you to get better at your job

iraqmtpizza · on Sept 12, 2023

the whole point is that they're not collaborating so as to avoid cross-contamination. also you don't get paid unless and until you identify the mistake. if you decrease the reward over time, there is an additional incentive to not sit on the information

fransje26 · on Sept 12, 2023

As as side note, too bad they knowingly didn't reuse such an approach for the MAX..

WalterBright · on Sept 13, 2023

The MAX system relied on the pilot remembering the stab trim cutoff switch and what it was for.

fransje26 · on Sept 22, 2023

Even though the trim cutoff switch didn't work as it used to do on the previous generation of 737s, and the pilots were not notified about the change.

jojobas · on Sept 12, 2023

Wouldn't trim be an number of which a significant tolerance is permissible at any given time? Or does "agree" mean "within a preset tolerance"?

WalterBright · on Sept 13, 2023

Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.

f1shy · on Sept 12, 2023

I would be very interested in knowing which languages were used. Do you know which were? Thanks

WalterBright · on Sept 13, 2023

One of them was Pascal. This was around 1980 or so.

lbriner · on Sept 11, 2023

I seem to remember another problem at NATS which had the same effect. Primary fell over so they switched over to a secondary that fell over for the exact same reason.

It seems like you should only failover if you know the problem is with the primary and not with the software itself. Failing over "just because" just reinforces the idea that they didn't have enough information exposed to really know what to do.

The bit that makes me feel a bit sick though is that they didn't have a method called "ValidateFlightPlan" that throws an error if for any reason it couldn't be parsed and that error could be handled in a really simple way. What programmer would look at a processor of external input and not think, "what do we do with bad input that makes it fall over?". I did something today for a simple message prompt since I can't guarantee that in all scenarios the data I need will be present/correct. Try/catch and a simple message to the user "Data could not be processed".

d1sxeyes · on Sept 11, 2023

Well, if the primary is known not to be in a good state, you might as well fail over and hope that the issue was a fried disk or a cosmic bit flip or something.

The real safety feature is the 4 hour lead time before manual processing becomes necessary.

One of the key safety controls in aviation is “if this breaks for any reason, what do we do”, not so much “how do we stop this breaking in the first place”.

zaphar · on Sept 11, 2023

I'm no aviation safety controls expert but it seems to me that there are two types of controls that should be in place:

1. Process controls: What do we do when this breaks for any reason.

2. Engineering controls: What can we do to keep this from breaking in the first place?

Both of them seem to be somewhat essential for a truly safe system.

mixdup · on Sept 11, 2023

It's very hard to ensure you capture every single possible failure mode. Yes, the engineering control is important but it's not the most critical. What to do if it does fail (for any reason) is the truly critical control, because it solves for the possibility of not knowing every possible way something might fail and therefore missing some way to prevent a failure

jeffrallen · on Sept 11, 2023

One or more of three results can come from the engineering exercise of trying to keep something from breaking in the first place:

1. You could know the solution, but it would be too heavy.

2. You could know the solution, but it would include more parts, each of which would need the same process on it, and the process might fail the same way

3. You miss something and it fails anyway, so your "what if this fails" path better be well rehearsed and executed.

Real engineering is facing the tradeoffs head on, not hand waving them away.

d1sxeyes · on Sept 12, 2023

The engineering controls don't independently make systems safe, they make things more reliable and cost-effective, and hopefully reduce the number of times the process controls kick in.

The process controls do however independently make things safe.

The reason for this is that there are 'unknown unknowns'—we accept that our knowledge and skills are imperfect, and there may be failures that occur which could have been eliminated with the proper engineering controls, but we, as imperfect beings and organisations, did not implement the engineering controls because we did not identify this possible failure mode.

There are also known errors, where the cost of implementing engineering controls may simply outweigh the benefits when adequate process controls are in place.

sheepshear · on Sept 12, 2023

Everyone uses slightly different terminology and groups things differently but this will give you the gist.

https://en.m.wikipedia.org/wiki/Hierarchy_of_hazard_controls

samus · on Sept 11, 2023

It was in a bad state, but in a very inane way: a flight plan in its processing queue was faulty. The system itself was mostly fine. It was just not well-written enough to distinguish an input error from an internal error, and thus didn't just skip the faulty flight plan.

Twirrim · on Sept 11, 2023

at the risk of nitpicking: "a flight plan in its processing queue was faulty" isn't true, the flight plan was fine. It couldn't process it.

I mention this only because the Daily Mail headline pissed me off with it's usual bullshit foreigner fear mongering crap.

samus · on Sept 11, 2023

Indeed, that intention is quite transparent in this case. Anyways, I suspect that invalid input exists that would have made the system react in a similar way

j_mo · on Sept 11, 2023

No validation, anddd this point from the article stood out to me: --- The programming style is very imperative. Furthermore, the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained. --- Given that description, I'd be surprised if it wasn't just running a regex / substring matches against the text and there's no classes / objects / data structure involved. Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.

jameshh · on Sept 12, 2023

> Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.

It's new code, from 2018 :) Quote from the report:

> An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers.

sheepshear · on Sept 11, 2023

Failing over is correct because there's no way to discern that the hardware is not at fault. They should have designed a better response to the second failure to avoid the knock-on effects.

anentropic · on Sept 12, 2023

I don't think anything in this incident pointed to a hardware fault

The software raised an exception because a "// TODO: this should never happen" case happened

A hardware fault would look like machines not talking to each other or corrupted data file unreadable

sheepshear · on Sept 12, 2023

Retroactive inspection revealed that it wasn't a hardware failure, but the computer didn't know that at the time, and hardware failure can look like anything, so it was correct to exercise its only option.

1970-01-01 · on Sept 11, 2023

Yep. In electrical terms, you replaced the fuse to watch it blow again. There are no more fuses in your shop. Progress?

adamauckland · on Sept 13, 2023

Stick a nail in

philjohn · on Sept 12, 2023

The Ariane 5 launch failure[1] was a similar issue, albeit with a more spectacular outcome.

Primary suffers integer overflow, fails. Secondary is identical, which also overflows. Angle of attack increases, boosters separate. Rocket goes boom.

[1] https://en.wikipedia.org/wiki/Ariane_flight_V88

asimpleusecase · on Sept 11, 2023

And why could the system not put the failed flight plan in a queue for human review and just keep on working for the rest of the flights? I think the lack of that “feature” is what I find so boggling.

adrianmonk · on Sept 11, 2023

Because the code classified it as a "this should never happen!" error, and then it happened. The code didn't classify it as a "flight plan has bad data" error or a "flight plan data is OK but we don't support it yet" error.

If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.

samus · on Sept 11, 2023

That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.

JumpCrisscross · on Sept 11, 2023

> it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios

Yes. But also, it's an ATC system. Its primary purpose "is to prevent collisions..." [1].

If the system encounters a "this should never happen!" error, the correct move is to shut it down and ground air traffic. (The error shouldn't have happened in the first place. But the shutdown should have been more graceful.)

[1] https://en.wikipedia.org/wiki/Air_traffic_control

crabbone · on Sept 11, 2023

Neither formal methods nor fuzzing would've helped if the programmer didn't know that input can repeat. Maybe they just didn't read the paragraph in whatever document describes how this should work and didn't know about it.

I didn't have to implement flight control software, but I had to write some stuff described by MIFID. It's a job from hell, if you take it seriously. It's a series of normative documents that explains how banks have to interact with each other which were published quicker than they could've been implemented (and therefore the date they had to take effect was rescheduled several times).

These documents aren't structured to answer every question a programmer might have. Sometimes the "interesting" information is close together. Sometimes you need to guess the keyword you need to search for to discover all the "interesting" parts... and it could be thousands of pages long.

samus · on Sept 12, 2023

The point of fuzzing is precisely to discover cases that the programmers couldn't think about, and formal methods are useful to discover invariants and assumptions that programmers didn't know they rely on.

Furthermore, identifiers from external systems always deserve scepticism. Even UUIDs can be suspect. Magic strings from hell even more so.

crabbone · on Sept 12, 2023

Sorry, you missed the point.

If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

The mistake is too trivial to attribute it to the programmer incompetence / lack of attention. I'd bet my lunch it was because the spec is written in an incomprehensible language, is all over the place in a thousand pages PDF, and the particular aspect of repetition isn't covered in what looks like the main description of how paths are defined.

I've dealt with specs like that. It's most likely the error created by the lack of understanding of the details of the requirements than of anything else. No automatic testing technique would help here. More rigorous and systematic approach to requirement specification would probably help, but we have no tools and no processes to address that.

samus · on Sept 12, 2023

> If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

It totally would. The point of a fuzzer is to test the system with every technically possible input, to avoid bias and blind spots in the programmer's thinking.

Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned. Unless you know all about the business rules of an external system, you can't trust its data and can't assume much about its behavior.

Anyways, we are discussing about the wrong issue. Bugs happen, even halting the whole system can be justified, but the operators should have had an easier time figuring out what was actually going on, without the vendor having to pore through low-level logs.

crabbone · on Sept 12, 2023

No... that's not the point of fuzzing... You cannot write individual functions in such a way that they keep revalidating input handed to them. Because then, invariably, the validations will be different function to function, and once you have an error in your validation logic, you will have to track down all function that do this validation. So, functions have to make assumptions about input, if it doesn't come from an external source.

I.e. this function wasn't the one which did all the job -- it already knew that the input was valid because the function that provided the input already ensured validation happened.

It's pointless to deliberately send invalid input to a function that expects (for a good reason) that the input is valid -- you will create a ton of worthless noise instead of looking for actual problems.

> Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned.

How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?... There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

You really need to try doing what you suggest before suggesting it.

samus · on Sept 13, 2023

I am not going to comment the first paragraph since you turned my words around.

> How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?

A dictionary in my program is under my control and I can be sure that the key is unique since... well, I know it's a dictionary. I have no such knowledge about data coming from external systems.

> There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

"Meant to be" and "actually are" can be very different things, and it's the responsibility of a programmer to establish the difference, or to at least ask pointed questions. Actually, the programmers did the correct thing by not sweeping this unexpected problem under the rug. The reaction was just a big drastic, and the system did not make it easy for the operators to find out what went wrong.

Edit: as we have seen, input can be valid, but still not be processable by our code. That not fine, but it's a fact of life since specs are often unclear or incomplete. Also, the rules can actually change without us noticing. In these cases, we should make it as easy as possible to figure out what went wrong.

sublimefire · on Sept 11, 2023

I've only heard from people engineering systems for aerospace industry and we're speaking hundreds of pages of api documentation. It is very complex so equally the chances of a human error are higher.

hn_throwaway_99 · on Sept 11, 2023

I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.

That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".

I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.

jameshart · on Sept 12, 2023

If the code takes a valid series of ICAO waypoints and routes, generates the corresponding ADEXP waypoint list, but then when it uses that to identify the ICAO segment that leaves UK airspace it's capable of producing a segment from before when the route enters UK airspace, then that code is wrong, and who knows what other failure modes it has?

Maybe it can also produce the wrong segment within British airspace, meaning another flight plan might be processed successfully, but with the system believing it terminates somewhere it doesn't?

Maybe it's already been processing all the preceding flight plans wrongly, and this is just the first time when this error has occurred in a way that causes the algorithm to error?

Maybe someone's introduced an error in the code or the underlying waypoint mapping database and every flight plan that is coming into the system is being misinterpreted?

WalterBright · on Sept 12, 2023

An "unexpected error" is always a logic bug. The cause of the logic error is not known, because it is unexpected. Therefore, the software cannot determine if it is an isolated problem or a systemic problem. For a systemic problem, shutting down the system and engaging the backup is the correct solution.

iudqnolq · on Sept 12, 2023

I'm pretty inexperienced, but I'm starting to learn the hard way that it takes more discipline to add more complex error recovery. (Just recently my implementation of what you're suggesting - limiting the blast radius of server side errors - meant all my tests were passing with a logged error I missed when I made a typo)

Considering their level 1 and 2 support techs couldn't access the so-called "low level" logs with the actual error message it's not clear to me they'd be able to keep up with a system with more complicated failure states. For example, they'd need to make sure that every plan rejected by the computer is routed to and handled by a human.

crabbone · on Sept 11, 2023

> is essentially totally independent

They physically cannot be independent. The system works on an assumption that the flight was accepted and is valid, but it cannot place it. What if it accidentally schedules another flight in the same time and place?

krisoft · on Sept 11, 2023

> What if it accidentally schedules another flight in the same time and place?

Flight plans are not responsible for flight separation. It is not their job and nobody uses them for that.

As a first approximation they are used so ATC doesn’t need to ask every airplane every five minute “so flight ABC123 where do you want to go today?”

I’m staring to think that there is a need for a “falsehoods programers believe about aviation” article.

Thorentis · on Sept 11, 2023

Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision. The system needs to maintain the integrity of all plans it sees. If it can't process one, and there's the risk of a plane entering airspace with a bad flight plan, you need to stop operations.

phkahler · on Sept 11, 2023

>> Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision.

Flight plans don't contain any information relevant for collision avoidance. They only say when and where the plane is expected to be. There is not enough specificity to ensure no collisions. Things change all the time, from late departures, to diverting around bad weather. On 9/11 they didn't have every plane in the sky file a new flight plan carefully checked against every other...

lozenge · on Sept 11, 2023

But they have 4 hours to reach out to the one plane whose flight plan didn't get processed and tell them to land somewhere else.

ivraatiems · on Sept 11, 2023

Assuming they can identify that plane.

Aviation is incredibly risk-averse, which is part of why it's one of the safest modes of travel that exists. I can't imagine any aviation administration in a developed country being OK with a "yeah just keep going" approach in this situation.

jameshh · on Sept 11, 2023

That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?

raverbashing · on Sept 11, 2023

And that's why I never (or very rarely) put "this should never happen" exceptions anymore in my code

Because you eventually figure out that, yes, it does happen

pmontra · on Sept 11, 2023

A customer of mine is adamant in their resolve to log errors, retry a few times, give up and go on with the next item to process.

That would have grounded only the plane with the flight plan that the UK system could not process.

Still a bug but with less effects to all the continent, because planes that could not get inside or outside the UK could not fly and that affected all of Europe and possibly more.

crabbone · on Sept 11, 2023

> That would have grounded only the plane with the flight plan that the UK system could not process.

By the looks of it, it was few hours in the air by the time the system had a breakdown. Considering it didn't know what the problem was, it seems appropriate that it shut down. No planes collided, so the worst didn't happen.

pmontra · on Sept 12, 2023

Couldn't the outcome be "access to the UK airspace denied" only for that flight? It would have checked with an ATC and possibly landed somewhere before approaching the UK.

In the case of a problem with all flights, the outcome would have been the same they eventually had.

Of course I have no idea if that would be a reasonable failure mode.

airstrike · on Sept 11, 2023

This here is the true takeaway. The bar for writing "this should never happen" code must be set so impossibly high that it might as well be translated into "'this should never happen' should never happen"

andrewaylett · on Sept 11, 2023

The problem with that is that most programming languages aren't sufficiently expressive to be able to recognise that, say, only a subset of switch cases are actually valid, the others having been already ruled out. It's sometimes possible to re-architect to avoid many of this kind of issue, but not always.

What you're often led to is "if this happens, there's a bug in the code elsewhere" code. It's really hard to know what to do in that situation, other than terminate whatever unit of work you were trying to complete: the only thing you know for sure is that the software doesn't accurately model reality.

In this story, there obviously was a bug in the code. And the broken algorithm shouldn't have passed review. But even so, the safety critical aspect of the complete system wasn't compromised, and that part worked as specified -- I suspect the system behaviour under error conditions was mandated, and I dread to think what might have happened if the developers (the company, not individuals) were allowed to actually assume errors wouldn't happen and let the system continue unchecked.

PeterStuer · on Sept 11, 2023

So what does your code do when you did not handle the this should never happen exception? Exit and print out a stacktrace to stdout?

pimterry · on Sept 11, 2023

To be fair, the article suggests early on that sometimes these plans are being processed for flights already in the air (although at least 4 hours away from the UK).

If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.

It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.

krisoft · on Sept 11, 2023

> "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

Flight plans don't tell where the plane is. Where is this assumption coming from?

joncrocks · on Sept 11, 2023

Presumably you need to know where upcoming flights are going to be in the future (based on the plan), before they hit radar etc.

macguillicuddy · on Sept 11, 2023

For the most part (although there are important exceptions), IFR flights are always in radar contact with a controller. The flight plan is tool allows ATC and the plane to agree a route so that they don't have to be constantly communicating. ATC 'clears' a plane to continue on the route to a given limit, and expects the plane to continue on the plan until that limit unless they give any future instructions.

In this regard UK ATC can choose to do anything they like with a plane when it comes under their control - if they don't consider the flight plan to be valid or safe they can just instruct the plane to hold/divert/land etc.

I'm not sure the NATS system that failed has the ability to reject a given flight plan back upstream.

RyJones · on Sept 11, 2023

Mostly yes; however, there are large parts of the Atlantic and Pacific where that isn't true (radar contact). I know the Atlantic routes are frequently full of plans that left the US and Canada heading to the UK.

I have no idea what percent of the volume into the UK comes from outside radar control; if they asked a flight to divert, that may open multiple other cans of worms.

mannykannot · on Sept 11, 2023

> If they asked a flight to divert, that may open multiple other cans of worms.

Any ATC system has to be resilient enough to handle a diversion on account of things like bad weather, mechanical failure or a medical emergency. In fact, I would think the diversion of one aircraft would be less of a problem than those caused by bad weather, and certainly less than the problem caused by this failure. Furthermore, I would guess that the mitigation would be just to manually direct the flight according to the accepted flight plan, as it was a completely valid one.

One of the many problems here is that they could not identify the problem-triggering flight plan for hours, and only with the assistance of the vendor's engineers. Another is that the system had immediately foreclosed on that option anyway, by shutting down.

lxgr · on Sept 11, 2023

Flight plans do inform ATC where and when a plane is expected to enter their FIR though, no?

t0mas88 · on Sept 11, 2023

Only theoretically. In practice the only thing that usually matches is from which other ATC unit the plane is coming. But it could be on a different route and will almost always be at a different time due to operational variation.

That doesn't matter, because the previous unit actively hands the plane over. You don't need the flight plan for that.

What does matter is knowing what the plane is planning to do inside your airspace. That's why they're so interested in the UK part of the flight plan. Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.

lxgr · on Sept 12, 2023

> the previous unit actively hands the plane over. You don't need the flight plan for that.

I thought practically, what's handed over is the CPL (current flight plan), which is essentially the flight plan as filed (FPL) plus any agreed-upon modifications to it?

> Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.

Without voice or datalink clearance (i.e. the plane calling the new ATC), would the flight even be allowed to enter a new FIR?

hn_throwaway_99 · on Sept 11, 2023

To be fair that is exactly what the article said was a major problem, and which the postmortem also said was a major problem. I agree I think this is the most important issue:

> The FPRSA-R system has bad failure modes

> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.

> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":

>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.

Spivak · on Sept 11, 2023

Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.

kccqzy · on Sept 11, 2023

I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.

> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.

It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.

jjk166 · on Sept 11, 2023

> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.

I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.

krisoft · on Sept 11, 2023

> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.

numpad0 · on Sept 11, 2023

Crash is a feature, though. It's not like exceptions raises by itself into interpreter specifications. It's just that it so happens that Web apps ain't need no airbags that slow down businesses.

acdha · on Sept 11, 2023

That line of reasoning is how you have systemic failures like this (or the Ariane 5 debacle). It only makes sense in the most dire of situations, like shutting down a reactor, not input validation. At most this failure should have grounded just the one affected flight rather than the entire transportation network.

marcosdumay · on Sept 11, 2023

On a multi-user system, only partial crashes are features. Total crashes are bugs.

A web server is a multi-user system, just like a country's air traffic control.

Spivak · on Sept 11, 2023

I love that phrasing, I'm gonna use that from now on when talking about low-stakes vs high-stakes systems.

david422 · on Sept 11, 2023

> big try-catch around the code which handles individual requests.

I mean, that's assuming the code isolating requests is also bug free. You just don't know.

piva00 · on Sept 11, 2023

> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.

I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.

Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.

SoftTalker · on Sept 11, 2023

If there are duplicate waypoint IDs, they are not close together. They can be easily eliminated by selecting the one that is one hop away from the prior waypoint. Just traversing the graph of waypoints in order would filter out any unreachable duplicates.

ummonk · on Sept 11, 2023

That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.

madeofpalk · on Sept 11, 2023

That's like saying that because one browser tab tried to parse some invalid JSON then my whole browser should crash.

Spivak · on Sept 11, 2023

Well yes because you're describing a system where there are really low stakes and crash recovery is always possible because you can just throw away all your local state.

The flip side would be like a database failing to parse some part of its WAL log due to disk corruption and just said, "eh just delete those sections and move on."

madeofpalk · on Sept 11, 2023

Crash the tab and allow all the others to carry on!

The problem here is that one individual document failed to parse.

haimez · on Sept 11, 2023

The other “tabs” here are other airplanes in flight, depending on being able to land before they run out of fuel. You don’t just ignore one and move on.

epolanski · on Sept 11, 2023

Nonsense comparison, your browser's tabs are de facto insulated from each other, flight paths for 7000 daily planes over the UK literally share the same space.

adrianmonk · on Sept 11, 2023

You don't know that the JSON is invalid. Maybe the JSON is perfect and your parser is broken.

zimpenfish · on Sept 11, 2023

No, it's more like saying your browser has detected possible internal corruption with, say, its history or cookies database and should stop writing to it immediately. Which probably means it has to stop working.

ludwik · on Sept 11, 2023

It definitely isn't. It was just a validation error in one of thousands external data files that the system processes. Something very routine for almost any software dealing with data.

jameshart · on Sept 11, 2023

The algorithm as described in the blogpost is probably not implemented as a straightforward piece of procedural code that goes step by step through the input flightplan waypoints as described. It may be implemented in a way that incorporates some abstractions that obscured the fact that this was an input error.

If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.

Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.

And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.

micromacrofoot · on Sept 11, 2023

I've had brief glimpses at these systems, and honestly I wouldn't be surprised if it took more a year for a simple feature like this to be implemented. These systems look like decades of legacy code duct-taped together.

cratermoon · on Sept 11, 2023

> why could the system not put the failed flight plan in a queue

Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.

Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.

slt2021 · on Sept 11, 2023

good ETLs are usually designed to separate good records from bad records, so even if one or two rows in the stream do not conform to schema - you can put them aside and process the rest.

seems like poor engineering

jandrese · on Sept 11, 2023

The problem is that it means you have a plane entering the airspace at some point in the near future and the system doesn't know it is going to be there. The whole point of this is to make sure no two planes are attempting to occupy the same space at the same time. If you don't know where one of the planes will be you can't plan all of the rest to avoid it.

The thing that blows my mind is that this was apparently the first time this situation had happened after 15 million records processed. I would have expected it to trigger much more often. It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.

zaphar · on Sept 11, 2023

Bad records aren't supposed to be ignored. They are supposed to be looked at by a human who can determine what to do.

Failing the way NATS did means that all future flight plan data including for planes already in the sky are not longer being processed. The safer failure mode was definitely to flag this plan and surface to a human while continuing to process other plans.

d1sxeyes · on Sept 11, 2023

> It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.

This is very possible. I know of a guy who does (or at least a few years ago did) 24x7 365 on-call for a piece of mission (although not safety) critical aviation software.

Most of his calls were fixing AWBs quickly because otherwise planes would need to take off empty or lose their take-off slot.

Although there had been some “bus factor” planning and mitigation around this guy’s role, it involved engaging vendors etc. and would have likely resulted in a lot of disruption in the short term.

fbdab103 · on Sept 11, 2023

Please tell me this guy is now wealthy beyond imagination and living a life of leisure?

d1sxeyes · on Sept 12, 2023

I would love to. But it wouldn’t be true.

epolanski · on Sept 11, 2023

One in a 15M chance with 7000 daily flies over the UK handled by nats meant it had a probability to happen at least once in 69 months, it took few months less.

cratermoon · on Sept 11, 2023

I never said it was a good ETL system. Heck, I don't even know if the specs for it even specifies what to do with a bad record - there are at least 300 pages detailing the system. Looking around at other stories, I see repeated mentions of how the circumstances leading to this failure are supposedly extremely rare, "one in 15 million" according to one official[1]. But at 100,000 flights/day (estimated), this kind situation would occur, statistically, twice a year.

1 https://news.sky.com/story/major-flights-disruption-caused-b...