Hacker News new | past | comments | ask | show | jobs | submit login
UK air traffic control meltdown (jameshaydon.github.io)
932 points by jameshh on Sept 11, 2023 | hide | past | favorite | 436 comments



So they forgot to "geographically disparate" fence their queries. Having built a flight navigation system before, I know this bug. I've seen this bug. I've followed the spec to include a geofence to avoid this bug.


Why on earth do they not have GUIDs for these navigation points if the names are not globally unique and inter-region routes are commonplace?


1. Pilots occasionally have to fat finger them into ruggedized I/O devices and read them off to ATC over radios.

2. These are defined by the various regional aviation authorities. The US FAA will define one list, (and they'll be unique in the US) the EU will have one, (EASA?) etc.

The AA965 crash (1995-12-20) was due to an aliased waypoint name. Colombia had two waypoints with the same name within 150 nautical miles of each other. (the name was 'R') This was in violation of ICAO regulations from like the '70s.

https://en.wikipedia.org/wiki/American_Airlines_Flight_965


> Pilots occasionally have to fat finger them into ruggedized I/O devices

you're saying what-3-words (W3W) is unsuitable for safety critical applications ? /s


I'm trying to imagine someone ensuring differentiation between minimums.unsettled.depends (Idaho), minimums.unsettled.depend (Alaska), minimums.unsettles.depend (Spain), and minimum.unsettles.depend (Russia) while typing them in on a t-9 style keypad with a 7 figure display in turbulence.


That's easily fixed – just spell them out using the ICAO spelling alphabet!


I can't believe the What3Words person or people didn't normalize all words to be singular before canonizing the list.

That's ridiculous.


The word list is 40,000 long, so without plurals probably there aren't enough words that people could spell or even pronounce. A better fix would be making it "what four words" - I wonder if they'd already committed too much to the "three" concept before discovering the flaw? Either way, using phony statistics to make unwarrantable claims of accuracy is a poor workaround.


That seems like a huge flaw in their system, has it never been addressed?


No, and by my understanding it can't be, as the algorithm is now permanent.

But it's worse than that, there are confusables within small distances of each other:

https://cybergibbons.com/security-2/why-what3words-is-not-su...

https://w3w.me.ss/


Since the app gives you the words to say, and translates those back to coordinates on the receiving end, in theory they could alter the word list, at the cost of making any written-down version obsolete.

Maybe they should release a new service called What4ActuallyVettedWordsAndWordCombinations ;)


Aren't they trying to turn their word list into a subscription service? Obsoleting paper copies might be a feature.


Something like what3words might be useful, but what3words itself doesn't have enough "auditory distance" between words. (i.e. - there are some/many words used by what3words that sound similar enough to be indistinguishable over an audio channel with noise.)

Something like FixPhrase seems better for use over radio.


There are a number of word lists whose words were picked due to their beneficial properties given the use-case of possibly needing to be understood verbally over unclear connections. The NATO phonetic alphabet, and PGP word lists come to mind: https://en.wikipedia.org/wiki/PGP_word_list

I'm particularly a fan of the PGP word list (it would definitely require more than 3 words for this purpose, though) because it has built-in error detection (of transposition, insertion or deletion): Separate word lists are used for "even" and "odd" hex digits. This makes it, IMHO, fairly ideal for use over verbal channels. From the wiki: "The words were carefully chosen for their phonetic distinctiveness, using genetic algorithms to select lists of words that had optimum separations in phoneme space"

It sounds like the w3w folks did not do any such thing

EDIT: According to my napkin math, 6 PGP words should be enough to cover the 64 trillion coordinates that "what3words" covers, but with way better properties such as error detection and phonetic incongruity (and not only that, it is just over 4 times larger, which means it can achieve a resolution of 5 feet instead of 10)


> I'm particularly a fan of the PGP word list

As a New Zealander, the PGP list is unfriendly because there are plenty of words that are hard to spell, or are too US centric.

dogsled (contains silent d, and sleigh might be a British spelling)

Galveston (I've never heard of the place)

Geiger (easy to type i before e - unobvious)

Wichita (I would have guessed the spelling began with which or witch)

And why did the designers not make the words have some connection to the numbers e.g. there are 12 even and 12 odd words beginning with E - add 16 more E words and you could use E words for E0 to EF. Redundant encoding like that helps humans (and would help when scanning for errors or matches too)

I imagine it is even harder for ESOL people from other countries! I am sure the UI has completion to help - but I wouldn't recommend using that list for anything except a pure US audience.


I have been to Galveston and I can assure you that you have not missed anything. There is no good reason to visit or know anything about it.

Making a word list that could work well for speakers different English dialects and for speakers of English as a second language sounds really hard. Has such a list as been made?

Probably it is too hard so we will continue to ignore the problem.


Great ideas all!

It should be discussed like this! It's clear that the w3w people didn't even do the bare minimum here!

The thing is, once you agree that some words are subpar or need translations, you can do a 1-to-1 mapping.

The problem with What3Words is that supporting the original word set will always be a pain even if they release a v2 word set with a 1-to-1 mapping (I believe they've already released versions for other languages?)

re: Geiger- parsing it could trivially accept misspellings of words


I never knew there was a “sleigh” in “dogsled,” I’ve only ever heard “sled,” like “slid” or “skid.”


> It sounds like the w3w folks did not do any such thing

They were too busy spending money on marketing, it's not like every news organisation ran a story about it by accident


I mean there is the ICAO phonetic alphabet already known and used by every single licensed pilot the world over, regardless of their native language.

or, or... hang with me here for a minute...

We could instead use one of these cool new hash algorithms that require a computer and use about fifteen thousand English words! I understand they are all the rage in the third world countries that lack a postal system.


These are all lovely technical solutions. The problem I imagine isn't coming up with unique words. The problem is organizing a switchover for dozens if not hundreds of systems and agencies around the world. The chaos of change is probably out weighing the benefits.


what3words is not useful at all. 1) FAA (and thus the world) have a hard character limit of 8, this is to support old mainframes running (Delta, I'm looking at you) old unix dispatch software. 2) The cockpit computers have limited characters on screen. A FMC can display 28 characters x 16 rows at best. Most are 8 rows. Military aircraft have some that are 2 rows. The FMC or Flight Management Computer is really just an old embedded chip. 3) The entire airline, flight, tourism, booking, and ticketing systems of the world would need to change. Including all legacy systems, all paper charts, all maps, all BMS's, all AirBoss's, all ATC software, all radio beacons. There is no chance that any of this will change simply because someone came up with a way to associate words with landmarks you can't see from the air.


Many authorities on the subject warn against using W3W for safety-critical applications.[0][1][2]

Personally I could imagine that Maidenhead Locator System[3] may be more useful. It's just 4-12 chars (depending on what degree of accuracy you need)

[0] https://www.summerlandreview.com/news/bc-search-and-rescue-g...

[1] https://globalnews.ca/news/8258671/north-shore-rescue-what3w...

[2] https://www.squamishchief.com/local-news/squamish-search-and...

[3] https://en.wikipedia.org/wiki/Maidenhead_Locator_System


You could maybe make them globally unique by adding the country where appropriate like we do with Paris, France vs Paris, Texas? And not using the same name twice in the same country.


The names have to be entered manually by pilots, if e.g. they change the route. They have to be transmitted over the air by humans. So they must be short ans simple.


Clippy: It looks like you are trying to enter a non unique navigation point, did you mean the one in France or the one in Australia?


if only it were as simple as that - what about unique but easily confusable "human-friendly" identifiers?

As a layman, I'd argue that such efforts would be band-aid and better spent on robust standardization


Yes but shouldn’t one step of the code be to translate these non-unique human-readable identifiers into completely unique machine-readable identifiers?


How exactly would you do that? It’s impossible to map from a dataset of non-unique identifiers to unique identifiers without additional data and heuristics. The mapping is ambiguous by definition.

The underlying flight plan standard were all created in an era of low memory machines, and when humans were expected to directly interpret data exactly as the programs represented it internally (because serialisation and deserialisation is expensive when you need every CPU cycle just run your core algorithms)


Couldn’t you use the surrounding points? Each point is surrounded by a set of nearby points. You can prepare a map of pairs of points into unique ids beforehand, then have a step that takes (before, current, after) for each point in the flight plan and finds the ID of current.


That sort of thing happens already; for instance, the MCDU of an Airbus aircraft will present various options in the case of ambiguous input, with a distance in nautical miles for each option. Usually, the closest option is the most appropriate.


Yes you can, but that would be using

> additional data and heuristics

Directly mapping is impossible, so you cant just do a dumb ID-at-time pre-processing step (which is what your comment seems to suggest). You need a more complex pre-processing step that’s capable of understanding the surrounding context the identifier is being used in. A major issue with the flight planing system (as highlighted in the article) is that they attempted to do this heuristic mapping as part of their core processing step, and just assumed the ID wouldn’t be too ambiguous, and certainly wouldn’t repeat.


If two waypoints have the same name assume its the closest one to the adjacent ones in the route-chain rather than the one 4000 km away


Sounds like a helpful idea with considerable implementation complexity, including the potential for new disastrous failure modes.


But its how the waypoint codes work in practice - they are contextual. If Air Traffic Control tell a plane to head for waypoint RESIA, they mean the one nearby, not the one 4000 km away.


Have to admit, I read the article in full detail only after commenting and I see your point.

Especially since the implenting company is called out explicitly for failing to achieve this, and the risks of changing the well-established identifiers are also illustrated.

Perfect might be the enemy of the good then, or the standardization thing at least is a separate topic.


Not really because each point is only adjacent to a small neighborhood of other points, so if you want to test every possibility then your search space only grows by a constant factor proportional to the maximum degree of the graph.

As for implementation complexity, you would hope they would use formal verification for something like this.


Sounds like we should have globally unique human-enterable identifiers governed by an ISO..


Long story: because changing identifiers is a considerable refactoring, and it takes coordination with multiple worldwide distributed partners to transition safely from the old to the new system, all to avoid a hypothetical issue some software engineer came up with

Short story: money. It costs money to do things well.


> Long story: because changing identifiers is a considerable refactoring

is this what refactoring means


Yes. It would cascade into:

Changes in how ATCs operate

Changes in how pilots operate

Changes in how airplanes receive these instructions (including the flight software itself, safety systems, etc.)

Changes in how airplanes are tested

Changes in how pilots are trained

Etc. In this case, the refactoring requires changes to hardware, software, training, manufacturing, and humans.


Pretty sure that is still not the meaning of refactoring. As I understand it refactoring should mean no changes to the external interface but changes to how it is implemented internally.


You could see it as the whole international flight system being refactored, consumers will still use planes like before


We can pontificate on how to define the scope of a system here. I will only state that, from the perspective of a consumer, you could consider this a Service on which the interface of find flight, book flight, etc. would appear to be the same while the connections internal to each of the above modules would have to account for the change.

Functionally, I suppose it's the equivalent of upgrading an ID field that was originally declared as an unsigned 32 bit integer to a wider 64 bit representation. We may not be changing anything fundametal in the functionality, but every boundary interface, protocol, and storage mechanism must now suffer through a potentially painful modification.


does refactoring mean literally any non-local change even just like changing a variable name, or does it usually mean some kind of structural or architectural non-local change


Aviation protocols are extremely backwards compatible and low-tech compatible.

You need to be able to read, write, hear, and speak the identifier. (And receive/transmit in morse code)

Would it be okay to have an "area code prefix" in the identifier? Plausible (but practically speaking too late for that)


FAA regulations state that fixes, navs, and waypoints must be phonetically transmittable over radio.

I.E. Yankee = YANKY. The pilot and ATC must be location aware. Apparently their software does not.


I would guess because humans have to read this and ascertain meaning from it. Not everyone is a technical resource.


They do and use lat/lon in some cases. Reviewing and inputting that (when being done manual) is another story - but it's technically possible.


It sounds like for actual processing they replace them with GPS coordinates (or at least augment them with such). But this is the system that is responsible for actually doing that...


Because they need to be short, that's why they are 5 letters long. And need to be understood phonetically very quickly by pilots.


What three words would be a better solution than a guid, as transmittable over radio.


W3W contains homonyms and words that are easily confused by non-native english speakers. Often within just a few KM. The latter is why ATC uses "niner", to avoid confusing "nine" and "nein".

Talk to someone deep in the GIS rabbit hole and you'll get a rant about how bad W3W is: https://cybergibbons.com/security-2/why-what3words-is-not-su...


Wernher von Braun polled his reports for whether the rocket was reliable or not. Each engineer replied "nein".

Von Braun reported that the rocket had six nines probability of success.


Also, "tree" for 3 and "fife" for 5.


I always love it when someone helicopters in to a complex, long-established system and, without even attempting to understand the requirements, constraints or history, knows this thing they read on a blog one time would fix all the problems thousands of work-years have failed to address.


As software developers, we are often living in our own bubble. As a pilot and developer working on an aviation solution, I quite often run into this issue when discussing solutions with my colleagues.


Well, it would be hard for them to helicopter in with their navigation system. ATC would have a field day. ;)


WTW is a proprietary system that should never be used:

https://www.walklakes.co.uk/opus64534.html

The biggest fault (besides being proprietary) is that you must be online in order to use WTW. The times that you might need WTW are ALSO the times you are most likely to be unable to be online.


> The biggest fault (besides being proprietary) is that you must be online in order to use WTW.

That doesn't seem to be the case anymore.

It's still not a great system – many included words are ambiguous (e.g. English singular and plural forms are both possible, and an "s" is notoriously difficult to hear over a bad phone line), and it's proprietary, as you already mentioned.


It's definitely not the case, as the word list and algorithm are not secret (notwithstanding that they're proprietary) and have been re-implemented and ported into at least a couple of languages that allow for offline use. I have a Rust implementation that started life as a transliteration from Javascript. I wouldn't recommend using it, still -- I wrote it in the hope of finding more problems with collisions, not because I like it.


That would actually be pretty bad. As mentioned, W3W is propritary, requires an online connection, and has homonyms. On top of that, you need to enter these waypoints into your aircraft's navigation system - sometimes one letter at a time using a rotary dial. These navigation systems will stay in service for decades.

Aviation already uses phonetically pronounceable waypoint names. Typically 5 characters long for RNAV (GPS) waypoints, for example "ALTAM" or "COLLI". Easy to pronounce, easy to spell phonetically if needed, and easy to enter.

The problem is the list of waypoints are independently defined by each country, so duplicates are possible between countries.

Rather than replacing a system that mostly works (and mandating changes to aircraft navigation systems, ATC systems, and human training for marginal benefit)... an easier fix would just be to have ICAO mandate that these waypoints are globally unique.


If only there was a globally unique set of short two-letter names for every country that could be used as prefixes to enforce uniqueness while still allowing every country to manage their own internal waypoint list.

If only.


I'm sure they thought about this at some point. Airports already have a country-code prefix. (For example, airports in the Continental US always start with K.)

For whatever reason, by convention navaids never use a country prefix. Even when it would make sense - the code for San Francisco International Airport is "KSFO", but the identifier for the colocated VOR-DME is just "SFO". (Sometimes this does make a big difference, when navaids are located off site - KCCR vs CCR for Concord Airport vs the off-site Concord VOR-DME, for example.)

It's even worse for NDB navaids, which are often just two letters.

Either way, we're stuck with it because it's baked into aircraft avionics and would be incredibly expensive to change at this point.


Yes, but country-prefixes are something you could migrate gradually in a backwards-compatible way.

"Waypoint Charlie Alpha Dash ALKOG, over."

"This is an old machine, so just ALKOG then, over."


That's "What3Words" -- https://en.m.wikipedia.org/wiki/What3words -- a system for representing geographic location using globally-unique word triads.


Just curios, what language was used to develop?


ICAO standard effective from 1978 to only duplicate identifiers if more than 600 nmi (690 mi; 1,100 km) apart


> the backup system applied the same logic to the flight plan with the same result

Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.

The boxes were designed with:

1. different algorithms

2. different programming languages

3. different CPUs

4. code written by different teams with a firewall between them

The idea was that bugs from one box would not cause the other to fail in the same way.


This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.

Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.

Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.

[1] https://en.wikipedia.org/wiki/Triple_modular_redundancy

[2] https://en.wikipedia.org/wiki/Instrument_flight_rules#Separa...

[3] https://en.wikipedia.org/wiki/Visual_flight_rules#Controlled...


> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.

But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.

As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:

https://news.ycombinator.com/item?id=37476624


> But this was a 1oo1 system, and the human backup handled it well enough ...

Heh, a hundred million pound outage. ;)

True, no-one seems to have died from it directly though.


True. I don't want to downplay the actual cost (or, worse, suggest that we should accept "the system worked as intended" excuses), but it's not just that there were no crashes: the air traffic itself remained under control throughout the event. Compare this to, for example, the financial "flash crash" of 2010, or the nuclear 'excursions' at Fukushima / Chernobyl / Three Mile Island / Windscale, where those nominally in control were reduced to being passive observers.

It also serves as a reminder of how far we have to go before we can automate away the jobs of pilots and air traffic controllers.


This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!


Sounds like the joke about a man with one watch always being sure about what time it is, but a man with two being continuously in doubt.


Just computate the average, then counter the documented drift vs a external source?


My grandfather was working with Stanisław Skarżyński, who was preparing for his first crossing of the Atlantic in a lightweight airplane (RWD-5bis, 450kg empty weight) in 1933.

They initially mounted two compasses in the cockpit, but Skarżyński taped one of them over so that it wasn't visible, saying wisely that if one fails, he will have no idea which one is correct.


> if one fails, he will have no idea which one is correct

Depends how it fails! For example, say, when you change direction one turns and the other doesn't.


Couldn't he bring his own 3rd? Compasses aren't heavy?


…or a 4th and a 5th, and have voting rounds — an idea explored by Stanisław Lem in "Golem XIV", where a parliament of machines voted :-)


That's a cool story! Would have loved to have heard more about that :)


In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.


you would be very surprised how difficult avionics are from even a fundamental level.

I'll provide a relatively simple example.

Just even attempting to design a starfox game clone where the ship goes towards the mouse cursor using euler angles will almost immediately result in gimbol lock and your starfighter locking up tighter than unlubricated car engine going 100mph and unable to move. [0]

The standard solution in games(or at least what I used) has been to use quaternions [1] (Hamilton defined a quaternion as the quotient of two directed lines in a three-dimensional space,[3] or, equivalently, as the quotient of two vectors.) So you essentially dump your 3D coordinate into the 4D quaternion coordinate, apply your matrix rotations, then convert back to 3D space and apply your rotations/transforms.

This was literally just to get my little space ship to go where my mouse cursor was on the screen without it locking up.

So... yeah, I cannot even begin to imagine the complexity of what a Boeing 757 (let alone a 787) is doing under the hood to deal with reality and not causing it to brick up and fall out of the sky.

[0] https://math.stackexchange.com/questions/8980/euler-angles-a... [1] https://en.wikipedia.org/wiki/Quaternion


I don't think we're talking about that kind of software, though. This big was in code that needs to parse a line defined by named points and then clip the line to the portion in the UK. Not trivial, but I can imagine writing that myself.

But regardless the more complex the code the worse idea it is to maintain three parallel implementations, if you won't/can't afford to do it properly


I was doing some orientation sensing 20 years ago with an IMU and ran into the same problem. I had never known at the time it was gimbal lock (which I had heard of) but did read quaternions were the way to fix it. Pesky problem.


> Human backup is not possible because of human resourcing

This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".

Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.


First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.


I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.

A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.


> A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.

Thoroughly checking of the inputs as far as possible should be a given, but in this case, the inputs were correct: while the use of duplicate identifiers is considerably less than ideal, the constraints on where that was permitted meant that there was one deterministically unambiguous parsing of the flight plan, as demonstrated in the article. The proximate cause of the problem was not in the inputs, but how they were processed by the ATC system.

For the same reason, multiple implementations of the software would only have helped if a majority of the teams understood this issue and got it right. I recall a fairly influential paper in the '90s (IIRC) in which multiple independent implementations of a requirements specification were compared, and the finding was that the errors were quite strongly correlated - i.e. there was a tendency for the teams to make the same mistakes as each other.


not stronger isolation between different flight plans? it seems "obvious" to me that if one flight plan is causing a bug in the handling logic, the system should be able to recover by continuing with the next flight plan and flagging the error to operators to impact that flight only


I'm no aviation expert, but perhaps with waypoints:

  A B C D E
   /
  F G H I J
If flight plan #1 is known to be going from F-B at flight level 130, and you have a (supposedly) bogus flight plan #2, they can't quite be sure if it might be going from A-G at flight level 130 at the same time and thus causing a really bad day for both aircraft. I'd worry that dropping plan #2 into a queue for manual intervention, especially if this kind of thing only happens once every 5 years, could be disastrous if people don't realize what's happening and why. Many people might never have seen anything in that queue and may not be trained to diagnose the problem and manually translate the flight plan.

This might not be the reason why the developer chose to have the program essentially pull the fire alarm and go home in this case, but that's the impression I got.


The ATC system handled well enough (i.e. no disasters, and AFAIK, no near misses) something much more complicated than one aircraft showing up with no flight plan: the failure of this particular system put all the flights in that category.

I mentioned elsewhere that any ATC system has to be resilient enough to handle things like in-flight equipment failure, medical emergencies, and the diversion of multiple aircraft on account of bad weather or an incident which shuts down a major airport.

As for why the system "pulled the plug", the author of the article suspects that this particular error was regarded as something that would not occur unless something catastrophic had caused it, whereas, in reality, it affected only one flight and could probably have been easily worked around if the system had informed ATC which flight plan was causing the problem.


I'm not sure they're even used for that purpose - that side of thing is done "live" as I understand it - the plans are so that ATC has the details on hand for each flight and it doesn't all need to be communicated by radio as they pass through.


"unexpected errors" are not necessarily problems with the flight plans. They could be anything.


I wonder where most of the complexity lies in ATC. Naively you’d think there would be some mega computer needed to solve the puzzle but the UK only sees 6k flights a day and the scale of the problem, like most things in the physical world, is well bounded. That’s about the same number of buses in London, or a tenth of the number of Uber drivers in NYC.

It would be interesting to actually see the code.


Much of the complexity is in interop. Passing data between ATC control positions, between different facilities, and between different countries. Then every airline has a bidirectional data feed, plus all the independent GA flights (either via flight service or via third-party apps). Plus additional systems for weather, traffic management, radar, etc. Plus everything happening on the defense side.

All using communication links and protocols that have evolved organically since the 1950s, need global consensus (with hundreds of different countries' implementations), and which need to never fail.


The system should have just rejected the FPL, notify the admins about the problem and keep working. The admins could have fixed whatever the software could not handle.

The affected flight could have been vectored by ATC if needed to divert from filed FPL.

Way less work and a better doutcome than the “system throws hands in the air and becomes unresponsive”.


"When a failsafe system fails, it fails by failing to fail safe."

J. Gall


Different teams often make the same mistake. The system you describe is not perfect, but makes sense.


I neglected to mention there was a third party that reviewed the algorithms to verify they weren't the same.

Nothing is perfect, though, and the pilot is the backup for failure of that system. I.e. turn off the stab trim system.


Is this still the case for simple algorithms?


I don't know as much about modern avionics.


if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?

In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.


Such incentives can lead to reduced collaboration. If I get paid every time you make mistakes, I won't want you to get better at your job


the whole point is that they're not collaborating so as to avoid cross-contamination. also you don't get paid unless and until you identify the mistake. if you decrease the reward over time, there is an additional incentive to not sit on the information


As as side note, too bad they knowingly didn't reuse such an approach for the MAX..


The MAX system relied on the pilot remembering the stab trim cutoff switch and what it was for.


Even though the trim cutoff switch didn't work as it used to do on the previous generation of 737s, and the pilots were not notified about the change.


Wouldn't trim be an number of which a significant tolerance is permissible at any given time? Or does "agree" mean "within a preset tolerance"?


Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.


I would be very interested in knowing which languages were used. Do you know which were? Thanks


One of them was Pascal. This was around 1980 or so.


I seem to remember another problem at NATS which had the same effect. Primary fell over so they switched over to a secondary that fell over for the exact same reason.

It seems like you should only failover if you know the problem is with the primary and not with the software itself. Failing over "just because" just reinforces the idea that they didn't have enough information exposed to really know what to do.

The bit that makes me feel a bit sick though is that they didn't have a method called "ValidateFlightPlan" that throws an error if for any reason it couldn't be parsed and that error could be handled in a really simple way. What programmer would look at a processor of external input and not think, "what do we do with bad input that makes it fall over?". I did something today for a simple message prompt since I can't guarantee that in all scenarios the data I need will be present/correct. Try/catch and a simple message to the user "Data could not be processed".


Well, if the primary is known not to be in a good state, you might as well fail over and hope that the issue was a fried disk or a cosmic bit flip or something.

The real safety feature is the 4 hour lead time before manual processing becomes necessary.

One of the key safety controls in aviation is “if this breaks for any reason, what do we do”, not so much “how do we stop this breaking in the first place”.


I'm no aviation safety controls expert but it seems to me that there are two types of controls that should be in place:

1. Process controls: What do we do when this breaks for any reason.

2. Engineering controls: What can we do to keep this from breaking in the first place?

Both of them seem to be somewhat essential for a truly safe system.


It's very hard to ensure you capture every single possible failure mode. Yes, the engineering control is important but it's not the most critical. What to do if it does fail (for any reason) is the truly critical control, because it solves for the possibility of not knowing every possible way something might fail and therefore missing some way to prevent a failure


One or more of three results can come from the engineering exercise of trying to keep something from breaking in the first place:

1. You could know the solution, but it would be too heavy.

2. You could know the solution, but it would include more parts, each of which would need the same process on it, and the process might fail the same way

3. You miss something and it fails anyway, so your "what if this fails" path better be well rehearsed and executed.

Real engineering is facing the tradeoffs head on, not hand waving them away.


The engineering controls don't independently make systems safe, they make things more reliable and cost-effective, and hopefully reduce the number of times the process controls kick in.

The process controls do however independently make things safe.

The reason for this is that there are 'unknown unknowns'—we accept that our knowledge and skills are imperfect, and there may be failures that occur which could have been eliminated with the proper engineering controls, but we, as imperfect beings and organisations, did not implement the engineering controls because we did not identify this possible failure mode.

There are also known errors, where the cost of implementing engineering controls may simply outweigh the benefits when adequate process controls are in place.


Everyone uses slightly different terminology and groups things differently but this will give you the gist.

https://en.m.wikipedia.org/wiki/Hierarchy_of_hazard_controls


It was in a bad state, but in a very inane way: a flight plan in its processing queue was faulty. The system itself was mostly fine. It was just not well-written enough to distinguish an input error from an internal error, and thus didn't just skip the faulty flight plan.


at the risk of nitpicking: "a flight plan in its processing queue was faulty" isn't true, the flight plan was fine. It couldn't process it.

I mention this only because the Daily Mail headline pissed me off with it's usual bullshit foreigner fear mongering crap.


Indeed, that intention is quite transparent in this case. Anyways, I suspect that invalid input exists that would have made the system react in a similar way


No validation, anddd this point from the article stood out to me: --- The programming style is very imperative. Furthermore, the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained. --- Given that description, I'd be surprised if it wasn't just running a regex / substring matches against the text and there's no classes / objects / data structure involved. Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.


> Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.

It's new code, from 2018 :) Quote from the report:

> An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers.


Failing over is correct because there's no way to discern that the hardware is not at fault. They should have designed a better response to the second failure to avoid the knock-on effects.


I don't think anything in this incident pointed to a hardware fault

The software raised an exception because a "// TODO: this should never happen" case happened

A hardware fault would look like machines not talking to each other or corrupted data file unreadable


Retroactive inspection revealed that it wasn't a hardware failure, but the computer didn't know that at the time, and hardware failure can look like anything, so it was correct to exercise its only option.


Yep. In electrical terms, you replaced the fuse to watch it blow again. There are no more fuses in your shop. Progress?


Stick a nail in


The Ariane 5 launch failure[1] was a similar issue, albeit with a more spectacular outcome.

Primary suffers integer overflow, fails. Secondary is identical, which also overflows. Angle of attack increases, boosters separate. Rocket goes boom.

[1] https://en.wikipedia.org/wiki/Ariane_flight_V88


And why could the system not put the failed flight plan in a queue for human review and just keep on working for the rest of the flights? I think the lack of that “feature” is what I find so boggling.


Because the code classified it as a "this should never happen!" error, and then it happened. The code didn't classify it as a "flight plan has bad data" error or a "flight plan data is OK but we don't support it yet" error.

If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.


That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.


> it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios

Yes. But also, it's an ATC system. Its primary purpose "is to prevent collisions..." [1].

If the system encounters a "this should never happen!" error, the correct move is to shut it down and ground air traffic. (The error shouldn't have happened in the first place. But the shutdown should have been more graceful.)

[1] https://en.wikipedia.org/wiki/Air_traffic_control


Neither formal methods nor fuzzing would've helped if the programmer didn't know that input can repeat. Maybe they just didn't read the paragraph in whatever document describes how this should work and didn't know about it.

I didn't have to implement flight control software, but I had to write some stuff described by MIFID. It's a job from hell, if you take it seriously. It's a series of normative documents that explains how banks have to interact with each other which were published quicker than they could've been implemented (and therefore the date they had to take effect was rescheduled several times).

These documents aren't structured to answer every question a programmer might have. Sometimes the "interesting" information is close together. Sometimes you need to guess the keyword you need to search for to discover all the "interesting" parts... and it could be thousands of pages long.


The point of fuzzing is precisely to discover cases that the programmers couldn't think about, and formal methods are useful to discover invariants and assumptions that programmers didn't know they rely on.

Furthermore, identifiers from external systems always deserve scepticism. Even UUIDs can be suspect. Magic strings from hell even more so.


Sorry, you missed the point.

If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

The mistake is too trivial to attribute it to the programmer incompetence / lack of attention. I'd bet my lunch it was because the spec is written in an incomprehensible language, is all over the place in a thousand pages PDF, and the particular aspect of repetition isn't covered in what looks like the main description of how paths are defined.

I've dealt with specs like that. It's most likely the error created by the lack of understanding of the details of the requirements than of anything else. No automatic testing technique would help here. More rigorous and systematic approach to requirement specification would probably help, but we have no tools and no processes to address that.


> If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.

It totally would. The point of a fuzzer is to test the system with every technically possible input, to avoid bias and blind spots in the programmer's thinking.

Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned. Unless you know all about the business rules of an external system, you can't trust its data and can't assume much about its behavior.

Anyways, we are discussing about the wrong issue. Bugs happen, even halting the whole system can be justified, but the operators should have had an easier time figuring out what was actually going on, without the vendor having to pore through low-level logs.


No... that's not the point of fuzzing... You cannot write individual functions in such a way that they keep revalidating input handed to them. Because then, invariably, the validations will be different function to function, and once you have an error in your validation logic, you will have to track down all function that do this validation. So, functions have to make assumptions about input, if it doesn't come from an external source.

I.e. this function wasn't the one which did all the job -- it already knew that the input was valid because the function that provided the input already ensured validation happened.

It's pointless to deliberately send invalid input to a function that expects (for a good reason) that the input is valid -- you will create a ton of worthless noise instead of looking for actual problems.

> Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned.

How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?... There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

You really need to try doing what you suggest before suggesting it.


I am not going to comment the first paragraph since you turned my words around.

> How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?

A dictionary in my program is under my control and I can be sure that the key is unique since... well, I know it's a dictionary. I have no such knowledge about data coming from external systems.

> There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...

"Meant to be" and "actually are" can be very different things, and it's the responsibility of a programmer to establish the difference, or to at least ask pointed questions. Actually, the programmers did the correct thing by not sweeping this unexpected problem under the rug. The reaction was just a big drastic, and the system did not make it easy for the operators to find out what went wrong.

Edit: as we have seen, input can be valid, but still not be processable by our code. That not fine, but it's a fact of life since specs are often unclear or incomplete. Also, the rules can actually change without us noticing. In these cases, we should make it as easy as possible to figure out what went wrong.


I've only heard from people engineering systems for aerospace industry and we're speaking hundreds of pages of api documentation. It is very complex so equally the chances of a human error are higher.


I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.

That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".

I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.


If the code takes a valid series of ICAO waypoints and routes, generates the corresponding ADEXP waypoint list, but then when it uses that to identify the ICAO segment that leaves UK airspace it's capable of producing a segment from before when the route enters UK airspace, then that code is wrong, and who knows what other failure modes it has?

Maybe it can also produce the wrong segment within British airspace, meaning another flight plan might be processed successfully, but with the system believing it terminates somewhere it doesn't?

Maybe it's already been processing all the preceding flight plans wrongly, and this is just the first time when this error has occurred in a way that causes the algorithm to error?

Maybe someone's introduced an error in the code or the underlying waypoint mapping database and every flight plan that is coming into the system is being misinterpreted?


An "unexpected error" is always a logic bug. The cause of the logic error is not known, because it is unexpected. Therefore, the software cannot determine if it is an isolated problem or a systemic problem. For a systemic problem, shutting down the system and engaging the backup is the correct solution.


I'm pretty inexperienced, but I'm starting to learn the hard way that it takes more discipline to add more complex error recovery. (Just recently my implementation of what you're suggesting - limiting the blast radius of server side errors - meant all my tests were passing with a logged error I missed when I made a typo)

Considering their level 1 and 2 support techs couldn't access the so-called "low level" logs with the actual error message it's not clear to me they'd be able to keep up with a system with more complicated failure states. For example, they'd need to make sure that every plan rejected by the computer is routed to and handled by a human.


> is essentially totally independent

They physically cannot be independent. The system works on an assumption that the flight was accepted and is valid, but it cannot place it. What if it accidentally schedules another flight in the same time and place?


> What if it accidentally schedules another flight in the same time and place?

Flight plans are not responsible for flight separation. It is not their job and nobody uses them for that.

As a first approximation they are used so ATC doesn’t need to ask every airplane every five minute “so flight ABC123 where do you want to go today?”

I’m staring to think that there is a need for a “falsehoods programers believe about aviation” article.


Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision. The system needs to maintain the integrity of all plans it sees. If it can't process one, and there's the risk of a plane entering airspace with a bad flight plan, you need to stop operations.


>> Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision.

Flight plans don't contain any information relevant for collision avoidance. They only say when and where the plane is expected to be. There is not enough specificity to ensure no collisions. Things change all the time, from late departures, to diverting around bad weather. On 9/11 they didn't have every plane in the sky file a new flight plan carefully checked against every other...


But they have 4 hours to reach out to the one plane whose flight plan didn't get processed and tell them to land somewhere else.


Assuming they can identify that plane.

Aviation is incredibly risk-averse, which is part of why it's one of the safest modes of travel that exists. I can't imagine any aviation administration in a developed country being OK with a "yeah just keep going" approach in this situation.


That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?


And that's why I never (or very rarely) put "this should never happen" exceptions anymore in my code

Because you eventually figure out that, yes, it does happen


A customer of mine is adamant in their resolve to log errors, retry a few times, give up and go on with the next item to process.

That would have grounded only the plane with the flight plan that the UK system could not process.

Still a bug but with less effects to all the continent, because planes that could not get inside or outside the UK could not fly and that affected all of Europe and possibly more.


> That would have grounded only the plane with the flight plan that the UK system could not process.

By the looks of it, it was few hours in the air by the time the system had a breakdown. Considering it didn't know what the problem was, it seems appropriate that it shut down. No planes collided, so the worst didn't happen.


Couldn't the outcome be "access to the UK airspace denied" only for that flight? It would have checked with an ATC and possibly landed somewhere before approaching the UK.

In the case of a problem with all flights, the outcome would have been the same they eventually had.

Of course I have no idea if that would be a reasonable failure mode.


This here is the true takeaway. The bar for writing "this should never happen" code must be set so impossibly high that it might as well be translated into "'this should never happen' should never happen"


The problem with that is that most programming languages aren't sufficiently expressive to be able to recognise that, say, only a subset of switch cases are actually valid, the others having been already ruled out. It's sometimes possible to re-architect to avoid many of this kind of issue, but not always.

What you're often led to is "if this happens, there's a bug in the code elsewhere" code. It's really hard to know what to do in that situation, other than terminate whatever unit of work you were trying to complete: the only thing you know for sure is that the software doesn't accurately model reality.

In this story, there obviously was a bug in the code. And the broken algorithm shouldn't have passed review. But even so, the safety critical aspect of the complete system wasn't compromised, and that part worked as specified -- I suspect the system behaviour under error conditions was mandated, and I dread to think what might have happened if the developers (the company, not individuals) were allowed to actually assume errors wouldn't happen and let the system continue unchecked.


So what does your code do when you did not handle the this should never happen exception? Exit and print out a stacktrace to stdout?


To be fair, the article suggests early on that sometimes these plans are being processed for flights already in the air (although at least 4 hours away from the UK).

If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.

It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.


> "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".

Flight plans don't tell where the plane is. Where is this assumption coming from?


Presumably you need to know where upcoming flights are going to be in the future (based on the plan), before they hit radar etc.


For the most part (although there are important exceptions), IFR flights are always in radar contact with a controller. The flight plan is tool allows ATC and the plane to agree a route so that they don't have to be constantly communicating. ATC 'clears' a plane to continue on the route to a given limit, and expects the plane to continue on the plan until that limit unless they give any future instructions.

In this regard UK ATC can choose to do anything they like with a plane when it comes under their control - if they don't consider the flight plan to be valid or safe they can just instruct the plane to hold/divert/land etc.

I'm not sure the NATS system that failed has the ability to reject a given flight plan back upstream.


Mostly yes; however, there are large parts of the Atlantic and Pacific where that isn't true (radar contact). I know the Atlantic routes are frequently full of plans that left the US and Canada heading to the UK.

I have no idea what percent of the volume into the UK comes from outside radar control; if they asked a flight to divert, that may open multiple other cans of worms.


> If they asked a flight to divert, that may open multiple other cans of worms.

Any ATC system has to be resilient enough to handle a diversion on account of things like bad weather, mechanical failure or a medical emergency. In fact, I would think the diversion of one aircraft would be less of a problem than those caused by bad weather, and certainly less than the problem caused by this failure. Furthermore, I would guess that the mitigation would be just to manually direct the flight according to the accepted flight plan, as it was a completely valid one.

One of the many problems here is that they could not identify the problem-triggering flight plan for hours, and only with the assistance of the vendor's engineers. Another is that the system had immediately foreclosed on that option anyway, by shutting down.


Flight plans do inform ATC where and when a plane is expected to enter their FIR though, no?


Only theoretically. In practice the only thing that usually matches is from which other ATC unit the plane is coming. But it could be on a different route and will almost always be at a different time due to operational variation.

That doesn't matter, because the previous unit actively hands the plane over. You don't need the flight plan for that.

What does matter is knowing what the plane is planning to do inside your airspace. That's why they're so interested in the UK part of the flight plan. Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.


> the previous unit actively hands the plane over. You don't need the flight plan for that.

I thought practically, what's handed over is the CPL (current flight plan), which is essentially the flight plan as filed (FPL) plus any agreed-upon modifications to it?

> Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.

Without voice or datalink clearance (i.e. the plane calling the new ATC), would the flight even be allowed to enter a new FIR?


To be fair that is exactly what the article said was a major problem, and which the postmortem also said was a major problem. I agree I think this is the most important issue:

> The FPRSA-R system has bad failure modes

> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.

> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":

>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.


Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.


I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.

> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.

It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.


> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.

I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.


> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.

How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.


Crash is a feature, though. It's not like exceptions raises by itself into interpreter specifications. It's just that it so happens that Web apps ain't need no airbags that slow down businesses.


That line of reasoning is how you have systemic failures like this (or the Ariane 5 debacle). It only makes sense in the most dire of situations, like shutting down a reactor, not input validation. At most this failure should have grounded just the one affected flight rather than the entire transportation network.


On a multi-user system, only partial crashes are features. Total crashes are bugs.

A web server is a multi-user system, just like a country's air traffic control.


I love that phrasing, I'm gonna use that from now on when talking about low-stakes vs high-stakes systems.


> big try-catch around the code which handles individual requests.

I mean, that's assuming the code isolating requests is also bug free. You just don't know.


> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.

What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.

I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.

Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.


If there are duplicate waypoint IDs, they are not close together. They can be easily eliminated by selecting the one that is one hop away from the prior waypoint. Just traversing the graph of waypoints in order would filter out any unreachable duplicates.


That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.


That's like saying that because one browser tab tried to parse some invalid JSON then my whole browser should crash.


Well yes because you're describing a system where there are really low stakes and crash recovery is always possible because you can just throw away all your local state.

The flip side would be like a database failing to parse some part of its WAL log due to disk corruption and just said, "eh just delete those sections and move on."


Crash the tab and allow all the others to carry on!

The problem here is that one individual document failed to parse.


The other “tabs” here are other airplanes in flight, depending on being able to land before they run out of fuel. You don’t just ignore one and move on.


Nonsense comparison, your browser's tabs are de facto insulated from each other, flight paths for 7000 daily planes over the UK literally share the same space.


You don't know that the JSON is invalid. Maybe the JSON is perfect and your parser is broken.


No, it's more like saying your browser has detected possible internal corruption with, say, its history or cookies database and should stop writing to it immediately. Which probably means it has to stop working.


It definitely isn't. It was just a validation error in one of thousands external data files that the system processes. Something very routine for almost any software dealing with data.


The algorithm as described in the blogpost is probably not implemented as a straightforward piece of procedural code that goes step by step through the input flightplan waypoints as described. It may be implemented in a way that incorporates some abstractions that obscured the fact that this was an input error.

If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.

Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.

And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.


I've had brief glimpses at these systems, and honestly I wouldn't be surprised if it took more a year for a simple feature like this to be implemented. These systems look like decades of legacy code duct-taped together.


> why could the system not put the failed flight plan in a queue

Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.

Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.


good ETLs are usually designed to separate good records from bad records, so even if one or two rows in the stream do not conform to schema - you can put them aside and process the rest.

seems like poor engineering


The problem is that it means you have a plane entering the airspace at some point in the near future and the system doesn't know it is going to be there. The whole point of this is to make sure no two planes are attempting to occupy the same space at the same time. If you don't know where one of the planes will be you can't plan all of the rest to avoid it.

The thing that blows my mind is that this was apparently the first time this situation had happened after 15 million records processed. I would have expected it to trigger much more often. It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.


Bad records aren't supposed to be ignored. They are supposed to be looked at by a human who can determine what to do.

Failing the way NATS did means that all future flight plan data including for planes already in the sky are not longer being processed. The safer failure mode was definitely to flag this plan and surface to a human while continuing to process other plans.


> It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.

This is very possible. I know of a guy who does (or at least a few years ago did) 24x7 365 on-call for a piece of mission (although not safety) critical aviation software.

Most of his calls were fixing AWBs quickly because otherwise planes would need to take off empty or lose their take-off slot.

Although there had been some “bus factor” planning and mitigation around this guy’s role, it involved engaging vendors etc. and would have likely resulted in a lot of disruption in the short term.


Please tell me this guy is now wealthy beyond imagination and living a life of leisure?


I would love to. But it wouldn’t be true.


One in a 15M chance with 7000 daily flies over the UK handled by nats meant it had a probability to happen at least once in 69 months, it took few months less.


I never said it was a good ETL system. Heck, I don't even know if the specs for it even specifies what to do with a bad record - there are at least 300 pages detailing the system. Looking around at other stories, I see repeated mentions of how the circumstances leading to this failure are supposedly extremely rare, "one in 15 million" according to one official[1]. But at 100,000 flights/day (estimated), this kind situation would occur, statistically, twice a year.

1 https://news.sky.com/story/major-flights-disruption-caused-b...


This flight plan was correct though, if there was some validation like that then it should have passed.

The code that crashed had a bug, it couldn't deal with all valid data.


Because some software developers are crap at their jobs.


Related. Others?

Coincidentally-identical waypoint names foxed UK air traffic control system - https://news.ycombinator.com/item?id=37430384 - Sept 2023 (64 comments)

UK air traffic control outage caused by bad data in flight plan - https://news.ycombinator.com/item?id=37402766 - Sept 2023 (20 comments)

NATS report into air traffic control incident details root cause and solution - https://news.ycombinator.com/item?id=37401864 - Sept 2023 (19 comments)

UK Air traffic control network crash - https://news.ycombinator.com/item?id=37292406 - Aug 2023 (23 comments)


The recent episode of The Daily about the (US) aviation industry has convinced me that we’ll see a catastrophic headline soon. Things can’t go on like this.


The title of this post made me think there was a new, current meltdown !


The fact that they blamed the French flight plan already accepted by Eurocontrol proves that they didn't really know how the software works. And here the Austrian company should take part of the blame for the lack of intensive testing.


They blamed the French because they are British, that's it. It's hard to get rid of bad habits.


But, but, but... ...the EU!


The same instinct led to something called Brexit ;)


This is a great post. My reading of it:

- waypoint names used around the world are not unique

- as a sortof cludge, "In order to avoid confusion latest standards state that such identical designators should be geographically widely spaced."

- but still you might get the same waypoint name used twice in a route to mean different places

- the software was not written with that possibilty in mind

- route did not compute

- threw 'critical exception' and entered 'maintenance mode' - i.e. crashed

- backup system took over, hit the same bug with the same bit of data, also crashed

- support people have a crap time

- it wasnt until they called the software supplier that they found the low level logs that revealed the cause of the problem


"software supplier"??? Why on God's green earth isn't someone familiar with the code on 7/24 pager duty for a system with this level of mission criticality?


That would be... the software supplier. This is quite a specific fault (albeit one that shouldn't have happened if better programming practices had been used), so I don't think anyone but the software's original developers would know what to do. This system is not safety-critical, luckily.


I think there is a bit of ignorance about how software is sold in some cases. This is not just some windows or browser application that was sold but it also contained the staff training with a help to procure hardware to run that software and maybe even more. Such systems get closed off from the outside without a way to send telemetry to the public internet (I've seen this before, it is bizarre and hard to deal with). The contract would have some clauses that deal with such situations where you will always have someone on call as the last line of defense if a critical issue happens. Otherwise, the trained teams should have been able to deal with it but could not.


My jaw kept dropping with each new bullet point.


Same, is aviation technology really this primitive?


It is mostly quite primitive, but it also works amazingly well. For example ILS or VOR or ATC audio comms can all be received and read correctly using hardware built from entry level ham radio knowledge. Altimeters still require a manual input of pressure. Fuel levels can be checked with sticks.

Kinda the opposite of a modern web/mobile app, complicated, massively bloated and breaks rather often :).


It's worse than you know. Ancient computer systems, non-ASCII character encodings, analog phone lines, and ticker-tape weather.

You'll also be surprised to learn there's still parts of the US where there's no radar or radio coverage with ATC, if flying at lower altitudes. (Heck, there's still a part of the Pacific Ocean that doesn't have ATC service at any altitude.)

Aviation drove a lot of the early developments in networked computing, which also means there's some really old tech in the stack. The globally decentralized nature of it all and it being a life-critical system means it's expensive and complicated to upgrade. (And to be clear, it does get upgraded - but it in a backwards compatible way.) Today's ATC systems need to work with planes built in the 1950s, and talk to ATC units in small countries that still use ancient teletype systems and fax machines.

But yet it's all still incredibly safe, because the technology is there to augment human processes - not replace them. Even if all the technology fails, everything can still be done manually using pen and paper.


You might find it interesting that the SF subway runs on floppy disks. Not the fancy new 3.5" ones, either.

https://sfstandard.com/2023/02/02/sfs-market-street-subway-r...


And what non-primitive software do we have that is reliable? None that I know of.


Airline messaging is wild, this blog from 2010 knows what’s what: https://cos.livejournal.com/79455.html


shhh, nobody tell xvector that unleaded avgas finally happened in 2022 :)


Thanks for the summary and TL;DR.

Essentially this is down to the lack of proper namespace, who'd have thought aerospace engineer need to study operating systems! I've a friend who's a retired air force pilot and graduated from Cranfield University, UK foremost post graduate institution for aerospace engineering with their own airport for teaching and research [1]. According to him he did study OS in Cranfield, and now I finally understand why.

Apparently based on the other comments, the standard for namespace is already available but currently it's not being used by the NATS/ATC, hopefully they've learnt their lessons and start using it for goodness sake. The top comment mentioned about the geofencing bug, but if NATS/ATC is using proper namespace, geofencing probably not necessary in the first place.

[1] Cranfield University:

https://en.wikipedia.org/wiki/Cranfield_University


It sounds like a great place to study that has its own ~2km long airstrip! It would be nice if they had a spare Trident or Hercules just lying around for student baggage transport :)


"the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained."

Oh, this is typical in airline industry work. Ask programmers about a domain model or parsing, they give you blank stares. They love their validation code, and they love just giving up if something doesn't validate. It's all dumb data pipelines At no point is there code models the activities happening in the real world.

In no system is there a "flight plan" type that has any behavior associated with it or anything like a set of waypoint types. Any type found would be a struct of strings in C terms, passed around and parsed not once, but every time the struct member is accessed. As the article notes, "The programming style seems very imperative.".


Giving up if something doesn't validate is indeed standard to avoid propagating badly interpreted data, causing far more complex bugs down the line. Validate soon, validate strongly, report errors and don't try to interpret whatever the hell is wrong with the input, don't try to be 'clever', because there lie the safety holes. Crashing on bad input is wrong, but trying to interpret data that doesn't validate, without specs (of course) is fraught with incomprehension and incompatibilities down the line, or unexpected corner cases (or untested, but no one wants to pay for a fully tested all-goes system, or just for the tools to simulate 'wrong inputs' or for formal validation of the parser and all the code using the parser's results).

There are already too many problems with non-compliant or legacy (or just buggy) data emitters, with the complexity in semantics or timing of the interfaces, to try and be clever with badly formatted/encoded data.

It's already difficult (and costly) to make a system work as specified, so subtle variations to make it more tolerant to unspecificied behaviour is just asking for bugs (or for more expensive systems that don't clear the purchasing price bar).


There's a difference between parsing and validating. https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

You're right about all the buggy stuff out there, and that nobody wants to pay to make it better, though.


From a safety-critical standpoint, I've always found this article interesting but strange. You want both, before taking into account any data from anything outside of the system. Do both. As soon as possible. Don't propagate data you haven't validated in any way your spec says so. If you have more stringent specs than any standard you're using, be explicit about it, reject the data with a clear failure report. Check for anything that could be corrupted, misformated, something that you're not expecting and could cause unexpected behaviour.

I feel the lack of investment in destroying the parsing- (and validation-) related classes of bugs is the worst oversight in the history of computing. We have the tools to build crash-proof parsers (spark, Frama-C, and custom model checked code generators such as recordflux) that - not being perfect in any way - if they had a tiny bit of the effort the security industry put in mending all the 'Postel's law' junk out there, we'd be working on other stuff.

I built, with an intern, an in-house bit-precise code generator for deserializers that can be proved absent of runtime errors, and am moving to semantics checks ('field X and field Y can only present together', or 'field Y must be greater or equal to the previous time field Y was present'). It's not that hard, compared to many other proof and safety/security endeavours.


> It's not that hard, compared to many other proof and safety/security endeavours.

Yes, but the code has to understand and model the input into a program representation: the AST. That's the essence of the "parse, don't validate" paradigm. Instead of looking at each piece of a blob of data in isolation to determine if it's a valid value, turn the input into a type-rich representation in the problem domain.

In the case of the FPRSA-R system in question, it does none of that. It's simply a gateway to translate data in format A to data in format B, like an ETL system. It's not looking at the input as a flight plan with waypoints, segments and routes.

Why the programmers chose to do the equivalent of bluescreening on one failed input, I can't say. As others have pointed out, the situation it gave up on isn't so rare: 1 in 15 million will happen. Of course switching to an identical backup system is a bad choice, too. In safety-critical work, there needs to be a different backup, much like the Backup Flight System in the space shuttle or the Abort Guidance System on the Apollo Lunar Module: a completely different set of avionics, programmed independently.


One of the reasons developers 'let it crash' is because no one wants to pay for error recovery, and I mean the whole design (including system level), testing, and long-term maintenance of barely used code.

THAT SAID isolation of the decoding code and data structures, having a way back to either checkpoint/restore or wipe out bad state (or, proving the absence of side effects, as SPARK dataflow contracts allow, for example) is better design, I wish would be taught more often. I really dislike how often exception propagation is taught without showing the handling of side effects...


That's super interesting (and a little terrifying). It's funny how different industries have developped different "cultures" for seemingly random reasons.


It was terrifying enough for me in the gig I worked on that dealt with reservations and check-in, where a catastrophic failure would be someone boarding a flight when they shouldn't have. To avoid that sort of failure, the system mostly just gave up and issued the passenger what's called an "Airport Service Document": effectively a record that shows the passenger as having a seat on the flight, but unable to check-in. This allows the passenger to go to the airport and talk to an agent at the check-in desk. At that point, yes, a person gets involved, and a good agent can usually work out the problem and get the passenger on their flight, but of course that takes time.

If you've ever been a the airline desk waiting to check-in and an agent spends 10 minutes working with a passenger (passengers), it's because they got an ASD and the agent has to screw around directly in the the user-hostile SABRE interface to fix the reservation.


SABRE is pretty good compared to the card file it replaced.


It's better to say SABRE replicated, in digital form, that card file. And even today the legacy of that card form defines SABRE and all the wrappers and gateways to it.


"UK air traffic control: inquiry into whether French error caused failure"

Of course bloody not. How is it a French airline's fault when it's a UK system? Systems like this should be foolproof with redundancies.

If one entry is bad reject it and carry on, even.


Well if you want to get all nationalistic, the software was Austrian.


It's not nationalism just because I'm defending one country in favour of another.

I'm not French and nor am I British. I feel neutral about both of them though I do live in the UK.

It's just logic, not nationalism. :/


But built to UK spec


A day I don't want to remember. Took me 15 hours to reach my destination instead of 2. Had to take train, bus, then train again. 30 minutes after I had booked my tickets, everything was fully booked for two days.


I waited in the airport for 6 hours before learning that my flight was cancelled, and had to rebook... I was flying to New York to see my family, so I didn't really have any alternate transportation options!


That's a shame, sorry to hear that. I got more lucky: I had to wait for 6 hours too but my flight suddenly resumed (must have been one of the first few). I didn't have any alternatives to go home either so I feel for all of those stuck in a foreign country.


Did you meet John Candy along the way?


I wish the article contained some explanation of why the processing for NATS requires looking at both the ADEXP waypoints and the ICAO4444 waypoints (not a criticism per se, it may not have been addressed in the underlying report). Just looking at the ADEXP seems sufficient for the UK segment logic.

I'm guessing it has something to do with how ICAO4444 is technically human readable, and how in some meaningful sense, pilots and ATC staff "prefer" it. e.g., maybe all ICAO4444 waypoints are "significant" to humans (like international airports), whereas ADEXP waypoints are often "insignificant" (local airports, or even locations without any runway at all).

Of course with 20/20 hindsight, it seems obviously incorrect to loop through the ICAO4444 waypoints in their entirety, instead of "resuming" from an advanced position. But why look at them at all?


They use the ADEXP to determine which part of the route is in the UK. Because the auto generated points are ATC area handover points. So this data is the best way so see which part of the route is within the UK airspace.

Then it needs to find the ICAO part that corresponds, because the controller needs to use the ICAO plan that the pilot has.

If the controller sees other (auto generated) waypoints that the pilots don't have you get problems during operation. A simple example is that controllers can tell pilots to fly in a straight line to a specific point on their filed route (and do so quite often). The pilot is expected to continue the filed route from that point onwards.

They can also tell a pilot to fly direct to some random other point (this also happens but less often). The pilot is then not expected to pick up a route after that point.

The radio instruction for both is exactly the same, the only difference is whether the point is part of the planned route or not. So the controller needs to see the exact same route as the pilots have, not one with additional waypoints added by the IFPS system.


Thanks for this explanation. So, is the ICAO plan in some sense the "single source of truth", being the international standard and all?


Yes, thanks, t0mas88. I have some more questions but I can feel myself drifting into the weeds a bit, and your comment was extremely helpful as-is.


Possibly it needs the ICAO information to communicate with some systems, but has to work in ADEXP to have sufficient granularity (the essay mentions the possibility of “clipping”, a flight going through the UK between two ICAO waypoints).


Yes, I'm essentially wanting to know more about those existing ICAO-based systems, be they machine or not.


What I don’t understand in situations like this when thousands of flights are cancelled is how do they catch up? It always seems like flights are at max capacity at all times, at least when I fly. If they cancel 1,000 flights in one day, how do they absorb that extra volume and get everyone where they need to be? Surely a lot of people have their plans permanently cancelled?


There's always some empty capacity, whether it's non-rev tickets for flight crew and their families which are lower priority than paying customers or people who miss their flights.

I had a cancelled flight recently and they booked people two weeks out because every flight from that day onward was full or nearly full. I showed up the next morning and was able to board the next flight because exactly one person had scanned in their boarding pass (was present at the airport) but did not show up for whatever reason to the airplane.

Beyond that, people just make alternate plans, whether it's taking a bus or taxi home, traveling elsewhere, picking another airline, anything is possible.


You don't.

I work in logistics for a FMCG company and sometimes our main producer goes down and we run out of certain types of stock. We send as much out as we can and cancel the rest.

If they really want the stock the customers can rebook an order for tomorrow because they aren't getting it today. And we just start adding extra stock to each delivery.

It's the best of a bad situation.

We don't have the money to have extra trucks and very perishable stock laying about and I know the airlines don't pay 300 grand a month to lease a 737 just to have it sat about doing nothing. There's very little slack.


Exactly. People's plans get pushed out into the evenings or during the less busy times, absorbed, then forgotten as collateral damage.


I heard in the news that this was caused by a "bad flight plan".

It is clear, even without any more information than that, it was a software failure (bad flight plan?)

It will be interesting to see if Frequentis has to pay a price for causing this


Yeah it seems clear from the report that NATS published it wasn't a bad flight plan at all... the plan was valid to the relevant specifications

But the specifications allow ambiguities (non-unique waypoint ids) and the software did not handle this particular ambiguity correctly


I think Frequentis will actually be paid more to “add feature to make the system more robust” and a bump in support schedules.


I had been considering becoming an air traffic controller myself, and it rather tickles me to think I might have missed my once-in-a-lifetime opportunity to direct aircraft with the original pen-and-paper flight strip mechanism in the 21st century! Completely safe, excruciatingly low-capacity, and sounds like awfully good fun as a novelty (for the willing ATC, not the passengers stuck on the ground, I hasten to add).


Quite few non major airports are still heavily pen and paper reliant methods to some degree.

An example are islands that serve few flights per week and can't justify heavy update investments.

Airplanes are generally spaced by hours and you need to do your math about where the airplanes are by hand. But again there's so little planes that risks are minimal.


Indeed, but the set of aerodromes that are large enough to have a tower controller but not large enough to have their own radar surveillance is shrinking all the time. Radar is getting cheaper and what with ADS-C and TA/RA, a big reason to have ATC even without radar is vanishing (namely that of preventing collisions close to the airport). Oceanic control is probably the closest you can get nowadays to routine ATC without radar, even though they now have automatic position reports via satellite.


I think an island in the middle of the Atlantic that is mostly used for refueling is exactly that kind of airport.

Can't remember the name but I'm quite sure it belongs to Portugal.


Was it part of the Azores?


If you want to hear about how bad air traffic control is in the United States, you can listen/read here https://www.nytimes.com/2023/09/05/podcasts/the-daily/plane-...

There was a time recently when only 3 out of the 300+ air traffic control centers in the U.S. were fully staffed. All the rest were short-handed. Not sure how it stands today


Every system I've ever made has better error reporting that that one. Even those that only I use. First thing I get working in a new project is the system to tell me when something fails and to help me understand and fix the problem quickly. I then use that system throughout development such that it works very well in production. I'd love to talk to the people who made the system discussed in the article. Is one of them reading this? Can you explain how come this problem reported itself so badly?


Yes it seems incredibly lame error reporting that they had to spend hours contacting the original vendor (to "analyse low-level software logs") just to find out which flight plan had crashed the system


Trusted input rarely should be trusted. It's input. You need to validate it as if it is hostile and have a process for dealing with malformed input. Now of course, standing by the sidelines it is easy to criticize and I'm sure whoever worked on this wasn't stupid. But I've seen this error often enough now in practice that I think that it needs to be drilled into programmers heads more forcefully: stuff is only valid if you have just validated it. If you send it to someone else, if someone you trust sends it to you, if you store in a database and then retrieve it and so on then it is just input all over again and you probably should validate it for being well-formed. If you don't do that then you're a bitflip, migration or an update away from an error that will cause your system to go into an unstable state and the real problem is that you might just propagate the error downstream because you didn't identify it.

Input is hard. Judging what constitutes 'input' in the first place can be harder.


From what I gathered from the article, the input WAS valid. It's the software that was unable to handle a specific case of valid input.


That's fine, and is exactly the kind of case that I was thinking of: your software has a different idea of what is valid than an upstream piece of software, so from your perspective it is invalid. So you need to pull this message out of the stream, sideline it so it can be looked at by someone qualified enough to make the call of what's the case (because it could well be either way) and processing for all other messages should continue as normal. After all the only reason you can say with confidence that it in fact was valid is because someone looked at it! You can only do that well after the fact.

A message switch [1] that I worked on had to deal with messages sources from 100's of different parties and while in principle everybody was working from the same spec (CCITT [2]) every day some malformed messages would land in the 'error' queue. Usually the problem was on the side of the sender, but sometimes (fortunately rarely) it wasn't and then the software would be improved to be able to handle that case correctly as well. Given the size of the specs and the many variations on the protocols it wasn't weird at all to see parties get confused. What's surprising is that it happens as rarely as it does.

The big takeaway here should be that even if something happens very rarely it should still not result in a massive cascade, the system should handle this gracefully.

[1] https://www.kvsa.nl/en/

[2] https://en.wikipedia.org/wiki/Group_4_compression


Exact same experience developing systems that process RFC-822 (and descendents) email messages.


This really isn't about input. Whether it comes from outside or produce inside the application, the reality is that everything can have bugs. A correct input can cause a buggy application to fail. So while verifying input is obviously an important step, it is not even a beginning if you are really looking to building reliable software.

What really is the heart of the matter is for the entire thing to be allowed to crash due to a problem with single transaction.

What you really want to do is to have firewalls. For example, you want a separate module that runs individual transactions and a separate shell that orchestrates everything but has no or very limited contact with the individual transactions. As bad as giving up on processing a single aircraft is, allowing the problem to cascade to entire system is way worse.

What's even more tragic about this monumental waste of resources is that the knowledge about how to do all of this is readily available. The aerospace and automotive industry have very high development standards along with people you can hire who know those standards and how to use them to write reliable software.


Yes, there are multiple problems here that interplay in a really bad way and that's one of them. But the input processing/validation step is the first point of contact with that particular flight plan and it should have never progressed beyond that state.

It all hinges on a whole bunch of assumptions and each and every one of those should be dealt with structurally rather than by patching things over.

Just from reading TFA I see a very long list of things that would need attention. Quick recap:

- validate all input

- ensure the system can never stall on any one record

- the system will occasionally come across malformed input which needs a process

- it won't be immediately clear whether the system or the input is at fault, which needs a process

- testing will need to take these scenarios into account

- negative tests will need to be created (such as: purposefully malformed input)

- attempts should be made to force the system into undefined states using malformed and well formed input

- a supervisor mechanism needs to be built into the system that checks overall system health

And probably many more besides. But this is what I gather from the article is what they'll need at a minimum. Typically once you start digging into what it would take to implement any of these you'll run into new things that also need fixing.

As for the last bit of your comment: I'm quite sure that those standards were in play for this particular piece of software, the question is whether or not they were properly applied and even then there are no guarantees against mistakes, they can and do happen. All that those standards manage to do is to reduce their frequency by catching the bulk of them. But some do slip through, and always will. Perfect software never is.


This is not the first time this has happened; the phenomenon has even got a name - "poison flight plan".


> the phenomenon has even got a name - "poison flight plan".

Maybe, but it must not be a common phrase because your comment is the first result when I search for it.

And it is also mentioned in this article: http://www.aero-news.net/subsite.cfm?do=main.textpost&id=ce2...

And that's about it? Do you have any other sources?


I think that term was invented four days ago by that article writer. There are four other occurrances before then and they're about PS2 games.


This term was in wide circulation when I was consulting at NATS in the 2000-2005 time frame.


That is indeed very currious. So you say NATS was amaware of this vulnerability?


Ironically the term about name clashes has a name clash!


The generic term I'm familiar with is "ping of death".


For those of you still following this story, the flight plan that triggered the chaos has been identified!

https://chaos.social/@russss/111048524540643971!

> Tonight we were wondering why nobody had identified the flight which caused the UK air traffic control crash so we worked it out. It was FBU (French Bee) 731 from LAX/KLAX to ORY/LFPO.

> It passed two waypoints called DVL on its expanded flight plan: Devil's Lake, Wisconsin, US, and Deauville, Normandy, FR (an intermediate on airway UN859).

> https://www.flightaware.com/live/flight/FBU731/history/20230...

> Credit to @marksteward and @benelsen for doing much of the legwork here.


I want to comment specifically on:

> The software and system are not properly tested.

Followed by suggesting to do fuzzing tests.

* Automatically generating valid flight paths is somewhat hard (and you'd have to know which ones are valid because the system, apparently, is designed to also reject some paths). It's also possible that such a generator would generate valid but improbable flight paths. There's probably an astronomic number of possible flight paths, which makes exhaustive testing impossible, thus no guarantee that a "weird" path would've been found. The points through which the paths go seem to be somewhat dynamic (i.e. new airports aren't added every day, but in a life-span of such a system there will be probably a few added). More realistically some points on flight paths may be removed. Does the fuzzing have to account for possibilities of new / removed points?

* This particular functionality is probably buried deep inside other code with no direct or easy way to extricate it from its surrounding, and so would be very difficult to feed into a fuzzer. Which leads to the question of how much fuzzing should be done and at what level. Add to this that some testing methodologies insist on divorcing the testing from development as not to create an incentive for testers to automatically okay the output of development (as they would be sort of okaying their own work). This is not very common in places like Web, but is common in eg. medical equipment (is actually in the guidelines). So, if the developer simply didn't understand what the specification told them to do, then it's possible that external testing wasn't capable of reaching the problematic code-path, or was severely limited in its ability to hit it.

* In my experience with formats and standards like these it's often the case that the standard captures a lot of impossible or unrealistic cases, hopefully a superset of what's actually needed in practice. Flagging every way in which a program doesn't match the specification becomes useless or even counter-productive because developers become overloaded with bug reports most of which aren't really relevant. It's hard to identify the cases that are rare but plausible. The fact that the testers didn't find this defect on time is really just a function of how much time they have. And, really, the time we have to test any program can cover a tiny fraction of what's required to test a program exhaustively. So, you need to rely on heuristics and gut feeling.


None of this really argues against fuzz testing; even with completely bogus/malformed flight plans, it shouldn't be possible for a dead letter to take down the entire system. And, since it's translating between an upstream and downstream format (and all the validation is done when ingesting the upstream), you probably want to be sure anything that is valid upstream is also valid downstream.

It's true that fuzz testing is easiest when you can do it more at the unit level (fuzz this function implementing a core algorithm, say) but doing whole-system fuzz tests is perfectly fine too.


This is not against the principle of fuzz testing. This is to say that the author doesn't really know the reality of testing and is very quick to point fingers. It's easy to tell in retrospect that this particular aspect should've been tested. It's basically impossible to find such defects proactively.


I've read both messages and I'm still unsure on how fuzzy testing may have not brought up similar edge cases.

We literally talking about a parser shutting down an entire system rather than reporting malformed data.

Considering this is a "one in 15M cases" it seems to me that fuzzy testing would've caught this and probably more bugs in a short time span.


Easy for me to say in retrospect, but IMO this is a textbook example of where you should reach for fuzz testing; it’s basically protocol parsing, you have a well-known text format upstream and you need to ensure your system can parse all well-formed protocol messages and at very least not crash if a given message is invalid in your own system.

Similarly with a message queue, handling dead letters is textbook stuff, and you must have system tests to verify that poison pills do not break your queue.

I did not think the author was setting unreasonable expectations for the a priori testing regime. These are common best practices.


This all sounds like exactly the stuff that fuzzing or property-based testing is good for

And if the functionality is "buried deep inside other code with no direct or easy way to extricate it from its surrounding" making it hard to test then that's just a further symptom of badly designed software in this case


What I still don't understand is how flight plans get approved?

In my mind they would only be approved once all involved countries review and process the plan. That way we don't need this ridiculous idea of failing safe on the whole uk airspace for a single error.

That day a single flight plan could have been rejected, perhaps just resubmitted and the bug quietly fixed in the background


>"in typical Mail Online reporting style: "Did blunder by French airline spark air traffic control issues?"

The Daily Mail is a horrible, right-wing paper in the UK that blames 'foreigners' for everything. Particularly the French.

Out of curiosity, is there a corresponding French paper that blames the English or the British for everything?


French here, as much as I wish It was the case for comical effect… I don’t think so.

Our right wing press is also desperately economically liberal so anything privately run is inherently better.

Maybe radio stations? Honestly, major respect to the daily mail for those snarky attacks that keep up the good spirits between our two countries.

It’s maybe the food or the weather that make them aggro ? Idk, but don’t worry, we love to hate the perfide Albion. Too.

Fellow French: am I wrong ? Maybe “valeur actuelle” could pull up that type of bullshit, but I think they are too busy blaming Islam to start thinking about our former colony across the channel.


>major respect to the daily mail for those snarky attacks

There is really nothing to like or respect about the Daily Mail. https://www.globaljustice.org.uk/blog/2017/10/horrible-histo...

>our former colony across the channel

Touché! ;0)


Isn't that why France has a President the position was created just to blame them? lol


Our last iteration of the constitution grant them large powers and leeway

But indeed, the unspoken rule is also that we hate them with a passion no matter what.


Well not really. People in France don't really care that much about England.

The one country that is often blamed for problems is rather Germany, but honestly even Germany doesn't get blamed for petty problems like that.


> Safety critical software systems are designed to always fail safely. This means that in the event they cannot proceed in a demonstrably safe manner, they will move into a state that requires manual intervention.

unrelated - this instantly caused me to think about tesla autopilot crashes that have been reported with emergency vehicles


It made me think of the “put a traffic cone on it” denial of service attack


Has the culprit flight-plan been disclosed? I'd be interested to know how easy it is to create a realistic looking flight-plan through UK airspace that reproduces the problem. I.e. how much truth is there when NATS say this was a 1 in 15m probability?


This is an interesting engineering problem and I'm not sure what the best approach is. Fail safe and stop the world, or keep running and risk danger? I imagine critical systems like trading/aerospace have this worked out to some degree.


There isn't and cannot be a preference to either one. It always depends on what the system is doing and what the consequences would be... Pacemaker cannot "fail safe" for example, under no circumstances. It's meaningless to consider such cases. But if escalation to a human operator is possible, then it will also depend on how the system is meant to be used. In some cases it's absolutely necessary that the system doesn't try to handle errors (eg. if say a patient is in a CT machine -- you always want to stop to, at least, prevent more radiation), but in the situation like the one with the flight control -- my guess is that you want the system to keep trying while alerting the human operator.

But then it can also depend on what's in the contract and who will get the blame for the system functioning incorrectly. My guess here is that failing w/o attempting to recover was, while an overkill, a safer strategy than to let eg. two airplanes be scheduled for the same path (and potentially collide).


Absolutely no idea on what is correct, but I love to reference this article on software practices at NASA[0], They Write the Right Stuff.

[0] https://www.fastcompany.com/28121/they-write-right-stuff


The best approach is to simply print the error to the screen, rather than burying it in a “low level log” which only the software vendor has access to.

They had a four hour buffer until the world stopped, but most of that was pissed away because no one knew what the problem was.


Bugs happen. Fact of being written by fleshy meatballs. What should also have been highlighted is that they clearly had no easy way of finding the specific buggy input in the logs nor simulating it without contacting the manufacturer.


It sounds like a simple functional smoke test throwing random flight plans at the system would have eventually (and probably pretty soon) triggered this. I hope they at least do it now.

This reminds me of: https://danluu.com/wat/


No way or no procedure.


I worked once with 4G BTS (Base Transceiver Stations) where one of the issues was preventing the errors in the running board to propagate to the backup systems. There was no clean way to do it given the fact the malformed input will eventually reach the backup system producing the same error. The post talks about the system delaying the process to prevent backup up. Perhaps a solution would be going in the other direction having a staging step to prevent compromising the pipeline. Very interesting article.


Well, I certainly hope they've at least stopped issuing waypoints with identical names... although it wouldn't surprise me if geographically-distant is the best we can do as a species.


They appear to be sequences of 5 upper-case letters. Assuming the 26-character alphabet, that should allow for nearly 12 million unique waypoint IDs. The world is a big place but that seems like it should be enough. The more likely problem is that there is (or was) no internationally-recognized authority in charge of handing out waypoint IDs, so we have at least legacy duplicates if not potential new ones.


Not all 5-character long strings are usable though. They have to be pronounceable as a single word and understandable over radio as much as possible.


You have to reduce that to the (still massive) set of IDs that are somewhat pronounceable in languages that use the Latin script. You don't want to be the air traffic controller trying to work out how to say 'Lufthansa 451, fly direct QXKCD'. Nonetheless, I think the there is little cause for concern about changing existing IDs. There might be sentimental attachment, but it takes barely a few flights before the new IDs start sticking, and it's not like pilots never fly new routes.


I thought that is what the "ICAO pronunciation" was for?

"Fly direct Quebec Xray Kilo Charlie Delta"


No, waypoints aren't spelled out with the ICAO alphabet. They are mnemonics that are pronounced as a word and only spelled out if the person on the receiving end requests it because of bad radio reception, or unfamiliarity with the area/waypoint.

For example, Hungarian waypoints, at least the more important ones are normally named after cities, towns or other geographical locations near them, and use the locations name or abbreviated name, being careful that they can be pronounced reasonably easily for English speakers. Like: ERGOM (for the city Esztergom), ABONY (for the town Füzesabony), SOPRO (for Sopron), etc.


It is, but fixes are almost always spoken as words rather than letter-by-letter. For this reason, they are usually chosen to be somewhat pronounceable, and occasionally you even get jokes in the names. Likewise, radio beacons and airports are usually referred to by the name of their location; for instance "proceed direct Dover" rather than "proceed direct Delta Victor Romeo".

I think a lot of pilots and air traffic controllers would be irritated if they had to spend longer reading out clearances and instructions. In a world where vocal communication is still the primary method of air traffic control, there might be a measurable reduction in capacity in some busier regions.


I really enjoy the joke names.

Portsmouth, NH has a Sylvester/Tweety Bird approach: ITAWT, ITAWA, PUDYE, TTATT, followed by IDEED for the missed approach.

https://www.pilotsofamerica.com/community/threads/unique-way...

Australia has WALTZ, INGMA, TILDA, and also WONSA, JOLLY, SWAGY, CAMBS, BUIYA, BYLLA, BONGS

https://www.cntraveler.com/stories/2015-06-02/a-pilot-explai...

Disney has a whole lot of special fixes in Orlando and Anaheim. The PIGLT arrival passes through HKUNA, MTATA, JAZMN, JAFAR, RFIKI, TTIGR. I'm fairly sure I've heard about some variants on MICKY, MINEE, GOOFY, PLUTO, etc.

https://aerosavvy.com/wp-content/uploads/2016/04/MCO-PIGLT-S...

According to the same article, Louisville has LUUKE – IAMUR – FADDR.


My favourite in the USA, which I'll never forget, is outside Washington Dulles: NEVVR just next to FORGT.


My question is: why was the algorithm searching any section before the UK entry point. You can’t exit at a waypoint before you enter so there is no reason to search that space.


> The manufacturer was able to offer further expertise including analysis of lower-level software logs which led to identification of the likely flight plan that had caused the software exception.

This part stood out to me. I've found it super helpful to include a reference to which piece of days in working with in log messages and exceptions. It helps isolated problems so much faster.


Did the creator of the flight plan software engage in adversarial testing to see if they could break the system with badly formed flight plans? Or was / is the typical practice to mostly just see if the system meets just the "well-behaved" flight plan processing requirements? (with unit tests, etc)


I think we all know the answer to this.

A huge portion of "exploits" in the last 20 years have been "internal business APIs" if you will being exposed to malicious actors.


It must suck to be responsible for a system that everyone depends on and millions of dollars are riding on so you are very reluctant to change it, even if you know it needs technical improvements.

Formal verification or fuzzing could have helped them over that mistrust, but are not panaceas


I imagine, for this kind of system, there is only one supplier. Why not force that supplier, as part of their 10-15 yr contract, to publish the source code for everything, not necessarily as FOSS. This way if there are bugs they can be reported and fixed.


I agree. But this would assume that:

1- the people writing and approving the specs even understand why this might be a good suggestion

2- the people ultimately approving the contract aren't in bed with the supplier


3- the people operating the system are capable of maintaining its source code


There's always prison for those people.


Poison Pill! Why on earth would the best failure mode be to cease operating? Just don’t accept the new plan being ingested and tell the person uploading that their plan was rejected. Impact one flight not thousands!


I wondered this- I have absolutely no understanding of what's involved in flight system development, but does anyone know why it doesn't do this?

By contrast, its normal for an API to return 500 if something goes wrong and keep serving other requests. It would seem insane if it crashed out and completely stopped. Any idea why the parallel isn't true for a flight system?


Ok if the system finds something that it does not understand what should it do - and how does the programmer know it will work?


> Flight Plan Reception Suite Automated (FPRSA-R)

Where does the "-R" come from?


Replacement.


Lol that's like me naming my filenames _final2_realfinal before I learned about git.


This looks like the perfect definition of “a man with two watches is always confused of the time”


Great writeup


They should have used Erlang OTP


Interesting to see that flight plans over the UK have to be filed 4 hours in advance.

No mention of plane, pilot, passenger and cargo manifests. So why the 4 hour lead time, is this the time it takes UK Authorities to look people up or workout if the cargo could be dangerous in an airborne Anthrax (Gruinard) Island [1] or Japanese subway Sarin [2], or an IRA favourite, fertilizer bomb thats bypassed the usual purchase reporting regulations used by people like Jeremy Clarkson and Harry Metcalfe as their store of wealth[3]?

It makes me wonder just how much more surveillance of the population exists, knowing I cant even step out of the front door without attracting surveillance of the type that followed Dr David Kelly.

Sure its not a cyber attack per se, carried out over the internet like a DDOS attack or a brute force password guessing attack with port knocking mitigation, but how would one carry out a cyber attack on this system if the only attack vector is from people submitting flight plans?

There sure is a constant playing down of the cyber attack angle to this which makes me think someone wants to Blurred Lines!

One point on the lack of uniquely named global way points, which is the main crux of the problem falling over if some are to be believed.

The USA demonstrates a disproportionate number of similar names, by virtue of Europeans migrating to the US [4]. So has this situation arisen with this system in other parts of the world like in the US? How can a country that created the globe spanning British Empire become so insular with regards to air travel in this way?

I'd agree with the initial assessment that there appears to be a lack of testing, but are the specifications simply not fit for purpose? I'm sure various pilots could speak out here, because some of the regulations require planes to be minimally distanced from each other when transiting across the UK.

On the point of ICAO and other bodies to eradicate non-unique waypoint names, its clear there is some legacy constraint still impeding the safety of air travellers, perhaps caused by poor audio quality analogue radio, so perhaps its time for the unambiguous and globally recognised What 3 Words form of location identifier, to come into effect?

The UK police already prefer it to speed up response times [4]. And although the same location can create 3 different words, suggesting drift with GPS [5], even if What 3 Words could not be used for a global system, having something a bit longer to create an easily recognisable human globally unique identifier is needed for these flight plans and perhaps maritime situations.

Obviously global coordination will be like herding cats, and if such a fixed size global network of cells were introduced, some area's like transiting over the Atlantic or Pacific could command bigger cells, but transiting over built up areas like London would require smaller sized identifiable cells. But IF ever there was a time for the New World Order to step up to the plate and assert itself, to create a Globally Unique Place ID (GUPID) for the whole planet, now is the time.

On the point of humans were kept safe, only by the sheer common sense of the pilots and traffic control tower staff, its not something NATS did or should claim, their systems were down, so everyone had to resort back to pen and paper and blocks in queues, and apart from Silverstone when the F1 British Grand Prix is on, is air space ever that densely populated.

NATS were caught with their pants down at so many levels of altitude, is this laissez faire UK management style that saw the Govt having to step in to bail out the banks during the financial crisis, still infecting other parts of UK life and still coming to light?

It's beginning to look a lot like Christmas!

[1] https://www.youtube.com/watch?v=_8Zr0IPtx80

[2] https://www.youtube.com/watch?v=RTr1lquCQMg

[3] https://youtu.be/LS54AJSadT4?t=279

[4] https://en.wikipedia.org/wiki/List_of_U.S._places_named_afte...

[5] https://www.bloomberg.com/news/articles/2019-03-21/u-k-polic...

[6] https://support.what3words.com/en/articles/2212837-why-do-i-...


> so why the 4 hour lead time

To answer your question without conspiracy drivel, let's look up CAP 694: The UK Flight Planning Guide [0]

Chapter 1

> 6.1 The general ICAO requirement is that FPLs should be filed on the ground at least 60 minutes before clearance to start-up or taxi is requested. The "Estimated Off Block Time" (EOBT) is used as the planned departure time in flight planning, not the planned airborne time.

> 6.3 IFR flights on the North Atlantic and on routes subject to Air Traffic Flow Management, should be filed a minimum of 3 hours before EOBT (see Chapter 4).

Chapter 4

> 1.1 The UK is a participating State in the Integrated Initial Flight Plan Processing System (IFPS), which is an integral part of the Eurocontrol centralised Air Traffic Flow Management (ATFM) system.

> 4.1 FPLs should be filed a minimum of 3 hours before Estimated Off Block Time (EOBT) for North Atlantic flights and those subject to ATFM measures, and a minimum of 60 minutes before EOBT for all other flights.

So the answer is because the UK is part of a Europe-wide air traffic control system, which hands out full flight plans to all the relevant authorities for each airspace, and they decided 3 hours is needed so that all possible participants can get their shit together and tell you if they accept the plan or not.

An entirely separate system exists to share Advanced Passenger Information, i.e. passenger manifests [1], and it goes even further that airlines share your overall identity with each other, known as a Passenger Name Record [2], and a variety of countries, led by the USA, insist on this information in advance before the plane is allowed to take off [3]

If you're going to be paranoid, please work with known facts instead of speculating.

[0] https://publicapps.caa.co.uk/docs/33/CAP%20694.pdf

[1] https://en.wikipedia.org/wiki/Advance_Passenger_Information_...

[2] https://en.wikipedia.org/wiki/Passenger_name_record

[3] https://en.wikipedia.org/wiki/United_States%E2%80%93European...


So on point 3 then, why do countries, turn people away at the destination and not before take off, if their visa or passport is not in order?

Are their systems not joined up, or does the state just like making examples of peoples once on the destination country? I can watch this stuff happening to people at airport border controls on TV all the time so which is it? Their systems are not joined up or they just like making examples of people?


Most countries do not have prescreening or data sharing treaties. Even in the case where two countries do have it, not all entrance criteria can be determined by electronic records. Countries reserve the right to check the traveller themselves before they permit entry.

Here's a list of common reasons for the UK to refuse entry on arrival: https://www.gov.uk/government/publications/suitability-refus...

Some examples:

* Your valid travel document (as communicated digitally) turns out to be invalid when you actually present it

* You got a tourist visa, not a work visa, so why do you have work tools in your luggage?

* You turn up infected with Ebola


Airline employees routinely turn people away at the departure airport due to visa/passport paperwork not being in order. Timatic is the usual system that most airlines subscribe to for this kind of thing. Airlines are highly incented to avoid letting a passenger board who won't be admitted, because they're on the hook for returning that passenger. But an airline employee in at the departure airport is never going to be able to be a perfect proxy for an immigration officer in every country they fly to, and immigration officers generally have wide latitude in who they accept/reject. It is extremely possible that all your paperwork is in order but the immigration control officer rejects you for other reasons.


So how do they spot the fake passports then?

And is this a legal on the hook for returning passengers that are not allowed in the destination country?

I've got caught in the US when Hurricane Katrina was landing, and whilst we were flown out by our carrier, other carriers in this situation would also honour our ticket and fly us back to the UK.

It seemed like an exodus where all the carriers just got people out of the country as quickly as possible. We were on the last flight out of the airport, but this isnt a legal thing is it?


Software has bugs, that's not really the damning part... The damning part is that in four hours and two levels of support teams, there was noone who actually knew anything about how the system worked who could remove the problematic flight plan so that the rest of the system could continue operating!

What exactly is the point of these support teams when they can't fix the most basic failure mode (a single bad input...)


Unfortunately, I work on a reasonably modern ERP system which has been customized significantly for the client and also works with wider range of client-specific data combinations that the vendor has seemingly not anticipated / other clients do not have.

What it means is that on a regular basis, teams will be woken up at 2am because a batch process aborted on bad data; AND it doesn't tell you what data / where in the process it aborted.

The only possibility is to rerun the process with crippling traces, and then manually review the logs to find the issue, remove it, and then re-run the program again (hopefully remembering to remove the trace:).

Even when all goes per plan, this can at times take more than 4 hrs.

Now, we are not running a mission-critical real-time system like air traffic; and I'm in NO way saying any of this is good; but, it may not be the case that "two level of support teams didn't know anything" - the system could just be so poorly designed that with best operational experience and knowledge, it still took that long :-< .

On HN, we take certain level of modernity, logging, failure states, messaging, and restartability for granted; which may not be even remotely present on more niche or legacy system (again, NOT saying that's good; just indicating issue may be less with operational competence vs design). It's easy to judge from our external perspective, but we have no idea what was presented / available to support teams, and what their mandatory process is.


Just guessing:

They bought a software from a third party and treat it as a "black box". There are few known ways that the software fails, and the local team has instructions on how to fix it. But if it fails in an unexpected way, good luck, it's impossible for the local team to identify and fix the problem without the vendor.

The reason it took so much was they realized too late that they need to call the vendor.

Probably you have to blame managers rather than engineers in the support team.


Considering this same failure has happened a few times in recent memory maybe its over optimistic of me to expect an entry on the support wiki or something.


> Considering this same failure has happened a few times in recent memory

Which previous instances are you thinking about?


One important software engineering skill that is often overlook is the art of writing just the right amount of log, such that one could have sufficient information to debug easily when things go wrong, but not too verbose such that it will be ignored or pruned in production.


And when did you last test your monthly backups? But seriously. If you fill out all the positions in an org chart it's easy to think you're delivering, and for a lot of situations it usually works. Anointing someone a manager usually works out because people can muddle through. It doesn't work in medicine, or as it turns out, air traffic control.

Lesson learned for about the next ~5 years.


I wouldn't expect level 1 and level 2 to be able to diagnose a problem like this

level 3 (devs) should have been brought in much quicker though


Having worked in tech support: level 3 (Devs) should have described their source code structure to level 2, and let them access it when they needed it.


You don’t need a complete diagnosis if you can spit out enough debug info that says, “oops shat the bed while working with this flight plan”, then the support people can remove the one that’s causing you to fail, restart the system, and tell ATC to route that one manually.


> What exactly is the point of these support teams when they can't fix the most basic failure mode (a single bad input...)

To collect money on support contracts, I suspect.


Try to get developers who love to code and create to stay on a support team and be on an on-call roster. I betcha at least half will say no, and the other half will either leave or you'll run out of money paying them.


They were probably on vacation


Great post. This part goes too far, I think:

> Human lives were kept safe at all times

> The consequence of all this was not that any human lives were put in danger, ..

When you're arguing that cancelling 2000 flights cost £100M and that no human danger was incurred, something should feel off. That might be around 600k humans who weren't able to be where they felt they needed to be. Did they have somewhere safe to sleep? Did they have all the medications they needed with them? Did they have to miss a scheduled surgery? Could we try to measure the effect on their well-being in aggregate, using a metric other than the binary state of alive or facing imminent death? You get the idea.

Of course I agree with the version of the claim that says that no direct danger was caused from the point of view of the failing-safe system. But when you're designing a system, it ought to be part of your role to wonder where risk is going as you more stringently displace it from the singular system and source of risk that you maintain.


I mean it could have also saved lives by that logic. Did someone missing their flight mean they also missed a terrible pileup on the roadways after landing? We can imagine pretty much any scenario here.


I agree with you that we don't know! But my thesis is that we should still do our best, when considering how much risk the systems we maintain should be willing to keep operating through.


But how many lives were saved by the reduced carbon emissions that were not produced by the cancelled flights?


Of course they blamed the French ^^


As is tradition =)

(We actaully have no major issues with the French at least in my generation, its all just good fun)


It’s all good. We still take cheap-shots at English food and English women ;)

Edit: I lived in London for 3 years. I miss it every day.


I wouldn't worry about it, we take cheap shots at French food and French people too ;)


Hold on, I think we take cheap shots at French people, but _expensive_ shots on the food and wine


Sir, those are dueling words.


[flagged]


Well that's DailyMail for you, where they tag anything parenting or healthy as "femail" section... cause you know only women are looking at that stuff.

Lol.

Anyways I actually think that's just reasonable response, system goes down/related system goes down , and in reviewing they are making frivolous updates to names that aren't needed.

I would question these updates (while they may be minor part of overall updates occuring).


At least until the 70s most newspapers had a section called "Women" or something similar. Even the news about the 60s/70s women's movement appeared there, not in the main "news" sections. Those sections were mostly renamed around that time to "Lifestyle", "Home", or just "Features".


Is this the UK or US edition? It's always easy fun to have a go at the Daily Mail which presumably you read regularly else you wouldn't be commenting. Its sin seems to be that it's not a serious broadsheet. It's a tabloid with very broad appeal that has to be profitable and therefore tries to reflect the requirements of the British public for such a publication. Perhaps you should lower your expectations.

'Tag anything parenting or healthy ...'? No, that's not correct. Here are a few health & food related items back to mid-September that did not appear in 'female'. You are right about parenting; most parenting in the UK is still undertaken primarily (in terms of executive action) by females so items on this topic are reasonably included in 'female'. The growing number of people who don't have children probably appreciate this sub-grouping by the Mail. You may not approve but this is what happens. Single males with dependent children are not known for objecting to checking out that section. It's not forbidden.

https://www.dailymail.co.uk/wires/pa/article-12505173/Health... https://www.dailymail.co.uk/wires/ap/article-12504751/Eggpla... https://www.dailymail.co.uk/health/article-12504649/Suicide-... https://www.dailymail.co.uk/health/article-12504813/Anthony-... https://www.dailymail.co.uk/health/article-12503801/Cancer-n... https://www.dailymail.co.uk/wires/reuters/article-12503815/W... https://www.dailymail.co.uk/wires/reuters/article-12503299/R... https://www.dailymail.co.uk/news/article-12468365/One-woman-... https://www.dailymail.co.uk/wires/reuters/article-12502685/W... https://www.dailymail.co.uk/wires/ap/article-12501533/Food-r... https://www.dailymail.co.uk/news/article-12490747/How-safe-c...


Dailymail is actually site a frequent multiple times a day everyday.

not all content is for everyone, but they got something, they are definitely a tabloid style.

they narrate particular views to the public but cover all different contents, and alot of content i would consider advertisements/plug than actual articles.

i would guess a highly elderly/conservative majoroity base

they pander to lowest common denominator, which is fine -- they are a for profit news/tabloid, i find some of it entertaining (As per daily visits).

do you work for them/just a big fan for doing all that digging in defense of DM overexaggerating i made that ALL content like that is in that category? i didnt take my own comment all that seriously so honest ask.


> tries to reflect the requirements of the British public

The issue though is that quite often the Dailymail doesn't reflect, but rather controls the requirements of the British public.


Except that was a completely different incident and it occurred in the United States, not the UK. The Daily Mail did try to make hay out of the idpol angle, but the British can't reasonably be accused of shirking responsibility for the FAA grounding flights in the US.


> The programming style is very imperative

Is that supposed to be a meaningful statement?


Yes, typically it would be used to mean things like the code mutates data in place rather than using persistent data structures, explicitly loops over data rather than using higher-order map, fold etc. operations, and explicitly checks tag bits rather than using sum types.


Fine, I'll give you that (sounds like a generic description) but there's nothing like that from the description given in the article and the paragraph immediately before that statement. It's almost as if the author completely made that up.


[flagged]


> the author of the blogpost is one of those functional programming proselytizers ... Such pure, very monoid in the category of endofunctors

Sorry, who's sneering here?


Thanks for making me laugh :D


this is what happens when you de-industrialize your nation and focus on things like finance that brings quick and cheap $.


"Jesus, what a clusterfuck!" - J.K. Simmons in Burn After Reading


What ticked me is that when the primary system threw in the towel, an EXACT SAME system took over and ran the exact same code on the exact same data as the primary. I know that with code and algorithms it's not always the case but even then you know what doing the same thing over and over expecting different results defines...

Yes, it can be argued that the software should've had more graceful failure modes and this shouldn't have thrown a critical exception. It can be argued that the programmers should've seen this possibility. We can argue a lot of things about this.

But the reality is that this is a mission-critical system. And for such systems, there're ways to mitigate all of these mistakes and allow the system to continue functioning.

The easiest (but least safe) one would be to have the secondary system loaded with code that does the same thing but written by a different team/vendor. It reduces the chance from 100% to much-much less that if any input provokes an unforseen, system-breaking bug in the primary, the same input will provoke the same bug in the secondary.

An even better solution is to have a triumvirate system, where all 3 have code written by different teams, and they always compare results. If 3 agree, great, if 2 agree, not so great but safe to assume that the bug is in the 1 not the 2 (but should throw an alert for the supervisors that the whole system is in a degraded mode where any further node failure is a showstopper), and if all disagree, grind everything to a halt because the world is ending, and let the humans handle it.

It can be refined even further. And it's not something new. So why wasn't this system implemented in such a way? (Aside from cost. I don't care about anyones cost-cutting incentives in mission-critical systems. Sorry capitalism...)


> an EXACT SAME system took over and ran the exact same code

Did you ever work with HA systems? Because this is how they work. It's two copies of the same system intended for the cases when eg. hardware fails, or network partitioning happens etc.


No, I do not. But HA systems work like that because hardware or network failure is what they are designed to guard against, not a latent bug in the software logic. If there's a software bug, both systems will exhibit the same behavior, so HA fails there.


In practice, you have two kinds of HA systems (based on this criteria):

* Live + standby. Typically, the state of the live system is passively replicasted to the standby, where standby is meant to take over if it doesn't hear from the live one / the live one sends nonsense. (For example, you can use Kubernetes API server in this capacity).

* Consensus systems where each actor plays the same role, while there's an "elected" master which deals with synchronization of the system state. (For example, you can use Etcd).

In either case, it's the same program, but with a somewhat different state.

It doesn't make sense to make different programs to deal with this problem because you will have double the amount of bugs for no practical gains. It's a lot more likely that two different programs will fail to communicate to each other than one program communicating to its own replica. Also, if you believe you were right the first time: why would you make the other one different? You will definitely want to choose the better of the two and have copies of that than have a better and a worse work together...


How can you tell whether the problem is due to a software bug or due to a hardware fault though? The software could have thrown the "catastrophic failure, stop the world" exception due to memory corruption.


I'm wondering if the backup system could have a delayed queue; say, 30 seconds behind. If the primary fails, and exactly 30 seconds later the secondary system fails, you have reasonable assurance that it was queue input that caused the failure. Rollback to the last successful queue input, skip and flag the suspect input, and see if the next input is successful.


This looks to me like it could work, but would need a ready force of technicians always expecting something like that so they can troubleshoot it in a timely manner.


> Aside from cost. I don't care about anyones cost-cutting incentives in mission-critical systems. Sorry capitalism...

Capitalism is happy to have redundancy in mission critical systems all the time. Why would it care here?


I don't know but in recent years I'm increasingly seeing mission critical systems having only token or "apparent" rendundancies instead of real ones, and couldn't find any other rationale than cost savings and shareholder bottom lines. I'm not saying that capitalism = bad, it's mostly better than the alternatives, but just like its most direct competitor, it suffers from bad implementations across the world and unbounded human greed.

A recent and very "in the face" example, also from the air travel industry would be the B737 Max and its AoA sensors. There were two, for two flight computers, but MCAS only used 1 flight computer and 1 AoA sensor, despite the already existing crosslinks between the flight computers and the sensors...

Pofit maxing first with the "no need for a new type rating for the pilots", then cost-cutting first in aeronautical engineering (solving an airframe design problem with software, plus designing a flight envelope protection system that can overpower the human pilots).

Then cost-cutting in software engineering and QC, rushing out software made by (probably) inexperienced in the field engineers and failing to properly test it and ensure that it had the needed redundancy.


This is apparently just an opinion, no additional inside information than we had from the report (https://news.ycombinator.com/item?id=37401981), isn't it?

EDIT: downvoting this question instead of responding is a pretty strange reaction.


You are correct, but it's an opinion that bridges the gap editorially between those knowledgable about ATC but not data, and those knowledgable about data but not ATC. This is a valuable service to provide, as both fields are rather complex.


Thanks. I didn't have the patience to read it all. I initially hoped that the author was a field expert or even someone with inside knowledge, but he is apparently from a completely different domain and not in the UK, and there were assumptions about things the report was rather specific about (as specific as such reports usually are). It would be more useful if people would take a closer look at the report and draw the right conclusions about organizational failures and how to avoid them. All the great software technologies to achieve memory safety, etc. are of little use if the analyses and specifications are flawed or the assumptions of the various parties in a system of systems do not match. But people seem to prefer to speculate and argue about secondary issues.


Dude this isn't reddit dont worry about the votes.


Since it's not Reddit but HN, it's all the stranger to dismiss a perfectly legitimate question. But times and mores seem to change much faster than I realize.


It's because your question was poorly phrased - it sounds like you are trying to dismiss the value of the submission for no apparent reason. If you genuinely want to know the answer to a question, don't start with your conclusion and append "isn't it?" to turn it into a question. Just say something like "I don't have time to read the article. Does the author provide any industry expertise to the incident beyond what was in the original report?".


It's not poorly phrased. It's my conclusion after spending ten minutes with the text and I was interested whether others came to the same conclusion, which apparently is the case. It also turned out that the author is not even a specialist and has no affiliation with an involved organization. But apparently people prefer to read and discuss arbitrary opinions.


This is one of the many reasons there should be a universal data standard using a format like JSON. Heavily structured, easy to parse, easy to debug. What you lose in footprint (i.e., more disk space), you gain in system stability.

Imagine a world where everybody uses JSON and if they offer an API, you can just consume the data without a bunch of hoop jumping. Failures like this would vanish overnight.


Parsing the data formats had zero contribution to the problem. They had a problem running an algorithm on the input data, and error reporting when that algorithm failed. Nothing about JSON would improve the situation.


Yes, but look at the data. The algorithm was buggy because the input data is a nightmare. If the data didn't look like that, it's very unlikely the bug(s) would have ever existed.


ADEXP sounds like the universal data standard you want then. The UK just has an existing NATS that cannot understand it without transformation by this problematic algorithm. So the significant part of your suggestion might be to elide the NATS specific processing and upgrade NATS to use ADEXP directly.

Using a JSON format changes nothing. Just adds a few more characters to the text representation.


I have seen a bad outages caused by valid JSON whose consumer implemented something incorrectly.

I agree with dundarius that "doing this in JSON" would not have changed the likelihood the bug could have manifested.


No change at all? I find that hard to believe. There's also a data design problem here, but the structure of JSON would aid in, not subtract from, that process.

The question at hand is: "heavily structured data vs. a blob of text as input into a complex algorithm, which one is preferred?"

Unless you're lying, you'd choose the former given the option.


The issue is using both ADEXP and ICAO4444 waypoints, and doing so in a sloppy way. For the waypoint lists, there is no issue with structurelessness -- the fact that they're lists is pretty obvious, even in the existing formats. Adding some ["",] would not have helped the specific problem, as the relevant structure was already perfectly clear to the implementers. I am not lying when I say the bug would have been equally likely in a JSON format in this specific case.


Now I'm wigging out to the idea of how the act of overcoming the inertia of the existing system just to migrate to JSON would spawn thousands of bugs on its own — many life-threatening, surely.


These old standards ARE heavily structured data, despite what their formatting or lack of punctuation suggests.


To me and XML-ified this would look more nightmarish than the status quo... it's just brief, space separated and \n terminated ASCII. No need to overcomplicate things this simple.


> The algorithm was buggy because the input data is a nightmare.

No, the algorithm was "buggy" because it didn't account for the entry to and exit points from the UK to have the same designation because they're supposed to be geographically distant (they were 4000Nm apart!) and the UK ain't that big.


There are already standards like XML and RDF Turtle that allow you to clearly communicate vocabulary, such that a property 'iso3779:vin' (shorthand for a made-up URI 'https://ns.iso.org/standard/52200#vin') is interpreted in the same way anywhere in the structures and across API endpoints across companies (unlike JSON, where you need to fight both the existence of multiple labels like 'vin', 'vin_no', 'vinNumber', as well as the fact that the meaning of a property is strongly connected to its place in the JSON tree). The problem is that the added burden is not respected at the small scale and once large scale is reached, the switching costs are too big. And that XML is not cool, naturally.

On top of that, RDF Turtle is the only widely used standard graph data format (as opposed to tree-based formats like JSON and XML). This allows you to reduce the hoop jumping when consuming responses from multiple APIs as graph union is a trivial operation, while n-way tree merging is not.

Finally, RDF Turtle promotes use of URIs as primary identifiers (the ones exposed to the API consumers) instead of primary keys, bespoke tokens, or UUIDs. Followig this rule makes all identifiers globally unique and dereferenceable (ie, the ID contains the necessary information on how to fetch the resource identified by a given ID).

P.S.: The problem at hand was caused by the algorithm that was processing the parsed data, not with the parsing per se. The only improvement a better data format like RDF Turtle would bring is that two different waypoints with the same label would have two different URI identifiers.


Furthermore, there are already XML namespaces for flight plans. These are not, however, used by ATC - only by pilots to load new routes into their aircrafts' navigation computers.

I'm not sure whether there is an existing RDF ontology for flight plans; it would probably be of low to medium complexity considering how powerful RDF is and the kind of global-scale users it already has.


Airport software predates basically every standard on the planet. I would not be surprised to learn that they have their own bizarro world implementation of ASCII, unix epoch time, etc.


Yes, FPL messages are sent over AFTN, which uses ITA-2 Baudot code instead of ASCII: https://en.wikipedia.org/wiki/Baudot_code

The keyboards used by ATC don't even allow entering symbols: https://www.reddit.com/media?url=https%3A%2F%2Fi.redd.it%2F1...

(There is a modern replacement for AFTN called AMHS, which replaces analog phone lines with X.400 messages over IP... but the system still needs to be backwards compatible for ATC units still using analog links.)


It won't fix anything. JSON is the "standard" today, 15 years ago it was XML and in 15 years we will have protobuf or another new standard.


Correct. The other "leg" of a solution to this problem would be to codify migration practices so stagnation at the tech level is a non issue long-term.


You could do all that stuff.

But after you did it, you'd still have exactly the same problem. The cause was not related to deserialization. That part worked perfectly. The problem is the business logic that applied to the model after the message was parsed.


> codify migration practices

I think this won't work: no one really wants to touch a system that works, and people will try to find any excuse to avoid migrating. The reason of this is that everyone prefers systems that work and fails in known way rather new systems that no one knows how can it fail.


Does the system work if it randomly fails and collapses the entire system for days?

People generally prefer to be lazy and to not use their brains, show up, and receive a paycheck for the minimum amount of effort. Not to be rude, but that's where this attitude originates. Having a codified process means that attitude can't exist because you're given all of the tools you need to solve the problem.


> Having a codified process means that attitude can't exist because you're given all of the tools you need to solve the problem.

Yes, but in real life doesn't work. Processes have corner cases. As you said, people are lazy and will do everything to find the corner case to fit in.

Just an example from the banking sector. There are processes (and even laws) that force banks to use only certified, supported and regularly patched software: there are still a lot of Windows 2000 servers in their datacenters and will be there for many years.


There are several XML formats for expressing flight plans, most notably ARINC 633 and FIXM.


Broadly speaking I think this is done for new systems. What you need to identify here is how and when you transition legacy systems to this new better standard of practice.


I'd argue in favor of at least an annual review process. Have a dedicated "feature freeze, emergencies only" period where you evaluate your existing data structures and queue up any necessary work. The only real hang up here is one of bad management.

In terms of how, it's really just a question of Schema A to Schema B mapping. Have a small team responsible for collection/organization of all the possible schemas and then another small team responsible for writing the mapping functions to transition existing data.

It would require will/force. Ideally, too, jobs of those responsible would be dependent on completion of the task so you couldn't just kick the can. You either do it and do it correctly or you're shopping your resume around.


The problem is systems written in the 1970s in FORTRAN to run on Mainframes don't speak JSON.


Great. It should be fixed by replacing the FORTRAN systems with a modern solution. It's not that it can't be done, it's that the engineers don't bother to start the process (which is a side-effect of bad incentive structure at the employment level).


No migration of this magnitude is blocked because of engineers not "bothering" to start the process. Imagine how many approvals you'd need, plus getting budget from who-knows how many government departments. Someone is paying for your time as an engineer and they decide what you work on. I'm glad we live in a world where engineers can't just decide to rewrite a life or death system because it's written in an old(er) programming language. (Not that there is any evidence that this specific system is written in anything older than C++ or maybe Ada.)


That's... not how that works. I take it you're probably more of a frontend person than a backend person by this comment. In the backend world, you usually can't fully and completely replace old systems, you can only replace parts of systems while maintaining full backwards compatibility. The most critical systems in the world -- healthcare, transportation, military, and banking -- all run on mainframes still, for the most part. This is isn't a coincidence. When these systems get migrated, any issues, including issues of backwards compatibility cause people to /DIE/. This isn't an issue of a button being two pixels to the left after you bump frontend platform revs, these systems are relied on for the lives and livelihood of millions of people, every single day.

I am totally with you wishing these systems were more modern, having worked with them extensively, but I'm also realistic about the prospect. If every major airline regulator in the world worked on upgrading their ATC systems to something modern by 2023 standards, and everything went perfectly, we could expect to no longer need backwards compatibility with the old system sometime in 2050, and that's /very/ optimistic. These systems are basically why IBM is still in business, frankly.


Many of them have been upgraded. In the US, we've replaced HOST (the old ATC backend system) with ERAM (the modern replacement) as of 2015.

However, you have to remember this is a global problem. You need to maintain 100% backwards compatibility with every country on the planet. So even if you upgrade your country's systems to something modern, you still have to support old analog communication links and industry standard data formats.


Have you ever been involved in such a migration?

It’s invariably a complete clusterfuck.


I haven't, but I'd love to. My approach wouldn't be very "HR friendly," though.


Ah yes, migration through sheer force of will.


It's trivial. Only took Amadeus hundreds of developers working for over a decade to migrate off TPF. /s

[0] https://amadeus.com/en/insights/blog/celebrating-one-year-fu...


In some sense, yes. Notice that most of the responses to what I've said are immediately negative or dismissive of the idea. If that's the starting point (bad mindset), of course nothing gets fixed and you land where we are today.

My initial approach would be to weed out anyone with that point of view before any work took place (the "not HR friendly" part being to be purposefully exclusionary). The only way a problem of this scope/scale can be solved is by a team of people with extremely thick skin who are comfortable grabbing a beer and telling jokes after they spent the day telling each other to go f*ck themselves.


Anyone who has worked with me knows that I have no issue coming in like a wrecking ball in order to make things happen, when necessary. I've also been involved in some of these migration projects. I think your take on the complexity of these projects (and I do mean inherent complexity, not incidental complexity) and the responses you've received is exceptionally naive.

The amount of wise-cracks and beers your team can handle after a work day is not the determinate factor in success. /Most/ of these organizations /want/ to migrate these systems to something better. There is political will and budget to do so, these are still inglorious multi-decade slogs which cannot fail, ever, because failure means people die. No amount of attitude will change that.


> The amount of wise-cracks and beers your team can handle after a work day is not the determinate factor in success.

Of course it isn't. But it's a starting point for building a team that can deal with what you describe (a decade-plus long timeline, zero room for failure, etc). If the people responsible are more or less insufferable, progress will be extremely difficult, irrespective of how talented they are.


I guess we should rewrite it in Rust.

Airplane logistics feels like one of the most complicated systems running today. A single airline has to track millions of entities: planes, parts, engineers, luggage, cargo, passengers, pilots, gate agents, maintenance schedules, etc. Most of which was created all before best-practices were a thing. Not only is the software complex, but there are probably millions of devices in the world expecting exactly format X and will never be upgraded.

I have no doubt that eventually the software will be Ship of Thesus-ed into something approaching sanity, but there are likely to be glaciers of tech debt which cannot be abstracted away in anything less than decades of work.


It would still be valuable to replace components piece-by-piece, starting with rigorously defining internal data structures and publically providing schemas for existing data structures so that companies can incorporate them.

I would like to point out that the article (and the incident) does not relate to airline systems; it is to do with Eurocontrol and NATS and their respective commercial suppliers of software.


The bug here was a processing one, having the data in json would make no difference.


The problem was not in the format, but with the way the semantics of the data is understood by the system. It could be fixed-width, XML, json, whatever, and the problem would still be the same.


So the "engineering teams" couldn't tail /var/log/FPRSA-R.log and see the cause of the halt?

I've had servers and software that I had never, ever used before stop working, and it took a lot less than four hours to figure out what went wrong. I've even dealt with situations where bad data caused a primary and secondary to both stop working, and I've had to learn how to back out that data and restart things.

Sure, hindsight is easy, but when you have two different systems halt while processing the same data, the list of possible causes shrinks tremendously.

The lack of competence in the "engineering teams" tells us lots about how horribly these supposedly critical systems are managed.


You're assuming that there is in fact a /var/log/FPRSA-R.log to tail - it would not at all surprise me if a system this old is still writing its logs to a 5.25 inch floppy in Prestwick or Swanwick^1.

^1: they closed the West Drayton centre about twenty years ago; I don't imagine they moved their old IBM 9020D too, if they still had it by then. My comment is nonetheless only slightly exaggerated ;)


Damn, if only you had been there to instantly save the day by just running that simple command!


No. That's silly. The logs would've / should've just shown that the program halted because it was confused about data. The actual commands to fix would've been quite different.


Small suggestion. Don't choose obscure language (in terms of popularity, 28th on TIOBE index with 0.65% rating) to visualize structure and algorithms. Otherwise you risk average viewer will stop reading the moment he encounter code samples. There are 27 more popular languages, some of them orders of magnitute more.


Maybe he doesn’t care if people stop reading and he’d prefer to use the language he’s most comfortable with? It’s his blog after all, not yours.

Additionally, perhaps he’s making the point that a language with an expressive type system makes solving problems like this trivial.


If you don't care about readers reading it or not then what is the point to publish an article ?


I read it. Probably lots of other people did too. Presumably the people who don’t think computer science begins and ends with JavaScript


Why does there have to be a point? If there is one, why do you need to understand it?


I appreciated the Haskell examples, they aren't particularly hard to follow. How do you think those more popular languages got more popular?


The code is a relatively small part of the article, and quite far into it I might add.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: