So they forgot to "geographically disparate" fence their queries. Having built a flight navigation system before, I know this bug. I've seen this bug. I've followed the spec to include a geofence to avoid this bug.
1. Pilots occasionally have to fat finger them into ruggedized I/O devices and read them off to ATC over radios.
2. These are defined by the various regional aviation authorities. The US FAA will define one list, (and they'll be unique in the US) the EU will have one, (EASA?) etc.
The AA965 crash (1995-12-20) was due to an aliased waypoint name. Colombia had two waypoints with the same name within 150 nautical miles of each other. (the name was 'R') This was in violation of ICAO regulations from like the '70s.
I'm trying to imagine someone ensuring differentiation between minimums.unsettled.depends (Idaho), minimums.unsettled.depend (Alaska), minimums.unsettles.depend (Spain), and minimum.unsettles.depend (Russia) while typing them in on a t-9 style keypad with a 7 figure display in turbulence.
The word list is 40,000 long, so without plurals probably there aren't enough words that people could spell or even pronounce. A better fix would be making it "what four words" - I wonder if they'd already committed too much to the "three" concept before discovering the flaw? Either way, using phony statistics to make unwarrantable claims of accuracy is a poor workaround.
Since the app gives you the words to say, and translates those back to coordinates on the receiving end, in theory they could alter the word list, at the cost of making any written-down version obsolete.
Maybe they should release a new service called What4ActuallyVettedWordsAndWordCombinations ;)
Something like what3words might be useful, but what3words itself doesn't have enough "auditory distance" between words. (i.e. - there are some/many words used by what3words that sound similar enough to be indistinguishable over an audio channel with noise.)
Something like FixPhrase seems better for use over radio.
There are a number of word lists whose words were picked due to their beneficial properties given the use-case of possibly needing to be understood verbally over unclear connections. The NATO phonetic alphabet, and PGP word lists come to mind: https://en.wikipedia.org/wiki/PGP_word_list
I'm particularly a fan of the PGP word list (it would definitely require more than 3 words for this purpose, though) because it has built-in error detection (of transposition, insertion or deletion): Separate word lists are used for "even" and "odd" hex digits. This makes it, IMHO, fairly ideal for use over verbal channels. From the wiki: "The words were carefully chosen for their phonetic distinctiveness, using genetic algorithms to select lists of words that had optimum separations in phoneme space"
It sounds like the w3w folks did not do any such thing
EDIT: According to my napkin math, 6 PGP words should be enough to cover the 64 trillion coordinates that "what3words" covers, but with way better properties such as error detection and phonetic incongruity (and not only that, it is just over 4 times larger, which means it can achieve a resolution of 5 feet instead of 10)
As a New Zealander, the PGP list is unfriendly because there are plenty of words that are hard to spell, or are too US centric.
dogsled (contains silent d, and sleigh might be a British spelling)
Galveston (I've never heard of the place)
Geiger (easy to type i before e - unobvious)
Wichita (I would have guessed the spelling began with which or witch)
And why did the designers not make the words have some connection to the numbers e.g. there are 12 even and 12 odd words beginning with E - add 16 more E words and you could use E words for E0 to EF. Redundant encoding like that helps humans (and would help when scanning for errors or matches too)
I imagine it is even harder for ESOL people from other countries! I am sure the UI has completion to help - but I wouldn't recommend using that list for anything except a pure US audience.
I have been to Galveston and I can assure you that you have not missed anything. There is no good reason to visit or know anything about it.
Making a word list that could work well for speakers different English dialects and for speakers of English as a second language sounds really hard. Has such a list as been made?
Probably it is too hard so we will continue to ignore the problem.
It should be discussed like this! It's clear that the w3w people didn't even do the bare minimum here!
The thing is, once you agree that some words are subpar or need translations, you can do a 1-to-1 mapping.
The problem with What3Words is that supporting the original word set will always be a pain even if they release a v2 word set with a 1-to-1 mapping (I believe they've already released versions for other languages?)
re: Geiger- parsing it could trivially accept misspellings of words
I mean there is the ICAO phonetic alphabet already known and used by every single licensed pilot the world over, regardless of their native language.
or, or... hang with me here for a minute...
We could instead use one of these cool new hash algorithms that require a computer and use about fifteen thousand English words! I understand they are all the rage in the third world countries that lack a postal system.
These are all lovely technical solutions. The problem I imagine isn't coming up with unique words. The problem is organizing a switchover for dozens if not hundreds of systems and agencies around the world. The chaos of change is probably out weighing the benefits.
what3words is not useful at all. 1) FAA (and thus the world) have a hard character limit of 8, this is to support old mainframes running (Delta, I'm looking at you) old unix dispatch software. 2) The cockpit computers have limited characters on screen. A FMC can display 28 characters x 16 rows at best. Most are 8 rows. Military aircraft have some that are 2 rows. The FMC or Flight Management Computer is really just an old embedded chip. 3) The entire airline, flight, tourism, booking, and ticketing systems of the world would need to change. Including all legacy systems, all paper charts, all maps, all BMS's, all AirBoss's, all ATC software, all radio beacons. There is no chance that any of this will change simply because someone came up with a way to associate words with landmarks you can't see from the air.
You could maybe make them globally unique by adding the country where appropriate like we do with Paris, France vs Paris, Texas? And not using the same name twice in the same country.
The names have to be entered manually by pilots, if e.g. they change the route. They have to be transmitted over the air by humans. So they must be short ans simple.
Yes but shouldn’t one step of the code be to translate these non-unique human-readable identifiers into completely unique machine-readable identifiers?
How exactly would you do that? It’s impossible to map from a dataset of non-unique identifiers to unique identifiers without additional data and heuristics. The mapping is ambiguous by definition.
The underlying flight plan standard were all created in an era of low memory machines, and when humans were expected to directly interpret data exactly as the programs represented it internally (because serialisation and deserialisation is expensive when you need every CPU cycle just run your core algorithms)
Couldn’t you use the surrounding points? Each point is surrounded by a set of nearby points. You can prepare a map of pairs of points into unique ids beforehand, then have a step that takes (before, current, after) for each point in the flight plan and finds the ID of current.
That sort of thing happens already; for instance, the MCDU of an Airbus aircraft will present various options in the case of ambiguous input, with a distance in nautical miles for each option. Usually, the closest option is the most appropriate.
Directly mapping is impossible, so you cant just do a dumb ID-at-time pre-processing step (which is what your comment seems to suggest). You need a more complex pre-processing step that’s capable of understanding the surrounding context the identifier is being used in. A major issue with the flight planing system (as highlighted in the article) is that they attempted to do this heuristic mapping as part of their core processing step, and just assumed the ID wouldn’t be too ambiguous, and certainly wouldn’t repeat.
But its how the waypoint codes work in practice - they are contextual. If Air Traffic Control tell a plane to head for waypoint RESIA, they mean the one nearby, not the one 4000 km away.
Have to admit, I read the article in full detail only after commenting and I see your point.
Especially since the implenting company is called out explicitly for failing to achieve this, and the risks of changing the well-established identifiers are also illustrated.
Perfect might be the enemy of the good then, or the standardization thing at least is a separate topic.
Not really because each point is only adjacent to a small neighborhood of other points, so if you want to test every possibility then your search space only grows by a constant factor proportional to the maximum degree of the graph.
As for implementation complexity, you would hope they would use formal verification for something like this.
Long story: because changing identifiers is a considerable refactoring, and it takes coordination with multiple worldwide distributed partners to transition safely from the old to the new system, all to avoid a hypothetical issue some software engineer came up with
Short story: money. It costs money to do things well.
Pretty sure that is still not the meaning of refactoring. As I understand it refactoring should mean no changes to the external interface but changes to how it is implemented internally.
We can pontificate on how to define the scope of a system here. I will only state that, from the perspective of a consumer, you could consider this a Service on which the interface of find flight, book flight, etc. would appear to be the same while the connections internal to each of the above modules would have to account for the change.
Functionally, I suppose it's the equivalent of upgrading an ID field that was originally declared as an unsigned 32 bit integer to a wider 64 bit representation. We may not be changing anything fundametal in the functionality, but every boundary interface, protocol, and storage mechanism must now suffer through a potentially painful modification.
does refactoring mean literally any non-local change even just like changing a variable name, or does it usually mean some kind of structural or architectural non-local change
It sounds like for actual processing they replace them with GPS coordinates (or at least augment them with such). But this is the system that is responsible for actually doing that...
W3W contains homonyms and words that are easily confused by non-native english speakers. Often within just a few KM. The latter is why ATC uses "niner", to avoid confusing "nine" and "nein".
I always love it when someone helicopters in to a complex, long-established system and, without even attempting to understand the requirements, constraints or history, knows this thing they read on a blog one time would fix all the problems thousands of work-years have failed to address.
As software developers, we are often living in our own bubble.
As a pilot and developer working on an aviation solution, I quite often run into this issue when discussing solutions with my colleagues.
The biggest fault (besides being proprietary) is that you must be online in order to use WTW. The times that you might need WTW are ALSO the times you are most likely to be unable to be online.
> The biggest fault (besides being proprietary) is that you must be online in order to use WTW.
That doesn't seem to be the case anymore.
It's still not a great system – many included words are ambiguous (e.g. English singular and plural forms are both possible, and an "s" is notoriously difficult to hear over a bad phone line), and it's proprietary, as you already mentioned.
It's definitely not the case, as the word list and algorithm are not secret (notwithstanding that they're proprietary) and have been re-implemented and ported into at least a couple of languages that allow for offline use. I have a Rust implementation that started life as a transliteration from Javascript. I wouldn't recommend using it, still -- I wrote it in the hope of finding more problems with collisions, not because I like it.
That would actually be pretty bad. As mentioned, W3W is propritary, requires an online connection, and has homonyms. On top of that, you need to enter these waypoints into your aircraft's navigation system - sometimes one letter at a time using a rotary dial. These navigation systems will stay in service for decades.
Aviation already uses phonetically pronounceable waypoint names. Typically 5 characters long for RNAV (GPS) waypoints, for example "ALTAM" or "COLLI". Easy to pronounce, easy to spell phonetically if needed, and easy to enter.
The problem is the list of waypoints are independently defined by each country, so duplicates are possible between countries.
Rather than replacing a system that mostly works (and mandating changes to aircraft navigation systems, ATC systems, and human training for marginal benefit)... an easier fix would just be to have ICAO mandate that these waypoints are globally unique.
If only there was a globally unique set of short two-letter names for every country that could be used as prefixes to enforce uniqueness while still allowing every country to manage their own internal waypoint list.
I'm sure they thought about this at some point. Airports already have a country-code prefix. (For example, airports in the Continental US always start with K.)
For whatever reason, by convention navaids never use a country prefix. Even when it would make sense - the code for San Francisco International Airport is "KSFO", but the identifier for the colocated VOR-DME is just "SFO". (Sometimes this does make a big difference, when navaids are located off site - KCCR vs CCR for Concord Airport vs the off-site Concord VOR-DME, for example.)
It's even worse for NDB navaids, which are often just two letters.
Either way, we're stuck with it because it's baked into aircraft avionics and would be incredibly expensive to change at this point.
> the backup system applied the same logic to the flight plan with the same result
Oops. In software, the backup system should use different logic. When I worked at Boeing on the 757 stab trim system, there were two avionics computers attached to the wires to activate the trim. The attachment was through a comparator, that would shut off the authority of both boxes if they didn't agree.
The boxes were designed with:
1. different algorithms
2. different programming languages
3. different CPUs
4. code written by different teams with a firewall between them
The idea was that bugs from one box would not cause the other to fail in the same way.
This would have been a 2oo2 system where the pilot becomes the backup. 2oo2 systems are not highly available.
Air traffic control systems should at least be 2oo3[1] (3 systems independently developed of which 2 must concur at any given time) so that a failure of one system would still allow the other two to continue operation without impacting availability of the aviation industry.
Human backup is not possible because of human resourcing and complexity. ATC systems would need to be available to provide separation under IFR[2] and CVFR[3] conditions.
> Air traffic control systems should at least be 2oo3... Human backup is not possible because of human resourcing and complexity.
But this was a 1oo1 system, and the human backup handled it well enough: a lot of people were inconvenienced, but there were no catastrophes, and (AFAIK) nothing that got close to being one.
As for the benefits of independent development: it might have helped, but the chances of this being so are probably not as much as one would have hoped if one thought programming errors are essentially random defects analogous to, say, weaknesses in a bundle of cables; I had a bit more to say about it here:
True. I don't want to downplay the actual cost (or, worse, suggest that we should accept "the system worked as intended" excuses), but it's not just that there were no crashes: the air traffic itself remained under control throughout the event. Compare this to, for example, the financial "flash crash" of 2010, or the nuclear 'excursions' at Fukushima / Chernobyl / Three Mile Island / Windscale, where those nominally in control were reduced to being passive observers.
It also serves as a reminder of how far we have to go before we can automate away the jobs of pilots and air traffic controllers.
This reminds me of a backwoods hike I took with a friend some years back. We each brought a compass, "for redundancy", but it wasn't until we were well underway that we noticed our respective compasses frequently disagreed. We often wished we had a third to break the tie!
My grandfather was working with Stanisław Skarżyński, who was preparing for his first crossing of the Atlantic in a lightweight airplane (RWD-5bis, 450kg empty weight) in 1933.
They initially mounted two compasses in the cockpit, but Skarżyński taped one of them over so that it wasn't visible, saying wisely that if one fails, he will have no idea which one is correct.
In this case the problem was choosing an excessively naive algorithm. I'm very inexperienced but that seems to me like the solution would be to spend a bit more money on reviewing the one implementation rather than writing two new ones from scratch.
you would be very surprised how difficult avionics are from even a fundamental level.
I'll provide a relatively simple example.
Just even attempting to design a starfox game clone where the ship goes towards the mouse cursor using euler angles will almost immediately result in gimbol lock and your starfighter locking up tighter than unlubricated car engine going 100mph and unable to move. [0]
The standard solution in games(or at least what I used) has been to use quaternions [1] (Hamilton defined a quaternion as the quotient of two directed lines in a three-dimensional space,[3] or, equivalently, as the quotient of two vectors.) So you essentially dump your 3D coordinate into the 4D quaternion coordinate, apply your matrix rotations, then convert back to 3D space and apply your rotations/transforms.
This was literally just to get my little space ship to go where my mouse cursor was on the screen without it locking up.
So... yeah, I cannot even begin to imagine the complexity of what a Boeing 757 (let alone a 787) is doing under the hood to deal with reality and not causing it to brick up and fall out of the sky.
I don't think we're talking about that kind of software, though. This big was in code that needs to parse a line defined by named points and then clip the line to the portion in the UK. Not trivial, but I can imagine writing that myself.
But regardless the more complex the code the worse idea it is to maintain three parallel implementations, if you won't/can't afford to do it properly
I was doing some orientation sensing 20 years ago with an IMU and ran into the same problem. I had never known at the time it was gimbal lock (which I had heard of) but did read quaternions were the way to fix it. Pesky problem.
> Human backup is not possible because of human resourcing
This is an artificial restraint. In the end, it comes down to risk management: "Are we willing to pay someone to make sure the system stays up when the computer does something unexpected?".
Considering this bug only showed up no, chances are there was a project manager who decided the risk would me extremely low and not worth spending another 200k or so of yearly operation expenses on.
First thought that came to my mind as well when I read it. This failover system seems to be more designed to mitigate hardware failures than software bugs.
I also understand that it is impractical to implement the ATC system software twice using different algorithms. The software at least checked for an illogical state and exited, which was the right thing to do.
A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.
> A fix I would consider is to have the inputs more thoroughly checked for correctness before passing them on to the ATC system.
Thoroughly checking of the inputs as far as possible should be a given, but in this case, the inputs were correct: while the use of duplicate identifiers is considerably less than ideal, the constraints on where that was permitted meant that there was one deterministically unambiguous parsing of the flight plan, as demonstrated in the article. The proximate cause of the problem was not in the inputs, but how they were processed by the ATC system.
For the same reason, multiple implementations of the software would only have helped if a majority of the teams understood this issue and got it right. I recall a fairly influential paper in the '90s (IIRC) in which multiple independent implementations of a requirements specification were compared, and the finding was that the errors were quite strongly correlated - i.e. there was a tendency for the teams to make the same mistakes as each other.
not stronger isolation between different flight plans? it seems "obvious" to me that if one flight plan is causing a bug in the handling logic, the system should be able to recover by continuing with the next flight plan and flagging the error to operators to impact that flight only
I'm no aviation expert, but perhaps with waypoints:
A B C D E
/
F G H I J
If flight plan #1 is known to be going from F-B at flight level 130, and you have a (supposedly) bogus flight plan #2, they can't quite be sure if it might be going from A-G at flight level 130 at the same time and thus causing a really bad day for both aircraft. I'd worry that dropping plan #2 into a queue for manual intervention, especially if this kind of thing only happens once every 5 years, could be disastrous if people don't realize what's happening and why. Many people might never have seen anything in that queue and may not be trained to diagnose the problem and manually translate the flight plan.
This might not be the reason why the developer chose to have the program essentially pull the fire alarm and go home in this case, but that's the impression I got.
The ATC system handled well enough (i.e. no disasters, and AFAIK, no near misses) something much more complicated than one aircraft showing up with no flight plan: the failure of this particular system put all the flights in that category.
I mentioned elsewhere that any ATC system has to be resilient enough to handle things like in-flight equipment failure, medical emergencies, and the diversion of multiple aircraft on account of bad weather or an incident which shuts down a major airport.
As for why the system "pulled the plug", the author of the article suspects that this particular error was regarded as something that would not occur unless something catastrophic had caused it, whereas, in reality, it affected only one flight and could probably have been easily worked around if the system had informed ATC which flight plan was causing the problem.
I'm not sure they're even used for that purpose - that side of thing is done "live" as I understand it - the plans are so that ATC has the details on hand for each flight and it doesn't all need to be communicated by radio as they pass through.
I wonder where most of the complexity lies in ATC. Naively you’d think there would be some mega computer needed to solve the puzzle but the UK only sees 6k flights a day and the scale of the problem, like most things in the physical world, is well bounded. That’s about the same number of buses in London, or a tenth of the number of Uber drivers in NYC.
Much of the complexity is in interop. Passing data between ATC control positions, between different facilities, and between different countries. Then every airline has a bidirectional data feed, plus all the independent GA flights (either via flight service or via third-party apps). Plus additional systems for weather, traffic management, radar, etc. Plus everything happening on the defense side.
All using communication links and protocols that have evolved organically since the 1950s, need global consensus (with hundreds of different countries' implementations), and which need to never fail.
The system should have just rejected the FPL, notify the admins about the problem and keep working.
The admins could have fixed whatever the software could not handle.
The affected flight could have been vectored by ATC if needed to divert from filed FPL.
Way less work and a better doutcome than the “system throws hands in the air and becomes unresponsive”.
if this is true, then would it be a better investment to have the 2nd team produce a fuzz testing/systematic testing mechanism instead of producing a secondary copy of the same system?
In fact, make it adversarial testing such that this team is rewarded (may be financially) if mistakes or problems are found from the 1st team's program.
the whole point is that they're not collaborating so as to avoid cross-contamination. also you don't get paid unless and until you identify the mistake. if you decrease the reward over time, there is an additional incentive to not sit on the information
Naturally, any comparator would have some slack in it to account for variations. Even CPU internals have such slack, that's why there's a "clock" to synchronize things.
I seem to remember another problem at NATS which had the same effect. Primary fell over so they switched over to a secondary that fell over for the exact same reason.
It seems like you should only failover if you know the problem is with the primary and not with the software itself. Failing over "just because" just reinforces the idea that they didn't have enough information exposed to really know what to do.
The bit that makes me feel a bit sick though is that they didn't have a method called "ValidateFlightPlan" that throws an error if for any reason it couldn't be parsed and that error could be handled in a really simple way. What programmer would look at a processor of external input and not think, "what do we do with bad input that makes it fall over?". I did something today for a simple message prompt since I can't guarantee that in all scenarios the data I need will be present/correct. Try/catch and a simple message to the user "Data could not be processed".
Well, if the primary is known not to be in a good state, you might as well fail over and hope that the issue was a fried disk or a cosmic bit flip or something.
The real safety feature is the 4 hour lead time before manual processing becomes necessary.
One of the key safety controls in aviation is “if this breaks for any reason, what do we do”, not so much “how do we stop this breaking in the first place”.
It's very hard to ensure you capture every single possible failure mode. Yes, the engineering control is important but it's not the most critical. What to do if it does fail (for any reason) is the truly critical control, because it solves for the possibility of not knowing every possible way something might fail and therefore missing some way to prevent a failure
One or more of three results can come from the engineering exercise of trying to keep something from breaking in the first place:
1. You could know the solution, but it would be too heavy.
2. You could know the solution, but it would include more parts, each of which would need the same process on it, and the process might fail the same way
3. You miss something and it fails anyway, so your "what if this fails" path better be well rehearsed and executed.
Real engineering is facing the tradeoffs head on, not hand waving them away.
The engineering controls don't independently make systems safe, they make things more reliable and cost-effective, and hopefully reduce the number of times the process controls kick in.
The process controls do however independently make things safe.
The reason for this is that there are 'unknown unknowns'—we accept that our knowledge and skills are imperfect, and there may be failures that occur which could have been eliminated with the proper engineering controls, but we, as imperfect beings and organisations, did not implement the engineering controls because we did not identify this possible failure mode.
There are also known errors, where the cost of implementing engineering controls may simply outweigh the benefits when adequate process controls are in place.
It was in a bad state, but in a very inane way: a flight plan in its processing queue was faulty. The system itself was mostly fine. It was just not well-written enough to distinguish an input error from an internal error, and thus didn't just skip the faulty flight plan.
Indeed, that intention is quite transparent in this case. Anyways, I suspect that invalid input exists that would have made the system react in a similar way
No validation, anddd this point from the article stood out to me:
---
The programming style is very imperative. Furthermore, the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained.
---
Given that description, I'd be surprised if it wasn't just running a regex / substring matches against the text and there's no classes / objects / data structure involved. Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.
> Bearing in mind this is likely decades old C code that can't be rewritten or replaced because the entirety of the UK's aviation runs on it.
It's new code, from 2018 :)
Quote from the report:
> An FPRSA sub-system has existed in NATS for many years and in 2018 the previous FPRSA sub- system was replaced with new hardware and software manufactured by Frequentis AG, one of the leading global ATC System providers.
Failing over is correct because there's no way to discern that the hardware is not at fault. They should have designed a better response to the second failure to avoid the knock-on effects.
Retroactive inspection revealed that it wasn't a hardware failure, but the computer didn't know that at the time, and hardware failure can look like anything, so it was correct to exercise its only option.
And why could the system not put the failed flight plan in a queue for human review and just keep on working for the rest of the flights? I think the lack of that “feature” is what I find so boggling.
Because the code classified it as a "this should never happen!" error, and then it happened. The code didn't classify it as a "flight plan has bad data" error or a "flight plan data is OK but we don't support it yet" error.
If a "this should never happen!" error occurs, then you don't know what's wrong with the system or how bad or far-reaching the effects are. Maybe it's like what happened here and you could have continued. Or maybe you're getting the error because the software has a catastrophic new bug that will silently corrupt all the other flight plans and get people killed. You don't know whether it is or isn't safe to continue, so you stop.
That reasoning is fine, but it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios. As TA expounds, it seems that neither formal methods nor fuzzing were used, which would have gone a long way flushing out such errors.
> it rather seems that the programmers triggered this catastrophic "stop the world" error because they were not thorough enough considering all scenarios
Yes. But also, it's an ATC system. Its primary purpose "is to prevent collisions..." [1].
If the system encounters a "this should never happen!" error, the correct move is to shut it down and ground air traffic. (The error shouldn't have happened in the first place. But the shutdown should have been more graceful.)
Neither formal methods nor fuzzing would've helped if the programmer didn't know that input can repeat. Maybe they just didn't read the paragraph in whatever document describes how this should work and didn't know about it.
I didn't have to implement flight control software, but I had to write some stuff described by MIFID. It's a job from hell, if you take it seriously. It's a series of normative documents that explains how banks have to interact with each other which were published quicker than they could've been implemented (and therefore the date they had to take effect was rescheduled several times).
These documents aren't structured to answer every question a programmer might have. Sometimes the "interesting" information is close together. Sometimes you need to guess the keyword you need to search for to discover all the "interesting" parts... and it could be thousands of pages long.
The point of fuzzing is precisely to discover cases that the programmers couldn't think about, and formal methods are useful to discover invariants and assumptions that programmers didn't know they rely on.
Furthermore, identifiers from external systems always deserve scepticism. Even UUIDs can be suspect. Magic strings from hell even more so.
If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.
The mistake is too trivial to attribute it to the programmer incompetence / lack of attention. I'd bet my lunch it was because the spec is written in an incomprehensible language, is all over the place in a thousand pages PDF, and the particular aspect of repetition isn't covered in what looks like the main description of how paths are defined.
I've dealt with specs like that. It's most likely the error created by the lack of understanding of the details of the requirements than of anything else. No automatic testing technique would help here. More rigorous and systematic approach to requirement specification would probably help, but we have no tools and no processes to address that.
> If programmer didn't know that repetitions are allowed, they wouldn't appear in the input to the fuzzer as well.
It totally would. The point of a fuzzer is to test the system with every technically possible input, to avoid bias and blind spots in the programmer's thinking.
Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned. Unless you know all about the business rules of an external system, you can't trust its data and can't assume much about its behavior.
Anyways, we are discussing about the wrong issue. Bugs happen, even halting the whole system can be justified, but the operators should have had an easier time figuring out what was actually going on, without the vendor having to pore through low-level logs.
No... that's not the point of fuzzing... You cannot write individual functions in such a way that they keep revalidating input handed to them. Because then, invariably, the validations will be different function to function, and once you have an error in your validation logic, you will have to track down all function that do this validation. So, functions have to make assumptions about input, if it doesn't come from an external source.
I.e. this function wasn't the one which did all the job -- it already knew that the input was valid because the function that provided the input already ensured validation happened.
It's pointless to deliberately send invalid input to a function that expects (for a good reason) that the input is valid -- you will create a ton of worthless noise instead of looking for actual problems.
> Furthermore, assuming that no duplicates exist is a rather strong assumption that should always be questioned.
How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?... There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...
You really need to try doing what you suggest before suggesting it.
I am not going to comment the first paragraph since you turned my words around.
> How do you even come up with this? Do you write your code in such a way that any time it pulls a value from a dictionary, you iterate over the dictionary keys to make sure that they are unique?
A dictionary in my program is under my control and I can be sure that the key is unique since... well, I know it's a dictionary. I have no such knowledge about data coming from external systems.
> There are plenty of things that are meant to be unique by design. The function in question wasn't meant to check if the points were unique. For all we know, the function might have been designed to take a map and the data was lost even before this function started processing it...
"Meant to be" and "actually are" can be very different things, and it's the responsibility of a programmer to establish the difference, or to at least ask pointed questions. Actually, the programmers did the correct thing by not sweeping this unexpected problem under the rug. The reaction was just a big drastic, and the system did not make it easy for the operators to find out what went wrong.
Edit: as we have seen, input can be valid, but still not be processable by our code. That not fine, but it's a fact of life since specs are often unclear or incomplete. Also, the rules can actually change without us noticing. In these cases, we should make it as easy as possible to figure out what went wrong.
I've only heard from people engineering systems for aerospace industry and we're speaking hundreds of pages of api documentation. It is very complex so equally the chances of a human error are higher.
I agree with the general sentiment "if you see an unexpected error, STOP", but I don't really think that applies here.
That is, when processing a sequential queue which is what this job does, it seems to me reading the article that each job in the queue is essentially totally independent. In that case, the code most definitely should isolate "unexpected error in job" from a larger "something unknown happened processing the higher level queue".
I've actually seen this bug in different contexts before, and the lessons should always be: One bad job shouldn't crash the whole system. Error handling boundaries should be such that a bad job should be taken out of the queue and handled separately. If you don't do this (which really just entails being thoughtful when processing jobs about the types of errors that are specific to an individual job), I guarantee you'll have a bad time, just like these maintainers did.
If the code takes a valid series of ICAO waypoints and routes, generates the corresponding ADEXP waypoint list, but then when it uses that to identify the ICAO segment that leaves UK airspace it's capable of producing a segment from before when the route enters UK airspace, then that code is wrong, and who knows what other failure modes it has?
Maybe it can also produce the wrong segment within British airspace, meaning another flight plan might be processed successfully, but with the system believing it terminates somewhere it doesn't?
Maybe it's already been processing all the preceding flight plans wrongly, and this is just the first time when this error has occurred in a way that causes the algorithm to error?
Maybe someone's introduced an error in the code or the underlying waypoint mapping database and every flight plan that is coming into the system is being misinterpreted?
An "unexpected error" is always a logic bug. The cause of the logic error is not known, because it is unexpected. Therefore, the software cannot determine if it is an isolated problem or a systemic problem. For a systemic problem, shutting down the system and engaging the backup is the correct solution.
I'm pretty inexperienced, but I'm starting to learn the hard way that it takes more discipline to add more complex error recovery. (Just recently my implementation of what you're suggesting - limiting the blast radius of server side errors - meant all my tests were passing with a logged error I missed when I made a typo)
Considering their level 1 and 2 support techs couldn't access the so-called "low level" logs with the actual error message it's not clear to me they'd be able to keep up with a system with more complicated failure states. For example, they'd need to make sure that every plan rejected by the computer is routed to and handled by a human.
They physically cannot be independent. The system works on an assumption that the flight was accepted and is valid, but it cannot place it. What if it accidentally schedules another flight in the same time and place?
Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision. The system needs to maintain the integrity of all plans it sees. If it can't process one, and there's the risk of a plane entering airspace with a bad flight plan, you need to stop operations.
>> Except that you can't be sure this bad flight plan doesn't contain information that will lead to a collision.
Flight plans don't contain any information relevant for collision avoidance. They only say when and where the plane is expected to be. There is not enough specificity to ensure no collisions. Things change all the time, from late departures, to diverting around bad weather. On 9/11 they didn't have every plane in the sky file a new flight plan carefully checked against every other...
Aviation is incredibly risk-averse, which is part of why it's one of the safest modes of travel that exists. I can't imagine any aviation administration in a developed country being OK with a "yeah just keep going" approach in this situation.
That's true, but then, why did engineers try to restart the system several times if they had no clue what was happening, and restarting it could have been dangerous?
A customer of mine is adamant in their resolve to log errors, retry a few times, give up and go on with the next item to process.
That would have grounded only the plane with the flight plan that the UK system could not process.
Still a bug but with less effects to all the continent, because planes that could not get inside or outside the UK could not fly and that affected all of Europe and possibly more.
> That would have grounded only the plane with the flight plan that the UK system could not process.
By the looks of it, it was few hours in the air by the time the system had a breakdown. Considering it didn't know what the problem was, it seems appropriate that it shut down. No planes collided, so the worst didn't happen.
Couldn't the outcome be "access to the UK airspace denied" only for that flight? It would have checked with an ATC and possibly landed somewhere before approaching the UK.
In the case of a problem with all flights, the outcome would have been the same they eventually had.
Of course I have no idea if that would be a reasonable failure mode.
This here is the true takeaway. The bar for writing "this should never happen" code must be set so impossibly high that it might as well be translated into "'this should never happen' should never happen"
The problem with that is that most programming languages aren't sufficiently expressive to be able to recognise that, say, only a subset of switch cases are actually valid, the others having been already ruled out. It's sometimes possible to re-architect to avoid many of this kind of issue, but not always.
What you're often led to is "if this happens, there's a bug in the code elsewhere" code. It's really hard to know what to do in that situation, other than terminate whatever unit of work you were trying to complete: the only thing you know for sure is that the software doesn't accurately model reality.
In this story, there obviously was a bug in the code. And the broken algorithm shouldn't have passed review. But even so, the safety critical aspect of the complete system wasn't compromised, and that part worked as specified -- I suspect the system behaviour under error conditions was mandated, and I dread to think what might have happened if the developers (the company, not individuals) were allowed to actually assume errors wouldn't happen and let the system continue unchecked.
To be fair, the article suggests early on that sometimes these plans are being processed for flights already in the air (although at least 4 hours away from the UK).
If you can stop the specific problematic plane taking off then keeping the system running is fine, but once you have a flight in the air it's a different game.
It's not totally unreasonable to say "we have an aircraft en route to enter UK airspace and we don't know when or where - stop planning more flights until we know where that plane is".
If you really can't handle the flight plan, I imagine a reasonable solution would be to somehow force the incoming plane to redirect and land before reaching the UK, until you can work out where it's actually going, but that's definitely something that needs to wait for manual intervention anyway.
For the most part (although there are important exceptions), IFR flights are always in radar contact with a controller. The flight plan is tool allows ATC and the plane to agree a route so that they don't have to be constantly communicating. ATC 'clears' a plane to continue on the route to a given limit, and expects the plane to continue on the plan until that limit unless they give any future instructions.
In this regard UK ATC can choose to do anything they like with a plane when it comes under their control - if they don't consider the flight plan to be valid or safe they can just instruct the plane to hold/divert/land etc.
I'm not sure the NATS system that failed has the ability to reject a given flight plan back upstream.
Mostly yes; however, there are large parts of the Atlantic and Pacific where that isn't true (radar contact). I know the Atlantic routes are frequently full of plans that left the US and Canada heading to the UK.
I have no idea what percent of the volume into the UK comes from outside radar control; if they asked a flight to divert, that may open multiple other cans of worms.
> If they asked a flight to divert, that may open multiple other cans of worms.
Any ATC system has to be resilient enough to handle a diversion on account of things like bad weather, mechanical failure or a medical emergency. In fact, I would think the diversion of one aircraft would be less of a problem than those caused by bad weather, and certainly less than the problem caused by this failure. Furthermore, I would guess that the mitigation would be just to manually direct the flight according to the accepted flight plan, as it was a completely valid one.
One of the many problems here is that they could not identify the problem-triggering flight plan for hours, and only with the assistance of the vendor's engineers. Another is that the system had immediately foreclosed on that option anyway, by shutting down.
Only theoretically. In practice the only thing that usually matches is from which other ATC unit the plane is coming. But it could be on a different route and will almost always be at a different time due to operational variation.
That doesn't matter, because the previous unit actively hands the plane over. You don't need the flight plan for that.
What does matter is knowing what the plane is planning to do inside your airspace. That's why they're so interested in the UK part of the flight plan. Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.
> the previous unit actively hands the plane over. You don't need the flight plan for that.
I thought practically, what's handed over is the CPL (current flight plan), which is essentially the flight plan as filed (FPL) plus any agreed-upon modifications to it?
> Because if you don't give any other instructions, the plane will follow the filed routing. Making turns on its own, because the departing ATC unit cleared it for that route.
Without voice or datalink clearance (i.e. the plane calling the new ATC), would the flight even be allowed to enter a new FIR?
To be fair that is exactly what the article said was a major problem, and which the postmortem also said was a major problem. I agree I think this is the most important issue:
> The FPRSA-R system has bad failure modes
> All systems can malfunction, so the important thing is that they malfunction in a good way and that those responsible are prepared for malfunctions.
> A single flight plan caused a problem, and the entire FPRSA-R system crashed, which means no flight plans are being processed at all. If there is a problem with a single flight plan, it should be moved to a separate slower queue, for manual processing by humans. NATS acknowledges this in their "actions already undertaken or in progress":
>> The addition of specific message filters into the data flow between IFPS and FPRSA-R to filter out any flight plans that fit the conditions that caused the incident.
Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.
Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code. Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
For these kinds of things the post mortem and remediation have to kinda take as given that eventually a not predictable in advance unhandled unknown error will occur and then work on how it could be handled better. Because of course the solution to a bug is to fix the bug, but the issue and the reason for the meltdown is a DR plan that couldn't be implemented in a reasonable timeframe. I don't care what programming practices, what style, what language, what tooling. Something of a similar caliber will happen again eventually with probability 1 even with the best coders.
I agree with your first paragraph but your second paragraph is quite defeatist. I was involved in a quite few of "premortem" meetings where people think of increasing improbable failure modes and devise strategies for them. It's a useful meeting before larges changes to critical systems are made live. In my opinion, this should totally be a known error.
> Having found an entry and exit point, with the latter being the duplicate and therefore geographically incorrect, the software could not extract a valid UK portion of flight plan between these two points.
It doesn't take much imagination to surmise that perhaps real world data is broken and sometimes you are handed data that doesn't have a valid UK portion of flight plan. Bugs can happen, yes, such as in this case where a valid flight plan was misinterpreted to be invalid, but gracefully dealing with the invalid plan should be a requirement.
> Saying this should have been handled as a known error is totally reasonable but that's broadly the same as saying they should have just written bug free code.
I think there's a world of difference between writing bug free code, and writing code such that a bug in one system doesn't propagate to others. Obviously it's unreasonable to foresee every possible issue with a flight plan and handle each, but it's much more reasonable to foresee that there might be some issue with some flight plan at some point, and structure the code such that it doesn't assume an error-free flight plan, and the damage is contained. You can't make systems completely immune to failure, but you can make it so an arbitrarily large number of things have to all go wrong at the same time to get a catastrophic failure.
> Even if they had parsed it into some structure this would be the equivalent of a KeyError popping out of nowhere because the code assumed an optional key existed.
How many KeyError exceptions have brought down your whole server? It doesn't happen because whoever coded your web framework knows better and added a big try-catch around the code which handles individual requests. That way you get a 500 error on the specific request instead of a complete shutdown every time a developer made a mistake.
Crash is a feature, though. It's not like exceptions raises by itself into interpreter specifications. It's just that it so happens that Web apps ain't need no airbags that slow down businesses.
That line of reasoning is how you have systemic failures like this (or the Ariane 5 debacle). It only makes sense in the most dire of situations, like shutting down a reactor, not input validation. At most this failure should have grounded just the one affected flight rather than the entire transportation network.
> Because they hit "unknown error" and when that happens on safety critical systems you have to assume that all your system's invariants are compromised and you're in undefined behavior -- so all you can do is stop.
What surprised me more is that the amount of data existing for all waypoints on the globe is quite small, if I were to implement a feature that query by their names as an identifier the first thing I'd do is to check for duplicates in the dataset. Because if there are, I need to consider that condition in every place where I'd be querying a waypoint by a potential duplicate identifier.
I had that thought immediately when looking at flight plan format, noticed the short strings referring to waypoints, way before getting to the section where they point out the name collision issue.
Maybe I'm too used to work with absurd amounts of data (at least in comparison to this dataset), it's a constant part of my job to do some cursory data analysis to understand the parameters of the data I'm working with, what values can be duplicated or malformed, etc.
If there are duplicate waypoint IDs, they are not close together. They can be easily eliminated by selecting the one that is one hop away from the prior waypoint. Just traversing the graph of waypoints in order would filter out any unreachable duplicates.
That it's safety critical is all the more reason it should fail gracefully (albeit surfacing errors to warn the user). A single bad flight plan shouldn't jeopardize things by making data on all the other flight plans unavailable.
Well yes because you're describing a system where there are really low stakes and crash recovery is always possible because you can just throw away all your local state.
The flip side would be like a database failing to parse some part of its WAL log due to disk corruption and just said, "eh just delete those sections and move on."
The other “tabs” here are other airplanes in flight, depending on being able to land before they run out of fuel. You don’t just ignore one and move on.
Nonsense comparison, your browser's tabs are de facto insulated from each other, flight paths for 7000 daily planes over the UK literally share the same space.
No, it's more like saying your browser has detected possible internal corruption with, say, its history or cookies database and should stop writing to it immediately. Which probably means it has to stop working.
It definitely isn't. It was just a validation error in one of thousands external data files that the system processes. Something very routine for almost any software dealing with data.
The algorithm as described in the blogpost is probably not implemented as a straightforward piece of procedural code that goes step by step through the input flightplan waypoints as described. It may be implemented in a way that incorporates some abstractions that obscured the fact that this was an input error.
If from the code’s point of view it looked instead like a sanity failure in the underlying navigation waypoint database, aborting processing of flight plans makes a lot more sense.
Imagine the code is asking some repository of waypoints and routes ‘find me the waypoint where this route leaves UK airspace’; then it asks to find the route segment that incorporates that waypoint; then it asserts that that segment passes through UK airspace… if that assertion fails, that doesn’t look immediately like a problem with the flight plan but rather with the invariant assumptions built into the route data.
And of course in a sense it is potentially a fatal bug because this issue demonstrates that the assumptions the algorithm is making about the data are wrong and it is potentially capable of returning incorrect answers.
I've had brief glimpses at these systems, and honestly I wouldn't be surprised if it took more a year for a simple feature like this to be implemented. These systems look like decades of legacy code duct-taped together.
> why could the system not put the failed flight plan in a queue
Because it doesn't look at the data as a "flight plan" consisting of "way points" with "segments" along a "route" that has any internal self-consistency. It's a bag of strings and numbers that's parsed and the result passed along, if parsing is successful. If not, give up. In this case fail the entire systemand take it out of production.
Airline industry code is a pile of badly-written legacy wrappers on top of legacy wrappers. (Mostly not including actual flight software on the aircraft. Mostly). The FPRSA-R system mentioned here is not a flight plan system, it's an ETL system. It's not coded to model or work with flight plans, it's just parsing data from system A, re-encoding it for system B, and failing hard if it it can't.
good ETLs are usually designed to separate good records from bad records, so even if one or two rows in the stream do not conform to schema - you can put them aside and process the rest.
The problem is that it means you have a plane entering the airspace at some point in the near future and the system doesn't know it is going to be there. The whole point of this is to make sure no two planes are attempting to occupy the same space at the same time. If you don't know where one of the planes will be you can't plan all of the rest to avoid it.
The thing that blows my mind is that this was apparently the first time this situation had happened after 15 million records processed. I would have expected it to trigger much more often. It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.
Bad records aren't supposed to be ignored. They are supposed to be looked at by a human who can determine what to do.
Failing the way NATS did means that all future flight plan data including for planes already in the sky are not longer being processed. The safer failure mode was definitely to flag this plan and surface to a human while continuing to process other plans.
> It makes me wonder if there wasn't someone who was fixing these as they came up in the 4 hour window, and he just happened to be off that day.
This is very possible. I know of a guy who does (or at least a few years ago did) 24x7 365 on-call for a piece of mission (although not safety) critical aviation software.
Most of his calls were fixing AWBs quickly because otherwise planes would need to take off empty or lose their take-off slot.
Although there had been some “bus factor” planning and mitigation around this guy’s role, it involved engaging vendors etc. and would have likely resulted in a lot of disruption in the short term.
One in a 15M chance with 7000 daily flies over the UK handled by nats meant it had a probability to happen at least once in 69 months, it took few months less.
I never said it was a good ETL system. Heck, I don't even know if the specs for it even specifies what to do with a bad record - there are at least 300 pages detailing the system. Looking around at other stories, I see repeated mentions of how the circumstances leading to this failure are supposedly extremely rare, "one in 15 million" according to one official[1]. But at 100,000 flights/day (estimated), this kind situation would occur, statistically, twice a year.
The recent episode of The Daily about the (US) aviation industry has convinced me that we’ll see a catastrophic headline soon. Things can’t go on like this.
The fact that they blamed the French flight plan already accepted by Eurocontrol proves that they didn't really know how the software works. And here the Austrian company should take part of the blame for the lack of intensive testing.
"software supplier"???
Why on God's green earth isn't someone familiar with the code on 7/24 pager duty for a system with this level of mission criticality?
That would be... the software supplier. This is quite a specific fault (albeit one that shouldn't have happened if better programming practices had been used), so I don't think anyone but the software's original developers would know what to do. This system is not safety-critical, luckily.
I think there is a bit of ignorance about how software is sold in some cases. This is not just some windows or browser application that was sold but it also contained the staff training with a help to procure hardware to run that software and maybe even more. Such systems get closed off from the outside without a way to send telemetry to the public internet (I've seen this before, it is bizarre and hard to deal with). The contract would have some clauses that deal with such situations where you will always have someone on call as the last line of defense if a critical issue happens. Otherwise, the trained teams should have been able to deal with it but could not.
It is mostly quite primitive, but it also works amazingly well. For example ILS or VOR or ATC audio comms can all be received and read correctly using hardware built from entry level ham radio knowledge. Altimeters still require a manual input of pressure. Fuel levels can be checked with sticks.
Kinda the opposite of a modern web/mobile app, complicated, massively bloated and breaks rather often :).
It's worse than you know. Ancient computer systems, non-ASCII character encodings, analog phone lines, and ticker-tape weather.
You'll also be surprised to learn there's still parts of the US where there's no radar or radio coverage with ATC, if flying at lower altitudes. (Heck, there's still a part of the Pacific Ocean that doesn't have ATC service at any altitude.)
Aviation drove a lot of the early developments in networked computing, which also means there's some really old tech in the stack. The globally decentralized nature of it all and it being a life-critical system means it's expensive and complicated to upgrade. (And to be clear, it does get upgraded - but it in a backwards compatible way.) Today's ATC systems need to work with planes built in the 1950s, and talk to ATC units in small countries that still use ancient teletype systems and fax machines.
But yet it's all still incredibly safe, because the technology is there to augment human processes - not replace them. Even if all the technology fails, everything can still be done manually using pen and paper.
Essentially this is down to the lack of proper namespace, who'd have thought aerospace engineer need to study operating systems! I've a friend who's a retired air force pilot and graduated from Cranfield University, UK foremost post graduate institution for aerospace engineering with their own airport for teaching and research [1]. According to him he did study OS in Cranfield, and now I finally understand why.
Apparently based on the other comments, the standard for namespace is already available but currently it's not being used by the NATS/ATC, hopefully they've learnt their lessons and start using it for goodness sake. The top comment mentioned about the geofencing bug, but if NATS/ATC is using proper namespace, geofencing probably not necessary in the first place.
It sounds like a great place to study that has its own ~2km long airstrip! It would be nice if they had a spare Trident or Hercules just lying around for student baggage transport :)
"the description sounds like the procedure is working directly on the textual representation of the flight plan, rather than a data structure parsed from the text file. This would be quite worrying, but it might also just be how it is explained."
Oh, this is typical in airline industry work. Ask programmers about a domain model or parsing, they give you blank stares. They love their validation code, and they love just giving up if something doesn't validate. It's all dumb data pipelines At no point is there code models the activities happening in the real world.
In no system is there a "flight plan" type that has any behavior associated with it or anything like a set of waypoint types. Any type found would be a struct of strings in C terms, passed around and parsed not once, but every time the struct member is accessed. As the article notes, "The programming style seems very imperative.".
Giving up if something doesn't validate is indeed standard to avoid propagating badly interpreted data, causing far more complex bugs down the line. Validate soon, validate strongly, report errors and don't try to interpret whatever the hell is wrong with the input, don't try to be 'clever', because there lie the safety holes. Crashing on bad input is wrong, but trying to interpret data that doesn't validate, without specs (of course) is fraught with incomprehension and incompatibilities down the line, or unexpected corner cases (or untested, but no one wants to pay for a fully tested all-goes system, or just for the tools to simulate 'wrong inputs' or for formal validation of the parser and all the code using the parser's results).
There are already too many problems with non-compliant or legacy (or just buggy) data emitters, with the complexity in semantics or timing of the interfaces, to try and be clever with badly formatted/encoded data.
It's already difficult (and costly) to make a system work as specified, so subtle variations to make it more tolerant to unspecificied behaviour is just asking for bugs (or for more expensive systems that don't clear the purchasing price bar).
From a safety-critical standpoint, I've always found this article interesting but strange. You want both, before taking into account any data from anything outside of the system. Do both. As soon as possible. Don't propagate data you haven't validated in any way your spec says so. If you have more stringent specs than any standard you're using, be explicit about it, reject the data with a clear failure report. Check for anything that could be corrupted, misformated, something that you're not expecting and could cause unexpected behaviour.
I feel the lack of investment in destroying the parsing- (and validation-) related classes of bugs is the worst oversight in the history of computing. We have the tools to build crash-proof parsers (spark, Frama-C, and custom model checked code generators such as recordflux) that - not being perfect in any way - if they had a tiny bit of the effort the security industry put in mending all the 'Postel's law' junk out there, we'd be working on other stuff.
I built, with an intern, an in-house bit-precise code generator for deserializers that can be proved absent of runtime errors, and am moving to semantics checks ('field X and field Y can only present together', or 'field Y must be greater or equal to the previous time field Y was present'). It's not that hard, compared to many other proof and safety/security endeavours.
> It's not that hard, compared to many other proof and safety/security endeavours.
Yes, but the code has to understand and model the input into a program representation: the AST. That's the essence of the "parse, don't validate" paradigm. Instead of looking at each piece of a blob of data in isolation to determine if it's a valid value, turn the input into a type-rich representation in the problem domain.
In the case of the FPRSA-R system in question, it does none of that. It's simply a gateway to translate data in format A to data in format B, like an ETL system. It's not looking at the input as a flight plan with waypoints, segments and routes.
Why the programmers chose to do the equivalent of bluescreening on one failed input, I can't say. As others have pointed out, the situation it gave up on isn't so rare: 1 in 15 million will happen. Of course switching to an identical backup system is a bad choice, too. In safety-critical work, there needs to be a different backup, much like the Backup Flight System in the space shuttle or the Abort Guidance System on the Apollo Lunar Module: a completely different set of avionics, programmed independently.
One of the reasons developers 'let it crash' is because no one wants to pay for error recovery, and I mean the whole design (including system level), testing, and long-term maintenance of barely used code.
THAT SAID isolation of the decoding code and data structures, having a way back to either checkpoint/restore or wipe out bad state (or, proving the absence of side effects, as SPARK dataflow contracts allow, for example) is better design, I wish would be taught more often. I really dislike how often exception propagation is taught without showing the handling of side effects...
That's super interesting (and a little terrifying). It's funny how different industries have developped different "cultures" for seemingly random reasons.
It was terrifying enough for me in the gig I worked on that dealt with reservations and check-in, where a catastrophic failure would be someone boarding a flight when they shouldn't have. To avoid that sort of failure, the system mostly just gave up and issued the passenger what's called an "Airport Service Document": effectively a record that shows the passenger as having a seat on the flight, but unable to check-in. This allows the passenger to go to the airport and talk to an agent at the check-in desk. At that point, yes, a person gets involved, and a good agent can usually work out the problem and get the passenger on their flight, but of course that takes time.
If you've ever been a the airline desk waiting to check-in and an agent spends 10 minutes working with a passenger (passengers), it's because they got an ASD and the agent has to screw around directly in the the user-hostile SABRE interface to fix the reservation.
It's better to say SABRE replicated, in digital form, that card file. And even today the legacy of that card form defines SABRE and all the wrappers and gateways to it.
A day I don't want to remember.
Took me 15 hours to reach my destination instead of 2.
Had to take train, bus, then train again. 30 minutes after I had booked my tickets, everything was fully booked for two days.
I waited in the airport for 6 hours before learning that my flight was cancelled, and had to rebook... I was flying to New York to see my family, so I didn't really have any alternate transportation options!
That's a shame, sorry to hear that. I got more lucky: I had to wait for 6 hours too but my flight suddenly resumed (must have been one of the first few). I didn't have any alternatives to go home either so I feel for all of those stuck in a foreign country.
I wish the article contained some explanation of why the processing for NATS requires looking at both the ADEXP waypoints and the ICAO4444 waypoints (not a criticism per se, it may not have been addressed in the underlying report). Just looking at the ADEXP seems sufficient for the UK segment logic.
I'm guessing it has something to do with how ICAO4444 is technically human readable, and how in some meaningful sense, pilots and ATC staff "prefer" it. e.g., maybe all ICAO4444 waypoints are "significant" to humans (like international airports), whereas ADEXP waypoints are often "insignificant" (local airports, or even locations without any runway at all).
Of course with 20/20 hindsight, it seems obviously incorrect to loop through the ICAO4444 waypoints in their entirety, instead of "resuming" from an advanced position. But why look at them at all?
They use the ADEXP to determine which part of the route is in the UK. Because the auto generated points are ATC area handover points. So this data is the best way so see which part of the route is within the UK airspace.
Then it needs to find the ICAO part that corresponds, because the controller needs to use the ICAO plan that the pilot has.
If the controller sees other (auto generated) waypoints that the pilots don't have you get problems during operation. A simple example is that controllers can tell pilots to fly in a straight line to a specific point on their filed route (and do so quite often). The pilot is expected to continue the filed route from that point onwards.
They can also tell a pilot to fly direct to some random other point (this also happens but less often). The pilot is then not expected to pick up a route after that point.
The radio instruction for both is exactly the same, the only difference is whether the point is part of the planned route or not. So the controller needs to see the exact same route as the pilots have, not one with additional waypoints added by the IFPS system.
Possibly it needs the ICAO information to communicate with some systems, but has to work in ADEXP to have sufficient granularity (the essay mentions the possibility of “clipping”, a flight going through the UK between two ICAO waypoints).
What I don’t understand in situations like this when thousands of flights are cancelled is how do they catch up? It always seems like flights are at max capacity at all times, at least when I fly. If they cancel 1,000 flights in one day, how do they absorb that extra volume and get everyone where they need to be? Surely a lot of people have their plans permanently cancelled?
There's always some empty capacity, whether it's non-rev tickets for flight crew and their families which are lower priority than paying customers or people who miss their flights.
I had a cancelled flight recently and they booked people two weeks out because every flight from that day onward was full or nearly full. I showed up the next morning and was able to board the next flight because exactly one person had scanned in their boarding pass (was present at the airport) but did not show up for whatever reason to the airplane.
Beyond that, people just make alternate plans, whether it's taking a bus or taxi home, traveling elsewhere, picking another airline, anything is possible.
I work in logistics for a FMCG company and sometimes our main producer goes down and we run out of certain types of stock. We send as much out as we can and cancel the rest.
If they really want the stock the customers can rebook an order for tomorrow because they aren't getting it today. And we just start adding extra stock to each delivery.
It's the best of a bad situation.
We don't have the money to have extra trucks and very perishable stock laying about and I know the airlines don't pay 300 grand a month to lease a 737 just to have it sat about doing nothing. There's very little slack.
I had been considering becoming an air traffic controller myself, and it rather tickles me to think I might have missed my once-in-a-lifetime opportunity to direct aircraft with the original pen-and-paper flight strip mechanism in the 21st century! Completely safe, excruciatingly low-capacity, and sounds like awfully good fun as a novelty (for the willing ATC, not the passengers stuck on the ground, I hasten to add).
Quite few non major airports are still heavily pen and paper reliant methods to some degree.
An example are islands that serve few flights per week and can't justify heavy update investments.
Airplanes are generally spaced by hours and you need to do your math about where the airplanes are by hand. But again there's so little planes that risks are minimal.
Indeed, but the set of aerodromes that are large enough to have a tower controller but not large enough to have their own radar surveillance is shrinking all the time. Radar is getting cheaper and what with ADS-C and TA/RA, a big reason to have ATC even without radar is vanishing (namely that of preventing collisions close to the airport). Oceanic control is probably the closest you can get nowadays to routine ATC without radar, even though they now have automatic position reports via satellite.
There was a time recently when only 3 out of the 300+ air traffic control centers in the U.S. were fully staffed. All the rest were short-handed. Not sure how it stands today
Every system I've ever made has better error reporting that that one. Even those that only I use. First thing I get working in a new project is the system to tell me when something fails and to help me understand and fix the problem quickly. I then use that system throughout development such that it works very well in production. I'd love to talk to the people who made the system discussed in the article. Is one of them reading this? Can you explain how come this problem reported itself so badly?
Yes it seems incredibly lame error reporting that they had to spend hours contacting the original vendor (to "analyse low-level software logs") just to find out which flight plan had crashed the system
Trusted input rarely should be trusted. It's input. You need to validate it as if it is hostile and have a process for dealing with malformed input. Now of course, standing by the sidelines it is easy to criticize and I'm sure whoever worked on this wasn't stupid. But I've seen this error often enough now in practice that I think that it needs to be drilled into programmers heads more forcefully: stuff is only valid if you have just validated it. If you send it to someone else, if someone you trust sends it to you, if you store in a database and then retrieve it and so on then it is just input all over again and you probably should validate it for being well-formed. If you don't do that then you're a bitflip, migration or an update away from an error that will cause your system to go into an unstable state and the real problem is that you might just propagate the error downstream because you didn't identify it.
Input is hard. Judging what constitutes 'input' in the first place can be harder.
That's fine, and is exactly the kind of case that I was thinking of: your software has a different idea of what is valid than an upstream piece of software, so from your perspective it is invalid. So you need to pull this message out of the stream, sideline it so it can be looked at by someone qualified enough to make the call of what's the case (because it could well be either way) and processing for all other messages should continue as normal. After all the only reason you can say with confidence that it in fact was valid is because someone looked at it! You can only do that well after the fact.
A message switch [1] that I worked on had to deal with messages sources from 100's of different parties and while in principle everybody was working from the same spec (CCITT [2]) every day some malformed messages would land in the 'error' queue. Usually the problem was on the side of the sender, but sometimes (fortunately rarely) it wasn't and then the software would be improved to be able to handle that case correctly as well. Given the size of the specs and the many variations on the protocols it wasn't weird at all to see parties get confused. What's surprising is that it happens as rarely as it does.
The big takeaway here should be that even if something happens very rarely it should still not result in a massive cascade, the system should handle this gracefully.
This really isn't about input. Whether it comes from outside or produce inside the application, the reality is that everything can have bugs. A correct input can cause a buggy application to fail. So while verifying input is obviously an important step, it is not even a beginning if you are really looking to building reliable software.
What really is the heart of the matter is for the entire thing to be allowed to crash due to a problem with single transaction.
What you really want to do is to have firewalls. For example, you want a separate module that runs individual transactions and a separate shell that orchestrates everything but has no or very limited contact with the individual transactions. As bad as giving up on processing a single aircraft is, allowing the problem to cascade to entire system is way worse.
What's even more tragic about this monumental waste of resources is that the knowledge about how to do all of this is readily available. The aerospace and automotive industry have very high development standards along with people you can hire who know those standards and how to use them to write reliable software.
Yes, there are multiple problems here that interplay in a really bad way and that's one of them. But the input processing/validation step is the first point of contact with that particular flight plan and it should have never progressed beyond that state.
It all hinges on a whole bunch of assumptions and each and every one of those should be dealt with structurally rather than by patching things over.
Just from reading TFA I see a very long list of things that would need attention. Quick recap:
- validate all input
- ensure the system can never stall on any one record
- the system will occasionally come across malformed input which needs a process
- it won't be immediately clear whether the system or the input is at fault, which needs a process
- testing will need to take these scenarios into account
- negative tests will need to be created (such as: purposefully malformed input)
- attempts should be made to force the system into undefined states using malformed and well formed input
- a supervisor mechanism needs to be built into the system that checks overall system health
And probably many more besides. But this is what I gather from the article is what they'll need at a minimum. Typically once you start digging into what it would take to implement any of these you'll run into new things that also need fixing.
As for the last bit of your comment: I'm quite sure that those standards were in play for this particular piece of software, the question is whether or not they were properly applied and even then there are no guarantees against mistakes, they can and do happen. All that those standards manage to do is to reduce their frequency by catching the bulk of them. But some do slip through, and always will. Perfect software never is.
> Tonight we were wondering why nobody had identified the flight which caused the UK air traffic control crash so we worked it out. It was FBU (French Bee) 731 from LAX/KLAX to ORY/LFPO.
> It passed two waypoints called DVL on its expanded flight plan: Devil's Lake, Wisconsin, US, and Deauville, Normandy, FR (an intermediate on airway UN859).
> The software and system are not properly tested.
Followed by suggesting to do fuzzing tests.
* Automatically generating valid flight paths is somewhat hard (and you'd have to know which ones are valid because the system, apparently, is designed to also reject some paths). It's also possible that such a generator would generate valid but improbable flight paths. There's probably an astronomic number of possible flight paths, which makes exhaustive testing impossible, thus no guarantee that a "weird" path would've been found. The points through which the paths go seem to be somewhat dynamic (i.e. new airports aren't added every day, but in a life-span of such a system there will be probably a few added). More realistically some points on flight paths may be removed. Does the fuzzing have to account for possibilities of new / removed points?
* This particular functionality is probably buried deep inside other code with no direct or easy way to extricate it from its surrounding, and so would be very difficult to feed into a fuzzer. Which leads to the question of how much fuzzing should be done and at what level. Add to this that some testing methodologies insist on divorcing the testing from development as not to create an incentive for testers to automatically okay the output of development (as they would be sort of okaying their own work). This is not very common in places like Web, but is common in eg. medical equipment (is actually in the guidelines). So, if the developer simply didn't understand what the specification told them to do, then it's possible that external testing wasn't capable of reaching the problematic code-path, or was severely limited in its ability to hit it.
* In my experience with formats and standards like these it's often the case that the standard captures a lot of impossible or unrealistic cases, hopefully a superset of what's actually needed in practice. Flagging every way in which a program doesn't match the specification becomes useless or even counter-productive because developers become overloaded with bug reports most of which aren't really relevant. It's hard to identify the cases that are rare but plausible. The fact that the testers didn't find this defect on time is really just a function of how much time they have. And, really, the time we have to test any program can cover a tiny fraction of what's required to test a program exhaustively. So, you need to rely on heuristics and gut feeling.
None of this really argues against fuzz testing; even with completely bogus/malformed flight plans, it shouldn't be possible for a dead letter to take down the entire system. And, since it's translating between an upstream and downstream format (and all the validation is done when ingesting the upstream), you probably want to be sure anything that is valid upstream is also valid downstream.
It's true that fuzz testing is easiest when you can do it more at the unit level (fuzz this function implementing a core algorithm, say) but doing whole-system fuzz tests is perfectly fine too.
This is not against the principle of fuzz testing. This is to say that the author doesn't really know the reality of testing and is very quick to point fingers. It's easy to tell in retrospect that this particular aspect should've been tested. It's basically impossible to find such defects proactively.
Easy for me to say in retrospect, but IMO this is a textbook example of where you should reach for fuzz testing; it’s basically protocol parsing, you have a well-known text format upstream and you need to ensure your system can parse all well-formed protocol messages and at very least not crash if a given message is invalid in your own system.
Similarly with a message queue, handling dead letters is textbook stuff, and you must have system tests to verify that poison pills do not break your queue.
I did not think the author was setting unreasonable expectations for the a priori testing regime. These are common best practices.
This all sounds like exactly the stuff that fuzzing or property-based testing is good for
And if the functionality is "buried deep inside other code with no direct or easy way to extricate it from its surrounding" making it hard to test then that's just a further symptom of badly designed software in this case
What I still don't understand is how flight plans get approved?
In my mind they would only be approved once all involved countries review and process the plan. That way we don't need this ridiculous idea of failing safe on the whole uk airspace for a single error.
That day a single flight plan could have been rejected, perhaps just resubmitted and the bug quietly fixed in the background
French here, as much as I wish It was the case for comical effect… I don’t think so.
Our right wing press is also desperately economically liberal so anything privately run is inherently better.
Maybe radio stations? Honestly, major respect to the daily mail for those snarky attacks that keep up the good spirits between our two countries.
It’s maybe the food or the weather that make them aggro ? Idk, but don’t worry, we love to hate the perfide Albion. Too.
Fellow French: am I wrong ? Maybe “valeur actuelle” could pull up that type of bullshit, but I think they are too busy blaming Islam to start thinking about our former colony across the channel.
> Safety critical software systems are designed to always fail safely. This means that in the event they cannot proceed in a demonstrably safe manner, they will move into a state that requires manual intervention.
unrelated - this instantly caused me to think about tesla autopilot crashes that have been reported with emergency vehicles
Has the culprit flight-plan been disclosed? I'd be interested to know how easy it is to create a realistic looking flight-plan through UK airspace that reproduces the problem. I.e. how much truth is there when NATS say this was a 1 in 15m probability?
This is an interesting engineering problem and I'm not sure what the best approach is. Fail safe and stop the world, or keep running and risk danger? I imagine critical systems like trading/aerospace have this worked out to some degree.
There isn't and cannot be a preference to either one. It always depends on what the system is doing and what the consequences would be... Pacemaker cannot "fail safe" for example, under no circumstances. It's meaningless to consider such cases. But if escalation to a human operator is possible, then it will also depend on how the system is meant to be used. In some cases it's absolutely necessary that the system doesn't try to handle errors (eg. if say a patient is in a CT machine -- you always want to stop to, at least, prevent more radiation), but in the situation like the one with the flight control -- my guess is that you want the system to keep trying while alerting the human operator.
But then it can also depend on what's in the contract and who will get the blame for the system functioning incorrectly. My guess here is that failing w/o attempting to recover was, while an overkill, a safer strategy than to let eg. two airplanes be scheduled for the same path (and potentially collide).
The best approach is to simply print the error to the screen, rather than burying it in a “low level log” which only the software vendor has access to.
They had a four hour buffer until the world stopped, but most of that was pissed away because no one knew what the problem was.
Bugs happen. Fact of being written by fleshy meatballs.
What should also have been highlighted is that they clearly had no easy way of finding the specific buggy input in the logs nor simulating it without contacting the manufacturer.
It sounds like a simple functional smoke test throwing random flight plans at the system would have eventually (and probably pretty soon) triggered this. I hope they at least do it now.
I worked once with 4G BTS (Base Transceiver Stations) where one of the issues was preventing the errors in the running board to propagate to the backup systems. There was no clean way to do it given the fact the malformed input will eventually reach the backup system producing the same error. The post talks about the system delaying the process to prevent backup up. Perhaps a solution would be going in the other direction having a staging step to prevent compromising the pipeline. Very interesting article.
Well, I certainly hope they've at least stopped issuing waypoints with identical names... although it wouldn't surprise me if geographically-distant is the best we can do as a species.
They appear to be sequences of 5 upper-case letters. Assuming the 26-character alphabet, that should allow for nearly 12 million unique waypoint IDs. The world is a big place but that seems like it should be enough. The more likely problem is that there is (or was) no internationally-recognized authority in charge of handing out waypoint IDs, so we have at least legacy duplicates if not potential new ones.
You have to reduce that to the (still massive) set of IDs that are somewhat pronounceable in languages that use the Latin script. You don't want to be the air traffic controller trying to work out how to say 'Lufthansa 451, fly direct QXKCD'. Nonetheless, I think the there is little cause for concern about changing existing IDs. There might be sentimental attachment, but it takes barely a few flights before the new IDs start sticking, and it's not like pilots never fly new routes.
No, waypoints aren't spelled out with the ICAO alphabet. They are mnemonics that are pronounced as a word and only spelled out if the person on the receiving end requests it because of bad radio reception, or unfamiliarity with the area/waypoint.
For example, Hungarian waypoints, at least the more important ones are normally named after cities, towns or other geographical locations near them, and use the locations name or abbreviated name, being careful that they can be pronounced reasonably easily for English speakers. Like: ERGOM (for the city Esztergom), ABONY (for the town Füzesabony), SOPRO (for Sopron), etc.
It is, but fixes are almost always spoken as words rather than letter-by-letter. For this reason, they are usually chosen to be somewhat pronounceable, and occasionally you even get jokes in the names. Likewise, radio beacons and airports are usually referred to by the name of their location; for instance "proceed direct Dover" rather than "proceed direct Delta Victor Romeo".
I think a lot of pilots and air traffic controllers would be irritated if they had to spend longer reading out clearances and instructions. In a world where vocal communication is still the primary method of air traffic control, there might be a measurable reduction in capacity in some busier regions.
Disney has a whole lot of special fixes in Orlando and Anaheim. The PIGLT arrival passes through HKUNA, MTATA, JAZMN, JAFAR, RFIKI, TTIGR. I'm fairly sure I've heard about some variants on MICKY, MINEE, GOOFY, PLUTO, etc.
My question is: why was the algorithm searching any section before the UK entry point. You can’t exit at a waypoint before you enter so there is no reason to search that space.
> The manufacturer was able to offer further expertise including analysis of lower-level software logs which led to identification of the likely flight plan that had caused the software exception.
This part stood out to me. I've found it super helpful to include a reference to which piece of days in working with in log messages and exceptions. It helps isolated problems so much faster.
Did the creator of the flight plan software engage in adversarial testing to see if they could break the system with badly formed flight plans? Or was / is the typical practice to mostly just see if the system meets just the "well-behaved" flight plan processing requirements? (with unit tests, etc)
It must suck to be responsible for a system that everyone depends on and millions of dollars are riding on so you are very reluctant to change it, even if you know it needs technical improvements.
Formal verification or fuzzing could have helped them over that mistrust, but are not panaceas
I imagine, for this kind of system, there is only one supplier. Why not force that supplier, as part of their 10-15 yr contract, to publish the source code for everything, not necessarily as FOSS. This way if there are bugs they can be reported and fixed.
Poison Pill! Why on earth would the best failure mode be to cease operating? Just don’t accept the new plan being ingested and tell the person uploading that their plan was rejected. Impact one flight not thousands!
I wondered this- I have absolutely no understanding of what's involved in flight system development, but does anyone know why it doesn't do this?
By contrast, its normal for an API to return 500 if something goes wrong and keep serving other requests. It would seem insane if it crashed out and completely stopped. Any idea why the parallel isn't true for a flight system?
Interesting to see that flight plans over the UK have to be filed 4 hours in advance.
No mention of plane, pilot, passenger and cargo manifests. So why the 4 hour lead time, is this the time it takes UK Authorities to look people up or workout if the cargo could be dangerous in an airborne Anthrax (Gruinard) Island [1] or Japanese subway Sarin [2], or an IRA favourite, fertilizer bomb thats bypassed the usual purchase reporting regulations used by people like Jeremy Clarkson and Harry Metcalfe as their store of wealth[3]?
It makes me wonder just how much more surveillance of the population exists, knowing I cant even step out of the front door without attracting surveillance of the type that followed Dr David Kelly.
Sure its not a cyber attack per se, carried out over the internet like a DDOS attack or a brute force password guessing attack with port knocking mitigation, but how would one carry out a cyber attack on this system if the only attack vector is from people submitting flight plans?
There sure is a constant playing down of the cyber attack angle to this which makes me think someone wants to Blurred Lines!
One point on the lack of uniquely named global way points, which is the main crux of the problem falling over if some are to be believed.
The USA demonstrates a disproportionate number of similar names, by virtue of Europeans migrating to the US [4]. So has this situation arisen with this system in other parts of the world like in the US? How can a country that created the globe spanning British Empire become so insular with regards to air travel in this way?
I'd agree with the initial assessment that there appears to be a lack of testing, but are the specifications simply not fit for purpose? I'm sure various pilots could speak out here, because some of the regulations require planes to be minimally distanced from each other when transiting across the UK.
On the point of ICAO and other bodies to eradicate non-unique waypoint names, its clear there is some legacy constraint still impeding the safety of air travellers, perhaps caused by poor audio quality analogue radio, so perhaps its time for the unambiguous and globally recognised What 3 Words form of location identifier, to come into effect?
The UK police already prefer it to speed up response times [4]. And although the same location can create 3 different words, suggesting drift with GPS [5], even if What 3 Words could not be used for a global system, having something a bit longer to create an easily recognisable human globally unique identifier is needed for these flight plans and perhaps maritime situations.
Obviously global coordination will be like herding cats, and if such a fixed size global network of cells were introduced, some area's like transiting over the Atlantic or Pacific could command bigger cells, but transiting over built up areas like London would require smaller sized identifiable cells. But IF ever there was a time for the New World Order to step up to the plate and assert itself, to create a Globally Unique Place ID (GUPID) for the whole planet, now is the time.
On the point of humans were kept safe, only by the sheer common sense of the pilots and traffic control tower staff, its not something NATS did or should claim, their systems were down, so everyone had to resort back to pen and paper and blocks in queues, and apart from Silverstone when the F1 British Grand Prix is on, is air space ever that densely populated.
NATS were caught with their pants down at so many levels of altitude, is this laissez faire UK management style that saw the Govt having to step in to bail out the banks during the financial crisis, still infecting other parts of UK life and still coming to light?
To answer your question without conspiracy drivel, let's look up CAP 694: The UK Flight Planning Guide [0]
Chapter 1
> 6.1 The general ICAO requirement is that FPLs should be filed on the ground at least 60 minutes before clearance to start-up or taxi is requested. The "Estimated Off Block Time" (EOBT) is used as the planned departure time in flight planning, not the planned airborne time.
> 6.3 IFR flights on the North Atlantic and on routes subject to Air Traffic Flow Management, should be filed a minimum of 3 hours before EOBT (see Chapter 4).
Chapter 4
> 1.1 The UK is a participating State in the Integrated Initial Flight Plan Processing System (IFPS), which is an integral part of the Eurocontrol centralised Air Traffic Flow Management (ATFM) system.
> 4.1 FPLs should be filed a minimum of 3 hours before Estimated Off Block Time (EOBT) for North Atlantic flights and those subject to ATFM measures, and a minimum of 60 minutes before EOBT for all other flights.
So the answer is because the UK is part of a Europe-wide air traffic control system, which hands out full flight plans to all the relevant authorities for each airspace, and they decided 3 hours is needed so that all possible participants can get their shit together and tell you if they accept the plan or not.
An entirely separate system exists to share Advanced Passenger Information, i.e. passenger manifests [1], and it goes even further that airlines share your overall identity with each other, known as a Passenger Name Record [2], and a variety of countries, led by the USA, insist on this information in advance before the plane is allowed to take off [3]
If you're going to be paranoid, please work with known facts instead of speculating.
So on point 3 then, why do countries, turn people away at the destination and not before take off, if their visa or passport is not in order?
Are their systems not joined up, or does the state just like making examples of peoples once on the destination country? I can watch this stuff happening to people at airport border controls on TV all the time so which is it? Their systems are not joined up or they just like making examples of people?
Most countries do not have prescreening or data sharing treaties. Even in the case where two countries do have it, not all entrance criteria can be determined by electronic records. Countries reserve the right to check the traveller themselves before they permit entry.
Airline employees routinely turn people away at the departure airport due to visa/passport paperwork not being in order. Timatic is the usual system that most airlines subscribe to for this kind of thing. Airlines are highly incented to avoid letting a passenger board who won't be admitted, because they're on the hook for returning that passenger. But an airline employee in at the departure airport is never going to be able to be a perfect proxy for an immigration officer in every country they fly to, and immigration officers generally have wide latitude in who they accept/reject. It is extremely possible that all your paperwork is in order but the immigration control officer rejects you for other reasons.
And is this a legal on the hook for returning passengers that are not allowed in the destination country?
I've got caught in the US when Hurricane Katrina was landing, and whilst we were flown out by our carrier, other carriers in this situation would also honour our ticket and fly us back to the UK.
It seemed like an exodus where all the carriers just got people out of the country as quickly as possible. We were on the last flight out of the airport, but this isnt a legal thing is it?
Software has bugs, that's not really the damning part... The damning part is that in four hours and two levels of support teams, there was noone who actually knew anything about how the system worked who could remove the problematic flight plan so that the rest of the system could continue operating!
What exactly is the point of these support teams when they can't fix the most basic failure mode (a single bad input...)
Unfortunately, I work on a reasonably modern ERP system which has been customized significantly for the client and also works with wider range of client-specific data combinations that the vendor has seemingly not anticipated / other clients do not have.
What it means is that on a regular basis, teams will be woken up at 2am because a batch process aborted on bad data; AND it doesn't tell you what data / where in the process it aborted.
The only possibility is to rerun the process with crippling traces, and then manually review the logs to find the issue, remove it, and then re-run the program again (hopefully remembering to remove the trace:).
Even when all goes per plan, this can at times take more than 4 hrs.
Now, we are not running a mission-critical real-time system like air traffic; and I'm in NO way saying any of this is good; but, it may not be the case that "two level of support teams didn't know anything" - the system could just be so poorly designed that with best operational experience and knowledge, it still took that long :-< .
On HN, we take certain level of modernity, logging, failure states, messaging, and restartability for granted; which may not be even remotely present on more niche or legacy system (again, NOT saying that's good; just indicating issue may be less with operational competence vs design). It's easy to judge from our external perspective, but we have no idea what was presented / available to support teams, and what their mandatory process is.
They bought a software from a third party and treat it as a "black box". There are few known ways that the software fails, and the local team has instructions on how to fix it. But if it fails in an unexpected way, good luck, it's impossible for the local team to identify and fix the problem without the vendor.
The reason it took so much was they realized too late that they need to call the vendor.
Probably you have to blame managers rather than engineers in the support team.
Considering this same failure has happened a few times in recent memory maybe its over optimistic of me to expect an entry on the support wiki or something.
One important software engineering skill that is often overlook is the art of writing just the right amount of log, such that one could have sufficient information to debug easily when things go wrong, but not too verbose such that it will be ignored or pruned in production.
And when did you last test your monthly backups? But seriously. If you fill out all the positions in an org chart it's easy to think you're delivering, and for a lot of situations it usually works. Anointing someone a manager usually works out because people can muddle through. It doesn't work in medicine, or as it turns out, air traffic control.
Having worked in tech support: level 3 (Devs) should have described their source code structure to level 2, and let them access it when they needed it.
You don’t need a complete diagnosis if you can spit out enough debug info that says, “oops shat the bed while working with this flight plan”, then the support people can remove the one that’s causing you to fail, restart the system, and tell ATC to route that one manually.
Try to get developers who love to code and create to stay on a support team and be on an on-call roster. I betcha at least half will say no, and the other half will either leave or you'll run out of money paying them.
> The consequence of all this was not that any human lives were put in danger, ..
When you're arguing that cancelling 2000 flights cost £100M and that no human danger was incurred, something should feel off. That might be around 600k humans who weren't able to be where they felt they needed to be. Did they have somewhere safe to sleep? Did they have all the medications they needed with them? Did they have to miss a scheduled surgery? Could we try to measure the effect on their well-being in aggregate, using a metric other than the binary state of alive or facing imminent death? You get the idea.
Of course I agree with the version of the claim that says that no direct danger was caused from the point of view of the failing-safe system. But when you're designing a system, it ought to be part of your role to wonder where risk is going as you more stringently displace it from the singular system and source of risk that you maintain.
I mean it could have also saved lives by that logic. Did someone missing their flight mean they also missed a terrible pileup on the roadways after landing? We can imagine pretty much any scenario here.
I agree with you that we don't know! But my thesis is that we should still do our best, when considering how much risk the systems we maintain should be willing to keep operating through.
Well that's DailyMail for you, where they tag anything parenting or healthy as "femail" section... cause you know only women are looking at that stuff.
Lol.
Anyways I actually think that's just reasonable response, system goes down/related system goes down , and in reviewing they are making frivolous updates to names that aren't needed.
I would question these updates (while they may be minor part of overall updates occuring).
At least until the 70s most newspapers had a section called "Women" or something similar. Even the news about the 60s/70s women's movement appeared there, not in the main "news" sections. Those sections were mostly renamed around that time to "Lifestyle", "Home", or just "Features".
Is this the UK or US edition? It's always easy fun to have a go at the Daily Mail which presumably you read regularly else you wouldn't be commenting. Its sin seems to be that it's not a serious broadsheet. It's a tabloid with very broad appeal that has to be profitable and therefore tries to reflect the requirements of the British public for such a publication. Perhaps you should lower your expectations.
'Tag anything parenting or healthy ...'? No, that's not correct. Here are a few health & food related items back to mid-September that did not appear in 'female'. You are right about parenting; most parenting in the UK is still undertaken primarily (in terms of executive action) by females so items on this topic are reasonably included in 'female'. The growing number of people who don't have children probably appreciate this sub-grouping by the Mail. You may not approve but this is what happens. Single males with dependent children are not known for objecting to checking out that section. It's not forbidden.
Dailymail is actually site a frequent multiple times a day everyday.
not all content is for everyone, but they got something, they are definitely a tabloid style.
they narrate particular views to the public but cover all different contents, and alot of content i would consider advertisements/plug than actual articles.
i would guess a highly elderly/conservative majoroity base
they pander to lowest common denominator, which is fine -- they are a for profit news/tabloid, i find some of it entertaining (As per daily visits).
do you work for them/just a big fan for doing all that digging in defense of DM overexaggerating i made that ALL content like that is in that category? i didnt take my own comment all that seriously so honest ask.
Except that was a completely different incident and it occurred in the United States, not the UK. The Daily Mail did try to make hay out of the idpol angle, but the British can't reasonably be accused of shirking responsibility for the FAA grounding flights in the US.
Yes, typically it would be used to mean things like the code mutates data in place rather than using persistent data structures, explicitly loops over data rather than using higher-order map, fold etc. operations, and explicitly checks tag bits rather than using sum types.
Fine, I'll give you that (sounds like a generic description) but there's nothing like that from the description given in the article and the paragraph immediately before that statement. It's almost as if the author completely made that up.
What ticked me is that when the primary system threw in the towel, an EXACT SAME system took over and ran the exact same code on the exact same data as the primary. I know that with code and algorithms it's not always the case but even then you know what doing the same thing over and over expecting different results defines...
Yes, it can be argued that the software should've had more graceful failure modes and this shouldn't have thrown a critical exception. It can be argued that the programmers should've seen this possibility. We can argue a lot of things about this.
But the reality is that this is a mission-critical system. And for such systems, there're ways to mitigate all of these mistakes and allow the system to continue functioning.
The easiest (but least safe) one would be to have the secondary system loaded with code that does the same thing but written by a different team/vendor. It reduces the chance from 100% to much-much less that if any input provokes an unforseen, system-breaking bug in the primary, the same input will provoke the same bug in the secondary.
An even better solution is to have a triumvirate system, where all 3 have code written by different teams, and they always compare results. If 3 agree, great, if 2 agree, not so great but safe to assume that the bug is in the 1 not the 2 (but should throw an alert for the supervisors that the whole system is in a degraded mode where any further node failure is a showstopper), and if all disagree, grind everything to a halt because the world is ending, and let the humans handle it.
It can be refined even further. And it's not something new. So why wasn't this system implemented in such a way? (Aside from cost. I don't care about anyones cost-cutting incentives in mission-critical systems. Sorry capitalism...)
> an EXACT SAME system took over and ran the exact same code
Did you ever work with HA systems? Because this is how they work. It's two copies of the same system intended for the cases when eg. hardware fails, or network partitioning happens etc.
No, I do not. But HA systems work like that because hardware or network failure is what they are designed to guard against, not a latent bug in the software logic. If there's a software bug, both systems will exhibit the same behavior, so HA fails there.
In practice, you have two kinds of HA systems (based on this criteria):
* Live + standby. Typically, the state of the live system is passively replicasted to the standby, where standby is meant to take over if it doesn't hear from the live one / the live one sends nonsense. (For example, you can use Kubernetes API server in this capacity).
* Consensus systems where each actor plays the same role, while there's an "elected" master which deals with synchronization of the system state. (For example, you can use Etcd).
In either case, it's the same program, but with a somewhat different state.
It doesn't make sense to make different programs to deal with this problem because you will have double the amount of bugs for no practical gains. It's a lot more likely that two different programs will fail to communicate to each other than one program communicating to its own replica. Also, if you believe you were right the first time: why would you make the other one different? You will definitely want to choose the better of the two and have copies of that than have a better and a worse work together...
How can you tell whether the problem is due to a software bug or due to a hardware fault though? The software could have thrown the "catastrophic failure, stop the world" exception due to memory corruption.
I'm wondering if the backup system could have a delayed queue; say, 30 seconds behind. If the primary fails, and exactly 30 seconds later the secondary system fails, you have reasonable assurance that it was queue input that caused the failure. Rollback to the last successful queue input, skip and flag the suspect input, and see if the next input is successful.
This looks to me like it could work, but would need a ready force of technicians always expecting something like that so they can troubleshoot it in a timely manner.
I don't know but in recent years I'm increasingly seeing mission critical systems having only token or "apparent" rendundancies instead of real ones, and couldn't find any other rationale than cost savings and shareholder bottom lines. I'm not saying that capitalism = bad, it's mostly better than the alternatives, but just like its most direct competitor, it suffers from bad implementations across the world and unbounded human greed.
A recent and very "in the face" example, also from the air travel industry would be the B737 Max and its AoA sensors. There were two, for two flight computers, but MCAS only used 1 flight computer and 1 AoA sensor, despite the already existing crosslinks between the flight computers and the sensors...
Pofit maxing first with the "no need for a new type rating for the pilots", then cost-cutting first in aeronautical engineering (solving an airframe design problem with software, plus designing a flight envelope protection system that can overpower the human pilots).
Then cost-cutting in software engineering and QC, rushing out software made by (probably) inexperienced in the field engineers and failing to properly test it and ensure that it had the needed redundancy.
You are correct, but it's an opinion that bridges the gap editorially between those knowledgable about ATC but not data, and those knowledgable about data but not ATC. This is a valuable service to provide, as both fields are rather complex.
Thanks. I didn't have the patience to read it all. I initially hoped that the author was a field expert or even someone with inside knowledge, but he is apparently from a completely different domain and not in the UK, and there were assumptions about things the report was rather specific about (as specific as such reports usually are). It would be more useful if people would take a closer look at the report and draw the right conclusions about organizational failures and how to avoid them. All the great software technologies to achieve memory safety, etc. are of little use if the analyses and specifications are flawed or the assumptions of the various parties in a system of systems do not match. But people seem to prefer to speculate and argue about secondary issues.
Since it's not Reddit but HN, it's all the stranger to dismiss a perfectly legitimate question. But times and mores seem to change much faster than I realize.
It's because your question was poorly phrased - it sounds like you are trying to dismiss the value of the submission for no apparent reason. If you genuinely want to know the answer to a question, don't start with your conclusion and append "isn't it?" to turn it into a question. Just say something like "I don't have time to read the article. Does the author provide any industry expertise to the incident beyond what was in the original report?".
It's not poorly phrased. It's my conclusion after spending ten minutes with the text and I was interested whether others came to the same conclusion, which apparently is the case. It also turned out that the author is not even a specialist and has no affiliation with an involved organization. But apparently people prefer to read and discuss arbitrary opinions.
This is one of the many reasons there should be a universal data standard using a format like JSON. Heavily structured, easy to parse, easy to debug. What you lose in footprint (i.e., more disk space), you gain in system stability.
Imagine a world where everybody uses JSON and if they offer an API, you can just consume the data without a bunch of hoop jumping. Failures like this would vanish overnight.
Parsing the data formats had zero contribution to the problem. They had a problem running an algorithm on the input data, and error reporting when that algorithm failed. Nothing about JSON would improve the situation.
Yes, but look at the data. The algorithm was buggy because the input data is a nightmare. If the data didn't look like that, it's very unlikely the bug(s) would have ever existed.
ADEXP sounds like the universal data standard you want then. The UK just has an existing NATS that cannot understand it without transformation by this problematic algorithm. So the significant part of your suggestion might be to elide the NATS specific processing and upgrade NATS to use ADEXP directly.
Using a JSON format changes nothing. Just adds a few more characters to the text representation.
No change at all? I find that hard to believe. There's also a data design problem here, but the structure of JSON would aid in, not subtract from, that process.
The question at hand is: "heavily structured data vs. a blob of text as input into a complex algorithm, which one is preferred?"
Unless you're lying, you'd choose the former given the option.
The issue is using both ADEXP and ICAO4444 waypoints, and doing so in a sloppy way. For the waypoint lists, there is no issue with structurelessness -- the fact that they're lists is pretty obvious, even in the existing formats. Adding some ["",] would not have helped the specific problem, as the relevant structure was already perfectly clear to the implementers. I am not lying when I say the bug would have been equally likely in a JSON format in this specific case.
Now I'm wigging out to the idea of how the act of overcoming the inertia of the existing system just to migrate to JSON would spawn thousands of bugs on its own — many life-threatening, surely.
To me and XML-ified this would look more nightmarish than the status quo... it's just brief, space separated and \n terminated ASCII. No need to overcomplicate things this simple.
> The algorithm was buggy because the input data is a nightmare.
No, the algorithm was "buggy" because it didn't account for the entry to and exit points from the UK to have the same designation because they're supposed to be geographically distant (they were 4000Nm apart!) and the UK ain't that big.
There are already standards like XML and RDF Turtle that allow you to clearly communicate vocabulary, such that a property 'iso3779:vin' (shorthand for a made-up URI 'https://ns.iso.org/standard/52200#vin') is interpreted in the same way anywhere in the structures and across API endpoints across companies (unlike JSON, where you need to fight both the existence of multiple labels like 'vin', 'vin_no', 'vinNumber', as well as the fact that the meaning of a property is strongly connected to its place in the JSON tree). The problem is that the added burden is not respected at the small scale and once large scale is reached, the switching costs are too big. And that XML is not cool, naturally.
On top of that, RDF Turtle is the only widely used standard graph data format (as opposed to tree-based formats like JSON and XML). This allows you to reduce the hoop jumping when consuming responses from multiple APIs as graph union is a trivial operation, while n-way tree merging is not.
Finally, RDF Turtle promotes use of URIs as primary identifiers (the ones exposed to the API consumers) instead of primary keys, bespoke tokens, or UUIDs. Followig this rule makes all identifiers globally unique and dereferenceable (ie, the ID contains the necessary information on how to fetch the resource identified by a given ID).
P.S.: The problem at hand was caused by the algorithm that was processing the parsed data, not with the parsing per se. The only improvement a better data format like RDF Turtle would bring is that two different waypoints with the same label would have two different URI identifiers.
Furthermore, there are already XML namespaces for flight plans. These are not, however, used by ATC - only by pilots to load new routes into their aircrafts' navigation computers.
I'm not sure whether there is an existing RDF ontology for flight plans; it would probably be of low to medium complexity considering how powerful RDF is and the kind of global-scale users it already has.
Airport software predates basically every standard on the planet. I would not be surprised to learn that they have their own bizarro world implementation of ASCII, unix epoch time, etc.
(There is a modern replacement for AFTN called AMHS, which replaces analog phone lines with X.400 messages over IP... but the system still needs to be backwards compatible for ATC units still using analog links.)
Correct. The other "leg" of a solution to this problem would be to codify migration practices so stagnation at the tech level is a non issue long-term.
But after you did it, you'd still have exactly the same problem. The cause was not related to deserialization. That part worked perfectly. The problem is the business logic that applied to the model after the message was parsed.
I think this won't work: no one really wants to touch a system that works, and people will try to find any excuse to avoid migrating.
The reason of this is that everyone prefers systems that work and fails in known way rather new systems that no one knows how can it fail.
Does the system work if it randomly fails and collapses the entire system for days?
People generally prefer to be lazy and to not use their brains, show up, and receive a paycheck for the minimum amount of effort. Not to be rude, but that's where this attitude originates. Having a codified process means that attitude can't exist because you're given all of the tools you need to solve the problem.
> Having a codified process means that attitude can't exist because you're given all of the tools you need to solve the problem.
Yes, but in real life doesn't work.
Processes have corner cases. As you said, people are lazy and will do everything to find the corner case to fit in.
Just an example from the banking sector.
There are processes (and even laws) that force banks to use only certified, supported and regularly patched software: there are still a lot of Windows 2000 servers in their datacenters and will be there for many years.
Broadly speaking I think this is done for new systems. What you need to identify here is how and when you transition legacy systems to this new better standard of practice.
I'd argue in favor of at least an annual review process. Have a dedicated "feature freeze, emergencies only" period where you evaluate your existing data structures and queue up any necessary work. The only real hang up here is one of bad management.
In terms of how, it's really just a question of Schema A to Schema B mapping. Have a small team responsible for collection/organization of all the possible schemas and then another small team responsible for writing the mapping functions to transition existing data.
It would require will/force. Ideally, too, jobs of those responsible would be dependent on completion of the task so you couldn't just kick the can. You either do it and do it correctly or you're shopping your resume around.
Great. It should be fixed by replacing the FORTRAN systems with a modern solution. It's not that it can't be done, it's that the engineers don't bother to start the process (which is a side-effect of bad incentive structure at the employment level).
No migration of this magnitude is blocked because of engineers not "bothering" to start the process. Imagine how many approvals you'd need, plus getting budget from who-knows how many government departments. Someone is paying for your time as an engineer and they decide what you work on. I'm glad we live in a world where engineers can't just decide to rewrite a life or death system because it's written in an old(er) programming language. (Not that there is any evidence that this specific system is written in anything older than C++ or maybe Ada.)
That's... not how that works. I take it you're probably more of a frontend person than a backend person by this comment. In the backend world, you usually can't fully and completely replace old systems, you can only replace parts of systems while maintaining full backwards compatibility. The most critical systems in the world -- healthcare, transportation, military, and banking -- all run on mainframes still, for the most part. This is isn't a coincidence. When these systems get migrated, any issues, including issues of backwards compatibility cause people to /DIE/. This isn't an issue of a button being two pixels to the left after you bump frontend platform revs, these systems are relied on for the lives and livelihood of millions of people, every single day.
I am totally with you wishing these systems were more modern, having worked with them extensively, but I'm also realistic about the prospect. If every major airline regulator in the world worked on upgrading their ATC systems to something modern by 2023 standards, and everything went perfectly, we could expect to no longer need backwards compatibility with the old system sometime in 2050, and that's /very/ optimistic. These systems are basically why IBM is still in business, frankly.
Many of them have been upgraded. In the US, we've replaced HOST (the old ATC backend system) with ERAM (the modern replacement) as of 2015.
However, you have to remember this is a global problem. You need to maintain 100% backwards compatibility with every country on the planet. So even if you upgrade your country's systems to something modern, you still have to support old analog communication links and industry standard data formats.
In some sense, yes. Notice that most of the responses to what I've said are immediately negative or dismissive of the idea. If that's the starting point (bad mindset), of course nothing gets fixed and you land where we are today.
My initial approach would be to weed out anyone with that point of view before any work took place (the "not HR friendly" part being to be purposefully exclusionary). The only way a problem of this scope/scale can be solved is by a team of people with extremely thick skin who are comfortable grabbing a beer and telling jokes after they spent the day telling each other to go f*ck themselves.
Anyone who has worked with me knows that I have no issue coming in like a wrecking ball in order to make things happen, when necessary. I've also been involved in some of these migration projects. I think your take on the complexity of these projects (and I do mean inherent complexity, not incidental complexity) and the responses you've received is exceptionally naive.
The amount of wise-cracks and beers your team can handle after a work day is not the determinate factor in success. /Most/ of these organizations /want/ to migrate these systems to something better. There is political will and budget to do so, these are still inglorious multi-decade slogs which cannot fail, ever, because failure means people die. No amount of attitude will change that.
> The amount of wise-cracks and beers your team can handle after a work day is not the determinate factor in success.
Of course it isn't. But it's a starting point for building a team that can deal with what you describe (a decade-plus long timeline, zero room for failure, etc). If the people responsible are more or less insufferable, progress will be extremely difficult, irrespective of how talented they are.
Airplane logistics feels like one of the most complicated systems running today. A single airline has to track millions of entities: planes, parts, engineers, luggage, cargo, passengers, pilots, gate agents, maintenance schedules, etc. Most of which was created all before best-practices were a thing. Not only is the software complex, but there are probably millions of devices in the world expecting exactly format X and will never be upgraded.
I have no doubt that eventually the software will be Ship of Thesus-ed into something approaching sanity, but there are likely to be glaciers of tech debt which cannot be abstracted away in anything less than decades of work.
It would still be valuable to replace components piece-by-piece, starting with rigorously defining internal data structures and publically providing schemas for existing data structures so that companies can incorporate them.
I would like to point out that the article (and the incident) does not relate to airline systems; it is to do with Eurocontrol and NATS and their respective commercial suppliers of software.
The problem was not in the format, but with the way the semantics of the data is understood by the system. It could be fixed-width, XML, json, whatever, and the problem would still be the same.
So the "engineering teams" couldn't tail /var/log/FPRSA-R.log and see the cause of the halt?
I've had servers and software that I had never, ever used before stop working, and it took a lot less than four hours to figure out what went wrong. I've even dealt with situations where bad data caused a primary and secondary to both stop working, and I've had to learn how to back out that data and restart things.
Sure, hindsight is easy, but when you have two different systems halt while processing the same data, the list of possible causes shrinks tremendously.
The lack of competence in the "engineering teams" tells us lots about how horribly these supposedly critical systems are managed.
You're assuming that there is in fact a /var/log/FPRSA-R.log to tail - it would not at all surprise me if a system this old is still writing its logs to a 5.25 inch floppy in Prestwick or Swanwick^1.
^1: they closed the West Drayton centre about twenty years ago; I don't imagine they moved their old IBM 9020D too, if they still had it by then. My comment is nonetheless only slightly exaggerated ;)
No. That's silly. The logs would've / should've just shown that the program halted because it was confused about data. The actual commands to fix would've been quite different.
Small suggestion. Don't choose obscure language (in terms of popularity, 28th on TIOBE index with 0.65% rating) to visualize structure and algorithms. Otherwise you risk average viewer will stop reading the moment he encounter code samples.
There are 27 more popular languages, some of them orders of magnitute more.