Apparently the Enhanced Tactical Flow Management System (ETFMS) packed up.
>“ETFMS facilitates improvements in flight management from the pre-planning stage to the arrival of the flight. It maximises the updating of flight-related data and thus improves the real picture of a given flight, thereby contributing to the Gate-to-Gate Concept,” Eurocontrol explains on its website.
>In over 20 years of operation, the ETFMS has only had one other outage which occurred in 2001. The system currently
manages up to 36,000 flights a day.
Tech details from wikipedia fr:
Written in ADA , and running on HP-UX , the system is based on an exchange of messages between the airlines (who will file / change / update flight plans), the air traffic control bodies and the CFMU , the messages are written in ADEXP format.
ETFMS uses at least 5 fundamental notions:
flight plan : describes the 4D trajectory of an airplane.
regulation: aircraft rate applied to a "volume traffic". Example: 50 aircraft / hour
"Traffic volume": association of a geographical reference (air sector, waypoint, airport, etc.) and a set of aircraft flows.
list of takeoff slots or called slot . Example: if the rate of regulation is 30 planes / h, there will be a slot every 2 minutes: 10 hours, 10 h 2, etc.
the delay: difference in time between the take-off time desired by the company and the schedule calculated by ETFMS.
It would be fascinating to read about what bug took down a system with one outage in 20 years. I remember reading in the chubby paper (from google) that user error was the cause of more than half the downtime incidents. Wonder if that's the case here too.
If I remember correctly, a similar outage in the UK was due to increased flight volumes. The system had a hardcoded limit somewhere which caused issues as flight volume increased. The software was old enough that they weren't aware of that issue beforehand.
I could imagine something similar happening here.
It definitely shows how well software can work with the right practices. Only one outage in 20 years and that one only caused a reduction of 10% in capacity. Don't think many companies can match that.
>It definitely shows how well software can work with the right practices. Only one outage in 20 years and that one only caused a reduction of 10% in capacity. Don't think many companies can match that.
This is what happens when you build software with the same meticulousness as other engineering disciplines. However a lot of software is so much more complex than what other disciplines can build (because you can build anything you can imagine), coupled with early deadlines and profit pressure, that it's unrealistic for most software to be developed this way. You easily have a cost factor of 100x in time as well as money.
I'm talking out of my ass here, but I can imagine that the focus for this software system was very sharp and hasn't changed (much) in the past 20 years. When you have a product with tight focus, you can polish the shit out of it and make it last 20+ years. A bit of the Unix mentality.
Most products today - and stuff on HN is at the forefront of that - is much more about selling a product to a lot of people, often in highly contested markets. That is, if you make a product, focus on just the core and don't do anything else, you'll stagnate and die.
Then again / on the other hand, there's Dropbox whose core functionality has not changed as far as I can tell for a decade - it still does the exact same thing as when I first installed it. Spotify whose IPO was today that doesn't seem to change its core model. In both cases though, I don't know where they put all their development capacity; probably on back-end systems / scalability, marketing, and new applications (like dropbox creating a google docs competitor).
I see this sentiment a lot. Intuitively it seems like it should be true, but I don't think the case is really quite so clear cut.
The costs involve way more than just the initial development. Maintenance eats up a huge, perhaps even a majority, of the total cost as well. And outages or other failures can be very expensive too.
It's also important to keep in mind that this isn't an all or nothing situation. We can have software that is more reliable without asking that it chug away without issue for a decade, or anywhere near as long as we expect bridges or buildings to last.
The process of developing more reliable software isn't necessarily more expensive than less reliable software. It can even be cheaper. I'm struggling to find the links (maybe somebody else has them handy, or I'll edit them in if I find them), but there have been a few case studies done a few years back by companies that moved to using Ada. In addition to the benefits of more reliable software, they also found development costs were better or at least no worse than C. I know that isn't exactly the language to compare to these days, but as I said these were done some time ago.
This is just my own argument, but I suspect that's because the same problems that ultimately cause problems after release also cause problems during development. With a more reliable programming system/environment, problems that might show up later during development are shown to be an issue immediately. This means the issue doesn't need to be tracked down, which can take some serious time. The developers are even fresh on problem area.
Personally speaking, I've been totally won over by Ada. It ain't perfect, but it's a hell of a lot better than anything else I've seen - and I've looked a lot. In my own projects (mainly personal or for school admittedly) development is much easier and ultimately quicker. I don't have to spend a day tracking down a weird bug because the compiler let's me know about the issue as soon as I try to cause it.
>The process of developing more reliable software isn't necessarily more expensive than less reliable software. It can even be cheaper. I'm struggling to find the links (maybe somebody else has them handy, or I'll edit them in if I find them), but there have been a few case studies done a few years back by companies that moved to using Ada. In addition to the benefits of more reliable software, they also found development costs were better or at least no worse than C. I know that isn't exactly the language to compare to these days, but as I said these were done some time ago.
I can believe that. Ada catches a lot of errors you would normally only notice by extensive testing at compile time. You're preaching to the strong-static typing choir here. I believe Ada and Rust could solve a lot of problems of companies working with C/C++ and make development cheaper. You can properly model your domain and abstract without sacrificing safety.
I'm also a strong believer that TDD makes you much faster and safer in the long run.
My experience tells me that most tools, languages or methods that catches errors earlier will save money.
Ada also has the best tested compiler I can think of.
However my larger point was about the engineering processes not the language itself. I think with languages and tools you can make it easier to make good software. The 100x time and cost is more in the sense of process changes when you're working on safety critical systems. How everything has to be traceable from requirement to test, how there are mandatory reviews before any code change that need to be documented, how there are qualification criteria for the toolchain, etc. All these things cost a lot of time and manpower, with arguably very bad cost-benefit analysis, which is only really worth it when human lives are at stake.
> The 100x time and cost is more in the sense of process changes when you're working on safety critical systems. How everything has to be traceable from requirement to test, how there are mandatory reviews before any code change that need to be documented, how there are qualification criteria for the toolchain, etc. All these things cost a lot of time and manpower, with arguably very bad cost-benefit analysis, which is only really worth it when human lives are at stake.
Absolutely. That's part of what I was getting at by mentioning all of this exists on a continuum. We don't need to, and really shouldn't, treat a SaaS startup exactly the same as a military aviation project.
We can, however, draw from the lessons learned on those safety critical projects and use parts of the process that make sense for the nature of whatever we're actually working on.
You're right that in general I suspect that comes down to strong static typing, particularly for the sorts of projects common to the HN crowd. When dealing with very large enterprise projects the balance might start to shift to more than just typing, though it would probably take a lot of real-word data that nobody is keen to supply to figure out where the tipping points are.
And I'd argue about how well Rust actually helps with these things, but that would really be going off the rails. Unfortunately.
Anyone could, few care to. They really are the wrong practices for a whole lot of things. The level of care, design, and verification just isn't necessary for applications with few or fixable consequences.
It is a bit sad that nearly nothing these days strives for that kind of excellence.
Well - I wouldn't call it dumb stuff. After all, it's only a matter of time one of us does something truly stupid :-) Even more so when under stress and pressure when shit hits the fan, which is usually when human operators have to intervene. It's part of building reliable systems to reduce the chances of operator error. See: the Hawaii missile alert bug! It must be truly terrifying to sit at that particular keyboard typing in any command.
Read "Inviting Disaster" for a lot more about this topic.
Many very high profile disasters were caused by operator error or more precisely complex systems not designed for what you might call failure ergonomics.
Silly jokes don't go down well on HN, as they're distractions from the conversation. It's considered "Reddit-like" behaviour, and discouraged through downvotes.
Humour is allowed, and so is downvoting. In practice, the humour has to be to be really good to avoid being mercilessly downvoted. Meta-humour based on usernames and TV shows usually doesn't cut it. Many of us view this as a good thing.
However HN does seem to curate specific on topic informative posts, especially early on in the conversation (when everyone is reading, rather than just people looking for replies to their comments).
[–] pftbest 114 points 1 year ago
can you please explain this go syntax to me?
type ImmutableTreeListᐸElementTᐳ struct {
I thought go doesn't have generics.
[–]Uncaffeinated[S] 239 points 1 year ago
It doesn't. That's just a "template" file,
which I use search and replace in order to
generate the three monomorphized go files.
If you look closely, those aren't angle brackets,
they're characters from the Canadian Aboriginal
Syllabics block, which are allowed in Go identifiers.
From Go's perspective, that's just one long identifier.
For those curious, Eurocontrol MUAC (Maastricht Upper Area Control Center) migrated to 50 virtual SUSE Linux Enterprise servers running under IBM z/VM hypervisor on a IBM z196 mainframe system in 2013.
I was on a plane just about to push back at Heathrow, and the pilot informed us we'd be delayed 15 minutes due to this failure. In the end it was 10 minutes, and we landed only 5 minutes late at my destination. Doesn't appear to have been a big deal, at least for me.
10% reduction probably meant that they tried to keep most flights on time and "strategically" delayed some flights. Makes it worse for some passengers but keeps knock-on effects for transfer passengers under control.
Technically, Time Based Separation helps only arrivals since it tries to negate the effects of headwinds on final approach, but it does add a lot of resilience to the airport to absorb delays that would normally ripple to departures.
I'm pretty sure it doesn't I was reading about it and it mentioned even when the pilot can start the engines is calculated. It saves millions of liters of fuel just waiting a few minutes.
There is a digital display facing the pilot showing him when he can depart. Even which size aircraft are allowed to depart. It's extremely well organized.
Yes, this sucked. I just had a 1 hour delay on a 1 hour flight (AMS-ZHR). Unfortunately, no compensation until its a 2 hour delay (but thank god it was only one hour!). Passengers who had a layover were noticeably less happy.
You'd probably won't have been eligable for compensation since this delay was definitely beyond the airline's control.
While a lot of airlines try to weasel out of their obligations (mechanical failue, for example, which however is the airline's responsibility) I would think such a case is pretty clear cut.
I was at a conference in Northern Virginia some months ago and saw a presentation from the folks at Upside, a startup specializing in booking business travel. They described the legacy system which handles pretty much all booking in the US, a system called SABRE. They described it as an ancient 6-bit computer system in Texas with no modern API. Everything they do tech-wise is a modern wrapper around that system. So I'm not at all surprised by any air travel computer failures if tech like that is central to the system.
You're talking about TPF[1]. Many smart people and organizations have tried and failed to build something that could match it, including a company Google paid $700 million to buy. I personally know of at least 10 failed attempts :)
Not sure where "6 bit" is coming from though, and you can use gcc/c++ now, not just assembler[2]. And it's in Tulsa, Oklahoma, not Texas. Sabre's HDQ is in Texas, the data center is not. The hardware is very modern and new Z series mainframes in big loosely coupled clusters.
Amadeus, Sabre's main competition, also still has TPF at the core.
There is one notable non TPF reservation system. http://www.navitaire.com/
Last I checked, it couldn't scale well enough to handle a large airline.
Both Sabre and Amadeus are replacing TPF, but one function at a time (shopping, fare engine, booking, check in, etc). And very slowly.
Fwiw, TPF is basically a huge, distributed, and transactionally consistent nosql database. Most orgs still using it have extracted most of the business logic that was in it, out to Linux boxes that front it. Not for stability reasons, but faster time to market with new features. To date, attempts to replace the high contention and high transaction rate nosql type traffic haven't scaled well enough.
Just in the US, 2.5 million people fly each day. And the processes to sell, board, etc each passenger are lots of transactions each. It's a pretty big scale. I think it's fairly close to Amazon sales per day, but with more contention and sub transactions.
All Visa credit card transactions are also still on TPF.
I've heard of enough misguided "modernisations" (and failures thereof) that I think the "legacy system" was the part that stayed working throughout, and it's the newer stuff added around it that failed. The old stuff may be old but there's a reason it's old... it's outlasted any attempts at replacing it.
I can confirm this. It's usually the Java/Tomcat boxes that front the TPF box that cause this type of huge meltdown. It's almost never the TPF core. It happens, of course, but they've offloaded most of the functionality to more modern technology. So there isn't much code change in the TPF core. And code change is usually what drives outages.
The counteracting force to that will be that they complexity of the surrounding environment increases and becomes more brittle, or is just simply in the way, as constantly increasing demands for new features drifts further from the capabilities of the old system in the middle.
A few years ago, SAS introduced a new status tier. It took 18 months to introduce into the system (Amadeus). The system may be stable, but those kinds of turn around times for a minor customer service change simply isn't feasible. I don't have numbers, but I wouldn't be surprised if one of the reasons upstart airlines such as EasyJet are competitive is that their IT is comparatively modern and can actively support the organisation while IT is more of a millstone around the legacy airlines' necks.
Wow. I remember sabre from genie- General Electric Network Information Exchange. In the 1980s. I’m not sure compuserve was invented yet. My father and I could use “electronic mail” to stay in touch. For you youngsters: I hsd never yet had a “remote control” for a tv yet. Yep. Had to get up to sdjust the volume or change the channel (don’t get me started on adjusting the antenna).
And it was already ancient then! Sabre started operating in 1960, and traces its origins to a chance meeting of an IBM salesman and the CEO of American Airlines on a flight in 1953.
In the air-travel world, a lot of seemingly-random limitations on systems that interface with reservations comes from the fact that they were originally built with telephone interfaces intended for use only by trained staff.
Every few years, for example, someone digs up and reposts one of the articles explaining why some airlines didn't allow 'Q' and 'Z' in account passwords (they were passing things directly through to SABRE on the backend, and so only allowed letters that could be "dialed" on the 1960s rotary phones SABRE was designed to interface with).
Can confirm. My flight from Zurich was delayed by over an hour today, causing my family and me to need to run, OJ Simpson-style, through the Philadelphia airport to make our connection.
Everyone at that organisation is paid boatloads of money. [1]
They don't pay taxes on it. (~10%, "for solidarity", which means they get to enjoy healthcare paid by ~55% taxed nationals)
And 90% of the organisation (especially the management) has absolutely nothing, nothing whatsoever, to do with guiding planes anywhere. In fact, those departments are severely understaffed. The department doing "regulatory support" is about 2/3rds of the organisation (tldr: making sure half the local government officials don't have to get their own coffee - and before you say it, no, Eurocontrol employees don't get them coffee, they're merely in charge of making sure someone's there to get them coffee, and steak, cake, and ... The coffees, I might add, are baffling. Done from a steam boiler machine in front of you, with fair trade beans, sweetened not with sugar, but with expensive imported bars of chocolate meled in milk that's frothed in front of you (they put in the chocolate somehow while they're frothing the milk with steam, melting it in while not getting the steam on the chocolate somehow), and you get the rest of the case of that (expensive) chocolate to take home for the kids. No, not when you ask, they'll ask you if you want that. Btw, it's not really the rest of the case they give you, you get a fresh case. Oh and of course, of the bar they opened they prepare just one coffee (about 1/4th of the bar). The rest gets thrown into the trashcan, they don't use the same bar for the next coffee. As for the steaks ... oh my God)
And in case you're wondering: the odds of 2 planes colliding with zero guidance outside of the ATC zones around airports (which aren't covered by Eurocontrol) over even a region as big as Europe are more than 10 billion to 1, against, per year. So if Eurocontrol didn't exist at all, and we just allowed every plane to fly wherever ... nothing would go wrong at all.
So ... what is the problem here ? Disruption of millions of travelers for no reason whatsoever ? Let's please not pretend anyone at Eurocontrol cares (well, they care about not being interfered with, and that will make them care NOW, but if one thing's guaranteed it's that the Software/ATC departments will remain the same size, and only the bribery departments will grow)
Just like there are no bugs without security impact, is it real to say that this has no impact on flight security? Any error can be a contributing factor.
Safety is ensured by ATC controlers. ETFMS is there to ensure that traffic does not increase beyond controlers capacity. I have read that without ETFMS, the traffic is reduced by 10%. I have been involved in ATC simulations where controlers had to land about 40 flights per hour using new procedures and tools. We have tried with 38 flights per hour, it was too easy for the controlers: their work was perfect even without tools. With 42 flights, controlers were getting angry because the traffic could not be managed. At 40, we could see the benefits of out new procedures and tools (more regular separation of flights). IMHO 10% less traffic gives far enough capacity margin to ensure safety.
Sure, the skies are more and more crowded, meaning more and more people will be affected by a once-in-20 year failure, but why would it happen more and more common?
If the reliability stays constant expressed as a rate (arguably optimistic), like ".1% failure per X flights", and the # of flights increases, then failures become more frequent.
Failure rates must decrease proportionally to the utilization/capacity increase for the # of failures to stay constant. Otherwise, the # of failures will naturally increase, it's simple multiplication.
For example:
If crime rates stay constant, but your population increases, you have more crime - not the same amount of crime as before. you have more terrorists, you have more school shootings, because your population doubled in size and you didn't cut the rates of abhorrent behaviors in half.
Air travel has _exploded_ in popularity, this is why you see all sorts of things more frequently than you did before. From reports of airline passengers being dragged off planes in totally asinine scenarios to deaths of pets in the cabin.
You're assuming the failure rate is related to the number of flights. With 2 failures in 20 years there's not really enough data to extrapolate.
And remember this wasn't a complete system outage - it affected 50% of flights, a degraded service.
In comparison the vast majority of startups that HN people work in will be lucky to last 2 years. In the last 10 years Gmail has been down, well not sure. It was down 6 times in one year. Uptime is apparently 99.8%, or down for 175 hours a decade, so 6 hours seems quite reliable.
Level 3 went down last month. And in 2016. And in 2015.
Having a system having to work for decades without downtime is unheard of in the web world, even in simple systems like gmail and AWS.
Well the archaic systems are getting older and older. From what I've heard it's almost impossible to get replacement parts for some of the computer systems that run airports/airlines.
Has not been my experience. 2 decades in the airline space. It's mostly uninformed people assuming that TPF is the issue or that it runs on old hardware. Neither is true.
Airline tech outages are just more visible than outages in other industries. Like manufacturing, etc. It doesn't make the news when all your manufacturing plants halt for 6 hours due to tech issues. Airlines could improve, sure, but pundit observations are usually off base. They are also dependent on systems they don't control, like ATC, TSA, Airport owned kiosks, and so on.
And on the other hand, there are massive requirements at all levels for this kind of software.
I'm not sure what the right answer is at that point. From dev-experience, I wouldn't want half-assed software directing flights. From ops-experience, I don't want software directing flights requiring specific hardware so you can just replace the hardware with easily available hardware.
That's a good point. I'm just a bit disillusioned with the quality of some of the legacy stuff at my current place. Especially because robustness and probability of failures is mostly determined by requirements.
In situations like this I’m glad I book my flights with Chase Sapphire Reserve. It comes with a trip delay reimbursement for up to $500 per ticket if a flight is delayed more than 6 hours. No sweat!
I'm glad I live in the EU in such circumstances: EU regulation 261 [1] covers so much, with €250 to €600 compensation depending on flight distance for delays over 4 hours, with a percentage of full compensation for shorter delays. No specific credit card required.
Most (some?) airlines will just suck up the cost though in those extrodinary circumstances (bad weather for example), as a general policy to keep their customers sane.
Not sure if they'll give you direct compensation, but most will pay for food/drink, some place to stay overnight if need be and rebook your flight for free too if it's cancelled outright.
What airlines do you fly with? Comp for hotel and food is paid but the EU comp is really hard to get. There's a reason that so many services specialise on getting that money for passengers for a fee. It's not just Ryanair and Easyjet making it hard to claim. Haven't heard of a single case where reasons as that one led to a payout.
> Most (some?) airlines will just suck up the cost though in those extrodinary circumstances (bad weather for example), as a general policy to keep their customers sane
I kind of doubt that matte_black's gamble (because that's what "insurance" is for costs which you could easily carry yourself if need be, unlike e.g. medical costs) would include that either. Then again, I'm surprised how many "insurances" cover things under your control (like dropping a device) so I might be wrong.
I have no desire to deal with an airline that has already fucked up a flight, and is probably already swamped. As soon as they give me the options for the next flight, I’ll take my own taxi, get my own hotel, and have Chase pick up the tab.
It's clearly a fundamental cultural difference. An inherent distrust of government and corporations, along with a willingness to pay to avoid issues, compared to, well, a modicum of trust in government, and an established legal framework that protects the customer from the corporation.
I've been on the receiving end of delayed flights too. The agent was polite and simply offered me an alternative flight, or the compensation/refund, all at the desk. There was no animosity or distrust, just "here's what we both have to do".
not overseas, and has a deductible. I used a generic travel credit card and it covered my rental damage once, now I make sure to use such a card whenever I rent.
DoT regulations don't require they do that, although in the event of substantial delays they will often. The reality is cash comp is only due in the event of involuntarily denied boarding situations.
>“ETFMS facilitates improvements in flight management from the pre-planning stage to the arrival of the flight. It maximises the updating of flight-related data and thus improves the real picture of a given flight, thereby contributing to the Gate-to-Gate Concept,” Eurocontrol explains on its website.
>The agency initially reported that contingency procedures were immediately put in place which reduced the capacity of the European network by around 10 per cent. http://www.airtrafficmanagement.net/2018/04/eurocontrol-give...
They don't seem to say what went wrong but do say
>In over 20 years of operation, the ETFMS has only had one other outage which occurred in 2001. The system currently manages up to 36,000 flights a day.
Tech details from wikipedia fr:
Written in ADA , and running on HP-UX , the system is based on an exchange of messages between the airlines (who will file / change / update flight plans), the air traffic control bodies and the CFMU , the messages are written in ADEXP format.
ETFMS uses at least 5 fundamental notions:
flight plan : describes the 4D trajectory of an airplane. regulation: aircraft rate applied to a "volume traffic". Example: 50 aircraft / hour "Traffic volume": association of a geographical reference (air sector, waypoint, airport, etc.) and a set of aircraft flows. list of takeoff slots or called slot . Example: if the rate of regulation is 30 planes / h, there will be a slot every 2 minutes: 10 hours, 10 h 2, etc. the delay: difference in time between the take-off time desired by the company and the schedule calculated by ETFMS.