It’s fascinating that the same hopscotch travel pattern that allows SWA to offer better service to more places is also what caused the network to suffer cascading failure. Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.
Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through. Despite the stress, the SWA crews were helpful, empathetic, and polite. They handled it better than I would have if I had been in their shoes.
Is resumption difficult or is it resumption and then make whole the tens of thousands of customers that were supposed to be moved a week prior?
No idea what SkySolver actually does in totality, I'm sure it's complicated but I would think a flight crew could indicate where they are right now and then it could maybe pickup the next possible course they could perform. Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?
They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.
Of course, it's easy to bitch about it not being hard when I've never seen the code. Maybe is also tracks hours and does payroll and a dozen other functions.
It's essentially an optimizing constraint solving problem. As far as I know, there are 2 main known approaches to this class of problem:
- use a tree traversal algorithm with a lot of built-in optimizations and pruning. Google OR-Tools is one well-known solver.
- use a set of meta-heuristics (Tabu search, simulated annealing, etc.) against arbitrary predicate expressions. The best example of this kind of silver is OptaPlanner.
The advantage of tree traversal is that it is exhaustive and guaranteed to be optimal; but it requires a fixed computing budget for a given set of constraints. Given a large compute cluster it's ideal, but on individual machines/servers these solutions tend to take hours or days. Southwest's system would likely require a significant supercomputer to run its scheduling through an exhaustive optimizing constraint solver, since it would have to re-run the entire solution (or significant numbers of subtrees) when any variable changes (which happens likely several times per second).
Meta-heuristics are more flexible and allow for all sorts of interesting, convenient features that can be very helpful when your compute budget is less than a Tiobe-500 supercomputer.
They offer time-constrained solving, real-time monitoring of a solution in progress (you can watch it get better as more of the solution space is searched), and over-constrained solving (where the requested solution is impossible, but we need to make a best-effort attempt).
Other paid constraint solvers are Gurobi and IBM’s Cplex. I only have experience with the later - scheduling giant chemistry and biology experiences into a robotic factory - there are a ton of foot guns in this space.
IBM has ILOG as well which is likely a better match for scheduling problems.
Fun Fact: google-ortools was written by the same people that wrote ILOG. You used to have to "bring your own search" when using ortools, which I used to assume was to avoid conflict with IBM. I haven't used it in a while, but looks like there's a one-size-fits-all search algorithm now. I'd be curious to try it sometime.
> Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?
It's not a problem of management (well it is a problem of 15 years of management fuckup which is the root cause), it's a problem of the scheduling system needing manual updates when things didn't go to plan.
At this point it's completely screwed, so it needs to be completely reset, as in the entire scheduling system needs to be reconfigured from empty, more or less.
And because SouthWest operates entirely on point to point, the same cascading properties which led to its complete collapse mean it needs to restart in a somewhat synchronised manner, otherwise you fly 3 planes, there's no followup, and you're hosed again.
> They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.
The system completely lost track of crews, so all of them need to be relocated, their work cycles reworked, flight slots need to be reallocated, flights need to be re-encoded.
Here is where we should take note of how things were resolved (or are in the process of) and note where “throw more money at it” (fly inefficient routes, farm out fares, etc) would have also helped.
Then next earnings release - when they post shareholders profits - that’s exactly how much they were willing to hold back to fix this, trading profit for people stuck for days.
Good point. I think it will be smart for them to post a record loss for this quarter.
Accrue a few billion for refunds, depreciate your software to zero, and start to set aside a few billion for new cloud whatever.
Basically, take this as a huge loss and communicate that you’ve lost a lot of goodwill and trying to make passengers whole is #1 priority (after safety, of course).
Airlines are extremely low-margin businesses, discount airlines like SWA doubly so. “Posting a record loss” of this magnitude isn’t something that can be afforded. What you’re talking about is likely to turn into some form of bankruptcy. Maybe it’s a relatively minor Chapter 11 where all those refunds get written down by the bankruptcy court, or maybe SWA goes the way of TWA and Pan-Am.
Replying to a sibling comment. There will never be a bailout for SWA. Too small, too many competitors and based in a red state with a blue administration in power.
In the last fiscal quarter pre-pandemic, they had operating income of $3 billion on revenue of $22 billion, or 13%. This is not a low-margin business.
If they have operating income of eight to ten billion a year, spending a third of a billion to upgrade systems seems like it would have been a reasonable investment instead of buying back shares and increasing the dividend.
Depends on the specifics of the duty time rules and how the flights were canceled. If the crew showed up and started getting the flight ready, but the flight was cancelled before takeoff, that may count as duty hours and could reduce their availability in the near term.
If the duty time rules only count time after takeoff, or only time after doors are closed, those both offer a different set of times.
I’m by no means an expert in this area. But from my experience writing field service software in another life, even on a city wide scale when you may have 20 people and around 200 stops it got tedious when the system went down and you had to manually schedule it.
I use to write field service software for ruggedized Windows CE devices.
> Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.
We had days like that in the UK but with the rail system. And it when it happens it’s also due to snow. We’ve seen it on global scale recently due to covid putting ships in the wrong places so that optimised shipping routes become a mess.
I don’t see how a full system reboot should take days. If you don’t care about serving customers (which they don’t right now), then the problem could be simplified to getting every plane and every crew home to where they would spend the night under a normal schedule. With few or no paying customers on each plane, there should be plenty of capacity to move misplaced crew members around. None of this needs to approximate normal routing, and fewer segments than normal are needed, since passengers can be ignored.
That being said, the software sucks. Southwest may have lost track of where their employees are. The ground crews are quitting. I wouldn’t be utterly shocked if management doesn’t even have a good overview of their the planes are.
(Obviously anyone halfway competent could hack up a script to find all the planes based on ADS-B data in a few hours. And it wouldn’t be terribly hard to text a link to all crew asking them to fill out a simple form with their location, nearest airport, and when they can get there. But this requires competence and agility.)
It's far more complicated than that. For example, there's a lot of strict regulation around how many hours crew can work, how much rest they need between, etc. Likewise planes can't just randomly fly where-ever they want, whenever they want. There's mandatory maintenance, etc. You also have a limited number of hangers and jetways at each airport, so you have to coordinate how planes move around to not overload things. Oh we need to be sure there's enough fuel, fresh oxygen bottles, and so many other things.
The problem is nothing like writing a script to scrape ADS-B data. This is the classic fallacy of a programmer thinking the most cartoonish imagination of the technology is the problem while being completely blind to the fundamental difficulty of organizing large groups of humans in some activity.
Modeling and simulation has the concept of Emergent Behavior[1].
Once that complexity tipping point is reached (and other comments suppose that SA competitors were more aggressive at cancellations, avoiding that point in their systems) the system takes on a life of its own.
What follows, for those running the system, is called a "Significant Emotional Event".
Yes. Globally the vast majority of airlines use either Amadeus, Apollo, or SABRE for everything from reservations to crew scheduling. United wrote Apollo, American wrote SABRE back in 1960. So yeah, other airlines have far better tech people and have for decades. While Southwest now pushes ticketing data to those three, as of a few years ago they were using other software for scheduling and even doing things by hand (like tracking bags) that other airlines had already automated. If anecdotes from crew are accurate the other problem is that Southwest did nothing to optimize their disaster recovery plan (if there even is one). Manual data entry can scale more than it is at Southwest, but they're still stuck in the teensy airline state of mind where even wildly inefficient workflows can scale.
As for ADS-B, that's hopelessly naive. ADS-B tracks planes, not people. The problem Southwest is having is trying to figure out who is legal to work and where they are. Tracking equipment is so trivial in comparison it's largely a secondary concern.
The other carriers do also have a tipping point of no return, where their software can't handle the situation. But, they were more aggressive about canceling flights in careful waves before they got to that point.
Some of the observations here are correct, but there's a big missing piece. Southwest went too far in with an overoptimistic schedule. They won't say that because then there's no easy scapegoat, it's just a pure management thing.
There is always a trade-off between robustness and efficiency. By leaving less slack the SWA schedule is more efficient, but also more tightly strung. When failures exceeded the slack capacity the whole schedule fell apart.
The problem of recovery is also extensively studied. But it seems like SW did not put enough effort into having fault tolerant restart.
Right. I'm saying though, it's not just the initial state of less slack in their system due to hub and spoke versus point to point. It's also that other airlines were more realistic about backing off flights earlier and more aggressively as a reaction to the developing info on the storm.
It paints a picture of an organisation which didn't see the need to accurately assess risk on a real time basis. If I was the COO I would want rolling projections for the week based on good/bad/ugly weather. But it seems SW was too busy cutting costs and building capacity as cheaply as possible to think about that.
As others have said, you are ridiculously oversimplifying the process.
If you are running a little video game that plots flight charts, sure perhaps you could write such a script. You cannot, however, script a bunch of logisitics support together across disparate, independent, multinational airports.
Add the obligitary paperwork (we don't really want to be shipping people around in coffins, enabling human trafficking or illegal substance smuggling, trust me), and add your standard bit of managerial incompetence managed from an excel document and sharepoint, and it's actually downright impressive if it's only a couple days to completely reboot the system.
Arguably, management doing the standard "oh that will never happen" is probably why it's not even better - you would think the airports would be able to automatically "fix" themselves with a self healing protocol, but that was probably deemed too expensive and left out of the feature set.
Of course I’m oversimplifying the process. But my point is the actual number of flights, passenger and luggage loads and unloads, etc can be a lot lower than a normal workload. On a normal day, an average Southwest plane flies quite a few hops, loads many passengers, unloads many passengers, etc. For a worst case reset, the normal routing doesn’t need to be followed.
Of course this will introduce complexity. But it will also need decently less than a normal day’s number of operations.
If I were designing a solver for where to move aircraft and crews (which is apparently what SkySolver does), I would test it on random and malicious inputs. If it’s initialized with every crew member and every plane at an independent random airport in the US (where Southwest has service) with a few broken for good measure and a few airports shut down, it should still find a solution. Waiting 12-24 hours to start so none of the crew is timed out could be part of a valid solution.
It’s still probably not that simple; the problem seems like it can be analogous to some sort of knapsack and/or traveling salesman problem, which are NP-hard! Even testing would take a long time.
That’s true if you want an optimal solution. If you just want a decent solution (for various definitions of decent), the complexity may not be too bad.
>then the problem could be simplified to getting every plain and every crew home to where they would spend the night under a normal schedule.
I'm confused, maybe my calibration of the scale involved here is wrong, but how do you not see this process taking days? Even if you had perfect information of where all the planes and crews were and where they should be, just coordinating with various airports to schedule the flights would take days; and that's assuming every plane is (1) flying empty, (2) fueled and (3) isn't carrying any passenger luggage.
"If I ignore all real-world constraints and all the regulatory burdens, it's TRIVIAL. Why can't they figure this out when I could solve it in an internet comment?"
He's ignorant and naive and disrespectful of the workers involved.
Well, let's see... A bit of internet searching shows SWA has about 60,000 total employees. Roughly 6,000 of those are pilots. So probably about 10,000 cabin crew (3 or 4 flight attendants, 2 pilots per flight). Most of those 15,000+ people need to be in exactly the expected place at the expected time, or things will get snarled up. If any one of those 5 or 6 people is a no-show then the plane can't legally fly passengers. There's limited fungibility; one of the pilots has to be a Captain. You don't want two Captains flying, if for no other reason than some other flight will then be short a Captain. One of the cabin crew has to be a Purser (supervisor). It's really preferable to have at least one of the pilots flying to the airports they usually frequent. Off-shift crew still need to be in the proper place at the proper time, or the plane has to stop mid-cycle. Then there's the ground and gate crews.
And you can't just say "go to your usual starting airport". Flights are 24/7. Days of the week matter. Holidays matter. Even your local burger joint doesn't have cast-in-stone schedules. Much less a huge interconnected network that's trying to reboot. Even if 80% of the crew could go to a "usual airport", how would you know if you were in the 20%? Oh... everybody has to phone in, or the system has to send a huge number of notifications out. System crash. (Remember, normally the system depends on scheduling in advance, and only a few last-minute changes need to be handled.)
And how do these what, 20k+ people get where they should be? If they aren't already at that airport, then normally by hopping on a SWA flight. Which aren't happening. So now what? Book flights on other airlines? Who's going to be doing the booking? How does it get paid? How many employees have a company credit card (any?) Use the employee's cards? How many are maxed out for Christmas?
So let's suppose you're scheduled everyone and everybody knows where there's supposed to be and when. What happens when 10% call back in (without crashing the system!) and say they can't get there on time. (Remember, many SWA customers are stuck and are having a hard time arraning alternate travel. Crew without access to SWA flights would have the same trouble.) You're going to either apply massive changes to the existing schedule, or start all over and reschedule everyone. Which is exactly the problem they're currently having. They can't handle massive corrections. What's worse, a crew member may think they're good, and then their travel gets delayed somehow.
How sensitive is the scheduling? If 10% of the crew are no-show or delayed then about 9% of the planes are affected; every 8 hours 90 planes will have the following 8 hours of flights cancelled or delayed (throwing further monkey wrenches in the schedule). About 40 of those planes will be missing a pilot, which means they can't even be deadheaded to where they're supposed to be next.
So how to reboot? They've had to cancel two-thirds of their flights, so apparently they're able to keep 1/3 of them going. Keep those flying so they can shuttle crew around. You're initially only scheduling 1/3 of the crew, so the poor overloaded system can handle it. 2/3 of the flights are just outright cancelled, days in advance, so the customer support load is reduced. ("I'm sorry, your flight has been cancelled. We can't reschedule you until x days from now." vs. much back-and-forth trying to find something with an overloaded system.) Slowly add additional crew and flights so the number of phone-ins is kept manageable.
As you said, this is certainly a complicated situation.
But this is HN, and, from an optimization / constraint satisfaction problem perspective / scale perspective, it’s really not a very large problem. The “going home” problem is a standard programming contest problem. This is more constrained, but a globally optimal solution isn’t needed. Southwest literally has a program called SkySolver. It should work.
So you need to choose destinations for a few thousand planes? Pick the place you would normally have them at 4 AM two days hence. Need captains familiar with the airports? Surely the captains who would normally fly from a given airport at 4 AM two days hence are familiar with those airports. This gives a good starting solution for further optimization.
You need to get a message to ~20k people? Great, Twilio will do that with minimal effort.
Will a budget of a couple million dollars, this is not hard. There are good SMT solvers available for free. CPLEX is expensive but not on this scale. Managing test cases on this scale is straightforward.
Even in an emergency, something good enough to get to the point where SkySolver starts working again should be doable.
People who manage electric grids have protocols for “black starts”. These protocols are messy and complicated, but they are developed in advance, and they work. Airlines should have the equivalent capability.
> But this is HN, and, from an optimization / constraint satisfaction problem perspective / scale perspective, it’s really not a very large problem.
i can't even begin to imagine what the utility function _alone_ must look like.
the solver can't just produce a blank slate solution every time, it's got to optimize... something, subject to some constraints about not changing too much from the previous plan it came up with. and it's probably got to do this every time there's new real world input.
> The “going home” problem is a standard programming contest problem.
the hubris and handwaving on here is hitting epic levels.
You touched upon the core issues. Yes, SkySolver should be able to get a solution quickly. Yes, there should be systems in place to co-ordinate invidividual actions of 20k people with rapid adjustments. Yes, they should have procedures in place to do "black starts". But they don't. And you can't design, test, and roll out even a new communications system to 20k users literally overnight. If they existing scheduling system doesn't already email crew, then do they have email addresses for all them? If so, it's probably in HR somewhere; if that's outsourced then it could take days for SWA to get that pulled out even on an expedited basis. If they send email, what percentage of crew will receive it? In what timeframe? Email campaigns are notoriously porous. And how do crewmembers provide feedback? A standalone web page? Great--how do you authenticate? By interfacing the existing system? What will that take? Probably not, though--it appears the existing feedback mechanism is a manual phone call to a person. Do crewmembers even have login IDs?
Finally, even with perfect communication there's the problem of getting people where they need to be when the transport system is very flaky. I think that's probably the issue when the CEO said 'they would get things manually set up, and then something would happen and we'd have to start all over again.'
The pilot's union and others have been critizing SWA for not investing in infrastructure. The previous CEO, responding to the previous melt-down, said something about 'you can't test for these kind of scenarios.' That's wrong. You can't do it easily or cheaply, but it is possible. They haven't done it, they haven't developed robust systems, and now they can't instantly resolve the problem.
I mean, let’s assume that this problem is as simple and everything else on the ground and with scheduling works perfectly (which it isn’t and everyone here is rightly calling you out on. Ever work in an operations environment? Shit goes wrong fast).
You still have to contend with the laws of physics. AKA the speed of aircraft and distance between airports. Compound that thousands of planes making potentially cross country trips and it will take days to do what you are suggesting, leading to the same problem SWA is already facing.
So you need to choose destinations for a few thousand planes?
No. You need to figure out which planes are where, which of those planes are legal to fly out of which airports, which planes can be made legal to fly with low effort, and which planes need to go in for service (potentially requiring a non-revenue flight).
Ostensibly Southwest operates a fleet entirely of Boeing 737s with two subfleets (ETOPS and non-ETOPS). The ETOPS planes can be used anywhere but the others cannot be used to Hawaii. Some of the ETOPS routes require the range of the MAX and some do not. Some of the airports Southwest flies into can only handle the smallest planes. It's entirely possible that everything stuck in Oakland is too big to fly into Burbank for instance. Or its entirely possible that all of the ETOPS fleet is stuck in Hawaii (so all of the routes too Hawaii are no-gos).
Then you've got to figure out what's broken on each plane. Depending on what's inoperative a plane may not be able to fly into a specific airport if the weather is anything but perfect. If enough stuff is broken the plane may be stuck away from a maintenance base and illegal to fly in revenue service. There goes a plane and a flight crew. These are laws, not optimizations.
And finally you need to figure out where all the diverted flights went. No idea what's normal but Southwest had to divert two flights today (one due to mechanical issues and one due to an unruly passenger). So there's at least one plane that's out of position and unable to fly until it's been repaired.
Need captains familiar with the airports?
No. In fact I'm pretty sure that unlike RyanAir and EasyJet, Southwest doesn't fly into any airports that categorically require specific familiarity and training beyond the ETOPS certs required for flight and maintenance crew on the Hawaii routes. You do need to figure out who's legal to fly in revenue service and who might be legal to fly on a ferry permit though. Southwest was pretty late to the autopilot game but at other airlines some captains are qualified to land in worse weather than others. So even if you have a flight crew ready to go they may be unable to complete your desired route today – and that's been a problem. The Southwest scheduling software assumes that every pilot completes their assigned flights successfully. Whoops.
Then you've gotta do the same with the cabin crew. And then you've gotta make sure none of this runs afoul of the various CBAs. The baggage handlers? They've been working without a contract for nearly three years and were threatened with immediate termination by Southwest's VP of ops. Wanna guess how strictly the ramp rats will stick to the letter of the contract?
And you haven't even addressed the issue of out of position luggage. It's not just that luggage is piling up at airports, but Southwest's actually flown some of the luggage without the passengers. The hand waving I've seen suggests that Southwest still does a lot of the luggage tracking manually.
When that's all said and done you're still going to need to handle everything that goes tits up as service resumes. As I pointed out earlier, Southwest had at least two diversions today. So now you have two more out of position planes, crews, and more out of position luggage and passengers.
It's a massively complex, dynamic problem. If a couple million bucks would solve things, Southwest would've spent it already. List price on a single 737 is around $40 million. Even if Southwest got half off you're still talking tens of millions of dollars per plane. Put another way, Southwest has so far bought two airlines only to junk their entire fleet. A few million here and there is nothing, especially if it could've headed off this chaos.
People who manage electric grids have protocols for “black starts”. These protocols are
messy and complicated, but they are developed in advance, and they work. Airlines should
have the equivalent capability.
Most airlines do but, based on the anecdotes from Southwest crew, Southwest does not. Again it's not really that Southwest flies point-to-point, and it's not just a software issue at this point. Procedures that worked when Southwest was a scrappy little airline simply don't scale and Southwest management is ill-equipped to handle this. It's all compounded by the employee contempt that Kelly's managed to foment.
Something worth pointing out here is that according to levels.fyi, a tech lead there who has worked for 20 years has a TC of 174K. It’s unlikely their technical talent would be up to what you propose which is related to the apparent decision at Southwest to underinvest - by their own reckoning - in technology.
This again the HN tech bubble. That TC is around average for most senior software engineers in most corporations in most US cities in the US.
And getting into a tech company and having high compensation doesn’t require “top talent”. Just the ability to memorize DS&A (junior to mid) and memorize system design and being able to regurgitate answers to behavioral questions showing “scope” and “impact”.
It takes knowing two or three well known algorithms to implement the scheduling problem they are trying to solve and system design chops to know how to scale it.
But, as far as having to be on hold for hours, that’s a staffing issue.
Most of their issues are symptoms of poor management - not software engineers.
To be clear, I don’t think you can blame these software engineers. Management has decided the average value of a tech lead to them is about 175K and they are getting that quality of engineer. I think their current tech issues suggest that the value management is placing on engineers is lower than it should be.
Like I said earlier in the thread - there are many bad engineers that are overpaid and many good engineers that are underpaid. On average though, compensation and quality are correlated. In the absence of additional information, compensation is the best metric we have to asses the quality of the underlying engineering talent. It also passes the sniff test - try using the Southwest Airlines app!
Again, you are just as deep in your bubble as everyone else. There are 2.7 million developers in the US. The vast majority of them are paid between $80K-$170k. They are working in banks, government, smaller startups, etc.
And it doesn’t take a $400K senior developer that can reverse a binary tree on the whiteboard while juggling bowling balls while riding a unicycle on a tightrope to make a mobile app.
How much do you think Delta developers make in Atlanta (hint: they aren’t making over 200K). Their app is excellent.
It never ceases to amaze me how little most people on HN know about compensation throughout the industry.
1) These types of comments always make me cringe because it's built on an assumption that everyone is solely trying to optimize for income. I know plenty of people who would take a pay cut to continuously work on what they consider more interesting problems than, say, a social media app or basic CRUD app.
2) COLA matters. $174k in, say, Southwest HQ in Dallas is roughly equated to $422k in SV.
It’s ridiculous to suggest that compensation isn’t correlated with quality. There are many exceptions but it’s a good rule of thumb in the absence of additional information. It also just passes the sniff test - try comparing the Southwest Airlines app to the united airlines app.
I don’t think COLA matters all that much. The reason SV pays such high salaries reflects the underlying higher productivity of those engineers. The fact that much of that income goes to landlords in the Bay Area is orthogonal to the underlying quality of the engineers.
Nobody is claiming they aren't correlated. The question is: how strong is the correlation? The OP seems to insinuate they are extremely correlated on account that it is the sole consideration in the comment.
Like so many strong claims made online, it's an overly simplified mental model for something that is much more complex in reality. Like you said, we need more information before making strong claims.
I wonder if it’s that or simply a lack of slack in their system.
It seems to me that just like pre-staged inventory helps in logistics management, that extra planes and crews in the rotation could improve operations under these circumstances.
With a normal airline, you have pilots sitting on "reserve" at bases, who can be called in at any time to fill in any gaps that may occur. They are being paid but are not flying, it's quite a good gig if you can get on the reserve list.
I don't know how this is handled at Southwest, who does not fly hub-and-spoke and thus doesn't have a bunch of pilots sitting reserve around a base at, say, Atlanta.
In the past I had a job where some contract required a trained body to be on site 24/7. The company hired EXACTLY enough workers to fill the position, with _zero_ slack for anything.
That lack of slack is hell. It makes any disruption, even minor ones, require the other workers to work more time. Major disruptions mean soul-crushing crunch level hours to just get by.
Slack _must_ be planned into a system, otherwise there won't be any safety / recovery margin, and you're seeing the results live with Southwest's implosion.
It'll be interesting to see. I'm sure Congress with go through it pretty strongly (opinions, that is), assuming they ever have a successful vote for a Speaker.
> Hence the need for a “full system reboot” over many days.
My understanding is that the full system reboot wouldn't have taken all that long, it's just the the company was trying to do a major fix while keeping whatever was still sort-of-working running. As any sysop will tell you, patching a running system is all kinds of crazy risky.
>"Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through."
Interesting. You don't say how far before Christmas you were traveling. Had this crazy weather system already started moving from West to East at that point? Or was the system buckling just from passenger volume at the point i.e similar to the Summer meltdown that Southwest had?
According the the article, the system makes some non-sensical tasking:
"In one example during the storm, the system assigned a pilot to deadhead on a flight from Baltimore to Manchester, N.H., and then back to Baltimore the next day, without ever flying a plane"*
* The article defines deadheading as sending a pilot as a passenger to get to another location.
It would be interesting to look at what the system is trying to optimize for to make such choices.
If the solver is updating based on current events, it’s completely reasonable that something like that could happen. Manchester was projected to be short a pilot, but the flight made it on time, and Baltimore unexpectedly ended up short.
This sounds like what you might expect from a using a random process (genetic algorithms, simulated annealing) to solve the NP-complete problem. The randomness injects suboptimal routes like this, as well as more optimal ones, but the fitness/cost function has to distill a whole bunch of things into one scalar. In my experience what happens is that the different things you want to value sort of compete with each other. I'm guessing the current state of the system is quite suboptimal, and it might not be able to remove the randomness.
It's quite feasible to get within a factor of 2 of the optimal solution (with much less processing power), which sounds great from a CS algorithm analysis standpoint, but a factor of 2 looks like an awful schedule from a human standpoint.
Compare the cost of keeping one surplus pilot forever deadheading around the system to the cost & disruption of canceling (say) one flight a month because "a pilot got sick, it was not legal to fly short-handed".
Having a (relative) bunch of surplus critical-task workers, forever being shuffled to where they system guesses they're most likely suddenly be needed, makes perfect sense.
(Yes, you'd have to be a bit more sophisticated, so your surplus pilots got enough flying time to stay certified. And didn't get pissed and quit. Assume that I know why RAID 5 is better than RAID 4:)
Sounds to me more like a bug in an edge case rather than trying to optimize for the wrong thing.
Or (more likely, I think) the first deadhead was planned to have the pilot take over a flight which was then cancelled after the pilot was already underway.
Also sounds like a case of not throwing enough adversarial data at the system - you can't just code coverage your code, you can't even establish KPIs, you have to establish its performance under system failure (does it freeze or gracefully shut down, does it persist to disk, what happens when the disk is yanked out of the system), etc.
Very few software shops I am aware of that do anything like this.
That could also be the result of thrashing after they were really far into the mess. That is, the solver maybe did output something sensible, but that solution has to get into the system that's used at runtime. Someone may have run another solution, or did manual updates while it was too overwhelmed to get all the changes posted. The solvers also depend on other context that's changing underneath them.
I've told this story a few times, but maybe 10 years ago I had a cross-country JetBlue flight that was delayed perhaps 6 hours hours. It was a few days after a major storm. Like Southwest here, JetBlue didn't have much flex capacity and relied on the daisy chain to keep on chaining. Our plane had gotten stuck somewhere, so they had to find a different one at some far-away airport and fly it in, which took hours. But the kicker was that when the plane finally landed, the crew already onboard couldn't man the flight because that would exceed their duty limits. The airline didn't realize this ahead of time, so they had to gather a new crew (like literally call them in), which added a couple of hours to the delay.
Naively, I'd assumed these kinds of things were handled in some sort of mission-control center with warnings from rule engines blinking on some big screen and a team of crack operators mapping out what needed done. But clearly that wasn't so: they were just making things up as they went along. Sounds like Southwest is in a similar spot, but this time on a much bigger scale.
> clearly that wasn't so: they were just making things up as they went along.
Where did you get your information? I have experience in the industry and scheduling logistics is clearly not how you describe it. The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.
Which information? That they belated realized they'd run out of duty hours and had to call in new crew, after the plane with the soon-to-expire crew had landed? They told us that while we waited at the gate, including updates about the expected time that the newly-called-in crew would arrive. They were quite transparent about everything.
Or are you asking why I think they're making it up as they go along? That is my conclusion, to be sure. It's one thing for dominoes to fall, but if you are in a situation where the dominoes are falling and you are not able to predict which dominoes will fall next and respond accordingly, you are making things up as you go along. I'd have expected that almost any decision they could even hypothetically make would run through a system that checked it for violations of constraints, which would have told them way ahead of time that they needed a fresh crew, and they'd have had the entire flight of the replacement plane to get one in (IIRC it flew Miami->Boston just to get us the plane).
Regarding your first paragraph, rarely are the gate agents telling the public the entire story. The truth is typically more complex and basically not of concern to the average passenger.
As for your conclusion, it's just not accurate. Systems _are_ used for scheduling and allocation of equipment and crew resources. You just have to realize that getting you, as an individual, to your destination on time on any given day isn't the number one priority of the company. There are a lot more concerns they are considering.
If the airline industry is really as dysfunctional as you think, with such shortcomings in areas as important as equipment and crew scheduling and allocation, you should drop whatever you're doing and start an airline. You'll soon be filthy rich and do the world a great service.
That’s just hand-waiving. Sure, they could have been lying to us and had some hyper-competent, complex reason that delayed us and then made up some malarkey about duty limits. But you don’t know either, and the explanation they presented seems vastly more likely.
Also your generalizations fly directly in the face of the information emerging from this Southwest fiasco, which indicates they don’t have systems in place to even track where their crews are (just where they are scheduled to be), so the shortcomings seem to be quite real.
Couldn't the emergency flight have run into some problems, making it take longer and that led to the crew being out of duty hours? Like the flight was supposed to take 3-4 hours but wind and some airport issue caused it to take 5? Something that may not have been obvious until very late in the flight too.
>The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.
I don't follow. Nationalization might add more slack, and more resilience, but it'll be so much more costly to operate.
Airlines run on very thin margins, so need to optimize heavily.
We just need to accept that we can either get "affordable" tickets with occasional meltdowns or we go back to 30 years ago and pay double price for tickets but have more resilience.
The market as a whole has already resoundingly chosen the former.
Safely flying humans in metal structures is just really expensive and the current ticket prices pretty much reflect the real cost (and even ignore the climate externalities)
> I don't follow. Nationalization might add more slack, and more resilience, but it'll be so much more costly to operate.
Who cares? The point of nationalization is costs don't matter, we just see it as a cost we bear because transportation is important for civilization. The primary mode (for better or worse) in the US is public (roads) and that "cost" is not covered by anyone.
I’ve seen it happen and it’s always strange that they seem to not realize the crew will be unusable until they arrive … I assume they had been trying to get another crew at the same time.
There are several time limits, but I believe that one of them stops when pulling back from the gate. i.e. once the the plane is moving, they're granted several X more hours, but if they haven't left yet, then they've timed out. So the arriving crew may plausibly be usable upon arrival, but a few minutes late may be all it takes to time them out.
I've been on a delayed plane where one of the pilots timed out while sitting at the gate, so we had another second delay finding another pilot.
There's also the fun wrinkle that not all crews or crew members are current on ratings (ie training courses) to operate all types/versions of planes the airline as in service.
You also need sufficient open gates of the right type in the right place during the right time window or you idle a plane (which can then idle a crew). It's pretty easy to imagine how such a system which may work at a certain level of load can be tipped into a runaway scenario.
As a systems person I'd be an interested reader if Southwest were to someday release an incident analysis and post-mortem of the type that good IT organizations do.
Once you realize how many people in critical industries (power grid, telecom, global cargo/logistics) are in fact making things up as they go along, and there aren't really any highly organized and responsible people running the show, you start to worry.
What is the alternative? Extensive playbooks for every possible scenario that may or may not make sense in the actual scenario because a few critical details are different? A "mission-control center" is really no different: it's just people making stuff up on the spot.
In the end, there is no substitute for human judgement in the situation itself.
> What is the alternative? Extensive playbooks for every possible scenario that may or may not make sense in the actual scenario because a few critical details are different?
Yes? I suspect they have playbooks and/or run gameday exercises for other aspects of disaster preparedness and business continuity. Why not do the same for these sorts of issues that affect the travel network?
The network of airports, planes, and crews is a graph (or multiple graphs). What happens when a node disappears?
That may be a hard question to answer, but it seems cheaper to answer, and plan around, it before the situation occurs rather than after.
The actual answer is buried at the end of a long article.
> Unlike many rival airlines, Southwest’s planes generally hop from one city to another, rather than orbiting a major hub. That approach lets Southwest maximize use of its planes and crew, but the daisy chain structure also makes its network more delicate—problems in one corner of the country can be difficult to contain
That is only part of the issue, not every airline is 100% hub system alot most are a pretty big mix.
Further alot of major Hubs where imacted by the Storm, yet those airlines where able to transition. Why?
Well most airlines have mobile apps, and web portals and other techonology so their crews can be reassigned in almost real time, (just like I would get auto booked on a new flight before I even knew my flight was canceled via the mobile app)
Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...
> Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...
From what I understand (from /r/flying) it's the reporting / fixing of issues which is manual. This means when the reporting / fixing is overloaded and not given a respite it can't catch up, keeps getting more overloaded, and ultimately the scheduling system completely fell over (it lost track of crews entirely is what I understand).
"Mobile apps, and web portals and other technology" is orthogonal to the issue at hand.
Mobile apps or a web portal absolutely could have helped, simply by massively increasing the throughput of crew members being able to report that they didn't make it to their assigned location. Which could have stopped the cascading effect much sooner.
The manual "reporting/fixing" of issues, which was done over the phone, includes pilots and crew members reporting "hey, I didn't actually get on the flight I was scheduled to". If the system did not receive that message, it was built to assume that the crew members made it to their next location and would be available for their next scheduled flight. But once a certain threshold of flights got canceled because of the storm, the limiting factor became how quickly staff members could even report that they didn't make it to their next location - they'd stay on hold for hours, until their next flight in their next city was scheduled to leave, and the system would still assume they were on that flight because it hadn't yet heard otherwise.
Lack of webapps certainly wasn't the only issue here - as the article notes, their whole point-to-point scheduling model is more vulnerable to cascading failure as well, vs. a hub-and-spoke model. But the phone-based system did certainly cause a feedback loop that kept the disruption happening for much longer than it should have. And moving away from point-to-point would have many more tradeoffs/downsides for consumers than simply modernizing their tech system.
The interview I saw with one the exec seemed to indicate that even under normal conditions when a flight is cancelled each crew member needed to call into a call center to be reassigned.
All crew are also under a regulation that limits their time available for flight, similar to a trucker but with larger consequences. eg If you are on shift for a certain number of hours, waiting for a plane in a hotel or in the terminal, it_does_not_matter. You have to have a mandatory sleep block where you are now unavailable. Once thousands of staff started timing out, even after the weather clears, it_did_not_matter. They were now unavailable and their replacements hadn't come in due to the weather, leaving some locales understaffed and all schedules backed up (which continued to compound as weather shifted).
Similar to humans, is there a wait time [cooling time] for aircrafts also? I suspect before and after each takeoff and landing, there are bunch of tests run to verify proper function of aircraft parts and subsystems.
Aircraft have to be scheduled for various maintenance checks such as "A", "B", "C", and "D" (which are the heaviest maintenance checks and take them out of service for many days/weeks). They're of course checked by the Ramp Crew and also receive a pre-flight walk around by the pilots before take-off. The only wait time is if there is a necessary repair to deal with a warning/indicator light and related repair time, hopefully does not take the aircraft out of service. Better out of service than break down in-air.
Transport aircraft can run continuously for multiple flights. They don't turn off the engines under certain conditions (short hops, type of plane, size of loads, etc). There are regulatory inspections that require it to hangar, but I'm not familiar enough to cite any.
I mean, a system for crew scheduling that requires manual notification and updating when flights get cancelled... seems to be missing very important functionality for a crew scheduling system?
Not entirely orthogonal? If the fixing issues is manual (therefore slower), that means that once issues start happening, it's slower to recover, thereby making a total meltdown more likely. And when a total meltdown occurs, recovery is manual, therefore slower.
You can have a non-computerized system that is more efficient. Apps can and do fail, and I'll go out on an unpopular limb and suggest that what most folks working in tech would still fail frequently enough to be problematic in an airline of this size.
The problem is that their disaster recovery plan (assuming that there is one) is highly centralized. Southwest has each and every body calling the scheduling department one at a time. The proper design (and what most other airlines actually do) is to decentralize a wee bit and have, for instance, crews report to station managers who in turn call the scheduling department.
Mobile apps / web portals could have helped the part of the problem related to status updates of staff location/availability not ingested fast enough. Apparently communications ingest bottleneck was a significant part of why the overall problem tipped into an even worse runaway state with no viable recovery options short of "full reboot."
In a complex enough system, it's reasonable to assume that at any point in time something is broken and the system is not running at 100%. What this means is they probably had their reporting/fixing system running at the thin edge all the time, and once the failures exceeded a certain rate, it was game over. The fixing could never catch up.
Major hub getting impacted is less of an issue. Either the planes and crews are stuck there, you cancel flights and send them home. Or they fly to nearest available location and wait until they can get back to the hub.
And explaining that FAA or someone banned flying from this airport is much simpler.
I mean... it's not entirely untrue but at the same time airlines have been winding down their "hub and spoke" model for a point to point one for a while.
That's in part what doomed the A380, which was popular with airlines still going strong with hub-and-spoke (Emirates being by far the most prominent one) but is worthless in a point-to-point model.
There are still a number of major airports that are slot constrained, especially now that passenger demand is growing again. It seems like there could be a profitable market for an efficient twin-engine airliner larger than the Boeing 777X. Perhaps even a double-decker?
> There are still a number of major airports that are slot constrained
Major airports are major airports even in a point to point system. If everybody wants to go into or come out of NY, NY's airports are going to be slot constrained.
Also when I say "winding down" I don't mean "killed entirely", most airlines still have home bases, and mix point to point with regional hubs (especially for international flights).
But at the height of the hub-and-spokes model, unless you departed from or went to a hub you'd always need a layover.
La Guardia is gimped by stupid rules about travel distance. I'm sure these rules made sense years and years ago when planes that could travel long distances were very loud, but by now, the airport has been there longer than most residents and planes are significantly quieter.
From a net utility standpoint, NYC should just pay for "port packages" for nearby houses, add the cost to landing fees, and remove source/destination restrictions on La Guardia.
La Guardia doesn't operate in a vacuum. It's too small for long range aircraft and increased traffic at LGA would create problems for JFK and EWR. Besides, the runways at LGA are much shorter than those at JFK, EWR, and extending the runways at any of those three is problematic. Even SWF has longer runways.
Southwest is statistically the worst airline in terms of delays and cancellations but has deluded its customers into thinking its the best (according to surveys asking people to rate airlines on their reliability).
Thanks for not understanding statistics and linking to an article that also doesn't understand basic math.
SW has a high number of delays and cancellations BECAUSE THEY FLY A HIGH VOLUME OF PEOPLE. By percentage, they are in the middle of the pack for both delays and cancellations, which isn't great, but they are not the worst by any means.
How are HN people so consistently bad with basic information?
They are actually more expensive, which is why they do two things to fool their customers:
1. On their flight search page, it shows the prices on a per-leg basis. Every other airline shows round trip prices. That way, the initial result on Southwest looks great and it’s only when you get to the final payment page that you realize it’s actually twice as expensive as you thought.
2. They refuse to share their data with any of the third party flight search engines like Google Flights or Expedia. Again, so people don’t realize how expensive they actually are.
b. other airlines will advertise prices that don't exist: it'll be the base price for the seat, but all (remaining) seats will be of the upcharged variety, so you can't pay the base price.
When you make the adjustments to compare apples-to-apples prices, SWA's get a bit closer. (But I do think SWA does tend to not be the cheapest, unless they're having a sale.)
I actually prefer the per-flight¹ pricing, too. That's what I'm buying: two flights.
¹IIRC, the pricing is per-flight, not per-leg/segment. (But it isn't, as you say, round-trip.)
I fly all the time and SW is usually more expensive than American.
And that’s why they refuse to share their prices with Expedia and Google Flights. If people could more easily compare prices, Southwest would lose a lot of sales.
Maybe it has something to do with my location, but my experience has been that SWA is normally much cheaper than other airlines. I recently cut 70% off a trip I'm planning in May by switching from Delta to SWA. That's probably a uniquely extreme case, but I'll happily risk a delay of even a day or two if it saves me litterally thousands of dollars.
Southwest’s base fares suck, but their discount tier fares combined with frequent sales, the companion pass, and a decent frequent flyer program make them pretty competitive in my experience. That said, if you don’t live or travel on one of their major routes, they are a bad choice.
The statistics in that article are of the "damned lies" variety: none of the values are normalized (they compared simple number of cancellations and delays without taking that as a per flight value, or perhaps better, per passenger) and they treat all delays (and cancellations) as equal; I'll take an airline often delayed by 5 minutes over an airline sometimes delayed by 3 hours.
Perhaps it's true nonetheless, but the numbers there won't tell you.
(And IME, it's perhaps true that SWA is often delayed … but by tolerable amounts. Compared to delays I've endured with Delta, where, e.g., a flight was delayed longer than the time it would take the plane to drive at highway speeds, from where it was coming from. Or … also Delta … where I was cancelled on twice in the same flight. They wanted to go 0-3 but I gave up and bought a ticket on … SWA.)
Perhaps they are thinking frequency for a trip they take often. It matters less that your Dallas->Houston flight is late when there's another one in 30 minutes during peak times.
I suspect also that, like JetBlue in its early years, a lot of its flights were out of secondary airports that are generally less exposed to a lot of operational disruptions. (e.g. they flew in Hobby in Houston early on). They also had the advantage of being in a region that per its name that probably has fewer weather issues in general.
The pattern I've seen over the years is that upstart airlines, as they grow, end up having to look--for various reasons--a lot more like legacy carriers over time. Whether that's flying into areas with seasonal bad weather, flying out of default airports, instituting various forms of passenger status, etc.
I don’t travel much these days but I found it common when I did that you would get there before arrival time that is told to passengers. There is some buffer.
I find it especially ironic that SWA system failed them, and this large failure was preceded by worse and worse "near failures", since SWA is in the aviation business.
In the aviation arena, high reliability is maintained in part by careful analysis of "near failures": lessons are extracted and improvements are made to aircraft designs, procedures, etc.
By contrast, the "near failures" of the SWA system as a whole don't appear to have been utilized to motivate system improvements.
Per comments on /r/southwestairlines for 20 years the C-suite were Wall Street boys and set that as management culture, as long as shares were up things were fine.
Frontlines have been emitting concerns and warnings for years but management didn’t care until an ops CEO (Bob Jordan) got in recently (as in 2022), but now there’s 20 years of ops neglect to deal with, and less than a year is nowhere near enough to start enacting real changes.
On that note, I think it's interesting how glowingly people talk about Herb Kelleher. Wouldn't he be responsible for allowing that tech debt to build until it finally caused the system to fail catastrophically?
Keller stepped down in 2004, so not really. In 2004, replacing 10 years old systems starts becoming a consideration but it's not an active need. Kelly was CEO from 2004 to 2022. And is now chairman of the board. Bob Jordan is the new CEO, since February 2022.
I'm surprised they haven't tried to blame a cyberattack yet.
That said, I feel like these sorts of catastrophic ultra-fragile McKinsey-consulted-to-death failures we keep seeing in various industries are basically a giant signal to any adversaries that say "Hi! Check out how easy it would be to grind this entire industry to a halt!"
Resiliency is literally the opposite of efficiency. These systems need to have slack, aka inefficiency built into them. Unfortunately the business culture has moved towards ultra fragile, ultra efficient thinking.
That could potentially increase their cyber security insurance premiums if not cause the insurer to drop them immediately. Not to mention the broader impact on the market and industry as a whole.
Skysolver is a GE Flight Services trademark - there’s a video here showing how it works and SW planes. Contrary to the reddit claim, it does appear to use a predictive algorithm.
Highlight quote from the video:
“It is humanly impossible when there’s a major disruption for somebody to figure out what the optimal approach is to get them back on schedule”
I would be very interested in a post mortem of the software used called SkySolver. Its supposed to be a Java Application which is said to be developed by Accenture? Anyone have actual technical insights into why it failed?
About a decade ago, I was part of a startup trying to disrupt crew scheduling. It's a non-trivial operations problem when you aim to honor crew preferences, union-negotiated affordances, FAA legalities, etc. We were only involved in the pre-planned schedules. At that time, the airlines we were courting had entirely different human-hravybsystem to resolve real time issues. There was some level of reserve redundancy baked in so the human planners had some wiggle room to work with... but redundancy is expensive to maintain. As a relatively new engineer at the time it was a pretty neat domain with big $$$ at stake. As it turns out, pretty much nobody wanted to take the risk on a new system even if it had provably better schedules for all parties. All it takes is one snafu for the whole thing to turn into a major regret.
We ran a pilot with one carrier for a subset of their crew for maybe a year, but eventually failed to gain further traction. Very tough sales and implementation cycle in spite of the fact that we could convince many individuals (bothe crew and management with their competing concerns) of the benefits of our system.
In what way was it a tough implementation cycle? e.g. resistant to change, or outsourced (e.g. could not get close enough to the fire/ talk to the right people) or was it something else?
Would love to pick your brain because I can relate to what youre saying.
Integrating into the operations on a provisional evaluation basis is a whole lot of work and the stakes are high. As my grandpa used to say, "There's nothing easier than doing nothing." Both parties (our company and the airline) brought the right people to the table, but for the system to really be evaluated we needed an entire crew base to use the system in a "realistic" fashion. There was a fair bit of understandable skepticism about how the system would perform under the stress of real user input, but due to the overall complex nature of the system there was a non-trivial learning curve and lots of crew members didn't see a lot of value in spending their time help us put the thing through its paces. As an example, the system would allow crew members to specify a highly detailed, specific set of flight preferences that were either declared as ranked priorities and/or they could select specific flights. If a crew member was forced to "try" our system, they could put in a generic set of input ("I like the Vegas flight."). However, we suspected that once they were faced with the opportunity to share their real schedule, they would have a lot more incentive to put in a lot more particulars, which would stress our optimization engine differently.
Im actually working at a startup that wants to use an algorithm to plan transport on large scale. ( Different Industry though). Got any tips or insights on biggest challenges?
The biggest challenges we had were more political. Getting buy-in across the airline as well as _support_ from the airline side to integrate their data and system into ours.
Sadly, it was a people problem, not a tech or algo problem.
So, did you manage to "tackle" the people problem in the end? How did you get them to move and integrate with your system in the end? Was it a bottom up (from the end users) or top down (making it part if company strategy) approach?
Sadly, I don't have these answers. I ended up quitting to go on a long sabbatical.
Last I heard, the airline scrapped the entire project to continue doing things the old way because they felt it was too much work on their end. ¯\_(ツ)_/¯
Does the algo work? Is it better than the status quo and under what circumstances is it susceptible to failure? Does it know when the whole system is over-constrained?
If we assume the algo rocks (because you have operations research veterans), what stage of planning does your system address? What is the experience required and cost of manual intervention?
Fun problem. Human factors and edge cases abound. So much is at stake for a system that already works (to any predictable degree).
The algo works. We have two algorithmic experts onboard, of which one is a professor and one is already working with algos each day (e.g. worked on patients/beds capacity challenge when covid began).
The startup focusses on assigning goods to transport. (As in its not part of the commercial sales process.. only execution of the required capacity planning that comes after it) .Something that has been tried 20 times and failed. It currently is done manually.
I'll bet it's just another excuse. The situation was probably so bad that the only solutions SkySolver could offer were beyond their capacity. There were probably partial solutions that were possible, but they rolled the dice on being able to resolve it by overworking people. They decided to double down instead of paying a whole lot of vouchers and refunds they'd rather not have to. Is it the software's fault the bet didn't pay out? As the problem snowballs and becomes more intractable, you get too far off the script and reach a point of no return.
I know nothing about the SkySolver or leadership at Southwest, but that seems like a very likely scenario based on what I've seen of what corporate types expect from software and rank-and-file employees.
SkySolver is a total shit show no doubt, but it’s not the entirety of the problem. Keep in mind when you have disruptions you actually have multiple products. SkySolver only manages the crew. They have another system that manages the aircraft and flight schedule. Believe it or not these systems are mostly decoupled - the schedule is modified first, then the crew, and there may even be a feedback loop back to the schedule if there is no crew solution.
Still my go to airline because the rest are so difficult, unfriendly or just greedy Id rather deal with Southwest every time to feel like I am a human being.
I mentioned this in another thread, but Southwest feels like they fuck up in ways that are understandable, if unfortunate. Delta, United, American and co. feel like they intentionally screw you out of perverse pleasure and a desire to cause you pain.
Southwest's seating may be chaotic, but it's not United where you have to pay an extra $60 to not get a middle seat, likely in the back of the plane.
That's one thing I have seen over and over regarding the current situation, that SouthWest personnel are polite, empathetic and as helpful as they can be given the circumstances.
Do their planes still have smiles painted on them?
It's easy to ask, "How could this happen?", but it's also a wonder this sort of thing doesn't happen more often with airlines and other businesses that rely on solutions to complex logistical challenges at their core. Overall, despite the horrors of war, perils of a pandemic, etc., sometime I pause and ponder how remarkably well the world works.
I’ve been lucky to have caught a flight back to SF after cancellations and can wait at home while figuring out how to get to my original destination.
What I don’t understand is how come SW couldn’t enlist help to get customers rebooked on other airlines - their phone lines were slammed (I waited 3 hours) just to get a refund since their app wouldn’t allow me to choose to rebook/cancel.
If I was as customer focused as they say they are - I would’ve contacted AMEX global travel and gotten their entire network of booking agents to backfill and rebook customers on other flights.
Southwest doesn’t have interline agreements with other airlines nor, AFAIK, the integration into reservation systems that allow rebooking onto another airline.
From what we know, this to me sounds like a story about technical debt.
"Sure, it's held together with rubber bands and is a mess, but it would cost hundreds of millions to fix, and it's working, isn't it? So the programmers complain a bit, that's their job."
Which works until conditions change in some way and it catastrophically does not.
I think a lot of our society is now run on unreliable fragile software. I expect to see a lot more of this. "Automation" is especially cost-savings when you don't min it being a fragile unreliable time-bomb.
I would expect any interview candidate to spot the issue within a minute.
For hub systems, ready crews are either at the hub, or at a spoke, ready to come back. That gives the hub a queue of ready crews, and each spoke can return a crew-plane combination to the hub when available. So with natural queue's, there's no delay cascade: it's all a function of whether and crew/plane readiness.
For point-to-point systems, crew-plane's are scattered, and the next flight opportunity might not be the next flight need. There is no buffer anywhere. Furthermore, any greedy/opportunistic strategy at one point can block a superior global solution.
That's the point-to-point trade-off taken by SWA. In the common case of good weather, you avoid the extra miles from going via hubs. But in the rare case of global weather shutdowns, there is no good recovery.
The only real question is whether SWA had any obligation to communicate this to investors and passengers. So far, Apple stock has gone down more than SouthWest's in this period, and passengers are remaining loyal, so no damage done.
So they are blaming SkySolver software. The article says it is off-the-shelf software? But in other news reports, they make it sound like it was developed in-house.
Alot of enterprise software is both, it is more akin to a "super framework". Where you start with the base system that provides commons functions but then it extendable.
Most ERP platforms are like this. over the decades it is not uncommon for little of the original platform to be used.
My org.s ERP is like that, we use probably 10% of the commercial code, and 90% of functions are custom in house written code.
Often, even more complicated: the ERP software vendor is one party; then there's the consulting VAR who does the customer's "customizations", and then the in-house team which does the "administration" and some "configuration". The VAR may or may not have a continuing relationship and of course the three groups have ample opportunity for finger-pointing; and the "administration" team is often under-resourced and doesn't really have the right skills to continue or maintain the work.
It's worse than a framework because most frameworks are designed for people to do some form of software engineering with. But with some "configurable" tool, you end up with something that you need to do software engineering but it doesn't have the tools, power or affordances to do so.
I've done some Jira consulting and worked in MDM and I'm of the firm belief that "buy vs. build" isn't the real dichotomy: it's "build" vs. "buy and build", and the effort of building on top of the platform is usually underestimated and undervalued.
We have a team of in house devs that do all of our customization but yea out-sourcing is still all the rage with many other companies VP's looking for that next bonus...
Even a SaaS poster-child like Salesforce is like that--between partners and in-house development of various sorts. I think it may be the largest tech conference these days and it's not because everyone buys a standard Salesforce subscription.
> The article says it is off-the-shelf software? But in other news reports, they make it sound like it was developed in-house.
Working in enterprise, there is a lot of “modifiable off-the-shelf” software; ideally, this provides basic functionality out of the box, but also the alignment to custom business needs of in-house software.
(Often, it seems like it combines the up-front and ongoing external licensing cost of COTS software, with the internal development and maintenance costs of in-house software, and the combined problems of both.)
If it's anything like an ERP, then that sounds about right. There's the out of box part that does the most generic things, and then heavy customization to integrate and customize it to fit the particular business.
I don't buy the narrative that inadequate technology is the main reason for the Southwest debacle. We must ask, why did this happen now and not before? Southwest has previously been able to better deal with disruptions like this. While the weather event did happen in the middle of their network, it wasn't unprecedented.
I think a more obvious reasons is because of staffing issues brought on by covid, layoffs, and the vaccine mandates. They lost experienced employees who were able to wrangle the bad scheduling software. Throughout 2022, Southwest was having hiring issues because they were still mandating the vaccine through at least the summer for new employees. Their pilots association warned about this causing disruptions after a bunch of summer cancellations. Do people forget how flaky Southwest was during summer 2022? Southwest just recently reached staffing levels that matched their 2019 high. This "inadequate technology" narrative just seems like a convenient scapegoat.
> This "inadequate technology" narrative just seems like a convenient scapegoat.
My brother has worked in the white and blue collar unions (he prefers his ramp job). It's not like there's some impermeable cover of secrecy. These are just regular people who you can talk to. It's a combination of computer problems and regulatory controls (sleep blocks) leaving insufficient staff (and mechanical dangers) due to weather. The ramp teams were sitting at almost quad pay with no planes to service out of Minneapolis for a significant part of the weekend. This same situation has occurred, to some degree, every year.
Due to the inevitable Guld Stream collapse, this will be a routine problem until SWA triages it.
This is probably a contributing factor but will get buried by the media. However, I doubt it was the sole or even primary cause. There is definitely a staffing component here in addition to their bad software.
> They lost experienced employees who were able to wrangle the bad scheduling software.
Employees could quit because they don't want to get vaccinated, but they also could have like just died from COVID too, or won the lottery, etc.
So to me this still points to the technology as something of a root cause. Your tech is as brittle as the number of people who know how to use it. Losing the people who make your sunk-cost old tech actually work, and not planning for the "bus factor' still makes it your fault for not addressing.
As with any event, multiple factors were involved. I have no doubt tech and process could be at the center, but our self-imposed response to COVID undoubtedly had major impacts.
For a meltdown the stock is back to where is was in October, tracking other airlines and the overall market, which keeps falling. I think people have become so accustomed this sort of stuff that it does not affect business long term. After Covid, people are accustomed to major inconvenience when traveling.
You only get bitten in the ass by tech debt after it’s too late. I’m sure management justified not paying it down because, truthfully, the consequences are never really felt until it’s too late. It’s better to pay down tech debt incrementally, instead of grand projects promising full rewrites.
Aren't cases like this is where the automated solvers are expected to shine?
If, on the other hand, it's not at all about software failures as many comments here suggest ("company management lost track of crews" notion), then does it have something to do with software at all?
Automated solvers that have a super computer backing them sure, but I doubt an airline has that kind of hardware at their disposal. Especially if their solver on their lower end hardware works most of the time for most disruptions.
Wouldn't this be one reasonable use case for a super computer? Perhaps not the fastest one on earth, but we use this stuff to predict the weather so I don't see why we couldn't use the same for air travel.
Southwest has the legacy SkySolver version which - believe it or not - runs locally on Windows OS. There is a cloud version that Southwest was either too cheap to buy or too resistant to migrate.
I thought this stuff also gets more likely to happen as you get towards the end of the month as you are close to the max number of hours the pilots can fly per month (100h in the us per calendar month), making pilot shortage cascades even more likely to happen.
Pilots are only allowed to fly 100h a month? That seems really low, considering a 40 hour work week over 4 weeks in a month is 160h. Although I guess I would rather have my pilot more energized than overworked.
They relied on hand written notes for baggage until 2017... 2017...
This is going to be a prime example CIO's can use as what happens when a company fails to invest in IT, treating as a cost center to alway gut budgets from
Somewhere there are Sysadmins, and Devs at southwest just face palming at the years and years they have beggged for resources to improve things
It isn’t just tech problems that drive this kind of stuff, SWA couldn’t get a deal worked out with AMFA to use bag scanners until 2016.
> Southwest said it will begin scanning bags on the tarmac this year to improve accuracy. Other large U.S. airlines already use this technology. Southwest's contract with bag handlers did not include a provision for using scanners until 2016. The airline then spent time evaluating technology and testing scanners at a few airports.
That's obviously not a blanket case of "AMFA bag handlers categorically refuse to use bag scanner tech" but rather "SW, unlike other airlines, didn't negotiate (early enough) the use of bag scanners in its contract with AMFA bag handlers."
SW decides what their priorities are when negotiating these contracts and deciding the terms. Other airlines obviously prioritized bag scanner tech and made the necessary tradeoffs to make it happen while SW didn't. Half the blame at the very least should be apportioned to SW.
The fact that a provision was required tells me there was already a provision, negotiated by the AMFA, requiring bag handling to be done a certain way to protect jobs by indirectly limiting innovation and automation.
I didn’t say it was categorically anyone’s fault, I just wanted to illustrate that airlines have unusual constraints on them so assigning blame to any one party is hard. The limitation are unusual compared to what we experience in less regulated, less unionized industries.
Since the airline industry, like most transport, exists only through subsidized infrastructure, it's prone to uncompetitive practice all around.
I realized this on my travel this holiday, just considering the entire security theater apparatus in the US. It's not just that the lines are longer and slower; they added differentiated tiers of service for the line, and the longer waits and ban on liquids encourage more purchases once through the gate. All the players involved in the racket will look at these add-ons and say, "sounds good for my bottom line," even though it acts as a tax on service quality.
And most of these things will likely remain because the technical considerations and the political realm exist in an equilibrium; giving the consumer more options can't be done within the logistical and safety requirements of large-scale air travel, thus it's up to the whims of the political system to coordinate and come to an agreement. A "sky highway" technology could automate away a lot of the coordination in flight, and smaller/more efficient aerial vehicles with VTOL and low noise could present a lower footprint for air travel, but these disruptions to the model are still a long ways away from reality.
The question is did SW attempt to add it before and it was rejected by the union? or is this just another example of SW not putting tech in the foreground, and instead treating it like a cost center that only needs to be maintained?
I don't really care if Southwest tried to negotiate for it, chose not to for some well thought out reason or just dropped the ball. The fact that they have to negotiate for it is broken. The "system" is not set up to allow for agility and innovation. Instead, it acts on protectionist incentives on all sides, none of which directly consider the consumer first.
The phone reliance in this case is for dispatchers to know where crews and airplanes are. In the article crews will call in for location and assignment (where to go to serve a flight) but are on hold for hours because there's not enough people to talk to and, I guess, update something that looks like this:
It isn't normal for an employer to not even know where flight crews even are physically located without a phone call and then additional phone calls to get them scheduled onto subsequent flights.
Then you tack on a hugely understaffed call center and problems with their phone systems and you have a hell-storm.
AIUI the problem is they're using the same lines for passenger and crew rescheduling. So they don't know where their pilots and crew are and can't communicate with them when the lines get hammered with customers trying to reschedule their flights that are canceled because they don't know where their pilots and crew are to staff them.
I have more than a few flights cancelled in my travels, never once have I rescheduled by phone. It was always on the mobile app. 90% of my travel is on American or Delta
apparently one other factor is that long on-hold phone calls (by Southwest staff) count against FAA work time limits, and thus can prevent that person from serving on a flight - according to the former CCO of JetBlue (St. George) at https://twitter.com/martysg/status/1608161473083183106
I wonder if there's an accurate record (time series) of what got canceled when, why, in what order, along with an accurate starting state for the whole SWA system. Then, for Monday morning quarterbacking, modelers and scientists could step through the meltdown in slow motion, and see, at each step, what they think could have been done differently.
None of this actually explains what happened. Okay they have an off-the-shelf product called SkySolver that they use to manage their flights and it’s old (I guess the traveling salesman problem
has really changed since 1930) and it couldn’t handle the sudden change in resource availability and flight constraints.
There doesn't exist an automated way for the crew to log in the system that the flight is canceled.
The crew of the cancelled flight calls in to a call center in a central office. There's a person there on the other end. The crew tells them that SW1234 or whatever was canceled. The person manually enters into SkySolver that the flight was canceled. SkySolver runs the algorithms or whatever and reschedules/reshuffles planes and crews to ensure that all passengers get to where they're going. SkySolver automatically sends out the corrections to all the crews and agents what the new crew/aircraft/flight assignments are.
This works great if the number of canceled flights is less than the capacity of the call center. But if the number of canceled flights exceeds the capacity of the call center by 1 aircraft, SkySolver does not know that the canceled flight did not arrive at its destination, and is unable to perform its next leg. It does not know that there's still an aircraft taking ramp space, so when/if the next flight arrives they have nowhere to park. So your 1 cancelled flight turns into 3 cancelled flights. But we're already over the capacity of the call center; so those 3 cancelled flights turns into 9 cancelled flights. 27 cancelled flights. 81 cancelled flights. 243 cancelled flights.
Yep, my guess is that the key point of failure tipping the system into a runaway cascade was increased latency in ingesting status reports on location/availability of crew. There probably exists an unknown "delay time threshold" in which to process and act on changing status updates. And once the average status update time exceeds that latency it just snowballs because the internal model the system is solving for is sufficiently out of sync with the real world that its "forward fixes" are causing new future failures approaching a 1:1 rate so the system can never catch up.
Essentially the system never know if a flight was successful or not, it just assume it went well, in the case of a delay or a change some employee has to input data manually into the system. For a few flights that's ok, but for a exceptional event like this the problem snowballs and now the system doesn't know were planes, crew and passengers are.
Quoting the top comment by 4Sammich here for those who refuse to touch Reddit:
> I have friends in CS and the hotel assignment side too. There were 2 specific problems, the software for scheduling is woefully antiquated by at least 20 years. No app/internet options, all manual entry and it has settings that you DO NOT CHANGE for fear of crashing it. Those settings create the automated flow as a crewmember is moving about their day, it doesn’t know you flew the leg DAL-MCO it just assumes it and moves your piece forward.
> In the event of a disruption you call scheduling and they manually adjust you. It does work, it just works for an airline 1/3 the size of SWA.
> So the storm came and it impacted ground ops so bad that many many crews were now “unaccounted” for and the system in place couldn’t keep up. Then it happened for several more days. By Xmas evening the CS department had essentially reached the inability to do anything but simple, one off assignments. And to make matters worse, the phone system was updated not too long ago and it was not working well.
> Last nite they did a web form and had planned to get the system up as much as possible with what communication they could muster, however it was too much to keep up on and ultimately the method for tracking crews failed again.
> This 100% is at the feet of all management who refused to invest in technology updates because it is the southwest way to be stuck in 1993. Heck, they still do 35 min turns on a -700 and 45 on an -800 frequently with only 2 man gates. But the good news is HDQ has a pickle ball court now.
> Edit: I just realized I never added the 2nd issue. Staffing. When the weather hit all those stations at once the ramp crews had to work in shifts to not become injured due to the cold. That slowed down the turns and backed up the planes. Many many ramp staff quit because of the management harassment (Denver) and just over it. So many rampers are new and making around 17/hr. Once they lost so much staff the crew scheduling software inputs couldn’t keep up because CS is also woefully understaffed and it became what we have today.
Except the pickle ball court was a conversion of existing recreational facilities (basketball courts?) long requested by employees.
The unnecessary, axe-to-grind editorializing undermines some of the credibility here, imo. There's no way this event hinged on a simple, binary "invest in tech" or "don't invest in tech," and that execs thumbed their nose at the latter.
So if I understand this correctly, almost the only thing that would have been needed would be a system of self-reporting from crews where they and their plane are, in case they are delayed or diverted?
The "manual entry" system is already there, so not much seems to be needed other than adding this seemingly simple front end. You don't muck around with old business critical software, you add to it whenever possible. And this seems to be exactly like that. The existing manual entry system feels like it should be able to accept self-reports from a mobile app or web site with minimal changes to the existing system.
none of these accounts review the human factors. Experienced administrators , pilots and crew were terminated over the past year and the replacement staff is not as experienced . Not everything is an algorithm error
As someone with over 20 years experience developing decision support models for domestic and international airlines, here's my take.
Southwest Airlines is a domestic carrier flying a point to point schedule, using the same fleet type to eliminate the need to train pilots on multiple aircraft type and also reduce training costs for transitional training from a narrow to a wide-body jet. All pilots can (theoretically) fly all planes in their fleet, tremendous training and labor savings. When flight crews "bid" on their flights, having only one aircraft type also reduces the complexities of bid lines down to basic seniority. (Let's ignore over water to Hawaii, m'kay?)
So, getting SWA back in the air should be over simplified compared to every carrier who got their planes and crews back flying the same or next day - get it? SWA has a single fleet and all of their crew is qualified to operate those jets or crew the cabin - so "wat da problem is"??? US Gov, inquiring minds want to know!!!
At all airlines, crews and planes are scheduled, and optimized, using decision support models which take into account how many hours the crew (pilots and flight attendants) can fly by law and contract; how long each tail# can fly until it needs to be flown into a maintenance base for an A/B/C/D check for scheduled maintenance, and how to get the maximum airtime out of the asset per day - in perfect weather conditions.
There are also decision support systems that monitor pricing of competitors every second of the day and why your airfares change from one browser refresh to another, yield management models which run overnight taking input from industry load data from the past year in the market SWA flies to help predict the passenger "load" which sets fares and also permits manual inputs for special events such as a World Series, or Super Bowl etc which would spike demand and drive airfares higher.
And there are Air Operations systems which are similar to the named product which take into account weather and crew events and help an airline re-plan based on where crew is currently. These Ops systems should have interfaces built to the crew (pilot and flight attendant) systems to know where they are located as well as where the jets are. Those values, along with the number of hours the crew has worked, would be used to re-calculate the crew and fleet assignments with the associated fleet and crew scheduling decision support models.
These DSS ran on either big iron multiple CPU Unix servers or multiple CPU Linux servers - point being the computational power was outstanding and yes, CPLEX was typically a library utilized by our PhD's. There was a lot of money spent on the hardware and the people who developed these models - it wasn't cheap but then none of the clients ever experienced this type of problem with scheduling and yes, the clients are named in the weather impact article alongside SWA, but they are up and running either same or next day...
My world had a common database and data model where all data was integrated from various systems regardless of if it was our system or not because a decision support system without current and accurate data is like having corn cobs for toilet paper vs. Charmin... painful and not a lot of value.
This is the one time I'm looking forward to the gov looking under the hood of private industry and revealing where the technical and management issues are. You can't blame the technical teams as they're only paid what SWA pays to hire both FTEs and contractors, and why I've never had a phone call that lasted beyond finding out what the position paid. We all know, you get what you pay for, but again, my opinion, and I'm greedy!!!
Nice writeup. So basically this problem could have been mitigated by having as much of the data available locally in a database, and a UI to make changes to that data by crew?
A Friend of a friend is a technical recruiter for SWA. They told me that SWA pays their Software Engineers below market salaries. They complained that because of that, its hard to recruit people, and they end up with mediocre developers.
Doesn’t it feel that, since COVID, we keep hearing more and more about these colossal, systemic failures more frequently? It makes me concerned because it seems most assume that at least someone knows what’s going on. Whereas the reality is that almost no one, if anyone, actually knows what’s going on, and that we have fragile software systems “running” everything. And COVID seems to have been a real shot to the brow for many of these systems.
I just worry about what hell we’re creating. Software basically captures miscommunication and poor understanding and executes it.
There's a lot of people who know exactly what's going on. We built fragile systems that optimize for efficiency, except edge cases. Because edge cases are expensive to handle and usually don't happen.
Except when edge cases happen, and then we have problems.
This is endemic to the whole world now, which has supply chains stretched across the globe in a just in time inventory system.
Some economist should figure out what percentage of our productivity gains weren't really productivity gains at all, but rather risk acceptance.
E.g. if we are currently producing 1000 units per year, then a new plan in which we produce 1010 units 95% of the time, 1000 units 4.9% of the time, and then 1% of the time we'll produce only 400. And just as people do price comparisons of unreliable electricity generation to reliable generation without taking the reliability into account, the various executives that signed off on this assumed (correctly) that the public would see a gain and give them the appropriate bonuses. Wall Street did a similar thing during the Great Financial Crisis.
This type of analysis has already been done by Dean Baker for productivity gains due to trade deficits[1] - e.g. instead of building an iPhone, we just design the iPhone, and then call up China and ask "make us 100 million of these", and because the designers can capture so much value of the finished phone, it appears like an explosion of labor productivity. But it's all contingent on having good terms of trade and strong IP enforcement. If these were to decline, we'd capture much less value and suddenly the productivity of those designers would plummet, even though they are still doing the same work per unit of time, because productivity is measured in terms of value creation. So it could well be that on an apples to apples comparison, total productivity growth in the US from 2000 to 2020 has been flat or negative, and we've merely hidden the losses elsewhere, for example in unexpected outages, labor force problems, or periods of rapidly rising prices (revaluations).
> There's a lot of people who know exactly what's going on.
Who or where are these people? (For any industry / system.)
In my experience, people understand things at a local level but quickly lose understanding at a system level. And I think there is a subtle difference from figuring out what went wrong once something did go wrong from understanding things prior to such an occurrence. Humans are pretty good at figuring out things after the fact, but we have a pretty poor track record of systematic understanding and preventing failures.
> Who or where are these people? (For any industry / system.)
If you hear someone on a huge incident conference call leading the capturing of new information, threading their way from one subsystem to another (most of which are outside the person's usual silo) based upon what they see, paging in teams as necessary to get the answers they need, giving a blow-by-blow account of the troubleshooting picture they're currently painting, with reasoning behind what they think might be happening, how to test that, what possible next steps might be, instantly discarding anything that doesn't move everyone closer to resolution, crafting ad hoc "good enough" tooling on the spot as needed to get enough information to make a decision but not enough to know details to the last degree, freely admitting if they're on a subsystem they know nothing about yet able to quickly grasp the essentials in minutes from domain experts to work with those domain experts to pull just the information they need...you've found one of those people. Extremely rare to find in one person, usually you see 2-3 people in such incidents splitting such a dynamic.
They don't know "exactly what's going on". But they sufficiently grasp how systems work to approximate that outcome close enough for most business purposes.
These people are SREs (source: am one) and there is a reason everyone is trying to hire a team of them.
It’s rare to find much teaching on complex systems in school so the hiring pipeline is fraught with peril and low tier companies understand the buzz but not the job. In my experience SRE is mostly people wander in and realize they’re in the right place, rather than setting out to be good at the required skills like managing complex systems and large scale incident response.
Yeah, I feel like we're living in a shadow disaster, where everything is screwed up but nobody's talking about it. Health care centers are still completely overwhelmed with 3-8 hour waits (in small cities, not even huge cities). Call centers can't take calls. Getting into supermarkets in large cities means walking around literal piles of human feces. It just goes on and on.
Airlines and airports have been suffering from some bad labor shortages because of the pandemic. A lot of them were let gone or furloughed during the pandemic and staffing hasn't recovered yet. That's why it's happening so often right now.
I was on a flight recently where apparently one of the flight attendants just didn't show up. It seems that they were eventually able to grab someone from the airline who was flight attendant-certified or whatever (he was just in street clothes) so we were able to go. At least the gate agent kept us informed.
Anyone else work at a large company with a massively complex system that basically controls everything, and a skeleton crew of devs that know how it all works?
Nope, we did but PE bought the outfit and replaced that crew with offshore contractors before selling it off to the next guy. Now nobody knows how it works.
Sorry, the rest of the microservice rewrite has to be delayed so my pet feature can be added. Maybe if you hurry you'll be able to pick between more of the rewrite and resolving crippling tech debt from the last time.
It's good to see that the billions of dollars in federal bailouts they received during the pandemic was put to good use and not given to executives with salaries already in the millions of dollars and stakeholder dividends... oh wait...
That article says “We recently had the opportunity to incorporate Employees’ requests to enhance campus. We converted a basketball court into three pickleball courts – a growing request by Employees over the past few years."
Reading a news article planted by the PR department of a big company is a skill. I can understand how this reads like "we're doing this for our people". This is just "the execs wanted to play pickleball so they had courts put in" dolled up to sound like big, happy, community-oriented charity.
Some sort of paddles-and-ball game that afaik is renown for its (a) endless, obnoxious, loud “THWOK! … THWOK! … THWOK!” sound and (b) obnoxious, self-entitled players.
Seriously, I have never heard anything good about it from anyone who does not play, only controversy and upset from neighbours and other potential users of the court.
There really should be laws or regulations that fine airlines for this kind of behavior and compensate passengers. Reliability failures that have that much impact on people need more consequences than just seeking another airline next time.
EU and other countries like Indonesia have laws that do exactly this. Light touch regulation really. Airlines can plan as they wish, but if you leave passengers stranded (or massively delay their journey) you have to pay the customer directly.
It‘s a bit of a question whether this would be fully covered by EU 261 (including compensation etc.), as it contains exemption clauses for events that are outside the airlines control. So you‘d need to argue here that the failure was not due to the storm, but rather due to bad planning. Might fall either way.
AFAIK they still need to make sure you get food and a place to sleep, though.
EU261 is a good model that we should look to implement here, basically you are due 600 euros when a flight is cancelled, and varying amounts for delays. It's absolutely insane that airlines can strand you for days and are not required to give accommodations overnight or any compensation.
In theory, yes. In practice, I got delayed for a day on one of the flights from Prague to Paris (pre-COVID times) due to the flight being late by hours and me missing the connection. They gave me the hotel, but refused any compensation beyond that. I tried to get them to pay via AirHelp, they took 3 years and said they can't do anything. The law is one thing, getting it is quite another.
FWIW Southwest is refunding fares and covering lodging, meals, and rebookings on other airlines. Not sure how much more we could ask for except the impossible “go back in time and don’t screw up.”
i've read that they will be reimbursing some people for their meals/lodging/rebooking, which can be a huge burden for people with bad/no credit or savings. i also believe that DOT is only able to provide enforcement of reimbursements or vouchers for southwest because southwest had made promises that they would in the past
i'd like to see all airlines required to provide the sort of compensation that southwest is providing here
Reimbursing for lodging and meals by submitting a receipt at a later date is not the same thing as taking stranded customers out of the airport to a hotel and giving them food vouchers. There are plenty of people who have been stuck sleeping in airports because they can not afford to pay peak walk-up holiday rates for 4 or 5 days. Those people are sleeping at the airport. This is happening in Denver, Austin, Midway etc. The news media is chock full full of these no lodging horror stories.
Further Southwest famously does not have interline or code sharing agreements with other carriers that would allow their agents to rebook passengers on a different airline. This is yet another detail of Southwest operations that is exacerbating the situation.
I’ll believe it when I see it, I am out of pocket $500 so far in chaining my own flights on other airlines to get to my destination after my SW flight on Christmas Eve was cancelled. They offered a rebooking only on their own routes on the 27th which would have been cancelled again in hindsight. I am finally headed to my final destination today 4 days later on another airline as I didn’t expect SW to pay for any of the extra costs.
The market can more than solve this, if it's an issue. We don't need to impose yet another regulatory burden on an industry already nearly completely insulated from competitive forces that drive innovation.
One would hope that the market could work this out, and when people went looking for options, would keep this in mind.
Instead what people do is find the lowest fare possible, all other considerations be damned. It’s not like this is the first time this has happened to SW.
We do mandate safety in general. (Probably less so in some countries than others.) So you mostly can't choose to fly an airline that defers maintenance to cut costs.
Presumably the consequence of regulators making canceling and delaying flights expensive is that airlines pay more to reduce the likelihood of that happening and fares go up as a result.
Whether that's good or bad is a matter of perspective I guess. Regulators could implement a lot of rules that would make flying more pleasant. But prices would go up.
Trouble with the free market here is that starting a new airline is all but impossible. Only so many slots available at airports and they’re all taken.
Yes, you can choose a more expensive airline but let’s not pretend Delta, United etc don’t also have huge delays and failures from time to time. Up until now Southwest’s reputation really hasn’t been bad, they’re no Frontier. No matter, there should be avenues for compensation.
> One would hope that the market could work this out
Southwest did lose 10% of its market value overnight, so it seems the market is punishing them. Given how much executive compensation is in stock, a lot of senior level folks at Southwest are likely feeling some pain (or, least, as much pain as any very rich person ever feels).
> Instead what people do is find the lowest fare possible, all other considerations be damned.
That's an over-simplification. Yes, passengers are very price sensitive. But whenever I talk to friends and family about travel, it's become increasingly clear that they factor predicted quality of service in to their choice.
I don't fly United and try to avoid American. I'm certain that many many people will hesitate to fly on Southwest after this. People aren't stupid and no one wants to roll the dice on an entire vacation just to save twenty bucks.
I don't totally agree with this. Prices between airlines and routes can vary dramatically to the same destination so it's not like saving 20$ will usually be the difference between the cheapest option and a more preferred option. Also if you look at situations like Volkswagen, it seems to me that neither governments nor consumers hit them very hard after their massive scandal a few years ago. I see new VWs all over the place.
I love my Volkswagen. In theory I care about emissions test cheating but in practice it wouldn’t affect my decision to buy in the future. It’s the best designed and built car I’ve ever had and so Volkswagen group brands will always be on the top of my list of cars to look at.
A couple of VW execs went to prison and it cost the company 30 billion, that feels like getting hit hard enough by governments. What more do you want?
In practice when you need to fly from one city to another at a specific time there is often only one option available. So customers have no ability to pay more for higher reliability even if they want to.
Well, in my case I had a really bad experience with United in 2011, connecting at LAX. After that I just never fly United or fly to/from LAX. My sanity is worth more than a few hundred bucks.
There is, that is why they are trying to blame it on the weather. If it is weather they do not have to compensate people, if it a failure due to their own systems well there are all kinds of provisions that require compensation
Or we could just make airlines always liable for any situation where they fail to get you to where they said they were going to get you by when they said they were going to get you. Like literally any service that is not delivered as promised. The problem will solve itself very rapidly.
No, if they kill someone, they will get fined for gross negligence on top of the inevitable tort case (file it under "airline malpractice"). I am not asking them to fly in terrible weather: I am asking them to have enough slack in the system that one outage does not destroy their entire system because places unconnected to the weather are affected when planes they expect to arrive never do.
The point is that it is irresponsible to be running so lean that their entire system is tightly coupled like this, and we need to make doing so completely unprofitable to airlines.
The way I read it, you kind of are, just without saying it out loud:
> we could just make airlines always liable for any situation where they fail to get you to where they said they were going to get you by when they said they were going to get you. Like literally any service that is not delivered as promised.
What incentives does that set up when terrible (or even questionable) weather exists at the departure or destination airports or widespread en route?
There are, but more importantly, as critical infrastructure you should have to demonstrate a BCDR (business continuity disaster recovery) plan exists and that you can perform it from a cold start and under extreme system stress.
I have to imagine that they are bleeding money from this self inflicted wound. How would fines have helped if the threat of total chaos and huge losses didn't do it?
This is a pretty huge incentive to get it right, and you can be sure every other airline will be looking into how they can prevent something similar from happening to them over the next few months.
Insane that the US has put all its eggs into one basket. One bad event and interregional passenger travel completely breaks down. Looks like it would be wise to invest in a more resilient mode of transportation.
This is one airline out of several. Also, what other mode of transportation doesn't use scheduling software? Certainly not rail or inter-state buses, they have very similar headaches.
And indeed multiple competing companies with planes is going to be more resilient - as each flight is dependent on the airports at each end and not much more, whereas a road or rail method has to have all the interconnecting working.
What would be a more resilient mode of transportation? Few people would take a train from Chicago to Miami, for example. That's something like 1200 miles (2000km). Traveling from LA to NYC is like traveling from Lisbon to Moscow (2500mi/4000km by air).
The US should definitely have more rail--and higher speed rail--but that is only practical in select areas with enough population density. Many of the people in those areas could drive to where rail would service.
On the other hand, a high speed train can do 200 mph (320 km/h) in actual operation. That would be a six hour train ride. A plane is faster, but once you add all the airport overhead, the train might actually be competitive.
AMS-PAR is a good example of when to take the train. Yet I wouldn't know how to begin to book a comparable train alternative in USA for say LAX-SFO or CLE-PIT.
Nobody seems to figure out the core problem for airlines: they're selling tickets that they cannot fulfill under current conditions. If they know they don't have enough pilots, enough attendants, and so on, why in the world are they selling all the tickets they've "planned"? Is this a realistic plan or it is a SCAM, i.e., taking people's money for a service that they know they won't be able to fulfill in the future? These companies need to be investigated!
They're taking reservations months ahead of time. It's not as if they can magically know what staffing they'll have in 3-9 months. I'm effected by this, but I bought my ticket in September. It doesn't seem reasonable to require any business to know for-sure that every employee needed to run a flight in December will be there in order to sell it in September.
Anyone can quit, die, or become ill at any time. If they had to contract every employee into a binding contract where they must fly a given route on a given day at a given time, we'd be howling about how they're abusing their employees, and in fact that's not far off from what everyone's been complaining about the railroad industry doing.
Predicting the future, even a few days at a time, is always best-effort.
When I was stuck in Florida during Hurricane Ian earlier this year my flight got cancelled 3 times. Having a ticket means nothing in terms of reserving a seat on a plane, it’s a scam for sure.