How Southwest Airlines melted down

nthitz · on Dec 28, 2022

burlesona · on Dec 28, 2022

It’s fascinating that the same hopscotch travel pattern that allows SWA to offer better service to more places is also what caused the network to suffer cascading failure. Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.

Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through. Despite the stress, the SWA crews were helpful, empathetic, and polite. They handled it better than I would have if I had been in their shoes.

TheCondor · on Dec 28, 2022

Is resumption difficult or is it resumption and then make whole the tens of thousands of customers that were supposed to be moved a week prior?

No idea what SkySolver actually does in totality, I'm sure it's complicated but I would think a flight crew could indicate where they are right now and then it could maybe pickup the next possible course they could perform. Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?

They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.

Of course, it's easy to bitch about it not being hard when I've never seen the code. Maybe is also tracks hours and does payroll and a dozen other functions.

TimTheTinker · on Dec 28, 2022

It's essentially an optimizing constraint solving problem. As far as I know, there are 2 main known approaches to this class of problem:

- use a tree traversal algorithm with a lot of built-in optimizations and pruning. Google OR-Tools is one well-known solver.

- use a set of meta-heuristics (Tabu search, simulated annealing, etc.) against arbitrary predicate expressions. The best example of this kind of silver is OptaPlanner.

The advantage of tree traversal is that it is exhaustive and guaranteed to be optimal; but it requires a fixed computing budget for a given set of constraints. Given a large compute cluster it's ideal, but on individual machines/servers these solutions tend to take hours or days. Southwest's system would likely require a significant supercomputer to run its scheduling through an exhaustive optimizing constraint solver, since it would have to re-run the entire solution (or significant numbers of subtrees) when any variable changes (which happens likely several times per second).

Meta-heuristics are more flexible and allow for all sorts of interesting, convenient features that can be very helpful when your compute budget is less than a Tiobe-500 supercomputer. They offer time-constrained solving, real-time monitoring of a solution in progress (you can watch it get better as more of the solution space is searched), and over-constrained solving (where the requested solution is impossible, but we need to make a best-effort attempt).

geethree · on Dec 29, 2022

Other paid constraint solvers are Gurobi and IBM’s Cplex. I only have experience with the later - scheduling giant chemistry and biology experiences into a robotic factory - there are a ton of foot guns in this space.

0xfaded · on Dec 29, 2022

IBM has ILOG as well which is likely a better match for scheduling problems.

Fun Fact: google-ortools was written by the same people that wrote ILOG. You used to have to "bring your own search" when using ortools, which I used to assume was to avoid conflict with IBM. I haven't used it in a while, but looks like there's a one-size-fits-all search algorithm now. I'd be curious to try it sometime.

ge0ffrey · on Jan 3, 2023

OptaPlanner (Java) or OptaPy (Python) doesn't need any other installations. Just add a dependency in Maven/Gradle (Java) or pip install it (Python).

masklinn · on Dec 28, 2022

> Is resumption difficult

The resumption itself.

> Not sure why the phone lines "jam up" exactly, don't you have a hierarchical management structure for this sort of thing? Or do 1000 pilots all report to one person?

It's not a problem of management (well it is a problem of 15 years of management fuckup which is the root cause), it's a problem of the scheduling system needing manual updates when things didn't go to plan.

At this point it's completely screwed, so it needs to be completely reset, as in the entire scheduling system needs to be reconfigured from empty, more or less.

And because SouthWest operates entirely on point to point, the same cascading properties which led to its complete collapse mean it needs to restart in a somewhat synchronised manner, otherwise you fly 3 planes, there's no followup, and you're hosed again.

> They've got like 1000 planes and like 100-150 destinations, it's not the traveling sales man problem, an optimal plan isn't needed now so much as a functional one.

The system completely lost track of crews, so all of them need to be relocated, their work cycles reworked, flight slots need to be reallocated, flights need to be re-encoded.

flandish · on Dec 28, 2022

Here is where we should take note of how things were resolved (or are in the process of) and note where “throw more money at it” (fly inefficient routes, farm out fares, etc) would have also helped.

Then next earnings release - when they post shareholders profits - that’s exactly how much they were willing to hold back to fix this, trading profit for people stuck for days.

smugma · on Dec 29, 2022

Good point. I think it will be smart for them to post a record loss for this quarter.

Accrue a few billion for refunds, depreciate your software to zero, and start to set aside a few billion for new cloud whatever.

Basically, take this as a huge loss and communicate that you’ve lost a lot of goodwill and trying to make passengers whole is #1 priority (after safety, of course).

philwelch · on Dec 29, 2022

Airlines are extremely low-margin businesses, discount airlines like SWA doubly so. “Posting a record loss” of this magnitude isn’t something that can be afforded. What you’re talking about is likely to turn into some form of bankruptcy. Maybe it’s a relatively minor Chapter 11 where all those refunds get written down by the bankruptcy court, or maybe SWA goes the way of TWA and Pan-Am.

scarface74 · on Dec 29, 2022

Airlines make a lot of their money from loyalty programs

https://www.forbes.com/sites/advisor/2020/07/15/how-airlines...

And loyalty programs are bigger business for the higher end airlines that cater to business travels and the more affluent.

On top of that, the top 3 airlines make a lot of money off a first class and business travelers who are using other people’s money.

philwelch · on Dec 29, 2022

You just listed a bunch of revenue sources SWA doesn’t have.

scarface74 · on Dec 29, 2022

That’s the point…

philwelch · on Dec 29, 2022

Gotcha.

chasd00 · on Dec 29, 2022

Replying to a sibling comment. There will never be a bailout for SWA. Too small, too many competitors and based in a red state with a blue administration in power.

cmmeur01 · on Dec 29, 2022

Or maybe their lobbyists get on the phones and secure them a nice bailout, since they’re so “essential” as we’ve just seen.

smugma · on Dec 29, 2022

They lost $300M in Q1 and made $800M in Q2. They can take another $300M loss (1B?) in the short term to mitigate reputational damage and legislation.

panarky · on Jan 1, 2023

In the last fiscal quarter pre-pandemic, they had operating income of $3 billion on revenue of $22 billion, or 13%. This is not a low-margin business.

If they have operating income of eight to ten billion a year, spending a third of a billion to upgrade systems seems like it would have been a reasonable investment instead of buying back shares and increasing the dividend.

lbotos · on Dec 28, 2022

I suspect crews timing out also play a big factor into this planning challenge:

https://flyingbynumbers.com/cabin-crew-duty-timeouts/

TheCondor · on Dec 29, 2022

Yes, but this has to be less of a problem when they've been canceling tons of flights, right?

toast0 · on Dec 29, 2022

Depends on the specifics of the duty time rules and how the flights were canceled. If the crew showed up and started getting the flight ready, but the flight was cancelled before takeoff, that may count as duty hours and could reduce their availability in the near term.

If the duty time rules only count time after takeoff, or only time after doors are closed, those both offer a different set of times.

scarface74 · on Dec 29, 2022

I’m by no means an expert in this area. But from my experience writing field service software in another life, even on a city wide scale when you may have 20 people and around 200 stops it got tedious when the system went down and you had to manually schedule it.

I use to write field service software for ruggedized Windows CE devices.

DrBazza · on Dec 29, 2022

> Once a critical mass of pieces (planes/crew) were out of position the whole network fell apart, and it’s large enough that it seems like neither the humans nor software can easily reason about how to resume operations. Hence the need for a “full system reboot” over many days.

We had days like that in the UK but with the rail system. And it when it happens it’s also due to snow. We’ve seen it on global scale recently due to covid putting ships in the wrong places so that optimised shipping routes become a mess.

amluto · on Dec 28, 2022

I don’t see how a full system reboot should take days. If you don’t care about serving customers (which they don’t right now), then the problem could be simplified to getting every plane and every crew home to where they would spend the night under a normal schedule. With few or no paying customers on each plane, there should be plenty of capacity to move misplaced crew members around. None of this needs to approximate normal routing, and fewer segments than normal are needed, since passengers can be ignored.

That being said, the software sucks. Southwest may have lost track of where their employees are. The ground crews are quitting. I wouldn’t be utterly shocked if management doesn’t even have a good overview of their the planes are.

(Obviously anyone halfway competent could hack up a script to find all the planes based on ADS-B data in a few hours. And it wouldn’t be terribly hard to text a link to all crew asking them to fill out a simple form with their location, nearest airport, and when they can get there. But this requires competence and agility.)

jasonwatkinspdx · on Dec 28, 2022

It's far more complicated than that. For example, there's a lot of strict regulation around how many hours crew can work, how much rest they need between, etc. Likewise planes can't just randomly fly where-ever they want, whenever they want. There's mandatory maintenance, etc. You also have a limited number of hangers and jetways at each airport, so you have to coordinate how planes move around to not overload things. Oh we need to be sure there's enough fuel, fresh oxygen bottles, and so many other things.

The problem is nothing like writing a script to scrape ADS-B data. This is the classic fallacy of a programmer thinking the most cartoonish imagination of the technology is the problem while being completely blind to the fundamental difficulty of organizing large groups of humans in some activity.

smitty1e · on Dec 29, 2022

Modeling and simulation has the concept of Emergent Behavior[1].

Once that complexity tipping point is reached (and other comments suppose that SA competitors were more aggressive at cancellations, avoiding that point in their systems) the system takes on a life of its own.

What follows, for those running the system, is called a "Significant Emotional Event".

[1] https://en.wikipedia.org/wiki/Emergence

landemva · on Dec 28, 2022

The other carriers made it through this high volume period. Do they have better programmers?

inferiorhuman · on Dec 29, 2022

Yes. Globally the vast majority of airlines use either Amadeus, Apollo, or SABRE for everything from reservations to crew scheduling. United wrote Apollo, American wrote SABRE back in 1960. So yeah, other airlines have far better tech people and have for decades. While Southwest now pushes ticketing data to those three, as of a few years ago they were using other software for scheduling and even doing things by hand (like tracking bags) that other airlines had already automated. If anecdotes from crew are accurate the other problem is that Southwest did nothing to optimize their disaster recovery plan (if there even is one). Manual data entry can scale more than it is at Southwest, but they're still stuck in the teensy airline state of mind where even wildly inefficient workflows can scale.

As for ADS-B, that's hopelessly naive. ADS-B tracks planes, not people. The problem Southwest is having is trying to figure out who is legal to work and where they are. Tracking equipment is so trivial in comparison it's largely a secondary concern.

taway22323 · on Dec 28, 2022

The other carriers do also have a tipping point of no return, where their software can't handle the situation. But, they were more aggressive about canceling flights in careful waves before they got to that point.

Some of the observations here are correct, but there's a big missing piece. Southwest went too far in with an overoptimistic schedule. They won't say that because then there's no easy scapegoat, it's just a pure management thing.

angry_octet · on Dec 29, 2022

There is always a trade-off between robustness and efficiency. By leaving less slack the SWA schedule is more efficient, but also more tightly strung. When failures exceeded the slack capacity the whole schedule fell apart.

The problem of recovery is also extensively studied. But it seems like SW did not put enough effort into having fault tolerant restart.

This whole event is a management decision.

taway22323 · on Dec 29, 2022

Right. I'm saying though, it's not just the initial state of less slack in their system due to hub and spoke versus point to point. It's also that other airlines were more realistic about backing off flights earlier and more aggressively as a reaction to the developing info on the storm.

angry_octet · on Dec 30, 2022

It paints a picture of an organisation which didn't see the need to accurately assess risk on a real time basis. If I was the COO I would want rolling projections for the week based on good/bad/ugly weather. But it seems SW was too busy cutting costs and building capacity as cheaply as possible to think about that.

jasonwatkinspdx · on Dec 28, 2022

No, as the article explains they use more of a hub and spoke dispatch pattern, which has less chained dependencies.

smaudet · on Dec 28, 2022

As others have said, you are ridiculously oversimplifying the process.

If you are running a little video game that plots flight charts, sure perhaps you could write such a script. You cannot, however, script a bunch of logisitics support together across disparate, independent, multinational airports.

Add the obligitary paperwork (we don't really want to be shipping people around in coffins, enabling human trafficking or illegal substance smuggling, trust me), and add your standard bit of managerial incompetence managed from an excel document and sharepoint, and it's actually downright impressive if it's only a couple days to completely reboot the system.

Arguably, management doing the standard "oh that will never happen" is probably why it's not even better - you would think the airports would be able to automatically "fix" themselves with a self healing protocol, but that was probably deemed too expensive and left out of the feature set.

amluto · on Dec 28, 2022

Of course I’m oversimplifying the process. But my point is the actual number of flights, passenger and luggage loads and unloads, etc can be a lot lower than a normal workload. On a normal day, an average Southwest plane flies quite a few hops, loads many passengers, unloads many passengers, etc. For a worst case reset, the normal routing doesn’t need to be followed.

Of course this will introduce complexity. But it will also need decently less than a normal day’s number of operations.

If I were designing a solver for where to move aircraft and crews (which is apparently what SkySolver does), I would test it on random and malicious inputs. If it’s initialized with every crew member and every plane at an independent random airport in the US (where Southwest has service) with a few broken for good measure and a few airports shut down, it should still find a solution. Waiting 12-24 hours to start so none of the crew is timed out could be part of a valid solution.

bfung · on Dec 29, 2022

It’s still probably not that simple; the problem seems like it can be analogous to some sort of knapsack and/or traveling salesman problem, which are NP-hard! Even testing would take a long time.

amluto · on Dec 29, 2022

That’s true if you want an optimal solution. If you just want a decent solution (for various definitions of decent), the complexity may not be too bad.

nemothekid · on Dec 28, 2022

>then the problem could be simplified to getting every plain and every crew home to where they would spend the night under a normal schedule.

I'm confused, maybe my calibration of the scale involved here is wrong, but how do you not see this process taking days? Even if you had perfect information of where all the planes and crews were and where they should be, just coordinating with various airports to schedule the flights would take days; and that's assuming every plane is (1) flying empty, (2) fueled and (3) isn't carrying any passenger luggage.

irjustin · on Dec 28, 2022

I'm downvoting.

It's simple to say, "oh it would be this easy..." and not consider a single real life scenario.

> problem could be simplified to getting every plain and every crew home to where they would spend the night under a normal schedule

This alone would take days under your own recommendation.

Sakos · on Dec 29, 2022

"If I ignore all real-world constraints and all the regulatory burdens, it's TRIVIAL. Why can't they figure this out when I could solve it in an internet comment?"

He's ignorant and naive and disrespectful of the workers involved.

aliswe · on Dec 29, 2022

Weeks

mgsouth · on Dec 29, 2022

Well, let's see... A bit of internet searching shows SWA has about 60,000 total employees. Roughly 6,000 of those are pilots. So probably about 10,000 cabin crew (3 or 4 flight attendants, 2 pilots per flight). Most of those 15,000+ people need to be in exactly the expected place at the expected time, or things will get snarled up. If any one of those 5 or 6 people is a no-show then the plane can't legally fly passengers. There's limited fungibility; one of the pilots has to be a Captain. You don't want two Captains flying, if for no other reason than some other flight will then be short a Captain. One of the cabin crew has to be a Purser (supervisor). It's really preferable to have at least one of the pilots flying to the airports they usually frequent. Off-shift crew still need to be in the proper place at the proper time, or the plane has to stop mid-cycle. Then there's the ground and gate crews.

And you can't just say "go to your usual starting airport". Flights are 24/7. Days of the week matter. Holidays matter. Even your local burger joint doesn't have cast-in-stone schedules. Much less a huge interconnected network that's trying to reboot. Even if 80% of the crew could go to a "usual airport", how would you know if you were in the 20%? Oh... everybody has to phone in, or the system has to send a huge number of notifications out. System crash. (Remember, normally the system depends on scheduling in advance, and only a few last-minute changes need to be handled.)

And how do these what, 20k+ people get where they should be? If they aren't already at that airport, then normally by hopping on a SWA flight. Which aren't happening. So now what? Book flights on other airlines? Who's going to be doing the booking? How does it get paid? How many employees have a company credit card (any?) Use the employee's cards? How many are maxed out for Christmas?

So let's suppose you're scheduled everyone and everybody knows where there's supposed to be and when. What happens when 10% call back in (without crashing the system!) and say they can't get there on time. (Remember, many SWA customers are stuck and are having a hard time arraning alternate travel. Crew without access to SWA flights would have the same trouble.) You're going to either apply massive changes to the existing schedule, or start all over and reschedule everyone. Which is exactly the problem they're currently having. They can't handle massive corrections. What's worse, a crew member may think they're good, and then their travel gets delayed somehow.

How sensitive is the scheduling? If 10% of the crew are no-show or delayed then about 9% of the planes are affected; every 8 hours 90 planes will have the following 8 hours of flights cancelled or delayed (throwing further monkey wrenches in the schedule). About 40 of those planes will be missing a pilot, which means they can't even be deadheaded to where they're supposed to be next.

So how to reboot? They've had to cancel two-thirds of their flights, so apparently they're able to keep 1/3 of them going. Keep those flying so they can shuttle crew around. You're initially only scheduling 1/3 of the crew, so the poor overloaded system can handle it. 2/3 of the flights are just outright cancelled, days in advance, so the customer support load is reduced. ("I'm sorry, your flight has been cancelled. We can't reschedule you until x days from now." vs. much back-and-forth trying to find something with an overloaded system.) Slowly add additional crew and flights so the number of phone-ins is kept manageable.

amluto · on Dec 29, 2022

As you said, this is certainly a complicated situation.

But this is HN, and, from an optimization / constraint satisfaction problem perspective / scale perspective, it’s really not a very large problem. The “going home” problem is a standard programming contest problem. This is more constrained, but a globally optimal solution isn’t needed. Southwest literally has a program called SkySolver. It should work.

So you need to choose destinations for a few thousand planes? Pick the place you would normally have them at 4 AM two days hence. Need captains familiar with the airports? Surely the captains who would normally fly from a given airport at 4 AM two days hence are familiar with those airports. This gives a good starting solution for further optimization.

You need to get a message to ~20k people? Great, Twilio will do that with minimal effort.

Will a budget of a couple million dollars, this is not hard. There are good SMT solvers available for free. CPLEX is expensive but not on this scale. Managing test cases on this scale is straightforward.

Even in an emergency, something good enough to get to the point where SkySolver starts working again should be doable.

People who manage electric grids have protocols for “black starts”. These protocols are messy and complicated, but they are developed in advance, and they work. Airlines should have the equivalent capability.

sigstoat · on Dec 29, 2022

> But this is HN, and, from an optimization / constraint satisfaction problem perspective / scale perspective, it’s really not a very large problem.

i can't even begin to imagine what the utility function _alone_ must look like.

the solver can't just produce a blank slate solution every time, it's got to optimize... something, subject to some constraints about not changing too much from the previous plan it came up with. and it's probably got to do this every time there's new real world input.

> The “going home” problem is a standard programming contest problem.

the hubris and handwaving on here is hitting epic levels.

mgsouth · on Dec 29, 2022

You touched upon the core issues. Yes, SkySolver should be able to get a solution quickly. Yes, there should be systems in place to co-ordinate invidividual actions of 20k people with rapid adjustments. Yes, they should have procedures in place to do "black starts". But they don't. And you can't design, test, and roll out even a new communications system to 20k users literally overnight. If they existing scheduling system doesn't already email crew, then do they have email addresses for all them? If so, it's probably in HR somewhere; if that's outsourced then it could take days for SWA to get that pulled out even on an expedited basis. If they send email, what percentage of crew will receive it? In what timeframe? Email campaigns are notoriously porous. And how do crewmembers provide feedback? A standalone web page? Great--how do you authenticate? By interfacing the existing system? What will that take? Probably not, though--it appears the existing feedback mechanism is a manual phone call to a person. Do crewmembers even have login IDs?

Finally, even with perfect communication there's the problem of getting people where they need to be when the transport system is very flaky. I think that's probably the issue when the CEO said 'they would get things manually set up, and then something would happen and we'd have to start all over again.'

The pilot's union and others have been critizing SWA for not investing in infrastructure. The previous CEO, responding to the previous melt-down, said something about 'you can't test for these kind of scenarios.' That's wrong. You can't do it easily or cheaply, but it is possible. They haven't done it, they haven't developed robust systems, and now they can't instantly resolve the problem.

quartesixte · on Dec 29, 2022

I mean, let’s assume that this problem is as simple and everything else on the ground and with scheduling works perfectly (which it isn’t and everyone here is rightly calling you out on. Ever work in an operations environment? Shit goes wrong fast).

You still have to contend with the laws of physics. AKA the speed of aircraft and distance between airports. Compound that thousands of planes making potentially cross country trips and it will take days to do what you are suggesting, leading to the same problem SWA is already facing.

inferiorhuman · on Dec 29, 2022

  this is certainly a complicated situation.

uhh

  it’s really not a very large problem

Ok.

  So you need to choose destinations for a few thousand planes?

No. You need to figure out which planes are where, which of those planes are legal to fly out of which airports, which planes can be made legal to fly with low effort, and which planes need to go in for service (potentially requiring a non-revenue flight).

Ostensibly Southwest operates a fleet entirely of Boeing 737s with two subfleets (ETOPS and non-ETOPS). The ETOPS planes can be used anywhere but the others cannot be used to Hawaii. Some of the ETOPS routes require the range of the MAX and some do not. Some of the airports Southwest flies into can only handle the smallest planes. It's entirely possible that everything stuck in Oakland is too big to fly into Burbank for instance. Or its entirely possible that all of the ETOPS fleet is stuck in Hawaii (so all of the routes too Hawaii are no-gos).

Then you've got to figure out what's broken on each plane. Depending on what's inoperative a plane may not be able to fly into a specific airport if the weather is anything but perfect. If enough stuff is broken the plane may be stuck away from a maintenance base and illegal to fly in revenue service. There goes a plane and a flight crew. These are laws, not optimizations.

And finally you need to figure out where all the diverted flights went. No idea what's normal but Southwest had to divert two flights today (one due to mechanical issues and one due to an unruly passenger). So there's at least one plane that's out of position and unable to fly until it's been repaired.

  Need captains familiar with the airports?

No. In fact I'm pretty sure that unlike RyanAir and EasyJet, Southwest doesn't fly into any airports that categorically require specific familiarity and training beyond the ETOPS certs required for flight and maintenance crew on the Hawaii routes. You do need to figure out who's legal to fly in revenue service and who might be legal to fly on a ferry permit though. Southwest was pretty late to the autopilot game but at other airlines some captains are qualified to land in worse weather than others. So even if you have a flight crew ready to go they may be unable to complete your desired route today – and that's been a problem. The Southwest scheduling software assumes that every pilot completes their assigned flights successfully. Whoops.

Then you've gotta do the same with the cabin crew. And then you've gotta make sure none of this runs afoul of the various CBAs. The baggage handlers? They've been working without a contract for nearly three years and were threatened with immediate termination by Southwest's VP of ops. Wanna guess how strictly the ramp rats will stick to the letter of the contract?

And you haven't even addressed the issue of out of position luggage. It's not just that luggage is piling up at airports, but Southwest's actually flown some of the luggage without the passengers. The hand waving I've seen suggests that Southwest still does a lot of the luggage tracking manually.

When that's all said and done you're still going to need to handle everything that goes tits up as service resumes. As I pointed out earlier, Southwest had at least two diversions today. So now you have two more out of position planes, crews, and more out of position luggage and passengers.

It's a massively complex, dynamic problem. If a couple million bucks would solve things, Southwest would've spent it already. List price on a single 737 is around $40 million. Even if Southwest got half off you're still talking tens of millions of dollars per plane. Put another way, Southwest has so far bought two airlines only to junk their entire fleet. A few million here and there is nothing, especially if it could've headed off this chaos.

  People who manage electric grids have protocols for “black starts”. These protocols are
  messy and complicated, but they are developed in advance, and they work. Airlines should
  have the equivalent capability.

Most airlines do but, based on the anecdotes from Southwest crew, Southwest does not. Again it's not really that Southwest flies point-to-point, and it's not just a software issue at this point. Procedures that worked when Southwest was a scrappy little airline simply don't scale and Southwest management is ill-equipped to handle this. It's all compounded by the employee contempt that Kelly's managed to foment.

itissid · on Dec 29, 2022

I think the guy you are replying to does not realize that all these are not data in some MySQL database to "use" to solve this quickly.

inferiorhuman · on Dec 29, 2022

To be fair it seems like Southwest management fell for the same trap.

benced · on Dec 29, 2022

Something worth pointing out here is that according to levels.fyi, a tech lead there who has worked for 20 years has a TC of 174K. It’s unlikely their technical talent would be up to what you propose which is related to the apparent decision at Southwest to underinvest - by their own reckoning - in technology.

scarface74 · on Dec 29, 2022

This again the HN tech bubble. That TC is around average for most senior software engineers in most corporations in most US cities in the US.

And getting into a tech company and having high compensation doesn’t require “top talent”. Just the ability to memorize DS&A (junior to mid) and memorize system design and being able to regurgitate answers to behavioral questions showing “scope” and “impact”.

It takes knowing two or three well known algorithms to implement the scheduling problem they are trying to solve and system design chops to know how to scale it.

But, as far as having to be on hold for hours, that’s a staffing issue.

Most of their issues are symptoms of poor management - not software engineers.

benced · on Dec 29, 2022

To be clear, I don’t think you can blame these software engineers. Management has decided the average value of a tech lead to them is about 175K and they are getting that quality of engineer. I think their current tech issues suggest that the value management is placing on engineers is lower than it should be.

Like I said earlier in the thread - there are many bad engineers that are overpaid and many good engineers that are underpaid. On average though, compensation and quality are correlated. In the absence of additional information, compensation is the best metric we have to asses the quality of the underlying engineering talent. It also passes the sniff test - try using the Southwest Airlines app!

scarface74 · on Dec 30, 2022

Again, you are just as deep in your bubble as everyone else. There are 2.7 million developers in the US. The vast majority of them are paid between $80K-$170k. They are working in banks, government, smaller startups, etc.

And it doesn’t take a $400K senior developer that can reverse a binary tree on the whiteboard while juggling bowling balls while riding a unicycle on a tightrope to make a mobile app.

How much do you think Delta developers make in Atlanta (hint: they aren’t making over 200K). Their app is excellent.

It never ceases to amaze me how little most people on HN know about compensation throughout the industry.

bumby · on Dec 29, 2022

1) These types of comments always make me cringe because it's built on an assumption that everyone is solely trying to optimize for income. I know plenty of people who would take a pay cut to continuously work on what they consider more interesting problems than, say, a social media app or basic CRUD app.

2) COLA matters. $174k in, say, Southwest HQ in Dallas is roughly equated to $422k in SV.

benced · on Dec 29, 2022

It’s ridiculous to suggest that compensation isn’t correlated with quality. There are many exceptions but it’s a good rule of thumb in the absence of additional information. It also just passes the sniff test - try comparing the Southwest Airlines app to the united airlines app.

I don’t think COLA matters all that much. The reason SV pays such high salaries reflects the underlying higher productivity of those engineers. The fact that much of that income goes to landlords in the Bay Area is orthogonal to the underlying quality of the engineers.

bumby · on Dec 30, 2022

Nobody is claiming they aren't correlated. The question is: how strong is the correlation? The OP seems to insinuate they are extremely correlated on account that it is the sole consideration in the comment.

Like so many strong claims made online, it's an overly simplified mental model for something that is much more complex in reality. Like you said, we need more information before making strong claims.

MR4D · on Dec 28, 2022

I wonder if it’s that or simply a lack of slack in their system.

It seems to me that just like pre-staged inventory helps in logistics management, that extra planes and crews in the rotation could improve operations under these circumstances.

EMM_386 · on Dec 29, 2022

With a normal airline, you have pilots sitting on "reserve" at bases, who can be called in at any time to fill in any gaps that may occur. They are being paid but are not flying, it's quite a good gig if you can get on the reserve list.

I don't know how this is handled at Southwest, who does not fly hub-and-spoke and thus doesn't have a bunch of pilots sitting reserve around a base at, say, Atlanta.

mjevans · on Dec 29, 2022

In the past I had a job where some contract required a trained body to be on site 24/7. The company hired EXACTLY enough workers to fill the position, with _zero_ slack for anything.

That lack of slack is hell. It makes any disruption, even minor ones, require the other workers to work more time. Major disruptions mean soul-crushing crunch level hours to just get by.

Slack _must_ be planned into a system, otherwise there won't be any safety / recovery margin, and you're seeing the results live with Southwest's implosion.

Tao3300 · on Dec 28, 2022

They probably had that kind of wiggle room once upon a time. Then some sort of pandemic turned everything to just enough planes and skeleton crews.

MR4D · on Jan 4, 2023

It'll be interesting to see. I'm sure Congress with go through it pretty strongly (opinions, that is), assuming they ever have a successful vote for a Speaker.

cratermoon · on Dec 28, 2022

> Hence the need for a “full system reboot” over many days.

My understanding is that the full system reboot wouldn't have taken all that long, it's just the the company was trying to do a major fix while keeping whatever was still sort-of-working running. As any sysop will tell you, patching a running system is all kinds of crazy risky.

bogomipz · on Dec 29, 2022

>"Anecdotally, I flew Southwest just before Christmas. The network was already buckling and we had major delays, but we were lucky and made it through."

Interesting. You don't say how far before Christmas you were traveling. Had this crazy weather system already started moving from West to East at that point? Or was the system buckling just from passenger volume at the point i.e similar to the Summer meltdown that Southwest had?

firstSpeaker · on Dec 28, 2022

I imagine the system cannot account for where all the planes, staff, passengers are and where they want to go economically.

bumby · on Dec 28, 2022

According the the article, the system makes some non-sensical tasking:

"In one example during the storm, the system assigned a pilot to deadhead on a flight from Baltimore to Manchester, N.H., and then back to Baltimore the next day, without ever flying a plane"*

* The article defines deadheading as sending a pilot as a passenger to get to another location.

It would be interesting to look at what the system is trying to optimize for to make such choices.

twobitshifter · on Dec 28, 2022

If the solver is updating based on current events, it’s completely reasonable that something like that could happen. Manchester was projected to be short a pilot, but the flight made it on time, and Baltimore unexpectedly ended up short.

prewett · on Dec 29, 2022

This sounds like what you might expect from a using a random process (genetic algorithms, simulated annealing) to solve the NP-complete problem. The randomness injects suboptimal routes like this, as well as more optimal ones, but the fitness/cost function has to distill a whole bunch of things into one scalar. In my experience what happens is that the different things you want to value sort of compete with each other. I'm guessing the current state of the system is quite suboptimal, and it might not be able to remove the randomness.

It's quite feasible to get within a factor of 2 of the optimal solution (with much less processing power), which sounds great from a CS algorithm analysis standpoint, but a factor of 2 looks like an awful schedule from a human standpoint.

bell-cot · on Dec 29, 2022

Compare the cost of keeping one surplus pilot forever deadheading around the system to the cost & disruption of canceling (say) one flight a month because "a pilot got sick, it was not legal to fly short-handed".

Having a (relative) bunch of surplus critical-task workers, forever being shuffled to where they system guesses they're most likely suddenly be needed, makes perfect sense.

(Yes, you'd have to be a bit more sophisticated, so your surplus pilots got enough flying time to stay certified. And didn't get pissed and quit. Assume that I know why RAID 5 is better than RAID 4:)

brazzy · on Dec 28, 2022

Sounds to me more like a bug in an edge case rather than trying to optimize for the wrong thing.

Or (more likely, I think) the first deadhead was planned to have the pilot take over a flight which was then cancelled after the pilot was already underway.

smaudet · on Dec 28, 2022

Also sounds like a case of not throwing enough adversarial data at the system - you can't just code coverage your code, you can't even establish KPIs, you have to establish its performance under system failure (does it freeze or gracefully shut down, does it persist to disk, what happens when the disk is yanked out of the system), etc.

Very few software shops I am aware of that do anything like this.

taway22323 · on Dec 29, 2022

"non-sensical tasking"

That could also be the result of thrashing after they were really far into the mess. That is, the solver maybe did output something sensible, but that solution has to get into the system that's used at runtime. Someone may have run another solution, or did manual updates while it was too overwhelmed to get all the changes posted. The solvers also depend on other context that's changing underneath them.

icambron · on Dec 28, 2022

I've told this story a few times, but maybe 10 years ago I had a cross-country JetBlue flight that was delayed perhaps 6 hours hours. It was a few days after a major storm. Like Southwest here, JetBlue didn't have much flex capacity and relied on the daisy chain to keep on chaining. Our plane had gotten stuck somewhere, so they had to find a different one at some far-away airport and fly it in, which took hours. But the kicker was that when the plane finally landed, the crew already onboard couldn't man the flight because that would exceed their duty limits. The airline didn't realize this ahead of time, so they had to gather a new crew (like literally call them in), which added a couple of hours to the delay.

Naively, I'd assumed these kinds of things were handled in some sort of mission-control center with warnings from rule engines blinking on some big screen and a team of crack operators mapping out what needed done. But clearly that wasn't so: they were just making things up as they went along. Sounds like Southwest is in a similar spot, but this time on a much bigger scale.

seandoe · on Dec 28, 2022

> clearly that wasn't so: they were just making things up as they went along.

Where did you get your information? I have experience in the industry and scheduling logistics is clearly not how you describe it. The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.

icambron · on Dec 28, 2022

Which information? That they belated realized they'd run out of duty hours and had to call in new crew, after the plane with the soon-to-expire crew had landed? They told us that while we waited at the gate, including updates about the expected time that the newly-called-in crew would arrive. They were quite transparent about everything.

Or are you asking why I think they're making it up as they go along? That is my conclusion, to be sure. It's one thing for dominoes to fall, but if you are in a situation where the dominoes are falling and you are not able to predict which dominoes will fall next and respond accordingly, you are making things up as you go along. I'd have expected that almost any decision they could even hypothetically make would run through a system that checked it for violations of constraints, which would have told them way ahead of time that they needed a fresh crew, and they'd have had the entire flight of the replacement plane to get one in (IIRC it flew Miami->Boston just to get us the plane).

seandoe · on Dec 29, 2022

Regarding your first paragraph, rarely are the gate agents telling the public the entire story. The truth is typically more complex and basically not of concern to the average passenger.

As for your conclusion, it's just not accurate. Systems _are_ used for scheduling and allocation of equipment and crew resources. You just have to realize that getting you, as an individual, to your destination on time on any given day isn't the number one priority of the company. There are a lot more concerns they are considering. If the airline industry is really as dysfunctional as you think, with such shortcomings in areas as important as equipment and crew scheduling and allocation, you should drop whatever you're doing and start an airline. You'll soon be filthy rich and do the world a great service.

icambron · on Dec 30, 2022

That’s just hand-waiving. Sure, they could have been lying to us and had some hyper-competent, complex reason that delayed us and then made up some malarkey about duty limits. But you don’t know either, and the explanation they presented seems vastly more likely.

Also your generalizations fly directly in the face of the information emerging from this Southwest fiasco, which indicates they don’t have systems in place to even track where their crews are (just where they are scheduled to be), so the shortcomings seem to be quite real.

boomboomsubban · on Dec 29, 2022

Couldn't the emergency flight have run into some problems, making it take longer and that led to the crew being out of duty hours? Like the flight was supposed to take 3-4 hours but wind and some airport issue caused it to take 5? Something that may not have been obvious until very late in the flight too.

noobermin · on Dec 29, 2022

>The issue is that to optimize for profit you sacrifice the ability to maintain service through catastrophic events and can end up in a bit of a dominoes situation.

Sounds like an argument for nationalization.

andruby · on Dec 29, 2022

I don't follow. Nationalization might add more slack, and more resilience, but it'll be so much more costly to operate.

Airlines run on very thin margins, so need to optimize heavily.

We just need to accept that we can either get "affordable" tickets with occasional meltdowns or we go back to 30 years ago and pay double price for tickets but have more resilience.

The market as a whole has already resoundingly chosen the former.

Safely flying humans in metal structures is just really expensive and the current ticket prices pretty much reflect the real cost (and even ignore the climate externalities)

noobermin · on Dec 30, 2022

> I don't follow. Nationalization might add more slack, and more resilience, but it'll be so much more costly to operate.

Who cares? The point of nationalization is costs don't matter, we just see it as a cost we bear because transportation is important for civilization. The primary mode (for better or worse) in the US is public (roads) and that "cost" is not covered by anyone.

chasd00 · on Dec 29, 2022

> Sounds like an argument for nationalization.

No, it sounds like an argument to never fly SWA again.

bombcar · on Dec 28, 2022

I’ve seen it happen and it’s always strange that they seem to not realize the crew will be unusable until they arrive … I assume they had been trying to get another crew at the same time.

ipqk · on Dec 28, 2022

There are several time limits, but I believe that one of them stops when pulling back from the gate. i.e. once the the plane is moving, they're granted several X more hours, but if they haven't left yet, then they've timed out. So the arriving crew may plausibly be usable upon arrival, but a few minutes late may be all it takes to time them out.

I've been on a delayed plane where one of the pilots timed out while sitting at the gate, so we had another second delay finding another pilot.

mrandish · on Dec 28, 2022

There's also the fun wrinkle that not all crews or crew members are current on ratings (ie training courses) to operate all types/versions of planes the airline as in service.

You also need sufficient open gates of the right type in the right place during the right time window or you idle a plane (which can then idle a crew). It's pretty easy to imagine how such a system which may work at a certain level of load can be tipped into a runaway scenario.

As a systems person I'd be an interested reader if Southwest were to someday release an incident analysis and post-mortem of the type that good IT organizations do.

bombcar · on Dec 28, 2022

The first is why Southwest (and others) only fly one type of plane.

nradov · on Dec 28, 2022

Sometimes the dispatchers expect that the crew will still be legal, but then the incoming flight gets delayed just enough to push them over the limit.

walrus01 · on Dec 28, 2022

Once you realize how many people in critical industries (power grid, telecom, global cargo/logistics) are in fact making things up as they go along, and there aren't really any highly organized and responsible people running the show, you start to worry.

isiahl · on Dec 28, 2022

Reminds me of this Onion headline: "Smart, Qualified People Behind The Scenes Keeping America Safe: ‘We Don't Exist’"

https://www.theonion.com/smart-qualified-people-behind-the-s...

Beltalowda · on Dec 28, 2022

What is the alternative? Extensive playbooks for every possible scenario that may or may not make sense in the actual scenario because a few critical details are different? A "mission-control center" is really no different: it's just people making stuff up on the spot.

In the end, there is no substitute for human judgement in the situation itself.

clintonb · on Dec 29, 2022

> What is the alternative? Extensive playbooks for every possible scenario that may or may not make sense in the actual scenario because a few critical details are different?

Yes? I suspect they have playbooks and/or run gameday exercises for other aspects of disaster preparedness and business continuity. Why not do the same for these sorts of issues that affect the travel network?

The network of airports, planes, and crews is a graph (or multiple graphs). What happens when a node disappears?

That may be a hard question to answer, but it seems cheaper to answer, and plan around, it before the situation occurs rather than after.

ghaff · on Dec 28, 2022

Or you realize a lot of things involving real world physical systems are hard and customers aren't fine with paying "whatever."

nostromo · on Dec 28, 2022

The actual answer is buried at the end of a long article.

> Unlike many rival airlines, Southwest’s planes generally hop from one city to another, rather than orbiting a major hub. That approach lets Southwest maximize use of its planes and crew, but the daisy chain structure also makes its network more delicate—problems in one corner of the country can be difficult to contain

phpisthebest · on Dec 28, 2022

That is only part of the issue, not every airline is 100% hub system alot most are a pretty big mix.

Further alot of major Hubs where imacted by the Storm, yet those airlines where able to transition. Why?

Well most airlines have mobile apps, and web portals and other techonology so their crews can be reassigned in almost real time, (just like I would get auto booked on a new flight before I even knew my flight was canceled via the mobile app)

Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...

masklinn · on Dec 28, 2022

> Instead Southwest has systems from the 80's they require crews and customers to call and talk to an actual live human...

From what I understand (from /r/flying) it's the reporting / fixing of issues which is manual. This means when the reporting / fixing is overloaded and not given a respite it can't catch up, keeps getting more overloaded, and ultimately the scheduling system completely fell over (it lost track of crews entirely is what I understand).

"Mobile apps, and web portals and other technology" is orthogonal to the issue at hand.

mlinsey · on Dec 28, 2022

Mobile apps or a web portal absolutely could have helped, simply by massively increasing the throughput of crew members being able to report that they didn't make it to their assigned location. Which could have stopped the cascading effect much sooner.

The manual "reporting/fixing" of issues, which was done over the phone, includes pilots and crew members reporting "hey, I didn't actually get on the flight I was scheduled to". If the system did not receive that message, it was built to assume that the crew members made it to their next location and would be available for their next scheduled flight. But once a certain threshold of flights got canceled because of the storm, the limiting factor became how quickly staff members could even report that they didn't make it to their next location - they'd stay on hold for hours, until their next flight in their next city was scheduled to leave, and the system would still assume they were on that flight because it hadn't yet heard otherwise.

Lack of webapps certainly wasn't the only issue here - as the article notes, their whole point-to-point scheduling model is more vulnerable to cascading failure as well, vs. a hub-and-spoke model. But the phone-based system did certainly cause a feedback loop that kept the disruption happening for much longer than it should have. And moving away from point-to-point would have many more tradeoffs/downsides for consumers than simply modernizing their tech system.

phpisthebest · on Dec 28, 2022

The interview I saw with one the exec seemed to indicate that even under normal conditions when a flight is cancelled each crew member needed to call into a call center to be reassigned.

Supermancho · on Dec 28, 2022

All crew are also under a regulation that limits their time available for flight, similar to a trucker but with larger consequences. eg If you are on shift for a certain number of hours, waiting for a plane in a hotel or in the terminal, it_does_not_matter. You have to have a mandatory sleep block where you are now unavailable. Once thousands of staff started timing out, even after the weather clears, it_did_not_matter. They were now unavailable and their replacements hadn't come in due to the weather, leaving some locales understaffed and all schedules backed up (which continued to compound as weather shifted).

redtriumph · on Dec 28, 2022

Similar to humans, is there a wait time [cooling time] for aircrafts also? I suspect before and after each takeoff and landing, there are bunch of tests run to verify proper function of aircraft parts and subsystems.

TexasDawg · on Dec 29, 2022

Aircraft have to be scheduled for various maintenance checks such as "A", "B", "C", and "D" (which are the heaviest maintenance checks and take them out of service for many days/weeks). They're of course checked by the Ramp Crew and also receive a pre-flight walk around by the pilots before take-off. The only wait time is if there is a necessary repair to deal with a warning/indicator light and related repair time, hopefully does not take the aircraft out of service. Better out of service than break down in-air.

Supermancho · on Dec 29, 2022

Transport aircraft can run continuously for multiple flights. They don't turn off the engines under certain conditions (short hops, type of plane, size of loads, etc). There are regulatory inspections that require it to hangar, but I'm not familiar enough to cite any.

jrochkind1 · on Dec 29, 2022

I mean, a system for crew scheduling that requires manual notification and updating when flights get cancelled... seems to be missing very important functionality for a crew scheduling system?

AnimalMuppet · on Dec 28, 2022

Not entirely orthogonal? If the fixing issues is manual (therefore slower), that means that once issues start happening, it's slower to recover, thereby making a total meltdown more likely. And when a total meltdown occurs, recovery is manual, therefore slower.

masklinn · on Dec 28, 2022

And you can have mobile apps and a web portal in front of that and it won't help none.

AnimalMuppet · on Dec 28, 2022

It will make it faster than manual phone calls. "Faster" actually will help here. If nothing else, it will help the recovery not take so long.

inferiorhuman · on Dec 28, 2022

You can have a non-computerized system that is more efficient. Apps can and do fail, and I'll go out on an unpopular limb and suggest that what most folks working in tech would still fail frequently enough to be problematic in an airline of this size.

The problem is that their disaster recovery plan (assuming that there is one) is highly centralized. Southwest has each and every body calling the scheduling department one at a time. The proper design (and what most other airlines actually do) is to decentralize a wee bit and have, for instance, crews report to station managers who in turn call the scheduling department.

mrandish · on Dec 28, 2022

Mobile apps / web portals could have helped the part of the problem related to status updates of staff location/availability not ingested fast enough. Apparently communications ingest bottleneck was a significant part of why the overall problem tipped into an even worse runaway state with no viable recovery options short of "full reboot."

cratermoon · on Dec 28, 2022

In a complex enough system, it's reasonable to assume that at any point in time something is broken and the system is not running at 100%. What this means is they probably had their reporting/fixing system running at the thin edge all the time, and once the failures exceeded a certain rate, it was game over. The fixing could never catch up.

Ekaros · on Dec 29, 2022

Major hub getting impacted is less of an issue. Either the planes and crews are stuck there, you cancel flights and send them home. Or they fly to nearest available location and wait until they can get back to the hub.

And explaining that FAA or someone banned flying from this airport is much simpler.

masklinn · on Dec 28, 2022

I mean... it's not entirely untrue but at the same time airlines have been winding down their "hub and spoke" model for a point to point one for a while.

That's in part what doomed the A380, which was popular with airlines still going strong with hub-and-spoke (Emirates being by far the most prominent one) but is worthless in a point-to-point model.

ChrisMarshallNY · on Dec 28, 2022

> the A380

My worst nightmare, was pulling into a gate, and seeing one of those puppies arriving on the same concourse.

Immigration queues are bad enough, with 747s, but the A380 is much worse.

ikrenji · on Dec 29, 2022

yeah well there is no reason why an entire A380 is being processed by two customs officers while 8 extra booths stand unused

nradov · on Dec 28, 2022

There are still a number of major airports that are slot constrained, especially now that passenger demand is growing again. It seems like there could be a profitable market for an efficient twin-engine airliner larger than the Boeing 777X. Perhaps even a double-decker?

masklinn · on Dec 28, 2022

> There are still a number of major airports that are slot constrained

Major airports are major airports even in a point to point system. If everybody wants to go into or come out of NY, NY's airports are going to be slot constrained.

Also when I say "winding down" I don't mean "killed entirely", most airlines still have home bases, and mix point to point with regional hubs (especially for international flights).

But at the height of the hub-and-spokes model, unless you departed from or went to a hub you'd always need a layover.

coredog64 · on Dec 28, 2022

La Guardia is gimped by stupid rules about travel distance. I'm sure these rules made sense years and years ago when planes that could travel long distances were very loud, but by now, the airport has been there longer than most residents and planes are significantly quieter.

From a net utility standpoint, NYC should just pay for "port packages" for nearby houses, add the cost to landing fees, and remove source/destination restrictions on La Guardia.

inferiorhuman · on Dec 28, 2022

La Guardia doesn't operate in a vacuum. It's too small for long range aircraft and increased traffic at LGA would create problems for JFK and EWR. Besides, the runways at LGA are much shorter than those at JFK, EWR, and extending the runways at any of those three is problematic. Even SWF has longer runways.

ComputerGuru · on Dec 28, 2022

Southwest is statistically the worst airline in terms of delays and cancellations but has deluded its customers into thinking its the best (according to surveys asking people to rate airlines on their reliability).

https://www.insidehook.com/daily_brief/travel/airlines-fewes...

skellington · on Dec 29, 2022

Thanks for not understanding statistics and linking to an article that also doesn't understand basic math.

SW has a high number of delays and cancellations BECAUSE THEY FLY A HIGH VOLUME OF PEOPLE. By percentage, they are in the middle of the pack for both delays and cancellations, which isn't great, but they are not the worst by any means.

How are HN people so consistently bad with basic information?

enjoylife · on Dec 29, 2022

Probably because the raw data is often hidden from the readers so it’s hard to corroborate a stories statistical narrative.

Here is the data which backs up the majority of these low effort cancellation related news articles.

https://public.tableau.com/app/profile/flightaware/viz/Airli...

vl · on Dec 28, 2022

But also SW attracts very specific kind of customer. If you fly for business, or just can afford other airline, why would you fly Southwest?

noyoudumbdolt · on Dec 29, 2022

They are actually more expensive, which is why they do two things to fool their customers:

1. On their flight search page, it shows the prices on a per-leg basis. Every other airline shows round trip prices. That way, the initial result on Southwest looks great and it’s only when you get to the final payment page that you realize it’s actually twice as expensive as you thought.

2. They refuse to share their data with any of the third party flight search engines like Google Flights or Expedia. Again, so people don’t realize how expensive they actually are.

deathanatos · on Dec 29, 2022

a. other airlines will charge for baggage

b. other airlines will advertise prices that don't exist: it'll be the base price for the seat, but all (remaining) seats will be of the upcharged variety, so you can't pay the base price.

When you make the adjustments to compare apples-to-apples prices, SWA's get a bit closer. (But I do think SWA does tend to not be the cheapest, unless they're having a sale.)

I actually prefer the per-flight¹ pricing, too. That's what I'm buying: two flights.

¹IIRC, the pricing is per-flight, not per-leg/segment. (But it isn't, as you say, round-trip.)

noyoudumbdolt · on Dec 29, 2022

I fly all the time and SW is usually more expensive than American.

And that’s why they refuse to share their prices with Expedia and Google Flights. If people could more easily compare prices, Southwest would lose a lot of sales.

ComputerGuru · on Dec 28, 2022

SW is hardly even the cheapest. If not booking months in advance, SW is almost always twice the price of United or American, at least in my parts.

CivBase · on Dec 29, 2022

Maybe it has something to do with my location, but my experience has been that SWA is normally much cheaper than other airlines. I recently cut 70% off a trip I'm planning in May by switching from Delta to SWA. That's probably a uniquely extreme case, but I'll happily risk a delay of even a day or two if it saves me litterally thousands of dollars.

el_benhameen · on Dec 29, 2022

Southwest’s base fares suck, but their discount tier fares combined with frequent sales, the companion pass, and a decent frequent flyer program make them pretty competitive in my experience. That said, if you don’t live or travel on one of their major routes, they are a bad choice.

deathanatos · on Dec 29, 2022

The statistics in that article are of the "damned lies" variety: none of the values are normalized (they compared simple number of cancellations and delays without taking that as a per flight value, or perhaps better, per passenger) and they treat all delays (and cancellations) as equal; I'll take an airline often delayed by 5 minutes over an airline sometimes delayed by 3 hours.

Perhaps it's true nonetheless, but the numbers there won't tell you.

(And IME, it's perhaps true that SWA is often delayed … but by tolerable amounts. Compared to delays I've endured with Delta, where, e.g., a flight was delayed longer than the time it would take the plane to drive at highway speeds, from where it was coming from. Or … also Delta … where I was cancelled on twice in the same flight. They wanted to go 0-3 but I gave up and bought a ticket on … SWA.)

tyingq · on Dec 29, 2022

Perhaps they are thinking frequency for a trip they take often. It matters less that your Dallas->Houston flight is late when there's another one in 30 minutes during peak times.

variant · on Dec 29, 2022

Deluded? Or could it be that customers value economy over predictability?

kube-system · on Dec 28, 2022

I believe this has changed in recent years due to similar hiccups, and their reliability in prior years was previously good.

ghaff · on Dec 28, 2022

I suspect also that, like JetBlue in its early years, a lot of its flights were out of secondary airports that are generally less exposed to a lot of operational disruptions. (e.g. they flew in Hobby in Houston early on). They also had the advantage of being in a region that per its name that probably has fewer weather issues in general.

The pattern I've seen over the years is that upstart airlines, as they grow, end up having to look--for various reasons--a lot more like legacy carriers over time. Whether that's flying into areas with seasonal bad weather, flying out of default airports, instituting various forms of passenger status, etc.

tmpburning · on Dec 28, 2022

You are lucky if your flights are on time 75% of the time on average, with any airline.

quickthrower2 · on Dec 29, 2022

Usually better I find as they overestimate the flight time as if headwinds are happening very time

tmpburning · on Dec 29, 2022

Either you are very lucky, using a specific company with better odds then most or are not traveling very much.

quickthrower2 · on Dec 29, 2022

I don’t travel much these days but I found it common when I did that you would get there before arrival time that is told to passengers. There is some buffer.

alanbernstein · on Dec 28, 2022

(in 2022)

marze · on Dec 29, 2022

I find it especially ironic that SWA system failed them, and this large failure was preceded by worse and worse "near failures", since SWA is in the aviation business.

In the aviation arena, high reliability is maintained in part by careful analysis of "near failures": lessons are extracted and improvements are made to aircraft designs, procedures, etc.

By contrast, the "near failures" of the SWA system as a whole don't appear to have been utilized to motivate system improvements.

masklinn · on Dec 29, 2022

Per comments on /r/southwestairlines for 20 years the C-suite were Wall Street boys and set that as management culture, as long as shares were up things were fine.

Frontlines have been emitting concerns and warnings for years but management didn’t care until an ops CEO (Bob Jordan) got in recently (as in 2022), but now there’s 20 years of ops neglect to deal with, and less than a year is nowhere near enough to start enacting real changes.

Sakos · on Dec 29, 2022

On that note, I think it's interesting how glowingly people talk about Herb Kelleher. Wouldn't he be responsible for allowing that tech debt to build until it finally caused the system to fail catastrophically?

masklinn · on Dec 29, 2022

Keller stepped down in 2004, so not really. In 2004, replacing 10 years old systems starts becoming a consideration but it's not an active need. Kelly was CEO from 2004 to 2022. And is now chairman of the board. Bob Jordan is the new CEO, since February 2022.

Sakos · on Dec 30, 2022

Apparently Keller was on the board until 2008 and remained an FTE until 2013. I suspect he still had plenty of influence over Southwest's direction.

igetspam · on Dec 28, 2022

A friend of mine wrote on this topic today as well.

https://www.seat31b.com/2022/12/the-great-southwest-meltdown...

thepasswordis · on Dec 28, 2022

I'm surprised they haven't tried to blame a cyberattack yet.

That said, I feel like these sorts of catastrophic ultra-fragile McKinsey-consulted-to-death failures we keep seeing in various industries are basically a giant signal to any adversaries that say "Hi! Check out how easy it would be to grind this entire industry to a halt!"

Resiliency is literally the opposite of efficiency. These systems need to have slack, aka inefficiency built into them. Unfortunately the business culture has moved towards ultra fragile, ultra efficient thinking.

ghaff · on Dec 28, 2022

In part because, in this case, customers will buy the ticket that is $10 cheaper.

factsarelolz · on Dec 28, 2022

That could potentially increase their cyber security insurance premiums if not cause the insurer to drop them immediately. Not to mention the broader impact on the market and industry as a whole.

twobitshifter · on Dec 28, 2022

https://blog.geaerospace.com/technology/big-wins-in-flight-e...

Skysolver is a GE Flight Services trademark - there’s a video here showing how it works and SW planes. Contrary to the reddit claim, it does appear to use a predictive algorithm.

Highlight quote from the video:

“It is humanly impossible when there’s a major disruption for somebody to figure out what the optimal approach is to get them back on schedule”

cerved · on Dec 29, 2022

I would expect it to be some kind of OR solver

nlstitch · on Dec 28, 2022

I would be very interested in a post mortem of the software used called SkySolver. Its supposed to be a Java Application which is said to be developed by Accenture? Anyone have actual technical insights into why it failed?

DLarsen · on Dec 29, 2022

About a decade ago, I was part of a startup trying to disrupt crew scheduling. It's a non-trivial operations problem when you aim to honor crew preferences, union-negotiated affordances, FAA legalities, etc. We were only involved in the pre-planned schedules. At that time, the airlines we were courting had entirely different human-hravybsystem to resolve real time issues. There was some level of reserve redundancy baked in so the human planners had some wiggle room to work with... but redundancy is expensive to maintain. As a relatively new engineer at the time it was a pretty neat domain with big $$$ at stake. As it turns out, pretty much nobody wanted to take the risk on a new system even if it had provably better schedules for all parties. All it takes is one snafu for the whole thing to turn into a major regret.

nlstitch · on Dec 29, 2022

So your biggest challenge was stakeholder management or expectation management at that startup? Did you eventually go to market and succeed?

DLarsen · on Dec 29, 2022

We ran a pilot with one carrier for a subset of their crew for maybe a year, but eventually failed to gain further traction. Very tough sales and implementation cycle in spite of the fact that we could convince many individuals (bothe crew and management with their competing concerns) of the benefits of our system.

nlstitch · on Dec 29, 2022

In what way was it a tough implementation cycle? e.g. resistant to change, or outsourced (e.g. could not get close enough to the fire/ talk to the right people) or was it something else?

Would love to pick your brain because I can relate to what youre saying.

DLarsen · on Jan 2, 2023

Integrating into the operations on a provisional evaluation basis is a whole lot of work and the stakes are high. As my grandpa used to say, "There's nothing easier than doing nothing." Both parties (our company and the airline) brought the right people to the table, but for the system to really be evaluated we needed an entire crew base to use the system in a "realistic" fashion. There was a fair bit of understandable skepticism about how the system would perform under the stress of real user input, but due to the overall complex nature of the system there was a non-trivial learning curve and lots of crew members didn't see a lot of value in spending their time help us put the thing through its paces. As an example, the system would allow crew members to specify a highly detailed, specific set of flight preferences that were either declared as ranked priorities and/or they could select specific flights. If a crew member was forced to "try" our system, they could put in a generic set of input ("I like the Vegas flight."). However, we suspected that once they were faced with the opportunity to share their real schedule, they would have a lot more incentive to put in a lot more particulars, which would stress our optimization engine differently.

kilroy123 · on Dec 29, 2022

Interesting. My last job was doing something similar but much more ambitious.

Manage the entire fleet. All tails, flight legs, and crew.

It was for a much smaller US airline but still a household name.

The crew aren't union so that helped. But it was tricky to manage the crew part.

nlstitch · on Dec 29, 2022

Im actually working at a startup that wants to use an algorithm to plan transport on large scale. ( Different Industry though). Got any tips or insights on biggest challenges?

kilroy123 · on Dec 29, 2022

The biggest challenges we had were more political. Getting buy-in across the airline as well as _support_ from the airline side to integrate their data and system into ours.

Sadly, it was a people problem, not a tech or algo problem.

nlstitch · on Dec 29, 2022

So, did you manage to "tackle" the people problem in the end? How did you get them to move and integrate with your system in the end? Was it a bottom up (from the end users) or top down (making it part if company strategy) approach?

kilroy123 · on Dec 29, 2022

Sadly, I don't have these answers. I ended up quitting to go on a long sabbatical.

Last I heard, the airline scrapped the entire project to continue doing things the old way because they felt it was too much work on their end. ¯\_(ツ)_/¯

DLarsen · on Dec 29, 2022

Does the algo work? Is it better than the status quo and under what circumstances is it susceptible to failure? Does it know when the whole system is over-constrained?

If we assume the algo rocks (because you have operations research veterans), what stage of planning does your system address? What is the experience required and cost of manual intervention?

Fun problem. Human factors and edge cases abound. So much is at stake for a system that already works (to any predictable degree).

I'd love to hear more though.

nlstitch · on Dec 29, 2022

The algo works. We have two algorithmic experts onboard, of which one is a professor and one is already working with algos each day (e.g. worked on patients/beds capacity challenge when covid began).

The startup focusses on assigning goods to transport. (As in its not part of the commercial sales process.. only execution of the required capacity planning that comes after it) .Something that has been tried 20 times and failed. It currently is done manually.

So yeah, its going to be a challenge.

Tao3300 · on Dec 28, 2022

I'll bet it's just another excuse. The situation was probably so bad that the only solutions SkySolver could offer were beyond their capacity. There were probably partial solutions that were possible, but they rolled the dice on being able to resolve it by overworking people. They decided to double down instead of paying a whole lot of vouchers and refunds they'd rather not have to. Is it the software's fault the bet didn't pay out? As the problem snowballs and becomes more intractable, you get too far off the script and reach a point of no return.

I know nothing about the SkySolver or leadership at Southwest, but that seems like a very likely scenario based on what I've seen of what corporate types expect from software and rank-and-file employees.

ainvb · on Dec 29, 2022

SkySolver is a total shit show no doubt, but it’s not the entirety of the problem. Keep in mind when you have disruptions you actually have multiple products. SkySolver only manages the crew. They have another system that manages the aircraft and flight schedule. Believe it or not these systems are mostly decoupled - the schedule is modified first, then the crew, and there may even be a feedback loop back to the schedule if there is no crew solution.

Multiple things can be true here.

ProAm · on Dec 28, 2022

Still my go to airline because the rest are so difficult, unfriendly or just greedy Id rather deal with Southwest every time to feel like I am a human being.

noirbot · on Dec 28, 2022

I mentioned this in another thread, but Southwest feels like they fuck up in ways that are understandable, if unfortunate. Delta, United, American and co. feel like they intentionally screw you out of perverse pleasure and a desire to cause you pain.

Southwest's seating may be chaotic, but it's not United where you have to pay an extra $60 to not get a middle seat, likely in the back of the plane.

themadturk · on Dec 28, 2022

That's one thing I have seen over and over regarding the current situation, that SouthWest personnel are polite, empathetic and as helpful as they can be given the circumstances.

Do their planes still have smiles painted on them?

crosen99 · on Dec 28, 2022

It's easy to ask, "How could this happen?", but it's also a wonder this sort of thing doesn't happen more often with airlines and other businesses that rely on solutions to complex logistical challenges at their core. Overall, despite the horrors of war, perils of a pandemic, etc., sometime I pause and ponder how remarkably well the world works.

calbear81 · on Dec 28, 2022

I’ve been lucky to have caught a flight back to SF after cancellations and can wait at home while figuring out how to get to my original destination.

What I don’t understand is how come SW couldn’t enlist help to get customers rebooked on other airlines - their phone lines were slammed (I waited 3 hours) just to get a refund since their app wouldn’t allow me to choose to rebook/cancel.

If I was as customer focused as they say they are - I would’ve contacted AMEX global travel and gotten their entire network of booking agents to backfill and rebook customers on other flights.

realityking · on Dec 28, 2022

Southwest doesn’t have interline agreements with other airlines nor, AFAIK, the integration into reservation systems that allow rebooking onto another airline.

jrochkind1 · on Dec 29, 2022

From what we know, this to me sounds like a story about technical debt.

"Sure, it's held together with rubber bands and is a mess, but it would cost hundreds of millions to fix, and it's working, isn't it? So the programmers complain a bit, that's their job."

Which works until conditions change in some way and it catastrophically does not.

I think a lot of our society is now run on unreliable fragile software. I expect to see a lot more of this. "Automation" is especially cost-savings when you don't min it being a fragile unreliable time-bomb.

w10-1 · on Dec 30, 2022

So much chatter!

I would expect any interview candidate to spot the issue within a minute.

For hub systems, ready crews are either at the hub, or at a spoke, ready to come back. That gives the hub a queue of ready crews, and each spoke can return a crew-plane combination to the hub when available. So with natural queue's, there's no delay cascade: it's all a function of whether and crew/plane readiness.

For point-to-point systems, crew-plane's are scattered, and the next flight opportunity might not be the next flight need. There is no buffer anywhere. Furthermore, any greedy/opportunistic strategy at one point can block a superior global solution.

That's the point-to-point trade-off taken by SWA. In the common case of good weather, you avoid the extra miles from going via hubs. But in the rare case of global weather shutdowns, there is no good recovery.

The only real question is whether SWA had any obligation to communicate this to investors and passengers. So far, Apple stock has gone down more than SouthWest's in this period, and passengers are remaining loyal, so no damage done.

francisofascii · on Dec 28, 2022

So they are blaming SkySolver software. The article says it is off-the-shelf software? But in other news reports, they make it sound like it was developed in-house.

phpisthebest · on Dec 28, 2022

Alot of enterprise software is both, it is more akin to a "super framework". Where you start with the base system that provides commons functions but then it extendable.

Most ERP platforms are like this. over the decades it is not uncommon for little of the original platform to be used.

My org.s ERP is like that, we use probably 10% of the commercial code, and 90% of functions are custom in house written code.

partdavid · on Dec 29, 2022

Often, even more complicated: the ERP software vendor is one party; then there's the consulting VAR who does the customer's "customizations", and then the in-house team which does the "administration" and some "configuration". The VAR may or may not have a continuing relationship and of course the three groups have ample opportunity for finger-pointing; and the "administration" team is often under-resourced and doesn't really have the right skills to continue or maintain the work.

It's worse than a framework because most frameworks are designed for people to do some form of software engineering with. But with some "configurable" tool, you end up with something that you need to do software engineering but it doesn't have the tools, power or affordances to do so.

I've done some Jira consulting and worked in MDM and I'm of the firm belief that "buy vs. build" isn't the real dichotomy: it's "build" vs. "buy and build", and the effort of building on top of the platform is usually underestimated and undervalued.

phpisthebest · on Dec 29, 2022

We have a team of in house devs that do all of our customization but yea out-sourcing is still all the rage with many other companies VP's looking for that next bonus...

Companies are allergic to headcount

ghaff · on Dec 28, 2022

Even a SaaS poster-child like Salesforce is like that--between partners and in-house development of various sorts. I think it may be the largest tech conference these days and it's not because everyone buys a standard Salesforce subscription.

dragonwriter · on Dec 28, 2022

> The article says it is off-the-shelf software? But in other news reports, they make it sound like it was developed in-house.

Working in enterprise, there is a lot of “modifiable off-the-shelf” software; ideally, this provides basic functionality out of the box, but also the alignment to custom business needs of in-house software.

(Often, it seems like it combines the up-front and ongoing external licensing cost of COTS software, with the internal development and maintenance costs of in-house software, and the combined problems of both.)

kube-system · on Dec 28, 2022

If it's anything like an ERP, then that sounds about right. There's the out of box part that does the most generic things, and then heavy customization to integrate and customize it to fit the particular business.

Beltalowda · on Dec 28, 2022

> So they are blaming SkySolver software. The article says it is off-the-shelf software?

The article actually says "SkySolver, an off-the-shelf piece of software that Southwest has customized and updated".

chris_wot · on Dec 28, 2022

It probably was - massive customizations would have had to have been done to make it fit for purpose.

Good luck replacing it, on an operational airline that can’t shutdown. How you would do this is to me, just mind-boggling.

hellcats · on Jan 5, 2023

AA does shut down for about 5-10 mins to run their solver and upload new assignments after a major event.

billsac · on Jan 4, 2023

So what year was it introduced? Is GE still in the software bix? Who knew.

crisdux · on Dec 28, 2022

I don't buy the narrative that inadequate technology is the main reason for the Southwest debacle. We must ask, why did this happen now and not before? Southwest has previously been able to better deal with disruptions like this. While the weather event did happen in the middle of their network, it wasn't unprecedented.

I think a more obvious reasons is because of staffing issues brought on by covid, layoffs, and the vaccine mandates. They lost experienced employees who were able to wrangle the bad scheduling software. Throughout 2022, Southwest was having hiring issues because they were still mandating the vaccine through at least the summer for new employees. Their pilots association warned about this causing disruptions after a bunch of summer cancellations. Do people forget how flaky Southwest was during summer 2022? Southwest just recently reached staffing levels that matched their 2019 high. This "inadequate technology" narrative just seems like a convenient scapegoat.

Supermancho · on Dec 28, 2022

> This "inadequate technology" narrative just seems like a convenient scapegoat.

My brother has worked in the white and blue collar unions (he prefers his ramp job). It's not like there's some impermeable cover of secrecy. These are just regular people who you can talk to. It's a combination of computer problems and regulatory controls (sleep blocks) leaving insufficient staff (and mechanical dangers) due to weather. The ramp teams were sitting at almost quad pay with no planes to service out of Minneapolis for a significant part of the weekend. This same situation has occurred, to some degree, every year.

Due to the inevitable Guld Stream collapse, this will be a routine problem until SWA triages it.

MaxHoppersGhost · on Dec 28, 2022

This is probably a contributing factor but will get buried by the media. However, I doubt it was the sole or even primary cause. There is definitely a staffing component here in addition to their bad software.

tenebrisalietum · on Dec 28, 2022

> They lost experienced employees who were able to wrangle the bad scheduling software.

Employees could quit because they don't want to get vaccinated, but they also could have like just died from COVID too, or won the lottery, etc.

So to me this still points to the technology as something of a root cause. Your tech is as brittle as the number of people who know how to use it. Losing the people who make your sunk-cost old tech actually work, and not planning for the "bus factor' still makes it your fault for not addressing.

bdavis__ · on Dec 30, 2022

someone should calculate this.

"Dieing from covid is like winning $1000 in the lottery"

// It sure isn't like winning $1 Million. // And it sure isn't a $2 winner // Somewhere in between.

variant · on Dec 29, 2022

As with any event, multiple factors were involved. I have no doubt tech and process could be at the center, but our self-imposed response to COVID undoubtedly had major impacts.

kube-system · on Dec 28, 2022

They also had mass cancellations in Oct 2021

lotsofpulp · on Dec 28, 2022

They had the same computer related problem and mass cancellations in 2016.

paulpauper · on Dec 29, 2022

For a meltdown the stock is back to where is was in October, tracking other airlines and the overall market, which keeps falling. I think people have become so accustomed this sort of stuff that it does not affect business long term. After Covid, people are accustomed to major inconvenience when traveling.

mise_en_place · on Dec 29, 2022

You only get bitten in the ass by tech debt after it’s too late. I’m sure management justified not paying it down because, truthfully, the consequences are never really felt until it’s too late. It’s better to pay down tech debt incrementally, instead of grand projects promising full rewrites.

zx8080 · on Dec 29, 2022

Aren't cases like this is where the automated solvers are expected to shine?

If, on the other hand, it's not at all about software failures as many comments here suggest ("company management lost track of crews" notion), then does it have something to do with software at all?

cube00 · on Dec 29, 2022

Automated solvers that have a super computer backing them sure, but I doubt an airline has that kind of hardware at their disposal. Especially if their solver on their lower end hardware works most of the time for most disruptions.

bob1029 · on Dec 29, 2022

Wouldn't this be one reasonable use case for a super computer? Perhaps not the fastest one on earth, but we use this stuff to predict the weather so I don't see why we couldn't use the same for air travel.

ainvb · on Dec 29, 2022

Southwest has the legacy SkySolver version which - believe it or not - runs locally on Windows OS. There is a cloud version that Southwest was either too cheap to buy or too resistant to migrate.

christkv · on Dec 28, 2022

I thought this stuff also gets more likely to happen as you get towards the end of the month as you are close to the max number of hours the pilots can fly per month (100h in the us per calendar month), making pilot shortage cascades even more likely to happen.

londonReed · on Dec 28, 2022

Pilots are only allowed to fly 100h a month? That seems really low, considering a 40 hour work week over 4 weeks in a month is 160h. Although I guess I would rather have my pilot more energized than overworked.

christkv · on Dec 28, 2022

https://www.flyingmag.com/guides/how-many-hours-do-pilots-wo...

Even less if commercial airplane pilot it seems. 100h is for transport

krisoft · on Dec 29, 2022

There is some confusion here. Transport pilots are commercial pilots.

atdrummond · on Dec 28, 2022

There’s work before and after the flight that isn’t included in that 100 hours.