It would be fascinating to read about what bug took down a system with one outag...

dx034 · on April 4, 2018

If I remember correctly, a similar outage in the UK was due to increased flight volumes. The system had a hardcoded limit somewhere which caused issues as flight volume increased. The software was old enough that they weren't aware of that issue beforehand. I could imagine something similar happening here.

It definitely shows how well software can work with the right practices. Only one outage in 20 years and that one only caused a reduction of 10% in capacity. Don't think many companies can match that.

carlmr · on April 4, 2018

>It definitely shows how well software can work with the right practices. Only one outage in 20 years and that one only caused a reduction of 10% in capacity. Don't think many companies can match that.

This is what happens when you build software with the same meticulousness as other engineering disciplines. However a lot of software is so much more complex than what other disciplines can build (because you can build anything you can imagine), coupled with early deadlines and profit pressure, that it's unrealistic for most software to be developed this way. You easily have a cost factor of 100x in time as well as money.

Cthulhu_ · on April 4, 2018

I'm talking out of my ass here, but I can imagine that the focus for this software system was very sharp and hasn't changed (much) in the past 20 years. When you have a product with tight focus, you can polish the shit out of it and make it last 20+ years. A bit of the Unix mentality.

Most products today - and stuff on HN is at the forefront of that - is much more about selling a product to a lot of people, often in highly contested markets. That is, if you make a product, focus on just the core and don't do anything else, you'll stagnate and die.

Then again / on the other hand, there's Dropbox whose core functionality has not changed as far as I can tell for a decade - it still does the exact same thing as when I first installed it. Spotify whose IPO was today that doesn't seem to change its core model. In both cases though, I don't know where they put all their development capacity; probably on back-end systems / scalability, marketing, and new applications (like dropbox creating a google docs competitor).

xcvbxzas · on April 4, 2018

I see this sentiment a lot. Intuitively it seems like it should be true, but I don't think the case is really quite so clear cut.

The costs involve way more than just the initial development. Maintenance eats up a huge, perhaps even a majority, of the total cost as well. And outages or other failures can be very expensive too.

It's also important to keep in mind that this isn't an all or nothing situation. We can have software that is more reliable without asking that it chug away without issue for a decade, or anywhere near as long as we expect bridges or buildings to last.

The process of developing more reliable software isn't necessarily more expensive than less reliable software. It can even be cheaper. I'm struggling to find the links (maybe somebody else has them handy, or I'll edit them in if I find them), but there have been a few case studies done a few years back by companies that moved to using Ada. In addition to the benefits of more reliable software, they also found development costs were better or at least no worse than C. I know that isn't exactly the language to compare to these days, but as I said these were done some time ago.

This is just my own argument, but I suspect that's because the same problems that ultimately cause problems after release also cause problems during development. With a more reliable programming system/environment, problems that might show up later during development are shown to be an issue immediately. This means the issue doesn't need to be tracked down, which can take some serious time. The developers are even fresh on problem area.

Personally speaking, I've been totally won over by Ada. It ain't perfect, but it's a hell of a lot better than anything else I've seen - and I've looked a lot. In my own projects (mainly personal or for school admittedly) development is much easier and ultimately quicker. I don't have to spend a day tracking down a weird bug because the compiler let's me know about the issue as soon as I try to cause it.

carlmr · on April 4, 2018

>The process of developing more reliable software isn't necessarily more expensive than less reliable software. It can even be cheaper. I'm struggling to find the links (maybe somebody else has them handy, or I'll edit them in if I find them), but there have been a few case studies done a few years back by companies that moved to using Ada. In addition to the benefits of more reliable software, they also found development costs were better or at least no worse than C. I know that isn't exactly the language to compare to these days, but as I said these were done some time ago.

I can believe that. Ada catches a lot of errors you would normally only notice by extensive testing at compile time. You're preaching to the strong-static typing choir here. I believe Ada and Rust could solve a lot of problems of companies working with C/C++ and make development cheaper. You can properly model your domain and abstract without sacrificing safety.

I'm also a strong believer that TDD makes you much faster and safer in the long run.

My experience tells me that most tools, languages or methods that catches errors earlier will save money.

Ada also has the best tested compiler I can think of.

However my larger point was about the engineering processes not the language itself. I think with languages and tools you can make it easier to make good software. The 100x time and cost is more in the sense of process changes when you're working on safety critical systems. How everything has to be traceable from requirement to test, how there are mandatory reviews before any code change that need to be documented, how there are qualification criteria for the toolchain, etc. All these things cost a lot of time and manpower, with arguably very bad cost-benefit analysis, which is only really worth it when human lives are at stake.

xcvbxzas · on April 5, 2018

> The 100x time and cost is more in the sense of process changes when you're working on safety critical systems. How everything has to be traceable from requirement to test, how there are mandatory reviews before any code change that need to be documented, how there are qualification criteria for the toolchain, etc. All these things cost a lot of time and manpower, with arguably very bad cost-benefit analysis, which is only really worth it when human lives are at stake.

Absolutely. That's part of what I was getting at by mentioning all of this exists on a continuum. We don't need to, and really shouldn't, treat a SaaS startup exactly the same as a military aviation project.

We can, however, draw from the lessons learned on those safety critical projects and use parts of the process that make sense for the nature of whatever we're actually working on.

You're right that in general I suspect that comes down to strong static typing, particularly for the sorts of projects common to the HN crowd. When dealing with very large enterprise projects the balance might start to shift to more than just typing, though it would probably take a lot of real-word data that nobody is keen to supply to figure out where the tipping points are.

And I'd argue about how well Rust actually helps with these things, but that would really be going off the rails. Unfortunately.

colechristensen · on April 4, 2018

Anyone could, few care to. They really are the wrong practices for a whole lot of things. The level of care, design, and verification just isn't necessary for applications with few or fixable consequences.

It is a bit sad that nearly nothing these days strives for that kind of excellence.

tim333 · on April 4, 2018

Yeah, I recall last year

"BA’s £150,000,000 outage was caused by someone turning computers on and off too quickly"

I think companies often try to stay quite about the dumb stuff.

makmanalp · on April 4, 2018

Well - I wouldn't call it dumb stuff. After all, it's only a matter of time one of us does something truly stupid :-) Even more so when under stress and pressure when shit hits the fan, which is usually when human operators have to intervene. It's part of building reliable systems to reduce the chances of operator error. See: the Hawaii missile alert bug! It must be truly terrifying to sit at that particular keyboard typing in any command.

colechristensen · on April 4, 2018

Read "Inviting Disaster" for a lot more about this topic.

Many very high profile disasters were caused by operator error or more precisely complex systems not designed for what you might call failure ergonomics.