Hacker News new | past | comments | ask | show | jobs | submit login

I think the software industry itself has accumulated enough bugs over the past few decades. E.g.

F-22 navigation system core dumps when crossing the international date line: https://medium.com/alfonsofuggetta-it/software-bug-halts-f-2...

Loss of Mars probe due to metric-imperial conversion error

I've a few of these myself (e.g. a misplaced decimal that made $12mil into $120mil), but sadly cannot devulge details.




The worst bug I encountered was when physically relocating a multi rack storage array for a mobile provider. The array had never been powered down(!) so we anticipated that a good number of the spindles would fail to come up on restart. So we added an extra mirror to protect each existing raid set. Problem is a bug in the firmware meant the mere existence of this extra mirror caused the entire arrays volume layout to become corrupted at reboot time. Fortunately a field engineer managed to reconstruct the layout, but not before a lot of hair had been whitened.


Close call. I know of a similar case where a fire suppression systems check ended up with massive data loss in a very large storage array.


My worst bug was a typo in a single line of html that removed 3DS protection from many millions of dollars of credit card payments


Pretty epic.

I was working for a webhosting company, and someone asked me to rush a change just before leaving. Instead of updating 1500 A records, I updated about 50k. Someone senior managed to turn off the cron though, so what I actually lost was the delta of changes between last backup and my SQL.

I was in the room for this though: https://www.theregister.com/2008/08/28/flexiscale_outage/


I love the title to that article "Engineer accidentally deletes cloud".

It's like a single individual managed to delete the monolithic cloud where everyone's files are stored.


That is eerily similar to what happened to us in IBM "Cloud", in a previous gig. An engineer was doing "account cleanup" and somehow our account got on the list and all our resources were blown away. The most interesting conversation was convincing the support person, that those deletion audit events were in fact not us, but rather (according to the engineer's Linked-In page) an SRE at IBM.


Although it probably wasn't funny at the time I can imagine how comical that conversation was.

Thinking about it further the term "cloud" is a good metaphor for storing files on someone else's computer because clouds just disappear.


This was ~14 years ago and both MS & AWS had loss of data incidents iirc.


Bare in mind, this was a small startup in 2008 that claims to be the 2nd cloud in the world ( read on-demand iaas provider ).

Flexiscale at the time was a single region backed by a netapp. Each VM essentially had a thin-provisioned lun ( logical volume ), basically you copy on write the underlying OS image.

So when someone accidently deletes vol0, they take out a whopping 6TB of data, that takes a ~20TB to restore because you're rebuilding filesystems from safe mode ( thanks netapp support ). It's fairly monolithic in that sense.

I guess I was 23 at the time, but I'd written the v2 API, orchestrator & scheduling later. It was fairly naive, but filled the criteria of a cloud, i.e. elastic, on-demand, metered usage, despite using a SAN.


At this point there's basically 3 clouds, and then everyone else.


AWS, Azure and Cloudflare?


And Google Cloud Platform


I if any such pathways remain at AWS, Google, Apple and MS that would still allow a thing like that to happen.


You could call that a feature making payments easier for customers! All 3DS does is protect the banks by inconveniencing consumers since banks are responsible for fraud.


I believe they pass on the risk to merchants now. If you let fraud through, $30 per incident or whatever. So typically things like 3DS are turned on because that cost got too high, and the banks assure you that it will fix everything.


It's mostly the payment processor. It may or may not be the bank itself.


My worst bug was changing how a zip code zone was fetched from the cache in a large ecommerce site with tens of thousands users using it all day long. Worked great in DEV :D but when the thundering herd hit it, the entire site came down.


Startup, shutdown and migration are all periods of significantly elevated risk. Especially for systems that have been in the air for a long time there are all kinds of ways in which things can go pear shaped. Drives that die on shutdown (or the subsequent boot up), raids that fail to rebuild, cascading failures, power supplies that fail, UPS's that fail, generators that don't start (or that run for 30 seconds and then quit because someone made off with the fuel) and so on.


I posted this, in case we want to collect these gems: https://news.ycombinator.com/item?id=37160295


In terms of engineering (QC, processes etc), modern day software industry is worse than almost any other industry out there. :-(

And no, just plain complexity or fast-moving environment, is a factor but not the issue. It's that steps are skipped which are not skipped in other branches of engineering (eg. continous improvement of processes, learning from mistakes & implementing those. In software land: same mistakes made again & again & again, poorly designed languages remain in use, the list goes on).

A long way still to go.


If you're making the same mistakes over and over again I think that says more about your company than it does about the software industry.

My first job was at a major automotive manufacturer. Implementing half the procedures they had would slow down any software company 10X - just look at the state of most car infotainment systems. If something is safety critical, obviously this makes sense but the reality is 85% of software isn't.


GP was speaking in the general sense not about their company.


Is that not coming from experience of working at a software company? As I believe you said elsewhere


It could easily be from looking from the outside in, as it is in my case.


Oh, so, without experience. Understood.


No, with lots of experience. I've been programming since 17, did it professionally for three decades and have since moved into technical due diligence, 16 years and counting. That gives me a pretty unique perspective on IT, I get to talk to and work with teams from a very large sample of companies (230+ to date) and that in turn gives a fairly large statistical base to base that opinion on. This includes seed stage companies, mid sized and very large ones.

So forgive me if I take your attitude as non-productive, you are essentially just trying to discount my input based on some assumptions because it apparently doesn't please you. I'm fine with that but then just keep it to yourself instead of pulling down the level of discourse here. If you wanted to make a point you could have been constructive rather than dismissive.


And reading this thread it doesn't look as if there is much awareness of that.


> In terms of engineering (QC, processes etc), modern day software industry is worse than almost any other industry out there. :-(

How do you know that?


Take airplane safety: plane crashes, cause of the crash is thoroughly investigated, report recommends procedures to avoid that type of cause for planecrashes. Sometimes such recommendations become enforced across the industry. Result: air travel safer & safer to the point where sitting in a (flying!) airplane all day is safer than sitting on a bench on the street.

Building regulations: similar.

Foodstuffs (hygiene requirements for manufacturers): similar.

Car parts: see ISO9000 standards & co.

Software: eg. memory leaks - been around forever, but every day new software is released that has 'm.

C: ancient, not memory safe, should really only be used for niche domains. Yet it still is everywhere.

New AAA game: pay $$ after year(s?) of development, download many-MB patch on day 1 because game is buggy. Could have been tested better, but released anyway 'cause getting it out & making sales weighed heavier than shipping reliable working product.

All of this = not improving methods.

I'm not arguing C v. Rust here or whatever. Just pointing out: better tools, better procedures exist, but using them is more exception than the rule.

Like I said the list goes on. Other branches of engineering don't (can't) work like that.


Exactly. The driving force is there but what is also good is that the industry - for the most part at least - realizes that safety is what keeps them in business. So not only is there a structure of oversight and enforcement, there is also an strongly internalized culture of safety created over decades to build on. An engineer that would propose something obviously unsafe would not get to finish their proposal, let alone implement it.

In 'regular' software circles you can find the marketing department with full access to raw data and front end if you're unlucky.


Experience?


On HN and reddit, experience doesn't count. Only reading about others experiences after they've been paid to write research experiences.


Don't forget to cite your sources! The nerds will rake you over the coals for not doing so.


In any other field of engineering, the engineers are all trained and qualified. In software 'engineering', not so much.


That training and qualification is only as good as the processes and standards being trained for and qualified on. We don't have those processes and standards to train against (and frankly I'm not convinced we should or even can) for generic "software engineers". I have a number of friends who are PEs, and it isn't the training and certification process that differentiates their work from mine, it is that there are very clear standards for how you engineer a safe structure or machine. But I contend that there is not a way to write such standards for "software". It's just too broad a category of thing. Writing control software for physical systems is just very different from writing a UX-driven web application. It would be odd and wasteful to have the same standards for both things.

I do think qualification would make sense for more narrow swathes of the "software engineering" practice. For instance, "Automotive Control Software Engineer", etc.


This is exactly why I support "engineer" being a protected term, like Doctor. It should tell you that a certain level of training and qualification has been met, to the point that the engineer is responsible and accountable for the work they do and sign off on. Especially for things that affect safety.

Many software engineers these days are often flying by the seat of their pants, moving quickly and breaking things. Thankfully this seems to largely be in places that aren't going to affect life or limb, but I'm still rubbed the wrong way seeing people (including myself mind you) building run of the mill CRUD apps under the title of engineer.

Is it a big deal? Not really. It's probably even technically correct to use the term this way. But I do think it dilutes it a bit. For context, I'm in Canada where engineer is technically a protected term, and there are governing bodies that qualify and designate professional engineers.


I'm curious how you think the word "Doctor" is protected.

Do you mean that History PhDs can't call themselves Doctors?

Or chiropractors can't pass themselves off as doctors?

Or you mean Doctor J was licensed to perform basketball bypass surgeries?

Or perhaps podiatrists can't deliver babies?


> Thankfully this seems to largely be in places that aren't going to affect life or limb

I've seen Agile teams doing medical stuff using the latest hotness. Horrorshow.

I've also seen very, very clean software and firmware development at really small companies.

It's all over the place and you have to look inside to know what is going on. Though job advertisements sometimes can be pretty revealing.


Are all "engineers" trained on human error, safety principles, and the like? The failures described in the article are precisely not software failures.


Yes? Most engineering programs (it might even be an accreditation requirement) involve ethics classes and learning from past failures.

My CS degree program required an ethics class and discussed things like the CFAA and famous cases like Therac-25, but nobody took it seriously because STEM majors think they are god's gift to an irrational world.


The important distinction is that engineers are professionally liable.


Does anyone have a similar compendium specifically for software engineering disasters?

Not of nasty bugs like the F-22 -- those are fun stories, but they don't really illustrate the systemic failures that led to the bug being deployed in the first place. Much more interested in systemic cultural/practice/process factors that led to a disaster.


Yes, the RISK mailing list.


Find and take a CS ethics class.


my funniest was a wrong param in a template generator which turned off escaping parameter values provided indirectly by the users. good that it was discovered during the yearly pen testing analysis because it lead to shell execution in the cloud environment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: