> But the main difference would be that failure modes in CS tend to be binary - a piece of code produces the expected result or it doesn't.
I think if you’re talking about CS and code you’re right - it’s a very pure and deterministic system. But if you consider software engineering at scale, you do get scenarios where failure states emerge in a gradual fashion, like the OP example of a worn mechanism. An example would be monitoring error rates on a large fleet of servers. You expect a certain background level of failures, but when the rate starts to rise you might be observing the beginning of some kind of cascading failure mode. The errors might be manageable up to a point, but it isn’t always clear when the system will tip beyond that point. Post-mortems of cloud platform outages often mention these kinds of failures, because they typically arise from complex interactions between different layers which put the system into a state which nobody anticipated.
I guess my response to that would be that any change of platform - like scaling up, replication, etc - is an alteration of the environment, similar to changing the geometry of your car's wheels by changing their size, getting different shocks or raising/lowering it. Then you could expect failure states to emerge, but that's when you're looking for unpredictable behavior. I think I was just trying to make the point that in a static system, code you write that works keeps working, whereas mechanical things break down. Emergent problems in mechanical systems come from normal wear, not from scaling.
Yes. The point is that software is not static. I work on mobile app that doesn't really have the scaling issue, but it's not that uncommon that a new OS update or phone model has new bugs that needs to be worked around or changes some undocumented behavior that the app accidentally depended on. It's not that difficult from mechanical wear.
In a previous job in early 2000s they did have some legacy software that was originally written for VAX/VMS and run on an emulator. Even that environment wasn't stable enough so they maintained some real VAX hardware just in case. And as far as I know, the hardware was physically breaking down.
I think if you’re talking about CS and code you’re right - it’s a very pure and deterministic system. But if you consider software engineering at scale, you do get scenarios where failure states emerge in a gradual fashion, like the OP example of a worn mechanism. An example would be monitoring error rates on a large fleet of servers. You expect a certain background level of failures, but when the rate starts to rise you might be observing the beginning of some kind of cascading failure mode. The errors might be manageable up to a point, but it isn’t always clear when the system will tip beyond that point. Post-mortems of cloud platform outages often mention these kinds of failures, because they typically arise from complex interactions between different layers which put the system into a state which nobody anticipated.