I was going to trash this, but it actually made me think of something non-obviou...

kansface · on Nov 23, 2020

> For example, we have code in systems that performs network retries, and I have code in my own software that performs polling operations when a request/response "should" work, but in practice, a variety of reasons cause it to be less robust.

Apples and oranges. The network is unreliable and all code dealing with the network should treat it as such. The file system and disks are also unreliable, but are probably more reliable by a factor of 10K than the network. Engineers mostly choose to ignore these potential errors via unintentional decisions or do stuff like let the process crash because restarting an API server once a year doesn't matter. Of course, not everyone has that luxury. Determining how much memory to allocate ought to be perfectly deterministic which is why this is such a code smell.

> where is the line, exactly?

The line to me is pretty clear in this case! There are all sorts of reasons why this code would be acceptable: if fixing the code to prevent the bug[s] is sufficiently onerous as to cause more bugs than the fix would prevent, or is so much work that it would never be undertaken, or so complex that the code is un-mergable, an emergency hot fix... all acceptable so long as the code is documented as such. In other words, the line crossed here was in the comment. Where did the number 5 come from? Where's the link to an issue tracking memory allocation/estimation?

I believe most comments in code are worse than useless - that we shouldn't merely document bad code, but write code that is so blindingly obvious and idiomatic that comments detract from its perfection. Comments exist when we deviate from the platonic ideal — the real world with leaky abstractions, deadlines, bugs and dollars.

gfodor · on Nov 23, 2020

I don't think you understood my comment - it was raising the question where the line actually is, and if the line is drawn more upon cultural norms than solid engineering principles. That isn't an argument the code of the OP is a good idea, it isn't, the question is if there are large categories of code we write that ought to be defensive but aren't, or vice versa.

gwern · on Nov 24, 2020

I'm reminded of Feynman's Challenger report:

"[T]here are several references to previous flights; the acceptance and success of these flights are taken as evidence of safety. But erosion and blowby are not what the design expected. They are warnings that something is wrong. The equipment is not operating as expected, and therefore there is a danger that it can operate with even wider deviations in the unexpected and not thoroughly understood way. The fact that this danger did not lead to catastrophe before is no guarantee that it will not the next time, unless it is completely understood. (...) The origin and consequences of the erosion and blowby were not understood. Erosion and blowby did not occur equally on all flights or in all joints: sometimes there was more, sometimes less. Why not sometime, when whatever conditions determined it were right, wouldn't there be still more, leading to catastrophe?"

Networks are unreliable and 'have you tried turning them off and then back on' works well and we have extensive experience with them; adding some retries is well within predicted workarounds and tricks. However, parsing a data structure should be straightforward, exactly reproducible, simple, and always work and use the expected amount of memory; adding on arbitrary amounts of memory is not a standard workaround, and, somewhat like Mercury failing to be where Newton's theory predicted it should be, indicates that your mental model of the system is not merely a little fuzzy on the edges, but fundamentally incorrect and must be replaced by a completely different theory (like relativity), and in the true model, the safety and correctness may be arbitrarily different than what you thought they were (in the way that Newton & Einstein make arbitrarily different predictions if you go fast enough).

gfodor · on Nov 24, 2020

I think these responses show I failed to articulate my point well at all. I appreciate them but they don’t seem to attack the question I asked which is if defensive programming as a practice (retries just being a extreme example) is more natural when you are talking “over the wire” for some innate reason, even tho other forms of defensiveness may be equally relevant and useful in non-networked programming.

WalterBright · on Nov 23, 2020

> where is the line between hacks like this and smart behavior?

Understanding exactly what the cause of the problem. Hacks imply a fix that "seems" to work but not understanding.

zamadatix · on Nov 23, 2020

Depends what you call smart behaviour I suppose. For the vast majority of projects though I'd assume the following to be smart behaviour:

- Is it robust

- Is it fast (enough)

- Is it the simplest way to achieve those levels of robustness and speed

If the answer to the first 2 is largely "yes" and the 3rd is "no" then I'd call it a bad hack. OTOH if the answer to 3 is also "yes" then it's a good hack. If the answer to 1 or 2 is "no" then it's not ready to ship.

In either case a comment saying "this is the simple/safe way, performance opportunity via..." Or "this is the complicated way because <reasons complicated way was needed, usually perf> and it this is why it works..." is good to leave in case performance requirements ever change or someone is trying to figure out why you went the way you did.

trinovantes · on Nov 23, 2020

Maybe the line should be drawn at problems that are formally proved to be unsolvable? Retrying network calls is essentially a real world approximation to the Two Generals Problem [1].

[1] https://en.wikipedia.org/wiki/Two_Generals%27_Problem

lmm · on Nov 24, 2020

Most "defensive programming" is this kind of mistake, in my experience. For example it's far better for the program to crash immediately on an unexpected null than to pass through a bunch of functions that never expected to receive null but "defensively" bail out in that case, and then you eventually get an error miles away from the original problem. Erlang achieves extremely high reliability through its explicit "let it crash" philosophy: as soon as any component gets into an unexpected state it should abort, and rely on higher-level error handling, rather than try to continue.