When robustly tolerable beats precariously optimal (2020)

tasty_freeze · 2024-03-08T08:52:20 1709887940

I learned this in a circuits class I took back in college, around 1984. Specifically, it was about amplifier circuits an op-amp design. Such designs are a puzzle of tradeoffs, and the teacher emphasized that often the "optimal" design is an inferior design in the light of real world constraints.

The globally optimal point on whatever thing we were optimizing might indeed be the highest peak of the graph, but if it is a sharp peak, any deviation in the voltage, temperature, or the real world values of the components would put the operating point far down the slope of that peak.

It was much better to find a reasonable operating point that had low sensitivity to voltage/temperature/component values but had acceptable behavior (gain, noise, whatever was important).

The surprising thing I learned from that class is that even though resistor and capacitor values and the gain of individual transistors of IC op-amps is an order of magnitude worse than for discrete designs, the matching of those terrible components was an order of magnitude better than for discrete components. Designers came up with many clever ways to take advantage of that to wring terrific performance from terrible components.

For example, say the nominal value of a given resistor in the design might be 4K ohms, and in the discrete design they might be 2%, or 1%, or 0.5% off (the ones with tighter tolerance get ever more expensive), while in the monolithic design the tolerance might be +/- 20%. But all the resistors would be off by the same amount and would match each other to a fraction of a percent, even across temperature and voltage variations.

The other funny effect is that when you buy a discrete 2% tolerance resistor, the distribution isn't gaussian around the mean. That is because the manufacturers have measured all of them and the ones within 0.5% get marked up and put in the 0.5% bin, and the remaining ones within 1% tolerance get marked up less and get put in the 1% bin. As a result, the distribution is bimodal on either side of the "hole" in the middle.

stefanpie · 2024-03-08T15:02:57 1709910177

This brings flashbacks of standing next to part bins in circuit labs with a multimeter in hand, trying to match components. Unfortunately, the TAs and course instructors did not emphasize your point, and the measured part values and thus the measured filter cutoffs had to be close to perfect to be checked off.

In the past, I have looked at using optimizers to solve for component values of complex analog circuits. I only looked at optimizing accuracy at one corner, but it would be interesting to see what people have done to optimize multiple variables, like including noise, for multiple corners. I think I've seen some Monte Carlo simulation mentioned once to suggest fuzzy solutions within some specification. I would be curious to see if others know more about this.

kwhitefoot · 2024-03-08T17:45:39 1709919939

Surely anyone doing analogue design for serial or mass production would do a sensitivity analysis, even in 1984. Now it should be much easier to repeatedly randomise the values of every part within their tolerance and run a simulation to check that the end product remains in spec.

You might even do this in order to find out if wider tolerance parts would be good enough and save a little money.

nonrandomstring · 2024-03-08T12:53:37 1709902417

> measured all of them and the ones within 0.5% get marked up and put in the 0.5% bin, and the remaining ones within 1% tolerance get marked up less and get put in the 1% bin. As a result, the distribution is bimodal on either side of the "hole" in the middle.

That's a great explanation of how iterative tolerance filtering works and why that happens!

rcxdude · 2024-03-08T19:44:55 1709927095

That last part is a bit of a myth, or at least it's something that doesn't consistently happen, at least nowadays. Binning parts is expensive, and for something like a resistor it's rarely worth it, so it's much more likely it's just a well dialed-in manufacturing process and design which stays within that tolerance.

advael · 2024-03-08T08:34:42 1709886882

This is a good evaluation of the way risk posture can inform design decisions, but I think it sort of ignores the elephant in the room: A short term strategy wins more on average in a competitive context, especially when existential risk is on the table for entities who lose a competition. Most solutions to this problem involve solving hard coordination problems to change the balance of incentives or lower the lethality of competition. Figuring out systems that work for your level of risk tolerance is very achievable and can be incrementally improved, but designing for robustness is something that needs fractal alignment at higher meta-layers of incentives to be sustainable

nonrandomstring · 2024-03-08T12:57:15 1709902635

> especially when existential risk is on the table for entities who lose a competition.

This really just explains externalities. Competition creates apparent short-range existential risk to a business. Real existential risk (to people or things damaged by your defective product because of "precarious perfection") land somewhere else - usually where their impact is much larger.

advael · 2024-03-08T22:59:37 1709938777

If something is considered an externality by some system, decisions that prioritize it are by definition hard or impossible to make within that system. In other words, to change that, we have to rewrite systemic priorities to some degree

nonrandomstring · 2024-03-09T11:10:58 1709982658

That's an interesting thought I'll give some chin stroking to. Sounds like something Kurt Godel and John Gall would have hatched together. Is it a personal insight, or from experience, or something you think might be attributable. cheers.

> we have to rewrite systemic priorities to some degree

maybe Forrester or Meadows has an insight?

advael · 2024-03-10T19:14:46 1710098086

I mean I am sure I'm not the first person to have this insight - as I said I view it as baked into the definition of "externality" - but I'm also not pulling it from any specific source I can consciously recall

6LLvveMx2koXfwn · 2024-03-08T14:27:32 1709908052

> but designing for robustness is something that needs fractal alignment at higher meta-layers of incentives to be sustainable

This is why I remain an engineer and nothing more. If this sentence ever became meaningful I guess I would have been over-promoted.

advael · 2024-03-08T19:09:25 1709924965

lol okay I'll admit upon reading it again that that was a more verbose way to make that point than it needed to be

How about "prioritizing robustness and other kinds of long-termism need to be at least somewhat protected by the system of incentives they operate in to succeed"? I invoked fractal self-similarity there because I think you best protect long-termism by prioritizing long-termism in decision-making at scopes further out

Animats · 2024-03-08T07:01:33 1709881293

It's an argument against such things as HTTP/3. That yields a slight increase in performance (maybe), for which there's a large increase in complexity. Classic issue in military and industrial equipment, where you often accept somewhat less than maximum possible performance in exchange for robustness. Mechanical designers think about this a lot, because their enemies are wear, vibration, and fragility.

nonrandomstring · 2024-03-08T13:06:39 1709903199

See advael's remark above ours on short-termisim. In the commercial digital world, wear vibration and fragility are not the enemy - other companies are.

Until we can move past that silly winner-takes-all incentive we can't have nice things. Most of the genuinely good stuff will be stillborn. We'll always have a 5% vying for perfection in an ever-escaping, unrealistic red queens race, while the bottom 95% suffer a dearth of simply good-enough. How many objectively better search engines than Google died in the ditch of obscurity between 1998 and 2020?

anonymoushn · 2024-03-08T15:10:35 1709910635

TCP must be destroyed. It's totally ridiculous that the default behavior of video players on flaky wifi networks is that they'll open a separate TCP connection for each video segment, and then when one of those connections inevitably decides that the link's bandwidth is like 1kbps, the whole video will stall for a minute or until the player skips the segment.

HTTP/2 gets you this behavior less often, but when you get unlucky with packet loss it affects every segment instead of just one, so the player doesn't have any segments to skip forward to.

grumpyoldfart · 2024-03-08T10:10:09 1709892609

It all depends on your the definition of the 'loss' function. One can actually include robustness/sensitivity as the goal and optimize for that.

vjerancrnjak · 2024-03-08T13:55:21 1709906121

Optimizers are very good. Any "smart" change of the loss function will be equally smartly exploited by the optimizer.

Only way to optimize well is to include the uncertainty of your world model into the model.

For travelling salesman, you obviously want to model that certain roads take longer time to travel at different times of day. No tweak of the loss function would allow you to get realistic/robust solutions to TSP.

anonymoushn · 2024-03-08T14:36:31 1709908591

Why not? Just adding 2SDs to the travel time of each road will get it to try to avoid the roads that get the worst.

Vecr · 2024-03-08T12:28:56 1709900936

And run minimax on a cloud of perturbations surrounding each world model, and then weight them together in a tree after that.

nonrandomstring · 2024-03-08T12:50:55 1709902255

Don't let the perfect be the enemy of the good.

pajuc · 2024-03-06T07:58:06 1709711886

Great short read on a concept we can find in many places, including software engineering or even UX design in general.

tetha · 2024-03-08T18:28:26 1709922506

Curiously, this is a big facet in our dev/ops re-organization.

For example, we in infra-operations are responsible to store data customers upload into your systems. This data has to be considered not reproducible, especially if it's older than a few days. If we lose it, we lose it for good and then people are disappointed and turn angry.

As such, large scale data wipes are handled very carefully with manual approvals from several different teams. The full deletion of a customer goes through us, account management, contract and us again. And this is fine. Even with the GDPR and such, it is entirely fine that deleting a customer takes 1-2 weeks. Especially because the process has caught errors in other internal processes, and errors in our customers processes. Suddenly you're the hero vendor if the customer goes "Oh fuck, noooooo".

On the other hand, stateless code updates without persistence changes are supposed to be able to move as fast as the build server gives. If it goes wrong, just deploy a fix with the next build or roll back. And sure you can construct situations in which code changes cause big, persistent, stateful issues, but these tend to be rare with a decent dev-team.

We as infra-ops and central services need to be robust and reliable and are fine shedding speed (outside of standard requests) for this. A dev-team with a good understanding of stateful and stateless changes should totally be able to run into a wall at full speed since they can stand back up just as quickly. We're easily looking at hours of backup restore for hosed databases. And no there is no way to speed it up without hardware changes.