My favorite sentence: "Our networking clients have well tested request back-off ...

discodave · on Dec 11, 2021

I saw pleeeeeenty of untested code at Amazon/AWS. Looking back it was almost like the most important services/code had the least amount of testing. While internal boondoggle projects (I worked on a couple) had complicated test plans and debates about coverage metrics.

tootie · on Dec 11, 2021

It's gotta be a whole thing to even think about how to accurately test this kind of software. Simulating all kinds of hardware failures, network partitions, power failures, or the thousand other failure modes.

Then again they get like $100B in revenue that should buy some decent unit tests.

DrBenCarson · on Dec 11, 2021

This is almost always the case.

The most important services get the most attention from leaders who apply the most pressure, especially in the first ~2y of a fast-growing or high-potential product. So people skip tests.

foobiekr · on Dec 11, 2021

reality most of the real world successful projects are mostly untested because that's not actually a high ROI endeavor. it kills me to realize that mediocre code you can hack all over to do unnatural things is generally higher value in phase I than the same code done well in twice the time.

jessermeyer · on Dec 11, 2021

This attitude is why modern software is a continuing controlled flight into terrain.

virtue3 · on Dec 11, 2021

I think the pendulum swing back is going to be designing code that is harder to make bad.

Typescript is a good example of trying to fix this. Rust is even better.

Deno, I think, takes things in a better direction as well.

Ultimately we're going to need systems that just don't let you do "unnatural" things but still maintain a great deal of forward mobility. I don't think that's an unreasonable ask of the future.

topspin · on Dec 11, 2021

Finger wagging has also failed to produce any solutions.

wbsun · on Dec 11, 2021

Interesting. I also work for a cloud provider. My team work on both internal infrastructure as well as product features. We take testing coverage very seriously and tie the metrics to the team's perf. Any product feature must have unit tests, integration tests at each layer of the stack, staging test, production test and continuous probers in production. But our reliability is still far from satisfactory. Now with your observation at AWS, I start wondering whether the coverage effort and different types of tests really help or not...

discodave · on Dec 14, 2021

> Now with your observation at AWS, I start wondering whether the coverage effort and different types of tests really help or not...

Figuring out ROI for testing is a very tricky problem. I'm glad to hear your team invests in testing. I agree it's hard to know if you're wasting money or not doing enough!

dastbe · on Dec 11, 2021

my take is that the overwhelming majority of services insufficiently invest in making testing easy. the services that need to grow fast due to customer demand skip the tests while the services that aren't going much of anywhere spend way too much time on testing.

lanstin · on Dec 11, 2021

I found myself wishing for a few code snippets here. It would be interesting. A lot of time code that handles "connection refused" or fast failures doesn't handle network slowness well. I've seen outages from "best effort" services (and the best-effort-ness worked when the services were hard down) because all of a sudden calls that were taking 50 ms were not failing but all taking 1500+ ms. Best effort but no client enforced SLAs that were low enough to matter.

Load shedding never kicked in, so things had to be shutdown for a bit and then restarted.

Seems their normal operating state might be what is called "meta-stable" - dynamically stable at a high thru-put (edited) unless/until a brief glitch bumps the system into the low work being finished state which is also stable.

foobiekr · on Dec 11, 2021

thundering herd and accidental synchronization for the win

I am sad to say, I find issues like this any time I look at retry logic written by anyone I have not interacted with previously on the topic. It is shockingly common even in companies where networking is their bread and butter.

cyounkins · on Dec 11, 2021

It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.

I just find it amusing that they describe their back-off behaviors as "well tested" and in the same sentence, say it didn't back off adequately.

eyelidlessness · on Dec 11, 2021

> It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.

This is also a general problem with (presumed stateless) concurrent/distributed systems which irked me working on such a system and still haven’t found meaningful resources for which aren’t extremely platform/stack/implementation specific:

A concurrent system has some global/network-wide/partitioned-subset-wide error or backoff condition. If that system is actually stateless and receives push work, communicating that state to them either means pushing the state management back to a less concurrent orchestrator to reprioritize (introducing a huge bottleneck/single or fragile point of failure) or accepting a lot of failed work will be processed in pathological ways.

dilap · on Dec 10, 2021

Oh you know, an editing error -- they accidentally dropped the word "not".

a-dub · on Dec 11, 2021

this caught my eye as well. i'd wager that it was not configured.