My favorite sentence: "Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event."
I saw pleeeeeenty of untested code at Amazon/AWS. Looking back it was almost like the most important services/code had the least amount of testing. While internal boondoggle projects (I worked on a couple) had complicated test plans and debates about coverage metrics.
It's gotta be a whole thing to even think about how to accurately test this kind of software. Simulating all kinds of hardware failures, network partitions, power failures, or the thousand other failure modes.
Then again they get like $100B in revenue that should buy some decent unit tests.
The most important services get the most attention from leaders who apply the most pressure, especially in the first ~2y of a fast-growing or high-potential product. So people skip tests.
reality most of the real world successful projects are mostly untested because that's not actually a high ROI endeavor. it kills me to realize that mediocre code you can hack all over to do unnatural things is generally higher value in phase I than the same code done well in twice the time.
I think the pendulum swing back is going to be designing code that is harder to make bad.
Typescript is a good example of trying to fix this. Rust is even better.
Deno, I think, takes things in a better direction as well.
Ultimately we're going to need systems that just don't let you do "unnatural" things but still maintain a great deal of forward mobility. I don't think that's an unreasonable ask of the future.
Interesting. I also work for a cloud provider. My team work on both internal infrastructure as well as product features. We take testing coverage very seriously and tie the metrics to the team's perf. Any product feature must have unit tests, integration tests at each layer of the stack, staging test, production test and continuous probers in production. But our reliability is still far from satisfactory. Now with your observation at AWS, I start wondering whether the coverage effort and different types of tests really help or not...
> Now with your observation at AWS, I start wondering whether the coverage effort and different types of tests really help or not...
Figuring out ROI for testing is a very tricky problem. I'm glad to hear your team invests in testing. I agree it's hard to know if you're wasting money or not doing enough!
my take is that the overwhelming majority of services insufficiently invest in making testing easy. the services that need to grow fast due to customer demand skip the tests while the services that aren't going much of anywhere spend way too much time on testing.
I found myself wishing for a few code snippets here. It would be interesting. A lot of time code that handles "connection refused" or fast failures doesn't handle network slowness well. I've seen outages from "best effort" services (and the best-effort-ness worked when the services were hard down) because all of a sudden calls that were taking 50 ms were not failing but all taking 1500+ ms. Best effort but no client enforced SLAs that were low enough to matter.
Load shedding never kicked in, so things had to be shutdown for a bit and then restarted.
Seems their normal operating state might be what is called "meta-stable" - dynamically stable at a high thru-put (edited) unless/until a brief glitch bumps the system into the low work being finished state which is also stable.
thundering herd and accidental synchronization for the win
I am sad to say, I find issues like this any time I look at retry logic written by anyone I have not interacted with previously on the topic. It is shockingly common even in companies where networking is their bread and butter.
> It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.
This is also a general problem with (presumed stateless) concurrent/distributed systems which irked me working on such a system and still haven’t found meaningful resources for which aren’t extremely platform/stack/implementation specific:
A concurrent system has some global/network-wide/partitioned-subset-wide error or backoff condition. If that system is actually stateless and receives push work, communicating that state to them either means pushing the state management back to a less concurrent orchestrator to reprioritize (introducing a huge bottleneck/single or fragile point of failure) or accepting a lot of failed work will be processed in pathological ways.