> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.
One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.
Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).
> daily occurrence when you're operating at S3 scale
Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...
In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.
James' posts are always a treat. It's so rare to encounter such plain, straightforward content from someone with a title and responsibilities like his. Without layers of marketing sugar over everything. Dude just wants to post about the cool shit he did on his GeoCities-tier website and I love it.
I agree with what I think is your sentiment -- that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!
> that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!
I don't mean to imply it's a profound insight, and the discussions I had in AWS were never in those terms.
It's just that when you're designing and building things that are going to operate at that scale, you have to very seriously consider the improbable.
What's more difficult is actually knowing what needs to be considered. e.g. prior to working at AWS, I don't think I'd have even considered "NIC corrupts packet, in such a way it gets to the OS mangled" as something that would be worth handling. Yet S3 and similar scale services see that and other improbable events so regularly that they actually have to consciously design for it, everywhere.
It's also one reason why larger services end up being incredibly conservative about the use of technology. You know what the failure modes are, however improbable, and can account for them. New technology tends to be kept on the fringes, and only adopted in more significant places once proven and improbable failures become understood.
Well it is - nobody maintains the level of detail required to actually know about these sorts of events.
I worked on a safety critical system where we’d find all sorts of unusual bugs… because we were looking for them. It really narrowed the scope for product selection, many vendors were just disqualified.
Was an SDM of a team of brand new SDEs standing up a new service. In a code review, pointed to an issue that could cause a Sev2, and the SDE pushed back "that's like one in a million chance, at most". Pointed out once we were dialled up to 500k TPS (which is where we needed to be at), that was 30 times a minute... "You want to be on call that week?". Insist on Highest Standards takes on a different meaning in that stack compared to most orgs.
Daily? A component I worked on that supported S3’s Index could hit a 1 in a billion issue multiple times a minute. Thankfully we had good algorithms and hardware that is a lot more reliable these days!
Personally I'd love working in that kind of environment. That one in a billion hole still itches at me.
There's also a slightly-perverse little voice in my head ready with popcorn in case I'm lucky enough to watch the ensuing fallout from the first major crypto hash collision :-).
I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough
Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.
Yes, I remember tcp checksumming coming up as not sufficient at one stage. Even saw S3 deal with a real head-scratcher of a non-impacting event that came down to a single NIC in a single machine corrupting the tcp checksum under very specific circumstances.
Well... yeah. S3 has checksums and all sorts of fixity checks right throughout. At no stage do they ever rely on a single mechanism. If there's one thing they're insanely paranoid about, it's data correctness and durability.
It has been several years, so I really don't remember much about the tcp checksum / corrupting NIC thing. Typically tcp checksum failures are handled entirely by the NIC, you wouldn't even notice it. My vague recollection was it coming up between two services not in the customer synchronous path (so e.g. not involved in getting data to or from the customer), and it caused something on the OS side.
I do remember that there was a contingent of engineers that were convinced it was a cosmic ray bit flip, which seems this whole thing certain types of engineers end up doing when presented with improbable seeming circumstances. It wasn't until it had happened a second or third time (weeks later) that they realised the origin machine was the same each time, and were able to dig in deeper to the point of reproduction.
To think that when Andy’s Coho Data built their first prototype on top of my abandoned Lithium [1] code base from VMware, the first thing they did was remove “all the crazy checksumming code” to not slow things down…
Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.
A billion seconds at a billion requests per second is already 2^60 items. You'd only need a few billion seconds to have a 50:50 collision chance with 128 random bits, and even less with a real UUID that only has 122 random bits.
You'd hit 1% odds of collision after less than a decade.
If you actually want to go for a billion years, you need to expand that UUID by 50%.
This seems off. A few billion seconds to have a 50:50 chance? Why wouldn't it be a billion seconds at a billion per second (2^60 total requests) would give a 1 in 2^68 chance (or 1 in 2^62 if its really only 122 bits)?
Because we're talking about collisions, as opposed to comparing 2^64 independent pairs. With 2^128 possible values, if you've picked 2^63 distinct ones, the chance that a randomly selected value collides with one of those is 1 in 2^65. If none of your second batch of 2^63 collide with each other, that gives a 2^63/2^65 = 1/4 chance of one of them colliding with the first batch. Considering the possibility of collisions within each batch of 2^63 brings it closer to 1 in 2.
There have been many cases of UUIDv4 collisions because an RNG wasn’t as random as expected, due to broken RNG or developer error. It is one of those cases where practice is not as reliable as theory, and it is banned in some places as a consequence.
A significant number of implementers roll their own UUIDv4. It seems so easy so why not? Most UUIDs are used in contexts where the devs are not that sophisticated so it isn’t that surprising that naive mistakes happen. If you are using it for distributed UUID generation, it just takes one person making a mistake to create havoc.
UUIDv4 is banned in many high security environments primarily because it is easy for people to screw up in practice and it is difficult to detect when those mistakes are made. 128-bits doesn’t leave much room for mistakes using probabilistic uniqueness.
One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.
Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).