> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that...

mjb · on July 27, 2023

> daily occurrence when you're operating at S3 scale

Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...

In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.

ignoramous · on July 27, 2023

James Hamilton, AWS' chief architect, wrote about this phenomena in 2017: At scale, rare events aren't rare; https://news.ycombinator.com/item?id=14038044

jdwithit · on July 28, 2023

James' posts are always a treat. It's so rare to encounter such plain, straightforward content from someone with a title and responsibilities like his. Without layers of marketing sugar over everything. Dude just wants to post about the cool shit he did on his GeoCities-tier website and I love it.

aborsy · on July 28, 2023

This phenomenon is just multiplication of the sample size (scale) times a probability (rare).

maweki · on July 28, 2023

It shows that, however improbable, people do win the lottery.

It's good to be reminded of that, if you've been trained for years, not to play the lottery because you personally won't ever win.

In this case, the Cloud vendor is the lottery organizer and they indeed need to plan for people winning.

da39a3ee · on July 28, 2023

I agree with what I think is your sentiment -- that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

Twirrim · on July 29, 2023

> that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

I don't mean to imply it's a profound insight, and the discussions I had in AWS were never in those terms. It's just that when you're designing and building things that are going to operate at that scale, you have to very seriously consider the improbable.

What's more difficult is actually knowing what needs to be considered. e.g. prior to working at AWS, I don't think I'd have even considered "NIC corrupts packet, in such a way it gets to the OS mangled" as something that would be worth handling. Yet S3 and similar scale services see that and other improbable events so regularly that they actually have to consciously design for it, everywhere.

It's also one reason why larger services end up being incredibly conservative about the use of technology. You know what the failure modes are, however improbable, and can account for them. New technology tends to be kept on the fringes, and only adopted in more significant places once proven and improbable failures become understood.

da39a3ee · on July 31, 2023

Thanks! That was interesting and helpful.

Spooky23 · on July 28, 2023

Well it is - nobody maintains the level of detail required to actually know about these sorts of events.

I worked on a safety critical system where we’d find all sorts of unusual bugs… because we were looking for them. It really narrowed the scope for product selection, many vendors were just disqualified.

PaulRobinson · on July 28, 2023

Was an SDM of a team of brand new SDEs standing up a new service. In a code review, pointed to an issue that could cause a Sev2, and the SDE pushed back "that's like one in a million chance, at most". Pointed out once we were dialled up to 500k TPS (which is where we needed to be at), that was 30 times a minute... "You want to be on call that week?". Insist on Highest Standards takes on a different meaning in that stack compared to most orgs.

rubiquity · on July 27, 2023

Daily? A component I worked on that supported S3’s Index could hit a 1 in a billion issue multiple times a minute. Thankfully we had good algorithms and hardware that is a lot more reliable these days!

Twirrim · on July 27, 2023

This was 7-8 years ago now. Lot of scaling up since those days :)

rubiquity · on July 28, 2023

I’m sure my numbers are out of date now too

rkagerer · on July 27, 2023

Personally I'd love working in that kind of environment. That one in a billion hole still itches at me. There's also a slightly-perverse little voice in my head ready with popcorn in case I'm lucky enough to watch the ensuing fallout from the first major crypto hash collision :-).

fooker · on July 28, 2023

That probability is significantly lower than one in a billion.

One in a billion would be if keys were ~30 bits. Luckily it isn't.

rkagerer · on July 29, 2023

The one in a billion was in reference to storage related stats described in the article. Not private crypto keys.

delecti · on July 28, 2023

I love conversations like this that remind me how unintuitive big numbers are.

ldjkfkdsjnv · on July 27, 2023

Also worked at Amazon, saw some issues with major well known open source libraries that broke in places nobody would ever expect.

wrboyce · on July 27, 2023

Any examples you can share?

ruckfool · on July 28, 2023

Redis Node failover

ldjkfkdsjnv · on July 28, 2023

Apache tomcat starts to break down

thewakalix · on July 28, 2023

Could you elaborate?

baz00 · on July 28, 2023

We get this on a much lower scale. We have to maintain many forks because no one is responsive on taking patches.

ilyt · on July 27, 2023

I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough

benou · on July 28, 2023

Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.

Twirrim · on July 27, 2023

Yes, I remember tcp checksumming coming up as not sufficient at one stage. Even saw S3 deal with a real head-scratcher of a non-impacting event that came down to a single NIC in a single machine corrupting the tcp checksum under very specific circumstances.

jamesblonde · on July 28, 2023

HDFS never relied on only network checksums. Blocks should be checksummed and validated at clients - a reliable end-to-end guarantee.

Twirrim · on July 28, 2023

Well... yeah. S3 has checksums and all sorts of fixity checks right throughout. At no stage do they ever rely on a single mechanism. If there's one thing they're insanely paranoid about, it's data correctness and durability.

It has been several years, so I really don't remember much about the tcp checksum / corrupting NIC thing. Typically tcp checksum failures are handled entirely by the NIC, you wouldn't even notice it. My vague recollection was it coming up between two services not in the customer synchronous path (so e.g. not involved in getting data to or from the customer), and it caused something on the OS side.

I do remember that there was a contingent of engineers that were convinced it was a cosmic ray bit flip, which seems this whole thing certain types of engineers end up doing when presented with improbable seeming circumstances. It wasn't until it had happened a second or third time (weeks later) that they realised the origin machine was the same each time, and were able to dig in deeper to the point of reproduction.

jacobgorm · on July 29, 2023

To think that when Andy’s Coho Data built their first prototype on top of my abandoned Lithium [1] code base from VMware, the first thing they did was remove “all the crazy checksumming code” to not slow things down…

[1] https://dl.acm.org/doi/10.1145/1807128.1807134

Waterluvian · on July 27, 2023

Ever see a UUID collision?

on July 28, 2023

[deleted]

kortilla · on July 28, 2023

How did you know it was a double bit flip and not just BGP bug or an in memory bit flip before being sent to the socket?

abwizz · on July 28, 2023

> two bit flips in the same tcp packet cancel each other out and cause the checksum to pass

checksum != parity check

not sure if there even exists a chance for this to happen

raverbashing · on July 28, 2023

Wow this is at the level of Homer Simpson "Cereal with Milk catching fire"

But yeah, mathematically possible (in AWS scale, but still) so of course it will happen once in a lifetime.

cmckn · on July 28, 2023

Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.

mabbo · on July 27, 2023

Even at a billion requests per second, 128 bit UUIDs shouldn't collide for something like a billion years.

And that's if you're going completely random and not taking care to try to reduce collisions.

Dylan16807 · on July 28, 2023

Are you sure about that math?

A billion seconds at a billion requests per second is already 2^60 items. You'd only need a few billion seconds to have a 50:50 collision chance with 128 random bits, and even less with a real UUID that only has 122 random bits.

You'd hit 1% odds of collision after less than a decade.

If you actually want to go for a billion years, you need to expand that UUID by 50%.

mabbo · on July 28, 2023

You know I think I converted powers of two and powers of ten interchangeably in my calculations. You're very likely correct.

danielmarkbruce · on July 28, 2023

This seems off. A few billion seconds to have a 50:50 chance? Why wouldn't it be a billion seconds at a billion per second (2^60 total requests) would give a 1 in 2^68 chance (or 1 in 2^62 if its really only 122 bits)?

Dylan16807 · on July 28, 2023

Birthday paradox. The number of opportunities to collide is the number of items squared. (Divided by two and a smidge)

danielmarkbruce · on July 28, 2023

Lol. I must be brain dead. Yes.

penteract · on July 28, 2023

Because we're talking about collisions, as opposed to comparing 2^64 independent pairs. With 2^128 possible values, if you've picked 2^63 distinct ones, the chance that a randomly selected value collides with one of those is 1 in 2^65. If none of your second batch of 2^63 collide with each other, that gives a 2^63/2^65 = 1/4 chance of one of them colliding with the first batch. Considering the possibility of collisions within each batch of 2^63 brings it closer to 1 in 2.

jandrewrogers · on July 28, 2023

There have been many cases of UUIDv4 collisions because an RNG wasn’t as random as expected, due to broken RNG or developer error. It is one of those cases where practice is not as reliable as theory, and it is banned in some places as a consequence.

It depends on how paranoid you need to be.

MichaelZuo · on July 29, 2023

NIST standards on RNG are not as random as expected?

Or do you mean certain folks intentionally chose substandard implementations for some reason?

jandrewrogers · on Aug 1, 2023

A significant number of implementers roll their own UUIDv4. It seems so easy so why not? Most UUIDs are used in contexts where the devs are not that sophisticated so it isn’t that surprising that naive mistakes happen. If you are using it for distributed UUID generation, it just takes one person making a mistake to create havoc.

UUIDv4 is banned in many high security environments primarily because it is easy for people to screw up in practice and it is difficult to detect when those mistakes are made. 128-bits doesn’t leave much room for mistakes using probabilistic uniqueness.

polynomial · on July 28, 2023

Facts.

lazide · on July 28, 2023

Shouldn’t != never happens. All sorts of weird implementation issues can cause problems.