Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> That’s a bit error rate of 1 in 10^15 requests. In the real world, we see that blade of grass get missed pretty frequently – and it’s actually something we need to account for in S3.

One of the things I remember from my time at AWS was conversations about how 1 in a billion events end up being a daily occurrence when you're operating at S3 scale. Things that you'd normally mark off as so wildly improbable it's not worth worrying about, have to be considered, and handled.

Glad to read about ShardStore, and especially the formal verification, property based testing etc. The previous generation of services were notoriously buggy, a very good example of the usual perils of organic growth (but at least really well designed such that they'd fail "safe", ensuring no data loss, something S3 engineers obsessed about).



> daily occurrence when you're operating at S3 scale

Yeah! With S3 averaging over 100M requests per second, 1 in a billion happens every ten seconds. And it's not just S3. For example, for Prime Day 2022, DynamoDB peaked at over 105M requests per second (just for the Amazon workload): https://aws.amazon.com/blogs/aws/amazon-prime-day-2022-aws-f...

In the post, Andy also talks about Lightweight Formal Methods and the team's adoption of Rust. When even extremely low probability events are common, we need to invest in multiple layers of tooling and process around correctness.


James Hamilton, AWS' chief architect, wrote about this phenomena in 2017: At scale, rare events aren't rare; https://news.ycombinator.com/item?id=14038044


James' posts are always a treat. It's so rare to encounter such plain, straightforward content from someone with a title and responsibilities like his. Without layers of marketing sugar over everything. Dude just wants to post about the cool shit he did on his GeoCities-tier website and I love it.


This phenomenon is just multiplication of the sample size (scale) times a probability (rare).


It shows that, however improbable, people do win the lottery.

It's good to be reminded of that, if you've been trained for years, not to play the lottery because you personally won't ever win.

In this case, the Cloud vendor is the lottery organizer and they indeed need to plan for people winning.


I agree with what I think is your sentiment -- that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!


> that people seem to be treating this as if it's some sort of profound insight that you only get if you work at a very senior level in engineering for major US cloud providers, when it's in fact blindingly obvious!

I don't mean to imply it's a profound insight, and the discussions I had in AWS were never in those terms. It's just that when you're designing and building things that are going to operate at that scale, you have to very seriously consider the improbable.

What's more difficult is actually knowing what needs to be considered. e.g. prior to working at AWS, I don't think I'd have even considered "NIC corrupts packet, in such a way it gets to the OS mangled" as something that would be worth handling. Yet S3 and similar scale services see that and other improbable events so regularly that they actually have to consciously design for it, everywhere.

It's also one reason why larger services end up being incredibly conservative about the use of technology. You know what the failure modes are, however improbable, and can account for them. New technology tends to be kept on the fringes, and only adopted in more significant places once proven and improbable failures become understood.


Thanks! That was interesting and helpful.


Well it is - nobody maintains the level of detail required to actually know about these sorts of events.

I worked on a safety critical system where we’d find all sorts of unusual bugs… because we were looking for them. It really narrowed the scope for product selection, many vendors were just disqualified.


Was an SDM of a team of brand new SDEs standing up a new service. In a code review, pointed to an issue that could cause a Sev2, and the SDE pushed back "that's like one in a million chance, at most". Pointed out once we were dialled up to 500k TPS (which is where we needed to be at), that was 30 times a minute... "You want to be on call that week?". Insist on Highest Standards takes on a different meaning in that stack compared to most orgs.


Daily? A component I worked on that supported S3’s Index could hit a 1 in a billion issue multiple times a minute. Thankfully we had good algorithms and hardware that is a lot more reliable these days!


This was 7-8 years ago now. Lot of scaling up since those days :)


I’m sure my numbers are out of date now too


Personally I'd love working in that kind of environment. That one in a billion hole still itches at me. There's also a slightly-perverse little voice in my head ready with popcorn in case I'm lucky enough to watch the ensuing fallout from the first major crypto hash collision :-).


That probability is significantly lower than one in a billion.

One in a billion would be if keys were ~30 bits. Luckily it isn't.


The one in a billion was in reference to storage related stats described in the article. Not private crypto keys.


I love conversations like this that remind me how unintuitive big numbers are.


Also worked at Amazon, saw some issues with major well known open source libraries that broke in places nobody would ever expect.


Any examples you can share?


Redis Node failover


Apache tomcat starts to break down


Could you elaborate?


We get this on a much lower scale. We have to maintain many forks because no one is responsive on taking patches.


I think Ceph hit similar problems and they had to add more robust checksumming to the system, as relying on just tcp checksums for integrity for example was no longer enough


Not that surprising, given this was already extensively documented in the 2000's (so already widely known by then) with iSCSI and such, see https://www.rfc-editor.org/rfc/rfc3385 for example.


Yes, I remember tcp checksumming coming up as not sufficient at one stage. Even saw S3 deal with a real head-scratcher of a non-impacting event that came down to a single NIC in a single machine corrupting the tcp checksum under very specific circumstances.


HDFS never relied on only network checksums. Blocks should be checksummed and validated at clients - a reliable end-to-end guarantee.


Well... yeah. S3 has checksums and all sorts of fixity checks right throughout. At no stage do they ever rely on a single mechanism. If there's one thing they're insanely paranoid about, it's data correctness and durability.

It has been several years, so I really don't remember much about the tcp checksum / corrupting NIC thing. Typically tcp checksum failures are handled entirely by the NIC, you wouldn't even notice it. My vague recollection was it coming up between two services not in the customer synchronous path (so e.g. not involved in getting data to or from the customer), and it caused something on the OS side.

I do remember that there was a contingent of engineers that were convinced it was a cosmic ray bit flip, which seems this whole thing certain types of engineers end up doing when presented with improbable seeming circumstances. It wasn't until it had happened a second or third time (weeks later) that they realised the origin machine was the same each time, and were able to dig in deeper to the point of reproduction.


To think that when Andy’s Coho Data built their first prototype on top of my abandoned Lithium [1] code base from VMware, the first thing they did was remove “all the crazy checksumming code” to not slow things down…

[1] https://dl.acm.org/doi/10.1145/1807128.1807134


Ever see a UUID collision?


[deleted]


How did you know it was a double bit flip and not just BGP bug or an in memory bit flip before being sent to the socket?


> two bit flips in the same tcp packet cancel each other out and cause the checksum to pass

checksum != parity check

not sure if there even exists a chance for this to happen


Wow this is at the level of Homer Simpson "Cereal with Milk catching fire"

But yeah, mathematically possible (in AWS scale, but still) so of course it will happen once in a lifetime.


Eh, UUID’s are usually not truly global anyway; so you’d need a collision in the context of a single region, cell, user, resource, etc. for it to matter.


Even at a billion requests per second, 128 bit UUIDs shouldn't collide for something like a billion years.

And that's if you're going completely random and not taking care to try to reduce collisions.


Are you sure about that math?

A billion seconds at a billion requests per second is already 2^60 items. You'd only need a few billion seconds to have a 50:50 collision chance with 128 random bits, and even less with a real UUID that only has 122 random bits.

You'd hit 1% odds of collision after less than a decade.

If you actually want to go for a billion years, you need to expand that UUID by 50%.


You know I think I converted powers of two and powers of ten interchangeably in my calculations. You're very likely correct.


This seems off. A few billion seconds to have a 50:50 chance? Why wouldn't it be a billion seconds at a billion per second (2^60 total requests) would give a 1 in 2^68 chance (or 1 in 2^62 if its really only 122 bits)?


Birthday paradox. The number of opportunities to collide is the number of items squared. (Divided by two and a smidge)


Lol. I must be brain dead. Yes.


Because we're talking about collisions, as opposed to comparing 2^64 independent pairs. With 2^128 possible values, if you've picked 2^63 distinct ones, the chance that a randomly selected value collides with one of those is 1 in 2^65. If none of your second batch of 2^63 collide with each other, that gives a 2^63/2^65 = 1/4 chance of one of them colliding with the first batch. Considering the possibility of collisions within each batch of 2^63 brings it closer to 1 in 2.


There have been many cases of UUIDv4 collisions because an RNG wasn’t as random as expected, due to broken RNG or developer error. It is one of those cases where practice is not as reliable as theory, and it is banned in some places as a consequence.

It depends on how paranoid you need to be.


NIST standards on RNG are not as random as expected?

Or do you mean certain folks intentionally chose substandard implementations for some reason?


A significant number of implementers roll their own UUIDv4. It seems so easy so why not? Most UUIDs are used in contexts where the devs are not that sophisticated so it isn’t that surprising that naive mistakes happen. If you are using it for distributed UUID generation, it just takes one person making a mistake to create havoc.

UUIDv4 is banned in many high security environments primarily because it is easy for people to screw up in practice and it is difficult to detect when those mistakes are made. 128-bits doesn’t leave much room for mistakes using probabilistic uniqueness.


Facts.


Shouldn’t != never happens. All sorts of weird implementation issues can cause problems.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: