99.99999999% reliability means you will not loss more than one byte in every 10 ...

andrewxdiamond · on Feb 8, 2023

Important to note that S3 does not have any Durability SLA. We promise Durability and take it extremely seriously, but there is no SLA. Much more of an SLO

andrewxdiamond · on Feb 9, 2023

Also, “durability” is not a property you can delegate to another service. Plenty of corruption is caused in-transit, not just at rest.

If your system handles the data in any way, you must compute and validate checksums.

If you do not have end to end checksums for the data, you do not get to claim your service adopts S3’s Durability guarantees.

S3 has that many 9s because your data is checksumed by the SDK. Every service that touches that data in any way recomputes and validates that (or a bracketed) checksum. Soup de nuts. All the way to when the data gets read out again.

And there is a lot more to Durability than data corruption. Protections against accidental deletions, mutations, or other data loss events come into play too. How good is your durability SLO when you accidentally overwrite one customer’s data with another’s?

Check out some of the talks S3 has on what Durability actually means, then maybe you investigate how durable your service is.

https://youtu.be/P1gGFYS9LRk

ps: I haven’t looked at the code yet, but plan to. Maybe I’m being presumptuous and your service is fully secured. I’ll let you know if I find anything!

pps: I work for amazon but all my opinions are my own and do not necessarily reflect my employer’s. I don’t speak for Amazon in any way :D

andrewstuart2 · on Feb 8, 2023

As you allude to in your response, that's usually referred to as durability, not reliability. The home page could probably use an update there to reflect that terminology.

riku_iki · on Feb 8, 2023

It sounds like not very practical metrics, since losing one byte often makes whole dataset useless (encryption, checksums failures).

ravi-delia · on Feb 8, 2023

It's an average- presumably they don't smear files across disks byte by byte, since that would be insane. But with drives randomly breaking, at some point every copy of at least one file will go at once. With, say, a terabyte of files over a thousand years, you'd expect to lose a total number of files equal to 100Kb. So probably not even one, with some small chance of losing half a drive.

riku_iki · on Feb 8, 2023

I think probability to lose any data in 100tb should be good metric.

908B64B197 · on Feb 8, 2023

As in there's no durability guarantee for the data? I can expect data loss at a rhythm of 1b per GB per year?

CodesInChaos · on Feb 8, 2023

It's unavoidable that too many disk failures in quick succession lead to data-loss. For example if you store two copies, your durability rests on being able to detect a disk failure and create another copy, before the sole remaining version dies as well.

juliangoldsmith · on Feb 8, 2023

"What do you mean you mean it can't recover from a 100% disk failure rate?

At least it's all in RAID 0, so the data's safe."