Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few factual inaccuracies in here that don't affect the general thrust. For example, the claim that S3 uses a 5:9 sharding scheme. In fact they use many different sharding schemes, and iirc 5:9 isn't one of them.

The main reason being that a ratio of 1.8 physical bytes to 1 logical byte is awful for HDD costs. You can get that down significantly, and you get wider parallelism and better availability guarantees to boot (consider: if a whole AZ goes down, how many shards can you lose before an object is unavailable for GET?).





See timestamp 42:20 at https://youtu.be/NXehLy7IiPM?si=QQEOMCt7kOBTMaGK

The way it’s worded makes me understand that’s what scheme they’re using. Curious to hear what you know



page loads then quicky move up some video loads, and content is gone

Web 3 in a nutshell

Wow that is very annoying. Here is a better page

https://www.vastdata.com/blog/introducing-rack-scale-resilie...


Naively it seems difficult to decrease the ratio of 1.8x while simultaneously increasing availability. The less duplication, the greater risk of data loss if an AZ goes down? (I thought AWS promises you have a complete independent copy in all 3 AZs though?)

To me though the idea that to read like a single 16MB chunk you need to actually read like 4MB of data from 5 different hard drives and that this is faster is baffling.


Availability zones are not durability zones. S3 aims for objects to still be available with one AZ down, but not more than that. That does actually impose a constraint on the ratio relative to the number of AZs you shard across.

If we assume 3 AZs, then you lose 1/3 of shards when an AZ goes down. You could do at most 6:9, which is a 1.5 byte ratio. But that's unacceptable, because you know you will temporarily lose shards to HDD failure, and this scheme doesn't permit that in the AZ down scenario. So 1.5 is our limit.

To lower the ratio from 1.8, it's necessary to increase the denominator (the number of shards necessary to reconstruct the object). This is not possible while preserving availability guarantees with just 9 shards.

Note that Cloudflare's R2 makes no such guarantees, and so does achieve a more favorable cost with their erasure coding scheme.

Note also that if you increase the number of shards, it becomes possible to change the ratio without sacrificing availability. Example: if we have 18 shards, we can chose 11:18, which gives us 1.61 physical bytes per logical byte. And it still takes 1 AZ + 2 shards to make an object unavailable.

You can extrapolate from there to develop other sharding schemes that would improve the ratio and improve availability!

Another key hidden assumption is that you don't worry about correlated shard loss except in the AZ down case. HDDs fail, but these are independent events. So you can bound the probability of simultaneous shard loss using the mean time to failure and the mean time to repair that your repair system achieves.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: