I'm running a startup and we're storing north of 10PB of data and growing. We're currently on AWS and our contract is up for renewal. I'm exploring other storage solutions.
Min requirements of AWS S3 One Zone IA (https://aws.amazon.com/s3/storage-classes/?nc=sn&loc=3)
How would you store >10PB if you'd be in my shoes? Thought experiment can be with and without data transfer cost our of current S3 buckets.
Please mention also what your experience is based on. Ideally you store large amounts of data yourself and speak of first hand experience.
Thank you for your support!! I will post a thread once we got to a decision on what we ended up doing.
Update:
Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.
The data is all images and videos, and no queries need to be performed on the data.
HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)
The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.
[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html
So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)
EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.
EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.