Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How would you store 10PB of data for your startup today?
307 points by philippb on April 23, 2021 | hide | past | favorite | 366 comments
I'm running a startup and we're storing north of 10PB of data and growing. We're currently on AWS and our contract is up for renewal. I'm exploring other storage solutions.

Min requirements of AWS S3 One Zone IA (https://aws.amazon.com/s3/storage-classes/?nc=sn&loc=3)

How would you store >10PB if you'd be in my shoes? Thought experiment can be with and without data transfer cost our of current S3 buckets. Please mention also what your experience is based on. Ideally you store large amounts of data yourself and speak of first hand experience.

Thank you for your support!! I will post a thread once we got to a decision on what we ended up doing.

Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.

The data is all images and videos, and no queries need to be performed on the data.




Non-cloud:

HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)

The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.

[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html

So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)

EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.


> The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph,

I think at that scale you would want a ceph expert on staff as a full time salaried position.

For an organization that has 10PB now and can project a growth path to 15, 20, 25PB in the future, you should talk with management about creating a vacant position for that role, and filling it.

> EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

I am a huge advocate of hosting stuff yourself on bare metal you own, but this is a ridiculous statement. Any drive in that class should come with a 3 or 5 year warranty. And the manual labor and hassle time to replace one (you have hundreds of thousands of dollars of storage and no ready to go cold spares on a shelf?!?!) is infinitesimal.


OK, clarification: most of our fleet has a LOT of supermicro machines, where it's impossible to identify the drive unless by serial number. There's on UID light, the machines needs to go offline, plus some 10 screws needs to come out to open the chassis, 4 more by drives.

The amount of downtime this would generate for a single machine plus the operation cost doesn't worth the hassle unless the machine loses a significant chunk of drives.


Chassis built for mass storage usually have lever caddies backed on to hotplug SAS backplanes.


One would think indeed, but not the early FatTwins.


If the colo is far and there’s plenty of headroom, it might not justify much urgency.


That's what remote hands are for. Yes, you batch the replacements up, but this is exactly what remote hands are for.


Costs less to leave them alone, and go once a year for a trash run. Cattle not pets, no trips to the vet. Don’t waste money diagnosing / fixing.


'cattle not pets' is not a valid argument when you're the owner and operator of the bare metal hardware. Do you also recommend that ISPs not replace failed fans in core and edge routers and optical transport systems? Let things with dual power supplies run for six months on one failed power supply?

Also you've clearly never interacted with cattle, sheep, goats, llamas or alpacas, which absolutely do get things like veterinary care and vaccinations. Large animal vet is a whole specialty and they spend lots of time working on animals other than horses. No trips to the vet???


I’ve been both farmer of black angus beef cattle (on 750 acres) right down to castrating steer, and founder/owner of the world’s largest VDN (14 international data centers) right down to pulling drives.

For what it’s worth, meat packers buy dead cattle and don’t ask questions. But this is a well known metaphor, and I’m pointing out by that metaphor, no trips to the vet. As a cattle farmer, I’d argue it holds true if you’re big enough they’ve got tags not names: the vet comes to you and only if you think you’ve got a herd problem instead of an individual problem.

As for the HN angle: these are contrarian and objection-inspiring policies that let us wholesale video delivery to/through other CDNs while making a profit.


This doesn't make sense to me. I work for a CDN with tens of thousands of servers in over a hundred data centers. We are always working to improve our turnaround time on repairing servers, even though we have thousands. Hard drive failure is one of the leading causes of server failures. Dead servers means diminished capacity, and capacity is what pays our bills.

Farmers absolutely have a vet who takes care of the cattle. I am not sure what you are on about.

The whole point of cattle-vs-pets is you are supposed to treat all the servers the same, not that you have to treat all of them poorly.


Less is more.

Another contrarian view — particuarly suitable for large VDN content like media, not small CDN content like html/js — is that one doesn’t need to be in over a hundred data centers, one needs to be in the key exchanges: you don’t have to be at the ends of every spoke if you pick the right hubs.

Agree swapping bad drives is a reasonable use of smart hands when done in batches, as no diagnosis is needed. I’d advocate considering extending that practice to the servers themselves. Math works if you find local tech repo/refurb shops that take gear (w/o drives) to bulk refurb & resell. The other way is to get your OEM to provide aliveness-as-a-service.

Anything so you don’t have to do manual labor, ideally ever.


You can also get units like this direct from Western Digital/HGST. We have a system with 3 of their 4U60 units, and they weren't all that expensive. Ordering direct from HGST, we only paid a small premium on top of the cost of the SAS drives.


This is the answer that worked for us storing petabytes a decade ago.

We collaborated with OEMs and also shared/compared notes with Backblaze on rackable mass storage for commodity drives.

Backblaze published a series of iterations of designs of multi-drive chassis, and one of the OEMs would make them for other buyers as well. If you’re doing this route, read through those for considerations and lessons learned.

Performance was > 10x better than enterprise solutions. A policy to “leave dead disks dead” aka “let them rot” as said elsewhere in this thread kept maintenance cheap.

The secret sauce part making this viable for commercial online storage hosting (we hosted video) was we used disks as JBOD with an in-house meta index with P2P health awareness to place objects redundantly across disks, chassis, racks, colocation providers, and regions.


Don't buy HPE gear. Qualify the gear with sample units from a few competing vendors and you'll see why.


I’ve done qualification on hpe and various competing vendors and honestly haven’t seen dramatic differences in terms of performance and failure rates. From my experience the biggest difference was with vendor support services rather than the actual hardware. I’d be curious to hear more about your qualification experience with this particular vendor if you’d be willing.


When we set up user content storage of images and mp3s for Last.fm back in 2006ish we used MogileFS (from the bradfitz LJ perl days) running on our own hardware. 3/4/5/6u machines stuffed full of disks. I still think it's an elegant concept – easy to grok, easy to debug, easy to reason about. No special distributed filesystem to worry about.

Don't take this as an endorsement of the MogileFS perl codebase in 2021, but worth considering this style of storage system depending on your precise needs.


MinIO is an option as well and would allow you to transition from testing in S3 to your own MinIO cluster seamlessly.


I wonder if anyone can comment if they have experience running minio at scale. It would be a pleasant surprise if a “simple” minio cluster could handle such workload


Does it scale that far?


Likely not that "gracefully". But Ceph absolutely does, and has an S3 gateway.


Using Ceph like S3 could be a bit tricky if all of that 10pb is very small files.

Redhat did an interesting series of blog posts about Ceph and getting it to 1 billion objects

https://www.redhat.com/en/blog/scaling-ceph-billion-objects-...


I'd be interested to learn more about your HDFS usage and your experience at that scale. Would you be willing to have a chat? If so, my email is in my profile.


At that kind of scale, S3 makes zero sense. You should definitely be rolling your own.

10PB costs more than $210,000 per month at S3, or more than $12M after five years.

RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

That means that you could fit all 15TB (for erasure encoding with Minio) in less than two racks for around $150k up-front.

Figure another $5k/mo for monthly opex as well (power, bandwidth, etc.)

Instead of $12M spent after five years, you'd be at less than $500k, including traffic (also far cheaper than AWS.) Even if you got AWS to cut their price in half (good luck with that), you'd still be saving more than $5 million.

Getting the data out of AWS won't be cheap, but check out the snowball options for that: https://aws.amazon.com/snowball/pricing/


[disclaimer: while I have some small experience putting things in DC, including big GPU servers, I have never been anywhere near that scale, certainly not storage]

10k $ is for a server with no hard drive. W/ 12 Gb disks, and with enough RAM, we're talking closer to 40-50k$ per server. Let's say for simplicity you're going to need to buy 15 of those, and let's say you only need to replace 2 of them per year. You need 25 over five years, that's already ~ 750k $ over 5 years.

And then you need to factor in the network equipment, the hosting in a colocation space, and if storage is your core value, you need to think about disaster recovery.

You will need at least 2 people full time on this, in the US that means minimum 2x 150k$ of costs per year: over 5 years, that's 1.5m$. If you use software-defined storage, that's likely gonna cost you much more because of the skill demand.

Altogether that's all gonna cost you much more than 500k$ over 5 years. I would say you would need at least 5x to 10x this.


yes, the TCO needs consideration, not just the metal


After a certain size, AWS et al simply don't make sense, unless you have infinitely deep pockets. For storage that you pull from, AWS et al charge bandwidth costs. These costs are non-trivial for non-trivial IO. I worked up financial operational models for one of my previous employers, when were were looking at costs of remaining on S3 and using it, versus rolling it into our own DCs. The download costs, the DC space, staff, etc. was far less per year (and the download cost is a 1 time cost) than the cold storage costs.

Up to about 1PB with infrequent use, AWS et al might be better. When you look at 10-100PB and beyond (we were at 500PB usable or so last I remembered) the costs are strongly biased towards in-house (or in-DC) vs cloud. That is, unless you have infinitely deep pockets.


I should add to this comment, as it may give the impression that I'm anti cloud. I'm not. Quite pro-cloud for a number of things.

The important point to understand in all of this is, is that there are cross-over points in the economics for which one becomes better than the other. Part of the economics is the speed of standing up new bits (opportunity cost of not having those new bits instantly available). This flexibility and velocity is where cloud generally wins on design, for small projects (well below 10PB).

This said, if your use case appears to be rapidly blasting through these cross-over points, the economics usually dictates a hybrid strategy (best case) or a migration strategy (worst case).

And while your use case may be rapidly approaching these limits (you need to determine where they are if you are growing/shrinking), there are things you can do to risk and cost reduce transitions ahead of this.

Hybrid as a strategy can work well, as long as your hot tier is outside of the cloud. Hybrid makes sense also if you have to include the possibility of deplatforming from cloud providers (which, sadly, appears to be a real, and significant, risk to some business models and people).

None of this analysis is trivial. You may not even need to do it, if you are below 1PB, and your cloud bills are reasonable. This is the approach that works best for many folks, though as you grow, it is as if you are a frog in ever increasing temperature water (with regard to costs). Figuring out the pain point where you need to make changes to get spending on a different (better) trajectory for your business is important then.


And again at at even larger size it makes sense again with >80% discounts on compute and $0 egress.


We had taken the discounts into account (we had qualified for them). The $0 egress was not a thing when we did our analysis. And we were moving 10's of PB/month. BW costs were running into sizable fractions of millions of dollars per month.


The thing about fitting everything in one rack, potentially, is vibration. There have been several studies into drive performance degredation from vibration, and there's noticeable impact in some scenarios. The Open Compute "Knox" design as used by Facebook spins drives up when needed, and then back down, though whether that's for vibration impact, I don't know (their cold storage use [0]).

0: https://datacenterfrontier.com/inside-facebooks-blu-ray-cold...

https://www.dtc.umn.edu/publications/reports/2005_08.pdf

https://digitalcommons.mtu.edu/cgi/viewcontent.cgi?article=1...


Here is Brendan Gregg showing how vibrations can affect disk latency:

https://www.youtube.com/watch?v=tDacjrSCeq4


I'm an absolute noob here, but is using SSD racks for storage a feasible option cost wise and for this issue in particular?


Absolutely it's an option

It's gonna cost more

But it's also going to be nearly vibration-free (just the PSU fans), and stupidly-fast


No, ssds are still way to expensive If you don't need the performance.


10PB costs more than $210,000 per month at S3, or more than $12M after five years.

Your pricing is off by a 2X - he said he's ok with infrequent access, 1 zone, which is $0.01/GB, or $100K/month.

If he rarely needs to read most of the data, he can cut the price by 1/10th by using deep archive, $0.00099 per GB, so $10K/month, or around $600K over 5 years, not including retrieval costs.


Nope, can't use Deep Archive as he specified max retrieval time of 1000ms. But you're correct with S3-IA


For a 10X reduction in cost, things that are impossible often become possible.


> Nope, can't use Deep Archive as he specified max retrieval time of 1000ms.

If accesses can be anticipated, pre-loading data from cold storage to something warmer might make it viable.


>RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

I dare you to buy 102 12TB drives for $11k

The cheapest consumer class 12GB hdd is ~$275 a pop

That's $28k just for the drives


If you have a PBs of data that you rarely access, it seems to make sense to compress it first.

I've rarely seen any non-giants with PBs of data properly compressed. For example, small JSON files converted into larger, compressed parquet files will use 10-100x less space. I am not familiar with images but see no reason why encoding batches of similar images should make it hard to get similar or even better compression ratios

Also, if you decide to move off later on, your transfer costs will also be cheaper if you can move it off in a compressed form first.


couple be wrong but I don't believe compression of batches of compressed images compresses well

but it'd be very interested to here about techniques on this because I have a lot of space eaten up by timelapses myself


It's not about space reduction, it's about handling the small file problem. HDFS can handle up to 500M files without issue but the amount of RAM needed to store the files' metadata starts to go beyond what you'd typically find in a single server these days.

When you store multiple images and/or videos inside of a single PQ file, you'll end up keeping fewer files on your server.

I believe Uber store JPEG data in PQ files and Spotify store audio files in PQ or a similar format on their backend.


On the contrary, batches of images with a high degree of similarity compress _very_ well. You have to use an algorithm specifically designed for that task though. Video codecs are a real world example of such - consider that H. 265 is really compressing a stream of (potentially) completely independent frames under the hood.

I'm not sure what the state of lossless algorithms might be for that though.


Best I know of for that is something like lrzip still, but even then it's probably not state of the art. https://github.com/ckolivas/lrzip

It'll also take a hell of a long time to do the compression and decompression. It'd probably be better to do some kind of chunking and deduplication instead of compression itself simply because I don't think you're ever going to have enough ram to store any kind of dictionary that would effectively handle so much data. You'd also not want to have to re-read and reconstruct that dictionary to get at some random image too.


A movie is a series of similar images and while it does allow temporal compression in a 3rd axis to the 2d raster, H265 is about as good as it gets at the moment but its also lossy which might not be tolerable.


H266 VVC looks impressive. Waiting to get my hands on fpga codec for testing.


right but we're not talking about compressing a video stream but compressing individually compressed pictures, big difference


I’ve heard reports that minio gets slow beyond the hundreds of millions of objects threshold


You are mixing up your units, with 12GB drives and 15TB in a rack.


You didn't take personnel cost into account. You will need at least two system administrators to look after those racks (even if remote hands to change faulty drives are in the monthly opex). It quickly takes you to surplus of 200k/year with current prices (which will rise another 50% in 5 years).

On the other hand, you may negotiate a very sizable discount from AWS for 10Pb storage for 5 years.


Does Snowball let you exfiltrate data from AWS? I was under the impression it was only for bulk ingestion.


First sentence on the linked page: "With AWS Snowball, you pay only for your use of the device and for data transfer out of AWS."


Wow that’s up to $500,000 just to export 10PB (depending on region).


According to https://aws.amazon.com/snowball/pricing/, egress fees depends on the region, which can range from $0.03/GB (North America & parts of Europe) to $0.05/GB (parts of Asia and Africa).

So US$300K to US$500K for egress fees + cost of Snowball devices.

The major downside of Snowball in this export case is the size limit of 80TB per device - from https://aws.amazon.com/snowball/features/ :

"Snowball Edge Storage Optimized provides 80 TB of HDD capacity for block volumes and Amazon S3-compatible object storage, and 1 TB of SATA SSD for block volumes."

That'd be around 125 Snowball devices to get 10PB out.

If OP actually has 10PB on S3 currently, the OP may want to fallback to leaving the existing data on S3 and accessing new data in the new location.


> If OP actually has 10PB on S3 currently, the OP may want to fallback to leaving the existing data on S3 and accessing new data in the new location.

I remember asking an Amazon executive in London when AWS was very new and they were evangelising it to developers; I asked him what is the cost of getting data out of AWS if I want to move it to other service provider, or how easy it will be? And he avoided giving a straight simple answer. I realised then than the business model from the start was to lock-in developer/startups/companies in to the AWS ecosystem.


> If OP actually has 10PB on S3 currently, the OP may want to fallback to leaving the existing data on S3 and accessing new data in the new location.

Another option would be to leave data on S3, store new data locally, and proxy all S3 download requests, ie, all requests go to the local system first. If an object is on S3, download it, store it locally, then pass it on to your customer. That way your data will gradually migrate away from S3. Of course you can speed this up to any degree you want by copying objects from S3 without a customer request.

An advantage of doing this is that you can phase in your solution gradually, for example:

Phase 1: direct all requests to local proxies, always get the data from S3, send it to customers. You can do this before any local storage servers are setup.

Phase 2: configure a local storage server, send all requests to S3, store the S3 data before sending to customers. If the local storage server is full, skip the store.

Phase 3: send requests to S3, if local servers have the data, verify it matches, send to customer

Phase 4: if local servers have the data, send it w/o S3 request. If not, make S3 request, store it locally, send data

Phase 5: store new data both locally and on S3

At this point you are still storing data on S3, so it can be considered your master copy and your local copy is basically a cache. If you lose your entire local store, everything will still work, assuming your proxies work. For the next phase, your local copy becomes the master, so you need to make sure backups, replication, etc are all working before proceeding.

Phase 5: start storing new content locally only.

Phase 6: as a background maintenance task, start sending list requests to S3. For objects that are stored locally, issue S3 delete requests to the biggest objects first, at whatever rate you want. If an object isn't stored locally, make a note that you need to sync it sometime.

Phase 7: using the sync list, copy S3 objects locally, biggest objects first, and remove them from S3.

The advantage IMO is that it's a gradual cutover, so you don't have to have a complete, perfect local solution before you start gaining experience with new technology.


There's also the snowmobile https://aws.amazon.com/snowmobile/


The AWS Snowmobile pages only talk about migrating INTO AWS, not OUT OF.

from https://aws.amazon.com/snowmobile/ :

AWS Snowmobile is an Exabyte-scale data transfer service used to move extremely large amounts of data to AWS. You can transfer up to 100PB per Snowmobile, a 45-foot long ruggedized shipping container, pulled by a semi-trailer truck. Snowmobile makes it easy to move massive volumes of data to the cloud, including video libraries, image repositories, or even a complete data center migration.

from https://aws.amazon.com/snowmobile/faqs/ :

Q: What is AWS Snowmobile?

AWS Snowmobile is the first exabyte-scale data migration service that allows you to move very large datasets from on-premises to AWS.


I mean, the title on the snowmobile page says:

> Migrate or transport exabyte-scale data sets into and out of AWS


Unfortunately the header is misleading. The FAQ says:

Q: Can I export data from AWS with Snowmobile?

Snowmobile does not support data export. It is designed to let you quickly, easily, and more securely migrate exabytes of data to AWS. When you need to export data from AWS, you can use AWS Snowball Edge to quickly export up to 100TB per appliance and run multiple export jobs in parallel as necessary


That wording is not inconsistent with the interpretation that Snowball is for in only.


You realize you can't fit 10 appliances of 4U in a rack? (A rack is 42U)

There's network equipment and power equipment that requires space in the rack. There's power limitations and weight limitations on the rack that prevents to fill it to the brim.


I've put 39U of drives in a rack before. You only need 1U for a network switch, and you can get power that attaches vertically to the back, so it doesn't take up any space. If you have a cabinet with rack in front and back and all the servers have rails, the weight shouldn't be an issue.

The biggest issue will be cooling depending on how hot your servers run.

Specifically, it was a rack full of Xserve RAIDs, which are 3U each and about 100lbs each. So that was over 1300lbs.


Looked up some specs.

* A typical rack is rated for somewhere between 450 and 900 kg (your mileage may vary).

* A disk is about 720g.

* A 4U quanta enclosure is 36 kg empty.

* With 10 enclosures of 60 disks, that's a total of 792 kg inside the rack.

You will want to check what rack you have exactly and weight things up.

The rack itself is another 100 to 200 kg. You will want to double check whether the floor was designed to carry 1 ton per square meter. It might not be.

My personal tip. Definitely do NOT put a row of that in an improvised room in a average office building. You might have a bad surprise with the floor. ;)

Anyway. The project will probably be abandoned after the OP tries to assemble the first enclosure (80kg fully loaded) and realize he's not going to move that.


You run a single network switch for a rack full of drives to the brim?


Sure. A single rack is a common failure domain so you make sure to replicate across racks.

E.g. Dunno about anyone else, but Facebook racks (generally) have a single switch.


That seems rather unnecessary risk. Sure you stripe across the racks but another tor switch in mlag configuration is a minuscule expense compared to the costs involved here


You could easily run two switches, there would be enough room. But normally yes, I'd run one switch per rack. Switch failure is pretty rare, and when it does happen it's pretty easy to switch it out for a spare.


Gold standard APC PDUs are all 0U side mount.


What if you want to move off S3? Let's do the math.

* To store 10+ PB of data.

* You need 15 PB of storage (running at 66% capacity)

* You need 30 PB of raw disks (twice for redundancy).

You're looking at buying thousands of large disks, in the order of a million dollar upfront. Do you have that sort of money available right now?

Maybe you do. Then, are you ready to receive and handle entire pallets of hardware? That will need to go somewhere with power and networking. They won't show up for another 3-6 months because that's the lead time to receive an order like that.

If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.

There is a sweet spot in moving off the cloud, if you can fit your entire infrastructure into one rack. You're not in that sweet spot.

You're going to be filling multiple racks, which is a pretty serious issue in terms of logistics (space, power, upfront costs, networking).

Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).

The conclusion of this story. S3 is pretty good. Your time would be better spend optimizing the software. What is expensive? The storage or the bandwidth or both?

* If it's the bandwidth. You need to improve your CDN and caching layer.

* If it's the storage. You should work on better compression for the images and videos. And check whether you can adjust retention.


> Let's do the math.

Offers no math.

At retail, 625 16TB drives is $400000. This is about 2x the MONTHLY retail s3 pricing. Further, as we all know, AWS bandwidth pricing is absolutely bonkers (1).

I think your conclusion that S3 is "pretty good" needs a lot more math to support.

(1) https://twitter.com/eastdakota/status/1371252709836263425


The math should also include the price of the staff who babysit 625 spinning metal disks, who likely drive to a data center multiple times a week to swap failed drives. I shudder to think if this job fell in my lap!


Sure, but you actually have to go through the steps. The very first step, back of the napkin, indicates savings. More than enough to warrant a more detailed evaluation. Then you can start considering the more complicated factors (redundancy, staffing, power, ...).

My response was in the context of someone who didn't do any of that.

Also, my tongue-in-cheek response to you is: the price will be offset by the SRE engineers who were babysitting your AWS setup that you'll no longer need. (More seriously, I don't think finding quality sysadmins who enjoy this stuff is particularly harder than finding quality roles for any other tech positions).


Been there, done that, at all levels. I would much rather be working on a 10PB set of hardware racks, including all the drive replacements. When you factor in the costs of compute hardware (to make that useful), networking equipment, etc, it's trebled again the cost, and then again for the power, cooling, and cage space to run it all. The actual break-even point of running your own hardware is more like 2 years.

But it's not about price: It's about control, and it's about the expertise you gain from running all of that. If you have 10PB of data, you should have someone in-house who knows how to work with 10PB of data at a low level, and the best way to get that is to employ people at all levels to make that work. You gain significant advantage from having the direct performance data and the expertise of having techs whose 9-5 is replacing disks.


>>> My response was in the context of someone who didn't do any of that.

I did and you ignored all of it -_-

* To store 10+ PB of data.

* You need 15 PB of storage (running at 66% capacity)

* You need 30 PB of raw disks (twice for redundancy).

>>> At retail, 625 16TB drives is $400000.

That's only 10 PB of disks. That's about one third of the actual need.

Please triple your number and we will start talking. That's about 2000 disks and well above a million dollar.

You can't just call a supplier and get for a thousand 16 TB disks (or even a hundred). They don't have that in stock now. The lead time might be 6 months to get a few hundreds. They might not have 16 TB disks for sale at all, the closest might be a 12 or 14TB.

Handling large amount of hardware is a logistic problem. Not a cost problem.


With over 600 drives you would have at least 20 hot spares, and i don't think you'd have more than 2 drives fail per week if you don't have bad batches of them.


> 625 16TB drives is $400000

how much is the real estate cost of 625 drives and associated machinery to run it?

At a guess, AWS has an operating margin of about 30%, so you can approximate their cost of hardware, bandwidth, and other fixed costs as 70% of their sticker price. As a start up, can you actually get this price to be lower? I actually dont think you can, unless your operation is very small and can be done out of a home/small office.


Their margin on bandwidth is literally over 1000%. Quick google says that S3 costs 320% more than Backblaze (which, presumably, isn't running at a loss).

> At a guess

The comments in this discussion that try to provide actual numbers show a fairly lopsided argument against S3. The comments that are advocating for S3 aren't as detailed.

You can look at this at the macro level, as on comment did, and see that one 1.2PB RackmountPro 4U server is $11K. Yes, of course you still need space and power. But at least this gives us actual numbers to play with as a base (e.g. buying 10 of these is less than what you'll spend on S3 in a month)

At a miro-level. You can spend $650 on a 16TB hard drive, or $650 on 16TB for 2 months of S3. Now, S3 is battle-tested, has redundancy, has power, has a cpu, has a network card (but not bandwidth), and is managed - unquestionable HUGE wins. But the hard drive (and other equipment) come with a 3-5 year warranty. Now, the difference between $650 for the hard drive, and $12000 for S3 over 3 years, won't let you: get the power, rent the racks, hire the staff, and invest in learning ceph. But the difference between $400K and $5million will.


> one 1.2PB RackmountPro 4U server is $11K

An empty 4U server with 96+ bays looks like it will set you back ~$7k minimum. At $500 per drive (I have no idea what volume discounts are like) filling it with drives would be in the range of ~$50k. You'd still need RAM. And (as you noted) space and power.

I have no idea how the math ends up working out, but a 1PB appliance in working order is nowhere near as cheap as $11k.


>>> You can look at this at the macro level, as on comment did, and see that one 1.2PB RackmountPro 4U server is $11K

protip: $11k is the cost of the empty enclosure. disks are sold separately.


FWIW you can get great redundancy with far less than 2x storage factor. e.g. Facebook uses a 10:14 erasure coding scheme[1] so they can lose up to 4 disks without losing data, and that only incurs a 1.4x storage factor. If one's data is cold enough, one can go wider than this, e.g. 50:55 or something has a 1.1x factor.

Not that this fundamentally changes your analysis and other totally valid points, but the 2x bit can probably be reduced a lot.

[1] https://engineering.fb.com/2015/05/04/core-data/under-the-ho...


https://en.wikipedia.org/wiki/Parchive

Basically they use par2 multi-file archives for cold storage with each archive file segment scattered across different physical locations. Always fun to see the kids rediscovering tricks from the old days.


> If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.

This is just incorrect.

If you talk to HPE, they should be quite happy to sell you the my employer's software (Qumulo) alongside their hardware. 10+ PB is definitely supported. (The HPE part is not required)

If you talk to Dell EMC, they will quite happily sell you their competing product, which is also quite capable of scaling beyond 1-2PB.


Most (all?) enterprise vendors will go well beyond 1-2PB.

Four years ago, one of the all flash vendors routinely advertised “well under a dollar a gigabyte”. Their prices have dropped dramatically since then, but the out of date numbers translate to “well under a million per PB”. That’s at the high end of performance with posix (nfs) or crash coherent (block) semantics. (Some also do S3, if that’s preferable for some reason)

With a 5 year depreciation cycle, those old machines were at << $16K / month per PB. Today’s all flash systems fit multiple PB per rack, and need less than one full time admin.

Hope that helps.


I've checked what I could find on Qumulo. It is software that you run on top of regular servers, to form a storage cluster.

It seems to me you're only confirming my previous point, that you need to invest in complicated/expensive software to make the raw storage usable.

>>> Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).

There's no listed price on the website, you will need to call sales. Wouldn't be surprised if it started at 6 figures a year for a few servers.

It looks like it may not run on just any server, but may need certified server hardware from HP or Qumulo.


Always fun stumbling across another Qumulon on here :)


AWS is ridiculously expensively at their scale, both for storage and egress. But the choice is not only between that and building a staffed on-premise storage facility.

You can compromise at a middle ground - rent a bunch of VPS/managed servers and let the hosting companies deal with all the nastiness of managing physical hardware and CAPEX. Cost around $1.6-2/TB/month (e.g. Hetzner's SX) for raw non-redundant storage, an order of magnitude better than AWS. Comes with far more reasonably priced bandwidth too.

Build some error correction on top using one of the many open-source distributed filesystems out there or perhaps an in-house software solution (reed-solomon isn't exactly rocket science). And for some 30+% overhead, depending on workload (you can have very low overload if you have few reads or very relaxed latency requirements), you should have a decently fault tolerant distributed storage at a fraction of AWS costs.


I agree that considering Hetzner is a good idea. I have used them often, never any problems, and very low pricing.


> You need to improve your CDN and caching layer.

Depends on usage patterns, but if this is 10PB of users'personal photos and videos, then you're not going to get much value from caching because the hit rate will be so low.


> If it's the bandwidth. You need to improve your CDN and caching layer.

What would you recommend for this?

(considering data is stored in S3)


Verizon and Redis has worked well for me.


> * To store 10+TB of data.

> * You need 15 TB of storage (running at 66% capacity)

> * You need 30 TB of raw disks (twice for redundancy).

Did you mean PB?


Corrected.


Very good advice!


If you have good sysadmin/devops types, this is a few racks of storage in a datacenter. Ceph is pretty good at managing something this size, and offers an S3 interface to the data (with a few quirks). We were mostly storing massive keys that were many gigabytes, so if you have smaller keys, so I'm not sure about performance/scalding limits with smaller keys and 10PB. I'd be sure to give your team a few months to build a test cluster then build and scale the full size cluster. And a few months to transfer the data...

But you'll need to balance the cost of finding people with that level of knowledge and adaptability with the cost of bundled storage packages. We were running super lean, got great deals on bandwidth, power, and has low performance requirements. When we ran the numbers for all in costs, it was less than we thought we could get from any other vendor. And if you commit to buying the severs racks it will take to fit 10PB, you can probably get somebody like Quanta to talk to you.


The is amazing. Thank you. I’ve been looking at Backblaze storage pods that seem to be designed for that use case. Never rented rack space.

Do you remember somehow the math on how much cheaper it was or how you thought about upfront cost vs ongoing. Just order of magnitude would be great.


Roughly a decade ago S3 storage pricing had a ~10x premium over self-hosted. The convenience of not having to touch any hardware is expensive.


Its also important to consider how often disks will fail when you are operating hundreds of them - its probably more often than you'd think, and if you don't have someone on staff and nearby to your colo provider you're going to pay a lot in remote hands fees.

Your colo facility will almost certainly have 24/7 staff on hand who can help you with tasks like swapping disks from a pile of spares, but expect to pay $300+ minimum just to get someone to walk over to your racks, even if the job is 10 mins.

With that said, the cost savings can still be enormous. But know what you're getting into.


Like another comment said, don't bother swapping out disks, just leave the dead ones in place and disable them in software. Then eventually either replace the whole server or get someone on site to do a mass swap of disks. At this scale redundancy needs to be spread between machines anyway so no gain in replacing disks as they die.


That also means that you need extra spare disks in the system, which also means extra servers, extra racks, extra power feeds, extra cooling etc.

If you do a 60-disk 4U setup you'll need 1 full rack of those just to get your 10PB, then you'll need yet another one for redundancy. And then a quarter for hot spares. At that point you have single-redundancy, no file history and no scaling. Is it possible? Sure. Is this something you can do 'on a side track with the people you already heave'? Unlikely if you are a startup with no datacenter, no colocation yet etc.


You don't do redundancy that way at that scale, that's completely insane. You run ceph or beegfs or Windows Storage Server and backup to tape with a tape library. If youve got big bucks (though still peanuts compared to s3) you replicate the entire setup 1:1 at a second site.


The author doesn't want a second site. And at that scale you do redundancy at that scale within the requested parameters.

If you set your object store to be resilient to single-partition loss per object (within CAP) you effectively duplicate everything once. If you want more-than-one you get into sharding to spread the risk. We're not talking about RAID here, but about replicas or copies.

Windows Storage Server doesn't belong in a setup like this, and neither does tape since it needs to be accessible in under 1s. If higher latencies were fine the author would have been able to use something between S3 IA and Glacier. Heck, you could use cold HDD storage for that kind of access. The drives would need to spin up to collect the shards to assemble at least one replica to be able to read the file, but that's still multiple orders of magnitude faster than tape.

I have written a larger post with more numbers, and unless you seriously reduce the features you use, it's not really cheaper than S3 if you start off with no physical IT and no people to support it. It's not that it isn't possible, it's just that you need to spin up an entire business unit for it and at that point you're eating way more cost.

Regardless of the object store (or filesystem if you want to go full on legacy style), you still need at least the minimum amount of physical bits on disk to be able to store the data. And pretty much no object store supports a 1:1 logical-physical storage scale. It's almost always at least 1:1.66 in degraded mode or 1:2 in minimum operational mode.


>We're not talking about RAID here, but about replicas or copies.

Most distributed filesystems support some form of erasure coding. Ceph does, Minio does, HDFS does, etc. So no, you don't need to duplicate everything.


You're talking about data integrity, this is not the same as redundancy.


> You're talking about data integrity, this is not the same as redundancy.

To be clear, you're talking about mitigating the risk of data corruption (eg. bits will flip randomly due to cosmic rays or what have you) over time, vs. the risk of outright data loss, yes?

Isn't there some some overlap between the solutions?


No, I'm talking about mitigating system failure (be it a dead disk, PHY, entire server, single PDU, single rack or entire feed. I didn't even go down the level of individual object durability yet (or web access to those objects, consistent access control and the likes).

There is some overlap in the sense that having redundant copies makes it possible to replace a bad copy with a good copy if a checksum mismatches on one of them. That also allows for bringing the copy count back in spec if a single copy goes missing (regardless of the type of failure).

But no matter what methods are used, data is data and needs to be stored somewhere. It the bits constituting that data go missing, the data is is gone. To prevent that, you need to make sure those bits exist in more than one place. The specific places come with differences in cost, mitigations and effort:

- Two copies on the same disk mitigates bit flips in one copy but not disk failure - Two copies on two disks on the same HBA mitigates bit flips and disk failure but not HBA failure

The list goes on until you reach the requirement posted at the top of this Ask HN where it is stated that OneZone IA is used. That means it does not need multiple zones for zone-outage mitigation. Effectively that means the racks are allowed to be placed in the same datacenter. So that datacenter being unavailable or destroyed means the data is unavailable (temporarily or permanently), which appears to be the accepted risk.

But within that zone (or datacenter) you would still need all other mitigations offered by the durable object storage S3 provides (unless specified differently - if we just make up new requirements we can make it very cheap and just accept total system failure with 1 bit flip and be done with it).


I currently pay about $40 for a half hour of remote hands at a large data center. Modern disks rarely need to be swapped. You can look at BackBlaze's published failure rates and do the math yourself if you don't believe me.


I’ve used Netapps and Isilon in the past. We didn’t change any disks, they did as part of the maintenance. Not sure how the physical security worked but they were let in by the data centre staff and did their thing. I think they came in weekly.

They whole solution wasn’t cheap though and all of these extras were baked into the cost. We were getting better than S3 costs from a per TB straight up without considering power , cooling and rack space costs. Network was significantly cheaper than AWS.

Not sure on how far these NAS’ scale but I would expect deep discounts for something of this scale.


I've run the math on this for 1PB of similar data (all pictures), and for us it was about 1.5-2 orders of magnitude cheaper over the span of 10 years (our guess for depreciation on the hardware).

Note that we were getting significantly cheaper bandwidth than S3 and similar providers, which made up over half of our savings.



Upfront costs, with networking, rack and stacked, and wired, were far under $100/TB raw, around $40-$60, but this was quite a while ago and I don't know how it looks in the era of 10+TB drives. Also remember that once you are off S3 you are in the situation of doing your own backup, and the use case dictates the required availability when things fail... we didn't need anything online, but mirrored to a second site. With erasure coding, you can get by with 1.5x copies at each site or so, with a performance hit. So properly backed up with a full double, it's about 3x raw...

Opex will be power, data center rent, and internet access are hugely hugely variable. And of course, the personnel will be at least 1 full time person who's extremely competent.


if you have looked at BB storage pods you should look at the 45drives.com the child of Protocase which manufactures the BB pods.


Totally out-of-band for this thread, but... what are the uses for a multi-gigabyte key?! I'm clearly unaware of some cool tech, any key words I can search?


I'm no expert but I would guess it's just a fancy word for "file", as in "key-value store", as opposed to a god-proof encryption key.


In this case, wouldn't the value be multi gigabit, not the key?


Well, you'd refer to an object by its key, so while the value of the object would have the data, you could stil refer to your objects as keys, the same way we refer to files on a filesystem. It's not the file that is big, but the blocks it represents.


When I say "key" I mean the blob that gets stored, but I may be misremembering or misusing S3 terms... it was large amounts of DNA sequencing data, and one of the first tasks was to add S3 support for indexed reads to our internal HTSlib fork, and since then somebody else's implementation has been added to the library. In any case, I quickly forgot about most of the details of S3 when I no longer had to deal with it directly...


That makes perfect sense, assume I'm a pleb! Thanks for the follow-up, large files/values make sense.

My head was in huge cryptographic keys for some purpose


This is outside my domain and I don't know how the pricing works out, but AWS Outpost will sell you a physical rack that is fully S3 compatible and redundant to cloud.


The pricing would be prohibitive, I reckon. S3 on Outposts is $0.1/GB/mo, whereas the S3 single zone IA that OP is using as a baseline is $0.01/GB/mo - an order of magnitude less. (Prices are based on us-east-1.)


There are four hidden costs which not many have touched upon.

1) Staff You'll need at least one, maybe two, to build, operate, and maintain any self-hosted solution. A quick peek on Glassdoor and Salary show the unloaded salary for a Storage Engineer runs $92,000-130,000 US. Multiply by 1.25-1.4 for loaded cost of an employee (things like FICA, insurance, laptop, facilities, etc). Storage Administrators run lower, but still around $70K US unloaded. Point is, you'll be paying around $100K+/year per storage staff position.

2) Facilities (HVAC, electrical, floor loading, etc) If you host on-site (not hosting facility), you'd better make certain your physical facilities can handle it. Can your HVAC handle the cooling, or will you need to upgrade it? What about your electrical? Can you get the increased electrical in your area? How much will your UPS and generator cost? Can the physical structure of the building (floor loading, etc) handle the weight of racks and hundreds of drives, the vibration of mechanical drives, the air cycling?

3) Disaster Recovery/Business Continuity Since you're using S3 One Zone IA, you have no multi-zone duplicated redundancy. It's use case is for secondary backup storage for data, not the primary data store for running a startup. When there is an outage/failure (and it will happen), the startup may be toast, and investors none too happy. So this is another expense you're going to have to seriously consider, whether you stick with S3 or roll-your-own.

4) Cost of money With rolling-your-own, you're going to be doing CAPEX and OPEX. How much upfront and ongoing CAPEX can the startup handle? Would the depreciation on storage assets be helpful financially? You really need to talk to the CPA/finance person before this. There may be better tax and financial benefits by staying on S3 (OPEX). Or not.

Good luck.


I am 100% agreeing with this, especially cash flow for a startup, it's going to be harder to manage. I think S3 is still the answer.


I have worked in HPC (academia) where the cluster storage size is measured in multiples of PB since a decade. Since latency and bandwidth is a killer requirement there, Infiniband (instead of Ethernet) is the defacto standard for connecting the storage pools to the computing nodes.

Maintaining such a (storage) cluster requires 1-2 people on site which replace a few hard disks every day.

Nevertheless, when I would continously need massive amount of data, I would opt in doing it myself anytime instead of cloud services. I just know how well these clusters run and there is little to no saving when outsourcing it.


I am a researcher in academia that handles most of my system admin needs myself. It’s way cheaper to do yourself than some of these comments here make it sound (if you have good server rack space available). I ordered two 60 drive JBODs that I racked by myself (I removed all the drives first to lighten them) for ~82k. I used Zfs and 10 drive raidz2 vdevs for a total capacity of ~960TB of useable file system space. Installing the servers and testing some setups and putting it into use took about 4-5 days. In four years I’ve put many PBs of reAfs and writes through these and had to replace 3 drives. I’d estimate I spend about 2% of my active work focus on maintaining and troubleshooting it. Scaling up to 10PB I’d probably switch to a supported SDS solution, which would be much more expensive, but still way way cheaper than cloud.


Since he needs 1000ms response on storage isn't ethernet the better option? It can reach 400gb/s on fastest hardware now. I thought Infiniband was only reasonable to use when machines need to quickly access other machines primary memory. I would like to know if I'm wrong about this though.


Agreed and at this point with ROCE there's little reason to go with infiniband given you can find fast ethernet hardware that'll go toe to toe with infiniband on latency and throughput.


I've done multiple multipetabyte scale projects and you only need to swap disks once a month or so. I had a project (as a solo engineer) 2 hours away and I drove there once in six months.


I would host in a datacenter of your choice and do a cross connect into AWS: https://aws.amazon.com/directconnect/pricing/

This allows you to read the data into AWS instances at no cost and process it as needed since there is 0 cost for ingress into AWS. I have some experience with this (hosting using Equinix)


Direct Connect isn't required from a cost perspective - ingress into AWS is free in all cases I can think of, but certainly in the case of S3 [0]. DX is useful when customers need assurances of bandwidth/throughput, or if they want to avoid their traffic routing over the internet.

[0] "You pay for all bandwidth into and out of Amazon S3, except for the following: Data transferred in from the internet..." - https://aws.amazon.com/s3/pricing/


Thanks for the pointer. Never thought about this as an option. Great stuff!!!


I had a similar problem at a past job. Though we only had a PB of data. We used a products called SwiftStack. It is open source, but they have paid support. I recommend getting support, as their support is really good. It is an object store like S3, but it has its own API. Though I think they now have an S3 compatible gateway now.

We had about 25 Dell R730xd servers. When the cluster would start to fill up, we would just replace drives with larger drives. Upgrading drives with SwiftStack is a piece of cake. When I left we were upgrading to 10TB drives as that was the best pricing. We didn't buy the drives from Dell as they were crazy expensive. We just bought drives from Amazon/New Egg, and kept some spares onsite. We got a better warranty that way too. Dell only had a 1 year warranty, but the drives we were buying had a 5 year warranty.


Way late to the discussion, but I second the positive remarks on SwiftStack. It's in the easy button category in this case. The core storage engine of SwiftStack is open source (OpenStack Swift). However, the nice wrap-around tooling and web dashboard is not open source.


I’m not an AWS pricing expert, but you should be aware you’re still on the hook for S3 requests even if you can get out of paying for bandwidth. Is AWS direct connect a pure peering arrangement? I wonder what their requirements are for that. Guess I’ll read the link :)

Idk what your team’s expertise is, but I’d advise avoiding the cloud as long as possible. If you can build out an on-premise infrastructure, it will be a huge competitive advantage for your company because it will allow you to offer features that your competitors can’t.

Examples of this:

- Cloudflare built up their own network and infrastructure and it’s always been their biggest asset. They set the standard for free tier of CDN pricing, and nobody who builds a CDN on top of an existing cloud provider will ever beat it.

- Zoom. By hosting their own servers and network, Zoom is similarly able to offer a free tier where they are not subject to variable costs from free customers losing them money on bandwidth charges.

- WhatsApp. They scaled to hundreds of millions of users with less than a dozen engineers, a few dozen (?) servers, and some Erlang code.

IMO defaulting to the cloud is one of the worst mistakes a young company can make. If your app is not business critical, you can probably afford up to a day of downtime or even some data loss. And that is unlikely to happen anyway, as long as you’ve got a capable team looking after it who chooses standard and robust software.


I run cloud infra for a living. Have been managing infrastructure for 20 years. I would never for one second consider building my own hosting for a start-up. It would be like a grocery delivery company starting their own farm because seeds are cheap.


Depends what you’re doing I suppose. I think the three companies I mentioned (CloudFlare, Zoom and WhatsApp) are good examples of infrastructure investment as a competitive advantage.


None of those are start-ups, though. They've either IPOed (CloudFlare, Zoom) or been acquired by publicly-traded companies (WhatsApp).

A startup is a company that might still need to pivot to find its final business model, potentially shedding its entire existing infrastructure base in the process. Start-ups are why IaaS providers don't default to instance reservations — because, as a startup, you might suddenly realize that you won't be needing that $10k/hr of compute, but rather $10k/hr of something else.


Or suppose you run the most successful/profitable Fantasy Sports League start-up on the internet (used to work for 'em) and host your own gear. Every year you have to analyze trends in use and predict future load, to build the capital needed to buy all new racks of servers every 2-3 years, pay for all the IT staff, datacenter costs.

That was before the cloud existed. They had to poach experts from hosting companies to build and maintain their gear. They built a 24/7 NOC, did server repair, became network experts, storage experts, database experts. Besides being incredibly complex and burdensome, it was financially risky. If they missed their projections they could over-invest by 1-2 million bucks, or even worse, not have the capacity needed to meet demand.

If somebody told us back then that we could pay a premium to be able to scale at any time as much as we needed, when we needed it? We would have flipped out. We had heard about Amazon building some kind of "grid computing" thing, but it seemed like a pipe dream for universities, like parallel computing. Turns out it was a different kind of grid.


WhatsApp ran on bare metal in SoftLayer prior to (and well after) being acquired by FB.

CloudFlare went well beyond leasing servers and built their own POPs with network etc prior to IPO. Much of what they built wouldn't have made economic sense with AWS tax.


I didn't mean to imply that IPOing is the point at which a start-up becomes a not-start-up. None of these three were a start-up for quite a few years before their IPO, either.


In most of these cases, the companies growth from startup to not-startup was only possible because of their infrastructure advantage. Do you think Cloudflare the startup could have offered a free tier if they had to pay Amazon $0.10 per GB that their users sent over the network?

Of course not. But the free tier was a vital component of Cloudflare's growth, first-mover advantage and wide adoption.


> as long as you’ve got a capable team looking after it who chooses standard and robust software.

And cheap.

If you put people in charge who are looking for ways of expanding their empire and budget through spending money on EMC/VMWare/Oracle/etc/etc then you can quickly wind up spending a lot more money.

Simplistic network designs, simplistic server designs, simplistic storage designs with mostly open source software used everywhere can be highly competitive with Cloud services.

Mostly all that Amazon did to create AWS/EC2 was to fire anyone who said words like SAN or EMC and do everything very cheaply using open source software, and evolved away from Enterprise vendors and towards commodity hardware.

If you make "frugality" a core competency in your datacenter design like Amazon did, then you can easily beat the cloud.

You also need to have [dev]ops people who are inclined to say "yes" to the business and who know how to debug things and can operate independently of needing to phone up EMC.


> fire anyone who said words like SAN

Is EBS not, itself, a SAN?


If you narrowly focus on the words outside of the context of what "SAN" has meant in the industry for decades now, yes it is. But no, it isn't.


Can you explain more? Because I honestly don't know enough about SANs to know the difference.

To me, a "Storage Area Network" is 1. a cluster of disk-servers, serving the role of exposing logical block-storage over a protocol like iSCSI (whether directly to client machines, or managed and dynamically allocated by hypervisor software like vSphere), where 2. machines are connected to that storage cluster over a dedicated network interface, to keep LAN/WAN packets from contending for throughput with SAN packets.

By that definition, EBS is definitely a SAN. (And technically, so is my two-drive NAS, if I configure it as an iSCSI target and then run a second switch that connects to its second network port and my workstation's second network port.)

Does "SAN" imply some specific internal architecture for the storage cluster or something?

And, if so, then what do you call the type of thing that EBS is?


> Does "SAN" imply some specific internal architecture for the storage cluster or something?

It implies purchasing dedicated hardware. SANs are CAPEX heavy solutions.

> And, if so, then what do you call the type of thing that EBS is?

If you insist, you could call EBS a SAN-as-a-Service, I suppose.


EBS is absolutely SAN-as-a-Service, and it's fantastic.

For a SAN, not only do you have to become a "storage expert", but their individual limitations will leave you with thousands of hours of wasted time and effort, constrain your architecture, and hold back your application's development.

For EBS, you don't need to know anything about storage. You just say "Give me some space and attach it to any VM I want" and you have it. "Expand that space" and you have it. "Give me a snapshot" and you have it. "Give me a bunch of performance guarantees" and you have it. "Make it all encrypted": Done.

You don't need to maintain it, repair it, upgrade it. No maintenance windows to apply a firmware patch. No waiting for someone to buy, deliver, and install a new storage array to get more space. No hoping your hardware has the right interconnects. No upgrading switch backbones to deal with performance issues. And I'm not even a storage person! I'm so happy that I don't deal with SANs anymore.


No, EBS is reliable.


I’d like to add I’d agree with the parent comment and add some specifics.

Buy storage servers from 45drives they basically build same hardware as Backblaze uses. Add copper 10G nics to the servers.

https://www.45drives.com/

Get necessary switches 10G with 40G uplink ports. Whatever your favorite. Use 10GBaseT to the servers.

Install hardware in a quality data center. Like one of theirs -

https://www.digitalrealty.com/

And get 10G virtual cross connects to AWS.

Back of the envelope calculation you need 30TB raw, so about 60 servers. They aren’t really that power hungry so 10 per cabinet. 6 cabinets. at least 6+2 switches.

Software wise you have lots of options with this infra. High upfront cost but low MRC vs all other options. Assuming you have skilled sys admins who know what they are doing.


+ some deep archive glacier? I think waiting 12h for data is acceptable if your datacenter burns down but it may not be the case for you.


It's going to depend entirely on a number of factors.

How are you storing this data? Is it tons of small objects, or a smaller number of massive objects?

If you can aggregate the small objects into larger ones, can you compress them? Is this 10PB compressed or not? If this is video or photo data, compression won't buy you nearly as much. If you have to access small bits of data, and this data isn't something like Parquet or JSON, S3 won't be a good fit.

Will you access this data for analytics purposes? If so, S3 has querying functionality like Athena and S3 Select. If it's instead for serving small files, S3 may not be a good fit.

Really, at PB scale these questions are all critically important and any one of them completely changes the article. There is no easy "store PB of data" architecture, you're going to need to optimize heavily for your specific use case.


Great question. I updated the original post. It’s user generated images and videos. We download those to the phones in the background.

We don’t touch the data at all.


> Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.

> The data is all images and videos, and no queries need to be performed on the data.

OK, so this definitely helps a bit.

At 10PB my assumption is that storage costs are the major thing to optimize for. Compression is an obvious must, but as it's image and video you're going to have some trouble there.

Aggregation where you can is probably a good idea - like if a user has a photo album, it might make sense to store all of those photos together, compressed, and then store an index of photo ID to album. Deduplication is another thing to consider architecting for - if the user has the same photo, across N albums, you should ensure it's only stored the one time. Depending on what you expect to be more or less common this will change your approach a lot.

Of course, you want to avoid mutating objects in S3 too - so an external index to track all of this will be important. You don't want to have to pull from S3 just to determine that your data was never there. You can also store object metadata and query that first.

AFAIK S3 is the cheapest way to store a huge amount of data other than running your own custom hardware. I don't think you're at that scale yet.

Latency is probably an easy one. Just don't use Glacier, basically, or use it sparingly for data that is extremely rare to access ie: if you back up disabled user accounts in case they come back or something like that.

I think this'll be less of a "do we use S3 or XYZ" and more of a "how do we organize our data so that we can compress as much of it together, deduplicate as much of it as possible, and access the least bytes necessary".


Isn't Backblaze B2 cheaper than S3?


Yeah, I guess I shouldn't say S3 is the cheapest option there, I was thinking 'In AWS' but Backblaze is cheaper.


In my opinion, knowing what you're planning to do w/the data once it's stored is the important piece to giving you some idea of where to put it.


Good point. I updated the post with some more infos


What is your loss tolerance? If a file is gone, who is annoyed: a free user, a $50/year customer, or a $10k/year customer?

Are these files WORM?


Agreed - though I feel like every data use comes after the fact. Original software engineers/developpers rarely have the foresight that the data scientists need the information for (at least imho).


To be fair, the data scientists rarely have the foresight to know what the data scientists need the information for. The only time I've seen a data scientist correctly include all the data they needed (but still be wrong) was when they answered "All of it. We need all of the data".


So true. Tough to know in advance which data will hold the secrets.


And you can't build a time machine to go and get it once you do know. Want X days of historical data for training/backtesting and we just implemented the metric this sprint? Good luck meeting your deadline!


I can build a 720T raw SSD storage box for ~$138k

Or a 648T raw HDD storage box for ~$53k

To get that up to raw 10 PB, I need ~$2m for all-SSD, or ~$850k for all-HDD

Bake-in a 2-system safety margin, and that's ~$2.3m all-SSD or ~$960 all-HDD

Run TrueNAS and ZFS on each of them ... and my overhead becomes a little bit of cross-over sysadmin/storage admin time per year and power

Say that's 1 FTE at $180k ($120k salary + 50% overhead) per year (even though actual admin time is only going to be maybe 10% of their workload - I like rounding-up for these types of approximations)

Peak cost, therefore, is ~$2.5m the first year, and ~$200k per year afterwards

And, of course, we'll want to plan for replacement systems to pop-in ... so factor-up to $250k per year in overhead (salary, benefits, taxes, power, budget for additional/replacement servers)

Using [Wasabi](https://wasabi.com/cloud-storage-pricing/#three-info), 10PB is going to run ~$62k/mo, or ~$744k per year

It's cheaper to build-vs-buy in no more than 5 years ... probably under 3


Backblaze B2, ingress and egress are free through Cloudflare, and it's S3 compatible. It's peanuts by comparison but I've been storing ~22TB on there for years and love it.

Wasabi and Glacier would be my 2nd choices.


>Backblaze B2, ingress and egress are free through cloudflare

AFAIK cloudflare ToS prohibits you from using it as a file hosting proxy. You might not run into issues if you're transferring a few gigabytes a month, but if you're transferring multiple terabytes it's just asking for trouble.

edit:

https://www.cloudflare.com/terms/ section 2.8 Limitation on Serving Non-HTML Content


You can definitely serve way more than a few GB per month through Cloudflare on the free plan. I serve tens of terabytes a month for free. If OP needs to serve hundreds of terabytes per month they may get an email asking to upgrade, but the backblaze/Cloudflare setup would probably still be the cheapest. BunnyCDN is great too.


OTOH I've been told (by CloudFlare support, in contact with their engineers) that for their "for hosting game levels and other content" use case[1], any of their ordinary plans should be fine.

I'm not... super confident in that answer, because despite that being a use case they promote on the site the terms seem a bit murkier, and the page on that use-case doesn't say much about which plan(s) they expect you to use (I'd have expected an "enterprise" plan for serving hundreds of TB of transfer of game-assets per month, but they said no, any normal plan's fine, which... I was up front with them about what our usage would look like, and they held that line, but that seems too good to be true).

I haven't tested these claims yet.

[1] https://www.cloudflare.com/gaming/


Definitely not backblaze. If you get a signed URL it remains valid for 24hrs and can be used over and over. If they are going through a proxy, that would be different, but I imagine they don't want that as that doubles bandwidth cost. You definitely don't want your client to be able to upload all the data they can in your bucket.


Wait what?! There is a way to egress from free from Backblaze B2! That’s a big deal if true.



I’ve looked at them. Would love to talk to you about your usage and experience with them.


I should preface this with: I read the question as you want something on-premises/in a colo. If you're talking hosted S3 by someone other than Amazon that's a different story.

It probably depends on if you are tied at the hip to other AWS services. If you are, then you're kind of stuck. The ingress/egress traffic will kill you doing anything with that data anywhere else.

If you aren't, the major players for on-prem S3 (assuming you want to continue access the data that way) would be (in no specific order):

Cloudian

Scality

NetApp Storagegrid

Hitachi Vantara HCP

Dell/EMC ECS

There are plusses and minuses to all of them. At that capacity I would honestly avoid a roll-your-own unless you're on a shoestring budget. Any of the above will be cheaper than Amazon.


[Disclaimer: I work on Quantum ActiveScale, an on-prem S3 system that fits this list]

Yup this is why these vendors exist. You’re definitely not alone, cloud repatriation is a ‘thing’.

These vendors have replaced that Ceph or Minio expert others in this thread said you’d need to budget for, by software. The system detects dead/degrading disks and automatically evicts them and rebuilds that chunk of the erasure code on another disk. Every few months you go in and hot swap the bad disks = the ones with a blinking led. Also Prometheus metrics, alerts in case of issues,... You don’t need a storage admin babysitting this.

1 rack of 4u90 enclosures with 18TB disks is 15PB RAW so with erasure coding overhead about the 10PB usable you need today.

I’m obviously biased on which vendor to pick. Do your due diligence, f.i. on how the system does capacity expansions.


I assume you're already making use of most of S3s auto-archive features?[0] Really it seems like this comes down to how quickly any of your data /needs/ to be loaded. I'd probably investigate after how much time a file is only ~1-10% likely to be accessed in the next 30 days, then auto-archive files in S3 to Glacier after that threshold. If you want to be a bit 'smarter' about it, here's an article by Dropbox[1] on how they saved $1.7M/year by determining which file previews actually need to be generated, and their strategy seems like it could be applied to your use case. That said, it seems like you are more likely to save money by going colo than by staying in the cloud.

[0] https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/ [1] https://dropbox.tech/machine-learning/cannes--how-ml-saves-u...


I have done 2 PB HPC data storage with ZFS. If I may extrapolate, I don’t see why it wouldn’t workout the same for 10 PB.

A 1U rack server attached to two JBODs(each 4U containing 60 spinning disks) connected to the server via 4 SAS HD cables. The rack server gets 512GiB of RAM to cache reads, and an Optane drive as persistent cache for writes. The usable storage depends on your redundancy and spare needs. But, as an example my setup - (9 * 6 drives(RAIDz2) + 4 hot spares) nets me about 450 TiB per JBOD or 900 TiB per rack server + two JBODs.

Repeat the setup by 6 times, and it would meet your 10 PB need. Throw in a few links 10GBps per server and have them all linked up by a switch, and you got your own storage setup. May be Minio(I have no experience with it) or something like that would give you a S3 interface over the whole thing.

I bet it would come out much cheaper than AWS. But, you’ve got to get your hands dirty a bit with system in work, and automate all the things with a tool like Ansible. Having done it, I’d say it is totally worth it at your scale.


Why do you need all 10PB accessible? Have you analyzed your usage pattern to see if you really need that much data accessible? This seems so unlikely and could solve most of your problems if you change the parameters.


It seems to me like you could save a ton of money by using your own hardware. Perhaps buy a bunch of big Synology boxes? At that scale you should also consider looking at technologies such as Ceph.

We've recently switched to a setup with several Synology boxes for around 1PB net storage.


Those boxes are slooooow, the 8 slot box have like a 500MB/s read speed limit, even if you raid0 8 SSDs, and use 10gbps networking. This limitation is in the product spec document, but with the smallest letters possible.


They advertise some boxes with 5.5GB/s.


Funny you should mention this. I once worked at a startup that stored lots of remote sensing data. Their strategy was to put it on a Synology. When the Synology filled up, they bought another, and so forth. Only some of the Synologys were online at any particular time, and there was no indexing to find which Synology held what data.

Plus, there were no backups so if one Synology were to blow up, all the data on it was lost.

Since they were a small startup it made some sense to start this way, but they had no plans on what to do about it as they got bigger.


Using Synologys doesn't mean that you have to be stupid about it :-)


Thank you!


At this scale, there's no one perfect answer. You need to consider your usage patterns, business needs, etc.

Is the data cold storage, that is rarely accessed? Is it OK to risk losing a percentage of it? Can you identify that percentage? If it's actively utilized, is it all used, or just a subset? Which subset? How much data is added every day? How much is deleted? What are the I/O patterns?

Etc.

I have direct experience moving big cloud datasets to on-site storage (in my case, RAID arrays), but it was a situation where the data had a long-tail usage pattern, and it didn't really matter if some was lost. YMMV.


I’d go with Ceph and dedicated hardware. Something like Hetzner or Datapacket, or built it yourself and go big with something like SoftIron. We’ve built and maintain a number of these types of clusters - using S3 compatible APIs (CephObjectStore). SoftIron is probably overkill but good lord is it fun to play with that much thruput!

If you’re looking for a partner/consultant to get things going, feel free to reach out! This stuff is sort of our wheelhouse, as me and my co-founder were previously Ops at Imgur, you can imagine the kinds of image hosting problems we’ve seen :P


SoftIron would love to assist with this.


Late to the party, but one does not simply store 10PB of data :)

The short story is, ignore most of the advice, poach^H^H^H^H^Hhire someone who has done this, and leverage their expertise. There is no armchair quarterbacking infrastructure at this scale.


Honestly, 10PB is small potatoes nowadays - and it's not "armchair quarterbacking" to think about storing it in a distributed environment

I work with one customer right now that's storing something like 27PB across ~100 systems for analysis on a rolling 90-to-365-day period (and they're a relatively small customer) with Splunk

If they were doing more storage than analysis, that storage cluster would be substantially smaller


I don't really know much about optimizing storage costs, But You could learn from storage giants.

Example is Blackblaze storage pod 6.0 according to them it holds 0.5PB with a cost of 10k$, you will need about 20*10K$ = 200K$ + Maintenance(They also publish failure rates) , The schematics and everything is in their website and according to them they have already a supplier who provides them with such devices which you could probably buy from. Note: This was published 2016, they probably have Pod 7.0 by now so cost may be better.

Reference: https://www.backblaze.com/blog/open-source-data-storage-serv...


fyi that 10k includes no drives.


Reading: https://www.backblaze.com/blog/open-source-data-storage-serv... it seems the drives are included.

That 10.3k includes drives but you have to assemble the pod yourself.

For 12.8k you get drives and assembled pod from 3rd party manufacturer.

Backblaze pays about 8.7k at a scale for the whole enchilada

Those numbers do not make sense if we exclude drives. The server itself is not that expensive(2-3k tops) without the drives.


Are you fundamentally a data storage business or are you another business that happens to store a tremendous amount of data?

If it's the former, then investing in-house might make sense (a la Dropbox's reverse course).


He's the CTO of KeepSafe.


Ok, so it seems that they are indeed a data storage company.


We are.


Do you want to grow your business by finding other ways to sell people storage, or by adding features to your app that may or may not require any additional storage to develop?


Cloud or self-hosted will depend on your in-house expertise. For cloud others have already mentioned Backblaze and Wasabi, but you can also check Scaleway, they do 0.02 EUR/GB/mo for hot storage and 0.002/GB/mo for cold storage.

Since we're talking about images and videos, do you already have different quality of each media available? Maybe thumbnail, high quality, and full quality. It could allow you to use cold storage for the full quality media, serving the high quality version while waiting for retrieval.

If the use case is more of a backup/restore service and a restore typically takes longer than a cold storage retrieval (being Glacier or self hosted tape robot), then keep just enough in S3 to restore while you wait for the retrieval of the rest.

If you go the self-hosted route, I like software that is flexible around hardware failures. Something that will rebalance automatically and reduce the total capacity of the cluster, rather than require you to swap the drive ASAP. That way you can keep batch all the hardware swapping/RMA once per week/month/quarter.


I believe Scaleway costs 0.01 EUR/GB, so a bit more than half of S3.


Also have a look at the Datahoarder community [1] on Reddit. Some people are storing astronomical amounts of data. [1]: https://www.reddit.com/r/DataHoarder/


How firm are your "less than 1000ms" requirements. Could you identify a subset of your images/videos that are very unlikely to ever be accessed and move those to s3 glacier and price in that some fractional percentage will require expedited retrieval costs?


Netapp. If you are managing it yourself do not accept alternatives.

https://www.ebay.com/itm/313012077673?_trkparms=aid%3D111000...


At that level of data you should be negotiating with the 3 largest cloud providers, and going with whoever gives you the best deal. You can negotiate the storage costs and also egress.


Take any credits you can get from a provider switch and then thoroughly map out your access patterns, ingestion, and egress. Do whatever you can to segment data by your needs for availability and modification.

If it's all archival storage then it's pretty straight forward. If you're on GCP you take it all and dump it into archival single region DRA (Durable Reduced Availability) storage for the lowest costs.

Otherwise, identify your segments and figure out a strategy for "load balancing" between standard, nearline, coldline, and archive storage classes. If you can figure out a chronological pattern, you can write a small script that uses the gsutils built-in rsync feature to mirror over data from a higher grade storage class to a lower one at the right time.

The strategy will probably be similar in any of the other big 3 providers as well, but fair warning, some providers archival grade storage does not have immediate availability last I checked.

See: https://cloud.google.com/storage/docs/storage-classes

https://cloud.google.com/storage/docs/gsutil/commands/rsync


Flip side. How much time would that migration take. As a startup focusing that time on product would lead to more VC investment or more sales sooner. With the seed/series funding and sales being many multiples of the cost savings.


Agree with someone else's comment questioning how is the data ingested and used.

10PB seems like a lot to store in S3 buckets. I assume much of that data is not accessed frequently or would be used in a big data scenario. Maybe some other services like Glacier or RedShift (I think).


10PB is a crazy amount of data. Far more than any normal business would ever have to deal with. Presuming you aren't crazy, you must have an unusual business plan to legitimately need to handle that much data. That means it's tough for us to say much - any assumptions we might have about it could be invalid depending on your actual business needs. You're just going to have to tell us some more about your business case before we can say anything useful about it.


Unless they do video. In that case 10PB is not much at all.


It's not merely video

Machine data takes a lot of space, depending on how long you need/want to hold onto it - any given even might only be a few 100 bytes, but with 1000s or 10s of 1000s of devices sending even syslog turns into a metric buttload of data :)


Disclaimer: *I work for Nutanix*

Consider looking at Nutanix - you can get the hardware from HPE (including Apollo).

Object storage from Nutanix doesn’t even break a sweat at 10PB of usable storage.

However the main reasons to look at Nutanix would be ease of use for

day 0 (bootstrapping) day 1 (administration operations, capacity management), fault tolerance and day n operations (upgrades, security patches etc)

Nutanix spends considerable time and resources on all this to make life of our customers easy.


Amazing how one post will tell you that, at your scale, S3 is stupid and other posts will tell you that at your not-small-enough-and-yet not-big-enough scale S3 is the only option. I say stick with cloud. If cost is an issue go negotiate a better contract — GCP will probably give you a nice discount. Setting up a highly available service at that scale is not a walk in the park. Can you afford the distractions from your primary app while you figure it out?


Wasabi is a good option. They’re S3 compatible and don’t charge any egress or ingress fees. Been using them for a few years. Great speeds and customer support.


10PB with their pricing calculator comes out to over $60,000/mo. Feels like a lot.

edit: perhaps their RCS option would be cheaper if you know exactly how much data you need to store in advance.


To be fair, purchasing and hosting even the most basic mirrored RAID array of that scale comes to well over half a million for the disks alone. Then you need to manage them.


10x Supermicro SSG-6049P-E1CR60H servers (60 x 3,5" HDD in 4U enclosure) - $5k each

600x WESTERN DIGITAL Ultrastar DC HC550 18TB (10800PB in total) - $500 each

~$350k in hardware, up to 20kW energy consumption, should fit in two rack towers. You can host it for about $1.5k somewhere. All assuming no redundancy :)


Don’t forget labor. You need to find talent to manage your little data center. And deal with it when it shits the bed at 4:12am on Christmas morning.

So toss in at least one SRE type person. Say $200k/year.

Since you only have one, they are gonna be on call 24/7, so assume you’ll burn them out after a year and a half and need to hire a new one....

Since redundancy is a thing, double that $350k. And 10pb is what they have now so double it again for 20pb. Add in $10k per rack for switches, routers, wires, etc.

So probably you are looking at a million dollars of capital plus labor to actually execute on this. And don’t forget the lead time might be a month to get the hardware and a week or two to install it. Plus all the configuration management that needs to be built up. Not to mention monitoring. So maybe a quarter of work just to have it functional.

I haven’t even factored in opportunity costs. What could this business be doing that adds more value than building out a little data center?

I dunno. Maybe it does make sense to manage your own hardware. But it helps to calculate the entire cost of ownership, not just the cost of the servers.


> Since you only have one, they are gonna be on call 24/7, so assume you’ll burn them out after a year and a half and need to hire a new one....

This person's entire job is managing a few racks of hard drives? How often do you think they're actually going to get called in?

> Since redundancy is a thing, double that $350k.

True, but you can do redundancy for cheaper with parity or tape.

> And 10pb is what they have now so double it again for 20pb.

> So probably you are looking at a million dollars of capital plus labor to actually execute on this.

You can go a couple PB at a time if the upfront cost is daunting.

> Add in $10k per rack for switches, routers, wires, etc.

Yep, though that's not very much in comparison.

> Plus all the configuration management that needs to be built up. Not to mention monitoring. So maybe a quarter of work just to have it functional.

This is the one I'd really worry about.

> I haven’t even factored in opportunity costs. What could this business be doing that adds more value than building out a little data center?

You always have to keep opportunity costs in mind, but something like this can pay for itself in under a year if there's significant bandwidth cost too, and that's an amazing ROI.


> How often do you think they're actually going to get called in?

Not often. But the server gods are a cruel mistress and it will definitely shit the bed when you are on your honeymoon, or maybe the day after your first kid is born.


>>> This person's entire job is managing a few racks of hard drives? How often do you think they're actually going to get called in?

How about every day?

Quick guess how often disks need to be replaced when there are thousands of them. ;)


You can replace disks once a month or less. That's not an on-call thing, even if you do make your $200k admin do that grunt work.

Also for one or two thousand disks I would expect less than one failure per week.


> True, but you can do redundancy for cheaper with parity or tape.

At this volumes you probably do want a carbon copy at another site to mitigate disasters like datacenter fires.


You're right. I'm wasn't really serious. Since I'm in the middle of calculating costs of own servers in rented racks in Poland (you're right labor is more difficult than hardware) let me imagine the rest of the infrastructure (probably not all) for this "projects", just for fun:

- network switch Juniper EX4600 (10Gbps ports) + 3rd party optics ~$11k

- cheap 1Gbps switch for management access <$1k

- some router for VPN for management network - $500

- 1Gbps (not guaranteed) internet access with few IPs ~$350 / month

- 100Mbps low traffic internet access for the management/OOB network.

Time to get the hardware - 2 months. Time to rent and install hardware in rack - about 1 month. I don't count configuring the software.

This setup is full of single points of failure so I would consider it one "region" and use something like CEPH + some spare servers in each "region". That way you don't need to react immediately to hardware failures. Just send a box of hardware from time to time to the DC and use ~$20-40h/h remote hands service to replace the failed drives or whole servers. You could also buy on-site service from the hardware vendor for 1-3 years adding some cost.

I think the most important thing would be have a cleaver person who design a fault tolerant system, automatic failover, good monitoring and alerting so that any on-call and maintenance job is easy and based on procedures. That way you could outsource it. Only then it might have some sense.


10 petabytes at AWS is $210,000 per month just for storage (even excluding AWS's very high egress and transaction pricing), so even $1M (which seems like a high estimate indeed) would be amortized in less than six months.

Also, the hardware can be depreciated, which reduces its net (of taxes) cost dramatically over time.

Five years (probably the useful life of the equipment in general) of $210,000 per month is $12.6M. That's a lot of savings.


At this scale you should be able to negotiate with AWS and get a deal better than the listed price.

Regarding accounting, the AWS monthly charges are also net of taxes so it makes no difference.


Thanks for laying this out. Never rented rack space my self


I see from their home page they do not charge for egress, but the FAQ clarifies this is only valid if your monthly egress total is less than or equal to your storage total, otherwise they suspend your service. Should be clarified on the home page in my opinion. At least with an asterisk beside "No egress charges".


1. Shrink your data. That's just an absurd amount of data for a start-up. Even large organizations can't quickly work around too much data. Resource growth directly affects system performance and complexity and limits what you will be able to practically do with the data. You already have a million problems as a start-up, don't make another one for yourself by trying to find a clever solution when you can just get rid of the problem.

2. As a general-purpose alternative, I would use Backblaze. It's cheap and they know what they're doing. Here is a comparison of (non-personal) cloud vendor storage prices: https://gist.github.com/peterwwillis/83a4636476f01852dc2b670...

3. You need to know how the architecture impacts the storage costs. There are costs for incoming traffic, outgoing traffic, intra-zone traffic, storage costs, archive costs, 'access' costs (cost per GET | POST | etc). You may end up paying $500K a month just to serve files smaller than 1KB.

4. You need to match up availability and performance requirements against providers' guarantees, and then measure a real-world performance test over a month. Some providers enforce rate limits, with others you might be in a shared pool of rate limits.

5. You need to verify the logistics for backup and restore. For 10PB you're gonna need an option to mail physical drives/tapes. Ensure that process works if you want to keep the data around.

6. Don't become your own storage provider. Unless you have a ton of time and money and engineering talent to waste and don't want to ship a reliable product soon.


SoftIron would love to help with this project. We're in your backyard and could have POC on your hands in no time at all, and full 10PB in about 6 weeks. matt@softiron.com


Meta-question: shouldn't there be a website dedicated specifically to reliable, crowd-sourced answers to questions like these? Does it really not exist? I'm thinking like StackShare, but you start from "What's the problem I'm trying to solve?", not "What products are big companies using?".


There is http://highscalability.com/ but you have to distinguish between PR and decent technical articles.


Yes it’s called hacker news


Having dealt with a lot of big data I often came to realization that we actually did not need most of it.

Try being intentional and smart in front of your data pipeline and purge data that is not useful. Too many times people store data "just in case" and that case never happens years later.


You wrote, "data needs to be accessible at all time ... less than 1000ms" latency, but this does not tell the whole story about accessibility/latency. Does your use case allow you to do something similar to lazy loading, where you serve reduced quality images/video at low latency and only offer the full quality on demand/as needed with greater latency? For example, initially serve a reduced-resolution or reduced-length video instead of the full-res/full-length original, which you keep in colder storage at a reduced cost? Depending on the details of what is permissible and data characteristics, this approach might save you a lot overall by reducing warm storage costs.


I'm wondering here if this data is currently oversized? If the use case is all mobile, has your product committed to losslessly storing something or not?

While there's definitely a cross-over point where you should roll your own, the overhead costs of running a storage cluster reliably (and all the problem you don't really have to deal with because they're outsourced to AWS) mean it might be a better use of time and effort to see how much you can cut that number down by changing the parameters of your storage. The immediate savings will be much easier to justify.

Keep in mind you've also got a migration problem: getting 10PB off Amazon is not a simple, handsfree project.


My only comment is that I have a hard time reconciling these two statements:

> downloaded in the background to a mobile phone

and

> but less than 1000ms required

I'm struggling to think of what kind of application needs data access in the background with latency of less than 1000ms. That would normally be for interactive use of some kind.

Getting to 1 min access time would get you into the S3 glacier territory ... you will obviously have considered this but I feel like some really hard scrutiny on requirements could be critical here. With intelligent tiering and smart software you might make a near order of magnitude difference in cost and lose almost no user-perceptible functionality.


<I'm struggling to think of what kind of application needs data access in the background with latency of less than 1000ms>

TikTok is most obvious example.


> Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.

> The data is all images and videos, and no queries need to be performed on the data.

Okay, this is a good start, but there are some other important factors.

For every PB of data, how much bandwidth is used in a month, and what percentage of the data is actually accessed?

Annoyingly, the services that have the best warm/"cold" storage offerings also tend to be the services that overcharge the most for bandwidth.


Need more details.. maybe a graph (or several graphs ) of requests \ day for various items (categorized by popularity and size is ok ) (a curve ( i suppose not very hyperbolic) to breakdown populary of top requested items vs long tail of almost never seen, and rarely seen which i suppose is the most 9f those 10pb ) and current bandwidth intersection ( and data size ) and volume , this is too have an idea about bw, iops ,structure of the data and requests patterns and requirements and caching layer , i think that probably a share fs is worse than distributed blobl storage here ( assuming spinning disks somewhere and not huge caches ) Not all days usage patterns are equal, your requirement are different from database (which is more in line with some suggestions here ) Plus data safety is everything for your kind of business so redoundancy is a must , speed too (don't even think about filecoin imho) i would think more about a mix of spinning and name as cache layer redoundant on multiple datacenter if it's to save costs.. if it's to save efforts and a bit of costs look at ovh offerings for blob storage services or contact backblaze for a custom solution hosted by them ?

Plus here we are not talking about 10pb but probably at 25 given redoundancy and probably also at 100pb ad more given the assumption that your company is growing , so a solution that cost slightly less today but will only do 2x when you do 10x would still be very interesting imo.. there is a lot to talk about ;)


I have a startup idea and want to make sure it scales, I was thinking S3 but don't like vendor lock-in. Not that far along yet, I was thinking maybe SeaweedFS or even going crazy enough to write my own storage system. Use a database like CockroachDB or MongoDB to store the meta data, and then replica pieces of the file to "chunk servers". However cleaning up deleted files, etc seem a bit of a pain. I was thinking instead of top down, let each node contain a copy of the metadata and scan on each node individually instead of the central database trying to manage each node. Then have a a process to handle under replicated files. However if you can adjust the number of replicas for say a popular file, you'd need to then coordinate which extra copies to remove when scaling down. Maybe a bit optimistic.

Kinda disappointed the file solutions seem more complicated and nothing more simple to setup like some of the new databases are like CockroachDB or MongoDB are to use. I feel like reinventing the wheel is kinda bad as rather let people who are more experts in this field handle this stuff, but I hate the idea of vendor lock-in and forced to use other peoples servers, self hosting be nice from a single node to test to a cluster spanning multiple datacenters. Maybe there's a solution out there, I done some searching and just seems to go in circles. I seen one system but if you wanted to add or remove nodes in the future, you couldn't just "drain" a chunk server by moving it data.


If data storage isn't your startup's job then I would negotiate heavily on the AWS contract.


How much can you get the pricing reduced at AWS? At list price, 10PB of IA storage cost $1.5M/yr.


AWS will blow up your phone if they know you're interested in dealing. Various online forms smattered around the site will put you in this pipeline. Just ensure you have a competing quote for them to work against


At startup grade, it‘s fine to stick and grow with IaaS provider like Amazon, Google, Microsoft, Oracle or whatever you like.

However, you‘ll get to a point, where it‘s crucial to become profitable. And storing that much data does cost a lot of money using one of the mentioned providers.

So, when you think it‘s the right time to become “mature”, then get your own servers up and running using colocation.

What options do you have here (just a quick brainstorm): 1. Set up some servers, put in a lot of hard drives, format them using zfs and make it available using nfs on your network 2. Get some storage servers 3. Set up a Ceph cluster

I used to work as a CTO at a hosting company and evaluated all of these options and more. Every of these options comes with pros and cons.

Just one last advice: Evaluate your options and get some external help on this. Any of these options have pitfalls and you need experienced consultants to set up and run such an infrastructure.

All in all, it’s an invest, that will save you a lot of money and will give you freedom and flexibility to grow further.

P.S. we ended up setting up a Ceph cluster. We found a partner, who’s specialized on hosting custom infrastructures. That partner is responsible for all the maintenance, so we could focus on the product itself.


If you want to stick with cloud, then stick with what you're doing or migrate to a cheaper alternative like wasabi, backblaze, etc.

If you're not afraid of having a few operations people on staff and running a few racks in multiple data centers, then buy a bunch of drives and servers and install something to expose everything via S3 interface (Ceph, Minio, ...) so none of your tools have to change.


I think they either stick to S3 or run their own DC with Minio in front. BB as I mentioned in another comment will be a bad idea due to the poor S3 compatible interface, See - https://www.backblaze.com/b2/docs/b2_get_upload_url.html Wasabi might be fine, but don't know if they can handle 10PB.


Disclaimer: I work for Backblaze so I'm biased and you should keep me honest. :-)

> Backblaze as I mentioned in another comment will be a bad idea due to the poor S3 compatible interface

Backblaze released an S3 compatible API recently: https://www.backblaze.com/b2/docs/s3_compatible_api.html

We're ALWAYS curious about any issue customers see, so if there is something specifically missing you use, we both want to hear about it, and we might be able to add it. Even if it doesn't help you right away, maybe it will help somebody else a few months down the road who might need the same feature.

We know we're compatible with Veeam backups (which only go through S3 APIs) for instance, and we continue to maintain that. We added the "S3 Object Lock" specifically for this particular vendor. So if you are missing one or two APIs, let us know!


Good to know, how does the presigned URL work? I thought it was this - https://www.backblaze.com/b2/docs/b2_get_upload_url.html. Does it function differently from this? My guess it for a mobile app, the API backend will generate the presigned upload URL hand it off to the mobile app. But certainly don't want the mobile app to have unlimited upload for a 24hr period. So one presigned URL, one upload.


> how does the presigned URL work? I thought it was this - https://www.backblaze.com/b2/docs/b2_get_upload_url.html

If you want to use Amazon S3 APIs, you do not call ANYTHING that is documented on the Backblaze website, and you especially should not call "b2_get_upload_url" because that is a B2 native API, not an Amazon S3 API. You can always tell if you are using "B2 Native" if the call starts with "b2_" -> then that has literally nothing to do with Amazon S3 compatibility, it is the custom Backblaze protocol.

If you want to find out about Amazon S3 APIs (which you use to communicate to Backblaze's Storage Cloud or Amazon S3) then you can start here: https://docs.aws.amazon.com/general/latest/gr/signature-vers... Make sure you stay ENTIRELY on the Amazon website, and only read Amazon documentation, and use the APIs Amazon talks about (but of course you are doing all of this communicating with the Backblaze Storage Cloud backend). If any of that fails in your application, or in incompatible, PLEASE LET US KNOW!!


Got it! I thought the S3 API is a wrapper on B2 native API. Good to know


One interesting thing that Backblaze supports (that I don't think many people use) is that you can actually use any S3 API on any bucket, and any B2 API on the same bucket. Like every other call if you want.

So if you HAPPEN to find something is more clear with the B2 API, it's fine to use those calls on a bucket. If you find something is more clear in the S3 API, it's also fine to use those calls. The bucket won't get confused. :-)


If you put the data on Storj DCS, it would run about $40k/month for list pricing with global availability and encryption. I'm sure you could get a deal if you asked though. It has S3 compatibility, so would be plug and play with whatever you have now. Egress out of AWS would be free.

Way cheaper than AWS, and a lot less headache than trying to run it all yourself.


is this a case where GlusterFS and ZFS would work? I dont have PB of data, but many TBs. Gluster nodes are spread around globe, use ZFS for the "brick" and then the Gluster magic gives me distribute / replica.

surprised I didn't see Gluster already in this thread. maybe its not for such big scale?

edit: Wikipedia says " GlusterFS to scale up to several petabytes on commodity hardware"


Check whether you really need 10 PB or you can make do with several orders of magnitude less. I wouldn't be surprised if it was some sort of perverse incentive CV building thing, like engineers building a Kubernetes cluster for every tiny thing. If you really do need 10 PB, then still you probably should check again because you probably don't need 10 PB.


In cloud:

Wasabi's Reserved Capacity Storage is likely to be the cheapest: https://wasabi.com/rcs/

If you front it with Cloudflare, egress would be close to free given both these companies are part of the Bandwidth Alliance: https://www.cloudflare.com/bandwidth-alliance/

Cloudflare has an images product in closed beta, but that is likely unnecessary and probably expensive for your usecase: https://blog.cloudflare.com/announcing-cloudflare-images-bet...

--

If you're curious still, take a look at Facebook's F4 (generic blob store) and Haystack (for IO bound image workloads) designs: https://archive.is/49GUM


Besides what others have asked:

What are your access patterns? You say "no queries need to be performed," but are you accessing via key-value look-ups? Or ranged look-ups?

What do customers do with the pictures? Do customers browse through images and videos?

You mention it's "user generated data" - how many users (order of magnitude)? How often is new data generated? Does the dataset grow, or can you evict older images/videos (so you have a moving window of data through time)?

Besides your immediate needs, what other needs do you anticipate? (Will you need to do ML/Analytics work on the data in the future? Will you want to generate thumbnails from the existing data set?)

What my experience is based on: I was formerly Senior Software Engineer/Principal Engineer for a team that managed reporting tools for internal reporting of Amazon's Retail data. The team I was on provides tools for accessing several years worth of Amazon.com's order/shipment data.


S3 + Glacier. For data you're accessing via Spark/Presto/Hive I believe Parquet is a good format. At your scale AWS should prob provide discounts, worth connecting w/ an account rep.

I'd recommend reaching out to some data eng in the various Bigs, they certainly have more clear numbers. Happy to make an intro if you need, feel free to dm me.


Actual answer: There is almost NO company that really needs that much data. This has mostly just become a pissing match. In general, companies (especially startups) are way better off making sure they have a small amount of high-quality, accurate, data than a huge pile-o-dung that they think they're going to use magical AI/ML pixie dust to do something with.

That said, if you really think you must, spend effort on good deduping/transcoding (relatively easy with images/video), and consider some far lower-cost storage options than S3, which is pretty pricey no matter what you do. If S3 is a good fit, I hear good things about Wasabi, but haven't used it myself.

If you have the technical ability (non-trivial, you need someone who really understands, disk and system I/O, RAID Controllers, PCI lane optimization, SAN protocols and network performance (not just IP), etc.) and the wherewithal to invest, then putting this on good hardware with something like say, ZFS at your site or a good co-lo will be WAY cheaper and probably offer higher performance than any other option, especially combined with serious deduping. (Look carefully at everything that comes in once and you never have to do it again.) Also, keep in mind that even-numbered RAID levels can make more sense for video streaming, if that's a big part of the mix.

The MAIN thing: Keep in mind that understanding your data flows is way more important than just "designing for scale". And really try to not need so much data in the first place.

(Aside: I'm was cofounder and chief technologist of one of the first onsite storage service providers - we built a screamer of a storage system that was 3-4x as fast, and scaled 10x larger than IBM's fastest Shark array, at less than 10% of the cost. The bad news - we were planning to launch the week of 9/11 and, as self-funded, ran out of money before the economy came back. The system kicked ass, though.)


As others have said, it’s a complicated question, but if you have the resources/wherewithal to run Ceph but don’t want to deal with co-location, you can get a bunch of storage servers from Hetzner and get a much better grasp on cost over S3.

For example, at 10PB with every object duplicated twice (so 20 PB raw storage), you’d need ~90 of their SX293[1] boxes, coming out to around €30k/mo. This doesn’t include time to configure/maintain on your end, but it does cover any costs associated with drive replacement for failure.

I’ve done similar setups for cheap video storage & CDN origin systems before, and it’s worked fairly well if you’re cost conscious.

[1] https://www.hetzner.com/dedicated-rootserver/sx293/configura...


Buying just one of these looks pretty challenging, let alone ~90. :(


You would probably pick up the phone to buy 90


It's a complex question. I had experience of working with ~60petabytish system back in 2016, and there a lot of things to cover (not only storage):

* network access - do you have data that will be accessed frequently, and with high traffic? You need to cover this skewed access pattern in your solution.

* data migration from one node to another, etc...

* ability to restore quickly in case of failure.

I would suggest to:

* use some open-source solution on top of the hosted infrastructure (Hetzner or similar is a good choice)

* bring in a seasoned expert to analyze your data usage/storage patterns, maybe there are some other ways to make storage more cost effective, that simply moving out of AWS S3.


Try https://min.io/ I would 100% go for it if my company was not a https://www.caringo.com/products/swarm customer


I'd like to echo an suggestion I read earlier in this thread: at this scale (i.e. yearly spent), talk to AWS, GCP, Azure or a reseller of your trust and get a good deal to compare your other options with.

Disclaimer: I'm working at a consultancy/partner for a competing cloud.


I would consider moving to my own metal and using hadoop.


Maybe take a look at BackBlaze Storage Pods;

https://www.backblaze.com/blog/open-source-data-storage-serv...

There Storage Pod 6.0 can hold up to 480TB per server.


I am working on SeaweedFS. It was originally designed to store images as Facebook Haystack paper, and should be ideal for your use case. See https://github.com/chrislusf/seaweedfs

And it already supports S3 API, and other HTTP, FUSE, WebDAV, Hadoop, etc.

There should be many existing hardware options that is much cheaper than AWS S3.


I would go for something like Wasabi cloud storage ,

It’s api is S3 compliant.

And also I believe they have minimal cost for transferring data from S3 into wasabi , so initial setup cost should be lower too.

It should be relatively cheaper than self hosting too , when you account for hidden costs that comes with self hosting , related to managing additional employees , having protocols in place for recovering from faults , expanding the storage as you go , maintaining existing infrastructure , etc.

You can compare the prices with respect to S3 at

(https://wasabi.com/cloud-storage-pricing/#cost-estimates)


Look at the cost of moving out of the cloud carefully.

Can you afford the up-front costs of the hardware needed to run the solutions you may want to run?

Will those solutions have good enough data locality to be useful to you?

It isn't real useful to have all your data on-site, and then you operations in the cloud. You've introduced many new layers that can fail.

If you go on-prem, the solution to look at is likely Ceph.

Source: Storage Software Engineer, who has spoken at SNIA SDC. I currently maintain a "small" 1PB ceph cluster at work.

Recommendation: Get someone who knows storage and systems engineering to work with you on the project. Even if you decide not to move, understanding why is the most important part.


If I were in your shoes I'd still host it on AWS, unless your shoes have a problem with the AWS bill, but then you run into other problems:

- Paying for physical space and facilities

- Paying people to maintain it

- Paying for DRP/BCP

- Paying periodically since it doesn't last forever so it'll need replacements

But if you were to have to move out of AWS but Azure and GCP aren't options, you can do: Ceph and HDDs. Dual copies of files so you have to lose three drives for any specific file to have (only those files) dataloss. Does not come with versioning or full IAM-style access control or webservers for static files (which you get 'for free' with S3).

HDDs don't need to be in servers, they can be in drive racks, connected with SAS or iSCSI to servers. This means you only need a few nodes to control many harddisks.

A more integrated option would be (As suggested) back blaze pod-style enclosures, or storinator type top loaders (supermicro has those too). It's generally 4U rack units for 40 to 60 3.5" drives, which again generally comes to about 1PB per 4U. A 48U rack holds 11 units when using side-mounted PDUs, a single top-of-rack switch and no environmental monitoring in the rack (and no electronic access control - no space!).

This means that for redundancy you'd need 3 racks of 10 units. If availability isn't a problem (1 rack down == entire service down) you can do 1 rack. If availability is important enough that you don't want downtime for maintenance, you need at least 2 racks. Cost will be about 510k USD per rack. Lifetime is about 5 to 6 years but you'll have to replace dead drives almost every day at that volume, which means an additional 2000 drives over the lifespan, perhaps some RAM will fail too, and maybe one or two HBAs, NICs and a few SFPs. That's about 1.500.000 spare parts over the life of the hardware, not including the racks themselves, not including power, cooling or physical facilities to locate them.

Note: all of the figures above are 'prosumer' class and semi-DIY. There are vendors that will support you partially, but that is an additional cost.

I'm probably repeating myself (and others) here, but unless you happen to already have most of this (say: the people, skills, experience, knowledge, facilities, money upfront and money during its lifecycle), this is a bad idea and 10PB isn't nearly enough to do by yourself 'for cheaper'. You'd have to get into the 100PB or more arena to 'start' with this stuff if you need to get all of those externalities covered as well (unless it happens to be your core business, which from the opening post it doesn't seem to be).

A rough S3 IA 1Z calculation shows a worst-case cost of about 150.000 USD monthly, but at that rate you can get a lot of cost savings, and with some smart lifecycle configuration you can get that down as well. This means that doing it yourself vs. letting AWS do it makes AWS half as expensive.

Calculation as follows:

DIY: at least 3 racks to match AWS IA OneZone (you'd need 3 racks on 3 different locations, a total of 9 racks to have 3 zones but we're not doing that as per your request) which means the initial starting cost is a minimum of 1.530.000 and combined with a lifetime cost of at least 1.500.000, over 5 years, if we're lucky, so about 606.000 per year, just for the contents of racks that you have to already have.

Adding to this, you'd have some average colocation costs, no matter if you have an entire room, a private cage or a shared corridor. That's at least 160U and in total at least 1400VA per 4U (or roughly 14A at 120V). That amount of power is what a third of a normal rack might use on its own! Roughly, that will boil down to a monthly racking cost of 1300USD per 4U if you use one of those colocation facilities. That's another ~45k per month, at the very least.

So no-personnel colocated can be done, but doing all that stuff 'externally' is expensive, about 95.500 every month, with no scalability, no real security, no web services or load balancing etc.

That means below-par features gets you a rough saving of 50k monthly if you didn't need any personnel and nothing breaks 'more' than usual. And you'd have to not use any other features in S3 besides storage. And if you use anything outside of the datacenter you're located (i.e. if you host an app in AWS EC2, ECS or a lambda or something) and you need a reasonable pipe between your storage and the app, that's a couple of K's per month you can add, eating into the perceived savings.


Strong plus-one here. Rolling your own basically means you will need an entire brand new business function to keep the lights on. That's something your entire company is going to have to adapt to. New staff, new ways of thinking about data, new problems the C-suite needs to consider. The opportunity cost alone can be immense here, since your engineers will need to spend their time working on rote data storage and not business problems.


Why not downsample everything to 10% the size, put those online, and use Amazon Glacier for the originals? (e.g. for exporting)

If you're storing images and videos directly from the phone, they can be downsampled drastically without losing quality on a viewing device that anyone's likely to have.

It's unlikely that anyone wants to download the full size copy, and if they do, they can wait a few hours for Glacier.

You could expose this to the customer, e.g. offer direct access of originals at 2x or 5x the price. But 99.9% of people will be OK with immediate access to quality images/video and eventual access to the unmodified originals.


Perhaps look into Vast Data? They have a TCO calculator [1] but it seems to compare to other on-prem data storage providers (like Isolon...). 10PB in One Zone IA costs $100,000/mo without discount, or $1.2M per year, and that's just for storage alone. Vast claims something like $3.5M TCO over 5 years with 10PB of data and no growth assumption. 5 years on your S3 zone with no data growth (or transfer...) is $6M.

[1] https://vastdata.com/tco-calculator/


1) For hardware you want cheap, expendable, bare metal. Look up posts about how Google built their own servers for reference. 2) For RAID, go with software only RAID. You will sidestep problems caused by hardware RAID controllers having custom data format each (i.e. non-swapable for different model/make). 3) For filesystem, look for OpenAFS. CERN is using OpenAFS to store petabytes of data from LHC. 4) For operating system, look at Debian. Coupled with FAI (fully automated installation), it will enable you to deploy multiple servers in an automated way, to host your files.


With a volume like that you should negotiate at least three storage+CDN providers and see who will give you the best offer. It could be as much as 50% off street price and even more if you are ready to sign a 2-3 years contract.

I personally would consider S3 Glacier+CloudFront, member of Bandwidth Alliance [0] of your choice+CloudFlare, and whomever serves TikTok now.

[0] https://www.cloudflare.com/en-gb/bandwidth-alliance/


I would buy commodity hardware and build my own storage cluster with ZFS and just put Minio in Distributed mode on it. You have full control of redundancy levels either on the cluster on individual ZFS pool side and can fine-tune what your business needs. Maybe you don't need to mirror all the data so you can have RAIDZ2 with just 20-30% extra cost.

Hiring staff to build this would make sense at this point, because if your S3 storage cost is really $200,000/month, you can hire 3 good engineers with $450,000/year, which is the cost of just two months of S3 storage.


I strongly recommend having more than one zone. A datacenter being offline for a while or totally burning is possible. It did happen a few weeks ago and a lot of companies learnt the value of multi zones the hard way.


It definitely depends on how you accumulate and the usage patterns. More clarity is needed there to make recommendations.

As an aside, you can often get nice credits for moving off of AWS to Azure or GCP. I recommend the later.


Can you elaborate on what the >10PB of data is and why it’s important to your startup? Is it archived customer data, like backups? Or is it data purchased from vendors for analysis and ML?


See updated question. Thanks for asking


Hey Philip,

We store north of 2PB with AWS and have just committed to an agreement that will increase that commitment based on some competitive pricing they've given us.

Give me a shout if you'd like to chat.


I have designed, deployed and supported an S3 compatible storage with 5PB capacity for a couple of years, so I acquired the experience to put right hardware and software together together to build such a storage system. And the cost reduction compared to public could like AWS is tremendous. If you are interested in building a private cloud storage for you own, you can contact me at hackernewsantispam@gmail.com for a more detailed discussion.


My high level view is that if you are storing that much content, most of it is bad, so the solution for me would be to delete it!

As for my own storage I use 1TB SanDisc SD cards in a raspberry 2 cluster for write once data (user) and 8x64GB 50nm SATA drives from 2011 on 2xAtom 8-core for data that changes all the time! Xo

People say that content is king, I think that final technology (systems that don't need rewriting ever) is king and content has peaked! ;)


Latency being time to first byte downloaded I’d still store this in cloud somewhere so that the really “hot” images/videos could be cached in a cloudfront CDN or something.

Also this is a startup, no? A million or so in storage so you need not preoccupy your startup with having to deal with failing disks, disk provisioning, collocation costs, etc. etc. not to mention the 11 9s of durability you get with S3, to me it just makes the most sense to do this on the cloud.


I'd look at using a Storinator cluster with a scalable network filesystem like Gluster, Lustre, Ceph or something along those lines. A 4U Storinator with 60 18TB drives has 1PB of raw capacity and cost $43,0000. You'd been looking at a upfront cost of $500k but if you amortize that cost over a 5 year period you are looking at 100k per year plus you are going to need someone that dedicates an amount of time to maintaining that.


If AWS is what you know I'd stick with it.

Changing that can be very very difficult for not much gain. Plus AWS skills are very easy to recruit for vs Google cloud.


By moving from AWS to a cheaper backup storage provider like B2 you would get costs from 200k$ to 50k$ per month.

There is S3-like interface, so you may just change access key, and region host: https://www.backblaze.com/b2/docs/s3_compatible_api.html


My previous startup (~2014) had a similar problem: PBs of data, with millions of mixed clients accessing it at close to real-time speeds. The biggest difference is that we needed to do real-time processing before delivering the content. We needed storage capacity balanced with CPU and RAM.

We ended up buying lots of Supermicro's ultra dense servers [1]. That's a 3U box, containing 24 servers that are interconnected with internal switches (think: 1 box is a self-contained mini cloud). Each server has (cheap config) 1 CPU 4 Xeon cores, 32GB ram, 4TB disk.

Those were bought & hosted in China, and IIRC price tag was around $20k USD per box. That's 96TB per 3U, or >1.2PB and ~$200k per rack. We had a lot of racks in multiple datacenters. These days capacity can be much larger, e.g.: 6TB disk, 144TB per 3U and >1.8PB per rack.

We've tried Ceph, GlusterFS, HDFS, even early versions of Citus, and pretty much everything that existed and was maintain during that time. We eventually settled on Cassandra. It required 2 people to maintain the software, and 1 for the hardware.

Today, I would have done the same hardware setup, mainly because I haven't had 1 Supermicro component fail on me since I bought them first in early 2000s. Cassandra would've been replaced by FoundationDB. I've been using FoundationBD for awhile now, and it just works: zero maintenance, incredible speeds, multi datacenter replication, etc.

Alternatively, if I needed storage without processing, but with fast access, I'd probably go with Supermicro's 4U 90 bay pods [2]. That'd be 90*16TB, 1.4PB in 4U, or ~14PB per rack. And FoudnationDB, no doubt.

As a fun aside: back then, we also tried Kinetic Ethernet Attached Storage [3]. Great idea but what a pain in the rear it was. We did however have a very early access device. No idea if it's still in production or not.

[1] https://www.supermicro.com/en/products/system/3U/5038/SYS-50...

[2] https://www.supermicro.com/en/products/system/4U/6048/SSG-60...

[3] https://www.supermicro.com/products/nfo/files/storage/d_SSG-...


I've used Wasabi a ton in the past and it's been excellent. It's already been talked about a lot in this thread, but I haven't seen their marketing video[0] linked, and it's pretty funny so I thought I'd leave it here!

https://www.youtube.com/watch?v=P7OzyTG4fCM


Tape, if it fits your storage needs. You won't beat the cost of tape if you are doing cold storage.

For online or nearline storage, you should look at what Backblaze did. Either buy hardware that is similar to what they did (basically disk shelves, you can cram ~100 drives into a 4U chassis) or if you are at that scale you can probably build your own just like they did.


Have you considered deleting most of it?

Chances are you don't need all of it. Every company today thinks they need "Big Data" to do their theoretical magic machine learning, but most of them are wrong. Hoarding petabytes of worthless data doesn't make you Facebook.

To be a little less glib, I'd start by auditing how much of that 10PB actually matters to anyone.


For on-premises storage (without managing storage racks and Ceph yourself) you can look at Infinibox (https://www.infinidat.com/en/products-technology/infinibox).

(I'm not working there anymore, posting this just to help)


i just wanted to thank everyone for taking the time to reply. This has been way better input than I thought it would turn out.


I hope you will share your decision with us if you can, would be interesting to understand.

Good luck.


Ceph is a beast and will require at least 2-3 technicians with intricate Ceph knowledge to run multiple (!) Ceph clusters in a business continuity responsible manner.

Because you must be able to deal with Ceph quirks.

If you can shard your data over multiple independent stand-alone ZFS boxes, that would be much simpler and more robust. But it might not scale like Ceph.


Have you tried backblaze b2 storage? Requires more work client-side but is around 1/4 to 1/5 the price.

The only issue is whether or not you have a CDN in front of this data. If you do then backblaze might not be much cheaper than S3->Cloudfront. You'd save storage costs but easily exceed those savings in egress.


Disclaimer: I work for Backblaze so I'm biased.

> Have you tried backblaze b2 storage? Requires more work client-side...

Backblaze recently released an S3 compatible API, so I'm hoping it is zero client-side work: https://www.backblaze.com/b2/docs/s3_compatible_api.html

If you try it, and find any issues, please let us know!


I think if I _had_ to decide (I'm not the best informed person on the matter) I'd lean towards leofs[1].

I only read about it, but never used it.

It advertises itself as exabyte scalable and provides s3 and nfs access.

[1] https://leo-project.net/leofs/



You can buy an appliance from Cloudian and have you S3 on-premise and support.

They're basically 100% S3-compatible.

I don't know the details of their pricing, but they're production grade in the reald sense of the word.

I am not affiliated with them in any way, but I interviewed with them a couple of years ago and left with a good impression.


Wasabi + BunnyCDN has worked like a charm for us. We've got about 50TB there, if I recall. Our bill is dramatically smaller than when we were on AWS. Wasabi has had some issues-- notably a DNS snafu that took the service out for about 8 hours, if I recall. But over all, the savings have been worth it.


Sounds like a standard business problem, make a spec and get the main 20 cloud providers to submit bids.


Compression is always a good alternative, which is especially effective when modification is infrequent.


If they are a good data storage company, the data is encrypted so they can't compress what they already have. Perhaps they could compress the new incoming data client side before encryption to save a few bits.


It would be cool to actually have a "blockchain" for something like this. I know the huge amount of data to be store is a niche market, but hear me out:

Everyone that wants to make extra money can join

You join with your computer hooked up to internet, a piece of software running in background

You share % of your hard-drive and limit speed that can be used to upload/download

When someone needs to store 100PB of data ("uploader"), they submit a "contract" on a blockchain - they also setup what's the redundancy rate, meaning how many copies need to be spread to guarantee consistency of data as a whole

The "uploader" shares a file - the file is being chop in chunks and each chunk being encrypted with uploader private PHP key. The info re chunks are uploaded to blockchain and everyone get a piece. In return, all parties that keep piece of uploader data get paid small % either via PayPal or simply in crypto.

I think that would be a cool project, but someone would have to do back-of-napkin number crunching if that would be profitable enough to data hoarders :)


I was hoping that someone has experience in storing data with FileCoin. But I think it's just to early still to bet an existing business on it.


I'm curious why distributed cloud storage systems such as filecoin haven't been mentioned as a possible solution. Estimates of cost of storage that I saw on "file.app" put it at something like 100x cheaper than S3.

Not worth the risk or why?


Not an experience. But if I was given the task, I'll probably think about how those data could be distributed. Maybe use my own instance of IPFS, so each 'node' don't have to store all of the data.


Just run another venture round and dont think too hard about this problem. If everything goes well it wont be your problem for much longer, if it goes bad then who cares anyway.


I happen to own exa-byte.com, in case you need a domain for it ;-)

(In 1998, in school, I looked up in our math book what would come after mega, giga... 20 years later, just as fresh and useless as on day one ;))


Have you looked into the storage tiering (eg moving objects to glacier) for less active users?

Perhaps it’s a mix of some app pattern changes and leveraging the storage tier options in AWS to reduce your cost.


Is the storage of the data critical to the future growth of the business?


Here's an unpopular answer - don't store 10PB of data. Find a way for your startup to work without needlessly having to store insane amounts of data that will likely never be needed.


Excellent advice for a data backup startup.


Doesn't this imply that they started a company without actually having a plan for the most fundamental part of what they are selling?

This is like an ISP asking how they can get hooked up to the internet.


It's more like a company that has validated product fit now needing to figure out how to scale economically.

Apple didn't start manufacturing with mega Foxconn contracts. They had to figure that out along the way as their scale demanded.

However I share your sentiment: doing things the same way but cheaper is usually not the solution. Doing things differently (in-sourcing) might be the path forward.


Figuring out how to scale is not just a part of a storage startup, it's the whole thing.

Apple created something people wanted and sold at a price that would still make money if it was assembled by hand. They didn't form a company around a commodity like data storage.

Data storage is a commodity. Everyone already has some, online storage companies already exist. If you don't know how to store a lot of data and your company's whole purpose is to store a lot of data, it sounds like something that should have been worked out before making the company.


No. They are already managing 10PB, planning for which would be very stupid when just starting up.


Why would planning to be able to execute the single focus of your startup be stupid?


I tend to agree. If you are a storage company, I’d think that part of your secret sauce should be how to store tens or hundreds of petabytes of customer backups economically.

Maybe I’m wrong though. Perhaps the real secret sauce is the end user experience and the kind of storage you use on the backend doesn’t matter at all.

However I bet that the “cloud storage space” is pretty crowded and lots of people shop on price more than anything. If your business model is all about price, then finding economical storage is critical to your company and needs to be part of your core competency.

If price isn’t that important, perhaps it doesn’t matter... the “winners” would win no matter how expensive their storage solution is.

But honestly.... I feel like part of your core competency needs to be managing the storage system.


I also tend to agree. I think AWS is great and use it as my default solution, but if I was starting a company that had high bandwidth and/or storage requirements, I would be looking for other solutions from day one.


A rule of thumb for performance is every 10-100X involves changing up your fundamentals

It's a bit different nowadays that a lot of scaling tech is commoditized, but still means things like negotiating new contracts, finding & fixing the odd pieces that weren't stressed before, etc.

(congrats on hitting the new usage levels + good luck! we're at a much smaller scale, but trying to figure out some similar questions for stuff like web-scale publishing of data journalism without dying on egress $, so it's an interesting thread...)


If you can solve that for me without affecting revenue I have $1m in cash for you right there.

We are a photo/video storage service.


I don't know if this is a crazy idea or if it creates scalability issues, but could you craft an algorithm to cold store data for users who do not show a need for instant access, and/or warm up the data when you predict it will be needed? Kind of like a physical logistics company would need to do with distributed warehousing.

Sticking points I see are, 1. If you get it wrong you'll need some form of UX that keeps the users from getting to angry about it. 2. The cost of moving the data between hot/cold storage might make this prohibitive until a much larger scale. 3. User behaviors might not be predictable enough.


So from a completely evil (well, capitalist) perspective, do you have data on how often people retrieve backups, and at what 'age' they do so?

Because there may be an inflection point that offering monetary compensation for data loss, rather than actually trying to store the data, would make more financial sense. I.e., "All data > than 2 years gets silently expunged, and anyone trying to retrieve it at that point gets $10 per gig in compensation for 'our mistake'".

Please don't actually consider that though.


(A less evil approach that might still lead to reduced costs would be detecting old, unaccessed data of sufficient size and flagging it for users, with a small refund or service discount if they purge it. Though that assumes you have 'power users' who are storing massive amounts, to where the savings in storage costs would be worth it)

(And if you don't already, I would also consider making it so items that are in the trash for some period of time, say 30 days, get deleted automatically as well, possibly with a reminder email a few days before)

(And lastly, depending on user profiles and usage, incentives around reducing resolution/quality of photos and video, and automating that in the app as part of the sync process, might provide some opportunities to reduce costs of storage > the lost revenue of cheaper plans.)


"We used advanced machine learning algorithms to predict which users will need to retrieve which pieces of data in the future, and silently delete everything else."


What about storing the older things in some slower backup service that's slow to access but cheap? If the user eventually accesses those some day you would kick of some background job to get them fresh again. Not super UX friendly of course, but could reduce costs. Or is this already the standard thing to do?


You could probably put a fun marketing spin on that.

"We use ML to ensure we only store the highest quality data, freeing you from the chains of having too much worthless data and nothing to do with it."


GPT-3 that convinces you why you don't need to store this thing


Image processing to label each image ("baby with spaghetti on head", "cat playing with string", "naked person"), and then only save one image with each label.


Just save the label. Then you use generative techniques when they want to retrieve the image.


They're a data storage service.


At that scale I would contact AWS, Backblaze and Wasabi directly to see what improvements they can offer in terms of TCO (and potentially for a longer term contract).


+1

and tell each that you contact others. Lowest bidder wins.


See how file coin works. And decentralized database work. It should be the way cheaper than aws. Search for s3 like api in decentralized database and you will get you answer


Google nearline etc cost a bit less, also coming from AWS, they may give good discount, with considering operation and Maintainance, cloud will be cheaper.


Use Intelligent tiering or some kind of a custom system that moves data into glacier more aggressively based on access times.It can help a lot.


Always look to nature first. Nature never lies. DNA storage:

Escherichia coli, for instance, has a storage density of about 10 to the 19 bits per cubic centimeter. At that density, all the world’s current storage needs for a year could be well met by a cube of DNA measuring about one meter on a side.

There are several companies doing it: https://www.scientificamerican.com/article/dna-data-storage-...


What happens to your business if you lose this data?


I'm unsure if it's mature enough for your use right now (in particular, the retrieval market is undeveloped for fast access, but I wonder if you looked at filecoin?)

https://file.app/ https://docs.filecoin.io/build/powergate/

(Disclosure: I am indirectly connected to filecoin, but interested in genuine answers)


Have you looked into Backblaze? They’re a lot cheaper than Amazon and have S3-compatible APIs.


Off topic, but I'm shocked that anyone would trust uploading sensitive files (e.g. nudes) to this service. Photo vault type apps can be useful, but I would never want the content in those apps to upload to a small service like this based on their word that employees won't go through it.


> no queries need to be performed on the data.

cat >/dev/null, obviously. ;-)


Not sure the state of some of the decentralized solutions...


Tape drives. Semi joke.

How often you access data is another question.


Pure Flashblade 100%

Feel free to ama on it, I'm a huge fan


how much is it costing to keep 10PB on AWS S3? according to calculator.s3.amazonaws.com is usd 200000+ per month


900 LTO-U8 tapes


I'd store in node_modules/


The right answer for you may have more to do with your business requirements than technical requirements. I've done large scale storage in cloud providers (S3, GCS, etc.) and on premise (I designed the early storage systems at Dropbox). I haven't found there to be a one-size-fits-all answer.

If you place a high value on engineering velocity and you already rely on managed services, then I would look to stay in S3. Do the legwork to gather competitive bids (GCS, Azure, maybe one second tier option) and use that in your price negotiation. Negotiation is a skill, so depending on the experience in your team, you may have better or worse results -- but it should be possible to get some traction if you engage in good faith with AWS.

There is a considerable opportunity cost to moving that data to another cloud provider. No matter how well you plan and execute it, you're going to lose some amount of velocity for at least several months. In a worse scenario, you are running two parallel systems for a considerable amount of time and have to pay that overhead cost on your engineering team's productivity. In the worst case scenario, you experience service degradation or even lose customer data. It's quite easy for 2-3 months to turn into 2-3 years when other higher priority requirements appear, and it's also easy for unknowns to pop up and complicate your migration.

With all of that said, if the fully baked cost of migrating to another cloud provider (engineering time + temporary migration services + a period of duplicated costs between services + opportunity cost) is trajectory changing for your business, then it certainly can be done. I feel like GCS is a bit better of a product vs S3, although S3 has managed to iron out some of its legacy cruft in the last few years. Azure is not my cup of tea. I have never seriously considered any other vendors in the space, although there are many.

Your other option is to build it. I've done it several times, people do it every day. You may need someone on the team who either has or can grow the skillset you're going to need: vendor negotiation, capacity planning, hardware qualification, and other operational tasks. You can save a bunch of money, but the opportunity cost can be even greater.

10PB is the equivalent of maybe 1-2 rack of servers in a world where you can easily get 40-50 drive systems with 10-18TB drives (of course for redundancy you would need more like 2-2.5x, and you need space to grow into so that you're always ahead of your user growth curve). At any rate, my point is that the deployment isn't particularly large, so you aren't going to see good economies of scale. If you expect to be in the 100+PB range in 6-12 months, this could still be the right option.

Personally, I would look to build a service like this in S3 and migrate to on-premise at an inflection point probably 2 orders of magnitude above yours, if the future growth curve dictated it. The migration time and cost will be even more onerous, but the flexibility while finding product/market fit probably countermands the cost overhead.

There is a third option, which is hosted storage where someone else runs the machines for you. Personally I see it as a stop-gap solution on the path to running the machines yourself, and so it's not very exciting. But it is a way to minimize your investment before fully committing.


On tape.


Context please.

1. Do you have paying customers already?

2. Can the startup weather large capex? does opex work better for you?

3. Do you already have staff with sufficient bandwidth to support this, or will you need to hire?

4. What are the access patterns for the data?

5. What is the data growth rate?

6. What is the cost of losing some, or all of this data?

7. What is your expected ROI?

TL;DR - storing and serving up the data is the easy part.


Talk with Linus at LTT.


Using Erasure coding.


Gusterfs + ZFS


I'm way late to the conversation. There are a few things that I haven't seen mentioned (apologies if I overlooked them).

I have no idea how you evaluate the necessity of keeping the data safe, and that plays a huge factor in deciding what's appropriate. Amazon S3 makes it a no-brainer for having your data safe across failure domains. Of course, the same can be done with non-S3 solutions, but someone has to set it all up, test it, and pay for it.

My background in storage is mostly related to working with Ceph and Swift (both OpenStack Swift and SwiftStack) while being employed by various hardware vendors.

Some thoughts on Ceph: - In my opinion, Ceph is better suited for block storage than object storage. To be fair, it does support object storage with use of the Rados Gateway (RGW) and RGW does support the S3 API. However, Ceph has a strong consistency model and in my opinion, strong consistency tends to be better suited to block storage. Why is this? For a 10PB cluster (or larger), failures of various types will be the norm (mostly disk failures). What does Ceph do when a disk fails? It goes to work right away to move whatever data was on the failed disk (using its redundant copies/fragments) to a new place. No big deal if it's only a single HDD that's in failed status at any given point of time. What if you have a server, disk controller, or drive shelf fail? You get a whole bunch of data backfilling going on all at once. The other consideration with strong consistency model is having multi-site storage. Not so good for strong consistency model (due to higher latency for inter-site communication). - Ceph has a ton of knobs, is very feature rich, and high on complexity (although it has improved). The open-source mechanisms for installing and the admin tools have experienced (and continue to have) a high-rate of churn. Do a quick search on how to install/deploy Ceph and you'll see multiple. Same with admin tools. Should you strongly consider Ceph as an option, I would strongly advise you to license and use one of the 3rd party software suites that (a) take the pain away from install/deploy/admin, and (b) reduce the amount of deep expertise that you would need to keep it running successfully. Examples of these 3rd party Ceph admin suites are Croit [0] and OSNEXUS [1]. Alternatively, if you like the idea of a Ceph appliance, I would take a close look at SoftIron [2].

Aside from Ceph, it's worth taking a very close look at OpenStack Swift [3][4]. It's only object storage and has been around for about 10 years. It supports the S3 protocol and also has its own Swift protocol. It's open source and it has an eventually consistent data model. Eventually consistent is (IMO) a much better fit for a 10+PB cluster of objects. Why is this? Because failures can be handled with less urgency and at more opportune times. Additionally, an eventually consistent model makes multi-site storage MUCH easier to deal with.

I suggest going further and spending some quality time with the folks at SwiftStack [5]. Object storage is their game and they're very good at it. They can also help with on-prem vs hosted vs hybrid deployments.

Additionally, you would definitely want to use erasure coding (EC) as opposed to full replication. This is easy enough to do with either Swift or Ceph.

Disclaimers and disclosures - I am not currently (nor have ever been) employed by any of the companies I mentioned above.

Dell EMC Technical Lead and co-author of these documents:

   Dell EMC Ready Architecture for Red Hat Ceph Storage 3.2 - Object Storage Architecture [6]
   Dell EMC Ready Architecture for SwiftStack Storage - Object Storage Architecture Guide [7]
Intel co-author of this document:

   "Accelerating Swift with Intel Cache Acceleration Software" [8]


   [0] https://croit.io
   [1] https://www.osnexus.com/technology/ceph
   [2] https://softiron.com
   [3] https://wiki.openstack.org/wiki/Swift
   [4] https://github.com/openstack/swift
   [5] https://www.swiftstack.com
   [6] https://www.delltechnologies.com/resources/en-us/asset/technical-guides-support-information/solutions/red_hat_ceph_storage_v3-2_object_storage_architecture_guide.pdf
   [7] https://infohub.delltechnologies.com/section-assets/solution-brief-swiftstack-1
   [8] https://www.intel.sg/content/www/xa/en/software/intel-cache-acceleration-software-performance/intel-cache-acceleration-software-performance-accelerating-swift-white-paper.html


Floppies. Lots of floppy disks. Like, 7B of them.


In my opinion, you're probably better off building and managing your own infrastructure at that scale, especially if you control the rest of the software stack that runs your platform. It would be best to go with an open source solution and invest in your own technology, infrastructure and people. This way, no matter what happens you can be in control of your data for as long as you want to and avoid vendor lock-in at every level.

If this isn't already something that your company is familiar with, you'll need people who know how to buy, build, test and manage infrastructure across datacentres, including servers and core networking. Understanding platforms like Linux will be critical, as well as monitoring and logging solutions (perhaps like Prometheus and Elastic).

The only solution that I know of which would scale to your requirements would be OpenStack Swift (https://wiki.openstack.org/wiki/Swift). It's explicitly designed as an eventually consistent object store which makes it great for multi-region, and it scales. It is Apache 2.0 licensed, written in Python with a simple REST API (plus support for S3).

The Swift architecture is pretty simple. It has 4 roles (proxy, account, container and object) which you can mix and match on your nodes and can scale independently. The proxy nodes handle all your incoming traffic like retrieving data from clients and sending it onto the object nodes and vice versa. Proxy nodes can be addressed independently rather than through a load balancer and is one of the ways Swift is able to scale out so well. You could start with three and go up to dozens across regions, as required.

The object nodes are pretty simple, they are also Linux machines with a bunch of disks each formatted with a simple XFS file system where they read and write data. Whole files are stored on disk but very large files can be sharded automatically and spread across multiple nodes. You can use replication or erasure coding and the data is scrubbed continuously, so if there is a corrupt object it will be replaced automatically.

Data is automatically kept on different nodes to avoid loss for when a node dies, in which case new copies of the data are made automatically from existing nodes. You can also configure regions and zones to help determine the placement of data across the wider cluster. For example, you could say you want at least one copy of an object per datacentre.

I know that many large companies use Swift and I've personally designed and built large clusters of over 100 nodes (with SwiftStack product) across three datacentres. This gives us three regions (although we mostly use two) and we have a few different DNS entries as entry points into the cluster. For example, we have one at swift.domain.com which resolves to 12 proxy nodes across each region, then others which resolves to proxy nodes in one region only, e.g. swift-dc1.domain.com. This way users can go to a specific region if they want to, or just the wider cluster in general.

We used Linux on commodity hardware, stock 2RU HPE servers with 12 x 12 TB drives (so total cluster size is ~14PB raw), but I'm sure there's a better sweet spot out there. You could also create different types, higher density or faster disk as required, perhaps even an "archive" tier. NVMe is ideal for the account and container services, the rest can be regular SATA/NL-SAS. You want each drive to be addressed individually, so no multi-disk RAID arrays however each of our drives sits on its own single member RAID-0 array in order to make use some caching from the RAID controller (so 12 x RAID-0 arrays per object node).

Our cluster nodes connect to Cisco spine and leaf networking and have multiple networks; e.g. the routeable frontend network for accessing the proxy nodes, private cluster network for accessing objects and the replication network for sending objects around the cluster.

Ceph is another open source option and while I love it as block storage for VMs, I’m not convinced that it’s quite the right design for a large, distributed object store. Compared to Swift object store seems more of an after thought and inherits a system designed for blocks. For example, it is synchronous and latency sensitive, so multi-region can be tricky. Could still be worth looking into, though.

Given the size of your data and ongoing costs of keeping it in AWS, it might be worthwhile investing in a small proof of concept with Swift (and perhaps some others). If you can successfully move your data onto your own infrastructure I'm sure you can not only save money but be in better control overall.

I've worked on upstream OpenStack and I'm sure the community would be very welcoming if you wanted to go that way. Swift is also just a really great piece of technology and I love seeing more people using it :-) Feel free to reach out if you want more details or some help, I'll be glad to do what I can.


Pied Piper ?


RAM?


Lol. A stick of 256gb RAM costs ~$3000. 1TB needs 4. 1PB needs 4000. 10PB needs 40,000. So this would be an upfront cost of $120M.

And this doesn't even cover how you'd fit 40,000 sticks of RAM together.


10PB of RAM being only $120M blows my mind, to be honest. I would have guessed that price was closer to the SSD cost for 10PB.


This made me curious about what the SSD cost would be. It looks like you can get a 2TB SSD for $200. So that's $100/TB = $1M for 10PB. Of course prices may be higher for enterprise SSDs and you may need redundancy. Then again, you could probably get a bulk discount at that scale.


Tape is still very cost effective. Load latency might be a few minutes though


Wasabi storage


You almost certainly should not have 10PB of data. Not just is it extremely expensive, it is unlikely that millions of people have each allowed you to take gigabytes of their data. You are sitting on a huge violation of CCPA, GDPR, and other privacy laws, as well as copyright issues. If you are scraping data off the Internet you likely have content illegal to poses in several different countries (such as child sexual abuse material or videos of ISIL killings). As a startup you do not have the legal and technical capabilities to manage this data so you should not have it.


a short research shows, this is the cofounder of keepsafe, so i guess they most likely got the data from their customers


Move to Oracle Cloud and before everybody starts hammering me look at this: https://www.oracle.com/cloud/economics/

I am not from Oracle and I am also running startup with growing pains. Oracle is a bit late to the Cloud game so they are loading up customer's base now and squeezing ears will come in 3-5 years down the road. Maybe you can take advantage of this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: