Ask HN: How would you store 10PB of data for your startup today?

pmlnr · on April 23, 2021

Non-cloud:

HPE sells their Apollo 4000[^1] line, which takes 60x3.5" drives - with 16TB drives, that's 960TB each machine, one rack of 10 of these is 9PB+ therefore, which nearly covers your 10PB needs. (We have some racks like this). They are not cheap. (Note: Quanta makes servers that can take 108x3.5" drive, but they need special deep racks.)

The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph, and ZFS across multiple machines is nasty as far as I'm aware, but I could be wrong. HDFS would work, but the latency can be completely random there.

[^1]: https://www.hpe.com/uk/en/storage/apollo-4000.html

So unless you are desperate to save money in the long run, stick to the cloud, and let someone else sweat about the filesystem level issues :)

EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

EDIT2: at 10PB HDFS would be happy; buy 3 racks of those apollos, and you're done. We started struggling at 1000+ nodes first; now, with 2400 nodes, nearly 250PB raw capacity, and literally a billion filesystem objects, we are slow as f*, so plan carefully.

walrus01 · on April 23, 2021

> The problem here would be the "filesystem" (read: the distributed service): I don't have much experience with Ceph,

I think at that scale you would want a ceph expert on staff as a full time salaried position.

For an organization that has 10PB now and can project a growth path to 15, 20, 25PB in the future, you should talk with management about creating a vacant position for that role, and filling it.

> EDIT: btw, we let the dead drives "rot": replacing them would cost more, and the failure rate is not that bad, so they stay in the machine, and we disable them in fstabs, configs, etc.

I am a huge advocate of hosting stuff yourself on bare metal you own, but this is a ridiculous statement. Any drive in that class should come with a 3 or 5 year warranty. And the manual labor and hassle time to replace one (you have hundreds of thousands of dollars of storage and no ready to go cold spares on a shelf?!?!) is infinitesimal.

pmlnr · on April 24, 2021

OK, clarification: most of our fleet has a LOT of supermicro machines, where it's impossible to identify the drive unless by serial number. There's on UID light, the machines needs to go offline, plus some 10 screws needs to come out to open the chassis, 4 more by drives.

The amount of downtime this would generate for a single machine plus the operation cost doesn't worth the hassle unless the machine loses a significant chunk of drives.

barrkel · on April 24, 2021

Chassis built for mass storage usually have lever caddies backed on to hotplug SAS backplanes.

pmlnr · on April 25, 2021

One would think indeed, but not the early FatTwins.

closeparen · on April 24, 2021

If the colo is far and there’s plenty of headroom, it might not justify much urgency.

cortesoft · on April 24, 2021

That's what remote hands are for. Yes, you batch the replacements up, but this is exactly what remote hands are for.

Terretta · on April 24, 2021

Costs less to leave them alone, and go once a year for a trash run. Cattle not pets, no trips to the vet. Don’t waste money diagnosing / fixing.

walrus01 · on April 24, 2021

'cattle not pets' is not a valid argument when you're the owner and operator of the bare metal hardware. Do you also recommend that ISPs not replace failed fans in core and edge routers and optical transport systems? Let things with dual power supplies run for six months on one failed power supply?

Also you've clearly never interacted with cattle, sheep, goats, llamas or alpacas, which absolutely do get things like veterinary care and vaccinations. Large animal vet is a whole specialty and they spend lots of time working on animals other than horses. No trips to the vet???

Terretta · on April 24, 2021

I’ve been both farmer of black angus beef cattle (on 750 acres) right down to castrating steer, and founder/owner of the world’s largest VDN (14 international data centers) right down to pulling drives.

For what it’s worth, meat packers buy dead cattle and don’t ask questions. But this is a well known metaphor, and I’m pointing out by that metaphor, no trips to the vet. As a cattle farmer, I’d argue it holds true if you’re big enough they’ve got tags not names: the vet comes to you and only if you think you’ve got a herd problem instead of an individual problem.

As for the HN angle: these are contrarian and objection-inspiring policies that let us wholesale video delivery to/through other CDNs while making a profit.

cortesoft · on April 24, 2021

This doesn't make sense to me. I work for a CDN with tens of thousands of servers in over a hundred data centers. We are always working to improve our turnaround time on repairing servers, even though we have thousands. Hard drive failure is one of the leading causes of server failures. Dead servers means diminished capacity, and capacity is what pays our bills.

Farmers absolutely have a vet who takes care of the cattle. I am not sure what you are on about.

The whole point of cattle-vs-pets is you are supposed to treat all the servers the same, not that you have to treat all of them poorly.

Terretta · on April 26, 2021

Less is more.

Another contrarian view — particuarly suitable for large VDN content like media, not small CDN content like html/js — is that one doesn’t need to be in over a hundred data centers, one needs to be in the key exchanges: you don’t have to be at the ends of every spoke if you pick the right hubs.

Agree swapping bad drives is a reasonable use of smart hands when done in batches, as no diagnosis is needed. I’d advocate considering extending that practice to the servers themselves. Math works if you find local tech repo/refurb shops that take gear (w/o drives) to bulk refurb & resell. The other way is to get your OEM to provide aliveness-as-a-service.

Anything so you don’t have to do manual labor, ideally ever.

secabeen · on April 23, 2021

You can also get units like this direct from Western Digital/HGST. We have a system with 3 of their 4U60 units, and they weren't all that expensive. Ordering direct from HGST, we only paid a small premium on top of the cost of the SAS drives.

Terretta · on April 24, 2021

This is the answer that worked for us storing petabytes a decade ago.

We collaborated with OEMs and also shared/compared notes with Backblaze on rackable mass storage for commodity drives.

Backblaze published a series of iterations of designs of multi-drive chassis, and one of the OEMs would make them for other buyers as well. If you’re doing this route, read through those for considerations and lessons learned.

Performance was > 10x better than enterprise solutions. A policy to “leave dead disks dead” aka “let them rot” as said elsewhere in this thread kept maintenance cheap.

The secret sauce part making this viable for commercial online storage hosting (we hosted video) was we used disks as JBOD with an in-house meta index with P2P health awareness to place objects redundantly across disks, chassis, racks, colocation providers, and regions.

more_corn · on April 23, 2021

Don't buy HPE gear. Qualify the gear with sample units from a few competing vendors and you'll see why.

specktr · on April 23, 2021

I’ve done qualification on hpe and various competing vendors and honestly haven’t seen dramatic differences in terms of performance and failure rates. From my experience the biggest difference was with vendor support services rather than the actual hardware. I’d be curious to hear more about your qualification experience with this particular vendor if you’d be willing.

metabrew · on April 24, 2021

When we set up user content storage of images and mp3s for Last.fm back in 2006ish we used MogileFS (from the bradfitz LJ perl days) running on our own hardware. 3/4/5/6u machines stuffed full of disks. I still think it's an elegant concept – easy to grok, easy to debug, easy to reason about. No special distributed filesystem to worry about.

Don't take this as an endorsement of the MogileFS perl codebase in 2021, but worth considering this style of storage system depending on your precise needs.

monstrado · on April 23, 2021

MinIO is an option as well and would allow you to transition from testing in S3 to your own MinIO cluster seamlessly.

goliatone · on April 24, 2021

I wonder if anyone can comment if they have experience running minio at scale. It would be a pleasant surprise if a “simple” minio cluster could handle such workload

dividedbyzero · on April 23, 2021

Does it scale that far?

pas · on April 23, 2021

Likely not that "gracefully". But Ceph absolutely does, and has an S3 gateway.

tempest_ · on April 24, 2021

Using Ceph like S3 could be a bit tricky if all of that 10pb is very small files.

Redhat did an interesting series of blog posts about Ceph and getting it to 1 billion objects

https://www.redhat.com/en/blog/scaling-ceph-billion-objects-...

lars_francke · on April 24, 2021

I'd be interested to learn more about your HDFS usage and your experience at that scale. Would you be willing to have a chat? If so, my email is in my profile.

skynet-9000 · on April 23, 2021

At that kind of scale, S3 makes zero sense. You should definitely be rolling your own.

10PB costs more than $210,000 per month at S3, or more than $12M after five years.

RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

That means that you could fit all 15TB (for erasure encoding with Minio) in less than two racks for around $150k up-front.

Figure another $5k/mo for monthly opex as well (power, bandwidth, etc.)

Instead of $12M spent after five years, you'd be at less than $500k, including traffic (also far cheaper than AWS.) Even if you got AWS to cut their price in half (good luck with that), you'd still be saving more than $5 million.

Getting the data out of AWS won't be cheap, but check out the snowball options for that: https://aws.amazon.com/snowball/pricing/

cdavid · on April 23, 2021

[disclaimer: while I have some small experience putting things in DC, including big GPU servers, I have never been anywhere near that scale, certainly not storage]

10k $ is for a server with no hard drive. W/ 12 Gb disks, and with enough RAM, we're talking closer to 40-50k$ per server. Let's say for simplicity you're going to need to buy 15 of those, and let's say you only need to replace 2 of them per year. You need 25 over five years, that's already ~ 750k $ over 5 years.

And then you need to factor in the network equipment, the hosting in a colocation space, and if storage is your core value, you need to think about disaster recovery.

You will need at least 2 people full time on this, in the US that means minimum 2x 150k$ of costs per year: over 5 years, that's 1.5m$. If you use software-defined storage, that's likely gonna cost you much more because of the skill demand.

Altogether that's all gonna cost you much more than 500k$ over 5 years. I would say you would need at least 5x to 10x this.

cerved · on April 24, 2021

yes, the TCO needs consideration, not just the metal

hpcjoe · on April 24, 2021

After a certain size, AWS et al simply don't make sense, unless you have infinitely deep pockets. For storage that you pull from, AWS et al charge bandwidth costs. These costs are non-trivial for non-trivial IO. I worked up financial operational models for one of my previous employers, when were were looking at costs of remaining on S3 and using it, versus rolling it into our own DCs. The download costs, the DC space, staff, etc. was far less per year (and the download cost is a 1 time cost) than the cold storage costs.

Up to about 1PB with infrequent use, AWS et al might be better. When you look at 10-100PB and beyond (we were at 500PB usable or so last I remembered) the costs are strongly biased towards in-house (or in-DC) vs cloud. That is, unless you have infinitely deep pockets.

hpcjoe · on April 24, 2021

I should add to this comment, as it may give the impression that I'm anti cloud. I'm not. Quite pro-cloud for a number of things.

The important point to understand in all of this is, is that there are cross-over points in the economics for which one becomes better than the other. Part of the economics is the speed of standing up new bits (opportunity cost of not having those new bits instantly available). This flexibility and velocity is where cloud generally wins on design, for small projects (well below 10PB).

This said, if your use case appears to be rapidly blasting through these cross-over points, the economics usually dictates a hybrid strategy (best case) or a migration strategy (worst case).

And while your use case may be rapidly approaching these limits (you need to determine where they are if you are growing/shrinking), there are things you can do to risk and cost reduce transitions ahead of this.

Hybrid as a strategy can work well, as long as your hot tier is outside of the cloud. Hybrid makes sense also if you have to include the possibility of deplatforming from cloud providers (which, sadly, appears to be a real, and significant, risk to some business models and people).

None of this analysis is trivial. You may not even need to do it, if you are below 1PB, and your cloud bills are reasonable. This is the approach that works best for many folks, though as you grow, it is as if you are a frog in ever increasing temperature water (with regard to costs). Figuring out the pain point where you need to make changes to get spending on a different (better) trajectory for your business is important then.

papageek · on April 26, 2021

And again at at even larger size it makes sense again with >80% discounts on compute and $0 egress.

hpcjoe · on April 27, 2021

We had taken the discounts into account (we had qualified for them). The $0 egress was not a thing when we did our analysis. And we were moving 10's of PB/month. BW costs were running into sizable fractions of millions of dollars per month.

cricalix · on April 23, 2021

The thing about fitting everything in one rack, potentially, is vibration. There have been several studies into drive performance degredation from vibration, and there's noticeable impact in some scenarios. The Open Compute "Knox" design as used by Facebook spins drives up when needed, and then back down, though whether that's for vibration impact, I don't know (their cold storage use [0]).

0: https://datacenterfrontier.com/inside-facebooks-blu-ray-cold...

https://www.dtc.umn.edu/publications/reports/2005_08.pdf

https://digitalcommons.mtu.edu/cgi/viewcontent.cgi?article=1...

pokler · on April 23, 2021

Here is Brendan Gregg showing how vibrations can affect disk latency:

https://www.youtube.com/watch?v=tDacjrSCeq4

fakedang · on April 24, 2021

I'm an absolute noob here, but is using SSD racks for storage a feasible option cost wise and for this issue in particular?

warrenm · on May 5, 2021

Absolutely it's an option

It's gonna cost more

But it's also going to be nearly vibration-free (just the PSU fans), and stupidly-fast

chefkoch · on April 24, 2021

No, ssds are still way to expensive If you don't need the performance.

Johnny555 · on April 24, 2021

10PB costs more than $210,000 per month at S3, or more than $12M after five years.

Your pricing is off by a 2X - he said he's ok with infrequent access, 1 zone, which is $0.01/GB, or $100K/month.

If he rarely needs to read most of the data, he can cut the price by 1/10th by using deep archive, $0.00099 per GB, so $10K/month, or around $600K over 5 years, not including retrieval costs.

Titan2189 · on April 24, 2021

Nope, can't use Deep Archive as he specified max retrieval time of 1000ms. But you're correct with S3-IA

Johnny555 · on April 24, 2021

For a 10X reduction in cost, things that are impossible often become possible.

webmaven · on April 24, 2021

> Nope, can't use Deep Archive as he specified max retrieval time of 1000ms.

If accesses can be anticipated, pre-loading data from cold storage to something warmer might make it viable.

warrenm · on May 5, 2021

>RackMountPro offers a 4U server with 102 bays, similar to the BackBlaze servers, which fully configured with 12GB drives is around $11k total and stores 1.2 PB per server. (https://www.rackmountpro.com/product.php?pid=3154)

I dare you to buy 102 12TB drives for $11k

The cheapest consumer class 12GB hdd is ~$275 a pop

That's $28k just for the drives

atomicity · on April 24, 2021

If you have a PBs of data that you rarely access, it seems to make sense to compress it first.

I've rarely seen any non-giants with PBs of data properly compressed. For example, small JSON files converted into larger, compressed parquet files will use 10-100x less space. I am not familiar with images but see no reason why encoding batches of similar images should make it hard to get similar or even better compression ratios

Also, if you decide to move off later on, your transfer costs will also be cheaper if you can move it off in a compressed form first.

cerved · on April 24, 2021

couple be wrong but I don't believe compression of batches of compressed images compresses well

but it'd be very interested to here about techniques on this because I have a lot of space eaten up by timelapses myself

marklit · on April 25, 2021

It's not about space reduction, it's about handling the small file problem. HDFS can handle up to 500M files without issue but the amount of RAM needed to store the files' metadata starts to go beyond what you'd typically find in a single server these days.

When you store multiple images and/or videos inside of a single PQ file, you'll end up keeping fewer files on your server.

I believe Uber store JPEG data in PQ files and Spotify store audio files in PQ or a similar format on their backend.

d110af5ccf · on April 24, 2021

On the contrary, batches of images with a high degree of similarity compress _very_ well. You have to use an algorithm specifically designed for that task though. Video codecs are a real world example of such - consider that H. 265 is really compressing a stream of (potentially) completely independent frames under the hood.

I'm not sure what the state of lossless algorithms might be for that though.

simcop2387 · on April 25, 2021

Best I know of for that is something like lrzip still, but even then it's probably not state of the art. https://github.com/ckolivas/lrzip

It'll also take a hell of a long time to do the compression and decompression. It'd probably be better to do some kind of chunking and deduplication instead of compression itself simply because I don't think you're ever going to have enough ram to store any kind of dictionary that would effectively handle so much data. You'd also not want to have to re-read and reconstruct that dictionary to get at some random image too.

water8 · on April 25, 2021

A movie is a series of similar images and while it does allow temporal compression in a 3rd axis to the 2d raster, H265 is about as good as it gets at the moment but its also lossy which might not be tolerable.

papageek · on April 26, 2021

H266 VVC looks impressive. Waiting to get my hands on fpga codec for testing.

cerved · on April 28, 2021

right but we're not talking about compressing a video stream but compressing individually compressed pictures, big difference

darkr · on April 23, 2021

I’ve heard reports that minio gets slow beyond the hundreds of millions of objects threshold

tinus_hn · on April 23, 2021

You are mixing up your units, with 12GB drives and 15TB in a rack.

SergeAx · on April 24, 2021

You didn't take personnel cost into account. You will need at least two system administrators to look after those racks (even if remote hands to change faulty drives are in the monthly opex). It quickly takes you to surplus of 200k/year with current prices (which will rise another 50% in 5 years).

On the other hand, you may negotiate a very sizable discount from AWS for 10Pb storage for 5 years.

FireBeyond · on April 23, 2021

Does Snowball let you exfiltrate data from AWS? I was under the impression it was only for bulk ingestion.

skynet-9000 · on April 23, 2021

First sentence on the linked page: "With AWS Snowball, you pay only for your use of the device and for data transfer out of AWS."

leetrout · on April 23, 2021

Wow that’s up to $500,000 just to export 10PB (depending on region).

canucker2016 · on April 23, 2021

According to https://aws.amazon.com/snowball/pricing/, egress fees depends on the region, which can range from $0.03/GB (North America & parts of Europe) to $0.05/GB (parts of Asia and Africa).

So US$300K to US$500K for egress fees + cost of Snowball devices.

The major downside of Snowball in this export case is the size limit of 80TB per device - from https://aws.amazon.com/snowball/features/ :

"Snowball Edge Storage Optimized provides 80 TB of HDD capacity for block volumes and Amazon S3-compatible object storage, and 1 TB of SATA SSD for block volumes."

That'd be around 125 Snowball devices to get 10PB out.

If OP actually has 10PB on S3 currently, the OP may want to fallback to leaving the existing data on S3 and accessing new data in the new location.

phekunde · on April 24, 2021

> If OP actually has 10PB on S3 currently, the OP may want to fallback to leaving the existing data on S3 and accessing new data in the new location.

I remember asking an Amazon executive in London when AWS was very new and they were evangelising it to developers; I asked him what is the cost of getting data out of AWS if I want to move it to other service provider, or how easy it will be? And he avoided giving a straight simple answer. I realised then than the business model from the start was to lock-in developer/startups/companies in to the AWS ecosystem.

prirun · on April 25, 2021

> If OP actually has 10PB on S3 currently, the OP may want to fallback to leaving the existing data on S3 and accessing new data in the new location.

Another option would be to leave data on S3, store new data locally, and proxy all S3 download requests, ie, all requests go to the local system first. If an object is on S3, download it, store it locally, then pass it on to your customer. That way your data will gradually migrate away from S3. Of course you can speed this up to any degree you want by copying objects from S3 without a customer request.

An advantage of doing this is that you can phase in your solution gradually, for example:

Phase 1: direct all requests to local proxies, always get the data from S3, send it to customers. You can do this before any local storage servers are setup.

Phase 2: configure a local storage server, send all requests to S3, store the S3 data before sending to customers. If the local storage server is full, skip the store.

Phase 3: send requests to S3, if local servers have the data, verify it matches, send to customer

Phase 4: if local servers have the data, send it w/o S3 request. If not, make S3 request, store it locally, send data

Phase 5: store new data both locally and on S3

At this point you are still storing data on S3, so it can be considered your master copy and your local copy is basically a cache. If you lose your entire local store, everything will still work, assuming your proxies work. For the next phase, your local copy becomes the master, so you need to make sure backups, replication, etc are all working before proceeding.

Phase 5: start storing new content locally only.

Phase 6: as a background maintenance task, start sending list requests to S3. For objects that are stored locally, issue S3 delete requests to the biggest objects first, at whatever rate you want. If an object isn't stored locally, make a note that you need to sync it sometime.

Phase 7: using the sync list, copy S3 objects locally, biggest objects first, and remove them from S3.

The advantage IMO is that it's a gradual cutover, so you don't have to have a complete, perfect local solution before you start gaining experience with new technology.

nicoburns · on April 23, 2021

There's also the snowmobile https://aws.amazon.com/snowmobile/

canucker2016 · on April 23, 2021

The AWS Snowmobile pages only talk about migrating INTO AWS, not OUT OF.

from https://aws.amazon.com/snowmobile/ :

AWS Snowmobile is an Exabyte-scale data transfer service used to move extremely large amounts of data to AWS. You can transfer up to 100PB per Snowmobile, a 45-foot long ruggedized shipping container, pulled by a semi-trailer truck. Snowmobile makes it easy to move massive volumes of data to the cloud, including video libraries, image repositories, or even a complete data center migration.

from https://aws.amazon.com/snowmobile/faqs/ :

Q: What is AWS Snowmobile?

AWS Snowmobile is the first exabyte-scale data migration service that allows you to move very large datasets from on-premises to AWS.

hn_throwaway_99 · on April 23, 2021

I mean, the title on the snowmobile page says:

> Migrate or transport exabyte-scale data sets into and out of AWS

orcinus8 · on April 23, 2021

Unfortunately the header is misleading. The FAQ says:

Q: Can I export data from AWS with Snowmobile?

Snowmobile does not support data export. It is designed to let you quickly, easily, and more securely migrate exabytes of data to AWS. When you need to export data from AWS, you can use AWS Snowball Edge to quickly export up to 100TB per appliance and run multiple export jobs in parallel as necessary

natch · on April 23, 2021

That wording is not inconsistent with the interpretation that Snowball is for in only.

user5994461 · on April 23, 2021

You realize you can't fit 10 appliances of 4U in a rack? (A rack is 42U)

There's network equipment and power equipment that requires space in the rack. There's power limitations and weight limitations on the rack that prevents to fill it to the brim.

jedberg · on April 24, 2021

I've put 39U of drives in a rack before. You only need 1U for a network switch, and you can get power that attaches vertically to the back, so it doesn't take up any space. If you have a cabinet with rack in front and back and all the servers have rails, the weight shouldn't be an issue.

The biggest issue will be cooling depending on how hot your servers run.

Specifically, it was a rack full of Xserve RAIDs, which are 3U each and about 100lbs each. So that was over 1300lbs.

user5994461 · on April 24, 2021

Looked up some specs.

* A typical rack is rated for somewhere between 450 and 900 kg (your mileage may vary).

* A disk is about 720g.

* A 4U quanta enclosure is 36 kg empty.

* With 10 enclosures of 60 disks, that's a total of 792 kg inside the rack.

You will want to check what rack you have exactly and weight things up.

The rack itself is another 100 to 200 kg. You will want to double check whether the floor was designed to carry 1 ton per square meter. It might not be.

My personal tip. Definitely do NOT put a row of that in an improvised room in a average office building. You might have a bad surprise with the floor. ;)

Anyway. The project will probably be abandoned after the OP tries to assemble the first enclosure (80kg fully loaded) and realize he's not going to move that.

dilyevsky · on April 24, 2021

You run a single network switch for a rack full of drives to the brim?

jauer · on April 24, 2021

Sure. A single rack is a common failure domain so you make sure to replicate across racks.

E.g. Dunno about anyone else, but Facebook racks (generally) have a single switch.

dilyevsky · on April 24, 2021

That seems rather unnecessary risk. Sure you stripe across the racks but another tor switch in mlag configuration is a minuscule expense compared to the costs involved here

jedberg · on April 24, 2021

You could easily run two switches, there would be enough room. But normally yes, I'd run one switch per rack. Switch failure is pretty rare, and when it does happen it's pretty easy to switch it out for a spare.

shiftpgdn · on April 24, 2021

Gold standard APC PDUs are all 0U side mount.

user5994461 · on April 23, 2021

What if you want to move off S3? Let's do the math.

* To store 10+ PB of data.

* You need 15 PB of storage (running at 66% capacity)

* You need 30 PB of raw disks (twice for redundancy).

You're looking at buying thousands of large disks, in the order of a million dollar upfront. Do you have that sort of money available right now?

Maybe you do. Then, are you ready to receive and handle entire pallets of hardware? That will need to go somewhere with power and networking. They won't show up for another 3-6 months because that's the lead time to receive an order like that.

If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.

There is a sweet spot in moving off the cloud, if you can fit your entire infrastructure into one rack. You're not in that sweet spot.

You're going to be filling multiple racks, which is a pretty serious issue in terms of logistics (space, power, upfront costs, networking).

Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).

The conclusion of this story. S3 is pretty good. Your time would be better spend optimizing the software. What is expensive? The storage or the bandwidth or both?

* If it's the bandwidth. You need to improve your CDN and caching layer.

* If it's the storage. You should work on better compression for the images and videos. And check whether you can adjust retention.

latch · on April 24, 2021

> Let's do the math.

Offers no math.

At retail, 625 16TB drives is $400000. This is about 2x the MONTHLY retail s3 pricing. Further, as we all know, AWS bandwidth pricing is absolutely bonkers (1).

I think your conclusion that S3 is "pretty good" needs a lot more math to support.

(1) https://twitter.com/eastdakota/status/1371252709836263425

_nickwhite · on April 24, 2021

The math should also include the price of the staff who babysit 625 spinning metal disks, who likely drive to a data center multiple times a week to swap failed drives. I shudder to think if this job fell in my lap!

latch · on April 24, 2021

Sure, but you actually have to go through the steps. The very first step, back of the napkin, indicates savings. More than enough to warrant a more detailed evaluation. Then you can start considering the more complicated factors (redundancy, staffing, power, ...).

My response was in the context of someone who didn't do any of that.

Also, my tongue-in-cheek response to you is: the price will be offset by the SRE engineers who were babysitting your AWS setup that you'll no longer need. (More seriously, I don't think finding quality sysadmins who enjoy this stuff is particularly harder than finding quality roles for any other tech positions).

GauntletWizard · on April 24, 2021

Been there, done that, at all levels. I would much rather be working on a 10PB set of hardware racks, including all the drive replacements. When you factor in the costs of compute hardware (to make that useful), networking equipment, etc, it's trebled again the cost, and then again for the power, cooling, and cage space to run it all. The actual break-even point of running your own hardware is more like 2 years.

But it's not about price: It's about control, and it's about the expertise you gain from running all of that. If you have 10PB of data, you should have someone in-house who knows how to work with 10PB of data at a low level, and the best way to get that is to employ people at all levels to make that work. You gain significant advantage from having the direct performance data and the expertise of having techs whose 9-5 is replacing disks.

user5994461 · on April 24, 2021

>>> My response was in the context of someone who didn't do any of that.

I did and you ignored all of it -_-

* To store 10+ PB of data.

* You need 15 PB of storage (running at 66% capacity)

* You need 30 PB of raw disks (twice for redundancy).

>>> At retail, 625 16TB drives is $400000.

That's only 10 PB of disks. That's about one third of the actual need.

Please triple your number and we will start talking. That's about 2000 disks and well above a million dollar.

You can't just call a supplier and get for a thousand 16 TB disks (or even a hundred). They don't have that in stock now. The lead time might be 6 months to get a few hundreds. They might not have 16 TB disks for sale at all, the closest might be a 12 or 14TB.

Handling large amount of hardware is a logistic problem. Not a cost problem.

chefkoch · on April 24, 2021

With over 600 drives you would have at least 20 hot spares, and i don't think you'd have more than 2 drives fail per week if you don't have bad batches of them.

chii · on April 24, 2021

> 625 16TB drives is $400000

how much is the real estate cost of 625 drives and associated machinery to run it?

At a guess, AWS has an operating margin of about 30%, so you can approximate their cost of hardware, bandwidth, and other fixed costs as 70% of their sticker price. As a start up, can you actually get this price to be lower? I actually dont think you can, unless your operation is very small and can be done out of a home/small office.

latch · on April 24, 2021

Their margin on bandwidth is literally over 1000%. Quick google says that S3 costs 320% more than Backblaze (which, presumably, isn't running at a loss).

> At a guess

The comments in this discussion that try to provide actual numbers show a fairly lopsided argument against S3. The comments that are advocating for S3 aren't as detailed.

You can look at this at the macro level, as on comment did, and see that one 1.2PB RackmountPro 4U server is $11K. Yes, of course you still need space and power. But at least this gives us actual numbers to play with as a base (e.g. buying 10 of these is less than what you'll spend on S3 in a month)

At a miro-level. You can spend $650 on a 16TB hard drive, or $650 on 16TB for 2 months of S3. Now, S3 is battle-tested, has redundancy, has power, has a cpu, has a network card (but not bandwidth), and is managed - unquestionable HUGE wins. But the hard drive (and other equipment) come with a 3-5 year warranty. Now, the difference between $650 for the hard drive, and $12000 for S3 over 3 years, won't let you: get the power, rent the racks, hire the staff, and invest in learning ceph. But the difference between $400K and $5million will.

d110af5ccf · on April 24, 2021

> one 1.2PB RackmountPro 4U server is $11K

An empty 4U server with 96+ bays looks like it will set you back ~$7k minimum. At $500 per drive (I have no idea what volume discounts are like) filling it with drives would be in the range of ~$50k. You'd still need RAM. And (as you noted) space and power.

I have no idea how the math ends up working out, but a 1PB appliance in working order is nowhere near as cheap as $11k.

user5994461 · on April 24, 2021

>>> You can look at this at the macro level, as on comment did, and see that one 1.2PB RackmountPro 4U server is $11K

protip: $11k is the cost of the empty enclosure. disks are sold separately.

gamegoblin · on April 23, 2021

FWIW you can get great redundancy with far less than 2x storage factor. e.g. Facebook uses a 10:14 erasure coding scheme[1] so they can lose up to 4 disks without losing data, and that only incurs a 1.4x storage factor. If one's data is cold enough, one can go wider than this, e.g. 50:55 or something has a 1.1x factor.

Not that this fundamentally changes your analysis and other totally valid points, but the 2x bit can probably be reduced a lot.

[1] https://engineering.fb.com/2015/05/04/core-data/under-the-ho...

dsyrk · on April 24, 2021

https://en.wikipedia.org/wiki/Parchive

Basically they use par2 multi-file archives for cold storage with each archive file segment scattered across different physical locations. Always fun to see the kids rediscovering tricks from the old days.

Nacraile · on April 24, 2021

> If you talk to Dell/HP/other, they can advise you and sell you large storage appliances. Problem is, the larger appliances will only host 1 or 2 PB. That's nowhere near enough.

This is just incorrect.

If you talk to HPE, they should be quite happy to sell you the my employer's software (Qumulo) alongside their hardware. 10+ PB is definitely supported. (The HPE part is not required)

If you talk to Dell EMC, they will quite happily sell you their competing product, which is also quite capable of scaling beyond 1-2PB.

hedora · on April 24, 2021

Most (all?) enterprise vendors will go well beyond 1-2PB.

Four years ago, one of the all flash vendors routinely advertised “well under a dollar a gigabyte”. Their prices have dropped dramatically since then, but the out of date numbers translate to “well under a million per PB”. That’s at the high end of performance with posix (nfs) or crash coherent (block) semantics. (Some also do S3, if that’s preferable for some reason)

With a 5 year depreciation cycle, those old machines were at << $16K / month per PB. Today’s all flash systems fit multiple PB per rack, and need less than one full time admin.

Hope that helps.

user5994461 · on May 2, 2021

I've checked what I could find on Qumulo. It is software that you run on top of regular servers, to form a storage cluster.

It seems to me you're only confirming my previous point, that you need to invest in complicated/expensive software to make the raw storage usable.

>>> Then you're going to have to handle "sharding" on top of the storage because there's no filesystem that can easily address 4 racks of disks. (Ceph/Lustre is another year long project for half a person).

There's no listed price on the website, you will need to call sales. Wouldn't be surprised if it started at 6 figures a year for a few servers.

It looks like it may not run on just any server, but may need certified server hardware from HP or Qumulo.

qwertykb · on April 24, 2021

Always fun stumbling across another Qumulon on here :)

quantumofalpha · on April 24, 2021

AWS is ridiculously expensively at their scale, both for storage and egress. But the choice is not only between that and building a staffed on-premise storage facility.

You can compromise at a middle ground - rent a bunch of VPS/managed servers and let the hosting companies deal with all the nastiness of managing physical hardware and CAPEX. Cost around $1.6-2/TB/month (e.g. Hetzner's SX) for raw non-redundant storage, an order of magnitude better than AWS. Comes with far more reasonably priced bandwidth too.

Build some error correction on top using one of the many open-source distributed filesystems out there or perhaps an in-house software solution (reed-solomon isn't exactly rocket science). And for some 30+% overhead, depending on workload (you can have very low overload if you have few reads or very relaxed latency requirements), you should have a decently fault tolerant distributed storage at a fraction of AWS costs.

mark_l_watson · on April 24, 2021

I agree that considering Hetzner is a good idea. I have used them often, never any problems, and very low pricing.

montroser · on April 24, 2021

> You need to improve your CDN and caching layer.

Depends on usage patterns, but if this is 10PB of users'personal photos and videos, then you're not going to get much value from caching because the hit rate will be so low.

pier25 · on April 23, 2021

> If it's the bandwidth. You need to improve your CDN and caching layer.

What would you recommend for this?

(considering data is stored in S3)

shyn3 · on April 23, 2021

Verizon and Redis has worked well for me.

crescentfresh · on April 23, 2021

> * To store 10+TB of data.

> * You need 15 TB of storage (running at 66% capacity)

> * You need 30 TB of raw disks (twice for redundancy).

Did you mean PB?

user5994461 · on April 23, 2021

Corrected.

louwrentius · on April 23, 2021

Very good advice!

epistasis · on April 23, 2021

If you have good sysadmin/devops types, this is a few racks of storage in a datacenter. Ceph is pretty good at managing something this size, and offers an S3 interface to the data (with a few quirks). We were mostly storing massive keys that were many gigabytes, so if you have smaller keys, so I'm not sure about performance/scalding limits with smaller keys and 10PB. I'd be sure to give your team a few months to build a test cluster then build and scale the full size cluster. And a few months to transfer the data...

But you'll need to balance the cost of finding people with that level of knowledge and adaptability with the cost of bundled storage packages. We were running super lean, got great deals on bandwidth, power, and has low performance requirements. When we ran the numbers for all in costs, it was less than we thought we could get from any other vendor. And if you commit to buying the severs racks it will take to fit 10PB, you can probably get somebody like Quanta to talk to you.

philippb · on April 23, 2021

The is amazing. Thank you. I’ve been looking at Backblaze storage pods that seem to be designed for that use case. Never rented rack space.

Do you remember somehow the math on how much cheaper it was or how you thought about upfront cost vs ongoing. Just order of magnitude would be great.

mceachen · on April 23, 2021

Roughly a decade ago S3 storage pricing had a ~10x premium over self-hosted. The convenience of not having to touch any hardware is expensive.

sethhochberg · on April 23, 2021

Its also important to consider how often disks will fail when you are operating hundreds of them - its probably more often than you'd think, and if you don't have someone on staff and nearby to your colo provider you're going to pay a lot in remote hands fees.

Your colo facility will almost certainly have 24/7 staff on hand who can help you with tasks like swapping disks from a pile of spares, but expect to pay $300+ minimum just to get someone to walk over to your racks, even if the job is 10 mins.

With that said, the cost savings can still be enormous. But know what you're getting into.

marcinzm · on April 23, 2021

Like another comment said, don't bother swapping out disks, just leave the dead ones in place and disable them in software. Then eventually either replace the whole server or get someone on site to do a mass swap of disks. At this scale redundancy needs to be spread between machines anyway so no gain in replacing disks as they die.

oneplane · on April 24, 2021

That also means that you need extra spare disks in the system, which also means extra servers, extra racks, extra power feeds, extra cooling etc.

If you do a 60-disk 4U setup you'll need 1 full rack of those just to get your 10PB, then you'll need yet another one for redundancy. And then a quarter for hot spares. At that point you have single-redundancy, no file history and no scaling. Is it possible? Sure. Is this something you can do 'on a side track with the people you already heave'? Unlikely if you are a startup with no datacenter, no colocation yet etc.

shiftpgdn · on April 24, 2021

You don't do redundancy that way at that scale, that's completely insane. You run ceph or beegfs or Windows Storage Server and backup to tape with a tape library. If youve got big bucks (though still peanuts compared to s3) you replicate the entire setup 1:1 at a second site.

oneplane · on April 24, 2021

The author doesn't want a second site. And at that scale you do redundancy at that scale within the requested parameters.

If you set your object store to be resilient to single-partition loss per object (within CAP) you effectively duplicate everything once. If you want more-than-one you get into sharding to spread the risk. We're not talking about RAID here, but about replicas or copies.

Windows Storage Server doesn't belong in a setup like this, and neither does tape since it needs to be accessible in under 1s. If higher latencies were fine the author would have been able to use something between S3 IA and Glacier. Heck, you could use cold HDD storage for that kind of access. The drives would need to spin up to collect the shards to assemble at least one replica to be able to read the file, but that's still multiple orders of magnitude faster than tape.

I have written a larger post with more numbers, and unless you seriously reduce the features you use, it's not really cheaper than S3 if you start off with no physical IT and no people to support it. It's not that it isn't possible, it's just that you need to spin up an entire business unit for it and at that point you're eating way more cost.

Regardless of the object store (or filesystem if you want to go full on legacy style), you still need at least the minimum amount of physical bits on disk to be able to store the data. And pretty much no object store supports a 1:1 logical-physical storage scale. It's almost always at least 1:1.66 in degraded mode or 1:2 in minimum operational mode.

marcinzm · on April 24, 2021

>We're not talking about RAID here, but about replicas or copies.

Most distributed filesystems support some form of erasure coding. Ceph does, Minio does, HDFS does, etc. So no, you don't need to duplicate everything.

oneplane · on April 24, 2021

You're talking about data integrity, this is not the same as redundancy.

webmaven · on April 24, 2021

> You're talking about data integrity, this is not the same as redundancy.

To be clear, you're talking about mitigating the risk of data corruption (eg. bits will flip randomly due to cosmic rays or what have you) over time, vs. the risk of outright data loss, yes?

Isn't there some some overlap between the solutions?

oneplane · on April 24, 2021

No, I'm talking about mitigating system failure (be it a dead disk, PHY, entire server, single PDU, single rack or entire feed. I didn't even go down the level of individual object durability yet (or web access to those objects, consistent access control and the likes).

There is some overlap in the sense that having redundant copies makes it possible to replace a bad copy with a good copy if a checksum mismatches on one of them. That also allows for bringing the copy count back in spec if a single copy goes missing (regardless of the type of failure).

But no matter what methods are used, data is data and needs to be stored somewhere. It the bits constituting that data go missing, the data is is gone. To prevent that, you need to make sure those bits exist in more than one place. The specific places come with differences in cost, mitigations and effort:

- Two copies on the same disk mitigates bit flips in one copy but not disk failure - Two copies on two disks on the same HBA mitigates bit flips and disk failure but not HBA failure

The list goes on until you reach the requirement posted at the top of this Ask HN where it is stated that OneZone IA is used. That means it does not need multiple zones for zone-outage mitigation. Effectively that means the racks are allowed to be placed in the same datacenter. So that datacenter being unavailable or destroyed means the data is unavailable (temporarily or permanently), which appears to be the accepted risk.

But within that zone (or datacenter) you would still need all other mitigations offered by the durable object storage S3 provides (unless specified differently - if we just make up new requirements we can make it very cheap and just accept total system failure with 1 bit flip and be done with it).

shiftpgdn · on April 24, 2021

I currently pay about $40 for a half hour of remote hands at a large data center. Modern disks rarely need to be swapped. You can look at BackBlaze's published failure rates and do the math yourself if you don't believe me.

lunatuna · on April 24, 2021

I’ve used Netapps and Isilon in the past. We didn’t change any disks, they did as part of the maintenance. Not sure how the physical security worked but they were let in by the data centre staff and did their thing. I think they came in weekly.

They whole solution wasn’t cheap though and all of these extras were baked into the cost. We were getting better than S3 costs from a per TB straight up without considering power , cooling and rack space costs. Network was significantly cheaper than AWS.

Not sure on how far these NAS’ scale but I would expect deep discounts for something of this scale.

cmeacham98 · on April 23, 2021

I've run the math on this for 1PB of similar data (all pictures), and for us it was about 1.5-2 orders of magnitude cheaper over the span of 10 years (our guess for depreciation on the hardware).

Note that we were getting significantly cheaper bandwidth than S3 and similar providers, which made up over half of our savings.

ehosca · on April 23, 2021

https://www.aberdeeninc.com/systems/storage/san/petarack

epistasis · on April 24, 2021

Upfront costs, with networking, rack and stacked, and wired, were far under $100/TB raw, around $40-$60, but this was quite a while ago and I don't know how it looks in the era of 10+TB drives. Also remember that once you are off S3 you are in the situation of doing your own backup, and the use case dictates the required availability when things fail... we didn't need anything online, but mirrored to a second site. With erasure coding, you can get by with 1.5x copies at each site or so, with a performance hit. So properly backed up with a full double, it's about 3x raw...

Opex will be power, data center rent, and internet access are hugely hugely variable. And of course, the personnel will be at least 1 full time person who's extremely competent.

dangerboysteve · on April 23, 2021

if you have looked at BB storage pods you should look at the 45drives.com the child of Protocase which manufactures the BB pods.

kfrzcode · on April 23, 2021

Totally out-of-band for this thread, but... what are the uses for a multi-gigabyte key?! I'm clearly unaware of some cool tech, any key words I can search?

dTal · on April 23, 2021

I'm no expert but I would guess it's just a fancy word for "file", as in "key-value store", as opposed to a god-proof encryption key.

antpls · on April 23, 2021

In this case, wouldn't the value be multi gigabit, not the key?

oneplane · on April 24, 2021

Well, you'd refer to an object by its key, so while the value of the object would have the data, you could stil refer to your objects as keys, the same way we refer to files on a filesystem. It's not the file that is big, but the blocks it represents.

epistasis · on April 24, 2021

When I say "key" I mean the blob that gets stored, but I may be misremembering or misusing S3 terms... it was large amounts of DNA sequencing data, and one of the first tasks was to add S3 support for indexed reads to our internal HTSlib fork, and since then somebody else's implementation has been added to the library. In any case, I quickly forgot about most of the details of S3 when I no longer had to deal with it directly...

kfrzcode · on April 24, 2021

That makes perfect sense, assume I'm a pleb! Thanks for the follow-up, large files/values make sense.

My head was in huge cryptographic keys for some purpose

tootie · on April 23, 2021

This is outside my domain and I don't know how the pricing works out, but AWS Outpost will sell you a physical rack that is fully S3 compatible and redundant to cloud.

aynsof · on April 24, 2021

The pricing would be prohibitive, I reckon. S3 on Outposts is $0.1/GB/mo, whereas the S3 single zone IA that OP is using as a baseline is $0.01/GB/mo - an order of magnitude less. (Prices are based on us-east-1.)

maestroia · on April 23, 2021

There are four hidden costs which not many have touched upon.

1) Staff You'll need at least one, maybe two, to build, operate, and maintain any self-hosted solution. A quick peek on Glassdoor and Salary show the unloaded salary for a Storage Engineer runs $92,000-130,000 US. Multiply by 1.25-1.4 for loaded cost of an employee (things like FICA, insurance, laptop, facilities, etc). Storage Administrators run lower, but still around $70K US unloaded. Point is, you'll be paying around $100K+/year per storage staff position.

2) Facilities (HVAC, electrical, floor loading, etc) If you host on-site (not hosting facility), you'd better make certain your physical facilities can handle it. Can your HVAC handle the cooling, or will you need to upgrade it? What about your electrical? Can you get the increased electrical in your area? How much will your UPS and generator cost? Can the physical structure of the building (floor loading, etc) handle the weight of racks and hundreds of drives, the vibration of mechanical drives, the air cycling?

3) Disaster Recovery/Business Continuity Since you're using S3 One Zone IA, you have no multi-zone duplicated redundancy. It's use case is for secondary backup storage for data, not the primary data store for running a startup. When there is an outage/failure (and it will happen), the startup may be toast, and investors none too happy. So this is another expense you're going to have to seriously consider, whether you stick with S3 or roll-your-own.

4) Cost of money With rolling-your-own, you're going to be doing CAPEX and OPEX. How much upfront and ongoing CAPEX can the startup handle? Would the depreciation on storage assets be helpful financially? You really need to talk to the CPA/finance person before this. There may be better tax and financial benefits by staying on S3 (OPEX). Or not.

Good luck.

aledalgrande · on April 24, 2021

I am 100% agreeing with this, especially cash flow for a startup, it's going to be harder to manage. I think S3 is still the answer.

ktpsns · on April 23, 2021

I have worked in HPC (academia) where the cluster storage size is measured in multiples of PB since a decade. Since latency and bandwidth is a killer requirement there, Infiniband (instead of Ethernet) is the defacto standard for connecting the storage pools to the computing nodes.

Maintaining such a (storage) cluster requires 1-2 people on site which replace a few hard disks every day.

Nevertheless, when I would continously need massive amount of data, I would opt in doing it myself anytime instead of cloud services. I just know how well these clusters run and there is little to no saving when outsourcing it.

craigyk · on April 24, 2021

I am a researcher in academia that handles most of my system admin needs myself. It’s way cheaper to do yourself than some of these comments here make it sound (if you have good server rack space available). I ordered two 60 drive JBODs that I racked by myself (I removed all the drives first to lighten them) for ~82k. I used Zfs and 10 drive raidz2 vdevs for a total capacity of ~960TB of useable file system space. Installing the servers and testing some setups and putting it into use took about 4-5 days. In four years I’ve put many PBs of reAfs and writes through these and had to replace 3 drives. I’d estimate I spend about 2% of my active work focus on maintaining and troubleshooting it. Scaling up to 10PB I’d probably switch to a supported SDS solution, which would be much more expensive, but still way way cheaper than cloud.

glbrew · on April 23, 2021

Since he needs 1000ms response on storage isn't ethernet the better option? It can reach 400gb/s on fastest hardware now. I thought Infiniband was only reasonable to use when machines need to quickly access other machines primary memory. I would like to know if I'm wrong about this though.

alfalfasprout · on April 24, 2021

Agreed and at this point with ROCE there's little reason to go with infiniband given you can find fast ethernet hardware that'll go toe to toe with infiniband on latency and throughput.

shiftpgdn · on April 24, 2021

I've done multiple multipetabyte scale projects and you only need to swap disks once a month or so. I had a project (as a solo engineer) 2 hours away and I drove there once in six months.

jtchang · on April 23, 2021

I would host in a datacenter of your choice and do a cross connect into AWS: https://aws.amazon.com/directconnect/pricing/

This allows you to read the data into AWS instances at no cost and process it as needed since there is 0 cost for ingress into AWS. I have some experience with this (hosting using Equinix)

aynsof · on April 24, 2021

Direct Connect isn't required from a cost perspective - ingress into AWS is free in all cases I can think of, but certainly in the case of S3 [0]. DX is useful when customers need assurances of bandwidth/throughput, or if they want to avoid their traffic routing over the internet.

[0] "You pay for all bandwidth into and out of Amazon S3, except for the following: Data transferred in from the internet..." - https://aws.amazon.com/s3/pricing/

philippb · on April 23, 2021

Thanks for the pointer. Never thought about this as an option. Great stuff!!!

pickle-wizard · on April 23, 2021

I had a similar problem at a past job. Though we only had a PB of data. We used a products called SwiftStack. It is open source, but they have paid support. I recommend getting support, as their support is really good. It is an object store like S3, but it has its own API. Though I think they now have an S3 compatible gateway now.

We had about 25 Dell R730xd servers. When the cluster would start to fill up, we would just replace drives with larger drives. Upgrading drives with SwiftStack is a piece of cake. When I left we were upgrading to 10TB drives as that was the best pricing. We didn't buy the drives from Dell as they were crazy expensive. We just bought drives from Amazon/New Egg, and kept some spares onsite. We got a better warranty that way too. Dell only had a 1 year warranty, but the drives we were buying had a 5 year warranty.

neverartful · on April 26, 2021

Way late to the discussion, but I second the positive remarks on SwiftStack. It's in the easy button category in this case. The core storage engine of SwiftStack is open source (OpenStack Swift). However, the nice wrap-around tooling and web dashboard is not open source.

TechBro8615 · on April 23, 2021

I’m not an AWS pricing expert, but you should be aware you’re still on the hook for S3 requests even if you can get out of paying for bandwidth. Is AWS direct connect a pure peering arrangement? I wonder what their requirements are for that. Guess I’ll read the link :)

Idk what your team’s expertise is, but I’d advise avoiding the cloud as long as possible. If you can build out an on-premise infrastructure, it will be a huge competitive advantage for your company because it will allow you to offer features that your competitors can’t.

Examples of this:

- Cloudflare built up their own network and infrastructure and it’s always been their biggest asset. They set the standard for free tier of CDN pricing, and nobody who builds a CDN on top of an existing cloud provider will ever beat it.

- Zoom. By hosting their own servers and network, Zoom is similarly able to offer a free tier where they are not subject to variable costs from free customers losing them money on bandwidth charges.

- WhatsApp. They scaled to hundreds of millions of users with less than a dozen engineers, a few dozen (?) servers, and some Erlang code.

IMO defaulting to the cloud is one of the worst mistakes a young company can make. If your app is not business critical, you can probably afford up to a day of downtime or even some data loss. And that is unlikely to happen anyway, as long as you’ve got a capable team looking after it who chooses standard and robust software.

throwaway823882 · on April 23, 2021

I run cloud infra for a living. Have been managing infrastructure for 20 years. I would never for one second consider building my own hosting for a start-up. It would be like a grocery delivery company starting their own farm because seeds are cheap.

TechBro8615 · on April 23, 2021

Depends what you’re doing I suppose. I think the three companies I mentioned (CloudFlare, Zoom and WhatsApp) are good examples of infrastructure investment as a competitive advantage.

derefr · on April 23, 2021

None of those are start-ups, though. They've either IPOed (CloudFlare, Zoom) or been acquired by publicly-traded companies (WhatsApp).

A startup is a company that might still need to pivot to find its final business model, potentially shedding its entire existing infrastructure base in the process. Start-ups are why IaaS providers don't default to instance reservations — because, as a startup, you might suddenly realize that you won't be needing that $10k/hr of compute, but rather $10k/hr of something else.

throwaway823882 · on April 23, 2021

Or suppose you run the most successful/profitable Fantasy Sports League start-up on the internet (used to work for 'em) and host your own gear. Every year you have to analyze trends in use and predict future load, to build the capital needed to buy all new racks of servers every 2-3 years, pay for all the IT staff, datacenter costs.

That was before the cloud existed. They had to poach experts from hosting companies to build and maintain their gear. They built a 24/7 NOC, did server repair, became network experts, storage experts, database experts. Besides being incredibly complex and burdensome, it was financially risky. If they missed their projections they could over-invest by 1-2 million bucks, or even worse, not have the capacity needed to meet demand.

If somebody told us back then that we could pay a premium to be able to scale at any time as much as we needed, when we needed it? We would have flipped out. We had heard about Amazon building some kind of "grid computing" thing, but it seemed like a pipe dream for universities, like parallel computing. Turns out it was a different kind of grid.

jauer · on April 23, 2021

WhatsApp ran on bare metal in SoftLayer prior to (and well after) being acquired by FB.

CloudFlare went well beyond leasing servers and built their own POPs with network etc prior to IPO. Much of what they built wouldn't have made economic sense with AWS tax.

derefr · on April 24, 2021

I didn't mean to imply that IPOing is the point at which a start-up becomes a not-start-up. None of these three were a start-up for quite a few years before their IPO, either.

TechBro8615 · on April 24, 2021

In most of these cases, the companies growth from startup to not-startup was only possible because of their infrastructure advantage. Do you think Cloudflare the startup could have offered a free tier if they had to pay Amazon $0.10 per GB that their users sent over the network?

Of course not. But the free tier was a vital component of Cloudflare's growth, first-mover advantage and wide adoption.

lamontcg · on April 23, 2021

> as long as you’ve got a capable team looking after it who chooses standard and robust software.

And cheap.

If you put people in charge who are looking for ways of expanding their empire and budget through spending money on EMC/VMWare/Oracle/etc/etc then you can quickly wind up spending a lot more money.

Simplistic network designs, simplistic server designs, simplistic storage designs with mostly open source software used everywhere can be highly competitive with Cloud services.

Mostly all that Amazon did to create AWS/EC2 was to fire anyone who said words like SAN or EMC and do everything very cheaply using open source software, and evolved away from Enterprise vendors and towards commodity hardware.

If you make "frugality" a core competency in your datacenter design like Amazon did, then you can easily beat the cloud.

You also need to have [dev]ops people who are inclined to say "yes" to the business and who know how to debug things and can operate independently of needing to phone up EMC.

derefr · on April 23, 2021

> fire anyone who said words like SAN

Is EBS not, itself, a SAN?

lamontcg · on April 24, 2021

If you narrowly focus on the words outside of the context of what "SAN" has meant in the industry for decades now, yes it is. But no, it isn't.

derefr · on April 24, 2021

Can you explain more? Because I honestly don't know enough about SANs to know the difference.

To me, a "Storage Area Network" is 1. a cluster of disk-servers, serving the role of exposing logical block-storage over a protocol like iSCSI (whether directly to client machines, or managed and dynamically allocated by hypervisor software like vSphere), where 2. machines are connected to that storage cluster over a dedicated network interface, to keep LAN/WAN packets from contending for throughput with SAN packets.

By that definition, EBS is definitely a SAN. (And technically, so is my two-drive NAS, if I configure it as an iSCSI target and then run a second switch that connects to its second network port and my workstation's second network port.)

Does "SAN" imply some specific internal architecture for the storage cluster or something?

And, if so, then what do you call the type of thing that EBS is?

webmaven · on April 24, 2021

> Does "SAN" imply some specific internal architecture for the storage cluster or something?

It implies purchasing dedicated hardware. SANs are CAPEX heavy solutions.

> And, if so, then what do you call the type of thing that EBS is?

If you insist, you could call EBS a SAN-as-a-Service, I suppose.

throwaway823882 · on April 25, 2021

EBS is absolutely SAN-as-a-Service, and it's fantastic.

For a SAN, not only do you have to become a "storage expert", but their individual limitations will leave you with thousands of hours of wasted time and effort, constrain your architecture, and hold back your application's development.

For EBS, you don't need to know anything about storage. You just say "Give me some space and attach it to any VM I want" and you have it. "Expand that space" and you have it. "Give me a snapshot" and you have it. "Give me a bunch of performance guarantees" and you have it. "Make it all encrypted": Done.

You don't need to maintain it, repair it, upgrade it. No maintenance windows to apply a firmware patch. No waiting for someone to buy, deliver, and install a new storage array to get more space. No hoping your hardware has the right interconnects. No upgrading switch backbones to deal with performance issues. And I'm not even a storage person! I'm so happy that I don't deal with SANs anymore.

throwaway823882 · on April 24, 2021

No, EBS is reliable.

dsyrk · on April 24, 2021

I’d like to add I’d agree with the parent comment and add some specifics.

Buy storage servers from 45drives they basically build same hardware as Backblaze uses. Add copper 10G nics to the servers.

https://www.45drives.com/

Get necessary switches 10G with 40G uplink ports. Whatever your favorite. Use 10GBaseT to the servers.

Install hardware in a quality data center. Like one of theirs -

https://www.digitalrealty.com/

And get 10G virtual cross connects to AWS.

Back of the envelope calculation you need 30TB raw, so about 60 servers. They aren’t really that power hungry so 10 per cabinet. 6 cabinets. at least 6+2 switches.

Software wise you have lots of options with this infra. High upfront cost but low MRC vs all other options. Assuming you have skilled sys admins who know what they are doing.

comboy · on April 24, 2021

+ some deep archive glacier? I think waiting 12h for data is acceptable if your datacenter burns down but it may not be the case for you.

staticassertion · on April 23, 2021

It's going to depend entirely on a number of factors.

How are you storing this data? Is it tons of small objects, or a smaller number of massive objects?

If you can aggregate the small objects into larger ones, can you compress them? Is this 10PB compressed or not? If this is video or photo data, compression won't buy you nearly as much. If you have to access small bits of data, and this data isn't something like Parquet or JSON, S3 won't be a good fit.

Will you access this data for analytics purposes? If so, S3 has querying functionality like Athena and S3 Select. If it's instead for serving small files, S3 may not be a good fit.

Really, at PB scale these questions are all critically important and any one of them completely changes the article. There is no easy "store PB of data" architecture, you're going to need to optimize heavily for your specific use case.

philippb · on April 23, 2021

Great question. I updated the original post. It’s user generated images and videos. We download those to the phones in the background.

We don’t touch the data at all.

staticassertion · on April 23, 2021

> Update: Should have mentioned earlier, data needs to be accessible at all time. It’s user generated data that is downloaded in the background to a mobile phone, so super low latency is not important, but less than 1000ms required.

> The data is all images and videos, and no queries need to be performed on the data.

OK, so this definitely helps a bit.

At 10PB my assumption is that storage costs are the major thing to optimize for. Compression is an obvious must, but as it's image and video you're going to have some trouble there.

Aggregation where you can is probably a good idea - like if a user has a photo album, it might make sense to store all of those photos together, compressed, and then store an index of photo ID to album. Deduplication is another thing to consider architecting for - if the user has the same photo, across N albums, you should ensure it's only stored the one time. Depending on what you expect to be more or less common this will change your approach a lot.

Of course, you want to avoid mutating objects in S3 too - so an external index to track all of this will be important. You don't want to have to pull from S3 just to determine that your data was never there. You can also store object metadata and query that first.

AFAIK S3 is the cheapest way to store a huge amount of data other than running your own custom hardware. I don't think you're at that scale yet.

Latency is probably an easy one. Just don't use Glacier, basically, or use it sparingly for data that is extremely rare to access ie: if you back up disabled user accounts in case they come back or something like that.

I think this'll be less of a "do we use S3 or XYZ" and more of a "how do we organize our data so that we can compress as much of it together, deduplicate as much of it as possible, and access the least bytes necessary".

VHRanger · on April 23, 2021

Isn't Backblaze B2 cheaper than S3?

staticassertion · on April 23, 2021

Yeah, I guess I shouldn't say S3 is the cheapest option there, I was thinking 'In AWS' but Backblaze is cheaper.

garciasn · on April 23, 2021

In my opinion, knowing what you're planning to do w/the data once it's stored is the important piece to giving you some idea of where to put it.

philippb · on April 23, 2021

Good point. I updated the post with some more infos

x0x0 · on April 23, 2021

What is your loss tolerance? If a file is gone, who is annoyed: a free user, a $50/year customer, or a $10k/year customer?

Are these files WORM?

boringg · on April 23, 2021

Agreed - though I feel like every data use comes after the fact. Original software engineers/developpers rarely have the foresight that the data scientists need the information for (at least imho).

lostcolony · on April 23, 2021

To be fair, the data scientists rarely have the foresight to know what the data scientists need the information for. The only time I've seen a data scientist correctly include all the data they needed (but still be wrong) was when they answered "All of it. We need all of the data".

boringg · on April 23, 2021

So true. Tough to know in advance which data will hold the secrets.

monkeybutton · on April 23, 2021

And you can't build a time machine to go and get it once you do know. Want X days of historical data for training/backtesting and we just implemented the metric this sprint? Good luck meeting your deadline!

warrenm · on May 4, 2021

I can build a 720T raw SSD storage box for ~$138k

Or a 648T raw HDD storage box for ~$53k

To get that up to raw 10 PB, I need ~$2m for all-SSD, or ~$850k for all-HDD

Bake-in a 2-system safety margin, and that's ~$2.3m all-SSD or ~$960 all-HDD

Run TrueNAS and ZFS on each of them ... and my overhead becomes a little bit of cross-over sysadmin/storage admin time per year and power

Say that's 1 FTE at $180k ($120k salary + 50% overhead) per year (even though actual admin time is only going to be maybe 10% of their workload - I like rounding-up for these types of approximations)

Peak cost, therefore, is ~$2.5m the first year, and ~$200k per year afterwards

And, of course, we'll want to plan for replacement systems to pop-in ... so factor-up to $250k per year in overhead (salary, benefits, taxes, power, budget for additional/replacement servers)

Using [Wasabi](https://wasabi.com/cloud-storage-pricing/#three-info), 10PB is going to run ~$62k/mo, or ~$744k per year

It's cheaper to build-vs-buy in no more than 5 years ... probably under 3

nikisweeting · on April 23, 2021

Backblaze B2, ingress and egress are free through Cloudflare, and it's S3 compatible. It's peanuts by comparison but I've been storing ~22TB on there for years and love it.

Wasabi and Glacier would be my 2nd choices.

gruez · on April 23, 2021

>Backblaze B2, ingress and egress are free through cloudflare

AFAIK cloudflare ToS prohibits you from using it as a file hosting proxy. You might not run into issues if you're transferring a few gigabytes a month, but if you're transferring multiple terabytes it's just asking for trouble.

edit:

https://www.cloudflare.com/terms/ section 2.8 Limitation on Serving Non-HTML Content

rewq4321 · on April 23, 2021

You can definitely serve way more than a few GB per month through Cloudflare on the free plan. I serve tens of terabytes a month for free. If OP needs to serve hundreds of terabytes per month they may get an email asking to upgrade, but the backblaze/Cloudflare setup would probably still be the cheapest. BunnyCDN is great too.

intergalplan · on April 23, 2021

OTOH I've been told (by CloudFlare support, in contact with their engineers) that for their "for hosting game levels and other content" use case[1], any of their ordinary plans should be fine.

I'm not... super confident in that answer, because despite that being a use case they promote on the site the terms seem a bit murkier, and the page on that use-case doesn't say much about which plan(s) they expect you to use (I'd have expected an "enterprise" plan for serving hundreds of TB of transfer of game-assets per month, but they said no, any normal plan's fine, which... I was up front with them about what our usage would look like, and they held that line, but that seems too good to be true).

I haven't tested these claims yet.

[1] https://www.cloudflare.com/gaming/

segmondy · on April 23, 2021

Definitely not backblaze. If you get a signed URL it remains valid for 24hrs and can be used over and over. If they are going through a proxy, that would be different, but I imagine they don't want that as that doubles bandwidth cost. You definitely don't want your client to be able to upload all the data they can in your bucket.

dsyrk · on April 24, 2021

Wait what?! There is a way to egress from free from Backblaze B2! That’s a big deal if true.

nikisweeting · on April 24, 2021

Yup, B2 egress has been free through CloudFlare for years:

- https://www.backblaze.com/blog/backblaze-and-cloudflare-part...

- https://www.cloudflare.com/partners/technology-partners/back...

- https://www.cloudflare.com/bandwidth-alliance/backblaze/

philippb · on April 23, 2021

I’ve looked at them. Would love to talk to you about your usage and experience with them.

tw04 · on April 23, 2021

I should preface this with: I read the question as you want something on-premises/in a colo. If you're talking hosted S3 by someone other than Amazon that's a different story.

It probably depends on if you are tied at the hip to other AWS services. If you are, then you're kind of stuck. The ingress/egress traffic will kill you doing anything with that data anywhere else.

If you aren't, the major players for on-prem S3 (assuming you want to continue access the data that way) would be (in no specific order):

Cloudian

Scality

NetApp Storagegrid

Hitachi Vantara HCP

Dell/EMC ECS

There are plusses and minuses to all of them. At that capacity I would honestly avoid a roll-your-own unless you're on a shoestring budget. Any of the above will be cheaper than Amazon.

qt_scientist · on April 24, 2021

[Disclaimer: I work on Quantum ActiveScale, an on-prem S3 system that fits this list]

Yup this is why these vendors exist. You’re definitely not alone, cloud repatriation is a ‘thing’.

These vendors have replaced that Ceph or Minio expert others in this thread said you’d need to budget for, by software. The system detects dead/degrading disks and automatically evicts them and rebuilds that chunk of the erasure code on another disk. Every few months you go in and hot swap the bad disks = the ones with a blinking led. Also Prometheus metrics, alerts in case of issues,... You don’t need a storage admin babysitting this.

1 rack of 4u90 enclosures with 18TB disks is 15PB RAW so with erasure coding overhead about the 10PB usable you need today.

I’m obviously biased on which vendor to pick. Do your due diligence, f.i. on how the system does capacity expansions.

babelfish · on April 23, 2021

I assume you're already making use of most of S3s auto-archive features?[0] Really it seems like this comes down to how quickly any of your data /needs/ to be loaded. I'd probably investigate after how much time a file is only ~1-10% likely to be accessed in the next 30 days, then auto-archive files in S3 to Glacier after that threshold. If you want to be a bit 'smarter' about it, here's an article by Dropbox[1] on how they saved $1.7M/year by determining which file previews actually need to be generated, and their strategy seems like it could be applied to your use case. That said, it seems like you are more likely to save money by going colo than by staying in the cloud.

[0] https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/ [1] https://dropbox.tech/machine-learning/cannes--how-ml-saves-u...

reacharavindh · on April 24, 2021

I have done 2 PB HPC data storage with ZFS. If I may extrapolate, I don’t see why it wouldn’t workout the same for 10 PB.

A 1U rack server attached to two JBODs(each 4U containing 60 spinning disks) connected to the server via 4 SAS HD cables. The rack server gets 512GiB of RAM to cache reads, and an Optane drive as persistent cache for writes. The usable storage depends on your redundancy and spare needs. But, as an example my setup - (9 * 6 drives(RAIDz2) + 4 hot spares) nets me about 450 TiB per JBOD or 900 TiB per rack server + two JBODs.

Repeat the setup by 6 times, and it would meet your 10 PB need. Throw in a few links 10GBps per server and have them all linked up by a switch, and you got your own storage setup. May be Minio(I have no experience with it) or something like that would give you a S3 interface over the whole thing.

I bet it would come out much cheaper than AWS. But, you’ve got to get your hands dirty a bit with system in work, and automate all the things with a tool like Ansible. Having done it, I’d say it is totally worth it at your scale.

plank_time · on April 23, 2021

Why do you need all 10PB accessible? Have you analyzed your usage pattern to see if you really need that much data accessible? This seems so unlikely and could solve most of your problems if you change the parameters.

Tepix · on April 23, 2021

It seems to me like you could save a ton of money by using your own hardware. Perhaps buy a bunch of big Synology boxes? At that scale you should also consider looking at technologies such as Ceph.

We've recently switched to a setup with several Synology boxes for around 1PB net storage.

faeyanpiraat · on April 23, 2021

Those boxes are slooooow, the 8 slot box have like a 500MB/s read speed limit, even if you raid0 8 SSDs, and use 10gbps networking. This limitation is in the product spec document, but with the smallest letters possible.

Tepix · on April 28, 2021

They advertise some boxes with 5.5GB/s.

nobozo · on April 23, 2021

Funny you should mention this. I once worked at a startup that stored lots of remote sensing data. Their strategy was to put it on a Synology. When the Synology filled up, they bought another, and so forth. Only some of the Synologys were online at any particular time, and there was no indexing to find which Synology held what data.

Plus, there were no backups so if one Synology were to blow up, all the data on it was lost.

Since they were a small startup it made some sense to start this way, but they had no plans on what to do about it as they got bigger.

Tepix · on April 23, 2021

Using Synologys doesn't mean that you have to be stupid about it :-)

philippb · on April 23, 2021

Thank you!

timr · on April 23, 2021

At this scale, there's no one perfect answer. You need to consider your usage patterns, business needs, etc.

Is the data cold storage, that is rarely accessed? Is it OK to risk losing a percentage of it? Can you identify that percentage? If it's actively utilized, is it all used, or just a subset? Which subset? How much data is added every day? How much is deleted? What are the I/O patterns?

Etc.

I have direct experience moving big cloud datasets to on-site storage (in my case, RAID arrays), but it was a situation where the data had a long-tail usage pattern, and it didn't really matter if some was lost. YMMV.

erulabs · on April 23, 2021

I’d go with Ceph and dedicated hardware. Something like Hetzner or Datapacket, or built it yourself and go big with something like SoftIron. We’ve built and maintain a number of these types of clusters - using S3 compatible APIs (CephObjectStore). SoftIron is probably overkill but good lord is it fun to play with that much thruput!

If you’re looking for a partner/consultant to get things going, feel free to reach out! This stuff is sort of our wheelhouse, as me and my co-founder were previously Ops at Imgur, you can imagine the kinds of image hosting problems we’ve seen :P

mattgair · on April 24, 2021

SoftIron would love to assist with this.

creiht · on April 24, 2021

Late to the party, but one does not simply store 10PB of data :)

The short story is, ignore most of the advice, poach^H^H^H^H^Hhire someone who has done this, and leverage their expertise. There is no armchair quarterbacking infrastructure at this scale.

warrenm · on May 5, 2021

Honestly, 10PB is small potatoes nowadays - and it's not "armchair quarterbacking" to think about storing it in a distributed environment

I work with one customer right now that's storing something like 27PB across ~100 systems for analysis on a rolling 90-to-365-day period (and they're a relatively small customer) with Splunk

If they were doing more storage than analysis, that storage cluster would be substantially smaller

msk20 · on April 23, 2021

I don't really know much about optimizing storage costs, But You could learn from storage giants.

Example is Blackblaze storage pod 6.0 according to them it holds 0.5PB with a cost of 10k$, you will need about 20*10K$ = 200K$ + Maintenance(They also publish failure rates) , The schematics and everything is in their website and according to them they have already a supplier who provides them with such devices which you could probably buy from. Note: This was published 2016, they probably have Pod 7.0 by now so cost may be better.

Reference: https://www.backblaze.com/blog/open-source-data-storage-serv...

dsyrk · on April 24, 2021

fyi that 10k includes no drives.

sireat · on April 24, 2021

Reading: https://www.backblaze.com/blog/open-source-data-storage-serv... it seems the drives are included.

That 10.3k includes drives but you have to assemble the pod yourself.

For 12.8k you get drives and assembled pod from 3rd party manufacturer.

Backblaze pays about 8.7k at a scale for the whole enchilada

Those numbers do not make sense if we exclude drives. The server itself is not that expensive(2-3k tops) without the drives.

qeternity · on April 23, 2021

Are you fundamentally a data storage business or are you another business that happens to store a tremendous amount of data?

If it's the former, then investing in-house might make sense (a la Dropbox's reverse course).

lostcolony · on April 23, 2021

He's the CTO of KeepSafe.

qeternity · on April 23, 2021

Ok, so it seems that they are indeed a data storage company.

philippb · on April 23, 2021

We are.

saalweachter · on April 23, 2021

Do you want to grow your business by finding other ways to sell people storage, or by adding features to your app that may or may not require any additional storage to develop?

miouge · on April 23, 2021

Cloud or self-hosted will depend on your in-house expertise. For cloud others have already mentioned Backblaze and Wasabi, but you can also check Scaleway, they do 0.02 EUR/GB/mo for hot storage and 0.002/GB/mo for cold storage.

Since we're talking about images and videos, do you already have different quality of each media available? Maybe thumbnail, high quality, and full quality. It could allow you to use cold storage for the full quality media, serving the high quality version while waiting for retrieval.

If the use case is more of a backup/restore service and a restore typically takes longer than a cold storage retrieval (being Glacier or self hosted tape robot), then keep just enough in S3 to restore while you wait for the retrieval of the rest.

If you go the self-hosted route, I like software that is flexible around hardware failures. Something that will rebalance automatically and reduce the total capacity of the cluster, rather than require you to swap the drive ASAP. That way you can keep batch all the hardware swapping/RMA once per week/month/quarter.

tormeh · on April 23, 2021

I believe Scaleway costs 0.01 EUR/GB, so a bit more than half of S3.

laurensr · on April 23, 2021

Also have a look at the Datahoarder community [1] on Reddit. Some people are storing astronomical amounts of data. [1]: https://www.reddit.com/r/DataHoarder/

nknealk · on April 23, 2021

How firm are your "less than 1000ms" requirements. Could you identify a subset of your images/videos that are very unlikely to ever be accessed and move those to s3 glacier and price in that some fractional percentage will require expedited retrieval costs?

ransom1538 · on April 24, 2021

Netapp. If you are managing it yourself do not accept alternatives.

https://www.ebay.com/itm/313012077673?_trkparms=aid%3D111000...