Hacker News new | past | comments | ask | show | jobs | submit login
Cloud Storage for $2 per TB per month (sia.tech)
659 points by beedrillzzzzz on March 6, 2020 | hide | past | favorite | 334 comments



I worked on the design of Dropbox's exabyte-scale storage system, and from that experience I can say that these numbers are all extremely optimistic, even with their "you can do it cheaper if you only target 95% uptime" caveat. Networking is much more expensive, labor is much more expensive, space is much more expensive, depreciation is faster than they say, etc etc. I don't think the authors have ever done any actual hardware provisioning before.

I didn't read all their math but I expect their final result to be off by a factor of 2-5x. Hard drives are a surprisingly low percentage of the cost of a storage system.


Author here. A lot of these numbers are drawn from experience in the mining world, where people realized that when cost is the ultimate bottom line, a lot of corners can be cut.

Sia systems don't need a ton of networking. I ran the networking buildout costs by some networking people, and again it comes down to cutting corners. If you only need 10 gbps per rack, if you don't mind having extra milliseconds added, etc, you can get away with very scrappy setups. The whole point is that it's not a highly reliable facility.


Sure, let's dig into networking. Who pays for rereplication traffic? If you do 64-of-96 RS encoding, that means for every failure you need to transfer 64x the lost storage capacity. If you're targeting a "low individual uptime but high aggregate uptime" model this means you need to be storing data in multiple sites -- and dedicated cross-geo bandwidth is expensive. I agree that in the happy case you can use low-bandwidth cheap equipment, but to get good reliability you need to provision for larger clustered failures such as rack- and row-level outages.


Sure! First up, we don't do repairs every time one host goes down. Standard practice on the network is to wait to do a repair until a full 25% of the redundancy is missing (in 64-of-96, that would be 8 hosts offline). Then you repair all 8 at once, significantly reducing the total amount of repair traffic.

But secondly, offline doesn't usually mean dead and gone, with unstable datacenters like this they are usually back online before the user has lost a full 25% of their redundancy.

Row level and rack level outages are handled by data randomization. The entire Sia system heavily depends on probabilistic techniques, both on the renting and hosting side. Row level failures will take out some of your data, but nobody should be disproportionately impacted by a cluster failure.

On Sia, each piece is at a different site. So 64-of-96 implies that each chunk of data (96 pieces to a chunk) is located in 96 different places. This doesn't help with the geo-bandwidth, but as discussed above there are other techniques to handle that.

Surprisingly, bandwidth pricing on the Sia network is even cheaper than storage pricing relative to centralized competition. That's a lot harder to model at scale though, so we aren't as confident the Sia bandwidth pricing will hold up at $1 / TB in the long term.

And technically, most of this stuff is customizable per-customer. If your particular use case has a different optimal parameterization, it's fairly easy to tune your client to suit your particular needs.


So, is this it? Dropbox designer challenges the project broadly, gets top comment, Author refutes – and we leave it at that?

I mean this is basically the moment where I would expect every systems designer on HN coming out of the woods and crushing Sia into the ground, if there were, in fact, any ground at all to crush Sia into.

Is this actually legit? If so, where is the rejoicing? What am I missing?


No, the author's idea is ok and he's mostly right on the cost part. Configuration and the numbers are a bit off and unrealistic, you won't get such low 95% availability per site due to other economical and technological constraints, you'll get at least 99%, but probably closer to three nines per site and 64 out of 96 won't be necessary at all (something like 8 out of 12 could be enough). Dropbox designer is just ignorant, biased and conditioned to US market and environment, but appeals to authority, so people upvote his bad comment. I do storage too, on smaller scale than Dropbox of course, not in the US, but it is distributed and the cost is already lower than what you see in the title.


The fact that nobody is able to prove it's a bad idea doesn't necessarily mean it's a good one. There might still be other downsides that haven't been considered, some of which could be solveable with more development work and some not.

At this point the cautious skeptic will be thinking "hmm, maybe there's something to this", not necessarily full on rejoicing.

That said, I agree it does seem promising. If you ever find yourself in need of cheap cloud storage it wouldn't hurt to look into Sia as a possible option.


Dropbox designer name dropped "64-of-96 RS encoding" as if they're the only person that's heard of, or dealt with Reed-solomon encoding before, and expected the author to get scared off. There is, in the case of drop box, plenty of ground to crush Sia into. That is the ground between the 95% and multiple-nines of availability.

Engineering is about tradeoffs. I could build a network as good as Google's with infinite money, infinite time, and infinite help. I could design a product as beautify as Apple's with the same lack of limitations. Unfortunately for me, I have limited money, limited time, and limited help. Every systems designer understands that, innately, so isn't rushing out of the woodword because Sia and Dropbox have merely chosen different tradeoffs. That one has IPO'd is uninteresting in the abstract. It's just money after all.


no. the author mentions "64-of-96" things in the post as well. I don't think kmod means to do what you said.


That sounds incredibly energy-inefficient. On average you have 12.5% of servers running but not contributing and possibly incurring load on other nodes.


12.5% overhead isn't that much. It's what just the networking gear can easily eat in a data center (12% out of all the non-cooling-related power supply).

Reed-Solomon encoding adds 50%, of you want 3 block per 2 data blocks. Replicated encoding (not relevant here since this is allow throughput usecase, but necessary if you want to sustain high read throughput) is adding at least 200% (if you want a 3x replication, which I think should be the minimum).


12.5% is far from the total overhead, just an additional compounding factor.


Turn 'em off if they're not contributing.


Spinning iron doesnt like start/stop cycles. Server drives go thru very few in their entire life for that very reason.


These are ssd’s.


They are not, see the link in the article: https://www.amazon.com/HGST-7-2K-SATA-Drive-Model/dp/B07XQRB...


Oops. Thanks.


87.5% efficient is not incredibly inefficient.


Compared to what though? How efficient is a typical data center? Probably way less than this.


I think his point is that he's targeting low aggregate uptime, too.


Definitely not, aggregate uptime is extremely high. We've never seen downtime do to network outages, only software bugs. And even then, only some users were impacted by the bugs, we've never in 5 years had a broad outage.


Gotcha, and your reply to the GP clarified a lot for me!


Here's the issue. We know that due to economy of scale and domain experience, AWS will always have the lowest cost (to Amazon) for storage -- whether that's totally-reliable storage, or sorta-reliable. If there was a demand for sorta-reliable, they'd build a sorta-reliable S3 and undercut you. Then, blockchain adds inefficiency. Therefore, it's basically impossible for any blockchain solution to have a lower total cost to provide storage.


> If there was a demand for sorta-reliable, they'd build a sorta-reliable S3 and undercut you.

Amazon's goal is to get people to pay a premium to get access to their entire ecosystem of services. They don't optimize the price of each individual service. There's a lot of take it or leave it for their specific offerings. Look at how they used to have reduced redundancy as a cost saver, but don't anymore. If you want corner-cutting storage from Amazon, you get funneled into Glacier now.

So let's look at how glacier deep archive is $.99 per TB per month, with a waiting time. And Google's weird offering where it costs $1.23 per TB per month, with instant access, but you pay a bunch of money to access. That means they can't store it on tapes, but they expect to profit anyway. And they're probably not running that as a loss leader that depends on the archival data being accessed.


S3 Reduced Redundancy is replaced with S3 Infrequent Access (and another lower tier One-Zone Infrequent Access) so there's still some pricing flexibility available.


Infrequent Access is closer. But wow, one-zone sure is only 20% cheaper than three-zone. And the retrieval cost is significantly more expensive than glacier's bulk price.

Actually, looking closer, is one-zone even storing fewer copies? It's offering the same durability, except in the case of "availability zone destruction". So if they're selling three copies for $10 a TB, then the equivalent price for 64-of-96 would be $5 a TB. Cutting that down with cheaper worse hardware would go a long way to get you toward the $2 goal.


S3 Standard and S3 IA store multiple copies spread across 3 zones. S3 One-Zone IA is stored as multiple copies but all within a single zone. They give you the same technical durability but less availability if that zone goes down, and if it's destroyed then you lose all your data.


That has the potential to cut both ways. If Amazon decides that S3 is their loss leader, then they would absolutely bury boutique players who only cover a few functional areas.

If instead they decide it's the cash cow, then it wouldn't be a stretch to predict that they price it a bit above the competition. The customer has the impression that they can amortize the cost across the other value-adds in the ecosystem. Whether the customer is right or not barely matters to Amazon from a purely financial aspect. All that matters is that they believe it to be true.

There's also the short-term versus the long term strategy. Short term they could do either and I wouldn't be surprised. Long term, I think I would expect the latter.


Where did you get those prices? I can't find anything cheaper than $4 per TB month at both Google and Amazon.


https://aws.amazon.com/s3/pricing/

Glacier Deep Archive is .99/TB/month in US East and EU Ireland and several others, plus $2.50/TB bulk retrieval.

https://cloud.google.com/storage/pricing

Archive Storage is $1.20/TB/month in us-east1 and europe-west1 and several others. But (instant) retrieval is $50/TB, with no options...


Everybody with more money than you can always undercut you in anything you ever do so why bother ever trying to do anything


Very good point! Let's say you move into a new market, trying to undercut the status quo. The status quo, with their power of scale and experience can just undercut you back. Who is benefitting from this? The customers! So if the customers want more competition they have to pay you to play. Which means they have to co-invest with you and promise to buy your service later.

A good is example are the Apple iPhone screens. If Apple wants a new competing supplier, they have to invest in new competitors.


Not really. Suppose giant A undercuts startup B where A is making a temporary loss but B runs out of money trying to complete, only A remains in the market and A starts charging a premium since it doesn’t have completion.

This is bad for customers in the long term. Like how Amazon is slowly destroying brick and mortar stores, or having Amazon Basics at the front undercutting other sellers. Monopolies mean you don’t have an open fair market anymore.


It's true that benefits for customers would likely be only temporary. But that is not my main point.

I'm explaining a common business move called pay to play, in industries that require very high investments for new competitors. If customers want B to compete, they have to pay B first to ensure B does not make a loss, so everyone wins a little in the end.

So in the case of OP. He needs to finds paying customers first, who are willing to pre-order his service.


I think the point is that the cost should not be the main motivator. I would agree that there needs to be other differentiators in addition to cost, which could provide sufficient moat against other competitors, big or small.


It's possible that there could be non-obvious innovations in how to save money with a low reliability threshold, which Amazon might not be able to effortlessly copy.


It should also be noted that you can get S3 storage for $1/TB/month already if you use the Glacier Deep Archive storage class.


Did that include network IO to retrieve your data?


Nope. That's what actually kills you on Amazon - network costs.


It's... interesting... that Amazon offers reasonably priced bandwidth on Lightsail, but it's against the TOS to use it in connection with other services.


Yeah, I have been proxy accessing my personal S3 that way, but I'm not really willing to bet a business on it.


Not really. Have you ever done the maths on what it costs to recover 1TB in a recovery situation.

Definitely not worth it.


Putting it into normal S3 only costs $2.50 a TB. That's very affordable.

To get out of Amazon entirely, they definitely want to gouge you, but it's not the end of the world. If $24 a year is an acceptable storage price, then Snowball export costing somewhere around $36/TB in bulk isn't too awful. (And if you don't have enough data to fill up a snowball, you can probably smuggle it out through lightsail.)


My point is taking something, adding inefficiency, then pronouncing it's "cheapest in class" isn't logically possible.


You are confusing "cheapest option possible" with "cheapest option available". Maybe Amazon _could_ be cheapest and most efficient in everything but they aren't, that's why they make a lot of profit with their cloud services. Thus there can easily be a competitor with a cheaper offering, especially if they are cutting some corners as per the article, even if they have some inefficiencies in other areas.


The point is not to do it cheaper but better.


Quantity is a type of quality. Lots of people mainly use cloud storage as a cold storage locker for backing up photos and documents. I'm sure there's ways to innovate in the space that would make certain clouds more appealing, but most solutions are already effective as simple storage lockers. Doing it better instead of cheaper at this point is more likely to be either a megacorp affiliation perk, like a discount on a YouTube Music subscription, or phone company style family plan discounts "Get discounted storage for the whole family if you pay slightly less than the price of four memberships!"


... better than AWS with its SLAs, that every major company relies on to some extent?


Yes. There's a reason companies like Digital Ocean, Heroku, OVH, ZEIT, Joyent, Rackspace, Linode, Cloudflare and many more have been able to survive and grow rapidly in an AWS-dominated space. None of them are competing by undercutting Amazon in price.


OVH is certainly undercutting Amazon in price. The free bandwidth included in each instance is already a dealbreaker if you actually use the instance for anything other than very heavy cpu-bound loads with small output to send back.



"Better" in a niche-filling way, not a "this is very well engineered, but for my use case it's very overengineered" way.


market your product better, steal better companies people, lie, play political games, leverage relationships, kill

so many options


> ...they'd build a sorta-reliable S3...

From operational experience reports of S3 I've seen out there and discussions [1], including on HN [2], once you reach a large enough object count, S3 is "sorta-reliable". Objects will disappear. The durability claim is not SLA-enforced [3], which is uptime-oriented. The only remedy is a service credit.

[1] https://www.theregister.co.uk/2018/07/19/data_durability_sta...

[2] https://news.ycombinator.com/item?id=14880617

[3] https://aws.amazon.com/s3/sla/


Had about 20k objects and lost one (405s no matter what) about six month in.

I didn’t think much about it, but I guess I was really unlucky.


I agree that AWS has economies of scale that makes it hard to build a better/cheaper S3, but one way you can get around this problem is by building to lesser requirements. S3 has to work for all use cases, but if you know you need less of something costly (say, less IOPs) you can build a system that's cheaper than S3 even if the individual components are more expensive than S3 is paying.


This is actually not correct.

Because of AWS scale it cannot be the cheapest. Amazon cannot use odd lots or small lots, which means that it has very few possible suppliers and those suppliers will never be the cheapest.

Same goes for the network infrastructure. While it is possible buy 100Gbit/sec at a throw away price to because some sales person needs to make his numbers and neither he nor his director of sales thinks Kmod Hosting would be able to fill the pipe it is buying, Amazon would need 900Gbit/sec for their exit and no one at the vendor is going to blink at making Amazon pay higher price than Kmod Hosting.


you are wrong that AWS pays more than small vendors.


That's not what they said. They pay more than small vendors, sometimes.

If you are able to take advantage of sales and clearance deals here and there, you are very likely to get a lower price than even the biggest buyers.


1 month of sales hunting has got me a PC rig for literally 2/3rd of the cost of buying without waiting. And this was an optimised build in the first place.


They probably could,but they're showing no sign of wanting to compete on cost.

E.g bandwidth costs on AWS are high enough that if you actually serve up lots of data from S3 you can typically afford to rent servers to cache all of it 'in front' of AWS and still save a ton of money.

S3 only gets close to competitive if you never access the data from outside of AWS.

Which gets to the point: If you use an AWS service like S3, you pretty much has to use other AWS services if you want the cost to be even somewhat reasonable in aggregate.

They don't need to compete on cost, because once they get you to buy into one set of services, moving any one set of services off AWS gets more painful and/or costly, and a full migration looks too scary for most people.

S3 will never be priced to undercut anything but big players for that reason. They need to be competitive with Google and Azure, because those guys can offer to offset transitioning costs and generally aggressively target AWS customers.

A small player isn't the same threat even if substantially cheaper.


Sia is a lot more than a low cost storage platform. It's a full reimagining of how the cloud should work with an emphasis on user ownership and control, open access for developers, and ultimately decentralization that allows users to be certain their applications will always work (no more "RSS Reader is shutting down").

This post is intended to address people who do not feel that decentralization can be cost effective at scale.


Where do I find the introduction what Sia is? I tried clicking around in the headers from OP, but couldn't find it within my very lazy tolerance.


https://blog.sia.tech/skynet-bdf0209d6d34 - this is probably the best thing we have at the moment.

https://support.sia.tech/article/dk91b0eibc-welcome-to-sia - this support article is also a good introduction to the network and why it was built


Website: https://sia.tech/

Also take a look at Skynet, a file sharing protocol built on top of Sia: https://siasky.net/


I cringe every time a project uses the name "Skynet". First, it's overused. Second, it's literally the name of the AI project that went rogue and tried to kill all the humans in the Terminator movie universe. Not exactly the best association. Just find another name.


On the other hand, I am filled with positive associations at the name.

Reliability, redundancy, disaster recovery, cost cutting, and absolute focus on execution!


See Soylent.


What aspect of this infrastructure is user-owned?


You re certainly reimagining the marketing slang of a few years ago (a decade for the gratuitous dig at Google Reader)


Amazon has huge margins. They could sell you the same services for waaaaaay less money. They just won't.


having worked at aws, all I'll say is I think you're wildly over-estimating their agility


Care to share any funny stories?


Isn’t that what AWS S3 one zone infrequent access is?


> (paraphrased) domain experience means they ae best

As someone who has worked with a couple of market leaders in their respective fields i want to disssaude anyone who would listen of this notion. NO, the market leaders aren't doing things optimally. Some things are downright stupid.


The buildout in the article doesn't work. You can't plug in that 4-lane SAS SFF-8087 splitter cable into that motherboard. You're only getting 8 hard drives per motherboard with that setup, not 32.

That puts the cost of 192 TB at more like $6240‬, not $4945. Could be less if you find a good deal on mini-SAS PCIE cards, but still going to be substantially higher than $4945.


Even if it had slots for the splitter cable, Intel and AMD onboard SATA explicitly doesn't support port multipliers as far as I know. You can buy PCIE SAS cards that do for relatively cheap, but then you have to find board with enough PCIe slots. Easy enough on the "gamer" boards but if you want ECC (and you probably do, for storage) and IPMI (you probably do, if you have more than a few dozen servers) your options get much more limited. Other than 1 or 2 Asrock Rack boards, you pretty much have to move into Epyc 7000-series or Xeon Silver or above. Often dual-socket on the Xeons to get a board with lots of PCIe.

In theory something like an Epyc 3000-series with lots of PCIe or onboard SATA that supports port multipliers would work great, but I don't think anyone actually makes that.


You are going to get a 4U server that takes 60 disks and populate it with SAS cards driving them using SAS-2-SATA cables.


The third sentence of your Medium article says; "Despite this, the Sia network is able to achieve 99.9999% uptime for files." How do you achieve this in a "not a highly reliable facility"?


Redundancy, I assume. Lots of unreliable facilities and nodes can be very durable in aggregate.


Sounds kinda like the people who thought that bundling a bunch of bad mortgage debt together in slices could get it rated AAA.


It does sound like that! The mistake there was to assume the failure of one mortgage was statistically independent of the failure of another, which is obviously incorrect for many failure scenarios. In this case, it it would be similar if all nodes were in, say, the same datacenter. That doesn't appear to be the case, but there may be other dimensions on which the network lacks the required diversity to support the reliability claims (disk vendor and age...? There must be others.)


Your mind is going to be blown when you learn how TCP/IP works.


Or computers for that matter. Deep down it is not really about 1s and 0s, more like thresholds in between 1 and 0.

To me a big part of computer science is abstracting away unreliable details to make them seem reliable.


Not really. Just probability. If you have fully redundant services then ALL of them have to go down to have an outage. Suppose you 5 copies with 75% uptime each. The probability that all of them are down is 0.25^5 ~ 0.0009

Now of course that assumes they are uncorrelated, but since Sia nodes are distributed across the internet, that's likely as opposed to multiple servers at a few data centers like AWS.

Turns out mortgage bonds tend to be correlated.


They all run the same software stack though, so despite being deployed on diverse hardware, so they can’t claim to only have independent failure modes.


True. It would be interesting to analyze what others aspects are correlated.


Reminds me of a saying from the first dot.com crash. (Or at least I heard it then first.)

Tying two bricks together doesn't make them float.


It's a great saying, but aren't all large ships these days made out of components that don't float individually?


No, they are mostly made of air, which floats (Partial sarcasm, as that is what makes them float)


I mean, the concept actually works, but you have to understand what is actually going into the bundled product. There is no reason you couldn't bundle 100 million in mortgages that had a 10% default risk and sell 10 million of that bundle as AAA.

It is when you just start bundling everything and then saying the entire thing is AAA that the trouble starts.


Except the people who put those sub prime mortgages in probably didn’t apply rigorous statistical analysis to their situation.


Redundancy is kind of meaningless unless if you have sufficient diversity. If all your replicas are located on the same rack, you're screwed if there is a minor disaster like a sprinkler going off.


This is why the post is a red-herring. You cannot use storage if you cannot get to it. Talking about the price of disk space/mth is absolutely useless without talking about the bandwidth required to use it, which means, if you want to go big-boys-math, 95% bandwidth and peering agreement pricing. That $1.50 with all the network chatter between hosts for replication/repair/etc., not to mention actually _downloading_ of the data from the Sia network will take that $1.50/TB/mth to $70/TB/mth.

Talking about the disk-space pricing is, IMO, disingenuous. Talk about the math for an all-in use-case of the service. I appreciate what the OP is going for, however, hand-waving around uncomfortable critical items like bandwidth cost for the Sia network is not a good look.


> not to mention actually _downloading_ of the data from the Sia network will take that $1.50/TB/mth to $70/TB/mth

You really need to back that math up.

Here's my math: 10mbps is over 3TB per month at max utilization, so let's say it's good enough for either 2TB throughput with semi-even use, or 1TB throughput if the use is really focused on part of the day.

When buying transit in bulk in the US, Google tells me that 10mbps was very roughly $8.50 a month in 2017, $4.50 a month in 2018, and $2.50 a month in 2019.

Are you expecting the data to be rebuilt every single day or something?


You will be able to boast of 99.9999% uptime up until that sprinkler goes of. And with luck, it could be decades until it does.


OTOH, your object storage can be even cheaper if you don't do Reed-Solomon erasure coding, but use rateless erasure codes. For example, online codes[1] have been used by Amplidata to have more reliable storage with lower overhead. There are downsides however (no partial reads, no mutability, ....)

[1] https://en.wikipedia.org/wiki/Online_codes


[flagged]


You're taking "cutting corners" out of context. He is describing how individual nodes don't need to run at the standards of regular data servers and are hence cheaper, but in aggregate provide a reliable service.

> So, I assume you’re quite young.

So you do you have a technical criticism?


> I didn't read all their math but I expect their final result to be off by a factor of 2-5x.

Can't be more than 2.5 because Backblaze B2 already gives you $5/TB/Mo.


Backblaze is operating at their own economy of scale with dedicated deals with suppliers, custom bare bones hardware, optimized processes, etc. It's always more expensive starting out as a little company for raw hardware and processes until you're big enough and mature enough to get the deals and processes in place.

That is also not the only service that Backblaze offers and wasn't their first. It could be that B2 is simply a way for them to offset their cost for extra capacity and are running it effectively near-cost for them.


If all their other customers on the $5/month unlimited backup plan store less than 1TB, but since it’s really aggressive about backing up everything (no, please don’t back up my steam games) on your computer I think they go over that.


Here's the histogram from their AMA: https://i.imgur.com/iVEuwUT.jpg

About a third of the users are under 100GB.

Another third is under 500GB.

13% from 500GB to 1TB.

9% from 1TB to 2TB.

8% from 2TB to 5TB.

5% above that.

And they cited raw costs of over $3.50 per TB. But of course that's with real datacenters and non-shoddy equipment.


Don't forget the vast amount of notebooks with SSDs that have less than 1TB storage to begin with.


Backblaze deduplicates data right? So it doesn’t really cost them anything to back up your steam games.


They can't deduplicate properly encrypted data, that will have to happen client-side. So while they can deduplicate your data (if you have a copy of a picture in multiple places or maybe even a game installed on multiple machines), it won't work across users.


In that case I’d also expect them to skip uploading the data, but everything is uploaded.


they'd probably charge you the same anyway, even if THEY get it at basically free ;)


> Can't be more than 2.5 because Backblaze B2 already gives you $5/TB/Mo.

Well it can be, if they have a lot of inefficiencies. Backblaze could have more experienced engineers who overcame these. I assure you, I can accidentally design a very expensive storage system as I’m not that smart ;)


Who are you?


Scaleway C14 [1] gives the $2/TB/month that the title promises. Now, providing Sia storage space may be more expensive because you have to calculate costs for proof-of-storage (and may not be able to pull off the "cold storage" model, which may also only work in conjunction with a higher-priced low-latency offering), plus potentially more replication than a centralized service uses, but it does indicate that it's probably not completely unrealistic.

[1] https://www.scaleway.com/en/c14-cold-storage/


Backblaze doesn't give you a random read/write like DropBox, and wanting your disk drive back takes some time...


> exabyte-scale storage system

Somewhat of a random question, can you point me to some state of the art research?



Storage Systems of this scale are thesedays almost exclusively built around object based storage rather than "legacy" block or file backed solutions. I guess today, min.io is would be the way to go. (to "go".. little pun on the end:)


What does object-based mean? How is an object different from a file (which I presumed was a collection of blocks)?


It means you access your object through a GUID. Think about it like parking your own car vs. a valet. When you park your own car you need to know the address of the garage you parked in, the floor you were on, and the spot you were in. When you valet park, you hand the attendant a ticket and he brings your car back.

With a standard fileshare, you need to walk the filesystem to retrieve your file - this incurs a ton of metadata overhead. It also means when you've got potentially billions of files in a directory, it can be slooooowww. All the metadata requests also make it very chatty - so doing it over a WAN link tends to be extremely painful if it works at all. Newer versions of SMB and NFS have done a lot to batch the metadata requests but they are still protocols meant to happen at extremely low latency inside a datacenter.


Some object stores do this, but aws S3 for example does not. You can list the contents of buckets, nicely sorted by name. You can mimic directory structures if you want.

However, you touched a key point: object stores are all about throughput, not latency. You can store at a GB/s (if you have the pipes), but even checking if an object exists will cost you a few milliseconds.


Got it. Thanks for explaining.


My guess: No random write access to objects, you can (at best) append-only but often you can only append until the object is finalized, and cannot read it until it is finalized.


it's like a key-value store, (or a dictionary). However, the values are objects (big blobs of data). This means you can't update parts of objects without rewriting the blob. However, most of the object stores offer metadata operations (move, tag, ...), concat of n objects into 1 and partial reads.


Google's clusters are all block and file based (depending on what layer you want to use them at)


I am not the op nor an expert on this, but I think https://ceph.io/ targets clusters of this scale.


Yes, with croit.io you can manage Ceph clusters with ease, reducing Laber costs and increase reliability.

We build clusters of low PB scale that have a TCO with everything from labor to hardware, from financing to electric, from routers to cables and can be run below 3€/TB. For that you can store data in Block(rbd, iscsi), Objekt(S3, swift) or Filestorage(CephFS, NFS, SMB), high available on Supermicro hardware in any datacenter worldwide.

Feel free to contact us or use our free community edition to start your own cluster.


Storage can be far cheaper when decentralized. Sending data over the Atlantic is super expensive compared to LAN networking. Almost all content providers peer with ISP's with onsite hardware. But why stop there, put the "racks" in ppl's basements. Data storage is very compact now a days, you can probably fit 100 TB is a shoe-box.


that'd need trust that they won't run off with your drives, their basement won't get flooded, there aren't power outages often, they have good protections against surges and whatnot... which at ANY datacenter, is automatically included ;)


You need to trust that not too many of them will have that happen at the same time.

Which is not that hard to do.

And all those protections in a datacenter are far from free.


>I didn't read all their math but I expect their final result to be off by a factor of 2-5x.

I looked at their parts list and it's obvious they aren't serious. CPU is missing, memory is missing, SAS to SATA cables, but no SAS controller, no mounting for the system board. Low effort at best.


It seems you’re getting heavily downvoted. I think it’s because CPU and memory are not missing.


That wasn't on that page when I opened it, you think I'd miss that? They were reading the comments here and updated the page after I wrote the comment.

The price was $4700 something, now it's $4945.

They simply smooth over it with this:

"So we will be using a rig cost of $4500 in our spreadsheet."

That way their overall math doesn't change. Should get banned for doing things like that. These guys are up to no good.


We have done this calculation and even if you put your gear into Equinix/Digital Realty in the most expensive places and use Backblaze-type setup ( which is not optimized and buying retail) bringing 10Gbit to every 4U the price for double-writes at 5TB disks are $10/year per TB.


> Hard drives are a surprisingly low percentage of the cost of a storage system

THIS! There's no use of a 2TB storage if you can't upload/download this amount each month


Many businesses have TB of data they upload and keep for several years, looking at it once or twice over that time if ever. For these you could provision monthly bandwidth as low as 1/30th of the total storage.


It says right there in the article that bandwidth is charged for separately. So you just buy a line of appropriate size, based on actual usage measurements. And each one of these rigs only needs a gigabit connection to upload/download its full capacity each month, which means the network equipment costs can be minimal.


have you ever heard of backups and archiving? ;) there are lots of scenarios when you'd want to write some amount, but not read it back for a while, if ever...


I agree with you. Not 2-5x but they are rounding down on costs and optimistic on risks.


I work in telecom/datacenter infrastructure and this is fanciful. The whole way they take the wattage load of one machine and then hand wave away all of the rest of the costs of either building and running a datacenter, or paying ongoing monthly colocation costs... Is just scary. I truly don't mean to offend anyone but this looks like a bunch of enthusiastic dilettantes.

Generators?

UPS?

Cooling costs?

Square footage costs for the real estate itself?

Security and staffing?

At the scale they intend to accomplish they will need at minimum several hundred kilowatts of datacenter space. Even assuming somewhere with a very low kWh cost of electricity, that much space for bare metal things isn't cheap. Go price a lot of square footage and 300kW of equipment load in Quincy, WA or anywhere else comparable, the monthly recurring dollar figure will be quite high.

And all of that is before you even start to look into network costs to build a serious IP network and interconnect with transits and peers.


They're not talking about a datacenter. Datacenters need to be reliable. Sia storage pools don't, because security and reliability is achieved at the global network level, not at the level of individual systems or storage pools. 95% reliability means you can be down for two whole weeks out of every year and still be well within acceptable uptime requirements.

Generators? Who needs those? Just wait for the power to come back on. UPS? Why bother? Square footage? Stick some wooden shelves in the cheapest building possible. Cooling? Locate in a cold climate and buy some window fans.

This isn't anything like the sort of infrastructure you're used to dealing with. Think Bitcoin mining farm, not Backblaze datacenter. Any corners that can be cut will be.


Who are the customers for such unreliable systems?


Sia is very reliable from the customer's perspective. Its only individual systems and storage pools that have lax uptime requirements. Thanks to some clever network-level redundancy mechanisms (10-of-30 redundancy), 95% uptime at the storage pool level translates to 99.9999% uptime from the customer's perspective. See the "Uptime Math" section in the OP for details.


When you says Sia is reliable, do you mean it could be reliable if hypothetical X, Y, and Z things happen?

Because according to the homepage (sia.tech) there are only 895 hosts storing a total of 206TB right now, which is a very, very small amount. Backblaze, a relatively small player as compared to the big cloud providers has 1.1 million TB of raw capacity as of last year (redundancy reduces the available capacity, but still) [0].

[0] https://www.backblaze.com/blog/hard-drive-stats-for-2019/


By reliable in this context I mean robust against hardware failure. (Including the failure of entire storage sites.) The OP explains the math and associated assumptions for how they derived that "99.9999%" figure, and acknowledges that since the calculated chance of data loss due to hardware failure is so infinitesimally small, other failure modes outside of what they modeled are likely to dominate.

As for the relatively small number of hosts at present, 895 is more than enough for 10-of-30 redundancy to work "as advertised". You really only need 30 hosts technically. The bigger issue I think is the relative immaturity of the software. Sia is still pretty new compared to most other data storage systems; and although I've never heard of any software bugs in Sia resulting in data loss, that doesn't mean such a bug will never be discovered. Be cautious, keep backups, and never rely on any single storage medium to store your data.


It is not the answer to the question above. What is the type of customer uses this company?


Users of Sia are the customers. So your question reduces to: who uses Sia? It used to mostly be interesting for customers interested in reliable large-file data backup on the cheap, but now it has expanded to customers looking to do content distribution on the web, etc.


> Think Bitcoin mining farm, not Backblaze datacenter. Any corners that can be cut will be.

And yet Sia is about half the cost of Backblaze (i.e. not much savings).

Hard to imagine situations where this is a good trade-off.


Its super interesting to dive into the world of cryptocurrency mining, where some DCs are getting PUE of 1.1 or better with a buildout that's basically amounts to shelves and box fans.

No generators, just eat the downtime. No batteries. No 24/7 staff. No racks, just shelves (folded sheet metal is cheap). Security varies from farm to farm.

These servers don't need to run cool, as long as you are in a climate that doesn't get over 100 degrees you can get away with fans and no AC.


If all you intend to do is generate hashes and run for the exits when your mining equipment is no longer economical, due to difficulty increase, sure.

Reliable data storage for paying customers is a very different thing.


Per-site reliability (beyond a stated/assumed 95%, which is right around the "I put a computer on my desk" level) isn't a design goal, though. You can argue with their math or their assumptions, but you can't say they're wrong for not designing in the reliability features you list above.


Things on my desk are meeting three nines or better of uptime at present, 95% would be 18 days hard down per year.


I think it very much depends on the desk. I mean, power alone at my home just barely reaches three nines. Internet connectivity glitches out occasionally. I'll fat finger configurations and reinstall stuff for some more hours a year. It all adds up.

You're right that you can take the same hardware, add a $70 UPS and some thought and care, and do much, much better. But my point was only that "95%" is a trivially achievable goal even in the most naive setups, which makes me treat their analysis with a little more credence than most folks here.


Which should give you even more confidence in Sia's reliability, no?


Crypto currency servers didn’t need to last though. Their primary components were being replaced regularly with the latest generations.

Here they’re depreciating components over 7 years, so will those components last 7 years under those conditions?


I think it's interesting to dive into the economics of 95% uptime. Maybe you don't need full datacenter cooling; you can just locate in a cool climate and have a fan in the window blowing cold air in. If there's a blizzard, you lose your drives because snow blows in and melts on your drives. If the chance of the blizzard * price of drives is less than needing 750W of cooling, then you win. Yeah, sometimes everything shorts out.

Power is similar. Maybe you just use solar and turn the drives off when it's cloudy. With enough distribution throughout the world, it will probably be sunny somewhere.

I haven't done the math and I'm not saying it will work out favorably. I also don't have a use case for 95% availability. (That's two weeks a year where your data is gone!) But it's something that someone with the right needs could consider, and maybe come out ahead of someone shooting for 5 nines and drives that aren't covered in snow.


OVH, the massive hosting company, is located in a former aluminum smelter next to a hydroelectric dam in Quebec. They keep costs as low as absolutely possible, but their operating costs as a whole are still considerably higher than something designed with one nine of uptime.

I get that you're mostly joking by saying "yeah sometimes everything shorts out", but we have electrical codes in the US/Canada for a reason.


In most places you don't need to build to code, as long as you take other steps to mitigate risk.

Saying "this entire building is designed to catch fire, and if it does, it won't do harm to neighbouring buildings, people, or the environment" is probably a good start.


If you're in the developed world, that isn't going to fly. "other buildings won't burn down" isn't going to get you a pass on these things.


Insurance?


> I also don't have a use case for 95% availability. (That's two weeks a year where your data is gone!)

Gone from only one of the places it's stored. Your data is still available even if just 10 out of 30 servers are online.


They're not talking about a real datacenter; they're talking about a deathtrap crypto mine with hard disks instead of GPUs/ASICs. Can't you run a million hard disks off a consumer cable modem? ;-)

At the (slow) rate Sia is growing, I don't think there will ever be enough demand to justify this design anyway.


How much is a terabyte of dropbox storage?


Dropbox is $10/TB/mo, which is the "old" industry standard (and still what most big guys charge). Backblaze is $5, but they charge egress fees, so usage matters. Dropbox doesn't charge for egress, but there are limits on "Public" folder bandwidth (and you can't pay for more if you max out). I don't think there are limits on private content, but I bet there's a soft throttle or some limit that will get you contacted.

There aren't a lot of options for <$5/TB and they all charge egress. I've tried Backblaze, pCloud, and DO Spaces. My specific use case is storing full HD movies for streaming via Plex, which requires a fairly reliable 6Gbit/second, and most of them can't keep up. DO Spaces did the best of all the ones I tried, so I'm using that for now, but it's just skating the edge of usability.


Surely you mean Mbit, unless you’re streaming uncompressed video for some reason! :)


Although it's certainly a typo, ~6 Mbits a second isn't enough for more than poor quality 1080p video, roughly equivalent to streaming from Netflix or Hulu. If you were streaming a backup of a UHD Bluray, for example, the bitrate could be over 100 Mbps.


Indeed. HD streaming seems to need about 5.4Mbps reliably. Spaces is working well enough, but sometimes I have to pause for ten minutes to buffer enough for the rest of the film. It usually works, but with variable bitrate it must spike at times.


I'd love to see someone try running a TV network off of DO spaces :)


Did you try GSuite Business account with 5 accounts? Google offers unlimited cloud storage if you get more than 5 accounts (5*12 = 60 USD/month and no egress)


It's 'unlimited' until you use it too much, and then all your google accounts will get perma-banned...


I've read this as well, so I didn't try. I also don't want to spend that much. I have 120 movies in Spaces and it only costs me about $8/mo. My egress is under the free tier so far, and overages amount to about four cents per movie stream.


Any evidence of that? Isn't it likely that Google has plenty of commercial customers using this service with large amounts of data?


For $6 per month (paid yearly) you get 1tb of OneDrive + office (desktop applications). My math maybe off but if you buy office every 3 years for $200 then you're already even.... And 1TB of storage is just free really.


You could also say, MS Office is virtually free for the 1 TB these days.


I should add that places like Backblaze and pCloud and DO Spaces and others charge per GB stored, so if you're using 1.3TB it's prorated accordingly.


0. You can only get 2 TB+


It sounds like they're intentionally forgoing backup power and UPSs, as well as 24/7 staffing. However, you're right that this is probably pretty optimistic.


If you’re paying more than $300k/mo for 300kw of power in Quincy the DC sales guy probably bought your purchasing people a boat.


I don’t really agree. Retail Datacenter pricing all in should be well under $500/kW; half of what they suggested.


In 2018, I spent about six weeks running a series of tests to measure Sia's real world costs. At that time, storage cost ~$4.50/TB on Sia to back up large real world files (backups of DVDs and Blu-Rays).[0] Community members have re-run my tests every few months, most recently in October 2019, when the cost was measured at $1.31/TB, though it's worth noting that recent tests use synthetic data optimized to minimize Sia's cost.[1] It's also unclear how much the market value of Sia's utility token affects these costs, as the price of Siacoin has fallen by ~80% since I conducted my original set of tests.

The calculations in today's blog post account for the labor cost of assembling hardware, but leave out major other labor costs:

1. You need an SRE to keep the servers online. Sia pushes out updates every few months, and the network penalizes you if you don't upgrade to the latest version. In addition, to optimize costs, you need to adjust your node's pricing in response to changes in the market.

2. You need a compliance officer to handle takedown requests. Since Sia allows anyone to upload data to your server without proving their identity, there's nothing stopping anyone from uploading illegal data to the network. If Sia reached the point where people are building $4k hosting rigs, then it's safe to assume clients would also be using Sia to store illegal data. When law enforcement identifies illegal data, they would send takedown notices to all hosts who are storing copies of it, and those hosts would need someone available to process those takedowns quickly.

[0] https://blog.spaceduck.io/load-test-wrapup/

[1] https://siastats.info/benchmarking


Another cost associated with your 2nd point is the collateral a host would have to burn to comply with take-down notices.


I'm going through Sia's website now. It seems this article is meant to bolster the claim on their website which states "When the Sia network is fully optimized, pricing will fall somewhere around $2/TB/month." [1]

Call me skeptical but it seems that they aren't committing to building out this infrastructure themselves or providing a specific amount of storage at this pricing. They seem to be outlining a potential infrastructure that some enterprising individual (or corporation) could use to provide storage at that price to "renters" within their marketplace.

I guess I'll just wait until someone puts their money where their mouth is. Given that this is a marketplace, the fact that a theoretical setup could be built to provide some service doesn't necessarily guarantee it will be built.

1. https://support.sia.tech/article/thvymhf1ff-about-renting


Since you're bringing up the website... I don't get this marketing strategy. The cryptocurrency angle is just as off putting as telling me as a potential customer that my data will be stored on janky servers in unreliable places, no matter if the uptime is the same. OP even claims they have reliability and experience I'd consider but those aspects sure send signals that make me not want to deal with that stack.

Just looking at the website for Sia I see a bunch of fluffy marketing stuff, fair enough, that's normal these days. But where is the selling point? https://sia.tech/technology tells me my data is stored securely and in a redundant manner, great, just like any storage provider. That is followed by "Renters And Hosts Pay With Siacoin" and talks about payment channels, which links to a wikipedia article and not something that tells me how I would even pay them, not to even talk about how much (I saw the calculator thingy on my way to that site, the messaging is still weird).

The "Getting started" call to action is a similar experience, a bunch of downloads, cool - I don't even know if you're right for me yet. I'm five levels deep into the "Getting started guide" linked there and so far found that I'd apparently have to deal with weird crypto exchanges to pay somebody for this, plus I couldn't use most of my pretty standard tooling anymore (at least not without involving one of those proxy things on the getting started page that cover a few use cases, some of which seem to be operated by others?).


It's live right now though, the community has deployed the infrastructure [1] and the pricing is approximately what they claim [2].

1. https://siastats.info/hosts_network

2. https://siastats.info/storage_pricing


That's the price being paid for the storage, but is it actually covering the cost of providing that storage? Or is it just 100-300 people who thought "Huh, neat, I'll toss a host online and see how it goes" ? I'd lean towards the latter and assume those storage costs are heavily subsidized by a few people satisfying curiosity.


> That means about 2 hours of labor per rig. We’ll call that $50

Does that seem low to anyone else? I don’t really have any background in the area, but 25/hr cost to the company would be less than 20/hr pay for the skilled labor. Other countries are different of course, but in US I could make that much flipping burgers in the right area.


It's outrageously low. They're also fancifully assuming cpu tdp = electrical power, cooling = 0w, and another 0w to the motherboard/network cards. And each box has 1 non-redundant, schmuck-grade $80 psu, as well as a consumer grade mobo. This would never be anywhere near their uptime.


CPU TDP is accounted for, the CPU linked in the blog post draws 65w and that is used in the electricity calculations.

I did realize that I completely forgot about RAM, when I get back to a computer I'll have to make some updates, but it won't materially move the numbers, there's 33% margin of error between the number in the spreadsheet and $2 / TB / Mo.

The $80 PSU is what I could link to from Newegg, I do have experience in industrial electronics, and I know from firsthand experience that you can buy a 10+year PSU at 93% efficiency for well under $80 at 300 watts. At that level, you're going to be able to request all the required cabling as well, which means you're getting a much better price than the $7 per cable linked in the post.

95% uptime means 18 days of downtime per year. Consumer grade PSUs and mobos do much better than that.


You included tdp but tdp /= actual electrical power. Intel approximates it from base clock (practically all consumer motherboards ignore the tdp/boost spec). AMD uses voodoo to calculate tdp, it's a pure marketing number with no basis in power usage. To make matters worse, motherboards typically go nuts with voltage on consumer boards, and power draw can vary wildly depending on what instructions you're using. There are a crap load of x86-64 extensions.

Yes, you forgot ram. And network chips, motherboard power delivery losses, motherboard power usage, cabling losses, etc. I'd guess it would total 50-100w, but feel free to current clamp the psu rails to get a realistic number.

As for the 95% uptime, I agree with you. I wasn't considering how much breathing room that actually provides, I was just going with my instincts.


Assembling computers and replacing drives is generally speaking unskilled labor, right there next to factory workers.


Right, the people putting together servers for Dell or HP certainly are not getting paid $20 or $25/hr. If your standard of quality is "crypto shitbox", even $10/hr seems more than enough. This is assuming that many of these storage farms are going to be in places like Vietnam or Romania, not the US or Germany.


If you're from the USA, perhaps just ×2 any price you read if that helps you get through the article. Work that needs doing doesn't always need to be done in high-income/high CoL places.

I earn about €26/h before taxes in western Europe, an income which lets me live in relative luxury (not "private jet" luxury, but I literally do anything I want and still save more than a third of my income with a 36-hour work week), and that's for security consultancy which is way more specialised than the job you're talking about. I think it's also above the national average, but I don't have the statistics on hand. Not sure what the cost to the company is, I think they put in another hundred a month for health insurance or pension or something (they pay 50% and I pay 50%, though I don't see why keeping 50% off my payslip helps anyone, an employer will just deduct that from the salary they can offer) plus some overhead for accounting and whatever, but it's probably not that far off.


I don't think they mean engineer time here - it's assembling the servers, so technicians, and yeah I guess yeah like most people they aren't paid like in-demand software engineers. This is how most people have to get to by!


In some developed countries you can get qualified labor that cheap. For example in Southern or Eastern Europe. Not to mention developing countries of course.


Not to mention, this work isn't technical in nature. It's very quick to pick up, probably more so than landscaping tbh.


There is way too much hand-waving and assuming going on this article. It is a load of BS that does not take into account real-world inefficiencies. e.g. sometimes buying in bulk is more expensive than buying at retail, esp when you need consistent supply. Sure, you may need only an hour of sysadmin time a day, but what sysadmin will let you employ them an hour a day? The buildout did not list a CPU. The assumptions about uptime are over-amortized, an outage given the resources they quote may average out to 95% uptime but their latency for getting systems back up is going to be absolutely terrible and I’d be surprised if outages were shorter than a day or two on average. They aren’t factoring in cooling. They aren’t factoring in the drastically reduced lifetime of drives in their ridiculously cramped and under-ventilated cubbies. They are completely ignoring diagnostic time, presuming they can only quote actual repair times, which is an absolute joke given the lack of smart hardware and enterprise DC management. They think they can average out throughout over the number of drives not taking into account per-channel limitations. They are not taking into account the extra time to build and dismantle systems in their hacked-together IKEA shelves. They are underestimating the costs of electricity at commercial rates. I could go on and on, but suffice to say that I would never, ever use their network for any purpose without another backup (which they don’t finger into their costs, of course ;). I thought B2 was risky; this is taking it to an entirely different level.


> The buildout did not list a CPU.

It did, or does now anyway, the RYZEN 3 1200 for $95.

EDIT: Although the better option is the 3200G so that you can actually get a display output from the thing. Same price, so it doesn't really change anything, but it does cut the CPU core count down a bit if that matters at all.

That said the buildout still doesn't work because you can't actually plug the "sata splitter" cable they linked into the motherboard. Because the splitter was actually a 4-lane SAS SFF-8087 breakout cable, and there's no consumer motherboard with 8x of those connectors on it. Good luck finding even 1 or 2 of those connectors on a consumer board, and it sure as hell won't be at dirt cheap prices.

So you either need 4x the computers they calculated, or you need to budget for add-in SATA/SAS controller cards. Which, because they aren't used in consumer land, are not cheap. You could go used, but that's still going to increase the bottom line (and won't be a reliable source of parts)

They also aren't factoring in assembly time nor budgeting for that. Building these isn't going to go very quickly.


I have gotten away with cheap PCIex1 2xSATA2 adapter cards, for roughly 10-15USD at the time of purchase. They did work, but this assumes a motherboard with room for lots of PCIe cards.

Edit: to clarify on the CPU usage, could a potential build also get away with a cheap AMD Athlon 3000G?


Yeah, that cpu wasn’t listed at the time. It’s clear that this is a thought experiment and nowhere near worth the attention it is getting.


I feel like backblaze has already done most of this and has it in production [1]. Whereas this is just done back of the napkin calculation.

[1] https://www.backblaze.com/b2/storage-pod.html


Backblaze has tried to make their datacenter as efficient as possible, and still only ends up hitting $5/tb/mo for their b2 service, as a point of reference.


>and still only ends up hitting $5/tb/mo for their b2 service

$5/TB/Mo was B2 Price with Profits, better depreciation ( Blackblaze replaces drives more often ) and faster connection.

$2/TB/Mo was napkins maths with ~10% Gross Profits.


Backblaze is very good but they are definitely not efficient in $$ utilization.

Efficient $$ utilization is bread racks, built out data centers abandoned by the likes of PepBoys that landlords will part for $3/sq foot per year and Google using servers without cases and velcros to keep hard drives attached.


Abandoned pepboys stores don't usually have very good fiber connectivity. Backblaze and similar hosting/storage companies move enough traffic that they need to be topologically close to major IX points.

If all you want is cheap commercial real estate with cheap dollars per square foot figures, there are plenty of economically depressed areas within the United States that you could put things. Those areas usually have very poor fibre connectivity, fibre diversity, and choice of carriers.

I have previously explained this to a number of people who asked me, basically, why don't all of these gigantic abandoned shopping malls get converted into Data center space? Two reasons: poor connectivity, and nowhere near enough electrical grid feed capacity (as proper three phase service) in terms of watts per square foot. Bulldozing empty land in Quincy and putting up a tilt-up concrete on slab dedicated purpose datacenter structure is much less costly than extensively retrofitting abandoned, 30, 40, 50 year old commercial real estate.


> Abandoned pepboys stores don't usually have very good fiber connectivity. Backblaze and similar hosting/storage companies move enough traffic that they need to be topologically close to major IX points.

Not stores. Data centers.

It is typically cheaper to get long distance fiber links than metro fiber and midsize data centers do not consume that much power, power which is plentiful outside the major metros, especially in the old manufacturing areas.

The real reason why companies do not go there is because it is not sexy and non-sexy places do not get "ninja" employees that would be passing brain teasers on a whiteboard.

> Bulldozing empty land in Quincy and putting up a tilt-up concrete on slab dedicated purpose datacenter structure is much less costly than extensively retrofitting abandoned, 30, 40, 50 year old commercial real estate.

If you are doing it in a major metro then unless you get a big fat tax break your real estate taxes are going to kill you.


> Not stores. Data centers.

Honestly, it doesn't make much difference. Datacenters built for random corporations often only have connectivity to the local ILEC or MSO which is going to get you pretty poor pricing.


All of them are on net and all of them have fiber already to the premises. Most of them have not just local loops but termination point for long distance carriers.


I didn't expect to see Pep Boys as an example. Do/did they have their own data centers?


I toured three in the late 2000 about 20-25 minutes away from Philadelphia in different directions. I was told if we wanted to go further away there were dozens.

There's its own market for 15-30k sq/feet built out data processing facilities. It is something that compete with Equinix? Not at all, but it is definitely competitive with 3rd tier colos at the major carrier hotels at a fraction of the cost.


According to your link, backblaze hardware costs are around 10x cheaper than OP's estimate. $2.56 per TB (based on .05 per GB) vs $25 per TB.


Backblaze has not only purchase costs but also considerable ongoing operational costs for salaries, network, datacenter.


To be fair, it’s an extremely large napkin.


One interesting point of reference is that backbaze currently charges $5 / TB / Mo. Assuming they haven't changed their profit margin of 50% from 2017 (https://www.backblaze.com/blog/cost-of-cloud-storage/), then this would imply that they have a direct cost of roughly $2.5 / TB / Mo.


Top of Hacker News and there's nothing clickable above the fold that takes me to the SIA website.

Content marketers and technical marketers - don't miss the opportunity on Medium and other platforms to at VERY LEAST link to your homepage in the first section.

In fact that is at the top of this awesome piece of content marketing is a "Sign Up" button for Medium . . .


I ended up at https://github.com/NebulousLabs/Sia and there's no activity in the last two years, the latest issues are a few "you broke my wallet with the update and my password doesn't work" from 2018.


Yeah, they moved to gitlab: https://gitlab.com/NebulousLabs/Sia


Thanks


They’re on GitLab now.


I just removed the word blog so there was no subdomain, and it worked. I mostly agree with what youre saying, but dont neglect the URL as part of the user interface. The information was there and a click away.


The word "Sia" in the first sentence links to their website


I've been using Sia for about three months to backup some personal files. Nothing crazy, but it seems to work well.

I'm looking forward to seeing this project mature as well as have some more layers build on top of it moving forward. I really wish the client offered synchronization or access across multiple devices. For now you have to try third party layers on top of Sia to accomplish this.


> I'm looking forward to seeing this project mature as well as have some more layers build on top of it moving forward. I really wish the client offered synchronization or access across multiple devices. For now you have to try third party layers on top of Sia to accomplish this.

Yea I'd actually pick it up now and give it a try if it had this feature.


[flagged]


This isn't a new project. It's been around for many years.


Really smart people make this mistake a lot, so I'm wondering what Sia is doing to decorrelate failure rates. If hedge fund quants can turn mortgage tranches into a machine for massive correlated economic losses, can blockchain quants turn storage tranches into a machine for massive correlated storage losses?

Or if one of the major hyperscalers or datacenter operators decides to start selling storage to Sia, it seems likely that their control plane across datacenters could result in correlated failures. A networking outage for their AS could result in multiple datacenters appearing offline concurrently, for example.


This analysis entirely omits the cost of a sysadmin to manage the storage servers. Even if sia is assumed to do almost everything, and even if we only want 95% uptime, you still need someone to deal with software updates, hard drive monitoring, etc etc.

The profit of $570/year/box is not enough to pay a part-time sysadmin and still have any useful profit.


>If we assume that the 30 hosts go offline independently

I wonder how reasonable this assumption really is. For regular CPU-bound crypto-mining we see that it tends to centralize geographically in zones where electricity, workforce and real-estate space to build a datacenter are cheap.

Assuming that Sia ends up following a similar distribution, it wouldn't be surprising if several of these hosts ended up sharing a single point of failure.

Beyond that, if only copying stuff around three times to provide tolerance is enough to lower the costs to $2/TB/Mo, why aren't centralized commercial offerings already offering something like that? Just pool three datacenters with 95+% uptime around the world and you should get the same numbers without the overhead of the decentralized solution, no? Surely the overhead of accounting for hosts going offline and redistributing the chunks alone must be very non-trivial. With a centralized, trusted solution it would be much simpler to deal with.

Or is the real catch that Sia has very high latency?


I'm guessing there's not a lot of 95% datacenters that don't have heavy generators or UPS on site. You'd have to basically build a datacenter that has lower guarantees.


Wait, how are they connecting 32 drives to that motherboard? They seem to be implying they are splitting each SATA plug 4 ways, which as far as I know is impossible.

The adapter they're linking to is SF8087 to 4x SATA, not SATA to 4x SATA (which shouldn't exist). That motherboard doesn't have SF8087, it has 8 SATA3 connections.

Unless I've missed something big, SF8087 can not be plugged into SATA3.


It cannot. I assume there’s an HBA thrown in the mix somewhere they did not mention.


I don't think it is correct to say that the only options are "host failures are truly independent" or "world war three".

The hosts are not ever going to be fully independent. There will be hundreds, if not thousands, host co-located in the same location -- likely of the cheapest grade, without any extras like fire alarms or halon extinguishers or redundant power feeds. A single fire (flood, broken power station) has a chance of taking out thousands of hosts simultaneously.

And there is management system as well -- AWS has thousands of engineers working on security. Will there be one at this super-cheap farm? What are the chances there will be farms with default passwords and password-less VNC connections? And since machines are likely to be cloned, any compromise affects thousands of hosts.

... and all of those things are made worse by the fact that if you store hundreds of thousands of files, your failure probability raises significantly. If a data center burns down, at least few of your files may be unlucky enough to be lost.


at a minimum the facility will need some power conditioning and/or insurance. you don't want a brief power surge to eat all of your capital, and lockup fees, in one go.

> For a 32 HDD system, you expect about 5 drives to fail per year. This takes time to repair and you will need on-site staff (just not 24/7). To account for these costs, we will budget $50 per year per rig.

will you not also lose 6TB (times utilization) of your lockup every time a drive dies?

> 8x 4 way SATA data splitters

you've linked to SAS breakout cables. they don't plug into SATA ports, they plug into SFF-8087 SAS ports.

they cannot plug into the motherboard you've listed. nor have I ever seen one listed for retail sale that has 8 SFF-8087 ports.

the cheapest way to get 8 SFF-8087 ports is with some SAS expander card, and a SAS HBA. even scraping off eBay that's another $50 per host, and two more components to fail.

there are also actual SATA expanders out there, but they last about 3 months before catastrophic failure in my experience.


Isn't a potential problem with "SATA splitters" also that all disks will share the same channel an therefore end up with worse performance? (Though I guess it won't make a difference for mechanical drives)


any of the expander (SATA, or SAS) things, yes, will be sharing bandwidth. but as you mention, it won't be a limiting factor for mechanical drives. and considering the latency involved in this sort of retrieval, probably isn't a problem regardless.

FWIW the break out cable they've listed is splitting up a connector that has 4 electrical channels onto 4 physically separate cables, so there's no problem with it. they just don't have anywhere to plug it in.


In addition to the SATA port multipliers, they'd need actual SATA PCIe cards. Basically nobody makes a motherboard with onboard SATA that supports port multipliers.


Big deal. I charge $5 per TB per month and I'm not even trying to be cheap.

The economies of scale should make this much less expensive. Colocating your own machine in a real datacenter and hosting your own data shouldn't still be cheaper than practically all of "the cloud" offerings, but it is. What does that tell you about "the cloud"? It's marketing bullshit.

Sure, it's fine for occasional use, but anyone using the hell out of "the cloud" can easily save money by using anything else.


That’s not really surprising, but people tend to forget about it. In the end somebody has to pay for ops. It’s business as usual, like it was a century ago.

There are cases where you can indeed save money by doing more by yourself. But how much time does it cost you and how much is your time worth?

How much time do you need to research, purchase, and eventually build your hardware? How much time do you need to get a decent data center deal? How much time do you need to bootstrap your setup? How much time do you need to regularly maintain your infrastructure?


Those are all just "cloud" talking points.

My time is worth a tremendous amount to me, which means I want to use my own hardware. "The cloud" does not guarantee reliability.

Any company that does any project that even slightly regularly requires compute / storage can easily justify the time to do all the things you mentioned.

The fact that many companies have gone towards "the cloud" goes hand-in-hand with the fact that many companies use Windows. It's clearly not the best thing to use to get things done, but the IT people don't want to reduce their importance and the management people like the kickbacks and perks they get from buying certain thing from certain companies.

The savings look good on paper, but the reality is that they're based on leaving out lots of information. I've helped several companies move from "the cloud" back to good, local compute resources because of the amount of money they were hemorrhaging to "cloud" providers.

For the most part, it's all marketing bullshit.


From their site: [1] Sia is a variant on the Bitcoin protocol that enables decentralized file storage via cryptographic contracts

[1]https://sia.tech/sia.pdf


I don't know anything about the subject, so no idea if these claims are realistic. But whatever, either they deliver or they don't.

My (or their, actually) problem is I don't really get what they are offering right now. There is an impressive landing page with big numbers and pretty pictures which explains pretty much nothing. Project seems to be in production for at least 3 years, there are some apps, but I don't actually see if I can use it to backup/store some data and how much it costs right now. I mean, they say "1TB of files on Sia costs about $1-2 per month" right there on the main page, but it cannot be true, right? It's just what they promise in the hypothetical future, not current price-tag?

The only technical question I'm interested here is why they actually need blockchain? This is always suspicious and I don't remember if I saw any startup at all that actually needs it for things other than hype. It is basically their internal money system to enable actual money exchange between storage providers and their customers, right? So, just a billing system akin to what telecom and ISP companies have? Is it cheaper to implement it on blockchain than by conventional means? How so?


> but it cannot be true, right? It's just what they promise in the hypothetical future, not current price-tag?

Here's the live pricing, right now: https://siastats.info/storage_pricing

> Is it cheaper to implement it on blockchain than by conventional means?

It's more so that anyone can join the network as a host. They don't have to have a financial or business relationship with anyone, they can just provide their storage service and charge for it. No way to do that currently in the world without a blockchain.


> It's more so that anyone can join the network as a host. They don't have to have a financial or business relationship with anyone, they can just provide their storage service and charge for it. No way to do that currently in the world without a blockchain.

Maybe I misunderstand your point, but I could certainly install MinIO (a S3 compatible object store) on a home NAS and charge people for it without using a blockchain. I see your point about not having a financial or business relationship with a blockchain network acting as an intermediary, but I can assure you that the IRS and various law enforcement and regulatory agencies would tell you that you absolutely do have a financial and business relationship with whoever is paying you via the crypto-network whether you'd like to or not.


> I could certainly install MinIO (a S3 compatible object store) on a home NAS and charge people for it

But how would that work? You'd probably make a website or app that had users sign up for an account, and then with that account they could associate payment information from a payment processing company, and then you'd provide them with credentials where they could log in to their Minio instance. Right?

Then, you have to go out and market your service, explain to people why they should use it instead of existing alternatives, convince people that you're trustworthy, build a reputation, and generally do sales.

In the case of Sia, you build your host, plug it in, announce it to the Sia blockchain, and then clients from all around the world start paying to use your storage.

Clients don't have to register for an account first, don't have to involve a third-party payment processing company, and don't need a sales pitch because they algorithmically test, measure, and rank hosts.

I remember at the outset of the web, a new thing was this user demand for services to become "self-serve", as in, you would no longer need to talk to a salesperson and establish a relationship in order to buy something — even something custom. I see this as the next step of that, where you want to be able to programmatically and algorithmically establish and dissolve those kinds of service agreements.


On a related topic, I've had a ton of problems finding a cloud storage system that will reliably handle files around 100-200gb. Does anyone have a recommendation for a service that can handle that file size with ease?


Any object storage system (S3, Backblaze B2, Azure Blob, GCP) should be able to handle those file sizes with proper chunking into smaller parts (limit details below per object store).

S3: https://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.... (Max object size: 5TB, Max single multipart size: 5GB)

Backblaze B2: https://www.backblaze.com/b2/docs/large_files.html (Max file size: 10TB, Max single object size: 5GB)

Azure: https://docs.microsoft.com/en-us/rest/api/storageservices/un... (Max file size: 4.75 TB, Max single block size: 100MB)

Google: https://cloud.google.com/storage/quotas (Max file size: 5TB, doesn't appear there is a lower limit for objects to be composed into a single object, docs could be better in this regard)

@khc: Terminology updated to be more clear for S3


I'd drop Microsoft from that list because the request for for something "reliable."

It's reliable enough, if you can get it to Microsoft's cloud. But for the last six months I've struggled putting very large files into Azure, using five different connections from five different providers in three locations. Small files are no problem. But large ones take two, three, or four tries.


I don't disagree Azure is hot garbage, just listing for completeness. S3 and Backblaze B2 are my go-to object stores.


Hi, GCS engineer here. The lower limit on composing objects is one source object, in which case you are not so much composing as you are copying with style. Zero source objects is an error. I will file a note about the docs, thanks.


Thank you so much! Is there a limit on size of objects besides the 5TB max described in the docs, similar to other object stores where multiparts have lower limits than the total composed object?


There is also a lower bound of 0 bytes :)

You can compose a 4TB object with a 1 byte object, or you can compose 32 150GB objects, just so long as the destination object doesn't go over 5 terabytes.


Brilliant! Thank you again!


S3 max single object size is 5TB. The 5GB limit is for single part upload API. You can use the multi-part API to upload single object in multiple parts.


"On a related topic, I've had a ton of problems finding a cloud storage system that will reliably handle files around 100-200gb. Does anyone have a recommendation for a service that can handle that file size with ease?"

rsync.net gives you an empty, ZFS filesystem that you can do anything you like with.[1]

I believe the file-size limit is 16 exbibytes (264 bytes).

rsync.net can also talk to every cloud storage provider[2] because rclone[3] is built into the platform:

    ssh user@rsync.net rclone s3:/some/bucket rsync/home/dir
[1] http://rsync.net

[2] https://www.rsync.net/products/universal.html

[3] https://rclone.org/


Looks like HN ate the * characters in

  2**64
16 exbibytes is 2^64 bytes.


Sia, the system described above, comfortably handles files of 100-200 gb, and with the recent Skynet link also enables filesharing for files up to 400 GB.


Have you tried amazon s3? What is your use case? Can you elaborate? edit:Grammar


So no CPU (or APU, so you don't need a GPU), no RAM, and those breakout cables are actually for SAS, but no SAS card listed in the total. This does not inspire confidence in the project at all.


Interesting article, but "black swan situations like world war three" may be underestimation. Software bugs are more likely and sometimes fatal.

I wonder why transfer prices are not included? As you explain every transfer is paid does it mean one has to pay for 10 uploads of every single object, right? But as equipment ages, peers go out of business then who pays for the data rebalancing transfers?


It's probably feasible to reach these levels of cost. I certainly still keep NAS in two locations because even places like hetzner don't sell lower powered machines with lots of disk space. But the build they specify doesn't have a CPU or RAM and it's using a SAS cable to connect to a SATA motherboard. Depending on the requirements of the platform they may be able to get away with non-ECC RAM, a simple APU to not need a graphics card, and a few cheap SATA PCIe cards to get enough connections. It will probably add ~500$ or ~10% to the build though. I don't know if the other costs have similar issues.


Hetzner's super cheap "cloud" machines allow you to attach "block storage" volumes of whatever size you like. It's not particularly well advertised (searching for this sort of thing takes you to their dedicated storage offerings) but it's there and it's great.


Hetzner's dedicated storage servers cost ~180 $/month for 100 TB storage:

https://www.hetzner.de/dedicated-rootserver/matrix-sx?countr...

That is pretty close to the $2 per TB per month from the article (assuming the same factor 1.5 replication they have planned), while also providing 128 GB ECC RAM, enterprise disks, and 24/7 phone support.


From what I can see on the website that block storage consists of SSDs priced at ~50€/month/TB which is a non-starter. The cloud instances are incredibly cheap though, I may have a look at those.


Check out their storage box if using their cloud instances. https://www.hetzner.com/storage/storage-box?country=ot

€10 for 2TB and unlimited transfer within same datacenter (or 10TB if accessing from outside). You will get a 1Gbps or more transfer speed and many options for mounting the storage (SCP, webdav, etc). I used to use them as a intermediate backup location when I used their cloud instances and dedi servers.


That's starting to look quite interesting, thanks! 5TB of rsyncable storage for 22€/month could definitely replace one of my NAS boxes. The 10 concurrent connections are the only big limitation really. Not being able to run my own stuff is also not ideal. But it's the kind of thing I've envisioned to stop using a NAS and just point syncer at:

https://github.com/pedrocr/syncer/


I was expecting an ad based on the title, but it ended up being an interesting analysis of just how much storage ends up costing them with their focused hardware setup.


Honestly, I'm a bit confused on who the targeted audience is for this article. I've been running as a host for Sia for months now. My rig is a raspberry pi and a 10tb external HDD I had laying around.


The target audience is people who are concerned that Sia's long term economic story is fragile, and can't survive beyond what underutilized storage exists today.

The types of farms described in the article are what I imagine Sia to look like 10 years from now, not something we expect to spin up in the next 18 months.


Exactly. I assume they’re targeting the “big guns” that can spend/invest money in order to build a Sia data center and then get a return out of it. I had a raspberry running Sia for a couple of months and made nothing out of it :-( When I originally heard/read about Sia I had the optimistic view that many of us users would power their network, where in fact only a few ones exist (~350). See https://siastats.info for more details.


Do you make anything from it?


For the first few months, you'll basically have all of your Sia drained as your contracts start coming in and your host wallet gives them collateral. But yeah, my contracts only recently started to complete, and I've got the amount of Sia I started with at the very least.

So... Yes, but probably not very much? Particularly when accounting for the price drop in Sia.

Edit: If you do decide to host, it'll likely be about 5-6 months before your contracts start completing if your host settings go by the recommended 26 week max duration. And don't go out buying hardware unless you're in it for the long run, which makes me now realize the point of this article.


Can you share more info on cost breakdown and earnings? Also, how expensive is electricity in your area?


I wouldn't try hosting on Sia if you're trying to make a profit. I spent 20USD on Sia currency. And I didn't go out and purchase a RPi + harddrive for it. You likely wouldn't make a profit if you did. RPi's are great for costing next to nothing in electricity, though.


I worked for a p2p startup 15 years ago. We were exploring ideas and products in this space. We came close to partnering with a company doing distributed cloud storage. Their idea was to allow people to rent storage space in personal computers.

We decided to scrap the plan to do p2p storage, ended up using cloud storage. This p2p storage idea is a tough one. People are not willing to make a few dimes renting out their hard drive or CPU. The economic unit is too small to work. But good luck trying this idea. I wouldn't be surprised if someone tries again in 20 years. :)


The IO operations amplification for 64 of 96 is pretty brutal, and particularly unfavorable in a world where capacity-per-IOPS keeps trending up. I wonder how they'll deal with that.


Each of the 64 pieces is fetched from a different host on the network, meaning all of the IOPS are happening in parallel at network speeds. You aren't going to be doing sub-millisecond updates for sure, but you can easily get under 100ms.

The reed-solomon also isn't even the most computationally expensive part, the computationally expensive part is computing and verifying the Merkle roots. All parts of the system though can go >1gbps on standard CPUs.


It's fine from the perspective of a single request but it seems like it reduces the overall throughput of the network. If Sia has, say, 1M hard disks and can do 100M raw IOPS then 10-of-30 gives 10M net IOPS but 64-of-96 gives only 1.5M net IOPS.


The most important thing not addressed here is demand. Last I checked (granted, this was a while ago) it simply wasnt there- Meaning if you built this rig you might only be able to rent out a small part of it.

If this has changed I would be interested in hearing about it-

One other thing I am not understanding is how this makes financial sense, even if the demand is there. If I am buying a rig for 4500 bucks to get 200TB, making "570 a year in profit" is nowhere near exciting enough. Practically any other use pays more. Renting a dedicated server for a game, web hosting, hell even GPU mining makes more.

(a single 1080ti can do about 1$ a day in gross revenue on grin/eth/etc - which can be had used for ~400 bucks- Or you can get a p102 which is the mining card version with no display output for 250 bucks) - Payback with power costs/etc.. well below the 10 year threshold of siacoin)

Now where it might be interesting (IF there is demand), is just adding harddrives to an existing infrastructure already in place. So if you are a GPU miner and have 1000 rigs already in place, just adding a single 4TB harddrive to each machine might not be too bad- They go for about $50 each used and according to this, will pay back $8 a month with minimal extra costs


So I did a quick look and it seems like the total usage of siacoin is not that large.

https://siastats.info/hosts_network Only 710 TB is in use. Or about $20k worth of hardware TOTAL for the entire network, according to the above URL.

Also, why is this a cryptocurrency at all? Wouldnt this business be drastically simplified by simply paying people out//letting people rent space with either USD or bitcoin?


> Also, why is this a cryptocurrency at all?

Do you even need to ask? Because they have minted a bunch of SIA coin and this is their effort to give it value out of thin air, making them all millionaires. Using someone else's currency, despite its obvious benefits, means no huge pre-minted pool under their control = no lambos.

Every single ICO project is like this.


The website seemed okay and even useful until:

"Both renters and hosts use Siacoin, a unique cryptocurrency built on the Sia blockchain. Renters use Siacoin to buy storage capacity from hosts, while hosts deposit Siacoin into each file contract as collateral."


Until payment processing for regular money becomes free, peer-to-peer micropayments can only be done economically with a cryptocurrency. The need for a new unique crypto currency for this project is debatable, but I can understand that the people who work on this project full-time need to earn a paycheck.


See also Tardigrade

https://tardigrade.io/

It is $10/mo/TB, but has different uptime, speed and security characteristics.


How did they go from 8 SATA ports to 32 SATA devices? The linked cable uses a SAS host/connector. Was a SAS card left out of the parts list?


Using "8x 4 way SATA data splitters".

What i don't get is why they don't use 14TB HDDs, they are only 15% more expensive per TB. On the other hand they'd need 2.33x less PCs at $550 each, plus their power use.

So instead of every 7 PCs with 6TB HDDs they'd need 3 with 14TB HDDs.

PS: They could also use a mainboard with 10 SATA ports instead of 8. They are only $15 more than the chosen board. Adding one or more PCIe 8x SATA controller cards might also make sense, depending on the average load of a system.


> Using "8x 4 way SATA data splitters".

There's no such thing as a passive SATA data splitter.


it's probably a SAS/Sata controller, and the sas interface split to 4x sata.


They do link to the motherboard they are talking about, and it doesn't have SAS.


They mention 8 pcie slots with 48 pcie lanes... I'm presuming they are filling them with sas/sata controllers.


> 48 PCIe lanes

PCIe lane count depends on CPU support, for AM4 I believe it ranges from 6x PCIe 3.0 (cheapest AMD Athlon/Ryzen CPU-s with integrated GPU) to 24 (PCIe 4.0 for latest Zen 2 based Ryzen 3000 series).

The CPU they list in the article supports 16 lanes of PCIe 3.0 connectivity + 4 lanes for chipset (storage and other IO). Nowhere near the 48 PCIe lanes you mention, although you could argue that 20+4 lanes of PCIe 4.0 bandwidth is equal to 48 lanes of PCIe 3.0 bandwidth, but this would require a compatible CPU, which would increase the cost by hundreds of dollars.


6TB might be the most reliable at the time the decision was made.


We from croit.io operate Ceph based storage including everything from datacenter, power, switches, labor, licenses, all to a price point of 3€/TB.

No consumer ware


There is one big problem that I've not seen anyone else point out with systems like this. I know because I did the calculation early on with Peergos and came to the conclusion that it doesn't work.

The problem comes when you want to store multiple files. If the corresponding erasure code fragments from different files are not stored on the same server then you don't have correlated failures. Contrast this with a typical raid scheme where a failed drive means the nth erasure fragment of every file is gone - correlated failures. If the failures across different files are not correlated, which is the case if you're storing each new block on a random node, then you are basically guaranteed to lose data if you have enough files. Depending on your scheme, this can happen as low as 1 TiB of data for a user. It is similar to the birthday paradox.

For erasure codes to work for a filesystem you need to have correlated failures.


totally false

ordinary raid has very slow recovery because it’s concentrated on a hot spot of a new drive. plus recovery waits for a new drive to be inserted (double stupid).

when fragment placement is randomized, recovery is widely distributed and can happen in less total time so lower chance of data loss.


I didn't say anything about speed of recovery because it's not relevant. Recovery can't happen at all if enough fragments aren't online. The maths says that with unncorelated fragment placement, and thus uncorrelated failures, and with enough data, you are basically guaranteed to lose data. Try doing the maths for an entire filesystem, where each file/block is individually erasure coded.


We stored hundreds of petabytes on cheap SATA drives with random fragment placement using reed solomon 6+3 coding (half the space of 3 replicas but same durability). Never lost a byte.

Speed of recovery is crucial, because that’s your window of vulnerability to multiple failures. For example. try raid 5 on giant drives. The chances of losing a second drive during recovery is very likely.


No need to be rude. EDIT: The offensive part was removed

What was the probability of failure of your drives? My guess is you just didn't hit the threshold for your failure rate. The maths checks out (PhD here). Seriously, do the calculation.


We lost drives all the time. In fact we moved so much data we needed checksums to avoid the 1e-13 undetected data errors.

We seriously did do the calculations (done by serious PhDs) and we seriously did not lose data.

I’m sure you are imagining a system that doesn’t work. But that doesn’t mean only a raidlike setup can work.

And by the way, could you explain how to calculate chance of data loss without taking recovery time into account?


To clarify, the assumptions I'm making for the calculation are:

1) a Fixed probability of a server failing

2) a fixed erasure coding scheme used for all files

3) uncorrelated server failures

4) an erasure fragment is stored on a random server


It boils down to the following:

You can calculate a probability L of losing a given file.

Because we've assumed totally uncorrelated failures that means this is the same for all files, and that the probability of losing NO files if you have T files is (1 - L)^T

As you can see, this approaches 0, meaning Pr(losing a file) approaches 1 as T increases.

Using the probability of file loss in Sia, which I would say is is too low, but lets ignore that. They get L = 10^-19.

This leads to T = ~10^19 before you expect to lose data. If you're erasure coding on the byte level, then that's 10 exa bytes.

I expect your probability of failure is much less than random nodes on a distributed global network of volunteers. so yes, ~petabyte is below the threshold, but there is a threshold.


This tech seems very cool, but with only 200TB stored I worry that it is destined to not pay for its overheads. No big project can survive on a revenue of $20/month!

When will the project grow some mobile apps like Dropbox or Google drive that you can just put a credit card number into, pay a few bucks and know your data is safe?


That rack setup in the article reminds me of Gilfoyle's setup for Anton in the garage from Silicon Valley.


This is pretty incredible both from a product perspective, as well as the potential to push the whole industry towards a race to the bottom. Equilibrium here pushes storage, processing, and availability towards distributed nodes, unless high availability is required for some unique business case


Offtopic question : I went through their website, and I have no idea about blockchain. I have gone through the documentation and almost everything they are doing is possible without it as well. Cryptoproofs for storing data, smart contracts et al - not sure if it is different from regular encrypted deduplication done regularly with standard per hour/ per minute billing. Also, Siacoin for payment, not sure if it is the most optimal way. I think I am missing something, would be glad if someone can point me in the right direction.


You lost me on the homepage (sia.tech) there are only 895 hosts storing a total of 206TB right now. That coupled with the shady infrastructure.

You aren’t touching my enterprise data, not even the cold storage logs.


For people that use Plex... Do you think that it's a good place ? If we have several TB is it cheaper to use it's own PC with (let's say) 2x10TB HDD + 2x10TB HDD backup or it's better to go online today ? When I check the price, I feel that it's always more expansive.

For my backup, it's not sync in real time but I do manual backup every 3 months. I can loose some data but I feel ok with that.


The website sia.tech required a Google captcha challenge for me to even load, clicking through from the article.

So.. that turned me off in an instant.


Where are these datacenters? I live in Ohio in the area of two of the points on the home page at https://sia.tech/ One appears to be a private residence or a farm. The other dot is literally on a golf course fairway, is a private residence, or a power substation.


I wish there was an easy solution that would allow me to plug an S3 or a virtual drive or whatever and mount it as a partition in my cheap 20gb VPS.

The price for additional “drive”, like 5 bucks per month for 50gb or sthg, is insane. Especially when comparing with Dropbox or Onedrive pricing (or even physical drives sold over the counter).


I have no clue how google stores my 35TB for $12 a month. They do at a loss and I guess they don’t really care.


Where are you getting that price? Google Drive is 300$+ for that and GCP archive storage is 140$.

Edit: apparently that's the unlimited storage in GSuite for 12$/month/user as long as you have 5 users or more.


they say you need 5 people but its just me and i pay $12


They make it up with the dozens paying $12/month and only using 8gb.


I expected this title to be about https://www.scaleway.com/en/c14-cold-storage/ which already offers exactly that, Cloud storage for $2 per TB per month.


Sia isn’t cold storage.


I think it's a lot more comparable to cold storage than to anything else. I assume you cannot serve directly from Sia? How long does recovery take?

(Does anyone know how long Scaleway takes in practice? They claim minutes but I haven't used them yet.)


Does any consumer grade motherboard have IPMI* support? When I tried to optimize my server costs one issue I ran into was that colocation providers require IPMI capability, which seems only available in server-grade motherboards.

* IPMI is for remote hardware management


To elaborate, while I like such optimization efforts, for this to work you'd need to run your own datacenter because the _ones I have found_ that offer cheap bandwidth require IPMI (to lower their labor costs).

Or you need to know some providers I don't know, in which case, tell me. :)


I just saw IBM claiming to do 10TB at zero cost on twitter, how are you guys gonna beat that?


> It also turns out that 32 HDDs only consume 200w, so the 750w PSU we picked is more than sufficient.

Yep. Stopped reading right there. HDDs use ~15 watts when they boot up. I experienced this and I never allocate less than 20 watts / hard disk.


Staggered spin-up is a thing. That motherboard doesn't have it most likely, but they also can't plug 32 HDDs into it either. They'd need an HBA card of some kind, which is more likely to have staggered spin-up.


> over the long term, storage will be cheaper than $2 / TB / Mo

OK, but what is it today?


https://siastats.info/storage_pricing

It's less than $2 / TB / Mo today, but relies on a completely different set of economics that don't scale beyond a few hundred PB. This article was aimed at people who understand why Sia is so cheap today, but do not believe that Sia will continue to be cheap as the network scales.


In addition to what other people mentioned, there is also a huge cost in managing all the metadata that you'd get from billions of files. This really even worse if you're using a crazy 64-32 encoding.


Yes, but with croit.io you can manage Ceph clusters with ease, reducing Laber costs and increase reliability.

We build clusters of low PB scale that have a TCO with everything from labor to hardware, from financing to electric, from routers to cables and can be run below 3€/TB. For that you can store data in Block(rbd, iscsi), Objekt(S3, swift) or Filestorage(CephFS, NFS, SMB), high available on Supermicro hardware in any datacenter worldwide.

Feel free to contact us or use our free community edition to start your own cluster.


This makes me want to buy storage just because I know exactly how it’s spent.


I was sexcotes to cancel all my storage subscriptions, but for some reason this article didn’t convince me that that was an option yet.


I wonder what it would take to build a frontend for this that uses other people’s exposed S3 buckets for storage.


So, $48 USD/year for 2TB? That's a little less than idrive, as they charge ~70/year for 2TB.


one thing that bothers me about this system beyond the hand waving math is what would motivate me to give up control of my data being available? If I have no SLA and no ability to convince a bunch of down hosts to come back online with my data, why store it that way at all?


I wouldn't even buy new hardware. I would buy retired hardware for a fraction of the price.


I would imagine that managing such a pool of dated and highly heterogeneous hardware would incur a lot of overhead. The procurement effort may end up being rather high as well.


yet a number of FAANGs have datacenters full of mismatched lease returns. So it much make sense at some scale.


Looks super interesting. Can anyone who has used it share their experience?


Durability? Availability?


Availability is the big one. Getting high availability makes systems incredibly more complex and expensive.

It's the same reason increasing redundancy and uptime to the next factor of nines is almost logarithmatically more expensive.


Increasing redundancy and uptime to the next factor of nines is exponentially* more expensive.


The shelf is made of a durable real-looking wood laminate!


This sounds interesting, thanks for sharing!


Great writeup from David as always. Can’t wait to see more!

If anyone hasn’t seen the work being done on the skynet platform, I highly recommend taking a look. Amazing stuff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: