Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It is probably worth noting that most of the listed storage systems (including S3) are designed to scale not only in hard drives, but horizontally across many servers in a distributed system. They really are not optimized for a single storage node use case. There are also other things to consider that can limit performance, like what does the storage back plane look like for those 80 HDDs, and how much throughput can you effectively push through that. Then there is the network connectivity that will also be a limiting factor.




It's a very beefy server with 4 NVMe and 20 HDD bays + a 60-drive external enclosure, 2 enterprise grade HBA cards set to multipath round-robin mode, even with 80 drives it's nowhere near the data path saturation point.

The link is a 10G 9K MTU connection, the server is only accessed via that local link.

Essentially, the drives being HDD are the only real bottleneck (besides the obvious single-node scenario).

At the moment, all writes are buffered into the NVMes via OpenCAS write-through cache, so the writes are very snappy and are pretty much ingested at the rate I can throw data at it. But the read/delete operations require at least a metadata read, and due to the very high number of small (most even empty) objects they take a lot more time than I would like.

I'm willing to sacrifice the write-through cache benefits (the write performance is actually an overkill for my use case), in order to make it a little more balanced for better List/Read/DeleteObject operations performance.

On paper, most "real" writes will be sequential data, so writing that directly to the HDDs should be fine, while metadata write operations will be handled exclusively by the flash storage, thus also taking care of the empty/small objects problem.


> Essentially, the drives being HDD are the only real bottleneck

? on the low end a single HD can deliver 100MB/s, 80 can deliver 8,000MB/s, a single nvme can do 700MB/s and you have 4, 2,800MB/s - a 10Gb link can only do 1000MB/s, so isn't your bottle neck Network and then probably CPU?


If your server is old, the RAID card's PCIe interface will be another bottleneck, alongside the latencies added if the card is not that powerful to begin with.

Same applies to your NVMe throughput since now you have the risk to congest the PCIe lanes if you're increasing line count with PCIe switches.

If there are gateway services or other software bound processes like zRAID, your processor will saturate way before your NIC, adding more jitter and inconsistency to your performance.

NIC is an independent republic on the motherboard. They can accelerate almost anything related to stack, esp. server grade cards. If you can pump the data to the NIC, you can be sure that it can be pushed at line speed.

However, running a NIC at line speed with data read from elsewhere on the system is not always that easy.


Hope you don't have expectations (over the long run) for high availability. At some point that server will come down (planned or unplanned).

For sure, there is zero expectations for any kind of hardware downtime tolerance, it's a secondary backup storage cobbled together from leftovers over many years :)

For software, at least with MinIO it's possible to do rolling updates/restarts since the 5 instances in docker-compose are enough for proper write quorum even with any single instance down.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: