Looks pretty cool when you make it to the GitHub (https://github.com/hse-project). Order of magnitude performance gains! I imagine most of that come from skipping the Filesystem layer and just hitting the raw Block layer directly.
I am curious about the durability and how well tested all of that is though. On the one hand, filesystems put a lot of work towards ensuring that bytes written to disk and synced are most likely durable, but OTOH Micron is a native SSD vendor so they've probably thought of that.
I'm also curious whether RAIDing multiple SSDs together at the block layer and then running HSE on top of that will be faster or whether running multiple HSE instances (not the right word, it's a library, but you get what I mean) with one per drive and then executing redundantly across instances would be faster. Argument for the former is that each instance would have to redo the management work, argument for the latter is that there's probably synchronization overhead within the library so running more in parallel should allow for concurrency and parallelism gains.
> I am curious about the durability and how well tested all of that is though. On the one hand, filesystems put a lot of work towards ensuring that bytes written to disk and synced are most likely durable,
All of the SSDs that this software might be deployed to have power loss protection capacitors to ensure the drive can flush its write caches when necessary. So this software only needs to make sure that the OS actually sends data to the drive instead of holding it back in an IO scheduler queue (as you point out, they're already bypassing the FS layer). Since this software should be pretty good at structuring its writes in an SSD-friendly pattern, the operating system's IO scheduler should probably just be disabled.
> ...the operating system's IO scheduler should probably just be disabled.
The default on RHEL/Fedora has long been the noop scheduler. I'd be surprised if other distributions haven't followed suit, given the prevalence of SSDs.
SSDs are already raid devices internally so they're really no point as whatever you can do the vendor can and do faster and better in hardware/firmware.
That's why all the hyperscalers all have their own custom SKUs.
Are you asking how to design a custom hardware raid controller, then tune it to match the performance characteristics of the specific NAND chips you have? Seems outside of the range of most people's abilities.
SPDK's RocksDB integration really hasn't gotten a lot of love. There's really two main challenges we hit and then never revisited.
First, the IO threads in RocksDB are a thread pool that assume they perform blocking operations. That doesn't jive with SPDK's async model (nor io_uring's). We're having to message pass to an async thread and block on semaphores in the IO threads.
Second, RocksDB was heavily reliant on the kernel page cache to make it fast.
Both of these things could have changed since we last worked on the integration. I haven't kept up with RocksDB development recently.
SPDK has had its share of detractors here on news.yc [0], especially with io_uring around the block [1]. It'd be interesting to see the improvements to these io-centric applications once they move to io_uring [2], which, in a way, like RocksDB, is sponsored by Facebook [3].
io_uring is a fantastic development for the kernel, and I really can't praise it enough.
However, there's still lots of reasons to use SPDK. Performance is still significantly better[0], and you can directly access all the of the NVMe features on the device without going through any abstraction layers.
SPDK has made some progress since [0] was posted in 2015 (see https://spdk.io/ if you're curious). Still, I am excited about io_uring and what it brings to the table.
PR copy aside, the claimed performance differences relative to RocksDB and WiredTiger are typical of many storage engines, the performance doesn't stand out. I don't think either RocksDB or WT has made a serious claim to prioritizing performance in their designs in any case.
Also, I have to wonder how narrowly "open-source storage engine for SSDs" is being defined here such that it excludes so many earlier storage engines in claiming the title of "first".
And of course RocksDB has had similar graphs to showing perf against other systems.
Every system manages to find a benchmark that fits their narrative :)
Reality is both RocksDB and WiredTiger are high performance storage engines, and they are both optimized for SSDs too. These type of benchmarks rarely tell real story.
"World's first" Open-Source storage engine for SSDs? I believe Aerospike has advertised itself as that for years, and certainly most MongoDB instances are backed by SSD these days. Heck, conceptually etcd is a key-value storage engine built for SSDs.
> HSE optimizes performance and endurance by orchestrating data placement across DRAM and multiple classes of SSDs or other solid-state storage.
Orchestrating data placement? Isn't that what all storage engines do?
What am I missing here? Is this a block level rather than file-system level driver?
Isn't this 'HSE' conceptually a HSM? How does it compares to existing field-proven 'storage engines', some of them shock-full of features (because they are filesystems), such as Lustre or ZFS?
Nope. The only product they've announced so far using 3D XPoint is the Micron X100 SSD, which they're only selling to a limited number of major customers; you won't find it for sale on CDW. Intel's Optane products do use 3D XPoint memory, and at the moment I believe that's all manufactured in a Micron-owned fab. (Intel used to co-own it, and I don't think Intel will have their own production line up and running until next year.)
Someone needs to write a book about breaking into writing software like RocksDB, HSE, etc. Years ago I found myself wanting to learn more however going from 0 to 1 felt impossible. Graduated from a T3 school in CS so understanding the concepts wasn't the issue, I just didn't know how to build a good foundation in low latency persistence. Years later I ended up contributing to low latency java which was really interesting, but what an opportunity missed.
I just finished reading the OSTEP book[1] and it has a nice chapter on SSDs[2]. The entire last portion of the book is about filesystems/disks so you might find it interesting.
That chapter on SSDs looks pretty good to me. Their numbers for NAND page and especially erase block sizes are very outdated; more modern values are 4kB to 16kB for NAND pages and 16-24MB for erase blocks on TLC NAND. Section 44.9 on mapping table sizes is a little bit odd, because most SSDs really do have 1GB of RAM per 1TB of flash, and that expense is widely seen as worthwhile even for multi-TB SSDs. The exceptions are low-end consumer SSDs that cache only part of the mapping table in a smaller amount of DRAM or SRAM, and a few enterprise/datacenter models that use 32kB block sizes for their FTL instead of the typical 4kB and thus reduce the DRAM requirement by a factor of 8 at the expense of greatly lowered performance and increased write amplification when writing in units smaller than 32kB.
Aside from the two above issues, everything looks correct and relevant, and I can't think of any missing details that deserve to be added to an introduction of that length.
"Sorry! The URL you requested was not found on our server.
Is it Sunday between 4 and 8 PM (CST)?
If so, the server may be undergoing regularly scheduled downtime. Otherwise, please contact the maintainer of the referring page and ask them to fix their link. Thanks!"
"How can I learn an instrument without listening to music"
"How can I learn to write stories without reading books?"
Have you considered just reading the code? It's all available. Best way to learn is to look what the masters are doing.
And even complex software is typically just a collection of simple things together.
IMHO one of the biggest failing of many CS courses is that they never get above toy software. Would be much better to at least once dive into some production software and try to figure something out in a real code base.
Reading the code of a piece of production software is a poor way to get understanding of the fundamentals of a strange topic. It's really inefficient in terms of "understanding" per hour; the student has no context as to why certain choices were made for what reason, just that they were made.
This is unbelievably cool. It is a multi segmented key prefix Key Value store. Can someone just strap paxos or raft to this and call it a day please? Pretty please?
You're comparing apples to oranges, plus that's unduly harsh.
1. This is a KV store optimized (or claimed to be) for a combo of SSD and PMEM. Not a packaged, supported, appliance storage system.
2. People who pay for enterprise storage do so for reasons beyond being too bone-headed to appreciate the joys of cobbling together production systems from the white box low-bidder and open source software.
I am curious about the durability and how well tested all of that is though. On the one hand, filesystems put a lot of work towards ensuring that bytes written to disk and synced are most likely durable, but OTOH Micron is a native SSD vendor so they've probably thought of that.
I'm also curious whether RAIDing multiple SSDs together at the block layer and then running HSE on top of that will be faster or whether running multiple HSE instances (not the right word, it's a library, but you get what I mean) with one per drive and then executing redundantly across instances would be faster. Argument for the former is that each instance would have to redo the management work, argument for the latter is that there's probably synchronization overhead within the library so running more in parallel should allow for concurrency and parallelism gains.