HSE: Heterogeneous-memory storage engine designed for SSDs

haneefmubarak · on April 27, 2020

Looks pretty cool when you make it to the GitHub (https://github.com/hse-project). Order of magnitude performance gains! I imagine most of that come from skipping the Filesystem layer and just hitting the raw Block layer directly.

I am curious about the durability and how well tested all of that is though. On the one hand, filesystems put a lot of work towards ensuring that bytes written to disk and synced are most likely durable, but OTOH Micron is a native SSD vendor so they've probably thought of that.

I'm also curious whether RAIDing multiple SSDs together at the block layer and then running HSE on top of that will be faster or whether running multiple HSE instances (not the right word, it's a library, but you get what I mean) with one per drive and then executing redundantly across instances would be faster. Argument for the former is that each instance would have to redo the management work, argument for the latter is that there's probably synchronization overhead within the library so running more in parallel should allow for concurrency and parallelism gains.

wtallis · on April 28, 2020

> I am curious about the durability and how well tested all of that is though. On the one hand, filesystems put a lot of work towards ensuring that bytes written to disk and synced are most likely durable,

All of the SSDs that this software might be deployed to have power loss protection capacitors to ensure the drive can flush its write caches when necessary. So this software only needs to make sure that the OS actually sends data to the drive instead of holding it back in an IO scheduler queue (as you point out, they're already bypassing the FS layer). Since this software should be pretty good at structuring its writes in an SSD-friendly pattern, the operating system's IO scheduler should probably just be disabled.

ignoramous · on April 28, 2020

> ...the operating system's IO scheduler should probably just be disabled.

The default on RHEL/Fedora has long been the noop scheduler. I'd be surprised if other distributions haven't followed suit, given the prevalence of SSDs.

andromeduck · on April 28, 2020

SSDs are already raid devices internally so they're really no point as whatever you can do the vendor can and do faster and better in hardware/firmware.

That's why all the hyperscalers all have their own custom SKUs.

polskibus · on April 28, 2020

could you elaborate? how can a standard user achieve improvements on their own? what's changed in those custom SKU?

xnyan · on April 28, 2020

Are you asking how to design a custom hardware raid controller, then tune it to match the performance characteristics of the specific NAND chips you have? Seems outside of the range of most people's abilities.

aloknnikhil · on April 27, 2020

> https://github.com/hse-project/hse

Their benchmarks show significant gains compared to RocksDB.

> https://github.com/spdk/rocksdb

But what I'd really like to see is a comparison against RocksDB using SPDK

> https://dqtibwqq6s6ux.cloudfront.net/download/papers/Hitachi... Based on these results, SPDK performs significantly better than the kernel requiring only 1-2 cores to saturate IOPS on an NVMe SSD (compared to the kernel requiring 16)

benlwalker · on April 28, 2020

SPDK's RocksDB integration really hasn't gotten a lot of love. There's really two main challenges we hit and then never revisited.

First, the IO threads in RocksDB are a thread pool that assume they perform blocking operations. That doesn't jive with SPDK's async model (nor io_uring's). We're having to message pass to an async thread and block on semaphores in the IO threads.

Second, RocksDB was heavily reliant on the kernel page cache to make it fast.

Both of these things could have changed since we last worked on the integration. I haven't kept up with RocksDB development recently.

source: am SPDK maintainer

continuations · on April 28, 2020

The SPDK RocksDB repo (https://github.com/spdk/rocksdb) seems to have been abandoned (last commit 2017).

Is there a maintained version of SPDK RocksDB? Or SPDK any DB?

wtallis · on April 27, 2020

I'd also like to see comparison to Toshiba/Kioxia's fork TRocksDB: https://github.com/KioxiaAmerica/trocksdb

ignoramous · on April 27, 2020

SPDK has had its share of detractors here on news.yc [0], especially with io_uring around the block [1]. It'd be interesting to see the improvements to these io-centric applications once they move to io_uring [2], which, in a way, like RocksDB, is sponsored by Facebook [3].

[0] https://news.ycombinator.com/item?id=10511960

[1] https://news.ycombinator.com/item?id=22266503

[2] https://news.ycombinator.com/item?id=19843464

[3] https://lkml.org/lkml/2014/1/24/252

benlwalker · on April 28, 2020

io_uring is a fantastic development for the kernel, and I really can't praise it enough.

However, there's still lots of reasons to use SPDK. Performance is still significantly better[0], and you can directly access all the of the NVMe features on the device without going through any abstraction layers.

[0] https://spdk.io/news/2019/05/06/nvme/

zerd · on April 28, 2020

Woah, just realizing that 10.39M 4k IOPS is 42GB/s. Doing 40GB/s of sequential IO was difficult not that many years ago, let alone random IO. That's faster than my memory bandwidth. https://www.microway.com/knowledge-center-articles/performan...

teleforce · on April 28, 2020

For those who is wondering about what is io_uring and its significance I'd highly recommend this blog article posted at HN few days ago[1].

The potential of using io_uring with eBPF in Linux will makes any HPC enthusiast drooling :-)

https://news.ycombinator.com/item?id=22974728

Rafuino · on April 27, 2020

SPDK has made some progress since [0] was posted in 2015 (see https://spdk.io/ if you're curious). Still, I am excited about io_uring and what it brings to the table.

AtlasBarfed · on April 27, 2020

Is there an api compatibility for this thing so it can be switched easily out for things like Rocksandra?

g14i · on April 27, 2020

Many techniques already used by Aerospike on their KV database, which also bypass the OS file system/cache.

I'm a long time Aerospike user with no connection to Aerospike.

Edit: I would love to see a benchmark with Aerospike.

jandrewrogers · on April 27, 2020

PR copy aside, the claimed performance differences relative to RocksDB and WiredTiger are typical of many storage engines, the performance doesn't stand out. I don't think either RocksDB or WT has made a serious claim to prioritizing performance in their designs in any case.

Also, I have to wonder how narrowly "open-source storage engine for SSDs" is being defined here such that it excludes so many earlier storage engines in claiming the title of "first".

tzone · on April 28, 2020

All storage engines prioritize performance? Back in the day, way before WiredTiger got acquired by MongoDB they also had very similar graphs showing perf differences with RocksDB and InnoDB: https://github.com/wiredtiger/wiredtiger/wiki/Read-scalabili... https://github.com/wiredtiger/wiredtiger/wiki/iiBench-result... https://github.com/wiredtiger/wiredtiger/wiki/YCSB-Mapkeeper...

And of course RocksDB has had similar graphs to showing perf against other systems.

Every system manages to find a benchmark that fits their narrative :)

Reality is both RocksDB and WiredTiger are high performance storage engines, and they are both optimized for SSDs too. These type of benchmarks rarely tell real story.

erulabs · on April 27, 2020

"World's first" Open-Source storage engine for SSDs? I believe Aerospike has advertised itself as that for years, and certainly most MongoDB instances are backed by SSD these days. Heck, conceptually etcd is a key-value storage engine built for SSDs.

> HSE optimizes performance and endurance by orchestrating data placement across DRAM and multiple classes of SSDs or other solid-state storage.

Orchestrating data placement? Isn't that what all storage engines do?

What am I missing here? Is this a block level rather than file-system level driver?

natmaka · on April 28, 2020

Isn't this 'HSE' conceptually a HSM? How does it compares to existing field-proven 'storage engines', some of them shock-full of features (because they are filesystems), such as Lustre or ZFS?

https://en.wikipedia.org/wiki/Hierarchical_storage_managemen...

https://en.wikipedia.org/wiki/ZFS#Caching_mechanisms:_ARC,_L...

shockinglytrue · on April 27, 2020

Much better link: https://github.com/hse-project/hse

PR is insane hot air referring to another hot air product (can you even buy their 3D Xpoint devices yet?)

wtallis · on April 28, 2020

> can you even buy their 3D Xpoint devices yet?

Nope. The only product they've announced so far using 3D XPoint is the Micron X100 SSD, which they're only selling to a limited number of major customers; you won't find it for sale on CDW. Intel's Optane products do use 3D XPoint memory, and at the moment I believe that's all manufactured in a Micron-owned fab. (Intel used to co-own it, and I don't think Intel will have their own production line up and running until next year.)

dang · on April 27, 2020

Ok, we've changed the URL to that from https://investors.micron.com/news-releases/news-release-deta....

organicfigs · on April 27, 2020

Someone needs to write a book about breaking into writing software like RocksDB, HSE, etc. Years ago I found myself wanting to learn more however going from 0 to 1 felt impossible. Graduated from a T3 school in CS so understanding the concepts wasn't the issue, I just didn't know how to build a good foundation in low latency persistence. Years later I ended up contributing to low latency java which was really interesting, but what an opportunity missed.

sahil-kang · on April 28, 2020

I just finished reading the OSTEP book[1] and it has a nice chapter on SSDs[2]. The entire last portion of the book is about filesystems/disks so you might find it interesting.

[1] http://pages.cs.wisc.edu/~remzi/OSTEP/

[2] http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf

wtallis · on April 28, 2020

That chapter on SSDs looks pretty good to me. Their numbers for NAND page and especially erase block sizes are very outdated; more modern values are 4kB to 16kB for NAND pages and 16-24MB for erase blocks on TLC NAND. Section 44.9 on mapping table sizes is a little bit odd, because most SSDs really do have 1GB of RAM per 1TB of flash, and that expense is widely seen as worthwhile even for multi-TB SSDs. The exceptions are low-end consumer SSDs that cache only part of the mapping table in a smaller amount of DRAM or SRAM, and a few enterprise/datacenter models that use 32kB block sizes for their FTL instead of the typical 4kB and thus reduce the DRAM requirement by a factor of 8 at the expense of greatly lowered performance and increased write amplification when writing in units smaller than 32kB.

Aside from the two above issues, everything looks correct and relevant, and I can't think of any missing details that deserve to be added to an introduction of that length.

ignoramous · on April 28, 2020

> pages.cs.wisc.edu/~remzi/OSTEP

I see a maintenance page instead:

"Sorry! The URL you requested was not found on our server.

Is it Sunday between 4 and 8 PM (CST)? If so, the server may be undergoing regularly scheduled downtime. Otherwise, please contact the maintainer of the referring page and ask them to fix their link. Thanks!"

Here's a Jan, 2020 snapshot: https://web.archive.org/web/20200122013800/https://pages.cs....

organicfigs · on April 28, 2020

Oh my god this book is amazing! Thank you

nn3 · on April 28, 2020

"How can I learn an instrument without listening to music" "How can I learn to write stories without reading books?"

Have you considered just reading the code? It's all available. Best way to learn is to look what the masters are doing.

And even complex software is typically just a collection of simple things together.

IMHO one of the biggest failing of many CS courses is that they never get above toy software. Would be much better to at least once dive into some production software and try to figure something out in a real code base.

krapht · on April 28, 2020

Reading the code of a piece of production software is a poor way to get understanding of the fundamentals of a strange topic. It's really inefficient in terms of "understanding" per hour; the student has no context as to why certain choices were made for what reason, just that they were made.

drenvuk · on April 27, 2020

This is unbelievably cool. It is a multi segmented key prefix Key Value store. Can someone just strap paxos or raft to this and call it a day please? Pretty please?

haivri · on April 27, 2020

I wonder how close of an API this provides compared to RocksDB... If close, CockroachDB might be a good trial candidate

tiernano · on April 27, 2020

GitHub repo: https://github.com/hse-project

elihu · on April 27, 2020

So, is this open-source firmware that runs directly on Micron SSDs, or is it an upper-layer thing that runs on the host system?

buildbot · on April 27, 2020

I thought it was firmware too, but it appears to be more of a key value store block level access engine, and improves mongo performance.

I got really excited thinking it was an open source nvme fpga core.

adam0c · on April 28, 2020

HSE not to be confused with HSE: https://www.hse.gov.uk/

jgaa · on April 28, 2020

Will it perform well on all SSD's, or is it optimized to give top performance only on Micron devices?

ha-shine · on April 28, 2020

What does heterogeneous mean here?

klodolph · on April 28, 2020

Multiple types of storage at the same time. For example, two different types of SSD, or combinations of SSD and DRAM.

junaru · on April 27, 2020

Why is this press release getting massively upvoted?

shockinglytrue · on April 27, 2020

It claims to fix MongoDB

cryptonector · on April 27, 2020

Sounds kinda like a ZFS.

fortran77 · on April 27, 2020

Will this be a replacement for something like the overpriced and under performing proprietary products from Pure Storage?

pinewurst · on April 27, 2020

You're comparing apples to oranges, plus that's unduly harsh.

1. This is a KV store optimized (or claimed to be) for a combo of SSD and PMEM. Not a packaged, supported, appliance storage system.

2. People who pay for enterprise storage do so for reasons beyond being too bone-headed to appreciate the joys of cobbling together production systems from the white box low-bidder and open source software.