An embedded database written in Rust

jsnell · on May 28, 2018

Other people have had trouble wringing competitive performance out of Bw-Trees despite heroic optimization efforts [0]. Why is this implementation going to beat other index structures with just a bit of tuning?

[0] https://news.ycombinator.com/item?id=17041616

krenoten · on May 28, 2018

It might not. But the critiques of bw trees in terms of performance that I've seen have not had compelling data in terms of things that matter outside of academia or benchmarking shootouts, like write or space amplification. The bw tree is a cheap thing to abandon after I implement a persistent ART and measure it though. The bwtree is only like 1k of rust on top of the modular pagecache, which is the real heart of the system.

lifepillar · on May 28, 2018

I am not familiar with bw-trees, but are you aware of this paper from this year’s SIGMOD?

https://db.cs.cmu.edu/papers/2018/mod342-wangA.pdf

Moving to ARTs (possibly combined with B+ trees) might be a smart choice after all.

Edit: sorry, I missed that grand-parent has cited the same paper already.

arthursilva · on May 28, 2018

That paper is pretty good but it's comparing bw-tree with much simpler in memory data structures. I think bw-tress might work specially well for fast-disk storage.

krenoten · on May 28, 2018

This was my interpretation as well. I'm going to compare a disk-backed bwtree with a disk-backed ART, both backed by the same pagecache, and maybe end up with an ART that scatters partial pages on disk, bwtree style. But I need to measure apples to apples on the metrics that matter for storage first. The pagecache is where most of the complexity is in my implementation, and it makes building different kinds of persistent structures on top of it pretty easy. docs.rs/pagecache

stult · on May 28, 2018

>I think bw-tress might work specially well for fast-disk storage.

That's precisely the claim made in the original paper describing Bw-trees.

https://dl.acm.org/citation.cfm?id=2510649.2511251

samuell · on May 28, 2018

Nice with more developments in embedded databases!

Would be interesting with a comparison with mentat [1]. Do you plan to support any query langauge such as datalog, like mentat?

[1] https://github.com/mozilla/mentat

krenoten · on May 28, 2018

Yeah, I'm curious about using sled as a more ssd friendly storage engine for mentat. I'm just starting to experiment with datalog implementations, but I think by having harmony between the storage engine, query language, and hardware properties we can make a really compelling stateful systems. If this is something that interests you, I'd love to work with more people on this.

scary · on May 28, 2018

Sounds very exciting!

samuell · on May 28, 2018

Sounds exciting!

jeffdavis · on May 28, 2018

"don't wake up operators. bring reliability techniques from academia into real-world practice."

What does he mean here? Don't wake up people with operational problems? Or does "wake up" refer to a scheduling strategy?

Either way, what are these techniques?

krenoten · on May 28, 2018

It means pay more attention to reliability than pop infrastructure and internet companies (who can offset poor reliability with human attention or intentionally deprioritize it to sell more support contacts) tend to put into these things. Specifically, exhaustive concurrency testing of lock-free algorithm interleavings via ptrace driven scheduling, model-based testing in combination with fault injection, ALICE-style file correctness testing, and for the various distributed modules that sit on top, network simulation combined with lineage driven fault injection. This is all very much a work in progress, and I'd love to work with more people on it!

erwan · on May 28, 2018

I took that as a reference to pager duty.

dmitrygr · on May 28, 2018

How well does that handle the storage disappearing halfway through a write? How well does it handle power being cut halfway through an update of some sort? How well does it handle some of the written blocks actually making it to the disc and others not? How about if the ones that made it were not the first or the last it issued to be written?

(For predictable answers to these, and many other complex questions, when dealing with data you care about, use sqlite)

krenoten · on May 28, 2018

ALICE showed that's not always true with sqlite. Sled is being built with an extreme bias toward reliability over features, but as the readme says, it has some time to go before reaching maturity. The tests are quite good at finding new issues and deterministically replaying them, so you can help bake it in by mining bugs using the default test suite and help it get there.

toolslive · on May 28, 2018

Most database designers assume that a power failure will only affect writes that are pending. Alas, for SSDs and NVMEs that's not always true. A power failure can cause all kinds of corruption. Long story short: even append-only strategies will not save you.

https://www.usenix.org/system/files/conference/fast13/fast13...

krenoten · on May 28, 2018

Indeed. This is why I aggressively checksum everything and pay particular attention to throwing away all data that was written after any detected corruption during recovery. This is easier with the log-only architecture. It's also totally os and filesystem agnostic. I was happily surprised yesterday when it passed tests on fuchsia :]

toolslive · on May 28, 2018

So you might end up throwing away everything. (I know, it's not your fault)

krenoten · on May 28, 2018

Users can rely on sequential recovery. At some point I'll probably write a partial recovery tool that gives you all versions of all keys that are at all present anywhere in the readable file though, which won't be much work. Typical best practices encourage moving away from single disk reliance for particularly valuable data, but this library will also work on phones etc... So it is important to support people when a wide variety of things go wrong.

spacenick88 · on May 28, 2018

That paper sounds like problems that can not be worked around in software and need hardware fixes?

toolslive · on May 28, 2018

You could work around it (erasure coding, fe), if you had insights on the failure mode specifics, but it's vendor specific and vendors are not exactly forthcoming.

So the only thing you can do is add checksumming schemes that allow you to detect you have been affected.

The paper is from 2013, so the situation might have improved meanwhile (I wouldn't put any money on it)

coldtea · on May 28, 2018

>How well does that handle the storage disappearing halfway through a write? How well does it handle power being cut halfway through an update of some sort?

Or the drive catching fire and the user manually deleting all existing backups.

dajonker · on May 28, 2018

I see that MVCC is a planned feature. Why would you need MVCC for an embedded database? It seems like unnecessary overhead that conflicts with the performance goals.

krenoten · on May 28, 2018

It's not needed for the single-key atomic record store, which is the sled bwtree index that is the current highest level module. MVCC is implemented in most popular embedded DBs because it is an effective way to manage mixed workloads that seek to read snapshots of the entire database at a single point in time as well as not blocking writes as this happens. This functionality is desirable for transactions that support mixed workloads. That's why I'm building it for a higher level module. This is a collection of modules that let you choose the abstraction and associated complexity that you want. There's also a modular pagecache that is totally decoupled and reusable for your own database experiments.

eternalban · on May 28, 2018

MVCC is a viable general purpose approach to deal with concurrency. It is the mother of all DAGs.

catwell · on May 28, 2018

I think they mean "embedded" as in "embedded in a program", not "targeting embedded hardware". In particular the README states they intend to use it in https://github.com/spacejam/rasputin. This is a use case similar to LMDB, for instance, which has MVCC.

EugeneOZ · on May 28, 2018

In Rust abstractions are free.

nixpulvis · on May 28, 2018

Some are (more or less) free, others are not. Rust simply gives you some great tools to make the choices yourself. The biggest win tends to be that you can get away without a GC, but still keep the guarantee that you're not leaking memory by forgetting to collect it.

steveklabnik · on May 28, 2018

Rust doesn't guarantee freedom of leaks, as leaking is memory-safe.

saghm · on May 28, 2018

To be fair, I think the way GP worded it might still be accurate; they said that you wouldn't leak by forgetting to collect it. I took that as a statement about not needing to manually free things than an assertion that Rust guarantees no memory leaks (which can still occur from things like reference cycles).

steveklabnik · on May 28, 2018

You’re right, I might be overly sensitive to people incorrectly making this claim directly.

nixpulvis · on May 28, 2018

I too try (and sometimes fail) to be careful when talking about this. People get all hot and bothered when you go overboard :P It can be tempting to start explaining what an afine type system is, but that can hurt more than help.

steveklabnik · on May 28, 2018

fooker · on May 28, 2018

If that was so, there wouldn't have been a need for the unsafe mode.

zacmps · on May 28, 2018

Unsafe does not exist for performance.

Quoting from the rust book:

> Unsafe Rust exists because, by nature, static analysis is conservative. When the compiler is trying to determine if code upholds the guarantees or not, it’s better for it to reject some programs that are valid than accept some programs that are invalid. That inevitably means there are some times when your code might be okay, but Rust thinks it’s not! In these cases, you can use unsafe code to tell the compiler, “trust me, I know what I’m doing.” The downside is that you’re on your own; if you get unsafe code wrong, problems due to memory unsafety, like null pointer dereferencing, can occur.

> There’s another reason Rust has an unsafe alter ego: the underlying hardware of computers is inherently not safe. If Rust didn’t let you do unsafe operations, there would be some tasks that you simply could not do. Rust needs to allow you to do low-level systems programming like directly interacting with your operating system, or even writing your own operating system! That’s one of the goals of the language. Let’s see what you can do with unsafe Rust, and how to do it.

steveklabnik · on May 28, 2018

As the person who wrote those paragraphs, yes. The fundamental reason unsafe exists is not for performance reasons. That being said, the parent is also true that sometimes, unsafe is used to enhance performance. But that's not the usual case, and in fact, can even hurt performance in ways. For example, &mut T can't alias, but *mut T can, and so &mut T can be optimized more aggressively. (We recently turned these optimizations back on: https://github.com/rust-lang/rust/pull/50744 )

PudgePacket · on May 28, 2018

You're argument doesn't really make sense. Not all abstractions are workarounds for "unsafe" things. For example, unsafe is needed for dealing with the OS, no amount of abstraction will help you there. At some point you must talk to and accept memory that another process (OS) tells you is correct. I'm not so knowledgeable in Rust so I can't say definitely, but I think you should back up your statement with some (any) examples.

EugeneOZ · on May 28, 2018

https://blog.rust-lang.org/2015/05/11/traits.html

p0nce · on May 28, 2018

Do you realize "zero-cost" abstractions are not specific to Rust?

steveklabnik · on May 28, 2018

Your parent didn't say that. Speaking about Rust in a thread about a Rust project seems to be very on-topic?

(I would also disagree with that statement, but in a different way: Rust enables zero-cost abstractions, but nothing inherently means they have to be. I can also make quite costly ones, and they'll compile.)

p0nce · on May 28, 2018

> Your parent didn't say that.

Yes and it's more disturbing. It says "In Rust abstractions are free" when an algorithm was discussed.

Rust having "free" abstractions does not change the performance characteristics of an algorithm: this is what the grand-parent was talking about.

steveklabnik · on May 28, 2018

Yes, I certainly agree with that distinction. If you'd have said that in the first place, I wouldn't have said anything :)

EugeneOZ · on May 28, 2018

What is results disturbing is that you are arguing just for the sake of arguing. "Algorithm" is a too wide term - everything is an algorithm, especially in programming. Talk was about abstractions, now you're trying to say it was about algorithms. We need "ignore" button here.

EugeneOZ · on May 28, 2018

I would answer "you can write trash in any language, in Rust you need to find ways how to do it, when in other languages - how to avoid it". Looks like ycombinator now is the place when people just looking for negativity and what else ti disagree with.

stringer · on May 28, 2018

How does it compare relatively to dgraph's Badger library?

ihsw2 · on May 28, 2018

Considering this has not even exited alpha, it stands to reason that Badger is a fair bit more robust in performance and failure modes.

krenoten · on May 28, 2018

There is zero reason behind this assumption.

dingo_bat · on May 28, 2018

Does rust compile reliably to embedded targets yet? Last time I checked there were a lot of problems with armv5.

steveklabnik · on May 28, 2018

I'm not sure about ARMv5, but v7 and v8-A are Tier 1 for Firefox, so we get a pretty decent smoke test out of them. The thing about embedded is that it's quite diverse, so speaking about it in broad terms is tough. It goes from "lol nope" to "barely works" to "pretty decent" to "great", depending on which thing you're talking about.

We have a whole working group this year working on embedded.

smilekzs · on May 28, 2018

quoting catwell:

> I think they mean "embedded" as in "embedded in a program", not "targeting embedded hardware".

Nullabillity · on May 28, 2018

I've been using embedded Rust for my thesis project (ARMv6-M), and it's been a pleasure.

nixpulvis · on May 28, 2018

It's been a focus of the team, and been getting better. With efforts being made to move some external tools into the mainline toolchains. Last I checked it was possible, but still a bit cumbersome.

P.S. Really looking forward to writing Rust on AVRs.

jstewartmobile · on May 28, 2018

I'll take Richard Hipp's C code over just about anyone's Rust code any day of the week. The man is a national treasure!

judofyr · on May 28, 2018

I'll also take any project that has thousands (if not millions?) of man-hours together with an extremely solid test suite. That doesn't mean other projects aren't worth exploring or won't be useful.

IshKebab · on May 28, 2018

It says it is alpha about 5 times in the readme. What gave you the impression that this was already more robust than SQLite?

tzahola · on May 28, 2018

But C is very unsafe! /s

jstewartmobile · on May 29, 2018

sorry you got strikefrced, but love your other comments

tormeh · on May 28, 2018

No clustering :( I feel like good multi-master asynchronous and synchronous clustering is truly the frontier in DBs.

Groxx · on May 28, 2018

clustering and embedded seem like almost opposite ends of the spectrum

spacenick88 · on May 28, 2018

Then again embedded databases are a great building block for things like etcd. It also might even make sense to have something like this in process because that would remove quite a few failure modes coming from the client connection and simplify deployment

jacquesm · on May 28, 2018

You can't cluster an embedded database, this comment makes not sense.

Compare with SQLite, not with Orcale or Postgres.

krenoten · on May 28, 2018

Currently it's even more basic. The current usable parts are a pagecache following the llama approach, some great testing utility libraries, and an index (that you can use as a kv) that follows the bwtree approach. Later it will have structured access support, but it needs some more db components to get there. It is a construction kit as well as a kv.

Hello71 · on May 28, 2018

So... SQLite FoundationDB?

spacenick88 · on May 28, 2018

Not really, when the embedded database is allowed to run its own threads and network connections embedding something like etcd makes perfect sense. It removes the failure modes of the client connection and simplifies deployment.

jinqueeny · on May 28, 2018

What about TiKV (https://github.com/pingcap/tikv), a distributed transactional Key-Value store?

krenoten · on May 28, 2018

Use it for large scale HTAP! It's great for its flexible use cases at high scales :)

fleetfox · on May 28, 2018

What would be the use-case for multi-master clustering in embedded?

fowl2 · on May 28, 2018

"real" serverless - mesh databases! ;p

krenoten · on May 28, 2018

This is honestly a use case I'm experimenting with using a mix of CRDTs and OT. Our systems are becoming more and more location agnostic and I don't feel that our current data infrastructure is adequate to serve the workloads we're going to be facing as compute migrates to the edge.

krenoten · on May 28, 2018

It's modular, and there is a paxos implementation, but it has been built totally in simulation so far and I haven't plugged it into an io layer yet. But this is trivial. That said, sled will always be a bwtree index, and the other modular crates will stand on their own.

keithnz · on May 28, 2018

what do you think of FoundationDb?

krenoten · on May 28, 2018

I am a total devotee to their approach to building simulable systems, although I seek to push it even farther and integrate lineage driven fault injection from an early stage. I see a lot of cool things in what they have done, technically. Sled is free from day one.