SimpleDB: A Basic RDBMS Built from Scratch

eatonphil · on Jan 31, 2022

Nice work! In my experience adding (basic) support for indexes shouldn't be too hard [0].

And good choice to support transactions up front. Adding that in later is pretty challenging.

Mine doesn't support transactions and refactoring to support it would take a while. :(

If HN-ers are interested in more database Internals content check out /r/databasedevelopment too.

[0] https://notes.eatonphil.com/database-basics-indexes.html

hackthesystem · on Feb 1, 2022

Thanks! I do plan on adding indexes soon

nstart · on Feb 1, 2022

A quick question about the MIT course this student followed[1]. Am I correct in assuming that the way to follow this course would be to take the lecture notes, the assigned reading material, and simply read it before then moving on to the assignments? Just wanted to make sure I'm not missing some video/other content before I take a look into this stuff.

[1] https://ocw.mit.edu/courses/electrical-engineering-and-compu...

hackthesystem · on Feb 1, 2022

I think that is the recommended approach, but I just read the lecture notes and started working on the assignments. I tried to prioritize coding because it helps me retain knowledge better over time

nstart · on Feb 1, 2022

Appreciate you sharing your direct experience. Thank you!

mathisonturing · on Feb 1, 2022

How does this stack up against Intro to Database Systems from CMU?

hackthesystem · on Feb 1, 2022

When I was implementing SimpleDB in 2019, I believe CMU's course didn't have resources and lab assignments that were publicly available. Now CMU has published a full video lecture series (which MIT doesn't have) and their labs. So if I were starting again today, I would probably go with CMU's course.

CMU Intro to Databases Labs: https://15445.courses.cs.cmu.edu/fall2021/assignments.html

CMU Intro to Databases Lectures: https://www.youtube.com/playlist?list=PLSE8ODhjZXjZaHA6QcxDf...

BusTub - CMU's Version of SimpleDB: https://github.com/cmu-db/bustub

nivertech · on Jan 31, 2022

There is already Amazon SimpleDB database service[1] written in Erlang, but it's kind of deprecated in favor of Amazon DynamoDB.

[1] https://aws.amazon.com/simpledb/

hackthesystem · on Feb 1, 2022

Ah yes, by the time I realized this it was too late to rename unfortunately. And very neat how Amazon SimpleDB is written in Erlang!

jen20 · on Feb 1, 2022

Is there some kind of paper or public info that details the Erlang usage? I'd never heard that before!

nivertech · on Feb 1, 2022

> Amazon SimpleDB is a distributed database written in Erlang[1] by Amazon.com.

https://en.wikipedia.org/wiki/Amazon_SimpleDB

> Highly Available – It’s Amazon. Running Erlang. Whoa.

https://web.archive.org/web/20110623221347/http://www.satine...

jen20 · on Feb 1, 2022

Thanks - very interesting! I’d always assumed it was Java, like many AWS services.

IggleSniggle · on Jan 31, 2022

For a half-moment, due to the way the title was capitalized, I thought this was going to be a RDBMS built in Scratch.

speed_spread · on Feb 1, 2022

By a 9 year old as school project, in a war torn country, hiding in cave, running on an OG Raspberry Pi, powered by guano-fueled generator.

nexuist · on Feb 1, 2022

Iron Man's distant cousin, Business Intelligence Man

rmbyrro · on Feb 1, 2022

Stretched :D

avinassh · on Feb 1, 2022

OP I am curious about the project license. Since it is built on top of the assignment code, can you use your own copyright? - https://github.com/awelm/simpledb/blob/2e78bb2/LICENSE

hackthesystem · on Feb 1, 2022

Good point. I just removed my name from the license, but I couldn't find the original author(s) so the copyright name list is just empty now

jsumrall · on Feb 1, 2022

Another very good “simple” database to look at is LMDB. Very very very fast, and is transactional (Implements MVCC)

http://www.lmdb.tech/doc/

akrymski · on Feb 1, 2022

Great work. I'm not an expert in database internals, but maybe someone can help answer some of these:

- Why not just use mmap and let the OS handle page caching?

- Why not use a write-ahead-log for all writes, with a background thread applying transactions to the read replica asynchronously? In most cases eventual consistency is fine and you don't need to query the latest version of the data, but this can be an optional query parameter.

- Instead of embedding an unpredictable SQL-to-code compiler, why not provide direct access to physical (relational algebra) operators via function calls? eg let me write: select([fields]).where(x=2).join(table) etc ... letting me select which index to use, how to do the joins etc.

hackthesystem · on Feb 1, 2022

Great questions! I'm not a database expert either but I can try answering these:

1) I think databases like to manage pages directly because the db can make more optimizations than the OS because the db has more context. For example, when aborting a transaction the db knows its dirty pages should be evicted (i'm not sure if mmap offers custom eviction). Also I believe if the db uses mmap, it loses control over when pages are flushed to disk. Flush control is necessary for guaranteeing transaction durability.

2) What you're describing here sounds similar to a LSM-tree database (e.g. RocksDB). They are used often for write-heavy workloads because writes are just appends, but they might not be great for read-heavy things.

3) This reminds me of PRQL[1] (which was trending on Hacker News last week) and Spark SQL. I'm not too familiar with this area though, so I can't really say why SQL was designed this way.

[1] https://github.com/max-sixty/prql?utm_source=hackernewslette...

akrymski · on Feb 2, 2022

1) Indeed you should only use mmap for reads afaik

2) Was thinking more of an event-sourcing model, whereby you log the SQL statements first, then update a B-Tree in the background.

Read via mmap, write by appending to a log and asynchronously applying the changes to the file.

3) Rather than yet another QL, expose a higher level API that I can target in any language

akrymski · on Feb 3, 2022

Another thing to consider is pluggable storage (a key/value interface) and pluggable query language (relational algebra interface?) and how to fit the two together.

HappyJoy · on Feb 1, 2022

Hey, this is really great. Fun fact - there is another SimpleDB also written in Java [1]. The book (out of print I think) is a good read.

[1] http://cs.bc.edu/~sciore/simpledb/

ohlookabird · on Feb 1, 2022

Indeed, thanks for the link! We used this SimpleDB in university 15 years ago and implemented our own database nodes like hash joins, table scans, index scans, etc. as well as the paging logic and more. This was really useful to better understand ideas behind many databases.

ddlutz · on Feb 1, 2022

I thought for sure that this post was about this simpledb.

hackthesystem · on Feb 1, 2022

Thank you! I'll be sure to check out the book

t8y · on Feb 1, 2022

Is that book what this course is based on?

albertzhong · on Feb 1, 2022

It's cool to see this here, thanks for posting! I remember working with SimpleDB when taking a a database internals class. It was an interesting and engaging way to learn concepts like strict two-phase locking and Selinger-style query optimization.

hackthesystem · on Feb 1, 2022

Yeah, I think its a shame that most teachers don't give assignments like this that tie the big picture together with the low-level details. After students complete a big assignment like SimpleDB, they'll have a working artifact that they can reference for the rest of their career

haggy102 · on Feb 2, 2022

I think the main issue that universities face is time. There is only so much time in each semester and, as we all know, building and improving on a database is a lifelong task.

zinclozenge · on Jan 31, 2022

I love seeing these educational databases. I'm still waiting for one to implement some form of MVCC, every one so far always seems to use 2 phase locking.

unusual-name · on Feb 1, 2022

https://github.com/erikgrinaker/toydb

zinclozenge · on Feb 1, 2022

You're right, I forgot about this one. I think because when I last looked at it, the MVCC part wasn't very straightforward to me.

jasfi · on Feb 1, 2022

They are probably no longer just educational dafter implementing MVCC.

hackthesystem · on Feb 1, 2022

Thanks! I'd also like to see a MVCC implementation because that would shed some light on how mysql works. I'll add MVCC to the future improvements section

eatonphil · on Jan 31, 2022

What other educational databases are you looking at? Curious in case I'm missing any of them.

zinclozenge · on Feb 1, 2022

RisingLight, SimpleDB by Edward Sciore, SimpleDB for the MIT course are the main ones that come to my mind immediately. I forgot that Erik Grinaker's ToyDB implements MVCC but I remember it being hard to grok and not very straightforward.

hackthesystem · on Feb 1, 2022

There is also BusTub from CMU which I stumbled upon earlier today:

https://github.com/cmu-db/bustub

jsumrall · on Feb 1, 2022

LMDB implements MVCC and is pretty easy to read, the implementation is very tiny.

java-man · on Jan 31, 2022

This is great!

Could you please clarify the license for your code? Thanks!

hackthesystem · on Feb 1, 2022

Thanks! Just added the MIT license

bjarneh · on Feb 1, 2022

Nice, although this will not build unless you already have javafx in your classpath. The only use of javafx seems to be javafx.util.Pair. Source version 1.5 is also deprecated if your Java version isn't ancient.

didip · on Feb 1, 2022

Reminded me of ToyDB, written in Rust. https://github.com/erikgrinaker/toydb

amilajack · on Jan 31, 2022

Great overview from first principles of how an RDBMS works!

FpUser · on Feb 1, 2022

You have a naturally curious mind. Wish there were more students like yourself. Very nice work.

ryanlpeterman · on Feb 1, 2022

I learned a lot from your post, thanks for writing!

hackthesystem · on Feb 1, 2022

Thanks, really glad to hear that!

alfonsodev · on Feb 1, 2022

anyone knows something similar for a graph database ?

timmit · on Feb 1, 2022

Nice work.

The only one comment to add is

Amazon has an Amazon SimpleDB.

So different name might be before for your branding.

nathell · on Jan 31, 2022

Get this through Jepsen and it’ll be more dependable than most high-hyped DBMSs out there.