Hacker News new | past | comments | ask | show | jobs | submit login
SimpleDB: A Basic RDBMS Built from Scratch (awelm.com)
210 points by hackthesystem on Jan 31, 2022 | hide | past | favorite | 51 comments



Nice work! In my experience adding (basic) support for indexes shouldn't be too hard [0].

And good choice to support transactions up front. Adding that in later is pretty challenging.

Mine doesn't support transactions and refactoring to support it would take a while. :(

If HN-ers are interested in more database Internals content check out /r/databasedevelopment too.

[0] https://notes.eatonphil.com/database-basics-indexes.html


Thanks! I do plan on adding indexes soon


A quick question about the MIT course this student followed[1]. Am I correct in assuming that the way to follow this course would be to take the lecture notes, the assigned reading material, and simply read it before then moving on to the assignments? Just wanted to make sure I'm not missing some video/other content before I take a look into this stuff.

[1] https://ocw.mit.edu/courses/electrical-engineering-and-compu...


I think that is the recommended approach, but I just read the lecture notes and started working on the assignments. I tried to prioritize coding because it helps me retain knowledge better over time


Appreciate you sharing your direct experience. Thank you!


How does this stack up against Intro to Database Systems from CMU?


When I was implementing SimpleDB in 2019, I believe CMU's course didn't have resources and lab assignments that were publicly available. Now CMU has published a full video lecture series (which MIT doesn't have) and their labs. So if I were starting again today, I would probably go with CMU's course.

CMU Intro to Databases Labs: https://15445.courses.cs.cmu.edu/fall2021/assignments.html

CMU Intro to Databases Lectures: https://www.youtube.com/playlist?list=PLSE8ODhjZXjZaHA6QcxDf...

BusTub - CMU's Version of SimpleDB: https://github.com/cmu-db/bustub


There is already Amazon SimpleDB database service[1] written in Erlang, but it's kind of deprecated in favor of Amazon DynamoDB.

[1] https://aws.amazon.com/simpledb/


Ah yes, by the time I realized this it was too late to rename unfortunately. And very neat how Amazon SimpleDB is written in Erlang!


Is there some kind of paper or public info that details the Erlang usage? I'd never heard that before!


> Amazon SimpleDB is a distributed database written in Erlang[1] by Amazon.com.

https://en.wikipedia.org/wiki/Amazon_SimpleDB

> Highly Available – It’s Amazon. Running Erlang. Whoa.

https://web.archive.org/web/20110623221347/http://www.satine...


Thanks - very interesting! I’d always assumed it was Java, like many AWS services.


For a half-moment, due to the way the title was capitalized, I thought this was going to be a RDBMS built in Scratch.


By a 9 year old as school project, in a war torn country, hiding in cave, running on an OG Raspberry Pi, powered by guano-fueled generator.


Iron Man's distant cousin, Business Intelligence Man


Stretched :D


OP I am curious about the project license. Since it is built on top of the assignment code, can you use your own copyright? - https://github.com/awelm/simpledb/blob/2e78bb2/LICENSE


Good point. I just removed my name from the license, but I couldn't find the original author(s) so the copyright name list is just empty now


Another very good “simple” database to look at is LMDB. Very very very fast, and is transactional (Implements MVCC)

http://www.lmdb.tech/doc/


Great work. I'm not an expert in database internals, but maybe someone can help answer some of these:

- Why not just use mmap and let the OS handle page caching?

- Why not use a write-ahead-log for all writes, with a background thread applying transactions to the read replica asynchronously? In most cases eventual consistency is fine and you don't need to query the latest version of the data, but this can be an optional query parameter.

- Instead of embedding an unpredictable SQL-to-code compiler, why not provide direct access to physical (relational algebra) operators via function calls? eg let me write: select([fields]).where(x=2).join(table) etc ... letting me select which index to use, how to do the joins etc.


Great questions! I'm not a database expert either but I can try answering these:

1) I think databases like to manage pages directly because the db can make more optimizations than the OS because the db has more context. For example, when aborting a transaction the db knows its dirty pages should be evicted (i'm not sure if mmap offers custom eviction). Also I believe if the db uses mmap, it loses control over when pages are flushed to disk. Flush control is necessary for guaranteeing transaction durability.

2) What you're describing here sounds similar to a LSM-tree database (e.g. RocksDB). They are used often for write-heavy workloads because writes are just appends, but they might not be great for read-heavy things.

3) This reminds me of PRQL[1] (which was trending on Hacker News last week) and Spark SQL. I'm not too familiar with this area though, so I can't really say why SQL was designed this way.

[1] https://github.com/max-sixty/prql?utm_source=hackernewslette...


1) Indeed you should only use mmap for reads afaik

2) Was thinking more of an event-sourcing model, whereby you log the SQL statements first, then update a B-Tree in the background.

Read via mmap, write by appending to a log and asynchronously applying the changes to the file.

3) Rather than yet another QL, expose a higher level API that I can target in any language


Another thing to consider is pluggable storage (a key/value interface) and pluggable query language (relational algebra interface?) and how to fit the two together.


Hey, this is really great. Fun fact - there is another SimpleDB also written in Java [1]. The book (out of print I think) is a good read.

[1] http://cs.bc.edu/~sciore/simpledb/


Indeed, thanks for the link! We used this SimpleDB in university 15 years ago and implemented our own database nodes like hash joins, table scans, index scans, etc. as well as the paging logic and more. This was really useful to better understand ideas behind many databases.


I thought for sure that this post was about this simpledb.


Thank you! I'll be sure to check out the book


Is that book what this course is based on?


It's cool to see this here, thanks for posting! I remember working with SimpleDB when taking a a database internals class. It was an interesting and engaging way to learn concepts like strict two-phase locking and Selinger-style query optimization.


Yeah, I think its a shame that most teachers don't give assignments like this that tie the big picture together with the low-level details. After students complete a big assignment like SimpleDB, they'll have a working artifact that they can reference for the rest of their career


I think the main issue that universities face is time. There is only so much time in each semester and, as we all know, building and improving on a database is a lifelong task.


I love seeing these educational databases. I'm still waiting for one to implement some form of MVCC, every one so far always seems to use 2 phase locking.



You're right, I forgot about this one. I think because when I last looked at it, the MVCC part wasn't very straightforward to me.


They are probably no longer just educational dafter implementing MVCC.


Thanks! I'd also like to see a MVCC implementation because that would shed some light on how mysql works. I'll add MVCC to the future improvements section


What other educational databases are you looking at? Curious in case I'm missing any of them.


RisingLight, SimpleDB by Edward Sciore, SimpleDB for the MIT course are the main ones that come to my mind immediately. I forgot that Erik Grinaker's ToyDB implements MVCC but I remember it being hard to grok and not very straightforward.


There is also BusTub from CMU which I stumbled upon earlier today:

https://github.com/cmu-db/bustub


LMDB implements MVCC and is pretty easy to read, the implementation is very tiny.


This is great!

Could you please clarify the license for your code? Thanks!


Thanks! Just added the MIT license


Nice, although this will not build unless you already have javafx in your classpath. The only use of javafx seems to be javafx.util.Pair. Source version 1.5 is also deprecated if your Java version isn't ancient.


Reminded me of ToyDB, written in Rust. https://github.com/erikgrinaker/toydb


Great overview from first principles of how an RDBMS works!


You have a naturally curious mind. Wish there were more students like yourself. Very nice work.


I learned a lot from your post, thanks for writing!


Thanks, really glad to hear that!


anyone knows something similar for a graph database ?


Nice work.

The only one comment to add is

Amazon has an Amazon SimpleDB.

So different name might be before for your branding.


Get this through Jepsen and it’ll be more dependable than most high-hyped DBMSs out there.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: