A quick question about the MIT course this student followed[1]. Am I correct in assuming that the way to follow this course would be to take the lecture notes, the assigned reading material, and simply read it before then moving on to the assignments? Just wanted to make sure I'm not missing some video/other content before I take a look into this stuff.
I think that is the recommended approach, but I just read the lecture notes and started working on the assignments. I tried to prioritize coding because it helps me retain knowledge better over time
When I was implementing SimpleDB in 2019, I believe CMU's course didn't have resources and lab assignments that were publicly available. Now CMU has published a full video lecture series (which MIT doesn't have) and their labs. So if I were starting again today, I would probably go with CMU's course.
Great work. I'm not an expert in database internals, but maybe someone can help answer some of these:
- Why not just use mmap and let the OS handle page caching?
- Why not use a write-ahead-log for all writes, with a background thread applying transactions to the read replica asynchronously? In most cases eventual consistency is fine and you don't need to query the latest version of the data, but this can be an optional query parameter.
- Instead of embedding an unpredictable SQL-to-code compiler, why not provide direct access to physical (relational algebra) operators via function calls? eg let me write: select([fields]).where(x=2).join(table) etc ... letting me select which index to use, how to do the joins etc.
Great questions! I'm not a database expert either but I can try answering these:
1) I think databases like to manage pages directly because the db can make more optimizations than the OS because the db has more context. For example, when aborting a transaction the db knows its dirty pages should be evicted (i'm not sure if mmap offers custom eviction). Also I believe if the db uses mmap, it loses control over when pages are flushed to disk. Flush control is necessary for guaranteeing transaction durability.
2) What you're describing here sounds similar to a LSM-tree database (e.g. RocksDB). They are used often for write-heavy workloads because writes are just appends, but they might not be great for read-heavy things.
3) This reminds me of PRQL[1] (which was trending on Hacker News last week) and Spark SQL. I'm not too familiar with this area though, so I can't really say why SQL was designed this way.
Another thing to consider is pluggable storage (a key/value interface) and pluggable query language (relational algebra interface?) and how to fit the two together.
Indeed, thanks for the link! We used this SimpleDB in university 15 years ago and implemented our own database nodes like hash joins, table scans, index scans, etc. as well as the paging logic and more. This was really useful to better understand ideas behind many databases.
It's cool to see this here, thanks for posting! I remember working with SimpleDB when taking a a database internals class. It was an interesting and engaging way to learn concepts like strict two-phase locking and Selinger-style query optimization.
Yeah, I think its a shame that most teachers don't give assignments like this that tie the big picture together with the low-level details. After students complete a big assignment like SimpleDB, they'll have a working artifact that they can reference for the rest of their career
I think the main issue that universities face is time. There is only so much time in each semester and, as we all know, building and improving on a database is a lifelong task.
I love seeing these educational databases. I'm still waiting for one to implement some form of MVCC, every one so far always seems to use 2 phase locking.
Thanks! I'd also like to see a MVCC implementation because that would shed some light on how mysql works. I'll add MVCC to the future improvements section
RisingLight, SimpleDB by Edward Sciore, SimpleDB for the MIT course are the main ones that come to my mind immediately. I forgot that Erik Grinaker's ToyDB implements MVCC but I remember it being hard to grok and not very straightforward.
Nice, although this will not build unless you already have javafx in your classpath. The only use of javafx seems to be javafx.util.Pair. Source version 1.5 is also deprecated if your Java version isn't ancient.
And good choice to support transactions up front. Adding that in later is pretty challenging.
Mine doesn't support transactions and refactoring to support it would take a while. :(
If HN-ers are interested in more database Internals content check out /r/databasedevelopment too.
[0] https://notes.eatonphil.com/database-basics-indexes.html