Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Foundations of Databases (1995) (inria.fr)
253 points by harperlee on April 14, 2021 | hide | past | favorite | 53 comments


Plugging in Andy Pavlo's Database lectures @ CMU which are completely free on Youtube. Great guy and great lectures.


Agreed. I finished watching them at the start of this year and it's significantly helped me in using and making choices about databases. I've also since started implementing my own database for learning off the back of them. Couldn't be more grateful for those courses being public.

link if anyone is interested: https://youtu.be/oeYBdghaIjc


Can you go wrong with a professor that says, "I fucking love databases"?

His videos are top notch. Passion creates good quality.


Seconded. I have been watching them over the past 2-weeks - he's a brilliant guy and a great teacher.


And hands down best music choice of all the CS lectures I've ever seen.


Can you please tell me what unique/good about this class you are referring to?


If you want to understand how databases work under the hood, from the basics up to near-state-of-the-art, its filled with tons of great information. I recommend reading Kleppmann's Designing Data Intensive Applications and then watching Pavlo.


Previous hn post with comments

https://news.ycombinator.com/item?id=19726520


From sec. 22.6 (temporal databases):

“Classical logics augmented with a temporal coordinate have been studied extensively, mostly geared toward specification and verification of concurrent programs. Such logics are usually referred to as temporal logics. There is a wealth of mathematical machinery developed around temporal logics; unfortunately, little of it seems to apply directly to databases.

“Although the view of a temporal database as a sequence of instances is conceptually clean, it is extremely inefficient to represent a temporal database in this manner. In practice, this information is summarized in a single database in which data is timestamped to indicate the time of validity. The timestamps can be placed at the tuple level or at the at- tribute level. Typically, timestamps are unions of intervals of the temporal domain. Such representations naturally lead to nested structures, as in the nested relation, semantic, and object-oriented data models.”

I couldn’t find any instances of the words “epoch”, “epochal”, not even “mvcc”. As of 1995, MVCC at least must have been known fairly widely in academia, no? Wikipedia has Starkey’s VAX Rdb/ELN as first commercial MVCC database (and VAXLEN first release in 1981).

So their take on ‘time’ and data seems to be a bit narrow and literal. At the same time, searching the web for temporal logic + epoch also doesn’t turn up anything.

I find this curious. Anyone can shed light on this?


Anyone recommend a good book specifically on distributed databases (not more general distributed systems stuff like e.g. klepmann's DDIA)?


In case someone isn't aware, Designing Data-Intensive Applications is a very good introduction to distributed databases, even if it doesn't specialize in them.


I’ll second this. It’s an excellent intro that’s very approachable.


Yeah I don't disagree, I've just read it already :)


Database Internals: A Deep Dive into How Distributed Data Systems Work by Alex Petrov


I got alot of value out of

  Distributed Databases: Principles and Systems
  Stefano Ceri, Giuseppe Pelagatti
  McGraw-Hill, 1985
it adopts a very specific approach, and is obviously missing some newer techniques, but it certainly leaves you with a feeling of 'yeah, i can build this'


Not a book... But foundationdb documentation (and source) might be worth looking at? https://apple.github.io/foundationdb/


Not a book, but the podcast series "Scaling Postgres" looks really good.


Not a book, but I would add to the foreword: "Try not to use them."


This is a valid comment that I'd expand to say "...without exhausting all the tools available."

For example, if you're facing lengthy search queries and are looking at partitioning the database to speed them up, many databases include helpful internal tools that should be attempted first. Proper indexing is a textbook all on it's own, and the appropriate design of full-text search dictionaries and queries is another.

The one that shocked me a year or two ago was implementing full-text search in a gigantic sqlite DB. I was preparing to migrate it to postgresql or elasticsearch because my traditional "ilike" queries, even indexed, were ridiculously slow, and I was not aware FTS was an included library.

Although FTS pretty much doubled the size of the DB, a three minute query dropped to a few seconds. Since read queries are non-blocking in sqlite, this allowed me to continue using the DB without needing a bigger, more complex solution.


Genuinely curious: What are some challenges with using distributed databases?


Distributed [decentralized] databases are basically the most complicated software you can play with that isn't classified/proprietary.

Read through all these Jepsen analyses: https://jepsen.io/analyses

Then consider that those analyses were done under theoretical environments using failures that tests were written for. It doesn't capture all the garden-variety problems that happen when you are running 40 extra-extra-large instances on shitty non-cloud hardware and serving traffic to thousands of high performance apps on a globe-spanning network.

Things that seem stupid or obvious, like just expanding your storage, can be something these databases can't handle. Sometimes there are well-documented problems, like random data corruption, or replication that just stops and never starts again, or an inability to reconfigure a cluster without literally destroying it. So not only do you need to understand consensus algorithms and cutting edge functionality, you also have to become an operational expert on their quirks. Most are practically brand new, making them largely untested in large-scale scenarios, with features that haven't "baked" longer than a month in testing. There's no book to buy, and very few people to get support from.

Sometimes there's no getting around it and they are literally your only option to solve your problem. But it's not worth getting involved in if a non-distributed database can solve your problem.

On the other hand, if you are just a bored engineer and want to see some very big systems explode in dramatic and obscure ways, definitely use a distributed database.


Awesome!!! thanks so much for the info.


Tons of complexity behind this, but the simple problem is source of truth. When you store data in multiple places, if they become unsynchronized, which is the true answer??? Within a monolithic system, you don't have differing clocks, network partitions that can be very weird, etc. And, a related problem, which is race conditions. What if you are reading one copy of the data while the other replica is being updated. Most complexity, imo, arises from this fundamental problem and the various techniques to deal with it.


I saw this book in Rich Hickey's Clojure Bookshelf: https://www.goodreads.com/list/show/137472.Rich_Hickey_s_Clo... (haven't read it myself, yet)

For some reason it is called "Foundations of Databases: The Logical Level" there.


Architecture of a Database System – Hellerstein, Stonebraker & Hamilton, 2007. Is a great paper focused on DBs design and architecture. From 2007 but still rocks. http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf


It's incredible how little attention is paid to data modeling and querying in the education of people entering the software engineering workforce.

Getting your database model right, on the logical and physical level, will make developing and deploying any data-driven app simpler and easier. Getting it wrong? No modern programming language or architectural pattern will save you from the worst kinds of bugs, workarounds and bad performance.

Do yourself a favor and read this or any other legendary DB book.


To wit: I made it through a master's in CS without a database class.

This reminds me of the famous Rob Pike quote:

"Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."

I've often found that if I'm coding something and the code starts looking increasingly gnarly, that rethinking the data structures / data model will clean up the code.


On that topic, Fred Brooks, author of The Mythical Man Month, said "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious."


We had two books when I was doing my bachelors in CS many moons ago (late 90's). One was by Silberschatz & Galvin and the other by Ullman. The first one was used for relational algebra, various normal forms and introducing table design, keys, constraints,etc. The second was used to teach the theory for internals on how a DB is implemented - B+ trees, deadlock handling, etc, etc. That was a hard course and it was mandatory.

I can see not electing to have databases as a course for masters though. Masters is to allow you to specialize and be more choice driven.


> "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming."

I'm saving this quote.

> I've often found that if I'm coding something and the code starts looking increasingly gnarly that rethinking the data structures / data model will clean up the code.

I've faced this over and over again. Writing algorithms is hard when the underlying data is not optimal. It simply invites writing complicated code and workarounds.


> I've faced this over and over again. Writing algorithms is hard when the underlying data is not optimal. It simply invites writing complicated code and workarounds.

This is actually one of the reasons I am moving more and more to functional programming for myself. I find that modeling the data as just data makes it easier to find a good representation. And that it usually results in simpler algorithms.


"Data structures, not algorithms, are central to programming.

Can we unify this to simply "data structures are essential to algorithms". It's really weird to put these in opposition, when algorithm with no data and data without algorithm make no sense.

Data structures are encoded with use cases in mind, those use cases at least at the very low level are their algorithms.


I think programming is more than just algorithms. System design and architecture have a real impact on the longevity and maintainability of your system, and I'd argue that data models (not structures per se, the distinction being logical vs. physical) are foundational to a good design. In contrast, algorithms play a much more focused role.

Put differently:

> Data structures are encoded with use cases in mind, those use cases at least at the very low level are their algorithms.

As you've lampshaded, it's the use cases that are most fundamental. The data model should be designed to serve those use cases. The data structures and algorithms reflect the physical reality of the logical data model.


In the context of persistent databases, the schema is more important than the algorithms. You can change the algorithms on the fly, but the schema is fundamental.

The strength of the relational model is it is separated from any specific algorithm.


There are underlying algorithms and data structures in databases (b-tree, indexing, WAL etc.) and then there are application level data structures and algorithms, like the schema and how you use it in your app.

So if we see data and algorithms disconnected, we're just thinking about different architectural layers.


Although it says data structures, isn’t it more important to talk about the structure of the data than the choice of data structure?


every one of my data scraping projects is littered with files titled "database.db.bak1", "database.db.bak2".... for this exact reason. "Oh I scraped 1000 pages and realized I missed an entire nested data structure....better blow away the dB and try again.


Copy pasting my comment[1] from an earlier thread.

A couple of years ago I spent quite some time trying to evaluate the tech stack (and general engineering culture) of merger/acquisition targets of my employer. It was quite a fun exercise, all said and done. I encountered all sorts; from a small team start up who had their tech sorted out more or less to a largish organisation who relied on IBM's ESB which exactly one person in their team knew how it worked!!

I discovered this exact method during the third tech evaluation exercise. When the team began explaining various modules top-down and user-flows etc., I politely interrupted them and asked for DB schema. It was just on a whim because I was bored of typical one way session interrupted by me asking minor questions. Once I had a hang of their schema rest of the session was literally me telling them what their control and user flows were and them validating it.

Since then it's become my magic wand to understand a new company or team. Just go directly to the schema and work backwards.

Conversely, I've begun paying more attention to data modelling. Because once a data model is fixed it's very hard to change and once enough data accumulates the inertia just increases and instead if changing the data model (for the fear of data migration etc.,) the tendency is to beat the use cases to fit the data model. It's not your usual fail-fast-and-iterate thing.

[1] https://news.ycombinator.com/item?id=24137997


Many years ago, my employer was transitioning ERP systems from...something CA-owned whose name escapes me and I'd probably have pretty awful memories dredged up if I actually went looking for it.

Anyway, we were lucky enough to have a very talented intern who was assigned the task of understanding the database schema behind the CA product in order to migrate our data to the new system.

It was...horrific. Absolute nightmare of spaghetti. I'd like to say I've seen worse, but I've never seen anything even in the same ballpark.

I think we finally gave up after weeks of digging into it and just started over with a mostly clean slate.


> I politely interrupted them and asked for DB schema.

In order of importance:

1. DB schema

2. List of dependancies

3. List of 3rd party integrations

Depending on the domain, 2. and 3. can be switched.


This is very true, but also a hugely unfortunate reality. It amazes me that we have reached this far with “object” graph oriented data models backed with relational DBs. They are very seldom conducive to one another.


Perhaps it’s a French thing but I had hours and hours of data modeling, SQL and database administration during my first 2 years of college, and a class about the foundations of databases during the 4th year, which ended with the project to build a basic relational database in Java.


> this or any other legendary DB book.

What are the other legendary DB books?


https://www.amazon.com/Modeling-Essentials-Third-Graeme-Sims...

My favorite. Not an easy read, it took me weeks to complete it.

https://www.amazon.com/Date-Database-Writings-2000-2006-Chri...

Relatively unknown but great book if you want to read the thoughts of a DB heavyweight Christopher Date who collaborated with Edgar Codd.

https://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Per...

The best book to start exploring the land outside the relational world. An easy read, can be completed in two to three days.


Was lucky enough to take Serge's class when he was visiting UC Berkeley back in the 90's. And yet I'm still writing SQL :( Curious to know what the state in the art in solving the then-thornier theoretical problems (negation mostly) is.


I think this was my textbook when I took a relational database class at UCSD Extension back in the 90s.


(1995)


Year programmers have ignored relational theory since:


I think to truly live up to the name "foundations" of databases, there would have to be some information as to how databases are implemented. For instance, how do you guarantee durability if someone trips over the power cable?


We don't need "more theory". The vast majority of computer science theory goes unapplied in solving real-world problems. While I'm sure this is great for postgrad computer science students, it's not very useful for the every day programmers out there.


>We don't need "more theory". The vast majority of computer science theory goes unapplied in solving real-world problems

The sheer amount of rewriting that is being done every day around the world because people didn't bother understanding what they were doing (and how they should have thought more before trying to make it work) disagrees with this.


>We don't need "more theory".

>it's not very useful for the every day programmers out there

So, programmer's talent is not being used to their fullest. The ones which can't understand basic theory can be easily condemned to always be mere users of tools while never improving them. They run a greater risk of becoming obsolete or devalued.


...and this is how we ended with NoSQL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: