Hacker News new | past | comments | ask | show | jobs | submit login
Database Gyms [pdf] (cidrdb.org)
112 points by greghn on June 19, 2023 | hide | past | favorite | 8 comments



[first author here] I'm not sure why this is on the front page. Speaking only on my own behalf, I like to think of this as a paper that's motivated by problems that I kept running into while re-implementing papers related to self-driving database systems [0] research.

My TLDR would be: existing research has focused on trying to develop better models of database system behavior, but look at recent trends in modeling. Transformers, foundation models, AutoML -- modeling is increasingly "solved", as long as you have the right training data. Training data is the bottleneck now. How can we optimize the training data collection pipeline? Can we engineer training data that generalizes better? What opportunities arise when you control the entire pipeline?

Elaborating on that, I think you can abstract existing training data collection pipelines into these four modules:

- [Synthesizer]: The field has standardized on the use of various synthetic workloads (e.g., TPC-C, TPC-H, DSB) and common workload trace formats for real-world workloads (e.g., postgres_log, MySQL general query log). Research on workload forecasting and dataset scaling exists. In 2023, why can't I say "assuming trends hold, show me what my workload and database state will look like 3 months from now"?

- [Trainer]: Given a workload and state (e.g., from the Synthesizer), existing research executes the workload on the state to produce training data. But executing workloads in real-time kind of sucks. Maybe you have a workload trace that's one month long, well, I don't want to wait one month for training data. But I can't just smash all the queries together either, that wouldn't be representative of actual deployment conditions. So right now, I'm intrigued by the idea of executing workloads in faster than real-time. Think of a fast-forward button on physics simulators, where you can reduce simulation fidelity in exchange for speed. Can we do that for databases? I'm also interested in playing tricks to help the training data generalize across different hardware, and in general, there seems to be a lot of unexplored opportunity here. Actively working on this!

- [Planner]: Given the training data (e.g., from the Trainer) and an objective function (e.g., latency, throughput), you might consider a set of tuning actions that improve the objective (e.g., build some indexes, change some knob settings). But how should you represent these actions? For example, a number of papers one-hot encode the possible set of indexes, but (1) you cannot actually do this in practice, there are too many indexes, and (2) you lose the notion of "distance" between your actions (e.g., indexes on the same table should probably be considered "related" in some way). Our research group is currently exploring some ideas here.

- [Decider]: Finally, once you're done applying all this domain-specific stuff to encode the states and actions, you're solidly in the realm of "learning to pick the best action" and can probably hand it off to a ML library. Why reinvent the wheel? :P That said, you can still do interesting work here (e.g., UDO is intelligent about batched action evaluation), but it's not something that I'm currently that interested in (relative to the other stuff above, which is more of an uncharted territory).

If anyone is at SIGMOD this week, I'm happy to chat! :)

[0] https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf


tl;dr - paper by Andy Pavlo's team. Conclusion:

> Most of the previous work in using ML for DBMS automation has focused on designing better ML models of DBMS behavior, but recent advances in ML have largely automated model design. The challenge now is to obtain good training data for building these models. This paper outlined the architecture of the database gym, an integrated environment that generates training data by using the DBMS to simulate itself at the highest possible fidelity.


I had a feeling this was OtterTune's product through a reading of the abstract before I noticed the names. Really interesting and excited to see where they take it.

ps. Brent you helped kickstart my career when I was exposed to your early contributions on StackOverflow and MetaStackOverflow. By far the largest accelerant I ever had on the DBMS front, really cool seeing you around so I wanted to send a friendly thank you.


> I had a feeling this was OtterTune's product through a reading of the abstract before I noticed the names.

No, this project is separate from OtterTune. I keep our CMU research strictly firewalled from OtterTune for legal reasons.

The Database Gym project rose up from the ashes of the NoisePage self-driving DBMS project. See my comment from last week about why it failed:

https://news.ycombinator.com/item?id=36355963


a random aside, I truly thank you for your course. You're the teacher I wish I had in my bachelor's 8 years back. Singlehandedly revived my interest in learning databases. :).

Through your course, I reached a stage where I can help OSS projects and its features.


> so I wanted to send a friendly thank you.

Awww, thanks! That's awesome to hear! My big goal is to make other peoples' journeys through data easier.


I can second his appreciation, posts on your site helped me keep my sanity as a junior developer who stood to close to a database server.


For context, Andy leads Ottertune, an AI powered database tuner.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: