A few weeks ago, we wrote about how we implemented SIMD instructions to aggregate a billion rows in milliseconds [1] thanks in great part to Agner Fog’s VCL library [2]. Although the initial scope was limited to table-wide aggregates into a unique scalar value, this was a first step towards very promising results on more complex aggregations. With the latest release of QuestDB, we are extending this level of performance to key-based aggregations.
To do this, we implemented Google’s fast hash table aka “Swisstable” [3] which can be found in the Abseil library [4]. In all modesty, we also found room to slightly accelerate it for our use case. Our version of Swisstable is dubbed “rosti”, after the traditional Swiss dish [5]. There were also a number of improvements thanks to techniques suggested by the community such as prefetch (which interestingly turned out to have no effect in the map code itself) [6]. Besides C++, we used our very own queue system written in Java to parallelise the execution [7].
The results are remarkable: millisecond latency on keyed aggregations that span over billions of rows.
We thought it could be a good occasion to show our progress by making this latest release available to try online with a pre-loaded dataset. It runs on an AWS instance using 23 threads. The data is stored on disk and includes a 1.6billion row NYC taxi dataset, 10 years of weather data with around 30-minute resolution and weekly gas prices over the last decade. The instance is located in London, so folks outside of Europe may experience different network latencies. The server-side time is reported as “Execute”.
We provide sample queries to get started, but you are encouraged to modify them. However, please be aware that not every type of query is fast yet. Some are still running under an old single-threaded model. If you find one of these, you’ll know: it will take minutes instead of milliseconds. But bear with us, this is just a matter of time before we make these instantaneous as well. Next in our crosshairs is time-bucket aggregations using the SAMPLE BY clause.
If you are interested in checking out how we did this, our code is available open-source [8]. We look forward to receiving your feedback on our work so far. Even better, we would love to hear more ideas to further improve performance. Even after decades in high performance computing, we are still learning something new every day.
Cool project! You posted a comparison to Clickhouse on the front page. Do you have any insights on how QuestDB compares performance-wise to PrestoDB?
Also, PrestoDB is known to be manageable at scale, for example as part AWS Athena. What are you thoughts on building out an Athena-like service using QuestDB?
We will definitely make QuestDB distributed in the near future. Today, our focus is to extract the maximum performance out of a single node. If we start scaling out too early, it may become harder to improve performance after the fact. So all improvements we get today are foundational: they will have multiplicative effects once you start to scale out.
what BI tools currently can connect to QuestDB for ad-hoc reporting (without need to write SQL queries by hands)? I guess popular solutions like Tableau/PowerBI doesn't support QuestDB.
Is there a demand for BI tools that can use QuestDB as a data source? I'm asking about that because we can add a connector for QuestDB to our BI tool (SeekTable) with a little effort, but this has sense only if someone really will use it.
I cannot talk for the whole industry, but I know for sure, that I (working in BigData / OnlineMarketing) was looking for exactly this recently: A BI tool that is agile and a database that is fast. I ended up with Tableau Server and ClickHouse (which is impressivly fast). Problem was: Tableau does not really support CH and Tableau is not that agile. So to answer your question: Yes. If QuestDB is that fast, there is a demand.
JFYI SeekTable already has a connector for ClickHouse :) it works over native binary protocol (with connection-level compression). Since you have control over SQL query and SQL expressions used for dimensions/measures ClickHouse SQL-dialect is not a problem.
That's a very cool demo, good job! I have a few questions if you don't mind.
- Could you share the specs of the hardware powering the demo?
- For IoT-type usage, one common access pattern is "get the readings of sensor X from device Y in the given time frame". But unfortunately, the data set in the demo doesn't include something like "cab_id" column that would allow me to simulate such a query. Is there a demo that has such a column?
- Does QuestDB require rows to be inserted strictly in order, or is there some leeway?
- Are you using any sort of compression? Timeseries columns are extremely compressible, and a good compression algorithm can greatly reduce IO wait times. We're using the Postgres-based TimescaleDb at work, and its compression capabilities have been a game changer [0].
Thanks for your feedback, I am pleased to know that you like it!
- This is a c5.metal instance (AWS) with 48 physical cores and 192 GB of memory. We use only 24 of those 48 cores.
- If I understand your example well, the query "select * from trips where trip_type = 'Dispatch' and pickup_datetime = '2018-03' and vendor_id = 'VTS';" should be similar. There is no limit in how many filters you can specify in the where clause, it will also have a very minor (if any) impact on the performance.
- Currently yes but this is something we are working on. Out of order insertions is one of our priority and you can expect this to be implemented in a very near future.
- Currently no. However, the way data is stored means that columns are extremely compressible as you mentioned. Our internal API (how we store, access/retrieve columns) would make it easy for us to implement compression, we will do it if this is something important for our users. In the meantime, a filesystem like BTRFS or ZFS would do a nice job (if disk usage is your concern).
the use of where clause right now turns off vectorization completely is falls back on simple single threaded execution. This is the reason why the query is slow right now. We are going to vectorize 'where' really soon.
How timely! I've done a deep dive into column store databases for the past couple of weeks. Reading through the Quest docs, I'd give it the following characteristics. Are these accurate?
- single node database, not [yet] distributed
- primary focus is time-series data, specifically in-order time series data (the `designated timestamp` extension)
- physical data layout is an append-only column store
- Implements a small subset of SQL with some affordances for time series (LATEST BY, SAMPLE BY).
- Doesn't support explicit GROUP BY or HAVING clauses.
Instead, questdb implicitly assumes GROUP BY or HAVING based on presence of aggregation functions in the select clause.
- Small standard library of functions: only 4 text functions.
Based on these characteristics it seems the quest db is well positioned against Influx. It's probably faster than Timescale DB but significantly less flexible given that Timescale has all of Postgres behind it. Quest DB might eventually compete with clickhouse but it's long ways out given that it's not distributed and implements a much smaller subset of SQL.
I'd love to get any insight into join performance. Quite a few column stores handle large joins poorly (clickhouse, druid).
Regarding the theme of joins, I would just add that QuestDB also has a concept called SYMBOL which allows to store repetitive strings into a table as an integer in the background, while letting the user manipulate strings. This means that instead of storing something like 'sensor_id' and then looking up the corresponding 'sensor_name' using a join, you can manipulate 'sensor_name' directly and skip using joins.
If you're interested in columnar database that can handle joins, Microsoft has one. It is called Azure Data Explorer(Kusto) and it is really fast. Here's the usual Uber Taxi Rides bench, you can see joins in the end:
https://yonileibowitz.github.io/kusto.blog/blog-posts/analyz...
Interesting, I hadn't come across Kusto. Looks like a competitor to Snowflake, perhaps with stronger analysis capabilities. Do you know if there's an academic paper? I found a white paper: https://azure.microsoft.com/mediahandler/files/resourcefiles...
> Generally we provide both row and column based access despite being column store. So anything a traditional database does we can do too.
Does quest DB use the same physical data layout for the row based access? My understanding was providing an entire row of data from a columnar file is expensive because you have to assemble the row from num_columns different slices of the columnar file.
The data layout is indeed the same for both cases. We don't assemble row as such. In QuestDB row is a collection of data addresses, each of which is calculated lazily using row_id and bitshift
Greenplum DB is open source, based on Postgres, and is a proper sharded OLAP database in the vein of Teradata. Its answer to these types of large analytic queries is that it stores the answers and as long as the underlying table data doesn't change, the query can run at near 0 cost. I believe the same approach is probably in use here.
Shame generate_series() isn't a valid function here - joining to an artificial table is a classic way to break the pre-aggregated values, and force a re-calc.
Ask HN: what’s the market for a new Database? Who do you build it for and how do you sell it? Asking as an uninformed users who haven’t needed a niche database.
Valid question. I was rhe guy who embraced and optimized a full offering in our org, first on redshift and then on snowflake. Even merely evaluating a new db (on top of actually parsing their landing pages to understand what does what) for our data needs is exhausting to say the least. I still occassionally get enthusiastic with some new cool db (most recently materialize) and spend a weekend trying to spin it up and load our data, but I invariably walk away feeling redshift/snowflake is fine/better while simultaneously feeling guilty I didn't give the tool a real chance at all.
In the end, I'm probably going to give up on trying new db ideas unless their pitch absolutely blows me away. I've gotten much better ROI just doubling down on optimizing our code to use these popular offerings (which çontinue to add features every day) to our use cases and just throw a bit more money at it.
For example, we actually query a 115 billion row, 1.5 TB table in snowflake and barring some cluster warmup time, queries that return less than a million rows run in a couple hundred ms max. Until the day we need this to speed up to just a few ms, I'll probably stick here.
If you’re talking about time series databases then there are a number of them used in industrial settings called “process historians”, for instance, OSI PI.
Very cool. Major props for putting this out there and accepting arbitrary queries.
Couple comments / questions:
- Correct that there’s no GROUP BY support?
- EXPLAIN or similar would be nice, both to get a peek at how your engine works & to anticipate whether a query is using parallel / SIMD execution and will take milliseconds vs. minutes.
Thank you! there is no 'GROUP BY' keyword support. Optimiser works out group by queries without it. We are planning to put a noop there, just to be nice to finger memoory.
EXPLAIN would be quite boring in our case, we use a myriad of methods to make query fast. We would love to keep up the performance so that no one ever needed to be interested in query plan.
Check the example queries, one of them had aggregation and simple column selection in the same query with an "order by" clause at the end.
I can't tell from the output whether there is an implicit "group by" going on over the non-aggregated columns, but eg. this query produced some results without any errors:
SQL optimiser works out that this is a cross join and renders execution path as such. Cross join implementation is streaming with pre-determined result set size. We would not be able to render this much data this quickly without optimiser :)
Seeing the same results on my end for that query. I'm a little puzzled, when I use the same query but with != instead of = to produce a 1.6 trillion row result, it's about 3-4ms execution. I don't know why it's taking 10,000x longer using = instead of !=.
Something about how timestamp comparisons are handled?
This seems very similar to Victoria Metrics. Victoria Metrics is very much based on the design of Clickhouse and currently shows best of class performance numbers for time series data: it would be a lot more interesting to see a comparison to Victoria Metrics than ClickHouse (which is not fully optimized for time series).
Victoria Metrics is Prometheus compatible whereas Quest now supports Postgres compatibility. Both have compatibility with InfluxDB.
Im not much of a database expert by an stretch, but this query took about 80 seconds, which seems like quite a long time, but maybe its more complicated than it appears? My understanding is that the group by is handled automatically, and the results seem to support that:
select cab_type, payment_type, count() from trips;
I was looking for dev/production specs for QuestDB the other day and didn't see them in the docs. Being that it's in Java, which can be quite memory-hungry, what's the minimum and recommended RAM/CPU required to run it?
Our startup sequence is sub-optimal. We query JDK to index timezone database and JDK generates a tonne of garbage in the process. The whole feat requires roughly 500MB RAM. Once out of startup sequence all this is collected and we can operate in 128MB heap.
The data itself is memory mapped. Columns are kept as primitives so that they take as much memory as their unit size times rows.
Thanks for the explanation, makes sense! Sounds like I could get away with 1GB RAM for development and probably want a bit more of a buffer to be safe in production.
WHERE queries are slow indeed. This is not because of the load but rather because WHERE clauses are not yet implemented under the fast path. Currently they run under an old model which is single-threaded which is why they are painfully slow.
while HN load is high, this is just because poor implementation on our side. Filters are single threaded, not yet vectorized in any way. This will be fixed eventually to bring in line with queries that we've already optimised
It looks very proming, congrats. I use ClickHouse in production and I'd love to see how this project evolves.
My main disappointment is the amount of aggregation functions: https://questdb.io/docs/functionsAggregation
Clickhouse provides hundreds of functions, many of which I use.
It would be hard to even consider QuestDB with this amount of functions.
I'll stay tuned, anyway.
Keep up the good work!
Thanks! Indeed we are still light on available aggregations. Right now we are focused on building a fast infrastructure to support them, for example Swissmap / "rosti" for keyed aggregations. Next steps will include multi-pass algorithms. Once the infrastructure is done (and fast), adding functions will be much easier.
Any plans for geo support? Looks like you have lat/lon stored as two independent doubles, which does not lend itself well to any sort of geo operations.
Yes, this is part of the roadmap. Right now, we don't have any interesting query to showcase that would use these columns - they are here because they are part of the original dataset and we are keeping them for when we demo further releases.
Ordering would be memory intensive. There are configurable artificial limits in place on this demo instance. This is just because database takes internet traffic directly. In any other settings these limits can be relaxed, which is default behaviour anyway.
Very impressive, i think building your own (performant) database from scratch is one of the most impressive software engineering feats. Can you let me know a little bit of how an interested person with a cs background can approach this daunting topic?
Databases can be daunting but for a variety of reasons. In my experience, it's not so much because of the sheer amount of features and function that need to be implemented. It's because you have to make sure that the every component is fast and reliable. Ultimately, the software can only be as fast as the slowest part of the stack. So you are going to spend a lot of time hunting for the slow part and looking for ways to accelerate it.
More generally, the way I approached this was to break down the project in small pieces, making each as fast as can be. This makes the problem more approachable. Also, you have to know and love how hardware works.
Could you take a moment to comment on why one would choose to use QuestDB in lieu of Google's Bigtable or BigQuery? Furthermore, how does this compare to other proprietary and open source solutions currently available in the marketplace? In short, I'm struggling to understand where this technology fits and the over all value proposition.
Asked a different way: If I were a lead of IT, director-level executive, CTO, etc. how and why should I begin to evaluate the QuestDB service / solution?
Thank you for sharing, bluestreak and team, very interesting and exciting work!
The main value proposition is raw performance which translates into hardware efficiency and ultimately, costs at scale.
To get a rough idea of how we compare, you could try to running `select cab_type, count() from trips` and `select passenger_count, avg(total_amount) from trips`. They are the equivalent of queries (1) and (2) of this benchmark [1]. In this benchmark, BigQuery took 2 seconds for each query.
Our server runs both in hundreds of milliseconds. And this is actually slow because this is the same server which is currently being hammered by HN users. Also, we are scanning more rows. Our dataset contains 1.6 billion rows while the benchmark attached runs on 1.1 billion. Lastly, we only use one 24-core CPU from one server while the top of this benchmark is clusters or GPUs.
Of course this is only approximation but I hope to be given the chance to feature in this benchmark when we are ready.
Timescale is built on top of Postrgres. QuestDB has been built from the ground up. Both are SQL. QuestDB is primarily focused on performance while timescale has more features today.
sharing some thoughts here, in that I am recently developing a similar thing:
1. "Query 1.6B rows in milliseconds, live" is just like "sum 1.6B numbers from memory in ms".
In fact, if not full SQL functionalities supported, a naive SQL query is just some tight loop on top of arrays(as partitions for naive data parallelism) and multi-core processors.
So, this kind is just several-line benchmark(assumed to ignore the data preparing and threading wrapping) to see how much time the sum loop can finish.
In fact again, this is just a naive memory bandwidth bench code.
Let's count: now the 6-channel xeon-sp can provide ~120GB/s bandwidth. Then sum loop with 1.6B 4-byte ints without compression in such processors' memory could be finished about ~1.6*4/120 ~= 50ms.
Then, if you find that you get 200ms in xxx db, you in fact has wasted 75% time(150ms) in other things than your own brew a small c program for such toy analysis.
2. Some readers like to see comparisons to ClickHouse(referred as CH below).
The fact is that, CH is a little slow for such naive cases here(seen at web[1] been pointed by guys).
This is because CH is a real world product. All optimizations here are ten- year research and usage in database industry and all included in CH and much much more.
Can you hold such statement in the title when you enable reading from persistent disk? or when doing a high-cardinality aggregation in the query(image that low-cardinality aggregation is like as a tight loop + hash table in L2)?
1. We do plan to support full SQL functionalities, we have a pretty good subset already [1]. I think that what we do is more than "naive memory bandwidth bench code", however I am happy to listen and when the time is right implement functionalities/features that you think we are missing.
2. "Can you hold such statement in the title when you enable reading from persistent disk?" We already persist to disk, the numbers you see imply reading from disk. We do this by using memory mapped files [2].
This is running on a single AWS c5.metal instance. It has 196GB RAM, but we are using only 40GB at peak. The instance provides two 24-core CPUs, but we are only using one of them on 23-threads. So far, it's handling the flow of connections and requests from HN fine. We over provisioned just in case as we were not sure what to expect :-)
Thanks for replying. What is the size of the data for these 1.6B rides?
I'm assuming the data is on disk (SSD?) and not in-memory. Or, is Quest an in-memory DB?
Nice! It'd be cool to support everything from PostgREST, that way people can build API-based tools on top of it and directly take advantage of the performance increases: postgrest.org/
I'm working on a...spreadsheet thing or something that requires PostgREST and it'd be really nice to simply direct the PSQL_URI configuration underneath the hood to whatever database has my data. Then there's no need for ETLing back and forth.
PostgREST is an interesting project, thanks! I will check how good is our support for them. It really depends on how tailored to Postgres they are, if they only rely on the Wire protocol then it should be fine!
How nice of you to hijack the back-button (to return to HN after some sample queries) despite that being how myself & others found this in the first place.
I was really liking the product up until that point.
As someone else pointed out, and a QuestDB engineer responded, it’s been noted and an issue has been created for it. 5e reasoning for it was actually solid.
We will find a way to fix this and have grid scroll without problems at the same time. This is not a dedicated website. This is exactly the same web console that is packages with our free product. It wasn't meant to run on the internet and confuse people. Sorry
Sorry for the frustration guys, this was not meant to be "something to lock people in". The rationale for doing this was to prevent macOS users from going back when scrolling the grid horizontally. I submitted a PR to remove this for now, in the future, we will implement a better solution that does not mess with the History API.
I'd keep it, I hate working on boxes and when I move left outside box I go back. THAT is the annoyance especially if I'm doing entry work. I'd always wondered WHY this happened.
I have mixed feelings about this feature actually. It's super handy and nice when browsing the Web while going through a lot of pages in a short amount of time. When using proper Web apps it's a total pain + sometimes the history API is not supported correctly so you lose the context entirely (which means you have to start from scratch in whatever app you are using).
Gestures in general should be something standard, cross platform / cross browser with a nice API exposed to JavaScript. However, I understand why it's not, they were a mess on Linux until very recently, it's all over the place on Windows, support depends on the hardware, they are not very intuitive, etc.
I don't know anything about this database, but this link did break the browser <back> button. Perhaps you can look into fixing that in a future release.
Interesting. This little bug comes up a lot on HN and is usually just that -- a bug. In this case, it seems that the "breaking" of the back button here was intentional based on the line comment[0]. What was the initial reasoning for that?
the reason was grid scroll using touchpad on OSX, when you swipe the grid from left to right there is 50% chance of losing web console due to back button being triggered by the same motion.
A few weeks ago, we wrote about how we implemented SIMD instructions to aggregate a billion rows in milliseconds [1] thanks in great part to Agner Fog’s VCL library [2]. Although the initial scope was limited to table-wide aggregates into a unique scalar value, this was a first step towards very promising results on more complex aggregations. With the latest release of QuestDB, we are extending this level of performance to key-based aggregations.
To do this, we implemented Google’s fast hash table aka “Swisstable” [3] which can be found in the Abseil library [4]. In all modesty, we also found room to slightly accelerate it for our use case. Our version of Swisstable is dubbed “rosti”, after the traditional Swiss dish [5]. There were also a number of improvements thanks to techniques suggested by the community such as prefetch (which interestingly turned out to have no effect in the map code itself) [6]. Besides C++, we used our very own queue system written in Java to parallelise the execution [7].
The results are remarkable: millisecond latency on keyed aggregations that span over billions of rows.
We thought it could be a good occasion to show our progress by making this latest release available to try online with a pre-loaded dataset. It runs on an AWS instance using 23 threads. The data is stored on disk and includes a 1.6billion row NYC taxi dataset, 10 years of weather data with around 30-minute resolution and weekly gas prices over the last decade. The instance is located in London, so folks outside of Europe may experience different network latencies. The server-side time is reported as “Execute”.
We provide sample queries to get started, but you are encouraged to modify them. However, please be aware that not every type of query is fast yet. Some are still running under an old single-threaded model. If you find one of these, you’ll know: it will take minutes instead of milliseconds. But bear with us, this is just a matter of time before we make these instantaneous as well. Next in our crosshairs is time-bucket aggregations using the SAMPLE BY clause.
If you are interested in checking out how we did this, our code is available open-source [8]. We look forward to receiving your feedback on our work so far. Even better, we would love to hear more ideas to further improve performance. Even after decades in high performance computing, we are still learning something new every day.
[1] https://questdb.io/blog/2020/04/02/using-simd-to-aggregate-b...
[2] https://www.agner.org/optimize/vectorclass.pdf
[3] https://www.youtube.com/watch?v=ncHmEUmJZf4
[4] https://github.com/abseil/abseil-cpp
[5] https://github.com/questdb/questdb/blob/master/core/src/main...
[6] https://github.com/questdb/questdb/blob/master/core/src/main...
[7] https://questdb.io/blog/2020/03/15/interthread
[8] https://github.com/questdb/questdb