This response illustrates important point - if you're expert in technology A and compare it to technology B, you're not expert in, comparison is very likely to be unfair.
I very much would like to see vendors at least to follow Journalist ethics and reach out to their competition for optimization comments and suggestions before publishing it, so others are given a chance to suggest optimizations
>This response illustrates important point - if you're expert in technology A and compare it to technology B, you're not expert in, comparison is very likely to be unfair.
To have fair benchmarks, one organization can manage the test requirements and submissions while expert in a said technology can submit their best efforts to be measured and compared to others.
Comparison with old version is actually in the article for the patient reader. It could go to the top but I don't think it will make a difference. At the end of the day it is the article at the official QuestDB website which gives the reader a spoiler about the bias.
I am intrigued what Timescale is going to publish next.
Agree. And for a blog post it can even have a story like: "We compared with ClickHouse and we were 10x slower, than we looked at this case and made it 100x faster. Thank you, benchmark and ClickHouse developers that showed us use case where we could do better."
For me benchmarking is usual - "Why this query takes so long? We need to improve it. Sometimes 1000x times."
Right? How do the folks at QuestDB know that their new JIT engine is actually responsible for those performance improvements? My understanding is that, index or not, data is still sorted by time in questdb, which is exactly what the ClickHouse engineers are replicating in the new schema.
The query Clickhouse picked on does not actually leverage time order. Perhaps clickhouse vendors on this thread can comment on relevance of the date partitioning for this query. My best guess is that it might help the execution logic to create data chunks for parallel scan.
QuestDB does also use partitions for this purpose but we also calculate chunks dynamically based on available CPU to distribute load across cores more evenly
That's fair enough and I get the broader point about ClickHouse being rather inflexible wrt query performance. It still seems like the initial sorting key for CH would've been the worst possible one for all benchmark scenarios.
What an extremely unfair comment. Having read QuestDBs blog, it’s quite clear they’ve taken great pains to point out that a single specific benchmark isn’t the be all and end all of DB analysis.
They quite clearly start out by saying they’re only looking to demonstrate the impact of a specific new DB feature they’ve created, and are using benchmarks that illustrate the difference. They make zero claims that QuestDB is faster than Clickhouse overall, and quite carefully point out that prospective users need to run their own benchmarks on their own data to figure out what DB will work for them.
I’m commenting specifically on the blog post provided by GP, which the parent comment made some pretty derogatory comments about.
I’ve made no attempt to deeply research what QuestDB have said else where, because I don’t care. I don’t use, or have a need, for any of the products mentioned in any of the linked articles. I’m only interested in the narrow discussion of the original blog post provided by GP, to which OP post is replying to.
I am in fact very proud of my team, who worked very hard on both implementation and the article. It is disappointing to read unfounded insults where we made every effort to be fair.
I appreciate your benchmark and was interested to learn about how QuestDB processes TSBS queries efficiently. I work extensively with ClickHouse and it's always enlightening to learn about how other databases achieve high performance. Your descriptions of the internals are clear and easy to follow, especially since you included comparisons with older versions of QuestDB.
That said, I think I can understand how some users might be a little put off by the comparisons. Your article effectively says "ClickHouse is really slow" without giving readers any easy way to judge what was happening under the covers. I was personally a bit frustrated not to have the time to set up TSBS and dig into what was going on. I therefore appreciated Geoff's effort look up the results and show that the default index choices didn't make a lot of sense for this particular case. That does not detract from QuestDB's performance at least from my perspective.
Anyway congratulations on the performance improvement. As a famous character in Star Wars said, "we will watch your career with great interest."
I wonder what "every effort to be fair" means ? The first thing you could have done is reach out to ClickHouse Community to ask for optimization suggestions
"fair" means that we comparing apples to apples. Ad-hoc, unindexed predicate, compiled by QuestDB into AVX2 assembly (using AsmJIT) vs same predicate complied by Clickhouse (I'm assuming by LLVM). One can perhaps view this as comparing SIMD-based scans from both databases. Perhaps we generate better assembly, which incidentally offers better IO.
We all understand that creating very specific index might improve specific query performance. Great, Clickhouse geared the entire table storage model to be ultra specific for latitude search. What if you search by longitude, or other column? Back to the beginning.
JIT-compiled predicates offer arbitrary query optimisation with zero impact on ingestion. This is sometimes useful.
What would you offer assuming that we reached out, other than creating an index?
Clickhouse does better than we do in other areas. It JITs more complicated expressions, such as some date functions. It optimises count() queries specifically. For example we collect "found" rowed_ids in an array. Clickhouse does not specifically for count(). We still have work to do. On other hand we ingested this very dataset about 5x quicker than clickhouse, which we left out because article is not about "QuestDB is faster than Clickhouse"
Doesn't matter, since that clearly wasn't the purpose of the article. After all, they were totally happy to add an index for another competing DB as long as they happened to win that comparison. Then they crow about how they beat having an index.
So, maybe do not create specific scenarios for corner cases and then generalize outcome? And write articles about common scenarios that is important for people who will use technology on daily basis.
Full disclosure: I am CTO of QuestDB and I took part in JIT implementation. The quote above is not mine, it was written by Clickhouse staff. "utilizes its full indexing strategy" statement is false and is news to me.
All benchmarks are always useless, in 90% of the cases. They could maybe give some baseline understanding, but it's important to always do your own benchmarks as your performance can be very different than what the benchmark showed, simply because the data/data structures are slightly different.
I wouldn't go as far as to say that benchmarks are useless, but I agree that when looking at a benchmark, it's important to be aware of how similar its data distribution and query patterns are compared to your own.
That has the confounding variable of how good you are at configuring each database option.
I can configure Postgres fairly well. I have little chance of knowing if I’m getting good performance out of most others without a serious time investment.
Sounds like they didn't re-do the QuestDB benchmark with same change to the indexes, and so their claim is that Clickhouse is 27x faster with a specific index than QuestDB without that index. Which is not a fair comparison.
Also, the tone of the post sounds really arrogant. They try to hide it a bit, I feel, but it just seeps through.
It's also part of a longer trend of saber rattling between these vendors - there's a history of these types of posts also from TimescaleDB: https://news.ycombinator.com/item?id=29096541
Well, I don't know how QuestDB works, and I couldn't find anything in the original benchmark, but probably they already have some sort of (geo)index in place? It's really strange to search geo-data by scanning the whole surface of the Earth. The point that Clickhouse outperforms this by just sorting on one axis (and even not using any fancy 2D indices) is reasonable.
Why does the lack of indexes matter? Especially when the size on disk is so much higher? Defining a sensible index isn't an unreasonable or daunting task, and minimal effort in CH got a 4x speedup over QuestDB. "It's faster if you invest literally zero time making it efficient" doesn't offer any practical benefit to anyone.
If it was demonstrated that Quest did a better job overall in the majority of cases where an optimization would have been missed, that's one thing. But this feels awfully nitpicky.
The article is not _just adding an index_. They are embedding one of the search fields in a table _primary key_. That likely means the whole physical table layout is tailored for that single specific query.
While it can help to win this very benchmark it's questionable whether it's usable in practice. Chances are an analytical database serves queries of various shapes. If you only need to run a single query over and over again then you might be better off with a stream processing engine anyway.
The primary key is, in effect, an index. Specializing on the latitude field of a table of geographic data seems like an incredibly small thing to nitpick.
I was curious to hear more details about this statement - "while QuestDB utilizes its full indexing strategy to read just a tiny fraction of the actual data". Did QuestDB create indexes in their QuestDB benchmark but just not mention it? Are there geoindexes which are automatically enabled which do help (but are of less value in the general sense from Clickhouse' perspective)?
I don't know how QuestDB is implemented in any detail, but this statement struck me as confused. My understanding is that for this query, QuestDB is performing a full scan of the relevant columns, and the point of the blog post was how fast their JIT engine for filtering makes this.
There were 2 queries in the QuestDB benchmark over the same table. ClickHouse didn't even try to match both of them choosing one as a victim. I guess that's what happens when you optimise the data storage for one query.
Is there an existing named adage for something like "if one creates a benchmark in order to rank general performance of some products, some of those products will ultimately sacrifice general performance in order to optimize for that benchmark"?
It's also why the only true benchmark is using the thing as it needs to be used - but this is hard to compare because often you need code to work with the tool and vice-versa.
there are the TPC benchmarks which try to cover a wide variety of use cases and scenarios and are designed independently from any one engine: https://www.tpc.org/information/benchmarks5.asp
It is partially true, but this benchmarks force schema. You can't reorganise data for example in wide table or add indices. So it actually does not show you how to use the system to solve this type of problems in a best way possible, but checks unoptimised results as if you never learn and never utilise best practices of the DBMS you choose for production.
as long as vendors are presenting the results of their own products, I'd expect them to de-normalize and tune the queries to leverage every advantage their engine can possibly bring to bear on the benchmark's problem set - so long as they're transparent about what they did. Clickhouse apparently does quite well on TPC-DS according to: https://aavin.dev/tpcds-benchmark-on-clickhouse-part2/ but I'd love to see a more official result that the CH engineers would stand behind.
I have a ton of respect for the clickhouse team. We ran a massive cluster storing trillions of events across thousands of servers for years to serve a real-time reporting/ api use case and the tech is blazing fast and never once let us down.
I agree. I actually think it's ideal for something we need at work (basically storing a crap ton of logs) but it doesn't remotely sound like it from the name. Sounds like some kind of front-end analytics tool.
I very much would like to see vendors at least to follow Journalist ethics and reach out to their competition for optimization comments and suggestions before publishing it, so others are given a chance to suggest optimizations