We'll be publishing a performance report soon (we didn't manage to get it out today).
Rough numbers you can expect for 1KB size documents, 25M document database: 40K reads/sec/server, 5K writes/sec/server, roughly linear scalability across nodes.
We should be able to get the report out in a couple of days.
Any work done in 2.0 for improving aggregation performance?
The last time I tried with 1.16, I gave up my testing when even the simplest aggregation query (count + group by with what should be a sequential, streaming scan) took literally minutes with RethinkDB, compared to <1s with PostgreSQL. Rethink coredumped before I gave it enough RAM, after which it blew up to around 7GB, whereas Postgres uses virtually no RAM, mostly OS buffers.
We did a couple of scalability improvements in 2.0, but didn't optimize groups and counts specifically.
Would you mind writing me an email with your query or opening an issue at https://github.com/rethinkdb/rethinkdb/issues (unless you have already?)? I'd like to look into it to see how we can best improve this.
We're planning to implemented a faster count algorithm that might help with this (https://github.com/rethinkdb/rethinkdb/issues/3949), but it's not completely trivial and will take us slightly longer to implement.
What I was doing is so trivial, you don't really need this information. This was my reference SQL query:
select path, count(*) from posts group by path;
(I don't have the exact Rethink query written down, but it was analogous to the SQL version.)
You can demonstrate RethinkDB's performance issue with any largeish dataset by trying to group on a single field.
The path column in this case has a cardinality of 94, and the whole dataset is about 1 million documents. Some rows are big, some not; each has metadata plus a JSON document. The Postgres table is around 3.1GB (1GB for the main table + a 2.1GB TOAST table). Postgres does a seqscan + hash aggregate in about 1500ms.
It's been months since I did this, and I've since deleted RethinkDB and my test dataset.
where the function maps the documents into one out of 32 groups (so that's less than your 94, but shouldn't make a giant difference... I just had this database around). Did that on both 1 million and a 25 million document table, and memory usage looked fine and very stable.
This was on RethinkDB 2.0, and I might retry that on 1.16 later to see if I can reproduce it there.
Do you remember if you had set an explicit cache size back when you were testing RethinkDB?
Cool. Well, the process eventually crashed if I used the defaults. I had to give it a 6GB cache (I think, maybe it was more) for it to return anything. The process would actually allocate that much, too, so it's clear that it was effectively loading everything into memory.
It wasn't using an index, but then Postgres wasn't, either. I don't think aggregating via B-tree index is a good idea; aggregation is inherently suited to sequential access. An index is useful only when the selectivity is very low.
If you wrote your query with group and count, with no index, then there would be problems with the performance. RethinkDB generally does not do query optimization, except in specific ways (mostly about distributing where the query is run), unless that's changed very recently. You can write that query so that it executes with appropriate memory usage with a map and reduce operation.
You would get the same behavior that Postgres's would be in terms of how data is traversed and aggregated -- that is, not by building a bunch of groups and counting them after the fact. I do think RethinkDB ought to be able to apply aggregations to group queries on the fly though... I'm not really up to date on that.
Postgres will still have better numbers, I'm sure. It has a schema for starters.
While I don't know RethinkDB is structured internally, I don't see any technical reason why a non-mapreduce group-by needs to load the entire table into memory instead of streaming it, or why a mapreduce group-by needs to be slow. M/R only becomes a slow algorithm once you involve shards and network traffic; any classical relational aggregation plan uses a kind of M/R anyway.
Postgres has a schema, of course, but it still needs to look up the column map (the ItemIdData) in each page as it scans it, the main difference being that this map is of fixed length, whereas in a schemaless page it would be variable-length.
Anyway, I'm hoping RethinkDB will get better at this. I sure like a lot about it.
Generally speaking RethinkDB doesn't query optimize, except in deterministic ways, unless they've changed policy on this. I don't see any reason why a plain group/aggregate query couldn't be evaluated appropriately -- I know it is when the grouping is done using an index, maybe it is now when the grouping is done otherwise (I don't know, but it would be sensible, I'm out of date).
I haven't used RethinkDB, but I would assume the answer is no. Choosing to use map/reduce is basically a declaration that performance is your lowest priority.
And the point is that the converse is definitely not true.
Postgres knows about the structure of your data and where it's located, and can do something reasonably optimal. A generic map/reduce algorithm will have to calculate the same thing as Postgres eventually, but it'll have tons of overhead.
(Also, what is with the fad for running map/reduce in the core of the database? Why would this be a good idea? It was a terrible, performance-killing idea on both Mongo and Riak. Is RethinkDB just participating in this fad to be buzzword-compliant?)
While there have been some truly misguided mapreduce implementations, mapreduce is just a computation model that isn't inherently slower than others: A relational aggregation of the type you get with SQL like:
select foo, count(*) from bar group by foo
...is essentially a mapreduce, although most databases probably don't use a reduce buffer larger than 2. (But they would benefit from it if they could use hardware vectorization, I believe.)
Mapreduce works great if you are already sequentially churning through a large subset of a table, which is typically the case with aggregations such as "count" and "sum". Where mapreduce is foolish is when you try using mapreduce for real-time queries that only seek to extract a tiny subset of the dataset.
There is no relevant knowledge that Postgres has that RethinkDB lacks that lets it evaluate the query more efficiently (besides maybe a row layout with fixed offsets so that it doesn't haven't parse documents, but that's not relevant to the reported problem). A generic map reduce certainly would have more overhead, obviously, but not running-out-of-memory overhead reported above, just the overhead of merging big documents.
The reason you run queries in "the core" of a database is because copying all the data outside the database and doing computations there would be far worse.
Rough numbers indeed - you forgot to define what a "server" is -- dedicated hw 16 core xeon with 4xssd in hw raid0 or a Digital Ocean vps with 512MB ram? ;-)
Daniel @ RethinkDB here.
We'll release the details shortly.
This was running on 12 core Xeon servers with 2 SSDs each in software RAID 0.
There were also additional read queries running at the same time as the write queries, and the read throughput that coffeemug posted is the sustainable increase in reads/s that you get when adding an additional server to a cluster. Single-server performance is much higher due to missing network / message encoding overhead.
I realize these numbers alone are still not very meaningful and there are many remaining questions (size and structure of the data set, exact queries performed etc). Rest assured that all of these details will be mentioned in the actual performance report that should be up soon.
Thank you for providing some quick back-of-the-envelope numbers here, which is exactly what most people are looking for at a first pass. One question though - do those numbers change considerably between disk vs SSD?
The numbers were measured on SSD. If the active/hot dataset fits into RAM, the numbers between SSD and rotational don't change much. If the active dataset doesn't fit into RAM, RethinkDB performs significantly worse on rotational.
We often run tests scaling up to ~40 nodes without problems. You could probably push Rethink quite further than that, but I think over 100 nodes would be hard. The goal is to keep pushing the boundary indefinitely.
I contributed the benchmarks to Dan's gorethink driver. Dan is great to collaborate with so if you want to hack on Go and contribute to OSS, consider giving his project a look.
One way to improve writes is to batch them, an example is here.
"In soft durability mode RethinkDB will acknowledge the write immediately after receiving and caching it, but before the write has been committed to disk."
Obviously your business requirements come into play. I prefer the Hard writes because my data is important to me but I do insert debug messages using soft writes in one application I have.
*Edit: Heh I forgot to mention, on my Macbook Pro I was getting 20k w/s while batching and using soft writes.
Individual writes for me are hovering around 10k w/s on the 8 cpu 24gb instance i have. But yeah, define your business reqs then write your own benchmarks and see if the need is met.
Many devs write benchmarks in order to be the fastest and not the correctest. Super lame.