\* Cool. This is a hard problem. \* Custom hardware sounds expensive. \* This do...

sergei · on May 4, 2010

Oracle RAC allows multiple machines to synchronize access to a shared disk. Each node in a RAC cluster locks individual disk pages and brings the data to the node for processing. Only one node is working on a query a time, while synchronizing access to resources at a physical level.

Clustrix is fundamentally different. We compile the query down into query fragments and evaluate relevant portions of the query at the node which holds the relevant data. We bring the query to the data.

You can find more specifics on our architecture in our white paper:

http://www.clustrix.com/wp-content/uploads/2010/04/clustrix-...

Clustrix is all about concurrency and parallelism. While we didn't specifically focus on OLAP, our model allows us to do very well for these types of workloads because we break the query apart and let individual nodes work on their portion of the query.

While we deliver our product as an appliance, it's all based on commodity hardware.

If you had an infinitely scalable database, would you still want to put a caching layer in front of it? Most people would rather not deal with the added complexity.

In a typical sharding scheme, one would choose a partition key (typically some primary key, e.g. user id) and use that as a distribution strategy. At such granularity, hot spots are a common problem. In Clustrix, we have independent distribution/partitioning for each table/index, down to the row level. To get hot spots is if you contend on a single row -- but that's really an indication of something going terribly wrong in the data model.

ACID (transactional properties) and KV (data model) are orthogonal concepts. Our systems scales exceptionally well and supports a wide range of data models.

strlen · on May 4, 2010

I'd hesitate to call RAC a true example of "scaling out". As you've mentioned, RAC relies upon shared disk in a SAN to provide data storage. Exadata would be the more appropriate comparison: there's also the "appliance" model and while Exadata was originally meant as an OLAP solution, it is claimed to be OLTP capable via SSDs and fast interconnect.

You're also very correct in that a caching layer in front of a database that _itself_ includes a cache seems like added complexity and resources (similarly to using a query cache in front of InnoDB's own cache with MySQL). A true scale out solution should not require (or even significantly benefit) from a caching layer.

However, I don't see how your product runs on commodity hardware: I can't buy a machine from Supermicro to run it, it makes use of high-speed interconnect (not something I can place in most data-centers). It seems more targeted at e.g., billing/accounting systems (which require strong transactional integrity) rather than storing high-scalability web application state (e.g., sessions, shopping carts, user histories, analytically derived data, CMS content, small pictures).

"Appliances" generally have the stigma of being difficult to manage, requiring more admins for for the same number of machines than regular hardware.

"Web scale" companies (e.g., Facebook, Google, Amazon, Yahoo) can run huge clusters with very few numbers of administrators (easily thousand machines per person) due to extensive automation of the whole process. Since the appliances running on a different OS image, the operations team can't manage them via their standard set of provisioning and configuration management tools: when one goes down, they can't use their main jumpstart/kickstart server to perform an unattended install; there will be difficulties managing its configuration via standard tools like puppet/cfengine/chef (much less custom tools that such companies write). From what I've seen, they end up typically relying on separate teams to manage their Oracle RAC installations due to these issues (these teams often have a much worse admin:machine ratio e.g., multiple-person teams for a <100 node RAC cluster). Facebook opts to stick with MySQL partly because their operations team know how to automate every bit of it.

How do you plan to address these issues?

That being said, I love seeing start-ups solve difficult problems. Congratulations on shipping a product that solves a tough problem.

rbranson · on May 4, 2010

> You're also very correct in that a caching layer in front of a database that _itself_ includes a cache seems like added complexity and resources (similarly to using a query cache in front of InnoDB's own cache with MySQL). A true scale out solution should not require (or even significantly benefit) from a caching layer.

Caches can be implemented at different levels. I'm talking about query caching (Exact SQL -> Exact Results). How would a query cache be effective in an environment like Clustrix (distributed MVCC)? Invalidation would be almost as expensive and hard to manage as the DELETE/UPDATE itself. With lots of RAM, it's probably cheaper just to run the query.

drosenthal · on May 4, 2010

Sergei, you make some great points here. Thanks for posting.

The OP asked what Clustrix does that RAC does not. Leaving aside the interesting architecture for a second, do you believe that your database can scale to a higher aggregate write per second load than Oracle RAC? than Exadata? At a lower cost?

sergei · on May 4, 2010

Absolutely. Yes to all three.

rbranson · on May 4, 2010

I read the paper. Very nice architecture. I really like the no-special-nodes aspect.

> Clustrix is all about concurrency and parallelism. While we didn't specifically focus on OLAP, our model allows us to do very well for these types of workloads because we break the query apart and let individual nodes work on their portion of the query.

I ask my original question because, in general, scaling OLAP and OLTP tend to use very different methods that work against each other. For instance, MapReduce works well for OLAP workloads, but doesn't provide the responsiveness required for OLTP (Not that it couldn't, MongoDB sure is using it for that, but it isn't the best for that). It sounds like you're applying some OLAP techniques ("broadcast" querying and processing data at it's site) to the OLTP realm.

> If you had an infinitely scalable database, would you still want to put a caching layer in front of it? Most people would rather not deal with the added complexity.

So this system will scale to 10,000 nodes? It seems as if queries will get distributed to hundreds of nodes at this point if they are joining in several tables. I can imagine serious network contention at this point. Do you still recommend some kind of denormalization for efficiency, as it seems it would reduce the number of table slices required for a query?

> In a typical sharding scheme, one would choose a partition key (typically some primary key, e.g. user id) and use that as a distribution strategy. At such granularity, hot spots are a common problem. In Clustrix, we have independent distribution/partitioning for each table/index, down to the row level. To get hot spots is if you contend on a single row -- but that's really an indication of something going terribly wrong in the data model.

Really? What about data that is very popular? To use worn out examples, the "real name" field for Ashton Kutcher's Twitter feed? This is a single row of data that could quite realistically be accessed tens or hundreds of thousands of times per second at peak. I know I am picking ridiculous examples, but that's what scaling is all about :)

I suppose I ask about caching because, in general, caching layers only hit one box with a simple key request that have been proven to work at volumes of hundreds of thousands of queries per second (memcached at Facebook). How does this database system that hits a single node (or two, considering replica) that has to do fundamentally more work to fetch the data perform anywhere near as efficiently as a caching system?

> ACID (transactional properties) and KV (data model) are orthogonal concepts. Our systems scales exceptionally well and supports a wide range of data models

Agreed. I apologize that I wasn't clear enough here. By KV I meant Dynamo-style loosely consistent DHTs.

apphacker · on May 4, 2010

>"If you had an infinitely scalable database, would you still want to put a caching layer in front of it?"

Yes because reading from memory is faster than file I/O?

sergei · on May 4, 2010

The database has a transitionally consistent memory cache. On a 5 node system, you have ~150GB of effective cache. And you can always add more nodes.

When you put a cache in front of the database, you sign up to manage your own cache coherence and consistency.

Are my users reading stale data? That write went into the cache, did it make it to the database? I use interface X to speak to the cache but interface Y to speak to the database. Some requests are ok to cache, some must go direct...

Or I can just get more nodes and not have to worry about it.

forkqueue · on May 4, 2010

I'm not sure the hardware is so custom - the pictures on the Clustrix site look very much like generic Supermicro x86 servers with a Clustrix logo stuck on them.

DenisM · on May 4, 2010

This does not solve CAP issues.

Their advertised setup is a small number of nodes interconnected with infiniband. Partitioning problem doesn't apply here.

sl_ · on May 4, 2010

Partitioning also happens when a node fails. But shouldn't be that much of a problem since they probably use a N=3/W=2 setup.

cx01 · on May 4, 2010

They don't use quorums. So I guess there will always be 2 replicas and in the case of failure a (Zookeeper-like) master will reconfigure the system.

rbranson · on May 4, 2010

Partitioning is only one part of CAP. Availability is sacrificed when site-specific power / network systems fail, or worse, when equipment is physically destroyed.