Clustrix (YC W06) Builds the Webscale Holy Grail: A Database That Scales

leif · on May 4, 2010

Good for them and all, and I know they're still kind of stealth mode, but this article doesn't say much to me.

It's not that I doubt they can deliver, I just want to see some numbers. The one graph I could find (http://www.clustrix.com/wp-content/uploads/2010/04/clustrix-...) really doesn't say much, and to be honest, I can't even really tell how I'm supposed to be reading it.

You don't go and build the best sports car in the world and then tell everyone "it goes really fast", you show videos of it screaming around a curvy mountain road. I want my curvy mountain road video.

bertove · on May 4, 2010

They were supposed to launch on May 4, but the story seems to have got out a day earlier. There should be more details coming, starting from their official launch at Web 2.0 tomorrow.

on May 4, 2010

[deleted]

leif · on May 4, 2010

Yeah, I know, but I reeeeeeeeally want to see their graphs!

qq66 · on May 4, 2010

Well, the performance specs of new sports cars are often known before the prototypes are fully built.

snitko · on May 4, 2010

I thought the holy grail of the web scale is being able to start with little (startups cannot pay 80K for a licence), then painlessly scale the thing. What they offer seems more like a better price option for large enterprise solutions - licence is (supposedly) cheaper than the cost of scaling. And that's cool, ok. But, personally, what I would like to be able to do is to launch a small application with a DB holding, say, 10K records and then scale it up to billions of records investing money when I actually have them. Or not.

ErrantX · on May 4, 2010

Agreed. We purchase an Oracle driven enterprise app that would probably be the target market for this. We would hold billions of records if it was economical to do so - the problem is that 80K is a lot and it is probably cheaper to pay me to do 3hrs a week maintenance/backup.

The other area we might have used this is somewhere we have about 40 billion database entries (several TB's of data) and growing. However, like Google et al, once you get to that scale it's not worth paying for a proprietary service because cheap/comodity hardware on a FOSS stack is a lot cheaper.

However, there is almost certainly a market here for "enterprise" type businesses (you know, the sort of grey faced telecoms-style place). The problem is supplanting Oracle...

alanh · on May 4, 2010

According to bertove, Clustrix may support "the entire MySQL API." If that's the case, migrating should be incredibly easy.

rbranson · on May 4, 2010

* Cool. This is a hard problem.

* Custom hardware sounds expensive.

* This does not solve CAP issues.

* There is a physical limit to ACID scalability because of synchronization latency and network throughput maximums. KV stores don't have this issue.

* Will developers still have to implement KV caching layers for database data? I suppose the real question is: does the system accelerate "hot" data where normal RDBMS sharding schemes would otherwise not?

* Does the linear scalability also apply to OLAP-type activity where queries are ran against large datasets?

* What are they doing that Oracle RAC isn't?

sergei · on May 4, 2010

Oracle RAC allows multiple machines to synchronize access to a shared disk. Each node in a RAC cluster locks individual disk pages and brings the data to the node for processing. Only one node is working on a query a time, while synchronizing access to resources at a physical level.

Clustrix is fundamentally different. We compile the query down into query fragments and evaluate relevant portions of the query at the node which holds the relevant data. We bring the query to the data.

You can find more specifics on our architecture in our white paper:

http://www.clustrix.com/wp-content/uploads/2010/04/clustrix-...

Clustrix is all about concurrency and parallelism. While we didn't specifically focus on OLAP, our model allows us to do very well for these types of workloads because we break the query apart and let individual nodes work on their portion of the query.

While we deliver our product as an appliance, it's all based on commodity hardware.

If you had an infinitely scalable database, would you still want to put a caching layer in front of it? Most people would rather not deal with the added complexity.

In a typical sharding scheme, one would choose a partition key (typically some primary key, e.g. user id) and use that as a distribution strategy. At such granularity, hot spots are a common problem. In Clustrix, we have independent distribution/partitioning for each table/index, down to the row level. To get hot spots is if you contend on a single row -- but that's really an indication of something going terribly wrong in the data model.

ACID (transactional properties) and KV (data model) are orthogonal concepts. Our systems scales exceptionally well and supports a wide range of data models.

strlen · on May 4, 2010

I'd hesitate to call RAC a true example of "scaling out". As you've mentioned, RAC relies upon shared disk in a SAN to provide data storage. Exadata would be the more appropriate comparison: there's also the "appliance" model and while Exadata was originally meant as an OLAP solution, it is claimed to be OLTP capable via SSDs and fast interconnect.

You're also very correct in that a caching layer in front of a database that _itself_ includes a cache seems like added complexity and resources (similarly to using a query cache in front of InnoDB's own cache with MySQL). A true scale out solution should not require (or even significantly benefit) from a caching layer.

However, I don't see how your product runs on commodity hardware: I can't buy a machine from Supermicro to run it, it makes use of high-speed interconnect (not something I can place in most data-centers). It seems more targeted at e.g., billing/accounting systems (which require strong transactional integrity) rather than storing high-scalability web application state (e.g., sessions, shopping carts, user histories, analytically derived data, CMS content, small pictures).

"Appliances" generally have the stigma of being difficult to manage, requiring more admins for for the same number of machines than regular hardware.

"Web scale" companies (e.g., Facebook, Google, Amazon, Yahoo) can run huge clusters with very few numbers of administrators (easily thousand machines per person) due to extensive automation of the whole process. Since the appliances running on a different OS image, the operations team can't manage them via their standard set of provisioning and configuration management tools: when one goes down, they can't use their main jumpstart/kickstart server to perform an unattended install; there will be difficulties managing its configuration via standard tools like puppet/cfengine/chef (much less custom tools that such companies write). From what I've seen, they end up typically relying on separate teams to manage their Oracle RAC installations due to these issues (these teams often have a much worse admin:machine ratio e.g., multiple-person teams for a <100 node RAC cluster). Facebook opts to stick with MySQL partly because their operations team know how to automate every bit of it.

How do you plan to address these issues?

That being said, I love seeing start-ups solve difficult problems. Congratulations on shipping a product that solves a tough problem.

rbranson · on May 4, 2010

> You're also very correct in that a caching layer in front of a database that _itself_ includes a cache seems like added complexity and resources (similarly to using a query cache in front of InnoDB's own cache with MySQL). A true scale out solution should not require (or even significantly benefit) from a caching layer.

Caches can be implemented at different levels. I'm talking about query caching (Exact SQL -> Exact Results). How would a query cache be effective in an environment like Clustrix (distributed MVCC)? Invalidation would be almost as expensive and hard to manage as the DELETE/UPDATE itself. With lots of RAM, it's probably cheaper just to run the query.

drosenthal · on May 4, 2010

Sergei, you make some great points here. Thanks for posting.

The OP asked what Clustrix does that RAC does not. Leaving aside the interesting architecture for a second, do you believe that your database can scale to a higher aggregate write per second load than Oracle RAC? than Exadata? At a lower cost?

sergei · on May 4, 2010

Absolutely. Yes to all three.

rbranson · on May 4, 2010

I read the paper. Very nice architecture. I really like the no-special-nodes aspect.

> Clustrix is all about concurrency and parallelism. While we didn't specifically focus on OLAP, our model allows us to do very well for these types of workloads because we break the query apart and let individual nodes work on their portion of the query.

I ask my original question because, in general, scaling OLAP and OLTP tend to use very different methods that work against each other. For instance, MapReduce works well for OLAP workloads, but doesn't provide the responsiveness required for OLTP (Not that it couldn't, MongoDB sure is using it for that, but it isn't the best for that). It sounds like you're applying some OLAP techniques ("broadcast" querying and processing data at it's site) to the OLTP realm.

> If you had an infinitely scalable database, would you still want to put a caching layer in front of it? Most people would rather not deal with the added complexity.

So this system will scale to 10,000 nodes? It seems as if queries will get distributed to hundreds of nodes at this point if they are joining in several tables. I can imagine serious network contention at this point. Do you still recommend some kind of denormalization for efficiency, as it seems it would reduce the number of table slices required for a query?

> In a typical sharding scheme, one would choose a partition key (typically some primary key, e.g. user id) and use that as a distribution strategy. At such granularity, hot spots are a common problem. In Clustrix, we have independent distribution/partitioning for each table/index, down to the row level. To get hot spots is if you contend on a single row -- but that's really an indication of something going terribly wrong in the data model.

Really? What about data that is very popular? To use worn out examples, the "real name" field for Ashton Kutcher's Twitter feed? This is a single row of data that could quite realistically be accessed tens or hundreds of thousands of times per second at peak. I know I am picking ridiculous examples, but that's what scaling is all about :)

I suppose I ask about caching because, in general, caching layers only hit one box with a simple key request that have been proven to work at volumes of hundreds of thousands of queries per second (memcached at Facebook). How does this database system that hits a single node (or two, considering replica) that has to do fundamentally more work to fetch the data perform anywhere near as efficiently as a caching system?

> ACID (transactional properties) and KV (data model) are orthogonal concepts. Our systems scales exceptionally well and supports a wide range of data models

Agreed. I apologize that I wasn't clear enough here. By KV I meant Dynamo-style loosely consistent DHTs.

apphacker · on May 4, 2010

>"If you had an infinitely scalable database, would you still want to put a caching layer in front of it?"

Yes because reading from memory is faster than file I/O?

sergei · on May 4, 2010

The database has a transitionally consistent memory cache. On a 5 node system, you have ~150GB of effective cache. And you can always add more nodes.

When you put a cache in front of the database, you sign up to manage your own cache coherence and consistency.

Are my users reading stale data? That write went into the cache, did it make it to the database? I use interface X to speak to the cache but interface Y to speak to the database. Some requests are ok to cache, some must go direct...

Or I can just get more nodes and not have to worry about it.

forkqueue · on May 4, 2010

I'm not sure the hardware is so custom - the pictures on the Clustrix site look very much like generic Supermicro x86 servers with a Clustrix logo stuck on them.

DenisM · on May 4, 2010

This does not solve CAP issues.

Their advertised setup is a small number of nodes interconnected with infiniband. Partitioning problem doesn't apply here.

sl_ · on May 4, 2010

Partitioning also happens when a node fails. But shouldn't be that much of a problem since they probably use a N=3/W=2 setup.

cx01 · on May 4, 2010

They don't use quorums. So I guess there will always be 2 replicas and in the case of failure a (Zookeeper-like) master will reconfigure the system.

rbranson · on May 4, 2010

Partitioning is only one part of CAP. Availability is sacrificed when site-specific power / network systems fail, or worse, when equipment is physically destroyed.

DenisM · on May 4, 2010

The "OLTP landscape chart" raises red flags. Matching functionality of Oracle in 4 years and $18m is an extraordinary claim.

So this raises the question as to what exactly can it do? The white-paper covers straight select, insert, sort and a an inner join which is pretty good already, but it doesn't go beyond that.

bertove · on May 4, 2010

I believe it supports the entire mysql api.

boris · on May 4, 2010

The complete MySQL API compatibility is a bold statement.

If they implemented the whole thing themselves (I would say that should take more than 4 years in a couple-of-guys setup) then maintaining that compatibility must be/will be a nightmare. If you sold me a product as MySQL-compatible and I find that some bizarre and undocumented side-effect of MySQL implementation that my code relies on is missing, you will have to emulate that side effect.

If, on the other hand, they based their implementation on MySQL code, then this raises some licensing questions (MySQL is GPL so they will have to make their changes public). They may have separated the code somehow so that this is less of an issue, though it adds a bit of uncertainty to the whole setup.

lzimm · on May 4, 2010

there was this programming language i remember reading about a while ago that was basically about moving computation around to different nodes to bring the "function to the data" as opposed to the other way around -- which is the exact same thing that the white paper says clustrix is doing.

anyways, with any sort of non-trivially related data, it becomes a continuous graph optimization problem to minimize the number of nodes you hit on any given "query" (ie: keep data that tends to be accessed together on the same nodes or die), and can easily see how though it may seem like an elegant solution, that you can very quickly run into scary problems that really are just a giant tangled mess.

anyways, distributed joins aren't that hard. select and insert are solved by the key-value/consistent hashing dbs. even mysql cluster gives you all that stuff these days (the whitepaper mentions non-trivial multi-master joins)

holla

Maro · on May 4, 2010

Hey, they're looking for a "Technical Support Engineer - 6AM Shift":

Required Skills / Duties: ... - Able to lift 80 lbs several times in a single day. Our servers weigh between 50 and 140 lbs depending on configuration.

Sounds more like a bodybuilder position to me =)

http://www.clustrix.com/about/careers

lsb · on May 4, 2010

Clustrix plans to sell its appliance (which consists of more than a terabyte of memory and its proprietary software)

If you give people a few TB of memory, of course database reads are going to be fast.

DenisM · on May 4, 2010

I think they meant TBs of storage...?

sl_ · on May 4, 2010

Each CLX 4010 appliance contains:

Two Quad-Core processors (8 CPU cores) 32GB memory Seven 160GB solid-state drives (SSD)

lsb · on May 4, 2010

See, that's a shame it's only 32GB of memory. I can go on dell.com and buy a machine with 128GB of RAM for about $10k, far less than I'd expect they're selling their boxes for.

mattmaroon · on May 4, 2010

My company might have bought this 6 months ago assuming it provably works as advertised. We ended up sharding (which I'd gladly pay $80k to have avoided) and going Cassandra for future apps.

DenisM · on May 4, 2010

hundreds of nodes

Whitepaper seems to imply all nodes are connected to all other nodes, using infiniband. Infiniband's point-to-point afaik, so it'd be interesting to know how they plan to do that with hundreds of nodes.

josephruscio · on May 4, 2010

IB is typically deployed in a Fat Tree network: http://en.wikipedia.org/wiki/Fat_tree

It's been used in high-performance compute clusters with thousands of individual nodes.

wmf · on May 4, 2010

Gratuitous Infiniband porn: http://blogs.sun.com/jonathan/entry/size_matters

3,456 ports.

DenisM · on May 4, 2010

Dear god. I had no idea.

jbellis · on May 4, 2010

Superficially at least it sounds a lot like what you'd get if you added SSDs to Oracle's Exadata.

drosenthal · on May 4, 2010

Exadata does use SSDs now, in the form of Sun's flash memory cards (though they still are backed by rotational disk). You're right, however that this and Exadata look to attack the same problem in a similar way. If we believe Clustrix about the "100s of nodes" rather than the 20 they show data for, it would likely out-perform Exadata by a bit too.

coffeemug · on May 4, 2010

Congrats Sergei and the Clustrix team! Scalable databases are a really hard problem to solve. It's great to see products that push the boundaries of what's possible.

shaunxcode · on May 4, 2010

How is this comparable to fathomdb in terms of what it provides on a business level. (not how i.e. fathomdb leverages 'the cloud' where as this seems to be using specialized hardware?)

bkrausz · on May 4, 2010

Their website has no call to action or availability information...only a hidden "Contact Us" link. What if I want to buy one (I don't, but what if)?

apphacker · on May 4, 2010

It's $80,000 for a 3-node machine running the software, and what, you're looking for a Guaranteed Paypal merchant button or something?

bkrausz · on May 12, 2010

No, I'm looking for a call to action if I'm interested in finding out more. I recognize that big purchases are on a salesperson system, but I should have an easy avenue to reach these salespeople.

sown · on May 4, 2010

I wonder what they look for in hiring? I guess experience is a big plus and general smarts.

I'm getting bored at my current job. I think if I could port our enterprise solution to cassandra it would help in an interview?

ck2 · on May 4, 2010

This makes me happy to hear, especially that it's NOT April Fools!

einarvollset · on May 4, 2010

Greenplum.com