Hacker News new | past | comments | ask | show | jobs | submit login
Sharding data models (citusdata.com)
136 points by pedrobelo on Aug 28, 2017 | hide | past | favorite | 19 comments



For financial transaction services I can recommend sharding first by customer, then by ledger. As a result, instead of enforcing double-entry book-keeping standards within a single database, do it at an application-specific middleware server layer to enforce only the guarantees you need.

As with all decisions, there are tradeoffs. For a little more up-front complexity and a tiny nominal performance hit, this allows maintenance, encryption, non-uniform storage paradigms to suit individual ledgers, rolling upgrades, storage location migration, and other cool stuff which is typically painful with traditional all-or-nothing RDBMS architectures.


Presumably you want a fairly strict consistency guarantee for whatever double-entry invariants you do have. Does that cause a lot of complexity when you kick it up to the middleware layer?


It depends what you mean by significant. SQL update becomes a two phase commit with double network connection setup, execute, teardown latency as a worst case. External auditing must be made. Sounds like hell.

A flip-side view is that, in many cases, if you really want to trust your data (not your database), then you want to be doing this stuff to a large extent anyway... which means that, basically, it's just enforcing good practices that should have already been present.

Complexity and per-TX latency increases, but for that you get maintainability and a great deal of flexibility. Nothing is free... take your pick!


Citus is a pretty powerful tool. We recently used Citus to help scale one of our databases and wrote a blog post about it: https://hipmunk.github.io/posts/2017/Aug/16/a-fare-cache-in-...


What were the before/after performance and load numbers? Your post mentions that data-refresh frequency was an issue, but doesn't mention what sort of problems it caused. Would love to hear how the move affected query perf


Then there's the 0 model for sharding - don't shard, just replicate everything, everywhere, with eventual consistency and MVCC


Author of the post here, couldn't agree more that if you don't have to shard then don't. Scaling out using replicas is an option to certain scale, you do have to ensure no long running queries and make sure your replication is up to date. Those are another option of things to manage, and at an intermediate stage very viable ones, at certain scale it all changes a bit. All that said if you're at a small data level I wouldn't encourage sharding for the sake of it.. if you know you're going to have to scale beyond a single node or approaching limits where you're frequently scaling up it's good to know your options and plan ahead.


Nobody voluntarily shards for their own enjoyment (unless they like pain). They shard because they need to in order to scale.


Sharding and master to master replication are two different things.

The tradeoff of master to master replication limits you to CRDTs, append only logs and manual merging of conflicts by the end user. Alternatively if dataloss is an acceptable tradeoff you can also use a last write wins strategy.

Sharding is basically having one isolated "database" per X (User, location, etc) but the tradeoff is you can't have transactions across two databases.

Document databases usually do both. Each document is it's own tiny database with atomic updates which then can be distributed over the cluster and they support multi master replication for availability/automatic failover.


Or you can do multi master replication, where you write to all replicas synchronously (and deal with downtime of replicas by continuing if you manage to write to a majority of them).


What ORMs out there natively support sharding well? I use Django the most and whenever I look into Django sharding, I don't see very many options being kept up to date...


I'm not sure any do completely out of the box but many are very close.

We've created a library at Citus to make rails sharding turn-key, activerecord-multi-tenant. We have a Django one in the works, if you want to take a look and give us any feedback on it please drop us a note. From what I've seen hibernate has some out of the box support, but we only have a few customers leveraging it so not as familiar with it.


Unfortunately, I don't think I have the time to test anything rigorously, but I would definitely be interested in learning more about the Django solution when it's more mature! Is there a mailing list or something I could sign up for?


I use ServiceStack's OrmLite for .NET, and it supports multi-nested database connections for Master/Shard setup: https://github.com/ServiceStack/ServiceStack.OrmLite#multi-n...



What distinction is being made between sharding by entity and sharding a graph? The approach seems to be the same, just with different naming.


Yeah, that section doesn't seem to be making much sense. I don't think there are actually any real examples of "graph sharding" in the wild. The graph databases that are available, like Neo4j, don't usually natively provide horizontal partitioning. (Of course -- the problem of finding the minimum k-cut of a graph is itself NP-complete. Doing this incrementally with a dynamic graph is even harder.)

The one solution that was mentioned, Facebook's TAO, isn't actually really a database; it's a cache, which means that it doesn't really have to deal with sharding in the way that persistent stores do. And it doesn't really shard at all; it basically stores a complete copy of the world's social graph in every region, which it can just populate from that region's MySQL replicas. (It's amazing the things you can do when you can be eventually consistent.)

(From what I recall, the main social network's MySQL also isn't really sharded by graph in any fancy way; it's basically "just" hash-sharded by entity ID.)


dgraph does try to provide horizontal scaling out of the box. The sharding is done by predicate - cf https://docs.dgraph.io/deploy/#multiple-instances for a documentation link ; I am not sure how it behaves for very frequent predicates though


I would have liked to read more about graph sharding. The article was rather sparse about it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: