Does this question imply sharding would be more difficult with this sort of table design, compared with other table designs?
Obviously sharding is difficult, but what would make it any more difficult with this table design?
Picking the sharding keys, avoiding hotspots, rebalancing, failover, joining disparate tables on different shards, etc... these are all complicated issues with far reaching implications, but I just don't understand how this table design would complicate sharding any more than another.
I'm not trying to sound condescending and suggest it's a dumb question... genuinely interested.
This would have happened after I left, so I'm just speculating based on experience.
1. You create a function in your application that maps a customer id to a database server. There is only one at first.
2. You setup replication to another server.
3. You update your application's mapping function to map about half of your ids to the new replica. Use something like md5(customer id) % 2. Using md5 or something else that has a good random distribution is important (sha-x DOES NOT have a good random distribution).
4. You application now will write to the replica, turn off replication, and delete the ids from each server that no longer map to that server.
Adding a new server follows approximately the same lines, except you need to rebalance your keys which can be tricky.
So I'm trying to learn about system design without having ever encountered large problems from any of my projects. What I'm reading keeps pushing for hash rings for these types of distribution problems.
Someone who does this stuff for real, do orgs actually just do a hash % serverCount? That's what a startup I worked at did 8 years ago. I thought nobody actually did this anymore though given the benefits of hash rings?
It depends on control and automation. If you can easily migrate things between servers, a hash ring makes more sense. If you are a scrappy startup who has too much traffic/data, you are doing everything manually, including rebalancing and babysitting servers.
I arrived at this empirically doing some tests at work. In general, Sha and md5 BOTH present randomly distributed numbers. However, most IDs (and the assumption made here) are numeric. For the first billion or so numbers, sha doesn't create a random distribution, while md5 does.
Looking back, I don't remember my experimental setup to verify a random distribution and I recommend checking for your own setup. But that was the conclusion we arrived at.
I quickly threw a few into excel with a pareto graph to show distribution. Granted, it is only the first 10,000 integers % 1000: https://imgur.com/a/HypGWP1
To my untrained eye, md5 and sha1 (and not shown, sha256) look remarkably similar to random. If you want truly even distributions, then going with a nieve "X % Y" approach looks better.
If you do % 1000 there'll be a bit of bias from the modulus, but not anything meaningful for the use case, and it affects MD5 and SHA both.
No argument that for simple incrementing integers a raw modulus is better, though. Even if hashing is needed, a cryptographic hash function probably isn't (depending on threat model, e.g. whether user inputs are involved).
MD5 and ShaX are available in almost every language. Things that are probably better (like Murmur3) might not be, or may even have different hashes for the same thing (C# vs JS with Murmur3, for example). I had to reimplement Murmur3 in C# to get the hashes to match from JS, it boiled down to the C# version using longs and eventually casting to an int while JS can’t or didn’t do that final cast.
I wasn't thinking of anything exotic, just the built-in functions used for hash tables and random number generators (for integers and other inputs that can be interpreted as seeds, a random number generator is also a hash function, and of course in your case the hash function was a random number generator all along anyway!).
Looks like it. Though I think md5 is faster, which is maybe what I’m remembering, now that I think about it. This was years ago, funny how memory gets tainted.
I think that's probably true of lots of masters of crafts, sometimes you have a gut feeling and don't even remember the formative moments that created the conclusion haha
If you worry about random distribution of the bits wouldn't it be better to do (H(customer_id) % big_prime) % 2? so that the "% 2" does not just read the last bit
A finite field would ALMOST work here. However, you're making the assumption that the user ids using your application are random. If you have an A/B test running using that same mapping, you'll end up with a server hotter than the other if one arm of the experiment is widely successful.
Thanks for that answer, it seems quite a specialist job. So many moving parts. Why would you need random distribution and not just customers 1 to 100k for example?
> Why would you need random distribution and not just customers 1 to 100k for example?
To keep the databases from having hotspots. For example, if your average customer churns after 6 months, you'll have your "old customer" database mostly just idling, sucking up $$ that you can't downgrade to a smaller machine because there are still customers on it. Meanwhile, your "new customer" database is on fire, swamped with requests. By using a random distribution, you spread the load out.
If you get fancy with some tooling, you can migrate customers between databases, and have a pool of small databases for "dead customers" and a more powerful pool for "live customers" or even have geographical distribution and use the datacenter closest to them. But you really do need some fancy tooling to pull that off, but it is doable.
I was wondering the same thing. But at the same time, it seems like having it all centralized in one place would make it really easy to write tooling to aid in such an endeavor.
For example, suppose it was decided that sharding would be by account ID. It would probably be very achievable to have some monitoring for all newly created relationships to check that they have an account ID specified, and break that down by relationship type (and therefor attributable to a team). Dashboards and metrics would be trivial to create and track progress, which would be essential for any large scale engineering effort.