Cool technology, good explanation. Legitimate questions below.
What you're describing (uploading photos + storing metadata) sounds like something which Facebook has tech talked at length about at multiple venues. Their solution was to use distributed FS for images (such as HDFS, though FB uses their internal "Haystack") and then use HBase for the metadata. To be honest, your solution while it works now, looks like a weak home-grown HBase, but leveraging PL/PGSQL for unique IDs. Why not go the snowflake+hbase route? While it may add Ops complexity, it is a fairly battle-proven stack, and JVM ops is pretty well documented.
Or, if you insist on using an RDBMS for metadata, why not just throw money (and not that much) at the problem and buy an SSD for your DB? Increase your iops from 100 or 200 up to 30,000 or 40,000 with a cheap drive, and call it a day. Surely this would be less expensive than the engineering effort that went into (and will continue to go into) this project. This has the added benefit of having no impact on Ops complexity and should scale to quite a staggering number of QPS.
We're on EC2, which has its set of limitations but means we can run a 10 million + user system with two-and-a-half engineers (and no ops team / overhead). So while we hear about more and more folks using SSDs in their DBs, it's not an option in our near-term future.
For SQL vs HBase/Haystack, we don't really have to worry about the photo storage itself, since S3 handles all of it. The data we shard out is more suited to an RDBMS, and since we're way more familiar with that world than with HBase and similar, it was the choice that let us make the most progress in a short time with a small team. Hope that's a helpful description of how we thought about it.
From what I am understanding from the post is that you are actually sharding other tables like photos, likes, comments but not the main users table, right?
So, if one day you want to shard the users table, it will render all current sharding useless, right?
Back in 2003 I was a developer on a team at Microsoft that was responsible for building a new storage backend for all the MSN Messenger and Hotmail contact lists. This store had to hold data for close to 300 million user accounts, which we sharded out to a few hundred SQL server databases. Our sharding system consisted of a database with a single table with a row for each user that mapped their 128 bit guid user ID to the ID of their assigned shard. New user creation involved generating a new guid and inserting into this table. Each read operation involved a select from this table.
The database ran on a machine with enough RAM to let SQL Server do its thing and cache almost the whole table in memory. At the time I left the team in 2005, it was executing over 25,000 requests per second, with an average latency of under 3ms. Pretty sure that on modern hardware it would handle much, much more.
I work at Flickr and I see they mentioned Flickr's ticket server idea, (ab)using MySQL's autoincrement and "REPLACE INTO" trick and mentioned that a con was the write bottleneck.
We're generating more GUIDs than ever with this system and those boxes are more or less idle on every metric. They're right in that we don't meet their time-ordered requirement, but I just wanted to say that writing (or reading) is not a bottleneck.
I like that you guys are using your database's stock features to accomplish this. Most of the time you don't need complicated systems to get things done. Reducing that mental overload of YET another system is huge.
I think it's a mistake to tie the shard id into the object id. Shard id should be derived from the object id dynamically based on the placement of the object among the shards. If the shards grow or shrink or object migrated, a different shard id is generated, but the object id doesn't have to change.
Edit: I like how constructive criticism got downvoted. Thanks for discouraging technical discussion.
If you do that then you need to keep a mapping of object IDs to shards. You can't store it on the application servers because of race conditions and memory constraints, so each object lookup will require two network trips.
They've defined 2^13 virtual shards, which is fine-grained enough that they can move entire shards between physical servers to eliminate hot spots.
I hear what you are saying, but what they have done is derived the their IDs based partially on their current choice of architecture.
If, for any reason, they would like to change their architecture later, suddenly they have a real problem.
That feels flaky to me. Its a nice techy solution, and the write up was interesting, but there are definitely a few weak points in their implementation.
Their computers have to be absolutely in sync for time, their ID space is entirely guessable once you know the algorithm and they have effectively hard coded their current architecture into their software.
Shard id is usually algorithmically derived rather than looking up from mapping table, which introduces unnecessary complexity. Look up Consistent Hashing for details, which deals with virtual shards, shard addition/removal, object migration in a consistent and simple manner.
Right on. The shard id could be derived from the last few bits of the object id, leaving space for additional entropy in the object id itself.
Edit: you don't need an object id / shard id mapping. Each time you need to locate a shard for an object, you take the last few bits of the object id modulo the number of shards to get your shard id.
Interesting article. It seems like a pretty good solution the problem, especially when judged on the ease of maintenance.
One thing that bothered me was, "Let’s walk through an example: let’s say it’s September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011. There have been 1387263000 milliseconds since the beginning of our epoch..."
The number of seconds between 5pm one day and midnight another obviously doesn't end in 3, so I looked into it. It looks like that number is taken from the epoch that is used later in the SQL which starts on Aug 24th.
I've used the "snowflake" like approach in the past with great success. It's really not all that complicated. The reliance on time in the Instagram approach is a bit scary. A few ms off here and there could really hurt this scheme. How do you handle seamlessly transitioning these across machines when your shards move?
Thanks for the comment! We don't need the IDs to be exactly sortable, only roughly sortable within a second or so. As long as the clock doesn't move backwards on any given machine (we use ntpd in its gradual-adjustment mode), the IDs are unique.
The way we move shards is to use PostgreSQL's built-in streaming replication to create an exact, in-sync copy of a set of tablespaces, then 'fail over' to a new machine and start reading/writing to a subset of those tablespaces (this is similar to how Facebook describes their shard-moving process).
Not really. Like MongoDB, the first four bytes of an ObjectId are a timestamp. That the timestamp is synced with other instances isn't paramount because the actual value of the timestamp does not matter. What does matter is that the timestamp is new and increasing every second. This is to retain sorting capabilities. With the 13 bits that represent the logical shard ID from this article, Instagram will guarantee uniqueness of an ID within the granularity of a second.
"What does matter is that the timestamp is new and increasing every second."
Right. I'm just sayin' that you have to be careful when you move data with a caveat like that. Moving the shard keyspace (the 13 bits) to a new machine that started generating ID's even one second behind (the first 4 bytes) would be troublesome, no?
Yep--definitely something to watch out for. At worst, though, you'd have a duplicate key when trying to insert, and can re-try without the risk of having a duplicate ID floating around your system.
The lower rotating 10 bits should give them a reasonable safety margin. If they're creating less than 128 entries in a particular shard per second (right now they're doing that across their entire datastore), their clocks would need to be out by 8 seconds to cause a problem.
They should definitely be monitoring their clocks though :)
There's a 1:1 mapping between the user's hash modulo the number of shards, and the table they're writing to. So, if we had 1000 logical shards, we have 1000 schema/tablespaces, with the same tables in each. And the database's own 'nextval()' feature makes sure never have the same ID twice. Hope that clarified things.
I understand 1000 logical shards is probably a good number to use with PostgreSQL. I'm currently looking at doing something like this myself but I was thinking that it would be so much more elegant if each user could live in his own shard.
Does anybody know of a database product / solution that will handle hundred thousands of shards? I don't have a need for joins across shards, each user account only uses it's own data. And data and access within a shard would be very small, a SQLlite instance for every user would be a theoretical solution. The idea would be to have one (in memory) table to connect the app to the correct shard for the user and then get "perfect" horizontal scaling. Is anybody doing this?
Did you evaluate using something like Redis for this? It's got an atomic increment command that guarantee's unique ID's and the performance is stupid fast.
We love Redis at Instagram (it powers a lot of our systems...more write-ups on these soon), and considered it for our ID generation, but it would have introduced a single point of failure, unless we split the load between several Redis instances, at which point it would be hard to make the IDs time-sortable. Also, most of our Redis systems are durable within a minute (we write to disk on a slave every minute), but if we were to lose the master and slave simultaneously (imagine an EC2 network issue or such), then it would be hard to know what the last known 'good' ID was--not an insurmountable problem, but one more moving part to worry about. That said, using Redis with something like the upcoming Redis Cluster could be a good choice, though.
I like it. Question, did you consider using composite keys of shard_id and id to make up a single primary key? If so what were the pros/cons you found with that approach?
We considered the composite key, but we often have to store keys in other systems like Redis, where having a single 64-bit integer makes it more portable and stored compactly (Redis, for example, has an optimization when storing integer values vs string values in its lists). Valeu pelo comentário!
Something still puzzles me though. You wrote a sequence is unique to each table in each schema. However, your example seems to use the same sequence name for all tables within a schema.
Also, shouldn't your mod operation have only one %?
One nit about snowflake in your article (i'm the author of snowflake)– the zookeeper integration is optional and only used for sanity checking the configuration (that the worker ids are distinct, etc).
You could avoid having to create a per-schema id function by passing the shard id and the sequence oid as params to the function, so you'd have
default public.next_id(5,'insta5.table_id_seq'::regclass)
Hi Mike, excellent article! One question about sharding, do you find that it reduces High Availability overall, as your uptime depends on the vagaries of additional database servers??
What I'd be more curious to hear about, is how they deal with super nodes and if they're storing a map (or any kind of routing table) or simply using generic modulo hashing on a key.
Right now, it's a lookup dict in our Django app--which involves brief downtime just to update the shard map when moving the data (more on this in another post). We hash on user ID (in most cases) and then look the shard # in the dict, then look up the shard # in a logical-to-physical dict.
Ah ok that's what I figured. We do a lot of the same things right now with various DBs (we just describe it as pre-sharding so we can avoid the actual re-sharding dilemma). We've had to do this with both PostgreSQL and Redis now.
The schemas are definitely a neat way to handle this though so you dont have to worry about table names.
Holy god. The post mentions "We evaluated a few different NoSQL solutions, but (...)".
I couldn't even imagine having to _think_ about this sort of thing; CouchDB makes this an absolute no-brainer. I mean, the act of creating a document assigns it a UUID in the database _by_ _default_.
Or, do you want to fetch a UUID before assigning it to any data? localhost:5984/_uuids
Want to fetch 10? localhost:5984/_uuids?count=10
Want to fetch a million? localhost:5984/_uuids?count=1000000
Instagram seems like the absolutely perfect candidate for CouchDB -- unstructured data, speed, HTTP querying, big data, attachments...
I used to work at Meebo, which hosts a large CouchDB system , but when it came time to choose a solution for Instagram we went with PGSQL. There was a lot I really liked about Couch, especially for the analytics system we built at Meebo (where the map/reduce views worked great), but I wouldn't classify it as a low-Ops-burden technology--like any newer db solution, there are some rough edges and more 'unknowns' at scale.
Also, we'd still have to write the middleware to assign data to shards and fetch data from shards in our system (unless we used something like BigCouch), so having a more tried-and-tested solution that we already understood well was more appealing.
The call to default _uuids API call generates random UUIds, however you could override the call in the _config to have a different algorithm field, one that potentially could call "utc_random" still while appending a timestamp in the string to sort by later. Was this thought about when CouchDB was potentially considered?
(author here) I think this should also be possible with MySQL's stored functions, but one huge benefit to PostgreSQL is the schema/tablespace feature, since it means all our logical shards all live inside one database (you could do something similar with MySQL, but it would mean multiple databases or prefixing table names with the shard ID).
PostgreSQL schemas are very similar to MySQL databases in functionality. In fact, in MySQL you can use SCHEMA in all places you can use the term database, ie. CREATE SCHEMA foo; instead of CREATE DATABASE foo;
4. Store your shard keyspace in DNS with SRV/TXT records
5. Assign IDs randomly
This gives a couple interesting properties, for replication odds are very high that your corresponding server has a shard of similar size and if not its generally only split over two servers. Servers also only need to speak directly to their one or two replicas in the other data center. The other really nice property if you're using spindle disks is that you can copy the entire HD and just delete the half of the keyspace not used after the split.
It's all written up here if you want to dig into with a full description of all the pieces:
http://bit.ly/FredPatent
It takes some nerve to actually be proud of that patent.
Luckily for the rest of us, there is not just one way to do sharding (which in the original version of your post you assertively referred to as "how to do sharding right").
It depends on the individual architecture and the tradeoffs people consider important. That means you don't get to have an undeserved but legal monopoly holding the entire scalable web hostage, just a part of it. Nevertheless: you should be ashamed of yourself.
What you're describing (uploading photos + storing metadata) sounds like something which Facebook has tech talked at length about at multiple venues. Their solution was to use distributed FS for images (such as HDFS, though FB uses their internal "Haystack") and then use HBase for the metadata. To be honest, your solution while it works now, looks like a weak home-grown HBase, but leveraging PL/PGSQL for unique IDs. Why not go the snowflake+hbase route? While it may add Ops complexity, it is a fairly battle-proven stack, and JVM ops is pretty well documented.
Or, if you insist on using an RDBMS for metadata, why not just throw money (and not that much) at the problem and buy an SSD for your DB? Increase your iops from 100 or 200 up to 30,000 or 40,000 with a cheap drive, and call it a day. Surely this would be less expensive than the engineering effort that went into (and will continue to go into) this project. This has the added benefit of having no impact on Ops complexity and should scale to quite a staggering number of QPS.
Thanks!