Interesting, they have their own custom epoch, the "twepoch":
// Tue, 21 Mar 2006 20:50:14.000 GMT
val twepoch = 1142974214000L
According to the README they can fit 69 years worth of timestamps in 41 bits with the custom epoch, since they don't care about any times that happened before Twitter launched.
The most interesting part of this post to me is the implication that despite a lot of noise about Twitter switching to Cassandra, they haven't actually done it yet -- if they had, they'd need snowflake in production, and they say that it's not.
Makes me think: when you scale databases up you always seem to have to loosen up constraints. Similar to how the laws of nature change when you change scale in nature, ie. quantum mechanics when you go really small, or weird laws of nature when you look really far away in the universe.
Still we have to confirm this theories. I just hope I'll be alive when all the Universe physics has been explained (I mean that we can use the found laws and change them knowing every single outcome), and we'll create other Universes.
Their approach for this is genius: they have some ideas for designs, some things they tested out - but while it's still alpha and not yet in production they open source it, allowing lots of other developers to take a look and find problems with their ideas and hopefully make it even better.
There is an implication of a very interesting math problem just waiting to be investigated. Unique IDs are a natural fit for something like a hash function, but now consider the "roughly sortable" requirement -- how about a homomorphic hash (invent one!), or a hash with a weak/predictable avalanche property.
Of course, that is probably too much for twitter who just needs to Get It Done, but I find such things interesting to think about.
It would be an interesting problem to really try and get the timestamps around 10's of ms for the time sorting bound. With multiple datacenters you might get up to ~10ms drift with NTPv4 so you're starting point is already pretty crippled. And how would you even test? :-)
When looking at it, I was a bit dubious until we made some tests at work and compared it with our current NTP using GPS IRIG-B receiver. Even for disconnect period longer than 48 hours, we had a small drift (around 200 nanoseconds) compared to traditional NTP.
The only drawback is that you need a kernel path to make it works.
You could use a GPS clock in each datacenter...AFAIK, the time signal is accurate to ~1 µs, so you shouldn't need to coordinate the time signal between datacenters.
"Used in the Internet of today with computers ranging from personal workstations to supercomputers, NTP provides accuracies generally in the range of a millisecond in LANs and up to a few tens of milliseconds in the global Internet"
But this is news to me, interesting:
"When kernel support for precision timing signals, such as a pulse-per-second (PPS) signal, is available the accuracy can be improved ultimately to the order of one nanosecond in time and one nanosecond per second in frequency."
I think this is big because releasing this open source moves Twitter into an R&D/software type of company who is contributing to more than just their own system. They are moving in the same direction as companies such as Google and Facebook.
What an interesting problem! Definitely not something I had ever given much thought to in all honesty. Do Twitter now have enough Tweets flying out at any given moment in time that they need to have multiple ID-generating servers?
Not sure about anyone else but it makes my mind boggle!
Kind of pointless. Just use a UUID for the ID and a date for the thing you sort by. Twitter stores a "posted at" time, so why not sort by that? Using an ID column for a time-based sort when you are storing the time anyway is silly.
Remember that this is for Cassandra. Cassandra only allows you to sort records by the primary key, so if you want to sort by date, the time has to be part of the primary key while also being unique. So vanilla UUID doesn't work. Some UUIDs can take a time argument, but as the post says, they couldn't use UUIDs anyway, because they are all 128-bit and Twitter is limited for historical reasons to 64.
Creating an index on the posted_at time may be harder than it sounds, with "tens of thousands of tweets per second" and a distributed database. The "within a second" approach (ie. loosening up the constraint of ordre) sounds pretty briliant to me.
But you don't need to index on time; you can sort on the client. This is what twitter clients already do, except they sort an integer instead of a datetime.
Remember: Twitter is nothing new. We have done massively distributed messaging since the 70s. It's called e-mail.
Because the APIs fetch recent tweets, the service needs an index on time.
Although it should be easy for client applications to sort on time instead of id, that's not what all applications do. Twitter chose not to break clients that sort on id.
Except in email everyone has a separate mailbox. On twitter everyone's mailbox overlaps and there are an infinite number of possible public views of the tweets (searches, lists, etc.)
right, so it doesn't fit the needs of the OP, but for other users without the 64-bit requirement, version 1 UUIDs can be k-sorted without an additional column.
Just wondering if the Twitter users here are able to use the Twitter website consistently. I cannot. I frequently am unable to view user profiles, load up my Twitter page in order to post tweets, or actually post a tweet after submitting the form. As a result my Twitter usage has dropped dramatically. Anyone else having the same issues?