Twitter Announces "Snowflake" for Unique Tweet IDs

paulsmith · on June 2, 2010

Interesting, they have their own custom epoch, the "twepoch":

    // Tue, 21 Mar 2006 20:50:14.000 GMT
    val twepoch = 1142974214000L

According to the README they can fit 69 years worth of timestamps in 41 bits with the custom epoch, since they don't care about any times that happened before Twitter launched.

kingryan · on June 2, 2010

That's the time of the first tweet. It saves us 30-something years of id space.

jmhodges · on June 2, 2010

Mostly, though, we think it's just adorable.

dschobel · on June 2, 2010

Totally OT but do you guys still know the content of the first tweet?

Apparently the best google-fu effort only yields #20: http://twitter.com/jack/status/20

jmhodges · on June 2, 2010

That is, as far as the archive goes, the very first public tweet.

seldo · on June 2, 2010

The most interesting part of this post to me is the implication that despite a lot of noise about Twitter switching to Cassandra, they haven't actually done it yet -- if they had, they'd need snowflake in production, and they say that it's not.

jedberg · on June 2, 2010

They have a large Cassandra node which they are using for some of their data, but it is not the primary datastore yet.

Much like how we (reddit) are using it for a chunk of our data, but we still have Postgres as our canonical source.

We are however moving slowly towards more Cassandra.

kingryan · on June 2, 2010

It turns out big problems are big. We're moving moving existing data to cassandra and putting a lot of new (mostly internal) data on cassandra.

seldo · on June 2, 2010

I didn't mean to imply that it was easy, merely that I'd got the impression from presentations that the migration had already happened.

petervandijck · on June 2, 2010

Makes me think: when you scale databases up you always seem to have to loosen up constraints. Similar to how the laws of nature change when you change scale in nature, ie. quantum mechanics when you go really small, or weird laws of nature when you look really far away in the universe.

RyanMcGreal · on June 2, 2010

Consistency, Availability, Partition Tolerance: pick any two.

eagleal · on June 2, 2010

Looks like our Universe chose the last two.

Groxx · on June 2, 2010

I dunno, quantum mechanics / oddness with atomic-and-smaller implies partition tolerance wasn't one of the choices.

Maybe Availability was all that was chosen? And it's doomed in a mere some-billion years anyway...

eagleal · on June 2, 2010

Still we have to confirm this theories. I just hope I'll be alive when all the Universe physics has been explained (I mean that we can use the found laws and change them knowing every single outcome), and we'll create other Universes.

brown9-2 · on June 2, 2010

Their approach for this is genius: they have some ideas for designs, some things they tested out - but while it's still alpha and not yet in production they open source it, allowing lots of other developers to take a look and find problems with their ideas and hopefully make it even better.

jallmann · on June 2, 2010

There is an implication of a very interesting math problem just waiting to be investigated. Unique IDs are a natural fit for something like a hash function, but now consider the "roughly sortable" requirement -- how about a homomorphic hash (invent one!), or a hash with a weak/predictable avalanche property.

Of course, that is probably too much for twitter who just needs to Get It Done, but I find such things interesting to think about.

timf · on June 2, 2010

It would be an interesting problem to really try and get the timestamps around 10's of ms for the time sorting bound. With multiple datacenters you might get up to ~10ms drift with NTPv4 so you're starting point is already pretty crippled. And how would you even test? :-)

adulau · on June 2, 2010

Yes, this is a good point. There is a pretty alternative to NTP and limiting the drift called radclock :

http://www.cubinlab.ee.unimelb.edu.au/radclock/

more information about the accuracy is available there :

http://www.cubinlab.ee.unimelb.edu.au/radclock/performance.p...

When looking at it, I was a bit dubious until we made some tests at work and compared it with our current NTP using GPS IRIG-B receiver. Even for disconnect period longer than 48 hours, we had a small drift (around 200 nanoseconds) compared to traditional NTP.

The only drawback is that you need a kernel path to make it works.

profquail · on June 2, 2010

You could use a GPS clock in each datacenter...AFAIK, the time signal is accurate to ~1 µs, so you shouldn't need to coordinate the time signal between datacenters.

kingryan · on June 2, 2010

NTP seems to work fairly well for this. It takes into account network latencies and attenuates over time.

timf · on June 2, 2010

Well, quoting from: http://www.eecis.udel.edu/~mills/ntp.html

"Used in the Internet of today with computers ranging from personal workstations to supercomputers, NTP provides accuracies generally in the range of a millisecond in LANs and up to a few tens of milliseconds in the global Internet"

But this is news to me, interesting:

"When kernel support for precision timing signals, such as a pulse-per-second (PPS) signal, is available the accuracy can be improved ultimately to the order of one nanosecond in time and one nanosecond per second in frequency."

est · on June 2, 2010

so much for the new features, now where can I get my 3201th tweet?

paradox95 · on June 2, 2010

I think this is big because releasing this open source moves Twitter into an R&D/software type of company who is contributing to more than just their own system. They are moving in the same direction as companies such as Google and Facebook.

mootothemax · on June 2, 2010

What an interesting problem! Definitely not something I had ever given much thought to in all honesty. Do Twitter now have enough Tweets flying out at any given moment in time that they need to have multiple ID-generating servers?

Not sure about anyone else but it makes my mind boggle!

kordless · on June 2, 2010

This would be somewhat useful for logging data as well!

jrockway · on June 2, 2010

Kind of pointless. Just use a UUID for the ID and a date for the thing you sort by. Twitter stores a "posted at" time, so why not sort by that? Using an ID column for a time-based sort when you are storing the time anyway is silly.

Oh well, at least it was a fun hack.

cperciva · on June 2, 2010

Just use a UUID for the ID and a date for the thing you sort by.

Sure, except for the stated design goal of being backwards compatible with code which expects a 64-bit ID and uses that value for sorting posts.

seldo · on June 2, 2010

Remember that this is for Cassandra. Cassandra only allows you to sort records by the primary key, so if you want to sort by date, the time has to be part of the primary key while also being unique. So vanilla UUID doesn't work. Some UUIDs can take a time argument, but as the post says, they couldn't use UUIDs anyway, because they are all 128-bit and Twitter is limited for historical reasons to 64.

timf · on June 2, 2010

> Using an ID column for a time-based sort when you are storing the time anyway is silly.

Your solution violates the listed requirements that the ID is 64 bits and the IDs need to be roughly sortable.

petervandijck · on June 2, 2010

Creating an index on the posted_at time may be harder than it sounds, with "tens of thousands of tweets per second" and a distributed database. The "within a second" approach (ie. loosening up the constraint of ordre) sounds pretty briliant to me.

jrockway · on June 2, 2010

But you don't need to index on time; you can sort on the client. This is what twitter clients already do, except they sort an integer instead of a datetime.

Remember: Twitter is nothing new. We have done massively distributed messaging since the 70s. It's called e-mail.

frognibble · on June 2, 2010

Because the APIs fetch recent tweets, the service needs an index on time.

Although it should be easy for client applications to sort on time instead of id, that's not what all applications do. Twitter chose not to break clients that sort on id.

jrockway · on June 2, 2010

You could just fake the id later. Why ruin your architecture because your users are lazy?

kingryan · on June 2, 2010

Except in email everyone has a separate mailbox. On twitter everyone's mailbox overlaps and there are an infinite number of possible public views of the tweets (searches, lists, etc.)

aaronblohowiak · on June 2, 2010

You don't need a separate column if you are using Version 1 UUIDs

kingryan · on June 2, 2010

Version 1 UUIDs require 128 bits.

aaronblohowiak · on June 2, 2010

right, so it doesn't fit the needs of the OP, but for other users without the 64-bit requirement, version 1 UUIDs can be k-sorted without an additional column.

adriand · on June 2, 2010

Just wondering if the Twitter users here are able to use the Twitter website consistently. I cannot. I frequently am unable to view user profiles, load up my Twitter page in order to post tweets, or actually post a tweet after submitting the form. As a result my Twitter usage has dropped dramatically. Anyone else having the same issues?