Hacker News new | past | comments | ask | show | jobs | submit login
Jetpants: a MySQL toolkit for managing billions of rows and hundreds of DBs (engineering.tumblr.com)
191 points by evanelias on June 7, 2012 | hide | past | favorite | 23 comments



Slightly OT, but Tumblr gave a great talk about their sharding architecture here: http://engineering.tumblr.com/post/12652551894/slides-from-o....

It's really quite a good intro to the subject, I think.


This is excellent stuff; it fits a niggling problem I have (i.e. I'm a better programmer than sysadmin, so managing massive data shards is a pain)

It's an elegant implementation. From a ~30 minute read through I reckon I can use it to replace our current "hacked up" solution in just a few hours.

Kudos Tumblr.


This is awesome! Also, I was at MySql event at Oracle two days ago and I overheard MySql guys talking with the pinterest folks about their sharding and how MySql team was going to announce something soon and wanted to get the pinterest's team feedback. MySql is doing their scripting on Python. So for a python shop, their release might be more interesting.


Cool, looking forward to seeing that! I love Python too, and readily admit it's perhaps a more frequent choice for this type of automation.

That said, I really grew to love Ruby over the course of this project, which is actually my first in the language. Ruby's open classes allowed me to write a pretty flexible plugin/callback system with very little code. Jetpants allows you to hook arbitrary methods in before or after any method in any Jetpants class, and these callbacks "stack" (with support for different priorities) so multiple plugins can hook-in to the same place.

Because every large site seems to tackle sharding slightly differently, I figured a nice plugin system was pretty important in order for anyone else to be able to use this :)


It looks interesting but one small thing. I looked into the transferring large files quickly link[1] and saw that they were using netcat and tar to transfer files. This is not necessarily optimal[2], and applying some compression can go a long way, although this will be dependent on the use case. Compression also has the added bonus of transferring less data across the network and (if you don't uncompress) less space at the other end.

SSH is also an option (a slow option, but an option) that provides certain things (encryption, authentication) that make it ideally suited for transfers across network boundaries.

[1] - http://engineering.tumblr.com/post/7658008285/efficiently-co...

[2] - http://www.ndchost.com/wiki/server-administration/netcat-ove...


They're using pigz on [1] so they are compressing?

I thought it wasn't a great idea. OK for an on-the-fly solution to the problem but bittorrent or multicast would seem better; the serial route between machines isn't very fault tolerant requiring a start-from-scratch on failure.

socat > nc BTW, and does multicast.

As for ssh, it's a shame the "no encryption" option was removed.


Wow, I completely missed pigz in there for some reason and had to look it up.

I agree with you on all counts though!


re: fault tolerance, it's a fair point. Although in practice I've never had this fail part-way on me, and I've used it a couple hundred times with >600GB transfers.

We usually use this to copy to 2 or 3 machines at once; it's rare that we'd need to bring up 4+ slaves simultaneously, or split a shard into 4+ pieces. Most Linux distributions already have all the software needed except pigz, which is tiny and available in several packaging systems.

I'll definitely give socat a look though, thanks for the tip.


Oh, OK, I agree, for 2-3 machines reliability isn't an issue, I was thinking more dozens.


It's so good to see companies open sourcing some of their technologies. They probably do so mostly to improve their image, but who cares, and I have to say... it works with me! Thanks tumblr.


No. They mostly do it to ensure that the software is actively maintained. When a home grown software reaches a certain level of maturity it makes sense to set a roadmap and release it in the public domain. More users + developers = profit for both the company and the public.


Yes, if you get there that's a win-win. I wonder though how many projects really get a community support after they are open sourced by a company.


My quick reading of this is that its suited for databases that don't change much (or at all) once the data is inserted, and not as much for apps that need to keep strong ACID compliance with guaranteed referential integrity.


We handle many thousands of write queries per second at Tumblr, and we use Jetpants to manage our entire MySQL topology, so trust me when I say the data changes quite often :) You can edit your existing posts on Tumblr, unlike on several other prominent social sites.

Definitely please let me know how you got that impression though -- I'm happy to improve confusing things in the docs.

As for ACID compliance: Jetpants is a toolkit for MySQL / InnoDB, and doesn't really impact the referential integrity guarantees of those systems any more or less than other partitioning schemes. MySQL is inherently not a distributed system, for better or worse.


I got the impression because I didn't see any discussion about the handling of bringing new slaves online (other than how to make it fast). Do you pause one of the slaves to get a consistent dump?


Slave cloning is performed by shutting down mysqld on a standby slave and copying its raw data files. There's no dump involved. This is widely regarded to be the fastest possible way to clone a slave in MySQL.

This is explained in the deeper doc files -- didn't want to bog down the top-level README with implementation details.

Meanwhile, data exporting (for shard splits or table defragmentation) is done on a standby slave with replication stopped.


Is there a mailing list for jetpants so we can ask questions? For example how should my application connect to mysql? Should it connect directly to mysql or to a jetpants server and how?


Good call -- we'll definitely set one up if there's sufficient need. Until then, feel free to email me; my email address is in the gemspec. I'll write up an FAQ once I have enough questions answered.

re: your immediate question, you still connect to MySQL as normal. Jetpants isn't a server, middleware, framework, or ORM. Rather, it's a toolkit. Jetpants has a command suite that you can use to run its built-in functionality, but it's also a Ruby gem that you can integrate into custom scripts however you'd like.

The functionality is all geared towards managing large DBs (importing/exporting lots of data quickly, copying files quickly, etc) and managing large numbers of servers (promoting/demoting masters and slaves, adding new machines to a pool, rebalancing a shard).


Thanks!

Another question - we are using django so are considering Postgres since there is a python connection pool available. Could Jetpants be potentially used for Postgres? ie how much of the functionality is Mysql specific?


The core functionality is currently very MySQL-specific. In theory a plugin could override a bunch of methods to target Postgres, and maybe even Redis or other persistent data stores with replication and import/export functionality. It would be a lot of work though.

I also made the mistake of putting "mysql" in the names of a few methods. At some point soon I'll change those to more generic names, and alias the old names to the new generic ones.


I'd love for this to do PgSql too.

Thanks very much Evan as well for our chats ages ago (Andrew here), was happy to see you release this tool!


Sounds quite useful. Was looking for some good tools for sharding and replicas for mysql.


[deleted]


no this is not a typo.

jetpants parallelizes many steps of this process allowing for greater throughput than what could be normally achieved by traditional tools.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: