Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Content Serving Cluster
14 points by ajkirwin on March 1, 2009 | hide | past | favorite | 12 comments
I run a small website, which shall remain unnamed so as to avoid unneccessary threads. I'm pushing probably a few terabytes of data a month now, so to improve speed whilst keeping costs low, I plan to serve files from multiple machines. A mini-CDN if you will.

The files will be the same across each distribution server, but I am not sure of the best way to replicate the uploaded files across machines.

Example: Someone uploads xyz.tar.gz to Machine #1. I need it replicated, as fast as possible, to Machine #2 and #3, so that when people visit the site, if they get http://cdn-3.mysite.com/, they'll get the file.

Does HN have any suggestions as to the best way to go about this?

Edit: My files aren't very large, they're maybe 5 gb in total and don't grow too fast, they're just accessed a lot.




MogileFS will get you close to this, but stores files with its own ID and directory scheme. If filenames don't matter, or if you have a gateway that's handling the mogile lookup and filename translation, it's a great solution.

If you already have a reliable central filestore, varnish or squid might accomplish faster distribution without having to replicate all your files.

Otherwise, I'm curious to see everyone's suggestions. I've looked at more *sync programs than I can count to handle this use case and come up empty-handed.


Honestly, I think you're likely to find that Amazon S3 is the best option. It costs money, but assuming your business generates money (many businesses do) it will probably be more reliable and cost less expensive, expensive you-time than anything else.

Otherwise: rsync is your friend. Run it in daemon mode. If you've just got a handful of machines I'd nominate one machine as the server to receive all uploads. Everybody else just syncs their upload directory to that machine's.

You may also want to consider offering bittorrent as an option, since this situation appears to be tailor-made for it.


I don't see how rsync will work here, during the sync time, the new file won't be available (entirely) from the other machines. As I understand it, the OP is looking for something that will sync while already serving up the file, presumably giving priority to syncing the parts that are already being requested by clients. (which will be the beginning of the file most of the time unless clients will be using partial downloading)

I've never set one up myself, but a clustering file system might work, e.g. OCFS: http://oss.oracle.com/projects/ocfs2/


You say "someone uploads xyz.tar.gz to Machine #1"; does this mean that all the bits are uploaded to the same machine, or will you have some files uploaded to machine #1, some files uploaded to machine #2, et cetera?

If files are uploaded to multiple machines, is it possible that you'd get two different files with the same name uploaded to different machines? If so, how do you want to handle this?

Will you ever have files deleted?

Do you have any ordering requirements, e.g., files have to appear on each machine in the same order as they were originally uploaded?


Files will never be deleted and there are no ordering requirements, files just need to exist across all machines (Every machine must have a copy of every file so that they can serve files in a round-robin fashion)


Why not write a custom 404 handler which when triggered tries to download the file from the original server, and when the file hasn't yet been downloaded, redirects to the master server as well?

This worked pretty well for a friend's site and you don't to care about replication anymore.


Panther express is cheap, cheaper than S3.

Let someone else worry about that issue. The amount of time you'll spend setting it up, testing it, maintaining it, is time wasted on other important aspects of your business/application. You are not going to be able to do it any better or any cheaper.


Panther really is dirt cheap.


You could use a distributed file system like:

http://en.wikipedia.org/wiki/Lustre_(file_system)

A simpler option would be to do a little scripting magic to catch inotify signals and then trigger an rsync.


Your options (if you are going to do it yourself and not use S3 or cachefly):

1. rsync on the backend; it's easy, it's relatively fast, but it is asynchronous and files won't be immediately available on all servers.

2. An inotify watcher that copies files to the slaves when a file is written or changed on the master. Faster than an rsync solution, but you'll need to write it yourself.

In either case you will want to seriously question whether you should do it yourself, look at the costs for keeping machines operational and how much you are paying for bandwidth.


rsync is built for this: http://samba.anu.edu.au/rsync/


I recommend to look at GlusterFS. It fast, modularized, layered clustering file system. It is not tied to Linux kernel because it built for (patched) Fuse. With Infiniband hardware, it is fastest clustered system available for free. See http://www.gluster.org/docs/index.php/GlusterFS_1.3.pre2-VER... for example.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: