Ask HN: Content Serving Cluster

jbyers · on March 1, 2009

MogileFS will get you close to this, but stores files with its own ID and directory scheme. If filenames don't matter, or if you have a gateway that's handling the mogile lookup and filename translation, it's a great solution.

If you already have a reliable central filestore, varnish or squid might accomplish faster distribution without having to replicate all your files.

Otherwise, I'm curious to see everyone's suggestions. I've looked at more *sync programs than I can count to handle this use case and come up empty-handed.

patio11 · on March 1, 2009

Honestly, I think you're likely to find that Amazon S3 is the best option. It costs money, but assuming your business generates money (many businesses do) it will probably be more reliable and cost less expensive, expensive you-time than anything else.

Otherwise: rsync is your friend. Run it in daemon mode. If you've just got a handful of machines I'd nominate one machine as the server to receive all uploads. Everybody else just syncs their upload directory to that machine's.

You may also want to consider offering bittorrent as an option, since this situation appears to be tailor-made for it.

pmjordan · on March 1, 2009

I don't see how rsync will work here, during the sync time, the new file won't be available (entirely) from the other machines. As I understand it, the OP is looking for something that will sync while already serving up the file, presumably giving priority to syncing the parts that are already being requested by clients. (which will be the beginning of the file most of the time unless clients will be using partial downloading)

I've never set one up myself, but a clustering file system might work, e.g. OCFS: http://oss.oracle.com/projects/ocfs2/

cperciva · on March 1, 2009

You say "someone uploads xyz.tar.gz to Machine #1"; does this mean that all the bits are uploaded to the same machine, or will you have some files uploaded to machine #1, some files uploaded to machine #2, et cetera?

If files are uploaded to multiple machines, is it possible that you'd get two different files with the same name uploaded to different machines? If so, how do you want to handle this?

Will you ever have files deleted?

Do you have any ordering requirements, e.g., files have to appear on each machine in the same order as they were originally uploaded?

ajkirwin · on March 1, 2009

Files will never be deleted and there are no ordering requirements, files just need to exist across all machines (Every machine must have a copy of every file so that they can serve files in a round-robin fashion)

bluelu · on March 1, 2009

Why not write a custom 404 handler which when triggered tries to download the file from the original server, and when the file hasn't yet been downloaded, redirects to the master server as well?

This worked pretty well for a friend's site and you don't to care about replication anymore.

jawngee · on March 1, 2009

Panther express is cheap, cheaper than S3.

Let someone else worry about that issue. The amount of time you'll spend setting it up, testing it, maintaining it, is time wasted on other important aspects of your business/application. You are not going to be able to do it any better or any cheaper.

fizx · on March 1, 2009

Panther really is dirt cheap.

wheels · on March 1, 2009

You could use a distributed file system like:

http://en.wikipedia.org/wiki/Lustre_(file_system)

A simpler option would be to do a little scripting magic to catch inotify signals and then trigger an rsync.

olefoo · on March 1, 2009

Your options (if you are going to do it yourself and not use S3 or cachefly):

1. rsync on the backend; it's easy, it's relatively fast, but it is asynchronous and files won't be immediately available on all servers.

2. An inotify watcher that copies files to the slaves when a file is written or changed on the master. Faster than an rsync solution, but you'll need to write it yourself.

In either case you will want to seriously question whether you should do it yourself, look at the costs for keeping machines operational and how much you are paying for bandwidth.

Steve0 · on March 1, 2009

rsync is built for this: http://samba.anu.edu.au/rsync/

vlisivka · on March 1, 2009

I recommend to look at GlusterFS. It fast, modularized, layered clustering file system. It is not tied to Linux kernel because it built for (patched) Fuse. With Infiniband hardware, it is fastest clustered system available for free. See http://www.gluster.org/docs/index.php/GlusterFS_1.3.pre2-VER... for example.