Ask HN: Best framework for parsing thousands of feeds?

bmelton · on Feb 4, 2011

I've done my fair share of it, and while I don't know that we ultimately tackled it 100%, there were plenty of gotchas.

1) Respect etags / last_updated tags. This will save you a ton of bandwidth, for one, and keep you from getting banned by the feeds you're pulling. It's important. What I ended up doing was a different method for new feeds vs. ones I already knew about -- on the initial parse (and subsequent ones too), I would check for an etag or last_modified indicator. If I can't detect anything, I set a poll frequency to something like a half an hour. This kept me from slamming servers that didn't properly implement etags, while I could check headers on the ones that did more frequently.

2) Hang on to your sockets. Opening / closing sockets is expensive for this particular task. What ended up working for us was queueing entries and using the same urllib handle for as many as needed polling at a time. Otherwise, we were flooding the box with open sockets.

3) Use a task queue. My environment was Python, so I had the beautiful Rabbit and Celery to work with. Never ended up having to scale, but the intention was that using a distributed task queue, it was built such that we could just add other nodes to do the fetching tasks if we needed to.

swanson · on Feb 4, 2011

You might want to take a look at Samuel Clay's NewsBlur project: https://github.com/samuelclay/NewsBlur and see how he handles this problem.

conesus · on Feb 5, 2011

As others have said, task queues (look into Celery, which is in Python), Mark Pilgrim's feedparser, and make sure you don't fetch more often than you need to. A few thousand feeds is fine, but if you grow in the hundreds of thousands of feeds, if you want to update them more than once a day, you're going to have to pass in the cache controls (etags and last-modified dates). Even then, 50,000+ feeds is straining the limits of a single DB.

If you were to reach that point, then I recommend moving to a nosql db (like mongo, which is what I use for NewsBlur), and sharding it so you can read/write to more feeds without a problem. All of your analysis will have to be in a MapReduce, so it can be sent to different shards, but that's not too difficult to learn how to do. NewsBlur has a few examples of how to do this.

I used Python, but Ruby would also be a fine choice. Although I'm not sure what Ruby libraries you would use.

dclaysmith · on Feb 4, 2011

Check out http://www.feedparser.org/. It's for python and pretty robust, handles Etags and Last-Modified headers. Well documented and loads of unit tests.