Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Best framework for parsing thousands of feeds?
6 points by kez on Feb 4, 2011 | hide | past | favorite | 4 comments
I am currently working on what I hope will be a startup (lean, bootstrapped etc) and I am dealing with thousands of feeds.

Presently I am batching 5-10 feeds to download in batches of threads from Ruby using FeedZirra (https://github.com/pauldix/feedzirra) and then parse.

Has anyone been in a similar situation and done something particularly innovative they care to share? I plan on ranking feeds by frequency of updates after some analysis, but in the mean time I am resigned to pulling everything down in as quick a time as possible.

I would love to use Superfeedr for this, but cost is prohibitive for me and I do not want to stump up the cash to pay for the credits whilst in development (although I could move to this in the future).

Not so bothered about the technology/language - this is a hodgepodge of Ruby, Ramaze, MySQL, Solr and good old file system storage.

Advanced thanks and appreciation of any and all comments!



I've done my fair share of it, and while I don't know that we ultimately tackled it 100%, there were plenty of gotchas.

1) Respect etags / last_updated tags. This will save you a ton of bandwidth, for one, and keep you from getting banned by the feeds you're pulling. It's important. What I ended up doing was a different method for new feeds vs. ones I already knew about -- on the initial parse (and subsequent ones too), I would check for an etag or last_modified indicator. If I can't detect anything, I set a poll frequency to something like a half an hour. This kept me from slamming servers that didn't properly implement etags, while I could check headers on the ones that did more frequently.

2) Hang on to your sockets. Opening / closing sockets is expensive for this particular task. What ended up working for us was queueing entries and using the same urllib handle for as many as needed polling at a time. Otherwise, we were flooding the box with open sockets.

3) Use a task queue. My environment was Python, so I had the beautiful Rabbit and Celery to work with. Never ended up having to scale, but the intention was that using a distributed task queue, it was built such that we could just add other nodes to do the fetching tasks if we needed to.


You might want to take a look at Samuel Clay's NewsBlur project: https://github.com/samuelclay/NewsBlur and see how he handles this problem.


As others have said, task queues (look into Celery, which is in Python), Mark Pilgrim's feedparser, and make sure you don't fetch more often than you need to. A few thousand feeds is fine, but if you grow in the hundreds of thousands of feeds, if you want to update them more than once a day, you're going to have to pass in the cache controls (etags and last-modified dates). Even then, 50,000+ feeds is straining the limits of a single DB.

If you were to reach that point, then I recommend moving to a nosql db (like mongo, which is what I use for NewsBlur), and sharding it so you can read/write to more feeds without a problem. All of your analysis will have to be in a MapReduce, so it can be sent to different shards, but that's not too difficult to learn how to do. NewsBlur has a few examples of how to do this.

I used Python, but Ruby would also be a fine choice. Although I'm not sure what Ruby libraries you would use.


Check out http://www.feedparser.org/. It's for python and pretty robust, handles Etags and Last-Modified headers. Well documented and loads of unit tests.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: