I am currently working on what I hope will be a startup (lean, bootstrapped etc) and I am dealing with thousands of feeds.
Presently I am batching 5-10 feeds to download in batches of threads from Ruby using FeedZirra (https://github.com/pauldix/feedzirra) and then parse.
Has anyone been in a similar situation and done something particularly innovative they care to share? I plan on ranking feeds by frequency of updates after some analysis, but in the mean time I am resigned to pulling everything down in as quick a time as possible.
I would love to use Superfeedr for this, but cost is prohibitive for me and I do not want to stump up the cash to pay for the credits whilst in development (although I could move to this in the future).
Not so bothered about the technology/language - this is a hodgepodge of Ruby, Ramaze, MySQL, Solr and good old file system storage.
Advanced thanks and appreciation of any and all comments!
1) Respect etags / last_updated tags. This will save you a ton of bandwidth, for one, and keep you from getting banned by the feeds you're pulling. It's important. What I ended up doing was a different method for new feeds vs. ones I already knew about -- on the initial parse (and subsequent ones too), I would check for an etag or last_modified indicator. If I can't detect anything, I set a poll frequency to something like a half an hour. This kept me from slamming servers that didn't properly implement etags, while I could check headers on the ones that did more frequently.
2) Hang on to your sockets. Opening / closing sockets is expensive for this particular task. What ended up working for us was queueing entries and using the same urllib handle for as many as needed polling at a time. Otherwise, we were flooding the box with open sockets.
3) Use a task queue. My environment was Python, so I had the beautiful Rabbit and Celery to work with. Never ended up having to scale, but the intention was that using a distributed task queue, it was built such that we could just add other nodes to do the fetching tasks if we needed to.