Show HN: crawl a website and store it in S3 from your browser

ryan-allen · on May 1, 2011

Is there a header set to identify this crawler so I can limit it?

For what it's worth, when people crawl sites I am responsible for, and it impacts our products performance, I block the offending IP.

It doesn't get unblocked until the system reboots (rarely) or someone lodges a support ticket to say they cannot access the site.

It's heavy handed but crawlers can cause a significant amount of trouble given it's a non-usual usage pattern. Spiders often have a time between hits, I hope you have programmed one in rather than going full speed!

spullara · on May 1, 2011

As it only does one URL at a time and uploads to S3 between requests, it shouldn't unduly load any reasonable system. I'll add an additional "BrowserCrawler" string to the user-agent, that seems very reasonable.

Update: JQuery can't set the User-Agent header on ajax requests it appears. I have instead set a new X-User-Agent header to BrowserCrawler.

JoachimSchipper · on May 1, 2011

What's this X-User-Agent stuff? The entire internet isn't going to special-case your hack. Just honour robots.txt.

spullara · on May 2, 2011

I don't that a individual storing web pages has any requirement to obey robots.txt. You should be able to click "Save Page" on anything you have access to. Obviously redistributing that is another can of worms.

ryan-allen · on May 1, 2011

Ah, I'm familiar with S3 latencies. That'd work.

Also, if you're uploading other people's content to public S3 buckets, you're very likely to be in violation of mosts site's copyright and TOS. You might want to mention that in the README.

If we found some of our sites mirrored on the public internet we'd issue cease and desist notices.

raptrex · on April 30, 2011

How hard would it be to modify it to store it locally on your computer?

dotBen · on May 1, 2011

  wget --spider

is probably your friend

spullara · on May 1, 2011

Probably wouldn't be that hard to modify. I haven't looked at the local file API available from Safari plugins but should be easy.

pushcx · on May 1, 2011

There is a nice Firefox plugin called Scrapbook that does this.

neuromancer2600 · on April 30, 2011

Great! I had the same idea only with GAE integration and option to download locally. Any plans on expanding that interface?

spullara · on May 1, 2011

I have it going to S3 because it gives me the ability to instantly send someone a link or access from wherever I am. Also, I planned on doing larger crawls + analysis. Should be pretty easy though to hook it into a local file API like I mention above if that is more convenient for you.

petervandijck · on April 30, 2011

That is very cool.