Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: crawl a website and store it in S3 from your browser (github.com/spullara)
43 points by spullara on April 30, 2011 | hide | past | favorite | 12 comments



Is there a header set to identify this crawler so I can limit it?

For what it's worth, when people crawl sites I am responsible for, and it impacts our products performance, I block the offending IP.

It doesn't get unblocked until the system reboots (rarely) or someone lodges a support ticket to say they cannot access the site.

It's heavy handed but crawlers can cause a significant amount of trouble given it's a non-usual usage pattern. Spiders often have a time between hits, I hope you have programmed one in rather than going full speed!


As it only does one URL at a time and uploads to S3 between requests, it shouldn't unduly load any reasonable system. I'll add an additional "BrowserCrawler" string to the user-agent, that seems very reasonable.

Update: JQuery can't set the User-Agent header on ajax requests it appears. I have instead set a new X-User-Agent header to BrowserCrawler.


What's this X-User-Agent stuff? The entire internet isn't going to special-case your hack. Just honour robots.txt.


I don't that a individual storing web pages has any requirement to obey robots.txt. You should be able to click "Save Page" on anything you have access to. Obviously redistributing that is another can of worms.


Ah, I'm familiar with S3 latencies. That'd work.

Also, if you're uploading other people's content to public S3 buckets, you're very likely to be in violation of mosts site's copyright and TOS. You might want to mention that in the README.

If we found some of our sites mirrored on the public internet we'd issue cease and desist notices.


How hard would it be to modify it to store it locally on your computer?


  wget --spider
is probably your friend


Probably wouldn't be that hard to modify. I haven't looked at the local file API available from Safari plugins but should be easy.


There is a nice Firefox plugin called Scrapbook that does this.


Great! I had the same idea only with GAE integration and option to download locally. Any plans on expanding that interface?


I have it going to S3 because it gives me the ability to instantly send someone a link or access from wherever I am. Also, I planned on doing larger crawls + analysis. Should be pretty easy though to hook it into a local file API like I mention above if that is more convenient for you.


That is very cool.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: