Is there a header set to identify this crawler so I can limit it?
For what it's worth, when people crawl sites I am responsible for, and it impacts our products performance, I block the offending IP.
It doesn't get unblocked until the system reboots (rarely) or someone lodges a support ticket to say they cannot access the site.
It's heavy handed but crawlers can cause a significant amount of trouble given it's a non-usual usage pattern. Spiders often have a time between hits, I hope you have programmed one in rather than going full speed!
As it only does one URL at a time and uploads to S3 between requests, it shouldn't unduly load any reasonable system. I'll add an additional "BrowserCrawler" string to the user-agent, that seems very reasonable.
Update: JQuery can't set the User-Agent header on ajax requests it appears. I have instead set a new X-User-Agent header to BrowserCrawler.
I don't that a individual storing web pages has any requirement to obey robots.txt. You should be able to click "Save Page" on anything you have access to. Obviously redistributing that is another can of worms.
Also, if you're uploading other people's content to public S3 buckets, you're very likely to be in violation of mosts site's copyright and TOS. You might want to mention that in the README.
If we found some of our sites mirrored on the public internet we'd issue cease and desist notices.
I have it going to S3 because it gives me the ability to instantly send someone a link or access from wherever I am. Also, I planned on doing larger crawls + analysis. Should be pretty easy though to hook it into a local file API like I mention above if that is more convenient for you.
For what it's worth, when people crawl sites I am responsible for, and it impacts our products performance, I block the offending IP.
It doesn't get unblocked until the system reboots (rarely) or someone lodges a support ticket to say they cannot access the site.
It's heavy handed but crawlers can cause a significant amount of trouble given it's a non-usual usage pattern. Spiders often have a time between hits, I hope you have programmed one in rather than going full speed!