Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

@jacquesm please send me a list of URLs to crawl (10M+), and I'll set up an 80legs job to do this. shion - at - 80legs - com.


The problem is that Posterous is hard to crawl. For one; They'll continously and automatedly ban your IPs, even if you rotate over a lot of them. Two: Posterous can't take all of the requests.

We've (ArchiveTeam) unfortunally made Posterous unresponsive multiple times. So please be careful to not completely bring it down if you're doing a solo effort.

Please also bear in mind that it's not just to "chuck it into the downloader"..


Also, please use a sensible format if you're crawling/archiving this.

We're using WARC (Web Archive) which is an official ISO File Format standard - which the Internet Archive's Wayback Machine can use. It's also a pretty good and nice format for archiving web pages in general.


please ask on irc://efnet/#preposterus that's where the archive team guys hang out. I don't have a list of seeds but they may be able to figure out a way in which you can put 80legs to good use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: