Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Any nice tool to archive web sites using the warc format?


If you want a full fidelity warc file, browsertrix-crawler is nice. It is slower than wget on account of using chrome, but it works better for sites with highly dynamic content and can generate wacz files which can be efficiently served via a file/objectstore when used with something like replayweb.page.




The Wikipedia article on the warc format references wget


Webrecorder's are by far the best imo. https://archiveweb.page and Browsertrix Crawler. ArchiveBox uses wget internally for it's WARC generation but I'd love to integrate it with Webrecorder in the future for this part.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: