ArchiveBox/ArchiveBox: open-source self-hosted web archiving

rambambram · on July 16, 2021

I really need to try this out soon. I keep bumping into it online. What do people on HN use it for?

What I do right now "to collect, save, and view sites you want to preserve offline" is by use of a Firefox plugin called WebScrapBook. Click-click-done, and I have a local searchable (!) copy of a webpage exactly as it looked in the browser. With styles and all, in one file. WebScrapBook is pretty highly configurable.

In the future I would like to have a solution that doesn't require some Firefox plugin.

thedanbob · on July 16, 2021

I’ve used it to save a lot of pages related to ham radio. I have several 30-40 year old radios and I’m afraid one day information about them will just drop off the internet.

rambambram · on July 16, 2021

Cool! How is the saving experience? Is it quick? And looking up? Or is it more just 'saving for later use'?

thedanbob · on July 16, 2021

Fairly quick, depending of course on the weight of the page(s) and how many archive methods you enable (I have several turned off because they are redundant for my purposes). The "adding to archive" interstitial still tends to time out which is a little annoying, but the actual archive process is backgrounded so it doesn't matter.

The recommended install includes a search engine that works well, aside from a few false positives due to being fuzzy search. I don't have much in it yet so I can't say what the performance is like once you reach e.g. thousands of pages, but I imagine it would still perform well except maybe for mass operations like rebuilding the entire index.

epilys · on July 17, 2021

Consider having archive.org save the websites, in case they are not already.

infogulch · on July 16, 2021

I think it would be neat to try integrating ArchiveBox with the enhanced history / browsing context extension Promnesia [1].

[1]: https://github.com/karlicoss/promnesia

NortySpock · on July 16, 2021

How does this compare with, say, Wallabag? https://github.com/wallabag

Looks like ArchiveBox has more export options? EDIT: looks like ArchiveBox is focused on continuous change tracking rather than than just snapshots like Wallabag.

rcarmo · on July 16, 2021

Actually, no. You can take more snapshots, but that's just an added feature.

rcarmo · on July 16, 2021

Just FYI, I have this set up as a Docker container on my Synology and it is now patiently crawling through my (imported) 2000+ Pocket URLs, to which I’m adding a lot of other stuff scattered across other “clipping” tools (like OneNote).

Key benefit for me is having actual local files. The resulting PDFs are searchable on their own, so I can sync those back to my Mac for reference (and Spotlight indexing). But the HTML snapshots are also pretty decent.

One thing I’ll be looking into is automatic tagging (since it’s a Django app there are plenty of likely ways to inject that info).

res0nat0r · on July 16, 2021

I just got my first Synology literally two days ago, a DS3617xsII. Looking forward to playing with it, especially the virtualization / Docker features. How do you like it?

rcarmo · on July 16, 2021

Very much so, this is my second and I've had it for a year now: https://taoofmac.com/space/blog/2020/04/04/2310

res0nat0r · on July 16, 2021

Cool post. Sounds like it has a lot of cool features, looking forward to messing with it over the weekend.

programmarchy · on July 16, 2021

Wow, it can even extract video! This looks phenomenal, and a great excuse to stock up on more disk space.

Their roadmap is also very interesting: "v2.0 Federated or distributed archiving + paid hosted service offering"

https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap#v20-fe...

blastro · on July 16, 2021

i have been archiving cool or interesting sites with this tool for about a year now. fantastic. can't say enough good things about it.

nikisweeting · on July 16, 2021

Thanks for posting this @rcarmo!

One neat tiny implementation detail of ArchiveBox that I just highlighted on HN today is our use of asymptotic progress bars when we don't know how long archiving a page is going to take: https://news.ycombinator.com/item?id=27860022

ammar_x · on July 17, 2021

When I tried to add my bookmarks from a file, one of the website seems to not be online anymore; archivebox just stops execution when it encounters this issue.

How can I make archivebox ignore such errors and continue with the rest of the websites?

Command used:

  archivebox add < exported_bookmarks.html --depth=1

nikisweeting · on July 18, 2021

It should not stop execution, can you open an issue with your log output or a screenshot.

Also note you got the arguments backwards, make sure to put the file redirect at the end, after --depth=1.

It's also possible that was the last URL in your list and it stopped at the end. If you want to re-try archiving you should run this instead:

    archivebox update

encryptluks2 · on July 16, 2021

What is the advantage of this over something like Kiwix, or just using Playwright CLI? It seems useful but a bit unnecessary if just using to create Archive.org links.

soneil · on July 16, 2021

I use it like permanent bookmarks. I can go back to it and trust it'll still be there. And this isn't just "things disappear eventually" - for a specific example, I was working on something rather last-minute recently, and wanted to refer to a vendor's whitepaper - and their entire site was "down for maintenance" all weekend. So I went onto my archive and I still had a copy from my first pass over the topic. It's a free resource, I can't blame them and I can't complain - but if I can make sure it doesn't impact me, even better.

I know other people would still have that tab open from 3 weeks ago, but I just don't work like that.

I'm not going to complain about wayback/archive.org at all, but the nature of the beast is that there's certain requests they have to obey - and with my own offline, non-exposed equivalent, I don't (well, I do, but I simply don't receive them)

ssl232 · on July 16, 2021

> I use it like permanent bookmarks.

I create browser bookmarks regularly but I wouldn't be bothered to SSH into my server to also tell it to grab a copy of the URL. Automating this with a browser plugin would be cool.

soneil · on July 16, 2021

There is a web UI, so I have a JS "bookmarklet" in my toolbar. If you go to the web UI and Add a url, right at the bottom of the page there's a link you can bookmark to do the same. So that's my workflow - click Archive (or cmd-alt-1), then hit enter. Done.

ssl232 · on July 16, 2021

Aha, that's great! I might give this a go this weekend then.

nikisweeting · on July 16, 2021

There's also a real browser extension in the works by one of our users: https://github.com/ArchiveBox/ArchiveBox/issues/577#issuecom...

rcarmo · on July 16, 2021

This does _a lot more_ than just creating archive.org links. It saves the entire page contents (in multiple formats, including nicely searchable PDF and some embedded media) locally.

braincoke · on July 16, 2021

Archivebox is also an awesome tool to create copies of a website. Whether you want to demonstrate a phishing attack or do a POC to integrate your product.