I really need to try this out soon. I keep bumping into it online. What do people on HN use it for?
What I do right now "to collect, save, and view sites you want to preserve offline" is by use of a Firefox plugin called WebScrapBook. Click-click-done, and I have a local searchable (!) copy of a webpage exactly as it looked in the browser. With styles and all, in one file. WebScrapBook is pretty highly configurable.
In the future I would like to have a solution that doesn't require some Firefox plugin.
I’ve used it to save a lot of pages related to ham radio. I have several 30-40 year old radios and I’m afraid one day information about them will just drop off the internet.
Fairly quick, depending of course on the weight of the page(s) and how many archive methods you enable (I have several turned off because they are redundant for my purposes). The "adding to archive" interstitial still tends to time out which is a little annoying, but the actual archive process is backgrounded so it doesn't matter.
The recommended install includes a search engine that works well, aside from a few false positives due to being fuzzy search. I don't have much in it yet so I can't say what the performance is like once you reach e.g. thousands of pages, but I imagine it would still perform well except maybe for mass operations like rebuilding the entire index.
Looks like ArchiveBox has more export options?
EDIT: looks like ArchiveBox is focused on continuous change tracking rather than than just snapshots like Wallabag.
Just FYI, I have this set up as a Docker container on my Synology and it is now patiently crawling through my (imported) 2000+ Pocket URLs, to which I’m adding a lot of other stuff scattered across other “clipping” tools (like OneNote).
Key benefit for me is having actual local files. The resulting PDFs are searchable on their own, so I can sync those back to my Mac for reference (and Spotlight indexing). But the HTML snapshots are also pretty decent.
One thing I’ll be looking into is automatic tagging (since it’s a Django app there are plenty of likely ways to inject that info).
I just got my first Synology literally two days ago, a DS3617xsII. Looking forward to playing with it, especially the virtualization / Docker features. How do you like it?
One neat tiny implementation detail of ArchiveBox that I just highlighted on HN today is our use of asymptotic progress bars when we don't know how long archiving a page is going to take: https://news.ycombinator.com/item?id=27860022
When I tried to add my bookmarks from a file, one of the website seems to not be online anymore; archivebox just stops execution when it encounters this issue.
How can I make archivebox ignore such errors and continue with the rest of the websites?
What is the advantage of this over something like Kiwix, or just using Playwright CLI? It seems useful but a bit unnecessary if just using to create Archive.org links.
I use it like permanent bookmarks. I can go back to it and trust it'll still be there. And this isn't just "things disappear eventually" - for a specific example, I was working on something rather last-minute recently, and wanted to refer to a vendor's whitepaper - and their entire site was "down for maintenance" all weekend. So I went onto my archive and I still had a copy from my first pass over the topic. It's a free resource, I can't blame them and I can't complain - but if I can make sure it doesn't impact me, even better.
I know other people would still have that tab open from 3 weeks ago, but I just don't work like that.
I'm not going to complain about wayback/archive.org at all, but the nature of the beast is that there's certain requests they have to obey - and with my own offline, non-exposed equivalent, I don't (well, I do, but I simply don't receive them)
I create browser bookmarks regularly but I wouldn't be bothered to SSH into my server to also tell it to grab a copy of the URL. Automating this with a browser plugin would be cool.
There is a web UI, so I have a JS "bookmarklet" in my toolbar. If you go to the web UI and Add a url, right at the bottom of the page there's a link you can bookmark to do the same. So that's my workflow - click Archive (or cmd-alt-1), then hit enter. Done.
This does _a lot more_ than just creating archive.org links. It saves the entire page contents (in multiple formats, including nicely searchable PDF and some embedded media) locally.
Archivebox is also an awesome tool to create copies of a website. Whether you want to demonstrate a phishing attack or do a POC to integrate your product.
What I do right now "to collect, save, and view sites you want to preserve offline" is by use of a Firefox plugin called WebScrapBook. Click-click-done, and I have a local searchable (!) copy of a webpage exactly as it looked in the browser. With styles and all, in one file. WebScrapBook is pretty highly configurable.
In the future I would like to have a solution that doesn't require some Firefox plugin.